#archiveteam 2014-06-19,Thu

↑back Search

Time Nickname Message
11:03 🔗 arkiver anyone else getting error 500 with S3 all the time?
11:20 🔗 Nemo_bis arkiver: have you checked that https://archive.org/catalog.php?history=1&justme=1 looks ok for you?
11:20 🔗 arkiver Nemo_bis: 101 waiting to run..,
11:21 🔗 arkiver is that why S3 is doing weird?
11:21 🔗 Nemo_bis I've had s3 reject my uploads when I had too many in the queue sometimes
11:21 🔗 Nemo_bis that may be the reason or not
11:22 🔗 arkiver ah
11:22 🔗 arkiver maybe it is
11:22 🔗 arkiver https://catalogd.archive.org/log/316904585
11:22 🔗 arkiver I got 6 of these errors
11:22 🔗 arkiver rerunning one and it looks like it's working, so I'm going to rerun them
11:27 🔗 arkiver Nemo_bis: looks like it's working again. Thank you!
11:36 🔗 midas s3 had alot of slowdowns, probably just going mental for a bit
12:22 🔗 midas getting some bigass originals from rawporter now
12:45 🔗 arkiver http://www.codespaces.com/
13:19 🔗 Cameron_D http://blog.snappytv.com/?p=2101
13:19 🔗 Cameron_D https://blog.twitter.com/2014/snappytv-is-joining-the-flock
13:21 🔗 Cameron_D doesn't seem like they really host anything
13:21 🔗 Cameron_D (Nor does it seem like they are closing up yet)
18:36 🔗 schbirid thanks for cleaning up https://archive.org/details/nycTaxiTripData2013 whoever did that :)
18:42 🔗 DFJustin you might want to put a better title/description on it
18:53 🔗 schbirid true
18:54 🔗 SN4T14 I want to make a shitty "acid trip" joke. :p
19:00 🔗 midas about 1000 more items to go in the rawporter grab from S3
19:10 🔗 swebb Pixorial video-, photo-sharing service shutting down - Denver... Giving 30-days notice. http://www.bizjournals.com/denver/news/2014/06/17/pixorial-video-photo-sharing-service-shutting-down.html?ana=twt
19:19 🔗 swebb "636,000 users, 2.7 million minutes of video and countless photos"
19:27 🔗 ivan` google doesn't seem to know much of their user content, so I'm guessing it's almost entirely private?
19:27 🔗 ivan` did spot some "www.pixorial.com/watch/" https://encrypted.google.com/search?q=site:pixorial.com&gbv=1&prmd=ivns&ei=EzmjU5SWA4bxoATxyIKYAg&start=20&sa=N
19:27 🔗 db48x signups are disabled
19:27 🔗 db48x no information about their api
19:27 🔗 swebb Yea, I'm guessing so. It seems like the company is only 6 months old or so.
19:28 🔗 db48x they don't actually display any of that content except through embeding
19:29 🔗 db48x wikipedia says founded in 2007 and launched in 2009
19:30 🔗 db48x hmm, on the other hand: http://myhub.pixorial.com/watch/636a65d9042aa8416876197a9c44e38b
19:31 🔗 db48x <source id="mp4Source" src="http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b" type="video/mp4"></source><source id="webmSource" src="http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b.webm" type="video/webm"></source>
19:51 🔗 arkiver so for pixorial
19:51 🔗 arkiver I think it's the best to just scan all the links like http://pixori.al/26e01
19:51 🔗 arkiver with http://pixori.al/*****
19:51 🔗 arkiver * = 0-9 or a-f
19:52 🔗 arkiver http://pixori.al/26e01 goes to http://myhub.pixorial.com/watch/636a65d9042aa8416876197a9c44e38b
19:54 🔗 arkiver change http://myhub.pixorial.com/watch/636a65d9042aa8416876197a9c44e38b to http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b and you have video
19:54 🔗 arkiver there is mp4 and webm
19:54 🔗 arkiver mp4: http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b.mp4
19:54 🔗 arkiver webm: http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b.webm
19:54 🔗 exmic that's easy enough
19:54 🔗 arkiver so I don't think it's that hard to do...
19:55 🔗 arkiver just scanning all the http://pixori.al/***** links
19:55 🔗 arkiver and you have it :)
19:55 🔗 arkiver let's make a channel
19:56 🔗 arkiver #pixi-death
19:56 🔗 arkiver ? :P
19:58 🔗 arkiver found s3 again :) just like with earbits
20:00 🔗 db48x hah
20:02 🔗 arkiver http://pix-wp-assets.s3.amazonaws.com/
20:14 🔗 midas arkiver: http://pastebin.com/TNR3JXZy
20:14 🔗 midas open s3 bucket, again
20:14 🔗 * db48x revives his aws account
20:15 🔗 midas running du on it now
20:15 🔗 db48x was about to ask :)
20:15 🔗 midas did you find a bucket for the images/movies too arkiver ?
20:15 🔗 SN4T14 arkiver, hate to burst your bubble, but that's 60466176 unique addresses
20:15 🔗 midas 319591281 s3://pix-wp-assets/
20:16 🔗 midas SN4T14: soo? thats not that much
20:16 🔗 db48x probably all just wordpress stuff like the name implies
20:16 🔗 midas yeah
20:17 🔗 db48x wouldn't surprise me if there were others though
20:17 🔗 arkiver SN4T14: http://pixori.al/***** is 1048576 adresses...
20:17 🔗 arkiver 0-9 and e-f
20:17 🔗 schbirid next archiveteam project will be "s3cmd ls s3://*"
20:17 🔗 db48x heh
20:17 🔗 schbirid there were some guys who index that years ago btw
20:18 🔗 schbirid based on keywords
20:18 🔗 schbirid err, *wordlists
20:18 🔗 SN4T14 arkiver, that's 36 letters, 36^5=60466176
20:18 🔗 midas schbirid: you know, would suprise me if that would be next :p
20:18 🔗 midas grabbing every s3 bucket that is public
20:18 🔗 SN4T14 Oh, a-f
20:18 🔗 SN4T14 Whoops
20:19 🔗 arkiver hehe
20:19 🔗 SN4T14 Yeah, then it's definitely doable
20:19 🔗 arkiver running crawl of 1048576 links now
20:19 🔗 arkiver see what's coming
20:19 🔗 arkiver the videos are embedded in html so that's easy
20:22 🔗 db48x even easier; the urls are predicitable so we don't even have to parse the html
20:24 🔗 arkiver tested some urls: the urls with 5 things are indeed a-f: http://pixori.al/b532f http://pixori.al/3bcbb http://pixori.al/5a5b2 http://pixori.al/dcc94 http://pixori.al/e5cef http://pixori.al/cd75e
20:25 🔗 arkiver looks like there no capital letters on anything
20:25 🔗 arkiver but, I found this one too: http://myhub.pixorial.com/watch/045ad22847837e2cc150484f7421e950
20:25 🔗 arkiver it has http://pixori.al/7sAR
20:25 🔗 arkiver :/
20:26 🔗 schbirid redirects to the same url
20:26 🔗 schbirid as lowercase
20:26 🔗 midas nice
20:27 🔗 arkiver schbirid: great!
20:27 🔗 arkiver so http://pixori.al/7sAR is http://pixori.al/7sar
20:28 🔗 arkiver so there are for http://pixori.al/**** 1679616 urls
20:30 🔗 arkiver wow.
20:30 🔗 arkiver going fast, running crawl on the 1048576 urls from the http://pixori.al/***** and it's going 3 urls per second
20:30 🔗 arkiver well, not extremely fast, but better then earbits
20:30 🔗 SN4T14 A *whole* 3 urls per second. :p
20:31 🔗 SN4T14 Why don't you start multiple instances of your script?
20:31 🔗 arkiver SN4T14: will do that
20:31 🔗 arkiver will start 5
20:32 🔗 db48x it's more than just 0-9a-f
20:32 🔗 db48x alas
20:32 🔗 arkiver db48x: do you have an example url?
20:32 🔗 db48x arkiver: oh, good, we don't have to deal with capitals as well :)
20:33 🔗 arkiver db48x: oh sorry, I see
20:33 🔗 arkiver so on http://pixori.al/**** it is 0-9a-z and http://pixori.al/***** is 0-9a-f
20:34 🔗 midas Rawporter data is downloaded
20:34 🔗 arkiver I tested around 15 urls and they were all up to f from http://pixori.al/***** , but there could be more then f...
20:34 🔗 arkiver if anyone finds such a case
20:34 🔗 midas f seems logical, being hex
20:35 🔗 midas Rawporter ended up being 75GB of data
20:35 🔗 arkiver :)
20:35 🔗 arkiver o images only?
20:35 🔗 arkiver of*
20:36 🔗 db48x if the final digit only goes up to f then that's still 26.8 million
20:36 🔗 midas images and video
20:37 🔗 arkiver db48x: how do you get that number?
20:37 🔗 arkiver 16^5 is 1048576
20:38 🔗 db48x 36^4*16
20:39 🔗 db48x although it could be two separate namespaces, in which case it's 36^4 + 16^5
20:39 🔗 arkiver yes....
20:40 🔗 arkiver it's 36^4 + 16^5, so that's 1679616 + 1048576 = 2728192 urls
20:40 🔗 arkiver but maybe also the http://pixori.al/*** , http://pixori.al/** and http://pixori.al/* exist..
20:44 🔗 schbirid http://pixori.al/* always redirects to http://myhub.pixorial.com/s/* which then redirects to the actual URL. you can save one redirect by using the myhub URL directly. saved almost 20% on my simple tiny test
20:47 🔗 schbirid ok, the second test just gave ~4% though :P
20:52 🔗 arkiver cool
20:52 🔗 arkiver Wayback is also playing the videos very well in the browser
20:52 🔗 arkiver http://web.archive.org/web/20140619195035/http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b
20:55 🔗 arkiver going to split list of links from http://pixori.al/***** into 5 packs
20:55 🔗 arkiver no, 10 packs
20:55 🔗 arkiver then download them simultaneously
20:55 🔗 arkiver hopefully, they don't have some kind of stupid banning system...
20:56 🔗 arkiver will put soTimeoutMs to 20000 milliseconds and timeoutSeconds to 120000 seconds
20:56 🔗 arkiver to stay save
20:57 🔗 arkiver the videos don't download very fast (I had 50-100 kpbs) for one video
20:57 🔗 arkiver so if video is multiple GB's...
20:57 🔗 Smiley can you create url lists?
20:57 🔗 Smiley damnit we need to get a warrior type project which we can just insert lists of urls, like archivebot does.
20:57 🔗 arkiver haha no need man
20:57 🔗 arkiver :P
20:57 🔗 arkiver we got 30 days
20:58 🔗 arkiver going to start 10 simultaneous crawls on http://pixori.al/***** in some time
20:59 🔗 Smiley it'd still be a awesomely useful project
20:59 🔗 arkiver yep
21:00 🔗 arkiver Smiley: http://www.onetimebox.org/box/wi7sHp4t4Qrc26w58
21:00 🔗 arkiver those are the links
21:02 🔗 db48x I say we do it with the warrior
21:03 🔗 db48x it's a more sure way to go, and if we get it done faster than people are less likely to delete things
21:03 🔗 arkiver yeah
21:04 🔗 arkiver but I have zero knowledge about the warrior...
21:04 🔗 arkiver :/
21:04 🔗 db48x but we should start scanning the urls now, before that is set up
21:04 🔗 Smiley do we have some way of feeding the warrior loads of urls?
21:04 🔗 Smiley we really need a framework D:
21:04 🔗 db48x it's pretty easy
21:04 🔗 arkiver db48x: why scanning the urls?
21:04 🔗 Smiley like a generic one...
21:04 🔗 arkiver why not doing short urls in the warrior?
21:04 🔗 arkiver and people just downloading those?
21:04 🔗 db48x we could be finished scanning the urls in a day or two
21:05 🔗 arkiver yeah
21:05 🔗 arkiver but I have now idea how to scan urls...
21:05 🔗 db48x the pipeline for the warrior task may take that long to write and test
21:05 🔗 arkiver O.o
21:05 🔗 arkiver I know how to download and use heritrix and such things
21:05 🔗 arkiver but someone else would have to do the short url scanning
21:05 🔗 db48x ah
21:06 🔗 arkiver sorry
21:06 🔗 arkiver :(
21:06 🔗 db48x I thought you were just scanning
21:06 🔗 arkiver no no
21:06 🔗 arkiver I was already downloading
21:06 🔗 arkiver 3 urls per second
21:06 🔗 arkiver would be done in around 5 to 10 days
21:06 🔗 arkiver with all the urls
21:06 🔗 arkiver But it's more fun with a warrior project :)
21:06 🔗 db48x assuming it's only a few million
21:06 🔗 arkiver #pixi-death
21:07 🔗 arkiver couldn't think of anything better... :P
21:09 🔗 db48x #pixofail :D
21:10 🔗 arkiver :D muuuuch better
21:10 🔗 arkiver :P
21:15 🔗 db48x yipdw: ping?
21:34 🔗 yipdw db48x: hi
21:35 🔗 db48x yipdw: howdy. how do you usually create a new job for the warriors? fork seesaw-kit?
21:36 🔗 yipdw yeah
21:36 🔗 yipdw I usually also subtree merge the readme repo in, but copy is fine too
21:38 🔗 db48x I guess you don't actually use a github fork
21:38 🔗 yipdw you can
22:27 🔗 deathy wondering if someone studied CKAN archival/backup... it's a platform for "open data". Not that governments collapse and may delete some of their published data...but best be safe...
22:49 🔗 Nemo_bis someone claims it's useful to upload data there in order to not have all eggs in IA's basket https://meta.wikimedia.org/wiki/Talk:Requests_for_comment/How_to_deal_with_open_datasets
22:50 🔗 Nemo_bis (I simplified)
22:52 🔗 DFJustin the more baskets the merrier
23:05 🔗 deathy heh.. "lots of copies keeps stuff safe"
23:06 🔗 SN4T14 Yeah, that's why I always clone myself in three places.
23:07 🔗 db48x I want to back myself up outside my current light-cone
23:07 🔗 SN4T14 Dude, set up a RAIC array. :p

irclogger-viewer