[11:03] anyone else getting error 500 with S3 all the time? [11:20] arkiver: have you checked that https://archive.org/catalog.php?history=1&justme=1 looks ok for you? [11:20] Nemo_bis: 101 waiting to run.., [11:21] is that why S3 is doing weird? [11:21] I've had s3 reject my uploads when I had too many in the queue sometimes [11:21] that may be the reason or not [11:22] ah [11:22] maybe it is [11:22] https://catalogd.archive.org/log/316904585 [11:22] I got 6 of these errors [11:22] rerunning one and it looks like it's working, so I'm going to rerun them [11:27] Nemo_bis: looks like it's working again. Thank you! [11:36] s3 had alot of slowdowns, probably just going mental for a bit [12:22] getting some bigass originals from rawporter now [12:45] http://www.codespaces.com/ [13:19] http://blog.snappytv.com/?p=2101 [13:19] https://blog.twitter.com/2014/snappytv-is-joining-the-flock [13:21] doesn't seem like they really host anything [13:21] (Nor does it seem like they are closing up yet) [18:36] thanks for cleaning up https://archive.org/details/nycTaxiTripData2013 whoever did that :) [18:42] you might want to put a better title/description on it [18:53] true [18:54] I want to make a shitty "acid trip" joke. :p [19:00] about 1000 more items to go in the rawporter grab from S3 [19:10] Pixorial video-, photo-sharing service shutting down - Denver... Giving 30-days notice. http://www.bizjournals.com/denver/news/2014/06/17/pixorial-video-photo-sharing-service-shutting-down.html?ana=twt [19:19] "636,000 users, 2.7 million minutes of video and countless photos" [19:27] google doesn't seem to know much of their user content, so I'm guessing it's almost entirely private? [19:27] did spot some "www.pixorial.com/watch/" https://encrypted.google.com/search?q=site:pixorial.com&gbv=1&prmd=ivns&ei=EzmjU5SWA4bxoATxyIKYAg&start=20&sa=N [19:27] signups are disabled [19:27] no information about their api [19:27] Yea, I'm guessing so. It seems like the company is only 6 months old or so. [19:28] they don't actually display any of that content except through embeding [19:29] wikipedia says founded in 2007 and launched in 2009 [19:30] hmm, on the other hand: http://myhub.pixorial.com/watch/636a65d9042aa8416876197a9c44e38b [19:31] [19:51] so for pixorial [19:51] I think it's the best to just scan all the links like http://pixori.al/26e01 [19:51] with http://pixori.al/***** [19:51] * = 0-9 or a-f [19:52] http://pixori.al/26e01 goes to http://myhub.pixorial.com/watch/636a65d9042aa8416876197a9c44e38b [19:54] change http://myhub.pixorial.com/watch/636a65d9042aa8416876197a9c44e38b to http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b and you have video [19:54] there is mp4 and webm [19:54] mp4: http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b.mp4 [19:54] webm: http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b.webm [19:54] that's easy enough [19:54] so I don't think it's that hard to do... [19:55] just scanning all the http://pixori.al/***** links [19:55] and you have it :) [19:55] let's make a channel [19:56] #pixi-death [19:56] ? :P [19:58] found s3 again :) just like with earbits [20:00] hah [20:02] http://pix-wp-assets.s3.amazonaws.com/ [20:14] arkiver: http://pastebin.com/TNR3JXZy [20:14] open s3 bucket, again [20:14] * db48x revives his aws account [20:15] running du on it now [20:15] was about to ask :) [20:15] did you find a bucket for the images/movies too arkiver ? [20:15] arkiver, hate to burst your bubble, but that's 60466176 unique addresses [20:15] 319591281 s3://pix-wp-assets/ [20:16] SN4T14: soo? thats not that much [20:16] probably all just wordpress stuff like the name implies [20:16] yeah [20:17] wouldn't surprise me if there were others though [20:17] SN4T14: http://pixori.al/***** is 1048576 adresses... [20:17] 0-9 and e-f [20:17] next archiveteam project will be "s3cmd ls s3://*" [20:17] heh [20:17] there were some guys who index that years ago btw [20:18] based on keywords [20:18] err, *wordlists [20:18] arkiver, that's 36 letters, 36^5=60466176 [20:18] schbirid: you know, would suprise me if that would be next :p [20:18] grabbing every s3 bucket that is public [20:18] Oh, a-f [20:18] Whoops [20:19] hehe [20:19] Yeah, then it's definitely doable [20:19] running crawl of 1048576 links now [20:19] see what's coming [20:19] the videos are embedded in html so that's easy [20:22] even easier; the urls are predicitable so we don't even have to parse the html [20:24] tested some urls: the urls with 5 things are indeed a-f: http://pixori.al/b532f http://pixori.al/3bcbb http://pixori.al/5a5b2 http://pixori.al/dcc94 http://pixori.al/e5cef http://pixori.al/cd75e [20:25] looks like there no capital letters on anything [20:25] but, I found this one too: http://myhub.pixorial.com/watch/045ad22847837e2cc150484f7421e950 [20:25] it has http://pixori.al/7sAR [20:25] :/ [20:26] redirects to the same url [20:26] as lowercase [20:26] nice [20:27] schbirid: great! [20:27] so http://pixori.al/7sAR is http://pixori.al/7sar [20:28] so there are for http://pixori.al/**** 1679616 urls [20:30] wow. [20:30] going fast, running crawl on the 1048576 urls from the http://pixori.al/***** and it's going 3 urls per second [20:30] well, not extremely fast, but better then earbits [20:30] A *whole* 3 urls per second. :p [20:31] Why don't you start multiple instances of your script? [20:31] SN4T14: will do that [20:31] will start 5 [20:32] it's more than just 0-9a-f [20:32] alas [20:32] db48x: do you have an example url? [20:32] arkiver: oh, good, we don't have to deal with capitals as well :) [20:33] db48x: oh sorry, I see [20:33] so on http://pixori.al/**** it is 0-9a-z and http://pixori.al/***** is 0-9a-f [20:34] Rawporter data is downloaded [20:34] I tested around 15 urls and they were all up to f from http://pixori.al/***** , but there could be more then f... [20:34] if anyone finds such a case [20:34] f seems logical, being hex [20:35] Rawporter ended up being 75GB of data [20:35] :) [20:35] o images only? [20:35] of* [20:36] if the final digit only goes up to f then that's still 26.8 million [20:36] images and video [20:37] db48x: how do you get that number? [20:37] 16^5 is 1048576 [20:38] 36^4*16 [20:39] although it could be two separate namespaces, in which case it's 36^4 + 16^5 [20:39] yes.... [20:40] it's 36^4 + 16^5, so that's 1679616 + 1048576 = 2728192 urls [20:40] but maybe also the http://pixori.al/*** , http://pixori.al/** and http://pixori.al/* exist.. [20:44] http://pixori.al/* always redirects to http://myhub.pixorial.com/s/* which then redirects to the actual URL. you can save one redirect by using the myhub URL directly. saved almost 20% on my simple tiny test [20:47] ok, the second test just gave ~4% though :P [20:52] cool [20:52] Wayback is also playing the videos very well in the browser [20:52] http://web.archive.org/web/20140619195035/http://myhub.pixorial.com/video/636a65d9042aa8416876197a9c44e38b [20:55] going to split list of links from http://pixori.al/***** into 5 packs [20:55] no, 10 packs [20:55] then download them simultaneously [20:55] hopefully, they don't have some kind of stupid banning system... [20:56] will put soTimeoutMs to 20000 milliseconds and timeoutSeconds to 120000 seconds [20:56] to stay save [20:57] the videos don't download very fast (I had 50-100 kpbs) for one video [20:57] so if video is multiple GB's... [20:57] can you create url lists? [20:57] damnit we need to get a warrior type project which we can just insert lists of urls, like archivebot does. [20:57] haha no need man [20:57] :P [20:57] we got 30 days [20:58] going to start 10 simultaneous crawls on http://pixori.al/***** in some time [20:59] it'd still be a awesomely useful project [20:59] yep [21:00] Smiley: http://www.onetimebox.org/box/wi7sHp4t4Qrc26w58 [21:00] those are the links [21:02] I say we do it with the warrior [21:03] it's a more sure way to go, and if we get it done faster than people are less likely to delete things [21:03] yeah [21:04] but I have zero knowledge about the warrior... [21:04] :/ [21:04] but we should start scanning the urls now, before that is set up [21:04] do we have some way of feeding the warrior loads of urls? [21:04] we really need a framework D: [21:04] it's pretty easy [21:04] db48x: why scanning the urls? [21:04] like a generic one... [21:04] why not doing short urls in the warrior? [21:04] and people just downloading those? [21:04] we could be finished scanning the urls in a day or two [21:05] yeah [21:05] but I have now idea how to scan urls... [21:05] the pipeline for the warrior task may take that long to write and test [21:05] O.o [21:05] I know how to download and use heritrix and such things [21:05] but someone else would have to do the short url scanning [21:05] ah [21:06] sorry [21:06] :( [21:06] I thought you were just scanning [21:06] no no [21:06] I was already downloading [21:06] 3 urls per second [21:06] would be done in around 5 to 10 days [21:06] with all the urls [21:06] But it's more fun with a warrior project :) [21:06] assuming it's only a few million [21:06] #pixi-death [21:07] couldn't think of anything better... :P [21:09] #pixofail :D [21:10] :D muuuuch better [21:10] :P [21:15] yipdw: ping? [21:34] db48x: hi [21:35] yipdw: howdy. how do you usually create a new job for the warriors? fork seesaw-kit? [21:36] yeah [21:36] I usually also subtree merge the readme repo in, but copy is fine too [21:38] I guess you don't actually use a github fork [21:38] you can [22:27] wondering if someone studied CKAN archival/backup... it's a platform for "open data". Not that governments collapse and may delete some of their published data...but best be safe... [22:49] someone claims it's useful to upload data there in order to not have all eggs in IA's basket https://meta.wikimedia.org/wiki/Talk:Requests_for_comment/How_to_deal_with_open_datasets [22:50] (I simplified) [22:52] the more baskets the merrier [23:05] heh.. "lots of copies keeps stuff safe" [23:06] Yeah, that's why I always clone myself in three places. [23:07] I want to back myself up outside my current light-cone [23:07] Dude, set up a RAIC array. :p