[00:04] I was continuing from what underscor said (the last person to say anything before me), and he was continuing from what dashcloud said (the last before underscor to say something) [01:46] they're not uploading the full res to IA anyway [02:53] DFJustin: of course not. why upload for free what you can put on DVD/bluray and make money from? [02:54] (there has been an avgeeks collection at IA for years, with very few uploads added to it. Instead, Skip preferred selling his DVDs on his website) [03:00] Coderjoe: http://archive.org/details/avgeeks [03:00] there still at it [03:01] the least one was upload 3 days ago [03:01] you think I don't know that? [03:01] before this 100 miles project, uploads to that collection were very few and far between [03:05] hmm [03:05] I apparently missed a burst or two. I didn't realize there are currently 794 items in that collection [03:06] yes but the last 20 to 30 is only from 2012 [03:09] earlier this week, or perhaps the end of last week, I was comparing the postings on the tumblr with the uploads on IA. there weren't a whole lot of recent non-prelinger uploads by skip [03:09] also i'm still download kat.ph/community [03:09] ok [03:09] hmm [03:10] I had not known about a large number he uploaded in 2011 [03:10] and this should probably be in -bs [03:21] "NOTICE: Due to continous abuse, privatepaste.com will be shutting down August 1st, 2012." [03:22] too bad Google does not know about any pastes on it, I guess this is the "private" part [03:23] https://encrypted.google.com/search?hl=en&source=hp&q="https+privatepaste.com" [03:24] the only pastes that would show up on google are ones that were on some other website that they crawled [03:28] So, what now? Automatic crawling of other pastebins to grab as much of privatepaste as possible? [03:30] Is it possible to grab privatepastes via attrition? [03:30] Find how it generates URLs, then step through each and toss out dead links (probably with --redirects=0) [03:30] I'm grepping all of my IRC logs for privatepaste [03:30] they have subdomains like http://pgsql.privatepaste.com/a9940ba8de [03:31] 16^10...oof [03:31] The URLs are likely either random or dependent on the text [03:32] nitro2k01: That's why I was curious if attrition would work [03:33] But the space is too great [03:50] mmm [03:50] 1 trillion IDs [03:51] 1 trillion HTTP requests [03:51] Wonder if we'd melt their network cards [03:55] Probably more like melting their patience [03:55] "That's it, we're closing early" [03:55] Also, knowing all valid IDs would do no good without the password for some of them [03:55] august 1, 2012 is in the past [03:56] So be it [10:30] i have an issue with zip explorer [10:30] http://ia600503.us.archive.org/zipview.php?zip=/21/items/Spanishrevolution-UnAnyoDeTrabajoEnLasPlazasYBarrios.ActasDel15m/Actas15M-0001-0500.zip [10:30] all the files are downlaoded as 0 bytes [10:30] and do not open [10:34] emijrp: how big is the file? [10:34] sigh, I suppose this is he.net fault again? http://p.defau.lt/?EhO6n_E45JN4GzsGwOCzAQ [10:35] ^ this will make a single wiki take a couple days to upload, emijrp [10:35] ok [10:35] :P [10:39] the zip is just 50 MB [10:40] hmm [10:41] filenames seem plain enough [13:21] Linked an article about "How to crawl a quarter billion webpages in 40 hours" in #archiveteam-bs [16:13] Interesting story from a friend. He setup some terrible, terrible... awful disgusting pile of work of a website on some free web hosting thing. For some reason he decided to see if the host was still even around, and sure enough, they were. Amazingly even after nearly 6 years of inactivity his account was still there, active, his site was not defaced and all the info is still there. [16:13] God damn.. very interesting to read. [16:13] * brayden has learned the value of saving data! [16:16] lol and now he tried to transfer it to one of his dedicated machines. Says the RAM/CPU has gone insane and he has to reboot the box. [18:22] So, the guy with all the computer mags just got back to me [18:22] He has a brief list of the mags; anyone interested? [18:26] hey shaqfu [18:26] godane: Yo [18:26] i'm still downloading kat.ph community [18:26] close to 300mb .warc.gz [18:26] Awesome [18:27] do you know much about grep? [18:27] What are you trying to grep? [18:27] all photobucket.com images in my kat.ph dump [18:28] Shouldn't those be picked up by --page-requisites? Or did you disable --span-hosts? [18:28] i didn't had that [18:28] Oh, hm [18:29] Should be reasonable to grep out photobucket.com and scrape out the URLs [18:29] i also wouldn't it start going after all the internet [18:29] You have to limit its levels [18:29] Probably to 1 [18:29] sense there are images everywhere [18:31] also i was trying to only get stuff from kat.ph/community [18:32] i add a --wait 1 so this took a very long time [18:33] Ah [18:33] i had to cause it failed on me before [18:35] there is also alot of 404 errors [18:57] godane: Are the image urls absolute? [18:57] That would make the grepping much easier. [18:58] there is spaces in some [18:59] With absolute I meant that the url is not relative to the page it is on. E.g. ../../images/test.png is relative, http://photobucket.com/something.png or just /something.png is absolute. [19:00] Ah, now that I read your messages again: the pages are from kat.ph and they link to photobucket.com? Then the urls must be absolute. [19:00] yes there absolute [19:01] Then grepping is enough, since you don't need to know where the urls came from. [19:01] Let me see. [19:02] grep -ohP ']+src="[^">]+"' [19:02] Does that produce something? [19:02] alard: I have a simple feature request for wget [19:03] it would be nice if it had an option which would make it query for indices in all directories [19:03] so if an html file has pictures in /img, it should also query /img in case the httpd has indices turned on [19:03] err, embedded pictures [19:03] it should do that in general with all subdirs, with that option enabled [19:03] do you follow? :) [19:04] balrog_: Yes. Have you tried it yourself? :) [19:04] Alternatively, since it is quite site-specific, you might want to use a Lua script that does this. [19:05] is it site specific? [19:05] have I tried to add that feature? not atm [19:05] holy crap [19:05] your code make a 31.3mb text file of image urls [19:06] balrog_: I'd say it is rather specific: some sites will have /index.html, others /. I don't think it's a problem that is general enough to need a special Wget option. [19:06] godane: You probably need a second grep to postprocess the output. [19:06] uhh, I don't think you follow! [19:06] in got all of them i think too [19:06] i know [19:06] what I meant was that it would be nice if wget could query for indices automatically [19:07] even if said indices aren't linked from anywhere [19:07] balrog_: Yes, I think I do understand. If it finds /img/something.png you want it to request /img/ as well. [19:07] yes [19:10] I think the fastest way to get that result is by writing a small Lua script. [19:11] That's what I would do, at least, if I were you. [21:12] hey shaqfu [21:47] shaqfu: I thought someone else would have asked to see the list of magazines by now, but in any event, I would be interested in the list [21:50] i got the kastatic.com images now