#archiveteam 2012-08-11,Sat

↑back Search

Time Nickname Message
00:04 🔗 Coderjoe I was continuing from what underscor said (the last person to say anything before me), and he was continuing from what dashcloud said (the last before underscor to say something)
01:46 🔗 DFJustin they're not uploading the full res to IA anyway
02:53 🔗 Coderjoe DFJustin: of course not. why upload for free what you can put on DVD/bluray and make money from?
02:54 🔗 Coderjoe (there has been an avgeeks collection at IA for years, with very few uploads added to it. Instead, Skip preferred selling his DVDs on his website)
03:00 🔗 godane Coderjoe: http://archive.org/details/avgeeks
03:00 🔗 godane there still at it
03:01 🔗 godane the least one was upload 3 days ago
03:01 🔗 Coderjoe you think I don't know that?
03:01 🔗 Coderjoe before this 100 miles project, uploads to that collection were very few and far between
03:05 🔗 Coderjoe hmm
03:05 🔗 Coderjoe I apparently missed a burst or two. I didn't realize there are currently 794 items in that collection
03:06 🔗 godane yes but the last 20 to 30 is only from 2012
03:09 🔗 Coderjoe earlier this week, or perhaps the end of last week, I was comparing the postings on the tumblr with the uploads on IA. there weren't a whole lot of recent non-prelinger uploads by skip
03:09 🔗 godane also i'm still download kat.ph/community
03:09 🔗 godane ok
03:09 🔗 Coderjoe hmm
03:10 🔗 Coderjoe I had not known about a large number he uploaded in 2011
03:10 🔗 Coderjoe and this should probably be in -bs
03:21 🔗 ivan` "NOTICE: Due to continous abuse, privatepaste.com will be shutting down August 1st, 2012."
03:22 🔗 ivan` too bad Google does not know about any pastes on it, I guess this is the "private" part
03:23 🔗 ivan` https://encrypted.google.com/search?hl=en&source=hp&q="https+privatepaste.com"
03:24 🔗 Coderjoe the only pastes that would show up on google are ones that were on some other website that they crawled
03:28 🔗 nitro2k01 So, what now? Automatic crawling of other pastebins to grab as much of privatepaste as possible?
03:30 🔗 shaqfu Is it possible to grab privatepastes via attrition?
03:30 🔗 shaqfu Find how it generates URLs, then step through each and toss out dead links (probably with --redirects=0)
03:30 🔗 ivan` I'm grepping all of my IRC logs for privatepaste
03:30 🔗 ivan` they have subdomains like http://pgsql.privatepaste.com/a9940ba8de
03:31 🔗 shaqfu 16^10...oof
03:31 🔗 nitro2k01 The URLs are likely either random or dependent on the text
03:32 🔗 shaqfu nitro2k01: That's why I was curious if attrition would work
03:33 🔗 shaqfu But the space is too great
03:50 🔗 Coderjoe mmm
03:50 🔗 Coderjoe 1 trillion IDs
03:51 🔗 shaqfu 1 trillion HTTP requests
03:51 🔗 shaqfu Wonder if we'd melt their network cards
03:55 🔗 nitro2k01 Probably more like melting their patience
03:55 🔗 nitro2k01 "That's it, we're closing early"
03:55 🔗 nitro2k01 Also, knowing all valid IDs would do no good without the password for some of them
03:55 🔗 Coderjoe august 1, 2012 is in the past
03:56 🔗 nitro2k01 So be it
10:30 🔗 emijrp i have an issue with zip explorer
10:30 🔗 emijrp http://ia600503.us.archive.org/zipview.php?zip=/21/items/Spanishrevolution-UnAnyoDeTrabajoEnLasPlazasYBarrios.ActasDel15m/Actas15M-0001-0500.zip
10:30 🔗 emijrp all the files are downlaoded as 0 bytes
10:30 🔗 emijrp and do not open
10:34 🔗 Nemo_bis emijrp: how big is the file?
10:34 🔗 Nemo_bis sigh, I suppose this is he.net fault again? http://p.defau.lt/?EhO6n_E45JN4GzsGwOCzAQ
10:35 🔗 Nemo_bis ^ this will make a single wiki take a couple days to upload, emijrp
10:35 🔗 emijrp ok
10:35 🔗 emijrp :P
10:39 🔗 emijrp the zip is just 50 MB
10:40 🔗 Nemo_bis hmm
10:41 🔗 Nemo_bis filenames seem plain enough
13:21 🔗 ersi Linked an article about "How to crawl a quarter billion webpages in 40 hours" in #archiveteam-bs
16:13 🔗 brayden Interesting story from a friend. He setup some terrible, terrible... awful disgusting pile of work of a website on some free web hosting thing. For some reason he decided to see if the host was still even around, and sure enough, they were. Amazingly even after nearly 6 years of inactivity his account was still there, active, his site was not defaced and all the info is still there.
16:13 🔗 brayden God damn.. very interesting to read.
16:13 🔗 * brayden has learned the value of saving data!
16:16 🔗 brayden lol and now he tried to transfer it to one of his dedicated machines. Says the RAM/CPU has gone insane and he has to reboot the box.
18:22 🔗 shaqfu So, the guy with all the computer mags just got back to me
18:22 🔗 shaqfu He has a brief list of the mags; anyone interested?
18:26 🔗 godane hey shaqfu
18:26 🔗 shaqfu godane: Yo
18:26 🔗 godane i'm still downloading kat.ph community
18:26 🔗 godane close to 300mb .warc.gz
18:26 🔗 shaqfu Awesome
18:27 🔗 godane do you know much about grep?
18:27 🔗 shaqfu What are you trying to grep?
18:27 🔗 godane all photobucket.com images in my kat.ph dump
18:28 🔗 shaqfu Shouldn't those be picked up by --page-requisites? Or did you disable --span-hosts?
18:28 🔗 godane i didn't had that
18:28 🔗 shaqfu Oh, hm
18:29 🔗 shaqfu Should be reasonable to grep out photobucket.com and scrape out the URLs
18:29 🔗 godane i also wouldn't it start going after all the internet
18:29 🔗 shaqfu You have to limit its levels
18:29 🔗 shaqfu Probably to 1
18:29 🔗 godane sense there are images everywhere
18:31 🔗 godane also i was trying to only get stuff from kat.ph/community
18:32 🔗 godane i add a --wait 1 so this took a very long time
18:33 🔗 shaqfu Ah
18:33 🔗 godane i had to cause it failed on me before
18:35 🔗 godane there is also alot of 404 errors
18:57 🔗 alard godane: Are the image urls absolute?
18:57 🔗 alard That would make the grepping much easier.
18:58 🔗 godane there is spaces in some
18:59 🔗 alard With absolute I meant that the url is not relative to the page it is on. E.g. ../../images/test.png is relative, http://photobucket.com/something.png or just /something.png is absolute.
19:00 🔗 alard Ah, now that I read your messages again: the pages are from kat.ph and they link to photobucket.com? Then the urls must be absolute.
19:00 🔗 godane yes there absolute
19:01 🔗 alard Then grepping is enough, since you don't need to know where the urls came from.
19:01 🔗 alard Let me see.
19:02 🔗 alard grep -ohP '<img[^>]+src="[^">]+"'
19:02 🔗 alard Does that produce something?
19:02 🔗 balrog_ alard: I have a simple feature request for wget
19:03 🔗 balrog_ it would be nice if it had an option which would make it query for indices in all directories
19:03 🔗 balrog_ so if an html file has pictures in /img, it should also query /img in case the httpd has indices turned on
19:03 🔗 balrog_ err, embedded pictures
19:03 🔗 balrog_ it should do that in general with all subdirs, with that option enabled
19:03 🔗 balrog_ do you follow? :)
19:04 🔗 alard balrog_: Yes. Have you tried it yourself? :)
19:04 🔗 alard Alternatively, since it is quite site-specific, you might want to use a Lua script that does this.
19:05 🔗 balrog_ is it site specific?
19:05 🔗 balrog_ have I tried to add that feature? not atm
19:05 🔗 godane holy crap
19:05 🔗 godane your code make a 31.3mb text file of image urls
19:06 🔗 alard balrog_: I'd say it is rather specific: some sites will have /index.html, others /. I don't think it's a problem that is general enough to need a special Wget option.
19:06 🔗 alard godane: You probably need a second grep to postprocess the output.
19:06 🔗 balrog_ uhh, I don't think you follow!
19:06 🔗 godane in got all of them i think too
19:06 🔗 godane i know
19:06 🔗 balrog_ what I meant was that it would be nice if wget could query for indices automatically
19:07 🔗 balrog_ even if said indices aren't linked from anywhere
19:07 🔗 alard balrog_: Yes, I think I do understand. If it finds /img/something.png you want it to request /img/ as well.
19:07 🔗 balrog_ yes
19:10 🔗 alard I think the fastest way to get that result is by writing a small Lua script.
19:11 🔗 alard That's what I would do, at least, if I were you.
21:12 🔗 godane hey shaqfu
21:47 🔗 dashcloud shaqfu: I thought someone else would have asked to see the list of magazines by now, but in any event, I would be interested in the list
21:50 🔗 godane i got the kastatic.com images now

irclogger-viewer