#archiveteam 2012-08-11,Sat

↑back Search

Time	Nickname	Message
00:04 ^🔗	Coderjoe	I was continuing from what underscor said (the last person to say anything before me), and he was continuing from what dashcloud said (the last before underscor to say something)
01:46 ^🔗	DFJustin	they're not uploading the full res to IA anyway
02:53 ^🔗	Coderjoe	DFJustin: of course not. why upload for free what you can put on DVD/bluray and make money from?
02:54 ^🔗	Coderjoe	(there has been an avgeeks collection at IA for years, with very few uploads added to it. Instead, Skip preferred selling his DVDs on his website)
03:00 ^🔗	godane	Coderjoe: http://archive.org/details/avgeeks
03:00 ^🔗	godane	there still at it
03:01 ^🔗	godane	the least one was upload 3 days ago
03:01 ^🔗	Coderjoe	you think I don't know that?
03:01 ^🔗	Coderjoe	before this 100 miles project, uploads to that collection were very few and far between
03:05 ^🔗	Coderjoe	hmm
03:05 ^🔗	Coderjoe	I apparently missed a burst or two. I didn't realize there are currently 794 items in that collection
03:06 ^🔗	godane	yes but the last 20 to 30 is only from 2012
03:09 ^🔗	Coderjoe	earlier this week, or perhaps the end of last week, I was comparing the postings on the tumblr with the uploads on IA. there weren't a whole lot of recent non-prelinger uploads by skip
03:09 ^🔗	godane	also i'm still download kat.ph/community
03:09 ^🔗	godane	ok
03:09 ^🔗	Coderjoe	hmm
03:10 ^🔗	Coderjoe	I had not known about a large number he uploaded in 2011
03:10 ^🔗	Coderjoe	and this should probably be in -bs
03:21 ^🔗	ivan`	"NOTICE: Due to continous abuse, privatepaste.com will be shutting down August 1st, 2012."
03:22 ^🔗	ivan`	too bad Google does not know about any pastes on it, I guess this is the "private" part
03:23 ^🔗	ivan`	https://encrypted.google.com/search?hl=en&source=hp&q="https+privatepaste.com"
03:24 ^🔗	Coderjoe	the only pastes that would show up on google are ones that were on some other website that they crawled
03:28 ^🔗	nitro2k01	So, what now? Automatic crawling of other pastebins to grab as much of privatepaste as possible?
03:30 ^🔗	shaqfu	Is it possible to grab privatepastes via attrition?
03:30 ^🔗	shaqfu	Find how it generates URLs, then step through each and toss out dead links (probably with --redirects=0)
03:30 ^🔗	ivan`	I'm grepping all of my IRC logs for privatepaste
03:30 ^🔗	ivan`	they have subdomains like http://pgsql.privatepaste.com/a9940ba8de
03:31 ^🔗	shaqfu	16^10...oof
03:31 ^🔗	nitro2k01	The URLs are likely either random or dependent on the text
03:32 ^🔗	shaqfu	nitro2k01: That's why I was curious if attrition would work
03:33 ^🔗	shaqfu	But the space is too great
03:50 ^🔗	Coderjoe	mmm
03:50 ^🔗	Coderjoe	1 trillion IDs
03:51 ^🔗	shaqfu	1 trillion HTTP requests
03:51 ^🔗	shaqfu	Wonder if we'd melt their network cards
03:55 ^🔗	nitro2k01	Probably more like melting their patience
03:55 ^🔗	nitro2k01	"That's it, we're closing early"
03:55 ^🔗	nitro2k01	Also, knowing all valid IDs would do no good without the password for some of them
03:55 ^🔗	Coderjoe	august 1, 2012 is in the past
03:56 ^🔗	nitro2k01	So be it
10:30 ^🔗	emijrp	i have an issue with zip explorer
10:30 ^🔗	emijrp	http://ia600503.us.archive.org/zipview.php?zip=/21/items/Spanishrevolution-UnAnyoDeTrabajoEnLasPlazasYBarrios.ActasDel15m/Actas15M-0001-0500.zip
10:30 ^🔗	emijrp	all the files are downlaoded as 0 bytes
10:30 ^🔗	emijrp	and do not open
10:34 ^🔗	Nemo_bis	emijrp: how big is the file?
10:34 ^🔗	Nemo_bis	sigh, I suppose this is he.net fault again? http://p.defau.lt/?EhO6n_E45JN4GzsGwOCzAQ
10:35 ^🔗	Nemo_bis	^ this will make a single wiki take a couple days to upload, emijrp
10:35 ^🔗	emijrp	ok
10:35 ^🔗	emijrp	:P
10:39 ^🔗	emijrp	the zip is just 50 MB
10:40 ^🔗	Nemo_bis	hmm
10:41 ^🔗	Nemo_bis	filenames seem plain enough
13:21 ^🔗	ersi	Linked an article about "How to crawl a quarter billion webpages in 40 hours" in #archiveteam-bs
16:13 ^🔗	brayden	Interesting story from a friend. He setup some terrible, terrible... awful disgusting pile of work of a website on some free web hosting thing. For some reason he decided to see if the host was still even around, and sure enough, they were. Amazingly even after nearly 6 years of inactivity his account was still there, active, his site was not defaced and all the info is still there.
16:13 ^🔗	brayden	God damn.. very interesting to read.
16:13 ^🔗	*	brayden has learned the value of saving data!
16:16 ^🔗	brayden	lol and now he tried to transfer it to one of his dedicated machines. Says the RAM/CPU has gone insane and he has to reboot the box.
18:22 ^🔗	shaqfu	So, the guy with all the computer mags just got back to me
18:22 ^🔗	shaqfu	He has a brief list of the mags; anyone interested?
18:26 ^🔗	godane	hey shaqfu
18:26 ^🔗	shaqfu	godane: Yo
18:26 ^🔗	godane	i'm still downloading kat.ph community
18:26 ^🔗	godane	close to 300mb .warc.gz
18:26 ^🔗	shaqfu	Awesome
18:27 ^🔗	godane	do you know much about grep?
18:27 ^🔗	shaqfu	What are you trying to grep?
18:27 ^🔗	godane	all photobucket.com images in my kat.ph dump
18:28 ^🔗	shaqfu	Shouldn't those be picked up by --page-requisites? Or did you disable --span-hosts?
18:28 ^🔗	godane	i didn't had that
18:28 ^🔗	shaqfu	Oh, hm
18:29 ^🔗	shaqfu	Should be reasonable to grep out photobucket.com and scrape out the URLs
18:29 ^🔗	godane	i also wouldn't it start going after all the internet
18:29 ^🔗	shaqfu	You have to limit its levels
18:29 ^🔗	shaqfu	Probably to 1
18:29 ^🔗	godane	sense there are images everywhere
18:31 ^🔗	godane	also i was trying to only get stuff from kat.ph/community
18:32 ^🔗	godane	i add a --wait 1 so this took a very long time
18:33 ^🔗	shaqfu	Ah
18:33 ^🔗	godane	i had to cause it failed on me before
18:35 ^🔗	godane	there is also alot of 404 errors
18:57 ^🔗	alard	godane: Are the image urls absolute?
18:57 ^🔗	alard	That would make the grepping much easier.
18:58 ^🔗	godane	there is spaces in some
18:59 ^🔗	alard	With absolute I meant that the url is not relative to the page it is on. E.g. ../../images/test.png is relative, http://photobucket.com/something.png or just /something.png is absolute.
19:00 ^🔗	alard	Ah, now that I read your messages again: the pages are from kat.ph and they link to photobucket.com? Then the urls must be absolute.
19:00 ^🔗	godane	yes there absolute
19:01 ^🔗	alard	Then grepping is enough, since you don't need to know where the urls came from.
19:01 ^🔗	alard	Let me see.
19:02 ^🔗	alard	grep -ohP '<img[^>]+src="[^">]+"'
19:02 ^🔗	alard	Does that produce something?
19:02 ^🔗	balrog_	alard: I have a simple feature request for wget
19:03 ^🔗	balrog_	it would be nice if it had an option which would make it query for indices in all directories
19:03 ^🔗	balrog_	so if an html file has pictures in /img, it should also query /img in case the httpd has indices turned on
19:03 ^🔗	balrog_	err, embedded pictures
19:03 ^🔗	balrog_	it should do that in general with all subdirs, with that option enabled
19:03 ^🔗	balrog_	do you follow? :)
19:04 ^🔗	alard	balrog_: Yes. Have you tried it yourself? :)
19:04 ^🔗	alard	Alternatively, since it is quite site-specific, you might want to use a Lua script that does this.
19:05 ^🔗	balrog_	is it site specific?
19:05 ^🔗	balrog_	have I tried to add that feature? not atm
19:05 ^🔗	godane	holy crap
19:05 ^🔗	godane	your code make a 31.3mb text file of image urls
19:06 ^🔗	alard	balrog_: I'd say it is rather specific: some sites will have /index.html, others /. I don't think it's a problem that is general enough to need a special Wget option.
19:06 ^🔗	alard	godane: You probably need a second grep to postprocess the output.
19:06 ^🔗	balrog_	uhh, I don't think you follow!
19:06 ^🔗	godane	in got all of them i think too
19:06 ^🔗	godane	i know
19:06 ^🔗	balrog_	what I meant was that it would be nice if wget could query for indices automatically
19:07 ^🔗	balrog_	even if said indices aren't linked from anywhere
19:07 ^🔗	alard	balrog_: Yes, I think I do understand. If it finds /img/something.png you want it to request /img/ as well.
19:07 ^🔗	balrog_	yes
19:10 ^🔗	alard	I think the fastest way to get that result is by writing a small Lua script.
19:11 ^🔗	alard	That's what I would do, at least, if I were you.
21:12 ^🔗	godane	hey shaqfu
21:47 ^🔗	dashcloud	shaqfu: I thought someone else would have asked to see the list of magazines by now, but in any event, I would be interested in the list
21:50 ^🔗	godane	i got the kastatic.com images now

irclogger-viewer