#archiveteam 2012-02-21,Tue

↑back Search

Time	Nickname	Message
00:03 ^🔗	kennethre	tef: ping
00:17 ^🔗	SketchCow	Regarding the downloading of the new site you want to archive, kennethre: Remember archive.org wants WARC and the world wants WGET.
00:18 ^🔗	kennethre	SketchCow: I'm going to see what I can do. It may have to just be json dumps. The site's barely running and they have a heavy ajax interface.
00:18 ^🔗	kennethre	but we'll see. WARC would be ideal.
00:20 ^🔗	SketchCow	That's fine.
00:20 ^🔗	SketchCow	Do your best.
01:20 ^🔗	SketchCow	Over 300 albums added to my jamendo-albums collection today.
01:20 ^🔗	SketchCow	This is just really good stuff.
01:46 ^🔗	balrog_ph	Did MultiUpload just croak?
01:48 ^🔗	nitro2k01	Yep
01:48 ^🔗	arrith	just wondering about that
01:49 ^🔗	arrith	multiupload has been unresponsive for a while, same with scroogle.org
01:49 ^🔗	arrith	been getting nxdomain for scroogle
01:51 ^🔗	arrith	not finding any officialish news about multiupload on the google, just people reporting issues not getting it to load
02:47 ^🔗	tef	kennethre: pong
03:22 ^🔗	SketchCow	http://www.flickr.com/photos/textfiles/sets/72157629411540695/with/6913872479/
03:51 ^🔗	nitro2k01	"Oh, the keyboard were bulky back then."
10:55 ^🔗	DoubleJ	So, question about wget: If it's unable to connect to a host (say, due to the Fios router being a piece of junk) does it keep trying that one URL over and over, or does it skip every file until the connection comes back?
15:47 ^🔗	chronomex	you usually get holes
15:58 ^🔗	Schbirid	hm, i think i broke something in my item with s3
15:58 ^🔗	Schbirid	new uploads were not added to the filelist
16:06 ^🔗	Schbirid	maybe i sohuld have used that bucket size hint
16:26 ^🔗	DoubleJ	chronomex: So you're saying that any user I have with an unable-to-connect error needs to have that part of their account re-done. Crap.
16:27 ^🔗	DoubleJ	Though I guess that explains why sometimes losing the internet causes dld-client to bomb and sometimes not. If it does it during --mirror things can recover before wget finishes.
16:28 ^🔗	DoubleJ	Though if it results in incomplete downloads it should really be a fatal error.
16:30 ^🔗	yipdw	DoubleJ: wget does do retries, but only up to a point
16:31 ^🔗	yipdw	default number is 20 retries
16:31 ^🔗	yipdw	as far as I can tell, the mobileme-grab scripts don't change that behavior
16:32 ^🔗	DoubleJ	Hmm... I think the 5.5 hours I lost between midnight and waking up may account for more than 20 tries :)
16:32 ^🔗	yipdw	well
16:32 ^🔗	yipdw	hm
16:33 ^🔗	yipdw	yeah, probably
16:33 ^🔗	yipdw	I don't think wget uses exponential backoff
16:33 ^🔗	yipdw	yeah it doesn't
16:33 ^🔗	DoubleJ	Lemme try to find how many lines have the connect error. If it's 19 I'll be happy.
16:34 ^🔗	yipdw	that said, if there was such an error, that should cause dld-single to exit erroneously
16:35 ^🔗	yipdw	so you shouldn't have multiple users with holes, unless you're running dld-single in an unpredicated loop or something
16:36 ^🔗	DoubleJ	grep -i "unable " wget.log \| wc -l
16:36 ^🔗	DoubleJ	307
16:36 ^🔗	DoubleJ	D'oh.
16:37 ^🔗	DoubleJ	I'm running dld-client.
16:37 ^🔗	DoubleJ	Doesn't that just call dld-single, though?
16:37 ^🔗	yipdw	yes
16:37 ^🔗	yipdw	and it should exit on error
16:37 ^🔗	yipdw	well, once a user's download finishes
16:38 ^🔗	DoubleJ	Hm. So, either it didn't, or it's happy to try until the end of time.
16:38 ^🔗	yipdw	point is that the number of users with holes should normally have the maximum number of simultaneous dld processes as its upper bound
16:38 ^🔗	DoubleJ	Also: The actual error is "unable to resolve host", not "unable to connect" like I said before.
16:38 ^🔗	yipdw	(also, try seesaw)
16:38 ^🔗	DoubleJ	In case that changes anything.
16:39 ^🔗	DoubleJ	Well, yes. Except that this has happened multiple times since I started downloading MobileMe. I think my router isn't happy with being asked to do actual work.
16:39 ^🔗	yipdw	that should still generate an error exit
16:39 ^🔗	yipdw	yeah, mine isn't either
16:39 ^🔗	yipdw	I just use off-hours connection time at work
16:39 ^🔗	yipdw	they have slightly more robust networking equipment
16:40 ^🔗	DoubleJ	This would be less of a headache if the current user was able to write HTML and I hadn't been grabbing an endless series of 404s since Friday.
16:40 ^🔗	DoubleJ	I can't just shrug it off, in case somewhere in that pile was a link he actually managed to code correctly.
16:40 ^🔗	yipdw	?
16:41 ^🔗	DoubleJ	The user who set this question off.
16:41 ^🔗	DoubleJ	wget --mirror has been working since Friday morning.
16:41 ^🔗	yipdw	which one is that?
16:41 ^🔗	DoubleJ	web.me.com/snoozeman
16:42 ^🔗	DoubleJ	Not downloading the internet as far as I can tell.
16:42 ^🔗	DoubleJ	Just a very long parade of bad links within his own account.
16:42 ^🔗	DoubleJ	But like I said, somewhere in that 5.5 hours may have been a good link.
16:42 ^🔗	DoubleJ	And spending another 4 days on the guy is going to be annoying.
16:49 ^🔗	yipdw	ha, it's literally a family history
16:54 ^🔗	DoubleJ	Oh christ.
16:54 ^🔗	DoubleJ	If he goes back to the Vikings I'm screwed.
16:54 ^🔗	yipdw	more specifically, it's a large set of pages generated via Reunion
16:55 ^🔗	yipdw	I guess that user just didn't upload / deleted a bunch of them
16:55 ^🔗	DoubleJ	Now, the wget log just says "unable to resolve host web.me.com". Is there any way of finding out what URL it was trying?
16:55 ^🔗	Schbirid	i use ngrep for this kind of stuff
16:55 ^🔗	DoubleJ	I think he renamed a directory. My URL list has "Ray's Genealogy" but the 404s are just "Genealogy".
16:56 ^🔗	DoubleJ	Schbirid: 'splain?
16:56 ^🔗	Schbirid	if you mean realtime checking of http traffic etc
16:56 ^🔗	DoubleJ	Oh, not realtime. This happened while I was asleep.
16:56 ^🔗	Schbirid	ah no idea then :(
16:56 ^🔗	SketchCow	GREETINGS FROM THE PRESERVING VIRTUAL WORLDS 2 ADVISORY BOARD MEETING
16:56 ^🔗	SketchCow	<---in Stanford
16:56 ^🔗	SketchCow	Also awesome
16:57 ^🔗	yipdw	I went to Stanford once
16:57 ^🔗	yipdw	nice campus
16:57 ^🔗	yipdw	DoubleJ: you can try to correlate it with the URL list, but other than that I'm not sure
16:57 ^🔗	yipdw	probably better to just retry the download, or move on
16:57 ^🔗	yipdw	I'm downloading snoozeman now
16:57 ^🔗	DoubleJ	I think I'll just do that.
16:58 ^🔗	DoubleJ	And thanks. Hopefully he takes less time for you. And no router problems.
16:59 ^🔗	DoubleJ	I'll have to go through my other logs and see who else has that error.
16:59 ^🔗	DoubleJ	What's a good way to find all files that have a particular string in them?
16:59 ^🔗	yipdw	grep -R
16:59 ^🔗	yipdw	ack also works if you have that installed
17:00 ^🔗	DoubleJ	So in mobileme-grab/data I say grep -R "unable " and it spits out a file list?
17:00 ^🔗	yipdw	yes, but you probably will want to scope it to just logs
17:00 ^🔗	DoubleJ	All right, I'll give that a try after lunch. Thanks.
17:01 ^🔗	yipdw	actually
17:02 ^🔗	yipdw	if you just want a list, grep -lR
17:02 ^🔗	yipdw	there may be better tools, I'm not sure
17:02 ^🔗	yipdw	so
17:02 ^🔗	yipdw	i think the maximum number of 404s I can get on snoozeman is 17,982
17:02 ^🔗	yipdw	awesome
17:03 ^🔗	yipdw	oh hey, it's this guy
17:03 ^🔗	yipdw	http://snoozemanscruiseblog.blogspot.com/
17:04 ^🔗	yipdw	holy shit, he's done 69 cruises
17:04 ^🔗	yipdw	wtf
17:05 ^🔗	yipdw	considering that cruises generally run at least a couple grand per, that's kind of
17:05 ^🔗	yipdw	conspicuous
17:07 ^🔗	Soojin	maybe he hides in the cargo area
17:11 ^🔗	SketchCow	Randall Scwartz has done many cruises.
17:11 ^🔗	SketchCow	Mostly by being the guy co-running tour groups and the phoography.
17:11 ^🔗	SketchCow	He is on the Joco Cruise Crazy 2 as we speak.
17:13 ^🔗	godane	can you guys take like playstation demo discs?
17:14 ^🔗	SketchCow	I'll gladly take them.
17:14 ^🔗	SketchCow	I can suggest a better place.
17:14 ^🔗	godane	i know off sites that have them too
17:14 ^🔗	*	SketchCow is sitting here next to two guys who run game archives.
17:14 ^🔗	SketchCow	JP Dyson of the Strong Museum of Place and the CHEG
17:15 ^🔗	SketchCow	And Henry Lowood of the Stanford Software Archives
17:15 ^🔗	SketchCow	Yes, we're all in the same room so this room is awesome now
17:15 ^🔗	godane	http://www.emuparadise.me/Sony_Playstation_-_Demos_ISOs/25
17:15 ^🔗	godane	the one rom size as 100s of demos
17:16 ^🔗	godane	i also found a torrent of 15 years worth of scientific american
17:17 ^🔗	SketchCow	Yeah, I snagged that
17:17 ^🔗	SketchCow	The public domain ones are now up on archive.org.
17:17 ^🔗	SketchCow	http://www.archive.org/details/scientific-american-1845-1909
17:18 ^🔗	SketchCow	Oh, to explain cruiseman as another thing.
17:18 ^🔗	godane	i think you are pushing me to buy blu-ray
17:18 ^🔗	godane	only for archive reasons
17:19 ^🔗	SketchCow	If you can organize cruise groups (and plane groups too) and you hit the mark of, like, 10 or 15 people I forget, you get along for free.
17:19 ^🔗	yipdw	oh, hm
17:19 ^🔗	SketchCow	So assume snoozeman organizes cruises
17:20 ^🔗	godane	SketchCow: can you move this to software: http://www.archive.org/details/cdrom-linux-format-73
17:23 ^🔗	SketchCow	You did it!
17:24 ^🔗	godane	thank you
17:25 ^🔗	DFJustin	somebody else pointed out that there is other stuff sitting in community texts http://www.archive.org/search.php?query=subject%3A%22cdbbsarchive%22
17:26 ^🔗	SketchCow	I wish you had made the two CDs separate, but you didn't know and I need to think about it regardless.
17:26 ^🔗	SketchCow	http://www.archive.org/details/cdrom-linux-format-73&reCache=1
17:32 ^🔗	godane	sorry about that
17:35 ^🔗	godane	i found 3 years of pc advisor
17:36 ^🔗	godane	there was cd that came in a magazine in 2005
17:36 ^🔗	balrog	SketchCow: this might be worth archiving: http://adfly.simplaza.net/
17:36 ^🔗	balrog	(as part of the link shortener project)
17:36 ^🔗	balrog	the cache contains most of adf.ly and similar links
17:40 ^🔗	SketchCow	Thanks for the tip, DFJustin - I just yanked all those into the CD archive.
17:40 ^🔗	SketchCow	They're all italian CDs!
17:41 ^🔗	SketchCow	godane: Any chance of scanning the fronts of the DVDs/CDs?
17:44 ^🔗	godane	maybe
17:58 ^🔗	ersi	Hm, Waybackmachine is a seperate front end that doesn't hang together with Heritrix -right?
17:59 ^🔗	ersi	Or, I mean - Heritrix seems to "only" be the crawler
18:00 ^🔗	SketchCow	Before you stamped off, ersi - the umich project turned out to be something alard could handle individually.
18:01 ^🔗	ersi	Yeah, I figured :-)
18:03 ^🔗	ersi	there we go, found the open-source wayback project
18:03 ^🔗	soultcer	As far as I understood it the open source project is different from the wayback machine they use at archive.org?
18:04 ^🔗	soultcer	Or did that change with the new wayback interface?
18:04 ^🔗	SketchCow	Supposed to be.
18:04 ^🔗	SketchCow	They've started to use our wget-warc now. :)
18:04 ^🔗	SketchCow	In some aspects.
18:04 ^🔗	ersi	According to http://archive-access.sourceforge.net/projects/wayback/ the current implementation is Perl, this one is Java
18:05 ^🔗	soultcer	I want to run a crawl of websites from my home country using Heritrix, just to see what I will find
18:06 ^🔗	ersi	And how would you crawl a country?
18:06 ^🔗	soultcer	Limit it to the country specific top level domain and domains that resolv to IPs that some geoip db thinks are from that country
18:06 ^🔗	ersi	alright
18:08 ^🔗	yipdw	soultcer: one tricky bit: the mapping from IP to domain is one-to-many
18:08 ^🔗	ersi	I'd say go for the ccTLD is probably the most viable if that's what you're choosing from
18:08 ^🔗	yipdw	actually it's many-to-many but the other direction isn't so important
18:08 ^🔗	soultcer	yipdw: Don't worry, I've got that part already covered.
18:09 ^🔗	yipdw	ok
18:14 ^🔗	Schbirid	is "zgrep -h something manyfiles*" not working for anyone else too? it should not list the filenames but for me it does anyways.
19:02 ^🔗	ersi	Hmmm, I'm considering building a spider/crawler
19:03 ^🔗	tef	ersi: what in
19:04 ^🔗	ersi	asking in what language?
19:04 ^🔗	tef	yes
19:05 ^🔗	ersi	I was thinking Python
19:05 ^🔗	tef	ah
19:05 ^🔗	tef	https://github.com/tef/crawler this might help
19:05 ^🔗	ersi	What I really want is to map a domain/subdomain
19:05 ^🔗	tef	although the warcs are terrible
19:05 ^🔗	ersi	checking it
19:06 ^🔗	ersi	ooh, there's a HTMLParser? I have only fiddled a little with SGLParse
19:08 ^🔗	ersi	actually, yeah I'll check this out ^_^
19:17 ^🔗	ersi	tef: Hm, I pip installed hanzo-warc-tools - but the crawler can still not find it - suggestions? 'requests' went fine to pip install :o
19:17 ^🔗	soultcer	ersi: Map a domain/subdomain? You mean crawl the whole page and get a warc as output?
19:17 ^🔗	tef	ersi python -m hanzo.warctools should work
19:18 ^🔗	ersi	No, I'd just like to map all the URLs
19:18 ^🔗	soultcer	?
19:18 ^🔗	ersi	"Find ALL THE URLS!"
19:19 ^🔗	soultcer	You must have a lot of bandwidth and storage space ;-)
19:19 ^🔗	ersi	I would only save URLs corresponding to my target domain
19:19 ^🔗	ersi	note, Resource Locations. Not the resources
19:20 ^🔗	soultcer	When we started on urlteam stuff we thought "me, maybe a couple hundred megabytes of shorturls"
19:20 ^🔗	ersi	tef: Hmmm, I'm not sure what I'm doing wrong here
19:20 ^🔗	soultcer	bit.ly alone has probably over a TB of urls...
19:20 ^🔗	tef	ersi: I don't know pip that well :/
19:21 ^🔗	ersi	tef: How'd you install hanzo-warc-tools? If you were to do it
19:21 ^🔗	tef	oh I use the hg repo and set PYTHONPATH, but I am like that
19:21 ^🔗	ersi	oh, heh
19:21 ^🔗	tef	you could just hg clone from code.hanzoarchives.com
19:22 ^🔗	tef	but I tend to push to the repo so
19:25 ^🔗	ersi	hmmmm, I added my hanzo clone to PYTHONPATH but it's not taking it up for some reason
19:27 ^🔗	ersi	augh, python packaging :(
19:29 ^🔗	ersi	I'll go poke around in crawler.py instead :]
19:31 ^🔗	soultcer	ersi: I still don't get the urls idea? what use would it be to only know the urls?
19:33 ^🔗	ersi	well, you could for example break down the downloading of the target in smaller chunks and distribute the owrk
19:34 ^🔗	soultcer	Hm, I guess
19:34 ^🔗	tef	ersi: as in export PYTHONPATH='..'
19:35 ^🔗	tef	ersi: maybe you're in a virtualenv
19:35 ^🔗	ersi	Yeah, but I did it with a absolute location
19:36 ^🔗	ersi	export PYTHONPATH=/home/ersi/libpython/ where hanzo/warc*/ is located
19:36 ^🔗	tef	ah you need the repo dir in
19:36 ^🔗	tef	not the dir contaiing the repo
19:36 ^🔗	tef	inside the repo is the package
19:37 ^🔗	ersi	Oh, heh
20:02 ^🔗	ersi	tef: Have you ran the crawler against any sites?
20:03 ^🔗	tef	a few but nothing heavy - the warc output is ugly and needs to be cleaned up
20:03 ^🔗	tef	the htmlparser should pick out a few more types of links at least
20:05 ^🔗	ersi	oh, found the issue I was getting. you're using the variable 'url' in the HTMLException on line 100 instead of response.url :)
20:05 ^🔗	ersi	tef: ^
20:08 ^🔗	tef	whoops
20:08 ^🔗	tef	I may not have pushed everything
20:08 ^🔗	tef	welp
20:09 ^🔗	ersi	Hm, wonder what I'm throwing at it that's making it not able to extract the links :o
20:13 ^🔗	tef	hmmm
20:13 ^🔗	ersi	haha, my friends stupid webserver closes the connection prematurely
20:14 ^🔗	ersi	might be why
20:14 ^🔗	soultcer	Welcome to webcrawler 101
20:14 ^🔗	soultcer	Lection 1: Web crawling is like running the most complete fuzz test ever against your program
20:14 ^🔗	ersi	How suitable
20:15 ^🔗	ersi	I'm a software tester at day
20:15 ^🔗	tef	soultcer: yes yes yes
20:16 ^🔗	tef	constantly changing fuzzer
20:16 ^🔗	ersi	quirky quicky quirky~
20:19 ^🔗	tef	also if any of you are REST types i'd appreciate any comments https://github.com/tef/hate-rpc
20:19 ^🔗	tef	but I am going to bed because i have been up for 28 hours.
20:20 ^🔗	ersi	websites for robots? woot
20:28 ^🔗	kennethre	tef: https://github.com/kennethreitz/convore.json
20:33 ^🔗	tef	I saw. Nice :D
20:38 ^🔗	kennethre	tef: i'm working on making it browsable html
21:41 ^🔗	ersi	argh, I feel so silly using Requests
21:42 ^🔗	ersi	or I'm getting used rather :D
21:43 ^🔗	ersi	Oh, I'm so silly. :\| Don't do requestR.text() instead of requestR.test >_>
21:51 ^🔗	kennethre	ersi: :(
21:51 ^🔗	kennethre	ersi: ah, gotcha. What do you think so far?
21:52 ^🔗	ersi	It's too easy!
21:52 ^🔗	kennethre	<3
21:52 ^🔗	ersi	It feels awesome, feels pythonic
21:53 ^🔗	kennethre	perfect

irclogger-viewer