[00:03] tef: ping [00:17] Regarding the downloading of the new site you want to archive, kennethre: Remember archive.org wants WARC and the world wants WGET. [00:18] SketchCow: I'm going to see what I can do. It may have to just be json dumps. The site's barely running and they have a heavy ajax interface. [00:18] but we'll see. WARC would be ideal. [00:20] That's fine. [00:20] Do your best. [01:20] Over 300 albums added to my jamendo-albums collection today. [01:20] This is just really good stuff. [01:46] Did MultiUpload just croak? [01:48] Yep [01:48] just wondering about that [01:49] multiupload has been unresponsive for a while, same with scroogle.org [01:49] been getting nxdomain for scroogle [01:51] not finding any officialish news about multiupload on the google, just people reporting issues not getting it to load [02:47] kennethre: pong [03:22] http://www.flickr.com/photos/textfiles/sets/72157629411540695/with/6913872479/ [03:51] "Oh, the keyboard were bulky back then." [10:55] So, question about wget: If it's unable to connect to a host (say, due to the Fios router being a piece of junk) does it keep trying that one URL over and over, or does it skip every file until the connection comes back? [15:47] you usually get holes [15:58] hm, i think i broke something in my item with s3 [15:58] new uploads were not added to the filelist [16:06] maybe i sohuld have used that bucket size hint [16:26] chronomex: So you're saying that any user I have with an unable-to-connect error needs to have that part of their account re-done. Crap. [16:27] Though I guess that explains why sometimes losing the internet causes dld-client to bomb and sometimes not. If it does it during --mirror things can recover before wget finishes. [16:28] Though if it results in incomplete downloads it should really be a fatal error. [16:30] DoubleJ: wget does do retries, but only up to a point [16:31] default number is 20 retries [16:31] as far as I can tell, the mobileme-grab scripts don't change that behavior [16:32] Hmm... I think the 5.5 hours I lost between midnight and waking up may account for more than 20 tries :) [16:32] well [16:32] hm [16:33] yeah, probably [16:33] I don't think wget uses exponential backoff [16:33] yeah it doesn't [16:33] Lemme try to find how many lines have the connect error. If it's 19 I'll be happy. [16:34] that said, if there was such an error, that should cause dld-single to exit erroneously [16:35] so you shouldn't have multiple users with holes, unless you're running dld-single in an unpredicated loop or something [16:36] grep -i "unable " wget.log | wc -l [16:36] 307 [16:36] D'oh. [16:37] I'm running dld-client. [16:37] Doesn't that just call dld-single, though? [16:37] yes [16:37] and it should exit on error [16:37] well, once a user's download finishes [16:38] Hm. So, either it didn't, or it's happy to try until the end of time. [16:38] point is that the number of users with holes should normally have the maximum number of simultaneous dld processes as its upper bound [16:38] Also: The actual error is "unable to resolve host", not "unable to connect" like I said before. [16:38] (also, try seesaw) [16:38] In case that changes anything. [16:39] Well, yes. Except that this has happened multiple times since I started downloading MobileMe. I think my router isn't happy with being asked to do actual work. [16:39] that should still generate an error exit [16:39] yeah, mine isn't either [16:39] I just use off-hours connection time at work [16:39] they have slightly more robust networking equipment [16:40] This would be less of a headache if the current user was able to write HTML and I hadn't been grabbing an endless series of 404s since Friday. [16:40] I can't just shrug it off, in case somewhere in that pile was a link he actually managed to code correctly. [16:40] ? [16:41] The user who set this question off. [16:41] wget --mirror has been working since Friday morning. [16:41] which one is that? [16:41] web.me.com/snoozeman [16:42] Not downloading the internet as far as I can tell. [16:42] Just a very long parade of bad links within his own account. [16:42] But like I said, somewhere in that 5.5 hours may have been a good link. [16:42] And spending another 4 days on the guy is going to be annoying. [16:49] ha, it's literally a family history [16:54] Oh christ. [16:54] If he goes back to the Vikings I'm screwed. [16:54] more specifically, it's a large set of pages generated via Reunion [16:55] I guess that user just didn't upload / deleted a bunch of them [16:55] Now, the wget log just says "unable to resolve host web.me.com". Is there any way of finding out what URL it was trying? [16:55] i use ngrep for this kind of stuff [16:55] I think he renamed a directory. My URL list has "Ray's Genealogy" but the 404s are just "Genealogy". [16:56] Schbirid: 'splain? [16:56] if you mean realtime checking of http traffic etc [16:56] Oh, not realtime. This happened while I was asleep. [16:56] ah no idea then :( [16:56] GREETINGS FROM THE PRESERVING VIRTUAL WORLDS 2 ADVISORY BOARD MEETING [16:56] <---in Stanford [16:56] Also awesome [16:57] I went to Stanford once [16:57] nice campus [16:57] DoubleJ: you can try to correlate it with the URL list, but other than that I'm not sure [16:57] probably better to just retry the download, or move on [16:57] I'm downloading snoozeman now [16:57] I think I'll just do that. [16:58] And thanks. Hopefully he takes less time for you. And no router problems. [16:59] I'll have to go through my other logs and see who else has that error. [16:59] What's a good way to find all files that have a particular string in them? [16:59] grep -R [16:59] ack also works if you have that installed [17:00] So in mobileme-grab/data I say grep -R "unable " and it spits out a file list? [17:00] yes, but you probably will want to scope it to just logs [17:00] All right, I'll give that a try after lunch. Thanks. [17:01] actually [17:02] if you just want a list, grep -lR [17:02] there may be better tools, I'm not sure [17:02] so [17:02] i think the maximum number of 404s I can get on snoozeman is 17,982 [17:02] awesome [17:03] oh hey, it's this guy [17:03] http://snoozemanscruiseblog.blogspot.com/ [17:04] holy shit, he's done 69 cruises [17:04] wtf [17:05] considering that cruises generally run at least a couple grand per, that's kind of [17:05] conspicuous [17:07] maybe he hides in the cargo area [17:11] Randall Scwartz has done many cruises. [17:11] Mostly by being the guy co-running tour groups and the phoography. [17:11] He is on the Joco Cruise Crazy 2 as we speak. [17:13] can you guys take like playstation demo discs? [17:14] I'll gladly take them. [17:14] I can suggest a better place. [17:14] i know off sites that have them too [17:14] * SketchCow is sitting here next to two guys who run game archives. [17:14] JP Dyson of the Strong Museum of Place and the CHEG [17:15] And Henry Lowood of the Stanford Software Archives [17:15] Yes, we're all in the same room so this room is awesome now [17:15] http://www.emuparadise.me/Sony_Playstation_-_Demos_ISOs/25 [17:15] the one rom size as 100s of demos [17:16] i also found a torrent of 15 years worth of scientific american [17:17] Yeah, I snagged that [17:17] The public domain ones are now up on archive.org. [17:17] http://www.archive.org/details/scientific-american-1845-1909 [17:18] Oh, to explain cruiseman as another thing. [17:18] i think you are pushing me to buy blu-ray [17:18] only for archive reasons [17:19] If you can organize cruise groups (and plane groups too) and you hit the mark of, like, 10 or 15 people I forget, you get along for free. [17:19] oh, hm [17:19] So assume snoozeman organizes cruises [17:20] SketchCow: can you move this to software: http://www.archive.org/details/cdrom-linux-format-73 [17:23] You did it! [17:24] thank you [17:25] somebody else pointed out that there is other stuff sitting in community texts http://www.archive.org/search.php?query=subject%3A%22cdbbsarchive%22 [17:26] I wish you had made the two CDs separate, but you didn't know and I need to think about it regardless. [17:26] http://www.archive.org/details/cdrom-linux-format-73&reCache=1 [17:32] sorry about that [17:35] i found 3 years of pc advisor [17:36] there was cd that came in a magazine in 2005 [17:36] SketchCow: this might be worth archiving: http://adfly.simplaza.net/ [17:36] (as part of the link shortener project) [17:36] the cache contains most of adf.ly and similar links [17:40] Thanks for the tip, DFJustin - I just yanked all those into the CD archive. [17:40] They're all italian CDs! [17:41] godane: Any chance of scanning the fronts of the DVDs/CDs? [17:44] maybe [17:58] Hm, Waybackmachine is a seperate front end that doesn't hang together with Heritrix -right? [17:59] Or, I mean - Heritrix seems to "only" be the crawler [18:00] Before you stamped off, ersi - the umich project turned out to be something alard could handle individually. [18:01] Yeah, I figured :-) [18:03] there we go, found the open-source wayback project [18:03] As far as I understood it the open source project is different from the wayback machine they use at archive.org? [18:04] Or did that change with the new wayback interface? [18:04] Supposed to be. [18:04] They've started to use our wget-warc now. :) [18:04] In some aspects. [18:04] According to http://archive-access.sourceforge.net/projects/wayback/ the current implementation is Perl, this one is Java [18:05] I want to run a crawl of websites from my home country using Heritrix, just to see what I will find [18:06] And how would you crawl a country? [18:06] Limit it to the country specific top level domain and domains that resolv to IPs that some geoip db thinks are from that country [18:06] alright [18:08] soultcer: one tricky bit: the mapping from IP to domain is one-to-many [18:08] I'd say go for the ccTLD is probably the most viable if that's what you're choosing from [18:08] actually it's many-to-many but the other direction isn't so important [18:08] yipdw: Don't worry, I've got that part already covered. [18:09] ok [18:14] is "zgrep -h something manyfiles*" not working for anyone else too? it should not list the filenames but for me it does anyways. [19:02] Hmmm, I'm considering building a spider/crawler [19:03] ersi: what in [19:04] asking in what language? [19:04] yes [19:05] I was thinking Python [19:05] ah [19:05] https://github.com/tef/crawler this might help [19:05] What I really want is to map a domain/subdomain [19:05] although the warcs are terrible [19:05] checking it [19:06] ooh, there's a HTMLParser? I have only fiddled a little with SGLParse [19:08] actually, yeah I'll check this out ^_^ [19:17] tef: Hm, I pip installed hanzo-warc-tools - but the crawler can still not find it - suggestions? 'requests' went fine to pip install :o [19:17] ersi: Map a domain/subdomain? You mean crawl the whole page and get a warc as output? [19:17] ersi python -m hanzo.warctools should work [19:18] No, I'd just like to map all the URLs [19:18] ? [19:18] "Find ALL THE URLS!" [19:19] You must have a lot of bandwidth and storage space ;-) [19:19] I would only save URLs corresponding to my target domain [19:19] note, Resource Locations. Not the *resources* [19:20] When we started on urlteam stuff we thought "me, maybe a couple hundred megabytes of shorturls" [19:20] tef: Hmmm, I'm not sure what I'm doing wrong here [19:20] bit.ly alone has probably over a TB of urls... [19:20] ersi: I don't know pip that well :/ [19:21] tef: How'd you install hanzo-warc-tools? If you were to do it [19:21] oh I use the hg repo and set PYTHONPATH, but I am like that [19:21] oh, heh [19:21] you could just hg clone from code.hanzoarchives.com [19:22] but I tend to push to the repo so [19:25] hmmmm, I added my hanzo clone to PYTHONPATH but it's not taking it up for some reason [19:27] augh, python packaging :( [19:29] I'll go poke around in crawler.py instead :] [19:31] ersi: I still don't get the urls idea? what use would it be to only know the urls? [19:33] well, you could for example break down the downloading of the target in smaller chunks and distribute the owrk [19:34] Hm, I guess [19:34] ersi: as in export PYTHONPATH='..' [19:35] ersi: maybe you're in a virtualenv [19:35] Yeah, but I did it with a absolute location [19:36] export PYTHONPATH=/home/ersi/libpython/ where hanzo/warc*/ is located [19:36] ah you need the repo dir in [19:36] not the dir contaiing the repo [19:36] inside the repo is the package [19:37] Oh, heh [20:02] tef: Have you ran the crawler against any sites? [20:03] a few but nothing heavy - the warc output is ugly and needs to be cleaned up [20:03] the htmlparser should pick out a few more types of links at least [20:05] oh, found the issue I was getting. you're using the variable 'url' in the HTMLException on line 100 instead of response.url :) [20:05] tef: ^ [20:08] whoops [20:08] I may not have pushed everything [20:08] welp [20:09] Hm, wonder what I'm throwing at it that's making it not able to extract the links :o [20:13] hmmm [20:13] haha, my friends stupid webserver closes the connection prematurely [20:14] might be why [20:14] Welcome to webcrawler 101 [20:14] Lection 1: Web crawling is like running the most complete fuzz test ever against your program [20:14] How suitable [20:15] I'm a software tester at day [20:15] soultcer: yes yes yes [20:16] constantly changing fuzzer [20:16] quirky quicky quirky~ [20:19] also if any of you are REST types i'd appreciate any comments https://github.com/tef/hate-rpc [20:19] but I am going to bed because i have been up for 28 hours. [20:20] websites for robots? woot [20:28] tef: https://github.com/kennethreitz/convore.json [20:33] I saw. Nice :D [20:38] tef: i'm working on making it browsable html [21:41] argh, I feel so silly using Requests [21:42] or I'm getting used rather :D [21:43] Oh, I'm so silly. :| Don't do requestR.text() instead of requestR.test >_> [21:51] ersi: :( [21:51] ersi: ah, gotcha. What do you think so far? [21:52] It's too easy! [21:52] <3 [21:52] It feels awesome, feels pythonic [21:53] perfect