#archiveteam 2012-02-21,Tue

↑back Search

Time Nickname Message
00:03 🔗 kennethre tef: ping
00:17 🔗 SketchCow Regarding the downloading of the new site you want to archive, kennethre: Remember archive.org wants WARC and the world wants WGET.
00:18 🔗 kennethre SketchCow: I'm going to see what I can do. It may have to just be json dumps. The site's barely running and they have a heavy ajax interface.
00:18 🔗 kennethre but we'll see. WARC would be ideal.
00:20 🔗 SketchCow That's fine.
00:20 🔗 SketchCow Do your best.
01:20 🔗 SketchCow Over 300 albums added to my jamendo-albums collection today.
01:20 🔗 SketchCow This is just really good stuff.
01:46 🔗 balrog_ph Did MultiUpload just croak?
01:48 🔗 nitro2k01 Yep
01:48 🔗 arrith just wondering about that
01:49 🔗 arrith multiupload has been unresponsive for a while, same with scroogle.org
01:49 🔗 arrith been getting nxdomain for scroogle
01:51 🔗 arrith not finding any officialish news about multiupload on the google, just people reporting issues not getting it to load
02:47 🔗 tef kennethre: pong
03:22 🔗 SketchCow http://www.flickr.com/photos/textfiles/sets/72157629411540695/with/6913872479/
03:51 🔗 nitro2k01 "Oh, the keyboard were bulky back then."
10:55 🔗 DoubleJ So, question about wget: If it's unable to connect to a host (say, due to the Fios router being a piece of junk) does it keep trying that one URL over and over, or does it skip every file until the connection comes back?
15:47 🔗 chronomex you usually get holes
15:58 🔗 Schbirid hm, i think i broke something in my item with s3
15:58 🔗 Schbirid new uploads were not added to the filelist
16:06 🔗 Schbirid maybe i sohuld have used that bucket size hint
16:26 🔗 DoubleJ chronomex: So you're saying that any user I have with an unable-to-connect error needs to have that part of their account re-done. Crap.
16:27 🔗 DoubleJ Though I guess that explains why sometimes losing the internet causes dld-client to bomb and sometimes not. If it does it during --mirror things can recover before wget finishes.
16:28 🔗 DoubleJ Though if it results in incomplete downloads it should really be a fatal error.
16:30 🔗 yipdw DoubleJ: wget does do retries, but only up to a point
16:31 🔗 yipdw default number is 20 retries
16:31 🔗 yipdw as far as I can tell, the mobileme-grab scripts don't change that behavior
16:32 🔗 DoubleJ Hmm... I think the 5.5 hours I lost between midnight and waking up may account for more than 20 tries :)
16:32 🔗 yipdw well
16:32 🔗 yipdw hm
16:33 🔗 yipdw yeah, probably
16:33 🔗 yipdw I don't think wget uses exponential backoff
16:33 🔗 yipdw yeah it doesn't
16:33 🔗 DoubleJ Lemme try to find how many lines have the connect error. If it's 19 I'll be happy.
16:34 🔗 yipdw that said, if there was such an error, that should cause dld-single to exit erroneously
16:35 🔗 yipdw so you shouldn't have multiple users with holes, unless you're running dld-single in an unpredicated loop or something
16:36 🔗 DoubleJ grep -i "unable " wget.log | wc -l
16:36 🔗 DoubleJ 307
16:36 🔗 DoubleJ D'oh.
16:37 🔗 DoubleJ I'm running dld-client.
16:37 🔗 DoubleJ Doesn't that just call dld-single, though?
16:37 🔗 yipdw yes
16:37 🔗 yipdw and it should exit on error
16:37 🔗 yipdw well, once a user's download finishes
16:38 🔗 DoubleJ Hm. So, either it didn't, or it's happy to try until the end of time.
16:38 🔗 yipdw point is that the number of users with holes should normally have the maximum number of simultaneous dld processes as its upper bound
16:38 🔗 DoubleJ Also: The actual error is "unable to resolve host", not "unable to connect" like I said before.
16:38 🔗 yipdw (also, try seesaw)
16:38 🔗 DoubleJ In case that changes anything.
16:39 🔗 DoubleJ Well, yes. Except that this has happened multiple times since I started downloading MobileMe. I think my router isn't happy with being asked to do actual work.
16:39 🔗 yipdw that should still generate an error exit
16:39 🔗 yipdw yeah, mine isn't either
16:39 🔗 yipdw I just use off-hours connection time at work
16:39 🔗 yipdw they have slightly more robust networking equipment
16:40 🔗 DoubleJ This would be less of a headache if the current user was able to write HTML and I hadn't been grabbing an endless series of 404s since Friday.
16:40 🔗 DoubleJ I can't just shrug it off, in case somewhere in that pile was a link he actually managed to code correctly.
16:40 🔗 yipdw ?
16:41 🔗 DoubleJ The user who set this question off.
16:41 🔗 DoubleJ wget --mirror has been working since Friday morning.
16:41 🔗 yipdw which one is that?
16:41 🔗 DoubleJ web.me.com/snoozeman
16:42 🔗 DoubleJ Not downloading the internet as far as I can tell.
16:42 🔗 DoubleJ Just a very long parade of bad links within his own account.
16:42 🔗 DoubleJ But like I said, somewhere in that 5.5 hours may have been a good link.
16:42 🔗 DoubleJ And spending another 4 days on the guy is going to be annoying.
16:49 🔗 yipdw ha, it's literally a family history
16:54 🔗 DoubleJ Oh christ.
16:54 🔗 DoubleJ If he goes back to the Vikings I'm screwed.
16:54 🔗 yipdw more specifically, it's a large set of pages generated via Reunion
16:55 🔗 yipdw I guess that user just didn't upload / deleted a bunch of them
16:55 🔗 DoubleJ Now, the wget log just says "unable to resolve host web.me.com". Is there any way of finding out what URL it was trying?
16:55 🔗 Schbirid i use ngrep for this kind of stuff
16:55 🔗 DoubleJ I think he renamed a directory. My URL list has "Ray's Genealogy" but the 404s are just "Genealogy".
16:56 🔗 DoubleJ Schbirid: 'splain?
16:56 🔗 Schbirid if you mean realtime checking of http traffic etc
16:56 🔗 DoubleJ Oh, not realtime. This happened while I was asleep.
16:56 🔗 Schbirid ah no idea then :(
16:56 🔗 SketchCow GREETINGS FROM THE PRESERVING VIRTUAL WORLDS 2 ADVISORY BOARD MEETING
16:56 🔗 SketchCow <---in Stanford
16:56 🔗 SketchCow Also awesome
16:57 🔗 yipdw I went to Stanford once
16:57 🔗 yipdw nice campus
16:57 🔗 yipdw DoubleJ: you can try to correlate it with the URL list, but other than that I'm not sure
16:57 🔗 yipdw probably better to just retry the download, or move on
16:57 🔗 yipdw I'm downloading snoozeman now
16:57 🔗 DoubleJ I think I'll just do that.
16:58 🔗 DoubleJ And thanks. Hopefully he takes less time for you. And no router problems.
16:59 🔗 DoubleJ I'll have to go through my other logs and see who else has that error.
16:59 🔗 DoubleJ What's a good way to find all files that have a particular string in them?
16:59 🔗 yipdw grep -R
16:59 🔗 yipdw ack also works if you have that installed
17:00 🔗 DoubleJ So in mobileme-grab/data I say grep -R "unable " and it spits out a file list?
17:00 🔗 yipdw yes, but you probably will want to scope it to just logs
17:00 🔗 DoubleJ All right, I'll give that a try after lunch. Thanks.
17:01 🔗 yipdw actually
17:02 🔗 yipdw if you just want a list, grep -lR
17:02 🔗 yipdw there may be better tools, I'm not sure
17:02 🔗 yipdw so
17:02 🔗 yipdw i think the maximum number of 404s I can get on snoozeman is 17,982
17:02 🔗 yipdw awesome
17:03 🔗 yipdw oh hey, it's this guy
17:03 🔗 yipdw http://snoozemanscruiseblog.blogspot.com/
17:04 🔗 yipdw holy shit, he's done 69 cruises
17:04 🔗 yipdw wtf
17:05 🔗 yipdw considering that cruises generally run at least a couple grand per, that's kind of
17:05 🔗 yipdw conspicuous
17:07 🔗 Soojin maybe he hides in the cargo area
17:11 🔗 SketchCow Randall Scwartz has done many cruises.
17:11 🔗 SketchCow Mostly by being the guy co-running tour groups and the phoography.
17:11 🔗 SketchCow He is on the Joco Cruise Crazy 2 as we speak.
17:13 🔗 godane can you guys take like playstation demo discs?
17:14 🔗 SketchCow I'll gladly take them.
17:14 🔗 SketchCow I can suggest a better place.
17:14 🔗 godane i know off sites that have them too
17:14 🔗 * SketchCow is sitting here next to two guys who run game archives.
17:14 🔗 SketchCow JP Dyson of the Strong Museum of Place and the CHEG
17:15 🔗 SketchCow And Henry Lowood of the Stanford Software Archives
17:15 🔗 SketchCow Yes, we're all in the same room so this room is awesome now
17:15 🔗 godane http://www.emuparadise.me/Sony_Playstation_-_Demos_ISOs/25
17:15 🔗 godane the one rom size as 100s of demos
17:16 🔗 godane i also found a torrent of 15 years worth of scientific american
17:17 🔗 SketchCow Yeah, I snagged that
17:17 🔗 SketchCow The public domain ones are now up on archive.org.
17:17 🔗 SketchCow http://www.archive.org/details/scientific-american-1845-1909
17:18 🔗 SketchCow Oh, to explain cruiseman as another thing.
17:18 🔗 godane i think you are pushing me to buy blu-ray
17:18 🔗 godane only for archive reasons
17:19 🔗 SketchCow If you can organize cruise groups (and plane groups too) and you hit the mark of, like, 10 or 15 people I forget, you get along for free.
17:19 🔗 yipdw oh, hm
17:19 🔗 SketchCow So assume snoozeman organizes cruises
17:20 🔗 godane SketchCow: can you move this to software: http://www.archive.org/details/cdrom-linux-format-73
17:23 🔗 SketchCow You did it!
17:24 🔗 godane thank you
17:25 🔗 DFJustin somebody else pointed out that there is other stuff sitting in community texts http://www.archive.org/search.php?query=subject%3A%22cdbbsarchive%22
17:26 🔗 SketchCow I wish you had made the two CDs separate, but you didn't know and I need to think about it regardless.
17:26 🔗 SketchCow http://www.archive.org/details/cdrom-linux-format-73&reCache=1
17:32 🔗 godane sorry about that
17:35 🔗 godane i found 3 years of pc advisor
17:36 🔗 godane there was cd that came in a magazine in 2005
17:36 🔗 balrog SketchCow: this might be worth archiving: http://adfly.simplaza.net/
17:36 🔗 balrog (as part of the link shortener project)
17:36 🔗 balrog the cache contains most of adf.ly and similar links
17:40 🔗 SketchCow Thanks for the tip, DFJustin - I just yanked all those into the CD archive.
17:40 🔗 SketchCow They're all italian CDs!
17:41 🔗 SketchCow godane: Any chance of scanning the fronts of the DVDs/CDs?
17:44 🔗 godane maybe
17:58 🔗 ersi Hm, Waybackmachine is a seperate front end that doesn't hang together with Heritrix -right?
17:59 🔗 ersi Or, I mean - Heritrix seems to "only" be the crawler
18:00 🔗 SketchCow Before you stamped off, ersi - the umich project turned out to be something alard could handle individually.
18:01 🔗 ersi Yeah, I figured :-)
18:03 🔗 ersi there we go, found the open-source wayback project
18:03 🔗 soultcer As far as I understood it the open source project is different from the wayback machine they use at archive.org?
18:04 🔗 soultcer Or did that change with the new wayback interface?
18:04 🔗 SketchCow Supposed to be.
18:04 🔗 SketchCow They've started to use our wget-warc now. :)
18:04 🔗 SketchCow In some aspects.
18:04 🔗 ersi According to http://archive-access.sourceforge.net/projects/wayback/ the current implementation is Perl, this one is Java
18:05 🔗 soultcer I want to run a crawl of websites from my home country using Heritrix, just to see what I will find
18:06 🔗 ersi And how would you crawl a country?
18:06 🔗 soultcer Limit it to the country specific top level domain and domains that resolv to IPs that some geoip db thinks are from that country
18:06 🔗 ersi alright
18:08 🔗 yipdw soultcer: one tricky bit: the mapping from IP to domain is one-to-many
18:08 🔗 ersi I'd say go for the ccTLD is probably the most viable if that's what you're choosing from
18:08 🔗 yipdw actually it's many-to-many but the other direction isn't so important
18:08 🔗 soultcer yipdw: Don't worry, I've got that part already covered.
18:09 🔗 yipdw ok
18:14 🔗 Schbirid is "zgrep -h something manyfiles*" not working for anyone else too? it should not list the filenames but for me it does anyways.
19:02 🔗 ersi Hmmm, I'm considering building a spider/crawler
19:03 🔗 tef ersi: what in
19:04 🔗 ersi asking in what language?
19:04 🔗 tef yes
19:05 🔗 ersi I was thinking Python
19:05 🔗 tef ah
19:05 🔗 tef https://github.com/tef/crawler this might help
19:05 🔗 ersi What I really want is to map a domain/subdomain
19:05 🔗 tef although the warcs are terrible
19:05 🔗 ersi checking it
19:06 🔗 ersi ooh, there's a HTMLParser? I have only fiddled a little with SGLParse
19:08 🔗 ersi actually, yeah I'll check this out ^_^
19:17 🔗 ersi tef: Hm, I pip installed hanzo-warc-tools - but the crawler can still not find it - suggestions? 'requests' went fine to pip install :o
19:17 🔗 soultcer ersi: Map a domain/subdomain? You mean crawl the whole page and get a warc as output?
19:17 🔗 tef ersi python -m hanzo.warctools should work
19:18 🔗 ersi No, I'd just like to map all the URLs
19:18 🔗 soultcer ?
19:18 🔗 ersi "Find ALL THE URLS!"
19:19 🔗 soultcer You must have a lot of bandwidth and storage space ;-)
19:19 🔗 ersi I would only save URLs corresponding to my target domain
19:19 🔗 ersi note, Resource Locations. Not the *resources*
19:20 🔗 soultcer When we started on urlteam stuff we thought "me, maybe a couple hundred megabytes of shorturls"
19:20 🔗 ersi tef: Hmmm, I'm not sure what I'm doing wrong here
19:20 🔗 soultcer bit.ly alone has probably over a TB of urls...
19:20 🔗 tef ersi: I don't know pip that well :/
19:21 🔗 ersi tef: How'd you install hanzo-warc-tools? If you were to do it
19:21 🔗 tef oh I use the hg repo and set PYTHONPATH, but I am like that
19:21 🔗 ersi oh, heh
19:21 🔗 tef you could just hg clone from code.hanzoarchives.com
19:22 🔗 tef but I tend to push to the repo so
19:25 🔗 ersi hmmmm, I added my hanzo clone to PYTHONPATH but it's not taking it up for some reason
19:27 🔗 ersi augh, python packaging :(
19:29 🔗 ersi I'll go poke around in crawler.py instead :]
19:31 🔗 soultcer ersi: I still don't get the urls idea? what use would it be to only know the urls?
19:33 🔗 ersi well, you could for example break down the downloading of the target in smaller chunks and distribute the owrk
19:34 🔗 soultcer Hm, I guess
19:34 🔗 tef ersi: as in export PYTHONPATH='..'
19:35 🔗 tef ersi: maybe you're in a virtualenv
19:35 🔗 ersi Yeah, but I did it with a absolute location
19:36 🔗 ersi export PYTHONPATH=/home/ersi/libpython/ where hanzo/warc*/ is located
19:36 🔗 tef ah you need the repo dir in
19:36 🔗 tef not the dir contaiing the repo
19:36 🔗 tef inside the repo is the package
19:37 🔗 ersi Oh, heh
20:02 🔗 ersi tef: Have you ran the crawler against any sites?
20:03 🔗 tef a few but nothing heavy - the warc output is ugly and needs to be cleaned up
20:03 🔗 tef the htmlparser should pick out a few more types of links at least
20:05 🔗 ersi oh, found the issue I was getting. you're using the variable 'url' in the HTMLException on line 100 instead of response.url :)
20:05 🔗 ersi tef: ^
20:08 🔗 tef whoops
20:08 🔗 tef I may not have pushed everything
20:08 🔗 tef welp
20:09 🔗 ersi Hm, wonder what I'm throwing at it that's making it not able to extract the links :o
20:13 🔗 tef hmmm
20:13 🔗 ersi haha, my friends stupid webserver closes the connection prematurely
20:14 🔗 ersi might be why
20:14 🔗 soultcer Welcome to webcrawler 101
20:14 🔗 soultcer Lection 1: Web crawling is like running the most complete fuzz test ever against your program
20:14 🔗 ersi How suitable
20:15 🔗 ersi I'm a software tester at day
20:15 🔗 tef soultcer: yes yes yes
20:16 🔗 tef constantly changing fuzzer
20:16 🔗 ersi quirky quicky quirky~
20:19 🔗 tef also if any of you are REST types i'd appreciate any comments https://github.com/tef/hate-rpc
20:19 🔗 tef but I am going to bed because i have been up for 28 hours.
20:20 🔗 ersi websites for robots? woot
20:28 🔗 kennethre tef: https://github.com/kennethreitz/convore.json
20:33 🔗 tef I saw. Nice :D
20:38 🔗 kennethre tef: i'm working on making it browsable html
21:41 🔗 ersi argh, I feel so silly using Requests
21:42 🔗 ersi or I'm getting used rather :D
21:43 🔗 ersi Oh, I'm so silly. :| Don't do requestR.text() instead of requestR.test >_>
21:51 🔗 kennethre ersi: :(
21:51 🔗 kennethre ersi: ah, gotcha. What do you think so far?
21:52 🔗 ersi It's too easy!
21:52 🔗 kennethre <3
21:52 🔗 ersi It feels awesome, feels pythonic
21:53 🔗 kennethre perfect

irclogger-viewer