#archiveteam 2011-07-08,Fri

↑back Search

Time Nickname Message
03:22 🔗 swebb1 FYI, I've started downloading the google groups stuff using the IPv6 hack that alard and ndurner have put together, so you've got one more guy leeching at light-speed as of today. :)
03:23 🔗 swebb1 BTW, guys. I'm *mega* impressed by the ingenuity of the tools you've built! Nice-going!
03:24 🔗 underscor I'm waiting for some company to approach us and be like "Hey, how should we backup/archive this shit?"
03:25 🔗 dashcloud which google groups stuff? the old usenet stuff or something else?
03:45 🔗 soultcer I think we are overloading ndurner's app :-/
04:34 🔗 underscor Yay, finished the now playing/song request system for the dance tomorrow!
04:34 🔗 underscor http://lc.k-srv.info/
06:10 🔗 SketchCow Hey, gang.
06:10 🔗 SketchCow In Sydney until tomorrow morning, which would be 12-13 hours from now.
09:09 🔗 Spirit_ interesting "censoring" in here: http://www.time.com/robots.txt
09:09 🔗 Spirit_ they missed http://www.time.com/time/magazine/article/0,9171,46162,00.html though
09:10 🔗 Spirit_ weird
09:12 🔗 SketchCow that smacks of random policy
09:16 🔗 Spirit_ hm, can i add my repo to the archiveteam github even after i created it independently?
09:17 🔗 Spirit_ no, "If you intend to push a copy of a repository that is already hosted on GitHub, please fork it instead."
09:17 🔗 Spirit_ meh
09:48 🔗 alard Spirit_: you can, I believe, move a repo to the archiveteam team.
09:49 🔗 alard repo > Admin > Danger Zone > Transfer ownership
09:52 🔗 Spirit_ Repo moved to ArchiveTeam/robots-relapse
09:52 🔗 Spirit_ thanks!
09:54 🔗 Spirit_ now we just need to connect that with https://github.com/organizations/ArchiveTeam/teams/69519 somehow
09:56 🔗 alard ehm, Admin > Teams?
09:58 🔗 Spirit_ no such button for me
10:01 🔗 alard Strange. Can you see this: https://github.com/ArchiveTeam/robots-relapse/admin/collaboration ?
10:01 🔗 Spirit_ nope, getting served a 404 page
10:02 🔗 alard That's funny.
10:02 🔗 alard Shall I add it to the team?
10:02 🔗 alard (Already did it.)
10:14 🔗 Spirit_ ah, now i have the admin button
10:15 🔗 Spirit_ cheers
16:06 🔗 Qwerty0 SketchCow: I finally finished rsyncing! Before I delete it all, does it look ok over there? It was to gv_28@flophouse.textfiles.com::gv_28/
16:38 🔗 alard Hey all: the warc extension for wget is more or less ready for testing.
16:38 🔗 alard I'm running out of ideas of things to add.
16:38 🔗 alard So for those of you with a bit of spare time, I'd really appreciate it if you could have a look.
16:38 🔗 alard Try to compile and run it, see if you can break it, and if you can think of any additional features that might be useful.
16:38 🔗 alard Download: https://github.com/downloads/alard/wget-warc/wget-warc-20110708.tar.bz2
16:38 🔗 alard More info: http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
16:38 🔗 alard Thanks!
16:48 🔗 swebb1 Where did you get info on the WARC format?
16:48 🔗 swebb1 My personal crawler stores the response headers, but just in a separate file.
16:49 🔗 swebb1 NM. I just read your docs and found the warc-tools thing that you're using. :)
17:03 🔗 closure damnit.. I was planning to archive the entire countdown of the last shuttle launch, but I forgot to start the recorder
17:06 🔗 Qwerty0 haha, so a bunch of pages saying "1:32", "1:31", "1:30" etc?
17:07 🔗 closure more like 6 hours of color commentary with engineers etc who are now packing up after their last day
17:09 🔗 Qwerty0 ohhh duh, right. for some reason i saw "countdown" and just imagined a nasa page with the countdown clock
17:09 🔗 Qwerty0 which would be kinda funny to archive..
17:11 🔗 closure it also seemed very unlikely it would launch today.. I'll bet it was a hell of a countdown
17:15 🔗 alard swebb1: There's also the ISO 28500 standard. (Or the draft version of it, which is what I used.)
17:18 🔗 swebb1 alard: is there a c library for writing to arc or warc files somewhere?
17:18 🔗 swebb1 my crawler is in c, and I'd like to keep most of it in c if possible.
17:20 🔗 swebb1 also, does arc/warc require the request headers too?
18:27 🔗 alard swebb1: warc-tools has a c library. http://code.google.com/p/warc-tools/ (it's not in the new implementation.)
18:27 🔗 alard You can store request headers in WARC, but it's not a strict requirement, more a recommendation.
18:28 🔗 alard The iso standard document ( http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf ) and the guidelines ( http://netpreserve.org/publications/WARC_Guidelines_v1.pdf ) are worth a read.
18:29 🔗 alard And arc is old, don't use that. :)
20:05 🔗 ndurner Is there a way to archive youtube video annotations (perhaps as subtitles)?
20:49 🔗 Coderjoe_ ndurner: got a video ID you want annotations for?
20:49 🔗 ndurner Coderjoe_: 6F0Qb1p0ukw
20:50 🔗 Coderjoe_ http://www.youtube.com/api/reviews/y/read2?feat=TCS&video_id=VIDEOIDHERE
20:51 🔗 Coderjoe_ that will give you an xml file with the annotations
20:51 🔗 soultcer ndurner: How is the google groups thingie doing? I am getting a lot of http 501 responses
20:52 🔗 emijrp YOU HAVE BEEN BANNED FROM THE INTERNET.
20:52 🔗 ndurner Coderjoe_: thanks!
20:53 🔗 ndurner soultcer: it's hitting the quota... which is good
20:53 🔗 Coderjoe I pulled that from the firebug net window from another video that youtube had on their blog announcement for annotations and your ID allowed me to test if it worked without having to even visit the video. it did
20:54 🔗 ndurner soultcer: do you have sqlite3 on your server(s) so the script could dedup uploads to app engine?
20:54 🔗 Coderjoe I don't know what the TCS feature enables just yet
20:55 🔗 Coderjoe so far, it just changes one node that gives info about the request, as far as I can tell
20:56 🔗 ndurner the XML might be transformable to real sub titles you can use with VLC
20:56 🔗 soultcer ndurner: Do you mean the command-line version of sqlite3? I can install it if you need it
20:56 🔗 ndurner yes
20:56 🔗 swebb1 Mine isn't getting 501's.
20:57 🔗 ndurner soultcer is our directory discoverer :-)
20:57 🔗 swebb1 OH, ahh.
20:57 🔗 soultcer Putting all those vServers that usually rape url shorteners to some good use ;-)
21:03 🔗 * ndurner has just created an account on github
21:03 🔗 ndurner How do I join the Archiveteam there?
21:04 🔗 soultcer lemme invite you
21:06 🔗 swebb1 Invite me too. Username: scumola
21:06 🔗 ndurner thanks
21:07 🔗 soultcer I think the SOP is to create a new "team" inside the organization for the project
21:07 🔗 Coderjoe yes
21:08 🔗 Coderjoe and add that team to the repository for that project
21:10 🔗 soultcer Using git for the google groups thing will allow me to directly pull from version control (stupid bazaar changes repository format too often)
21:37 🔗 ndurner soultcer: https://github.com/ArchiveTeam/google-groups-files
21:41 🔗 soultcer It works by using files, not sqlite?
21:43 🔗 ndurner I haven't implemented the idea yet
21:51 🔗 soultcer ndurner: Is HTTP Error 403 also used for rate limiting?
21:51 🔗 ndurner no
21:52 🔗 ndurner Do you have the request?
21:52 🔗 soultcer --2011-07-09 01:50:00-- http://archiveteamorg.appspot.com/donedir2?sel=gtype%3D0%2Cusenet%3Dathena&
22:00 🔗 ndurner soultcer: ah, it's the filter for usenet groups!
22:10 🔗 swebb1 ndurner: instead of telling the clients to sleep for longer periods, why not issue more work for them to do instead? That would be more productive. My workers are mostly sleeping now.
22:11 🔗 swebb1 And issuing more work for them to do will take them longer to report back.
22:12 🔗 ndurner the problem is not the number of requests but a) database access and b) database timeouts
22:13 🔗 ndurner * a) load caused by database access
22:13 🔗 swebb1 So if google is limiting database access, request more items per request.
22:13 🔗 ndurner it's the time spent doing database access
22:14 🔗 ndurner or rather, the cycles burnt doing DB stuff
22:15 🔗 swebb1 Ok. Just a suggestion. Can you index the table somehow (like index the "done or not" field, so when you say "give me 5 that aren't done yet", it'll be fast?
22:16 🔗 Coderjoe the database provided by the app engine is not really an sql database, iirc
22:16 🔗 ndurner It's purely index organized
22:17 🔗 ndurner doing non-indexed queries isn't possible (GAE would throw an exception)
22:19 🔗 swebb1 Ok. I'm not familiar with the inner-workings of GAE, so I'll leave the details up to you. Just figured that if it was Db access, that changing the 'limit xxx' query at the end to a larger number would keep the number of queries the same, but would keep peoples' crawlers from sleeping as much as they are.
22:23 🔗 ndurner There are about 5 "lost" directories (but then reissued) a day because the query didn't finish fast enough for Google
22:24 🔗 ndurner so query time also matters
22:24 🔗 Coderjoe the entire HTTP request must finish in under 30 seconds (or GAE kills it)
22:25 🔗 Coderjoe but you're seriously running into that?
22:25 🔗 swebb1 That's one slow database.
22:25 🔗 Coderjoe any chance I can look at it? I don't have any real experience with GAE, but something does not seem right (and perhaps another set of eyes might help find it)
22:26 🔗 swebb1 Is there a queue system in GAE?
22:26 🔗 Coderjoe what do you mean by queue system?
22:26 🔗 swebb1 You could queue work units.
22:26 🔗 swebb1 and pull them off the queue when issued, and put them back on the queue if timeout.
22:26 🔗 swebb1 Standard queue stuff.
22:34 🔗 swebb1 http://docs.amazonwebservices.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/ :)
22:34 🔗 Coderjoe not inherently. you can do it using another database table, though.
22:35 🔗 Coderjoe GAE has nothing that runs in the background. everything in your app has to happen as the result of an HTTP request.
22:40 🔗 ndurner Coderjoe: did you get the link?
22:41 🔗 Coderjoe just now
22:41 🔗 Coderjoe thanks
22:42 🔗 Coderjoe I wonder if using java instead of python incurs any speed penalty on request startup. (I haven't looked into using java on gae)
22:46 🔗 ndurner Coderjoe: is there a specification/explanation for the youtube annotation XML format somewhere?
22:46 🔗 Coderjoe i don't know
22:46 🔗 Coderjoe hmm
22:46 🔗 Coderjoe http://www.google.com/search?q=youtube+annotation+xml+format
22:47 🔗 Coderjoe my top two results are about converting or copying them
22:47 🔗 Coderjoe top one about converting to srt
22:52 🔗 ndurner The script is not too sophisticated, not much you can gather from it. The best thing I have found so far is http://code.google.com/p/plugin-blocker/source/browse/trunk/annotations.js?spec=svn30&r=30
22:53 🔗 Coderjoe the format looks fairly straightforward to me
22:55 🔗 ndurner It is, I just want to explore the possibilities

irclogger-viewer