[03:22] FYI, I've started downloading the google groups stuff using the IPv6 hack that alard and ndurner have put together, so you've got one more guy leeching at light-speed as of today. :) [03:23] BTW, guys. I'm *mega* impressed by the ingenuity of the tools you've built! Nice-going! [03:24] I'm waiting for some company to approach us and be like "Hey, how should we backup/archive this shit?" [03:25] which google groups stuff? the old usenet stuff or something else? [03:45] I think we are overloading ndurner's app :-/ [04:34] Yay, finished the now playing/song request system for the dance tomorrow! [04:34] http://lc.k-srv.info/ [06:10] Hey, gang. [06:10] In Sydney until tomorrow morning, which would be 12-13 hours from now. [09:09] interesting "censoring" in here: http://www.time.com/robots.txt [09:09] they missed http://www.time.com/time/magazine/article/0,9171,46162,00.html though [09:10] weird [09:12] that smacks of random policy [09:16] hm, can i add my repo to the archiveteam github even after i created it independently? [09:17] no, "If you intend to push a copy of a repository that is already hosted on GitHub, please fork it instead." [09:17] meh [09:48] Spirit_: you can, I believe, move a repo to the archiveteam team. [09:49] repo > Admin > Danger Zone > Transfer ownership [09:52] Repo moved to ArchiveTeam/robots-relapse [09:52] thanks! [09:54] now we just need to connect that with https://github.com/organizations/ArchiveTeam/teams/69519 somehow [09:56] ehm, Admin > Teams? [09:58] no such button for me [10:01] Strange. Can you see this: https://github.com/ArchiveTeam/robots-relapse/admin/collaboration ? [10:01] nope, getting served a 404 page [10:02] That's funny. [10:02] Shall I add it to the team? [10:02] (Already did it.) [10:14] ah, now i have the admin button [10:15] cheers [16:06] SketchCow: I finally finished rsyncing! Before I delete it all, does it look ok over there? It was to gv_28@flophouse.textfiles.com::gv_28/ [16:38] Hey all: the warc extension for wget is more or less ready for testing. [16:38] I'm running out of ideas of things to add. [16:38] So for those of you with a bit of spare time, I'd really appreciate it if you could have a look. [16:38] Try to compile and run it, see if you can break it, and if you can think of any additional features that might be useful. [16:38] Download: https://github.com/downloads/alard/wget-warc/wget-warc-20110708.tar.bz2 [16:38] More info: http://www.archiveteam.org/index.php?title=Wget_with_WARC_output [16:38] Thanks! [16:48] Where did you get info on the WARC format? [16:48] My personal crawler stores the response headers, but just in a separate file. [16:49] NM. I just read your docs and found the warc-tools thing that you're using. :) [17:03] damnit.. I was planning to archive the entire countdown of the last shuttle launch, but I forgot to start the recorder [17:06] haha, so a bunch of pages saying "1:32", "1:31", "1:30" etc? [17:07] more like 6 hours of color commentary with engineers etc who are now packing up after their last day [17:09] ohhh duh, right. for some reason i saw "countdown" and just imagined a nasa page with the countdown clock [17:09] which would be kinda funny to archive.. [17:11] it also seemed very unlikely it would launch today.. I'll bet it was a hell of a countdown [17:15] swebb1: There's also the ISO 28500 standard. (Or the draft version of it, which is what I used.) [17:18] alard: is there a c library for writing to arc or warc files somewhere? [17:18] my crawler is in c, and I'd like to keep most of it in c if possible. [17:20] also, does arc/warc require the request headers too? [18:27] swebb1: warc-tools has a c library. http://code.google.com/p/warc-tools/ (it's not in the new implementation.) [18:27] You can store request headers in WARC, but it's not a strict requirement, more a recommendation. [18:28] The iso standard document ( http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf ) and the guidelines ( http://netpreserve.org/publications/WARC_Guidelines_v1.pdf ) are worth a read. [18:29] And arc is old, don't use that. :) [20:05] Is there a way to archive youtube video annotations (perhaps as subtitles)? [20:49] ndurner: got a video ID you want annotations for? [20:49] Coderjoe_: 6F0Qb1p0ukw [20:50] http://www.youtube.com/api/reviews/y/read2?feat=TCS&video_id=VIDEOIDHERE [20:51] that will give you an xml file with the annotations [20:51] ndurner: How is the google groups thingie doing? I am getting a lot of http 501 responses [20:52] YOU HAVE BEEN BANNED FROM THE INTERNET. [20:52] Coderjoe_: thanks! [20:53] soultcer: it's hitting the quota... which is good [20:53] I pulled that from the firebug net window from another video that youtube had on their blog announcement for annotations and your ID allowed me to test if it worked without having to even visit the video. it did [20:54] soultcer: do you have sqlite3 on your server(s) so the script could dedup uploads to app engine? [20:54] I don't know what the TCS feature enables just yet [20:55] so far, it just changes one node that gives info about the request, as far as I can tell [20:56] the XML might be transformable to real sub titles you can use with VLC [20:56] ndurner: Do you mean the command-line version of sqlite3? I can install it if you need it [20:56] yes [20:56] Mine isn't getting 501's. [20:57] soultcer is our directory discoverer :-) [20:57] OH, ahh. [20:57] Putting all those vServers that usually rape url shorteners to some good use ;-) [21:03] * ndurner has just created an account on github [21:03] How do I join the Archiveteam there? [21:04] lemme invite you [21:06] Invite me too. Username: scumola [21:06] thanks [21:07] I think the SOP is to create a new "team" inside the organization for the project [21:07] yes [21:08] and add that team to the repository for that project [21:10] Using git for the google groups thing will allow me to directly pull from version control (stupid bazaar changes repository format too often) [21:37] soultcer: https://github.com/ArchiveTeam/google-groups-files [21:41] It works by using files, not sqlite? [21:43] I haven't implemented the idea yet [21:51] ndurner: Is HTTP Error 403 also used for rate limiting? [21:51] no [21:52] Do you have the request? [21:52] --2011-07-09 01:50:00-- http://archiveteamorg.appspot.com/donedir2?sel=gtype%3D0%2Cusenet%3Dathena& [22:00] soultcer: ah, it's the filter for usenet groups! [22:10] ndurner: instead of telling the clients to sleep for longer periods, why not issue more work for them to do instead? That would be more productive. My workers are mostly sleeping now. [22:11] And issuing more work for them to do will take them longer to report back. [22:12] the problem is not the number of requests but a) database access and b) database timeouts [22:13] * a) load caused by database access [22:13] So if google is limiting database access, request more items per request. [22:13] it's the time spent doing database access [22:14] or rather, the cycles burnt doing DB stuff [22:15] Ok. Just a suggestion. Can you index the table somehow (like index the "done or not" field, so when you say "give me 5 that aren't done yet", it'll be fast? [22:16] the database provided by the app engine is not really an sql database, iirc [22:16] It's purely index organized [22:17] doing non-indexed queries isn't possible (GAE would throw an exception) [22:19] Ok. I'm not familiar with the inner-workings of GAE, so I'll leave the details up to you. Just figured that if it was Db access, that changing the 'limit xxx' query at the end to a larger number would keep the number of queries the same, but would keep peoples' crawlers from sleeping as much as they are. [22:23] There are about 5 "lost" directories (but then reissued) a day because the query didn't finish fast enough for Google [22:24] so query time also matters [22:24] the entire HTTP request must finish in under 30 seconds (or GAE kills it) [22:25] but you're seriously running into that? [22:25] That's one slow database. [22:25] any chance I can look at it? I don't have any real experience with GAE, but something does not seem right (and perhaps another set of eyes might help find it) [22:26] Is there a queue system in GAE? [22:26] what do you mean by queue system? [22:26] You could queue work units. [22:26] and pull them off the queue when issued, and put them back on the queue if timeout. [22:26] Standard queue stuff. [22:34] http://docs.amazonwebservices.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/ :) [22:34] not inherently. you can do it using another database table, though. [22:35] GAE has nothing that runs in the background. everything in your app has to happen as the result of an HTTP request. [22:40] Coderjoe: did you get the link? [22:41] just now [22:41] thanks [22:42] I wonder if using java instead of python incurs any speed penalty on request startup. (I haven't looked into using java on gae) [22:46] Coderjoe: is there a specification/explanation for the youtube annotation XML format somewhere? [22:46] i don't know [22:46] hmm [22:46] http://www.google.com/search?q=youtube+annotation+xml+format [22:47] my top two results are about converting or copying them [22:47] top one about converting to srt [22:52] The script is not too sophisticated, not much you can gather from it. The best thing I have found so far is http://code.google.com/p/plugin-blocker/source/browse/trunk/annotations.js?spec=svn30&r=30 [22:53] the format looks fairly straightforward to me [22:55] It is, I just want to explore the possibilities