#archiveteam 2011-07-08,Fri

↑back Search

Time	Nickname	Message
03:22 ^🔗	swebb1	FYI, I've started downloading the google groups stuff using the IPv6 hack that alard and ndurner have put together, so you've got one more guy leeching at light-speed as of today. :)
03:23 ^🔗	swebb1	BTW, guys. I'm mega impressed by the ingenuity of the tools you've built! Nice-going!
03:24 ^🔗	underscor	I'm waiting for some company to approach us and be like "Hey, how should we backup/archive this shit?"
03:25 ^🔗	dashcloud	which google groups stuff? the old usenet stuff or something else?
03:45 ^🔗	soultcer	I think we are overloading ndurner's app :-/
04:34 ^🔗	underscor	Yay, finished the now playing/song request system for the dance tomorrow!
04:34 ^🔗	underscor	http://lc.k-srv.info/
06:10 ^🔗	SketchCow	Hey, gang.
06:10 ^🔗	SketchCow	In Sydney until tomorrow morning, which would be 12-13 hours from now.
09:09 ^🔗	Spirit_	interesting "censoring" in here: http://www.time.com/robots.txt
09:09 ^🔗	Spirit_	they missed http://www.time.com/time/magazine/article/0,9171,46162,00.html though
09:10 ^🔗	Spirit_	weird
09:12 ^🔗	SketchCow	that smacks of random policy
09:16 ^🔗	Spirit_	hm, can i add my repo to the archiveteam github even after i created it independently?
09:17 ^🔗	Spirit_	no, "If you intend to push a copy of a repository that is already hosted on GitHub, please fork it instead."
09:17 ^🔗	Spirit_	meh
09:48 ^🔗	alard	Spirit_: you can, I believe, move a repo to the archiveteam team.
09:49 ^🔗	alard	repo > Admin > Danger Zone > Transfer ownership
09:52 ^🔗	Spirit_	Repo moved to ArchiveTeam/robots-relapse
09:52 ^🔗	Spirit_	thanks!
09:54 ^🔗	Spirit_	now we just need to connect that with https://github.com/organizations/ArchiveTeam/teams/69519 somehow
09:56 ^🔗	alard	ehm, Admin > Teams?
09:58 ^🔗	Spirit_	no such button for me
10:01 ^🔗	alard	Strange. Can you see this: https://github.com/ArchiveTeam/robots-relapse/admin/collaboration ?
10:01 ^🔗	Spirit_	nope, getting served a 404 page
10:02 ^🔗	alard	That's funny.
10:02 ^🔗	alard	Shall I add it to the team?
10:02 ^🔗	alard	(Already did it.)
10:14 ^🔗	Spirit_	ah, now i have the admin button
10:15 ^🔗	Spirit_	cheers
16:06 ^🔗	Qwerty0	SketchCow: I finally finished rsyncing! Before I delete it all, does it look ok over there? It was to gv_28@flophouse.textfiles.com::gv_28/
16:38 ^🔗	alard	Hey all: the warc extension for wget is more or less ready for testing.
16:38 ^🔗	alard	I'm running out of ideas of things to add.
16:38 ^🔗	alard	So for those of you with a bit of spare time, I'd really appreciate it if you could have a look.
16:38 ^🔗	alard	Try to compile and run it, see if you can break it, and if you can think of any additional features that might be useful.
16:38 ^🔗	alard	Download: https://github.com/downloads/alard/wget-warc/wget-warc-20110708.tar.bz2
16:38 ^🔗	alard	More info: http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
16:38 ^🔗	alard	Thanks!
16:48 ^🔗	swebb1	Where did you get info on the WARC format?
16:48 ^🔗	swebb1	My personal crawler stores the response headers, but just in a separate file.
16:49 ^🔗	swebb1	NM. I just read your docs and found the warc-tools thing that you're using. :)
17:03 ^🔗	closure	damnit.. I was planning to archive the entire countdown of the last shuttle launch, but I forgot to start the recorder
17:06 ^🔗	Qwerty0	haha, so a bunch of pages saying "1:32", "1:31", "1:30" etc?
17:07 ^🔗	closure	more like 6 hours of color commentary with engineers etc who are now packing up after their last day
17:09 ^🔗	Qwerty0	ohhh duh, right. for some reason i saw "countdown" and just imagined a nasa page with the countdown clock
17:09 ^🔗	Qwerty0	which would be kinda funny to archive..
17:11 ^🔗	closure	it also seemed very unlikely it would launch today.. I'll bet it was a hell of a countdown
17:15 ^🔗	alard	swebb1: There's also the ISO 28500 standard. (Or the draft version of it, which is what I used.)
17:18 ^🔗	swebb1	alard: is there a c library for writing to arc or warc files somewhere?
17:18 ^🔗	swebb1	my crawler is in c, and I'd like to keep most of it in c if possible.
17:20 ^🔗	swebb1	also, does arc/warc require the request headers too?
18:27 ^🔗	alard	swebb1: warc-tools has a c library. http://code.google.com/p/warc-tools/ (it's not in the new implementation.)
18:27 ^🔗	alard	You can store request headers in WARC, but it's not a strict requirement, more a recommendation.
18:28 ^🔗	alard	The iso standard document ( http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf ) and the guidelines ( http://netpreserve.org/publications/WARC_Guidelines_v1.pdf ) are worth a read.
18:29 ^🔗	alard	And arc is old, don't use that. :)
20:05 ^🔗	ndurner	Is there a way to archive youtube video annotations (perhaps as subtitles)?
20:49 ^🔗	Coderjoe_	ndurner: got a video ID you want annotations for?
20:49 ^🔗	ndurner	Coderjoe_: 6F0Qb1p0ukw
20:50 ^🔗	Coderjoe_	http://www.youtube.com/api/reviews/y/read2?feat=TCS&video_id=VIDEOIDHERE
20:51 ^🔗	Coderjoe_	that will give you an xml file with the annotations
20:51 ^🔗	soultcer	ndurner: How is the google groups thingie doing? I am getting a lot of http 501 responses
20:52 ^🔗	emijrp	YOU HAVE BEEN BANNED FROM THE INTERNET.
20:52 ^🔗	ndurner	Coderjoe_: thanks!
20:53 ^🔗	ndurner	soultcer: it's hitting the quota... which is good
20:53 ^🔗	Coderjoe	I pulled that from the firebug net window from another video that youtube had on their blog announcement for annotations and your ID allowed me to test if it worked without having to even visit the video. it did
20:54 ^🔗	ndurner	soultcer: do you have sqlite3 on your server(s) so the script could dedup uploads to app engine?
20:54 ^🔗	Coderjoe	I don't know what the TCS feature enables just yet
20:55 ^🔗	Coderjoe	so far, it just changes one node that gives info about the request, as far as I can tell
20:56 ^🔗	ndurner	the XML might be transformable to real sub titles you can use with VLC
20:56 ^🔗	soultcer	ndurner: Do you mean the command-line version of sqlite3? I can install it if you need it
20:56 ^🔗	ndurner	yes
20:56 ^🔗	swebb1	Mine isn't getting 501's.
20:57 ^🔗	ndurner	soultcer is our directory discoverer :-)
20:57 ^🔗	swebb1	OH, ahh.
20:57 ^🔗	soultcer	Putting all those vServers that usually rape url shorteners to some good use ;-)
21:03 ^🔗	*	ndurner has just created an account on github
21:03 ^🔗	ndurner	How do I join the Archiveteam there?
21:04 ^🔗	soultcer	lemme invite you
21:06 ^🔗	swebb1	Invite me too. Username: scumola
21:06 ^🔗	ndurner	thanks
21:07 ^🔗	soultcer	I think the SOP is to create a new "team" inside the organization for the project
21:07 ^🔗	Coderjoe	yes
21:08 ^🔗	Coderjoe	and add that team to the repository for that project
21:10 ^🔗	soultcer	Using git for the google groups thing will allow me to directly pull from version control (stupid bazaar changes repository format too often)
21:37 ^🔗	ndurner	soultcer: https://github.com/ArchiveTeam/google-groups-files
21:41 ^🔗	soultcer	It works by using files, not sqlite?
21:43 ^🔗	ndurner	I haven't implemented the idea yet
21:51 ^🔗	soultcer	ndurner: Is HTTP Error 403 also used for rate limiting?
21:51 ^🔗	ndurner	no
21:52 ^🔗	ndurner	Do you have the request?
21:52 ^🔗	soultcer	--2011-07-09 01:50:00-- http://archiveteamorg.appspot.com/donedir2?sel=gtype%3D0%2Cusenet%3Dathena&
22:00 ^🔗	ndurner	soultcer: ah, it's the filter for usenet groups!
22:10 ^🔗	swebb1	ndurner: instead of telling the clients to sleep for longer periods, why not issue more work for them to do instead? That would be more productive. My workers are mostly sleeping now.
22:11 ^🔗	swebb1	And issuing more work for them to do will take them longer to report back.
22:12 ^🔗	ndurner	the problem is not the number of requests but a) database access and b) database timeouts
22:13 ^🔗	ndurner	* a) load caused by database access
22:13 ^🔗	swebb1	So if google is limiting database access, request more items per request.
22:13 ^🔗	ndurner	it's the time spent doing database access
22:14 ^🔗	ndurner	or rather, the cycles burnt doing DB stuff
22:15 ^🔗	swebb1	Ok. Just a suggestion. Can you index the table somehow (like index the "done or not" field, so when you say "give me 5 that aren't done yet", it'll be fast?
22:16 ^🔗	Coderjoe	the database provided by the app engine is not really an sql database, iirc
22:16 ^🔗	ndurner	It's purely index organized
22:17 ^🔗	ndurner	doing non-indexed queries isn't possible (GAE would throw an exception)
22:19 ^🔗	swebb1	Ok. I'm not familiar with the inner-workings of GAE, so I'll leave the details up to you. Just figured that if it was Db access, that changing the 'limit xxx' query at the end to a larger number would keep the number of queries the same, but would keep peoples' crawlers from sleeping as much as they are.
22:23 ^🔗	ndurner	There are about 5 "lost" directories (but then reissued) a day because the query didn't finish fast enough for Google
22:24 ^🔗	ndurner	so query time also matters
22:24 ^🔗	Coderjoe	the entire HTTP request must finish in under 30 seconds (or GAE kills it)
22:25 ^🔗	Coderjoe	but you're seriously running into that?
22:25 ^🔗	swebb1	That's one slow database.
22:25 ^🔗	Coderjoe	any chance I can look at it? I don't have any real experience with GAE, but something does not seem right (and perhaps another set of eyes might help find it)
22:26 ^🔗	swebb1	Is there a queue system in GAE?
22:26 ^🔗	Coderjoe	what do you mean by queue system?
22:26 ^🔗	swebb1	You could queue work units.
22:26 ^🔗	swebb1	and pull them off the queue when issued, and put them back on the queue if timeout.
22:26 ^🔗	swebb1	Standard queue stuff.
22:34 ^🔗	swebb1	http://docs.amazonwebservices.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/ :)
22:34 ^🔗	Coderjoe	not inherently. you can do it using another database table, though.
22:35 ^🔗	Coderjoe	GAE has nothing that runs in the background. everything in your app has to happen as the result of an HTTP request.
22:40 ^🔗	ndurner	Coderjoe: did you get the link?
22:41 ^🔗	Coderjoe	just now
22:41 ^🔗	Coderjoe	thanks
22:42 ^🔗	Coderjoe	I wonder if using java instead of python incurs any speed penalty on request startup. (I haven't looked into using java on gae)
22:46 ^🔗	ndurner	Coderjoe: is there a specification/explanation for the youtube annotation XML format somewhere?
22:46 ^🔗	Coderjoe	i don't know
22:46 ^🔗	Coderjoe	hmm
22:46 ^🔗	Coderjoe	http://www.google.com/search?q=youtube+annotation+xml+format
22:47 ^🔗	Coderjoe	my top two results are about converting or copying them
22:47 ^🔗	Coderjoe	top one about converting to srt
22:52 ^🔗	ndurner	The script is not too sophisticated, not much you can gather from it. The best thing I have found so far is http://code.google.com/p/plugin-blocker/source/browse/trunk/annotations.js?spec=svn30&r=30
22:53 ^🔗	Coderjoe	the format looks fairly straightforward to me
22:55 ^🔗	ndurner	It is, I just want to explore the possibilities

irclogger-viewer