#urlteam 2015-11-14,Sat

↑back Search

Time	Nickname	Message
00:12 ^🔗		bwn has quit IRC (Read error: Operation timed out)
00:24 ^🔗		aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
00:25 ^🔗		aaaaaaaaa has joined #urlteam
00:25 ^🔗		swebb sets mode: +o aaaaaaaaa
01:14 ^🔗		JesseW has joined #urlteam
01:15 ^🔗		svchfoo3 sets mode: +o JesseW
01:20 ^🔗		xmc has quit IRC (Read error: Operation timed out)
01:21 ^🔗	JesseW	OK, here's the 20,141 unique URLs containing "adrive.com" in the URLteam corpus: https://gist.github.com/JesseWeinstein/f1287df2ca12b1d96705/raw/3fb99fcea51fa990ded6e67c642e7a7c69a08aa3/gistfile1.txt
01:21 ^🔗		svchfoo1 has quit IRC (Read error: Operation timed out)
01:23 ^🔗		xmc has joined #urlteam
01:23 ^🔗		swebb sets mode: +o xmc
01:23 ^🔗		Fusl has quit IRC (Ping timeout: 255 seconds)
01:23 ^🔗	JesseW	I need to figure out a better way to handle migre.me -- for now, I've turned it off.
01:23 ^🔗		svchfoo3 has quit IRC (Ping timeout: 369 seconds)
01:25 ^🔗		svchfoo3 has joined #urlteam
01:25 ^🔗		chazchaz has quit IRC (Ping timeout: 369 seconds)
01:27 ^🔗		phuzion has quit IRC (Ping timeout: 369 seconds)
01:27 ^🔗		atlogbot has quit IRC (Ping timeout: 369 seconds)
01:27 ^🔗		phuzion has joined #urlteam
01:27 ^🔗		JesseW has quit IRC (Leaving.)
01:28 ^🔗		aaaaaaaaa sets mode: +o svchfoo3
01:29 ^🔗		Fusl has joined #urlteam
01:30 ^🔗		atlogbot has joined #urlteam
01:31 ^🔗		chazchaz has joined #urlteam
01:31 ^🔗		svchfoo3 sets mode: +o chazchaz
01:35 ^🔗		svchfoo1 has joined #urlteam
01:35 ^🔗		svchfoo3 sets mode: +o svchfoo1
01:50 ^🔗		bwn has joined #urlteam
02:40 ^🔗		aaaaaaaaa has quit IRC (Read error: Operation timed out)
03:55 ^🔗		bwn has quit IRC (Read error: Operation timed out)
04:15 ^🔗		bwn has joined #urlteam
05:00 ^🔗		JesseW has joined #urlteam
05:00 ^🔗		svchfoo1 sets mode: +o JesseW
05:54 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
05:58 ^🔗		dashcloud has joined #urlteam
05:59 ^🔗		svchfoo3 sets mode: +o dashcloud
06:01 ^🔗		bwn has quit IRC (Read error: Operation timed out)
06:11 ^🔗		GLaDOS has quit IRC (Read error: Operation timed out)
06:22 ^🔗		bwn has joined #urlteam
06:41 ^🔗		JesseW has quit IRC (Leaving.)
06:43 ^🔗		GLaDOS has joined #urlteam
06:43 ^🔗		svchfoo3 sets mode: +o GLaDOS
08:40 ^🔗		WinterFox has quit IRC (Remote host closed the connection)
08:42 ^🔗		WinterFox has joined #urlteam
08:57 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
09:00 ^🔗		dashcloud has joined #urlteam
09:00 ^🔗		svchfoo3 sets mode: +o dashcloud
10:27 ^🔗		VADemon has quit IRC (left4dead)
11:27 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
11:31 ^🔗		dashcloud has joined #urlteam
11:31 ^🔗		svchfoo3 sets mode: +o dashcloud
11:42 ^🔗		VADemon has joined #urlteam
11:49 ^🔗		bwn has quit IRC (Read error: Operation timed out)
12:13 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
12:16 ^🔗		dashcloud has joined #urlteam
12:17 ^🔗		svchfoo3 sets mode: +o dashcloud
12:42 ^🔗		bwn has joined #urlteam
13:14 ^🔗		chazchaz has quit IRC (Read error: Operation timed out)
13:20 ^🔗		chazchaz has joined #urlteam
13:20 ^🔗		svchfoo1 sets mode: +o chazchaz
13:39 ^🔗		slang has quit IRC (Ping timeout: 240 seconds)
13:55 ^🔗		WinterFox has quit IRC (Remote host closed the connection)
15:56 ^🔗		JW_work has quit IRC (Read error: Connection reset by peer)
15:58 ^🔗		JW_work has joined #urlteam
15:58 ^🔗		JW_work has quit IRC (Read error: Connection reset by peer)
16:02 ^🔗		JW_work has joined #urlteam
17:51 ^🔗		JesseW has joined #urlteam
17:51 ^🔗		svchfoo3 sets mode: +o JesseW
18:16 ^🔗	JesseW	Once we finish the first round of migre.me, I'm going to have to go through the results, and do a second round of 1 URL items for the ones we missed in this round...
18:17 ^🔗	JesseW	but that's only ~5,000 so far, so it shouldn't be too painful.
18:38 ^🔗		bwn_ has joined #urlteam
18:39 ^🔗		JesseW has quit IRC (Leaving.)
18:46 ^🔗		bwn has quit IRC (Read error: Operation timed out)
18:57 ^🔗		aaaaaaaaa has joined #urlteam
18:57 ^🔗		swebb sets mode: +o aaaaaaaaa
19:32 ^🔗		JesseW has joined #urlteam
19:32 ^🔗		svchfoo3 sets mode: +o JesseW
20:44 ^🔗		VADemon has quit IRC (left4dead)
20:52 ^🔗		WinterFox has joined #urlteam
21:46 ^🔗		bwn_ has quit IRC (Ping timeout: 606 seconds)
21:52 ^🔗	WinterFox	How are the urls arranged in the dumps? JSON, csv?
22:03 ^🔗	*	JesseW goes to write this up on http://archiveteam.org/index.php?title=URLTeam
22:06 ^🔗	WinterFox	I almost have yesterdays dump downloaded so I will check on that
22:13 ^🔗		bwn has joined #urlteam
22:20 ^🔗	JesseW	WinterFox: OK, written up (thanks for prompting me to do it): http://archiveteam.org/index.php?title=URLTeam#Archives
22:20 ^🔗	JesseW	let me know if you have questions
22:22 ^🔗	WinterFox	Looks good.
22:23 ^🔗	WinterFox	JesseW, Do you have a script to download all the daily dumps?
22:24 ^🔗	JesseW	I hacked together stuff, but I don't have a script, exactly.
22:26 ^🔗	JesseW	print '\n'.join(list(x.get_files(glob_pattern='*.torrent'))[0].url for x in list(internetarchive.api.search_items('subject:urlteam').iter_as_items()))
22:27 ^🔗	JesseW	should get you a list of URLs of the torrents for all the daily dumps, which you can then push into transmission (or another bt client) with xargs
22:27 ^🔗	WinterFox	Thats python right?
22:27 ^🔗	JesseW	yep. Python using the internetarchive library
22:28 ^🔗	JesseW	It's basically just making a call to the IA's Advanced Search for things with a subject of urlteam, then grabbing the files with a torrent extension, and printing out the download URLs for them.
22:29 ^🔗	JesseW	One could pretty easily rewrite it in shell, with curl, etc.
22:30 ^🔗	WinterFox	AttributeError: 'Search' object has no attribute 'iter_as_items'
22:32 ^🔗	JesseW	ah, sorry, you'll need to use the 1.0 branch
22:32 ^🔗	JesseW	we're planning on merging them soon, but havne't done it yet
22:32 ^🔗	JesseW	want more tests first
22:33 ^🔗	WinterFox	So I need a newer version of the internetarchive lib?
22:33 ^🔗	WinterFox	Can I do that with pip?
22:34 ^🔗	JesseW	or, as iter_as_items is just a convienence, you could convert the search results to items yourself, by pulling out the identifier and stuffing it in internetarchive.api.get_item
22:35 ^🔗	JesseW	e.g. print '\n'.join(list(iaapi.get_item(x['identifier']).get_files(glob_pattern='*.torrent'))[0].url for x in iaapi.search_items('subject:urlteam AND addeddate:[2015-11-05 TO 2016]'))
22:35 ^🔗	JesseW	do import internetarchive.api as iaapi
22:35 ^🔗	JesseW	first
22:35 ^🔗	JesseW	or change the references to iaapi to use the longer name. :-)
22:37 ^🔗	WinterFox	So I just change 2015-11-05 to 2013 to get them all?
22:37 ^🔗	JesseW	also, the code above only gets the last few -- remove the AND addeddate part to get all of them
22:37 ^🔗	WinterFox	ah
22:37 ^🔗	JesseW	I was just using that to get a list of them to add to my collection
22:43 ^🔗	WinterFox	It seems to be working
22:51 ^🔗	JesseW	cool
23:20 ^🔗		JesseW has quit IRC (Leaving.)

irclogger-viewer