Time |
Nickname |
Message |
00:12
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
00:24
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Connection reset by peer) |
00:25
🔗
|
|
aaaaaaaaa has joined #urlteam |
00:25
🔗
|
|
swebb sets mode: +o aaaaaaaaa |
01:14
🔗
|
|
JesseW has joined #urlteam |
01:15
🔗
|
|
svchfoo3 sets mode: +o JesseW |
01:20
🔗
|
|
xmc has quit IRC (Read error: Operation timed out) |
01:21
🔗
|
JesseW |
OK, here's the 20,141 unique URLs containing "adrive.com" in the URLteam corpus: https://gist.github.com/JesseWeinstein/f1287df2ca12b1d96705/raw/3fb99fcea51fa990ded6e67c642e7a7c69a08aa3/gistfile1.txt |
01:21
🔗
|
|
svchfoo1 has quit IRC (Read error: Operation timed out) |
01:23
🔗
|
|
xmc has joined #urlteam |
01:23
🔗
|
|
swebb sets mode: +o xmc |
01:23
🔗
|
|
Fusl has quit IRC (Ping timeout: 255 seconds) |
01:23
🔗
|
JesseW |
I need to figure out a better way to handle migre.me -- for now, I've turned it off. |
01:23
🔗
|
|
svchfoo3 has quit IRC (Ping timeout: 369 seconds) |
01:25
🔗
|
|
svchfoo3 has joined #urlteam |
01:25
🔗
|
|
chazchaz has quit IRC (Ping timeout: 369 seconds) |
01:27
🔗
|
|
phuzion has quit IRC (Ping timeout: 369 seconds) |
01:27
🔗
|
|
atlogbot has quit IRC (Ping timeout: 369 seconds) |
01:27
🔗
|
|
phuzion has joined #urlteam |
01:27
🔗
|
|
JesseW has quit IRC (Leaving.) |
01:28
🔗
|
|
aaaaaaaaa sets mode: +o svchfoo3 |
01:29
🔗
|
|
Fusl has joined #urlteam |
01:30
🔗
|
|
atlogbot has joined #urlteam |
01:31
🔗
|
|
chazchaz has joined #urlteam |
01:31
🔗
|
|
svchfoo3 sets mode: +o chazchaz |
01:35
🔗
|
|
svchfoo1 has joined #urlteam |
01:35
🔗
|
|
svchfoo3 sets mode: +o svchfoo1 |
01:50
🔗
|
|
bwn has joined #urlteam |
02:40
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Operation timed out) |
03:55
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
04:15
🔗
|
|
bwn has joined #urlteam |
05:00
🔗
|
|
JesseW has joined #urlteam |
05:00
🔗
|
|
svchfoo1 sets mode: +o JesseW |
05:54
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
05:58
🔗
|
|
dashcloud has joined #urlteam |
05:59
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
06:01
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
06:11
🔗
|
|
GLaDOS has quit IRC (Read error: Operation timed out) |
06:22
🔗
|
|
bwn has joined #urlteam |
06:41
🔗
|
|
JesseW has quit IRC (Leaving.) |
06:43
🔗
|
|
GLaDOS has joined #urlteam |
06:43
🔗
|
|
svchfoo3 sets mode: +o GLaDOS |
08:40
🔗
|
|
WinterFox has quit IRC (Remote host closed the connection) |
08:42
🔗
|
|
WinterFox has joined #urlteam |
08:57
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
09:00
🔗
|
|
dashcloud has joined #urlteam |
09:00
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
10:27
🔗
|
|
VADemon has quit IRC (left4dead) |
11:27
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
11:31
🔗
|
|
dashcloud has joined #urlteam |
11:31
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
11:42
🔗
|
|
VADemon has joined #urlteam |
11:49
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
12:13
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
12:16
🔗
|
|
dashcloud has joined #urlteam |
12:17
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
12:42
🔗
|
|
bwn has joined #urlteam |
13:14
🔗
|
|
chazchaz has quit IRC (Read error: Operation timed out) |
13:20
🔗
|
|
chazchaz has joined #urlteam |
13:20
🔗
|
|
svchfoo1 sets mode: +o chazchaz |
13:39
🔗
|
|
slang has quit IRC (Ping timeout: 240 seconds) |
13:55
🔗
|
|
WinterFox has quit IRC (Remote host closed the connection) |
15:56
🔗
|
|
JW_work has quit IRC (Read error: Connection reset by peer) |
15:58
🔗
|
|
JW_work has joined #urlteam |
15:58
🔗
|
|
JW_work has quit IRC (Read error: Connection reset by peer) |
16:02
🔗
|
|
JW_work has joined #urlteam |
17:51
🔗
|
|
JesseW has joined #urlteam |
17:51
🔗
|
|
svchfoo3 sets mode: +o JesseW |
18:16
🔗
|
JesseW |
Once we finish the first round of migre.me, I'm going to have to go through the results, and do a second round of 1 URL items for the ones we missed in this round... |
18:17
🔗
|
JesseW |
but that's only ~5,000 so far, so it shouldn't be too painful. |
18:38
🔗
|
|
bwn_ has joined #urlteam |
18:39
🔗
|
|
JesseW has quit IRC (Leaving.) |
18:46
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
18:57
🔗
|
|
aaaaaaaaa has joined #urlteam |
18:57
🔗
|
|
swebb sets mode: +o aaaaaaaaa |
19:32
🔗
|
|
JesseW has joined #urlteam |
19:32
🔗
|
|
svchfoo3 sets mode: +o JesseW |
20:44
🔗
|
|
VADemon has quit IRC (left4dead) |
20:52
🔗
|
|
WinterFox has joined #urlteam |
21:46
🔗
|
|
bwn_ has quit IRC (Ping timeout: 606 seconds) |
21:52
🔗
|
WinterFox |
How are the urls arranged in the dumps? JSON, csv? |
22:03
🔗
|
* |
JesseW goes to write this up on http://archiveteam.org/index.php?title=URLTeam |
22:06
🔗
|
WinterFox |
I almost have yesterdays dump downloaded so I will check on that |
22:13
🔗
|
|
bwn has joined #urlteam |
22:20
🔗
|
JesseW |
WinterFox: OK, written up (thanks for prompting me to do it): http://archiveteam.org/index.php?title=URLTeam#Archives |
22:20
🔗
|
JesseW |
let me know if you have questions |
22:22
🔗
|
WinterFox |
Looks good. |
22:23
🔗
|
WinterFox |
JesseW, Do you have a script to download all the daily dumps? |
22:24
🔗
|
JesseW |
I hacked together stuff, but I don't have a script, exactly. |
22:26
🔗
|
JesseW |
print '\n'.join(list(x.get_files(glob_pattern='*.torrent'))[0].url for x in list(internetarchive.api.search_items('subject:urlteam').iter_as_items())) |
22:27
🔗
|
JesseW |
should get you a list of URLs of the torrents for all the daily dumps, which you can then push into transmission (or another bt client) with xargs |
22:27
🔗
|
WinterFox |
Thats python right? |
22:27
🔗
|
JesseW |
yep. Python using the internetarchive library |
22:28
🔗
|
JesseW |
It's basically just making a call to the IA's Advanced Search for things with a subject of urlteam, then grabbing the files with a torrent extension, and printing out the download URLs for them. |
22:29
🔗
|
JesseW |
One could pretty easily rewrite it in shell, with curl, etc. |
22:30
🔗
|
WinterFox |
AttributeError: 'Search' object has no attribute 'iter_as_items' |
22:32
🔗
|
JesseW |
ah, sorry, you'll need to use the 1.0 branch |
22:32
🔗
|
JesseW |
we're planning on merging them soon, but havne't done it yet |
22:32
🔗
|
JesseW |
want more tests first |
22:33
🔗
|
WinterFox |
So I need a newer version of the internetarchive lib? |
22:33
🔗
|
WinterFox |
Can I do that with pip? |
22:34
🔗
|
JesseW |
or, as iter_as_items is just a convienence, you could convert the search results to items yourself, by pulling out the identifier and stuffing it in internetarchive.api.get_item |
22:35
🔗
|
JesseW |
e.g. print '\n'.join(list(iaapi.get_item(x['identifier']).get_files(glob_pattern='*.torrent'))[0].url for x in iaapi.search_items('subject:urlteam AND addeddate:[2015-11-05 TO 2016]')) |
22:35
🔗
|
JesseW |
do import internetarchive.api as iaapi |
22:35
🔗
|
JesseW |
first |
22:35
🔗
|
JesseW |
or change the references to iaapi to use the longer name. :-) |
22:37
🔗
|
WinterFox |
So I just change 2015-11-05 to 2013 to get them all? |
22:37
🔗
|
JesseW |
also, the code above only gets the last few -- remove the AND addeddate part to get all of them |
22:37
🔗
|
WinterFox |
ah |
22:37
🔗
|
JesseW |
I was just using that to get a list of them to add to my collection |
22:43
🔗
|
WinterFox |
It seems to be working |
22:51
🔗
|
JesseW |
cool |
23:20
🔗
|
|
JesseW has quit IRC (Leaving.) |