| Time |
Nickname |
Message |
|
00:12
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
|
00:24
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Connection reset by peer) |
|
00:25
🔗
|
|
aaaaaaaaa has joined #urlteam |
|
00:25
🔗
|
|
swebb sets mode: +o aaaaaaaaa |
|
01:14
🔗
|
|
JesseW has joined #urlteam |
|
01:15
🔗
|
|
svchfoo3 sets mode: +o JesseW |
|
01:20
🔗
|
|
xmc has quit IRC (Read error: Operation timed out) |
|
01:21
🔗
|
JesseW |
OK, here's the 20,141 unique URLs containing "adrive.com" in the URLteam corpus: https://gist.github.com/JesseWeinstein/f1287df2ca12b1d96705/raw/3fb99fcea51fa990ded6e67c642e7a7c69a08aa3/gistfile1.txt |
|
01:21
🔗
|
|
svchfoo1 has quit IRC (Read error: Operation timed out) |
|
01:23
🔗
|
|
xmc has joined #urlteam |
|
01:23
🔗
|
|
swebb sets mode: +o xmc |
|
01:23
🔗
|
|
Fusl has quit IRC (Ping timeout: 255 seconds) |
|
01:23
🔗
|
JesseW |
I need to figure out a better way to handle migre.me -- for now, I've turned it off. |
|
01:23
🔗
|
|
svchfoo3 has quit IRC (Ping timeout: 369 seconds) |
|
01:25
🔗
|
|
svchfoo3 has joined #urlteam |
|
01:25
🔗
|
|
chazchaz has quit IRC (Ping timeout: 369 seconds) |
|
01:27
🔗
|
|
phuzion has quit IRC (Ping timeout: 369 seconds) |
|
01:27
🔗
|
|
atlogbot has quit IRC (Ping timeout: 369 seconds) |
|
01:27
🔗
|
|
phuzion has joined #urlteam |
|
01:27
🔗
|
|
JesseW has quit IRC (Leaving.) |
|
01:28
🔗
|
|
aaaaaaaaa sets mode: +o svchfoo3 |
|
01:29
🔗
|
|
Fusl has joined #urlteam |
|
01:30
🔗
|
|
atlogbot has joined #urlteam |
|
01:31
🔗
|
|
chazchaz has joined #urlteam |
|
01:31
🔗
|
|
svchfoo3 sets mode: +o chazchaz |
|
01:35
🔗
|
|
svchfoo1 has joined #urlteam |
|
01:35
🔗
|
|
svchfoo3 sets mode: +o svchfoo1 |
|
01:50
🔗
|
|
bwn has joined #urlteam |
|
02:40
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Operation timed out) |
|
03:55
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
|
04:15
🔗
|
|
bwn has joined #urlteam |
|
05:00
🔗
|
|
JesseW has joined #urlteam |
|
05:00
🔗
|
|
svchfoo1 sets mode: +o JesseW |
|
05:54
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
05:58
🔗
|
|
dashcloud has joined #urlteam |
|
05:59
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
|
06:01
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
|
06:11
🔗
|
|
GLaDOS has quit IRC (Read error: Operation timed out) |
|
06:22
🔗
|
|
bwn has joined #urlteam |
|
06:41
🔗
|
|
JesseW has quit IRC (Leaving.) |
|
06:43
🔗
|
|
GLaDOS has joined #urlteam |
|
06:43
🔗
|
|
svchfoo3 sets mode: +o GLaDOS |
|
08:40
🔗
|
|
WinterFox has quit IRC (Remote host closed the connection) |
|
08:42
🔗
|
|
WinterFox has joined #urlteam |
|
08:57
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
09:00
🔗
|
|
dashcloud has joined #urlteam |
|
09:00
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
|
10:27
🔗
|
|
VADemon has quit IRC (left4dead) |
|
11:27
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
11:31
🔗
|
|
dashcloud has joined #urlteam |
|
11:31
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
|
11:42
🔗
|
|
VADemon has joined #urlteam |
|
11:49
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
|
12:13
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
12:16
🔗
|
|
dashcloud has joined #urlteam |
|
12:17
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
|
12:42
🔗
|
|
bwn has joined #urlteam |
|
13:14
🔗
|
|
chazchaz has quit IRC (Read error: Operation timed out) |
|
13:20
🔗
|
|
chazchaz has joined #urlteam |
|
13:20
🔗
|
|
svchfoo1 sets mode: +o chazchaz |
|
13:39
🔗
|
|
slang has quit IRC (Ping timeout: 240 seconds) |
|
13:55
🔗
|
|
WinterFox has quit IRC (Remote host closed the connection) |
|
15:56
🔗
|
|
JW_work has quit IRC (Read error: Connection reset by peer) |
|
15:58
🔗
|
|
JW_work has joined #urlteam |
|
15:58
🔗
|
|
JW_work has quit IRC (Read error: Connection reset by peer) |
|
16:02
🔗
|
|
JW_work has joined #urlteam |
|
17:51
🔗
|
|
JesseW has joined #urlteam |
|
17:51
🔗
|
|
svchfoo3 sets mode: +o JesseW |
|
18:16
🔗
|
JesseW |
Once we finish the first round of migre.me, I'm going to have to go through the results, and do a second round of 1 URL items for the ones we missed in this round... |
|
18:17
🔗
|
JesseW |
but that's only ~5,000 so far, so it shouldn't be too painful. |
|
18:38
🔗
|
|
bwn_ has joined #urlteam |
|
18:39
🔗
|
|
JesseW has quit IRC (Leaving.) |
|
18:46
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
|
18:57
🔗
|
|
aaaaaaaaa has joined #urlteam |
|
18:57
🔗
|
|
swebb sets mode: +o aaaaaaaaa |
|
19:32
🔗
|
|
JesseW has joined #urlteam |
|
19:32
🔗
|
|
svchfoo3 sets mode: +o JesseW |
|
20:44
🔗
|
|
VADemon has quit IRC (left4dead) |
|
20:52
🔗
|
|
WinterFox has joined #urlteam |
|
21:46
🔗
|
|
bwn_ has quit IRC (Ping timeout: 606 seconds) |
|
21:52
🔗
|
WinterFox |
How are the urls arranged in the dumps? JSON, csv? |
|
22:03
🔗
|
* |
JesseW goes to write this up on http://archiveteam.org/index.php?title=URLTeam |
|
22:06
🔗
|
WinterFox |
I almost have yesterdays dump downloaded so I will check on that |
|
22:13
🔗
|
|
bwn has joined #urlteam |
|
22:20
🔗
|
JesseW |
WinterFox: OK, written up (thanks for prompting me to do it): http://archiveteam.org/index.php?title=URLTeam#Archives |
|
22:20
🔗
|
JesseW |
let me know if you have questions |
|
22:22
🔗
|
WinterFox |
Looks good. |
|
22:23
🔗
|
WinterFox |
JesseW, Do you have a script to download all the daily dumps? |
|
22:24
🔗
|
JesseW |
I hacked together stuff, but I don't have a script, exactly. |
|
22:26
🔗
|
JesseW |
print '\n'.join(list(x.get_files(glob_pattern='*.torrent'))[0].url for x in list(internetarchive.api.search_items('subject:urlteam').iter_as_items())) |
|
22:27
🔗
|
JesseW |
should get you a list of URLs of the torrents for all the daily dumps, which you can then push into transmission (or another bt client) with xargs |
|
22:27
🔗
|
WinterFox |
Thats python right? |
|
22:27
🔗
|
JesseW |
yep. Python using the internetarchive library |
|
22:28
🔗
|
JesseW |
It's basically just making a call to the IA's Advanced Search for things with a subject of urlteam, then grabbing the files with a torrent extension, and printing out the download URLs for them. |
|
22:29
🔗
|
JesseW |
One could pretty easily rewrite it in shell, with curl, etc. |
|
22:30
🔗
|
WinterFox |
AttributeError: 'Search' object has no attribute 'iter_as_items' |
|
22:32
🔗
|
JesseW |
ah, sorry, you'll need to use the 1.0 branch |
|
22:32
🔗
|
JesseW |
we're planning on merging them soon, but havne't done it yet |
|
22:32
🔗
|
JesseW |
want more tests first |
|
22:33
🔗
|
WinterFox |
So I need a newer version of the internetarchive lib? |
|
22:33
🔗
|
WinterFox |
Can I do that with pip? |
|
22:34
🔗
|
JesseW |
or, as iter_as_items is just a convienence, you could convert the search results to items yourself, by pulling out the identifier and stuffing it in internetarchive.api.get_item |
|
22:35
🔗
|
JesseW |
e.g. print '\n'.join(list(iaapi.get_item(x['identifier']).get_files(glob_pattern='*.torrent'))[0].url for x in iaapi.search_items('subject:urlteam AND addeddate:[2015-11-05 TO 2016]')) |
|
22:35
🔗
|
JesseW |
do import internetarchive.api as iaapi |
|
22:35
🔗
|
JesseW |
first |
|
22:35
🔗
|
JesseW |
or change the references to iaapi to use the longer name. :-) |
|
22:37
🔗
|
WinterFox |
So I just change 2015-11-05 to 2013 to get them all? |
|
22:37
🔗
|
JesseW |
also, the code above only gets the last few -- remove the AND addeddate part to get all of them |
|
22:37
🔗
|
WinterFox |
ah |
|
22:37
🔗
|
JesseW |
I was just using that to get a list of them to add to my collection |
|
22:43
🔗
|
WinterFox |
It seems to be working |
|
22:51
🔗
|
JesseW |
cool |
|
23:20
🔗
|
|
JesseW has quit IRC (Leaving.) |