Time |
Nickname |
Message |
00:31
🔗
|
|
chfoo has quit IRC (Quit: chfoo) |
00:41
🔗
|
|
chfoo has joined #internetarchive.bak |
02:56
🔗
|
|
primus104 has quit IRC (Leaving.) |
05:12
🔗
|
|
wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES) |
05:27
🔗
|
|
wp494 has joined #internetarchive.bak |
06:18
🔗
|
|
primus104 has joined #internetarchive.bak |
06:37
🔗
|
|
zz_CyberJ is now known as CyberJaco |
07:20
🔗
|
|
CyberJaco is now known as zz_CyberJ |
07:27
🔗
|
|
primus104 has quit IRC (Leaving.) |
08:39
🔗
|
|
atomotic has joined #internetarchive.bak |
09:08
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
09:17
🔗
|
|
GLaDOS has quit IRC (Ping timeout: 252 seconds) |
09:19
🔗
|
|
GLaDOS has joined #internetarchive.bak |
09:32
🔗
|
|
atomotic has joined #internetarchive.bak |
10:22
🔗
|
|
primus104 has joined #internetarchive.bak |
10:28
🔗
|
|
primus105 has joined #internetarchive.bak |
10:34
🔗
|
|
primus104 has quit IRC (Read error: Operation timed out) |
10:46
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
12:02
🔗
|
|
mariusz_ has joined #internetarchive.bak |
13:33
🔗
|
|
primus105 has quit IRC (Leaving.) |
13:35
🔗
|
|
atomotic has joined #internetarchive.bak |
14:21
🔗
|
SketchCow |
OK SO |
14:38
🔗
|
tpw_rules |
5 million reddit comments are scary |
14:39
🔗
|
tpw_rules |
also i'm getting that thing on the torrent where the last few blocks are corrupt |
14:40
🔗
|
tpw_rules |
"Downloaded: |
14:40
🔗
|
tpw_rules |
166.4 GB (208.4 GB corrupt)" |
14:46
🔗
|
zhongfu |
tpw_rules: that's... interesting |
14:46
🔗
|
zhongfu |
if you want I can throw up a rsync repo for a bit so you can sync the files that are broken |
14:46
🔗
|
tpw_rules |
re the reddit corpus? |
14:46
🔗
|
tpw_rules |
i tried re-adding the torrent; verifying local data |
14:46
🔗
|
zhongfu |
yep |
14:46
🔗
|
tpw_rules |
all of them are broken, but they're all at 99.999% |
14:47
🔗
|
tpw_rules |
how are you planning to process the files btw |
14:47
🔗
|
zhongfu |
it's already on IA iirc |
14:47
🔗
|
tpw_rules |
i'm adding it all to a mysql database; gonna play with RNN maybe |
14:47
🔗
|
tpw_rules |
that's the torrent i'm using |
14:48
🔗
|
tpw_rules |
and i've had that problem with torrents from IA in the past |
14:48
🔗
|
zhongfu |
i'm using the original magnet link |
14:48
🔗
|
tpw_rules |
using transmission 2.84 |
14:48
🔗
|
zhongfu |
same here |
14:48
🔗
|
tpw_rules |
that magnet link only had one file though? |
14:48
🔗
|
zhongfu |
http://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/ |
14:48
🔗
|
zhongfu |
you want the one frmo update 6 |
14:48
🔗
|
zhongfu |
magnet:?xt=urn:btih:7690f71ea949b868080401c749e878f98de34d3d&dn=reddit%5Fdata&tr=http%3A%2F%2Ftracker.pushshift.io%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80 |
14:49
🔗
|
tpw_rules |
lemme see what local data verification does |
14:52
🔗
|
tpw_rules |
do you have a file structure for the magnet link? like screenshot transmission or something |
14:52
🔗
|
zhongfu |
sec |
14:52
🔗
|
zhongfu |
https://znx.cc/s1436799165.png |
14:53
🔗
|
zhongfu |
goes on until RC_2015-05.bz2 |
14:53
🔗
|
zhongfu |
and then there's a README in the root directory |
14:53
🔗
|
tpw_rules |
ok cool. it's the same as the ia one then |
14:53
🔗
|
tpw_rules |
(ia only has 2015-01 though? |
14:53
🔗
|
tpw_rules |
is reddit_data the root folder |
14:54
🔗
|
zhongfu |
yep |
14:54
🔗
|
tpw_rules |
wait that's weird. https://i.imgur.com/fJ4EXDW.png it's there twice |
14:54
🔗
|
zhongfu |
the IA directory listing shows up to 2015/05 though, not sure about the torrent |
14:55
🔗
|
zhongfu |
uhh |
14:55
🔗
|
zhongfu |
just tell transmission to download to reddit_data then I guess, not sure how the one in the top directory went there |
14:56
🔗
|
tpw_rules |
yeah i'm removing the rest |
14:56
🔗
|
tpw_rules |
ugh 150MB/s is too slow for verifying |
14:57
🔗
|
tpw_rules |
anyway thx. what are you doing with all this stuff? |
14:57
🔗
|
zhongfu |
i'm just seeding it for now |
14:57
🔗
|
tpw_rules |
ahh |
14:57
🔗
|
tpw_rules |
i'm shoving it all into mysql |
14:57
🔗
|
zhongfu |
not really free now so I'll leave it for later |
14:57
🔗
|
zhongfu |
I see |
14:57
🔗
|
tpw_rules |
i'm on summer break and i know nothing about data analysis |
14:58
🔗
|
tpw_rules |
science help us all |
14:59
🔗
|
zhongfu |
IIRC someone loaded the dataset into google bigquery, you might want to check that out |
15:00
🔗
|
tpw_rules |
i'm cheap and i have agood computer here. maybe once i figure out what i'm doing |
15:00
🔗
|
|
atomotic has quit IRC (Ping timeout: 252 seconds) |
15:00
🔗
|
tpw_rules |
granted i don't have "petabytes in seconds" but i can wait a little |
15:49
🔗
|
mariusz_ |
damn it why am I on expiring list again |
16:09
🔗
|
|
primus104 has joined #internetarchive.bak |
16:12
🔗
|
DFJustin |
sketchcow fucked up the torrent the first time and put that file twice, then updated it |
16:13
🔗
|
DFJustin |
so if you have the old torrent you need to get the updated torrent file from archive.org |
16:29
🔗
|
tpw_rules |
how many comments is it? |
16:33
🔗
|
tpw_rules |
hm if it's really 17 billion this may not be practical |
16:56
🔗
|
|
primus104 has quit IRC (Leaving.) |
17:33
🔗
|
|
kyan has quit IRC (Quit: This computer has gone to sleep) |
17:50
🔗
|
|
primus104 has joined #internetarchive.bak |
18:54
🔗
|
|
zz_CyberJ is now known as CyberJaco |
18:55
🔗
|
|
balrog has quit IRC (Read error: Operation timed out) |
19:14
🔗
|
|
balrog has joined #internetarchive.bak |
20:02
🔗
|
|
mariusz_ has quit IRC (Read error: Operation timed out) |
20:51
🔗
|
|
Muad-Dib has quit IRC (Ping timeout: 252 seconds) |
20:59
🔗
|
|
Muad-Dib has joined #internetarchive.bak |
21:05
🔗
|
|
Muad-Dib has quit IRC (Ping timeout: 252 seconds) |
21:46
🔗
|
|
Muad-Dib has joined #internetarchive.bak |
21:52
🔗
|
|
Muad-Dib has quit IRC (Ping timeout: 252 seconds) |
22:16
🔗
|
SketchCow |
Hug |
23:31
🔗
|
|
protodev has quit IRC (Ping timeout: 606 seconds) |
23:32
🔗
|
|
protodev has joined #internetarchive.bak |
23:44
🔗
|
|
CyberJaco is now known as zz_CyberJ |