[00:31] *** chfoo has quit IRC (Quit: chfoo) [00:41] *** chfoo has joined #internetarchive.bak [02:56] *** primus104 has quit IRC (Leaving.) [05:12] *** wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES) [05:27] *** wp494 has joined #internetarchive.bak [06:18] *** primus104 has joined #internetarchive.bak [06:37] *** zz_CyberJ is now known as CyberJaco [07:20] *** CyberJaco is now known as zz_CyberJ [07:27] *** primus104 has quit IRC (Leaving.) [08:39] *** atomotic has joined #internetarchive.bak [09:08] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [09:17] *** GLaDOS has quit IRC (Ping timeout: 252 seconds) [09:19] *** GLaDOS has joined #internetarchive.bak [09:32] *** atomotic has joined #internetarchive.bak [10:22] *** primus104 has joined #internetarchive.bak [10:28] *** primus105 has joined #internetarchive.bak [10:34] *** primus104 has quit IRC (Read error: Operation timed out) [10:46] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [12:02] *** mariusz_ has joined #internetarchive.bak [13:33] *** primus105 has quit IRC (Leaving.) [13:35] *** atomotic has joined #internetarchive.bak [14:21] OK SO [14:38] 5 million reddit comments are scary [14:39] also i'm getting that thing on the torrent where the last few blocks are corrupt [14:40] "Downloaded: [14:40] 166.4 GB (208.4 GB corrupt)" [14:46] tpw_rules: that's... interesting [14:46] if you want I can throw up a rsync repo for a bit so you can sync the files that are broken [14:46] re the reddit corpus? [14:46] i tried re-adding the torrent; verifying local data [14:46] yep [14:46] all of them are broken, but they're all at 99.999% [14:47] how are you planning to process the files btw [14:47] it's already on IA iirc [14:47] i'm adding it all to a mysql database; gonna play with RNN maybe [14:47] that's the torrent i'm using [14:48] and i've had that problem with torrents from IA in the past [14:48] i'm using the original magnet link [14:48] using transmission 2.84 [14:48] same here [14:48] that magnet link only had one file though? [14:48] http://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/ [14:48] you want the one frmo update 6 [14:48] magnet:?xt=urn:btih:7690f71ea949b868080401c749e878f98de34d3d&dn=reddit%5Fdata&tr=http%3A%2F%2Ftracker.pushshift.io%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80 [14:49] lemme see what local data verification does [14:52] do you have a file structure for the magnet link? like screenshot transmission or something [14:52] sec [14:52] https://znx.cc/s1436799165.png [14:53] goes on until RC_2015-05.bz2 [14:53] and then there's a README in the root directory [14:53] ok cool. it's the same as the ia one then [14:53] (ia only has 2015-01 though? [14:53] is reddit_data the root folder [14:54] yep [14:54] wait that's weird. https://i.imgur.com/fJ4EXDW.png it's there twice [14:54] the IA directory listing shows up to 2015/05 though, not sure about the torrent [14:55] uhh [14:55] just tell transmission to download to reddit_data then I guess, not sure how the one in the top directory went there [14:56] yeah i'm removing the rest [14:56] ugh 150MB/s is too slow for verifying [14:57] anyway thx. what are you doing with all this stuff? [14:57] i'm just seeding it for now [14:57] ahh [14:57] i'm shoving it all into mysql [14:57] not really free now so I'll leave it for later [14:57] I see [14:57] i'm on summer break and i know nothing about data analysis [14:58] science help us all [14:59] IIRC someone loaded the dataset into google bigquery, you might want to check that out [15:00] i'm cheap and i have agood computer here. maybe once i figure out what i'm doing [15:00] *** atomotic has quit IRC (Ping timeout: 252 seconds) [15:00] granted i don't have "petabytes in seconds" but i can wait a little [15:49] damn it why am I on expiring list again [16:09] *** primus104 has joined #internetarchive.bak [16:12] sketchcow fucked up the torrent the first time and put that file twice, then updated it [16:13] so if you have the old torrent you need to get the updated torrent file from archive.org [16:29] how many comments is it? [16:33] hm if it's really 17 billion this may not be practical [16:56] *** primus104 has quit IRC (Leaving.) [17:33] *** kyan has quit IRC (Quit: This computer has gone to sleep) [17:50] *** primus104 has joined #internetarchive.bak [18:54] *** zz_CyberJ is now known as CyberJaco [18:55] *** balrog has quit IRC (Read error: Operation timed out) [19:14] *** balrog has joined #internetarchive.bak [20:02] *** mariusz_ has quit IRC (Read error: Operation timed out) [20:51] *** Muad-Dib has quit IRC (Ping timeout: 252 seconds) [20:59] *** Muad-Dib has joined #internetarchive.bak [21:05] *** Muad-Dib has quit IRC (Ping timeout: 252 seconds) [21:46] *** Muad-Dib has joined #internetarchive.bak [21:52] *** Muad-Dib has quit IRC (Ping timeout: 252 seconds) [22:16] Hug [23:31] *** protodev has quit IRC (Ping timeout: 606 seconds) [23:32] *** protodev has joined #internetarchive.bak [23:44] *** CyberJaco is now known as zz_CyberJ