#internetarchive.bak 2015-07-13,Mon

↑back Search

Time Nickname Message
00:31 🔗 chfoo has quit IRC (Quit: chfoo)
00:41 🔗 chfoo has joined #internetarchive.bak
02:56 🔗 primus104 has quit IRC (Leaving.)
05:12 🔗 wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES)
05:27 🔗 wp494 has joined #internetarchive.bak
06:18 🔗 primus104 has joined #internetarchive.bak
06:37 🔗 zz_CyberJ is now known as CyberJaco
07:20 🔗 CyberJaco is now known as zz_CyberJ
07:27 🔗 primus104 has quit IRC (Leaving.)
08:39 🔗 atomotic has joined #internetarchive.bak
09:08 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
09:17 🔗 GLaDOS has quit IRC (Ping timeout: 252 seconds)
09:19 🔗 GLaDOS has joined #internetarchive.bak
09:32 🔗 atomotic has joined #internetarchive.bak
10:22 🔗 primus104 has joined #internetarchive.bak
10:28 🔗 primus105 has joined #internetarchive.bak
10:34 🔗 primus104 has quit IRC (Read error: Operation timed out)
10:46 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
12:02 🔗 mariusz_ has joined #internetarchive.bak
13:33 🔗 primus105 has quit IRC (Leaving.)
13:35 🔗 atomotic has joined #internetarchive.bak
14:21 🔗 SketchCow OK SO
14:38 🔗 tpw_rules 5 million reddit comments are scary
14:39 🔗 tpw_rules also i'm getting that thing on the torrent where the last few blocks are corrupt
14:40 🔗 tpw_rules "Downloaded:
14:40 🔗 tpw_rules 166.4 GB (208.4 GB corrupt)"
14:46 🔗 zhongfu tpw_rules: that's... interesting
14:46 🔗 zhongfu if you want I can throw up a rsync repo for a bit so you can sync the files that are broken
14:46 🔗 tpw_rules re the reddit corpus?
14:46 🔗 tpw_rules i tried re-adding the torrent; verifying local data
14:46 🔗 zhongfu yep
14:46 🔗 tpw_rules all of them are broken, but they're all at 99.999%
14:47 🔗 tpw_rules how are you planning to process the files btw
14:47 🔗 zhongfu it's already on IA iirc
14:47 🔗 tpw_rules i'm adding it all to a mysql database; gonna play with RNN maybe
14:47 🔗 tpw_rules that's the torrent i'm using
14:48 🔗 tpw_rules and i've had that problem with torrents from IA in the past
14:48 🔗 zhongfu i'm using the original magnet link
14:48 🔗 tpw_rules using transmission 2.84
14:48 🔗 zhongfu same here
14:48 🔗 tpw_rules that magnet link only had one file though?
14:48 🔗 zhongfu http://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/
14:48 🔗 zhongfu you want the one frmo update 6
14:48 🔗 zhongfu magnet:?xt=urn:btih:7690f71ea949b868080401c749e878f98de34d3d&dn=reddit%5Fdata&tr=http%3A%2F%2Ftracker.pushshift.io%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80
14:49 🔗 tpw_rules lemme see what local data verification does
14:52 🔗 tpw_rules do you have a file structure for the magnet link? like screenshot transmission or something
14:52 🔗 zhongfu sec
14:52 🔗 zhongfu https://znx.cc/s1436799165.png
14:53 🔗 zhongfu goes on until RC_2015-05.bz2
14:53 🔗 zhongfu and then there's a README in the root directory
14:53 🔗 tpw_rules ok cool. it's the same as the ia one then
14:53 🔗 tpw_rules (ia only has 2015-01 though?
14:53 🔗 tpw_rules is reddit_data the root folder
14:54 🔗 zhongfu yep
14:54 🔗 tpw_rules wait that's weird. https://i.imgur.com/fJ4EXDW.png it's there twice
14:54 🔗 zhongfu the IA directory listing shows up to 2015/05 though, not sure about the torrent
14:55 🔗 zhongfu uhh
14:55 🔗 zhongfu just tell transmission to download to reddit_data then I guess, not sure how the one in the top directory went there
14:56 🔗 tpw_rules yeah i'm removing the rest
14:56 🔗 tpw_rules ugh 150MB/s is too slow for verifying
14:57 🔗 tpw_rules anyway thx. what are you doing with all this stuff?
14:57 🔗 zhongfu i'm just seeding it for now
14:57 🔗 tpw_rules ahh
14:57 🔗 tpw_rules i'm shoving it all into mysql
14:57 🔗 zhongfu not really free now so I'll leave it for later
14:57 🔗 zhongfu I see
14:57 🔗 tpw_rules i'm on summer break and i know nothing about data analysis
14:58 🔗 tpw_rules science help us all
14:59 🔗 zhongfu IIRC someone loaded the dataset into google bigquery, you might want to check that out
15:00 🔗 tpw_rules i'm cheap and i have agood computer here. maybe once i figure out what i'm doing
15:00 🔗 atomotic has quit IRC (Ping timeout: 252 seconds)
15:00 🔗 tpw_rules granted i don't have "petabytes in seconds" but i can wait a little
15:49 🔗 mariusz_ damn it why am I on expiring list again
16:09 🔗 primus104 has joined #internetarchive.bak
16:12 🔗 DFJustin sketchcow fucked up the torrent the first time and put that file twice, then updated it
16:13 🔗 DFJustin so if you have the old torrent you need to get the updated torrent file from archive.org
16:29 🔗 tpw_rules how many comments is it?
16:33 🔗 tpw_rules hm if it's really 17 billion this may not be practical
16:56 🔗 primus104 has quit IRC (Leaving.)
17:33 🔗 kyan has quit IRC (Quit: This computer has gone to sleep)
17:50 🔗 primus104 has joined #internetarchive.bak
18:54 🔗 zz_CyberJ is now known as CyberJaco
18:55 🔗 balrog has quit IRC (Read error: Operation timed out)
19:14 🔗 balrog has joined #internetarchive.bak
20:02 🔗 mariusz_ has quit IRC (Read error: Operation timed out)
20:51 🔗 Muad-Dib has quit IRC (Ping timeout: 252 seconds)
20:59 🔗 Muad-Dib has joined #internetarchive.bak
21:05 🔗 Muad-Dib has quit IRC (Ping timeout: 252 seconds)
21:46 🔗 Muad-Dib has joined #internetarchive.bak
21:52 🔗 Muad-Dib has quit IRC (Ping timeout: 252 seconds)
22:16 🔗 SketchCow Hug
23:31 🔗 protodev has quit IRC (Ping timeout: 606 seconds)
23:32 🔗 protodev has joined #internetarchive.bak
23:44 🔗 CyberJaco is now known as zz_CyberJ

irclogger-viewer