#archiveteam-bs 2016-03-20,Sun

↑back Search

Time Nickname Message
00:06 🔗 JesseW BlueMax: repackage the existing ~200GB tar file of FanFiction.Net stories in a way that makes it easier for people to extract individual ones without downloading extra stuff.
00:21 🔗 JesseW OK, started zipping up the A's.
00:34 🔗 bsmith093 will this extract faster than a tar.gz file
00:46 🔗 BlueMaxim has joined #archiveteam-bs
00:56 🔗 BlueMax has quit IRC (Read error: Operation timed out)
01:07 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
01:08 🔗 BlueMaxim has joined #archiveteam-bs
01:13 🔗 JesseW I think so? I'll test...
01:13 🔗 JesseW It's made it to the Av's...
01:14 🔗 JesseW In the process of zipping it up
01:16 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
01:17 🔗 BlueMaxim has joined #archiveteam-bs
01:20 🔗 bsmith093 i mean any random file.
01:20 🔗 JesseW IDK
01:21 🔗 JesseW I mean, unlike a tar.gz, it *can* allows random access.
01:21 🔗 JesseW But I don't know if unziping all of it is faster or slower
01:21 🔗 bsmith093 yay, that's what i meant!
01:21 🔗 JesseW 2.1G left to go
01:23 🔗 JesseW 1.8
01:24 🔗 godane SketchCow: 2011 of kpfa are all uploaded
01:25 🔗 godane also 2012-01 of kpfa is uploaded
01:27 🔗 JesseW nice!
01:30 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
01:34 🔗 bsmith093 godane: kpfa?
01:36 🔗 JesseW https://en.wikipedia.org/wiki/KPFA
01:37 🔗 JesseW The A's are compressed -- went from ~11GB to 4GB
01:37 🔗 bsmith093 whoo! good ratio
01:39 🔗 JesseW and it takes an unnoticeable amount of time to extract a single file
01:40 🔗 JesseW specifically, about 0.15s
01:50 🔗 JesseW OK, now doing all the letters except H, N and T.
01:56 🔗 bsmith093 awesome!
01:56 🔗 bsmith093 it might be easier just to move those folders out of the path.
01:58 🔗 godane i'm also uploading more koreanet videos
01:58 🔗 godane http://archive.org/details/koreanet-1_chuncheon_pg_goodgw-20080221
01:58 🔗 godane i'm also looking at archiving coverville mp3s
01:59 🔗 BlueMaxim has joined #archiveteam-bs
02:01 🔗 godane i also figure i should be up to date with kpfa by may or june at the rate i'm going
02:02 🔗 godane i'm downloading march 2012 with 13 proxies right now
02:04 🔗 JesseW bsmith093: I should be able to use -x to exclude them; I just wanted to handle it manually
02:04 🔗 bsmith093 k
02:13 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
02:37 🔗 VADemon has quit IRC (Quit: left4dead)
02:48 🔗 JesseW has quit IRC (Quit: Leaving.)
02:49 🔗 JesseW has joined #archiveteam-bs
03:10 🔗 BlueMaxim has joined #archiveteam-bs
03:18 🔗 ohhdemgir has quit IRC (Quit: True)
03:34 🔗 JRWR has quit IRC (Read error: Connection reset by peer)
03:56 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
04:22 🔗 JesseW Finished the B's.
04:22 🔗 JesseW 15 -> 5.5
04:36 🔗 vitzli has joined #archiveteam-bs
05:01 🔗 JesseW C: 12 -> 4.4
05:06 🔗 acridAxid has quit IRC (marauder)
05:08 🔗 acridAxid has joined #archiveteam-bs
05:11 🔗 bsmith093 do you have to trigger each letter manually, or is it a batch job?
05:34 🔗 JesseW has quit IRC (Quit: Leaving.)
05:58 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
06:05 🔗 Sk1d has joined #archiveteam-bs
06:12 🔗 JesseW has joined #archiveteam-bs
06:13 🔗 bsmith093 JesseW: starting a hopefully much smaller grab of stories starting with id 10 million on up
06:14 🔗 JesseW bsmith093: it's a batch job, running on my home server (where I normally run the Warrior).
06:14 🔗 JesseW bsmith093: great, please put it in a zip file when it's done. :-)
06:14 🔗 JesseW currently it's 5G from the end of the Ds
06:25 🔗 JesseW 2.8G from the end of the Ds
06:44 🔗 JesseW and finished D! 17GB -> 6.3GB
06:45 🔗 JesseW and E is only 3GB uncompressed, so it should go quickly
06:45 🔗 JesseW but I'm going to sleep soon; I'll update in the morning about how far it's gotten
06:53 🔗 GLaDOS has joined #archiveteam-bs
07:10 🔗 JesseW E is done, and went from 3 -> 1.1
07:32 🔗 JesseW has quit IRC (Quit: Leaving.)
07:56 🔗 GLaDOS has quit IRC (Quit: Oh crap, I died.)
08:14 🔗 metalcamp has joined #archiveteam-bs
08:20 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
09:23 🔗 bwn has quit IRC (Ping timeout: 492 seconds)
09:44 🔗 bwn has joined #archiveteam-bs
09:58 🔗 vitzli has quit IRC (Leaving)
10:14 🔗 fpoee has joined #archiveteam-bs
10:20 🔗 plog99 has quit IRC (Read error: Operation timed out)
10:33 🔗 GLaDOS has joined #archiveteam-bs
10:39 🔗 bzc6p has joined #archiveteam-bs
10:39 🔗 swebb sets mode: +o bzc6p
10:46 🔗 bzc6p When you see "You are strictly forbidden to share/distribute/archive this free stuff" and "that external content is not available any more" on the same webpage. *facepalm*
10:46 🔗 bzc6p Both with bold red letters followed by multiple exclamation marks.
11:00 🔗 bzc6p You have heard bzc6p's monthly ventillation. Oh, here comes the ops train.
11:00 🔗 bzc6p sets mode: +ooo achip arkiver BnA-Rob1n
11:00 🔗 bzc6p sets mode: +oooo chfoo Fletcher Fletcher_ GLaDOS
11:00 🔗 bzc6p sets mode: +oooo godane HCross HCross2 joepie91
11:00 🔗 bzc6p sets mode: +ooo Kenshin Kazzy SimpBrain
11:00 🔗 bzc6p sets mode: +ooo Start wp494 yipdw
11:16 🔗 vitzli has joined #archiveteam-bs
11:22 🔗 bzc6p has left
13:10 🔗 vitzli has quit IRC (Ping timeout: 246 seconds)
13:17 🔗 RichardG_ has joined #archiveteam-bs
13:24 🔗 RichardG has quit IRC (Read error: Operation timed out)
14:06 🔗 VADemon has joined #archiveteam-bs
14:36 🔗 yakfish has quit IRC (Read error: Operation timed out)
14:37 🔗 yakfish has joined #archiveteam-bs
14:39 🔗 trane has joined #archiveteam-bs
14:39 🔗 trane has left
14:39 🔗 ohhdemgir has joined #archiveteam-bs
14:44 🔗 RichardG has joined #archiveteam-bs
14:51 🔗 RichardG_ has quit IRC (Ping timeout: 633 seconds)
15:21 🔗 metalcamp has joined #archiveteam-bs
15:39 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
16:17 🔗 zino has quit IRC (Read error: Operation timed out)
16:49 🔗 JesseW has joined #archiveteam-bs
16:52 🔗 JesseW The FanFictionNet repack is now on P; A through O (excluding H & N) is 49GB
17:47 🔗 bsmith093 could you do an ls -a of the files before you dump the uncompressed ones? probably want to compress that too :)
17:48 🔗 bsmith093 wow, that was fast!
17:49 🔗 JesseW sure
17:49 🔗 JesseW I mean, it won't be different than your inventory file, though
17:50 🔗 JesseW P done: 14 -> 5.2
17:50 🔗 JesseW Q done: 0.19G -> 0.07G
17:51 🔗 JesseW I don't think you meant "ls -a" ...?
17:51 🔗 JesseW Maybe you meant ls -r?
17:51 🔗 JesseW or find?
17:54 🔗 schbirid has joined #archiveteam-bs
18:12 🔗 metalcamp has joined #archiveteam-bs
18:22 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
18:29 🔗 JesseW R done: 6.4G -> 2.4G
18:30 🔗 JesseW Now doing the hideously large S -- 31GB
18:30 🔗 JesseW mainly due to lots and lots of crossovers, I think.
18:33 🔗 JesseW ~ 14GB in over 1G chunks, which leaves 17GB in over *3000* other fandoms
18:39 🔗 bsmith093 JesseW: might be consioderably smaller, thouth, because of the lack of "home/Desktop/etc" in every path
18:40 🔗 bsmith093 JesseW: yes, that would be supernatural, i think
18:42 🔗 JesseW Well, Supernatural *is* the largest (at 3.9G) but Sailor Moon is next at 2.5G, followed by Star Wars at 2.0G
19:01 🔗 JesseW has quit IRC (Quit: Leaving.)
19:20 🔗 bwn has quit IRC (Ping timeout: 246 seconds)
19:32 🔗 zino has joined #archiveteam-bs
19:54 🔗 bwn has joined #archiveteam-bs
19:55 🔗 JetBalsa has joined #archiveteam-bs
19:56 🔗 zhongfu has quit IRC (Ping timeout: 260 seconds)
20:00 🔗 zhongfu has joined #archiveteam-bs
20:53 🔗 metalcamp has joined #archiveteam-bs
20:59 🔗 schbirid has quit IRC (Quit: Leaving)
21:03 🔗 JesseW has joined #archiveteam-bs
21:05 🔗 RichardG has quit IRC (Read error: Operation timed out)
21:07 🔗 JesseW Up to Star Wars in the repack.
21:10 🔗 xXx_ndidd has quit IRC (Read error: Operation timed out)
21:30 🔗 DopefishJ has joined #archiveteam-bs
21:30 🔗 swebb sets mode: +o DopefishJ
21:31 🔗 Boltsie has quit IRC (Read error: Connection reset by peer)
21:31 🔗 BnA-Rob1n has quit IRC (Read error: Connection reset by peer)
21:31 🔗 JSharp___ has quit IRC (Write error: Connection reset by peer)
21:33 🔗 JSharp___ has joined #archiveteam-bs
21:33 🔗 Boltsie has joined #archiveteam-bs
21:33 🔗 DFJustin has quit IRC (Ping timeout: 260 seconds)
21:33 🔗 BnA-Rob1n has joined #archiveteam-bs
21:35 🔗 Ctrl-S___ has quit IRC (Ping timeout: 260 seconds)
21:35 🔗 _desu___ has quit IRC (Ping timeout: 260 seconds)
21:36 🔗 _desu___ has joined #archiveteam-bs
21:38 🔗 Ctrl-S___ has joined #archiveteam-bs
21:39 🔗 BnA-Rob1n has quit IRC (Quit: Updating details, brb)
21:40 🔗 BnA-Rob1n has joined #archiveteam-bs
21:57 🔗 metalcamp has quit IRC (Quit: Bye)
21:58 🔗 metalcamp has joined #archiveteam-bs
22:00 🔗 bsmith093 JesseW: hey, feel like doing statistical analysis on this massive corpus?
22:01 🔗 bsmith093 might be more trouble than its worth, grabbing all th metadata out of every. single. file.
22:02 🔗 bsmith093 specifically the first 27 lines at the beginning. i just checked
22:03 🔗 JesseW I was thinking that would be interesting to do, yeah
22:03 🔗 JesseW Probably pull the 27 lines from each file and stuff them in a sqlite table?
22:04 🔗 JesseW Then upload that, too?
22:04 🔗 JesseW bsmith093:
22:04 🔗 bsmith093 oohhh, that would be awesome!
22:04 🔗 bsmith093 the colums would even be labeled!
22:05 🔗 bsmith093 columns*
22:05 🔗 JesseW :-)
22:05 🔗 bsmith093 holy crap you rock! can't afford gold, so have some reddit silver
22:06 🔗 JesseW again, I wouldn't be able to do it if you hadn't taken the time to extract that corpus first.
22:06 🔗 JesseW (and neither of us would have been able to do it if the (unknown?) people behind fanfiction.net hadn't been willing to provide it for this long.
22:07 🔗 JesseW (and we all owe the fanfic *authors* gratitude for writing it in the first place)
22:07 🔗 bsmith093 https://www.google.com/search?q=reddit+silver&tbm=isch&imgil=Ebbr9Fm8RlfkqM%253A%253BxOpeMr0fiUOvhM%253Bhttps%25253A%25252F%25252Fwww.reddit.com%25252Fuser%25252Flotsalote%25252Fgilded%25252F&source=iu&pf=m&fir=Ebbr9Fm8RlfkqM%253A%252CxOpeMr0fiUOvhM%252C_&usg=__EY7Dc8QlfzxNYwsteoEtJeHtRrc%3D
22:07 🔗 bsmith093 damn straight! creatives rule(34)
22:08 🔗 JesseW lol! I hadn't seen reddit silver before.
22:08 🔗 bsmith093 i've read stories that turn the canon of their respective universes into something much better than that canon probably deserves
22:09 🔗 JesseW :-)
22:09 🔗 bsmith093 for example https://www.fanfiction.net/s/75517/1/Shadows-of-the-Past
22:10 🔗 bsmith093 this guy made me care about RAINBOW BRITE!!! nuff said
22:11 🔗 JesseW so are the headers fully consistent? i.e. title always on line 4, author on line 6, etc?
22:12 🔗 JesseW S done: 31G -> 12G
22:12 🔗 bsmith093 very nearly alwasy, except that for a few hundred(ish) stories the packaged, pulished, and updated dates, will either not be there or not have times with the dates.
22:12 🔗 bsmith093 everything is always in that order
22:13 🔗 JesseW ok cool
22:13 🔗 bsmith093 when grabbing, maybe just grab the path too, to save time, and just grab the first 27 lines first, less data to comb through
22:15 🔗 bsmith093 i had an app that saved fanfic to, apparently a sql db file. it turns out to be rather difficult to tell sql to undo that. also reddit is awesome, had a script in 3 hours
22:15 🔗 JesseW well, what my script will do is grab the first 27 lines and convert them into a CSV row, then import those into sqlite
22:16 🔗 JesseW U done: 0.423G -> 0.156G
22:17 🔗 JesseW Only about 20G more total (then the three bigones)
22:17 🔗 JesseW (and the rest of their letters)
22:17 🔗 bsmith093 when i switched to calibre i dumped those id numbers from the sql db into the fanficfare plugin, all 8000+ waited the 2.7 days it took to grab them, then sorted all the dead ones, and ran that script against the dead list. got 300 more.
22:17 🔗 bsmith093 you're going in order right, so 5 more letters, then the 3 huge ones.
22:18 🔗 bsmith093 check a random file though, i said 27 to grab some blank lines at the end.
22:18 🔗 JesseW yep
22:28 🔗 Rickster has quit IRC (Ping timeout: 260 seconds)
22:29 🔗 goekesmi has quit IRC (Ping timeout: 260 seconds)
22:41 🔗 Rickster has joined #archiveteam-bs
22:46 🔗 acridAxid has quit IRC (marauder)
22:46 🔗 BlueMaxim has joined #archiveteam-bs
22:47 🔗 goekesmi has joined #archiveteam-bs
22:48 🔗 acridAxid has joined #archiveteam-bs
22:48 🔗 JesseW V done: 5.3G -> 2G
22:54 🔗 DopefishJ is now known as DFJustin
23:13 🔗 ndiddy has joined #archiveteam-bs
23:16 🔗 metalcamp has quit IRC (Ping timeout: 244 seconds)
23:23 🔗 dxrt- has quit IRC (Ping timeout: 260 seconds)
23:32 🔗 yipdw I am super happy this laptop has an ethernet port
23:32 🔗 yipdw 35 MB/ vs 11 MB/s: A Thing
23:53 🔗 bsmith093 yipdw: wired is always faster
23:53 🔗 JesseW W done: 7 -> 2.6
23:57 🔗 bsmith093 yipdw: i recently replaced a hub i had in my network with a gigabit unmanaged switch, internal speeds tripled.

irclogger-viewer