[00:06] BlueMax: repackage the existing ~200GB tar file of FanFiction.Net stories in a way that makes it easier for people to extract individual ones without downloading extra stuff. [00:21] OK, started zipping up the A's. [00:34] will this extract faster than a tar.gz file [00:46] *** BlueMaxim has joined #archiveteam-bs [00:56] *** BlueMax has quit IRC (Read error: Operation timed out) [01:07] *** BlueMaxim has quit IRC (Read error: Operation timed out) [01:08] *** BlueMaxim has joined #archiveteam-bs [01:13] I think so? I'll test... [01:13] It's made it to the Av's... [01:14] In the process of zipping it up [01:16] *** BlueMaxim has quit IRC (Read error: Operation timed out) [01:17] *** BlueMaxim has joined #archiveteam-bs [01:20] i mean any random file. [01:20] IDK [01:21] I mean, unlike a tar.gz, it *can* allows random access. [01:21] But I don't know if unziping all of it is faster or slower [01:21] yay, that's what i meant! [01:21] 2.1G left to go [01:23] 1.8 [01:24] SketchCow: 2011 of kpfa are all uploaded [01:25] also 2012-01 of kpfa is uploaded [01:27] nice! [01:30] *** BlueMaxim has quit IRC (Read error: Operation timed out) [01:34] godane: kpfa? [01:36] https://en.wikipedia.org/wiki/KPFA [01:37] The A's are compressed -- went from ~11GB to 4GB [01:37] whoo! good ratio [01:39] and it takes an unnoticeable amount of time to extract a single file [01:40] specifically, about 0.15s [01:50] OK, now doing all the letters except H, N and T. [01:56] awesome! [01:56] it might be easier just to move those folders out of the path. [01:58] i'm also uploading more koreanet videos [01:58] http://archive.org/details/koreanet-1_chuncheon_pg_goodgw-20080221 [01:58] i'm also looking at archiving coverville mp3s [01:59] *** BlueMaxim has joined #archiveteam-bs [02:01] i also figure i should be up to date with kpfa by may or june at the rate i'm going [02:02] i'm downloading march 2012 with 13 proxies right now [02:04] bsmith093: I should be able to use -x to exclude them; I just wanted to handle it manually [02:04] k [02:13] *** BlueMaxim has quit IRC (Read error: Operation timed out) [02:37] *** VADemon has quit IRC (Quit: left4dead) [02:48] *** JesseW has quit IRC (Quit: Leaving.) [02:49] *** JesseW has joined #archiveteam-bs [03:10] *** BlueMaxim has joined #archiveteam-bs [03:18] *** ohhdemgir has quit IRC (Quit: True) [03:34] *** JRWR has quit IRC (Read error: Connection reset by peer) [03:56] *** BlueMaxim has quit IRC (Read error: Operation timed out) [04:22] Finished the B's. [04:22] 15 -> 5.5 [04:36] *** vitzli has joined #archiveteam-bs [05:01] C: 12 -> 4.4 [05:06] *** acridAxid has quit IRC (marauder) [05:08] *** acridAxid has joined #archiveteam-bs [05:11] do you have to trigger each letter manually, or is it a batch job? [05:34] *** JesseW has quit IRC (Quit: Leaving.) [05:58] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [06:05] *** Sk1d has joined #archiveteam-bs [06:12] *** JesseW has joined #archiveteam-bs [06:13] JesseW: starting a hopefully much smaller grab of stories starting with id 10 million on up [06:14] bsmith093: it's a batch job, running on my home server (where I normally run the Warrior). [06:14] bsmith093: great, please put it in a zip file when it's done. :-) [06:14] currently it's 5G from the end of the Ds [06:25] 2.8G from the end of the Ds [06:44] and finished D! 17GB -> 6.3GB [06:45] and E is only 3GB uncompressed, so it should go quickly [06:45] but I'm going to sleep soon; I'll update in the morning about how far it's gotten [06:53] *** GLaDOS has joined #archiveteam-bs [07:10] E is done, and went from 3 -> 1.1 [07:32] *** JesseW has quit IRC (Quit: Leaving.) [07:56] *** GLaDOS has quit IRC (Quit: Oh crap, I died.) [08:14] *** metalcamp has joined #archiveteam-bs [08:20] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [09:23] *** bwn has quit IRC (Ping timeout: 492 seconds) [09:44] *** bwn has joined #archiveteam-bs [09:58] *** vitzli has quit IRC (Leaving) [10:14] *** fpoee has joined #archiveteam-bs [10:20] *** plog99 has quit IRC (Read error: Operation timed out) [10:33] *** GLaDOS has joined #archiveteam-bs [10:39] *** bzc6p has joined #archiveteam-bs [10:39] *** swebb sets mode: +o bzc6p [10:46] When you see "You are strictly forbidden to share/distribute/archive this free stuff" and "that external content is not available any more" on the same webpage. *facepalm* [10:46] Both with bold red letters followed by multiple exclamation marks. [11:00] You have heard bzc6p's monthly ventillation. Oh, here comes the ops train. [11:00] *** bzc6p sets mode: +ooo achip arkiver BnA-Rob1n [11:00] *** bzc6p sets mode: +oooo chfoo Fletcher Fletcher_ GLaDOS [11:00] *** bzc6p sets mode: +oooo godane HCross HCross2 joepie91 [11:00] *** bzc6p sets mode: +ooo Kenshin Kazzy SimpBrain [11:00] *** bzc6p sets mode: +ooo Start wp494 yipdw [11:16] *** vitzli has joined #archiveteam-bs [11:22] *** bzc6p has left [13:10] *** vitzli has quit IRC (Ping timeout: 246 seconds) [13:17] *** RichardG_ has joined #archiveteam-bs [13:24] *** RichardG has quit IRC (Read error: Operation timed out) [14:06] *** VADemon has joined #archiveteam-bs [14:36] *** yakfish has quit IRC (Read error: Operation timed out) [14:37] *** yakfish has joined #archiveteam-bs [14:39] *** trane has joined #archiveteam-bs [14:39] *** trane has left [14:39] *** ohhdemgir has joined #archiveteam-bs [14:44] *** RichardG has joined #archiveteam-bs [14:51] *** RichardG_ has quit IRC (Ping timeout: 633 seconds) [15:21] *** metalcamp has joined #archiveteam-bs [15:39] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [16:17] *** zino has quit IRC (Read error: Operation timed out) [16:49] *** JesseW has joined #archiveteam-bs [16:52] The FanFictionNet repack is now on P; A through O (excluding H & N) is 49GB [17:47] could you do an ls -a of the files before you dump the uncompressed ones? probably want to compress that too :) [17:48] wow, that was fast! [17:49] sure [17:49] I mean, it won't be different than your inventory file, though [17:50] P done: 14 -> 5.2 [17:50] Q done: 0.19G -> 0.07G [17:51] I don't think you meant "ls -a" ...? [17:51] Maybe you meant ls -r? [17:51] or find? [17:54] *** schbirid has joined #archiveteam-bs [18:12] *** metalcamp has joined #archiveteam-bs [18:22] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [18:29] R done: 6.4G -> 2.4G [18:30] Now doing the hideously large S -- 31GB [18:30] mainly due to lots and lots of crossovers, I think. [18:33] ~ 14GB in over 1G chunks, which leaves 17GB in over *3000* other fandoms [18:39] JesseW: might be consioderably smaller, thouth, because of the lack of "home/Desktop/etc" in every path [18:40] JesseW: yes, that would be supernatural, i think [18:42] Well, Supernatural *is* the largest (at 3.9G) but Sailor Moon is next at 2.5G, followed by Star Wars at 2.0G [19:01] *** JesseW has quit IRC (Quit: Leaving.) [19:20] *** bwn has quit IRC (Ping timeout: 246 seconds) [19:32] *** zino has joined #archiveteam-bs [19:54] *** bwn has joined #archiveteam-bs [19:55] *** JetBalsa has joined #archiveteam-bs [19:56] *** zhongfu has quit IRC (Ping timeout: 260 seconds) [20:00] *** zhongfu has joined #archiveteam-bs [20:53] *** metalcamp has joined #archiveteam-bs [20:59] *** schbirid has quit IRC (Quit: Leaving) [21:03] *** JesseW has joined #archiveteam-bs [21:05] *** RichardG has quit IRC (Read error: Operation timed out) [21:07] Up to Star Wars in the repack. [21:10] *** xXx_ndidd has quit IRC (Read error: Operation timed out) [21:30] *** DopefishJ has joined #archiveteam-bs [21:30] *** swebb sets mode: +o DopefishJ [21:31] *** Boltsie has quit IRC (Read error: Connection reset by peer) [21:31] *** BnA-Rob1n has quit IRC (Read error: Connection reset by peer) [21:31] *** JSharp___ has quit IRC (Write error: Connection reset by peer) [21:33] *** JSharp___ has joined #archiveteam-bs [21:33] *** Boltsie has joined #archiveteam-bs [21:33] *** DFJustin has quit IRC (Ping timeout: 260 seconds) [21:33] *** BnA-Rob1n has joined #archiveteam-bs [21:35] *** Ctrl-S___ has quit IRC (Ping timeout: 260 seconds) [21:35] *** _desu___ has quit IRC (Ping timeout: 260 seconds) [21:36] *** _desu___ has joined #archiveteam-bs [21:38] *** Ctrl-S___ has joined #archiveteam-bs [21:39] *** BnA-Rob1n has quit IRC (Quit: Updating details, brb) [21:40] *** BnA-Rob1n has joined #archiveteam-bs [21:57] *** metalcamp has quit IRC (Quit: Bye) [21:58] *** metalcamp has joined #archiveteam-bs [22:00] JesseW: hey, feel like doing statistical analysis on this massive corpus? [22:01] might be more trouble than its worth, grabbing all th metadata out of every. single. file. [22:02] specifically the first 27 lines at the beginning. i just checked [22:03] I was thinking that would be interesting to do, yeah [22:03] Probably pull the 27 lines from each file and stuff them in a sqlite table? [22:04] Then upload that, too? [22:04] bsmith093: [22:04] oohhh, that would be awesome! [22:04] the colums would even be labeled! [22:05] columns* [22:05] :-) [22:05] holy crap you rock! can't afford gold, so have some reddit silver [22:06] again, I wouldn't be able to do it if you hadn't taken the time to extract that corpus first. [22:06] (and neither of us would have been able to do it if the (unknown?) people behind fanfiction.net hadn't been willing to provide it for this long. [22:07] (and we all owe the fanfic *authors* gratitude for writing it in the first place) [22:07] https://www.google.com/search?q=reddit+silver&tbm=isch&imgil=Ebbr9Fm8RlfkqM%253A%253BxOpeMr0fiUOvhM%253Bhttps%25253A%25252F%25252Fwww.reddit.com%25252Fuser%25252Flotsalote%25252Fgilded%25252F&source=iu&pf=m&fir=Ebbr9Fm8RlfkqM%253A%252CxOpeMr0fiUOvhM%252C_&usg=__EY7Dc8QlfzxNYwsteoEtJeHtRrc%3D [22:07] damn straight! creatives rule(34) [22:08] lol! I hadn't seen reddit silver before. [22:08] i've read stories that turn the canon of their respective universes into something much better than that canon probably deserves [22:09] :-) [22:09] for example https://www.fanfiction.net/s/75517/1/Shadows-of-the-Past [22:10] this guy made me care about RAINBOW BRITE!!! nuff said [22:11] so are the headers fully consistent? i.e. title always on line 4, author on line 6, etc? [22:12] S done: 31G -> 12G [22:12] very nearly alwasy, except that for a few hundred(ish) stories the packaged, pulished, and updated dates, will either not be there or not have times with the dates. [22:12] everything is always in that order [22:13] ok cool [22:13] when grabbing, maybe just grab the path too, to save time, and just grab the first 27 lines first, less data to comb through [22:15] i had an app that saved fanfic to, apparently a sql db file. it turns out to be rather difficult to tell sql to undo that. also reddit is awesome, had a script in 3 hours [22:15] well, what my script will do is grab the first 27 lines and convert them into a CSV row, then import those into sqlite [22:16] U done: 0.423G -> 0.156G [22:17] Only about 20G more total (then the three bigones) [22:17] (and the rest of their letters) [22:17] when i switched to calibre i dumped those id numbers from the sql db into the fanficfare plugin, all 8000+ waited the 2.7 days it took to grab them, then sorted all the dead ones, and ran that script against the dead list. got 300 more. [22:17] you're going in order right, so 5 more letters, then the 3 huge ones. [22:18] check a random file though, i said 27 to grab some blank lines at the end. [22:18] yep [22:28] *** Rickster has quit IRC (Ping timeout: 260 seconds) [22:29] *** goekesmi has quit IRC (Ping timeout: 260 seconds) [22:41] *** Rickster has joined #archiveteam-bs [22:46] *** acridAxid has quit IRC (marauder) [22:46] *** BlueMaxim has joined #archiveteam-bs [22:47] *** goekesmi has joined #archiveteam-bs [22:48] *** acridAxid has joined #archiveteam-bs [22:48] V done: 5.3G -> 2G [22:54] *** DopefishJ is now known as DFJustin [23:13] *** ndiddy has joined #archiveteam-bs [23:16] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [23:23] *** dxrt- has quit IRC (Ping timeout: 260 seconds) [23:32] I am super happy this laptop has an ethernet port [23:32] 35 MB/ vs 11 MB/s: A Thing [23:53] yipdw: wired is always faster [23:53] W done: 7 -> 2.6 [23:57] yipdw: i recently replaced a hub i had in my network with a gigabit unmanaged switch, internal speeds tripled.