Time |
Nickname |
Message |
00:06
🔗
|
JesseW |
BlueMax: repackage the existing ~200GB tar file of FanFiction.Net stories in a way that makes it easier for people to extract individual ones without downloading extra stuff. |
00:21
🔗
|
JesseW |
OK, started zipping up the A's. |
00:34
🔗
|
bsmith093 |
will this extract faster than a tar.gz file |
00:46
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
00:56
🔗
|
|
BlueMax has quit IRC (Read error: Operation timed out) |
01:07
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
01:08
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
01:13
🔗
|
JesseW |
I think so? I'll test... |
01:13
🔗
|
JesseW |
It's made it to the Av's... |
01:14
🔗
|
JesseW |
In the process of zipping it up |
01:16
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
01:17
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
01:20
🔗
|
bsmith093 |
i mean any random file. |
01:20
🔗
|
JesseW |
IDK |
01:21
🔗
|
JesseW |
I mean, unlike a tar.gz, it *can* allows random access. |
01:21
🔗
|
JesseW |
But I don't know if unziping all of it is faster or slower |
01:21
🔗
|
bsmith093 |
yay, that's what i meant! |
01:21
🔗
|
JesseW |
2.1G left to go |
01:23
🔗
|
JesseW |
1.8 |
01:24
🔗
|
godane |
SketchCow: 2011 of kpfa are all uploaded |
01:25
🔗
|
godane |
also 2012-01 of kpfa is uploaded |
01:27
🔗
|
JesseW |
nice! |
01:30
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
01:34
🔗
|
bsmith093 |
godane: kpfa? |
01:36
🔗
|
JesseW |
https://en.wikipedia.org/wiki/KPFA |
01:37
🔗
|
JesseW |
The A's are compressed -- went from ~11GB to 4GB |
01:37
🔗
|
bsmith093 |
whoo! good ratio |
01:39
🔗
|
JesseW |
and it takes an unnoticeable amount of time to extract a single file |
01:40
🔗
|
JesseW |
specifically, about 0.15s |
01:50
🔗
|
JesseW |
OK, now doing all the letters except H, N and T. |
01:56
🔗
|
bsmith093 |
awesome! |
01:56
🔗
|
bsmith093 |
it might be easier just to move those folders out of the path. |
01:58
🔗
|
godane |
i'm also uploading more koreanet videos |
01:58
🔗
|
godane |
http://archive.org/details/koreanet-1_chuncheon_pg_goodgw-20080221 |
01:58
🔗
|
godane |
i'm also looking at archiving coverville mp3s |
01:59
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
02:01
🔗
|
godane |
i also figure i should be up to date with kpfa by may or june at the rate i'm going |
02:02
🔗
|
godane |
i'm downloading march 2012 with 13 proxies right now |
02:04
🔗
|
JesseW |
bsmith093: I should be able to use -x to exclude them; I just wanted to handle it manually |
02:04
🔗
|
bsmith093 |
k |
02:13
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
02:37
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
02:48
🔗
|
|
JesseW has quit IRC (Quit: Leaving.) |
02:49
🔗
|
|
JesseW has joined #archiveteam-bs |
03:10
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
03:18
🔗
|
|
ohhdemgir has quit IRC (Quit: True) |
03:34
🔗
|
|
JRWR has quit IRC (Read error: Connection reset by peer) |
03:56
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
04:22
🔗
|
JesseW |
Finished the B's. |
04:22
🔗
|
JesseW |
15 -> 5.5 |
04:36
🔗
|
|
vitzli has joined #archiveteam-bs |
05:01
🔗
|
JesseW |
C: 12 -> 4.4 |
05:06
🔗
|
|
acridAxid has quit IRC (marauder) |
05:08
🔗
|
|
acridAxid has joined #archiveteam-bs |
05:11
🔗
|
bsmith093 |
do you have to trigger each letter manually, or is it a batch job? |
05:34
🔗
|
|
JesseW has quit IRC (Quit: Leaving.) |
05:58
🔗
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
06:05
🔗
|
|
Sk1d has joined #archiveteam-bs |
06:12
🔗
|
|
JesseW has joined #archiveteam-bs |
06:13
🔗
|
bsmith093 |
JesseW: starting a hopefully much smaller grab of stories starting with id 10 million on up |
06:14
🔗
|
JesseW |
bsmith093: it's a batch job, running on my home server (where I normally run the Warrior). |
06:14
🔗
|
JesseW |
bsmith093: great, please put it in a zip file when it's done. :-) |
06:14
🔗
|
JesseW |
currently it's 5G from the end of the Ds |
06:25
🔗
|
JesseW |
2.8G from the end of the Ds |
06:44
🔗
|
JesseW |
and finished D! 17GB -> 6.3GB |
06:45
🔗
|
JesseW |
and E is only 3GB uncompressed, so it should go quickly |
06:45
🔗
|
JesseW |
but I'm going to sleep soon; I'll update in the morning about how far it's gotten |
06:53
🔗
|
|
GLaDOS has joined #archiveteam-bs |
07:10
🔗
|
JesseW |
E is done, and went from 3 -> 1.1 |
07:32
🔗
|
|
JesseW has quit IRC (Quit: Leaving.) |
07:56
🔗
|
|
GLaDOS has quit IRC (Quit: Oh crap, I died.) |
08:14
🔗
|
|
metalcamp has joined #archiveteam-bs |
08:20
🔗
|
|
metalcamp has quit IRC (Ping timeout: 244 seconds) |
09:23
🔗
|
|
bwn has quit IRC (Ping timeout: 492 seconds) |
09:44
🔗
|
|
bwn has joined #archiveteam-bs |
09:58
🔗
|
|
vitzli has quit IRC (Leaving) |
10:14
🔗
|
|
fpoee has joined #archiveteam-bs |
10:20
🔗
|
|
plog99 has quit IRC (Read error: Operation timed out) |
10:33
🔗
|
|
GLaDOS has joined #archiveteam-bs |
10:39
🔗
|
|
bzc6p has joined #archiveteam-bs |
10:39
🔗
|
|
swebb sets mode: +o bzc6p |
10:46
🔗
|
bzc6p |
When you see "You are strictly forbidden to share/distribute/archive this free stuff" and "that external content is not available any more" on the same webpage. *facepalm* |
10:46
🔗
|
bzc6p |
Both with bold red letters followed by multiple exclamation marks. |
11:00
🔗
|
bzc6p |
You have heard bzc6p's monthly ventillation. Oh, here comes the ops train. |
11:00
🔗
|
|
bzc6p sets mode: +ooo achip arkiver BnA-Rob1n |
11:00
🔗
|
|
bzc6p sets mode: +oooo chfoo Fletcher Fletcher_ GLaDOS |
11:00
🔗
|
|
bzc6p sets mode: +oooo godane HCross HCross2 joepie91 |
11:00
🔗
|
|
bzc6p sets mode: +ooo Kenshin Kazzy SimpBrain |
11:00
🔗
|
|
bzc6p sets mode: +ooo Start wp494 yipdw |
11:16
🔗
|
|
vitzli has joined #archiveteam-bs |
11:22
🔗
|
|
bzc6p has left |
13:10
🔗
|
|
vitzli has quit IRC (Ping timeout: 246 seconds) |
13:17
🔗
|
|
RichardG_ has joined #archiveteam-bs |
13:24
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
14:06
🔗
|
|
VADemon has joined #archiveteam-bs |
14:36
🔗
|
|
yakfish has quit IRC (Read error: Operation timed out) |
14:37
🔗
|
|
yakfish has joined #archiveteam-bs |
14:39
🔗
|
|
trane has joined #archiveteam-bs |
14:39
🔗
|
|
trane has left |
14:39
🔗
|
|
ohhdemgir has joined #archiveteam-bs |
14:44
🔗
|
|
RichardG has joined #archiveteam-bs |
14:51
🔗
|
|
RichardG_ has quit IRC (Ping timeout: 633 seconds) |
15:21
🔗
|
|
metalcamp has joined #archiveteam-bs |
15:39
🔗
|
|
metalcamp has quit IRC (Ping timeout: 244 seconds) |
16:17
🔗
|
|
zino has quit IRC (Read error: Operation timed out) |
16:49
🔗
|
|
JesseW has joined #archiveteam-bs |
16:52
🔗
|
JesseW |
The FanFictionNet repack is now on P; A through O (excluding H & N) is 49GB |
17:47
🔗
|
bsmith093 |
could you do an ls -a of the files before you dump the uncompressed ones? probably want to compress that too :) |
17:48
🔗
|
bsmith093 |
wow, that was fast! |
17:49
🔗
|
JesseW |
sure |
17:49
🔗
|
JesseW |
I mean, it won't be different than your inventory file, though |
17:50
🔗
|
JesseW |
P done: 14 -> 5.2 |
17:50
🔗
|
JesseW |
Q done: 0.19G -> 0.07G |
17:51
🔗
|
JesseW |
I don't think you meant "ls -a" ...? |
17:51
🔗
|
JesseW |
Maybe you meant ls -r? |
17:51
🔗
|
JesseW |
or find? |
17:54
🔗
|
|
schbirid has joined #archiveteam-bs |
18:12
🔗
|
|
metalcamp has joined #archiveteam-bs |
18:22
🔗
|
|
metalcamp has quit IRC (Ping timeout: 244 seconds) |
18:29
🔗
|
JesseW |
R done: 6.4G -> 2.4G |
18:30
🔗
|
JesseW |
Now doing the hideously large S -- 31GB |
18:30
🔗
|
JesseW |
mainly due to lots and lots of crossovers, I think. |
18:33
🔗
|
JesseW |
~ 14GB in over 1G chunks, which leaves 17GB in over *3000* other fandoms |
18:39
🔗
|
bsmith093 |
JesseW: might be consioderably smaller, thouth, because of the lack of "home/Desktop/etc" in every path |
18:40
🔗
|
bsmith093 |
JesseW: yes, that would be supernatural, i think |
18:42
🔗
|
JesseW |
Well, Supernatural *is* the largest (at 3.9G) but Sailor Moon is next at 2.5G, followed by Star Wars at 2.0G |
19:01
🔗
|
|
JesseW has quit IRC (Quit: Leaving.) |
19:20
🔗
|
|
bwn has quit IRC (Ping timeout: 246 seconds) |
19:32
🔗
|
|
zino has joined #archiveteam-bs |
19:54
🔗
|
|
bwn has joined #archiveteam-bs |
19:55
🔗
|
|
JetBalsa has joined #archiveteam-bs |
19:56
🔗
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
20:00
🔗
|
|
zhongfu has joined #archiveteam-bs |
20:53
🔗
|
|
metalcamp has joined #archiveteam-bs |
20:59
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
21:03
🔗
|
|
JesseW has joined #archiveteam-bs |
21:05
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
21:07
🔗
|
JesseW |
Up to Star Wars in the repack. |
21:10
🔗
|
|
xXx_ndidd has quit IRC (Read error: Operation timed out) |
21:30
🔗
|
|
DopefishJ has joined #archiveteam-bs |
21:30
🔗
|
|
swebb sets mode: +o DopefishJ |
21:31
🔗
|
|
Boltsie has quit IRC (Read error: Connection reset by peer) |
21:31
🔗
|
|
BnA-Rob1n has quit IRC (Read error: Connection reset by peer) |
21:31
🔗
|
|
JSharp___ has quit IRC (Write error: Connection reset by peer) |
21:33
🔗
|
|
JSharp___ has joined #archiveteam-bs |
21:33
🔗
|
|
Boltsie has joined #archiveteam-bs |
21:33
🔗
|
|
DFJustin has quit IRC (Ping timeout: 260 seconds) |
21:33
🔗
|
|
BnA-Rob1n has joined #archiveteam-bs |
21:35
🔗
|
|
Ctrl-S___ has quit IRC (Ping timeout: 260 seconds) |
21:35
🔗
|
|
_desu___ has quit IRC (Ping timeout: 260 seconds) |
21:36
🔗
|
|
_desu___ has joined #archiveteam-bs |
21:38
🔗
|
|
Ctrl-S___ has joined #archiveteam-bs |
21:39
🔗
|
|
BnA-Rob1n has quit IRC (Quit: Updating details, brb) |
21:40
🔗
|
|
BnA-Rob1n has joined #archiveteam-bs |
21:57
🔗
|
|
metalcamp has quit IRC (Quit: Bye) |
21:58
🔗
|
|
metalcamp has joined #archiveteam-bs |
22:00
🔗
|
bsmith093 |
JesseW: hey, feel like doing statistical analysis on this massive corpus? |
22:01
🔗
|
bsmith093 |
might be more trouble than its worth, grabbing all th metadata out of every. single. file. |
22:02
🔗
|
bsmith093 |
specifically the first 27 lines at the beginning. i just checked |
22:03
🔗
|
JesseW |
I was thinking that would be interesting to do, yeah |
22:03
🔗
|
JesseW |
Probably pull the 27 lines from each file and stuff them in a sqlite table? |
22:04
🔗
|
JesseW |
Then upload that, too? |
22:04
🔗
|
JesseW |
bsmith093: |
22:04
🔗
|
bsmith093 |
oohhh, that would be awesome! |
22:04
🔗
|
bsmith093 |
the colums would even be labeled! |
22:05
🔗
|
bsmith093 |
columns* |
22:05
🔗
|
JesseW |
:-) |
22:05
🔗
|
bsmith093 |
holy crap you rock! can't afford gold, so have some reddit silver |
22:06
🔗
|
JesseW |
again, I wouldn't be able to do it if you hadn't taken the time to extract that corpus first. |
22:06
🔗
|
JesseW |
(and neither of us would have been able to do it if the (unknown?) people behind fanfiction.net hadn't been willing to provide it for this long. |
22:07
🔗
|
JesseW |
(and we all owe the fanfic *authors* gratitude for writing it in the first place) |
22:07
🔗
|
bsmith093 |
https://www.google.com/search?q=reddit+silver&tbm=isch&imgil=Ebbr9Fm8RlfkqM%253A%253BxOpeMr0fiUOvhM%253Bhttps%25253A%25252F%25252Fwww.reddit.com%25252Fuser%25252Flotsalote%25252Fgilded%25252F&source=iu&pf=m&fir=Ebbr9Fm8RlfkqM%253A%252CxOpeMr0fiUOvhM%252C_&usg=__EY7Dc8QlfzxNYwsteoEtJeHtRrc%3D |
22:07
🔗
|
bsmith093 |
damn straight! creatives rule(34) |
22:08
🔗
|
JesseW |
lol! I hadn't seen reddit silver before. |
22:08
🔗
|
bsmith093 |
i've read stories that turn the canon of their respective universes into something much better than that canon probably deserves |
22:09
🔗
|
JesseW |
:-) |
22:09
🔗
|
bsmith093 |
for example https://www.fanfiction.net/s/75517/1/Shadows-of-the-Past |
22:10
🔗
|
bsmith093 |
this guy made me care about RAINBOW BRITE!!! nuff said |
22:11
🔗
|
JesseW |
so are the headers fully consistent? i.e. title always on line 4, author on line 6, etc? |
22:12
🔗
|
JesseW |
S done: 31G -> 12G |
22:12
🔗
|
bsmith093 |
very nearly alwasy, except that for a few hundred(ish) stories the packaged, pulished, and updated dates, will either not be there or not have times with the dates. |
22:12
🔗
|
bsmith093 |
everything is always in that order |
22:13
🔗
|
JesseW |
ok cool |
22:13
🔗
|
bsmith093 |
when grabbing, maybe just grab the path too, to save time, and just grab the first 27 lines first, less data to comb through |
22:15
🔗
|
bsmith093 |
i had an app that saved fanfic to, apparently a sql db file. it turns out to be rather difficult to tell sql to undo that. also reddit is awesome, had a script in 3 hours |
22:15
🔗
|
JesseW |
well, what my script will do is grab the first 27 lines and convert them into a CSV row, then import those into sqlite |
22:16
🔗
|
JesseW |
U done: 0.423G -> 0.156G |
22:17
🔗
|
JesseW |
Only about 20G more total (then the three bigones) |
22:17
🔗
|
JesseW |
(and the rest of their letters) |
22:17
🔗
|
bsmith093 |
when i switched to calibre i dumped those id numbers from the sql db into the fanficfare plugin, all 8000+ waited the 2.7 days it took to grab them, then sorted all the dead ones, and ran that script against the dead list. got 300 more. |
22:17
🔗
|
bsmith093 |
you're going in order right, so 5 more letters, then the 3 huge ones. |
22:18
🔗
|
bsmith093 |
check a random file though, i said 27 to grab some blank lines at the end. |
22:18
🔗
|
JesseW |
yep |
22:28
🔗
|
|
Rickster has quit IRC (Ping timeout: 260 seconds) |
22:29
🔗
|
|
goekesmi has quit IRC (Ping timeout: 260 seconds) |
22:41
🔗
|
|
Rickster has joined #archiveteam-bs |
22:46
🔗
|
|
acridAxid has quit IRC (marauder) |
22:46
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
22:47
🔗
|
|
goekesmi has joined #archiveteam-bs |
22:48
🔗
|
|
acridAxid has joined #archiveteam-bs |
22:48
🔗
|
JesseW |
V done: 5.3G -> 2G |
22:54
🔗
|
|
DopefishJ is now known as DFJustin |
23:13
🔗
|
|
ndiddy has joined #archiveteam-bs |
23:16
🔗
|
|
metalcamp has quit IRC (Ping timeout: 244 seconds) |
23:23
🔗
|
|
dxrt- has quit IRC (Ping timeout: 260 seconds) |
23:32
🔗
|
yipdw |
I am super happy this laptop has an ethernet port |
23:32
🔗
|
yipdw |
35 MB/ vs 11 MB/s: A Thing |
23:53
🔗
|
bsmith093 |
yipdw: wired is always faster |
23:53
🔗
|
JesseW |
W done: 7 -> 2.6 |
23:57
🔗
|
bsmith093 |
yipdw: i recently replaced a hub i had in my network with a gigabit unmanaged switch, internal speeds tripled. |