Time |
Nickname |
Message |
00:00
🔗
|
PepsiMax |
'2Al' is done. |
00:00
🔗
|
yipdw |
that's one reason inconsolata is an awesome typeface :) |
00:00
🔗
|
yipdw |
obvious differentation of 0 and O |
00:00
🔗
|
PepsiMax |
Lucida console also works fine |
00:03
🔗
|
chronomex |
I use clearlyu-clean on my terminals. |
00:05
🔗
|
alard |
I seem to be using Monospace. I can see the difference between 0 and O, just didn't remember that I should look for a difference. :) |
00:10
🔗
|
alard |
Is it a good idea to add everything that isn't done yet back to the queue? |
00:10
🔗
|
alard |
I can do it in a way that should ensure that it is given to someone else than the first claimant. |
00:11
🔗
|
db48x |
how do you know what isn't done yet? |
00:11
🔗
|
PepsiMax |
alard: whut? http://pastebin.com/raw.php?i=LAu1unPR |
00:12
🔗
|
alard |
Well, a couple of things: I know what has been claimed but hasn't been marked done (602 items). I know what should have been done (generate a list of ids), I know what has been marked done. |
00:13
🔗
|
db48x |
ah |
00:13
🔗
|
db48x |
yea, no reason no to throw those back into the hopper |
00:13
🔗
|
alard |
PepsiMax: A misplaced warc? |
00:13
🔗
|
PepsiMax |
/archiveteam/anyhub-grab$ grep -l Cannot data/*/wget*.log |
00:13
🔗
|
PepsiMax |
alard: it worked before |
00:13
🔗
|
PepsiMax |
data/3_q/wget-3_q-1.log |
00:13
🔗
|
PepsiMax |
data/4xH/wget-4xH-1.log |
00:13
🔗
|
PepsiMax |
data/4xH/wget-4xH-d1.log |
00:14
🔗
|
alard |
PepsiMax: You probably have a warc file in your data/ directory. (It looks like it, anyway.) |
00:15
🔗
|
PepsiMax |
I do. |
00:16
🔗
|
PepsiMax |
alard: can you see what just happend with my rsync? It looked like it started sending data you already should have... |
00:17
🔗
|
alard |
2011/11/18 00:16:50 [22037] rsync error: error in rsync protocol data stream (code 12) at io.c(601) [generator=3.0.7] |
00:17
🔗
|
alard |
2011/11/18 00:16:50 [22037] rsync error: error in rsync protocol data stream (code 12) at io.c(760) [receiver=3.0.7] |
00:17
🔗
|
alard |
2011/11/18 00:16:50 [22037] rsync: connection unexpectedly closed (5454 bytes received so far) [generator] |
00:17
🔗
|
alard |
2011/11/18 00:16:50 [22037] rsync: read error: Connection reset by peer (104) |
00:17
🔗
|
alard |
2011/11/18 00:16:49 [22037] data/1RE/ |
00:17
🔗
|
alard |
2011/11/18 00:16:50 [22037] data/1S6/ |
00:18
🔗
|
PepsiMax |
I CTRL+c 'd |
00:18
🔗
|
underscor |
I have 130 ones that I just requeued |
00:18
🔗
|
alard |
PepsiMax: A data/ directory has appeared. |
00:18
🔗
|
underscor |
Because they weren't done |
00:18
🔗
|
underscor |
alard: Just so you know |
00:18
🔗
|
alard |
underscor: Well, I just added ~ 550 items back in the queue to be redone by someone, so it'll probably work. |
00:19
🔗
|
PepsiMax |
Hmm /archiveteam/anyhub-grab/data pepsimax@8yourbox::pepsimax/anyhub/ |
00:19
🔗
|
alard |
PepsiMax: The slashes are always tricky. You may be missing a / |
00:20
🔗
|
PepsiMax |
this is so cofusing |
00:20
🔗
|
alard |
I think you're uploading data to anyhub/data if you do this. |
00:21
🔗
|
PepsiMax |
if im in the data/ dir |
00:21
🔗
|
PepsiMax |
no |
00:21
🔗
|
PepsiMax |
wait |
00:21
🔗
|
PepsiMax |
The location of the main dir is |
00:21
🔗
|
PepsiMax |
/mnt/extdisk/archiveteam/anyhub-grab/ |
00:21
🔗
|
alard |
Can't you run the upload script? |
00:21
🔗
|
PepsiMax |
there is? |
00:22
🔗
|
alard |
There is, maybe I added it after you last git pulled. |
00:22
🔗
|
PepsiMax |
i pulled |
00:22
🔗
|
PepsiMax |
didnt saw it |
00:22
🔗
|
PepsiMax |
who is dest? |
00:22
🔗
|
alard |
Well, the script assumes you'll be uploading to SketchCOw. |
00:22
🔗
|
yipdw |
PepsiMax: make sure your source tree contains b543af28807150554b3a0f0958615657def5df4d as an ancestor |
00:23
🔗
|
PepsiMax |
Already up-to-date. |
00:23
🔗
|
yipdw |
(or is at that commit) |
00:23
🔗
|
yipdw |
what's git show HEAD --format=oneline show |
00:23
🔗
|
PepsiMax |
yeah |
00:23
🔗
|
PepsiMax |
but why? |
00:23
🔗
|
yipdw |
then there should be an upload-finished.sh in the root of the repository |
00:23
🔗
|
PepsiMax |
hurr |
00:23
🔗
|
PepsiMax |
it is here |
00:23
🔗
|
PepsiMax |
its kinda working |
00:24
🔗
|
PepsiMax |
but i got login details of alard |
00:24
🔗
|
PepsiMax |
i moved files around |
00:24
🔗
|
alard |
Two options: modify the script, or ask SketchCow for an official rsync slot. |
00:24
🔗
|
PepsiMax |
now i forgot how rsync assumes you are copying data |
00:24
🔗
|
PepsiMax |
SketchCow: bzzzzzzzzzzzzzzzzzz |
00:25
🔗
|
PepsiMax |
Can I haz 25GB of storage |
00:27
🔗
|
underscor |
PepsiMax: You want rsync -avP /mnt/extdisk/archiveteam/anyhub-grab/ pepsimax@8yourbox::pepsimax/anyhub/ |
00:27
🔗
|
PepsiMax |
-avP? |
00:27
🔗
|
underscor |
aRchival, Verbose, Progress, Partial |
00:28
🔗
|
underscor |
That might give you a "group" error |
00:28
🔗
|
underscor |
if it does, you want |
00:28
🔗
|
underscor |
rsync rsync -rlptoDvP /mnt/extdisk/archiveteam/anyhub-grab/ pepsimax@8yourbox::pepsimax/anyhub/ |
00:28
🔗
|
underscor |
Er, 1 rsync |
00:28
🔗
|
PepsiMax |
its uploading something |
00:29
🔗
|
PepsiMax |
but god knows where |
00:29
🔗
|
PepsiMax |
well |
00:29
🔗
|
underscor |
Oh, whoops |
00:29
🔗
|
underscor |
You want data on the end |
00:29
🔗
|
underscor |
my bad |
00:29
🔗
|
underscor |
rsync -rlptoDvP /mnt/extdisk/archiveteam/anyhub-grab/data/ pepsimax@8yourbox::pepsimax/anyhub/ |
00:29
🔗
|
underscor |
or |
00:29
🔗
|
underscor |
rsync -avP /mnt/extdisk/archiveteam/anyhub-grab/data/ pepsimax@8yourbox::pepsimax/anyhub/ |
00:29
🔗
|
alard |
Did everyone suddenly stop their downloaders? |
00:30
🔗
|
underscor |
alard: My clients are not running |
00:30
🔗
|
underscor |
because I'm running dld-singles |
00:30
🔗
|
alard |
Ah, I see. |
00:30
🔗
|
alard |
It's eerily quiet on the tracker. 549 items to do, but no requesters. |
00:30
🔗
|
alard |
:) |
00:31
🔗
|
yipdw |
I'm stuck trying to fix 4z- |
00:31
🔗
|
PepsiMax |
heh |
00:31
🔗
|
yipdw |
well, h ell |
00:31
🔗
|
PepsiMax |
yeah |
00:31
🔗
|
yipdw |
I'll spin up another client |
00:31
🔗
|
PepsiMax |
alard: i was trying to save the remaining 25GB, its nomming on 0ry now. |
00:32
🔗
|
underscor |
alard: Fired up a few clients |
00:34
🔗
|
alard |
PepsiMax: rsync, you mean? Good. Unfortunately, I'm shutting things down for today/tonight, so you'll have to continue tomorrow. (Or see if you get hold of SketchCow, which would be even better.) |
00:34
🔗
|
alard |
underscor; Thanks. |
00:34
🔗
|
alard |
Well, bye all. |
00:34
🔗
|
PepsiMax |
Good. |
00:34
🔗
|
PepsiMax |
cya |
00:35
🔗
|
underscor |
adios |
02:59
🔗
|
underscor |
SketchCow: Batcave is about to crap itself, just so you know |
05:38
🔗
|
yipdw |
I love how I can see a Splinder download completing in one window -- and then see my name be shoved off the dashboard in seconds in another window |
05:45
🔗
|
chronomex |
strange, I'm not seeing anything live in the dashboard. |
05:45
🔗
|
chronomex |
ah, it's an opera thing |
05:49
🔗
|
chronomex |
wat |
05:50
🔗
|
chronomex |
screen won't let me have more than 40 windows?!? |
05:59
🔗
|
closure |
there's a maxwin setting you can adjust |
06:00
🔗
|
closure |
lol, it can only be set lower than 40.. guess you'd have to recompile. what a strange thing |
06:01
🔗
|
chronomex |
you have to recompile and set MAXWIN to something else |
06:01
🔗
|
chronomex |
lazy goddamn c programmers |
06:02
🔗
|
underscor |
chronomex: Use tmux |
06:02
🔗
|
underscor |
It's better |
06:02
🔗
|
underscor |
:> |
06:02
🔗
|
closure |
must save previous previous bytes in window list |
06:03
🔗
|
chronomex |
pfeh, screen works fine. |
06:03
🔗
|
chronomex |
for most uses. |
06:03
🔗
|
underscor |
|
06:03
🔗
|
underscor |
A Fatal Error has occurred. |
06:03
🔗
|
underscor |
Error Number: -2147205115 |
06:03
🔗
|
underscor |
Error Source: [SystemInfo: GetSystemConfig] [SystemInfo: GetSystemConfigItem] [SystemInfo: LoadSystemConfigInfo] |
06:03
🔗
|
underscor |
Welcome to Prince William County School's Parent Portal |
06:03
🔗
|
underscor |
Description: Provider: Microsoft OLE DB Provider for SQL Server Interface: IDBInitialize Description: Timeout expired |
06:03
🔗
|
underscor |
Fucking school gradebook |
06:03
🔗
|
closure |
little Bobhy Tables must go there |
06:03
🔗
|
chronomex |
to be sure, I should modify alard's excellent scraper to run a bunch of them nicely. |
06:04
🔗
|
closure |
chronomex: well, I simply do ./dld-client.sh closure & ./dld-client.sh closure & ./dld-client.sh closure & ./dld-client.sh closure & ./dld-client.sh closure & |
06:04
🔗
|
chronomex |
right. but there's a cleaner way I have in mind. |
06:14
🔗
|
underscor |
alard: Looks like jstor broke metadata scraping or something |
06:14
🔗
|
underscor |
I keep getting WARNING:root:400 POST /data (71.126.138.142): Missing argument meta |
06:14
🔗
|
underscor |
WARNING:root:400 POST /data (71.126.138.142) 3754.34ms |
06:14
🔗
|
underscor |
on my listener |
06:42
🔗
|
* |
chronomex working on properly multithreadifying splinder |
06:42
🔗
|
chronomex |
# reap dead children |
07:19
🔗
|
Nemo_bis |
donbex wrote some scripts |
07:28
🔗
|
chronomex |
oh yeah? |
07:29
🔗
|
chronomex |
I bet so, he's running so fast :P |
07:38
🔗
|
Nemo_bis |
that seems not the reason, though :-p |
07:39
🔗
|
Nemo_bis |
although part of the reason because there was a bug and at some moment he had 200 processes running |
07:39
🔗
|
chronomex |
lol |
07:39
🔗
|
Nemo_bis |
this is a genius: |
07:39
🔗
|
Nemo_bis |
- Downloading blog from ----------------------.splinder.com... done, with network errors. |
07:40
🔗
|
chronomex |
"is that eighteen or 22 hyphens?" |
07:42
🔗
|
Nemo_bis |
yes, a very inspiring domain name |
07:45
🔗
|
Nemo_bis |
i've put his scripts on http://toolserver.org/~nemobis/ |
07:46
🔗
|
Nemo_bis |
I'm not sure he agrees, though; ssshhh |
08:05
🔗
|
ersi |
"If I make the name totally super hard, it's gonna be secret so no one will find it" |
08:06
🔗
|
chronomex |
clearly |
08:12
🔗
|
ersi |
then again, firefucks had problems loading urls with - in them earlier |
08:13
🔗
|
ersi |
on linux atleast, I think it worked on the windows version |
08:15
🔗
|
chronomex |
splinder shouldn't have allowed it. |
08:15
🔗
|
chronomex |
being contrary to the dns spec and all |
08:30
🔗
|
chronomex |
there we go, thready version works great. |
08:30
🔗
|
chronomex |
gonna put it on github and issue a pull request in a bit |
08:40
🔗
|
chronomex |
wooo I own the realtime list |
08:40
🔗
|
chronomex |
ish |
08:51
🔗
|
chronomex |
12/50 PID 8006 finished 'us:replicawatch7': Success. |
08:51
🔗
|
chronomex |
replicawatch7, eh. excellent. |
08:57
🔗
|
chronomex |
dld-streamer.sh is now in the archiveteam splinder git repository, if y'all want to use it |
08:57
🔗
|
chronomex |
usage: ./dld-streamer.sh <you> 40 or whatever number you can handle |
08:58
🔗
|
chronomex |
caveats: it loads up your system right good |
08:58
🔗
|
yipdw |
does it generate the same results as the old version? |
08:58
🔗
|
yipdw |
(never hurts to ask) |
08:58
🔗
|
chronomex |
so far as I can make it, yes. it's a modified dld-client.sh |
08:59
🔗
|
chronomex |
console output from dld-single.sh goes to $username.log , which is deleted upon successful completion of the user. |
09:02
🔗
|
chronomex |
multithreaded programming in bash is kind of a bitch, so I'm not completely certain if I'm properly catching failed dld-single.sh's. |
09:02
🔗
|
yipdw |
I didn't even know bash did threads |
09:02
🔗
|
chronomex |
a second pair of eyes would help |
09:02
🔗
|
chronomex |
well, it's not threads so much as management of a lot of background tasks. |
09:02
🔗
|
yipdw |
oh |
09:02
🔗
|
yipdw |
so &? |
09:02
🔗
|
chronomex |
yes |
09:03
🔗
|
chronomex |
but then I have to save the pid I just forked off and check if it's still around periodically |
09:07
🔗
|
alard |
chronomex: Cool. Would it be possible to make it so that it keeps running when it doesn't get a username? (With a sleep of 10 to 30 seconds in between, for example?) |
09:07
🔗
|
chronomex |
I'm sure it would, but I don't have a way to test that easily. |
09:09
🔗
|
chronomex |
closure just did a user named "it:MS.Dos". I wonder how that works. |
09:09
🔗
|
chronomex |
oh wait, users != blogs |
09:09
🔗
|
chronomex |
kind of like tumblr I suppose |
09:16
🔗
|
yipdw |
oh good lord |
09:16
🔗
|
yipdw |
- Parsing profile HTML to extract media urls... done. |
09:16
🔗
|
yipdw |
- Downloading 834 media files... |
09:16
🔗
|
yipdw |
gonna be here a while |
09:16
🔗
|
chronomex |
heh |
09:19
🔗
|
yipdw |
though, that's still not bad compared to the Redazione profile |
09:19
🔗
|
yipdw |
I've been downloading the first blog on that profile for a week now |
09:20
🔗
|
yipdw |
it would probably be a good idea for someone else could start up a download for that profile |
09:20
🔗
|
yipdw |
just in case mine errors out |
09:20
🔗
|
chronomex |
any idea what they're writing about? |
09:21
🔗
|
yipdw |
I think it's some sort of official Splinder account |
09:21
🔗
|
yipdw |
"redazione" means "editorial staff", I think |
09:21
🔗
|
chronomex |
ah. |
09:22
🔗
|
yipdw |
http://www.splinder.com/profile/Redazione/blogs is what worries me |
09:22
🔗
|
yipdw |
journal.splinder.com is the first one out of 8 |
09:23
🔗
|
yipdw |
and they all have many entries, each with many comments :P |
09:25
🔗
|
yipdw |
http://www.splinder.com/myblog/comment/list/4212591/49575751?from=400 |
09:25
🔗
|
yipdw |
oh, NOW I know why some of these blogs have such huge comment pages. |
09:27
🔗
|
chronomex |
hahaha |
09:27
🔗
|
chronomex |
lesbian strapon fisting! |
09:27
🔗
|
chronomex |
wtf is that, do they have a prosthetic arm connected to a harness or something |
09:28
🔗
|
yipdw |
I don't know, but I bet anime has the answer |
09:28
🔗
|
chronomex |
I'm good |
09:49
🔗
|
Nemo_bis |
I once tried to archive it.wikihow.com and it failed while downloading a talk page with a 10 GiB ish history full with spam |
09:50
🔗
|
Nemo_bis |
surprisingly, deleting it didn't kill their servers |
10:00
🔗
|
chronomex |
woops, dld-streamer.sh has a bug where it goes into a spinloop during stopping state |
10:41
🔗
|
Nemo_bis |
chronomex, what's the opposite of touch STOP ? |
10:41
🔗
|
chronomex |
opposite how? |
10:41
🔗
|
Schbirid |
rm STOP i guess |
10:41
🔗
|
Nemo_bis |
I don't knw, how do you stop the stopping? |
10:41
🔗
|
chronomex |
rm STOP |
10:41
🔗
|
chronomex |
:) |
10:41
🔗
|
Nemo_bis |
ah ok |
10:41
🔗
|
Nemo_bis |
thanks :) |
10:41
🔗
|
chronomex |
you have to do it before the script sees STOP, which it will do usually within a second or two after you create it |
10:42
🔗
|
* |
Nemo_bis didn't get how it works |
10:42
🔗
|
Nemo_bis |
yep, I'm waiting for some processes to complete |
10:50
🔗
|
chronomex |
it works for you too? excellent. |
10:51
🔗
|
chronomex |
not that I don't expect it, but that's good to hear |
10:53
🔗
|
Nemo_bis |
no, I didn't start it yet because I'm stopping the running dld-client now |
10:53
🔗
|
Nemo_bis |
44 left |
10:53
🔗
|
chronomex |
aye |
10:54
🔗
|
chronomex |
it's safe to run in a different terminal. |
10:54
🔗
|
chronomex |
similarly, you can run as many dld-client.sh in that directory as you want at once |
10:54
🔗
|
Nemo_bis |
hmm, yes, I'll do so because they're so slooooooooooooow now |
10:55
🔗
|
chronomex |
wait |
10:55
🔗
|
chronomex |
I suggest you run it as |
10:55
🔗
|
chronomex |
ionice -c 3 -n 5 nice -n 10 ./dld-streamer.sh <name> 50 |
10:56
🔗
|
chronomex |
that way it's less likely to completely eat your machine |
10:58
🔗
|
Nemo_bis |
right now it's not a problem |
10:58
🔗
|
db48x |
hrm |
10:58
🔗
|
Nemo_bis |
I tried ionice -c 3 but then many processes died because were not able to write to disk |
10:59
🔗
|
db48x |
17.5 KB/s is not the ideal transfer rate for this 40 gigs of data |
10:59
🔗
|
Nemo_bis |
heh |
11:00
🔗
|
db48x |
especially since the machines are right next to each other |
11:04
🔗
|
db48x |
hrm |
11:04
🔗
|
chronomex |
db48x: eew |
11:04
🔗
|
db48x |
one of my splinder clients is stuck retrying --------.splinder.com |
11:06
🔗
|
db48x |
it:luke1989 |
11:08
🔗
|
Schbirid |
chronomex: wait, you can pass eg 50 to make it dl 50 at once? |
11:08
🔗
|
chronomex |
yeah |
11:09
🔗
|
* |
Schbirid stops a lot of terminal windows... |
11:09
🔗
|
chronomex |
heh |
11:09
🔗
|
chronomex |
I suggest sticking to around 50 in each streamer instance, you can run multiple streamers at once |
11:11
🔗
|
chronomex |
see how much it loads your machine ... it can really churn your disk around. |
11:11
🔗
|
Schbirid |
yeah |
11:12
🔗
|
Schbirid |
got around 12 io/s and latency of 100ms but i have no idea how much it is |
11:12
🔗
|
Schbirid |
just compared to the normal almost-0 it sure shows up on the graphs |
11:12
🔗
|
chronomex |
lol |
11:25
🔗
|
db48x |
the download speed is going down |
11:35
🔗
|
Nemo_bis |
for me too |
11:35
🔗
|
Nemo_bis |
probably servers overloaded |
11:36
🔗
|
Nemo_bis |
(this is peak hour) |
11:36
🔗
|
Schbirid |
archiveteam, your free load testing service |
11:37
🔗
|
Nemo_bis |
nah, I doubt it, we're not pulling so much |
11:37
🔗
|
chronomex |
<3 |
11:43
🔗
|
Nemo_bis |
chronomex, is it normal that it's always the 49th or 50th process to tell me that a user is completed? http://p.defau.lt/?WnNq7oqrkhMYTsNPAhdNvQ |
11:43
🔗
|
chronomex |
oh, yeah, that deserves some explanation |
11:44
🔗
|
chronomex |
the first number is how many processes are running, the second is how many you want |
11:44
🔗
|
Nemo_bis |
ah, ok |
11:44
🔗
|
chronomex |
I don't number them individually, I just count how many there are |
11:44
🔗
|
chronomex |
so, yes. |
11:44
🔗
|
Nemo_bis |
so only PID tells something |
11:44
🔗
|
chronomex |
right |
11:44
🔗
|
Nemo_bis |
ok |
11:44
🔗
|
chronomex |
yep |
11:46
🔗
|
chronomex |
huh. https://plus.google.com/112313173544747389010/posts/UouzhaSbB1M |
12:09
🔗
|
Cameron_D |
So I ahve ~15 splinder processes that are still running 12 hours after a touch STOP |
12:09
🔗
|
Cameron_D |
and they are still downloading |
12:13
🔗
|
ersi |
chronomex: Bwahaha |
12:13
🔗
|
PepsiMax |
Cameron_D: huge downloads then. |
12:13
🔗
|
ersi |
What a bunch of fucking retards Backupify is |
12:13
🔗
|
Cameron_D |
PepsiMax, yeah :/ |
12:14
🔗
|
Nemo_bis |
eventually every client bumps into a big user and gets stuck with it :/ |
12:19
🔗
|
PepsiMax |
but the time between the small ones is so big. |
12:20
🔗
|
PepsiMax |
so you start a few clients, to speed up the small user-waiting time |
12:20
🔗
|
PepsiMax |
and booom, 4 clients sucking the internet. |
12:25
🔗
|
Cameron_D |
I think I'm doing it right... http://i.imgur.com/Ln10H.jpg |
12:26
🔗
|
PepsiMax |
lol skynet |
12:26
🔗
|
Cameron_D |
Unoriginal name is unoriginal :P |
14:00
🔗
|
PepsiMax |
bleerg |
15:11
🔗
|
Nemo_bis |
No more usernames available. Entering reap mode... |
15:11
🔗
|
Nemo_bis |
?? |
15:11
🔗
|
DoubleJ |
Any chance of making the dld-client try harded or be more patient when telling the tracker it's done? I have a quarter of my processes that stopped this morning because they didn't hear back quickly enough. |
15:11
🔗
|
DoubleJ |
s/harded/harder/ |
15:24
🔗
|
Nemo_bis |
the same for me |
15:25
🔗
|
Nemo_bis |
and dld-streamer stops everything the first time it doesn't receive a user |
16:55
🔗
|
SketchCow |
Summary of payments: * Your withdrawal of $107981.00 succeeded for a bank account. |
16:55
🔗
|
SketchCow |
TIME TO GO SHOPPPPPPINGGGGGGGG |
17:05
🔗
|
Schbirid |
>:) |
17:05
🔗
|
soultcer |
Kickstarter is an awesome invention |
17:10
🔗
|
rude___ |
what will you be shooting with SketchCow? |
17:17
🔗
|
Schbirid |
"./dld-client.sh mynick 30" will try 30 at once, correct? |
17:27
🔗
|
Nemo_bis |
Schbirid, that's dld-streamer.sh |
17:28
🔗
|
Schbirid |
oh |
17:29
🔗
|
Schbirid |
oh yes |
17:29
🔗
|
Schbirid |
thanks |
17:53
🔗
|
PepsiMax |
For those who missed it: ARCHIVE TEAM: A Distributed Preservation of Service Attack http://youtu.be/-2ZTmuX3cog |
17:56
🔗
|
PepsiMax |
anyhub deaded? |
18:17
🔗
|
godane |
i found out how to make wikipedia work offline: http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html |
18:18
🔗
|
godane |
its best when you need a mirror of it locally |
18:20
🔗
|
Cowering |
heh heh.. best to mirror *.gr while you are at it :) |
18:26
🔗
|
godane |
why is it that wikipedia dumps are in bz2? |
18:27
🔗
|
godane |
i have see some are not compressed at all |
18:27
🔗
|
godane |
part of me thinks wikipedia dumps need to be done in lzma or xz |
18:29
🔗
|
Nemo_bis |
there are also 7z versions |
18:30
🔗
|
Nemo_bis |
bzip archives are needed for some applications which process them without uncompressing them |
18:30
🔗
|
godane |
ok |
18:30
🔗
|
godane |
thought there was a bzip2recover for xz |
18:30
🔗
|
godane |
or lzma |
18:30
🔗
|
Nemo_bis |
and lzma doesn't improve much in some cases; it's very useful for complete histories with a lot of duplicate text |
18:31
🔗
|
godane |
i have seen 10% improvement |
18:31
🔗
|
Nemo_bis |
that's not much :) |
18:31
🔗
|
Nemo_bis |
7z archives of full history are about 1/5 of bzip2 |
18:31
🔗
|
godane |
it is when your talking 7.3gb |
18:31
🔗
|
godane |
saves 700mb |
18:32
🔗
|
Nemo_bis |
which is not much if the archive uncompressed is 730 GiB |
18:32
🔗
|
Nemo_bis |
and 7z lets you not uncompress it |
18:32
🔗
|
Nemo_bis |
s/7z/bzip2 |
18:33
🔗
|
godane |
found a for 7z to do a 7zcat and 7zless |
18:33
🔗
|
godane |
*a script |
18:34
🔗
|
godane |
exec 7z e -so -bd "$@" 2>/dev/null | cat |
18:34
🔗
|
Nemo_bis |
hm, doesn't look very efficient: piping 350 GiB? |
18:34
🔗
|
Nemo_bis |
anyway, I don't really know what I'm talking about, just repeating things I heard real experts say .-p |
18:35
🔗
|
Nemo_bis |
you might want to read https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l where this was discussed multiple times |
19:06
🔗
|
DFJustin |
I use wikitaxi for offline wikipedia which is a nice ready-made app but there's like one guy writing it and he doesn't update enough |
19:15
🔗
|
ersi |
godane: becauze xz is a fucking dickbitch to compress with |
19:15
🔗
|
ersi |
since it takes a billion more resources |
19:50
🔗
|
SketchCow |
Just slammed another 5gb of shareware CDs in |
19:50
🔗
|
SketchCow |
I'm trying to empty batcave out in the next week |
19:50
🔗
|
SketchCow |
Except Berlios and the current projects, while we determine ingestion |
19:52
🔗
|
DFJustin |
I notice the cds that have gone in in the past while don't have the auto-generated file listings and galleries |
19:54
🔗
|
SketchCow |
I have to initiate them. |
19:54
🔗
|
SketchCow |
Shortly, I'll write something to extract them. |
19:54
🔗
|
SketchCow |
Right now, I need the iso location of them to do it. |
19:56
🔗
|
SketchCow |
Like, what I SHOULD do is you give it an item name, it checks if the work has been done, and then it does it if there's nothing. |
19:56
🔗
|
SketchCow |
Then I can just glorb the rss feed |
19:59
🔗
|
DFJustin |
isn't there already a derive infrastructure for this sort of thing |
20:00
🔗
|
underscor |
Yeah, but adding derive jobs to the system is a pain in the ass |
20:31
🔗
|
chronomex |
godane, Nemo_bis: wiki dumps compress even better when you first diff-compress them and then bzip the diffs. |
20:31
🔗
|
chronomex |
godane: Nemo_bis: https://github.com/chronomex/wikiscraper is a tool that takes a wiki dump and turns it into a version-control repo |
20:32
🔗
|
Nemo_bis |
chronomex, are you interested in doing the opposite? :) |
20:33
🔗
|
chronomex |
hmmm, maybe, what do you have in mind? |
20:34
🔗
|
Nemo_bis |
old diffs of Wikipedia in UseModWiki |
20:34
🔗
|
chronomex |
? do link |
20:35
🔗
|
Nemo_bis |
http://reagle.org/joseph/blog/social/wikipedia/10k-redux.html |
20:35
🔗
|
DoubleJ |
Hm. Splinder tracker seems to be getting less reliable. Just had to restart about half a dozen out of 16 processes for failure to tell teh tracker the user was finished. |
20:36
🔗
|
chronomex |
hmmm. |
20:37
🔗
|
DoubleJ |
I'd think it was at my end, but my rsync up to batcave was uninterrupted and others reported the same problem before. |
20:37
🔗
|
DoubleJ |
So that leaves the tracker looking like the weak link. |
20:38
🔗
|
chronomex |
I'm away from my box right now but it's not showing up on the dashboard. |
20:38
🔗
|
chronomex |
hm. |
20:38
🔗
|
Nemo_bis |
yes, and sometimes it also fails to give a user and dld-streamer stops adding processes |
20:38
🔗
|
chronomex |
i don't think there's a huge problem with failing to notify the tracker occasionally. it'll cause duplication but it's better than saying you got something that you didn't actually get |
20:39
🔗
|
DoubleJ |
True enough. But given that we're likely to be pressed for time, I'd like to avoid any duplication if possible. |
20:39
🔗
|
chronomex |
dld-streamer is designed to abort gracefully when the tracker goes away. |
20:39
🔗
|
chronomex |
yeah. I'll look into it when I get home in ~8 hours.. |
20:39
🔗
|
DoubleJ |
Especially if the solution might be as simple as, "if the response times out, wait 5 seconds and try again" |
20:40
🔗
|
chronomex |
I'd have to move a few things around, it's a 15-minute fix. |
20:40
🔗
|
chronomex |
including testing. |
20:40
🔗
|
DoubleJ |
It always gives me a new user right away when I restart so it's definitely transient. |
20:40
🔗
|
chronomex |
yeah. it's transient. |
20:40
🔗
|
Nemo_bis |
yes, but it means that you have to restart the streamer a lot of times; I had to disable the automatic stop or I'd have had to restart it 14 times in the last ~2000 users |
20:40
🔗
|
chronomex |
which is why I wasn't able to test it reliably |
20:40
🔗
|
chronomex |
huh, lame. |
20:41
🔗
|
DoubleJ |
Didn't we have the same problem with GV? Whichever fake-database-thing being used doesn't seem to be able to handle the mass of updates, or something like that. |
20:41
🔗
|
chronomex |
well. streamer is much better than a bunch of clients in screen, that's for damn sure. |
20:42
🔗
|
DoubleJ |
Maybe if you have 50 to deal with. My VM and ancientware machine can handle so little it's easy to pop between screen windows :) |
20:42
🔗
|
chronomex |
heh |
20:42
🔗
|
chronomex |
yeah |
20:43
🔗
|
chronomex |
when I run 50, it spends a lot of cpu time in extract-urls.py |
20:43
🔗
|
chronomex |
like, a lot. |
20:43
🔗
|
DoubleJ |
Also handy for catching the "my blog domain has dashes" users that send it into a hoke-and-retry cycle |
20:43
🔗
|
chronomex |
it chokes much on that? |
20:43
🔗
|
DoubleJ |
s/hoke/choke |
20:44
🔗
|
DoubleJ |
Not too often, but it gets a "bad" error when the standard-weenie DNS decides the subdomain can't exist, so it deletes the user and starts over. |
20:44
🔗
|
chronomex |
ah. |
20:44
🔗
|
DoubleJ |
I've had 2, so I assume they're out there at about a rate of 1 per 4000 |
20:44
🔗
|
chronomex |
sounds reasonable given the data available |
20:45
🔗
|
chronomex |
I take it nobody is doing memac? |
20:45
🔗
|
DoubleJ |
And again, that's with dld-client being called directly. dld-stramer may catch and fail more gracefully. |
20:45
🔗
|
chronomex |
well, nobody other than me |
20:45
🔗
|
Nemo_bis |
I added them to http://archiveteam.org/index.php?title=Splinder#Notes |
20:45
🔗
|
Nemo_bis |
dld-streamer didn't tell anything about it |
20:45
🔗
|
chronomex |
DoubleJ: dld-streamer is mostly just a re-arranged dld-client. |
20:46
🔗
|
DoubleJ |
OK. Didn't know if it was a wrapper or a rewrite. |
20:46
🔗
|
chronomex |
yeah. it manages a pool of dld-single.sh's |
20:47
🔗
|
DoubleJ |
If you're in the mood to test, see what it does with it:ermejointutt |
20:47
🔗
|
chronomex |
I have to put some more robustness into its error handling |
20:47
🔗
|
DoubleJ |
The blog is at -dituttounpo-.splinder.com which Linux boxen seem to be incapable of resolving. |
20:48
🔗
|
chronomex |
on bsd: $ host -- -dituttounpo-.splinder.com |
20:48
🔗
|
chronomex |
host: convert UTF-8 textname to IDN encoding: prohibited character found |
20:48
🔗
|
DoubleJ |
Which starts the download-choke-delete-retry cycle forever |
20:48
🔗
|
DoubleJ |
Yep. Probably the same error on Linux. But it's out there and resolves correctly on other OSes, and prevents the user from being completed. |
20:49
🔗
|
chronomex |
fuckres. |
20:49
🔗
|
DoubleJ |
I think it may be an older version of the spec? ISTR that dashes used to be OK in subdomains, but I might be remembering wrong. |
20:49
🔗
|
Nemo_bis |
"something of everything" (I wonder what's the exact translation) |
20:49
🔗
|
chronomex |
*.splinder.com is CNAME aliased to blog.splinder.com. Can we bypass the DNS entirely? |
20:49
🔗
|
DoubleJ |
You know bach, you tell me :) |
20:50
🔗
|
DoubleJ |
bash |
20:50
🔗
|
chronomex |
heh, it's more a wget question. |
20:50
🔗
|
chronomex |
and I don't know very much bash, I just know how to program and read manpages ;) |
20:53
🔗
|
chronomex |
btw. most of the logfiles that dld-streamer keeps around temporarily are because I didn't trust myself to handle work units properly, and wanted a record until they were successfully despatched |
20:54
🔗
|
chronomex |
also, the wget's save their files in the directory and then are deleted. this causes a lot of io traffic that's needless on my system. I'm considering having them download into some other directory, and then mounting a tmpfs for that to reduce disk head |
20:57
🔗
|
Nemo_bis |
ah, that would be great |
20:57
🔗
|
Nemo_bis |
ionice blocks jobs but doesn't increase caching to memory, apparently |
20:57
🔗
|
chronomex |
yeah. I want writebehind on that stuff. |
20:58
🔗
|
chronomex |
<chronomex> good evening #archiveteam, I'm a random internet user who's concerned about the state of things |
20:59
🔗
|
chronomex |
<chronomex> is it all right if I lurk here? I've got not much to offer besides Linux and a bit of disk space. |
20:59
🔗
|
chronomex |
my first words in #archiveteam :) |
21:02
🔗
|
DoubleJ |
It looks like we could just send the request to $BLOGS_IP_ADDRESS/whichever/url and specify the host in the wget command line, but I don't know if that works with mirroring. I know I can do it for a single page, but I have a feeling that it's download page 1 and choke on page 2 if we tried. |
21:03
🔗
|
ndurner1 |
rsync: link_stat "data/it/^/^Z/^Zo/^ZoSo^" failed: No such file or directory (2) |
21:03
🔗
|
ndurner1 |
rsync: link_stat "data/it/^/^s/^so/^sognoribelle^" failed: No such file or directory (2) |
21:03
🔗
|
ndurner1 |
sending incremental file list |
21:03
🔗
|
ndurner1 |
splinder rsync trouble: |
21:03
🔗
|
ndurner1 |
$ ls "data/it/^/^s/^so/" |
21:03
🔗
|
ndurner1 |
\^sognoribelle\^ |
21:04
🔗
|
chronomex |
weird. |
21:04
🔗
|
chronomex |
hm, maybe rsync is passing it through a shell at some point. that would cause problems. |
21:04
🔗
|
chronomex |
I know with `scp' you have to double-escape filenames. |
21:05
🔗
|
DoubleJ |
From an online man page: "You use rsync in the same way you use rcp." |
21:05
🔗
|
DoubleJ |
So if that's like scp you may have to double-escape. |
21:06
🔗
|
chronomex |
ugh. |
21:13
🔗
|
Nemo_bis |
no, the problem is that usernames are escaped when writing to disk |
21:13
🔗
|
Nemo_bis |
some examples: |
21:13
🔗
|
Nemo_bis |
http://p.defau.lt/?NITL0SVf4K4QFRgCKmlWIg |
21:14
🔗
|
chronomex |
why do you use p.defau.lt in particular? |
21:14
🔗
|
chronomex |
are you sure you don't have ls set up to print pastable names? |
21:14
🔗
|
Nemo_bis |
I don't know, it's popular in Wikimedia channels on freeNode, probably Domas Mituzas created it |
21:14
🔗
|
Nemo_bis |
yes, nautlus shows the same |
21:14
🔗
|
chronomex |
hmm ok |
21:15
🔗
|
chronomex |
weird. |
21:15
🔗
|
Nemo_bis |
anyway, defau.lt is fast and I like those long hashes |
21:15
🔗
|
chronomex |
heh. |
21:15
🔗
|
chronomex |
friend of mine runs rafb.me so I use that |
21:17
🔗
|
Nemo_bis |
nah, too complex :-p |
21:17
🔗
|
Nemo_bis |
and expires |
21:18
🔗
|
chronomex |
sure. |
21:26
🔗
|
SketchCow |
Yay for Chronomex coming to SXSW |
21:28
🔗
|
chronomex |
\o/ |
21:30
🔗
|
chronomex |
I'm sure it'll be piles of fun |
21:41
🔗
|
SketchCow |
It'll be something! |
21:43
🔗
|
underscor |
http://www.archive.org/details/911/day/20010911#id/WJLA_20010911_130000_Live_With_Regis_and_Kelly/start/13:02:55UTC |
21:43
🔗
|
underscor |
This... is indescribable |
21:43
🔗
|
underscor |
Watching that, eating breakfast before first grade... |
21:46
🔗
|
* |
Nemo_bis hates captchas in archiveteam wiki |
21:57
🔗
|
chronomex |
underscor: wait. 1st grade?!? |
22:11
🔗
|
underscor |
chronomex: I'm 17, I was 7 in first grade |
22:21
🔗
|
yipdw^ |
SketchCow: uploading about 56 GB of Anyhub data to my rsync account on batcave; let me know if I should abort it for any reason |
22:22
🔗
|
SketchCow |
Parents against the marriage |
22:22
🔗
|
SketchCow |
birth defect |
22:22
🔗
|
yipdw^ |
crap |
22:24
🔗
|
yipdw^ |
should have consulted with Focus on the Family beforehand |
22:26
🔗
|
SketchCow |
http://www.flickr.com/photos/textfiles/6360036609/in/photostream |
22:26
🔗
|
SketchCow |
Focus on the Hard Drive |
22:27
🔗
|
Nemo_bis |
SketchCow, does it use esata? |
22:28
🔗
|
yipdw^ |
SketchCow: heh, make that hard drive appear huge by playing with the focal plane |
22:30
🔗
|
SketchCow |
http://www.flickr.com/photos/textfiles/6360036609/in/photostream/lightbox/ with the lightbox |
22:54
🔗
|
alard |
Splinder people: Please do a git pull if you have the time. Nothing urgent, but I added two heroku-instances of the tracker. The new version of the script randomly chooses one. |
22:56
🔗
|
db48x |
cool, done |
22:57
🔗
|
alard |
Thanks. (The problem of the tracker not responding is most likely due to my use of a free heroku account, which allows just one single HTTP connection. The Redis backend on my EC2 micro instance is working fine, so far.) |
22:57
🔗
|
underscor |
alard: done |
22:57
🔗
|
underscor |
does anyhub still need any downloaders? |
22:58
🔗
|
chronomex |
http://anyhub.heroku.com/ says 0 to do |
22:59
🔗
|
alard |
The to do list is empty, so I don't think so. |
23:02
🔗
|
underscor |
Okay |
23:03
🔗
|
underscor |
\o/ |
23:03
🔗
|
underscor |
Now what's left is to package the beast |
23:03
🔗
|
alard |
I also get the impression that it has stopped working. |
23:03
🔗
|
alard |
The urls on the recent items page all produce zero-byte downloads. |
23:04
🔗
|
underscor |
Hm, you're right |
23:04
🔗
|
alard |
Although older ids do still work. |
23:05
🔗
|
underscor |
- Downloading blog from blogorrea.splinder.com... |
23:05
🔗
|
underscor |
hahah |
23:13
🔗
|
rude___ |
SketchCow: heh, I have that same dock, with that same hard drive in it right now, and I also took a test photo of it when I got my EOS 5D2 (but with a crap lens) |
23:52
🔗
|
closure |
ok, time to get this splinder thing to 50% |
23:57
🔗
|
alard |
closure, ndurner: Since you're big splinder downloaders, please do a git pull if you haven't done so. The new version balances the load between two tracker instances on heroku. |