#archiveteam 2011-11-18,Fri

↑back Search

Time Nickname Message
00:00 🔗 PepsiMax '2Al' is done.
00:00 🔗 yipdw that's one reason inconsolata is an awesome typeface :)
00:00 🔗 yipdw obvious differentation of 0 and O
00:00 🔗 PepsiMax Lucida console also works fine
00:03 🔗 chronomex I use clearlyu-clean on my terminals.
00:05 🔗 alard I seem to be using Monospace. I can see the difference between 0 and O, just didn't remember that I should look for a difference. :)
00:10 🔗 alard Is it a good idea to add everything that isn't done yet back to the queue?
00:10 🔗 alard I can do it in a way that should ensure that it is given to someone else than the first claimant.
00:11 🔗 db48x how do you know what isn't done yet?
00:11 🔗 PepsiMax alard: whut? http://pastebin.com/raw.php?i=LAu1unPR
00:12 🔗 alard Well, a couple of things: I know what has been claimed but hasn't been marked done (602 items). I know what should have been done (generate a list of ids), I know what has been marked done.
00:13 🔗 db48x ah
00:13 🔗 db48x yea, no reason no to throw those back into the hopper
00:13 🔗 alard PepsiMax: A misplaced warc?
00:13 🔗 PepsiMax /archiveteam/anyhub-grab$ grep -l Cannot data/*/wget*.log
00:13 🔗 PepsiMax alard: it worked before
00:13 🔗 PepsiMax data/3_q/wget-3_q-1.log
00:13 🔗 PepsiMax data/4xH/wget-4xH-1.log
00:13 🔗 PepsiMax data/4xH/wget-4xH-d1.log
00:14 🔗 alard PepsiMax: You probably have a warc file in your data/ directory. (It looks like it, anyway.)
00:15 🔗 PepsiMax I do.
00:16 🔗 PepsiMax alard: can you see what just happend with my rsync? It looked like it started sending data you already should have...
00:17 🔗 alard 2011/11/18 00:16:50 [22037] rsync error: error in rsync protocol data stream (code 12) at io.c(601) [generator=3.0.7]
00:17 🔗 alard 2011/11/18 00:16:50 [22037] rsync error: error in rsync protocol data stream (code 12) at io.c(760) [receiver=3.0.7]
00:17 🔗 alard 2011/11/18 00:16:50 [22037] rsync: connection unexpectedly closed (5454 bytes received so far) [generator]
00:17 🔗 alard 2011/11/18 00:16:50 [22037] rsync: read error: Connection reset by peer (104)
00:17 🔗 alard 2011/11/18 00:16:49 [22037] data/1RE/
00:17 🔗 alard 2011/11/18 00:16:50 [22037] data/1S6/
00:18 🔗 PepsiMax I CTRL+c 'd
00:18 🔗 underscor I have 130 ones that I just requeued
00:18 🔗 alard PepsiMax: A data/ directory has appeared.
00:18 🔗 underscor Because they weren't done
00:18 🔗 underscor alard: Just so you know
00:18 🔗 alard underscor: Well, I just added ~ 550 items back in the queue to be redone by someone, so it'll probably work.
00:19 🔗 PepsiMax Hmm /archiveteam/anyhub-grab/data pepsimax@8yourbox::pepsimax/anyhub/
00:19 🔗 alard PepsiMax: The slashes are always tricky. You may be missing a /
00:20 🔗 PepsiMax this is so cofusing
00:20 🔗 alard I think you're uploading data to anyhub/data if you do this.
00:21 🔗 PepsiMax if im in the data/ dir
00:21 🔗 PepsiMax no
00:21 🔗 PepsiMax wait
00:21 🔗 PepsiMax The location of the main dir is
00:21 🔗 PepsiMax /mnt/extdisk/archiveteam/anyhub-grab/
00:21 🔗 alard Can't you run the upload script?
00:21 🔗 PepsiMax there is?
00:22 🔗 alard There is, maybe I added it after you last git pulled.
00:22 🔗 PepsiMax i pulled
00:22 🔗 PepsiMax didnt saw it
00:22 🔗 PepsiMax who is dest?
00:22 🔗 alard Well, the script assumes you'll be uploading to SketchCOw.
00:22 🔗 yipdw PepsiMax: make sure your source tree contains b543af28807150554b3a0f0958615657def5df4d as an ancestor
00:23 🔗 PepsiMax Already up-to-date.
00:23 🔗 yipdw (or is at that commit)
00:23 🔗 yipdw what's git show HEAD --format=oneline show
00:23 🔗 PepsiMax yeah
00:23 🔗 PepsiMax but why?
00:23 🔗 yipdw then there should be an upload-finished.sh in the root of the repository
00:23 🔗 PepsiMax hurr
00:23 🔗 PepsiMax it is here
00:23 🔗 PepsiMax its kinda working
00:24 🔗 PepsiMax but i got login details of alard
00:24 🔗 PepsiMax i moved files around
00:24 🔗 alard Two options: modify the script, or ask SketchCow for an official rsync slot.
00:24 🔗 PepsiMax now i forgot how rsync assumes you are copying data
00:24 🔗 PepsiMax SketchCow: bzzzzzzzzzzzzzzzzzz
00:25 🔗 PepsiMax Can I haz 25GB of storage
00:27 🔗 underscor PepsiMax: You want rsync -avP /mnt/extdisk/archiveteam/anyhub-grab/ pepsimax@8yourbox::pepsimax/anyhub/
00:27 🔗 PepsiMax -avP?
00:27 🔗 underscor aRchival, Verbose, Progress, Partial
00:28 🔗 underscor That might give you a "group" error
00:28 🔗 underscor if it does, you want
00:28 🔗 underscor rsync rsync -rlptoDvP /mnt/extdisk/archiveteam/anyhub-grab/ pepsimax@8yourbox::pepsimax/anyhub/
00:28 🔗 underscor Er, 1 rsync
00:28 🔗 PepsiMax its uploading something
00:29 🔗 PepsiMax but god knows where
00:29 🔗 PepsiMax well
00:29 🔗 underscor Oh, whoops
00:29 🔗 underscor You want data on the end
00:29 🔗 underscor my bad
00:29 🔗 underscor rsync -rlptoDvP /mnt/extdisk/archiveteam/anyhub-grab/data/ pepsimax@8yourbox::pepsimax/anyhub/
00:29 🔗 underscor or
00:29 🔗 underscor rsync -avP /mnt/extdisk/archiveteam/anyhub-grab/data/ pepsimax@8yourbox::pepsimax/anyhub/
00:29 🔗 alard Did everyone suddenly stop their downloaders?
00:30 🔗 underscor alard: My clients are not running
00:30 🔗 underscor because I'm running dld-singles
00:30 🔗 alard Ah, I see.
00:30 🔗 alard It's eerily quiet on the tracker. 549 items to do, but no requesters.
00:30 🔗 alard :)
00:31 🔗 yipdw I'm stuck trying to fix 4z-
00:31 🔗 PepsiMax heh
00:31 🔗 yipdw well, h ell
00:31 🔗 PepsiMax yeah
00:31 🔗 yipdw I'll spin up another client
00:31 🔗 PepsiMax alard: i was trying to save the remaining 25GB, its nomming on 0ry now.
00:32 🔗 underscor alard: Fired up a few clients
00:34 🔗 alard PepsiMax: rsync, you mean? Good. Unfortunately, I'm shutting things down for today/tonight, so you'll have to continue tomorrow. (Or see if you get hold of SketchCow, which would be even better.)
00:34 🔗 alard underscor; Thanks.
00:34 🔗 alard Well, bye all.
00:34 🔗 PepsiMax Good.
00:34 🔗 PepsiMax cya
00:35 🔗 underscor adios
02:59 🔗 underscor SketchCow: Batcave is about to crap itself, just so you know
05:38 🔗 yipdw I love how I can see a Splinder download completing in one window -- and then see my name be shoved off the dashboard in seconds in another window
05:45 🔗 chronomex strange, I'm not seeing anything live in the dashboard.
05:45 🔗 chronomex ah, it's an opera thing
05:49 🔗 chronomex wat
05:50 🔗 chronomex screen won't let me have more than 40 windows?!?
05:59 🔗 closure there's a maxwin setting you can adjust
06:00 🔗 closure lol, it can only be set lower than 40.. guess you'd have to recompile. what a strange thing
06:01 🔗 chronomex you have to recompile and set MAXWIN to something else
06:01 🔗 chronomex lazy goddamn c programmers
06:02 🔗 underscor chronomex: Use tmux
06:02 🔗 underscor It's better
06:02 🔗 underscor :>
06:02 🔗 closure must save previous previous bytes in window list
06:03 🔗 chronomex pfeh, screen works fine.
06:03 🔗 chronomex for most uses.
06:03 🔗 underscor
06:03 🔗 underscor A Fatal Error has occurred.
06:03 🔗 underscor Error Number: -2147205115
06:03 🔗 underscor Error Source: [SystemInfo: GetSystemConfig] [SystemInfo: GetSystemConfigItem] [SystemInfo: LoadSystemConfigInfo]
06:03 🔗 underscor Welcome to Prince William County School's Parent Portal
06:03 🔗 underscor Description: Provider: Microsoft OLE DB Provider for SQL Server Interface: IDBInitialize Description: Timeout expired
06:03 🔗 underscor Fucking school gradebook
06:03 🔗 closure little Bobhy Tables must go there
06:03 🔗 chronomex to be sure, I should modify alard's excellent scraper to run a bunch of them nicely.
06:04 🔗 closure chronomex: well, I simply do ./dld-client.sh closure & ./dld-client.sh closure & ./dld-client.sh closure & ./dld-client.sh closure & ./dld-client.sh closure &
06:04 🔗 chronomex right. but there's a cleaner way I have in mind.
06:14 🔗 underscor alard: Looks like jstor broke metadata scraping or something
06:14 🔗 underscor I keep getting WARNING:root:400 POST /data (71.126.138.142): Missing argument meta
06:14 🔗 underscor WARNING:root:400 POST /data (71.126.138.142) 3754.34ms
06:14 🔗 underscor on my listener
06:42 🔗 * chronomex working on properly multithreadifying splinder
06:42 🔗 chronomex # reap dead children
07:19 🔗 Nemo_bis donbex wrote some scripts
07:28 🔗 chronomex oh yeah?
07:29 🔗 chronomex I bet so, he's running so fast :P
07:38 🔗 Nemo_bis that seems not the reason, though :-p
07:39 🔗 Nemo_bis although part of the reason because there was a bug and at some moment he had 200 processes running
07:39 🔗 chronomex lol
07:39 🔗 Nemo_bis this is a genius:
07:39 🔗 Nemo_bis - Downloading blog from ----------------------.splinder.com... done, with network errors.
07:40 🔗 chronomex "is that eighteen or 22 hyphens?"
07:42 🔗 Nemo_bis yes, a very inspiring domain name
07:45 🔗 Nemo_bis i've put his scripts on http://toolserver.org/~nemobis/
07:46 🔗 Nemo_bis I'm not sure he agrees, though; ssshhh
08:05 🔗 ersi "If I make the name totally super hard, it's gonna be secret so no one will find it"
08:06 🔗 chronomex clearly
08:12 🔗 ersi then again, firefucks had problems loading urls with - in them earlier
08:13 🔗 ersi on linux atleast, I think it worked on the windows version
08:15 🔗 chronomex splinder shouldn't have allowed it.
08:15 🔗 chronomex being contrary to the dns spec and all
08:30 🔗 chronomex there we go, thready version works great.
08:30 🔗 chronomex gonna put it on github and issue a pull request in a bit
08:40 🔗 chronomex wooo I own the realtime list
08:40 🔗 chronomex ish
08:51 🔗 chronomex 12/50 PID 8006 finished 'us:replicawatch7': Success.
08:51 🔗 chronomex replicawatch7, eh. excellent.
08:57 🔗 chronomex dld-streamer.sh is now in the archiveteam splinder git repository, if y'all want to use it
08:57 🔗 chronomex usage: ./dld-streamer.sh <you> 40 or whatever number you can handle
08:58 🔗 chronomex caveats: it loads up your system right good
08:58 🔗 yipdw does it generate the same results as the old version?
08:58 🔗 yipdw (never hurts to ask)
08:58 🔗 chronomex so far as I can make it, yes. it's a modified dld-client.sh
08:59 🔗 chronomex console output from dld-single.sh goes to $username.log , which is deleted upon successful completion of the user.
09:02 🔗 chronomex multithreaded programming in bash is kind of a bitch, so I'm not completely certain if I'm properly catching failed dld-single.sh's.
09:02 🔗 yipdw I didn't even know bash did threads
09:02 🔗 chronomex a second pair of eyes would help
09:02 🔗 chronomex well, it's not threads so much as management of a lot of background tasks.
09:02 🔗 yipdw oh
09:02 🔗 yipdw so &?
09:02 🔗 chronomex yes
09:03 🔗 chronomex but then I have to save the pid I just forked off and check if it's still around periodically
09:07 🔗 alard chronomex: Cool. Would it be possible to make it so that it keeps running when it doesn't get a username? (With a sleep of 10 to 30 seconds in between, for example?)
09:07 🔗 chronomex I'm sure it would, but I don't have a way to test that easily.
09:09 🔗 chronomex closure just did a user named "it:MS.Dos". I wonder how that works.
09:09 🔗 chronomex oh wait, users != blogs
09:09 🔗 chronomex kind of like tumblr I suppose
09:16 🔗 yipdw oh good lord
09:16 🔗 yipdw - Parsing profile HTML to extract media urls... done.
09:16 🔗 yipdw - Downloading 834 media files...
09:16 🔗 yipdw gonna be here a while
09:16 🔗 chronomex heh
09:19 🔗 yipdw though, that's still not bad compared to the Redazione profile
09:19 🔗 yipdw I've been downloading the first blog on that profile for a week now
09:20 🔗 yipdw it would probably be a good idea for someone else could start up a download for that profile
09:20 🔗 yipdw just in case mine errors out
09:20 🔗 chronomex any idea what they're writing about?
09:21 🔗 yipdw I think it's some sort of official Splinder account
09:21 🔗 yipdw "redazione" means "editorial staff", I think
09:21 🔗 chronomex ah.
09:22 🔗 yipdw http://www.splinder.com/profile/Redazione/blogs is what worries me
09:22 🔗 yipdw journal.splinder.com is the first one out of 8
09:23 🔗 yipdw and they all have many entries, each with many comments :P
09:25 🔗 yipdw http://www.splinder.com/myblog/comment/list/4212591/49575751?from=400
09:25 🔗 yipdw oh, NOW I know why some of these blogs have such huge comment pages.
09:27 🔗 chronomex hahaha
09:27 🔗 chronomex lesbian strapon fisting!
09:27 🔗 chronomex wtf is that, do they have a prosthetic arm connected to a harness or something
09:28 🔗 yipdw I don't know, but I bet anime has the answer
09:28 🔗 chronomex I'm good
09:49 🔗 Nemo_bis I once tried to archive it.wikihow.com and it failed while downloading a talk page with a 10 GiB ish history full with spam
09:50 🔗 Nemo_bis surprisingly, deleting it didn't kill their servers
10:00 🔗 chronomex woops, dld-streamer.sh has a bug where it goes into a spinloop during stopping state
10:41 🔗 Nemo_bis chronomex, what's the opposite of touch STOP ?
10:41 🔗 chronomex opposite how?
10:41 🔗 Schbirid rm STOP i guess
10:41 🔗 Nemo_bis I don't knw, how do you stop the stopping?
10:41 🔗 chronomex rm STOP
10:41 🔗 chronomex :)
10:41 🔗 Nemo_bis ah ok
10:41 🔗 Nemo_bis thanks :)
10:41 🔗 chronomex you have to do it before the script sees STOP, which it will do usually within a second or two after you create it
10:42 🔗 * Nemo_bis didn't get how it works
10:42 🔗 Nemo_bis yep, I'm waiting for some processes to complete
10:50 🔗 chronomex it works for you too? excellent.
10:51 🔗 chronomex not that I don't expect it, but that's good to hear
10:53 🔗 Nemo_bis no, I didn't start it yet because I'm stopping the running dld-client now
10:53 🔗 Nemo_bis 44 left
10:53 🔗 chronomex aye
10:54 🔗 chronomex it's safe to run in a different terminal.
10:54 🔗 chronomex similarly, you can run as many dld-client.sh in that directory as you want at once
10:54 🔗 Nemo_bis hmm, yes, I'll do so because they're so slooooooooooooow now
10:55 🔗 chronomex wait
10:55 🔗 chronomex I suggest you run it as
10:55 🔗 chronomex ionice -c 3 -n 5 nice -n 10 ./dld-streamer.sh <name> 50
10:56 🔗 chronomex that way it's less likely to completely eat your machine
10:58 🔗 Nemo_bis right now it's not a problem
10:58 🔗 db48x hrm
10:58 🔗 Nemo_bis I tried ionice -c 3 but then many processes died because were not able to write to disk
10:59 🔗 db48x 17.5 KB/s is not the ideal transfer rate for this 40 gigs of data
10:59 🔗 Nemo_bis heh
11:00 🔗 db48x especially since the machines are right next to each other
11:04 🔗 db48x hrm
11:04 🔗 chronomex db48x: eew
11:04 🔗 db48x one of my splinder clients is stuck retrying --------.splinder.com
11:06 🔗 db48x it:luke1989
11:08 🔗 Schbirid chronomex: wait, you can pass eg 50 to make it dl 50 at once?
11:08 🔗 chronomex yeah
11:09 🔗 * Schbirid stops a lot of terminal windows...
11:09 🔗 chronomex heh
11:09 🔗 chronomex I suggest sticking to around 50 in each streamer instance, you can run multiple streamers at once
11:11 🔗 chronomex see how much it loads your machine ... it can really churn your disk around.
11:11 🔗 Schbirid yeah
11:12 🔗 Schbirid got around 12 io/s and latency of 100ms but i have no idea how much it is
11:12 🔗 Schbirid just compared to the normal almost-0 it sure shows up on the graphs
11:12 🔗 chronomex lol
11:25 🔗 db48x the download speed is going down
11:35 🔗 Nemo_bis for me too
11:35 🔗 Nemo_bis probably servers overloaded
11:36 🔗 Nemo_bis (this is peak hour)
11:36 🔗 Schbirid archiveteam, your free load testing service
11:37 🔗 Nemo_bis nah, I doubt it, we're not pulling so much
11:37 🔗 chronomex <3
11:43 🔗 Nemo_bis chronomex, is it normal that it's always the 49th or 50th process to tell me that a user is completed? http://p.defau.lt/?WnNq7oqrkhMYTsNPAhdNvQ
11:43 🔗 chronomex oh, yeah, that deserves some explanation
11:44 🔗 chronomex the first number is how many processes are running, the second is how many you want
11:44 🔗 Nemo_bis ah, ok
11:44 🔗 chronomex I don't number them individually, I just count how many there are
11:44 🔗 chronomex so, yes.
11:44 🔗 Nemo_bis so only PID tells something
11:44 🔗 chronomex right
11:44 🔗 Nemo_bis ok
11:44 🔗 chronomex yep
11:46 🔗 chronomex huh. https://plus.google.com/112313173544747389010/posts/UouzhaSbB1M
12:09 🔗 Cameron_D So I ahve ~15 splinder processes that are still running 12 hours after a touch STOP
12:09 🔗 Cameron_D and they are still downloading
12:13 🔗 ersi chronomex: Bwahaha
12:13 🔗 PepsiMax Cameron_D: huge downloads then.
12:13 🔗 ersi What a bunch of fucking retards Backupify is
12:13 🔗 Cameron_D PepsiMax, yeah :/
12:14 🔗 Nemo_bis eventually every client bumps into a big user and gets stuck with it :/
12:19 🔗 PepsiMax but the time between the small ones is so big.
12:20 🔗 PepsiMax so you start a few clients, to speed up the small user-waiting time
12:20 🔗 PepsiMax and booom, 4 clients sucking the internet.
12:25 🔗 Cameron_D I think I'm doing it right... http://i.imgur.com/Ln10H.jpg
12:26 🔗 PepsiMax lol skynet
12:26 🔗 Cameron_D Unoriginal name is unoriginal :P
14:00 🔗 PepsiMax bleerg
15:11 🔗 Nemo_bis No more usernames available. Entering reap mode...
15:11 🔗 Nemo_bis ??
15:11 🔗 DoubleJ Any chance of making the dld-client try harded or be more patient when telling the tracker it's done? I have a quarter of my processes that stopped this morning because they didn't hear back quickly enough.
15:11 🔗 DoubleJ s/harded/harder/
15:24 🔗 Nemo_bis the same for me
15:25 🔗 Nemo_bis and dld-streamer stops everything the first time it doesn't receive a user
16:55 🔗 SketchCow Summary of payments: * Your withdrawal of $107981.00 succeeded for a bank account.
16:55 🔗 SketchCow TIME TO GO SHOPPPPPPINGGGGGGGG
17:05 🔗 Schbirid >:)
17:05 🔗 soultcer Kickstarter is an awesome invention
17:10 🔗 rude___ what will you be shooting with SketchCow?
17:17 🔗 Schbirid "./dld-client.sh mynick 30" will try 30 at once, correct?
17:27 🔗 Nemo_bis Schbirid, that's dld-streamer.sh
17:28 🔗 Schbirid oh
17:29 🔗 Schbirid oh yes
17:29 🔗 Schbirid thanks
17:53 🔗 PepsiMax For those who missed it: ARCHIVE TEAM: A Distributed Preservation of Service Attack http://youtu.be/-2ZTmuX3cog
17:56 🔗 PepsiMax anyhub deaded?
18:17 🔗 godane i found out how to make wikipedia work offline: http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html
18:18 🔗 godane its best when you need a mirror of it locally
18:20 🔗 Cowering heh heh.. best to mirror *.gr while you are at it :)
18:26 🔗 godane why is it that wikipedia dumps are in bz2?
18:27 🔗 godane i have see some are not compressed at all
18:27 🔗 godane part of me thinks wikipedia dumps need to be done in lzma or xz
18:29 🔗 Nemo_bis there are also 7z versions
18:30 🔗 Nemo_bis bzip archives are needed for some applications which process them without uncompressing them
18:30 🔗 godane ok
18:30 🔗 godane thought there was a bzip2recover for xz
18:30 🔗 godane or lzma
18:30 🔗 Nemo_bis and lzma doesn't improve much in some cases; it's very useful for complete histories with a lot of duplicate text
18:31 🔗 godane i have seen 10% improvement
18:31 🔗 Nemo_bis that's not much :)
18:31 🔗 Nemo_bis 7z archives of full history are about 1/5 of bzip2
18:31 🔗 godane it is when your talking 7.3gb
18:31 🔗 godane saves 700mb
18:32 🔗 Nemo_bis which is not much if the archive uncompressed is 730 GiB
18:32 🔗 Nemo_bis and 7z lets you not uncompress it
18:32 🔗 Nemo_bis s/7z/bzip2
18:33 🔗 godane found a for 7z to do a 7zcat and 7zless
18:33 🔗 godane *a script
18:34 🔗 godane exec 7z e -so -bd "$@" 2>/dev/null | cat
18:34 🔗 Nemo_bis hm, doesn't look very efficient: piping 350 GiB?
18:34 🔗 Nemo_bis anyway, I don't really know what I'm talking about, just repeating things I heard real experts say .-p
18:35 🔗 Nemo_bis you might want to read https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l where this was discussed multiple times
19:06 🔗 DFJustin I use wikitaxi for offline wikipedia which is a nice ready-made app but there's like one guy writing it and he doesn't update enough
19:15 🔗 ersi godane: becauze xz is a fucking dickbitch to compress with
19:15 🔗 ersi since it takes a billion more resources
19:50 🔗 SketchCow Just slammed another 5gb of shareware CDs in
19:50 🔗 SketchCow I'm trying to empty batcave out in the next week
19:50 🔗 SketchCow Except Berlios and the current projects, while we determine ingestion
19:52 🔗 DFJustin I notice the cds that have gone in in the past while don't have the auto-generated file listings and galleries
19:54 🔗 SketchCow I have to initiate them.
19:54 🔗 SketchCow Shortly, I'll write something to extract them.
19:54 🔗 SketchCow Right now, I need the iso location of them to do it.
19:56 🔗 SketchCow Like, what I SHOULD do is you give it an item name, it checks if the work has been done, and then it does it if there's nothing.
19:56 🔗 SketchCow Then I can just glorb the rss feed
19:59 🔗 DFJustin isn't there already a derive infrastructure for this sort of thing
20:00 🔗 underscor Yeah, but adding derive jobs to the system is a pain in the ass
20:31 🔗 chronomex godane, Nemo_bis: wiki dumps compress even better when you first diff-compress them and then bzip the diffs.
20:31 🔗 chronomex godane: Nemo_bis: https://github.com/chronomex/wikiscraper is a tool that takes a wiki dump and turns it into a version-control repo
20:32 🔗 Nemo_bis chronomex, are you interested in doing the opposite? :)
20:33 🔗 chronomex hmmm, maybe, what do you have in mind?
20:34 🔗 Nemo_bis old diffs of Wikipedia in UseModWiki
20:34 🔗 chronomex ? do link
20:35 🔗 Nemo_bis http://reagle.org/joseph/blog/social/wikipedia/10k-redux.html
20:35 🔗 DoubleJ Hm. Splinder tracker seems to be getting less reliable. Just had to restart about half a dozen out of 16 processes for failure to tell teh tracker the user was finished.
20:36 🔗 chronomex hmmm.
20:37 🔗 DoubleJ I'd think it was at my end, but my rsync up to batcave was uninterrupted and others reported the same problem before.
20:37 🔗 DoubleJ So that leaves the tracker looking like the weak link.
20:38 🔗 chronomex I'm away from my box right now but it's not showing up on the dashboard.
20:38 🔗 chronomex hm.
20:38 🔗 Nemo_bis yes, and sometimes it also fails to give a user and dld-streamer stops adding processes
20:38 🔗 chronomex i don't think there's a huge problem with failing to notify the tracker occasionally. it'll cause duplication but it's better than saying you got something that you didn't actually get
20:39 🔗 DoubleJ True enough. But given that we're likely to be pressed for time, I'd like to avoid any duplication if possible.
20:39 🔗 chronomex dld-streamer is designed to abort gracefully when the tracker goes away.
20:39 🔗 chronomex yeah. I'll look into it when I get home in ~8 hours..
20:39 🔗 DoubleJ Especially if the solution might be as simple as, "if the response times out, wait 5 seconds and try again"
20:40 🔗 chronomex I'd have to move a few things around, it's a 15-minute fix.
20:40 🔗 chronomex including testing.
20:40 🔗 DoubleJ It always gives me a new user right away when I restart so it's definitely transient.
20:40 🔗 chronomex yeah. it's transient.
20:40 🔗 Nemo_bis yes, but it means that you have to restart the streamer a lot of times; I had to disable the automatic stop or I'd have had to restart it 14 times in the last ~2000 users
20:40 🔗 chronomex which is why I wasn't able to test it reliably
20:40 🔗 chronomex huh, lame.
20:41 🔗 DoubleJ Didn't we have the same problem with GV? Whichever fake-database-thing being used doesn't seem to be able to handle the mass of updates, or something like that.
20:41 🔗 chronomex well. streamer is much better than a bunch of clients in screen, that's for damn sure.
20:42 🔗 DoubleJ Maybe if you have 50 to deal with. My VM and ancientware machine can handle so little it's easy to pop between screen windows :)
20:42 🔗 chronomex heh
20:42 🔗 chronomex yeah
20:43 🔗 chronomex when I run 50, it spends a lot of cpu time in extract-urls.py
20:43 🔗 chronomex like, a lot.
20:43 🔗 DoubleJ Also handy for catching the "my blog domain has dashes" users that send it into a hoke-and-retry cycle
20:43 🔗 chronomex it chokes much on that?
20:43 🔗 DoubleJ s/hoke/choke
20:44 🔗 DoubleJ Not too often, but it gets a "bad" error when the standard-weenie DNS decides the subdomain can't exist, so it deletes the user and starts over.
20:44 🔗 chronomex ah.
20:44 🔗 DoubleJ I've had 2, so I assume they're out there at about a rate of 1 per 4000
20:44 🔗 chronomex sounds reasonable given the data available
20:45 🔗 chronomex I take it nobody is doing memac?
20:45 🔗 DoubleJ And again, that's with dld-client being called directly. dld-stramer may catch and fail more gracefully.
20:45 🔗 chronomex well, nobody other than me
20:45 🔗 Nemo_bis I added them to http://archiveteam.org/index.php?title=Splinder#Notes
20:45 🔗 Nemo_bis dld-streamer didn't tell anything about it
20:45 🔗 chronomex DoubleJ: dld-streamer is mostly just a re-arranged dld-client.
20:46 🔗 DoubleJ OK. Didn't know if it was a wrapper or a rewrite.
20:46 🔗 chronomex yeah. it manages a pool of dld-single.sh's
20:47 🔗 DoubleJ If you're in the mood to test, see what it does with it:ermejointutt
20:47 🔗 chronomex I have to put some more robustness into its error handling
20:47 🔗 DoubleJ The blog is at -dituttounpo-.splinder.com which Linux boxen seem to be incapable of resolving.
20:48 🔗 chronomex on bsd: $ host -- -dituttounpo-.splinder.com
20:48 🔗 chronomex host: convert UTF-8 textname to IDN encoding: prohibited character found
20:48 🔗 DoubleJ Which starts the download-choke-delete-retry cycle forever
20:48 🔗 DoubleJ Yep. Probably the same error on Linux. But it's out there and resolves correctly on other OSes, and prevents the user from being completed.
20:49 🔗 chronomex fuckres.
20:49 🔗 DoubleJ I think it may be an older version of the spec? ISTR that dashes used to be OK in subdomains, but I might be remembering wrong.
20:49 🔗 Nemo_bis "something of everything" (I wonder what's the exact translation)
20:49 🔗 chronomex *.splinder.com is CNAME aliased to blog.splinder.com. Can we bypass the DNS entirely?
20:49 🔗 DoubleJ You know bach, you tell me :)
20:50 🔗 DoubleJ bash
20:50 🔗 chronomex heh, it's more a wget question.
20:50 🔗 chronomex and I don't know very much bash, I just know how to program and read manpages ;)
20:53 🔗 chronomex btw. most of the logfiles that dld-streamer keeps around temporarily are because I didn't trust myself to handle work units properly, and wanted a record until they were successfully despatched
20:54 🔗 chronomex also, the wget's save their files in the directory and then are deleted. this causes a lot of io traffic that's needless on my system. I'm considering having them download into some other directory, and then mounting a tmpfs for that to reduce disk head
20:57 🔗 Nemo_bis ah, that would be great
20:57 🔗 Nemo_bis ionice blocks jobs but doesn't increase caching to memory, apparently
20:57 🔗 chronomex yeah. I want writebehind on that stuff.
20:58 🔗 chronomex <chronomex> good evening #archiveteam, I'm a random internet user who's concerned about the state of things
20:59 🔗 chronomex <chronomex> is it all right if I lurk here? I've got not much to offer besides Linux and a bit of disk space.
20:59 🔗 chronomex my first words in #archiveteam :)
21:02 🔗 DoubleJ It looks like we could just send the request to $BLOGS_IP_ADDRESS/whichever/url and specify the host in the wget command line, but I don't know if that works with mirroring. I know I can do it for a single page, but I have a feeling that it's download page 1 and choke on page 2 if we tried.
21:03 🔗 ndurner1 rsync: link_stat "data/it/^/^Z/^Zo/^ZoSo^" failed: No such file or directory (2)
21:03 🔗 ndurner1 rsync: link_stat "data/it/^/^s/^so/^sognoribelle^" failed: No such file or directory (2)
21:03 🔗 ndurner1 sending incremental file list
21:03 🔗 ndurner1 splinder rsync trouble:
21:03 🔗 ndurner1 $ ls "data/it/^/^s/^so/"
21:03 🔗 ndurner1 \^sognoribelle\^
21:04 🔗 chronomex weird.
21:04 🔗 chronomex hm, maybe rsync is passing it through a shell at some point. that would cause problems.
21:04 🔗 chronomex I know with `scp' you have to double-escape filenames.
21:05 🔗 DoubleJ From an online man page: "You use rsync in the same way you use rcp."
21:05 🔗 DoubleJ So if that's like scp you may have to double-escape.
21:06 🔗 chronomex ugh.
21:13 🔗 Nemo_bis no, the problem is that usernames are escaped when writing to disk
21:13 🔗 Nemo_bis some examples:
21:13 🔗 Nemo_bis http://p.defau.lt/?NITL0SVf4K4QFRgCKmlWIg
21:14 🔗 chronomex why do you use p.defau.lt in particular?
21:14 🔗 chronomex are you sure you don't have ls set up to print pastable names?
21:14 🔗 Nemo_bis I don't know, it's popular in Wikimedia channels on freeNode, probably Domas Mituzas created it
21:14 🔗 Nemo_bis yes, nautlus shows the same
21:14 🔗 chronomex hmm ok
21:15 🔗 chronomex weird.
21:15 🔗 Nemo_bis anyway, defau.lt is fast and I like those long hashes
21:15 🔗 chronomex heh.
21:15 🔗 chronomex friend of mine runs rafb.me so I use that
21:17 🔗 Nemo_bis nah, too complex :-p
21:17 🔗 Nemo_bis and expires
21:18 🔗 chronomex sure.
21:26 🔗 SketchCow Yay for Chronomex coming to SXSW
21:28 🔗 chronomex \o/
21:30 🔗 chronomex I'm sure it'll be piles of fun
21:41 🔗 SketchCow It'll be something!
21:43 🔗 underscor http://www.archive.org/details/911/day/20010911#id/WJLA_20010911_130000_Live_With_Regis_and_Kelly/start/13:02:55UTC
21:43 🔗 underscor This... is indescribable
21:43 🔗 underscor Watching that, eating breakfast before first grade...
21:46 🔗 * Nemo_bis hates captchas in archiveteam wiki
21:57 🔗 chronomex underscor: wait. 1st grade?!?
22:11 🔗 underscor chronomex: I'm 17, I was 7 in first grade
22:21 🔗 yipdw^ SketchCow: uploading about 56 GB of Anyhub data to my rsync account on batcave; let me know if I should abort it for any reason
22:22 🔗 SketchCow Parents against the marriage
22:22 🔗 SketchCow birth defect
22:22 🔗 yipdw^ crap
22:24 🔗 yipdw^ should have consulted with Focus on the Family beforehand
22:26 🔗 SketchCow http://www.flickr.com/photos/textfiles/6360036609/in/photostream
22:26 🔗 SketchCow Focus on the Hard Drive
22:27 🔗 Nemo_bis SketchCow, does it use esata?
22:28 🔗 yipdw^ SketchCow: heh, make that hard drive appear huge by playing with the focal plane
22:30 🔗 SketchCow http://www.flickr.com/photos/textfiles/6360036609/in/photostream/lightbox/ with the lightbox
22:54 🔗 alard Splinder people: Please do a git pull if you have the time. Nothing urgent, but I added two heroku-instances of the tracker. The new version of the script randomly chooses one.
22:56 🔗 db48x cool, done
22:57 🔗 alard Thanks. (The problem of the tracker not responding is most likely due to my use of a free heroku account, which allows just one single HTTP connection. The Redis backend on my EC2 micro instance is working fine, so far.)
22:57 🔗 underscor alard: done
22:57 🔗 underscor does anyhub still need any downloaders?
22:58 🔗 chronomex http://anyhub.heroku.com/ says 0 to do
22:59 🔗 alard The to do list is empty, so I don't think so.
23:02 🔗 underscor Okay
23:03 🔗 underscor \o/
23:03 🔗 underscor Now what's left is to package the beast
23:03 🔗 alard I also get the impression that it has stopped working.
23:03 🔗 alard The urls on the recent items page all produce zero-byte downloads.
23:04 🔗 underscor Hm, you're right
23:04 🔗 alard Although older ids do still work.
23:05 🔗 underscor - Downloading blog from blogorrea.splinder.com...
23:05 🔗 underscor hahah
23:13 🔗 rude___ SketchCow: heh, I have that same dock, with that same hard drive in it right now, and I also took a test photo of it when I got my EOS 5D2 (but with a crap lens)
23:52 🔗 closure ok, time to get this splinder thing to 50%
23:57 🔗 alard closure, ndurner: Since you're big splinder downloaders, please do a git pull if you haven't done so. The new version balances the load between two tracker instances on heroku.

irclogger-viewer