#internetarchive.bak 2015-04-02,Thu

↑back Search

Time Nickname Message
00:07 🔗 patricko- is now known as patrickod
00:10 🔗 patrickod is now known as patricko-
00:45 🔗 closure tpw_rules: git annex info
00:46 🔗 tpw_rules ugh i'm running 16 at once and still having trouble pegging my internet
00:47 🔗 ohhdemgir (~ohhdemgir@[redacted]) has joined #internetarchive.bak
00:50 🔗 GitHub90/#internetarchive.bak IA.BAK/server 0533231 Joey Hess: fix html directory for stats
00:50 🔗 GitHub90/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/jhIx
00:51 🔗 SketchCow Well, that explain that
00:52 🔗 closure indeed
00:52 🔗 SketchCow Oh shit son, we have +4's
00:52 🔗 closure and I see there are now a few files replicated a +4 ..
00:52 🔗 closure ys
00:52 🔗 closure +4 is the current target
00:52 🔗 closure hrm, or is it +3?
00:53 🔗 closure anyway, might get a few over the target due to collisions
00:53 🔗 londoncal has quit (Quit: Leaving...)
00:56 🔗 SketchCow So, some questions, likely some unanswerable.
00:56 🔗 SketchCow Can you, as the poobah, see how much disk space is out there among the clients?
00:56 🔗 SketchCow (I mean amount that is filling, not how much is out there and reported back. Obviously that's what that list is.)
00:56 🔗 closure the data is there, it would take a little bit of calculation
00:56 🔗 closure oh, you mean the unused space too? No
00:57 🔗 SketchCow What I mean, to be specific, is to go "we have X tb in total, which we are now filling."
00:57 🔗 closure we could add more reporting when the clients report back, but git-annex doesn't track that anyway
00:58 🔗 ohhdemgir well, this project is radged! I'm in
01:00 🔗 SN4T14_ (~SN4T14@[redacted]) has joined #internetarchive.bak
01:00 🔗 SN4T14_ is now known as SN4T14
01:00 🔗 closure SketchCow: oh, we could add a client count though
01:01 🔗 closure currently 13
01:01 🔗 SketchCow Well, I want that listing out there we talked about, so I can make a map. Go ahead and do it and let me know the URL
01:01 🔗 SketchCow Have it update, oh, once a day
01:01 🔗 SketchCow Or maybe more for now, while more people are joining
01:01 🔗 closure hmm, I'm uncomfortable putting up IP addresses on http
01:01 🔗 SketchCow Well, if you'd like, YOU could do the calculations, you're a genius
01:02 🔗 closure I could rsync them to teamarchive1, or ..
01:02 🔗 SketchCow And then give me the rough names it shoots out
01:02 🔗 closure ugh, no time
01:02 🔗 SketchCow busy genius
01:02 🔗 closure tell me something to run, or I'll give you an account
01:02 🔗 SketchCow Well, how about this. e-mail me a list. I'll write my code and crap, then hand it to you.
01:02 🔗 closure never looked at geodns stuff
01:02 🔗 closure sure
01:02 🔗 SketchCow We're just doing extremely general, after all
01:03 🔗 patricko- is now known as patrickod
01:03 🔗 patrickod is now known as patricko-
01:08 🔗 patricko- is now known as patrickod
01:18 🔗 trs80 is this at a point where you want more testing clients?
01:20 🔗 closure yes
01:27 🔗 SketchCow The more the better, now.
01:27 🔗 SketchCow I don't want people sacrificing anything for it, don't go for unwarranted overuse, but it's worthwhile to get us in the realm of our plans
01:28 🔗 aschmitz closure: How big is the current test shard?
01:28 🔗 trs80 ok, how do I get started?
01:29 🔗 aschmitz Ah, I can read. 2.91 TB.
01:30 🔗 aschmitz trs80: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation#SHARD1
01:32 🔗 closure yeah, 2.91 tb, but you don't need that much disk, it will use what you give it
01:32 🔗 tpw_rules how often should i fsck?
01:33 🔗 tpw_rules what does that check?
01:33 🔗 closure still need to figure that out.. it checks the file contents
01:33 🔗 tpw_rules ah
01:33 🔗 tpw_rules so just md5 *
01:34 🔗 aschmitz Any problem running on an NFS mount?
01:36 🔗 GitHub175/#internetarchive.bak IA.BAK/pubkey d95afb3 Joey Hess: add trs80
01:36 🔗 GitHub175/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to pubkey: http://git.io/jhni
01:36 🔗 closure aschmitz: there can be lock file problems. I would recommend not running multiple concurrent downloads on nfs
01:36 🔗 patrickod is now known as patricko-
01:39 🔗 aschmitz But with just one copy of git-annex, it should be fine?
01:40 🔗 aschmitz Second question is whether there's a way to get git-annex to use a specified keypair, rather than ~/.ssh/id_rsa.
01:40 🔗 closure probably.
01:40 🔗 closure aschmitz: the current iabak script generates its own dedicated ssh key, and makes it be used. so yes
01:40 🔗 aschmitz Ah, fun.
01:43 🔗 patricko- is now known as patrickod
01:44 🔗 trs80 closure: oops, so let me send you that new key
01:44 🔗 trs80 hmm, so you don't need to manually install the latest git-annex, ./iabak does that for you (although it got an i386 version on amd64)
01:46 🔗 GitHub59/#internetarchive.bak IA.BAK/pubkey d2f5097 Joey Hess: swap in right key for trs80
01:46 🔗 GitHub59/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to pubkey: http://git.io/jhWq
01:47 🔗 closure yeah, hadn't realized that old documentation was still in the wiki
01:50 🔗 trs80 hmm, now it's gone into a sleep for 1 hour
01:50 🔗 patrickod is now known as patricko-
01:52 🔗 trs80 right, because iabak-helper doesn't run git-annex init in shard1
01:53 🔗 trs80 hmm, but it should have ...
01:53 🔗 trs80 ah, because my git wasn't configured with user.name/email
01:55 🔗 closure oh, ok
01:56 🔗 closure trs80: did that leave the shard1 empty?
01:57 🔗 closure sounds like the problem zottelbey had earlier
01:57 🔗 trs80 closure: yeah, it did
01:58 🔗 GitHub42/#internetarchive.bak IA.BAK/master 717e95e Joey Hess: set user.name and user.email locally to deal with systems where git falls over otherwise...
01:58 🔗 GitHub42/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/jh8D
02:02 🔗 trs80 also line 107 comparing versions failed a little bit because I ran git-annex init in the root dir, causing repository version lines to be output. maybe add -m1 to the grep "version"
02:04 🔗 trs80 little things to work around stupid users :)
02:04 🔗 GitHub125/#internetarchive.bak IA.BAK/master 418a7d3 Joey Hess: make version grep look at 1st line
02:04 🔗 GitHub125/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/jhBB
02:05 🔗 tpw_rules http://cl.ly/image/24082s0z3c1c/Screen%20Shot%202015-04-01%20at%209.04.57%20PM.png EHEHEHHEHE
02:10 🔗 GitHub120/#internetarchive.bak IA.BAK/pubkey 1f9922f Joey Hess: add aschmitz
02:10 🔗 GitHub120/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to pubkey: http://git.io/jh0B
02:11 🔗 aschmitz Thanks!
02:20 🔗 underscor closure: fwiw I still can't ssh
02:20 🔗 hatseflat (~hatseflat@[redacted]) has joined #internetarchive.bak
02:20 🔗 underscor (at least, the script fails with "you're not signed up yet")
02:22 🔗 underscor interestingly manually reconstructing the command its using works fine
02:22 🔗 underscor (ssh SHARD1@iabak.archiveteam.org git-annex-shell -c configlist shard1)
02:22 🔗 aschmitz underscor: The script uses id_rsa in the local directory, while your manual command uses the one in ~/.ssh.
02:22 🔗 underscor aha
02:24 🔗 aschmitz Might be best to generate a new key / use the id_rsa.pub that probably got generated, rather than copying your personal id_rsa around to more places, though.
02:24 🔗 underscor this was a manually generated one from a previous iteration
02:25 🔗 underscor but yeah, good point to consider
02:25 🔗 trs80 how long should I expect git-annex --library-path to churn before starting to download? been about 20 minutes so far
02:26 🔗 underscor hum
02:26 🔗 underscor my dirname doesn't have -z
02:26 🔗 underscor weird
02:27 🔗 closure trs80: it can take a while on a slower disk.. you could ctrl-c, touch IA.BAK/NOSHUF and avoid the overhead of the shuffling it does
02:30 🔗 aschmitz "git annex [...] get -- [item names]" seems to just be hanging for me? There's a "[git] <defunct>" a few processes after it, if that's relevant.
02:30 🔗 aschmitz 100% CPU, but no network traffic, and strace seems to mostly be it checking the time.
02:31 🔗 tpw_rules this is gonna make my disk fragmented as fuck
02:31 🔗 trs80 aschmitz: same here
02:32 🔗 tpw_rules what happens if you just try git annex get
02:32 🔗 tpw_rules that's what i'm doing and it's working great
02:33 🔗 aschmitz Hmm. That'll be ordered, though.
02:33 🔗 aschmitz Which isn't ideal, but better than busy waiting.
02:34 🔗 tpw_rules ordered?
02:34 🔗 aschmitz Alphabetical by item name, no?
02:34 🔗 aschmitz (The requests)
02:34 🔗 tpw_rules oh yeah, i think
02:34 🔗 GitHub17/#internetarchive.bak IA.BAK/server 56507ef Joey Hess: sketchcow's geoip extractor
02:34 🔗 GitHub17/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/jh2S
02:35 🔗 closure aschmitz: by default it looks at all 100 thousand files, finds ones that don't have enough copies, scrables the list, and downloads at random. this takes a while
02:35 🔗 tpw_rules did that update just recently happen?
02:35 🔗 tpw_rules because mine is doing it alphabetically
02:36 🔗 tpw_rules using like git annex get --not --copies 3
02:36 🔗 aschmitz Well, I think it had picked some to download, as it has a huge command line that looks like a result of that.
02:39 🔗 closure hmm idn
02:40 🔗 aschmitz Hm, interesting problem.
02:41 🔗 aschmitz Looks like some of these items have since been darked.
02:41 🔗 tpw_rules there are many in the shard that have been
02:42 🔗 trs80 I just killed that process and a new one started, which is now writing stuff
02:42 🔗 tpw_rules (though i'm not sure what 'darked' means in IA lingo? is it permanent?)
02:42 🔗 GitHub35/#internetarchive.bak IA.BAK/server 1dafefc Joey Hess: perm fixup
02:42 🔗 GitHub35/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/jhVP
02:42 🔗 trs80 although that git is now defunct again
02:42 🔗 aschmitz "darking" just makes items unavailable to the public, the IA keeps a copy, and could revert that if they wanted to. Doesn't usually happen, though, as far as I know.
02:43 🔗 GitHub177/#internetarchive.bak IA.BAK/server efda62a Joey Hess: sketch had a sort -u in there which I forgot
02:43 🔗 GitHub177/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/jhVH
02:47 🔗 trs80 closure: so the shuf completes, but the git-annex subprocesses is dying later on
02:51 🔗 closure SketchCow: here we are! http://iabak.archiveteam.org/stats/SHARD1.geolist
02:52 🔗 closure trs80: hmm, if git cat-file is dying for some reason that must be the problem. I'd like to debug this, but not tonight
02:52 🔗 tpw_rules it has my zip code wrong :(
02:53 🔗 sep332 it's got my city right - probably because the box is sitting inside the ISP lol
02:53 🔗 aschmitz Yay ICBM addresses for everyone. :-/
02:54 🔗 tpw_rules sep332: how do you do these things? what isp
02:54 🔗 garyrh Must get IA in Antarctica...
02:54 🔗 sep332 oh i cheat! i work there ;)
02:55 🔗 tpw_rules oh
02:55 🔗 tpw_rules but shit, that lat/lon is only 3 miles from my house
02:56 🔗 SketchCow http://iabackup.archiveteam.org/ia.bak/
02:56 🔗 tpw_rules whoever is doing that may want to chop off a couple digits just in case
02:57 🔗 SketchCow if you're truly concerned about the lat-long
02:57 🔗 SketchCow We can remove it.
02:57 🔗 closure SketchCow: hmm, why only 6 clients?
02:57 🔗 SketchCow we remove IP already.
02:57 🔗 SketchCow closure: Old data
02:57 🔗 SketchCow I've been hacking, bro!
02:57 🔗 underscor closure: is it expected that the git cat-file sits using a bunch of cpu for a while before downloading starts?
02:57 🔗 aschmitz Personally I'd stick with country and region, but I wouldn't fight over it or anything.
02:58 🔗 tpw_rules i'm personally not at all. but it's a concern in the community. i'd probably round to one decimal
02:58 🔗 closure underscor: yes
02:58 🔗 SketchCow closure - kill lat-long
02:58 🔗 SketchCow The code is obvious in the script
02:58 🔗 SketchCow And my thing doesn't care.
03:01 🔗 SketchCow http://iabackup.archiveteam.org/ia.bak/ now upgraded with all 15 clients
03:01 🔗 GitHub151/#internetarchive.bak IA.BAK/server 61a093f Joey Hess: de-icbm
03:01 🔗 GitHub151/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/jhKz
03:03 🔗 trs80 closure: yeah, git ls-files is what goes defunct
03:03 🔗 closure oh, interesting it's ls-files
03:04 🔗 trs80 right now it's stuck in write(1,
03:04 🔗 closure kinda suggests it's due to all those files being shoved through the command line and to ls-files
03:04 🔗 trs80 so the destination pipe is full I guess
03:04 🔗 closure ls-files is stuck in write?
03:05 🔗 trs80 yeah
03:05 🔗 SketchCow Now the fun part, more nerdy than anything.
03:05 🔗 SketchCow I want to add a second visual chart.
03:05 🔗 SketchCow I mean a second graphic chart. Now to understand how to make the api not blow up.
03:05 🔗 tpw_rules to do what
03:05 🔗 SketchCow just US.
03:06 🔗 SketchCow Because it's important to know how far from IA ground zero they are.
03:06 🔗 tpw_rules oh
03:06 🔗 tpw_rules where is IA ground zero?
03:06 🔗 SketchCow San Francisco.
03:06 🔗 tpw_rules ah
03:06 🔗 SketchCow We have one person in Walnut Creek
03:06 🔗 tpw_rules also why can't you just mail tapes to maine or something
03:06 🔗 SketchCow Fuck that guy, bomb's going to get him too
03:06 🔗 SketchCow that's a different solution path
03:06 🔗 SketchCow We can build a really nice off-road car, AND work on our sailboat
03:07 🔗 SketchCow AND our drone army
03:07 🔗 tpw_rules yes
03:07 🔗 tpw_rules where is amazon headquarters
03:07 🔗 SketchCow Seattle
03:07 🔗 tpw_rules oh
03:07 🔗 SketchCow But bear in mind, their shit is EVERYWHERE
03:07 🔗 tpw_rules i was gonna say "free IA tape with every drone delivery"
03:07 🔗 tpw_rules beam gps at them so they come to your facility, attach a tape, then let them go
03:09 🔗 tpw_rules do you think you'll be able to accelerate backup faster than the archive adds new crap?
03:11 🔗 SketchCow How did you stumble into this project?
03:11 🔗 SketchCow My tweet?
03:11 🔗 tpw_rules a tweet from @textfiles
03:11 🔗 SketchCow Holding off on USian graph - just because we had some nice advancement today, don't need to hack to 6am
03:11 🔗 aschmitz tpw_rules: SketchCow = @textfiles
03:11 🔗 tpw_rules ah. so yes
03:12 🔗 SketchCow I'll add US later.
03:13 🔗 SketchCow So, tpw_rules - I could say a very long thing, or I could say "the mountain must be climbed".
03:13 🔗 SketchCow Is "the mountain must be climbed" sufficient or do you want the long thing.
03:13 🔗 tpw_rules that's good enough
03:13 🔗 tpw_rules i'll do my part
03:14 🔗 SketchCow the project is forcing a mass of assessment of the archive
03:14 🔗 tpw_rules also particularly in the US good fucking luck finding people without capped internet
03:14 🔗 tpw_rules i have to pay a ridiculous amount extra for no cap
03:14 🔗 SketchCow Which was desperately needed.
03:14 🔗 tpw_rules how is it protected internally? do you do tape or something in the HQ?
03:15 🔗 SketchCow Everything with little exception is on spinning disks
03:15 🔗 tpw_rules are they all spinning at once?
03:15 🔗 tpw_rules i assume you can recover from a failed drive for example. but what about accidentally rm -rf?
03:16 🔗 tpw_rules also btw textfiles.whatever is real neat
03:17 🔗 tpw_rules i have to confess to being a youngin, so i was never around for that. but it's cool to read about
03:17 🔗 SketchCow textfiles.whatever has always been proud of bringing history to the youngins
03:18 🔗 SketchCow unless you make a bomb, and then we know you're not a smart youngins and we let evolution sort that out
03:18 🔗 tpw_rules i knew about it but never really immersed myself in it. i'm at least fluent in 6502 assembly language though, but not the culture
03:18 🔗 SketchCow http://iabackup.archiveteam.org/ia.bak/ now lists the countries because I got tired of counting.
03:18 🔗 SketchCow Or counts, anyway
03:19 🔗 tpw_rules can you put a size over the tree view at the top?
03:19 🔗 SketchCow I did. 2.91 terabytes.
03:19 🔗 tpw_rules no i mean for each box
03:20 🔗 SketchCow Not right now, no.
03:20 🔗 tpw_rules ie 1.7TB is not redundant at all
03:20 🔗 SketchCow use the areaaaaaa
03:20 🔗 SketchCow that's what it's forrrrr
03:20 🔗 tpw_rules gets out ruler
03:20 🔗 tpw_rules also textfiles.com*
03:20 🔗 SketchCow Also http://textfil.es/
03:21 🔗 SketchCow for those pesky blockers
03:21 🔗 yipdw how is there not a .whatever gTLD at this point
03:21 🔗 tpw_rules lol. i need to try it at school
03:21 🔗 tpw_rules (though i always run with a vpn)
03:24 🔗 tpw_rules can you guys remove a repo from the list? i deleted everything from mine because it was being funky and now i show up twice. 1d92bde5-54d3-41bc-932e-d8e8e7bfff51 is my real one and ff2f752d-b35a-4555-b8b4-617f23e4e015 is bad
03:26 🔗 trs80 touches NOSHUF and starts again
03:26 🔗 trs80 ahh, sweet downloads
03:29 🔗 closure trs80: I reproduced the problem.. so I'll probably be able to fix it
03:29 🔗 trs80 closure: ah, cool. was going to say I'm in UTC+8 if you wanted to look at it another time
03:29 🔗 closure it seems it prints out an enormous list of directory names before stalling?
03:30 🔗 tpw_rules okay it's sleepy time for me. closure can you delete that extra repo?
03:30 🔗 closure tpw_rules: we could, but let's not worry about it. We want to automatically detect dead repos and disregard them
03:30 🔗 tpw_rules ok. i just noticed it with annex info
03:31 🔗 tpw_rules but goodnight. got 500gb so far
03:31 🔗 trs80 closure: yeah, that sounds like what's happening
03:32 🔗 closure that is seriously weird. it's like it thinks that's all one file
03:32 🔗 closure I think I'll just make it run git-annex once per dir for now, and debug this tomorrow
03:34 🔗 GitHub127/#internetarchive.bak IA.BAK/master b3e2de7 Joey Hess: temporary workaround for strange hang-of-doom when git-annex is given a really, really big list of dirs to get
03:34 🔗 GitHub127/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/jhDX
03:43 🔗 trs80 closure: the workaround wfm, although I've still got a defunct git process (not sure what type)
03:43 🔗 closure *** SO, if iabak is stuck not doing anything and eating cpu, now's a good time to restart it ***
03:43 🔗 trs80 cat-file and wget are fine though
03:43 🔗 underscor remote: error: hook declined to update refs/heads/synced/master
03:43 🔗 underscor :(
03:43 🔗 closure underscor: intentional, you're not supposed to be changing the master branch
03:44 🔗 underscor oh
03:44 🔗 underscor did I do something wrong?
03:44 🔗 underscor haha
03:44 🔗 underscor how do I check?
03:44 🔗 closure git log master
03:45 🔗 underscor aha
03:45 🔗 underscor wonder how that git commit happened, weird
03:45 🔗 closure and then you'll want to git reset --hard HEAD^ or so :)
03:45 🔗 closure but, please carry on trying to break it ;)
03:45 🔗 closure just don't break it by doing horrible commits in the git-annex branch, that is not checked yet at all
03:46 🔗 underscor ok
03:46 🔗 SketchCow OH LOOK WHAT BROUGHT BACK UNDERSCOR
03:46 🔗 underscor :D
03:47 🔗 yipdw wow
03:48 🔗 underscor man my shard is still really broken
03:49 🔗 underscor closure: I reverted the commit and did the reset --hard
03:49 🔗 underscor but it's still trying to commit master on annex sync
03:49 🔗 closure underscor: you probably have a synced/master branch lying around with the bad commit in it, which you'd need to delete
03:50 🔗 underscor closure: is delete different than revert in this context?
03:50 🔗 closure oh, but commit master .. idk, why it would have something to commit
03:51 🔗 underscor http://p.defau.lt/?rj4PyeY9VmzbB5LnHYw2qQ
03:51 🔗 closure yeah, git branch --delete synced/master
03:52 🔗 underscor closure: and now, http://p.defau.lt/?0gvXoSd272FIja7elUcgBQ
03:53 🔗 closure you need to delete synced/master and reset master, both
03:54 🔗 underscor yay!
04:00 🔗 closure trs80: ok, figured it out. It's simply an exponential blowup due to some fancy stuff it tries to do with the command line. Plus possibly a little bit of truncation
04:01 🔗 SketchCow Improvement Continues!
04:02 🔗 SketchCow Eventually, I will turn the graph page into an ad to help with the experiment.
04:12 🔗 espes__ (~espes@[redacted]) has joined #internetarchive.bak
04:17 🔗 closure sweet, sped that up by like 1000x
04:36 🔗 zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
04:40 🔗 GitHub152/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/jhN2
04:40 🔗 GitHub152/#internetarchive.bak IA.BAK/server 756666f Joey Hess: grep the compressed auth.log too, to get a full month of IPs
05:04 🔗 zottelbey has quit (Remote host closed the connection)
06:14 🔗 GitHub59/#internetarchive.bak IA.BAK/server 930b511 Joey Hess: gc repo too
06:14 🔗 GitHub59/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/jjY8
06:16 🔗 GitHub154/#internetarchive.bak IA.BAK/master 30f5611 Joey Hess: remove debug output
06:16 🔗 GitHub154/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/jjYF
07:05 🔗 bzc6p_ (~bzc6p@[redacted]) has joined #internetarchive.bak
07:10 🔗 trs80 right, 10 iabaks running, should be done in just over a day
07:11 🔗 bzc6p has quit (Ping timeout: 600 seconds)
07:52 🔗 GitHub182/#internetarchive.bak IA.BAK/pubkey 0e91e20 Daniel Brooks: another for me
07:52 🔗 GitHub182/#internetarchive.bak [IA.BAK] db48x pushed 1 new commit to pubkey: http://git.io/jjwL
08:11 🔗 londoncal (~londoncal@[redacted]) has joined #internetarchive.bak
08:48 🔗 londoncal has quit (Quit: Leaving...)
08:52 🔗 bzc6p_ has quit (Read error: Connection reset by peer)
08:53 🔗 bzc6p (~bzc6p@[redacted]) has joined #internetarchive.bak
09:12 🔗 edsu (~edsu@[redacted]) has joined #internetarchive.bak
09:23 🔗 midas hm? you can start multiple jobs on 1 box?
09:27 🔗 db48x yea, git annex commands very carefully avoid stepping on their own toes
09:28 🔗 db48x you can run 'git annex get' as many times in parrallel as you want
09:28 🔗 db48x easiest way to do that is to run iabak multiple times
09:29 🔗 db48x and then while they're running you can run git annex get manually to pull down a specific item that you're interested in
09:37 🔗 bzc6p midas: Except for network filesystems.
09:39 🔗 midas good point :p
11:34 🔗 hater db48x: i think gnu/parallel (http://www.gnu.org/software/parallel/ ) would be nice to be build into the helper-script (as some kind of option)
11:57 🔗 GitHub79/#internetarchive.bak [IA.BAK] zottelbeyer opened pull request #9: correct zottelbeyer's pubkey (pubkey...patch-1) http://git.io/veebN
11:59 🔗 GitHub75/#internetarchive.bak IA.BAK/pubkey 09a303b Daniel Brooks: Merge pull request #9 from zottelbeyer/patch-1...
11:59 🔗 GitHub75/#internetarchive.bak IA.BAK/pubkey 6bb10d9 zottelbeyer: correct zottelbeyer's pubkey...
11:59 🔗 GitHub169/#internetarchive.bak [IA.BAK] db48x closed pull request #9: correct zottelbeyer's pubkey (pubkey...patch-1) http://git.io/veebN
11:59 🔗 GitHub75/#internetarchive.bak [IA.BAK] db48x pushed 2 new commits to pubkey: http://git.io/veeN1
12:00 🔗 db48x hater: possibly. it might be easier to build support for concurrent downloads into git annex itself
12:16 🔗 hater i am too lazy to lern haskell to programm a tool which already exists
12:17 🔗 hater https://git-annex.branchable.com/todo/parallel_get/ <-- Posted 3 months and 14 days ago
14:43 🔗 sep332 i switched over to using the iabak script and i'm getting an error
14:43 🔗 sep332 error: Untracked working tree file 'internetarchivebooks/100storyofpatrio00sinc/100storyofpatrio00sinc_archive.torrent' would be overwritten by merge.
14:43 🔗 sep332 can I just delete the file and try again?
14:53 🔗 closure that's weird.. is your repository in direct mode maybe?
14:53 🔗 closure git annex info --fast
14:54 🔗 closure 1st line
14:54 🔗 sep332 i just did a "git clone" and then copied the files to the shard1 folder
14:54 🔗 sep332 "indirect"
14:54 🔗 closure oh, hm, so you switched over by copying files?
14:55 🔗 sep332 yeah
14:55 🔗 closure I hope you copied .git/annex that's where the actual downloads are
14:55 🔗 closure but really, the right way is to just move your old git repo to IA.BAK/shard1
14:55 🔗 sep332 ok. i didn't realize about .git
14:56 🔗 closure suggest you move the files back to the old repo, delete the new repo, and move the old repo
15:04 🔗 sep332 ok, it's working fine. thanks closure
15:14 🔗 closure wow, we're over 50% on SHARD1
15:14 🔗 closure er, no. Over 25% :)
15:14 🔗 sep332 well... i have 1.9TB on this drive now
15:15 🔗 sep332 shouldn't that be higher then?
15:15 🔗 closure maybe.. could be your client has not communicated back, if you just started running the script
15:15 🔗 closure did you ever git annex sync manually before?
15:15 🔗 closure script does it once an hour
15:16 🔗 sep332 yeah, i stole that snippet that runs sync every hour
15:16 🔗 sep332 and i ran it manually twice in the last hour
15:16 🔗 closure well, at 3 copies, SHARD1 needs 9 tb
15:17 🔗 SketchCow Right.
15:18 🔗 closure of course, the graph is counting by files not by size anyway. So somewhat comparing apples and oranges
15:18 🔗 SketchCow Shhh
15:18 🔗 SketchCow Don't wreck my dreams
15:19 🔗 SketchCow I agree, size is ideal.
15:19 🔗 SketchCow But I like incrementing, after all
15:20 🔗 SketchCow http://blog.dshr.org/2015/03/the-opposite-of-lockss.html
15:20 🔗 SketchCow (My comment at end)
15:24 🔗 SketchCow closure: If you create output files of data updated regularly about the activity, I'll make pretty graphs that display them.
15:25 🔗 closure SketchCow: how about a connecting clients per hour graph?
15:32 🔗 SketchCow I'm for any textfiles you want to generate.
15:32 🔗 SketchCow I'm not doing particularly smart graphing, so I'm converting files into graphs
16:13 🔗 hater warning: the iabak-helper script is broken atm: someone changed the output of 'git-annex version' - i pulled an bugfix but it is not merged into the master atm
16:21 🔗 hater here is the bugfix: https://github.com/cancerAlot/IA.BAK/commit/6c432e4808ebad9cbcb33902535a575d6b687f0e
16:55 🔗 SketchCow closure: How hard is it for me to give you a collection and have you go "aaaaand here's the stats on that."
16:56 🔗 SketchCow i.e. how big it is (originals and system files, number of items)
17:01 🔗 svchfoo1 has quit (Quit: Closing)
17:06 🔗 svchfoo1 (~chfoo1@[redacted]) has joined #internetarchive.bak
17:09 🔗 svchfoo2 gives channel operator status to svchfoo1
17:14 🔗 zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
17:20 🔗 zottelbey alright, its working now! though the speed is somewhat terrible.
17:22 🔗 hater db48x: 'one tool for one thing' - does implementing parallel-support into git-annex violate that 'rule'?
18:58 🔗 patricko- is now known as patrickod
19:03 🔗 GitHub0/#internetarchive.bak [IA.BAK] db48x created git-annex from synced/git-annex (+0 new commits): http://git.io/veU3j
19:03 🔗 GitHub188/#internetarchive.bak [IA.BAK] db48x created synced/git-annex from git-annex (+0 new commits): http://git.io/veU3p
19:03 🔗 GitHub169/#internetarchive.bak [IA.BAK] db48x created synced/master from master (+0 new commits): http://git.io/veU3h
19:04 🔗 GitHub97/#internetarchive.bak IA.BAK/server 2bad8e6 Joey Hess: add a client connections per hour data file
19:04 🔗 GitHub97/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veUsE
19:09 🔗 GitHub62/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veUZH
19:09 🔗 GitHub62/#internetarchive.bak IA.BAK/server 14ffc71 Joey Hess: typo
19:11 🔗 closure hater: current version code has: git-annex version | grep "version" -m 1
19:11 🔗 closure which seems to work ok...
19:13 🔗 closure goes and adds a git annex version --raw anyway
19:14 🔗 GitHub73/#internetarchive.bak IA.BAK/server 0ac68d4 Joey Hess: typo2
19:14 🔗 GitHub73/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veUCi
19:17 🔗 closure SketchCow: I can ingest a collection into a new shard pretty quickly, and then can do anything we can do with SHARD1
19:18 🔗 closure by pretty quickly, 10 minutes or so
19:18 🔗 SketchCow Which is great.
19:18 🔗 SketchCow Mostly, I just wanted the ability for you to look at a collection and go "it's this big"
19:19 🔗 closure I've only done that for the number of files, which is all I care about, not disk size. The data is available in the census though
19:20 🔗 SketchCow Anyway, I have our next collection, I think.
19:20 🔗 SketchCow usfederalcourts
19:21 🔗 SketchCow and genealogy
19:21 🔗 SketchCow But obviously I think we should be at a SOLID 4 for current shard before we add more shards.
19:21 🔗 SketchCow And we have some shard-punching to do, etc.
19:23 🔗 closure are we going to 3 or to 4?
19:23 🔗 closure and by 4 I mean, 4 including IA
19:24 🔗 closure it's a big decision
19:24 🔗 sep332 aw, what's 14PB between friends
19:25 🔗 closure ... times 1770
19:25 🔗 closure er, you already multiplied, didn't yu
19:25 🔗 closure numbers too big
19:25 🔗 closure SketchCow: so here is a new textfile for you.. http://iabak.archiveteam.org/stats/SHARD1.clientconnsperhour
19:26 🔗 closure that is the number of clients that connected for that shard, per hour.
19:26 🔗 SketchCow client connections
19:26 🔗 SketchCow Since people seem to be uber-connecting
19:26 🔗 closure the guys that are running concurrent iabak scripts count multiple
19:26 🔗 sep332 it would take 3 days for my computer to count that high
19:26 🔗 closure call it "worker threads" or something
19:27 🔗 closure it would make a nice bar graph
19:27 🔗 SketchCow closure: http://iabackup.archiveteam.org/ia.bak/
19:27 🔗 SketchCow I'm assuming we're working to make it 100% Dark Green
19:27 🔗 closure that's how it's set right now, yes
19:27 🔗 SketchCow (the area graph. Making the map 100% dark green will take longer, muhahaha)
19:28 🔗 closure 13124639 usfederalcourts.list
19:28 🔗 SketchCow closure: Well, that's what I'm shooting for.
19:28 🔗 SketchCow Yes, usfederalcourts will be 1.3tb
19:28 🔗 closure 13 million.. so, that's 130 shards. They may be smaler than usual disk size, I dunno
19:28 🔗 closure that's file count
19:28 🔗 SketchCow Ah.
19:29 🔗 SketchCow Well, anyway, point is that I always assumed "4" (3+IA). Everything green
19:29 🔗 closure oh, ok. I pulled COPIES=4 in iabak from /dev/ass
19:30 🔗 SketchCow My documentation and writing mentions it
19:30 🔗 SketchCow I bet you got it there
19:30 🔗 SketchCow These "sectors" are then checked into the virtual drive, and based on if there's zero, one, two or more than two copies of the item in "The Drive", a color-coding is assigned. (Red, Yellow, Green).
19:30 🔗 bzc6p has quit (Read error: Operation timed out)
19:31 🔗 SketchCow I just stole that idea from Josh S., creator of Delicious and most bitter Google Employee ever
19:31 🔗 SketchCow Who told me GMail works on "5 copies of mail, in 3 discrete geographical locations, at all time"
19:32 🔗 closure 5 is my bare minimum replication for important personal data. and yeah, 3 locations
19:32 🔗 SketchCow See? So we both agree
19:32 🔗 SketchCow IA + 3
19:32 🔗 SketchCow (IA is two)
19:32 🔗 SketchCow (Sort of)
19:32 🔗 SketchCow (Let's pretend I said it was)
19:32 🔗 bzc6p (~bzc6p@[redacted]) has joined #internetarchive.bak
19:32 🔗 SketchCow I mean, it's definitely two copies, but occasionally two copies end up in the same building.
19:33 🔗 closure 89 genealogy.list
19:33 🔗 closure heh! well, we can fit that in somewhere
19:34 🔗 closure I wonder if that's really all of it. There is a weirdness in the census where an item can be in multiple collections, and the data I'm working from just picked the first one
19:34 🔗 SketchCow I think it's not.
19:34 🔗 SketchCow It's huge.
19:35 🔗 db48x oh, I ran git annex sync in IA.BAK, not IA.BAK/shard1
19:35 🔗 db48x that's confusing
19:36 🔗 db48x hater: yes and no
19:41 🔗 db48x it'd be nice if we could always just use parallel (or any of a dozen alternatives), but there are a couple of problems with it
19:42 🔗 db48x interleaving the output of a bunch of git annex get commands is super annoying
19:43 🔗 db48x the number of jobs to run simultaneously is not obvious; what we really care about is how much bandwidth we're using
19:43 🔗 db48x some people want to use a lot, some people want to throttle it, some people want to do both at different times, or in different circumstances
19:43 🔗 closure yeah, a get that started/stalled to saturate would be great
19:44 🔗 midas this works alot better with 10 gets
19:45 🔗 patrickod is now known as patricko-
19:46 🔗 closure "collection":["1880_census","microfilm","americana","us_census","genealogy","additional_collections"]
19:47 🔗 db48x some have a cap and don't care about throughput, but only the total data uploaded/downloaded
19:47 🔗 closure seeing a lot of that kind of thing.. that's presumably why genealogy has so few items, they all went to other more specific things
19:48 🔗 closure 35518 1880_census
19:48 🔗 SN4T14_ (~SN4T14@[redacted]) has joined #internetarchive.bak
19:49 🔗 closure 57995 jstor_virglawregi
19:51 🔗 closure wonders if we can go to 200 thousand or so per shard. Have not noticed much scaling issues with 100k files. Except for that startup delay for shuffling..
19:54 🔗 closure 103554 nasa_techdocs .. that would be a nice shard
19:54 🔗 db48x closure: this is a side issue, but I just noticed that every single iabak-helper I've ever run is still waiting around to do a sync every hour
19:55 🔗 closure because they bg?
19:55 🔗 db48x yep
19:55 🔗 closure perhaps it should fork off a single syncer if none is running
19:56 🔗 closure let's see, what lock file program is portably available..?
19:56 🔗 closure I'm thinking maybe perl
19:57 🔗 SN4T14 has quit (Ping timeout: 512 seconds)
19:57 🔗 db48x doesn't the assistant already do that?
19:57 🔗 midas you can try the ftp boneyard, it's big and has huge files
19:59 🔗 closure it does.. does some other stuff we maybe don't want
19:59 🔗 closure oh, util-linux has flock(1) now
20:05 🔗 GitHub120/#internetarchive.bak IA.BAK/master 2c3a13d Joey Hess: use separate program for hourly background sync, and use lock file so only 1 runs
20:05 🔗 GitHub135/#internetarchive.bak IA.BAK/server c061607 Joey Hess: this script seems to have bit rotted since I last ran it
20:05 🔗 GitHub120/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veUDM
20:05 🔗 GitHub135/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veUD1
20:10 🔗 closure here's a thought. SHARD2 could be made by taking the *smallest* collections, until we get to 100k files.
20:11 🔗 closure that turns out to be 3537 collections.
20:11 🔗 closure with the larger collections in it being ones like TheIncredibleSandwich, TheFivePercent, KingsInDisguise, HOLLER_band
20:11 🔗 closure public_library_of_science, usda-agriculturalhistoryseries
20:14 🔗 patricko- is now known as patrickod
20:22 🔗 patrickod is now known as patricko-
20:24 🔗 hater closure: "grep "version" -m 1" - that "-m 1"-part was not in the sourcecode when i wrote the patch
20:39 🔗 closure hater: so, it's ok now?
20:41 🔗 hater yes
20:48 🔗 closure http://iabak.archiveteam.org/candidateshards/
20:48 🔗 closure so, that's some lists of collections, starting with the ones with less files. Most of the lists are 100k files
20:49 🔗 closure around 100-150 there are some interesting sets of collections
20:50 🔗 closure http://iabak.archiveteam.org/candidateshards/smallestfirst118.lst I like this one
20:50 🔗 closure http://iabak.archiveteam.org/candidateshards/smallestfirst118.lst
20:50 🔗 closure oop
20:50 🔗 closure has: NISTJournalofResearch, 1880_census, speedydeletionwiki
20:51 🔗 closure http://iabak.archiveteam.org/candidateshards/smallestfirst107.lst archiveteam + glennbeck + some jstor
20:54 🔗 closure http://iabak.archiveteam.org/candidateshards/smallestfirst10.lst nice grab bag
20:55 🔗 GitHub14/#internetarchive.bak IA.BAK/server b39791a Joey Hess: add simple shard packer...
20:55 🔗 GitHub14/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veTLk
21:15 🔗 GitHub1/#internetarchive.bak [IA.BAK] cancerAlot closed pull request #8: install the right arch of git-annex (master...master) http://git.io/jpBr
21:16 🔗 closure SketchCow: another line for your graph.. http://iabak.archiveteam.org/stats/SHARD1.filestransferred
21:16 🔗 closure this will get 1 line added per hour, with the timestamp, and the total number of files transferred so far.
21:17 🔗 GitHub57/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veTnP
21:17 🔗 GitHub57/#internetarchive.bak IA.BAK/server f5f8a53 Joey Hess: add filestransferred data file
21:17 🔗 GitHub32/#internetarchive.bak [IA.BAK] cancerAlot opened pull request #10: checks if "reserve" is less than the available space (master...master) http://git.io/veTnH
21:18 🔗 hater closure: https://github.com/ArchiveTeam/IA.BAK/pull/10 closes the issue: https://github.com/ArchiveTeam/IA.BAK/issues/7
21:22 🔗 GitHub108/#internetarchive.bak IA.BAK/master 13e1b64 cancerAlot: Merge pull request #1 from ArchiveTeam/master...
21:22 🔗 GitHub108/#internetarchive.bak IA.BAK/master 6c432e4 cancerAlot: bugfix because the 'git-annex version' output was changed
21:22 🔗 GitHub108/#internetarchive.bak IA.BAK/master 7e2cf45 cancerAlot: .
21:22 🔗 GitHub9/#internetarchive.bak [IA.BAK] joeyh closed pull request #10: checks if "reserve" is less than the available space (master...master) http://git.io/veTnH
21:22 🔗 GitHub108/#internetarchive.bak [IA.BAK] joeyh pushed 5 new commits to master: http://git.io/veTCQ
21:38 🔗 svchfoo2 has quit (Ping timeout: 240 seconds)
21:39 🔗 svchfoo2 (~chfoo2@[redacted]) has joined #internetarchive.bak
21:39 🔗 svchfoo1 gives channel operator status to svchfoo2
21:53 🔗 patricko- is now known as patrickod
21:55 🔗 patrickod is now known as patricko-
22:12 🔗 hater who is able to add my ssh public key?
22:14 🔗 zottelbey could you implement a thread option in the script ? i now have to run 4 tty to reach 1.2-2MiB/s. i would like to max out my 7MB/s without opening another 15.
22:15 🔗 hater zottelbey: parallel downloading is in progress
22:15 🔗 zottelbey neat.
22:15 🔗 hater https://git-annex.branchable.com/todo/parallel_get/
22:15 🔗 zottelbey hater, also for pub key anyone with write access to the git rep.
22:16 🔗 zottelbey i dont care about the output tbh i just want it to be faster.
22:17 🔗 zottelbey iabak could just run n copies of git-annex.
22:19 🔗 zottelbey "Last edited 3 months and 15 days ago" tells me git-annex is not going to get there any time soon probably.
22:24 🔗 yipdw the author of git-annex is present and has been making changes to better suit it for this project
22:24 🔗 yipdw no need to be an ass
22:25 🔗 db48x hater: I can add your key
22:25 🔗 zottelbey yipdw, sorry, didnt mean to offend anyone.
22:25 🔗 yipdw np, I probably read too far into it
22:25 🔗 zottelbey probably.
22:26 🔗 hater db48x: i sent the link in the query
22:26 🔗 Kazzy hm, is it possible to make the script store the "backup data" in a different location than the iabak scripts etc? Having issues, script seems to create symlinks, which doesn't play well with smb/cifs shares
22:27 🔗 GitHub37/#internetarchive.bak IA.BAK/pubkey 9886116 Daniel Brooks: add hater's public key
22:27 🔗 GitHub37/#internetarchive.bak [IA.BAK] db48x pushed 1 new commit to pubkey: http://git.io/veTMP
22:27 🔗 hater db48x: thx
22:27 🔗 db48x yw
22:29 🔗 db48x Kazzy: it's currently not an option, but you can edit iabak-helper to change the location
22:30 🔗 db48x Kazzy: you could also go into the shard1 directory and run 'git annex direct' to switch to direct mode, which doesn't use symlinks
22:32 🔗 Kazzy hm, i'll take a shot at changing paths in the iabak-helper script, see if it'll work that way, cheers
22:33 🔗 hater Kazzy: if something useful comes out, push it into the repo ;)
22:34 🔗 Kazzy well, at first it'll just be changing the hardcoded paths, I guess, will see where it goes from there
22:47 🔗 GitHub64/#internetarchive.bak [IA.BAK] kurtmclester opened pull request #11: Changed key. -Kazzy (pubkey...pubkey) http://git.io/veTHj
22:50 🔗 GitHub83/#internetarchive.bak IA.BAK/pubkey 6a6a11d Kurt: Changed key. -Kazzy
22:50 🔗 GitHub83/#internetarchive.bak IA.BAK/pubkey ccb9ea4 Daniel Brooks: Merge pull request #11 from kurtmclester/pubkey...
22:50 🔗 GitHub123/#internetarchive.bak [IA.BAK] db48x closed pull request #11: Changed key. -Kazzy (pubkey...pubkey) http://git.io/veTHj
22:50 🔗 GitHub83/#internetarchive.bak [IA.BAK] db48x pushed 2 new commits to pubkey: http://git.io/veTQK
22:52 🔗 Kazzy git-annex-shell: user error (git ["config","--null","--list"] exited 126)
22:53 🔗 db48x can you show the output from prior to that?
22:53 🔗 Kazzy Checking ssh to server...
22:53 🔗 Kazzy only bit before that is: Hit Enter once you're signed up!
22:53 🔗 Kazzy then throws that error, and asks me to sign up for access again
22:54 🔗 closure Kazzy: my guess is you've mangled the path to the git repo on the server
22:55 🔗 closure git remote add origin "$user:$dir"
22:55 🔗 closure since that uses $dir, if you changed it, it'll look in the wrong place on the server
22:56 🔗 Kazzy oh right hm, yeah didn't notice all the $dir references in there, will try some more poking
22:59 🔗 db48x all those dir variables should probably stay as-is
23:04 🔗 db48x just change to a different directory before the if on line 126
23:05 🔗 closure oho, stats have been broken today!
23:06 🔗 closure seems I have a stupid permissions error
23:09 🔗 closure omg
23:10 🔗 closure +1 and +2 have *both* overtaken +0 in the stats!
23:10 🔗 closure numcopies +0: 17275
23:10 🔗 closure numcopies +1: 42961
23:10 🔗 closure numcopies +2: 32420
23:10 🔗 closure numcopies +3: 10149
23:10 🔗 closure numcopies +4: 490
23:10 🔗 closure wonders if SketchCow's script wil handle this, I forgot it sorted it like that
23:11 🔗 closure at 2 am yesterday, we had numcopies +0: 54519 ..
23:14 🔗 closure we've doubled the total files transferred today
23:20 🔗 db48x Kazzy: https://github.com/db48x/IA.BAK/commit/a320bbbf0abd1359c0b20fbe7f412864437fa357
23:21 🔗 Kazzy oh ncie, thanks for that one
23:21 🔗 Kazzy will take a shot at that now
23:23 🔗 hater i love this channel; someone ask for a feature and it is available in less than an hour
23:30 🔗 zottelbey has quit (Remote host closed the connection)
23:31 🔗 closure http://teamarchive1.fnf.archive.org/ia.bak/graph.html \o/
23:35 🔗 patricko- is now known as patrickod
23:46 🔗 patrickod is now known as patricko-

irclogger-viewer