#internetarchive.bak 2015-04-06,Mon

↑back Search

Time Nickname Message
00:19 🔗 cf_ (~nickgrego@[redacted]) has joined #internetarchive.bak
00:42 🔗 beardicus (~beardicus@[redacted]) has joined #internetarchive.bak
00:49 🔗 beardicus has quit (Quit: Textual IRC Client: www.textualapp.com)
02:24 🔗 yipdw has changed the topic to: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK | http://iabackup.archiveteam.org/ia.bak/ALL | #archiveteam
03:03 🔗 niyaje4 (~niyaje@[redacted]) has joined #internetarchive.bak
03:42 🔗 cf_ has quit (cf_)
05:41 🔗 realeyes_ (~Space_Cas@[redacted]) has joined #internetarchive.bak
05:41 🔗 realeyes_ is it too late to participate in the backup of the internet? lol
05:43 🔗 DFJustin nope, pull up a seat
05:43 🔗 realeyes_ is now known as realeyes
05:43 🔗 realeyes Jason Scott led me here
05:43 🔗 realeyes that guy around?
05:44 🔗 DFJustin he's SketchCow here, may be asleep at this hour but you never know with him
05:44 🔗 realeyes aha
05:45 🔗 realeyes so, i like the idea of this project
05:45 🔗 realeyes where do i start?
05:46 🔗 realeyes first things first i guess, how the hell do i register my nick on this network?
05:47 🔗 csssuf you don't
05:47 🔗 tpw_rules the magic of efnet :D
05:48 🔗 realeyes i dont usually frequent this network lol
05:48 🔗 tpw_rules it has literally no *servs
05:49 🔗 realeyes how do i hide my hostname? lol
05:49 🔗 csssuf you don't
05:49 🔗 tpw_rules disconnect?
05:49 🔗 csssuf ^
05:50 🔗 realeyes lol
05:50 🔗 realeyes alright so lets get started, how do i do my part here?
05:53 🔗 realeyes you guys rsync'ing the archive or what?
05:55 🔗 tpw_rules http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation
05:55 🔗 tpw_rules i have to sleep so you'll have to bother somebody else for details
05:55 🔗 tpw_rules but it's not hard
06:00 🔗 trs80 realeyes: so run the git checkout, then run ./iabak
06:00 🔗 trs80 that'll create an ssh key, you'll need to bug someone here (closure or db48x) to get it added
06:06 🔗 realeyes ok
06:09 🔗 realeyes has quit (Textual IRC Client: www.textualapp.com)
06:09 🔗 realeyes (~Space_Cas@[redacted]) has joined #internetarchive.bak
06:21 🔗 realeyes so, does anyone here have a backup of the ia?
06:21 🔗 aschmitz has quit (Read error: Operation timed out)
06:21 🔗 kyan has quit (Ping timeout: 258 seconds)
06:21 🔗 Kenshin has quit (hub.efnet.us irc.Prison.NET)
06:21 🔗 db48x has quit (hub.efnet.us irc.Prison.NET)
06:21 🔗 aschmitz (~aschmitz@[redacted]) has joined #internetarchive.bak
06:28 🔗 Kenshin (~rurouni@[redacted]) has joined #internetarchive.bak
06:28 🔗 db48x (~user@[redacted]) has joined #internetarchive.bak
06:28 🔗 irc.Prison.NET gives channel operator status to db48x Kenshin
06:28 🔗 kyan (~kyan@[redacted]) has joined #internetarchive.bak
06:48 🔗 jbenet_ has quit (Read error: Connection reset by peer)
06:48 🔗 lhobas_ has quit (Read error: Connection reset by peer)
06:48 🔗 jbenet_ (sid17552@[redacted]) has joined #internetarchive.bak
06:48 🔗 lhobas_ (sid41114@[redacted]) has joined #internetarchive.bak
06:49 🔗 mrfoo has quit (Read error: Connection reset by peer)
06:51 🔗 mrfoo (sid25914@[redacted]) has joined #internetarchive.bak
09:27 🔗 midas IA has back-ups
09:28 🔗 midas also, you can see the status of how far the backup is on the stats page
09:44 🔗 zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak
12:05 🔗 shabble_ (~shabble@[redacted]) has left #internetarchive.bak
12:25 🔗 SketchCow The WHOLE IA, no.
12:26 🔗 SketchCow http://iabackup.archiveteam.org/ia.bak/ BOOM, Shard1 is done!
12:27 🔗 SketchCow Who here is part of Shard1, and how much data do you have (roughly)
12:29 🔗 Senji I have a very littke bit (less than 0.2TB)
12:40 🔗 trs80 I have most of it, if not all
12:41 🔗 trs80 and my shard2 download stopped because I'm out of space
12:45 🔗 Senji my shard1 download has frozen; presumably it'll eventually figure out there's nothing left.
12:46 🔗 Senji I do have 50 shard2 dls running.
12:47 🔗 db48x I have 36GB + 51GB
12:47 🔗 Senji getting about 50KB/s each
12:49 🔗 Senji Which is failing to stretch my internet connection.
12:49 🔗 sankin (~sankin@[redacted]) has joined #internetarchive.bak
12:50 🔗 db48x Senji: maybe you need to download larger files?
12:51 🔗 Senji well, i'll just let it keep dribbling :-)
13:19 🔗 sep332 i have 2.0TB of shard1, but it's offline. I should get it back online today sometime
13:20 🔗 db48x :)
13:52 🔗 tpw_rules did the archive change somehow?
13:52 🔗 tpw_rules i'm trying to fsck and all of my _files.xml are too big
13:52 🔗 tpw_rules Bad file size (5.12 kB larger); moved to .git/annex/bad/MD5-s0--bc90afa1a66ce6a816d6fed309c68efd
13:52 🔗 tpw_rules fsck internetarchivebooks/59to60sanfrancisco00sanfrich/59to60sanfrancisco00sanfrich_files.xml
13:53 🔗 tpw_rules and SketchCow i have the entirey of shard1 except for somehow those files
13:55 🔗 tpw_rules even if i redownload them they're still bad
13:57 🔗 GitHub152/#internetarchive.bak IA.BAK/master 5a15958 Joey Hess: SHARD1 done! activate SHARD2
13:57 🔗 GitHub152/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veBNE
13:57 🔗 Quile (quile@[redacted]) has joined #internetarchive.bak
13:58 🔗 closure tpw_rules: interesting problem
13:58 🔗 tpw_rules that's a bad sign :D
13:59 🔗 closure the size from the survey and reality must not match for those xml files
14:00 🔗 closure only the _files.xml right, not the other ones
14:00 🔗 tpw_rules yeah. all the other ones are checksumming fine
14:00 🔗 closure yeah, the survey has them as size 0
14:00 🔗 tpw_rules huh.
14:01 🔗 closure so, I can work around this for future shards. for shard1/2, hrm
14:01 🔗 tpw_rules should i just let the fsck run? i don't want to have to re-do it like tomorrow because it will probably take many hours
14:01 🔗 closure let's run it, maybe we'll learn something
14:02 🔗 tpw_rules also (i guess because they're size 0) it says 'ok' even if they're not there
14:02 🔗 tpw_rules fsck internetarchivebooks/1910catholicencyclop07herb/1910catholicencyclop07herb_files.xml ok
14:02 🔗 tpw_rules it previously got that one for being too big
14:03 🔗 closure I'm guessing the reason files.xml is empty is because files.xml includes files.xml inside itself
14:03 🔗 closure yeah, looking at the xml, their survey queried that, and of course the file list file doesn't include its own size or md5sum
14:04 🔗 tpw_rules heh. anyway i'm restarting it
14:05 🔗 closure so, I can special case this by making the shard use an URL key for that, not a md5sum, and omitting any size info. Then it'll fsck ok
14:05 🔗 tpw_rules do you say url like 'earl'
14:06 🔗 tpw_rules when will that be fixed?
14:07 🔗 closure even better I say nuclear like "nooclear". Clearly an uneducated hick
14:07 🔗 tpw_rules well it's not nucular
14:10 🔗 closure looks at the graph
14:10 🔗 closure wow, that's a serious spike
14:11 🔗 tpw_rules okay ima fsck then. tell me when it's figured out
14:11 🔗 closure http://iabackup.archiveteam.org/ia.bak/
14:12 🔗 closure wow, shard2 is coming along
14:24 🔗 beardicus (~beardicus@[redacted]) has joined #internetarchive.bak
14:25 🔗 GitHub65/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRTA
14:25 🔗 GitHub65/#internetarchive.bak IA.BAK/master 32165aa Joey Hess: begin handling different states differently...
14:26 🔗 GitHub95/#internetarchive.bak IA.BAK/master a0fdc55 Joey Hess: remove unused funct stub
14:26 🔗 GitHub95/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRkZ
14:26 🔗 GitHub130/#internetarchive.bak IA.BAK/master 0e416a6 Joey Hess: thought I was writing haskell for a sec
14:26 🔗 GitHub130/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRkz
14:27 🔗 GitHub169/#internetarchive.bak IA.BAK/master 783f35e Joey Hess: no fall-thru
14:27 🔗 GitHub169/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRkH
14:29 🔗 GitHub55/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRIw
14:29 🔗 GitHub55/#internetarchive.bak IA.BAK/master 69cd7b6 Joey Hess: avoid ./ in find output
14:30 🔗 GitHub120/#internetarchive.bak IA.BAK/master bb01f4a Joey Hess: more haskellitus
14:30 🔗 GitHub120/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRLS
14:32 🔗 GitHub152/#internetarchive.bak IA.BAK/master 824fee8 Joey Hess: simplif
14:32 🔗 GitHub152/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRqk
14:38 🔗 closure oh, this is interesting
14:39 🔗 closure tpw_rules has dropped the _files.xml
14:39 🔗 closure so my client then goes and downloads every such directory
14:39 🔗 closure that's because of the dirname in the script
14:40 🔗 tpw_rules ?
14:40 🔗 closure means that everyone is going to try to re-download all your files
14:41 🔗 closure but I'll fix it
14:41 🔗 tpw_rules all the ones from the bad directory?
14:41 🔗 closure no, all of them
14:41 🔗 tpw_rules huh
14:44 🔗 GitHub170/#internetarchive.bak IA.BAK/master 13b07a4 Joey Hess: avoid redundant whole-directory downloads for 1 missing file in maint mode
14:44 🔗 GitHub170/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRsC
14:52 🔗 GitHub145/#internetarchive.bak IA.BAK/master 7bdc606 Joey Hess: add NOMORE control file
14:52 🔗 GitHub145/#internetarchive.bak IA.BAK/master fb202cb Joey Hess: get all clients to checkout shard2, unless they've touched NOMORE
14:52 🔗 GitHub145/#internetarchive.bak [IA.BAK] joeyh pushed 2 new commits to master: http://git.io/veRnS
14:54 🔗 GitHub185/#internetarchive.bak IA.BAK/master cf3ed9e Joey Hess: skip already checked out shard
14:54 🔗 GitHub185/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRcB
15:05 🔗 GitHub189/#internetarchive.bak IA.BAK/master 0f1d13d Joey Hess: improve message
15:05 🔗 GitHub189/#internetarchive.bak IA.BAK/master 478134d Joey Hess: start work on shard2 immediately after checkout
15:05 🔗 GitHub189/#internetarchive.bak [IA.BAK] joeyh pushed 2 new commits to master: http://git.io/veR89
15:06 🔗 closure is interested to see that SHARD1's git repo grew to 309 mb
15:07 🔗 closure so we'll need 0.5 to 1 tb to hold the git repos on the server
15:13 🔗 SketchCow closure: You did with ia.bak/ exactly what I'd have done
15:13 🔗 SketchCow So, next.
15:14 🔗 SketchCow 1. Closure determines Shard1 is "stable"
15:14 🔗 SketchCow 2. We have a handful of you delete, oh, 20gb off your shard1 directory
15:15 🔗 SketchCow 3. Closure determines if the reporting (the graph) will go back down, and if things are notable.
15:15 🔗 SketchCow 4. The command is issued, possibly automatically, to get that information back
15:15 🔗 SketchCow After that, we'll expand onto shard2
15:16 🔗 SketchCow Also, I think the page (the graph page) should have information on the collections we're saving, at the moment.
15:16 🔗 closure easy to add that
15:16 🔗 closure can just run ls > $SHARD.collections
15:17 🔗 SketchCow Well, I want it to be slightly more bespoke, but yes.
15:18 🔗 SketchCow Just because that's more a PR/publicity thing than engineering
15:18 🔗 SketchCow Educating people what sort of amazing things are there, etc.
15:18 🔗 closure well then you just make links to them
15:19 🔗 SketchCow So, while we start arranging the fire drill
15:19 🔗 SketchCow ...do we want more people?
15:19 🔗 SketchCow Do we want to continue to expand? That's the main question.
15:19 🔗 closure I think we should pause and take stock
15:20 🔗 closure and then expand
15:20 🔗 SketchCow Great.
15:20 🔗 SketchCow OK, so I say the above 4 steps are what to do next, unless you have a different idea.
15:24 🔗 GitHub136/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veRgE
15:24 🔗 GitHub136/#internetarchive.bak IA.BAK/server 2a12096 Joey Hess: get per-shard list of collections
15:29 🔗 closure sounds good, will work on it tonight
15:29 🔗 closure in the meantime, if anyone restarts iabak, it should move on to shard2, assuming you have any disk space lef
15:34 🔗 GitHub107/#internetarchive.bak IA.BAK/server 70a5611 Joey Hess: tell gc to prune all loose objects after packing
15:34 🔗 GitHub107/#internetarchive.bak IA.BAK/server 9b706e2 Joey Hess: make ALL.collections list collections of currently active shards, but not ones we have already downloaded
15:34 🔗 GitHub107/#internetarchive.bak [IA.BAK] joeyh pushed 2 new commits to server: http://git.io/veRoU
15:43 🔗 GitHub22/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veRiX
15:43 🔗 GitHub22/#internetarchive.bak IA.BAK/server 207ea38 Joey Hess: add basic list of collections to graph page
15:45 🔗 GitHub52/#internetarchive.bak IA.BAK/server 5923a01 Joey Hess: avoid multiline var
15:45 🔗 GitHub52/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veRPx
15:46 🔗 closure SketchCow: collection list is there. I'll let you dress it up
15:46 🔗 closure for the ALL.html, it's the collections of currently active shards (SHARD2 currently).. for SHARDn.html, it's the collections in that shard
15:50 🔗 zottelbey weird. when i run checkoutshard shard2 it asks for my pubkey to be added again although it is in the pubkey file for shard 2.
15:50 🔗 zottelbey any ideas ?
16:00 🔗 Senji It didn't for me last night
16:02 🔗 Senji Is there any way I can get iabak to stop once it has finished this file?
16:20 🔗 db48x Senji: not at the moment
16:28 🔗 db48x you can hit Control-C at any time
16:28 🔗 db48x and if you want to be sure of finishing up that one file, do a manual git annex get on it afterwards, it'll pick up where it left off
16:35 🔗 db48x closure: in my home directory on iabak are some changes to your propellor conifg
16:35 🔗 db48x closure: could you run your eye over them?
17:27 🔗 SketchCow closure: It appears your collection list is JUST the second shard?
17:27 🔗 SketchCow Oh, I see.
17:27 🔗 SketchCow Yes, for ALL, I'd like connections of all shards.
17:50 🔗 closure SketchCow: well, that's easier, but it will get enormous eventually..
17:51 🔗 closure tell you what, I'll put the active shard first
17:51 🔗 closure and then all the rest
17:52 🔗 GitHub44/#internetarchive.bak IA.BAK/server 04a0627 Joey Hess: include collections of all shards, active shards first
17:52 🔗 GitHub44/#internetarchive.bak [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/ve0Pk
17:52 🔗 closure db48x: it looks basically ok
17:52 🔗 closure the way to deal with passwords is: File.privContent "/path/to/file"
17:53 🔗 closure then I have to feed in the passwords in my config, which gets encrypted and stuff
17:54 🔗 closure note that the cmdProperties will be run each time propellor runs. If they should only run once, append `flagFile` "/some/flag/file"
18:08 🔗 SketchCow closure: It will eventually get huge. Ues
18:08 🔗 SketchCow Put them all in.
18:09 🔗 SketchCow Later, it'll be its own page
18:18 🔗 db48x hrm
18:20 🔗 db48x I don't see where flagFile is defined; I assume it will modify the property so that it creates the named file the first time it executes the command, and skips the command if it already exists?
18:21 🔗 closure right
18:21 🔗 closure It's in Properties
18:23 🔗 db48x do you have a preference for where to put them?
18:24 🔗 db48x do you have a function that takes a list of properties and makes them all use a flagfile?
18:24 🔗 closure well, they need different flag files for each, so no
18:26 🔗 db48x would that make it harder to implement?
18:28 🔗 db48x also, what about upgrades?
18:28 🔗 closure upgrades of what?
18:28 🔗 db48x if graphite-web should ever be upgraded, we'd need it to run syncdb again
18:28 🔗 closure ah, I see. The debian package doesn't do that for you on upgrade?
18:29 🔗 db48x good question; I've no idea
18:30 🔗 db48x it certainly didn't do it on the initial install
18:33 🔗 closure ok.. I don't have a handy way to do that, but it would probably be a nice thing to add.
18:33 🔗 closure for now, I'd not worry about it. If it breaks, we can fix it
18:34 🔗 closure so, I'm looking at making shards that are full and in maintenance mode fsck themselves in full once per month, running up to 5 hours of fscking per iabak run
18:35 🔗 closure those numbers can be tuned, depends how quickly we want to catch bit flips..
18:35 🔗 closure and then a separate fsck --fast to detect gross damage, which should run in just a few minutes
18:42 🔗 db48x seems reasonable
18:43 🔗 db48x in aggregate that's more often than I scrub my ZFS filesystems
18:43 🔗 closure it might be too often
18:44 🔗 closure I have not researched what makes sense
18:52 🔗 closure new git-annex just released
18:54 🔗 db48x I saw :)
18:55 🔗 closure cool, the auto-upgrade in iabak works
19:02 🔗 SketchCow I think as long as we're emotionally prepped for the scramble at the end of the month, a once per month check is worth it.
19:02 🔗 SketchCow I mean a scramble to add code and check situations for what we didn't expect.
19:04 🔗 db48x my haskell is too shaky to be certain of what GetAnnexBuilder is doing to get the rsync password
19:05 🔗 db48x the inner bit where you test to see if the password has changed is clear enough
19:06 🔗 db48x you either return an action which writes the password to /home/whoever/rsyncpassword if it has, or you return a noop
19:06 🔗 closure withPrivData returns an action that can be called to get the password
19:08 🔗 db48x which you do by chaining it together with others passed to combineProperties
19:08 🔗 WizCry (~Wizardcry@[redacted]) has joined #internetarchive.bak
19:11 🔗 closure well, that combines a bunch of Property into one, yes
19:15 🔗 db48x getpw here is a string with the actual password in it?
19:17 🔗 closure the String it returns is, yes
19:18 🔗 closure or, some arbitrary peice of data, maybe a whole file. Whatever is put in later
19:19 🔗 closure I realize that (((PrivData -> Propellor Result) -> Propellor Result) -> Property i) is a somewhat unintuitive data type, it was necessary to make introspection work
19:21 🔗 db48x the inner bit is a function which takes a PrivData, which is a string, and returns a result?
19:22 🔗 closure returns an action that modifies the system and returns a result yes
19:23 🔗 db48x which is the action?
19:24 🔗 closure Propellor Resut
19:31 🔗 db48x I'm really going to have to study Haskell systematically some day
19:40 🔗 db48x mmm, is the type signature of getpw ((PrivData -> Propellor Result) -> Propellor Result)?
19:43 🔗 db48x that would make sense, because it lets the anonymous function on the next line take a PrivData called pw and return a Propellor Result
19:44 🔗 db48x presumably getpw first queries the PrivDataSource (here constructed using Password), and returns a failure if there's no data available, or calls its argument otherwise
19:48 🔗 SN4T14__ (~SN4T14@[redacted]) has joined #internetarchive.bak
19:51 🔗 db48x closure: how's this?
19:51 🔗 db48x \gettoken -> property "graphite-web CSRF token" $
19:51 🔗 db48x graphiteCSRF = withPrivData (Password "guesswork") (Context "graphite-web-stuff")
19:51 🔗 db48x gettoken $ \token -> do
19:51 🔗 db48x makeChange $ File.hasLine "/etc/graphite/local_settings.py" "SECRET_KEY = '"++ token ++"'"
19:51 🔗 db48x maybe change "guesswork" to something else...
19:51 🔗 closure looks great
19:52 🔗 db48x good
19:52 🔗 db48x how would I actually set this value, if I could?
19:52 🔗 closure you could use (PrivFile "/etc/graphite/local_settings.py")
19:53 🔗 db48x this is the only thing in there that has to be private
19:53 🔗 closure propellor --set something something < file
19:53 🔗 SN4T14_ has quit (Ping timeout: 306 seconds)
19:53 🔗 db48x what are the somethings?
19:53 🔗 closure if it's just a password, I'll make one up. Propellor will tell what the command line is to set it when it sees the value is missing
19:54 🔗 db48x it's not quite a password
19:54 🔗 db48x just needs to be a random string, decently long
19:54 🔗 db48x 30 or 40 characters will do
19:55 🔗 db48x when you log in it creates a CSRF token for your session which is formed by hashing this token plus some other information; the hash is the token
19:57 🔗 db48x ok, I changed it to graphiteCSRF = withPrivData (Password "csrf-token") (Context "graphite-web")
19:57 🔗 db48x as I understand it, you'll end up doing propellor --set csrf-token graphite-web < tokenfile, right?
19:58 🔗 closure right
19:58 🔗 db48x perfect
20:00 🔗 db48x that only leaves the passwords
20:00 🔗 db48x obviously they could be done the same way, but I don't think there's any way to programatically set them
20:01 🔗 db48x maybe echo "swordfish" | graphite-manage changepassword joey :P
20:02 🔗 db48x no
20:03 🔗 db48x well, I'll just leave that out
20:06 🔗 db48x yawns
20:06 🔗 db48x time for me to sleep, I think
20:07 🔗 db48x closure: thanks for helping me understand :)
20:18 🔗 SketchCow 1pm naptime
21:02 🔗 sankin has quit (Leaving.)
21:54 🔗 Senji iabak seems to be dling a new git-annex every time i start it
22:04 🔗 Senji closure: you need a lock to ensure iabak doesn't run git annex while another iabak is unpacking a new one
22:33 🔗 yipdw Senji: are you running latest
22:33 🔗 yipdw https://github.com/ArchiveTeam/IA.BAK/commit/8fc19b24ffededa104ddcbe73a7d3ad6d853f668 performs a check
22:39 🔗 hater db48x: is there some reason why you don't close https://github.com/ArchiveTeam/IA.BAK/pull/1 ?
22:49 🔗 Senji yipdw: yes; and I've just zeen a dozen copies I've just restart all upsare git-annex
22:50 🔗 yipdw then the check must be failing somehow
22:51 🔗 yipdw I guess a lock could be useful if you have a dozen all attempting to update at once
23:07 🔗 tpw_rules why are URLs excluded from the wayback machine :(
23:08 🔗 tpw_rules some*
23:14 🔗 hater yipdw: good point; haven't thought about that yet

irclogger-viewer