[00:19] *** cf_ (~nickgrego@[redacted]) has joined #internetarchive.bak [00:42] *** beardicus (~beardicus@[redacted]) has joined #internetarchive.bak [00:49] *** beardicus has quit (Quit: Textual IRC Client: www.textualapp.com) [02:24] *** yipdw has changed the topic to: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK | http://iabackup.archiveteam.org/ia.bak/ALL | #archiveteam [03:03] *** niyaje4 (~niyaje@[redacted]) has joined #internetarchive.bak [03:42] *** cf_ has quit (cf_) [05:41] *** realeyes_ (~Space_Cas@[redacted]) has joined #internetarchive.bak [05:41] is it too late to participate in the backup of the internet? lol [05:43] nope, pull up a seat [05:43] *** realeyes_ is now known as realeyes [05:43] Jason Scott led me here [05:43] that guy around? [05:44] he's SketchCow here, may be asleep at this hour but you never know with him [05:44] aha [05:45] so, i like the idea of this project [05:45] where do i start? [05:46] first things first i guess, how the hell do i register my nick on this network? [05:47] you don't [05:47] the magic of efnet :D [05:48] i dont usually frequent this network lol [05:48] it has literally no *servs [05:49] how do i hide my hostname? lol [05:49] you don't [05:49] disconnect? [05:49] ^ [05:50] lol [05:50] alright so lets get started, how do i do my part here? [05:53] you guys rsync'ing the archive or what? [05:55] http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-annex_implementation [05:55] i have to sleep so you'll have to bother somebody else for details [05:55] but it's not hard [06:00] realeyes: so run the git checkout, then run ./iabak [06:00] that'll create an ssh key, you'll need to bug someone here (closure or db48x) to get it added [06:06] ok [06:09] *** realeyes has quit (Textual IRC Client: www.textualapp.com) [06:09] *** realeyes (~Space_Cas@[redacted]) has joined #internetarchive.bak [06:21] so, does anyone here have a backup of the ia? [06:21] *** aschmitz has quit (Read error: Operation timed out) [06:21] *** kyan has quit (Ping timeout: 258 seconds) [06:21] *** Kenshin has quit (hub.efnet.us irc.Prison.NET) [06:21] *** db48x has quit (hub.efnet.us irc.Prison.NET) [06:21] *** aschmitz (~aschmitz@[redacted]) has joined #internetarchive.bak [06:28] *** Kenshin (~rurouni@[redacted]) has joined #internetarchive.bak [06:28] *** db48x (~user@[redacted]) has joined #internetarchive.bak [06:28] *** irc.Prison.NET gives channel operator status to db48x Kenshin [06:28] *** kyan (~kyan@[redacted]) has joined #internetarchive.bak [06:48] *** jbenet_ has quit (Read error: Connection reset by peer) [06:48] *** lhobas_ has quit (Read error: Connection reset by peer) [06:48] *** jbenet_ (sid17552@[redacted]) has joined #internetarchive.bak [06:48] *** lhobas_ (sid41114@[redacted]) has joined #internetarchive.bak [06:49] *** mrfoo has quit (Read error: Connection reset by peer) [06:51] *** mrfoo (sid25914@[redacted]) has joined #internetarchive.bak [09:27] IA has back-ups [09:28] also, you can see the status of how far the backup is on the stats page [09:44] *** zottelbey (~zottelbey@[redacted]) has joined #internetarchive.bak [12:05] *** shabble_ (~shabble@[redacted]) has left #internetarchive.bak [12:25] The WHOLE IA, no. [12:26] http://iabackup.archiveteam.org/ia.bak/ BOOM, Shard1 is done! [12:27] Who here is part of Shard1, and how much data do you have (roughly) [12:29] I have a very littke bit (less than 0.2TB) [12:40] I have most of it, if not all [12:41] and my shard2 download stopped because I'm out of space [12:45] my shard1 download has frozen; presumably it'll eventually figure out there's nothing left. [12:46] I do have 50 shard2 dls running. [12:47] I have 36GB + 51GB [12:47] getting about 50KB/s each [12:49] Which is failing to stretch my internet connection. [12:49] *** sankin (~sankin@[redacted]) has joined #internetarchive.bak [12:50] Senji: maybe you need to download larger files? [12:51] well, i'll just let it keep dribbling :-) [13:19] i have 2.0TB of shard1, but it's offline. I should get it back online today sometime [13:20] :) [13:52] did the archive change somehow? [13:52] i'm trying to fsck and all of my _files.xml are too big [13:52] Bad file size (5.12 kB larger); moved to .git/annex/bad/MD5-s0--bc90afa1a66ce6a816d6fed309c68efd [13:52] fsck internetarchivebooks/59to60sanfrancisco00sanfrich/59to60sanfrancisco00sanfrich_files.xml [13:53] and SketchCow i have the entirey of shard1 except for somehow those files [13:55] even if i redownload them they're still bad [13:57] IA.BAK/master 5a15958 Joey Hess: SHARD1 done! activate SHARD2 [13:57] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veBNE [13:57] *** Quile (quile@[redacted]) has joined #internetarchive.bak [13:58] tpw_rules: interesting problem [13:58] that's a bad sign :D [13:59] the size from the survey and reality must not match for those xml files [14:00] only the _files.xml right, not the other ones [14:00] yeah. all the other ones are checksumming fine [14:00] yeah, the survey has them as size 0 [14:00] huh. [14:01] so, I can work around this for future shards. for shard1/2, hrm [14:01] should i just let the fsck run? i don't want to have to re-do it like tomorrow because it will probably take many hours [14:01] let's run it, maybe we'll learn something [14:02] also (i guess because they're size 0) it says 'ok' even if they're not there [14:02] fsck internetarchivebooks/1910catholicencyclop07herb/1910catholicencyclop07herb_files.xml ok [14:02] it previously got that one for being too big [14:03] I'm guessing the reason files.xml is empty is because files.xml includes files.xml inside itself [14:03] yeah, looking at the xml, their survey queried that, and of course the file list file doesn't include its own size or md5sum [14:04] heh. anyway i'm restarting it [14:05] so, I can special case this by making the shard use an URL key for that, not a md5sum, and omitting any size info. Then it'll fsck ok [14:05] do you say url like 'earl' [14:06] when will that be fixed? [14:07] even better I say nuclear like "nooclear". Clearly an uneducated hick [14:07] well it's not nucular [14:10] *** closure looks at the graph [14:10] wow, that's a serious spike [14:11] okay ima fsck then. tell me when it's figured out [14:11] http://iabackup.archiveteam.org/ia.bak/ [14:12] wow, shard2 is coming along [14:24] *** beardicus (~beardicus@[redacted]) has joined #internetarchive.bak [14:25] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRTA [14:25] IA.BAK/master 32165aa Joey Hess: begin handling different states differently... [14:26] IA.BAK/master a0fdc55 Joey Hess: remove unused funct stub [14:26] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRkZ [14:26] IA.BAK/master 0e416a6 Joey Hess: thought I was writing haskell for a sec [14:26] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRkz [14:27] IA.BAK/master 783f35e Joey Hess: no fall-thru [14:27] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRkH [14:29] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRIw [14:29] IA.BAK/master 69cd7b6 Joey Hess: avoid ./ in find output [14:30] IA.BAK/master bb01f4a Joey Hess: more haskellitus [14:30] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRLS [14:32] IA.BAK/master 824fee8 Joey Hess: simplif [14:32] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRqk [14:38] oh, this is interesting [14:39] tpw_rules has dropped the _files.xml [14:39] so my client then goes and downloads every such directory [14:39] that's because of the dirname in the script [14:40] ? [14:40] means that everyone is going to try to re-download all your files [14:41] but I'll fix it [14:41] all the ones from the bad directory? [14:41] no, all of them [14:41] huh [14:44] IA.BAK/master 13b07a4 Joey Hess: avoid redundant whole-directory downloads for 1 missing file in maint mode [14:44] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRsC [14:52] IA.BAK/master 7bdc606 Joey Hess: add NOMORE control file [14:52] IA.BAK/master fb202cb Joey Hess: get all clients to checkout shard2, unless they've touched NOMORE [14:52] [IA.BAK] joeyh pushed 2 new commits to master: http://git.io/veRnS [14:54] IA.BAK/master cf3ed9e Joey Hess: skip already checked out shard [14:54] [IA.BAK] joeyh pushed 1 new commit to master: http://git.io/veRcB [15:05] IA.BAK/master 0f1d13d Joey Hess: improve message [15:05] IA.BAK/master 478134d Joey Hess: start work on shard2 immediately after checkout [15:05] [IA.BAK] joeyh pushed 2 new commits to master: http://git.io/veR89 [15:06] *** closure is interested to see that SHARD1's git repo grew to 309 mb [15:07] so we'll need 0.5 to 1 tb to hold the git repos on the server [15:13] closure: You did with ia.bak/ exactly what I'd have done [15:13] So, next. [15:14] 1. Closure determines Shard1 is "stable" [15:14] 2. We have a handful of you delete, oh, 20gb off your shard1 directory [15:15] 3. Closure determines if the reporting (the graph) will go back down, and if things are notable. [15:15] 4. The command is issued, possibly automatically, to get that information back [15:15] After that, we'll expand onto shard2 [15:16] Also, I think the page (the graph page) should have information on the collections we're saving, at the moment. [15:16] easy to add that [15:16] can just run ls > $SHARD.collections [15:17] Well, I want it to be slightly more bespoke, but yes. [15:18] Just because that's more a PR/publicity thing than engineering [15:18] Educating people what sort of amazing things are there, etc. [15:18] well then you just make links to them [15:19] So, while we start arranging the fire drill [15:19] ...do we want more people? [15:19] Do we want to continue to expand? That's the main question. [15:19] I think we should pause and take stock [15:20] and then expand [15:20] Great. [15:20] OK, so I say the above 4 steps are what to do next, unless you have a different idea. [15:24] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veRgE [15:24] IA.BAK/server 2a12096 Joey Hess: get per-shard list of collections [15:29] sounds good, will work on it tonight [15:29] in the meantime, if anyone restarts iabak, it should move on to shard2, assuming you have any disk space lef [15:34] IA.BAK/server 70a5611 Joey Hess: tell gc to prune all loose objects after packing [15:34] IA.BAK/server 9b706e2 Joey Hess: make ALL.collections list collections of currently active shards, but not ones we have already downloaded [15:34] [IA.BAK] joeyh pushed 2 new commits to server: http://git.io/veRoU [15:43] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veRiX [15:43] IA.BAK/server 207ea38 Joey Hess: add basic list of collections to graph page [15:45] IA.BAK/server 5923a01 Joey Hess: avoid multiline var [15:45] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/veRPx [15:46] SketchCow: collection list is there. I'll let you dress it up [15:46] for the ALL.html, it's the collections of currently active shards (SHARD2 currently).. for SHARDn.html, it's the collections in that shard [15:50] weird. when i run checkoutshard shard2 it asks for my pubkey to be added again although it is in the pubkey file for shard 2. [15:50] any ideas ? [16:00] It didn't for me last night [16:02] Is there any way I can get iabak to stop once it has finished this file? [16:20] Senji: not at the moment [16:28] you can hit Control-C at any time [16:28] and if you want to be sure of finishing up that one file, do a manual git annex get on it afterwards, it'll pick up where it left off [16:35] closure: in my home directory on iabak are some changes to your propellor conifg [16:35] closure: could you run your eye over them? [17:27] closure: It appears your collection list is JUST the second shard? [17:27] Oh, I see. [17:27] Yes, for ALL, I'd like connections of all shards. [17:50] SketchCow: well, that's easier, but it will get enormous eventually.. [17:51] tell you what, I'll put the active shard first [17:51] and then all the rest [17:52] IA.BAK/server 04a0627 Joey Hess: include collections of all shards, active shards first [17:52] [IA.BAK] joeyh pushed 1 new commit to server: http://git.io/ve0Pk [17:52] db48x: it looks basically ok [17:52] the way to deal with passwords is: File.privContent "/path/to/file" [17:53] then I have to feed in the passwords in my config, which gets encrypted and stuff [17:54] note that the cmdProperties will be run each time propellor runs. If they should only run once, append `flagFile` "/some/flag/file" [18:08] closure: It will eventually get huge. Ues [18:08] Put them all in. [18:09] Later, it'll be its own page [18:18] hrm [18:20] I don't see where flagFile is defined; I assume it will modify the property so that it creates the named file the first time it executes the command, and skips the command if it already exists? [18:21] right [18:21] It's in Properties [18:23] do you have a preference for where to put them? [18:24] do you have a function that takes a list of properties and makes them all use a flagfile? [18:24] well, they need different flag files for each, so no [18:26] would that make it harder to implement? [18:28] also, what about upgrades? [18:28] upgrades of what? [18:28] if graphite-web should ever be upgraded, we'd need it to run syncdb again [18:28] ah, I see. The debian package doesn't do that for you on upgrade? [18:29] good question; I've no idea [18:30] it certainly didn't do it on the initial install [18:33] ok.. I don't have a handy way to do that, but it would probably be a nice thing to add. [18:33] for now, I'd not worry about it. If it breaks, we can fix it [18:34] so, I'm looking at making shards that are full and in maintenance mode fsck themselves in full once per month, running up to 5 hours of fscking per iabak run [18:35] those numbers can be tuned, depends how quickly we want to catch bit flips.. [18:35] and then a separate fsck --fast to detect gross damage, which should run in just a few minutes [18:42] seems reasonable [18:43] in aggregate that's more often than I scrub my ZFS filesystems [18:43] it might be too often [18:44] I have not researched what makes sense [18:52] new git-annex just released [18:54] I saw :) [18:55] cool, the auto-upgrade in iabak works [19:02] I think as long as we're emotionally prepped for the scramble at the end of the month, a once per month check is worth it. [19:02] I mean a scramble to add code and check situations for what we didn't expect. [19:04] my haskell is too shaky to be certain of what GetAnnexBuilder is doing to get the rsync password [19:05] the inner bit where you test to see if the password has changed is clear enough [19:06] you either return an action which writes the password to /home/whoever/rsyncpassword if it has, or you return a noop [19:06] withPrivData returns an action that can be called to get the password [19:08] which you do by chaining it together with others passed to combineProperties [19:08] *** WizCry (~Wizardcry@[redacted]) has joined #internetarchive.bak [19:11] well, that combines a bunch of Property into one, yes [19:15] getpw here is a string with the actual password in it? [19:17] the String it returns is, yes [19:18] or, some arbitrary peice of data, maybe a whole file. Whatever is put in later [19:19] I realize that (((PrivData -> Propellor Result) -> Propellor Result) -> Property i) is a somewhat unintuitive data type, it was necessary to make introspection work [19:21] the inner bit is a function which takes a PrivData, which is a string, and returns a result? [19:22] returns an action that modifies the system and returns a result yes [19:23] which is the action? [19:24] Propellor Resut [19:31] I'm really going to have to study Haskell systematically some day [19:40] mmm, is the type signature of getpw ((PrivData -> Propellor Result) -> Propellor Result)? [19:43] that would make sense, because it lets the anonymous function on the next line take a PrivData called pw and return a Propellor Result [19:44] presumably getpw first queries the PrivDataSource (here constructed using Password), and returns a failure if there's no data available, or calls its argument otherwise [19:48] *** SN4T14__ (~SN4T14@[redacted]) has joined #internetarchive.bak [19:51] closure: how's this? [19:51] \gettoken -> property "graphite-web CSRF token" $ [19:51] graphiteCSRF = withPrivData (Password "guesswork") (Context "graphite-web-stuff") [19:51] gettoken $ \token -> do [19:51] makeChange $ File.hasLine "/etc/graphite/local_settings.py" "SECRET_KEY = '"++ token ++"'" [19:51] maybe change "guesswork" to something else... [19:51] looks great [19:52] good [19:52] how would I actually set this value, if I could? [19:52] you could use (PrivFile "/etc/graphite/local_settings.py") [19:53] this is the only thing in there that has to be private [19:53] propellor --set something something < file [19:53] *** SN4T14_ has quit (Ping timeout: 306 seconds) [19:53] what are the somethings? [19:53] if it's just a password, I'll make one up. Propellor will tell what the command line is to set it when it sees the value is missing [19:54] it's not quite a password [19:54] just needs to be a random string, decently long [19:54] 30 or 40 characters will do [19:55] when you log in it creates a CSRF token for your session which is formed by hashing this token plus some other information; the hash is the token [19:57] ok, I changed it to graphiteCSRF = withPrivData (Password "csrf-token") (Context "graphite-web") [19:57] as I understand it, you'll end up doing propellor --set csrf-token graphite-web < tokenfile, right? [19:58] right [19:58] perfect [20:00] that only leaves the passwords [20:00] obviously they could be done the same way, but I don't think there's any way to programatically set them [20:01] maybe echo "swordfish" | graphite-manage changepassword joey :P [20:02] no [20:03] well, I'll just leave that out [20:06] *** db48x yawns [20:06] time for me to sleep, I think [20:07] closure: thanks for helping me understand :) [20:18] 1pm naptime [21:02] *** sankin has quit (Leaving.) [21:54] iabak seems to be dling a new git-annex every time i start it [22:04] closure: you need a lock to ensure iabak doesn't run git annex while another iabak is unpacking a new one [22:33] Senji: are you running latest [22:33] https://github.com/ArchiveTeam/IA.BAK/commit/8fc19b24ffededa104ddcbe73a7d3ad6d853f668 performs a check [22:39] db48x: is there some reason why you don't close https://github.com/ArchiveTeam/IA.BAK/pull/1 ? [22:49] yipdw: yes; and I've just zeen a dozen copies I've just restart all upsare git-annex [22:50] then the check must be failing somehow [22:51] I guess a lock could be useful if you have a dozen all attempting to update at once [23:07] why are URLs excluded from the wayback machine :( [23:08] some* [23:14] yipdw: good point; haven't thought about that yet