[00:13] *** protodev has quit IRC (Ping timeout: 606 seconds) [00:14] *** protodev has joined #internetarchive.bak [00:32] *** primus104 has quit IRC (Leaving.) [00:40] *** protodev has quit IRC (Ping timeout: 606 seconds) [00:41] *** protodev has joined #internetarchive.bak [00:56] *** niyaje4 has joined #internetarchive.bak [01:10] *** protodev has quit IRC (Ping timeout: 606 seconds) [01:10] *** protodev has joined #internetarchive.bak [01:21] *** protodev has quit IRC (Ping timeout: 606 seconds) [01:22] *** protodev has joined #internetarchive.bak [04:21] *** niyaje4 has quit IRC (Ping timeout: 600 seconds) [05:03] *** Ctrl-S has quit IRC ( HydraIRC -> http://www.hydrairc.com <- In tests, 0x09 out of 0x0A l33t h4x0rz prefer it :)) [05:10] *** Ctrl-S has joined #internetarchive.bak [05:48] *** marvinw has quit IRC (Read error: Operation timed out) [06:08] *** zottelbey has joined #internetarchive.bak [06:13] *** marvinw has joined #internetarchive.bak [06:15] *** Control-S has joined #internetarchive.bak [06:19] *** Ctrl-S has quit IRC (Read error: Operation timed out) [06:19] *** Control-S is now known as Ctrl-S [06:20] *** primus104 has joined #internetarchive.bak [06:22] *** niyaje4 has joined #internetarchive.bak [07:19] *** atomotic has joined #internetarchive.bak [07:36] my SHARD3 download finished, with 108 files that could not be downloaded - any similar experiences here? [08:15] There's a bug with URL parsing of files with # in the URLs [08:23] *** niyaje4 has quit IRC (Ping timeout: 600 seconds) [09:07] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [09:28] So, when's shard5 going to appear? :-) [10:50] *** primus104 has quit IRC (Leaving.) [11:00] neat, one of my shards became corrupted [11:00] https://pastebin.mozilla.org/8832598 [11:00] neat? [11:00] git annex fsck fixed it though [11:01] Senji: yes [11:01] this is a test, we have to have bad things happen to it [11:03] Good point :) [11:03] I deliberately put it on a disk that I suspected of having trouble [11:04] Testing the disk too then? :) [11:04] yes [11:05] Any idea when shard5 comes out? :-) [11:05] gah, shard2 is broken the same way [11:06] weirdly it's exactly the same hash in the error messages... [11:07] oh man, git annex fsck is broken here too [11:08] it's just repeating this over and over: [11:08] git-annex: fd:12: commitBuffer: resource vanished (Broken pipe) [11:08] failed [11:09] Odd [11:10] indeed [11:51] 03registrar 05master fc499f1 06other 10SHARD4/pubkeys registration of db48x iabak on SHARD4 [12:32] *** atomotic has joined #internetarchive.bak [12:47] *** sankin has joined #internetarchive.bak [13:16] *** primus104 has joined #internetarchive.bak [13:40] btw, i uploaded my current perl-script: https://github.com/cancerAlot/IA.BAK/tree/perl-master [13:41] it's still broken and not ready yet; if someone spots the bug in checkssh() which causes to register with an wrong ssh-pub-key, tell me ;) [13:51] *** garyrh has quit IRC (http://bnc4free.com/) [13:58] *** Start has quit IRC (Disconnected.) [14:35] *** Start has joined #internetarchive.bak [14:39] protodev: excellent, I'll take a look [14:59] *** Start has quit IRC (Disconnected.) [15:00] db48x: there is no maintain-function, no cronjob, no proper osx-support (in the install-sub) and loads of other missing features [15:01] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [15:03] and the whole @cmd-thing will be replaced - ugly hack which is not needed [15:03] *** Start has joined #internetarchive.bak [15:16] *** Start has quit IRC (Disconnected.) [15:31] 03registrar 05master b2119e2 06other 10SHARD3/pubkeys registration of protodev on SHARD3 [15:31] YES! :D [15:32] it works !! [15:32] woot [15:33] ok, not yet... [15:40] 03registrar 05master 106f19e 06other 10SHARD3/pubkeys registration of db48x+iabak on SHARD3 [15:41] Hmm, the disk I have my shard4 repo in is nearly full; should I create a new shard4 repo on the next disk and worry about not having duplication; or just wait until shard5? [15:42] if you create another shard4, then you'll want to manually set them both up so that they don't want content that the other has [15:43] Yeah, exactly [15:44] can I do that with iabak or do i have to run "git-annex get" directly with appropriate flags? [15:44] use git annex wanted [15:44] 03registrar 05master ed8d47f 06other 10SHARD3/pubkeys registration of protodev on SHARD3 [15:44] *** Start has joined #internetarchive.bak [15:47] The git annex wanted documentation says it doesn't have --in; which seems to rule out the obvious way of doing that (git annex wanted --not --in ) [15:49] *** Start has quit IRC (Read error: Connection reset by peer) [15:57] yep, successfully registered [15:57] plenty of other problems after that though :) [16:00] protodev: I'll send you a pull request [16:02] it's probably similar to what you've already got, but since you said that yours didn't work... [16:08] *** primus104 has quit IRC (Leaving.) [16:16] protodev: removed some duplicated bits [16:18] are more shards planned? just curious [16:21] *** Start has joined #internetarchive.bak [16:24] *** garyrh has joined #internetarchive.bak [16:43] *** Start has quit IRC (Disconnected.) [16:56] db48x: we wrote the exactly same piece of code ^ [16:56] i just named my variables a bit different [17:23] *** primus104 has joined #internetarchive.bak [17:26] the reincarnation of Thompson and Ritchie [17:31] :) [17:31] sep332: yes [17:31] sep332: the plan is to have enough shards to back up all of archive.org, of course [17:32] are you going to add each as the previous one gets close to done? or all done? [17:32] we should call them crystals in preparation for Final Fantasy XVI or whatever [17:33] no, I believe the plan is to add all of the rest of them all at once, and spread the word wide and far [17:34] ok cool [17:38] I think there were a few more things we wanted to do before then, though [17:44] *** primus104 has quit IRC (Leaving.) [18:32] *** primus104 has joined #internetarchive.bak [19:33] yipdw: you're only allowed to have 7 crystals [19:33] perhaps make the number of total shards evenly divisible by 7? [19:50] can make a couple more shards, just need to pick collections for them to get 100k-200k files [19:51] if the same file (by MD5) gets uploaded multiple times, will it get backed up more? or only 4 copies per MD5? [19:52] if it's in the same shard, only once [19:52] well, 4 times [19:52] closure: are there likely to be more shards soon then? [19:53] once someone picks some collections [19:53] my shard4 disk is full, so if I want to start another one off it'd currently have to be another shard4 with some faff [19:53] or I can just randomly pick some, shrug [19:53] Well, nearly full :) [19:54] hm. a file can be in multiple collections. If those collections are in different shards, the same file will count for both shards [19:54] when a given filename is in multiple collections, it only gets into one shard [19:54] but if the same file is in the IA under different names, it might end up redundantly in multiple shards [19:55] ok thanks [19:55] (could sort and uniq this away, but it's a 32gb file listing them all..) [19:55] There are enough files in the IA that I'm sure there are MD5 collisions [19:55] that too [19:56] although iirc we did the math and collisions seemed unlikely [19:56] at least collisions within a shard seem very unlikely [19:56] (non-intentional collisions) [19:56] As I understand it the risk of *accidental* MD5 collisions is still astronomically low. [19:56] MD5 collisions are extremely unlikely unless you control both files [19:57] Or, "unless you are actually trying for a collision". [19:57] if i give you a static MD5, like one from the census, you couldn't make a file that collides. [19:58] there is a cryptanalytic weakness but it's still not good enough to be feasible [20:02] sep332: actually, people have created md5 collisions for existing other documents [20:03] sep332: but the thing here isn't malicious intent, it's the birthday 'paradox' [20:07] hadn't thought of birthday paradox. But I can't find any instance of preimage attack on md5? [20:08] ./mkSHARD 'wwIIarchive dnalounge jstor_polisciequar usda-commoditysituationreports jstor_jpoliecon' 5 [20:08] fairly random, just a list an early shard packer found [20:09] sep332: the Flame malware forged a Windows code-signing cert [20:18] Senji2: that was a chosen-prefix collision, not a preimage. the attackers controlled part of the "original" certificate [20:19] sep332: oh, OK, my apologies; I must have misunderstood [20:19] I've seen a random collision (in work's backups) before :) [20:19] some questionable conclusions but a good summary https://randomoracle.wordpress.com/2012/06/13/unanswered-questions-about-the-flame-certificate-forgery-22/ [20:19] oh, crazy [20:29] ./mkSHARD 'sports wikipediadumps Strangefolk canadianpamphlets statedocs_maine 1920_census' 6 [20:39] *** zottelbey has quit IRC (Remote host closed the connection) [20:47] wow, this would be a BIG shard [20:47] 6.19 terabytes [20:47] I'll bet it's all the sports vids [20:47] Something to get teeth into :) [20:47] maybe save that one for later [20:48] similar number of files though? ~100,000? [20:48] yes [20:48] hmm, sports is only 1.25 tb [20:49] must be the wikipedia [20:49] 1.44 tb [20:50] everything in this shard is kinda large [20:50] 1920 census is 2.6 tb [20:51] *** sankin has quit IRC (Leaving.) [20:51] as in the year 1920? pretty sure they didn't have that many hard drives lo [20:51] lol [20:52] yeah, damnifiknow what's in tha [20:52] high-res scans of every piece of paper? [20:53] only 61 gb of canadian pamphlets [20:54] yes [20:55] scans of microfilm of photographs of the pieces of paper [20:55] all the scratches lovingly preserved at 600dpi [20:56] a video of someone reading all the papers out loud [20:57] sep332: a scanning tunneling microscophy of your hard drive, 300 yrs after you back this up [20:57] oh wait db48x are you serious? [20:58] yes [20:59] then "called it" ;) [21:00] closure: did you see the stuff about my broken shards from this morning? [21:01] yeah, when a git repo is that hosed, you can use git-annex repair, but it will be slow on the shards. Probably better to re-clone and move in .git/config and .git/annex/objects [21:02] how is that different from fsck? [21:02] fsck just finds problems [21:02] it fixed the first shard [21:03] it doesn't fix broken git repos, it only fixes up git-annex specific issues [21:05] what do you think is broken? [21:05] you have a bunch of broken .git/objects/ files [21:05] well, one [21:06] same hash repeated a number of times [21:06] what about the other message? [21:06] git-annex: fd:12: commitBuffer: resource vanished (Broken pipe) [21:06] failed [21:07] just a consequence of the actual broken git repo [21:37] it'd be nice if it were more informative, and less spammy [22:04] 03registrar 05master ad0b477 06other 10SHARD5/pubkeys registration of jdamery+iabak on SHARD5 [22:08] 03registrar 05master 1308265 06other 10SHARD3/pubkeys registration of protodev on SHARD3 [22:16] *** DFJustin has quit IRC (IMHOSTFU) [22:17] *** DFJustin has joined #internetarchive.bak [22:17] *** svchfoo2 sets mode: +o DFJustin [22:17] *** svchfoo1 sets mode: +o DFJustin [22:17] *** Start has joined #internetarchive.bak [22:21] *** toad2 has joined #internetarchive.bak [22:22] 03registrar 05master 67e3630 06other 10SHARD3/pubkeys registration of protodev on SHARD3 [22:27] *** toad1 has quit IRC (Read error: Operation timed out) [23:30] oh yea, this drive is toast: [23:30] [db48x@celebdil shard2]$ ../git-annex.linux/git-annex fsck --quiet [23:30] git-annex: .git/annex/objects/Gk/8X/MD5-s1843643129--f8d132c8f834ebf8597856bcfc069f8f/MD5-s1843643129--f8d132c8f834ebf8597856bcfc069f8f: hGetBufSome: hardware fault (Input/output error) [23:31] and also: [23:31] May 06 16:21:30 celebdil smartd[1196]: Device: /dev/sdc [SAT], 1062 Currently unreadable (pending) sectors [23:31] May 06 16:21:30 celebdil smartd[1196]: Device: /dev/sdc [SAT], 51 Offline uncorrectable sectors [23:38] loverly [23:40] actually, I suppose the word "toast" applies better to a different drive from a different computer which got burnt a couple of weeks ago [23:40] power supply shorted and caused a small fire [23:44] oh, even better: [23:44] May 06 15:16:16 celebdil kernel: EXT4-fs (sdc1): Delayed block allocation failed for inode 28312757 at logical offset 1162316 with max [23:44] May 06 15:16:16 celebdil kernel: EXT4-fs (sdc1): This should not happen!! Data will be lost [23:44] May 06 15:16:16 celebdil kernel: EXT4-fs error (device sdc1) in ext4_writepages:2395: Journal has aborted [23:44] I'm so glad I use ZFS for most things