#internetarchive.bak 2015-05-06,Wed

↑back Search

Time Nickname Message
00:13 🔗 protodev has quit IRC (Ping timeout: 606 seconds)
00:14 🔗 protodev has joined #internetarchive.bak
00:32 🔗 primus104 has quit IRC (Leaving.)
00:40 🔗 protodev has quit IRC (Ping timeout: 606 seconds)
00:41 🔗 protodev has joined #internetarchive.bak
00:56 🔗 niyaje4 has joined #internetarchive.bak
01:10 🔗 protodev has quit IRC (Ping timeout: 606 seconds)
01:10 🔗 protodev has joined #internetarchive.bak
01:21 🔗 protodev has quit IRC (Ping timeout: 606 seconds)
01:22 🔗 protodev has joined #internetarchive.bak
04:21 🔗 niyaje4 has quit IRC (Ping timeout: 600 seconds)
05:03 🔗 Ctrl-S has quit IRC ( HydraIRC -> http://www.hydrairc.com <- In tests, 0x09 out of 0x0A l33t h4x0rz prefer it :))
05:10 🔗 Ctrl-S has joined #internetarchive.bak
05:48 🔗 marvinw has quit IRC (Read error: Operation timed out)
06:08 🔗 zottelbey has joined #internetarchive.bak
06:13 🔗 marvinw has joined #internetarchive.bak
06:15 🔗 Control-S has joined #internetarchive.bak
06:19 🔗 Ctrl-S has quit IRC (Read error: Operation timed out)
06:19 🔗 Control-S is now known as Ctrl-S
06:20 🔗 primus104 has joined #internetarchive.bak
06:22 🔗 niyaje4 has joined #internetarchive.bak
07:19 🔗 atomotic has joined #internetarchive.bak
07:36 🔗 lhobas my SHARD3 download finished, with 108 files that could not be downloaded - any similar experiences here?
08:15 🔗 Senji There's a bug with URL parsing of files with # in the URLs
08:23 🔗 niyaje4 has quit IRC (Ping timeout: 600 seconds)
09:07 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
09:28 🔗 Senji So, when's shard5 going to appear? :-)
10:50 🔗 primus104 has quit IRC (Leaving.)
11:00 🔗 db48x neat, one of my shards became corrupted
11:00 🔗 db48x https://pastebin.mozilla.org/8832598
11:00 🔗 Senji neat?
11:00 🔗 db48x git annex fsck fixed it though
11:01 🔗 db48x Senji: yes
11:01 🔗 db48x this is a test, we have to have bad things happen to it
11:03 🔗 Senji Good point :)
11:03 🔗 db48x I deliberately put it on a disk that I suspected of having trouble
11:04 🔗 Senji Testing the disk too then? :)
11:04 🔗 db48x yes
11:05 🔗 Senji Any idea when shard5 comes out? :-)
11:05 🔗 db48x gah, shard2 is broken the same way
11:06 🔗 db48x weirdly it's exactly the same hash in the error messages...
11:07 🔗 db48x oh man, git annex fsck is broken here too
11:08 🔗 db48x it's just repeating this over and over:
11:08 🔗 db48x git-annex: fd:12: commitBuffer: resource vanished (Broken pipe)
11:08 🔗 db48x failed
11:09 🔗 Senji Odd
11:10 🔗 db48x indeed
11:51 🔗 iabak-reg 03registrar 05master fc499f1 06other 10SHARD4/pubkeys registration of db48x iabak on SHARD4
12:32 🔗 atomotic has joined #internetarchive.bak
12:47 🔗 sankin has joined #internetarchive.bak
13:16 🔗 primus104 has joined #internetarchive.bak
13:40 🔗 protodev btw, i uploaded my current perl-script: https://github.com/cancerAlot/IA.BAK/tree/perl-master
13:41 🔗 protodev it's still broken and not ready yet; if someone spots the bug in checkssh() which causes to register with an wrong ssh-pub-key, tell me ;)
13:51 🔗 garyrh has quit IRC (http://bnc4free.com/)
13:58 🔗 Start has quit IRC (Disconnected.)
14:35 🔗 Start has joined #internetarchive.bak
14:39 🔗 db48x protodev: excellent, I'll take a look
14:59 🔗 Start has quit IRC (Disconnected.)
15:00 🔗 protodev db48x: there is no maintain-function, no cronjob, no proper osx-support (in the install-sub) and loads of other missing features
15:01 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
15:03 🔗 protodev and the whole @cmd-thing will be replaced - ugly hack which is not needed
15:03 🔗 Start has joined #internetarchive.bak
15:16 🔗 Start has quit IRC (Disconnected.)
15:31 🔗 iabak-reg 03registrar 05master b2119e2 06other 10SHARD3/pubkeys registration of protodev on SHARD3
15:31 🔗 protodev YES! :D
15:32 🔗 protodev it works !!
15:32 🔗 sep332 woot
15:33 🔗 protodev ok, not yet...
15:40 🔗 iabak-reg 03registrar 05master 106f19e 06other 10SHARD3/pubkeys registration of db48x+iabak on SHARD3
15:41 🔗 Senji Hmm, the disk I have my shard4 repo in is nearly full; should I create a new shard4 repo on the next disk and worry about not having duplication; or just wait until shard5?
15:42 🔗 db48x if you create another shard4, then you'll want to manually set them both up so that they don't want content that the other has
15:43 🔗 Senji Yeah, exactly
15:44 🔗 sep332 can I do that with iabak or do i have to run "git-annex get" directly with appropriate flags?
15:44 🔗 db48x use git annex wanted
15:44 🔗 iabak-reg 03registrar 05master ed8d47f 06other 10SHARD3/pubkeys registration of protodev on SHARD3
15:44 🔗 Start has joined #internetarchive.bak
15:47 🔗 Senji The git annex wanted documentation says it doesn't have --in; which seems to rule out the obvious way of doing that (git annex wanted --not --in <uuid>)
15:49 🔗 Start has quit IRC (Read error: Connection reset by peer)
15:57 🔗 db48x yep, successfully registered
15:57 🔗 db48x plenty of other problems after that though :)
16:00 🔗 db48x protodev: I'll send you a pull request
16:02 🔗 db48x it's probably similar to what you've already got, but since you said that yours didn't work...
16:08 🔗 primus104 has quit IRC (Leaving.)
16:16 🔗 db48x protodev: removed some duplicated bits
16:18 🔗 sep332 are more shards planned? just curious
16:21 🔗 Start has joined #internetarchive.bak
16:24 🔗 garyrh has joined #internetarchive.bak
16:43 🔗 Start has quit IRC (Disconnected.)
16:56 🔗 protodev db48x: we wrote the exactly same piece of code ^
16:56 🔗 protodev i just named my variables a bit different
17:23 🔗 primus104 has joined #internetarchive.bak
17:26 🔗 yipdw the reincarnation of Thompson and Ritchie
17:31 🔗 db48x :)
17:31 🔗 db48x sep332: yes
17:31 🔗 db48x sep332: the plan is to have enough shards to back up all of archive.org, of course
17:32 🔗 sep332 are you going to add each as the previous one gets close to done? or all done?
17:32 🔗 yipdw we should call them crystals in preparation for Final Fantasy XVI or whatever
17:33 🔗 db48x no, I believe the plan is to add all of the rest of them all at once, and spread the word wide and far
17:34 🔗 sep332 ok cool
17:38 🔗 db48x I think there were a few more things we wanted to do before then, though
17:44 🔗 primus104 has quit IRC (Leaving.)
18:32 🔗 primus104 has joined #internetarchive.bak
19:33 🔗 tpw_rules yipdw: you're only allowed to have 7 crystals
19:33 🔗 tpw_rules perhaps make the number of total shards evenly divisible by 7?
19:50 🔗 closure can make a couple more shards, just need to pick collections for them to get 100k-200k files
19:51 🔗 sep332 if the same file (by MD5) gets uploaded multiple times, will it get backed up more? or only 4 copies per MD5?
19:52 🔗 closure if it's in the same shard, only once
19:52 🔗 closure well, 4 times
19:52 🔗 Senji closure: are there likely to be more shards soon then?
19:53 🔗 closure once someone picks some collections
19:53 🔗 Senji my shard4 disk is full, so if I want to start another one off it'd currently have to be another shard4 with some faff
19:53 🔗 closure or I can just randomly pick some, shrug
19:53 🔗 Senji Well, nearly full :)
19:54 🔗 sep332 hm. a file can be in multiple collections. If those collections are in different shards, the same file will count for both shards
19:54 🔗 closure when a given filename is in multiple collections, it only gets into one shard
19:54 🔗 closure but if the same file is in the IA under different names, it might end up redundantly in multiple shards
19:55 🔗 sep332 ok thanks
19:55 🔗 closure (could sort and uniq this away, but it's a 32gb file listing them all..)
19:55 🔗 Senji There are enough files in the IA that I'm sure there are MD5 collisions
19:55 🔗 closure that too
19:56 🔗 closure although iirc we did the math and collisions seemed unlikely
19:56 🔗 closure at least collisions within a shard seem very unlikely
19:56 🔗 closure (non-intentional collisions)
19:56 🔗 pikhq As I understand it the risk of *accidental* MD5 collisions is still astronomically low.
19:56 🔗 sep332 MD5 collisions are extremely unlikely unless you control both files
19:57 🔗 pikhq Or, "unless you are actually trying for a collision".
19:57 🔗 sep332 if i give you a static MD5, like one from the census, you couldn't make a file that collides.
19:58 🔗 sep332 there is a cryptanalytic weakness but it's still not good enough to be feasible
20:02 🔗 Senji sep332: actually, people have created md5 collisions for existing other documents
20:03 🔗 Senji sep332: but the thing here isn't malicious intent, it's the birthday 'paradox'
20:07 🔗 sep332 hadn't thought of birthday paradox. But I can't find any instance of preimage attack on md5?
20:08 🔗 closure ./mkSHARD 'wwIIarchive dnalounge jstor_polisciequar usda-commoditysituationreports jstor_jpoliecon' 5
20:08 🔗 closure fairly random, just a list an early shard packer found
20:09 🔗 Senji sep332: the Flame malware forged a Windows code-signing cert
20:18 🔗 sep332 Senji2: that was a chosen-prefix collision, not a preimage. the attackers controlled part of the "original" certificate
20:19 🔗 Senji sep332: oh, OK, my apologies; I must have misunderstood
20:19 🔗 Senji I've seen a random collision (in work's backups) before :)
20:19 🔗 sep332 some questionable conclusions but a good summary https://randomoracle.wordpress.com/2012/06/13/unanswered-questions-about-the-flame-certificate-forgery-22/
20:19 🔗 sep332 oh, crazy
20:29 🔗 closure ./mkSHARD 'sports wikipediadumps Strangefolk canadianpamphlets statedocs_maine 1920_census' 6
20:39 🔗 zottelbey has quit IRC (Remote host closed the connection)
20:47 🔗 closure wow, this would be a BIG shard
20:47 🔗 closure 6.19 terabytes
20:47 🔗 closure I'll bet it's all the sports vids
20:47 🔗 Senji Something to get teeth into :)
20:47 🔗 closure maybe save that one for later
20:48 🔗 sep332 similar number of files though? ~100,000?
20:48 🔗 closure yes
20:48 🔗 closure hmm, sports is only 1.25 tb
20:49 🔗 closure must be the wikipedia
20:49 🔗 closure 1.44 tb
20:50 🔗 closure everything in this shard is kinda large
20:50 🔗 closure 1920 census is 2.6 tb
20:51 🔗 sankin has quit IRC (Leaving.)
20:51 🔗 sep332 as in the year 1920? pretty sure they didn't have that many hard drives lo
20:51 🔗 sep332 lol
20:52 🔗 closure yeah, damnifiknow what's in tha
20:52 🔗 sep332 high-res scans of every piece of paper?
20:53 🔗 closure only 61 gb of canadian pamphlets
20:54 🔗 db48x yes
20:55 🔗 db48x scans of microfilm of photographs of the pieces of paper
20:55 🔗 closure all the scratches lovingly preserved at 600dpi
20:56 🔗 sep332 a video of someone reading all the papers out loud
20:57 🔗 closure sep332: a scanning tunneling microscophy of your hard drive, 300 yrs after you back this up
20:57 🔗 sep332 oh wait db48x are you serious?
20:58 🔗 db48x yes
20:59 🔗 sep332 then "called it" ;)
21:00 🔗 db48x closure: did you see the stuff about my broken shards from this morning?
21:01 🔗 closure yeah, when a git repo is that hosed, you can use git-annex repair, but it will be slow on the shards. Probably better to re-clone and move in .git/config and .git/annex/objects
21:02 🔗 db48x how is that different from fsck?
21:02 🔗 closure fsck just finds problems
21:02 🔗 db48x it fixed the first shard
21:03 🔗 closure it doesn't fix broken git repos, it only fixes up git-annex specific issues
21:05 🔗 db48x what do you think is broken?
21:05 🔗 closure you have a bunch of broken .git/objects/ files
21:05 🔗 db48x well, one
21:06 🔗 db48x same hash repeated a number of times
21:06 🔗 db48x what about the other message?
21:06 🔗 db48x git-annex: fd:12: commitBuffer: resource vanished (Broken pipe)
21:06 🔗 db48x failed
21:07 🔗 closure just a consequence of the actual broken git repo
21:37 🔗 db48x it'd be nice if it were more informative, and less spammy
22:04 🔗 iabak-reg 03registrar 05master ad0b477 06other 10SHARD5/pubkeys registration of jdamery+iabak on SHARD5
22:08 🔗 iabak-reg 03registrar 05master 1308265 06other 10SHARD3/pubkeys registration of protodev on SHARD3
22:16 🔗 DFJustin has quit IRC (IMHOSTFU)
22:17 🔗 DFJustin has joined #internetarchive.bak
22:17 🔗 svchfoo2 sets mode: +o DFJustin
22:17 🔗 svchfoo1 sets mode: +o DFJustin
22:17 🔗 Start has joined #internetarchive.bak
22:21 🔗 toad2 has joined #internetarchive.bak
22:22 🔗 iabak-reg 03registrar 05master 67e3630 06other 10SHARD3/pubkeys registration of protodev on SHARD3
22:27 🔗 toad1 has quit IRC (Read error: Operation timed out)
23:30 🔗 db48x oh yea, this drive is toast:
23:30 🔗 db48x [db48x@celebdil shard2]$ ../git-annex.linux/git-annex fsck --quiet
23:30 🔗 db48x git-annex: .git/annex/objects/Gk/8X/MD5-s1843643129--f8d132c8f834ebf8597856bcfc069f8f/MD5-s1843643129--f8d132c8f834ebf8597856bcfc069f8f: hGetBufSome: hardware fault (Input/output error)
23:31 🔗 db48x and also:
23:31 🔗 db48x May 06 16:21:30 celebdil smartd[1196]: Device: /dev/sdc [SAT], 1062 Currently unreadable (pending) sectors
23:31 🔗 db48x May 06 16:21:30 celebdil smartd[1196]: Device: /dev/sdc [SAT], 51 Offline uncorrectable sectors
23:38 🔗 Senji loverly
23:40 🔗 db48x actually, I suppose the word "toast" applies better to a different drive from a different computer which got burnt a couple of weeks ago
23:40 🔗 db48x power supply shorted and caused a small fire
23:44 🔗 db48x oh, even better:
23:44 🔗 db48x May 06 15:16:16 celebdil kernel: EXT4-fs (sdc1): Delayed block allocation failed for inode 28312757 at logical offset 1162316 with max
23:44 🔗 db48x May 06 15:16:16 celebdil kernel: EXT4-fs (sdc1): This should not happen!! Data will be lost
23:44 🔗 db48x May 06 15:16:16 celebdil kernel: EXT4-fs error (device sdc1) in ext4_writepages:2395: Journal has aborted
23:44 🔗 db48x I'm so glad I use ZFS for most things

irclogger-viewer