[03:41] hi [03:58] howdy no2pencil [04:13] hrm [04:13] why can't I find any SATA controllers that are just SATA controllers with no RAID support? [05:28] db48x: promise sata300 tx2 and tx4 fit that [05:28] and several other cards [05:29] and you can usually use a raid card without the raid features [05:43] Optimus Primus [05:44] bringing the Pork Soda? [06:31] Coderjoe: ah, the tx4 is pretty much exactly what I want, except that it's PCI instead of PCIe :) [07:06] best I've found so far is the LSI 9212-4i4e [07:07] slightly expensive though [07:08] paying for a lot of cpu power and ram that I won't use :( [08:00] alard: wget-warc is great :) [10:45] Hey, does anybody know of some blogs that focus on data backup topics, like backup solutions, hardware, software, etc? [12:12] robots.txt download ran with success tonight [12:12] ~7mb per day since it is not deduplicating [12:13] http://91.121.208.153:32198/ if someone wants the files [12:13] the paths are sometimes absolute at the moment [12:13] doit.sh is my cronjob [12:18] hmm [12:18] any good ones? [12:18] no idea [12:19] i mean yes, i linked some stupid ones weeks ago :P [12:20] next step will be making something that diffs the files and renders some nice html overview with that [12:22] whoever is wgetting, stop that please [12:22] it is utterly pointless [12:22] :) [12:23] heh [12:23] exclude the files dir, then it makes sense [12:23] why do you say that? [12:24] files/ is 20000 files. they are inserted into the robots.db [12:24] robots.db is harder to snapshot [12:25] snapshot? [12:26] i'll add a daily 7z of the files, that seems like a good idea [12:26] my filesystem will record the changes as I mirror them [12:26] in this case I should just change the scripts to overwrite the files if they've changed [12:29] are the scripts in version control? [12:29] nope [12:30] I recommend git or mercurial [12:30] but that sounds like a good idea [12:30] we've got an Archive Team group on github that you could join [12:31] http://91.121.208.153:32198/files_20110702.7z [12:32] http://91.121.208.153:32198/files_20110703.7z [12:33] i'll see about git [12:33] gotta go now [12:33] Spirit_: see you around :) [12:33] :) [12:43] http://www.archiveteam.org/index.php?title=Wget_with_WARC_output [12:43] Glorious!! [12:46] SketchCow: yea, it's pretty cool [12:47] SketchCow: the resulting warc file doesn't validate with the latest warc-tools though. it complains about the version number [12:52] is anyone archiving digitalpreservation.org? [13:05] what about archiving archive.org? [13:05] we should get a wayback machine scraper going [13:05] :) [13:06] all those eggs in one basket... [13:06] there should be a button [13:06] "download the internet" [13:06] mirroring digitalpreservation.org now [13:07] Ymgve: insert disc 1... [13:08] alard: hey [13:28] db48x: Hey. (My internet connection is a bit unreliable at the moment. Perhaps I make too many connections.) [13:29] heh [13:29] alard: wget-warc is pretty cool [13:29] Thanks. [13:30] I notice that it's using version 0.18 and the latest version of the tools want version 1.0 [13:30] any differences of note? [13:30] I don't think so. [13:31] 0.18 is the latest draft version of the specification, I believe. For version 1.0 you have to pay. [13:31] In the version I have here I just changed the version number to 1.0 [13:31] yea, I suspected as much [13:31] heh [13:32] I am also not sure about the warc-tools library. The gzipped files it produces are a bit strange. [13:32] strange in what way? [13:32] Every warc record should end with a few newlines, according to the spec, and you are allowed to gzip records. [13:33] The Heritrix warc writer gzips records including the newlines at the end. [13:33] The warc-tools library doesn't: it gzips the record and then adds non-gzipped newlines. [13:33] ah, interesting [13:33] So I'm not sure if that's allowed. [13:38] how much is the spec? we can take a collection if needed. [13:38] http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717 [13:39] 118 Swiss francs. [13:39] But I think the 0.18 final draft is pretty similar to the final version. [13:39] so about a 140 in real money [13:40] Or 96, depending on your definition of 'real money'. :) [13:40] According to the Heritrix source code, they also used the 0.18 draft and [13:40] they just output WARC/1.0 [13:40] that's exactly the sort of issue that they will have worked out in the standardization process [13:41] heh [13:42] > debug point: caller"couldn't add record to the warc file, maximum size reached" [13:42] getting a lot of those [13:42] I'd like to put up the wget-warc code somewhere. But wget uses bazaar, which more or less means launchpad, and from what Iǘe seen of that I don't like that. [13:42] Yeah, that's an unsolved problem. [13:43] maximum size of the warc file is 1gb :( [13:43] The warc-tools let you set a maximum file size. The idea is that you open a new file after that, but I didn't do that yet. [13:43] that seems silly [13:44] Well, it keeps things manageable. [13:44] just let the filesystem worry about where it will put the file [13:45] --warc-max-size [13:45] which defaults to 0, which is interpreted as unlimited [13:46] anyway, just because wget uses bzr doesn't mean you have to do anything with launchpad [13:46] you can export the patches into a nice diff-with-metadata [13:47] I'm going to add my code to the github repository. [13:47] bzr send -o exported.patch [13:47] won't that make it more difficult to get the patch accepted upstream? [13:49] Maybe, but it keeps things simpler until then. And it should be possible to combine all little patches it into one big patch and send that upstream. [13:49] I hate doing that [13:49] (And I'm not even sure if this should be added to wget, or just stay a separate version.) [13:49] I mean, yes, that makes it easier to review, so attach that to the bug [13:50] but actually erasing the commit history and replacing it with a sanitized version... [13:50] yea, it should definately be pushed upstream [13:51] one reason I don't like git is that it makes it hard to avoid rebasing in some situations [13:51] like when sending patches via email [13:51] the recipient has to be uber-careful or he'll rebase your patch for you [13:52] Ah, I have no experience with that. I mostly work with my own repositories. [13:53] btw, could you push to your github repository? I'd like to take a look :) [13:59] Well, the code is in the tar.gz, so you could have a look at that. But I'll have a look. [14:04] yea, but there's no version control [14:04] No, that's in my local bzr. [14:04] I guess I could recursively diff against a clean copy [14:09] or you could use bzr send and pastebin/email the result [14:09] the idea is to see a diff, not the whole source :) [14:10] I'm constructing the git version now. [14:10] cool [14:21] digitalpreservation.gov is 2 GB [14:21] https://github.com/alard/wget-warc/commit/cbbd701d784ebe6253a5a2b7d6abe5bd3c64670b [14:21] cool [14:22] It is quite a hackish modification, not very neat. But then, the wget source code wasn't very clean either. [14:24] yes, there are some rough edges apparent [14:26] And I don't normally program in C, so there may be all kinds of memory leaks and other dangerous bugs. [14:26] answers my question about ftp though :) [14:26] It doesn't. [14:26] yea [14:26] as for memory, I saw stack allocations but no heap allocations [14:27] so you didn't leak anything at least [14:27] That's a relief. I tried to keep clear of defining too many variables myself. [14:28] oh, unless bless() is a heap allocation [14:28] I think it is. If you call bless( ), you also have to call dispose( ). [14:28] yea [14:28] weird name for it [14:30] It's also strange that the warc library has a public and a private section, the idea being that you only use the public part, but you can't actually compile anything without also referring to the private bits. [14:31] But I'll give you access to the repository, then you can fix things if you want. [14:31] this is a silly line: [14:31] WRecord_setContentType (responseWRecord, ((warc_u8_t *) "application/http;msgtype=response"), w_strlen(((warc_u8_t *) "application/http;msgtype=response"))); [14:32] although possibly the optimizer is smart enough to figure out the length at compile time, instead of having two string literals in memory [18:19] warc would be awesome to have in browsers instead of the rather common MHTML, huh? [18:19] Yes... [18:23] http://en.wikipedia.org/wiki/KDE_WAR_(file_format) eek [18:30] random idea: we could collect exclude rules for mirroring common website framework (forums, blogs) systems [18:30] some forums have infinite link circles, etc [18:39] db48x: depends on the card. there are a lot of cards that claim raid features, but depend on the driver and main CPU to do most of the work [19:01] so how do i get into the github group? [19:01] i am https://github.com/SpiritQuaddicted [19:12] haha, i just deleted my files when trying to create a repository [19:12] yay for backups:) [19:15] and now that stupid ssh key hackery again [19:15] i hate their tutorial on that [19:15] "just remove the old stuff", this cant be safe [19:16] wait, my account is still authenticated [19:17] ok, now a repo. do i have to create it on the github site, clone and then commit? or how does my local git know what to d [19:17] do [19:18] i need a good name for a robots.txt downloader and some day in the future DIFFer [19:19] TheDroidYouAreLookingFor [19:20] robots-robber [19:22] radical robots [19:23] robotic ramifications [19:23] robot rush [19:24] robots replace [19:24] err [19:24] robots relapse [19:24] relics [19:25] robot rollback [19:25] aaaah, this will go on forever [19:26] robot rowels [19:27] enough [19:28] tied between [19:28] "robotic ramifications", "robots relapse", "robot rollback" [19:28] (i hate the linux clipboard) [19:29] robots-relapse it is [19:29] nice double meaning too [19:30] https://github.com/SpiritQuaddicted/robots-relapse [19:30] \o/ [19:30] interesting, i had another ancient github profile [19:31] crap [19:32] sometimes i hate the internet [19:35] better now [19:47] i am pretty sure it will not work as it is online right now [19:47] but enough for today