#archiveteam 2011-07-03,Sun

↑back Search

Time Nickname Message
03:41 🔗 no2pencil hi
03:58 🔗 db48x howdy no2pencil
04:13 🔗 db48x hrm
04:13 🔗 db48x why can't I find any SATA controllers that are just SATA controllers with no RAID support?
05:28 🔗 Coderjoe db48x: promise sata300 tx2 and tx4 fit that
05:28 🔗 Coderjoe and several other cards
05:29 🔗 Coderjoe and you can usually use a raid card without the raid features
05:43 🔗 no2pencil Optimus Primus
05:44 🔗 Coderjoe bringing the Pork Soda?
06:31 🔗 db48x Coderjoe: ah, the tx4 is pretty much exactly what I want, except that it's PCI instead of PCIe :)
07:06 🔗 db48x best I've found so far is the LSI 9212-4i4e
07:07 🔗 db48x slightly expensive though
07:08 🔗 db48x paying for a lot of cpu power and ram that I won't use :(
08:00 🔗 db48x alard: wget-warc is great :)
10:45 🔗 Auguste Hey, does anybody know of some blogs that focus on data backup topics, like backup solutions, hardware, software, etc?
12:12 🔗 Spirit_ robots.txt download ran with success tonight
12:12 🔗 Spirit_ ~7mb per day since it is not deduplicating
12:13 🔗 Spirit_ http://91.121.208.153:32198/ if someone wants the files
12:13 🔗 Spirit_ the paths are sometimes absolute at the moment
12:13 🔗 Spirit_ doit.sh is my cronjob
12:18 🔗 db48x hmm
12:18 🔗 db48x any good ones?
12:18 🔗 Spirit_ no idea
12:19 🔗 Spirit_ i mean yes, i linked some stupid ones weeks ago :P
12:20 🔗 Spirit_ next step will be making something that diffs the files and renders some nice html overview with that
12:22 🔗 Spirit_ whoever is wgetting, stop that please
12:22 🔗 Spirit_ it is utterly pointless
12:22 🔗 Spirit_ :)
12:23 🔗 db48x heh
12:23 🔗 Spirit_ exclude the files dir, then it makes sense
12:23 🔗 db48x why do you say that?
12:24 🔗 Spirit_ files/ is 20000 files. they are inserted into the robots.db
12:24 🔗 db48x robots.db is harder to snapshot
12:25 🔗 Spirit_ snapshot?
12:26 🔗 Spirit_ i'll add a daily 7z of the files, that seems like a good idea
12:26 🔗 db48x my filesystem will record the changes as I mirror them
12:26 🔗 db48x in this case I should just change the scripts to overwrite the files if they've changed
12:29 🔗 db48x are the scripts in version control?
12:29 🔗 Spirit_ nope
12:30 🔗 db48x I recommend git or mercurial
12:30 🔗 Spirit_ but that sounds like a good idea
12:30 🔗 db48x we've got an Archive Team group on github that you could join
12:31 🔗 Spirit_ http://91.121.208.153:32198/files_20110702.7z
12:32 🔗 Spirit_ http://91.121.208.153:32198/files_20110703.7z
12:33 🔗 Spirit_ i'll see about git
12:33 🔗 Spirit_ gotta go now
12:33 🔗 db48x Spirit_: see you around :)
12:33 🔗 Spirit_ :)
12:43 🔗 SketchCow http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
12:43 🔗 SketchCow Glorious!!
12:46 🔗 db48x SketchCow: yea, it's pretty cool
12:47 🔗 db48x SketchCow: the resulting warc file doesn't validate with the latest warc-tools though. it complains about the version number
12:52 🔗 db48x is anyone archiving digitalpreservation.org?
13:05 🔗 Ymgve what about archiving archive.org?
13:05 🔗 Ymgve we should get a wayback machine scraper going
13:05 🔗 db48x :)
13:06 🔗 db48x all those eggs in one basket...
13:06 🔗 Ymgve there should be a button
13:06 🔗 Ymgve "download the internet"
13:06 🔗 db48x mirroring digitalpreservation.org now
13:07 🔗 db48x Ymgve: insert disc 1...
13:08 🔗 db48x alard: hey
13:28 🔗 alard db48x: Hey. (My internet connection is a bit unreliable at the moment. Perhaps I make too many connections.)
13:29 🔗 db48x heh
13:29 🔗 db48x alard: wget-warc is pretty cool
13:29 🔗 alard Thanks.
13:30 🔗 db48x I notice that it's using version 0.18 and the latest version of the tools want version 1.0
13:30 🔗 db48x any differences of note?
13:30 🔗 alard I don't think so.
13:31 🔗 alard 0.18 is the latest draft version of the specification, I believe. For version 1.0 you have to pay.
13:31 🔗 alard In the version I have here I just changed the version number to 1.0
13:31 🔗 db48x yea, I suspected as much
13:31 🔗 db48x heh
13:32 🔗 alard I am also not sure about the warc-tools library. The gzipped files it produces are a bit strange.
13:32 🔗 db48x strange in what way?
13:32 🔗 alard Every warc record should end with a few newlines, according to the spec, and you are allowed to gzip records.
13:33 🔗 alard The Heritrix warc writer gzips records including the newlines at the end.
13:33 🔗 alard The warc-tools library doesn't: it gzips the record and then adds non-gzipped newlines.
13:33 🔗 db48x ah, interesting
13:33 🔗 alard So I'm not sure if that's allowed.
13:38 🔗 db48x how much is the spec? we can take a collection if needed.
13:38 🔗 alard http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717
13:39 🔗 alard 118 Swiss francs.
13:39 🔗 alard But I think the 0.18 final draft is pretty similar to the final version.
13:39 🔗 db48x so about a 140 in real money
13:40 🔗 alard Or 96, depending on your definition of 'real money'. :)
13:40 🔗 alard According to the Heritrix source code, they also used the 0.18 draft and
13:40 🔗 alard they just output WARC/1.0
13:40 🔗 db48x that's exactly the sort of issue that they will have worked out in the standardization process
13:41 🔗 db48x heh
13:42 🔗 db48x > debug point: caller<lib private wfile.c:WFile_storeRecordUncompressed:1998>"couldn't add record to the warc file, maximum size reached"
13:42 🔗 db48x getting a lot of those
13:42 🔗 alard I'd like to put up the wget-warc code somewhere. But wget uses bazaar, which more or less means launchpad, and from what Iǘe seen of that I don't like that.
13:42 🔗 alard Yeah, that's an unsolved problem.
13:43 🔗 db48x maximum size of the warc file is 1gb :(
13:43 🔗 alard The warc-tools let you set a maximum file size. The idea is that you open a new file after that, but I didn't do that yet.
13:43 🔗 db48x that seems silly
13:44 🔗 alard Well, it keeps things manageable.
13:44 🔗 db48x just let the filesystem worry about where it will put the file
13:45 🔗 db48x --warc-max-size
13:45 🔗 db48x which defaults to 0, which is interpreted as unlimited
13:46 🔗 db48x anyway, just because wget uses bzr doesn't mean you have to do anything with launchpad
13:46 🔗 db48x you can export the patches into a nice diff-with-metadata
13:47 🔗 alard I'm going to add my code to the github repository.
13:47 🔗 db48x bzr send -o exported.patch
13:47 🔗 db48x won't that make it more difficult to get the patch accepted upstream?
13:49 🔗 alard Maybe, but it keeps things simpler until then. And it should be possible to combine all little patches it into one big patch and send that upstream.
13:49 🔗 db48x I hate doing that
13:49 🔗 alard (And I'm not even sure if this should be added to wget, or just stay a separate version.)
13:49 🔗 db48x I mean, yes, that makes it easier to review, so attach that to the bug
13:50 🔗 db48x but actually erasing the commit history and replacing it with a sanitized version...
13:50 🔗 db48x yea, it should definately be pushed upstream
13:51 🔗 db48x one reason I don't like git is that it makes it hard to avoid rebasing in some situations
13:51 🔗 db48x like when sending patches via email
13:51 🔗 db48x the recipient has to be uber-careful or he'll rebase your patch for you
13:52 🔗 alard Ah, I have no experience with that. I mostly work with my own repositories.
13:53 🔗 db48x btw, could you push to your github repository? I'd like to take a look :)
13:59 🔗 alard Well, the code is in the tar.gz, so you could have a look at that. But I'll have a look.
14:04 🔗 db48x yea, but there's no version control
14:04 🔗 alard No, that's in my local bzr.
14:04 🔗 db48x I guess I could recursively diff against a clean copy
14:09 🔗 db48x or you could use bzr send and pastebin/email the result
14:09 🔗 db48x the idea is to see a diff, not the whole source :)
14:10 🔗 alard I'm constructing the git version now.
14:10 🔗 db48x cool
14:21 🔗 db48x digitalpreservation.gov is 2 GB
14:21 🔗 alard https://github.com/alard/wget-warc/commit/cbbd701d784ebe6253a5a2b7d6abe5bd3c64670b
14:21 🔗 db48x cool
14:22 🔗 alard It is quite a hackish modification, not very neat. But then, the wget source code wasn't very clean either.
14:24 🔗 db48x yes, there are some rough edges apparent
14:26 🔗 alard And I don't normally program in C, so there may be all kinds of memory leaks and other dangerous bugs.
14:26 🔗 db48x answers my question about ftp though :)
14:26 🔗 alard It doesn't.
14:26 🔗 db48x yea
14:26 🔗 db48x as for memory, I saw stack allocations but no heap allocations
14:27 🔗 db48x so you didn't leak anything at least
14:27 🔗 alard That's a relief. I tried to keep clear of defining too many variables myself.
14:28 🔗 db48x oh, unless bless() is a heap allocation
14:28 🔗 alard I think it is. If you call bless( ), you also have to call dispose( ).
14:28 🔗 db48x yea
14:28 🔗 db48x weird name for it
14:30 🔗 alard It's also strange that the warc library has a public and a private section, the idea being that you only use the public part, but you can't actually compile anything without also referring to the private bits.
14:31 🔗 alard But I'll give you access to the repository, then you can fix things if you want.
14:31 🔗 db48x this is a silly line:
14:31 🔗 db48x WRecord_setContentType (responseWRecord, ((warc_u8_t *) "application/http;msgtype=response"), w_strlen(((warc_u8_t *) "application/http;msgtype=response")));
14:32 🔗 db48x although possibly the optimizer is smart enough to figure out the length at compile time, instead of having two string literals in memory
18:19 🔗 Spirit_ warc would be awesome to have in browsers instead of the rather common MHTML, huh?
18:19 🔗 marceloan Yes...
18:23 🔗 Spirit_ http://en.wikipedia.org/wiki/KDE_WAR_(file_format) eek
18:30 🔗 Spirit_ random idea: we could collect exclude rules for mirroring common website framework (forums, blogs) systems
18:30 🔗 Spirit_ some forums have infinite link circles, etc
18:39 🔗 Coderjoe db48x: depends on the card. there are a lot of cards that claim raid features, but depend on the driver and main CPU to do most of the work
19:01 🔗 Spirit_ so how do i get into the github group?
19:01 🔗 Spirit_ i am https://github.com/SpiritQuaddicted
19:12 🔗 Spirit_ haha, i just deleted my files when trying to create a repository
19:12 🔗 Spirit_ yay for backups:)
19:15 🔗 Spirit_ and now that stupid ssh key hackery again
19:15 🔗 Spirit_ i hate their tutorial on that
19:15 🔗 Spirit_ "just remove the old stuff", this cant be safe
19:16 🔗 Spirit_ wait, my account is still authenticated
19:17 🔗 Spirit_ ok, now a repo. do i have to create it on the github site, clone and then commit? or how does my local git know what to d
19:17 🔗 Spirit_ do
19:18 🔗 Spirit_ i need a good name for a robots.txt downloader and some day in the future DIFFer
19:19 🔗 Spirit_ TheDroidYouAreLookingFor
19:20 🔗 Spirit_ robots-robber
19:22 🔗 Spirit_ radical robots
19:23 🔗 Spirit_ robotic ramifications
19:23 🔗 Spirit_ robot rush
19:24 🔗 Spirit_ robots replace
19:24 🔗 Spirit_ err
19:24 🔗 Spirit_ robots relapse
19:24 🔗 Spirit_ relics
19:25 🔗 Spirit_ robot rollback
19:25 🔗 Spirit_ aaaah, this will go on forever
19:26 🔗 Spirit_ robot rowels
19:27 🔗 Spirit_ enough
19:28 🔗 Spirit_ tied between
19:28 🔗 Spirit_ "robotic ramifications", "robots relapse", "robot rollback"
19:28 🔗 Spirit_ (i hate the linux clipboard)
19:29 🔗 Spirit_ robots-relapse it is
19:29 🔗 Spirit_ nice double meaning too
19:30 🔗 Spirit_ https://github.com/SpiritQuaddicted/robots-relapse
19:30 🔗 Spirit_ \o/
19:30 🔗 Spirit_ interesting, i had another ancient github profile
19:31 🔗 Spirit_ crap
19:32 🔗 Spirit_ sometimes i hate the internet
19:35 🔗 Spirit_ better now
19:47 🔗 Spirit_ i am pretty sure it will not work as it is online right now
19:47 🔗 Spirit_ but enough for today

irclogger-viewer