Time |
Nickname |
Message |
03:41
🔗
|
no2pencil |
hi |
03:58
🔗
|
db48x |
howdy no2pencil |
04:13
🔗
|
db48x |
hrm |
04:13
🔗
|
db48x |
why can't I find any SATA controllers that are just SATA controllers with no RAID support? |
05:28
🔗
|
Coderjoe |
db48x: promise sata300 tx2 and tx4 fit that |
05:28
🔗
|
Coderjoe |
and several other cards |
05:29
🔗
|
Coderjoe |
and you can usually use a raid card without the raid features |
05:43
🔗
|
no2pencil |
Optimus Primus |
05:44
🔗
|
Coderjoe |
bringing the Pork Soda? |
06:31
🔗
|
db48x |
Coderjoe: ah, the tx4 is pretty much exactly what I want, except that it's PCI instead of PCIe :) |
07:06
🔗
|
db48x |
best I've found so far is the LSI 9212-4i4e |
07:07
🔗
|
db48x |
slightly expensive though |
07:08
🔗
|
db48x |
paying for a lot of cpu power and ram that I won't use :( |
08:00
🔗
|
db48x |
alard: wget-warc is great :) |
10:45
🔗
|
Auguste |
Hey, does anybody know of some blogs that focus on data backup topics, like backup solutions, hardware, software, etc? |
12:12
🔗
|
Spirit_ |
robots.txt download ran with success tonight |
12:12
🔗
|
Spirit_ |
~7mb per day since it is not deduplicating |
12:13
🔗
|
Spirit_ |
http://91.121.208.153:32198/ if someone wants the files |
12:13
🔗
|
Spirit_ |
the paths are sometimes absolute at the moment |
12:13
🔗
|
Spirit_ |
doit.sh is my cronjob |
12:18
🔗
|
db48x |
hmm |
12:18
🔗
|
db48x |
any good ones? |
12:18
🔗
|
Spirit_ |
no idea |
12:19
🔗
|
Spirit_ |
i mean yes, i linked some stupid ones weeks ago :P |
12:20
🔗
|
Spirit_ |
next step will be making something that diffs the files and renders some nice html overview with that |
12:22
🔗
|
Spirit_ |
whoever is wgetting, stop that please |
12:22
🔗
|
Spirit_ |
it is utterly pointless |
12:22
🔗
|
Spirit_ |
:) |
12:23
🔗
|
db48x |
heh |
12:23
🔗
|
Spirit_ |
exclude the files dir, then it makes sense |
12:23
🔗
|
db48x |
why do you say that? |
12:24
🔗
|
Spirit_ |
files/ is 20000 files. they are inserted into the robots.db |
12:24
🔗
|
db48x |
robots.db is harder to snapshot |
12:25
🔗
|
Spirit_ |
snapshot? |
12:26
🔗
|
Spirit_ |
i'll add a daily 7z of the files, that seems like a good idea |
12:26
🔗
|
db48x |
my filesystem will record the changes as I mirror them |
12:26
🔗
|
db48x |
in this case I should just change the scripts to overwrite the files if they've changed |
12:29
🔗
|
db48x |
are the scripts in version control? |
12:29
🔗
|
Spirit_ |
nope |
12:30
🔗
|
db48x |
I recommend git or mercurial |
12:30
🔗
|
Spirit_ |
but that sounds like a good idea |
12:30
🔗
|
db48x |
we've got an Archive Team group on github that you could join |
12:31
🔗
|
Spirit_ |
http://91.121.208.153:32198/files_20110702.7z |
12:32
🔗
|
Spirit_ |
http://91.121.208.153:32198/files_20110703.7z |
12:33
🔗
|
Spirit_ |
i'll see about git |
12:33
🔗
|
Spirit_ |
gotta go now |
12:33
🔗
|
db48x |
Spirit_: see you around :) |
12:33
🔗
|
Spirit_ |
:) |
12:43
🔗
|
SketchCow |
http://www.archiveteam.org/index.php?title=Wget_with_WARC_output |
12:43
🔗
|
SketchCow |
Glorious!! |
12:46
🔗
|
db48x |
SketchCow: yea, it's pretty cool |
12:47
🔗
|
db48x |
SketchCow: the resulting warc file doesn't validate with the latest warc-tools though. it complains about the version number |
12:52
🔗
|
db48x |
is anyone archiving digitalpreservation.org? |
13:05
🔗
|
Ymgve |
what about archiving archive.org? |
13:05
🔗
|
Ymgve |
we should get a wayback machine scraper going |
13:05
🔗
|
db48x |
:) |
13:06
🔗
|
db48x |
all those eggs in one basket... |
13:06
🔗
|
Ymgve |
there should be a button |
13:06
🔗
|
Ymgve |
"download the internet" |
13:06
🔗
|
db48x |
mirroring digitalpreservation.org now |
13:07
🔗
|
db48x |
Ymgve: insert disc 1... |
13:08
🔗
|
db48x |
alard: hey |
13:28
🔗
|
alard |
db48x: Hey. (My internet connection is a bit unreliable at the moment. Perhaps I make too many connections.) |
13:29
🔗
|
db48x |
heh |
13:29
🔗
|
db48x |
alard: wget-warc is pretty cool |
13:29
🔗
|
alard |
Thanks. |
13:30
🔗
|
db48x |
I notice that it's using version 0.18 and the latest version of the tools want version 1.0 |
13:30
🔗
|
db48x |
any differences of note? |
13:30
🔗
|
alard |
I don't think so. |
13:31
🔗
|
alard |
0.18 is the latest draft version of the specification, I believe. For version 1.0 you have to pay. |
13:31
🔗
|
alard |
In the version I have here I just changed the version number to 1.0 |
13:31
🔗
|
db48x |
yea, I suspected as much |
13:31
🔗
|
db48x |
heh |
13:32
🔗
|
alard |
I am also not sure about the warc-tools library. The gzipped files it produces are a bit strange. |
13:32
🔗
|
db48x |
strange in what way? |
13:32
🔗
|
alard |
Every warc record should end with a few newlines, according to the spec, and you are allowed to gzip records. |
13:33
🔗
|
alard |
The Heritrix warc writer gzips records including the newlines at the end. |
13:33
🔗
|
alard |
The warc-tools library doesn't: it gzips the record and then adds non-gzipped newlines. |
13:33
🔗
|
db48x |
ah, interesting |
13:33
🔗
|
alard |
So I'm not sure if that's allowed. |
13:38
🔗
|
db48x |
how much is the spec? we can take a collection if needed. |
13:38
🔗
|
alard |
http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717 |
13:39
🔗
|
alard |
118 Swiss francs. |
13:39
🔗
|
alard |
But I think the 0.18 final draft is pretty similar to the final version. |
13:39
🔗
|
db48x |
so about a 140 in real money |
13:40
🔗
|
alard |
Or 96, depending on your definition of 'real money'. :) |
13:40
🔗
|
alard |
According to the Heritrix source code, they also used the 0.18 draft and |
13:40
🔗
|
alard |
they just output WARC/1.0 |
13:40
🔗
|
db48x |
that's exactly the sort of issue that they will have worked out in the standardization process |
13:41
🔗
|
db48x |
heh |
13:42
🔗
|
db48x |
> debug point: caller<lib private wfile.c:WFile_storeRecordUncompressed:1998>"couldn't add record to the warc file, maximum size reached" |
13:42
🔗
|
db48x |
getting a lot of those |
13:42
🔗
|
alard |
I'd like to put up the wget-warc code somewhere. But wget uses bazaar, which more or less means launchpad, and from what Iǘe seen of that I don't like that. |
13:42
🔗
|
alard |
Yeah, that's an unsolved problem. |
13:43
🔗
|
db48x |
maximum size of the warc file is 1gb :( |
13:43
🔗
|
alard |
The warc-tools let you set a maximum file size. The idea is that you open a new file after that, but I didn't do that yet. |
13:43
🔗
|
db48x |
that seems silly |
13:44
🔗
|
alard |
Well, it keeps things manageable. |
13:44
🔗
|
db48x |
just let the filesystem worry about where it will put the file |
13:45
🔗
|
db48x |
--warc-max-size |
13:45
🔗
|
db48x |
which defaults to 0, which is interpreted as unlimited |
13:46
🔗
|
db48x |
anyway, just because wget uses bzr doesn't mean you have to do anything with launchpad |
13:46
🔗
|
db48x |
you can export the patches into a nice diff-with-metadata |
13:47
🔗
|
alard |
I'm going to add my code to the github repository. |
13:47
🔗
|
db48x |
bzr send -o exported.patch |
13:47
🔗
|
db48x |
won't that make it more difficult to get the patch accepted upstream? |
13:49
🔗
|
alard |
Maybe, but it keeps things simpler until then. And it should be possible to combine all little patches it into one big patch and send that upstream. |
13:49
🔗
|
db48x |
I hate doing that |
13:49
🔗
|
alard |
(And I'm not even sure if this should be added to wget, or just stay a separate version.) |
13:49
🔗
|
db48x |
I mean, yes, that makes it easier to review, so attach that to the bug |
13:50
🔗
|
db48x |
but actually erasing the commit history and replacing it with a sanitized version... |
13:50
🔗
|
db48x |
yea, it should definately be pushed upstream |
13:51
🔗
|
db48x |
one reason I don't like git is that it makes it hard to avoid rebasing in some situations |
13:51
🔗
|
db48x |
like when sending patches via email |
13:51
🔗
|
db48x |
the recipient has to be uber-careful or he'll rebase your patch for you |
13:52
🔗
|
alard |
Ah, I have no experience with that. I mostly work with my own repositories. |
13:53
🔗
|
db48x |
btw, could you push to your github repository? I'd like to take a look :) |
13:59
🔗
|
alard |
Well, the code is in the tar.gz, so you could have a look at that. But I'll have a look. |
14:04
🔗
|
db48x |
yea, but there's no version control |
14:04
🔗
|
alard |
No, that's in my local bzr. |
14:04
🔗
|
db48x |
I guess I could recursively diff against a clean copy |
14:09
🔗
|
db48x |
or you could use bzr send and pastebin/email the result |
14:09
🔗
|
db48x |
the idea is to see a diff, not the whole source :) |
14:10
🔗
|
alard |
I'm constructing the git version now. |
14:10
🔗
|
db48x |
cool |
14:21
🔗
|
db48x |
digitalpreservation.gov is 2 GB |
14:21
🔗
|
alard |
https://github.com/alard/wget-warc/commit/cbbd701d784ebe6253a5a2b7d6abe5bd3c64670b |
14:21
🔗
|
db48x |
cool |
14:22
🔗
|
alard |
It is quite a hackish modification, not very neat. But then, the wget source code wasn't very clean either. |
14:24
🔗
|
db48x |
yes, there are some rough edges apparent |
14:26
🔗
|
alard |
And I don't normally program in C, so there may be all kinds of memory leaks and other dangerous bugs. |
14:26
🔗
|
db48x |
answers my question about ftp though :) |
14:26
🔗
|
alard |
It doesn't. |
14:26
🔗
|
db48x |
yea |
14:26
🔗
|
db48x |
as for memory, I saw stack allocations but no heap allocations |
14:27
🔗
|
db48x |
so you didn't leak anything at least |
14:27
🔗
|
alard |
That's a relief. I tried to keep clear of defining too many variables myself. |
14:28
🔗
|
db48x |
oh, unless bless() is a heap allocation |
14:28
🔗
|
alard |
I think it is. If you call bless( ), you also have to call dispose( ). |
14:28
🔗
|
db48x |
yea |
14:28
🔗
|
db48x |
weird name for it |
14:30
🔗
|
alard |
It's also strange that the warc library has a public and a private section, the idea being that you only use the public part, but you can't actually compile anything without also referring to the private bits. |
14:31
🔗
|
alard |
But I'll give you access to the repository, then you can fix things if you want. |
14:31
🔗
|
db48x |
this is a silly line: |
14:31
🔗
|
db48x |
WRecord_setContentType (responseWRecord, ((warc_u8_t *) "application/http;msgtype=response"), w_strlen(((warc_u8_t *) "application/http;msgtype=response"))); |
14:32
🔗
|
db48x |
although possibly the optimizer is smart enough to figure out the length at compile time, instead of having two string literals in memory |
18:19
🔗
|
Spirit_ |
warc would be awesome to have in browsers instead of the rather common MHTML, huh? |
18:19
🔗
|
marceloan |
Yes... |
18:23
🔗
|
Spirit_ |
http://en.wikipedia.org/wiki/KDE_WAR_(file_format) eek |
18:30
🔗
|
Spirit_ |
random idea: we could collect exclude rules for mirroring common website framework (forums, blogs) systems |
18:30
🔗
|
Spirit_ |
some forums have infinite link circles, etc |
18:39
🔗
|
Coderjoe |
db48x: depends on the card. there are a lot of cards that claim raid features, but depend on the driver and main CPU to do most of the work |
19:01
🔗
|
Spirit_ |
so how do i get into the github group? |
19:01
🔗
|
Spirit_ |
i am https://github.com/SpiritQuaddicted |
19:12
🔗
|
Spirit_ |
haha, i just deleted my files when trying to create a repository |
19:12
🔗
|
Spirit_ |
yay for backups:) |
19:15
🔗
|
Spirit_ |
and now that stupid ssh key hackery again |
19:15
🔗
|
Spirit_ |
i hate their tutorial on that |
19:15
🔗
|
Spirit_ |
"just remove the old stuff", this cant be safe |
19:16
🔗
|
Spirit_ |
wait, my account is still authenticated |
19:17
🔗
|
Spirit_ |
ok, now a repo. do i have to create it on the github site, clone and then commit? or how does my local git know what to d |
19:17
🔗
|
Spirit_ |
do |
19:18
🔗
|
Spirit_ |
i need a good name for a robots.txt downloader and some day in the future DIFFer |
19:19
🔗
|
Spirit_ |
TheDroidYouAreLookingFor |
19:20
🔗
|
Spirit_ |
robots-robber |
19:22
🔗
|
Spirit_ |
radical robots |
19:23
🔗
|
Spirit_ |
robotic ramifications |
19:23
🔗
|
Spirit_ |
robot rush |
19:24
🔗
|
Spirit_ |
robots replace |
19:24
🔗
|
Spirit_ |
err |
19:24
🔗
|
Spirit_ |
robots relapse |
19:24
🔗
|
Spirit_ |
relics |
19:25
🔗
|
Spirit_ |
robot rollback |
19:25
🔗
|
Spirit_ |
aaaah, this will go on forever |
19:26
🔗
|
Spirit_ |
robot rowels |
19:27
🔗
|
Spirit_ |
enough |
19:28
🔗
|
Spirit_ |
tied between |
19:28
🔗
|
Spirit_ |
"robotic ramifications", "robots relapse", "robot rollback" |
19:28
🔗
|
Spirit_ |
(i hate the linux clipboard) |
19:29
🔗
|
Spirit_ |
robots-relapse it is |
19:29
🔗
|
Spirit_ |
nice double meaning too |
19:30
🔗
|
Spirit_ |
https://github.com/SpiritQuaddicted/robots-relapse |
19:30
🔗
|
Spirit_ |
\o/ |
19:30
🔗
|
Spirit_ |
interesting, i had another ancient github profile |
19:31
🔗
|
Spirit_ |
crap |
19:32
🔗
|
Spirit_ |
sometimes i hate the internet |
19:35
🔗
|
Spirit_ |
better now |
19:47
🔗
|
Spirit_ |
i am pretty sure it will not work as it is online right now |
19:47
🔗
|
Spirit_ |
but enough for today |