| Time |
Nickname |
Message |
|
00:00
🔗
|
FalconK |
but it accounted for less than 1% of cost |
|
00:03
🔗
|
Odd0002 |
I think it helps detect the character set? |
|
00:05
🔗
|
FalconK |
yeah but why |
|
00:06
🔗
|
FalconK |
unless it's metadata needed for the warc |
|
00:06
🔗
|
Odd0002 |
so it can parse the site? |
|
00:06
🔗
|
FalconK |
display of the text in the document is charset-dependent but I'm pretty sure the parsing of the HTML and hrefs is not |
|
00:07
🔗
|
Odd0002 |
well it has to decode the text when parsing into a more machine-usable format right? Convert it to a python string/unicode string? |
|
00:08
🔗
|
Odd0002 |
that's my guess |
|
00:10
🔗
|
Odd0002 |
as I recently discovered, URLs can contain emoji characters now |
|
00:10
🔗
|
FalconK |
idk. the best way would be to read the code. |
|
00:33
🔗
|
|
godane has joined #archiveteam-bs |
|
00:33
🔗
|
|
Stilett0 has quit IRC (Read error: Operation timed out) |
|
00:33
🔗
|
|
Stilett0 has joined #archiveteam-bs |
|
01:01
🔗
|
|
Stilett0 has quit IRC (Read error: Operation timed out) |
|
01:02
🔗
|
godane |
so this happened: http://kotaku.com/guy-finds-starcraft-source-code-and-returns-it-to-blizz-1794897125 |
|
01:03
🔗
|
godane |
wish we got a iso image of that disk for the archives |
|
01:30
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
|
01:49
🔗
|
godane |
SketchCow: did you get the zines magazine from here: https://diz.srve.io/zines/ |
|
01:50
🔗
|
godane |
if not then you can grab that |
|
02:07
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
|
02:19
🔗
|
|
pizzaiolo has quit IRC (pizzaiolo) |
|
02:20
🔗
|
|
Stilett0 has joined #archiveteam-bs |
|
04:15
🔗
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
|
04:22
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
04:24
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
|
04:25
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
|
06:03
🔗
|
|
SpaffGarg has quit IRC (Read error: Operation timed out) |
|
06:05
🔗
|
|
SpaffGarg has joined #archiveteam-bs |
|
06:58
🔗
|
|
sun_rise has quit IRC (Read error: Connection reset by peer) |
|
07:00
🔗
|
|
GE has joined #archiveteam-bs |
|
07:20
🔗
|
|
bztoot has joined #archiveteam-bs |
|
07:21
🔗
|
|
t2t2 has quit IRC (Read error: Operation timed out) |
|
07:22
🔗
|
|
schbirid has joined #archiveteam-bs |
|
09:32
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
|
10:47
🔗
|
|
Honno_ has joined #archiveteam-bs |
|
10:52
🔗
|
|
Honno has quit IRC (Ping timeout: 370 seconds) |
|
11:11
🔗
|
|
GE has joined #archiveteam-bs |
|
11:33
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
|
11:34
🔗
|
|
pizzaiolo has left |
|
11:34
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
|
12:52
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
|
13:22
🔗
|
|
Frogging has quit IRC (Read error: Operation timed out) |
|
13:24
🔗
|
|
Frogging has joined #archiveteam-bs |
|
13:45
🔗
|
|
bztoot has quit IRC (Read error: Operation timed out) |
|
13:48
🔗
|
|
t2t2 has joined #archiveteam-bs |
|
14:41
🔗
|
Zeryl |
howdy folks, curious, I have a warc to upload, is there any way to feed it to IA so that it is fully used (i.e. wayback machine)? |
|
14:42
🔗
|
jtn2 |
(this is gna.org mailing list archives) |
|
14:42
🔗
|
Zeryl |
yes, yes it is |
|
15:21
🔗
|
Zeryl |
Ok, I used: https://gist.github.com/Asparagirl/6206247 -- It's done uploading, but no idea where it is now :/ |
|
15:40
🔗
|
xmc |
it should show up at https://archive.org/details/@yourusername |
|
15:48
🔗
|
Zeryl |
yep, foundit, just showed up: https://archive.org/details/mail.gna.org_2017-05-04 |
|
15:48
🔗
|
Zeryl |
So i need to get that moved over to the AT collection at some point, who do I ask to do that? |
|
15:58
🔗
|
xmc |
paging SketchCow |
|
16:01
🔗
|
Coderjo |
interesting archival problem: http://spectrum.ieee.org/computing/it/the-lost-picture-show-hollywood-archivists-cant-outpace-obsolescence |
|
16:06
🔗
|
Zeryl |
yea, i'm curious why they don't use something other than tape, but I guess really there isn't something better, at the scale they are talking |
|
16:07
🔗
|
Coderjo |
I'm somewhat annoyed that the tape drive manufacturers can't just maintain more than 2 generations of backward compatability within the same tape system |
|
16:11
🔗
|
Coderjo |
although even if they did have that backward compatability, there is the problem of bit rot on such a dense media |
|
16:17
🔗
|
Zeryl |
yep, and you can't "innovate" if you have to keep the backwards compatibility (or something yadda yadda) |
|
16:39
🔗
|
|
pizzaiolo has quit IRC (Read error: Connection reset by peer) |
|
16:39
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
|
16:39
🔗
|
|
JAA has joined #archiveteam-bs |
|
16:47
🔗
|
|
pizzaiolo has quit IRC (Read error: Connection reset by peer) |
|
16:47
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
|
16:55
🔗
|
JAA |
It's an interesting article for sure. But for that much money, couldn't you just get yourself a contract with a tape drive manufacturer where they'd supply drives capable of reading old tape generations for the next X years? |
|
16:56
🔗
|
Zeryl |
That seems like the right way to go, yea. |
|
16:56
🔗
|
Zeryl |
Or why not move to disk, where we SHOULD be good for the next 30+ years. |
|
16:56
🔗
|
Zeryl |
Seems like a non-issue, if they move from tape. |
|
16:56
🔗
|
JAA |
More expensive for that amount of data, I assume. |
|
16:57
🔗
|
Zeryl |
I'm certain |
|
16:57
🔗
|
Zeryl |
And I assume not much in the way of de-dupe availble |
|
16:58
🔗
|
Zeryl |
they certainly are not unique in this though, hospitals do the same |
|
16:58
🔗
|
Zeryl |
they certainly are not unique in this though, hospitals do the same |
|
16:58
🔗
|
Zeryl |
\ |
|
16:58
🔗
|
Zeryl |
sorry :/ cat hit the keyboard |
|
16:59
🔗
|
* |
JAA pets Zeryl's cat. |
|
16:59
🔗
|
Zeryl |
but if someone like IA can operate, I can't see why the movie studios, who have a significant amount more money, can't do similar |
|
17:00
🔗
|
JAA |
Indeed. And they could probably do it better, too. One thing that really bothers me about IA is that it's all in a single building. If anything happens to that church... |
|
17:01
🔗
|
Zeryl |
and we're not even talking data that has a real SLA on it. We're talking data that if it takes 20 minutes to bring online, or 2 hours, you're not worried. |
|
17:04
🔗
|
Zeryl |
but, this is from the guy with a paltry 12tbin house |
|
17:15
🔗
|
xmc |
studios have more money but archival is a cost center for them, it's not their fundamental purpose |
|
17:17
🔗
|
Zeryl |
this is true. just another thing to let them whine about. And how they "lose" money on EVERY movie! |
|
17:37
🔗
|
DFJustin |
JAA it isn't all in a single building, everything is duplicated in a warehouse across town |
|
17:37
🔗
|
DFJustin |
and now they're setting up a third copy in canada |
|
17:38
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
17:40
🔗
|
JAA |
DFJustin: Oh, never heard about that warehouse before. As far as I understand it, the Canada copy will only be partial though, right? |
|
17:43
🔗
|
|
dashcloud has joined #archiveteam-bs |
|
17:44
🔗
|
joepie91 |
Coderjo: Zeryl: worth noting that that problem is why IA doesn't use tape, afaik |
|
17:44
🔗
|
joepie91 |
:p |
|
17:46
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
|
19:11
🔗
|
|
Aranje has joined #archiveteam-bs |
|
19:12
🔗
|
|
fie has quit IRC (Ping timeout: 245 seconds) |
|
19:17
🔗
|
arkiver |
yipdw: I have updated the script to only load records.json once |
|
19:17
🔗
|
arkiver |
https://github.com/ArchiveTeam/ftp-items/blob/master/tools/deduplicate.py |
|
19:17
🔗
|
arkiver |
Unfortunately it doesn't really have a clean way to shut it down |
|
19:17
🔗
|
arkiver |
do you think you can make a copy of the json, test to see if it's good json, and then shut down and start the new script? |
|
19:17
🔗
|
arkiver |
also moving the copy back as the original |
|
19:20
🔗
|
|
GE has joined #archiveteam-bs |
|
19:25
🔗
|
|
fie has joined #archiveteam-bs |
|
19:26
🔗
|
yipdw |
arkiver: cool, yeah |
|
19:26
🔗
|
arkiver |
thanks! |
|
19:26
🔗
|
yipdw |
the JSON I have for gov-ftp is definitely not a good copy |
|
19:26
🔗
|
yipdw |
I can save it somewhere though |
|
19:26
🔗
|
arkiver |
the script was already stopped? |
|
19:36
🔗
|
Zeryl |
@yipdw, are you accepting new nodes for the archive bot now? |
|
19:36
🔗
|
yipdw |
no |
|
19:36
🔗
|
Zeryl |
ok |
|
19:36
🔗
|
yipdw |
the main reason is it's still a management hassle |
|
19:37
🔗
|
Zeryl |
understood, no worries, just figured i'd offer again :) |
|
19:42
🔗
|
yipdw |
yeah np |
|
19:45
🔗
|
|
Zeryl_ has joined #archiveteam-bs |
|
19:50
🔗
|
HCross2 |
Anyone know if I can change where grab site is saving warcs, mid crawl? I'm mid way through a large (over 2tb) crawl and one HDD is filling and so need to divert to another |
|
19:50
🔗
|
|
Zeryl has quit IRC (Read error: Operation timed out) |
|
19:55
🔗
|
Selavi |
HCross2, might be able to slap a symlink on the parent dir? |
|
19:59
🔗
|
|
Zeryl__ has joined #archiveteam-bs |
|
20:07
🔗
|
|
Zeryl_ has quit IRC (Read error: Operation timed out) |
|
20:21
🔗
|
Coderjo |
joepie91: well, that and tape isn't suited for random access |
|
20:43
🔗
|
|
Zeryl__ is now known as Zeryl |
|
21:19
🔗
|
joepie91 |
Coderjo: right, I was more referring to the commonly-named idea of "why don't you store the darked items on tape so it's cheaper to store them" |
|
21:19
🔗
|
joepie91 |
since those don't require random access |
|
21:19
🔗
|
joepie91 |
(generally) |
|
21:43
🔗
|
Coderjo |
oh |
|
21:44
🔗
|
Coderjo |
yeah, tape is good for short-term, regularly cycling backups. not for long term archiving. (aside from the capacity issue) |
|
22:00
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
|
22:04
🔗
|
godane |
looks like 1992-03-27 episode of Charlie Rose doesn't work at all: https://charlierose.com/episodes/21428?autoplay=true |
|
22:25
🔗
|
|
ndiddy has joined #archiveteam-bs |
|
22:26
🔗
|
bsmith093 |
i'm saving fictionpress the same way as ffnet, and its going swimmingly! |
|
22:26
🔗
|
|
Swizzle has joined #archiveteam-bs |
|
22:26
🔗
|
bsmith093 |
evidently, most of the first million id's are also gone, barely 150K stories so far. |
|
22:28
🔗
|
bsmith093 |
i'd be amazed if the whole dump ends up >20GB |
|
22:29
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
|
23:39
🔗
|
|
Swizzle has quit IRC (Quit: Leaving) |