Time |
Nickname |
Message |
00:00
🔗
|
FalconK |
but it accounted for less than 1% of cost |
00:03
🔗
|
Odd0002 |
I think it helps detect the character set? |
00:05
🔗
|
FalconK |
yeah but why |
00:06
🔗
|
FalconK |
unless it's metadata needed for the warc |
00:06
🔗
|
Odd0002 |
so it can parse the site? |
00:06
🔗
|
FalconK |
display of the text in the document is charset-dependent but I'm pretty sure the parsing of the HTML and hrefs is not |
00:07
🔗
|
Odd0002 |
well it has to decode the text when parsing into a more machine-usable format right? Convert it to a python string/unicode string? |
00:08
🔗
|
Odd0002 |
that's my guess |
00:10
🔗
|
Odd0002 |
as I recently discovered, URLs can contain emoji characters now |
00:10
🔗
|
FalconK |
idk. the best way would be to read the code. |
00:33
🔗
|
|
godane has joined #archiveteam-bs |
00:33
🔗
|
|
Stilett0 has quit IRC (Read error: Operation timed out) |
00:33
🔗
|
|
Stilett0 has joined #archiveteam-bs |
01:01
🔗
|
|
Stilett0 has quit IRC (Read error: Operation timed out) |
01:02
🔗
|
godane |
so this happened: http://kotaku.com/guy-finds-starcraft-source-code-and-returns-it-to-blizz-1794897125 |
01:03
🔗
|
godane |
wish we got a iso image of that disk for the archives |
01:30
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
01:49
🔗
|
godane |
SketchCow: did you get the zines magazine from here: https://diz.srve.io/zines/ |
01:50
🔗
|
godane |
if not then you can grab that |
02:07
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
02:19
🔗
|
|
pizzaiolo has quit IRC (pizzaiolo) |
02:20
🔗
|
|
Stilett0 has joined #archiveteam-bs |
04:15
🔗
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
04:22
🔗
|
|
Sk1d has joined #archiveteam-bs |
04:24
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
04:25
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
06:03
🔗
|
|
SpaffGarg has quit IRC (Read error: Operation timed out) |
06:05
🔗
|
|
SpaffGarg has joined #archiveteam-bs |
06:58
🔗
|
|
sun_rise has quit IRC (Read error: Connection reset by peer) |
07:00
🔗
|
|
GE has joined #archiveteam-bs |
07:20
🔗
|
|
bztoot has joined #archiveteam-bs |
07:21
🔗
|
|
t2t2 has quit IRC (Read error: Operation timed out) |
07:22
🔗
|
|
schbirid has joined #archiveteam-bs |
09:32
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
10:47
🔗
|
|
Honno_ has joined #archiveteam-bs |
10:52
🔗
|
|
Honno has quit IRC (Ping timeout: 370 seconds) |
11:11
🔗
|
|
GE has joined #archiveteam-bs |
11:33
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
11:34
🔗
|
|
pizzaiolo has left |
11:34
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
12:52
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
13:22
🔗
|
|
Frogging has quit IRC (Read error: Operation timed out) |
13:24
🔗
|
|
Frogging has joined #archiveteam-bs |
13:45
🔗
|
|
bztoot has quit IRC (Read error: Operation timed out) |
13:48
🔗
|
|
t2t2 has joined #archiveteam-bs |
14:41
🔗
|
Zeryl |
howdy folks, curious, I have a warc to upload, is there any way to feed it to IA so that it is fully used (i.e. wayback machine)? |
14:42
🔗
|
jtn2 |
(this is gna.org mailing list archives) |
14:42
🔗
|
Zeryl |
yes, yes it is |
15:21
🔗
|
Zeryl |
Ok, I used: https://gist.github.com/Asparagirl/6206247 -- It's done uploading, but no idea where it is now :/ |
15:40
🔗
|
xmc |
it should show up at https://archive.org/details/@yourusername |
15:48
🔗
|
Zeryl |
yep, foundit, just showed up: https://archive.org/details/mail.gna.org_2017-05-04 |
15:48
🔗
|
Zeryl |
So i need to get that moved over to the AT collection at some point, who do I ask to do that? |
15:58
🔗
|
xmc |
paging SketchCow |
16:01
🔗
|
Coderjo |
interesting archival problem: http://spectrum.ieee.org/computing/it/the-lost-picture-show-hollywood-archivists-cant-outpace-obsolescence |
16:06
🔗
|
Zeryl |
yea, i'm curious why they don't use something other than tape, but I guess really there isn't something better, at the scale they are talking |
16:07
🔗
|
Coderjo |
I'm somewhat annoyed that the tape drive manufacturers can't just maintain more than 2 generations of backward compatability within the same tape system |
16:11
🔗
|
Coderjo |
although even if they did have that backward compatability, there is the problem of bit rot on such a dense media |
16:17
🔗
|
Zeryl |
yep, and you can't "innovate" if you have to keep the backwards compatibility (or something yadda yadda) |
16:39
🔗
|
|
pizzaiolo has quit IRC (Read error: Connection reset by peer) |
16:39
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
16:39
🔗
|
|
JAA has joined #archiveteam-bs |
16:47
🔗
|
|
pizzaiolo has quit IRC (Read error: Connection reset by peer) |
16:47
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
16:55
🔗
|
JAA |
It's an interesting article for sure. But for that much money, couldn't you just get yourself a contract with a tape drive manufacturer where they'd supply drives capable of reading old tape generations for the next X years? |
16:56
🔗
|
Zeryl |
That seems like the right way to go, yea. |
16:56
🔗
|
Zeryl |
Or why not move to disk, where we SHOULD be good for the next 30+ years. |
16:56
🔗
|
Zeryl |
Seems like a non-issue, if they move from tape. |
16:56
🔗
|
JAA |
More expensive for that amount of data, I assume. |
16:57
🔗
|
Zeryl |
I'm certain |
16:57
🔗
|
Zeryl |
And I assume not much in the way of de-dupe availble |
16:58
🔗
|
Zeryl |
they certainly are not unique in this though, hospitals do the same |
16:58
🔗
|
Zeryl |
they certainly are not unique in this though, hospitals do the same |
16:58
🔗
|
Zeryl |
\ |
16:58
🔗
|
Zeryl |
sorry :/ cat hit the keyboard |
16:59
🔗
|
* |
JAA pets Zeryl's cat. |
16:59
🔗
|
Zeryl |
but if someone like IA can operate, I can't see why the movie studios, who have a significant amount more money, can't do similar |
17:00
🔗
|
JAA |
Indeed. And they could probably do it better, too. One thing that really bothers me about IA is that it's all in a single building. If anything happens to that church... |
17:01
🔗
|
Zeryl |
and we're not even talking data that has a real SLA on it. We're talking data that if it takes 20 minutes to bring online, or 2 hours, you're not worried. |
17:04
🔗
|
Zeryl |
but, this is from the guy with a paltry 12tbin house |
17:15
🔗
|
xmc |
studios have more money but archival is a cost center for them, it's not their fundamental purpose |
17:17
🔗
|
Zeryl |
this is true. just another thing to let them whine about. And how they "lose" money on EVERY movie! |
17:37
🔗
|
DFJustin |
JAA it isn't all in a single building, everything is duplicated in a warehouse across town |
17:37
🔗
|
DFJustin |
and now they're setting up a third copy in canada |
17:38
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
17:40
🔗
|
JAA |
DFJustin: Oh, never heard about that warehouse before. As far as I understand it, the Canada copy will only be partial though, right? |
17:43
🔗
|
|
dashcloud has joined #archiveteam-bs |
17:44
🔗
|
joepie91 |
Coderjo: Zeryl: worth noting that that problem is why IA doesn't use tape, afaik |
17:44
🔗
|
joepie91 |
:p |
17:46
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
19:11
🔗
|
|
Aranje has joined #archiveteam-bs |
19:12
🔗
|
|
fie has quit IRC (Ping timeout: 245 seconds) |
19:17
🔗
|
arkiver |
yipdw: I have updated the script to only load records.json once |
19:17
🔗
|
arkiver |
https://github.com/ArchiveTeam/ftp-items/blob/master/tools/deduplicate.py |
19:17
🔗
|
arkiver |
Unfortunately it doesn't really have a clean way to shut it down |
19:17
🔗
|
arkiver |
do you think you can make a copy of the json, test to see if it's good json, and then shut down and start the new script? |
19:17
🔗
|
arkiver |
also moving the copy back as the original |
19:20
🔗
|
|
GE has joined #archiveteam-bs |
19:25
🔗
|
|
fie has joined #archiveteam-bs |
19:26
🔗
|
yipdw |
arkiver: cool, yeah |
19:26
🔗
|
arkiver |
thanks! |
19:26
🔗
|
yipdw |
the JSON I have for gov-ftp is definitely not a good copy |
19:26
🔗
|
yipdw |
I can save it somewhere though |
19:26
🔗
|
arkiver |
the script was already stopped? |
19:36
🔗
|
Zeryl |
@yipdw, are you accepting new nodes for the archive bot now? |
19:36
🔗
|
yipdw |
no |
19:36
🔗
|
Zeryl |
ok |
19:36
🔗
|
yipdw |
the main reason is it's still a management hassle |
19:37
🔗
|
Zeryl |
understood, no worries, just figured i'd offer again :) |
19:42
🔗
|
yipdw |
yeah np |
19:45
🔗
|
|
Zeryl_ has joined #archiveteam-bs |
19:50
🔗
|
HCross2 |
Anyone know if I can change where grab site is saving warcs, mid crawl? I'm mid way through a large (over 2tb) crawl and one HDD is filling and so need to divert to another |
19:50
🔗
|
|
Zeryl has quit IRC (Read error: Operation timed out) |
19:55
🔗
|
Selavi |
HCross2, might be able to slap a symlink on the parent dir? |
19:59
🔗
|
|
Zeryl__ has joined #archiveteam-bs |
20:07
🔗
|
|
Zeryl_ has quit IRC (Read error: Operation timed out) |
20:21
🔗
|
Coderjo |
joepie91: well, that and tape isn't suited for random access |
20:43
🔗
|
|
Zeryl__ is now known as Zeryl |
21:19
🔗
|
joepie91 |
Coderjo: right, I was more referring to the commonly-named idea of "why don't you store the darked items on tape so it's cheaper to store them" |
21:19
🔗
|
joepie91 |
since those don't require random access |
21:19
🔗
|
joepie91 |
(generally) |
21:43
🔗
|
Coderjo |
oh |
21:44
🔗
|
Coderjo |
yeah, tape is good for short-term, regularly cycling backups. not for long term archiving. (aside from the capacity issue) |
22:00
🔗
|
|
GE has quit IRC (Remote host closed the connection) |
22:04
🔗
|
godane |
looks like 1992-03-27 episode of Charlie Rose doesn't work at all: https://charlierose.com/episodes/21428?autoplay=true |
22:25
🔗
|
|
ndiddy has joined #archiveteam-bs |
22:26
🔗
|
bsmith093 |
i'm saving fictionpress the same way as ffnet, and its going swimmingly! |
22:26
🔗
|
|
Swizzle has joined #archiveteam-bs |
22:26
🔗
|
bsmith093 |
evidently, most of the first million id's are also gone, barely 150K stories so far. |
22:28
🔗
|
bsmith093 |
i'd be amazed if the whole dump ends up >20GB |
22:29
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
23:39
🔗
|
|
Swizzle has quit IRC (Quit: Leaving) |