#archiveteam-bs 2017-05-05,Fri

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
FalconKbut it accounted for less than 1% of cost [00:00]
Odd0002I think it helps detect the character set? [00:03]
FalconKyeah but why
unless it's metadata needed for the warc
[00:05]
Odd0002so it can parse the site? [00:06]
FalconKdisplay of the text in the document is charset-dependent but I'm pretty sure the parsing of the HTML and hrefs is not [00:06]
Odd0002well it has to decode the text when parsing into a more machine-usable format right? Convert it to a python string/unicode string?
that's my guess
as I recently discovered, URLs can contain emoji characters now
[00:07]
FalconKidk. the best way would be to read the code. [00:10]
..... (idle for 23mn)
***godane has joined #archiveteam-bs
Stilett0 has quit IRC (Read error: Operation timed out)
Stilett0 has joined #archiveteam-bs
[00:33]
...... (idle for 28mn)
Stilett0 has quit IRC (Read error: Operation timed out) [01:01]
godaneso this happened: http://kotaku.com/guy-finds-starcraft-source-code-and-returns-it-to-blizz-1794897125
wish we got a iso image of that disk for the archives
[01:02]
...... (idle for 27mn)
***BlueMaxim has quit IRC (Read error: Operation timed out) [01:30]
.... (idle for 19mn)
godaneSketchCow: did you get the zines magazine from here: https://diz.srve.io/zines/
if not then you can grab that
[01:49]
.... (idle for 17mn)
***BlueMaxim has joined #archiveteam-bs [02:07]
pizzaiolo has quit IRC (pizzaiolo)
Stilett0 has joined #archiveteam-bs
[02:19]
........................ (idle for 1h55mn)
Sk1d has quit IRC (Ping timeout: 250 seconds) [04:15]
Sk1d has joined #archiveteam-bs
BlueMaxim has quit IRC (Read error: Operation timed out)
BlueMaxim has joined #archiveteam-bs
[04:22]
.................... (idle for 1h38mn)
SpaffGarg has quit IRC (Read error: Operation timed out)
SpaffGarg has joined #archiveteam-bs
[06:03]
........... (idle for 53mn)
sun_rise has quit IRC (Read error: Connection reset by peer)
GE has joined #archiveteam-bs
[06:58]
..... (idle for 20mn)
bztoot has joined #archiveteam-bs
t2t2 has quit IRC (Read error: Operation timed out)
schbirid has joined #archiveteam-bs
[07:20]
........................... (idle for 2h10mn)
GE has quit IRC (Remote host closed the connection) [09:32]
................ (idle for 1h15mn)
Honno_ has joined #archiveteam-bs [10:47]
Honno has quit IRC (Ping timeout: 370 seconds) [10:52]
.... (idle for 19mn)
GE has joined #archiveteam-bs [11:11]
..... (idle for 22mn)
pizzaiolo has joined #archiveteam-bs
pizzaiolo has left
pizzaiolo has joined #archiveteam-bs
[11:33]
................ (idle for 1h18mn)
BlueMaxim has quit IRC (Quit: Leaving) [12:52]
....... (idle for 30mn)
Frogging has quit IRC (Read error: Operation timed out)
Frogging has joined #archiveteam-bs
[13:22]
..... (idle for 21mn)
bztoot has quit IRC (Read error: Operation timed out)
t2t2 has joined #archiveteam-bs
[13:45]
........... (idle for 53mn)
Zerylhowdy folks, curious, I have a warc to upload, is there any way to feed it to IA so that it is fully used (i.e. wayback machine)? [14:41]
jtn2(this is gna.org mailing list archives) [14:42]
Zerylyes, yes it is [14:42]
........ (idle for 39mn)
Ok, I used: https://gist.github.com/Asparagirl/6206247 -- It's done uploading, but no idea where it is now :/ [15:21]
.... (idle for 19mn)
xmcit should show up at https://archive.org/details/@yourusername [15:40]
Zerylyep, foundit, just showed up: https://archive.org/details/mail.gna.org_2017-05-04
So i need to get that moved over to the AT collection at some point, who do I ask to do that?
[15:48]
xmcpaging SketchCow [15:58]
Coderjointeresting archival problem: http://spectrum.ieee.org/computing/it/the-lost-picture-show-hollywood-archivists-cant-outpace-obsolescence [16:01]
Zerylyea, i'm curious why they don't use something other than tape, but I guess really there isn't something better, at the scale they are talking [16:06]
CoderjoI'm somewhat annoyed that the tape drive manufacturers can't just maintain more than 2 generations of backward compatability within the same tape system
although even if they did have that backward compatability, there is the problem of bit rot on such a dense media
[16:07]
Zerylyep, and you can't "innovate" if you have to keep the backwards compatibility (or something yadda yadda) [16:17]
..... (idle for 22mn)
***pizzaiolo has quit IRC (Read error: Connection reset by peer)
pizzaiolo has joined #archiveteam-bs
JAA has joined #archiveteam-bs
[16:39]
pizzaiolo has quit IRC (Read error: Connection reset by peer)
pizzaiolo has joined #archiveteam-bs
[16:47]
JAAIt's an interesting article for sure. But for that much money, couldn't you just get yourself a contract with a tape drive manufacturer where they'd supply drives capable of reading old tape generations for the next X years? [16:55]
ZerylThat seems like the right way to go, yea.
Or why not move to disk, where we SHOULD be good for the next 30+ years.
Seems like a non-issue, if they move from tape.
[16:56]
JAAMore expensive for that amount of data, I assume. [16:56]
ZerylI'm certain
And I assume not much in the way of de-dupe availble
they certainly are not unique in this though, hospitals do the same
they certainly are not unique in this though, hospitals do the same
\
sorry :/ cat hit the keyboard
[16:57]
JAAJAA pets Zeryl's cat. [16:59]
Zerylbut if someone like IA can operate, I can't see why the movie studios, who have a significant amount more money, can't do similar [16:59]
JAAIndeed. And they could probably do it better, too. One thing that really bothers me about IA is that it's all in a single building. If anything happens to that church... [17:00]
Zeryland we're not even talking data that has a real SLA on it. We're talking data that if it takes 20 minutes to bring online, or 2 hours, you're not worried.
but, this is from the guy with a paltry 12tbin house
[17:01]
xmcstudios have more money but archival is a cost center for them, it's not their fundamental purpose [17:15]
Zerylthis is true. just another thing to let them whine about. And how they "lose" money on EVERY movie! [17:17]
..... (idle for 20mn)
DFJustinJAA it isn't all in a single building, everything is duplicated in a warehouse across town
and now they're setting up a third copy in canada
[17:37]
***dashcloud has quit IRC (Read error: Operation timed out) [17:38]
JAADFJustin: Oh, never heard about that warehouse before. As far as I understand it, the Canada copy will only be partial though, right? [17:40]
***dashcloud has joined #archiveteam-bs [17:43]
joepie91Coderjo: Zeryl: worth noting that that problem is why IA doesn't use tape, afaik
:p
[17:44]
***GE has quit IRC (Remote host closed the connection) [17:46]
.................. (idle for 1h25mn)
Aranje has joined #archiveteam-bs
fie has quit IRC (Ping timeout: 245 seconds)
[19:11]
arkiveryipdw: I have updated the script to only load records.json once
https://github.com/ArchiveTeam/ftp-items/blob/master/tools/deduplicate.py
Unfortunately it doesn't really have a clean way to shut it down
do you think you can make a copy of the json, test to see if it's good json, and then shut down and start the new script?
also moving the copy back as the original
[19:17]
***GE has joined #archiveteam-bs [19:20]
fie has joined #archiveteam-bs [19:25]
yipdwarkiver: cool, yeah [19:26]
arkiverthanks! [19:26]
yipdwthe JSON I have for gov-ftp is definitely not a good copy
I can save it somewhere though
[19:26]
arkiverthe script was already stopped? [19:26]
Zeryl@yipdw, are you accepting new nodes for the archive bot now? [19:36]
yipdwno [19:36]
Zerylok [19:36]
yipdwthe main reason is it's still a management hassle [19:36]
Zerylunderstood, no worries, just figured i'd offer again :) [19:37]
yipdwyeah np [19:42]
***Zeryl_ has joined #archiveteam-bs [19:45]
HCross2Anyone know if I can change where grab site is saving warcs, mid crawl? I'm mid way through a large (over 2tb) crawl and one HDD is filling and so need to divert to another [19:50]
***Zeryl has quit IRC (Read error: Operation timed out) [19:50]
SelaviHCross2, might be able to slap a symlink on the parent dir? [19:55]
***Zeryl__ has joined #archiveteam-bs [19:59]
Zeryl_ has quit IRC (Read error: Operation timed out) [20:07]
Coderjojoepie91: well, that and tape isn't suited for random access [20:21]
..... (idle for 22mn)
***Zeryl__ is now known as Zeryl [20:43]
........ (idle for 36mn)
joepie91Coderjo: right, I was more referring to the commonly-named idea of "why don't you store the darked items on tape so it's cheaper to store them"
since those don't require random access
(generally)
[21:19]
..... (idle for 24mn)
Coderjooh
yeah, tape is good for short-term, regularly cycling backups. not for long term archiving. (aside from the capacity issue)
[21:43]
.... (idle for 16mn)
***GE has quit IRC (Remote host closed the connection) [22:00]
godanelooks like 1992-03-27 episode of Charlie Rose doesn't work at all: https://charlierose.com/episodes/21428?autoplay=true [22:04]
..... (idle for 21mn)
***ndiddy has joined #archiveteam-bs [22:25]
bsmith093i'm saving fictionpress the same way as ffnet, and its going swimmingly! [22:26]
***Swizzle has joined #archiveteam-bs [22:26]
bsmith093evidently, most of the first million id's are also gone, barely 150K stories so far.
i'd be amazed if the whole dump ends up >20GB
[22:26]
***BlueMaxim has joined #archiveteam-bs [22:29]
............... (idle for 1h10mn)
Swizzle has quit IRC (Quit: Leaving) [23:39]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)