[00:00] but it accounted for less than 1% of cost [00:03] I think it helps detect the character set? [00:05] yeah but why [00:06] unless it's metadata needed for the warc [00:06] so it can parse the site? [00:06] display of the text in the document is charset-dependent but I'm pretty sure the parsing of the HTML and hrefs is not [00:07] well it has to decode the text when parsing into a more machine-usable format right? Convert it to a python string/unicode string? [00:08] that's my guess [00:10] as I recently discovered, URLs can contain emoji characters now [00:10] idk. the best way would be to read the code. [00:33] *** godane has joined #archiveteam-bs [00:33] *** Stilett0 has quit IRC (Read error: Operation timed out) [00:33] *** Stilett0 has joined #archiveteam-bs [01:01] *** Stilett0 has quit IRC (Read error: Operation timed out) [01:02] so this happened: http://kotaku.com/guy-finds-starcraft-source-code-and-returns-it-to-blizz-1794897125 [01:03] wish we got a iso image of that disk for the archives [01:30] *** BlueMaxim has quit IRC (Read error: Operation timed out) [01:49] SketchCow: did you get the zines magazine from here: https://diz.srve.io/zines/ [01:50] if not then you can grab that [02:07] *** BlueMaxim has joined #archiveteam-bs [02:19] *** pizzaiolo has quit IRC (pizzaiolo) [02:20] *** Stilett0 has joined #archiveteam-bs [04:15] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:22] *** Sk1d has joined #archiveteam-bs [04:24] *** BlueMaxim has quit IRC (Read error: Operation timed out) [04:25] *** BlueMaxim has joined #archiveteam-bs [06:03] *** SpaffGarg has quit IRC (Read error: Operation timed out) [06:05] *** SpaffGarg has joined #archiveteam-bs [06:58] *** sun_rise has quit IRC (Read error: Connection reset by peer) [07:00] *** GE has joined #archiveteam-bs [07:20] *** bztoot has joined #archiveteam-bs [07:21] *** t2t2 has quit IRC (Read error: Operation timed out) [07:22] *** schbirid has joined #archiveteam-bs [09:32] *** GE has quit IRC (Remote host closed the connection) [10:47] *** Honno_ has joined #archiveteam-bs [10:52] *** Honno has quit IRC (Ping timeout: 370 seconds) [11:11] *** GE has joined #archiveteam-bs [11:33] *** pizzaiolo has joined #archiveteam-bs [11:34] *** pizzaiolo has left [11:34] *** pizzaiolo has joined #archiveteam-bs [12:52] *** BlueMaxim has quit IRC (Quit: Leaving) [13:22] *** Frogging has quit IRC (Read error: Operation timed out) [13:24] *** Frogging has joined #archiveteam-bs [13:45] *** bztoot has quit IRC (Read error: Operation timed out) [13:48] *** t2t2 has joined #archiveteam-bs [14:41] howdy folks, curious, I have a warc to upload, is there any way to feed it to IA so that it is fully used (i.e. wayback machine)? [14:42] (this is gna.org mailing list archives) [14:42] yes, yes it is [15:21] Ok, I used: https://gist.github.com/Asparagirl/6206247 -- It's done uploading, but no idea where it is now :/ [15:40] it should show up at https://archive.org/details/@yourusername [15:48] yep, foundit, just showed up: https://archive.org/details/mail.gna.org_2017-05-04 [15:48] So i need to get that moved over to the AT collection at some point, who do I ask to do that? [15:58] paging SketchCow [16:01] interesting archival problem: http://spectrum.ieee.org/computing/it/the-lost-picture-show-hollywood-archivists-cant-outpace-obsolescence [16:06] yea, i'm curious why they don't use something other than tape, but I guess really there isn't something better, at the scale they are talking [16:07] I'm somewhat annoyed that the tape drive manufacturers can't just maintain more than 2 generations of backward compatability within the same tape system [16:11] although even if they did have that backward compatability, there is the problem of bit rot on such a dense media [16:17] yep, and you can't "innovate" if you have to keep the backwards compatibility (or something yadda yadda) [16:39] *** pizzaiolo has quit IRC (Read error: Connection reset by peer) [16:39] *** pizzaiolo has joined #archiveteam-bs [16:39] *** JAA has joined #archiveteam-bs [16:47] *** pizzaiolo has quit IRC (Read error: Connection reset by peer) [16:47] *** pizzaiolo has joined #archiveteam-bs [16:55] It's an interesting article for sure. But for that much money, couldn't you just get yourself a contract with a tape drive manufacturer where they'd supply drives capable of reading old tape generations for the next X years? [16:56] That seems like the right way to go, yea. [16:56] Or why not move to disk, where we SHOULD be good for the next 30+ years. [16:56] Seems like a non-issue, if they move from tape. [16:56] More expensive for that amount of data, I assume. [16:57] I'm certain [16:57] And I assume not much in the way of de-dupe availble [16:58] they certainly are not unique in this though, hospitals do the same [16:58] they certainly are not unique in this though, hospitals do the same [16:58] \ [16:58] sorry :/ cat hit the keyboard [16:59] * JAA pets Zeryl's cat. [16:59] but if someone like IA can operate, I can't see why the movie studios, who have a significant amount more money, can't do similar [17:00] Indeed. And they could probably do it better, too. One thing that really bothers me about IA is that it's all in a single building. If anything happens to that church... [17:01] and we're not even talking data that has a real SLA on it. We're talking data that if it takes 20 minutes to bring online, or 2 hours, you're not worried. [17:04] but, this is from the guy with a paltry 12tbin house [17:15] studios have more money but archival is a cost center for them, it's not their fundamental purpose [17:17] this is true. just another thing to let them whine about. And how they "lose" money on EVERY movie! [17:37] JAA it isn't all in a single building, everything is duplicated in a warehouse across town [17:37] and now they're setting up a third copy in canada [17:38] *** dashcloud has quit IRC (Read error: Operation timed out) [17:40] DFJustin: Oh, never heard about that warehouse before. As far as I understand it, the Canada copy will only be partial though, right? [17:43] *** dashcloud has joined #archiveteam-bs [17:44] Coderjo: Zeryl: worth noting that that problem is why IA doesn't use tape, afaik [17:44] :p [17:46] *** GE has quit IRC (Remote host closed the connection) [19:11] *** Aranje has joined #archiveteam-bs [19:12] *** fie has quit IRC (Ping timeout: 245 seconds) [19:17] yipdw: I have updated the script to only load records.json once [19:17] https://github.com/ArchiveTeam/ftp-items/blob/master/tools/deduplicate.py [19:17] Unfortunately it doesn't really have a clean way to shut it down [19:17] do you think you can make a copy of the json, test to see if it's good json, and then shut down and start the new script? [19:17] also moving the copy back as the original [19:20] *** GE has joined #archiveteam-bs [19:25] *** fie has joined #archiveteam-bs [19:26] arkiver: cool, yeah [19:26] thanks! [19:26] the JSON I have for gov-ftp is definitely not a good copy [19:26] I can save it somewhere though [19:26] the script was already stopped? [19:36] @yipdw, are you accepting new nodes for the archive bot now? [19:36] no [19:36] ok [19:36] the main reason is it's still a management hassle [19:37] understood, no worries, just figured i'd offer again :) [19:42] yeah np [19:45] *** Zeryl_ has joined #archiveteam-bs [19:50] Anyone know if I can change where grab site is saving warcs, mid crawl? I'm mid way through a large (over 2tb) crawl and one HDD is filling and so need to divert to another [19:50] *** Zeryl has quit IRC (Read error: Operation timed out) [19:55] HCross2, might be able to slap a symlink on the parent dir? [19:59] *** Zeryl__ has joined #archiveteam-bs [20:07] *** Zeryl_ has quit IRC (Read error: Operation timed out) [20:21] joepie91: well, that and tape isn't suited for random access [20:43] *** Zeryl__ is now known as Zeryl [21:19] Coderjo: right, I was more referring to the commonly-named idea of "why don't you store the darked items on tape so it's cheaper to store them" [21:19] since those don't require random access [21:19] (generally) [21:43] oh [21:44] yeah, tape is good for short-term, regularly cycling backups. not for long term archiving. (aside from the capacity issue) [22:00] *** GE has quit IRC (Remote host closed the connection) [22:04] looks like 1992-03-27 episode of Charlie Rose doesn't work at all: https://charlierose.com/episodes/21428?autoplay=true [22:25] *** ndiddy has joined #archiveteam-bs [22:26] i'm saving fictionpress the same way as ffnet, and its going swimmingly! [22:26] *** Swizzle has joined #archiveteam-bs [22:26] evidently, most of the first million id's are also gone, barely 150K stories so far. [22:28] i'd be amazed if the whole dump ends up >20GB [22:29] *** BlueMaxim has joined #archiveteam-bs [23:39] *** Swizzle has quit IRC (Quit: Leaving)