#archiveteam-bs 2017-05-05,Fri

↑back Search

Time Nickname Message
00:00 🔗 FalconK but it accounted for less than 1% of cost
00:03 🔗 Odd0002 I think it helps detect the character set?
00:05 🔗 FalconK yeah but why
00:06 🔗 FalconK unless it's metadata needed for the warc
00:06 🔗 Odd0002 so it can parse the site?
00:06 🔗 FalconK display of the text in the document is charset-dependent but I'm pretty sure the parsing of the HTML and hrefs is not
00:07 🔗 Odd0002 well it has to decode the text when parsing into a more machine-usable format right? Convert it to a python string/unicode string?
00:08 🔗 Odd0002 that's my guess
00:10 🔗 Odd0002 as I recently discovered, URLs can contain emoji characters now
00:10 🔗 FalconK idk. the best way would be to read the code.
00:33 🔗 godane has joined #archiveteam-bs
00:33 🔗 Stilett0 has quit IRC (Read error: Operation timed out)
00:33 🔗 Stilett0 has joined #archiveteam-bs
01:01 🔗 Stilett0 has quit IRC (Read error: Operation timed out)
01:02 🔗 godane so this happened: http://kotaku.com/guy-finds-starcraft-source-code-and-returns-it-to-blizz-1794897125
01:03 🔗 godane wish we got a iso image of that disk for the archives
01:30 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
01:49 🔗 godane SketchCow: did you get the zines magazine from here: https://diz.srve.io/zines/
01:50 🔗 godane if not then you can grab that
02:07 🔗 BlueMaxim has joined #archiveteam-bs
02:19 🔗 pizzaiolo has quit IRC (pizzaiolo)
02:20 🔗 Stilett0 has joined #archiveteam-bs
04:15 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
04:22 🔗 Sk1d has joined #archiveteam-bs
04:24 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
04:25 🔗 BlueMaxim has joined #archiveteam-bs
06:03 🔗 SpaffGarg has quit IRC (Read error: Operation timed out)
06:05 🔗 SpaffGarg has joined #archiveteam-bs
06:58 🔗 sun_rise has quit IRC (Read error: Connection reset by peer)
07:00 🔗 GE has joined #archiveteam-bs
07:20 🔗 bztoot has joined #archiveteam-bs
07:21 🔗 t2t2 has quit IRC (Read error: Operation timed out)
07:22 🔗 schbirid has joined #archiveteam-bs
09:32 🔗 GE has quit IRC (Remote host closed the connection)
10:47 🔗 Honno_ has joined #archiveteam-bs
10:52 🔗 Honno has quit IRC (Ping timeout: 370 seconds)
11:11 🔗 GE has joined #archiveteam-bs
11:33 🔗 pizzaiolo has joined #archiveteam-bs
11:34 🔗 pizzaiolo has left
11:34 🔗 pizzaiolo has joined #archiveteam-bs
12:52 🔗 BlueMaxim has quit IRC (Quit: Leaving)
13:22 🔗 Frogging has quit IRC (Read error: Operation timed out)
13:24 🔗 Frogging has joined #archiveteam-bs
13:45 🔗 bztoot has quit IRC (Read error: Operation timed out)
13:48 🔗 t2t2 has joined #archiveteam-bs
14:41 🔗 Zeryl howdy folks, curious, I have a warc to upload, is there any way to feed it to IA so that it is fully used (i.e. wayback machine)?
14:42 🔗 jtn2 (this is gna.org mailing list archives)
14:42 🔗 Zeryl yes, yes it is
15:21 🔗 Zeryl Ok, I used: https://gist.github.com/Asparagirl/6206247 -- It's done uploading, but no idea where it is now :/
15:40 🔗 xmc it should show up at https://archive.org/details/@yourusername
15:48 🔗 Zeryl yep, foundit, just showed up: https://archive.org/details/mail.gna.org_2017-05-04
15:48 🔗 Zeryl So i need to get that moved over to the AT collection at some point, who do I ask to do that?
15:58 🔗 xmc paging SketchCow
16:01 🔗 Coderjo interesting archival problem: http://spectrum.ieee.org/computing/it/the-lost-picture-show-hollywood-archivists-cant-outpace-obsolescence
16:06 🔗 Zeryl yea, i'm curious why they don't use something other than tape, but I guess really there isn't something better, at the scale they are talking
16:07 🔗 Coderjo I'm somewhat annoyed that the tape drive manufacturers can't just maintain more than 2 generations of backward compatability within the same tape system
16:11 🔗 Coderjo although even if they did have that backward compatability, there is the problem of bit rot on such a dense media
16:17 🔗 Zeryl yep, and you can't "innovate" if you have to keep the backwards compatibility (or something yadda yadda)
16:39 🔗 pizzaiolo has quit IRC (Read error: Connection reset by peer)
16:39 🔗 pizzaiolo has joined #archiveteam-bs
16:39 🔗 JAA has joined #archiveteam-bs
16:47 🔗 pizzaiolo has quit IRC (Read error: Connection reset by peer)
16:47 🔗 pizzaiolo has joined #archiveteam-bs
16:55 🔗 JAA It's an interesting article for sure. But for that much money, couldn't you just get yourself a contract with a tape drive manufacturer where they'd supply drives capable of reading old tape generations for the next X years?
16:56 🔗 Zeryl That seems like the right way to go, yea.
16:56 🔗 Zeryl Or why not move to disk, where we SHOULD be good for the next 30+ years.
16:56 🔗 Zeryl Seems like a non-issue, if they move from tape.
16:56 🔗 JAA More expensive for that amount of data, I assume.
16:57 🔗 Zeryl I'm certain
16:57 🔗 Zeryl And I assume not much in the way of de-dupe availble
16:58 🔗 Zeryl they certainly are not unique in this though, hospitals do the same
16:58 🔗 Zeryl they certainly are not unique in this though, hospitals do the same
16:58 🔗 Zeryl \
16:58 🔗 Zeryl sorry :/ cat hit the keyboard
16:59 🔗 * JAA pets Zeryl's cat.
16:59 🔗 Zeryl but if someone like IA can operate, I can't see why the movie studios, who have a significant amount more money, can't do similar
17:00 🔗 JAA Indeed. And they could probably do it better, too. One thing that really bothers me about IA is that it's all in a single building. If anything happens to that church...
17:01 🔗 Zeryl and we're not even talking data that has a real SLA on it. We're talking data that if it takes 20 minutes to bring online, or 2 hours, you're not worried.
17:04 🔗 Zeryl but, this is from the guy with a paltry 12tbin house
17:15 🔗 xmc studios have more money but archival is a cost center for them, it's not their fundamental purpose
17:17 🔗 Zeryl this is true. just another thing to let them whine about. And how they "lose" money on EVERY movie!
17:37 🔗 DFJustin JAA it isn't all in a single building, everything is duplicated in a warehouse across town
17:37 🔗 DFJustin and now they're setting up a third copy in canada
17:38 🔗 dashcloud has quit IRC (Read error: Operation timed out)
17:40 🔗 JAA DFJustin: Oh, never heard about that warehouse before. As far as I understand it, the Canada copy will only be partial though, right?
17:43 🔗 dashcloud has joined #archiveteam-bs
17:44 🔗 joepie91 Coderjo: Zeryl: worth noting that that problem is why IA doesn't use tape, afaik
17:44 🔗 joepie91 :p
17:46 🔗 GE has quit IRC (Remote host closed the connection)
19:11 🔗 Aranje has joined #archiveteam-bs
19:12 🔗 fie has quit IRC (Ping timeout: 245 seconds)
19:17 🔗 arkiver yipdw: I have updated the script to only load records.json once
19:17 🔗 arkiver https://github.com/ArchiveTeam/ftp-items/blob/master/tools/deduplicate.py
19:17 🔗 arkiver Unfortunately it doesn't really have a clean way to shut it down
19:17 🔗 arkiver do you think you can make a copy of the json, test to see if it's good json, and then shut down and start the new script?
19:17 🔗 arkiver also moving the copy back as the original
19:20 🔗 GE has joined #archiveteam-bs
19:25 🔗 fie has joined #archiveteam-bs
19:26 🔗 yipdw arkiver: cool, yeah
19:26 🔗 arkiver thanks!
19:26 🔗 yipdw the JSON I have for gov-ftp is definitely not a good copy
19:26 🔗 yipdw I can save it somewhere though
19:26 🔗 arkiver the script was already stopped?
19:36 🔗 Zeryl @yipdw, are you accepting new nodes for the archive bot now?
19:36 🔗 yipdw no
19:36 🔗 Zeryl ok
19:36 🔗 yipdw the main reason is it's still a management hassle
19:37 🔗 Zeryl understood, no worries, just figured i'd offer again :)
19:42 🔗 yipdw yeah np
19:45 🔗 Zeryl_ has joined #archiveteam-bs
19:50 🔗 HCross2 Anyone know if I can change where grab site is saving warcs, mid crawl? I'm mid way through a large (over 2tb) crawl and one HDD is filling and so need to divert to another
19:50 🔗 Zeryl has quit IRC (Read error: Operation timed out)
19:55 🔗 Selavi HCross2, might be able to slap a symlink on the parent dir?
19:59 🔗 Zeryl__ has joined #archiveteam-bs
20:07 🔗 Zeryl_ has quit IRC (Read error: Operation timed out)
20:21 🔗 Coderjo joepie91: well, that and tape isn't suited for random access
20:43 🔗 Zeryl__ is now known as Zeryl
21:19 🔗 joepie91 Coderjo: right, I was more referring to the commonly-named idea of "why don't you store the darked items on tape so it's cheaper to store them"
21:19 🔗 joepie91 since those don't require random access
21:19 🔗 joepie91 (generally)
21:43 🔗 Coderjo oh
21:44 🔗 Coderjo yeah, tape is good for short-term, regularly cycling backups. not for long term archiving. (aside from the capacity issue)
22:00 🔗 GE has quit IRC (Remote host closed the connection)
22:04 🔗 godane looks like 1992-03-27 episode of Charlie Rose doesn't work at all: https://charlierose.com/episodes/21428?autoplay=true
22:25 🔗 ndiddy has joined #archiveteam-bs
22:26 🔗 bsmith093 i'm saving fictionpress the same way as ffnet, and its going swimmingly!
22:26 🔗 Swizzle has joined #archiveteam-bs
22:26 🔗 bsmith093 evidently, most of the first million id's are also gone, barely 150K stories so far.
22:28 🔗 bsmith093 i'd be amazed if the whole dump ends up >20GB
22:29 🔗 BlueMaxim has joined #archiveteam-bs
23:39 🔗 Swizzle has quit IRC (Quit: Leaving)

irclogger-viewer