[01:04] *** RichardG_ is now known as RichardG
[01:12] *** Asparagir has quit IRC (Asparagir)
[01:13] *** Asparagir has joined #archiveteam-bs
[01:33] *** fie has joined #archiveteam-bs
[01:36] *** yakfish has quit IRC (Operation timed out)
[02:40] *** Asparagir has quit IRC (Asparagir)
[03:09] *** yakfish has joined #archiveteam-bs
[03:34] *** Asparagir has joined #archiveteam-bs
[03:35] *** krazedkat has quit IRC (Quit: Leaving)
[04:14] <Somebody2> godane: http://calteches.library.caltech.edu/ -- Archive of Caltech magazine back to the 1930s; might be good for you to grab when you get a chance
[04:30] *** ndiddy has quit IRC (Quit: Leaving)
[04:31] <Somebody2> It looks like it is part of a large open database of Caltech materials, so it's *probably* pretty safe where it is, though.
[04:56] *** Asparagir has quit IRC (Asparagir)
[05:12] *** Sk1d has quit IRC (Ping timeout: 194 seconds)
[05:18] *** Sk1d has joined #archiveteam-bs
[05:26] *** Asparagir has joined #archiveteam-bs
[05:29] *** Asparagir has quit IRC (Client Quit)
[06:03] <godane> Somebody2: thanks
[06:03] <godane> first Journal i have seen where the full issues archived
[06:04] <godane> i always see science journals only put out the articles but no full issue scans
[07:09] *** godane has quit IRC (Ping timeout: 250 seconds)
[07:17] *** VADemon has joined #archiveteam-bs
[07:18] *** godane has joined #archiveteam-bs
[07:21] *** vitzli has joined #archiveteam-bs
[07:43] *** Aranje has quit IRC (Ping timeout: 260 seconds)
[08:34] *** VADemon_ has joined #archiveteam-bs
[08:40] *** VADemon has quit IRC (Ping timeout: 370 seconds)
[08:42] *** VADemon_ has quit IRC (Read error: Operation timed out)
[09:18] *** Honno has joined #archiveteam-bs
[09:30] *** schbirid has joined #archiveteam-bs
[12:07] *** BlueMaxim has quit IRC (Quit: Leaving)
[12:08] *** RichardG has quit IRC (Ping timeout: 244 seconds)
[12:10] *** RichardG has joined #archiveteam-bs
[12:33] <schbirid> ugh, 1MB/s to ACD right now. had gbit speeds earlier
[13:25] <vitzli> you're lucky, last month i got ~30-80 kb/s, though I believe it has to do with ISP messing around
[13:56] *** godane has left 
[13:57] *** godane has joined #archiveteam-bs
[15:01] *** sep332 has joined #archiveteam-bs
[15:06] *** vitzli has quit IRC (Leaving)
[15:41] *** Boppen has quit IRC (Ping timeout: 194 seconds)
[16:08] *** Boppen has joined #archiveteam-bs
[16:13] *** Aranje has joined #archiveteam-bs
[16:20] <yan> arkiver: in fccbda81dc24d605f74ecdc24bca290e74683c2b you broke the link to the IRC channel
[16:39] <yan> arkiver: (in the ftp-gov-grab repo btw); IRC link was changed from cheetoflee to cheetoftp
[16:45] <arkiver> yan: fixed.
[16:52] *** VADemon has joined #archiveteam-bs
[17:53] *** HCross2 has quit IRC (Ping timeout: 260 seconds)
[18:09] *** johtso has joined #archiveteam-bs
[18:09] *** HCross2 has joined #archiveteam-bs
[19:08] <arkiver> anyone going to SHA2017?
[20:11] *** Boppen has quit IRC (Quit: Nettalk6 - www.ntalk.de)
[20:15] *** Boppen has joined #archiveteam-bs
[20:48] *** GinhijiQu has joined #archiveteam-bs
[20:49] <PurpleSym> API data is more useful for robot consumption and transformation. Accessing HTML pages is easier for human beings.
[20:50] <PurpleSym> So, depends on your audience, GinhijiQu.
[20:52] <GinhijiQu> I just suspect that storing the whole webpages will lead to a lot of redundancy and waste storage that could be used to store more information?
[20:54] <PurpleSym> Sure, you trade time to generate a visually appealing output for space. But then again HTML probably compresses well.
[20:56] <Aranje> definitely does
[20:57] <GinhijiQu> How well will that go with blogs that include stuff like the bloated Flickr widgets?
[20:58] <GinhijiQu> I'd prefer to just grab the images and deduplicate the images, but then again that would probably require some modifications to the web pages.
[20:59] <PurpleSym> Afaik grab-site implements deduplication, output is stored as WARC an can be played back with another piece of software.
[21:01] <PurpleSym> Have a look at http://archiveteam.org/index.php?title=The_WARC_Ecosystem
[21:07] <GinhijiQu> Maybe I will make some tests tonight to see how well it works with these things I am most worrying about.
[21:09] <GinhijiQu> I guess a perfect archive would include both the HTML pages and data from the API embedded as comments or stored alongside the other documents, so there would be a way to upload blogs to other platforms later. (But that would probably be really too much data scaled across all of Tumblr.)
[21:10] <dashcloud> apis are nice, but generally they have limits and such, which isn't terribly helpful when you're trying to save a sinking ship
[21:12] <GinhijiQu> Tumblr has an old API which they didn't seem to care about that much a while ago. I didn't go for an extreme stress test but I never hit any rate limits either... :-) But maybe if 100 of clients would start accessing that API it would overload the servers, idk.
[21:16] <GinhijiQu> Also it doesn't require authentication.
[21:51] *** tsr has joined #archiveteam-bs
[21:52] *** BlueMaxim has joined #archiveteam-bs
[22:01] *** ndiddy has joined #archiveteam-bs
[22:01] *** GE has joined #archiveteam-bs
[22:40] *** pizzaiolo has joined #archiveteam-bs
[22:49] *** GE has quit IRC (Quit: zzz)
[23:45] <HCross2> It looks like that HDD deal I posted the other day was an accident. Couple friends reporting their orders cancelled
[23:50] *** Honno has quit IRC (Read error: Operation timed out)