[00:02] *** dashcloud has joined #archiveteam-bs [00:02] 328GB downloaded of the 444GB torrent dump [00:02] They do well but you'd want a debinding knife [00:11] *** RichardG has quit IRC (Read error: Connection reset by peer) [00:12] *** RichardG has joined #archiveteam-bs [00:18] voltagex: Yeah, I'd be interested. Endpoint is pretty slow for me, not sure who is throttling. [00:18] What are you using for downloading? zsync? [00:19] I am still maxing out at ~1Mbps [00:20] aria2c [00:20] They're throttling per connection [00:25] robogoat: I can scp it to you in about 8 hours [00:38] Ok, let me muster a dropbox. [00:43] *** dashcloud has quit IRC (Read error: Operation timed out) [00:48] *** dashcloud has joined #archiveteam-bs [01:07] *** WubTheCap has joined #archiveteam-bs [01:42] SketchCow: i really hope he does byte magazines in from 1987 on [01:43] cause i think we have all issues from 1986 and before [01:45] btw i have been grabbing Popular Science from google books: https://books.google.com/books/serial/ISSN:01617370?rview=1 [01:45] so hope he just focus one the worse scans issues that are on google books [02:08] so tape 25 is going to get digitize soon [02:09] this is going faster then i thought it would [02:15] *** bitBaron has quit IRC (Quit: Bye!) [02:17] so i found this: https://120minutes.tylerc.com/1991/ [02:17] that site is going in archivebot [02:18] anyways that help me date a episode of 120 minutes to 1991-09-01 [02:19] *** Odd0002 has quit IRC (Ping timeout: 260 seconds) [02:25] *** Odd0002 has joined #archiveteam-bs [02:46] *** yuitimoth has quit IRC (Ping timeout: 250 seconds) [03:40] *** nightpool has joined #archiveteam-bs [04:02] *** espes__ has quit IRC (Ping timeout: 252 seconds) [04:02] *** espes__ has joined #archiveteam-bs [04:38] *** qw3rty111 has joined #archiveteam-bs [04:44] *** qw3rty119 has quit IRC (Read error: Operation timed out) [05:03] *** octothorp has quit IRC (Remote host closed the connection) [05:04] *** octothorp has joined #archiveteam-bs [05:31] *** ivan has quit IRC (Leaving) [05:33] *** ivan has joined #archiveteam-bs [05:41] *** Jens has quit IRC (Remote host closed the connection) [05:42] *** Jens has joined #archiveteam-bs [05:46] *** rsznick has joined #archiveteam-bs [05:47] *** rsznik has quit IRC (Read error: Operation timed out) [05:47] *** fie has joined #archiveteam-bs [06:18] *** RichardG has quit IRC (Ping timeout: 250 seconds) [07:01] *** zyphlar_ has joined #archiveteam-bs [07:35] spc.fimea.fi isn't on Wayback Machine and hosts medicine info for every medicine with a permit in Finland, both for professionals (summary of product characteristics) and individuals (package leaflets). ArchiveBot? [07:36] I checked few PDFs at least and they're not there [07:37] *** bwn has quit IRC (Read error: Operation timed out) [08:06] *** bwn has joined #archiveteam-bs [08:32] *** Mateon1 has quit IRC (Ping timeout: 250 seconds) [08:32] *** Mateon1 has joined #archiveteam-bs [08:32] *** Mateon1 has quit IRC (Connection closed) [09:09] *** schbirid has joined #archiveteam-bs [10:17] *** mr_archiv has quit IRC (Ping timeout: 252 seconds) [10:23] *** mr_archiv has joined #archiveteam-bs [10:59] *** zyphlar_ has quit IRC (Quit: Connection closed for inactivity) [11:40] *** fie has quit IRC (Quit: Leaving) [12:13] SketchCow: I have everything ready for Charlie Rose, just need the rsync target now. [13:51] *** yuitimoth has joined #archiveteam-bs [14:12] *** RichardG has joined #archiveteam-bs [14:31] Does anyone have any recommendations how to best share and archive the code associated with WARC captures? [14:31] I'm currently uploading all my manual wpulled WARCs to IA and need to figure out what to do with the scripts around that. [14:41] *** GLaDOS has joined #archiveteam-bs [14:45] JAA: Embed it into the WARCs using a resource record? [14:52] Regarding the DevTools API: It exposes abstract HTTP requests/responses regardless of the actual transport medium (HTTP 1/2, SPDY, QUIC, Avian Carrier, …). DOM is a layer above that, but the API can provide that one too. [15:01] PurpleSym: Huh, I hadn't thought about that. Sounds like a decent idea. [15:05] That would be able to handle multiple versions of the files, including being able to match those to the data (through the timestamps). [15:06] I guess I'd either use a separate file or add it to the meta WARC. [15:07] Have you done this before? If so, can you link an example? I'm curious about the details, e.g. the WARC-Target-URI choice. [15:25] JAA: Yeah, chromebot writes behavior scripts to the WARCs. https://github.com/PromyLOPh/crocoite/blob/master/crocoite/behavior.py#L59 [15:26] The header in question looks like this: WARC-Target-URI: urn:crocoite:script – probably not the best choice, but ohwell. [15:27] Actually it is using metadata records right now. [15:44] *** GLaDOS has quit IRC (Quit: Leaving) [15:50] robogoat: going to bed, have the 444GB dump. DM me some way of getting it to you, via something that works from the Linux console / preferably multithreaded [15:52] PurpleSym: Sweet, thanks. [16:22] *** Mateon1 has joined #archiveteam-bs [16:22] *** Mateon1 has quit IRC (Connection closed) [18:08] *** atomicthu has quit IRC (Read error: Operation timed out) [18:10] *** c4rc4s has quit IRC (Ping timeout: 600 seconds) [18:48] *** c4rc4s has joined #archiveteam-bs [18:53] so editing tape 25 [18:53] tape 26 is done digitizing [19:10] *** atomicthu has joined #archiveteam-bs [19:35] *** ola_norsk has joined #archiveteam-bs [19:36] anyone know if web.archive.org/save/ is currently having som technical difficulties? [19:36] http://web.archive.org/save/https://www.merriam-webster.com/dictionary/stigma -> HTTP ERROR 400 [19:52] *** jschwart has joined #archiveteam-bs [20:23] *** RichardG has quit IRC (Read error: Connection reset by peer) [20:24] *** RichardG has joined #archiveteam-bs [20:59] *** bsmith093 has quit IRC (Quit: Leaving.) [21:02] *** bsmith093 has joined #archiveteam-bs [21:45] *** sekolyn has joined #archiveteam-bs [21:45] *** octothorp has quit IRC (Read error: Connection reset by peer) [21:52] *** sekolyn has quit IRC (Read error: Connection reset by peer) [21:52] *** octothorp has joined #archiveteam-bs [21:59] *** ola_norsk has quit IRC (the internet is drunk, st00pid and insane. I am not! https://youtu.be/ItZyaOlrb7E) [21:59] *** Asparagir has joined #archiveteam-bs [22:27] *** schbirid has quit IRC (Quit: Leaving) [22:30] *** dashcloud has quit IRC (Ping timeout: 260 seconds) [22:34] *** dashcloud has joined #archiveteam-bs [22:37] *** dashcloud has quit IRC (Read error: Operation timed out) [22:37] *** dashcloud has joined #archiveteam-bs [22:43] SketchCow: we may have a way to run local wayback now [22:44] i'm testing this: https://github.com/oduwsdl/ipwb [22:45] i will say the !ao option really sucks when trying to keep the format the same in warc [22:48] i only say that cause using this local wayback makes the pages look a raw html grab of the page [22:49] What's wrong with pywb? [22:49] Well, I guess it doesn't offer "save now". [22:50] i have not tried pywb yet [22:52] Also webrecorder, I think. That's the software behind webrecorder.io. [22:52] What's your goal, by the way? [22:52] a wayback machine that run offline on a rpi [22:53] Ah offline, so only playback of existing archives? [22:53] Definitely look at pywb then. [23:01] *** jschwart has quit IRC (Quit: Konversation terminated!) [23:10] *** RichardG has quit IRC (Read error: Connection reset by peer) [23:11] *** RichardG has joined #archiveteam-bs [23:16] JAA: how do i build cdx from warc.gz file? [23:17] i ask cause the archivebot cdx will not work for pywb [23:21] godane: Ah, yeah, I think they're using a different CDX format than IA. I just ran 'wb-manager add file.warc.gz' for each archive usually. I think there's also a way to use the IA CDX format and convert it to pywb's index though, which should be much faster. [23:22] godane: https://pywb.readthedocs.io/en/latest/manual/usage.html#using-existing-web-archive-collections [23:40] error : {'args': {'coll': u'wayback', 'type': 'replay'}, 'error': u'{"message": "archiveteam_archivebot_go_20180216070002/qz.com-shallow-20180216-040733-1u9jd-00000.warc.gz: [Errno 2] No such file or directory: u\'/usr/lib/python2.7/wayback/archive/archiveteam_archivebot_go_20180216070002/qz.com-shallow-20180216-040733-1u9jd-00000.warc.gz\'", "errors": {"WARCPathLoader": "archiveteam_archivebot_go_20180216070002/qz.com-shallow-20 [23:40] 180216-040733-1u9jd-00000.warc.gz: [Errno 2] No such file or directory: u\'/usr/lib/python2.7/wayback/archive/archiveteam_archivebot_go_20180216070002/qz.com-shallow-20180216-040733-1u9jd-00000.warc.gz\'"}}'} [23:41] sorry for the dump [23:43] ok the fix is to remove archiveteam_archivebot_go_20180216070002/ from the cdx [23:45] so i guess you remove archive item name from warc.os.cdz.gz after gunzip it [23:51] I hope you're not storing the WARCs in /usr/lib. [23:55] *** Stiletto has quit IRC (Ping timeout: 246 seconds) [23:59] *** Stilett0 has joined #archiveteam-bs