#archiveteam-bs 2018-02-23,Fri

↑back Search

Time Nickname Message
00:02 🔗 dashcloud has joined #archiveteam-bs
00:02 🔗 voltagex 328GB downloaded of the 444GB torrent dump
00:02 🔗 SketchCow They do well but you'd want a debinding knife
00:11 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
00:12 🔗 RichardG has joined #archiveteam-bs
00:18 🔗 robogoat voltagex: Yeah, I'd be interested. Endpoint is pretty slow for me, not sure who is throttling.
00:18 🔗 robogoat What are you using for downloading? zsync?
00:19 🔗 robogoat I am still maxing out at ~1Mbps
00:20 🔗 voltagex aria2c
00:20 🔗 voltagex They're throttling per connection
00:25 🔗 voltagex robogoat: I can scp it to you in about 8 hours
00:38 🔗 robogoat Ok, let me muster a dropbox.
00:43 🔗 dashcloud has quit IRC (Read error: Operation timed out)
00:48 🔗 dashcloud has joined #archiveteam-bs
01:07 🔗 WubTheCap has joined #archiveteam-bs
01:42 🔗 godane SketchCow: i really hope he does byte magazines in from 1987 on
01:43 🔗 godane cause i think we have all issues from 1986 and before
01:45 🔗 godane btw i have been grabbing Popular Science from google books: https://books.google.com/books/serial/ISSN:01617370?rview=1
01:45 🔗 godane so hope he just focus one the worse scans issues that are on google books
02:08 🔗 godane so tape 25 is going to get digitize soon
02:09 🔗 godane this is going faster then i thought it would
02:15 🔗 bitBaron has quit IRC (Quit: Bye!)
02:17 🔗 godane so i found this: https://120minutes.tylerc.com/1991/
02:17 🔗 godane that site is going in archivebot
02:18 🔗 godane anyways that help me date a episode of 120 minutes to 1991-09-01
02:19 🔗 Odd0002 has quit IRC (Ping timeout: 260 seconds)
02:25 🔗 Odd0002 has joined #archiveteam-bs
02:46 🔗 yuitimoth has quit IRC (Ping timeout: 250 seconds)
03:40 🔗 nightpool has joined #archiveteam-bs
04:02 🔗 espes__ has quit IRC (Ping timeout: 252 seconds)
04:02 🔗 espes__ has joined #archiveteam-bs
04:38 🔗 qw3rty111 has joined #archiveteam-bs
04:44 🔗 qw3rty119 has quit IRC (Read error: Operation timed out)
05:03 🔗 octothorp has quit IRC (Remote host closed the connection)
05:04 🔗 octothorp has joined #archiveteam-bs
05:31 🔗 ivan has quit IRC (Leaving)
05:33 🔗 ivan has joined #archiveteam-bs
05:41 🔗 Jens has quit IRC (Remote host closed the connection)
05:42 🔗 Jens has joined #archiveteam-bs
05:46 🔗 rsznick has joined #archiveteam-bs
05:47 🔗 rsznik has quit IRC (Read error: Operation timed out)
05:47 🔗 fie has joined #archiveteam-bs
06:18 🔗 RichardG has quit IRC (Ping timeout: 250 seconds)
07:01 🔗 zyphlar_ has joined #archiveteam-bs
07:35 🔗 WubTheCap spc.fimea.fi isn't on Wayback Machine and hosts medicine info for every medicine with a permit in Finland, both for professionals (summary of product characteristics) and individuals (package leaflets). ArchiveBot?
07:36 🔗 WubTheCap I checked few PDFs at least and they're not there
07:37 🔗 bwn has quit IRC (Read error: Operation timed out)
08:06 🔗 bwn has joined #archiveteam-bs
08:32 🔗 Mateon1 has quit IRC (Ping timeout: 250 seconds)
08:32 🔗 Mateon1 has joined #archiveteam-bs
08:32 🔗 Mateon1 has quit IRC (Connection closed)
09:09 🔗 schbirid has joined #archiveteam-bs
10:17 🔗 mr_archiv has quit IRC (Ping timeout: 252 seconds)
10:23 🔗 mr_archiv has joined #archiveteam-bs
10:59 🔗 zyphlar_ has quit IRC (Quit: Connection closed for inactivity)
11:40 🔗 fie has quit IRC (Quit: Leaving)
12:13 🔗 JAA SketchCow: I have everything ready for Charlie Rose, just need the rsync target now.
13:51 🔗 yuitimoth has joined #archiveteam-bs
14:12 🔗 RichardG has joined #archiveteam-bs
14:31 🔗 JAA Does anyone have any recommendations how to best share and archive the code associated with WARC captures?
14:31 🔗 JAA I'm currently uploading all my manual wpulled WARCs to IA and need to figure out what to do with the scripts around that.
14:41 🔗 GLaDOS has joined #archiveteam-bs
14:45 🔗 PurpleSym JAA: Embed it into the WARCs using a resource record?
14:52 🔗 PurpleSym Regarding the DevTools API: It exposes abstract HTTP requests/responses regardless of the actual transport medium (HTTP 1/2, SPDY, QUIC, Avian Carrier, …). DOM is a layer above that, but the API can provide that one too.
15:01 🔗 JAA PurpleSym: Huh, I hadn't thought about that. Sounds like a decent idea.
15:05 🔗 JAA That would be able to handle multiple versions of the files, including being able to match those to the data (through the timestamps).
15:06 🔗 JAA I guess I'd either use a separate file or add it to the meta WARC.
15:07 🔗 JAA Have you done this before? If so, can you link an example? I'm curious about the details, e.g. the WARC-Target-URI choice.
15:25 🔗 PurpleSym JAA: Yeah, chromebot writes behavior scripts to the WARCs. https://github.com/PromyLOPh/crocoite/blob/master/crocoite/behavior.py#L59
15:26 🔗 PurpleSym The header in question looks like this: WARC-Target-URI: urn:crocoite:script – probably not the best choice, but ohwell.
15:27 🔗 PurpleSym Actually it is using metadata records right now.
15:44 🔗 GLaDOS has quit IRC (Quit: Leaving)
15:50 🔗 voltagex robogoat: going to bed, have the 444GB dump. DM me some way of getting it to you, via something that works from the Linux console / preferably multithreaded
15:52 🔗 JAA PurpleSym: Sweet, thanks.
16:22 🔗 Mateon1 has joined #archiveteam-bs
16:22 🔗 Mateon1 has quit IRC (Connection closed)
18:08 🔗 atomicthu has quit IRC (Read error: Operation timed out)
18:10 🔗 c4rc4s has quit IRC (Ping timeout: 600 seconds)
18:48 🔗 c4rc4s has joined #archiveteam-bs
18:53 🔗 godane so editing tape 25
18:53 🔗 godane tape 26 is done digitizing
19:10 🔗 atomicthu has joined #archiveteam-bs
19:35 🔗 ola_norsk has joined #archiveteam-bs
19:36 🔗 ola_norsk anyone know if web.archive.org/save/ is currently having som technical difficulties?
19:36 🔗 ola_norsk http://web.archive.org/save/https://www.merriam-webster.com/dictionary/stigma -> HTTP ERROR 400
19:52 🔗 jschwart has joined #archiveteam-bs
20:23 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
20:24 🔗 RichardG has joined #archiveteam-bs
20:59 🔗 bsmith093 has quit IRC (Quit: Leaving.)
21:02 🔗 bsmith093 has joined #archiveteam-bs
21:45 🔗 sekolyn has joined #archiveteam-bs
21:45 🔗 octothorp has quit IRC (Read error: Connection reset by peer)
21:52 🔗 sekolyn has quit IRC (Read error: Connection reset by peer)
21:52 🔗 octothorp has joined #archiveteam-bs
21:59 🔗 ola_norsk has quit IRC (the internet is drunk, st00pid and insane. I am not! https://youtu.be/ItZyaOlrb7E)
21:59 🔗 Asparagir has joined #archiveteam-bs
22:27 🔗 schbirid has quit IRC (Quit: Leaving)
22:30 🔗 dashcloud has quit IRC (Ping timeout: 260 seconds)
22:34 🔗 dashcloud has joined #archiveteam-bs
22:37 🔗 dashcloud has quit IRC (Read error: Operation timed out)
22:37 🔗 dashcloud has joined #archiveteam-bs
22:43 🔗 godane SketchCow: we may have a way to run local wayback now
22:44 🔗 godane i'm testing this: https://github.com/oduwsdl/ipwb
22:45 🔗 godane i will say the !ao option really sucks when trying to keep the format the same in warc
22:48 🔗 godane i only say that cause using this local wayback makes the pages look a raw html grab of the page
22:49 🔗 JAA What's wrong with pywb?
22:49 🔗 JAA Well, I guess it doesn't offer "save now".
22:50 🔗 godane i have not tried pywb yet
22:52 🔗 JAA Also webrecorder, I think. That's the software behind webrecorder.io.
22:52 🔗 JAA What's your goal, by the way?
22:52 🔗 godane a wayback machine that run offline on a rpi
22:53 🔗 JAA Ah offline, so only playback of existing archives?
22:53 🔗 JAA Definitely look at pywb then.
23:01 🔗 jschwart has quit IRC (Quit: Konversation terminated!)
23:10 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
23:11 🔗 RichardG has joined #archiveteam-bs
23:16 🔗 godane JAA: how do i build cdx from warc.gz file?
23:17 🔗 godane i ask cause the archivebot cdx will not work for pywb
23:21 🔗 JAA godane: Ah, yeah, I think they're using a different CDX format than IA. I just ran 'wb-manager add file.warc.gz' for each archive usually. I think there's also a way to use the IA CDX format and convert it to pywb's index though, which should be much faster.
23:22 🔗 JAA godane: https://pywb.readthedocs.io/en/latest/manual/usage.html#using-existing-web-archive-collections
23:40 🔗 godane error : {'args': {'coll': u'wayback', 'type': 'replay'}, 'error': u'{"message": "archiveteam_archivebot_go_20180216070002/qz.com-shallow-20180216-040733-1u9jd-00000.warc.gz: [Errno 2] No such file or directory: u\'/usr/lib/python2.7/wayback/archive/archiveteam_archivebot_go_20180216070002/qz.com-shallow-20180216-040733-1u9jd-00000.warc.gz\'", "errors": {"WARCPathLoader": "archiveteam_archivebot_go_20180216070002/qz.com-shallow-20
23:40 🔗 godane 180216-040733-1u9jd-00000.warc.gz: [Errno 2] No such file or directory: u\'/usr/lib/python2.7/wayback/archive/archiveteam_archivebot_go_20180216070002/qz.com-shallow-20180216-040733-1u9jd-00000.warc.gz\'"}}'}
23:41 🔗 godane sorry for the dump
23:43 🔗 godane ok the fix is to remove archiveteam_archivebot_go_20180216070002/ from the cdx
23:45 🔗 godane so i guess you remove archive item name from warc.os.cdz.gz after gunzip it
23:51 🔗 JAA I hope you're not storing the WARCs in /usr/lib.
23:55 🔗 Stiletto has quit IRC (Ping timeout: 246 seconds)
23:59 🔗 Stilett0 has joined #archiveteam-bs

irclogger-viewer