Time |
Nickname |
Message |
00:02
🔗
|
|
dashcloud has joined #archiveteam-bs |
00:02
🔗
|
voltagex |
328GB downloaded of the 444GB torrent dump |
00:02
🔗
|
SketchCow |
They do well but you'd want a debinding knife |
00:11
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
00:12
🔗
|
|
RichardG has joined #archiveteam-bs |
00:18
🔗
|
robogoat |
voltagex: Yeah, I'd be interested. Endpoint is pretty slow for me, not sure who is throttling. |
00:18
🔗
|
robogoat |
What are you using for downloading? zsync? |
00:19
🔗
|
robogoat |
I am still maxing out at ~1Mbps |
00:20
🔗
|
voltagex |
aria2c |
00:20
🔗
|
voltagex |
They're throttling per connection |
00:25
🔗
|
voltagex |
robogoat: I can scp it to you in about 8 hours |
00:38
🔗
|
robogoat |
Ok, let me muster a dropbox. |
00:43
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
00:48
🔗
|
|
dashcloud has joined #archiveteam-bs |
01:07
🔗
|
|
WubTheCap has joined #archiveteam-bs |
01:42
🔗
|
godane |
SketchCow: i really hope he does byte magazines in from 1987 on |
01:43
🔗
|
godane |
cause i think we have all issues from 1986 and before |
01:45
🔗
|
godane |
btw i have been grabbing Popular Science from google books: https://books.google.com/books/serial/ISSN:01617370?rview=1 |
01:45
🔗
|
godane |
so hope he just focus one the worse scans issues that are on google books |
02:08
🔗
|
godane |
so tape 25 is going to get digitize soon |
02:09
🔗
|
godane |
this is going faster then i thought it would |
02:15
🔗
|
|
bitBaron has quit IRC (Quit: Bye!) |
02:17
🔗
|
godane |
so i found this: https://120minutes.tylerc.com/1991/ |
02:17
🔗
|
godane |
that site is going in archivebot |
02:18
🔗
|
godane |
anyways that help me date a episode of 120 minutes to 1991-09-01 |
02:19
🔗
|
|
Odd0002 has quit IRC (Ping timeout: 260 seconds) |
02:25
🔗
|
|
Odd0002 has joined #archiveteam-bs |
02:46
🔗
|
|
yuitimoth has quit IRC (Ping timeout: 250 seconds) |
03:40
🔗
|
|
nightpool has joined #archiveteam-bs |
04:02
🔗
|
|
espes__ has quit IRC (Ping timeout: 252 seconds) |
04:02
🔗
|
|
espes__ has joined #archiveteam-bs |
04:38
🔗
|
|
qw3rty111 has joined #archiveteam-bs |
04:44
🔗
|
|
qw3rty119 has quit IRC (Read error: Operation timed out) |
05:03
🔗
|
|
octothorp has quit IRC (Remote host closed the connection) |
05:04
🔗
|
|
octothorp has joined #archiveteam-bs |
05:31
🔗
|
|
ivan has quit IRC (Leaving) |
05:33
🔗
|
|
ivan has joined #archiveteam-bs |
05:41
🔗
|
|
Jens has quit IRC (Remote host closed the connection) |
05:42
🔗
|
|
Jens has joined #archiveteam-bs |
05:46
🔗
|
|
rsznick has joined #archiveteam-bs |
05:47
🔗
|
|
rsznik has quit IRC (Read error: Operation timed out) |
05:47
🔗
|
|
fie has joined #archiveteam-bs |
06:18
🔗
|
|
RichardG has quit IRC (Ping timeout: 250 seconds) |
07:01
🔗
|
|
zyphlar_ has joined #archiveteam-bs |
07:35
🔗
|
WubTheCap |
spc.fimea.fi isn't on Wayback Machine and hosts medicine info for every medicine with a permit in Finland, both for professionals (summary of product characteristics) and individuals (package leaflets). ArchiveBot? |
07:36
🔗
|
WubTheCap |
I checked few PDFs at least and they're not there |
07:37
🔗
|
|
bwn has quit IRC (Read error: Operation timed out) |
08:06
🔗
|
|
bwn has joined #archiveteam-bs |
08:32
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 250 seconds) |
08:32
🔗
|
|
Mateon1 has joined #archiveteam-bs |
08:32
🔗
|
|
Mateon1 has quit IRC (Connection closed) |
09:09
🔗
|
|
schbirid has joined #archiveteam-bs |
10:17
🔗
|
|
mr_archiv has quit IRC (Ping timeout: 252 seconds) |
10:23
🔗
|
|
mr_archiv has joined #archiveteam-bs |
10:59
🔗
|
|
zyphlar_ has quit IRC (Quit: Connection closed for inactivity) |
11:40
🔗
|
|
fie has quit IRC (Quit: Leaving) |
12:13
🔗
|
JAA |
SketchCow: I have everything ready for Charlie Rose, just need the rsync target now. |
13:51
🔗
|
|
yuitimoth has joined #archiveteam-bs |
14:12
🔗
|
|
RichardG has joined #archiveteam-bs |
14:31
🔗
|
JAA |
Does anyone have any recommendations how to best share and archive the code associated with WARC captures? |
14:31
🔗
|
JAA |
I'm currently uploading all my manual wpulled WARCs to IA and need to figure out what to do with the scripts around that. |
14:41
🔗
|
|
GLaDOS has joined #archiveteam-bs |
14:45
🔗
|
PurpleSym |
JAA: Embed it into the WARCs using a resource record? |
14:52
🔗
|
PurpleSym |
Regarding the DevTools API: It exposes abstract HTTP requests/responses regardless of the actual transport medium (HTTP 1/2, SPDY, QUIC, Avian Carrier, …). DOM is a layer above that, but the API can provide that one too. |
15:01
🔗
|
JAA |
PurpleSym: Huh, I hadn't thought about that. Sounds like a decent idea. |
15:05
🔗
|
JAA |
That would be able to handle multiple versions of the files, including being able to match those to the data (through the timestamps). |
15:06
🔗
|
JAA |
I guess I'd either use a separate file or add it to the meta WARC. |
15:07
🔗
|
JAA |
Have you done this before? If so, can you link an example? I'm curious about the details, e.g. the WARC-Target-URI choice. |
15:25
🔗
|
PurpleSym |
JAA: Yeah, chromebot writes behavior scripts to the WARCs. https://github.com/PromyLOPh/crocoite/blob/master/crocoite/behavior.py#L59 |
15:26
🔗
|
PurpleSym |
The header in question looks like this: WARC-Target-URI: urn:crocoite:script – probably not the best choice, but ohwell. |
15:27
🔗
|
PurpleSym |
Actually it is using metadata records right now. |
15:44
🔗
|
|
GLaDOS has quit IRC (Quit: Leaving) |
15:50
🔗
|
voltagex |
robogoat: going to bed, have the 444GB dump. DM me some way of getting it to you, via something that works from the Linux console / preferably multithreaded |
15:52
🔗
|
JAA |
PurpleSym: Sweet, thanks. |
16:22
🔗
|
|
Mateon1 has joined #archiveteam-bs |
16:22
🔗
|
|
Mateon1 has quit IRC (Connection closed) |
18:08
🔗
|
|
atomicthu has quit IRC (Read error: Operation timed out) |
18:10
🔗
|
|
c4rc4s has quit IRC (Ping timeout: 600 seconds) |
18:48
🔗
|
|
c4rc4s has joined #archiveteam-bs |
18:53
🔗
|
godane |
so editing tape 25 |
18:53
🔗
|
godane |
tape 26 is done digitizing |
19:10
🔗
|
|
atomicthu has joined #archiveteam-bs |
19:35
🔗
|
|
ola_norsk has joined #archiveteam-bs |
19:36
🔗
|
ola_norsk |
anyone know if web.archive.org/save/ is currently having som technical difficulties? |
19:36
🔗
|
ola_norsk |
http://web.archive.org/save/https://www.merriam-webster.com/dictionary/stigma -> HTTP ERROR 400 |
19:52
🔗
|
|
jschwart has joined #archiveteam-bs |
20:23
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
20:24
🔗
|
|
RichardG has joined #archiveteam-bs |
20:59
🔗
|
|
bsmith093 has quit IRC (Quit: Leaving.) |
21:02
🔗
|
|
bsmith093 has joined #archiveteam-bs |
21:45
🔗
|
|
sekolyn has joined #archiveteam-bs |
21:45
🔗
|
|
octothorp has quit IRC (Read error: Connection reset by peer) |
21:52
🔗
|
|
sekolyn has quit IRC (Read error: Connection reset by peer) |
21:52
🔗
|
|
octothorp has joined #archiveteam-bs |
21:59
🔗
|
|
ola_norsk has quit IRC (the internet is drunk, st00pid and insane. I am not! https://youtu.be/ItZyaOlrb7E) |
21:59
🔗
|
|
Asparagir has joined #archiveteam-bs |
22:27
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
22:30
🔗
|
|
dashcloud has quit IRC (Ping timeout: 260 seconds) |
22:34
🔗
|
|
dashcloud has joined #archiveteam-bs |
22:37
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
22:37
🔗
|
|
dashcloud has joined #archiveteam-bs |
22:43
🔗
|
godane |
SketchCow: we may have a way to run local wayback now |
22:44
🔗
|
godane |
i'm testing this: https://github.com/oduwsdl/ipwb |
22:45
🔗
|
godane |
i will say the !ao option really sucks when trying to keep the format the same in warc |
22:48
🔗
|
godane |
i only say that cause using this local wayback makes the pages look a raw html grab of the page |
22:49
🔗
|
JAA |
What's wrong with pywb? |
22:49
🔗
|
JAA |
Well, I guess it doesn't offer "save now". |
22:50
🔗
|
godane |
i have not tried pywb yet |
22:52
🔗
|
JAA |
Also webrecorder, I think. That's the software behind webrecorder.io. |
22:52
🔗
|
JAA |
What's your goal, by the way? |
22:52
🔗
|
godane |
a wayback machine that run offline on a rpi |
22:53
🔗
|
JAA |
Ah offline, so only playback of existing archives? |
22:53
🔗
|
JAA |
Definitely look at pywb then. |
23:01
🔗
|
|
jschwart has quit IRC (Quit: Konversation terminated!) |
23:10
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
23:11
🔗
|
|
RichardG has joined #archiveteam-bs |
23:16
🔗
|
godane |
JAA: how do i build cdx from warc.gz file? |
23:17
🔗
|
godane |
i ask cause the archivebot cdx will not work for pywb |
23:21
🔗
|
JAA |
godane: Ah, yeah, I think they're using a different CDX format than IA. I just ran 'wb-manager add file.warc.gz' for each archive usually. I think there's also a way to use the IA CDX format and convert it to pywb's index though, which should be much faster. |
23:22
🔗
|
JAA |
godane: https://pywb.readthedocs.io/en/latest/manual/usage.html#using-existing-web-archive-collections |
23:40
🔗
|
godane |
error : {'args': {'coll': u'wayback', 'type': 'replay'}, 'error': u'{"message": "archiveteam_archivebot_go_20180216070002/qz.com-shallow-20180216-040733-1u9jd-00000.warc.gz: [Errno 2] No such file or directory: u\'/usr/lib/python2.7/wayback/archive/archiveteam_archivebot_go_20180216070002/qz.com-shallow-20180216-040733-1u9jd-00000.warc.gz\'", "errors": {"WARCPathLoader": "archiveteam_archivebot_go_20180216070002/qz.com-shallow-20 |
23:40
🔗
|
godane |
180216-040733-1u9jd-00000.warc.gz: [Errno 2] No such file or directory: u\'/usr/lib/python2.7/wayback/archive/archiveteam_archivebot_go_20180216070002/qz.com-shallow-20180216-040733-1u9jd-00000.warc.gz\'"}}'} |
23:41
🔗
|
godane |
sorry for the dump |
23:43
🔗
|
godane |
ok the fix is to remove archiveteam_archivebot_go_20180216070002/ from the cdx |
23:45
🔗
|
godane |
so i guess you remove archive item name from warc.os.cdz.gz after gunzip it |
23:51
🔗
|
JAA |
I hope you're not storing the WARCs in /usr/lib. |
23:55
🔗
|
|
Stiletto has quit IRC (Ping timeout: 246 seconds) |
23:59
🔗
|
|
Stilett0 has joined #archiveteam-bs |