#archiveteam-bs 2020-09-13,Sun

↑back Search

Time	Nickname	Message
00:24 ^🔗		jshoard has quit IRC (Quit: Leaving)
01:00 ^🔗		Arcorann has joined #archiveteam-bs
01:01 ^🔗		Arcorann has quit IRC (Remote host closed the connection)
01:01 ^🔗		Arcorann has joined #archiveteam-bs
01:38 ^🔗		scorche has quit IRC (Read error: Operation timed out)
01:48 ^🔗		scorche has joined #archiveteam-bs
01:53 ^🔗		Mateon1 has quit IRC (Ping timeout: 272 seconds)
01:53 ^🔗		Ctrl has quit IRC (Read error: Operation timed out)
01:54 ^🔗		Mateon1 has joined #archiveteam-bs
01:55 ^🔗		brayden has quit IRC (Ping timeout: 272 seconds)
01:55 ^🔗		Laverne has quit IRC (Ping timeout: 272 seconds)
02:29 ^🔗		Ctrl has joined #archiveteam-bs
02:38 ^🔗		brayden has joined #archiveteam-bs
02:38 ^🔗		Laverne has joined #archiveteam-bs
02:55 ^🔗		asdf01011 has quit IRC (Remote host closed the connection)
03:12 ^🔗		qw3rty_ has joined #archiveteam-bs
03:13 ^🔗		asdf01011 has joined #archiveteam-bs
03:15 ^🔗		qw3rty__ has quit IRC (Ping timeout: 265 seconds)
03:23 ^🔗	SketchCow	My 50th birthday party tomorrow: http://50.textfiles.com
05:12 ^🔗		benjinsmi has quit IRC (Read error: Connection reset by peer)
06:32 ^🔗		larryv has quit IRC (Quit: larryv)
06:40 ^🔗		endrift has quit IRC (Read error: Operation timed out)
06:48 ^🔗		endrift has joined #archiveteam-bs
08:10 ^🔗		jshoard has joined #archiveteam-bs
08:48 ^🔗		BlueMax has quit IRC (Read error: Connection reset by peer)
11:04 ^🔗		scorche has quit IRC (Read error: Operation timed out)
11:35 ^🔗		benjins has joined #archiveteam-bs
11:46 ^🔗		scorche has joined #archiveteam-bs
12:11 ^🔗		VADemon has joined #archiveteam-bs
13:19 ^🔗		VADemon has quit IRC (Read error: Connection reset by peer)
14:02 ^🔗		dashcloud has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
14:03 ^🔗		dashcloud has joined #archiveteam-bs
14:21 ^🔗		nepeat has quit IRC (Quit: ZNC 1.7.5 - https://znc.in)
14:23 ^🔗		nepeat has joined #archiveteam-bs
14:36 ^🔗		Lord_Nigh has quit IRC (Read error: Operation timed out)
14:45 ^🔗		Lord_Nigh has joined #archiveteam-bs
14:55 ^🔗		Lord_Nigh has quit IRC (Ping timeout: 272 seconds)
14:57 ^🔗		Lord_Nigh has joined #archiveteam-bs
14:58 ^🔗		Lord_Nigh has quit IRC (Remote host closed the connection)
15:24 ^🔗		Arcorann has quit IRC (Read error: Connection reset by peer)
16:07 ^🔗		VADemon has joined #archiveteam-bs
16:35 ^🔗		godane has quit IRC (Read error: Connection reset by peer)
16:51 ^🔗		godane has joined #archiveteam-bs
17:50 ^🔗		DigiDigi has quit IRC (Remote host closed the connection)
18:16 ^🔗		DigiDigi has joined #archiveteam-bs
19:06 ^🔗		Laverne has quit IRC (Ping timeout: 272 seconds)
19:07 ^🔗		brayden has quit IRC (Ping timeout: 272 seconds)
19:13 ^🔗	jodizzle	cm: So what kind of solution are you referring to?
19:14 ^🔗	cm	well my naive approach is to wget all the encosure files (mp3s)
19:14 ^🔗	cm	for each item that i archive, i replace the enclosure url in the rss feed with a link to my own copy
19:14 ^🔗	cm	but this is not portable
19:15 ^🔗	jodizzle	Oh I see, so you're saying that you're creating your own podcast feed?
19:15 ^🔗	cm	yeah
19:15 ^🔗	jodizzle	Based on the original feed, but where items reference what you've already downloaded
19:16 ^🔗	cm	yeah
19:16 ^🔗	jodizzle	That's interesting
19:16 ^🔗	cm	but creating the archive feed could be a separate step if you have a well-defined archive format
19:17 ^🔗	jodizzle	What do you mean by "archive format", in this case?
19:17 ^🔗	cm	simply wgetting the rss and the enclosures doesn't work, because wget doesn't store the name of the url that was wgotten
19:18 ^🔗	cm	maybe warc would be enough?
19:18 ^🔗	jodizzle	What would you be uputting in the WARCs, to be clear?
19:19 ^🔗	cm	the mp3 files and rss i guess
19:19 ^🔗	cm	warc remembers the url of the fetched file
19:21 ^🔗	jodizzle	Yeah, I guess one approach would be to save all the podcasts downloaded into WARCs, so you'd get data + metdata for each. You could also save the rss feed at the time of the grab. (Basically what you said.)
19:21 ^🔗	cm	i haven't thought through what could be bundled in a single warc file, and what would have to be separate
19:21 ^🔗	jodizzle	Then, you can generate a CDX of the WARCs that acts as an index for which URLs you've gotten
19:22 ^🔗	cm	cdx?
19:23 ^🔗	jodizzle	https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/
19:25 ^🔗	jodizzle	Also, typically a WARC is a single thing, in this case, either one request or response. But you can bundle WARCs together into gzipped .warc.gz files.
19:26 ^🔗	jodizzle	That's how they're usually stored in bulk.
19:26 ^🔗	jodizzle	(Here are examples of different records, for reference: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#annex-b-informative-examples-of-warc-records)
19:27 ^🔗	cm	so i guess the cleanest and most consistent way would be to treat the podcast rss as a web page, and crawl it like you would a website
19:27 ^🔗	cm	then you get an archive of the rss and the content files
19:27 ^🔗	jodizzle	But this is all archival details. I am a little confused by what you mean by "portable". What would a "portable feed archive" be?
19:28 ^🔗	cm	the current format of my feed archives is a directory with feed.xml, and subdirectories written by wget containing the content
19:29 ^🔗	cm	the whole directory is served by a webserver, and the feed.xml has links to my copies of the content
19:30 ^🔗	cm	one way to make it portable would be to use relative links to the content, but i dont think that is possible in RSS
19:32 ^🔗	cm	so i thought about putting a custom keyword in place of the webroot for each archived piece of content, then to make a usable RSS feed would would replace that keyword with whatever the webroot happens to be
19:33 ^🔗	jodizzle	Oh, so you mean it's not currently portable because if the IP address/domain changes, the links in feed.xml will break?
19:33 ^🔗	cm	yeah
19:34 ^🔗	jodizzle	Okay, yeah. If you can't use relative links in RSS (is that defintely true), then I don't think there's anyway around something like what you described.
19:34 ^🔗	jodizzle	You could have feed.xml created periodically by a cronjob or similar via a script that refernces a "webroot" setting
19:35 ^🔗	jodizzle	Or something like that.
19:36 ^🔗		VADemon has quit IRC (left4dead)
19:38 ^🔗	cm	i couldn't find anything definitive saying RSS does not support relative links
19:38 ^🔗	cm	but there are at least a significant number of readers that dont support it
19:39 ^🔗	cm	and it makes sense that readers would not store the prefix of the URL used to fetch the rss feed, which would be necessary to determine to full url for a relative link
19:41 ^🔗	jodizzle	Hm, I don't know. It seems like if feed readers have to fetch feeds from web domains, they could keep track of those domains and use them to resolve relative links?
19:41 ^🔗	cm	yeah true
19:41 ^🔗	jodizzle	But if they don't do that, they don't do that. Maybe there's some more complexity to it that I'm not thinking of.
19:42 ^🔗	cm	maybe i'll do a test
19:42 ^🔗		obskyr has quit IRC (Read error: Operation timed out)
19:43 ^🔗		sHATNER has quit IRC (Read error: Operation timed out)
19:43 ^🔗		omglolba- has quit IRC (Read error: Operation timed out)
19:46 ^🔗		omglolbah has joined #archiveteam-bs
19:46 ^🔗		closure has quit IRC (Read error: Operation timed out)
19:48 ^🔗		obskyr has joined #archiveteam-bs
19:50 ^🔗		closure has joined #archiveteam-bs
19:50 ^🔗		Maylay has quit IRC (Read error: Operation timed out)
19:50 ^🔗	cm	yeah my default podcasts app rejects items with relative enclosure links
19:51 ^🔗	cm	i.e. doesn't display them
19:52 ^🔗	nico_32	SketchCow: have a nice party & birthday!
19:52 ^🔗	jodizzle	cm: Unfortunate.
19:53 ^🔗	cm	now i guess warc files are basically an annotated transcript of an http response
19:54 ^🔗	cm	does warc have any way to refer to a standalone file?
19:54 ^🔗	cm	i.e. "the next thing the server sent was the contents of this file"
19:54 ^🔗	cm	with a pointer to a html or mp3 file on disk
19:56 ^🔗		Maylay has joined #archiveteam-bs
19:58 ^🔗	jodizzle	I think you could manufacture something like that, but typically you would just store the bytes in the WARC.
19:59 ^🔗	jodizzle	So a response warc would contain the response bytes and all the headers and metadata of that response
20:01 ^🔗	cm	yeah
20:01 ^🔗	cm	then to view the file you need a server side script to strip out the metadata
20:03 ^🔗	jodizzle	In a sense, yes. Though typically you'd be reading a library or using some toolkit that's built for reading WARC data.
20:05 ^🔗	jodizzle	But yes, I think I see what you mean. I think the links in the feed.xml you generate would basically have to route to some webserver endpoint which does the work necessary to read from the WARC.
20:07 ^🔗	cm	couldn't i use the warc server for the rss as well
20:10 ^🔗	jodizzle	What do you mean?
20:10 ^🔗		sHATNER has joined #archiveteam-bs
20:10 ^🔗		brayden has joined #archiveteam-bs
20:10 ^🔗	cm	pywb for example
20:10 ^🔗	cm	if im using it to browse a website, it will rewrite links to point to archived copies, right?
20:11 ^🔗		Laverne has joined #archiveteam-bs
20:13 ^🔗	cm	so couldnt i let pywb rewrite the links in the rss feed? or does it not do that
20:13 ^🔗	jodizzle	Does pywb have a browsing feature like that? I've only ever used it in the context of a local WBM to view WARCs I've generated separately.
20:13 ^🔗	jodizzle	If it does though, that sounds pretty cool
20:14 ^🔗	cm	idk actually
20:14 ^🔗	cm	i assumed it works like web.archive.org
20:14 ^🔗	cm	but come to think of it, web.archive.org doesn't do that for rss, only html
20:15 ^🔗	jodizzle	And in terms of how this would play with an RSS feed reader, I don't know.
20:31 ^🔗		dashcloud has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
20:35 ^🔗		scorche has quit IRC (Read error: Operation timed out)
20:48 ^🔗		dashcloud has joined #archiveteam-bs
21:01 ^🔗		icedice has joined #archiveteam-bs
21:05 ^🔗		icedice has quit IRC (Client Quit)
21:17 ^🔗		scorche has joined #archiveteam-bs
21:37 ^🔗		scorche has quit IRC (Read error: Operation timed out)
21:57 ^🔗		lennier1 has joined #archiveteam-bs
22:01 ^🔗		paul2520 has quit IRC (Read error: Operation timed out)
22:01 ^🔗		Jake has quit IRC (Read error: Operation timed out)
22:01 ^🔗		endrift has quit IRC (Read error: Operation timed out)
22:02 ^🔗		britmob has quit IRC (Read error: Operation timed out)
22:02 ^🔗		paul2520 has joined #archiveteam-bs
22:02 ^🔗		Jake has joined #archiveteam-bs
22:02 ^🔗		endrift has joined #archiveteam-bs
22:02 ^🔗		britmob has joined #archiveteam-bs
22:02 ^🔗		systwi_ has joined #archiveteam-bs
22:03 ^🔗		Meli has joined #archiveteam-bs
22:03 ^🔗		Hecatz- has joined #archiveteam-bs
22:04 ^🔗		asdf01011 has quit IRC (Read error: Operation timed out)
22:04 ^🔗		scorche has joined #archiveteam-bs
22:05 ^🔗		voltagex_ has joined #archiveteam-bs
22:05 ^🔗		nico_32_ has joined #archiveteam-bs
22:05 ^🔗		Coderjo has joined #archiveteam-bs
22:05 ^🔗		systwi has quit IRC (Read error: Operation timed out)
22:06 ^🔗		colona_ has joined #archiveteam-bs
22:07 ^🔗		systwi_ is now known as systwi
22:07 ^🔗		AlsoJAA_ has joined #archiveteam-bs
22:07 ^🔗		JAA sets mode: +o AlsoJAA_
22:08 ^🔗		N4Y_ has joined #archiveteam-bs
22:09 ^🔗		nightpoo- has joined #archiveteam-bs
22:09 ^🔗		second_ has joined #archiveteam-bs
22:09 ^🔗		actually_ has joined #archiveteam-bs
22:12 ^🔗		obskyr has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		nepeat has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		apache2_ has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		Meli-sama has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		nightpool has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		PotcFdk has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		voltagex has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		mr_archiv has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		AlsoJAA has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		N4Y has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		nico_32 has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		N4Y_ is now known as N4Y
22:12 ^🔗		atg has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		Mateon1 has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		second has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		DFJustin has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		Flashfire has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		acridAxid has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		zhongfu has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		igloo25 has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		Coderjo_ has quit IRC (Ping timeout: 745 seconds)
22:12 ^🔗		step has quit IRC (Ping timeout: 745 seconds)
22:13 ^🔗		Hecatz has quit IRC (Ping timeout: 745 seconds)
22:13 ^🔗		Hecatz- is now known as Hecatz
22:13 ^🔗		colona has quit IRC (Ping timeout: 745 seconds)
22:14 ^🔗		Mateon1 has joined #archiveteam-bs
22:16 ^🔗		mr_archiv has joined #archiveteam-bs
22:52 ^🔗		BlueMax has joined #archiveteam-bs
23:35 ^🔗		Lord_Nigh has joined #archiveteam-bs
23:38 ^🔗		Lord_Nigh has quit IRC (Client Quit)
23:42 ^🔗		jshoard has quit IRC (Quit: Leaving)
23:44 ^🔗		Lord_Nigh has joined #archiveteam-bs
23:52 ^🔗		Arcorann has joined #archiveteam-bs

irclogger-viewer