#archiveteam-bs 2020-09-13,Sun

↑back Search

Time Nickname Message
00:24 🔗 jshoard has quit IRC (Quit: Leaving)
01:00 🔗 Arcorann has joined #archiveteam-bs
01:01 🔗 Arcorann has quit IRC (Remote host closed the connection)
01:01 🔗 Arcorann has joined #archiveteam-bs
01:38 🔗 scorche has quit IRC (Read error: Operation timed out)
01:48 🔗 scorche has joined #archiveteam-bs
01:53 🔗 Mateon1 has quit IRC (Ping timeout: 272 seconds)
01:53 🔗 Ctrl has quit IRC (Read error: Operation timed out)
01:54 🔗 Mateon1 has joined #archiveteam-bs
01:55 🔗 brayden has quit IRC (Ping timeout: 272 seconds)
01:55 🔗 Laverne has quit IRC (Ping timeout: 272 seconds)
02:29 🔗 Ctrl has joined #archiveteam-bs
02:38 🔗 brayden has joined #archiveteam-bs
02:38 🔗 Laverne has joined #archiveteam-bs
02:55 🔗 asdf01011 has quit IRC (Remote host closed the connection)
03:12 🔗 qw3rty_ has joined #archiveteam-bs
03:13 🔗 asdf01011 has joined #archiveteam-bs
03:15 🔗 qw3rty__ has quit IRC (Ping timeout: 265 seconds)
03:23 🔗 SketchCow My 50th birthday party tomorrow: http://50.textfiles.com
05:12 🔗 benjinsmi has quit IRC (Read error: Connection reset by peer)
06:32 🔗 larryv has quit IRC (Quit: larryv)
06:40 🔗 endrift has quit IRC (Read error: Operation timed out)
06:48 🔗 endrift has joined #archiveteam-bs
08:10 🔗 jshoard has joined #archiveteam-bs
08:48 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
11:04 🔗 scorche has quit IRC (Read error: Operation timed out)
11:35 🔗 benjins has joined #archiveteam-bs
11:46 🔗 scorche has joined #archiveteam-bs
12:11 🔗 VADemon has joined #archiveteam-bs
13:19 🔗 VADemon has quit IRC (Read error: Connection reset by peer)
14:02 🔗 dashcloud has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
14:03 🔗 dashcloud has joined #archiveteam-bs
14:21 🔗 nepeat has quit IRC (Quit: ZNC 1.7.5 - https://znc.in)
14:23 🔗 nepeat has joined #archiveteam-bs
14:36 🔗 Lord_Nigh has quit IRC (Read error: Operation timed out)
14:45 🔗 Lord_Nigh has joined #archiveteam-bs
14:55 🔗 Lord_Nigh has quit IRC (Ping timeout: 272 seconds)
14:57 🔗 Lord_Nigh has joined #archiveteam-bs
14:58 🔗 Lord_Nigh has quit IRC (Remote host closed the connection)
15:24 🔗 Arcorann has quit IRC (Read error: Connection reset by peer)
16:07 🔗 VADemon has joined #archiveteam-bs
16:35 🔗 godane has quit IRC (Read error: Connection reset by peer)
16:51 🔗 godane has joined #archiveteam-bs
17:50 🔗 DigiDigi has quit IRC (Remote host closed the connection)
18:16 🔗 DigiDigi has joined #archiveteam-bs
19:06 🔗 Laverne has quit IRC (Ping timeout: 272 seconds)
19:07 🔗 brayden has quit IRC (Ping timeout: 272 seconds)
19:13 🔗 jodizzle cm: So what kind of solution are you referring to?
19:14 🔗 cm well my naive approach is to wget all the encosure files (mp3s)
19:14 🔗 cm for each item that i archive, i replace the enclosure url in the rss feed with a link to my own copy
19:14 🔗 cm but this is not portable
19:15 🔗 jodizzle Oh I see, so you're saying that you're creating your own podcast feed?
19:15 🔗 cm yeah
19:15 🔗 jodizzle Based on the original feed, but where items reference what you've already downloaded
19:16 🔗 cm yeah
19:16 🔗 jodizzle That's interesting
19:16 🔗 cm but creating the archive feed could be a separate step if you have a well-defined archive format
19:17 🔗 jodizzle What do you mean by "archive format", in this case?
19:17 🔗 cm simply wgetting the rss and the enclosures doesn't work, because wget doesn't store the name of the url that was wgotten
19:18 🔗 cm maybe warc would be enough?
19:18 🔗 jodizzle What would you be uputting in the WARCs, to be clear?
19:19 🔗 cm the mp3 files and rss i guess
19:19 🔗 cm warc remembers the url of the fetched file
19:21 🔗 jodizzle Yeah, I guess one approach would be to save all the podcasts downloaded into WARCs, so you'd get data + metdata for each. You could also save the rss feed at the time of the grab. (Basically what you said.)
19:21 🔗 cm i haven't thought through what could be bundled in a single warc file, and what would have to be separate
19:21 🔗 jodizzle Then, you can generate a CDX of the WARCs that acts as an index for which URLs you've gotten
19:22 🔗 cm cdx?
19:23 🔗 jodizzle https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/
19:25 🔗 jodizzle Also, typically a WARC is a single thing, in this case, either one request or response. But you can bundle WARCs together into gzipped .warc.gz files.
19:26 🔗 jodizzle That's how they're usually stored in bulk.
19:26 🔗 jodizzle (Here are examples of different records, for reference: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#annex-b-informative-examples-of-warc-records)
19:27 🔗 cm so i guess the cleanest and most consistent way would be to treat the podcast rss as a web page, and crawl it like you would a website
19:27 🔗 cm then you get an archive of the rss and the content files
19:27 🔗 jodizzle But this is all archival details. I am a little confused by what you mean by "portable". What would a "portable feed archive" be?
19:28 🔗 cm the current format of my feed archives is a directory with feed.xml, and subdirectories written by wget containing the content
19:29 🔗 cm the whole directory is served by a webserver, and the feed.xml has links to my copies of the content
19:30 🔗 cm one way to make it portable would be to use relative links to the content, but i dont think that is possible in RSS
19:32 🔗 cm so i thought about putting a custom keyword in place of the webroot for each archived piece of content, then to make a usable RSS feed would would replace that keyword with whatever the webroot happens to be
19:33 🔗 jodizzle Oh, so you mean it's not currently portable because if the IP address/domain changes, the links in feed.xml will break?
19:33 🔗 cm yeah
19:34 🔗 jodizzle Okay, yeah. If you can't use relative links in RSS (is that defintely true), then I don't think there's anyway around something like what you described.
19:34 🔗 jodizzle You could have feed.xml created periodically by a cronjob or similar via a script that refernces a "webroot" setting
19:35 🔗 jodizzle Or something like that.
19:36 🔗 VADemon has quit IRC (left4dead)
19:38 🔗 cm i couldn't find anything definitive saying RSS does not support relative links
19:38 🔗 cm but there are at least a significant number of readers that dont support it
19:39 🔗 cm and it makes sense that readers would not store the prefix of the URL used to fetch the rss feed, which would be necessary to determine to full url for a relative link
19:41 🔗 jodizzle Hm, I don't know. It seems like if feed readers have to fetch feeds from web domains, they could keep track of those domains and use them to resolve relative links?
19:41 🔗 cm yeah true
19:41 🔗 jodizzle But if they don't do that, they don't do that. Maybe there's some more complexity to it that I'm not thinking of.
19:42 🔗 cm maybe i'll do a test
19:42 🔗 obskyr has quit IRC (Read error: Operation timed out)
19:43 🔗 sHATNER has quit IRC (Read error: Operation timed out)
19:43 🔗 omglolba- has quit IRC (Read error: Operation timed out)
19:46 🔗 omglolbah has joined #archiveteam-bs
19:46 🔗 closure has quit IRC (Read error: Operation timed out)
19:48 🔗 obskyr has joined #archiveteam-bs
19:50 🔗 closure has joined #archiveteam-bs
19:50 🔗 Maylay has quit IRC (Read error: Operation timed out)
19:50 🔗 cm yeah my default podcasts app rejects items with relative enclosure links
19:51 🔗 cm i.e. doesn't display them
19:52 🔗 nico_32 SketchCow: have a nice party & birthday!
19:52 🔗 jodizzle cm: Unfortunate.
19:53 🔗 cm now i guess warc files are basically an annotated transcript of an http response
19:54 🔗 cm does warc have any way to refer to a standalone file?
19:54 🔗 cm i.e. "the next thing the server sent was the contents of this file"
19:54 🔗 cm with a pointer to a html or mp3 file on disk
19:56 🔗 Maylay has joined #archiveteam-bs
19:58 🔗 jodizzle I think you could manufacture something like that, but typically you would just store the bytes in the WARC.
19:59 🔗 jodizzle So a response warc would contain the response bytes and all the headers and metadata of that response
20:01 🔗 cm yeah
20:01 🔗 cm then to view the file you need a server side script to strip out the metadata
20:03 🔗 jodizzle In a sense, yes. Though typically you'd be reading a library or using some toolkit that's built for reading WARC data.
20:05 🔗 jodizzle But yes, I think I see what you mean. I think the links in the feed.xml you generate would basically have to route to some webserver endpoint which does the work necessary to read from the WARC.
20:07 🔗 cm couldn't i use the warc server for the rss as well
20:10 🔗 jodizzle What do you mean?
20:10 🔗 sHATNER has joined #archiveteam-bs
20:10 🔗 brayden has joined #archiveteam-bs
20:10 🔗 cm pywb for example
20:10 🔗 cm if im using it to browse a website, it will rewrite links to point to archived copies, right?
20:11 🔗 Laverne has joined #archiveteam-bs
20:13 🔗 cm so couldnt i let pywb rewrite the links in the rss feed? or does it not do that
20:13 🔗 jodizzle Does pywb have a browsing feature like that? I've only ever used it in the context of a local WBM to view WARCs I've generated separately.
20:13 🔗 jodizzle If it does though, that sounds pretty cool
20:14 🔗 cm idk actually
20:14 🔗 cm i assumed it works like web.archive.org
20:14 🔗 cm but come to think of it, web.archive.org doesn't do that for rss, only html
20:15 🔗 jodizzle And in terms of how this would play with an RSS feed reader, I don't know.
20:31 🔗 dashcloud has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
20:35 🔗 scorche has quit IRC (Read error: Operation timed out)
20:48 🔗 dashcloud has joined #archiveteam-bs
21:01 🔗 icedice has joined #archiveteam-bs
21:05 🔗 icedice has quit IRC (Client Quit)
21:17 🔗 scorche has joined #archiveteam-bs
21:37 🔗 scorche has quit IRC (Read error: Operation timed out)
21:57 🔗 lennier1 has joined #archiveteam-bs
22:01 🔗 paul2520 has quit IRC (Read error: Operation timed out)
22:01 🔗 Jake has quit IRC (Read error: Operation timed out)
22:01 🔗 endrift has quit IRC (Read error: Operation timed out)
22:02 🔗 britmob has quit IRC (Read error: Operation timed out)
22:02 🔗 paul2520 has joined #archiveteam-bs
22:02 🔗 Jake has joined #archiveteam-bs
22:02 🔗 endrift has joined #archiveteam-bs
22:02 🔗 britmob has joined #archiveteam-bs
22:02 🔗 systwi_ has joined #archiveteam-bs
22:03 🔗 Meli has joined #archiveteam-bs
22:03 🔗 Hecatz- has joined #archiveteam-bs
22:04 🔗 asdf01011 has quit IRC (Read error: Operation timed out)
22:04 🔗 scorche has joined #archiveteam-bs
22:05 🔗 voltagex_ has joined #archiveteam-bs
22:05 🔗 nico_32_ has joined #archiveteam-bs
22:05 🔗 Coderjo has joined #archiveteam-bs
22:05 🔗 systwi has quit IRC (Read error: Operation timed out)
22:06 🔗 colona_ has joined #archiveteam-bs
22:07 🔗 systwi_ is now known as systwi
22:07 🔗 AlsoJAA_ has joined #archiveteam-bs
22:07 🔗 JAA sets mode: +o AlsoJAA_
22:08 🔗 N4Y_ has joined #archiveteam-bs
22:09 🔗 nightpoo- has joined #archiveteam-bs
22:09 🔗 second_ has joined #archiveteam-bs
22:09 🔗 actually_ has joined #archiveteam-bs
22:12 🔗 obskyr has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 nepeat has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 apache2_ has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 Meli-sama has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 nightpool has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 PotcFdk has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 voltagex has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 mr_archiv has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 AlsoJAA has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 N4Y has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 nico_32 has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 N4Y_ is now known as N4Y
22:12 🔗 atg has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 Mateon1 has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 second has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 DFJustin has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 Flashfire has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 acridAxid has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 zhongfu has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 igloo25 has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 Coderjo_ has quit IRC (Ping timeout: 745 seconds)
22:12 🔗 step has quit IRC (Ping timeout: 745 seconds)
22:13 🔗 Hecatz has quit IRC (Ping timeout: 745 seconds)
22:13 🔗 Hecatz- is now known as Hecatz
22:13 🔗 colona has quit IRC (Ping timeout: 745 seconds)
22:14 🔗 Mateon1 has joined #archiveteam-bs
22:16 🔗 mr_archiv has joined #archiveteam-bs
22:52 🔗 BlueMax has joined #archiveteam-bs
23:35 🔗 Lord_Nigh has joined #archiveteam-bs
23:38 🔗 Lord_Nigh has quit IRC (Client Quit)
23:42 🔗 jshoard has quit IRC (Quit: Leaving)
23:44 🔗 Lord_Nigh has joined #archiveteam-bs
23:52 🔗 Arcorann has joined #archiveteam-bs

irclogger-viewer