[00:24] *** jshoard has quit IRC (Quit: Leaving) [01:00] *** Arcorann has joined #archiveteam-bs [01:01] *** Arcorann has quit IRC (Remote host closed the connection) [01:01] *** Arcorann has joined #archiveteam-bs [01:38] *** scorche has quit IRC (Read error: Operation timed out) [01:48] *** scorche has joined #archiveteam-bs [01:53] *** Mateon1 has quit IRC (Ping timeout: 272 seconds) [01:53] *** Ctrl has quit IRC (Read error: Operation timed out) [01:54] *** Mateon1 has joined #archiveteam-bs [01:55] *** brayden has quit IRC (Ping timeout: 272 seconds) [01:55] *** Laverne has quit IRC (Ping timeout: 272 seconds) [02:29] *** Ctrl has joined #archiveteam-bs [02:38] *** brayden has joined #archiveteam-bs [02:38] *** Laverne has joined #archiveteam-bs [02:55] *** asdf01011 has quit IRC (Remote host closed the connection) [03:12] *** qw3rty_ has joined #archiveteam-bs [03:13] *** asdf01011 has joined #archiveteam-bs [03:15] *** qw3rty__ has quit IRC (Ping timeout: 265 seconds) [03:23] My 50th birthday party tomorrow: http://50.textfiles.com [05:12] *** benjinsmi has quit IRC (Read error: Connection reset by peer) [06:32] *** larryv has quit IRC (Quit: larryv) [06:40] *** endrift has quit IRC (Read error: Operation timed out) [06:48] *** endrift has joined #archiveteam-bs [08:10] *** jshoard has joined #archiveteam-bs [08:48] *** BlueMax has quit IRC (Read error: Connection reset by peer) [11:04] *** scorche has quit IRC (Read error: Operation timed out) [11:35] *** benjins has joined #archiveteam-bs [11:46] *** scorche has joined #archiveteam-bs [12:11] *** VADemon has joined #archiveteam-bs [13:19] *** VADemon has quit IRC (Read error: Connection reset by peer) [14:02] *** dashcloud has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) [14:03] *** dashcloud has joined #archiveteam-bs [14:21] *** nepeat has quit IRC (Quit: ZNC 1.7.5 - https://znc.in) [14:23] *** nepeat has joined #archiveteam-bs [14:36] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [14:45] *** Lord_Nigh has joined #archiveteam-bs [14:55] *** Lord_Nigh has quit IRC (Ping timeout: 272 seconds) [14:57] *** Lord_Nigh has joined #archiveteam-bs [14:58] *** Lord_Nigh has quit IRC (Remote host closed the connection) [15:24] *** Arcorann has quit IRC (Read error: Connection reset by peer) [16:07] *** VADemon has joined #archiveteam-bs [16:35] *** godane has quit IRC (Read error: Connection reset by peer) [16:51] *** godane has joined #archiveteam-bs [17:50] *** DigiDigi has quit IRC (Remote host closed the connection) [18:16] *** DigiDigi has joined #archiveteam-bs [19:06] *** Laverne has quit IRC (Ping timeout: 272 seconds) [19:07] *** brayden has quit IRC (Ping timeout: 272 seconds) [19:13] cm: So what kind of solution are you referring to? [19:14] well my naive approach is to wget all the encosure files (mp3s) [19:14] for each item that i archive, i replace the enclosure url in the rss feed with a link to my own copy [19:14] but this is not portable [19:15] Oh I see, so you're saying that you're creating your own podcast feed? [19:15] yeah [19:15] Based on the original feed, but where items reference what you've already downloaded [19:16] yeah [19:16] That's interesting [19:16] but creating the archive feed could be a separate step if you have a well-defined archive format [19:17] What do you mean by "archive format", in this case? [19:17] simply wgetting the rss and the enclosures doesn't work, because wget doesn't store the name of the url that was wgotten [19:18] maybe warc would be enough? [19:18] What would you be uputting in the WARCs, to be clear? [19:19] the mp3 files and rss i guess [19:19] warc remembers the url of the fetched file [19:21] Yeah, I guess one approach would be to save all the podcasts downloaded into WARCs, so you'd get data + metdata for each. You could also save the rss feed at the time of the grab. (Basically what you said.) [19:21] i haven't thought through what could be bundled in a single warc file, and what would have to be separate [19:21] Then, you can generate a CDX of the WARCs that acts as an index for which URLs you've gotten [19:22] cdx? [19:23] https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/ [19:25] Also, typically a WARC is a single thing, in this case, either one request or response. But you can bundle WARCs together into gzipped .warc.gz files. [19:26] That's how they're usually stored in bulk. [19:26] (Here are examples of different records, for reference: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#annex-b-informative-examples-of-warc-records) [19:27] so i guess the cleanest and most consistent way would be to treat the podcast rss as a web page, and crawl it like you would a website [19:27] then you get an archive of the rss and the content files [19:27] But this is all archival details. I am a little confused by what you mean by "portable". What would a "portable feed archive" be? [19:28] the current format of my feed archives is a directory with feed.xml, and subdirectories written by wget containing the content [19:29] the whole directory is served by a webserver, and the feed.xml has links to my copies of the content [19:30] one way to make it portable would be to use relative links to the content, but i dont think that is possible in RSS [19:32] so i thought about putting a custom keyword in place of the webroot for each archived piece of content, then to make a usable RSS feed would would replace that keyword with whatever the webroot happens to be [19:33] Oh, so you mean it's not currently portable because if the IP address/domain changes, the links in feed.xml will break? [19:33] yeah [19:34] Okay, yeah. If you can't use relative links in RSS (is that defintely true), then I don't think there's anyway around something like what you described. [19:34] You could have feed.xml created periodically by a cronjob or similar via a script that refernces a "webroot" setting [19:35] Or something like that. [19:36] *** VADemon has quit IRC (left4dead) [19:38] i couldn't find anything definitive saying RSS does not support relative links [19:38] but there are at least a significant number of readers that dont support it [19:39] and it makes sense that readers would not store the prefix of the URL used to fetch the rss feed, which would be necessary to determine to full url for a relative link [19:41] Hm, I don't know. It seems like if feed readers have to fetch feeds from web domains, they could keep track of those domains and use them to resolve relative links? [19:41] yeah true [19:41] But if they don't do that, they don't do that. Maybe there's some more complexity to it that I'm not thinking of. [19:42] maybe i'll do a test [19:42] *** obskyr has quit IRC (Read error: Operation timed out) [19:43] *** sHATNER has quit IRC (Read error: Operation timed out) [19:43] *** omglolba- has quit IRC (Read error: Operation timed out) [19:46] *** omglolbah has joined #archiveteam-bs [19:46] *** closure has quit IRC (Read error: Operation timed out) [19:48] *** obskyr has joined #archiveteam-bs [19:50] *** closure has joined #archiveteam-bs [19:50] *** Maylay has quit IRC (Read error: Operation timed out) [19:50] yeah my default podcasts app rejects items with relative enclosure links [19:51] i.e. doesn't display them [19:52] SketchCow: have a nice party & birthday! [19:52] cm: Unfortunate. [19:53] now i guess warc files are basically an annotated transcript of an http response [19:54] does warc have any way to refer to a standalone file? [19:54] i.e. "the next thing the server sent was the contents of this file" [19:54] with a pointer to a html or mp3 file on disk [19:56] *** Maylay has joined #archiveteam-bs [19:58] I think you could manufacture something like that, but typically you would just store the bytes in the WARC. [19:59] So a response warc would contain the response bytes and all the headers and metadata of that response [20:01] yeah [20:01] then to view the file you need a server side script to strip out the metadata [20:03] In a sense, yes. Though typically you'd be reading a library or using some toolkit that's built for reading WARC data. [20:05] But yes, I think I see what you mean. I think the links in the feed.xml you generate would basically have to route to some webserver endpoint which does the work necessary to read from the WARC. [20:07] couldn't i use the warc server for the rss as well [20:10] What do you mean? [20:10] *** sHATNER has joined #archiveteam-bs [20:10] *** brayden has joined #archiveteam-bs [20:10] pywb for example [20:10] if im using it to browse a website, it will rewrite links to point to archived copies, right? [20:11] *** Laverne has joined #archiveteam-bs [20:13] so couldnt i let pywb rewrite the links in the rss feed? or does it not do that [20:13] Does pywb have a browsing feature like that? I've only ever used it in the context of a local WBM to view WARCs I've generated separately. [20:13] If it does though, that sounds pretty cool [20:14] idk actually [20:14] i assumed it works like web.archive.org [20:14] but come to think of it, web.archive.org doesn't do that for rss, only html [20:15] And in terms of how this would play with an RSS feed reader, I don't know. [20:31] *** dashcloud has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) [20:35] *** scorche has quit IRC (Read error: Operation timed out) [20:48] *** dashcloud has joined #archiveteam-bs [21:01] *** icedice has joined #archiveteam-bs [21:05] *** icedice has quit IRC (Client Quit) [21:17] *** scorche has joined #archiveteam-bs [21:37] *** scorche has quit IRC (Read error: Operation timed out) [21:57] *** lennier1 has joined #archiveteam-bs [22:01] *** paul2520 has quit IRC (Read error: Operation timed out) [22:01] *** Jake has quit IRC (Read error: Operation timed out) [22:01] *** endrift has quit IRC (Read error: Operation timed out) [22:02] *** britmob has quit IRC (Read error: Operation timed out) [22:02] *** paul2520 has joined #archiveteam-bs [22:02] *** Jake has joined #archiveteam-bs [22:02] *** endrift has joined #archiveteam-bs [22:02] *** britmob has joined #archiveteam-bs [22:02] *** systwi_ has joined #archiveteam-bs [22:03] *** Meli has joined #archiveteam-bs [22:03] *** Hecatz- has joined #archiveteam-bs [22:04] *** asdf01011 has quit IRC (Read error: Operation timed out) [22:04] *** scorche has joined #archiveteam-bs [22:05] *** voltagex_ has joined #archiveteam-bs [22:05] *** nico_32_ has joined #archiveteam-bs [22:05] *** Coderjo has joined #archiveteam-bs [22:05] *** systwi has quit IRC (Read error: Operation timed out) [22:06] *** colona_ has joined #archiveteam-bs [22:07] *** systwi_ is now known as systwi [22:07] *** AlsoJAA_ has joined #archiveteam-bs [22:07] *** JAA sets mode: +o AlsoJAA_ [22:08] *** N4Y_ has joined #archiveteam-bs [22:09] *** nightpoo- has joined #archiveteam-bs [22:09] *** second_ has joined #archiveteam-bs [22:09] *** actually_ has joined #archiveteam-bs [22:12] *** obskyr has quit IRC (Ping timeout: 745 seconds) [22:12] *** nepeat has quit IRC (Ping timeout: 745 seconds) [22:12] *** apache2_ has quit IRC (Ping timeout: 745 seconds) [22:12] *** Meli-sama has quit IRC (Ping timeout: 745 seconds) [22:12] *** nightpool has quit IRC (Ping timeout: 745 seconds) [22:12] *** PotcFdk has quit IRC (Ping timeout: 745 seconds) [22:12] *** voltagex has quit IRC (Ping timeout: 745 seconds) [22:12] *** mr_archiv has quit IRC (Ping timeout: 745 seconds) [22:12] *** AlsoJAA has quit IRC (Ping timeout: 745 seconds) [22:12] *** N4Y has quit IRC (Ping timeout: 745 seconds) [22:12] *** nico_32 has quit IRC (Ping timeout: 745 seconds) [22:12] *** N4Y_ is now known as N4Y [22:12] *** atg has quit IRC (Ping timeout: 745 seconds) [22:12] *** Mateon1 has quit IRC (Ping timeout: 745 seconds) [22:12] *** second has quit IRC (Ping timeout: 745 seconds) [22:12] *** DFJustin has quit IRC (Ping timeout: 745 seconds) [22:12] *** Flashfire has quit IRC (Ping timeout: 745 seconds) [22:12] *** acridAxid has quit IRC (Ping timeout: 745 seconds) [22:12] *** zhongfu has quit IRC (Ping timeout: 745 seconds) [22:12] *** igloo25 has quit IRC (Ping timeout: 745 seconds) [22:12] *** Coderjo_ has quit IRC (Ping timeout: 745 seconds) [22:12] *** step has quit IRC (Ping timeout: 745 seconds) [22:13] *** Hecatz has quit IRC (Ping timeout: 745 seconds) [22:13] *** Hecatz- is now known as Hecatz [22:13] *** colona has quit IRC (Ping timeout: 745 seconds) [22:14] *** Mateon1 has joined #archiveteam-bs [22:16] *** mr_archiv has joined #archiveteam-bs [22:52] *** BlueMax has joined #archiveteam-bs [23:35] *** Lord_Nigh has joined #archiveteam-bs [23:38] *** Lord_Nigh has quit IRC (Client Quit) [23:42] *** jshoard has quit IRC (Quit: Leaving) [23:44] *** Lord_Nigh has joined #archiveteam-bs [23:52] *** Arcorann has joined #archiveteam-bs