Time |
Nickname |
Message |
00:24
🔗
|
|
jshoard has quit IRC (Quit: Leaving) |
01:00
🔗
|
|
Arcorann has joined #archiveteam-bs |
01:01
🔗
|
|
Arcorann has quit IRC (Remote host closed the connection) |
01:01
🔗
|
|
Arcorann has joined #archiveteam-bs |
01:38
🔗
|
|
scorche has quit IRC (Read error: Operation timed out) |
01:48
🔗
|
|
scorche has joined #archiveteam-bs |
01:53
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 272 seconds) |
01:53
🔗
|
|
Ctrl has quit IRC (Read error: Operation timed out) |
01:54
🔗
|
|
Mateon1 has joined #archiveteam-bs |
01:55
🔗
|
|
brayden has quit IRC (Ping timeout: 272 seconds) |
01:55
🔗
|
|
Laverne has quit IRC (Ping timeout: 272 seconds) |
02:29
🔗
|
|
Ctrl has joined #archiveteam-bs |
02:38
🔗
|
|
brayden has joined #archiveteam-bs |
02:38
🔗
|
|
Laverne has joined #archiveteam-bs |
02:55
🔗
|
|
asdf01011 has quit IRC (Remote host closed the connection) |
03:12
🔗
|
|
qw3rty_ has joined #archiveteam-bs |
03:13
🔗
|
|
asdf01011 has joined #archiveteam-bs |
03:15
🔗
|
|
qw3rty__ has quit IRC (Ping timeout: 265 seconds) |
03:23
🔗
|
SketchCow |
My 50th birthday party tomorrow: http://50.textfiles.com |
05:12
🔗
|
|
benjinsmi has quit IRC (Read error: Connection reset by peer) |
06:32
🔗
|
|
larryv has quit IRC (Quit: larryv) |
06:40
🔗
|
|
endrift has quit IRC (Read error: Operation timed out) |
06:48
🔗
|
|
endrift has joined #archiveteam-bs |
08:10
🔗
|
|
jshoard has joined #archiveteam-bs |
08:48
🔗
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
11:04
🔗
|
|
scorche has quit IRC (Read error: Operation timed out) |
11:35
🔗
|
|
benjins has joined #archiveteam-bs |
11:46
🔗
|
|
scorche has joined #archiveteam-bs |
12:11
🔗
|
|
VADemon has joined #archiveteam-bs |
13:19
🔗
|
|
VADemon has quit IRC (Read error: Connection reset by peer) |
14:02
🔗
|
|
dashcloud has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) |
14:03
🔗
|
|
dashcloud has joined #archiveteam-bs |
14:21
🔗
|
|
nepeat has quit IRC (Quit: ZNC 1.7.5 - https://znc.in) |
14:23
🔗
|
|
nepeat has joined #archiveteam-bs |
14:36
🔗
|
|
Lord_Nigh has quit IRC (Read error: Operation timed out) |
14:45
🔗
|
|
Lord_Nigh has joined #archiveteam-bs |
14:55
🔗
|
|
Lord_Nigh has quit IRC (Ping timeout: 272 seconds) |
14:57
🔗
|
|
Lord_Nigh has joined #archiveteam-bs |
14:58
🔗
|
|
Lord_Nigh has quit IRC (Remote host closed the connection) |
15:24
🔗
|
|
Arcorann has quit IRC (Read error: Connection reset by peer) |
16:07
🔗
|
|
VADemon has joined #archiveteam-bs |
16:35
🔗
|
|
godane has quit IRC (Read error: Connection reset by peer) |
16:51
🔗
|
|
godane has joined #archiveteam-bs |
17:50
🔗
|
|
DigiDigi has quit IRC (Remote host closed the connection) |
18:16
🔗
|
|
DigiDigi has joined #archiveteam-bs |
19:06
🔗
|
|
Laverne has quit IRC (Ping timeout: 272 seconds) |
19:07
🔗
|
|
brayden has quit IRC (Ping timeout: 272 seconds) |
19:13
🔗
|
jodizzle |
cm: So what kind of solution are you referring to? |
19:14
🔗
|
cm |
well my naive approach is to wget all the encosure files (mp3s) |
19:14
🔗
|
cm |
for each item that i archive, i replace the enclosure url in the rss feed with a link to my own copy |
19:14
🔗
|
cm |
but this is not portable |
19:15
🔗
|
jodizzle |
Oh I see, so you're saying that you're creating your own podcast feed? |
19:15
🔗
|
cm |
yeah |
19:15
🔗
|
jodizzle |
Based on the original feed, but where items reference what you've already downloaded |
19:16
🔗
|
cm |
yeah |
19:16
🔗
|
jodizzle |
That's interesting |
19:16
🔗
|
cm |
but creating the archive feed could be a separate step if you have a well-defined archive format |
19:17
🔗
|
jodizzle |
What do you mean by "archive format", in this case? |
19:17
🔗
|
cm |
simply wgetting the rss and the enclosures doesn't work, because wget doesn't store the name of the url that was wgotten |
19:18
🔗
|
cm |
maybe warc would be enough? |
19:18
🔗
|
jodizzle |
What would you be uputting in the WARCs, to be clear? |
19:19
🔗
|
cm |
the mp3 files and rss i guess |
19:19
🔗
|
cm |
warc remembers the url of the fetched file |
19:21
🔗
|
jodizzle |
Yeah, I guess one approach would be to save all the podcasts downloaded into WARCs, so you'd get data + metdata for each. You could also save the rss feed at the time of the grab. (Basically what you said.) |
19:21
🔗
|
cm |
i haven't thought through what could be bundled in a single warc file, and what would have to be separate |
19:21
🔗
|
jodizzle |
Then, you can generate a CDX of the WARCs that acts as an index for which URLs you've gotten |
19:22
🔗
|
cm |
cdx? |
19:23
🔗
|
jodizzle |
https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/ |
19:25
🔗
|
jodizzle |
Also, typically a WARC is a single thing, in this case, either one request or response. But you can bundle WARCs together into gzipped .warc.gz files. |
19:26
🔗
|
jodizzle |
That's how they're usually stored in bulk. |
19:26
🔗
|
jodizzle |
(Here are examples of different records, for reference: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#annex-b-informative-examples-of-warc-records) |
19:27
🔗
|
cm |
so i guess the cleanest and most consistent way would be to treat the podcast rss as a web page, and crawl it like you would a website |
19:27
🔗
|
cm |
then you get an archive of the rss and the content files |
19:27
🔗
|
jodizzle |
But this is all archival details. I am a little confused by what you mean by "portable". What would a "portable feed archive" be? |
19:28
🔗
|
cm |
the current format of my feed archives is a directory with feed.xml, and subdirectories written by wget containing the content |
19:29
🔗
|
cm |
the whole directory is served by a webserver, and the feed.xml has links to my copies of the content |
19:30
🔗
|
cm |
one way to make it portable would be to use relative links to the content, but i dont think that is possible in RSS |
19:32
🔗
|
cm |
so i thought about putting a custom keyword in place of the webroot for each archived piece of content, then to make a usable RSS feed would would replace that keyword with whatever the webroot happens to be |
19:33
🔗
|
jodizzle |
Oh, so you mean it's not currently portable because if the IP address/domain changes, the links in feed.xml will break? |
19:33
🔗
|
cm |
yeah |
19:34
🔗
|
jodizzle |
Okay, yeah. If you can't use relative links in RSS (is that defintely true), then I don't think there's anyway around something like what you described. |
19:34
🔗
|
jodizzle |
You could have feed.xml created periodically by a cronjob or similar via a script that refernces a "webroot" setting |
19:35
🔗
|
jodizzle |
Or something like that. |
19:36
🔗
|
|
VADemon has quit IRC (left4dead) |
19:38
🔗
|
cm |
i couldn't find anything definitive saying RSS does not support relative links |
19:38
🔗
|
cm |
but there are at least a significant number of readers that dont support it |
19:39
🔗
|
cm |
and it makes sense that readers would not store the prefix of the URL used to fetch the rss feed, which would be necessary to determine to full url for a relative link |
19:41
🔗
|
jodizzle |
Hm, I don't know. It seems like if feed readers have to fetch feeds from web domains, they could keep track of those domains and use them to resolve relative links? |
19:41
🔗
|
cm |
yeah true |
19:41
🔗
|
jodizzle |
But if they don't do that, they don't do that. Maybe there's some more complexity to it that I'm not thinking of. |
19:42
🔗
|
cm |
maybe i'll do a test |
19:42
🔗
|
|
obskyr has quit IRC (Read error: Operation timed out) |
19:43
🔗
|
|
sHATNER has quit IRC (Read error: Operation timed out) |
19:43
🔗
|
|
omglolba- has quit IRC (Read error: Operation timed out) |
19:46
🔗
|
|
omglolbah has joined #archiveteam-bs |
19:46
🔗
|
|
closure has quit IRC (Read error: Operation timed out) |
19:48
🔗
|
|
obskyr has joined #archiveteam-bs |
19:50
🔗
|
|
closure has joined #archiveteam-bs |
19:50
🔗
|
|
Maylay has quit IRC (Read error: Operation timed out) |
19:50
🔗
|
cm |
yeah my default podcasts app rejects items with relative enclosure links |
19:51
🔗
|
cm |
i.e. doesn't display them |
19:52
🔗
|
nico_32 |
SketchCow: have a nice party & birthday! |
19:52
🔗
|
jodizzle |
cm: Unfortunate. |
19:53
🔗
|
cm |
now i guess warc files are basically an annotated transcript of an http response |
19:54
🔗
|
cm |
does warc have any way to refer to a standalone file? |
19:54
🔗
|
cm |
i.e. "the next thing the server sent was the contents of this file" |
19:54
🔗
|
cm |
with a pointer to a html or mp3 file on disk |
19:56
🔗
|
|
Maylay has joined #archiveteam-bs |
19:58
🔗
|
jodizzle |
I think you could manufacture something like that, but typically you would just store the bytes in the WARC. |
19:59
🔗
|
jodizzle |
So a response warc would contain the response bytes and all the headers and metadata of that response |
20:01
🔗
|
cm |
yeah |
20:01
🔗
|
cm |
then to view the file you need a server side script to strip out the metadata |
20:03
🔗
|
jodizzle |
In a sense, yes. Though typically you'd be reading a library or using some toolkit that's built for reading WARC data. |
20:05
🔗
|
jodizzle |
But yes, I think I see what you mean. I think the links in the feed.xml you generate would basically have to route to some webserver endpoint which does the work necessary to read from the WARC. |
20:07
🔗
|
cm |
couldn't i use the warc server for the rss as well |
20:10
🔗
|
jodizzle |
What do you mean? |
20:10
🔗
|
|
sHATNER has joined #archiveteam-bs |
20:10
🔗
|
|
brayden has joined #archiveteam-bs |
20:10
🔗
|
cm |
pywb for example |
20:10
🔗
|
cm |
if im using it to browse a website, it will rewrite links to point to archived copies, right? |
20:11
🔗
|
|
Laverne has joined #archiveteam-bs |
20:13
🔗
|
cm |
so couldnt i let pywb rewrite the links in the rss feed? or does it not do that |
20:13
🔗
|
jodizzle |
Does pywb have a browsing feature like that? I've only ever used it in the context of a local WBM to view WARCs I've generated separately. |
20:13
🔗
|
jodizzle |
If it does though, that sounds pretty cool |
20:14
🔗
|
cm |
idk actually |
20:14
🔗
|
cm |
i assumed it works like web.archive.org |
20:14
🔗
|
cm |
but come to think of it, web.archive.org doesn't do that for rss, only html |
20:15
🔗
|
jodizzle |
And in terms of how this would play with an RSS feed reader, I don't know. |
20:31
🔗
|
|
dashcloud has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) |
20:35
🔗
|
|
scorche has quit IRC (Read error: Operation timed out) |
20:48
🔗
|
|
dashcloud has joined #archiveteam-bs |
21:01
🔗
|
|
icedice has joined #archiveteam-bs |
21:05
🔗
|
|
icedice has quit IRC (Client Quit) |
21:17
🔗
|
|
scorche has joined #archiveteam-bs |
21:37
🔗
|
|
scorche has quit IRC (Read error: Operation timed out) |
21:57
🔗
|
|
lennier1 has joined #archiveteam-bs |
22:01
🔗
|
|
paul2520 has quit IRC (Read error: Operation timed out) |
22:01
🔗
|
|
Jake has quit IRC (Read error: Operation timed out) |
22:01
🔗
|
|
endrift has quit IRC (Read error: Operation timed out) |
22:02
🔗
|
|
britmob has quit IRC (Read error: Operation timed out) |
22:02
🔗
|
|
paul2520 has joined #archiveteam-bs |
22:02
🔗
|
|
Jake has joined #archiveteam-bs |
22:02
🔗
|
|
endrift has joined #archiveteam-bs |
22:02
🔗
|
|
britmob has joined #archiveteam-bs |
22:02
🔗
|
|
systwi_ has joined #archiveteam-bs |
22:03
🔗
|
|
Meli has joined #archiveteam-bs |
22:03
🔗
|
|
Hecatz- has joined #archiveteam-bs |
22:04
🔗
|
|
asdf01011 has quit IRC (Read error: Operation timed out) |
22:04
🔗
|
|
scorche has joined #archiveteam-bs |
22:05
🔗
|
|
voltagex_ has joined #archiveteam-bs |
22:05
🔗
|
|
nico_32_ has joined #archiveteam-bs |
22:05
🔗
|
|
Coderjo has joined #archiveteam-bs |
22:05
🔗
|
|
systwi has quit IRC (Read error: Operation timed out) |
22:06
🔗
|
|
colona_ has joined #archiveteam-bs |
22:07
🔗
|
|
systwi_ is now known as systwi |
22:07
🔗
|
|
AlsoJAA_ has joined #archiveteam-bs |
22:07
🔗
|
|
JAA sets mode: +o AlsoJAA_ |
22:08
🔗
|
|
N4Y_ has joined #archiveteam-bs |
22:09
🔗
|
|
nightpoo- has joined #archiveteam-bs |
22:09
🔗
|
|
second_ has joined #archiveteam-bs |
22:09
🔗
|
|
actually_ has joined #archiveteam-bs |
22:12
🔗
|
|
obskyr has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
nepeat has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
apache2_ has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
Meli-sama has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
nightpool has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
PotcFdk has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
voltagex has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
mr_archiv has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
AlsoJAA has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
N4Y has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
nico_32 has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
N4Y_ is now known as N4Y |
22:12
🔗
|
|
atg has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
second has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
DFJustin has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
Flashfire has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
acridAxid has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
zhongfu has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
igloo25 has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
Coderjo_ has quit IRC (Ping timeout: 745 seconds) |
22:12
🔗
|
|
step has quit IRC (Ping timeout: 745 seconds) |
22:13
🔗
|
|
Hecatz has quit IRC (Ping timeout: 745 seconds) |
22:13
🔗
|
|
Hecatz- is now known as Hecatz |
22:13
🔗
|
|
colona has quit IRC (Ping timeout: 745 seconds) |
22:14
🔗
|
|
Mateon1 has joined #archiveteam-bs |
22:16
🔗
|
|
mr_archiv has joined #archiveteam-bs |
22:52
🔗
|
|
BlueMax has joined #archiveteam-bs |
23:35
🔗
|
|
Lord_Nigh has joined #archiveteam-bs |
23:38
🔗
|
|
Lord_Nigh has quit IRC (Client Quit) |
23:42
🔗
|
|
jshoard has quit IRC (Quit: Leaving) |
23:44
🔗
|
|
Lord_Nigh has joined #archiveteam-bs |
23:52
🔗
|
|
Arcorann has joined #archiveteam-bs |