[01:21] *** toad2 has joined #urlteam [01:23] *** toad1 has quit IRC (Read error: Operation timed out) [04:46] *** aaaaaaaaa has quit IRC (Leaving) [06:16] *** JesseW has joined #urlteam [06:20] Whoever was considering converting URLteam data into WARCs -- at least for shortners that still exist, would it work to use ArchiveBot (or a separate set of such) to re-request the known-valid short URLs, thereby gathering all the expected WARC data, bypassing the issue of synthesizing it? [06:37] 3.3 Billion confirmed URLs is rather a lot, it would probably take over a year using a single machine to recheck them [06:37] Plus not all the sites are still up [06:39] *** JesseW has quit IRC (Read error: Operation timed out) [06:55] *** Start has joined #urlteam [07:07] *** JesseW has joined #urlteam [07:53] *** JesseW has quit IRC (Read error: Operation timed out) [09:35] *** Start has quit IRC (Read error: Connection reset by peer) [09:37] *** Start has joined #urlteam [10:22] *** VADemon has joined #urlteam [13:37] *** Start has quit IRC (Read error: Connection reset by peer) [13:38] *** Start has joined #urlteam [14:59] *** JesseW has joined #urlteam [15:13] bentpins: The sites being up isn't an issue -- the WARC would still capture the fact of the redirection, and (if the site had been previously captured), the Wayback Machine could load the other capture. [15:13] The issue of how long it would take is a reason it might be better to make a Warrior task for it, so we could distribute it over multiple machines. :-) [15:15] Of course, reading further about the WARC format, we could also generate WARCs with the relationship described using `resource` WARC-Type records, which don't require a full request/response. But I don't know if the Wayback Machine actually uses those... [15:16] JesseW: I was considering converting this to WARCs [15:16] I didn't because of the missing metadata [15:17] arkiver: did you already consider the ideas I had? [15:22] I haven't considered using resource records [15:43] * JesseW was reading over the WARC spec for fun last night... [15:44] arkiver: what were your thoughts/reasons for rejecting the idea of re-capturing valid shortURLs with wpull or some such? [15:46] `resource` records are called out in the spec for just this purpose: " the result of a networked retrieval where the protocol information has been discarded." [17:02] *** JesseW has quit IRC (Read error: Operation timed out) [17:38] *** Start has quit IRC (Read error: Connection reset by peer) [17:39] *** Start has joined #urlteam [17:47] hm, yes [17:47] i like this idea [17:47] loading urlteam into wayback is a thing i've had at the back of my mind for a long time [17:49] *** aaaaaaaaa has joined #urlteam [17:49] *** swebb sets mode: +o aaaaaaaaa [18:55] *** n00b832 has joined #urlteam [18:58] Hello! The urlteam tracker keeps throwing 502's at me for a few hours. [19:09] n00b832: yeah it ran out of disk and they are in the process of fixing it. [19:12] Thanks for the info! [19:26] *** n00b832 has quit IRC (Ping timeout: 240 seconds) [20:03] *** Start_ has joined #urlteam [20:03] *** Start has quit IRC (Read error: Connection reset by peer) [21:12] *** aaaaaaaa_ has joined #urlteam [21:12] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [21:12] *** swebb sets mode: +o aaaaaaaa_ [21:13] *** aaaaaaaa_ is now known as aaaaaaaaa [21:39] *** Start_ has quit IRC (Read error: Connection reset by peer) [21:40] *** slang has joined #urlteam [21:40] *** Start has joined #urlteam [21:47] *** Start has quit IRC (Read error: Connection reset by peer) [21:47] *** Start has joined #urlteam [22:02] *** Start_ has joined #urlteam [22:07] *** Start has quit IRC (Ping timeout: 362 seconds) [22:17] *** Start_ is now known as Start