#urlteam 2015-09-05,Sat

↑back Search

Time Nickname Message
01:21 🔗 toad2 has joined #urlteam
01:23 🔗 toad1 has quit IRC (Read error: Operation timed out)
04:46 🔗 aaaaaaaaa has quit IRC (Leaving)
06:16 🔗 JesseW has joined #urlteam
06:20 🔗 JesseW Whoever was considering converting URLteam data into WARCs -- at least for shortners that still exist, would it work to use ArchiveBot (or a separate set of such) to re-request the known-valid short URLs, thereby gathering all the expected WARC data, bypassing the issue of synthesizing it?
06:37 🔗 bentpins 3.3 Billion confirmed URLs is rather a lot, it would probably take over a year using a single machine to recheck them
06:37 🔗 bentpins Plus not all the sites are still up
06:39 🔗 JesseW has quit IRC (Read error: Operation timed out)
06:55 🔗 Start has joined #urlteam
07:07 🔗 JesseW has joined #urlteam
07:53 🔗 JesseW has quit IRC (Read error: Operation timed out)
09:35 🔗 Start has quit IRC (Read error: Connection reset by peer)
09:37 🔗 Start has joined #urlteam
10:22 🔗 VADemon has joined #urlteam
13:37 🔗 Start has quit IRC (Read error: Connection reset by peer)
13:38 🔗 Start has joined #urlteam
14:59 🔗 JesseW has joined #urlteam
15:13 🔗 JesseW bentpins: The sites being up isn't an issue -- the WARC would still capture the fact of the redirection, and (if the site had been previously captured), the Wayback Machine could load the other capture.
15:13 🔗 JesseW The issue of how long it would take is a reason it might be better to make a Warrior task for it, so we could distribute it over multiple machines. :-)
15:15 🔗 JesseW Of course, reading further about the WARC format, we could also generate WARCs with the relationship described using `resource` WARC-Type records, which don't require a full request/response. But I don't know if the Wayback Machine actually uses those...
15:16 🔗 arkiver JesseW: I was considering converting this to WARCs
15:16 🔗 arkiver I didn't because of the missing metadata
15:17 🔗 JesseW arkiver: did you already consider the ideas I had?
15:22 🔗 arkiver I haven't considered using resource records
15:43 🔗 * JesseW was reading over the WARC spec for fun last night...
15:44 🔗 JesseW arkiver: what were your thoughts/reasons for rejecting the idea of re-capturing valid shortURLs with wpull or some such?
15:46 🔗 JesseW `resource` records are called out in the spec for just this purpose: " the result of a networked retrieval where the protocol information has been discarded."
17:02 🔗 JesseW has quit IRC (Read error: Operation timed out)
17:38 🔗 Start has quit IRC (Read error: Connection reset by peer)
17:39 🔗 Start has joined #urlteam
17:47 🔗 xmc hm, yes
17:47 🔗 xmc i like this idea
17:47 🔗 xmc loading urlteam into wayback is a thing i've had at the back of my mind for a long time
17:49 🔗 aaaaaaaaa has joined #urlteam
17:49 🔗 swebb sets mode: +o aaaaaaaaa
18:55 🔗 n00b832 has joined #urlteam
18:58 🔗 n00b832 Hello! The urlteam tracker keeps throwing 502's at me for a few hours.
19:09 🔗 aaaaaaaaa n00b832: yeah it ran out of disk and they are in the process of fixing it.
19:12 🔗 n00b832 Thanks for the info!
19:26 🔗 n00b832 has quit IRC (Ping timeout: 240 seconds)
20:03 🔗 Start_ has joined #urlteam
20:03 🔗 Start has quit IRC (Read error: Connection reset by peer)
21:12 🔗 aaaaaaaa_ has joined #urlteam
21:12 🔗 aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
21:12 🔗 swebb sets mode: +o aaaaaaaa_
21:13 🔗 aaaaaaaa_ is now known as aaaaaaaaa
21:39 🔗 Start_ has quit IRC (Read error: Connection reset by peer)
21:40 🔗 slang has joined #urlteam
21:40 🔗 Start has joined #urlteam
21:47 🔗 Start has quit IRC (Read error: Connection reset by peer)
21:47 🔗 Start has joined #urlteam
22:02 🔗 Start_ has joined #urlteam
22:07 🔗 Start has quit IRC (Ping timeout: 362 seconds)
22:17 🔗 Start_ is now known as Start

irclogger-viewer