#urlteam 2015-09-05,Sat

↑back Search

Time	Nickname	Message
01:21 ^🔗		toad2 has joined #urlteam
01:23 ^🔗		toad1 has quit IRC (Read error: Operation timed out)
04:46 ^🔗		aaaaaaaaa has quit IRC (Leaving)
06:16 ^🔗		JesseW has joined #urlteam
06:20 ^🔗	JesseW	Whoever was considering converting URLteam data into WARCs -- at least for shortners that still exist, would it work to use ArchiveBot (or a separate set of such) to re-request the known-valid short URLs, thereby gathering all the expected WARC data, bypassing the issue of synthesizing it?
06:37 ^🔗	bentpins	3.3 Billion confirmed URLs is rather a lot, it would probably take over a year using a single machine to recheck them
06:37 ^🔗	bentpins	Plus not all the sites are still up
06:39 ^🔗		JesseW has quit IRC (Read error: Operation timed out)
06:55 ^🔗		Start has joined #urlteam
07:07 ^🔗		JesseW has joined #urlteam
07:53 ^🔗		JesseW has quit IRC (Read error: Operation timed out)
09:35 ^🔗		Start has quit IRC (Read error: Connection reset by peer)
09:37 ^🔗		Start has joined #urlteam
10:22 ^🔗		VADemon has joined #urlteam
13:37 ^🔗		Start has quit IRC (Read error: Connection reset by peer)
13:38 ^🔗		Start has joined #urlteam
14:59 ^🔗		JesseW has joined #urlteam
15:13 ^🔗	JesseW	bentpins: The sites being up isn't an issue -- the WARC would still capture the fact of the redirection, and (if the site had been previously captured), the Wayback Machine could load the other capture.
15:13 ^🔗	JesseW	The issue of how long it would take is a reason it might be better to make a Warrior task for it, so we could distribute it over multiple machines. :-)
15:15 ^🔗	JesseW	Of course, reading further about the WARC format, we could also generate WARCs with the relationship described using `resource` WARC-Type records, which don't require a full request/response. But I don't know if the Wayback Machine actually uses those...
15:16 ^🔗	arkiver	JesseW: I was considering converting this to WARCs
15:16 ^🔗	arkiver	I didn't because of the missing metadata
15:17 ^🔗	JesseW	arkiver: did you already consider the ideas I had?
15:22 ^🔗	arkiver	I haven't considered using resource records
15:43 ^🔗	*	JesseW was reading over the WARC spec for fun last night...
15:44 ^🔗	JesseW	arkiver: what were your thoughts/reasons for rejecting the idea of re-capturing valid shortURLs with wpull or some such?
15:46 ^🔗	JesseW	`resource` records are called out in the spec for just this purpose: " the result of a networked retrieval where the protocol information has been discarded."
17:02 ^🔗		JesseW has quit IRC (Read error: Operation timed out)
17:38 ^🔗		Start has quit IRC (Read error: Connection reset by peer)
17:39 ^🔗		Start has joined #urlteam
17:47 ^🔗	xmc	hm, yes
17:47 ^🔗	xmc	i like this idea
17:47 ^🔗	xmc	loading urlteam into wayback is a thing i've had at the back of my mind for a long time
17:49 ^🔗		aaaaaaaaa has joined #urlteam
17:49 ^🔗		swebb sets mode: +o aaaaaaaaa
18:55 ^🔗		n00b832 has joined #urlteam
18:58 ^🔗	n00b832	Hello! The urlteam tracker keeps throwing 502's at me for a few hours.
19:09 ^🔗	aaaaaaaaa	n00b832: yeah it ran out of disk and they are in the process of fixing it.
19:12 ^🔗	n00b832	Thanks for the info!
19:26 ^🔗		n00b832 has quit IRC (Ping timeout: 240 seconds)
20:03 ^🔗		Start_ has joined #urlteam
20:03 ^🔗		Start has quit IRC (Read error: Connection reset by peer)
21:12 ^🔗		aaaaaaaa_ has joined #urlteam
21:12 ^🔗		aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
21:12 ^🔗		swebb sets mode: +o aaaaaaaa_
21:13 ^🔗		aaaaaaaa_ is now known as aaaaaaaaa
21:39 ^🔗		Start_ has quit IRC (Read error: Connection reset by peer)
21:40 ^🔗		slang has joined #urlteam
21:40 ^🔗		Start has joined #urlteam
21:47 ^🔗		Start has quit IRC (Read error: Connection reset by peer)
21:47 ^🔗		Start has joined #urlteam
22:02 ^🔗		Start_ has joined #urlteam
22:07 ^🔗		Start has quit IRC (Ping timeout: 362 seconds)
22:17 ^🔗		Start_ is now known as Start

irclogger-viewer