Time |
Nickname |
Message |
01:21
🔗
|
|
toad2 has joined #urlteam |
01:23
🔗
|
|
toad1 has quit IRC (Read error: Operation timed out) |
04:46
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
06:16
🔗
|
|
JesseW has joined #urlteam |
06:20
🔗
|
JesseW |
Whoever was considering converting URLteam data into WARCs -- at least for shortners that still exist, would it work to use ArchiveBot (or a separate set of such) to re-request the known-valid short URLs, thereby gathering all the expected WARC data, bypassing the issue of synthesizing it? |
06:37
🔗
|
bentpins |
3.3 Billion confirmed URLs is rather a lot, it would probably take over a year using a single machine to recheck them |
06:37
🔗
|
bentpins |
Plus not all the sites are still up |
06:39
🔗
|
|
JesseW has quit IRC (Read error: Operation timed out) |
06:55
🔗
|
|
Start has joined #urlteam |
07:07
🔗
|
|
JesseW has joined #urlteam |
07:53
🔗
|
|
JesseW has quit IRC (Read error: Operation timed out) |
09:35
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
09:37
🔗
|
|
Start has joined #urlteam |
10:22
🔗
|
|
VADemon has joined #urlteam |
13:37
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
13:38
🔗
|
|
Start has joined #urlteam |
14:59
🔗
|
|
JesseW has joined #urlteam |
15:13
🔗
|
JesseW |
bentpins: The sites being up isn't an issue -- the WARC would still capture the fact of the redirection, and (if the site had been previously captured), the Wayback Machine could load the other capture. |
15:13
🔗
|
JesseW |
The issue of how long it would take is a reason it might be better to make a Warrior task for it, so we could distribute it over multiple machines. :-) |
15:15
🔗
|
JesseW |
Of course, reading further about the WARC format, we could also generate WARCs with the relationship described using `resource` WARC-Type records, which don't require a full request/response. But I don't know if the Wayback Machine actually uses those... |
15:16
🔗
|
arkiver |
JesseW: I was considering converting this to WARCs |
15:16
🔗
|
arkiver |
I didn't because of the missing metadata |
15:17
🔗
|
JesseW |
arkiver: did you already consider the ideas I had? |
15:22
🔗
|
arkiver |
I haven't considered using resource records |
15:43
🔗
|
* |
JesseW was reading over the WARC spec for fun last night... |
15:44
🔗
|
JesseW |
arkiver: what were your thoughts/reasons for rejecting the idea of re-capturing valid shortURLs with wpull or some such? |
15:46
🔗
|
JesseW |
`resource` records are called out in the spec for just this purpose: " the result of a networked retrieval where the protocol information has been discarded." |
17:02
🔗
|
|
JesseW has quit IRC (Read error: Operation timed out) |
17:38
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
17:39
🔗
|
|
Start has joined #urlteam |
17:47
🔗
|
xmc |
hm, yes |
17:47
🔗
|
xmc |
i like this idea |
17:47
🔗
|
xmc |
loading urlteam into wayback is a thing i've had at the back of my mind for a long time |
17:49
🔗
|
|
aaaaaaaaa has joined #urlteam |
17:49
🔗
|
|
swebb sets mode: +o aaaaaaaaa |
18:55
🔗
|
|
n00b832 has joined #urlteam |
18:58
🔗
|
n00b832 |
Hello! The urlteam tracker keeps throwing 502's at me for a few hours. |
19:09
🔗
|
aaaaaaaaa |
n00b832: yeah it ran out of disk and they are in the process of fixing it. |
19:12
🔗
|
n00b832 |
Thanks for the info! |
19:26
🔗
|
|
n00b832 has quit IRC (Ping timeout: 240 seconds) |
20:03
🔗
|
|
Start_ has joined #urlteam |
20:03
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
21:12
🔗
|
|
aaaaaaaa_ has joined #urlteam |
21:12
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Connection reset by peer) |
21:12
🔗
|
|
swebb sets mode: +o aaaaaaaa_ |
21:13
🔗
|
|
aaaaaaaa_ is now known as aaaaaaaaa |
21:39
🔗
|
|
Start_ has quit IRC (Read error: Connection reset by peer) |
21:40
🔗
|
|
slang has joined #urlteam |
21:40
🔗
|
|
Start has joined #urlteam |
21:47
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
21:47
🔗
|
|
Start has joined #urlteam |
22:02
🔗
|
|
Start_ has joined #urlteam |
22:07
🔗
|
|
Start has quit IRC (Ping timeout: 362 seconds) |
22:17
🔗
|
|
Start_ is now known as Start |