Time |
Nickname |
Message |
01:01
🔗
|
|
ShellyRol has quit IRC (Ping timeout: 496 seconds) |
01:15
🔗
|
|
ShellyRol has joined #archiveteam-ot |
01:24
🔗
|
|
foureyes has quit IRC (Quit: brb) |
01:25
🔗
|
|
foureyes has joined #archiveteam-ot |
01:38
🔗
|
|
nyany_ has quit IRC (Read error: Operation timed out) |
01:40
🔗
|
|
ats has quit IRC (New kernel) |
01:43
🔗
|
|
ats has joined #archiveteam-ot |
01:52
🔗
|
|
nyany_ has joined #archiveteam-ot |
02:23
🔗
|
|
ShellyRol has quit IRC (Read error: Operation timed out) |
02:25
🔗
|
|
ShellyRol has joined #archiveteam-ot |
02:32
🔗
|
|
Veeryuk has joined #archiveteam-ot |
02:54
🔗
|
|
X-Scale` has joined #archiveteam-ot |
02:57
🔗
|
|
X-Scale` has quit IRC (irc.efnet.nl efnet.deic.eu) |
02:57
🔗
|
|
tuluu has quit IRC (irc.efnet.nl efnet.deic.eu) |
02:57
🔗
|
|
kiska3 has quit IRC (irc.efnet.nl efnet.deic.eu) |
02:57
🔗
|
|
X-Scale` has joined #archiveteam-ot |
02:57
🔗
|
|
tuluu has joined #archiveteam-ot |
02:57
🔗
|
|
kiska3 has joined #archiveteam-ot |
02:58
🔗
|
|
MrRadar2 has quit IRC (ny.us.hub irc.efnet.nl) |
02:59
🔗
|
|
X-Scale has quit IRC (Ping timeout: 745 seconds) |
02:59
🔗
|
|
X-Scale` is now known as X-Scale |
02:59
🔗
|
|
MrRadar2 has joined #archiveteam-ot |
02:59
🔗
|
|
nyany_ has quit IRC (Read error: Operation timed out) |
03:00
🔗
|
|
icedice has quit IRC (Leaving) |
03:01
🔗
|
|
nyany_ has joined #archiveteam-ot |
03:11
🔗
|
Somebody2 |
prq: best to ask your question about urlteam in the #urlteam channel |
03:11
🔗
|
Somebody2 |
but to answer it anyway -- no, it doesn't (yet) |
03:11
🔗
|
Somebody2 |
urlteam is all about brute-force scanning the possibilities, not targeted efforts |
03:12
🔗
|
Somebody2 |
but it would be lovely if someone wanted to dig thru the full ArchiveBot corpus and *extract* all the shortcodes... |
03:30
🔗
|
atphoenix |
Somebody2, by ArchiveBot corpus, do you mean the AB source code or do you mean the web pages saved by AB or do you mean there is an URL list of every URL AB has visited? |
03:31
🔗
|
JAA |
You won't find anything in the source code, but either the WARCs ("web pages saved by AB") or CDXs ("URL list of every URL AB has visited") would work. |
03:32
🔗
|
JAA |
Extracting from the WARCs would discover more URLs because it would also find shortlinks that weren't attempted by AB for whatever reason (recursion limits such as off-site or !ao jobs, ignores, etc.). |
03:32
🔗
|
JAA |
All AB WARCs are probably well over 600 TiB by now, but I don't have any current numbers. |
03:41
🔗
|
|
scorche has quit IRC (Quit: HydraIRC -> http://www.hydrairc.com <- Organize your IRC) |
03:45
🔗
|
OrIdow6 |
I don't see the point of looking through the CDXs - wouldn't the ones listed there have been captured anyhow? |
03:46
🔗
|
Somebody2 |
atphoenix: the web pages saved by AB, all of which can be downloaded by anyone, from IA. |
03:46
🔗
|
Somebody2 |
atphoenix: a list of all the urls visited could be extracted from that |
03:47
🔗
|
Somebody2 |
OrIdow6: sure, but extracting them into the format used by Urlteam would mean they could be used by whatever can process those |
03:47
🔗
|
Somebody2 |
there's been various talk about setting up a "dead shortener" site, that would hold all the URLs for shorteners that don't exist anymore |
03:50
🔗
|
atphoenix |
To make this whole idea work better I'd think that it would make sense to first rework the urlteam effort to allow it to |
03:50
🔗
|
atphoenix |
1.) ingest lists of known used shortened URLs and |
03:50
🔗
|
atphoenix |
2.) to store found URL results in a searchable DB alongside the date the URL was most recently resolved |
03:50
🔗
|
atphoenix |
Resolving known used shortened URLs would take priority. If a URL was re-resolved later, and the result was the same, update the date. If different, create a new DB entry with date of resolution. This could also work to merge in externally resolved lists (leave the last-resolved date empty if externally resolved list does not have a last-resolved date). Also keep a field that links to metadata about ingested lists. |
03:52
🔗
|
OrIdow6 |
Somebody2: What advantage would that have over the WBM? |
03:55
🔗
|
Somebody2 |
OrIdow6: We can't load most of the urlteam data into the WBM, as we didn't store the full headers. |
03:55
🔗
|
Somebody2 |
atphoenix: yes, that would be a lovely *additional* thing to do! |
03:56
🔗
|
OrIdow6 |
Oh, I see |
03:59
🔗
|
atphoenix |
does ArchiveTeam have place it can keep/run a live/searchable database large enough to store (billions) of shortener lookup results? |
04:00
🔗
|
Somebody2 |
ArchiveTeam itself? probably not |
04:00
🔗
|
Somebody2 |
but some volunteer? maybe |
04:00
🔗
|
atphoenix |
I know IA does bulk storage, but that's not the same as having a DB we can use to run the project |
04:02
🔗
|
Somebody2 |
yep |
04:10
🔗
|
kode54 |
cool, hackint doesn't like Matrix |
04:10
🔗
|
kode54 |
wonder what kind of horribly underspecced servers they're running on |
04:16
🔗
|
|
qw3rty_ has joined #archiveteam-ot |
04:21
🔗
|
|
qw3rty has quit IRC (Ping timeout: 276 seconds) |
04:29
🔗
|
atphoenix |
Somebody2, should I copy that urlteam proposal somewhere? To the wiki perhaps? (yes I know the urlteam wiki needs reorg) |
04:54
🔗
|
|
scorche has joined #archiveteam-ot |
05:06
🔗
|
|
odemg has quit IRC (Ping timeout: 745 seconds) |
05:10
🔗
|
|
odemg has joined #archiveteam-ot |
05:25
🔗
|
|
Veeryuk has quit IRC (Read error: Connection reset by peer) |
05:48
🔗
|
|
cerca has quit IRC (Remote host closed the connection) |
06:10
🔗
|
|
dxrt_ has quit IRC (Read error: Connection reset by peer) |
06:11
🔗
|
|
kiska has quit IRC (Read error: Operation timed out) |
06:11
🔗
|
|
Wingy has quit IRC (Read error: Operation timed out) |
06:12
🔗
|
|
britmob_ has joined #archiveteam-ot |
06:12
🔗
|
|
asdf0101 has quit IRC (Read error: Operation timed out) |
06:14
🔗
|
|
britmob has quit IRC (Read error: Operation timed out) |
06:15
🔗
|
|
Sora_Uta has joined #archiveteam-ot |
06:15
🔗
|
|
SoraUta has quit IRC (Read error: Operation timed out) |
06:15
🔗
|
|
benjinss has quit IRC (Read error: Operation timed out) |
06:16
🔗
|
|
Raccoon has quit IRC (Ping timeout: 622 seconds) |
06:16
🔗
|
|
Raccoon` has joined #archiveteam-ot |
06:16
🔗
|
|
benjins has joined #archiveteam-ot |
06:19
🔗
|
|
_niklas has quit IRC (Read error: Operation timed out) |
06:19
🔗
|
|
systwi_ has quit IRC (Ping timeout: 622 seconds) |
06:23
🔗
|
|
_niklas has joined #archiveteam-ot |
06:27
🔗
|
|
systwi has joined #archiveteam-ot |
06:28
🔗
|
|
oxguy3 has joined #archiveteam-ot |
06:28
🔗
|
|
dxrt_ has joined #archiveteam-ot |
06:28
🔗
|
|
dxrt sets mode: +o dxrt_ |
06:30
🔗
|
|
Wingy has joined #archiveteam-ot |
06:41
🔗
|
|
LowLevelM has quit IRC (Ping timeout: 496 seconds) |
06:53
🔗
|
|
dhyan_nat has joined #archiveteam-ot |
07:05
🔗
|
Somebody2 |
atphoenix: yes please! |
07:05
🔗
|
Somebody2 |
or at least repeat it in #urlteam |
07:06
🔗
|
Somebody2 |
atphoenix: see above |
07:09
🔗
|
|
oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) |
07:14
🔗
|
|
kiska has joined #archiveteam-ot |
07:15
🔗
|
|
svchfoo3 sets mode: +o kiska |
07:15
🔗
|
|
svchfoo1 sets mode: +o kiska |
07:15
🔗
|
|
Sora_Uta has quit IRC (Ping timeout: 276 seconds) |
07:16
🔗
|
|
oxguy3 has joined #archiveteam-ot |
07:19
🔗
|
|
kiska has quit IRC (Ping timeout: 276 seconds) |
07:30
🔗
|
|
kiska has joined #archiveteam-ot |
07:30
🔗
|
|
svchfoo3 sets mode: +o kiska |
07:30
🔗
|
|
svchfoo1 sets mode: +o kiska |
08:06
🔗
|
|
BlueMaxim has joined #archiveteam-ot |
08:08
🔗
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
11:23
🔗
|
|
BlueMaxim has quit IRC (Read error: Connection reset by peer) |
11:28
🔗
|
|
schbirid has joined #archiveteam-ot |
11:40
🔗
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
11:41
🔗
|
|
oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) |
11:53
🔗
|
|
oxguy3 has joined #archiveteam-ot |
12:05
🔗
|
JAA |
Somebody2: Most importantly, URLTeam should be changed to produce WARCs instead of text files. |
12:06
🔗
|
JAA |
I believe that's been on the todo list since the very early days. |
13:04
🔗
|
|
oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) |
13:27
🔗
|
|
godane has quit IRC (Read error: Operation timed out) |
13:44
🔗
|
|
godane has joined #archiveteam-ot |
14:17
🔗
|
|
oxguy3 has joined #archiveteam-ot |
15:40
🔗
|
|
Sanqui has quit IRC (Remote host closed the connection) |
15:40
🔗
|
|
Sanqui has joined #archiveteam-ot |
16:09
🔗
|
|
icedice has joined #archiveteam-ot |
16:13
🔗
|
|
dhyan_nat has joined #archiveteam-ot |
16:43
🔗
|
|
Wingy has quit IRC (Read error: Operation timed out) |
16:53
🔗
|
|
Wingy has joined #archiveteam-ot |
17:10
🔗
|
|
LowLevelM has joined #archiveteam-ot |
17:24
🔗
|
|
oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) |
18:41
🔗
|
|
VADemon has joined #archiveteam-ot |
18:44
🔗
|
Somebody2 |
amen |
18:47
🔗
|
|
DogsRNice has joined #archiveteam-ot |
19:04
🔗
|
|
ats has quit IRC (Quit: old kernel, since the new one doesn't work) |
19:06
🔗
|
|
ats has joined #archiveteam-ot |
20:33
🔗
|
|
DogsRNice has quit IRC (Ping timeout: 276 seconds) |
20:48
🔗
|
|
Ryz has quit IRC (Quit: Ping timeout (120 seconds)) |
20:49
🔗
|
|
Ryz has joined #archiveteam-ot |
21:16
🔗
|
|
oxguy3 has joined #archiveteam-ot |
22:19
🔗
|
|
asdf0101 has joined #archiveteam-ot |
22:59
🔗
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
23:23
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
23:40
🔗
|
|
Maylay has quit IRC (Read error: Connection reset by peer) |
23:42
🔗
|
|
Maylay has joined #archiveteam-ot |