#archiveteam-ot 2020-01-04,Sat

↑back Search

Time Nickname Message
01:01 🔗 ShellyRol has quit IRC (Ping timeout: 496 seconds)
01:15 🔗 ShellyRol has joined #archiveteam-ot
01:24 🔗 foureyes has quit IRC (Quit: brb)
01:25 🔗 foureyes has joined #archiveteam-ot
01:38 🔗 nyany_ has quit IRC (Read error: Operation timed out)
01:40 🔗 ats has quit IRC (New kernel)
01:43 🔗 ats has joined #archiveteam-ot
01:52 🔗 nyany_ has joined #archiveteam-ot
02:23 🔗 ShellyRol has quit IRC (Read error: Operation timed out)
02:25 🔗 ShellyRol has joined #archiveteam-ot
02:32 🔗 Veeryuk has joined #archiveteam-ot
02:54 🔗 X-Scale` has joined #archiveteam-ot
02:57 🔗 X-Scale` has quit IRC (irc.efnet.nl efnet.deic.eu)
02:57 🔗 tuluu has quit IRC (irc.efnet.nl efnet.deic.eu)
02:57 🔗 kiska3 has quit IRC (irc.efnet.nl efnet.deic.eu)
02:57 🔗 X-Scale` has joined #archiveteam-ot
02:57 🔗 tuluu has joined #archiveteam-ot
02:57 🔗 kiska3 has joined #archiveteam-ot
02:58 🔗 MrRadar2 has quit IRC (ny.us.hub irc.efnet.nl)
02:59 🔗 X-Scale has quit IRC (Ping timeout: 745 seconds)
02:59 🔗 X-Scale` is now known as X-Scale
02:59 🔗 MrRadar2 has joined #archiveteam-ot
02:59 🔗 nyany_ has quit IRC (Read error: Operation timed out)
03:00 🔗 icedice has quit IRC (Leaving)
03:01 🔗 nyany_ has joined #archiveteam-ot
03:11 🔗 Somebody2 prq: best to ask your question about urlteam in the #urlteam channel
03:11 🔗 Somebody2 but to answer it anyway -- no, it doesn't (yet)
03:11 🔗 Somebody2 urlteam is all about brute-force scanning the possibilities, not targeted efforts
03:12 🔗 Somebody2 but it would be lovely if someone wanted to dig thru the full ArchiveBot corpus and *extract* all the shortcodes...
03:30 🔗 atphoenix Somebody2, by ArchiveBot corpus, do you mean the AB source code or do you mean the web pages saved by AB or do you mean there is an URL list of every URL AB has visited?
03:31 🔗 JAA You won't find anything in the source code, but either the WARCs ("web pages saved by AB") or CDXs ("URL list of every URL AB has visited") would work.
03:32 🔗 JAA Extracting from the WARCs would discover more URLs because it would also find shortlinks that weren't attempted by AB for whatever reason (recursion limits such as off-site or !ao jobs, ignores, etc.).
03:32 🔗 JAA All AB WARCs are probably well over 600 TiB by now, but I don't have any current numbers.
03:41 🔗 scorche has quit IRC (Quit: HydraIRC -> http://www.hydrairc.com <- Organize your IRC)
03:45 🔗 OrIdow6 I don't see the point of looking through the CDXs - wouldn't the ones listed there have been captured anyhow?
03:46 🔗 Somebody2 atphoenix: the web pages saved by AB, all of which can be downloaded by anyone, from IA.
03:46 🔗 Somebody2 atphoenix: a list of all the urls visited could be extracted from that
03:47 🔗 Somebody2 OrIdow6: sure, but extracting them into the format used by Urlteam would mean they could be used by whatever can process those
03:47 🔗 Somebody2 there's been various talk about setting up a "dead shortener" site, that would hold all the URLs for shorteners that don't exist anymore
03:50 🔗 atphoenix To make this whole idea work better I'd think that it would make sense to first rework the urlteam effort to allow it to
03:50 🔗 atphoenix 1.) ingest lists of known used shortened URLs and
03:50 🔗 atphoenix 2.) to store found URL results in a searchable DB alongside the date the URL was most recently resolved
03:50 🔗 atphoenix Resolving known used shortened URLs would take priority. If a URL was re-resolved later, and the result was the same, update the date. If different, create a new DB entry with date of resolution. This could also work to merge in externally resolved lists (leave the last-resolved date empty if externally resolved list does not have a last-resolved date). Also keep a field that links to metadata about ingested lists.
03:52 🔗 OrIdow6 Somebody2: What advantage would that have over the WBM?
03:55 🔗 Somebody2 OrIdow6: We can't load most of the urlteam data into the WBM, as we didn't store the full headers.
03:55 🔗 Somebody2 atphoenix: yes, that would be a lovely *additional* thing to do!
03:56 🔗 OrIdow6 Oh, I see
03:59 🔗 atphoenix does ArchiveTeam have place it can keep/run a live/searchable database large enough to store (billions) of shortener lookup results?
04:00 🔗 Somebody2 ArchiveTeam itself? probably not
04:00 🔗 Somebody2 but some volunteer? maybe
04:00 🔗 atphoenix I know IA does bulk storage, but that's not the same as having a DB we can use to run the project
04:02 🔗 Somebody2 yep
04:10 🔗 kode54 cool, hackint doesn't like Matrix
04:10 🔗 kode54 wonder what kind of horribly underspecced servers they're running on
04:16 🔗 qw3rty_ has joined #archiveteam-ot
04:21 🔗 qw3rty has quit IRC (Ping timeout: 276 seconds)
04:29 🔗 atphoenix Somebody2, should I copy that urlteam proposal somewhere? To the wiki perhaps? (yes I know the urlteam wiki needs reorg)
04:54 🔗 scorche has joined #archiveteam-ot
05:06 🔗 odemg has quit IRC (Ping timeout: 745 seconds)
05:10 🔗 odemg has joined #archiveteam-ot
05:25 🔗 Veeryuk has quit IRC (Read error: Connection reset by peer)
05:48 🔗 cerca has quit IRC (Remote host closed the connection)
06:10 🔗 dxrt_ has quit IRC (Read error: Connection reset by peer)
06:11 🔗 kiska has quit IRC (Read error: Operation timed out)
06:11 🔗 Wingy has quit IRC (Read error: Operation timed out)
06:12 🔗 britmob_ has joined #archiveteam-ot
06:12 🔗 asdf0101 has quit IRC (Read error: Operation timed out)
06:14 🔗 britmob has quit IRC (Read error: Operation timed out)
06:15 🔗 Sora_Uta has joined #archiveteam-ot
06:15 🔗 SoraUta has quit IRC (Read error: Operation timed out)
06:15 🔗 benjinss has quit IRC (Read error: Operation timed out)
06:16 🔗 Raccoon has quit IRC (Ping timeout: 622 seconds)
06:16 🔗 Raccoon` has joined #archiveteam-ot
06:16 🔗 benjins has joined #archiveteam-ot
06:19 🔗 _niklas has quit IRC (Read error: Operation timed out)
06:19 🔗 systwi_ has quit IRC (Ping timeout: 622 seconds)
06:23 🔗 _niklas has joined #archiveteam-ot
06:27 🔗 systwi has joined #archiveteam-ot
06:28 🔗 oxguy3 has joined #archiveteam-ot
06:28 🔗 dxrt_ has joined #archiveteam-ot
06:28 🔗 dxrt sets mode: +o dxrt_
06:30 🔗 Wingy has joined #archiveteam-ot
06:41 🔗 LowLevelM has quit IRC (Ping timeout: 496 seconds)
06:53 🔗 dhyan_nat has joined #archiveteam-ot
07:05 🔗 Somebody2 atphoenix: yes please!
07:05 🔗 Somebody2 or at least repeat it in #urlteam
07:06 🔗 Somebody2 atphoenix: see above
07:09 🔗 oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
07:14 🔗 kiska has joined #archiveteam-ot
07:15 🔗 svchfoo3 sets mode: +o kiska
07:15 🔗 svchfoo1 sets mode: +o kiska
07:15 🔗 Sora_Uta has quit IRC (Ping timeout: 276 seconds)
07:16 🔗 oxguy3 has joined #archiveteam-ot
07:19 🔗 kiska has quit IRC (Ping timeout: 276 seconds)
07:30 🔗 kiska has joined #archiveteam-ot
07:30 🔗 svchfoo3 sets mode: +o kiska
07:30 🔗 svchfoo1 sets mode: +o kiska
08:06 🔗 BlueMaxim has joined #archiveteam-ot
08:08 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
11:23 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
11:28 🔗 schbirid has joined #archiveteam-ot
11:40 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
11:41 🔗 oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
11:53 🔗 oxguy3 has joined #archiveteam-ot
12:05 🔗 JAA Somebody2: Most importantly, URLTeam should be changed to produce WARCs instead of text files.
12:06 🔗 JAA I believe that's been on the todo list since the very early days.
13:04 🔗 oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
13:27 🔗 godane has quit IRC (Read error: Operation timed out)
13:44 🔗 godane has joined #archiveteam-ot
14:17 🔗 oxguy3 has joined #archiveteam-ot
15:40 🔗 Sanqui has quit IRC (Remote host closed the connection)
15:40 🔗 Sanqui has joined #archiveteam-ot
16:09 🔗 icedice has joined #archiveteam-ot
16:13 🔗 dhyan_nat has joined #archiveteam-ot
16:43 🔗 Wingy has quit IRC (Read error: Operation timed out)
16:53 🔗 Wingy has joined #archiveteam-ot
17:10 🔗 LowLevelM has joined #archiveteam-ot
17:24 🔗 oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
18:41 🔗 VADemon has joined #archiveteam-ot
18:44 🔗 Somebody2 amen
18:47 🔗 DogsRNice has joined #archiveteam-ot
19:04 🔗 ats has quit IRC (Quit: old kernel, since the new one doesn't work)
19:06 🔗 ats has joined #archiveteam-ot
20:33 🔗 DogsRNice has quit IRC (Ping timeout: 276 seconds)
20:48 🔗 Ryz has quit IRC (Quit: Ping timeout (120 seconds))
20:49 🔗 Ryz has joined #archiveteam-ot
21:16 🔗 oxguy3 has joined #archiveteam-ot
22:19 🔗 asdf0101 has joined #archiveteam-ot
22:59 🔗 dhyan_nat has quit IRC (Read error: Operation timed out)
23:23 🔗 schbirid has quit IRC (Quit: Leaving)
23:40 🔗 Maylay has quit IRC (Read error: Connection reset by peer)
23:42 🔗 Maylay has joined #archiveteam-ot

irclogger-viewer