#urlteam 2020-01-04,Sat

↑back Search

Time Nickname Message
01:38 🔗 nyany_ has quit IRC (Read error: Operation timed out)
01:52 🔗 nyany_ has joined #urlteam
02:57 🔗 kiska3 has quit IRC (irc.efnet.nl efnet.deic.eu)
02:57 🔗 kiska3 has joined #urlteam
02:58 🔗 MrRadar2 has quit IRC (ny.us.hub irc.efnet.nl)
02:59 🔗 MrRadar2 has joined #urlteam
02:59 🔗 nyany_ has quit IRC (Read error: Operation timed out)
03:01 🔗 nyany_ has joined #urlteam
05:06 🔗 odemg has quit IRC (Ping timeout: 745 seconds)
05:10 🔗 odemg has joined #urlteam
06:11 🔗 kiska has quit IRC (Read error: Operation timed out)
06:11 🔗 Wingy has quit IRC (Read error: Operation timed out)
06:12 🔗 Zerote has joined #urlteam
06:12 🔗 britmob_ has joined #urlteam
06:14 🔗 Zerote__ has quit IRC (Read error: Operation timed out)
06:14 🔗 britmob has quit IRC (Read error: Operation timed out)
06:19 🔗 systwi_ has quit IRC (Ping timeout: 622 seconds)
06:27 🔗 systwi has joined #urlteam
06:28 🔗 oxguy3 has joined #urlteam
06:30 🔗 Wingy has joined #urlteam
07:09 🔗 oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
07:16 🔗 oxguy3 has joined #urlteam
07:30 🔗 kiska has joined #urlteam
07:30 🔗 svchfoo3 sets mode: +o kiska
11:41 🔗 oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
11:53 🔗 oxguy3 has joined #urlteam
13:04 🔗 oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
14:17 🔗 oxguy3 has joined #urlteam
16:14 🔗 atphoenix Here's a relevant set of conversation from another channel, including a feature/design proposal from myself:
16:14 🔗 atphoenix <prq> I'm watching http://dashboard.at.ninjawedding.org/?showNicks=1 and some of the jobs seem to encounter short URLs. Does archivebot feed those back into the http://urlte.am/ project?
16:14 🔗 atphoenix <prq> or does http://urlte.am/ simply stick to its brute force method?
16:14 🔗 atphoenix <Somebody2> prq: best to ask your question about urlteam in the #urlteam channel
16:14 🔗 atphoenix <Somebody2> but to answer it anyway -- no, it doesn't (yet)
16:14 🔗 atphoenix <Somebody2> urlteam is all about brute-force scanning the possibilities, not targeted efforts
16:14 🔗 atphoenix <Somebody2> but it would be lovely if someone wanted to dig thru the full ArchiveBot corpus and *extract* all the shortcodes...
16:14 🔗 atphoenix <atphoenix> Somebody2, by ArchiveBot corpus, do you mean the AB source code or do you mean the web pages saved by AB or do you mean there is an URL list of every URL AB has visited?
16:14 🔗 atphoenix <JAA> You won't find anything in the source code, but either the WARCs ("web pages saved by AB") or CDXs ("URL list of every URL AB has visited") would work.
16:14 🔗 atphoenix <JAA> Extracting from the WARCs would discover more URLs because it would also find shortlinks that weren't attempted by AB for whatever reason (recursion limits such as off-site or !ao jobs, ignores, etc.).
16:14 🔗 atphoenix <JAA> All AB WARCs are probably well over 600 TiB by now, but I don't have any current numbers.
16:14 🔗 atphoenix <OrIdow6> I don't see the point of looking through the CDXs - wouldn't the ones listed there have been captured anyhow?
16:14 🔗 atphoenix <Somebody2> atphoenix: the web pages saved by AB, all of which can be downloaded by anyone, from IA.
16:14 🔗 atphoenix <Somebody2> atphoenix: a list of all the urls visited could be extracted from that
16:14 🔗 atphoenix <Somebody2> OrIdow6: sure, but extracting them into the format used by Urlteam would mean they could be used by whatever can process those
16:14 🔗 atphoenix <Somebody2> there's been various talk about setting up a "dead shortener" site, that would hold all the URLs for shorteners that don't exist anymore
16:14 🔗 atphoenix <atphoenix> To make this whole idea work better I'd think that it would make sense to first rework the urlteam effort to allow it to
16:15 🔗 atphoenix <atphoenix> 1.) ingest lists of known used shortened URLs and
16:15 🔗 atphoenix <atphoenix> 2.) to store found URL results in a searchable DB alongside the date the URL was most recently resolved
16:15 🔗 atphoenix <atphoenix> Resolving known used shortened URLs would take priority. If a URL was re-resolved later, and the result was the same, update the date. If different, create a new DB entry with date of resolution. This could also work to merge in externally resolved lists (leave the last-resolved date empty if externally resolved list does not have a last-resolved date). Also keep a field that links to metadata about ingested lists.
16:15 🔗 atphoenix <OrIdow6> Somebody2: What advantage would that have over the WBM?
16:15 🔗 atphoenix <Somebody2> OrIdow6: We can't load most of the urlteam data into the WBM, as we didn't store the full headers.
16:15 🔗 atphoenix <Somebody2> atphoenix: yes, that would be a lovely *additional* thing to do!
16:15 🔗 atphoenix <OrIdow6> Oh, I see
16:15 🔗 atphoenix <atphoenix> does ArchiveTeam have place it can keep/run a live/searchable database large enough to store (billions) of shortener lookup results?
16:15 🔗 atphoenix <Somebody2> ArchiveTeam itself? probably not
16:15 🔗 atphoenix <Somebody2> but some volunteer? maybe
16:15 🔗 atphoenix <atphoenix> I know IA does bulk storage, but that's not the same as having a DB we can use to run the project
16:15 🔗 atphoenix <Somebody2> yep
16:15 🔗 atphoenix <atphoenix> Somebody2, should I copy that urlteam proposal somewhere? To the wiki perhaps? (yes I know the urlteam wiki needs reorg)
16:15 🔗 atphoenix <Somebody2> atphoenix: yes please!
16:15 🔗 atphoenix <Somebody2> or at least repeat it in #urlteam
16:15 🔗 atphoenix <Somebody2> atphoenix: see above
16:15 🔗 atphoenix <JAA> Somebody2: Most importantly, URLTeam should be changed to produce WARCs instead of text files.
16:15 🔗 atphoenix <JAA> I believe that's been on the todo list since the very early days.
16:16 🔗 atphoenix above was copied from #archiveteam-ot
16:18 🔗 astrid yes that's been a wishlist item for most of a decade :)
16:23 🔗 atphoenix let's hope this decade looks a bit different :)
16:43 🔗 Wingy has quit IRC (Read error: Operation timed out)
16:53 🔗 Wingy has joined #urlteam
17:08 🔗 marked1 I can't image what's difficult in using warc instead of text
17:17 🔗 marked1 oh, text files are created by the tracker as part of export?
17:24 🔗 oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…)
17:50 🔗 JAA Yes, URLTeam works differently from all other DPoS projects.
17:51 🔗 JAA Workers send back shortcode-longurl mappings to the tracker, they get stored in Redis and dumped to text files periodically.
17:56 🔗 Kagee to be honest, i would hope one would continue producing text aswell, as that is all *I* need :)
17:56 🔗 JAA Sure, that can be done pretty easily.
18:21 🔗 marked1 so I'm proposing warcio on the client side. So the perverse question is do you want warc data in the tracker DB otherwise you'd need a parallel megawarc factory path or similar
18:30 🔗 astrid parallel warc factory feels cleaner to me
18:41 🔗 VADemon has joined #urlteam
18:41 🔗 Zerote_ has joined #urlteam
18:43 🔗 Somebody2 yeah, parallel path seems fine
18:44 🔗 Somebody2 and thanks for copying that in, atphoenix
18:44 🔗 Zerote has quit IRC (Ping timeout: 276 seconds)
19:01 🔗 marked1 would the .warc export and .txt exports need to align or synchronize?
19:12 🔗 Somebody2 marked1: not particularly -- what do you mean?
19:27 🔗 marked1 would it be weird to have text files that are exported every hour, and warcs that are dumped every X GB
19:39 🔗 Zerote__ has joined #urlteam
19:44 🔗 Zerote_ has quit IRC (Read error: Operation timed out)
20:15 🔗 atphoenix I don't think that would be weird. I think the use-cases and consumers of the results are different. IMO the primary consumer of WARC results would be IA for WBM integration (assumes IA is interested in taking these results for integration).
20:17 🔗 atphoenix Kagee, can you explain your use-case more? How do you use the results?
20:22 🔗 Kagee atphoenix: my use is very limited, but I'm only interested in the urls. Urlteam is one of the sources I scrape for domains in the .no TLD, as the .no registry does not publish a list of domains.
20:25 🔗 Kagee The is a lot of overlap in the datasets, but urlteam data sums up to about 70k .no domains atm, approx 10% of the total.
21:16 🔗 oxguy3 has joined #urlteam
23:45 🔗 picklefac has joined #urlteam

irclogger-viewer