[01:38] *** nyany_ has quit IRC (Read error: Operation timed out) [01:52] *** nyany_ has joined #urlteam [02:57] *** kiska3 has quit IRC (irc.efnet.nl efnet.deic.eu) [02:57] *** kiska3 has joined #urlteam [02:58] *** MrRadar2 has quit IRC (ny.us.hub irc.efnet.nl) [02:59] *** MrRadar2 has joined #urlteam [02:59] *** nyany_ has quit IRC (Read error: Operation timed out) [03:01] *** nyany_ has joined #urlteam [05:06] *** odemg has quit IRC (Ping timeout: 745 seconds) [05:10] *** odemg has joined #urlteam [06:11] *** kiska has quit IRC (Read error: Operation timed out) [06:11] *** Wingy has quit IRC (Read error: Operation timed out) [06:12] *** Zerote has joined #urlteam [06:12] *** britmob_ has joined #urlteam [06:14] *** Zerote__ has quit IRC (Read error: Operation timed out) [06:14] *** britmob has quit IRC (Read error: Operation timed out) [06:19] *** systwi_ has quit IRC (Ping timeout: 622 seconds) [06:27] *** systwi has joined #urlteam [06:28] *** oxguy3 has joined #urlteam [06:30] *** Wingy has joined #urlteam [07:09] *** oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [07:16] *** oxguy3 has joined #urlteam [07:30] *** kiska has joined #urlteam [07:30] *** svchfoo3 sets mode: +o kiska [11:41] *** oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [11:53] *** oxguy3 has joined #urlteam [13:04] *** oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [14:17] *** oxguy3 has joined #urlteam [16:14] Here's a relevant set of conversation from another channel, including a feature/design proposal from myself: [16:14] I'm watching http://dashboard.at.ninjawedding.org/?showNicks=1 and some of the jobs seem to encounter short URLs. Does archivebot feed those back into the http://urlte.am/ project? [16:14] or does http://urlte.am/ simply stick to its brute force method? [16:14] prq: best to ask your question about urlteam in the #urlteam channel [16:14] but to answer it anyway -- no, it doesn't (yet) [16:14] urlteam is all about brute-force scanning the possibilities, not targeted efforts [16:14] but it would be lovely if someone wanted to dig thru the full ArchiveBot corpus and *extract* all the shortcodes... [16:14] Somebody2, by ArchiveBot corpus, do you mean the AB source code or do you mean the web pages saved by AB or do you mean there is an URL list of every URL AB has visited? [16:14] You won't find anything in the source code, but either the WARCs ("web pages saved by AB") or CDXs ("URL list of every URL AB has visited") would work. [16:14] Extracting from the WARCs would discover more URLs because it would also find shortlinks that weren't attempted by AB for whatever reason (recursion limits such as off-site or !ao jobs, ignores, etc.). [16:14] All AB WARCs are probably well over 600 TiB by now, but I don't have any current numbers. [16:14] I don't see the point of looking through the CDXs - wouldn't the ones listed there have been captured anyhow? [16:14] atphoenix: the web pages saved by AB, all of which can be downloaded by anyone, from IA. [16:14] atphoenix: a list of all the urls visited could be extracted from that [16:14] OrIdow6: sure, but extracting them into the format used by Urlteam would mean they could be used by whatever can process those [16:14] there's been various talk about setting up a "dead shortener" site, that would hold all the URLs for shorteners that don't exist anymore [16:14] To make this whole idea work better I'd think that it would make sense to first rework the urlteam effort to allow it to [16:15] 1.) ingest lists of known used shortened URLs and [16:15] 2.) to store found URL results in a searchable DB alongside the date the URL was most recently resolved [16:15] Resolving known used shortened URLs would take priority. If a URL was re-resolved later, and the result was the same, update the date. If different, create a new DB entry with date of resolution. This could also work to merge in externally resolved lists (leave the last-resolved date empty if externally resolved list does not have a last-resolved date). Also keep a field that links to metadata about ingested lists. [16:15] Somebody2: What advantage would that have over the WBM? [16:15] OrIdow6: We can't load most of the urlteam data into the WBM, as we didn't store the full headers. [16:15] atphoenix: yes, that would be a lovely *additional* thing to do! [16:15] Oh, I see [16:15] does ArchiveTeam have place it can keep/run a live/searchable database large enough to store (billions) of shortener lookup results? [16:15] ArchiveTeam itself? probably not [16:15] but some volunteer? maybe [16:15] I know IA does bulk storage, but that's not the same as having a DB we can use to run the project [16:15] yep [16:15] Somebody2, should I copy that urlteam proposal somewhere? To the wiki perhaps? (yes I know the urlteam wiki needs reorg) [16:15] atphoenix: yes please! [16:15] or at least repeat it in #urlteam [16:15] atphoenix: see above [16:15] Somebody2: Most importantly, URLTeam should be changed to produce WARCs instead of text files. [16:15] I believe that's been on the todo list since the very early days. [16:16] above was copied from #archiveteam-ot [16:18] yes that's been a wishlist item for most of a decade :) [16:23] let's hope this decade looks a bit different :) [16:43] *** Wingy has quit IRC (Read error: Operation timed out) [16:53] *** Wingy has joined #urlteam [17:08] I can't image what's difficult in using warc instead of text [17:17] oh, text files are created by the tracker as part of export? [17:24] *** oxguy3 has quit IRC (Quit: My MacBook has gone to sleep. ZZZzzz…) [17:50] Yes, URLTeam works differently from all other DPoS projects. [17:51] Workers send back shortcode-longurl mappings to the tracker, they get stored in Redis and dumped to text files periodically. [17:56] to be honest, i would hope one would continue producing text aswell, as that is all *I* need :) [17:56] Sure, that can be done pretty easily. [18:21] so I'm proposing warcio on the client side. So the perverse question is do you want warc data in the tracker DB otherwise you'd need a parallel megawarc factory path or similar [18:30] parallel warc factory feels cleaner to me [18:41] *** VADemon has joined #urlteam [18:41] *** Zerote_ has joined #urlteam [18:43] yeah, parallel path seems fine [18:44] and thanks for copying that in, atphoenix [18:44] *** Zerote has quit IRC (Ping timeout: 276 seconds) [19:01] would the .warc export and .txt exports need to align or synchronize? [19:12] marked1: not particularly -- what do you mean? [19:27] would it be weird to have text files that are exported every hour, and warcs that are dumped every X GB [19:39] *** Zerote__ has joined #urlteam [19:44] *** Zerote_ has quit IRC (Read error: Operation timed out) [20:15] I don't think that would be weird. I think the use-cases and consumers of the results are different. IMO the primary consumer of WARC results would be IA for WBM integration (assumes IA is interested in taking these results for integration). [20:17] Kagee, can you explain your use-case more? How do you use the results? [20:22] atphoenix: my use is very limited, but I'm only interested in the urls. Urlteam is one of the sources I scrape for domains in the .no TLD, as the .no registry does not publish a list of domains. [20:25] The is a lot of overlap in the datasets, but urlteam data sums up to about 70k .no domains atm, approx 10% of the total. [21:16] *** oxguy3 has joined #urlteam [23:45] *** picklefac has joined #urlteam