[05:02] *** odemg has quit IRC (Read error: Operation timed out) [05:16] *** odemg has joined #urlteam [09:01] *** Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat) [11:53] *** mtntmnky has quit IRC (Remote host closed the connection) [11:53] *** mtntmnky has joined #urlteam [15:15] *** Craigle has joined #urlteam [16:02] *** Smiley has quit IRC (Read error: Operation timed out) [16:03] *** Smiley has joined #urlteam [17:42] *** mtntmnky_ has joined #urlteam [17:55] *** mtntmnky has quit IRC (Remote host closed the connection) [18:14] *** VoynichCr has quit IRC (Ping timeout: 258 seconds) [20:45] *** Smiley has quit IRC (irc.efnet.nl efnet.deic.eu) [20:45] *** nepeat has quit IRC (irc.efnet.nl efnet.deic.eu) [20:45] *** Terbium has quit IRC (irc.efnet.nl efnet.deic.eu) [20:45] *** kiskaWee has quit IRC (irc.efnet.nl efnet.deic.eu) [20:45] *** SmileyG has joined #urlteam [20:49] *** Terbium_ has joined #urlteam [21:29] *** atphoenix has quit IRC (Quit: Leaving) [21:30] *** atphoenix has joined #urlteam [22:41] FYI: I've started running my "find all of the new url shorteners used in the twitter stream for 2019" by un-rolling all urls posted to twitter in the 2% spritzer stream. I'm starting out with Jan-Nov 2019 and I've got 78 million initial t.co urls to crawl. [22:43] It'll probably take until well into 2020 before the crawl is actually done. LOL [22:44] Grabbing the redirects into WARC I hope? :-) [22:44] Sounds like a great project! [22:46] I do it every couple of years at the end of the year... Last time I did it was in 2015. [22:46] https://gist.github.com/scumola/5216839 [22:47] I'm not storing the urls in WARC, but I can provide the SRC, DEST urls (which url redirected to which other url) if you guys want them. [22:49] swebb: are you also getting the specific used shortened URLs? [22:50] i.e. recording the bit.ly/1234 was used? [22:51] Ah, so like URLTeam as simple mappings. That's fine as well. Eventually, we can regrab everything as WARC to get them into the WBM. [22:52] Yea, I'm storing the SRC url (the shortener url) and the destination url. If there are intermediate shorteners/forwarders in the path, I'll be grabbing those as well. My crawler doesn't automatically follow all redirects, I'm manually following them, so I can record them all. [22:53] atphoenix: so yea. [22:53] recommendation: keep the dates/times related to shortened URLs. Like date when it is resolved. If scraped from web, when it was seen on the web. If part of a post with a date, date that post was made. [22:54] reason for dates: some shorteners expire the URLs. That means re-use may also happen. [22:55] I'm not storing date/times. Sorry. I'm not storing any headers either. Mainly it's for discovery of new shortener services. [22:56] Now I want to launch a URL shortening service that changes all redirects to https://invidio.us/watch?v=dQw4w9WgXcQ after a few hours. :-P [22:57] Or operate it normally for a while, then start that when people use it widely. >:-D [22:57] swebb: it sounds like you are resolving the URL you find? If you can add a column to your results of 'date resolved', I think it'll help future integration. [23:00] *** mtntmnky_ has quit IRC (Remote host closed the connection) [23:00] *** mtntmnky_ has joined #urlteam [23:02] Well, they're all within the last couple of days. How's that? :) [23:08] The initial url database was 4.4G. I'm going to try to whittle it down to something I can post in a gist like I did in 2015. [23:12] There is a note of my 2015 stuff on https://www.archiveteam.org/index.php/URLTeam/unsorted at the top of the page. Once I'm done, I can append my new results. [23:21] swebb: I only ask that somehow that dates be kept; with the minimum being the date the URL was resolved from short to long. This way we'll be able to properly time-sort, and have a shot at getting them into the right place on the WBM timelines. Especially if the resolutions change. [23:22] My vision is that all these discrete lists of resolved URLs can be merged, with dates as a way to better handle potential collisions [23:23] atphoenix: I'm doing the crawl over a period of a few days and these are all urls from early in 2019, so if they haven't changed by now, the odds that they're going to change in the next couple of days are very slim. I can keep a timestamp that I crawled the urls if that will make you happy, but the goal of my crawl is to discover new shorteners, not crawl/un-shorten masses of urls. That's the warrior's job. [23:23] JAA: try shortener domain jaa.ru.le (you can have the TLD of .le :) [23:30] atphoenix: I'd still like to have jaa.at, but there's no way I'll ever have it unless I pay at least €€€€. [23:31] swebb: I think start and end time of the crawl are perfectly sufficient for this. [23:32] I'm guessing the full crawl should be done in less than a week. [23:33] Nice. How many URLs (roughly)? [23:33] Oh wait, you said it above, 78M. [23:33] Neat. [23:34] The starting url list is 78M urls, but if there are intermediate url shorteners in the chain of redirects, it could be much more. [23:34] Right [23:34] I've seen a chain of 4 before I think. People be stupid. [23:35] I've got the system pretty smoothly running now with very little babysitting and I'm doing about 250M urls every couple of hours with no throttling by twitter (yet). [23:35] 3 mins to process 10k urls. [23:36] Nice. Yeah, Twitter doesn't seem to care about request rates. How's this set up hardware-wise? [23:38] Technologies in use: Mysql for the master DB, rabbitmq for the crawl queue (10k urls queued at a time), python is doing the crawling, then back to rabbit with the results, then another process takes the results from a queue and dumps back into mysql the un-wrapped url and any newly-discovered urls that may also be shorteners. 10 crawlers are running now. All running on a single 8-core, 16G ram linux machine w/ [23:38] 1Gbit internet. [23:42] screenshot: https://imgur.com/gallery/C3NJssi [23:46] Very nice! [23:48] I'm interested to see how well it's running in 24 hours. :)