#urlteam 2019-12-30,Mon

↑back Search

Time Nickname Message
05:02 🔗 odemg has quit IRC (Read error: Operation timed out)
05:16 🔗 odemg has joined #urlteam
09:01 🔗 Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat)
11:53 🔗 mtntmnky has quit IRC (Remote host closed the connection)
11:53 🔗 mtntmnky has joined #urlteam
15:15 🔗 Craigle has joined #urlteam
16:02 🔗 Smiley has quit IRC (Read error: Operation timed out)
16:03 🔗 Smiley has joined #urlteam
17:42 🔗 mtntmnky_ has joined #urlteam
17:55 🔗 mtntmnky has quit IRC (Remote host closed the connection)
18:14 🔗 VoynichCr has quit IRC (Ping timeout: 258 seconds)
20:45 🔗 Smiley has quit IRC (irc.efnet.nl efnet.deic.eu)
20:45 🔗 nepeat has quit IRC (irc.efnet.nl efnet.deic.eu)
20:45 🔗 Terbium has quit IRC (irc.efnet.nl efnet.deic.eu)
20:45 🔗 kiskaWee has quit IRC (irc.efnet.nl efnet.deic.eu)
20:45 🔗 SmileyG has joined #urlteam
20:49 🔗 Terbium_ has joined #urlteam
21:29 🔗 atphoenix has quit IRC (Quit: Leaving)
21:30 🔗 atphoenix has joined #urlteam
22:41 🔗 swebb FYI: I've started running my "find all of the new url shorteners used in the twitter stream for 2019" by un-rolling all urls posted to twitter in the 2% spritzer stream. I'm starting out with Jan-Nov 2019 and I've got 78 million initial t.co urls to crawl.
22:43 🔗 swebb It'll probably take until well into 2020 before the crawl is actually done. LOL
22:44 🔗 JAA Grabbing the redirects into WARC I hope? :-)
22:44 🔗 JAA Sounds like a great project!
22:46 🔗 swebb I do it every couple of years at the end of the year... Last time I did it was in 2015.
22:46 🔗 swebb https://gist.github.com/scumola/5216839
22:47 🔗 swebb I'm not storing the urls in WARC, but I can provide the SRC, DEST urls (which url redirected to which other url) if you guys want them.
22:49 🔗 atphoenix swebb: are you also getting the specific used shortened URLs?
22:50 🔗 atphoenix i.e. recording the bit.ly/1234 was used?
22:51 🔗 JAA Ah, so like URLTeam as simple mappings. That's fine as well. Eventually, we can regrab everything as WARC to get them into the WBM.
22:52 🔗 swebb Yea, I'm storing the SRC url (the shortener url) and the destination url. If there are intermediate shorteners/forwarders in the path, I'll be grabbing those as well. My crawler doesn't automatically follow all redirects, I'm manually following them, so I can record them all.
22:53 🔗 swebb atphoenix: so yea.
22:53 🔗 atphoenix recommendation: keep the dates/times related to shortened URLs. Like date when it is resolved. If scraped from web, when it was seen on the web. If part of a post with a date, date that post was made.
22:54 🔗 atphoenix reason for dates: some shorteners expire the URLs. That means re-use may also happen.
22:55 🔗 swebb I'm not storing date/times. Sorry. I'm not storing any headers either. Mainly it's for discovery of new shortener services.
22:56 🔗 JAA Now I want to launch a URL shortening service that changes all redirects to https://invidio.us/watch?v=dQw4w9WgXcQ after a few hours. :-P
22:57 🔗 JAA Or operate it normally for a while, then start that when people use it widely. >:-D
22:57 🔗 atphoenix swebb: it sounds like you are resolving the URL you find? If you can add a column to your results of 'date resolved', I think it'll help future integration.
23:00 🔗 mtntmnky_ has quit IRC (Remote host closed the connection)
23:00 🔗 mtntmnky_ has joined #urlteam
23:02 🔗 swebb Well, they're all within the last couple of days. How's that? :)
23:08 🔗 swebb The initial url database was 4.4G. I'm going to try to whittle it down to something I can post in a gist like I did in 2015.
23:12 🔗 swebb There is a note of my 2015 stuff on https://www.archiveteam.org/index.php/URLTeam/unsorted at the top of the page. Once I'm done, I can append my new results.
23:21 🔗 atphoenix swebb: I only ask that somehow that dates be kept; with the minimum being the date the URL was resolved from short to long. This way we'll be able to properly time-sort, and have a shot at getting them into the right place on the WBM timelines. Especially if the resolutions change.
23:22 🔗 atphoenix My vision is that all these discrete lists of resolved URLs can be merged, with dates as a way to better handle potential collisions
23:23 🔗 swebb atphoenix: I'm doing the crawl over a period of a few days and these are all urls from early in 2019, so if they haven't changed by now, the odds that they're going to change in the next couple of days are very slim. I can keep a timestamp that I crawled the urls if that will make you happy, but the goal of my crawl is to discover new shorteners, not crawl/un-shorten masses of urls. That's the warrior's job.
23:23 🔗 atphoenix JAA: try shortener domain jaa.ru.le (you can have the TLD of .le :)
23:30 🔗 JAA atphoenix: I'd still like to have jaa.at, but there's no way I'll ever have it unless I pay at least €€€€.
23:31 🔗 JAA swebb: I think start and end time of the crawl are perfectly sufficient for this.
23:32 🔗 swebb I'm guessing the full crawl should be done in less than a week.
23:33 🔗 JAA Nice. How many URLs (roughly)?
23:33 🔗 JAA Oh wait, you said it above, 78M.
23:33 🔗 JAA Neat.
23:34 🔗 swebb The starting url list is 78M urls, but if there are intermediate url shorteners in the chain of redirects, it could be much more.
23:34 🔗 JAA Right
23:34 🔗 JAA I've seen a chain of 4 before I think. People be stupid.
23:35 🔗 swebb I've got the system pretty smoothly running now with very little babysitting and I'm doing about 250M urls every couple of hours with no throttling by twitter (yet).
23:35 🔗 swebb 3 mins to process 10k urls.
23:36 🔗 JAA Nice. Yeah, Twitter doesn't seem to care about request rates. How's this set up hardware-wise?
23:38 🔗 swebb Technologies in use: Mysql for the master DB, rabbitmq for the crawl queue (10k urls queued at a time), python is doing the crawling, then back to rabbit with the results, then another process takes the results from a queue and dumps back into mysql the un-wrapped url and any newly-discovered urls that may also be shorteners. 10 crawlers are running now. All running on a single 8-core, 16G ram linux machine w/
23:38 🔗 swebb 1Gbit internet.
23:42 🔗 swebb screenshot: https://imgur.com/gallery/C3NJssi
23:46 🔗 JAA Very nice!
23:48 🔗 swebb I'm interested to see how well it's running in 24 hours. :)

irclogger-viewer