#urlteam 2015-07-15,Wed

↑back Search

Time Nickname Message
04:37 🔗 aaaaaaaaa has quit IRC (Leaving)
05:39 🔗 Start_ has joined #urlteam
05:39 🔗 Start has quit IRC (Read error: Connection reset by peer)
05:55 🔗 bsmith093 has quit IRC (Read error: Operation timed out)
06:11 🔗 Start has joined #urlteam
06:11 🔗 svchfoo1 sets mode: +o Start
06:12 🔗 Start_ has quit IRC (Read error: Connection reset by peer)
08:06 🔗 toad2 has joined #urlteam
08:08 🔗 toad1 has quit IRC (Read error: Operation timed out)
09:19 🔗 svchfoo2 has quit IRC (Ping timeout: 265 seconds)
11:35 🔗 matthusby has quit IRC (ZNC - http://znc.in)
12:05 🔗 dashcloud has quit IRC (Read error: Operation timed out)
12:11 🔗 dashcloud has joined #urlteam
15:58 🔗 T31M has joined #urlteam
17:39 🔗 dashcloud has quit IRC (Read error: Operation timed out)
17:44 🔗 aaaaaaaaa has joined #urlteam
17:44 🔗 swebb sets mode: +o aaaaaaaaa
17:45 🔗 dashcloud has joined #urlteam
18:29 🔗 dashcloud has quit IRC (Read error: Operation timed out)
18:35 🔗 dashcloud has joined #urlteam
18:43 🔗 dashcloud has quit IRC (Read error: Operation timed out)
18:59 🔗 dashcloud has joined #urlteam
19:18 🔗 aaaaaaaaa has quit IRC (Leaving)
19:33 🔗 aaaaaaaaa has joined #urlteam
19:33 🔗 swebb sets mode: +o aaaaaaaaa
19:54 🔗 aaaaaaaaa has quit IRC (Leaving)
20:35 🔗 dashcloud has quit IRC (Read error: Operation timed out)
20:38 🔗 dashcloud has joined #urlteam
20:43 🔗 TNPuAin has joined #urlteam
20:44 🔗 TNPuAin Hello, anyone from the URLTeam that can answer a question I have?
21:00 🔗 TNPuAin I am doing a research paper on link rot, life span of url shorteners, and typos in domains which lead to 404 pages
21:01 🔗 TNPuAin I was wondering if there is access to the short urls that have been harvested which rather than to have lead to a new domain, instead lead to a 404 page
21:08 🔗 arkiver I think there is access
21:08 🔗 arkiver chfoo probably knows more about that then me though
21:12 🔗 chfoo we just have unshortened urls. i don't think anyone has done any processing on them
21:17 🔗 TNPuAin you have the URLS that were harvested from random websites before they were checked to see where they redirected to?
21:19 🔗 achip for reference, here's an example of what one of the datasets from URLTeam looks like http://paste.nerds.io/raw/usepasodoc, in this case it's is.gd urls
21:20 🔗 TNPuAin thanks achip, I have the torrent downloaded and have went through the files
21:21 🔗 TNPuAin my understanding is that before each of those files is made, like the one you gave an example of, that the short urls are 1st. scraped from the web, 2nd, followed to their destination
21:21 🔗 TNPuAin what about the short urls that are harvested, but when followed, lead to a 404 page? are they just discarded from the dataset?
21:33 🔗 chfoo we don't follow urls. we just brute force the url shortener server and save whatever the server returns back
21:35 🔗 TNPuAin so you take a url such as bit.ly/a and then continue in order like bit.ly/aa bit.ly/ab bit.ly/ac etc?
21:35 🔗 xmc yep
21:35 🔗 xmc you can follow the urls yourself, either manually or with a script
21:35 🔗 TNPuAin ah, I thought the URLS were scarped from the web
21:35 🔗 xmc however, most of the urls we have scraped don't come with a timestamp
21:35 🔗 xmc nope :]
21:36 🔗 TNPuAin do you know of any possible way I can find lists of short urls that were actually scraped off of webpages?
21:38 🔗 xmc i doubt it
21:40 🔗 TNPuAin does anyone know of any public data sets that contain large lists of urls (including short urls) that were scraped from the web? I have seen common crawl but don't have the knowledge to be able to extract urls from their data
22:28 🔗 aaaaaaaaa has joined #urlteam
22:28 🔗 swebb sets mode: +o aaaaaaaaa
22:41 🔗 dashcloud has quit IRC (Remote host closed the connection)
22:43 🔗 dashcloud has joined #urlteam
22:53 🔗 dashcloud has quit IRC (Read error: Operation timed out)
22:58 🔗 dashcloud has joined #urlteam
23:22 🔗 TNPuAin has quit IRC (Ping timeout: 265 seconds)

irclogger-viewer