Time |
Nickname |
Message |
04:37
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
05:39
🔗
|
|
Start_ has joined #urlteam |
05:39
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
05:55
🔗
|
|
bsmith093 has quit IRC (Read error: Operation timed out) |
06:11
🔗
|
|
Start has joined #urlteam |
06:11
🔗
|
|
svchfoo1 sets mode: +o Start |
06:12
🔗
|
|
Start_ has quit IRC (Read error: Connection reset by peer) |
08:06
🔗
|
|
toad2 has joined #urlteam |
08:08
🔗
|
|
toad1 has quit IRC (Read error: Operation timed out) |
09:19
🔗
|
|
svchfoo2 has quit IRC (Ping timeout: 265 seconds) |
11:35
🔗
|
|
matthusby has quit IRC (ZNC - http://znc.in) |
12:05
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
12:11
🔗
|
|
dashcloud has joined #urlteam |
15:58
🔗
|
|
T31M has joined #urlteam |
17:39
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
17:44
🔗
|
|
aaaaaaaaa has joined #urlteam |
17:44
🔗
|
|
swebb sets mode: +o aaaaaaaaa |
17:45
🔗
|
|
dashcloud has joined #urlteam |
18:29
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
18:35
🔗
|
|
dashcloud has joined #urlteam |
18:43
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
18:59
🔗
|
|
dashcloud has joined #urlteam |
19:18
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
19:33
🔗
|
|
aaaaaaaaa has joined #urlteam |
19:33
🔗
|
|
swebb sets mode: +o aaaaaaaaa |
19:54
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
20:35
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
20:38
🔗
|
|
dashcloud has joined #urlteam |
20:43
🔗
|
|
TNPuAin has joined #urlteam |
20:44
🔗
|
TNPuAin |
Hello, anyone from the URLTeam that can answer a question I have? |
21:00
🔗
|
TNPuAin |
I am doing a research paper on link rot, life span of url shorteners, and typos in domains which lead to 404 pages |
21:01
🔗
|
TNPuAin |
I was wondering if there is access to the short urls that have been harvested which rather than to have lead to a new domain, instead lead to a 404 page |
21:08
🔗
|
arkiver |
I think there is access |
21:08
🔗
|
arkiver |
chfoo probably knows more about that then me though |
21:12
🔗
|
chfoo |
we just have unshortened urls. i don't think anyone has done any processing on them |
21:17
🔗
|
TNPuAin |
you have the URLS that were harvested from random websites before they were checked to see where they redirected to? |
21:19
🔗
|
achip |
for reference, here's an example of what one of the datasets from URLTeam looks like http://paste.nerds.io/raw/usepasodoc, in this case it's is.gd urls |
21:20
🔗
|
TNPuAin |
thanks achip, I have the torrent downloaded and have went through the files |
21:21
🔗
|
TNPuAin |
my understanding is that before each of those files is made, like the one you gave an example of, that the short urls are 1st. scraped from the web, 2nd, followed to their destination |
21:21
🔗
|
TNPuAin |
what about the short urls that are harvested, but when followed, lead to a 404 page? are they just discarded from the dataset? |
21:33
🔗
|
chfoo |
we don't follow urls. we just brute force the url shortener server and save whatever the server returns back |
21:35
🔗
|
TNPuAin |
so you take a url such as bit.ly/a and then continue in order like bit.ly/aa bit.ly/ab bit.ly/ac etc? |
21:35
🔗
|
xmc |
yep |
21:35
🔗
|
xmc |
you can follow the urls yourself, either manually or with a script |
21:35
🔗
|
TNPuAin |
ah, I thought the URLS were scarped from the web |
21:35
🔗
|
xmc |
however, most of the urls we have scraped don't come with a timestamp |
21:35
🔗
|
xmc |
nope :] |
21:36
🔗
|
TNPuAin |
do you know of any possible way I can find lists of short urls that were actually scraped off of webpages? |
21:38
🔗
|
xmc |
i doubt it |
21:40
🔗
|
TNPuAin |
does anyone know of any public data sets that contain large lists of urls (including short urls) that were scraped from the web? I have seen common crawl but don't have the knowledge to be able to extract urls from their data |
22:28
🔗
|
|
aaaaaaaaa has joined #urlteam |
22:28
🔗
|
|
swebb sets mode: +o aaaaaaaaa |
22:41
🔗
|
|
dashcloud has quit IRC (Remote host closed the connection) |
22:43
🔗
|
|
dashcloud has joined #urlteam |
22:53
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
22:58
🔗
|
|
dashcloud has joined #urlteam |
23:22
🔗
|
|
TNPuAin has quit IRC (Ping timeout: 265 seconds) |