Time |
Nickname |
Message |
08:06
🔗
|
omf_ |
I never said collecting urls was small I just said it is the barrier |
08:07
🔗
|
omf_ |
fetching pages, processing and all that other stuff is simple scripts |
08:07
🔗
|
soultcer |
When you are fetching those pages simply extract all outgoing links |
08:07
🔗
|
soultcer |
Voila, you have a neverending supply of URLs |
08:07
🔗
|
omf_ |
yes and with my internet it takes forever to build that up |
08:08
🔗
|
omf_ |
but common crawl and others have done a good chunk of it already |
08:08
🔗
|
omf_ |
Here is something interesting I learned |
08:08
🔗
|
soultcer |
But even processing the common crawl data will take ages (and you also need to download it first) |
08:09
🔗
|
omf_ |
yes but compared to doing all that work myself it is a short amount of time |
08:10
🔗
|
omf_ |
wget seems to be the only tool that is smart enough to fetch js and css dependencies |
08:10
🔗
|
omf_ |
most frameworks have a get_content() function but nothing to do with everything else |
08:11
🔗
|
soultcer |
Heritrix will be a lot better at getting javascript stuff than wget |
08:12
🔗
|
omf_ |
it is on my list but I haven't gotten to it yet |
08:12
🔗
|
omf_ |
I am going to move it to the top |
08:13
🔗
|
soultcer |
Hehe you are like me 10 years ago when I wanted to write a web crawler that collects all the URLs in the world and gathers information about them ;-) |
08:15
🔗
|
omf_ |
I figured out that was a waste of time when google overtook altavista. I am only really concerned with the top 3% of the internet |
08:15
🔗
|
soultcer |
How will you know which part is the top 3 percent? |
08:15
🔗
|
omf_ |
but then who determines what that is? gardner, alexa, google, statcounter |
08:16
🔗
|
soultcer |
Alexa releases their top million domains list. It might be a good start? |
08:16
🔗
|
omf_ |
Got it |
08:16
🔗
|
omf_ |
and a few other souces |
08:19
🔗
|
soultcer |
There are lots of domain name lists out of there if you type the right keywords into google |
08:19
🔗
|
omf_ |
oh yeah but most are short |
08:20
🔗
|
omf_ |
I want to cross section certain types of sites |
08:20
🔗
|
omf_ |
I mainly do web dev now for my day job which is where this stuff would be useful |
08:21
🔗
|
omf_ |
Right now there is some decent data out there because people want to share it |
08:23
🔗
|
soultcer |
People share data because data wants to be free |
08:56
🔗
|
omf_ |
It stopped snowing outside :( |
10:59
🔗
|
ersi |
http://archive.org/details/more_dangerous_then_dynamite :D |
10:59
🔗
|
ersi |
TIL.. people used gasoline to wash clothes |
11:00
🔗
|
ersi |
Linked from IA's latest blog post btw |
11:01
🔗
|
chronomex |
the most effective chemicals are usually the most toxic ones |
11:02
🔗
|
ersi |
heh |
11:03
🔗
|
chronomex |
perchloroethane |
11:03
🔗
|
chronomex |
xylene |
11:03
🔗
|
chronomex |
asbestos |
11:14
🔗
|
ersi |
http://blog.thelifeofkenneth.com/2013/02/tear-down-of-hp-procurve-2824-ethernet.html |
14:46
🔗
|
godane |
i got this: http://www.imdb.com/title/tt1113745/ |
14:46
🔗
|
godane |
its very rare cause it was only aired once |
14:49
🔗
|
SmileyG |
nice |
15:06
🔗
|
godane |
i really hope we can fix these videos: https://archive.org/details/g4tv.com-video14737 |
15:06
🔗
|
godane |
most of the ces 2007 videos are broken |
15:08
🔗
|
ersi |
Created an IA account, havn't gotten an activation e-mail after several re-send trials :( |
15:10
🔗
|
godane |
i'm most likely going to have the microsoft ces 2007 keynote coverage from youtube |
15:10
🔗
|
godane |
the g4tv.com version is in 3 parts and there all broken |
15:17
🔗
|
godane |
so i think i will have to limit the amount of hd videos from g4tv.com i can grab |
15:20
🔗
|
godane |
so i'm about 627gb of SD g4tv.com videos |
15:22
🔗
|
godane |
i have about 19gb of HD g4tv.com videos |
15:57
🔗
|
Aranje |
gasoline is actually a great cleaner |
15:57
🔗
|
Aranje |
It's particularly good at anything sticky or gooey, it just dissolves it. |
15:58
🔗
|
Aranje |
If you're ever working in a bar and you have a /real/ bar, gas and a rag is the best way to clean it (watch for fumes, of course) |
15:59
🔗
|
ersi |
Yeah, but, like clothes. Every wash. |
16:00
🔗
|
ersi |
I'm fine with using it as a "special" solvent in some situations ;) |
16:05
🔗
|
Aranje |
I feel like that'd be really rough on the fabric |
16:28
🔗
|
Smiley |
urgh |
20:03
🔗
|
godane |
looks like mapsforus.org is blocked cause of robots |
20:04
🔗
|
godane |
there making fun of a miss USA contested: https://archive.org/details/g4tv.com-video17620 |
20:05
🔗
|
godane |
fun fact: Pat & Stu show makes fun if it all the time |
20:44
🔗
|
balrog_ |
yeah mapsforus.org specifically blocks all robots except google |
20:58
🔗
|
omf_ |
Posterous just banning ips seems like such a banal problem. |
20:58
🔗
|
omf_ |
compared to how hard this scraping problem could be |
20:59
🔗
|
omf_ |
I am a glutton for data liberation punishment :) |