[08:06] I never said collecting urls was small I just said it is the barrier [08:07] fetching pages, processing and all that other stuff is simple scripts [08:07] When you are fetching those pages simply extract all outgoing links [08:07] Voila, you have a neverending supply of URLs [08:07] yes and with my internet it takes forever to build that up [08:08] but common crawl and others have done a good chunk of it already [08:08] Here is something interesting I learned [08:08] But even processing the common crawl data will take ages (and you also need to download it first) [08:09] yes but compared to doing all that work myself it is a short amount of time [08:10] wget seems to be the only tool that is smart enough to fetch js and css dependencies [08:10] most frameworks have a get_content() function but nothing to do with everything else [08:11] Heritrix will be a lot better at getting javascript stuff than wget [08:12] it is on my list but I haven't gotten to it yet [08:12] I am going to move it to the top [08:13] Hehe you are like me 10 years ago when I wanted to write a web crawler that collects all the URLs in the world and gathers information about them ;-) [08:15] I figured out that was a waste of time when google overtook altavista. I am only really concerned with the top 3% of the internet [08:15] How will you know which part is the top 3 percent? [08:15] but then who determines what that is? gardner, alexa, google, statcounter [08:16] Alexa releases their top million domains list. It might be a good start? [08:16] Got it [08:16] and a few other souces [08:19] There are lots of domain name lists out of there if you type the right keywords into google [08:19] oh yeah but most are short [08:20] I want to cross section certain types of sites [08:20] I mainly do web dev now for my day job which is where this stuff would be useful [08:21] Right now there is some decent data out there because people want to share it [08:23] People share data because data wants to be free [08:56] It stopped snowing outside :( [10:59] http://archive.org/details/more_dangerous_then_dynamite :D [10:59] TIL.. people used gasoline to wash clothes [11:00] Linked from IA's latest blog post btw [11:01] the most effective chemicals are usually the most toxic ones [11:02] heh [11:03] perchloroethane [11:03] xylene [11:03] asbestos [11:14] http://blog.thelifeofkenneth.com/2013/02/tear-down-of-hp-procurve-2824-ethernet.html [14:46] i got this: http://www.imdb.com/title/tt1113745/ [14:46] its very rare cause it was only aired once [14:49] nice [15:06] i really hope we can fix these videos: https://archive.org/details/g4tv.com-video14737 [15:06] most of the ces 2007 videos are broken [15:08] Created an IA account, havn't gotten an activation e-mail after several re-send trials :( [15:10] i'm most likely going to have the microsoft ces 2007 keynote coverage from youtube [15:10] the g4tv.com version is in 3 parts and there all broken [15:17] so i think i will have to limit the amount of hd videos from g4tv.com i can grab [15:20] so i'm about 627gb of SD g4tv.com videos [15:22] i have about 19gb of HD g4tv.com videos [15:57] gasoline is actually a great cleaner [15:57] It's particularly good at anything sticky or gooey, it just dissolves it. [15:58] If you're ever working in a bar and you have a /real/ bar, gas and a rag is the best way to clean it (watch for fumes, of course) [15:59] Yeah, but, like clothes. Every wash. [16:00] I'm fine with using it as a "special" solvent in some situations ;) [16:05] I feel like that'd be really rough on the fabric [16:28] urgh [20:03] looks like mapsforus.org is blocked cause of robots [20:04] there making fun of a miss USA contested: https://archive.org/details/g4tv.com-video17620 [20:05] fun fact: Pat & Stu show makes fun if it all the time [20:44] yeah mapsforus.org specifically blocks all robots except google [20:58] Posterous just banning ips seems like such a banal problem. [20:58] compared to how hard this scraping problem could be [20:59] I am a glutton for data liberation punishment :)