#archiveteam-bs 2013-02-16,Sat

↑back Search

Time Nickname Message
08:06 🔗 omf_ I never said collecting urls was small I just said it is the barrier
08:07 🔗 omf_ fetching pages, processing and all that other stuff is simple scripts
08:07 🔗 soultcer When you are fetching those pages simply extract all outgoing links
08:07 🔗 soultcer Voila, you have a neverending supply of URLs
08:07 🔗 omf_ yes and with my internet it takes forever to build that up
08:08 🔗 omf_ but common crawl and others have done a good chunk of it already
08:08 🔗 omf_ Here is something interesting I learned
08:08 🔗 soultcer But even processing the common crawl data will take ages (and you also need to download it first)
08:09 🔗 omf_ yes but compared to doing all that work myself it is a short amount of time
08:10 🔗 omf_ wget seems to be the only tool that is smart enough to fetch js and css dependencies
08:10 🔗 omf_ most frameworks have a get_content() function but nothing to do with everything else
08:11 🔗 soultcer Heritrix will be a lot better at getting javascript stuff than wget
08:12 🔗 omf_ it is on my list but I haven't gotten to it yet
08:12 🔗 omf_ I am going to move it to the top
08:13 🔗 soultcer Hehe you are like me 10 years ago when I wanted to write a web crawler that collects all the URLs in the world and gathers information about them ;-)
08:15 🔗 omf_ I figured out that was a waste of time when google overtook altavista. I am only really concerned with the top 3% of the internet
08:15 🔗 soultcer How will you know which part is the top 3 percent?
08:15 🔗 omf_ but then who determines what that is? gardner, alexa, google, statcounter
08:16 🔗 soultcer Alexa releases their top million domains list. It might be a good start?
08:16 🔗 omf_ Got it
08:16 🔗 omf_ and a few other souces
08:19 🔗 soultcer There are lots of domain name lists out of there if you type the right keywords into google
08:19 🔗 omf_ oh yeah but most are short
08:20 🔗 omf_ I want to cross section certain types of sites
08:20 🔗 omf_ I mainly do web dev now for my day job which is where this stuff would be useful
08:21 🔗 omf_ Right now there is some decent data out there because people want to share it
08:23 🔗 soultcer People share data because data wants to be free
08:56 🔗 omf_ It stopped snowing outside :(
10:59 🔗 ersi http://archive.org/details/more_dangerous_then_dynamite :D
10:59 🔗 ersi TIL.. people used gasoline to wash clothes
11:00 🔗 ersi Linked from IA's latest blog post btw
11:01 🔗 chronomex the most effective chemicals are usually the most toxic ones
11:02 🔗 ersi heh
11:03 🔗 chronomex perchloroethane
11:03 🔗 chronomex xylene
11:03 🔗 chronomex asbestos
11:14 🔗 ersi http://blog.thelifeofkenneth.com/2013/02/tear-down-of-hp-procurve-2824-ethernet.html
14:46 🔗 godane i got this: http://www.imdb.com/title/tt1113745/
14:46 🔗 godane its very rare cause it was only aired once
14:49 🔗 SmileyG nice
15:06 🔗 godane i really hope we can fix these videos: https://archive.org/details/g4tv.com-video14737
15:06 🔗 godane most of the ces 2007 videos are broken
15:08 🔗 ersi Created an IA account, havn't gotten an activation e-mail after several re-send trials :(
15:10 🔗 godane i'm most likely going to have the microsoft ces 2007 keynote coverage from youtube
15:10 🔗 godane the g4tv.com version is in 3 parts and there all broken
15:17 🔗 godane so i think i will have to limit the amount of hd videos from g4tv.com i can grab
15:20 🔗 godane so i'm about 627gb of SD g4tv.com videos
15:22 🔗 godane i have about 19gb of HD g4tv.com videos
15:57 🔗 Aranje gasoline is actually a great cleaner
15:57 🔗 Aranje It's particularly good at anything sticky or gooey, it just dissolves it.
15:58 🔗 Aranje If you're ever working in a bar and you have a /real/ bar, gas and a rag is the best way to clean it (watch for fumes, of course)
15:59 🔗 ersi Yeah, but, like clothes. Every wash.
16:00 🔗 ersi I'm fine with using it as a "special" solvent in some situations ;)
16:05 🔗 Aranje I feel like that'd be really rough on the fabric
16:28 🔗 Smiley urgh
20:03 🔗 godane looks like mapsforus.org is blocked cause of robots
20:04 🔗 godane there making fun of a miss USA contested: https://archive.org/details/g4tv.com-video17620
20:05 🔗 godane fun fact: Pat & Stu show makes fun if it all the time
20:44 🔗 balrog_ yeah mapsforus.org specifically blocks all robots except google
20:58 🔗 omf_ Posterous just banning ips seems like such a banal problem.
20:58 🔗 omf_ compared to how hard this scraping problem could be
20:59 🔗 omf_ I am a glutton for data liberation punishment :)

irclogger-viewer