[00:03] <Sum> btw how do IA's bots find/determine which pages to crawl on a site?
[00:04] <Sum> If say there are links in the footer of a site that aren't crawled along with the main page why would that be?
[00:07] <xmc> ¯\_(ツ)_/¯
[00:59] *** yipdw has quit IRC (Quit: No Ping reply in 180 seconds.)
[01:02] *** yipdw has joined #internetarchive
[01:15] <DFJustin> their crawlers are finite so I assume there are limits in place on how many links deep it will go or how many urls it will keep in the queue on any given crawl
[04:33] *** Stiletto has quit IRC ()
[05:32] *** Stiletto has joined #internetarchive
[05:51] *** LordNigh2 has joined #internetarchive
[05:52] *** Lord_Nigh has quit IRC (hub.se efnet.portlane.se)
[05:52] *** espes__ has quit IRC (hub.se efnet.portlane.se)
[06:07] *** LordNigh2 is now known as Lord_Nigh
[06:17] *** espes__ has joined #internetarchive
[07:27] *** Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
[07:30] *** Lord_Nigh has joined #internetarchive
[10:57] *** zhongfu_ has joined #internetarchive
[10:58] *** zhongfu has quit IRC (Ping timeout: 260 seconds)
[11:26] *** zhongfu_ is now known as zhongfu
[13:21] *** zyclonicz has left 
[13:21] *** zyclonicz has joined #internetarchive
[14:28] <joepie91> Sum: the wayback crawlers optimizes for "a bit of everything, regularly recrawled"
[14:29] <joepie91> so I would imagine that it 1) sets a pages-per-site limit based on how many sites it knows about and what the current crawling capacity is, and 2) does some heuristics/analytics to determine what the most popular/important/complete links on the page are
[14:29] <joepie91> that's a bit of a guess, though, but that seems like the most reasonable implementation for the goal it has
[14:55] *** atomotic has joined #internetarchive
[16:24] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
[17:39] *** Stilett0 has joined #internetarchive
[17:39] *** Stiletto has quit IRC (Read error: Operation timed out)
[23:00] *** Stilett0 has quit IRC (Read error: Operation timed out)