[00:03] btw how do IA's bots find/determine which pages to crawl on a site? [00:04] If say there are links in the footer of a site that aren't crawled along with the main page why would that be? [00:07] ¯\_(ツ)_/¯ [00:59] *** yipdw has quit IRC (Quit: No Ping reply in 180 seconds.) [01:02] *** yipdw has joined #internetarchive [01:15] their crawlers are finite so I assume there are limits in place on how many links deep it will go or how many urls it will keep in the queue on any given crawl [04:33] *** Stiletto has quit IRC () [05:32] *** Stiletto has joined #internetarchive [05:51] *** LordNigh2 has joined #internetarchive [05:52] *** Lord_Nigh has quit IRC (hub.se efnet.portlane.se) [05:52] *** espes__ has quit IRC (hub.se efnet.portlane.se) [06:07] *** LordNigh2 is now known as Lord_Nigh [06:17] *** espes__ has joined #internetarchive [07:27] *** Lord_Nigh has quit IRC (Ping timeout: 244 seconds) [07:30] *** Lord_Nigh has joined #internetarchive [10:57] *** zhongfu_ has joined #internetarchive [10:58] *** zhongfu has quit IRC (Ping timeout: 260 seconds) [11:26] *** zhongfu_ is now known as zhongfu [13:21] *** zyclonicz has left [13:21] *** zyclonicz has joined #internetarchive [14:28] Sum: the wayback crawlers optimizes for "a bit of everything, regularly recrawled" [14:29] so I would imagine that it 1) sets a pages-per-site limit based on how many sites it knows about and what the current crawling capacity is, and 2) does some heuristics/analytics to determine what the most popular/important/complete links on the page are [14:29] that's a bit of a guess, though, but that seems like the most reasonable implementation for the goal it has [14:55] *** atomotic has joined #internetarchive [16:24] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [17:39] *** Stilett0 has joined #internetarchive [17:39] *** Stiletto has quit IRC (Read error: Operation timed out) [23:00] *** Stilett0 has quit IRC (Read error: Operation timed out)