[00:19] *** Stiletto has joined #internetarchive [01:02] joepie91, yeah I have to imagine part of it is decided by popularity of pages [01:02] hence why some pages like the ToS wouldn't be crawled [01:18] *** ats has quit IRC (Ping timeout: 244 seconds) [01:23] *** Stiletto has quit IRC (Ping timeout: 244 seconds) [01:24] *** Coderjoe has quit IRC (Read error: Operation timed out) [01:28] *** Coderjoe has joined #internetarchive [02:18] *** Stiletto has joined #internetarchive [02:36] *** ats has joined #internetarchive [02:46] *** ats has quit IRC (Read error: Operation timed out) [02:52] *** ats has joined #internetarchive [03:33] *** Coderjoe has quit IRC (Read error: Operation timed out) [04:01] *** Coderjoe has joined #internetarchive [04:12] I would guess that popularity affects how likely it is that the page is linked from another site, and thus how likely it is that a crawler will stumble across the link [04:30] *** GLaDOS has quit IRC (Ping timeout: 260 seconds) [08:24] *** zyclonicz has quit IRC (Remote host closed the connection) [08:57] *** zhongfu_ has joined #internetarchive [08:57] *** zhongfu has quit IRC (Ping timeout: 260 seconds) [09:04] *** zhongfu_ has quit IRC (Ping timeout: 260 seconds) [09:05] *** zhongfu has joined #internetarchive [09:39] *** atomotic has joined #internetarchive [09:51] *** zhongfu has quit IRC (Remote host closed the connection) [10:06] *** Sum has quit IRC (Ping timeout: 246 seconds) [10:07] *** Sum has joined #internetarchive [10:14] *** zhongfu has joined #internetarchive [10:20] *** Sum has quit IRC (Ping timeout: 246 seconds) [10:25] *** zyclonicz has joined #internetarchive [10:32] *** zhongfu has quit IRC (Quit: No Ping reply in 180 seconds.) [10:32] *** GLaDOS has joined #internetarchive [10:34] *** zhongfu has joined #internetarchive [10:58] *** Sum has joined #internetarchive [11:03] *** Sum has quit IRC (Quit: Leaving) [11:09] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [13:00] DFJustin: perhaps, but even then it spiders recursively [13:01] so the question is how - knowing all the links on the site - it prioritizes the important ones [13:32] I think there is a limit to how many links it will follow from a page, hence the bottom ones may never be reached [13:32] Or something silly/simple of that kind [13:33] I often stumble upon wordpress.com domains (or similar) where only the main page is ever crawled [14:32] *** zyclonicz has left