#internetarchive 2016-07-07,Thu

↑back Search

Time Nickname Message
00:03 🔗 Sum btw how do IA's bots find/determine which pages to crawl on a site?
00:04 🔗 Sum If say there are links in the footer of a site that aren't crawled along with the main page why would that be?
00:07 🔗 xmc ¯\_(ツ)_/¯
00:59 🔗 yipdw has quit IRC (Quit: No Ping reply in 180 seconds.)
01:02 🔗 yipdw has joined #internetarchive
01:15 🔗 DFJustin their crawlers are finite so I assume there are limits in place on how many links deep it will go or how many urls it will keep in the queue on any given crawl
04:33 🔗 Stiletto has quit IRC ()
05:32 🔗 Stiletto has joined #internetarchive
05:51 🔗 LordNigh2 has joined #internetarchive
05:52 🔗 Lord_Nigh has quit IRC (hub.se efnet.portlane.se)
05:52 🔗 espes__ has quit IRC (hub.se efnet.portlane.se)
06:07 🔗 LordNigh2 is now known as Lord_Nigh
06:17 🔗 espes__ has joined #internetarchive
07:27 🔗 Lord_Nigh has quit IRC (Ping timeout: 244 seconds)
07:30 🔗 Lord_Nigh has joined #internetarchive
10:57 🔗 zhongfu_ has joined #internetarchive
10:58 🔗 zhongfu has quit IRC (Ping timeout: 260 seconds)
11:26 🔗 zhongfu_ is now known as zhongfu
13:21 🔗 zyclonicz has left
13:21 🔗 zyclonicz has joined #internetarchive
14:28 🔗 joepie91 Sum: the wayback crawlers optimizes for "a bit of everything, regularly recrawled"
14:29 🔗 joepie91 so I would imagine that it 1) sets a pages-per-site limit based on how many sites it knows about and what the current crawling capacity is, and 2) does some heuristics/analytics to determine what the most popular/important/complete links on the page are
14:29 🔗 joepie91 that's a bit of a guess, though, but that seems like the most reasonable implementation for the goal it has
14:55 🔗 atomotic has joined #internetarchive
16:24 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
17:39 🔗 Stilett0 has joined #internetarchive
17:39 🔗 Stiletto has quit IRC (Read error: Operation timed out)
23:00 🔗 Stilett0 has quit IRC (Read error: Operation timed out)
