Time |
Nickname |
Message |
00:03
🔗
|
Sum |
btw how do IA's bots find/determine which pages to crawl on a site? |
00:04
🔗
|
Sum |
If say there are links in the footer of a site that aren't crawled along with the main page why would that be? |
00:07
🔗
|
xmc |
¯\_(ツ)_/¯ |
00:59
🔗
|
|
yipdw has quit IRC (Quit: No Ping reply in 180 seconds.) |
01:02
🔗
|
|
yipdw has joined #internetarchive |
01:15
🔗
|
DFJustin |
their crawlers are finite so I assume there are limits in place on how many links deep it will go or how many urls it will keep in the queue on any given crawl |
04:33
🔗
|
|
Stiletto has quit IRC () |
05:32
🔗
|
|
Stiletto has joined #internetarchive |
05:51
🔗
|
|
LordNigh2 has joined #internetarchive |
05:52
🔗
|
|
Lord_Nigh has quit IRC (hub.se efnet.portlane.se) |
05:52
🔗
|
|
espes__ has quit IRC (hub.se efnet.portlane.se) |
06:07
🔗
|
|
LordNigh2 is now known as Lord_Nigh |
06:17
🔗
|
|
espes__ has joined #internetarchive |
07:27
🔗
|
|
Lord_Nigh has quit IRC (Ping timeout: 244 seconds) |
07:30
🔗
|
|
Lord_Nigh has joined #internetarchive |
10:57
🔗
|
|
zhongfu_ has joined #internetarchive |
10:58
🔗
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
11:26
🔗
|
|
zhongfu_ is now known as zhongfu |
13:21
🔗
|
|
zyclonicz has left |
13:21
🔗
|
|
zyclonicz has joined #internetarchive |
14:28
🔗
|
joepie91 |
Sum: the wayback crawlers optimizes for "a bit of everything, regularly recrawled" |
14:29
🔗
|
joepie91 |
so I would imagine that it 1) sets a pages-per-site limit based on how many sites it knows about and what the current crawling capacity is, and 2) does some heuristics/analytics to determine what the most popular/important/complete links on the page are |
14:29
🔗
|
joepie91 |
that's a bit of a guess, though, but that seems like the most reasonable implementation for the goal it has |
14:55
🔗
|
|
atomotic has joined #internetarchive |
16:24
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
17:39
🔗
|
|
Stilett0 has joined #internetarchive |
17:39
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
23:00
🔗
|
|
Stilett0 has quit IRC (Read error: Operation timed out) |