#internetarchive 2016-07-08,Fri

↑back Search

Time	Nickname	Message
00:19 ^🔗		Stiletto has joined #internetarchive
01:02 ^🔗	Sum	joepie91, yeah I have to imagine part of it is decided by popularity of pages
01:02 ^🔗	Sum	hence why some pages like the ToS wouldn't be crawled
01:18 ^🔗		ats has quit IRC (Ping timeout: 244 seconds)
01:23 ^🔗		Stiletto has quit IRC (Ping timeout: 244 seconds)
01:24 ^🔗		Coderjoe has quit IRC (Read error: Operation timed out)
01:28 ^🔗		Coderjoe has joined #internetarchive
02:18 ^🔗		Stiletto has joined #internetarchive
02:36 ^🔗		ats has joined #internetarchive
02:46 ^🔗		ats has quit IRC (Read error: Operation timed out)
02:52 ^🔗		ats has joined #internetarchive
03:33 ^🔗		Coderjoe has quit IRC (Read error: Operation timed out)
04:01 ^🔗		Coderjoe has joined #internetarchive
04:12 ^🔗	DFJustin	I would guess that popularity affects how likely it is that the page is linked from another site, and thus how likely it is that a crawler will stumble across the link
04:30 ^🔗		GLaDOS has quit IRC (Ping timeout: 260 seconds)
08:24 ^🔗		zyclonicz has quit IRC (Remote host closed the connection)
08:57 ^🔗		zhongfu_ has joined #internetarchive
08:57 ^🔗		zhongfu has quit IRC (Ping timeout: 260 seconds)
09:04 ^🔗		zhongfu_ has quit IRC (Ping timeout: 260 seconds)
09:05 ^🔗		zhongfu has joined #internetarchive
09:39 ^🔗		atomotic has joined #internetarchive
09:51 ^🔗		zhongfu has quit IRC (Remote host closed the connection)
10:06 ^🔗		Sum has quit IRC (Ping timeout: 246 seconds)
10:07 ^🔗		Sum has joined #internetarchive
10:14 ^🔗		zhongfu has joined #internetarchive
10:20 ^🔗		Sum has quit IRC (Ping timeout: 246 seconds)
10:25 ^🔗		zyclonicz has joined #internetarchive
10:32 ^🔗		zhongfu has quit IRC (Quit: No Ping reply in 180 seconds.)
10:32 ^🔗		GLaDOS has joined #internetarchive
10:34 ^🔗		zhongfu has joined #internetarchive
10:58 ^🔗		Sum has joined #internetarchive
11:03 ^🔗		Sum has quit IRC (Quit: Leaving)
11:09 ^🔗		atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
13:00 ^🔗	joepie91	DFJustin: perhaps, but even then it spiders recursively
13:01 ^🔗	joepie91	so the question is how - knowing all the links on the site - it prioritizes the important ones
13:32 ^🔗	Nemo_bis	I think there is a limit to how many links it will follow from a page, hence the bottom ones may never be reached
13:32 ^🔗	Nemo_bis	Or something silly/simple of that kind
13:33 ^🔗	Nemo_bis	I often stumble upon wordpress.com domains (or similar) where only the main page is ever crawled
14:32 ^🔗		zyclonicz has left

irclogger-viewer