#internetarchive 2016-07-08,Fri

↑back Search

Time Nickname Message
00:19 🔗 Stiletto has joined #internetarchive
01:02 🔗 Sum joepie91, yeah I have to imagine part of it is decided by popularity of pages
01:02 🔗 Sum hence why some pages like the ToS wouldn't be crawled
01:18 🔗 ats has quit IRC (Ping timeout: 244 seconds)
01:23 🔗 Stiletto has quit IRC (Ping timeout: 244 seconds)
01:24 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
01:28 🔗 Coderjoe has joined #internetarchive
02:18 🔗 Stiletto has joined #internetarchive
02:36 🔗 ats has joined #internetarchive
02:46 🔗 ats has quit IRC (Read error: Operation timed out)
02:52 🔗 ats has joined #internetarchive
03:33 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
04:01 🔗 Coderjoe has joined #internetarchive
04:12 🔗 DFJustin I would guess that popularity affects how likely it is that the page is linked from another site, and thus how likely it is that a crawler will stumble across the link
04:30 🔗 GLaDOS has quit IRC (Ping timeout: 260 seconds)
08:24 🔗 zyclonicz has quit IRC (Remote host closed the connection)
08:57 🔗 zhongfu_ has joined #internetarchive
08:57 🔗 zhongfu has quit IRC (Ping timeout: 260 seconds)
09:04 🔗 zhongfu_ has quit IRC (Ping timeout: 260 seconds)
09:05 🔗 zhongfu has joined #internetarchive
09:39 🔗 atomotic has joined #internetarchive
09:51 🔗 zhongfu has quit IRC (Remote host closed the connection)
10:06 🔗 Sum has quit IRC (Ping timeout: 246 seconds)
10:07 🔗 Sum has joined #internetarchive
10:14 🔗 zhongfu has joined #internetarchive
10:20 🔗 Sum has quit IRC (Ping timeout: 246 seconds)
10:25 🔗 zyclonicz has joined #internetarchive
10:32 🔗 zhongfu has quit IRC (Quit: No Ping reply in 180 seconds.)
10:32 🔗 GLaDOS has joined #internetarchive
10:34 🔗 zhongfu has joined #internetarchive
10:58 🔗 Sum has joined #internetarchive
11:03 🔗 Sum has quit IRC (Quit: Leaving)
11:09 🔗 atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com)
13:00 🔗 joepie91 DFJustin: perhaps, but even then it spiders recursively
13:01 🔗 joepie91 so the question is how - knowing all the links on the site - it prioritizes the important ones
13:32 🔗 Nemo_bis I think there is a limit to how many links it will follow from a page, hence the bottom ones may never be reached
13:32 🔗 Nemo_bis Or something silly/simple of that kind
13:33 🔗 Nemo_bis I often stumble upon wordpress.com domains (or similar) where only the main page is ever crawled
14:32 🔗 zyclonicz has left

irclogger-viewer