Time |
Nickname |
Message |
00:19
🔗
|
|
Stiletto has joined #internetarchive |
01:02
🔗
|
Sum |
joepie91, yeah I have to imagine part of it is decided by popularity of pages |
01:02
🔗
|
Sum |
hence why some pages like the ToS wouldn't be crawled |
01:18
🔗
|
|
ats has quit IRC (Ping timeout: 244 seconds) |
01:23
🔗
|
|
Stiletto has quit IRC (Ping timeout: 244 seconds) |
01:24
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
01:28
🔗
|
|
Coderjoe has joined #internetarchive |
02:18
🔗
|
|
Stiletto has joined #internetarchive |
02:36
🔗
|
|
ats has joined #internetarchive |
02:46
🔗
|
|
ats has quit IRC (Read error: Operation timed out) |
02:52
🔗
|
|
ats has joined #internetarchive |
03:33
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
04:01
🔗
|
|
Coderjoe has joined #internetarchive |
04:12
🔗
|
DFJustin |
I would guess that popularity affects how likely it is that the page is linked from another site, and thus how likely it is that a crawler will stumble across the link |
04:30
🔗
|
|
GLaDOS has quit IRC (Ping timeout: 260 seconds) |
08:24
🔗
|
|
zyclonicz has quit IRC (Remote host closed the connection) |
08:57
🔗
|
|
zhongfu_ has joined #internetarchive |
08:57
🔗
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
09:04
🔗
|
|
zhongfu_ has quit IRC (Ping timeout: 260 seconds) |
09:05
🔗
|
|
zhongfu has joined #internetarchive |
09:39
🔗
|
|
atomotic has joined #internetarchive |
09:51
🔗
|
|
zhongfu has quit IRC (Remote host closed the connection) |
10:06
🔗
|
|
Sum has quit IRC (Ping timeout: 246 seconds) |
10:07
🔗
|
|
Sum has joined #internetarchive |
10:14
🔗
|
|
zhongfu has joined #internetarchive |
10:20
🔗
|
|
Sum has quit IRC (Ping timeout: 246 seconds) |
10:25
🔗
|
|
zyclonicz has joined #internetarchive |
10:32
🔗
|
|
zhongfu has quit IRC (Quit: No Ping reply in 180 seconds.) |
10:32
🔗
|
|
GLaDOS has joined #internetarchive |
10:34
🔗
|
|
zhongfu has joined #internetarchive |
10:58
🔗
|
|
Sum has joined #internetarchive |
11:03
🔗
|
|
Sum has quit IRC (Quit: Leaving) |
11:09
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
13:00
🔗
|
joepie91 |
DFJustin: perhaps, but even then it spiders recursively |
13:01
🔗
|
joepie91 |
so the question is how - knowing all the links on the site - it prioritizes the important ones |
13:32
🔗
|
Nemo_bis |
I think there is a limit to how many links it will follow from a page, hence the bottom ones may never be reached |
13:32
🔗
|
Nemo_bis |
Or something silly/simple of that kind |
13:33
🔗
|
Nemo_bis |
I often stumble upon wordpress.com domains (or similar) where only the main page is ever crawled |
14:32
🔗
|
|
zyclonicz has left |