Time |
Nickname |
Message |
03:32
🔗
|
|
zerkalo has quit IRC (Ping timeout: 264 seconds) |
03:32
🔗
|
|
zerkalo has joined #warrior |
03:39
🔗
|
|
dxrt has quit IRC (Ping timeout: 360 seconds) |
03:39
🔗
|
|
dxrt has joined #warrior |
03:40
🔗
|
|
svchfoo1 sets mode: +o dxrt |
03:41
🔗
|
|
saper has quit IRC (Ping timeout: 264 seconds) |
03:47
🔗
|
|
phirephly has quit IRC (Ping timeout: 360 seconds) |
03:47
🔗
|
|
fennec has quit IRC (Ping timeout: 360 seconds) |
03:47
🔗
|
|
arkiver has quit IRC (Ping timeout: 360 seconds) |
03:48
🔗
|
|
phirephly has joined #warrior |
03:48
🔗
|
|
saper has joined #warrior |
03:52
🔗
|
|
arkiver has joined #warrior |
03:53
🔗
|
|
svchfoo1 sets mode: +o arkiver |
03:57
🔗
|
|
fennec has joined #warrior |
04:05
🔗
|
|
Cameron_D has quit IRC (Read error: Operation timed out) |
04:08
🔗
|
|
Cameron_D has joined #warrior |
06:11
🔗
|
|
anarcat has joined #warrior |
06:11
🔗
|
anarcat |
how does warrior compare to archivebot? why are they different? |
06:12
🔗
|
anarcat |
if i have a batch of archival work to coordinate with a bunch of people, can i get a slot on the tracker and we all jump in and party hard? |
06:18
🔗
|
|
logchfoo0 starts logging #warrior at Fri Nov 02 06:18:53 2018 |
06:18
🔗
|
|
logchfoo0 has joined #warrior |
06:19
🔗
|
|
svchfoo1 has joined #warrior |
06:20
🔗
|
|
svchfoo3 sets mode: +o svchfoo1 |
07:23
🔗
|
JAA |
anarcat: ArchiveBot = recursive crawl of a site, distributed only in the sense that jobs are spread over multiple machines. Warrior = distributed retrieval of a site, i.e. each worker retrieves parts of the site; works only when a site can easily be split into work items, e.g. when content can be accessed by a numeric ID in the URL. |
07:31
🔗
|
|
SmileyG has quit IRC (Read error: Operation timed out) |
07:31
🔗
|
|
Smiley has joined #warrior |
07:36
🔗
|
|
alex__ has joined #warrior |
08:22
🔗
|
|
svchfoo1 has quit IRC (hub.efnet.us irc.colosolutions.net) |
08:22
🔗
|
|
ivan has quit IRC (hub.efnet.us irc.colosolutions.net) |
08:22
🔗
|
|
JAA has quit IRC (hub.efnet.us irc.colosolutions.net) |
08:32
🔗
|
|
svchfoo1 has joined #warrior |
08:32
🔗
|
|
ivan has joined #warrior |
08:32
🔗
|
|
JAA has joined #warrior |
08:32
🔗
|
|
irc.colosolutions.net sets mode: +ooo svchfoo1 ivan JAA |
08:32
🔗
|
|
bakJAA sets mode: +o JAA |
08:34
🔗
|
|
JAA sets mode: +o bakJAA |
09:36
🔗
|
|
alex__ has quit IRC (Quit: alex__) |
09:38
🔗
|
|
alex__ has joined #warrior |
09:50
🔗
|
|
alex__ has quit IRC (Quit: alex__) |
10:16
🔗
|
|
nertzy has joined #warrior |
10:46
🔗
|
|
nertzy has quit IRC (Quit: This computer has gone to sleep) |
10:52
🔗
|
|
alex__ has joined #warrior |
13:02
🔗
|
anarcat |
JAA: so would be relevant for (say) dados.gov.br, but not for crawling sites more generically |
13:06
🔗
|
jut |
It would be usefull for Flickr |
13:16
🔗
|
JAA |
anarcat: I guess it could be used for dados.gov.br, but that would require generating a list of the dataset slugs (e.g. dominios-gov-br in http://dados.gov.br/dataset/dominios-gov-br ) at least. An example of where this works very well is forums, especially those that use URLs like /showthread.php?t=1234. You just create items like 'threads:1230-1239' which then grabs those 10 threads. |
13:17
🔗
|
anarcat |
i see |
13:19
🔗
|
JAA |
So yeah, there has to be a way to easily discover content based on a short ID. In the best case, that's simply a range of numeric IDs since that's really easiest. But it can also involve scraping identifiers beforehand, e.g. usernames for ISP hosting (thinking of Angelfire and Tripod there). |
13:20
🔗
|
JAA |
There just has to be a way to somehow split the site up into nice, small chunks. That always requires code for the specific site, so yeah, not very generalisable. |
13:22
🔗
|
anarcat |
yeah and the dados pieces are not necessarily "small" |
13:22
🔗
|
anarcat |
FSOV small |
13:22
🔗
|
JAA |
Of course, it is not impossible to build a recursive, distributed crawl. Each worker could grab e.g. 100 URLs from the queue, retrieve those, extract all links it finds there, then send those back to the tracker for insertion into the queue. (I was actually working on a version of wpull that can do this using a central pgsql DB.) But that has its own host of problems. |
13:24
🔗
|
JAA |
"Small" is relative, obviously. We had a project for VidMe which had some massive items. |
13:25
🔗
|
JAA |
So "small" with respect to the size of the entire site. We typically aim for a few 100k to a few million items in total. |
17:00
🔗
|
* |
anarcat nods |
19:05
🔗
|
|
alex____ has joined #warrior |
19:06
🔗
|
|
alex__ has quit IRC (Ping timeout: 252 seconds) |
21:47
🔗
|
|
tuluu has quit IRC (Remote host closed the connection) |
21:49
🔗
|
|
tuluu has joined #warrior |