#warrior 2018-11-02,Fri

↑back Search

Time Nickname Message
03:32 🔗 zerkalo has quit IRC (Ping timeout: 264 seconds)
03:32 🔗 zerkalo has joined #warrior
03:39 🔗 dxrt has quit IRC (Ping timeout: 360 seconds)
03:39 🔗 dxrt has joined #warrior
03:40 🔗 svchfoo1 sets mode: +o dxrt
03:41 🔗 saper has quit IRC (Ping timeout: 264 seconds)
03:47 🔗 phirephly has quit IRC (Ping timeout: 360 seconds)
03:47 🔗 fennec has quit IRC (Ping timeout: 360 seconds)
03:47 🔗 arkiver has quit IRC (Ping timeout: 360 seconds)
03:48 🔗 phirephly has joined #warrior
03:48 🔗 saper has joined #warrior
03:52 🔗 arkiver has joined #warrior
03:53 🔗 svchfoo1 sets mode: +o arkiver
03:57 🔗 fennec has joined #warrior
04:05 🔗 Cameron_D has quit IRC (Read error: Operation timed out)
04:08 🔗 Cameron_D has joined #warrior
06:11 🔗 anarcat has joined #warrior
06:11 🔗 anarcat how does warrior compare to archivebot? why are they different?
06:12 🔗 anarcat if i have a batch of archival work to coordinate with a bunch of people, can i get a slot on the tracker and we all jump in and party hard?
06:18 🔗 logchfoo0 starts logging #warrior at Fri Nov 02 06:18:53 2018
06:18 🔗 logchfoo0 has joined #warrior
06:19 🔗 svchfoo1 has joined #warrior
06:20 🔗 svchfoo3 sets mode: +o svchfoo1
07:23 🔗 JAA anarcat: ArchiveBot = recursive crawl of a site, distributed only in the sense that jobs are spread over multiple machines. Warrior = distributed retrieval of a site, i.e. each worker retrieves parts of the site; works only when a site can easily be split into work items, e.g. when content can be accessed by a numeric ID in the URL.
07:31 🔗 SmileyG has quit IRC (Read error: Operation timed out)
07:31 🔗 Smiley has joined #warrior
07:36 🔗 alex__ has joined #warrior
08:22 🔗 svchfoo1 has quit IRC (hub.efnet.us irc.colosolutions.net)
08:22 🔗 ivan has quit IRC (hub.efnet.us irc.colosolutions.net)
08:22 🔗 JAA has quit IRC (hub.efnet.us irc.colosolutions.net)
08:32 🔗 svchfoo1 has joined #warrior
08:32 🔗 ivan has joined #warrior
08:32 🔗 JAA has joined #warrior
08:32 🔗 irc.colosolutions.net sets mode: +ooo svchfoo1 ivan JAA
08:32 🔗 bakJAA sets mode: +o JAA
08:34 🔗 JAA sets mode: +o bakJAA
09:36 🔗 alex__ has quit IRC (Quit: alex__)
09:38 🔗 alex__ has joined #warrior
09:50 🔗 alex__ has quit IRC (Quit: alex__)
10:16 🔗 nertzy has joined #warrior
10:46 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
10:52 🔗 alex__ has joined #warrior
13:02 🔗 anarcat JAA: so would be relevant for (say) dados.gov.br, but not for crawling sites more generically
13:06 🔗 jut It would be usefull for Flickr
13:16 🔗 JAA anarcat: I guess it could be used for dados.gov.br, but that would require generating a list of the dataset slugs (e.g. dominios-gov-br in http://dados.gov.br/dataset/dominios-gov-br ) at least. An example of where this works very well is forums, especially those that use URLs like /showthread.php?t=1234. You just create items like 'threads:1230-1239' which then grabs those 10 threads.
13:17 🔗 anarcat i see
13:19 🔗 JAA So yeah, there has to be a way to easily discover content based on a short ID. In the best case, that's simply a range of numeric IDs since that's really easiest. But it can also involve scraping identifiers beforehand, e.g. usernames for ISP hosting (thinking of Angelfire and Tripod there).
13:20 🔗 JAA There just has to be a way to somehow split the site up into nice, small chunks. That always requires code for the specific site, so yeah, not very generalisable.
13:22 🔗 anarcat yeah and the dados pieces are not necessarily "small"
13:22 🔗 anarcat FSOV small
13:22 🔗 JAA Of course, it is not impossible to build a recursive, distributed crawl. Each worker could grab e.g. 100 URLs from the queue, retrieve those, extract all links it finds there, then send those back to the tracker for insertion into the queue. (I was actually working on a version of wpull that can do this using a central pgsql DB.) But that has its own host of problems.
13:24 🔗 JAA "Small" is relative, obviously. We had a project for VidMe which had some massive items.
13:25 🔗 JAA So "small" with respect to the size of the entire site. We typically aim for a few 100k to a few million items in total.
17:00 🔗 * anarcat nods
19:05 🔗 alex____ has joined #warrior
19:06 🔗 alex__ has quit IRC (Ping timeout: 252 seconds)
21:47 🔗 tuluu has quit IRC (Remote host closed the connection)
21:49 🔗 tuluu has joined #warrior

irclogger-viewer