[06:46] *** Fusl_ has quit IRC (Ping timeout: 615 seconds) [06:52] *** Fusl_ has joined #webroasting [08:55] *** Fusl sets mode: +o Fusl_ [10:37] say I wanted to scrape as much of those inspire.net.nz homepages as I could, how would I go about that? After hitting google, bing, etc for a set of base names, and then what? Is there recommended tooling or do I just use wget mirroring function and throw it all in a directory? [10:42] eythian: #archivebot or https://github.com/ArchiveTeam/grab-site [10:44] Fusl: thanks, I'll investigate those. [10:54] oh, also, is there a nice way to scrape search engines for links matching a search result? [11:34] eythian: https://github.com/JustAnotherArchivist/little-things/blob/master/bing-scrape for Bing. All other search engines I've tried have very strict rate limits that make them unscrapeable. [11:35] Bing's fine with one request every 2 seconds. [11:36] However, you need to run it for a *very* long time to actually obtain "all" results. Bing's results are ridiculously bad, and there is a lot of duplication in there. [11:37] That script will by default retrieve 1k result pages (and take a bit over half an hour to run), which may or may not be enough. [11:38] JAA: thanks. [13:49] https://developers.google.com/custom-search/v1/overview?csw=1 100 queries per day for free. That seems useful for occasional use, in the case I'm wanting it for, [22:06] *** t3 has joined #webroasting