[06:46] *** Fusl_ has quit IRC (Ping timeout: 615 seconds)
[06:52] *** Fusl_ has joined #webroasting
[08:55] *** Fusl sets mode: +o Fusl_
[10:37] <eythian> say I wanted to scrape as much of those inspire.net.nz homepages as I could, how would I go about that? After hitting google, bing, etc for a set of base names, and then what? Is there recommended tooling or do I just use wget mirroring function and throw it all in a directory?
[10:42] <Fusl> eythian: #archivebot or https://github.com/ArchiveTeam/grab-site
[10:44] <eythian> Fusl: thanks, I'll investigate those.
[10:54] <eythian> oh, also, is there a nice way to scrape search engines for links matching a search result?
[11:34] <JAA> eythian: https://github.com/JustAnotherArchivist/little-things/blob/master/bing-scrape for Bing. All other search engines I've tried have very strict rate limits that make them unscrapeable.
[11:35] <JAA> Bing's fine with one request every 2 seconds.
[11:36] <JAA> However, you need to run it for a *very* long time to actually obtain "all" results. Bing's results are ridiculously bad, and there is a lot of duplication in there.
[11:37] <JAA> That script will by default retrieve 1k result pages (and take a bit over half an hour to run), which may or may not be enough.
[11:38] <eythian> JAA: thanks. 
[13:49] <eythian> https://developers.google.com/custom-search/v1/overview?csw=1 100 queries per day for free. That seems useful for occasional use, in the case I'm wanting it for,
[22:06] *** t3 has joined #webroasting