#webroasting 2019-04-16,Tue

↑back Search

Time	Nickname	Message
06:46 ^🔗		Fusl_ has quit IRC (Ping timeout: 615 seconds)
06:52 ^🔗		Fusl_ has joined #webroasting
08:55 ^🔗		Fusl sets mode: +o Fusl_
10:37 ^🔗	eythian	say I wanted to scrape as much of those inspire.net.nz homepages as I could, how would I go about that? After hitting google, bing, etc for a set of base names, and then what? Is there recommended tooling or do I just use wget mirroring function and throw it all in a directory?
10:42 ^🔗	Fusl	eythian: #archivebot or https://github.com/ArchiveTeam/grab-site
10:44 ^🔗	eythian	Fusl: thanks, I'll investigate those.
10:54 ^🔗	eythian	oh, also, is there a nice way to scrape search engines for links matching a search result?
11:34 ^🔗	JAA	eythian: https://github.com/JustAnotherArchivist/little-things/blob/master/bing-scrape for Bing. All other search engines I've tried have very strict rate limits that make them unscrapeable.
11:35 ^🔗	JAA	Bing's fine with one request every 2 seconds.
11:36 ^🔗	JAA	However, you need to run it for a very long time to actually obtain "all" results. Bing's results are ridiculously bad, and there is a lot of duplication in there.
11:37 ^🔗	JAA	That script will by default retrieve 1k result pages (and take a bit over half an hour to run), which may or may not be enough.
11:38 ^🔗	eythian	JAA: thanks.
13:49 ^🔗	eythian	https://developers.google.com/custom-search/v1/overview?csw=1 100 queries per day for free. That seems useful for occasional use, in the case I'm wanting it for,
22:06 ^🔗		t3 has joined #webroasting

irclogger-viewer