#webroasting 2019-04-16,Tue

↑back Search

Time Nickname Message
06:46 🔗 Fusl_ has quit IRC (Ping timeout: 615 seconds)
06:52 🔗 Fusl_ has joined #webroasting
08:55 🔗 Fusl sets mode: +o Fusl_
10:37 🔗 eythian say I wanted to scrape as much of those inspire.net.nz homepages as I could, how would I go about that? After hitting google, bing, etc for a set of base names, and then what? Is there recommended tooling or do I just use wget mirroring function and throw it all in a directory?
10:42 🔗 Fusl eythian: #archivebot or https://github.com/ArchiveTeam/grab-site
10:44 🔗 eythian Fusl: thanks, I'll investigate those.
10:54 🔗 eythian oh, also, is there a nice way to scrape search engines for links matching a search result?
11:34 🔗 JAA eythian: https://github.com/JustAnotherArchivist/little-things/blob/master/bing-scrape for Bing. All other search engines I've tried have very strict rate limits that make them unscrapeable.
11:35 🔗 JAA Bing's fine with one request every 2 seconds.
11:36 🔗 JAA However, you need to run it for a *very* long time to actually obtain "all" results. Bing's results are ridiculously bad, and there is a lot of duplication in there.
11:37 🔗 JAA That script will by default retrieve 1k result pages (and take a bit over half an hour to run), which may or may not be enough.
11:38 🔗 eythian JAA: thanks.
13:49 🔗 eythian https://developers.google.com/custom-search/v1/overview?csw=1 100 queries per day for free. That seems useful for occasional use, in the case I'm wanting it for,
22:06 🔗 t3 has joined #webroasting
