Time |
Nickname |
Message |
06:46
🔗
|
|
Fusl_ has quit IRC (Ping timeout: 615 seconds) |
06:52
🔗
|
|
Fusl_ has joined #webroasting |
08:55
🔗
|
|
Fusl sets mode: +o Fusl_ |
10:37
🔗
|
eythian |
say I wanted to scrape as much of those inspire.net.nz homepages as I could, how would I go about that? After hitting google, bing, etc for a set of base names, and then what? Is there recommended tooling or do I just use wget mirroring function and throw it all in a directory? |
10:42
🔗
|
Fusl |
eythian: #archivebot or https://github.com/ArchiveTeam/grab-site |
10:44
🔗
|
eythian |
Fusl: thanks, I'll investigate those. |
10:54
🔗
|
eythian |
oh, also, is there a nice way to scrape search engines for links matching a search result? |
11:34
🔗
|
JAA |
eythian: https://github.com/JustAnotherArchivist/little-things/blob/master/bing-scrape for Bing. All other search engines I've tried have very strict rate limits that make them unscrapeable. |
11:35
🔗
|
JAA |
Bing's fine with one request every 2 seconds. |
11:36
🔗
|
JAA |
However, you need to run it for a *very* long time to actually obtain "all" results. Bing's results are ridiculously bad, and there is a lot of duplication in there. |
11:37
🔗
|
JAA |
That script will by default retrieve 1k result pages (and take a bit over half an hour to run), which may or may not be enough. |
11:38
🔗
|
eythian |
JAA: thanks. |
13:49
🔗
|
eythian |
https://developers.google.com/custom-search/v1/overview?csw=1 100 queries per day for free. That seems useful for occasional use, in the case I'm wanting it for, |
22:06
🔗
|
|
t3 has joined #webroasting |