Time |
Nickname |
Message |
00:03
🔗
|
|
JAA has quit IRC (Ping timeout: 246 seconds) |
01:02
🔗
|
|
JAA has joined #webroasting |
01:03
🔗
|
|
bakJAA sets mode: +o JAA |
05:18
🔗
|
|
ripdog has joined #webroasting |
05:19
🔗
|
ripdog |
Vodafone NZ is shutting down their ihug (old acquired company) hosting service... tomorrow. Example url: http://homepages.ihug.co.nz/~(user) |
05:19
🔗
|
ripdog |
I guess it should all be in the internet archive, so probably not a disaster |
05:25
🔗
|
ripdog |
context: https://www.geekzone.co.nz/forums.asp?forumid=40&topicid=243123 |
05:26
🔗
|
|
ripdog has quit IRC (Quit: Page closed) |
06:18
🔗
|
|
hook54321 has joined #webroasting |
09:59
🔗
|
|
mr_archiv has quit IRC (Read error: Operation timed out) |
10:00
🔗
|
|
chfoo has quit IRC (Read error: Operation timed out) |
10:00
🔗
|
|
chfoo has joined #webroasting |
10:18
🔗
|
|
mr_archiv has joined #webroasting |
12:02
🔗
|
JAA |
Three days lead time. Fantastic... |
12:21
🔗
|
JAA |
Extracting links from the WBM and Bing now. |
12:45
🔗
|
|
eientei95 has joined #webroasting |
12:46
🔗
|
eientei95 |
Bing seems to have a lot more on ihug than Google or DDG |
12:46
🔗
|
JAA |
For log completeness: 2018-11-29 12:42:20 < eientei95> JAA: https://transfer.sh/KGXuC/hp_ihug here's a bunch of the ihug sites |
12:47
🔗
|
JAA |
I'm reusing my TalkTalk scripts. So far, I have a list of ~17.5k URLs. |
12:47
🔗
|
eientei95 |
`site:http://homepages.ihug.co.nz/` gives 7,230 results in Bing |
12:47
🔗
|
eientei95 |
Oh wow |
12:47
🔗
|
JAA |
I'm scraping the Wayback Machine and Bing. |
12:47
🔗
|
eientei95 |
Better than me copy-pasting it into a text doc and regexing out the URLs |
12:48
🔗
|
JAA |
And those 17.5k URLs are a bit inflated since they also include every directory level. |
12:48
🔗
|
eientei95 |
Ah |
12:48
🔗
|
JAA |
Well, that's pretty much what I'm doing anyway, but simply in an automated way. :-P |
12:50
🔗
|
JAA |
The WBM is returning lots of weird stuff as usual, e.g. http://homepages.paradise.net.nz/~rhector%0a/ |
12:50
🔗
|
eientei95 |
:P |
12:50
🔗
|
eientei95 |
Good ol' unstriped newline chars |
12:50
🔗
|
eientei95 |
*pp |
12:51
🔗
|
JAA |
http://homepages.paradise.net.nz/rossstan'target='_blank/ |
12:51
🔗
|
JAA |
http://homepages.paradise.net.nz/riffraff4robots.txt/ |
12:51
🔗
|
eientei95 |
Someone didn't check their links properly |
12:53
🔗
|
JAA |
Aand the list shrunk by almost 3k after filtering out the trailing newlines. |
12:54
🔗
|
JAA |
Oh, I like this one: http://homepages.ihug.co.nz/~%0a%0a/ |
12:55
🔗
|
eientei95 |
lol |
13:00
🔗
|
JAA |
Turns out there's a good number of pages which are only listed by the WBM with the trailing newline. Ok, strip & deduplicate it is. |
13:01
🔗
|
|
Sanqui has joined #webroasting |
13:02
🔗
|
JAA |
15840 currently |
13:02
🔗
|
JAA |
Bing scrape is still running though. |
13:02
🔗
|
JAA |
I'll throw this onto trent-nz, by the way. |
13:10
🔗
|
JAA |
eientei95: FYI, there are 7 sites in your list which aren't in mine so far. |
13:10
🔗
|
eientei95 |
Huh |
13:10
🔗
|
eientei95 |
Cool |
13:11
🔗
|
eientei95 |
Mine is the result of scraping Google and DDG and a bit of Bing |
13:12
🔗
|
eientei95 |
Anyway, it's 11 past 2 here in Hobbitland |
13:12
🔗
|
JAA |
11 past 2 here as well in Switzerland. :-) |
13:13
🔗
|
kiska |
Now that another web space hosting service has gone down, I'll start looking at every ISP service that has web spaces, in the morning, its 13 past midnight here |
13:13
🔗
|
eientei95 |
I found an aussie one while scraping bing iirc |
13:14
🔗
|
eientei95 |
http://www.pcug.org.au/~djordan/ that page 404s but the site exists |
13:15
🔗
|
JAA |
kiska: Yeah, that's a good idea. I fully expect more shutdowns in the near future. |
13:15
🔗
|
eientei95 |
http://members.pcug.org.au/~arhen/ > November 1995 |
13:17
🔗
|
eientei95 |
> Computing: with a modem and a no-name clone IBM PC as a member of the PC Users Group (ACT). |
13:17
🔗
|
eientei95 |
> Page last revised 28 September 1997 |
13:17
🔗
|
JAA |
We should also compile a good list of keywords for search engine scrapes. I have one which covers mostly genealogy, local businesses, and stuff like schools or churches, but it's definitely not a good list. |
13:17
🔗
|
JAA |
Scraping Bing is very effective but takes forever because you can only make one request ever 2 seconds. |
13:18
🔗
|
JAA |
(Which is still better than any other search engine I've tried; those usually ban after a few requests.) |
13:20
🔗
|
eientei95 |
https://www.cs.auckland.ac.nz/~pgut001/ |
13:22
🔗
|
eientei95 |
http://homepage.eircom.net/~seanjmurphy/ |
13:23
🔗
|
eientei95 |
https://www-cs-faculty.stanford.edu/~knuth/ |
13:23
🔗
|
eientei95 |
http://homepages.ecs.vuw.ac.nz/~yimei/ |
13:23
🔗
|
JAA |
Hmm, .ac.nz and .edu are universities & Co., right? Those might be less at risk than ISP's offers. |
13:24
🔗
|
eientei95 |
Yeah |
13:25
🔗
|
eientei95 |
http://pages.stern.nyu.edu/~adamodar/ https://www.csie.ntu.edu.tw/~cjlin/ http://www.robots.ox.ac.uk/~vgg/ http://www.cs.toronto.edu/~hinton/ More unis |
13:29
🔗
|
eientei95 |
http://www.tcm.phy.cam.ac.uk/~bdj10/ http://homepages.ecs.vuw.ac.nz/~mengjie/ https://people.math.ethz.ch/~embrecht/ http://www.wisdom.weizmann.ac.il/~zeitouni/ http://www.cs.huji.ac.il/~nati/ |
13:33
🔗
|
kiska |
Maybe if the jobs aren't urgent enough we can use warriors to get these sites? |
13:33
🔗
|
kiska |
Also I am going to sleep as well |
13:33
🔗
|
eientei95 |
Yeah |
13:57
🔗
|
JAA |
18.2k URLs in my list representing ~10.6k sites (including probably a whole lot of dead ones) |
15:12
🔗
|
|
sep332 has joined #webroasting |
16:07
🔗
|
JAA |
I threw everything discovered so far into ArchiveBot. 22073 URLs from 10663 sites, split into four jobs (one per domain). |
16:07
🔗
|
JAA |
Bing scrape is still running, and I'll add another job for the remaining stuff when it finishes (another hour or so). |
18:01
🔗
|
JAA |
Scrape finished, job queued. |
18:01
🔗
|
JAA |
There's definitely some duplication with the other jobs, but it's too much work to separate that properly. |
21:59
🔗
|
JAA |
The logs of all these Vodafone jobs should be a good source for additional ISP hosting services, by the way. |
23:27
🔗
|
|
mr_archiv has quit IRC (Read error: Operation timed out) |
23:28
🔗
|
|
mr_archiv has joined #webroasting |