[00:03] *** JAA has quit IRC (Ping timeout: 246 seconds) [01:02] *** JAA has joined #webroasting [01:03] *** bakJAA sets mode: +o JAA [05:18] *** ripdog has joined #webroasting [05:19] Vodafone NZ is shutting down their ihug (old acquired company) hosting service... tomorrow. Example url: http://homepages.ihug.co.nz/~(user) [05:19] I guess it should all be in the internet archive, so probably not a disaster [05:25] context: https://www.geekzone.co.nz/forums.asp?forumid=40&topicid=243123 [05:26] *** ripdog has quit IRC (Quit: Page closed) [06:18] *** hook54321 has joined #webroasting [09:59] *** mr_archiv has quit IRC (Read error: Operation timed out) [10:00] *** chfoo has quit IRC (Read error: Operation timed out) [10:00] *** chfoo has joined #webroasting [10:18] *** mr_archiv has joined #webroasting [12:02] Three days lead time. Fantastic... [12:21] Extracting links from the WBM and Bing now. [12:45] *** eientei95 has joined #webroasting [12:46] Bing seems to have a lot more on ihug than Google or DDG [12:46] For log completeness: 2018-11-29 12:42:20 < eientei95> JAA: https://transfer.sh/KGXuC/hp_ihug here's a bunch of the ihug sites [12:47] I'm reusing my TalkTalk scripts. So far, I have a list of ~17.5k URLs. [12:47] `site:http://homepages.ihug.co.nz/` gives 7,230 results in Bing [12:47] Oh wow [12:47] I'm scraping the Wayback Machine and Bing. [12:47] Better than me copy-pasting it into a text doc and regexing out the URLs [12:48] And those 17.5k URLs are a bit inflated since they also include every directory level. [12:48] Ah [12:48] Well, that's pretty much what I'm doing anyway, but simply in an automated way. :-P [12:50] The WBM is returning lots of weird stuff as usual, e.g. http://homepages.paradise.net.nz/~rhector%0a/ [12:50] :P [12:50] Good ol' unstriped newline chars [12:50] *pp [12:51] http://homepages.paradise.net.nz/rossstan'target='_blank/ [12:51] http://homepages.paradise.net.nz/riffraff4robots.txt/ [12:51] Someone didn't check their links properly [12:53] Aand the list shrunk by almost 3k after filtering out the trailing newlines. [12:54] Oh, I like this one: http://homepages.ihug.co.nz/~%0a%0a/ [12:55] lol [13:00] Turns out there's a good number of pages which are only listed by the WBM with the trailing newline. Ok, strip & deduplicate it is. [13:01] *** Sanqui has joined #webroasting [13:02] 15840 currently [13:02] Bing scrape is still running though. [13:02] I'll throw this onto trent-nz, by the way. [13:10] eientei95: FYI, there are 7 sites in your list which aren't in mine so far. [13:10] Huh [13:10] Cool [13:11] Mine is the result of scraping Google and DDG and a bit of Bing [13:12] Anyway, it's 11 past 2 here in Hobbitland [13:12] 11 past 2 here as well in Switzerland. :-) [13:13] Now that another web space hosting service has gone down, I'll start looking at every ISP service that has web spaces, in the morning, its 13 past midnight here [13:13] I found an aussie one while scraping bing iirc [13:14] http://www.pcug.org.au/~djordan/ that page 404s but the site exists [13:15] kiska: Yeah, that's a good idea. I fully expect more shutdowns in the near future. [13:15] http://members.pcug.org.au/~arhen/ > November 1995 [13:17] > Computing: with a modem and a no-name clone IBM PC as a member of the PC Users Group (ACT). [13:17] > Page last revised 28 September 1997 [13:17] We should also compile a good list of keywords for search engine scrapes. I have one which covers mostly genealogy, local businesses, and stuff like schools or churches, but it's definitely not a good list. [13:17] Scraping Bing is very effective but takes forever because you can only make one request ever 2 seconds. [13:18] (Which is still better than any other search engine I've tried; those usually ban after a few requests.) [13:20] https://www.cs.auckland.ac.nz/~pgut001/ [13:22] http://homepage.eircom.net/~seanjmurphy/ [13:23] https://www-cs-faculty.stanford.edu/~knuth/ [13:23] http://homepages.ecs.vuw.ac.nz/~yimei/ [13:23] Hmm, .ac.nz and .edu are universities & Co., right? Those might be less at risk than ISP's offers. [13:24] Yeah [13:25] http://pages.stern.nyu.edu/~adamodar/ https://www.csie.ntu.edu.tw/~cjlin/ http://www.robots.ox.ac.uk/~vgg/ http://www.cs.toronto.edu/~hinton/ More unis [13:29] http://www.tcm.phy.cam.ac.uk/~bdj10/ http://homepages.ecs.vuw.ac.nz/~mengjie/ https://people.math.ethz.ch/~embrecht/ http://www.wisdom.weizmann.ac.il/~zeitouni/ http://www.cs.huji.ac.il/~nati/ [13:33] Maybe if the jobs aren't urgent enough we can use warriors to get these sites? [13:33] Also I am going to sleep as well [13:33] Yeah [13:57] 18.2k URLs in my list representing ~10.6k sites (including probably a whole lot of dead ones) [15:12] *** sep332 has joined #webroasting [16:07] I threw everything discovered so far into ArchiveBot. 22073 URLs from 10663 sites, split into four jobs (one per domain). [16:07] Bing scrape is still running, and I'll add another job for the remaining stuff when it finishes (another hour or so). [18:01] Scrape finished, job queued. [18:01] There's definitely some duplication with the other jobs, but it's too much work to separate that properly. [21:59] The logs of all these Vodafone jobs should be a good source for additional ISP hosting services, by the way. [23:27] *** mr_archiv has quit IRC (Read error: Operation timed out) [23:28] *** mr_archiv has joined #webroasting