#webroasting 2018-11-29,Thu

↑back Search

Time	Nickname	Message
00:03 ^🔗		JAA has quit IRC (Ping timeout: 246 seconds)
01:02 ^🔗		JAA has joined #webroasting
01:03 ^🔗		bakJAA sets mode: +o JAA
05:18 ^🔗		ripdog has joined #webroasting
05:19 ^🔗	ripdog	Vodafone NZ is shutting down their ihug (old acquired company) hosting service... tomorrow. Example url: http://homepages.ihug.co.nz/~(user)
05:19 ^🔗	ripdog	I guess it should all be in the internet archive, so probably not a disaster
05:25 ^🔗	ripdog	context: https://www.geekzone.co.nz/forums.asp?forumid=40&topicid=243123
05:26 ^🔗		ripdog has quit IRC (Quit: Page closed)
06:18 ^🔗		hook54321 has joined #webroasting
09:59 ^🔗		mr_archiv has quit IRC (Read error: Operation timed out)
10:00 ^🔗		chfoo has quit IRC (Read error: Operation timed out)
10:00 ^🔗		chfoo has joined #webroasting
10:18 ^🔗		mr_archiv has joined #webroasting
12:02 ^🔗	JAA	Three days lead time. Fantastic...
12:21 ^🔗	JAA	Extracting links from the WBM and Bing now.
12:45 ^🔗		eientei95 has joined #webroasting
12:46 ^🔗	eientei95	Bing seems to have a lot more on ihug than Google or DDG
12:46 ^🔗	JAA	For log completeness: 2018-11-29 12:42:20 < eientei95> JAA: https://transfer.sh/KGXuC/hp_ihug here's a bunch of the ihug sites
12:47 ^🔗	JAA	I'm reusing my TalkTalk scripts. So far, I have a list of ~17.5k URLs.
12:47 ^🔗	eientei95	`site:http://homepages.ihug.co.nz/` gives 7,230 results in Bing
12:47 ^🔗	eientei95	Oh wow
12:47 ^🔗	JAA	I'm scraping the Wayback Machine and Bing.
12:47 ^🔗	eientei95	Better than me copy-pasting it into a text doc and regexing out the URLs
12:48 ^🔗	JAA	And those 17.5k URLs are a bit inflated since they also include every directory level.
12:48 ^🔗	eientei95	Ah
12:48 ^🔗	JAA	Well, that's pretty much what I'm doing anyway, but simply in an automated way. :-P
12:50 ^🔗	JAA	The WBM is returning lots of weird stuff as usual, e.g. http://homepages.paradise.net.nz/~rhector%0a/
12:50 ^🔗	eientei95	:P
12:50 ^🔗	eientei95	Good ol' unstriped newline chars
12:50 ^🔗	eientei95	*pp
12:51 ^🔗	JAA	http://homepages.paradise.net.nz/rossstan'target='_blank/
12:51 ^🔗	JAA	http://homepages.paradise.net.nz/riffraff4robots.txt/
12:51 ^🔗	eientei95	Someone didn't check their links properly
12:53 ^🔗	JAA	Aand the list shrunk by almost 3k after filtering out the trailing newlines.
12:54 ^🔗	JAA	Oh, I like this one: http://homepages.ihug.co.nz/~%0a%0a/
12:55 ^🔗	eientei95	lol
13:00 ^🔗	JAA	Turns out there's a good number of pages which are only listed by the WBM with the trailing newline. Ok, strip & deduplicate it is.
13:01 ^🔗		Sanqui has joined #webroasting
13:02 ^🔗	JAA	15840 currently
13:02 ^🔗	JAA	Bing scrape is still running though.
13:02 ^🔗	JAA	I'll throw this onto trent-nz, by the way.
13:10 ^🔗	JAA	eientei95: FYI, there are 7 sites in your list which aren't in mine so far.
13:10 ^🔗	eientei95	Huh
13:10 ^🔗	eientei95	Cool
13:11 ^🔗	eientei95	Mine is the result of scraping Google and DDG and a bit of Bing
13:12 ^🔗	eientei95	Anyway, it's 11 past 2 here in Hobbitland
13:12 ^🔗	JAA	11 past 2 here as well in Switzerland. :-)
13:13 ^🔗	kiska	Now that another web space hosting service has gone down, I'll start looking at every ISP service that has web spaces, in the morning, its 13 past midnight here
13:13 ^🔗	eientei95	I found an aussie one while scraping bing iirc
13:14 ^🔗	eientei95	http://www.pcug.org.au/~djordan/ that page 404s but the site exists
13:15 ^🔗	JAA	kiska: Yeah, that's a good idea. I fully expect more shutdowns in the near future.
13:15 ^🔗	eientei95	http://members.pcug.org.au/~arhen/ > November 1995
13:17 ^🔗	eientei95	> Computing: with a modem and a no-name clone IBM PC as a member of the PC Users Group (ACT).
13:17 ^🔗	eientei95	> Page last revised 28 September 1997
13:17 ^🔗	JAA	We should also compile a good list of keywords for search engine scrapes. I have one which covers mostly genealogy, local businesses, and stuff like schools or churches, but it's definitely not a good list.
13:17 ^🔗	JAA	Scraping Bing is very effective but takes forever because you can only make one request ever 2 seconds.
13:18 ^🔗	JAA	(Which is still better than any other search engine I've tried; those usually ban after a few requests.)
13:20 ^🔗	eientei95	https://www.cs.auckland.ac.nz/~pgut001/
13:22 ^🔗	eientei95	http://homepage.eircom.net/~seanjmurphy/
13:23 ^🔗	eientei95	https://www-cs-faculty.stanford.edu/~knuth/
13:23 ^🔗	eientei95	http://homepages.ecs.vuw.ac.nz/~yimei/
13:23 ^🔗	JAA	Hmm, .ac.nz and .edu are universities & Co., right? Those might be less at risk than ISP's offers.
13:24 ^🔗	eientei95	Yeah
13:25 ^🔗	eientei95	http://pages.stern.nyu.edu/~adamodar/ https://www.csie.ntu.edu.tw/~cjlin/ http://www.robots.ox.ac.uk/~vgg/ http://www.cs.toronto.edu/~hinton/ More unis
13:29 ^🔗	eientei95	http://www.tcm.phy.cam.ac.uk/~bdj10/ http://homepages.ecs.vuw.ac.nz/~mengjie/ https://people.math.ethz.ch/~embrecht/ http://www.wisdom.weizmann.ac.il/~zeitouni/ http://www.cs.huji.ac.il/~nati/
13:33 ^🔗	kiska	Maybe if the jobs aren't urgent enough we can use warriors to get these sites?
13:33 ^🔗	kiska	Also I am going to sleep as well
13:33 ^🔗	eientei95	Yeah
13:57 ^🔗	JAA	18.2k URLs in my list representing ~10.6k sites (including probably a whole lot of dead ones)
15:12 ^🔗		sep332 has joined #webroasting
16:07 ^🔗	JAA	I threw everything discovered so far into ArchiveBot. 22073 URLs from 10663 sites, split into four jobs (one per domain).
16:07 ^🔗	JAA	Bing scrape is still running, and I'll add another job for the remaining stuff when it finishes (another hour or so).
18:01 ^🔗	JAA	Scrape finished, job queued.
18:01 ^🔗	JAA	There's definitely some duplication with the other jobs, but it's too much work to separate that properly.
21:59 ^🔗	JAA	The logs of all these Vodafone jobs should be a good source for additional ISP hosting services, by the way.
23:27 ^🔗		mr_archiv has quit IRC (Read error: Operation timed out)
23:28 ^🔗		mr_archiv has joined #webroasting

irclogger-viewer