#webroasting 2018-11-29,Thu

↑back Search

Time Nickname Message
00:03 🔗 JAA has quit IRC (Ping timeout: 246 seconds)
01:02 🔗 JAA has joined #webroasting
01:03 🔗 bakJAA sets mode: +o JAA
05:18 🔗 ripdog has joined #webroasting
05:19 🔗 ripdog Vodafone NZ is shutting down their ihug (old acquired company) hosting service... tomorrow. Example url: http://homepages.ihug.co.nz/~(user)
05:19 🔗 ripdog I guess it should all be in the internet archive, so probably not a disaster
05:25 🔗 ripdog context: https://www.geekzone.co.nz/forums.asp?forumid=40&topicid=243123
05:26 🔗 ripdog has quit IRC (Quit: Page closed)
06:18 🔗 hook54321 has joined #webroasting
09:59 🔗 mr_archiv has quit IRC (Read error: Operation timed out)
10:00 🔗 chfoo has quit IRC (Read error: Operation timed out)
10:00 🔗 chfoo has joined #webroasting
10:18 🔗 mr_archiv has joined #webroasting
12:02 🔗 JAA Three days lead time. Fantastic...
12:21 🔗 JAA Extracting links from the WBM and Bing now.
12:45 🔗 eientei95 has joined #webroasting
12:46 🔗 eientei95 Bing seems to have a lot more on ihug than Google or DDG
12:46 🔗 JAA For log completeness: 2018-11-29 12:42:20 < eientei95> JAA: https://transfer.sh/KGXuC/hp_ihug here's a bunch of the ihug sites
12:47 🔗 JAA I'm reusing my TalkTalk scripts. So far, I have a list of ~17.5k URLs.
12:47 🔗 eientei95 `site:http://homepages.ihug.co.nz/` gives 7,230 results in Bing
12:47 🔗 eientei95 Oh wow
12:47 🔗 JAA I'm scraping the Wayback Machine and Bing.
12:47 🔗 eientei95 Better than me copy-pasting it into a text doc and regexing out the URLs
12:48 🔗 JAA And those 17.5k URLs are a bit inflated since they also include every directory level.
12:48 🔗 eientei95 Ah
12:48 🔗 JAA Well, that's pretty much what I'm doing anyway, but simply in an automated way. :-P
12:50 🔗 JAA The WBM is returning lots of weird stuff as usual, e.g. http://homepages.paradise.net.nz/~rhector%0a/
12:50 🔗 eientei95 :P
12:50 🔗 eientei95 Good ol' unstriped newline chars
12:50 🔗 eientei95 *pp
12:51 🔗 JAA http://homepages.paradise.net.nz/rossstan'target='_blank/
12:51 🔗 JAA http://homepages.paradise.net.nz/riffraff4robots.txt/
12:51 🔗 eientei95 Someone didn't check their links properly
12:53 🔗 JAA Aand the list shrunk by almost 3k after filtering out the trailing newlines.
12:54 🔗 JAA Oh, I like this one: http://homepages.ihug.co.nz/~%0a%0a/
12:55 🔗 eientei95 lol
13:00 🔗 JAA Turns out there's a good number of pages which are only listed by the WBM with the trailing newline. Ok, strip & deduplicate it is.
13:01 🔗 Sanqui has joined #webroasting
13:02 🔗 JAA 15840 currently
13:02 🔗 JAA Bing scrape is still running though.
13:02 🔗 JAA I'll throw this onto trent-nz, by the way.
13:10 🔗 JAA eientei95: FYI, there are 7 sites in your list which aren't in mine so far.
13:10 🔗 eientei95 Huh
13:10 🔗 eientei95 Cool
13:11 🔗 eientei95 Mine is the result of scraping Google and DDG and a bit of Bing
13:12 🔗 eientei95 Anyway, it's 11 past 2 here in Hobbitland
13:12 🔗 JAA 11 past 2 here as well in Switzerland. :-)
13:13 🔗 kiska Now that another web space hosting service has gone down, I'll start looking at every ISP service that has web spaces, in the morning, its 13 past midnight here
13:13 🔗 eientei95 I found an aussie one while scraping bing iirc
13:14 🔗 eientei95 http://www.pcug.org.au/~djordan/ that page 404s but the site exists
13:15 🔗 JAA kiska: Yeah, that's a good idea. I fully expect more shutdowns in the near future.
13:15 🔗 eientei95 http://members.pcug.org.au/~arhen/ > November 1995
13:17 🔗 eientei95 > Computing: with a modem and a no-name clone IBM PC as a member of the PC Users Group (ACT).
13:17 🔗 eientei95 > Page last revised 28 September 1997
13:17 🔗 JAA We should also compile a good list of keywords for search engine scrapes. I have one which covers mostly genealogy, local businesses, and stuff like schools or churches, but it's definitely not a good list.
13:17 🔗 JAA Scraping Bing is very effective but takes forever because you can only make one request ever 2 seconds.
13:18 🔗 JAA (Which is still better than any other search engine I've tried; those usually ban after a few requests.)
13:20 🔗 eientei95 https://www.cs.auckland.ac.nz/~pgut001/
13:22 🔗 eientei95 http://homepage.eircom.net/~seanjmurphy/
13:23 🔗 eientei95 https://www-cs-faculty.stanford.edu/~knuth/
13:23 🔗 eientei95 http://homepages.ecs.vuw.ac.nz/~yimei/
13:23 🔗 JAA Hmm, .ac.nz and .edu are universities & Co., right? Those might be less at risk than ISP's offers.
13:24 🔗 eientei95 Yeah
13:25 🔗 eientei95 http://pages.stern.nyu.edu/~adamodar/ https://www.csie.ntu.edu.tw/~cjlin/ http://www.robots.ox.ac.uk/~vgg/ http://www.cs.toronto.edu/~hinton/ More unis
13:29 🔗 eientei95 http://www.tcm.phy.cam.ac.uk/~bdj10/ http://homepages.ecs.vuw.ac.nz/~mengjie/ https://people.math.ethz.ch/~embrecht/ http://www.wisdom.weizmann.ac.il/~zeitouni/ http://www.cs.huji.ac.il/~nati/
13:33 🔗 kiska Maybe if the jobs aren't urgent enough we can use warriors to get these sites?
13:33 🔗 kiska Also I am going to sleep as well
13:33 🔗 eientei95 Yeah
13:57 🔗 JAA 18.2k URLs in my list representing ~10.6k sites (including probably a whole lot of dead ones)
15:12 🔗 sep332 has joined #webroasting
16:07 🔗 JAA I threw everything discovered so far into ArchiveBot. 22073 URLs from 10663 sites, split into four jobs (one per domain).
16:07 🔗 JAA Bing scrape is still running, and I'll add another job for the remaining stuff when it finishes (another hour or so).
18:01 🔗 JAA Scrape finished, job queued.
18:01 🔗 JAA There's definitely some duplication with the other jobs, but it's too much work to separate that properly.
21:59 🔗 JAA The logs of all these Vodafone jobs should be a good source for additional ISP hosting services, by the way.
23:27 🔗 mr_archiv has quit IRC (Read error: Operation timed out)
23:28 🔗 mr_archiv has joined #webroasting

irclogger-viewer