#archiveteam 2014-10-30,Thu

↑back Search

Time	Nickname	Message
00:04 ^🔗	db48x	parsons: how is Meetup Anywhere different from the rest of Meetup?
00:10 ^🔗	db48x	I should upgrade my cpu so that I could actually play these arcade games
01:02 ^🔗	DFJustin	we're still working on making it faster
01:11 ^🔗	aaaaaaaaa	I wonder if there shouldn't be some sort of benchmark and you can compare your benchmark score to a recommendation in the page. That way as speed increases come from either better software or more powerful hardware, people can more easily determine what they can run.
01:12 ^🔗	DFJustin	I'd love that, go write it
01:37 ^🔗	db48x	DFJustin: and the browsers are getting faster too. It'll get there eventually, but in the mean time it would be a nice excuse to upgrade
01:38 ^🔗	db48x	aaaaaaaaa: I wonder if instrumenting MAME would be enough
01:39 ^🔗	db48x	build a version of MAME that reports some metrics (instructions per second or something) about the simulation, then build it for each platform
01:39 ^🔗	db48x	then compare that with the original hardware specs
03:25 ^🔗	joepie91	aaaaaaaaa: isn't that basically what Microsofts performance index set out (and failed) to do?
03:36 ^🔗	aaaaaaaaa	I suppose, but with the performance index, there are all sorts of measures and different software is limited by different things. I think (but could be wrong) that jsmess is dependent on cpu.
03:39 ^🔗	joepie91	aaaaaaaaa: yes, and that's exactly the reason it failed :P
05:28 ^🔗	DFJustin	yes cpu
14:36 ^🔗	parsons	db48x: Meetup Everywhere is an experimental platform -- more top-down than bottom up. A single group could create a community (Reddit, Coursera, etc) and people could sign on to create local "chapters"
14:37 ^🔗	parsons	It was somewhat successful but on a small scale. We're trying to get successful Meetup Everywhere groups to move over to the main platform before the shutdown
17:59 ^🔗	SadDM	SketchCow: http://linux.slashdot.org/story/14/10/30/1614249/slashdot-asks-appropriate-place-for-free--open-source-software-artifacts
19:14 ^🔗	schbirid	i wonder how big gfycat is and if it might be incredibly journeyed some day
20:06 ^🔗	ionpulse	Anyone working or have worked on yahoo dir? getting haulted at sub folders, like trying to start off at games or science instead of just dir.yahoo.com
20:07 ^🔗	ionpulse	using wget
20:13 ^🔗	ionpulse	recursive retrieval is simply not working
20:15 ^🔗	schbirid	what happens?
20:16 ^🔗	ionpulse	its now following any links
20:16 ^🔗	ionpulse	i have tried everything
20:16 ^🔗	arkiver	ionpulse what link are you starting from?
20:16 ^🔗	arkiver	what are you trying to download?
20:16 ^🔗	ionpulse	https://dir.yahoo.com/recreation/games/video_games/
20:16 ^🔗	schbirid	what's your commandline?
20:17 ^🔗	arkiver	might be that the href is not supported
20:17 ^🔗	schbirid	the urls onthe page have Uppercase letters
20:17 ^🔗	arkiver	the way it is written
20:17 ^🔗	schbirid	while the starting url you posted does not
20:18 ^🔗	schbirid	erm
20:18 ^🔗	schbirid	lol
20:18 ^🔗	schbirid	i should sleep
20:18 ^🔗	ionpulse	yes I noticed the case issue
20:19 ^🔗	ionpulse	gonna try ignore-case quick
20:20 ^🔗	schbirid	you guys getting certificate issues too?
20:22 ^🔗	ionpulse	ok i got it
20:22 ^🔗	ionpulse	you do have to set --ignore-case with wget
20:22 ^🔗	schbirid	yay
20:23 ^🔗	ionpulse	there are a few more critical things as well, like --no-check-certificate, and a regex reject to kill the alphabetical option
20:23 ^🔗	ionpulse	if you don't block that you will get everything twice
20:23 ^🔗	ionpulse	and given the narrow time window to get this stuff, i am blocking the alpha sort option
20:24 ^🔗	schbirid	oh is it shutting down?
20:24 ^🔗	ionpulse	yea, thought tommorrow was shutoff
20:24 ^🔗	aaaaaaaaa	Yahoo directory is shutting down December 31st, IIRC
20:24 ^🔗	ionpulse	ah
20:25 ^🔗	ionpulse	i thought it was October 31st
20:26 ^🔗	aaaaaaaaa	http://www.theverge.com/2014/9/27/6854139/yahoo-directory-once-the-center-of-a-web-empire-will-shut-down
20:27 ^🔗	ionpulse	ok nice
20:30 ^🔗	aaaaaaaaa	you are probably thinking of qwiki, which is done on November 1st.
20:41 ^🔗	ionpulse	i will have an adjusted cmdline string here in a sec for wget, so you guys can get a jump on this easier if you havn't already
20:42 ^🔗	ionpulse	i have to merge in some stuff from my wgetrc so its a standalone working execution
20:51 ^🔗	schbirid	nice "ERROR 999: Unable to process request at this time -- error 999."
20:52 ^🔗	ionpulse	Okay, here is a working wget commandline for Yahoo Directory:
20:53 ^🔗	ionpulse	wget -rkEpH -l inf -np --random-wait -w 0.5 --restrict-file-names=windows --trust-server-names=on -Ddir.yahoo.com,yahooapis.com,yimg.com -Pydir_games --no-check-certificate --secure-protocol=auto --user-agent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" --referer="https://dir.yahoo.com/recreation/games/" --reject-regex '(.)(\?o\=a)(.)' --ignore-case -e robots=off https
20:53 ^🔗	ionpulse	://dir.yahoo.com/recreation/games/video_games/
20:53 ^🔗	ionpulse	sorry if this is a bit messy, as I had to quickly weave in stuff I usually have in an rc file
20:54 ^🔗	ionpulse	and there is some armor in here that "may" not be necessary, like ignore robots
20:54 ^🔗	ionpulse	but I just want the process to run smoothly without snags
20:54 ^🔗	ionpulse	like referrer may not even be needed but I added it anyway
20:55 ^🔗	ionpulse	That reject regex is important though if you don't want to grab double everything.
20:56 ^🔗	ionpulse	Right now I am just grabbing computer/game related, then art/science in a non-alphabetical grab. Then, depending on how long those processes run for, could do a complete grab.
20:56 ^🔗	ionpulse	However it makes sense to do it based on category and multi-thread it.
20:57 ^🔗	ionpulse	AWS EC2's might come in handy
21:07 ^🔗	schbirid	ionpulse: dont forget warc!
21:07 ^🔗	schbirid	--warc-file="dir.yahoo.com_$(date +%Y%m%d)" --warc-cdx
21:11 ^🔗	ionpulse	I am archiving stuff different than you guys. I have different types of projects going on.
21:11 ^🔗	ionpulse	So I don't tag with warc
21:12 ^🔗	ionpulse	Makes it especially hard on websites because only a percentage of the site ends up being tagged. As I have complex post process routines that stitch in more data than would otherwise be archived by a set it and forget it wget/httrac run.
21:13 ^🔗	ionpulse	But yes, traditionally, if someone were to use warc, that would be added (as many Archive Team projects do)
21:14 ^🔗	schbirid	making wget built a warc only costs space (tiny, so it's just ~1/3 more) and you can just shove them into IA. would be nice
21:14 ^🔗	ionpulse	I am grabbing data out of Yahoo Web Directory to parse it for working links to archive on the web, and then to extract dead sites out of IA.
21:15 ^🔗	ionpulse	So Yahoo Dir is a means to an end for some of my other projects.
21:18 ^🔗	ionpulse	the video games run finished already
21:19 ^🔗	ionpulse	its only 583 files
21:24 ^🔗	ionpulse	yea... whats up with that ERROR 999
21:24 ^🔗	ionpulse	wtf
21:25 ^🔗	ionpulse	geocities was like this
21:27 ^🔗	wp494	behold: {{specialcase}} can now be used for sites with a special case such as twitpic/4chan
21:27 ^🔗	wp494	I was going to make a "hybrid" one for sites like 4chan that actively purge data, but decided against it
21:27 ^🔗	wp494	if there's support I can get one rolling
22:14 ^🔗	ionpulse	so i dropped the user-agent for googlebot and Yahoo Directory started working again
22:14 ^🔗	ionpulse	the ERROR 999 went away
22:14 ^🔗	ionpulse	So we have to find the right combination of user-agent and wait time most likely

irclogger-viewer