#archiveteam 2014-10-30,Thu

↑back Search

Time Nickname Message
00:04 🔗 db48x parsons: how is Meetup Anywhere different from the rest of Meetup?
00:10 🔗 db48x I should upgrade my cpu so that I could actually play these arcade games
01:02 🔗 DFJustin we're still working on making it faster
01:11 🔗 aaaaaaaaa I wonder if there shouldn't be some sort of benchmark and you can compare your benchmark score to a recommendation in the page. That way as speed increases come from either better software or more powerful hardware, people can more easily determine what they can run.
01:12 🔗 DFJustin I'd love that, go write it
01:37 🔗 db48x DFJustin: and the browsers are getting faster too. It'll get there eventually, but in the mean time it would be a nice excuse to upgrade
01:38 🔗 db48x aaaaaaaaa: I wonder if instrumenting MAME would be enough
01:39 🔗 db48x build a version of MAME that reports some metrics (instructions per second or something) about the simulation, then build it for each platform
01:39 🔗 db48x then compare that with the original hardware specs
03:25 🔗 joepie91 aaaaaaaaa: isn't that basically what Microsofts performance index set out (and failed) to do?
03:36 🔗 aaaaaaaaa I suppose, but with the performance index, there are all sorts of measures and different software is limited by different things. I think (but could be wrong) that jsmess is dependent on cpu.
03:39 🔗 joepie91 aaaaaaaaa: yes, and that's exactly the reason it failed :P
05:28 🔗 DFJustin yes cpu
14:36 🔗 parsons db48x: Meetup Everywhere is an experimental platform -- more top-down than bottom up. A single group could create a community (Reddit, Coursera, etc) and people could sign on to create local "chapters"
14:37 🔗 parsons It was somewhat successful but on a small scale. We're trying to get successful Meetup Everywhere groups to move over to the main platform before the shutdown
17:59 🔗 SadDM SketchCow: http://linux.slashdot.org/story/14/10/30/1614249/slashdot-asks-appropriate-place-for-free--open-source-software-artifacts
19:14 🔗 schbirid i wonder how big gfycat is and if it might be incredibly journeyed some day
20:06 🔗 ionpulse Anyone working or have worked on yahoo dir? getting haulted at sub folders, like trying to start off at games or science instead of just dir.yahoo.com
20:07 🔗 ionpulse using wget
20:13 🔗 ionpulse recursive retrieval is simply not working
20:15 🔗 schbirid what happens?
20:16 🔗 ionpulse its now following any links
20:16 🔗 ionpulse i have tried everything
20:16 🔗 arkiver ionpulse what link are you starting from?
20:16 🔗 arkiver what are you trying to download?
20:16 🔗 ionpulse https://dir.yahoo.com/recreation/games/video_games/
20:16 🔗 schbirid what's your commandline?
20:17 🔗 arkiver might be that the href is not supported
20:17 🔗 schbirid the urls onthe page have Uppercase letters
20:17 🔗 arkiver the way it is written
20:17 🔗 schbirid while the starting url you posted does not
20:18 🔗 schbirid erm
20:18 🔗 schbirid lol
20:18 🔗 schbirid i should sleep
20:18 🔗 ionpulse yes I noticed the case issue
20:19 🔗 ionpulse gonna try ignore-case quick
20:20 🔗 schbirid you guys getting certificate issues too?
20:22 🔗 ionpulse ok i got it
20:22 🔗 ionpulse you do have to set --ignore-case with wget
20:22 🔗 schbirid yay
20:23 🔗 ionpulse there are a few more critical things as well, like --no-check-certificate, and a regex reject to kill the alphabetical option
20:23 🔗 ionpulse if you don't block that you will get everything twice
20:23 🔗 ionpulse and given the narrow time window to get this stuff, i am blocking the alpha sort option
20:24 🔗 schbirid oh is it shutting down?
20:24 🔗 ionpulse yea, thought tommorrow was shutoff
20:24 🔗 aaaaaaaaa Yahoo directory is shutting down December 31st, IIRC
20:24 🔗 ionpulse ah
20:25 🔗 ionpulse i thought it was October 31st
20:26 🔗 aaaaaaaaa http://www.theverge.com/2014/9/27/6854139/yahoo-directory-once-the-center-of-a-web-empire-will-shut-down
20:27 🔗 ionpulse ok nice
20:30 🔗 aaaaaaaaa you are probably thinking of qwiki, which is done on November 1st.
20:41 🔗 ionpulse i will have an adjusted cmdline string here in a sec for wget, so you guys can get a jump on this easier if you havn't already
20:42 🔗 ionpulse i have to merge in some stuff from my wgetrc so its a standalone working execution
20:51 🔗 schbirid nice "ERROR 999: Unable to process request at this time -- error 999."
20:52 🔗 ionpulse Okay, here is a working wget commandline for Yahoo Directory:
20:53 🔗 ionpulse wget -rkEpH -l inf -np --random-wait -w 0.5 --restrict-file-names=windows --trust-server-names=on -Ddir.yahoo.com,yahooapis.com,yimg.com -Pydir_games --no-check-certificate --secure-protocol=auto --user-agent="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" --referer="https://dir.yahoo.com/recreation/games/" --reject-regex '(.*)(\?o\=a)(.*)' --ignore-case -e robots=off https
20:53 🔗 ionpulse ://dir.yahoo.com/recreation/games/video_games/
20:53 🔗 ionpulse sorry if this is a bit messy, as I had to quickly weave in stuff I usually have in an rc file
20:54 🔗 ionpulse and there is some armor in here that "may" not be necessary, like ignore robots
20:54 🔗 ionpulse but I just want the process to run smoothly without snags
20:54 🔗 ionpulse like referrer may not even be needed but I added it anyway
20:55 🔗 ionpulse That reject regex is important though if you don't want to grab double everything.
20:56 🔗 ionpulse Right now I am just grabbing computer/game related, then art/science in a non-alphabetical grab. Then, depending on how long those processes run for, could do a complete grab.
20:56 🔗 ionpulse However it makes sense to do it based on category and multi-thread it.
20:57 🔗 ionpulse AWS EC2's might come in handy
21:07 🔗 schbirid ionpulse: dont forget warc!
21:07 🔗 schbirid --warc-file="dir.yahoo.com_$(date +%Y%m%d)" --warc-cdx
21:11 🔗 ionpulse I am archiving stuff different than you guys. I have different types of projects going on.
21:11 🔗 ionpulse So I don't tag with warc
21:12 🔗 ionpulse Makes it especially hard on websites because only a percentage of the site ends up being tagged. As I have complex post process routines that stitch in more data than would otherwise be archived by a set it and forget it wget/httrac run.
21:13 🔗 ionpulse But yes, traditionally, if someone were to use warc, that would be added (as many Archive Team projects do)
21:14 🔗 schbirid making wget built a warc only costs space (tiny, so it's just ~1/3 more) and you can just shove them into IA. would be nice
21:14 🔗 ionpulse I am grabbing data out of Yahoo Web Directory to parse it for working links to archive on the web, and then to extract dead sites out of IA.
21:15 🔗 ionpulse So Yahoo Dir is a means to an end for some of my other projects.
21:18 🔗 ionpulse the video games run finished already
21:19 🔗 ionpulse its only 583 files
21:24 🔗 ionpulse yea... whats up with that ERROR 999
21:24 🔗 ionpulse wtf
21:25 🔗 ionpulse geocities was like this
21:27 🔗 wp494 behold: {{specialcase}} can now be used for sites with a special case such as twitpic/4chan
21:27 🔗 wp494 I was going to make a "hybrid" one for sites like 4chan that actively purge data, but decided against it
21:27 🔗 wp494 if there's support I can get one rolling
22:14 🔗 ionpulse so i dropped the user-agent for googlebot and Yahoo Directory started working again
22:14 🔗 ionpulse the ERROR 999 went away
22:14 🔗 ionpulse So we have to find the right combination of user-agent and wait time most likely
