#archiveteam 2012-06-17,Sun

↑back Search

Time Nickname Message
01:24 🔗 DrainLib and hornet archive mirror completed. now the damn thing exists in more then 3 places on the internet
01:24 🔗 Famicoman nice, how'd ya get it?
01:24 🔗 DrainLib good ole ftp, nothing fancy
01:25 🔗 Famicoman is it like 3gb or something hilarious? :P
01:25 🔗 DrainLib 6gb
01:26 🔗 Famicoman hell, you could mirror that everywhere
01:26 🔗 DrainLib yeah i was like wait, wtf, this is only in 3 places?
01:26 🔗 DrainLib found 2-3 stray copies of it on what looked like test sites, like someone just said "hey i need 6gb of random data to test my http/ftp server"
01:27 🔗 Famicoman haha
01:28 🔗 DrainLib working on modarchive.org torrents too, at least those are on an official tracker. and don't look nearly as ethereal. but when the entire collection of something exists on one website with sub-30 torrent seeders, i worry
01:32 🔗 DrainLib after that, scene.org as a whole. dont think their ftp admins real happy with me though, seems i got blocked with around 300 files of the hornet remaining, had to hit up a mirror
01:33 🔗 Famicoman I've actually been downloading a bit of scene.org
01:33 🔗 Famicoman it's kinda fun
01:33 🔗 DrainLib skip hornet, grab the whole damn thing :)
01:33 🔗 Famicoman lol, sorta working on it
01:33 🔗 Famicoman I think I got about 25gb of it today
01:34 🔗 DrainLib alright i'll leave that one up to you then. go forth.
01:34 🔗 Famicoman haha okay
01:41 🔗 DrainLib just grabbed entire Kosmic archives too cause it was easy, though I think they're all in the modarchive.org collection anyhow
01:46 🔗 DrainLib wow. just stumbled across aminet.net - epic archive of amiga stuff.
01:48 🔗 DrainLib no action necessary there, looks widely dispersed across aminet cds
02:38 🔗 shaqfu Hm, is there any clever way in Perl to have it ignore carriage returns?
02:38 🔗 shaqfu There's absolutely no reason for them to be in these .html files, but they're trashing output :(
02:38 🔗 Coderjoe change the line separator metavariable?
02:38 🔗 shaqfu Coderjoe: They're hanging out in the middle of lines
02:39 🔗 Coderjoe then s/^M//g ?
02:39 🔗 Coderjoe (remove them?)
02:39 🔗 shaqfu Ah, true - forgot they're removable
02:39 🔗 shaqfu Thanks
02:41 🔗 shaqfu Hm, nope - still carriage returning
02:42 🔗 Coderjoe did you use a real ^M or carat M?
02:42 🔗 shaqfu It has to be escaped?
02:43 🔗 chronomex try \r
02:43 🔗 chronomex or an actual control-M character
02:43 🔗 Coderjoe you either need the real ^M (in vim, you would type ctrl-V ctrl-M) or the proper escape character sequence (like \x0D)
02:45 🔗 shaqfu Neither \r nor \x0D are working
02:45 🔗 shaqfu Er, ^r
02:45 🔗 chronomex ^r isn't even a thing
02:46 🔗 Coderjoe beginning-of-line r?
02:46 🔗 shaqfu Oh, nevermind - 0D works. No more reformatting HTML pages *after* hitting the bar next time
02:46 🔗 shaqfu Thanks
02:49 🔗 shaqfu (last thing I needed for FilePlanet)
02:51 🔗 godane i don't see the 250 hours of bbs documentary are on archive.org
02:52 🔗 godane Was there a problem getting all of them on archive.org SketchCow?
02:53 🔗 aggro Am I late in hearing about this?
02:53 🔗 aggro https://www.eff.org/press/releases/internet-archive-sues-stop-new-washington-state-law
02:56 🔗 chronomex I heard about it the other day, kind of fucked up
03:12 🔗 instence i'm doing my own more selective grab from fileplanet
03:13 🔗 shaqfu instence: Of what?
03:14 🔗 instence i went through 250,000 id's and verified working ones, then pulled down the fileinfo pages, then im scraping them to build a list, then targeting specific game titles
03:14 🔗 shaqfu instence: Wait, you have a list of all the verified IDs? We need that...
03:15 🔗 instence yea
03:15 🔗 shaqfu It'd hurry things along, at least - I was waiting until we downloaded every ID before listing them
03:15 🔗 instence finished this morning
03:15 🔗 shaqfu instence: Get in #fireplanet
03:16 🔗 instence At first I was running a scrape aganst everu single id
03:16 🔗 instence but that was taking forever
03:17 🔗 instence So instead i used Xenu, which is multithreaded
03:17 🔗 instence since it checks "size"
03:17 🔗 instence i was able to blast through them, and sort by size
03:17 🔗 instence file not found pages would be smaller
03:18 🔗 shaqfu Ah - we were just grabbing every id, and pitching ones that returned errors
03:18 🔗 shaqfu We hoped to use sitemaps.xml files, but those don't list every file
03:18 🔗 instence Well my process will yield an excel file with uhhh 1 sec
03:19 🔗 shaqfu instence: Do you have a complete set of /fileinfo/ pages?
03:19 🔗 shaqfu Inc. ones not listed in sitemaps
03:19 🔗 shaqfu I have a set of scripts ready to fire, once I get them
03:20 🔗 instence i have a complete list of working fileinfo pages from 0-249999
03:20 🔗 instence its about 100,000 pages that are working
03:20 🔗 shaqfu Sounds about right
03:20 🔗 shaqfu Toss them on archive.org
03:22 🔗 shaqfu I'm curious - why only certain games, vs. the entire site?
03:23 🔗 instence my process still isn't finished, i have to compress the info pages, pull them down, then scrape more info off them (Title, Author, Created, filename, fileid, filesize)
03:23 🔗 instence i do independant research
03:23 🔗 shaqfu instence: If you want, I have scraping /fileinfo/ done
03:23 🔗 shaqfu I can send you what I have
03:24 🔗 shaqfu The plan was to take the metadata, put it into SQLite, and possibly host it elsewhere later
03:26 🔗 instence For fileplanet, my goal was to just target older games of interest. Quake, UT99, AVP2, etc
03:26 🔗 instence plus doing the filenfo scrape allows me to weed out large files that are deemed unecessary, that would be bloat
03:26 🔗 shaqfu Were you getting beat up by the encrypted clients?
03:27 🔗 instence hmmm? beat up?
03:27 🔗 shaqfu We ran into nonsense like 11GB copies of games that were encrypted
03:27 🔗 shaqfu So they're functionally useless...just very big
03:28 🔗 shaqfu I think IGN pooled FilePlanet and Direct2Drive at some point
03:28 🔗 instence I havn't begun pulling down any files yet, but yea there are stuff in there like 4GB beta clients for extinct mmo's and stuff
03:28 🔗 shaqfu We had stuff like the UT3 installer
03:28 🔗 instence yea lol
03:29 🔗 shaqfu Hm, did you check out the project page?
03:29 🔗 instence Prioritization helps a lot. Thats why i'm targeting older smaller first, vs newer bigger
03:29 🔗 instence nah
03:29 🔗 shaqfu Sensible. Thankfully, IGN's taking their time axing it
03:30 🔗 shaqfu We've been on it for six weeks (it's a backburner project)
03:30 🔗 instence I sort of ride solo, working on my own projects and just idle in here to read
03:30 🔗 shaqfu We already have 90% of the site on archive.org, so if it does go down, you can get a lot there
03:31 🔗 shaqfu http://archiveteam.org/index.php?title=Fileplanet
03:31 🔗 shaqfu In a sense, it's the smaller games that are more interesting, since they're less likely to be saved
03:32 🔗 shaqfu But most of the community-made stuff centers around Half-Life, UT, Quake, etc
03:32 🔗 instence yea to me thats where the really important suff is
03:33 🔗 instence the planet sites fueled hosting of user created content
03:33 🔗 shaqfu Yep
03:33 🔗 instence basically classic golden age pc games
03:33 🔗 shaqfu You might want to talk to Schbirid - he's the one working on saving most of this
03:33 🔗 shaqfu Planet* are getting killed in a few days; he saved many of them, if not most
03:34 🔗 shaqfu Er, forums*
03:34 🔗 instence question:
03:34 🔗 shaqfu What's up
03:35 🔗 instence Did you guys ever save the public hosted stuff from 2008?
03:35 🔗 instence from the big shutdown?
03:35 🔗 shaqfu Public hosted stuff?
03:35 🔗 shaqfu Unless Schbirid did so, I don't think so - do you have any of it?
03:36 🔗 instence oh wait 2009
03:36 🔗 instence GameSpy public hosted sites
03:36 🔗 instence i archived them in 2009
03:36 🔗 shaqfu Talk to Schbirid and compare notes
03:36 🔗 instence was curious if archiveteam was ever working on that
03:36 🔗 instence well these were all shutdown
03:37 🔗 instence Sept 2009
03:37 🔗 shaqfu Are those the planet* sites?
03:37 🔗 shaqfu Like PlanetHeretic and such
03:37 🔗 instence yes, but also all user subsites
03:38 🔗 shaqfu Gotcha; agian, you'd have to swap notes. He should be on in a few hours
03:38 🔗 instence I have carbon copy archives of over 850 sites out of 1,350 urls
03:39 🔗 instence took 3 months to get it all
03:39 🔗 shaqfu Wait up --- do any of them have FilePlanet download URLs?
03:39 🔗 instence most likely, lots of sites in my archive would
03:40 🔗 shaqfu Since not every file has an ID
03:41 🔗 instence The trickiest part about theb planet sites were the news archives
03:41 🔗 instence they were form based select boxes with no pagination
03:41 🔗 instence so crawlers would completely miss them
03:41 🔗 shaqfu Ouch
03:43 🔗 instence I had to build these crazy stepladders of dates in excel and create these custom lists for each planet sites, and run a seperate process on the news listings
03:46 🔗 instence lol.... the -k is still running on the fileinfo html files
03:46 🔗 instence been over an hour
03:47 🔗 instence btw
03:47 🔗 instence one thing i noticed is that, there are some fileinfo pages that have a different id for the file (FileID) in the source code
03:48 🔗 instence <input id="FileId" type="hidden" value="10058"/>
03:48 🔗 shaqfu Ah, interesting
03:48 🔗 instence i was running a php scrape against that at first
03:48 🔗 shaqfu I haven't come across any collisions yet...wonder if it's a meaningful number
03:48 🔗 instence er well thats not an example of a collision
03:49 🔗 instence What this means is there are extra fileinfo ids
03:49 🔗 instence so here is an example
03:50 🔗 instence lets say you had fileinfo url with 2456, but in its source code there is a different fileid, at 4567
03:50 🔗 instence but a fileinfo page also would exist at 4567, with the same data as 2456
03:51 🔗 shaqfu Gotcha
03:54 🔗 instence 1 site i am pissed i never got was planetdreamcast
03:54 🔗 shaqfu Router's acting up again - losing connection. Running another test instance of scraping fileinfo tonight; I'll report how it goes
03:54 🔗 instence it was shutdown in may of 2009 before i even knew wtf was going down
03:54 🔗 shaqfu I'm out for a bit
03:54 🔗 instence word
04:12 🔗 shaqdroid Do you have any other GameSpy stuff?
04:22 🔗 instence i'd have to check my archives
04:23 🔗 instence a lot of my projects are in limbo right now
06:13 🔗 godane looks like yahoo may have kept geocities.co.jp up
06:23 🔗 chronomex yes, they're a different organization
06:28 🔗 Coderjoe yahoo japan is an independent company
06:28 🔗 Coderjoe yahoo is huge over there
09:13 🔗 Coderjoe http://archive.org/details/archiveteam-ftp_abit_com_tw
10:11 🔗 Coderjoe and I think that concludes my independent downloaded collections. I do have a bunch of old CDs I could image. Unfortunately, while some came with a book and case inserts in jewel cases, the only thing remaining in my posession is the CD itself
10:11 🔗 Coderjoe (things like several slackware CDs)
10:22 🔗 Nemo_bis Coderjoe, as long as you're able to give them at least a title this doesn't seem an isse
10:22 🔗 Nemo_bis *issue
15:03 🔗 underscor I'm IRCing from 34,248 feet
15:03 🔗 underscor how fucking awesome is that :D
15:07 🔗 BlueMax must be one huge woman you climbed up in
15:08 🔗 underscor zing
15:38 🔗 godane i got NEXT Generation #41 CD to upload
15:39 🔗 godane :-D
17:06 🔗 godane uploaded: http://archive.org/details/cdrom-next-generation-1998-05
17:16 🔗 mistym So I've got a site I was trying to archive with wget-warc, but it's got an infinite loop going on that means it never finishes.
17:16 🔗 mistym Any advice?
17:48 🔗 aggro You could set the '--level=n' option, where 'n' is some integer.
17:49 🔗 underscor or figure out what specific page is a spidertrap and exclude it
19:06 🔗 godane do you guys have world of spectrum snapshot 2009-06-17?
19:06 🔗 godane its on underground gamer and its 42gb
19:20 🔗 Zebranky http://thedailyoat.com/?p=987
19:20 🔗 Zebranky :|
19:20 🔗 Zebranky SketchCow: underscor: all too relevant
19:31 🔗 closure SketchCow: hey, you're a freebsd guy arn't you?
22:16 🔗 amerrykan warrior -- what a great idea
23:04 🔗 shaqfu I should make a Swiss Archivist's Knife USB disk with ATW, Archivematica, etc on it
23:50 🔗 closure anyone remember when SketchCow was talking about having an archiveteam conference?
23:51 🔗 Coderjoe shaqfu: unfortunately, you can't really run virtualbox on a system in a portable manner
23:52 🔗 Coderjoe closure: sure. I recall him mentioning asking about holding it at IA as well
23:52 🔗 closure was that end of January?
23:54 🔗 shaqfu Coderjoe: Really? I'm curious how Archivematica does it, then
23:55 🔗 shaqfu closure: He's mentioned it a few times since then, but dunno anything specific
23:58 🔗 shaqfu I should get familiar with Amatica...might be using it if this contract position works out
23:59 🔗 Coderjoe I'm not aware of a way to do so, anyway
23:59 🔗 Coderjoe and I had never heard of archivematica
23:59 🔗 shaqfu Coderjoe: Virtualized Ubuntu with a bunch of command-line tools, so you can do digipres work without worrying about OS

irclogger-viewer