[01:24] and hornet archive mirror completed. now the damn thing exists in more then 3 places on the internet [01:24] nice, how'd ya get it? [01:24] good ole ftp, nothing fancy [01:25] is it like 3gb or something hilarious? :P [01:25] 6gb [01:26] hell, you could mirror that everywhere [01:26] yeah i was like wait, wtf, this is only in 3 places? [01:26] found 2-3 stray copies of it on what looked like test sites, like someone just said "hey i need 6gb of random data to test my http/ftp server" [01:27] haha [01:28] working on modarchive.org torrents too, at least those are on an official tracker. and don't look nearly as ethereal. but when the entire collection of something exists on one website with sub-30 torrent seeders, i worry [01:32] after that, scene.org as a whole. dont think their ftp admins real happy with me though, seems i got blocked with around 300 files of the hornet remaining, had to hit up a mirror [01:33] I've actually been downloading a bit of scene.org [01:33] it's kinda fun [01:33] skip hornet, grab the whole damn thing :) [01:33] lol, sorta working on it [01:33] I think I got about 25gb of it today [01:34] alright i'll leave that one up to you then. go forth. [01:34] haha okay [01:41] just grabbed entire Kosmic archives too cause it was easy, though I think they're all in the modarchive.org collection anyhow [01:46] wow. just stumbled across aminet.net - epic archive of amiga stuff. [01:48] no action necessary there, looks widely dispersed across aminet cds [02:38] Hm, is there any clever way in Perl to have it ignore carriage returns? [02:38] There's absolutely no reason for them to be in these .html files, but they're trashing output :( [02:38] change the line separator metavariable? [02:38] Coderjoe: They're hanging out in the middle of lines [02:39] then s/^M//g ? [02:39] (remove them?) [02:39] Ah, true - forgot they're removable [02:39] Thanks [02:41] Hm, nope - still carriage returning [02:42] did you use a real ^M or carat M? [02:42] It has to be escaped? [02:43] try \r [02:43] or an actual control-M character [02:43] you either need the real ^M (in vim, you would type ctrl-V ctrl-M) or the proper escape character sequence (like \x0D) [02:45] Neither \r nor \x0D are working [02:45] Er, ^r [02:45] ^r isn't even a thing [02:46] beginning-of-line r? [02:46] Oh, nevermind - 0D works. No more reformatting HTML pages *after* hitting the bar next time [02:46] Thanks [02:49] (last thing I needed for FilePlanet) [02:51] i don't see the 250 hours of bbs documentary are on archive.org [02:52] Was there a problem getting all of them on archive.org SketchCow? [02:53] Am I late in hearing about this? [02:53] https://www.eff.org/press/releases/internet-archive-sues-stop-new-washington-state-law [02:56] I heard about it the other day, kind of fucked up [03:12] i'm doing my own more selective grab from fileplanet [03:13] instence: Of what? [03:14] i went through 250,000 id's and verified working ones, then pulled down the fileinfo pages, then im scraping them to build a list, then targeting specific game titles [03:14] instence: Wait, you have a list of all the verified IDs? We need that... [03:15] yea [03:15] It'd hurry things along, at least - I was waiting until we downloaded every ID before listing them [03:15] finished this morning [03:15] instence: Get in #fireplanet [03:16] At first I was running a scrape aganst everu single id [03:16] but that was taking forever [03:17] So instead i used Xenu, which is multithreaded [03:17] since it checks "size" [03:17] i was able to blast through them, and sort by size [03:17] file not found pages would be smaller [03:18] Ah - we were just grabbing every id, and pitching ones that returned errors [03:18] We hoped to use sitemaps.xml files, but those don't list every file [03:18] Well my process will yield an excel file with uhhh 1 sec [03:19] instence: Do you have a complete set of /fileinfo/ pages? [03:19] Inc. ones not listed in sitemaps [03:19] I have a set of scripts ready to fire, once I get them [03:20] i have a complete list of working fileinfo pages from 0-249999 [03:20] its about 100,000 pages that are working [03:20] Sounds about right [03:20] Toss them on archive.org [03:22] I'm curious - why only certain games, vs. the entire site? [03:23] my process still isn't finished, i have to compress the info pages, pull them down, then scrape more info off them (Title, Author, Created, filename, fileid, filesize) [03:23] i do independant research [03:23] instence: If you want, I have scraping /fileinfo/ done [03:23] I can send you what I have [03:24] The plan was to take the metadata, put it into SQLite, and possibly host it elsewhere later [03:26] For fileplanet, my goal was to just target older games of interest. Quake, UT99, AVP2, etc [03:26] plus doing the filenfo scrape allows me to weed out large files that are deemed unecessary, that would be bloat [03:26] Were you getting beat up by the encrypted clients? [03:27] hmmm? beat up? [03:27] We ran into nonsense like 11GB copies of games that were encrypted [03:27] So they're functionally useless...just very big [03:28] I think IGN pooled FilePlanet and Direct2Drive at some point [03:28] I havn't begun pulling down any files yet, but yea there are stuff in there like 4GB beta clients for extinct mmo's and stuff [03:28] We had stuff like the UT3 installer [03:28] yea lol [03:29] Hm, did you check out the project page? [03:29] Prioritization helps a lot. Thats why i'm targeting older smaller first, vs newer bigger [03:29] nah [03:29] Sensible. Thankfully, IGN's taking their time axing it [03:30] We've been on it for six weeks (it's a backburner project) [03:30] I sort of ride solo, working on my own projects and just idle in here to read [03:30] We already have 90% of the site on archive.org, so if it does go down, you can get a lot there [03:31] http://archiveteam.org/index.php?title=Fileplanet [03:31] In a sense, it's the smaller games that are more interesting, since they're less likely to be saved [03:32] But most of the community-made stuff centers around Half-Life, UT, Quake, etc [03:32] yea to me thats where the really important suff is [03:33] the planet sites fueled hosting of user created content [03:33] Yep [03:33] basically classic golden age pc games [03:33] You might want to talk to Schbirid - he's the one working on saving most of this [03:33] Planet* are getting killed in a few days; he saved many of them, if not most [03:34] Er, forums* [03:34] question: [03:34] What's up [03:35] Did you guys ever save the public hosted stuff from 2008? [03:35] from the big shutdown? [03:35] Public hosted stuff? [03:35] Unless Schbirid did so, I don't think so - do you have any of it? [03:36] oh wait 2009 [03:36] GameSpy public hosted sites [03:36] i archived them in 2009 [03:36] Talk to Schbirid and compare notes [03:36] was curious if archiveteam was ever working on that [03:36] well these were all shutdown [03:37] Sept 2009 [03:37] Are those the planet* sites? [03:37] Like PlanetHeretic and such [03:37] yes, but also all user subsites [03:38] Gotcha; agian, you'd have to swap notes. He should be on in a few hours [03:38] I have carbon copy archives of over 850 sites out of 1,350 urls [03:39] took 3 months to get it all [03:39] Wait up --- do any of them have FilePlanet download URLs? [03:39] most likely, lots of sites in my archive would [03:40] Since not every file has an ID [03:41] The trickiest part about theb planet sites were the news archives [03:41] they were form based select boxes with no pagination [03:41] so crawlers would completely miss them [03:41] Ouch [03:43] I had to build these crazy stepladders of dates in excel and create these custom lists for each planet sites, and run a seperate process on the news listings [03:46] lol.... the -k is still running on the fileinfo html files [03:46] been over an hour [03:47] btw [03:47] one thing i noticed is that, there are some fileinfo pages that have a different id for the file (FileID) in the source code [03:48] [03:48] Ah, interesting [03:48] i was running a php scrape against that at first [03:48] I haven't come across any collisions yet...wonder if it's a meaningful number [03:48] er well thats not an example of a collision [03:49] What this means is there are extra fileinfo ids [03:49] so here is an example [03:50] lets say you had fileinfo url with 2456, but in its source code there is a different fileid, at 4567 [03:50] but a fileinfo page also would exist at 4567, with the same data as 2456 [03:51] Gotcha [03:54] 1 site i am pissed i never got was planetdreamcast [03:54] Router's acting up again - losing connection. Running another test instance of scraping fileinfo tonight; I'll report how it goes [03:54] it was shutdown in may of 2009 before i even knew wtf was going down [03:54] I'm out for a bit [03:54] word [04:12] Do you have any other GameSpy stuff? [04:22] i'd have to check my archives [04:23] a lot of my projects are in limbo right now [06:13] looks like yahoo may have kept geocities.co.jp up [06:23] yes, they're a different organization [06:28] yahoo japan is an independent company [06:28] yahoo is huge over there [09:13] http://archive.org/details/archiveteam-ftp_abit_com_tw [10:11] and I think that concludes my independent downloaded collections. I do have a bunch of old CDs I could image. Unfortunately, while some came with a book and case inserts in jewel cases, the only thing remaining in my posession is the CD itself [10:11] (things like several slackware CDs) [10:22] Coderjoe, as long as you're able to give them at least a title this doesn't seem an isse [10:22] *issue [15:03] I'm IRCing from 34,248 feet [15:03] how fucking awesome is that :D [15:07] must be one huge woman you climbed up in [15:08] zing [15:38] i got NEXT Generation #41 CD to upload [15:39] :-D [17:06] uploaded: http://archive.org/details/cdrom-next-generation-1998-05 [17:16] So I've got a site I was trying to archive with wget-warc, but it's got an infinite loop going on that means it never finishes. [17:16] Any advice? [17:48] You could set the '--level=n' option, where 'n' is some integer. [17:49] or figure out what specific page is a spidertrap and exclude it [19:06] do you guys have world of spectrum snapshot 2009-06-17? [19:06] its on underground gamer and its 42gb [19:20] http://thedailyoat.com/?p=987 [19:20] :| [19:20] SketchCow: underscor: all too relevant [19:31] SketchCow: hey, you're a freebsd guy arn't you? [22:16] warrior -- what a great idea [23:04] I should make a Swiss Archivist's Knife USB disk with ATW, Archivematica, etc on it [23:50] anyone remember when SketchCow was talking about having an archiveteam conference? [23:51] shaqfu: unfortunately, you can't really run virtualbox on a system in a portable manner [23:52] closure: sure. I recall him mentioning asking about holding it at IA as well [23:52] was that end of January? [23:54] Coderjoe: Really? I'm curious how Archivematica does it, then [23:55] closure: He's mentioned it a few times since then, but dunno anything specific [23:58] I should get familiar with Amatica...might be using it if this contract position works out [23:59] I'm not aware of a way to do so, anyway [23:59] and I had never heard of archivematica [23:59] Coderjoe: Virtualized Ubuntu with a bunch of command-line tools, so you can do digipres work without worrying about OS