[01:45] FortuneCities is now 100% into the format for the Wayback machine. [01:46] No idea if they've done the final sweep yet ; the power outage definitely set some projects and activities back. [01:46] But this basically leaves PicPlz. [01:47] I've got that project going now in two windows (one ingesting, one uploading) [01:47] Will definitely take a day or two, since it's 3.5tb of pictures. [01:51] For MobileMe, we're going to have to do flat-out pull-downs, conversion, and REPLACEMENT, so I want to hold off on that for a bit. [01:52] There's just no other way - too much data. [01:54] yeah, we oughtn't double it unnecessarily [01:54] So first I want to see all this data we just did make the full journey into Wayback and completely live. [01:55] When it does, I want to then start dialing down/removing the doubled Fortunecity and other large collections, like Picplz, to not be doubles. [02:02] On the good side, I'm getting between 15-21mb a second off of the pipe. [02:02] So I think whatever was going on before the power outage is now in good shape. [02:03] SketchCow, so it really went out? [02:03] no, it was staged, just like the moon landing [02:03] of course the power went out [02:04] Richmond district of SF had a power outage. [02:04] Took Internet Archive with it. [02:04] Exciting. [02:04] That's a lot of stuff out. [02:04] sorry [02:04] Stuff came back slowly, but it did come back. [02:04] RIGHT during their big party to celebrate 10 petabytes of web historical data in the Wayback machine. [02:05] http://www.flickr.com/photos/mlinksva/8126312466/ [02:05] So as you can see, they got out emergency lights and power, put the laptop on it, and just kept going. [02:06] sucko [02:06] funny tho [02:06] was the livestream interrupted? [02:06] Well, this was the party - it had no livestream. [02:07] ah [02:07] The whole Books in Browsers went fine - they lost power at, like, 8pm. [02:09] Is H-Net at any sort of risk? [02:09] What's H-Net in this context? [02:09] hurricane electric? [02:10] The humanities mailing list [02:10] I mean, never trust any mailing list is my rule. [02:10] And it's text and trivial to grab. [02:10] Did a very quick survey and saw a lot of defunct lists, but it still seems to be in some use [02:10] I assume you don't mean http://dhhumanist.org/text.html [02:11] I'm on several mailing lists more or less solely so I have my own archive of it [02:11] for example, it's why I will never unsubscribe from a yahoo list [02:11] No, http://www.h-net.org/lists/ [02:11] DHHumanist is what made me look at it [02:16] And yeah, it's trivial, but no need if it's still under active watch [02:27] hey shaqfu [02:27] Yo [02:28] i'm grabing another magazine [02:28] called ce lifestyles [02:40] https://docs.google.com/a/textfiles.com/spreadsheet/ccc?key=0ApQeH7pQrcBWdDZIUEVjR3d1UmRoU0lPSWZYX0Q1Ync#gid=0 [02:40] Lot of great stuff - thanks, team. [02:57] alard: 2 instances running with different ips [03:23] Could someone please WARC http://www.wikipediareview.com/ up? [03:29] is all the info how to do it on the wiki? then i might try :) [03:29] SHOULD be. If not, let me know [03:30] then i ll try. Thx [03:35] SketchCow: src/wget "http://www.archiveteam.org/" --mirror --warc-file="at" <- is the command to use, right? anything else i need to watch out for? [03:35] via http://archiveteam.org/index.php?title=Wget_with_WARC_output [03:38] flaushy: with a different starting url, of course :) [03:38] tef yeah :) [03:52] http://projects.metafilter.com/3766/Just-Solve-the-Problem-Month-Solve-File-Formats [07:09] looks like we have an admirer https://archive.org/details/virtualitera.freeweb7.com [07:14] yay, wrote a javascript unpacker [07:14] ... in python [09:24] flaushy: Thanks. The ask-crawl produced 100 new usernames overnight. [13:32] alard: i think i got blocked at ask [13:32] Your client does not have permission to access this site. [13:54] I m getting HTTPs 400 on wikipediareview after a while, and i feel like it is too small to be a complete rip. Any suggestions on wget parameters to "crawl gently" and avoid blocks? And any suggestions on "checking" complete rips? [14:20] flaushy: You can use gunzip *.warc.gz | grep Target-URI to get a list of the URLs in the warc file. [14:21] wget has options to have a delay between requests (there are multiple, look in wget --help). [14:22] If you want to download the images, you probably need --page-requisites (I think that's not included in --mirror). [14:23] ah cool [14:23] humanizing as well? [14:23] what delays would be good? (i once mirrored a wiki with 5 sec average, painful...) [14:25] alard: do we want to keep crawling at ask? [14:25] What do you mean with "humanizing"? I don't know about delays, it really depends on the purpose. [14:26] Well, I'm not sure if it's worth it. It is producing some usernames, very slowly, and they block reasonably quick. [14:26] you have a random backoff and it is on average Xsecs [14:26] eg so you keep a window of 0 - 10 secs [14:27] ah --random-wait was its name in wget :) [14:27] Yes, that's it. [14:27] You might want to set the --user-agent to something other than Wget. [14:28] okie thx [14:28] btw did yacy crawl over bt? [14:28] maybe we could get data out of their index [14:31] Yacy, never heard of before. Does it contain any data? (I just used the demo to search for "archive team", but that didn't produce results.) [14:32] it is a p2p search engine attempt [14:32] it didnt skale a couple of years ago, lost interest in it after a while [14:34] http://search.yacy.net/HostBrowser.html?path=www.btinternet.com&list=Browse+Host [14:34] okie screw that ;( [14:35] I now see that http://wikipediareview.com/ is a forum. They're hard to archive. [14:39] Agreed [14:40] There's not a RUSH on this. They're just on the skids [14:40] They've been up and down over the past couple years. Unpaid bills, etc. [14:43] Well, with 'hard to archive' I only meant that it probably needs something more structured than just wget --mirror. [14:43] A Lua script could be a solution to do a structured download. What type of forum software is it? [14:44] It may be time for a new script in the forum download library. [14:49] Which reminds me: should we do a second run of boards.cityofheroes.com? [14:53] I am not against it. [14:53] I think people are probably going pretty nuts towards the end as this idiocy goes down. [14:54] Oh, wait, it's less urgent than I thought: "The City of Heroes® servers will shut off on November 30, 2012" http://na.cityofheroes.com/en/news/news_archive/city_of_heroes_sunset_faq.php [14:57] Well, that's still urgent. And I mean obviously we wait to closer to end, like the 20th or later. [14:57] Set an alarm! [14:58] Someone is working on a file format of motion picture film. [14:58] Agreed, they encode audio and visual data right into the film. [15:39] ha! http://source.git-annex.branchable.com/?p=source.git;a=commitdiff;h=b72d04988f767dd4b8dab3e1267c03b7f80d4c2c;hp=6633a5158d4d3a6f0bdf9fa5c2c8725e47b051cc [15:59] http://monitor.us.archive.org/weathermap/weathermap.html [16:11] "Paul" is very important. [16:18] So here's a script to download the Invision Powerboard forums on wikipediareview.com: [16:18] https://github.com/ArchiveTeam/wikipediareview-grab/blob/master/invpowerboard.lua [16:22] It could be a small warrior project. (I think it might be too big for one single wget run.) [16:38] alard: thx. i will put it on my nas lateron :) [20:16] is all of ftp.scene.org backed up? [20:31] it has multiple mirrors right [20:51] i think i found some thing interesting [20:51] a magazine called hebdogiciel [20:52] its a french magazine from 1983 to 1987 [20:52] collection item is not viewable even though i can see the magazines just fine [20:53] anyways archive.org only has 13 issues [20:55] i have found the rest of them [22:17] godane: the french site that has all that stuff got really butthurt when jason put it up so that's why it all went dark [22:48] DFJustin: Only 13 issues are visable [22:49] I don't think jason uploaded the full set [23:11] Back. [23:11] Had to visit underscor [23:11] Yes. [23:12] All the french magazines are dark unless we digitize them ourselves. [23:13] They got butthurt like the Al Qaeda gets butthurt [23:13] :( [23:14] yeah I was digging around about early PC programming books ... quite a bit of that on IA, but dark :/ [23:15] Anyway, you're all missing the important point [23:15] John Romero got married. [23:15] Off the market [23:15] Now, this is a major blow to the group but I think we can recover [23:15] If we stick together [23:16] "Sandy could potentially be an unprecedented threat to the masses in its path, a massive storm that hasn't been rivaled in generations." [23:16] Whew, way to couch it carefully [23:36] SketchCow: So you can only undark them if archive.org digitize them? [23:38] SketchCow:: that sort of would defeat the point of darking it then [23:40] No. [23:40] I can undark them if SOMEONE I KNOW digitizes them. [23:40] This is a specific situation, to those magazines. [23:41] thats just weird [23:42] anyway there is no seed to torrent collection right now [23:42] no point in downloading it