[00:03] put a zfs filesystem image on the cd [00:03] problem solved [00:08] SketchCow: desktop.google.com has arrived on batcave. [00:09] Thanks, alard. [00:13] anyone here messing with zfs for mac? [00:13] (the tenscomplement port) [00:16] I so don't trust zfs [00:18] SketchCow: oh? [00:18] it just puts your data into a merkle tree, which is super awesome [01:19] I'm a huge fan of the archive.org online reader, I wish there was a desktop version [01:19] it relies on browser image scaling, which varies a lot and can be lame [01:20] true, that only seems to be an issue for 1-bit stuff though [01:34] merkle trees are like pixie dust; you basically can't go wrong [01:47] is the steve meretsky archive up yet? [02:05] "We believe that all of the Early Journal Content is out of copyright." -- JSTOR "Additional uses are allowed, including the ability to download, share, and reuse the content for any non-commercial purpose." -- JSTOR .. um, if it's out of copyright, who the fuck do they think they are slapping these restrictions on it? [02:06] * chronomex shrugs [02:06] well, they're not 100% sure it's 100% out of copyright [02:31] i have a feeling backing up something like reddit will be a problem [02:32] only cause images are linked to other sites [02:33] so to archive reddit we would have to archive also the exterinal link too [04:31] comcast-- [05:15] Oh dear, what did they do this time? [05:17] left me offline for 12 hours, then couldn't explain why it just started working again while I was talking to support [05:18] Sounds like what we've come to expect from them. [06:59] 5.0G www.instructables.com/ [06:59] Growin' and growin' [07:00] ersi: Does it work to just wget that? [07:05] Yeah [07:06] Or well, it *seems* to work. I'm going to check through what I get though [07:08] This is one massive site though, with mostly internal links [07:11] Hmm, think it would work for ehow? [07:11] Or is ehow already crawled by ia_archiver? [07:14] Wyatt: Doesn't seem crawled by ia_archiver at all when I visited http://liveweb.archive.org/www.ehow.com [07:14] neither was instructables btw ;) [07:18] Ominous. [07:36] Hey hey. [07:37] I finally game to negotiations with the developer set who found I was choking archive.org [07:37] So yay? [07:38] developer set? [07:39] Set of developers who were finding I was choking things. [07:39] To be honest, OCR is a bottleneck I don't like existing. [07:39] Add more OCRs [07:39] Everything else is going fine. [07:40] I'm getting into a useless twitter fight with some fathead [07:40] heh [07:40] I finally got the digitizer rig going [07:41] GDC tapes. I need to be digitizing at the rate of 15-20 a day. [07:41] One ends.... next one. [07:41] Just keep going [07:41] In middle of month, they send me money to buy a second one [07:41] It'll render. [07:41] And we'll kill these fuckers [07:41] sweet [07:43] buy a second what? [07:43] oh, digitizer rig [07:45] SketchCow: so the second question is "game"? [07:48] ? [07:48] " I finally game to negotiations..." [07:49] anyway [07:50] SAFE. So safe you wouldn't believe it. [07:50] root@teamarchive-0:/3/TIMAGS/super99# ~jscott/isitsafe [07:50] replace game with came, and it'll make more sense [07:50] Yes, I wrote a script that asks if the queue can handle me. [07:51] rsync to batcave finally started up again [07:51] SketchCow: lol [07:51] ersi: oh, I suppose if negotiations is an event [07:51] but then I would have expected "went" [07:51] anyway [07:52] http://www.archive.org/details/fox40newsaug222011 [07:52] Entertainment for you [07:52] ooh [07:53] Hm, wonder if I should have thrown on more parameters to wget before starting this :| [07:53] ersi: -D [07:54] * Closing connection #0 [07:54] < [07:54] < Connection: close [07:54] < Content-Length: 0 [07:54] < Content-Type: text/plain [07:54] ersi: --warc-file [07:54] root@teamarchive-0:/3/TIMAGS/super99# ~jscott/isitsafe [07:54] SAFE. So safe you wouldn't believe it. [07:54] Tah dah, it says I didn't break it! [07:54] heh [07:55] db48x2: So the answer is 'yes, I should have' [07:55] there's probably always another option you could throw on there [07:55] like -k? for converting teh links [07:55] yes [07:55] and -K to save a copy of the original from before it munged the links [07:56] well, dang. [07:56] heh [07:59] At one point in this talk, Will Wright shows a self-riding motorcycle [07:59] It's hilarious [07:59] Running around a park scaring people [07:59] heh [07:59] he seems like a pretty crazy guy [08:01] does work against wget even when you do -e robots=off? [08:01] Not sure [08:02] oh, interesting [08:02] this time it crashed [08:03] I don't know if I showed this script I run. [08:03] aha [08:03] root@teamarchive-0:/3/TIMAGS/smartprogrammer# ./ingestor SmartProgrammer_1984_02.pdf [08:03] OK, then, SmartProgrammer_1984_02.pdf gets the love. [08:03] Here's what I plan to do. [08:03] I was telling it to mirror fanfiction.net, but it redirects to www.fanfiction.net [08:03] In the collection named smart-programmer-newsletter... [08:03] I will add an item called smart-programmer-newsletter-1984-02. [08:03] I will say this dates to 1984-02. [08:03] I will give it the title of The Smart Programmer Newsletter (February 1984). [08:04] .. [08:04] It looked at SmartProgrammer_1984_02.pdf to figure it out. [08:04] That's test mode [08:04] It tells me it's working. [08:04] sweet [08:04] There are 18 issues. [08:05] Running. [08:05] db48x2: I think wget doesn't listen to robots noarchive at all. It only understands nofollow. [08:05] It uploads each issue in roughly 8 seconds. [08:05] alard: good to know [08:07] Done. [08:07] 18 issues in what, 2 minutes. [08:08] SketchCow: what do you use for downloading them? [08:12] doh [08:12] UNSAFE. Current OCR count is 207. [08:12] root@teamarchive-0:/3/TIMAGS# ~jscott/isitsafe [08:13] 1am already [08:13] Oh no! [08:13] time to put more machines on the task of misreading the text in magazines [08:20] Yeah! [08:25] * db48x2 is watching Time's Arrow [08:37] hullo [08:38] Hi [08:38] So, I want to throw Atari Force up there. [08:38] But Atari Force is a DC comic book [08:38] A super defunct one, but still [08:39] So as awesome as it is, I don't think it'll count right now. [08:41] But this? [08:41] http://www.bombjack.org/commodore/commodore/ [08:41] As soon as it finishes downloading, it goes up. [08:41] Fwip [08:42] woah [08:54] Michael S. Hart is dead .... [08:55] So how good is httrack for mirroring things really? [08:56] it's kinda shitty [08:56] good for small projects [08:56] crap, hit a snag with fortunecity.com [08:57] Really? Damn. [08:58] Funny, I had completely forgotten about fortunecity, too. [08:58] Nothing really good on windows for ripping a site, but if your on linux wget or curl is really good. [08:59] ive been doing some poking around in it, and found their directory structure to be.....not quite as i expected on fortune city [09:00] josephwdy: Yeah, they're utilities useful in proportion to the length of their man pages. [09:00] But their man pages are...short story-length. [09:03] What options are good? looks like wget -mkKe robots=off --warc-file from just the past few bits of history [09:03] -E [09:03] --mirror [09:04] --wait [09:04] --random-wait [09:04] -p --protocol-directories -np --follow-ftp --progress=dot:decimal --warc-file --warc-cdx --warch-header --user-agent [09:05] the --warc options require a special build of wget which you'll find on the wiki [09:05] crap, now im stuck.... [09:05] they cause it to record an archive that contains not just the files retrieved, but the http request and response headers that lead to the files themselves [09:06] OK, here we go. [09:06] Michael S. Hart is dead and we will miss him. [09:06] Only got to meet him once. [09:07] kin37ik: recording headers as db48x2 recommends is the ideal; for some time we mirrored without doing it but now we do when possible [09:07] chronomex: dont you mean Wyatt, and not me? lol [09:07] um, right. [09:07] I'm not sober. [09:07] lol [09:08] SketchCow: that's pretty awesome :D do tell more. [09:08] DRUNKIVING [09:08] DRUNKIRCING [09:08] DON'T DRINK AND DERIVE [09:08] I actually don't know how to drive. [09:08] Drunk Relay Chat [09:08] lol [09:09] There it goes! [09:09] Adding 156 books [09:09] Wyatt: the at wiki has a good starting point http://archiveteam.org/index.php?title=Wget [09:09] Yeah, thanks. I was just looking over that. [09:09] SketchCow: you still need to hook me up with your adder thing. [09:09] http://www.archive.org/details/commodore-manuals [09:09] chronomex: Yes [09:10] Sometimes I forget that there _are_ good resources for this stuff. [09:11] Ideally, wouldn't one want; A) 'just a plain wget' mirroring of the site, no modification B) modified links wget mirroring C) a WARC kind of wget mirroring= [09:11] s//=//?/ [09:11] ersi: ues [09:11] Ideally, you want both [09:11] But sometimes, no choice [09:11] -k and -K get you a modified and unmodified mirror [09:11] both three? [09:11] Ah, true. [09:11] and the --warc gets you the archive [09:11] Shut up, drunky [09:12] does warc make 'archives'? [09:12] yea, after a fashion [09:12] SketchCow: I'M drunk?!? [09:12] db48x2: Hm? [09:12] it's not a tarball [09:12] I mean, like heritrex (or whatever it's called) [09:12] You said there's a patch for warc on the wik? [09:12] yea, very similar to what heritrix does [09:12] similar being compatible? [09:13] when they wrote heritrix they invented the arc format [09:13] it's been updated [09:13] I don't know the exact timeline [09:13] So it makes 'old version WARC archives'? [09:13] http://www.youtube.com/watch?v=xDjOr68VxKw [09:13] Go watch that [09:14] http://archiveteam.org/index.php?title=Wget_with_WARC_output [09:14] hmm, heres a problem, if i poke members.fortunecity.com, ill get all the dir files on that domain but wont get any of the members subsites as they arent linked, how could i get around that to poke the member accounts?? [09:14] "sites [...] run by habitual whiners, will complain when a site scraping uses 200 megabytes of transfer when it could have used 100." -- sites run by whiners bitch at EVERYTHING [09:14] Truth^ [09:14] once you create your warc file, you should append a record that contains the script you ran to grab the site, if it's more than a single invocation of wget [09:15] Let me add; derp [09:16] oh, alard wrote the --warc wget support [09:16] yea [09:16] you will note, alard has an @ by his name [09:16] Oh, the headers is probably used for Wayback Machine to place it in the timeline [09:17] Historically, I had a @ by my name as well. [09:17] I think it's just for the masturbatory completeness factor [09:17] :P [09:17] fine. [09:17] usually the @s occupy 1 of 6 nickname columns on my screen; we're running low. [09:17] ish. [09:19] Man, I'd like to just ./mirror-archive-the-fuck-out-of-url [09:20] I think it's obvious we're going to have to write a script set that does this. [09:20] I've been working on one [09:20] also, darn these dynamic pages that generate these weird files [09:20] ersi: weird how? [09:20] trololo?COMMENTS=UPSIDEDOWN?&SORT=INMYPANTS [09:21] what's the problem with that? [09:21] that's the bit after the last / in the url [09:21] is filename. [09:21] None really, besides that it bothers me and feels naughty [09:21] unix is okay with it, right? [09:21] Right. [09:21] you can use -E [09:21] if it's okay with unix, it's okay with chronomex [09:21] it'll slap a .html on the end of all that [09:22] yeah, but I didn't do that :) [09:22] I'm unsure if I should CONTINUE RAPING or STOP and modify my parameters [09:23] indeed [09:23] a dilemma for the ages [09:24] If I let it run, i'll get a feel for if they use other domains for CDN or trickery and possibly total size of site [09:25] this is instructables, right? [09:26] Yes. It's probably effing huge [09:26] It's up at 6GB currently [09:27] ahyeah. [09:28] god I hate it when people who are insane but kind of interesting email me [09:29] Seems like a generalised distributed parallel archival-quality...I hesitate to say "bandwidth fucker" because it's awfully uncouth. [09:29] But yes, challenging, but boy would it be useful. [09:33] * kin37ik is getting frustrated [09:34] chronomex: Do you get lots of insane interesting people mailing you? :) [09:35] no, for the most part, it's confused transsexual folk who think I care. [09:35] responding to this one with "This sounds like something one would ask a lover. Before you proceed any further, ask yourself the following question: Is chronomex my lover?" [09:35] lol [09:37] Wyatt: My continuation of questionabe quality archival effort? [09:37] ersi: what would you change? [09:38] OK, who wants a short project [09:38] ersi: In a sense? I'm saying it would be nice to spread the love around [09:38] http://census.ire.org/ [09:38] Well, I'd throw on -kK and perhaps some more [09:39] Turn that into an "item", a collection that makes sense. [09:39] Module threw exception: [09:39] item must be OCR'd via auto_submit [09:40] That's interesting. [09:40] Wouldn't the "raw data datasets" from the bottom of http://census.ire.org/data/bulkdata.html be good candidates? [09:40] SketchCow: how is this better than the data on census.gov? [09:40] I am not clear at all it is. [09:40] I'm not seeing any real value add, except a shinier interface [09:40] If that's the case, I trust that opinion. [09:41] I've spent a good deal of time working with census data; I practically majored in that shit. [09:41] :o [09:41] geography is a lot to do with demography [09:42] https://github.com/ireapps/census yeah, it's a fancy interface to census data [09:43] ersi: -k is not archive-safe, unless combined with -K. [09:43] That's why I'd do -kK [09:43] -K means some extra work to get an archive-safe version [09:44] what other flags were you thinking of? [09:45] well, I'd consider building alard's patched wget version and do WARC perhaps [09:45] warc is good [09:45] can you combine --continue with --warc ? [09:45] http://www.archive.org/details/commodore-manuals [09:45] aww yeah! [09:46] maybe add some domains to -D [09:46] SketchCow: color monitor service manual?!? fuck yeah! [09:46] Hm, maybe [09:46] See, these are all useful things [09:46] But I'd rather do a full blown new run with --warc [09:46] That have been around a long time [09:47] But they're going to be consolidated now. [09:47] ersi: right, just wondering. remember, alcohol. [09:47] Also, change the useragent to Firefox or something instead of Googlebot [09:47] maybe I'm getting 'GBot customised' versions of pages :/ [09:47] yea, that helps a lot [09:47] ersi: or "ARCHIVETEAM FUCKYOUBOT" [09:48] ArchiveTeam 1.0/Bitch I'm a Bus [09:48] SketchCow: might just grab a copy of all of those and store them away somewhere [09:48] "ARCHIVETEAM FUCKYOUBOT 3.6" [09:48] currently running with; wget -m -c -p -e robots=off http://www.instructables.com/index --user-agent="Googlebot/2.1 (+http://www.google.com/bot.html)" [09:48] to be honest we ought to archive with lots of different user agents, to make sure [09:48] ersi: --mirror [09:48] -m == --mirror [09:48] db48x2: this sounds like "wget replacement project" to me [09:48] oh, right [09:48] wget is great but it's not the ultimate spider. [09:48] We've moved in the last few months from panic downloads to proactives. [09:48] I like 'em short parameters [09:49] Proactives, I am fine with 5 400mb .tar.gz files, representing different approaches. [09:49] I really do hope AutoCAD will take great care of Instructables.. but.. Trust No One. [09:49] I just don't want to lose stuff that's time critical. [09:49] SketchCow: This bitch be huge though [09:49] size isn't an issue [09:49] My opinion, which I told Bre, is that AutoCAD will buy Makerbot within 4-5 years [09:49] It can complicate things :) [09:49] SketchCow: that would be very interesting [09:49] Size is an issue when we've only got two weeks to get all of it. [09:50] SketchCow: Does not sound unlikely. Since they bought Instructables for the exactly same reason they would buy Makerbot [09:50] SketchCow: my personal opinion? makerbot is in violation of its lease, which says "robots made must obey asimov's 3 laws". I've had my fingers burned by a makerbot. [09:50] lol [09:50] Was that the makerbot's fault? [09:50] yes [09:50] Yeah, seriously [09:50] it let him get injured [09:50] yes, it went down when it ought have gone up because my fingers were there! [09:51] If I'm canoeing with you, and you're a fuck and fall over and drown [09:51] Which is within Jason Scott's Three Laws of Robotics [09:51] 2. You die, I get your wallet [09:51] 1. I didn't know him, officer [09:51] wait wait wait [09:51] 3. If our size is the same, hey, you died naked for whatever reason [09:51] you're a robot? [09:51] I KNEW there was a reason I don't carry cash! And all this time, I thought it was roaming bands of thugs. [09:52] So we're using the DEFCON speech to apply to TED [09:52] The question is, can they get an adequate idea I could do a TED speech when half the words are profanity [09:52] We'll see!! [09:53] Oh fuck, that'd be great [09:54] Attend TEDActive 2012 in Palm Springs [09:54] Held in Palm Springs, TEDActive is a parallel event held at the same time as TED in Long Beach, featuring the simulcast of the conference. Get the benefits of the TED Book Club, conference video archives, online social networking, and many special offers (Learn more .). [09:54] Price: $3,750 [09:54] I wish I could afford TED [09:55] I already qualify as an insider [09:55] If you get in, you have to pull the "Fuck you, you are all in ArchiveTeam" bit. [09:55] * db48x2 sighs [09:55] 3am now [09:55] But I can't pay retail for that shit [09:55] They're expensive/costly as fuck [09:55] It was so great [09:55] I paid wholesale price [09:55] Still expensive [09:55] Bajeezus, though, that's worse than SXSW... [09:55] Worth every dime. [09:55] Every. Dime. [09:55] I did get to watch TED live for free last year [09:55] Retail is $7,500 [09:56] (I also RTMPDumped the shit out of the stream) [09:56] I harassed one of the google founders (Page) for 40 seconds. [09:56] Come on, that was worth it right there [09:56] http://pastebin.com/8EDZBLE0 [09:56] hahahahaha SketchCow [09:57] I demanded he buy 4chan through a shell company [09:57] This was before canv.as of course. [09:57] Shook Bill Gates' hand, had a long talk with The Amazing Randi [09:57] Come on, so worth it [09:58] Also surprised by the people who knew me on sight [09:58] Like Wozniak [09:58] Anyway, I'm applying [09:58] With some help [09:58] If I get in, you'll see probably a 7 or 12 minute version of that speech [09:59] I've got another script that does a zfs snapshot [09:59] db48x2: you want me to run that pastebin? [09:59] runs this script and then takes a zfs snapshot to preserve it [09:59] chronomex: this is just the script that I'm working on [10:00] you have to customize it per site, of course [10:00] right. [10:00] for GoogleFriendsNewsletter: [10:00] grab -a log http://groups.google.com/group/google-friends/download?s=pages -O google-friends-pages.zip [10:00] mirror -a log "${SITE2}" [10:00] mirror -o log "${SITE}" [10:00] grab -a log http://groups.google.com/group/google-friends/download?s=files -O google-friends-files.zip [10:00] etc [10:01] so it's not really as simple as it ought to be, I guess [10:01] but I could make those command line args [10:01] http://www.guardian.co.uk/books/2011/sep/07/michael-moore-hated-man-america [10:01] makeup? makeup. [10:02] --mirror http://wherever/ --mirror http://another/ --grab http://some/file [10:02] Lol! Nice that Wozy recognized ya' :) [10:08] OK, bed [10:09] 'Night [10:09] Time's Arrow is a pretty good episode [10:09] it's got everything [10:10] Time's Arrow? [10:11] severed heads, time travel, body snatchers, robots, historical figures [10:11] ersi: Star Trek: TNG episode [10:11] oh, heh [10:11] S05E26 and S06E01 [10:12] they find Data's 500-year-old severed head in a mine under San Francisco [10:12] hijinks ensure [10:15] yeah that was kind of strange. [10:16] right, now that im off the phone i need to figure out this dir [10:27] how do i get Wget to fetch and grab the user/member subsites if they arent linked somewhere on fortunecity for Wget to follow? [10:27]