#archiveteam 2011-09-08,Thu

↑back Search

Time Nickname Message
00:03 🔗 db48x put a zfs filesystem image on the cd
00:03 🔗 db48x problem solved
00:08 🔗 alard SketchCow: desktop.google.com has arrived on batcave.
00:09 🔗 SketchCow Thanks, alard.
00:13 🔗 balrog anyone here messing with zfs for mac?
00:13 🔗 balrog (the tenscomplement port)
00:16 🔗 SketchCow I so don't trust zfs
00:18 🔗 db48x SketchCow: oh?
00:18 🔗 db48x it just puts your data into a merkle tree, which is super awesome
01:19 🔗 DFJustin I'm a huge fan of the archive.org online reader, I wish there was a desktop version
01:19 🔗 chronomex it relies on browser image scaling, which varies a lot and can be lame
01:20 🔗 DFJustin true, that only seems to be an issue for 1-bit stuff though
01:34 🔗 db48x merkle trees are like pixie dust; you basically can't go wrong
01:47 🔗 bsmith093 is the steve meretsky archive up yet?
02:05 🔗 closure "We believe that all of the Early Journal Content is out of copyright." -- JSTOR "Additional uses are allowed, including the ability to download, share, and reuse the content for any non-commercial purpose." -- JSTOR .. um, if it's out of copyright, who the fuck do they think they are slapping these restrictions on it?
02:06 🔗 * chronomex shrugs
02:06 🔗 chronomex well, they're not 100% sure it's 100% out of copyright
02:31 🔗 godane i have a feeling backing up something like reddit will be a problem
02:32 🔗 godane only cause images are linked to other sites
02:33 🔗 godane so to archive reddit we would have to archive also the exterinal link too
04:31 🔗 db48x2 comcast--
05:15 🔗 Wyatt Oh dear, what did they do this time?
05:17 🔗 db48x2 left me offline for 12 hours, then couldn't explain why it just started working again while I was talking to support
05:18 🔗 Wyatt Sounds like what we've come to expect from them.
06:59 🔗 ersi 5.0G www.instructables.com/
06:59 🔗 ersi Growin' and growin'
07:00 🔗 Wyatt ersi: Does it work to just wget that?
07:05 🔗 ersi Yeah
07:06 🔗 ersi Or well, it *seems* to work. I'm going to check through what I get though
07:08 🔗 ersi This is one massive site though, with mostly internal links
07:11 🔗 Wyatt Hmm, think it would work for ehow?
07:11 🔗 Wyatt Or is ehow already crawled by ia_archiver?
07:14 🔗 ersi Wyatt: Doesn't seem crawled by ia_archiver at all when I visited http://liveweb.archive.org/www.ehow.com
07:14 🔗 ersi neither was instructables btw ;)
07:18 🔗 Wyatt Ominous.
07:36 🔗 SketchCow Hey hey.
07:37 🔗 SketchCow I finally game to negotiations with the developer set who found I was choking archive.org
07:37 🔗 SketchCow So yay?
07:38 🔗 db48x2 developer set?
07:39 🔗 SketchCow Set of developers who were finding I was choking things.
07:39 🔗 SketchCow To be honest, OCR is a bottleneck I don't like existing.
07:39 🔗 SketchCow Add more OCRs
07:39 🔗 SketchCow Everything else is going fine.
07:40 🔗 SketchCow I'm getting into a useless twitter fight with some fathead
07:40 🔗 db48x2 heh
07:40 🔗 SketchCow I finally got the digitizer rig going
07:41 🔗 SketchCow GDC tapes. I need to be digitizing at the rate of 15-20 a day.
07:41 🔗 SketchCow One ends.... next one.
07:41 🔗 SketchCow Just keep going
07:41 🔗 SketchCow In middle of month, they send me money to buy a second one
07:41 🔗 SketchCow It'll render.
07:41 🔗 SketchCow And we'll kill these fuckers
07:41 🔗 db48x2 sweet
07:43 🔗 ersi buy a second what?
07:43 🔗 ersi oh, digitizer rig
07:45 🔗 db48x2 SketchCow: so the second question is "game"?
07:48 🔗 SketchCow ?
07:48 🔗 db48x2 "<SketchCow> I finally game to negotiations..."
07:49 🔗 db48x2 anyway
07:50 🔗 SketchCow SAFE. So safe you wouldn't believe it.
07:50 🔗 SketchCow root@teamarchive-0:/3/TIMAGS/super99# ~jscott/isitsafe
07:50 🔗 ersi replace game with came, and it'll make more sense
07:50 🔗 SketchCow Yes, I wrote a script that asks if the queue can handle me.
07:51 🔗 db48x2 rsync to batcave finally started up again
07:51 🔗 db48x2 SketchCow: lol
07:51 🔗 db48x2 ersi: oh, I suppose if negotiations is an event
07:51 🔗 db48x2 but then I would have expected "went"
07:51 🔗 db48x2 anyway
07:52 🔗 SketchCow http://www.archive.org/details/fox40newsaug222011
07:52 🔗 SketchCow Entertainment for you
07:52 🔗 db48x2 ooh
07:53 🔗 ersi Hm, wonder if I should have thrown on more parameters to wget before starting this :|
07:53 🔗 db48x2 ersi: -D
07:54 🔗 SketchCow * Closing connection #0
07:54 🔗 SketchCow <
07:54 🔗 SketchCow < Connection: close
07:54 🔗 SketchCow < Content-Length: 0
07:54 🔗 SketchCow < Content-Type: text/plain
07:54 🔗 db48x2 ersi: --warc-file
07:54 🔗 SketchCow root@teamarchive-0:/3/TIMAGS/super99# ~jscott/isitsafe
07:54 🔗 SketchCow SAFE. So safe you wouldn't believe it.
07:54 🔗 SketchCow Tah dah, it says I didn't break it!
07:54 🔗 db48x2 heh
07:55 🔗 ersi db48x2: So the answer is 'yes, I should have'
07:55 🔗 db48x2 there's probably always another option you could throw on there
07:55 🔗 ersi like -k? for converting teh links
07:55 🔗 db48x2 yes
07:55 🔗 db48x2 and -K to save a copy of the original from before it munged the links
07:56 🔗 ersi well, dang.
07:56 🔗 db48x2 heh
07:59 🔗 SketchCow At one point in this talk, Will Wright shows a self-riding motorcycle
07:59 🔗 SketchCow It's hilarious
07:59 🔗 SketchCow Running around a park scaring people
07:59 🔗 db48x2 heh
07:59 🔗 db48x2 he seems like a pretty crazy guy
08:01 🔗 db48x2 does <META NAME='ROBOTS' CONTENT='NOARCHIVE'> work against wget even when you do -e robots=off?
08:01 🔗 SketchCow Not sure
08:02 🔗 db48x2 oh, interesting
08:02 🔗 db48x2 this time it crashed
08:03 🔗 SketchCow I don't know if I showed this script I run.
08:03 🔗 db48x2 aha
08:03 🔗 SketchCow root@teamarchive-0:/3/TIMAGS/smartprogrammer# ./ingestor SmartProgrammer_1984_02.pdf
08:03 🔗 SketchCow OK, then, SmartProgrammer_1984_02.pdf gets the love.
08:03 🔗 SketchCow Here's what I plan to do.
08:03 🔗 db48x2 I was telling it to mirror fanfiction.net, but it redirects to www.fanfiction.net
08:03 🔗 SketchCow In the collection named smart-programmer-newsletter...
08:03 🔗 SketchCow I will add an item called smart-programmer-newsletter-1984-02.
08:03 🔗 SketchCow I will say this dates to 1984-02.
08:03 🔗 SketchCow I will give it the title of The Smart Programmer Newsletter (February 1984).
08:04 🔗 SketchCow ..
08:04 🔗 SketchCow It looked at SmartProgrammer_1984_02.pdf to figure it out.
08:04 🔗 SketchCow That's test mode
08:04 🔗 SketchCow It tells me it's working.
08:04 🔗 db48x2 sweet
08:04 🔗 SketchCow There are 18 issues.
08:05 🔗 SketchCow Running.
08:05 🔗 alard db48x2: I think wget doesn't listen to robots noarchive at all. It only understands nofollow.
08:05 🔗 SketchCow It uploads each issue in roughly 8 seconds.
08:05 🔗 db48x2 alard: good to know
08:07 🔗 SketchCow Done.
08:07 🔗 SketchCow 18 issues in what, 2 minutes.
08:08 🔗 db48x2 SketchCow: what do you use for downloading them?
08:12 🔗 db48x2 doh
08:12 🔗 SketchCow UNSAFE. Current OCR count is 207.
08:12 🔗 SketchCow root@teamarchive-0:/3/TIMAGS# ~jscott/isitsafe
08:13 🔗 db48x2 1am already
08:13 🔗 SketchCow Oh no!
08:13 🔗 db48x2 time to put more machines on the task of misreading the text in magazines
08:20 🔗 SketchCow Yeah!
08:25 🔗 * db48x2 is watching Time's Arrow
08:37 🔗 kin37ik hullo
08:38 🔗 ersi Hi
08:38 🔗 SketchCow So, I want to throw Atari Force up there.
08:38 🔗 SketchCow But Atari Force is a DC comic book
08:38 🔗 SketchCow A super defunct one, but still
08:39 🔗 SketchCow So as awesome as it is, I don't think it'll count right now.
08:41 🔗 SketchCow But this?
08:41 🔗 SketchCow http://www.bombjack.org/commodore/commodore/
08:41 🔗 SketchCow As soon as it finishes downloading, it goes up.
08:41 🔗 SketchCow Fwip
08:42 🔗 kin37ik woah
08:54 🔗 josephwdy Michael S. Hart is dead ....
08:55 🔗 Wyatt So how good is httrack for mirroring things really?
08:56 🔗 josephwdy it's kinda shitty
08:56 🔗 josephwdy good for small projects
08:56 🔗 kin37ik crap, hit a snag with fortunecity.com
08:57 🔗 Wyatt Really? Damn.
08:58 🔗 Wyatt Funny, I had completely forgotten about fortunecity, too.
08:58 🔗 josephwdy Nothing really good on windows for ripping a site, but if your on linux wget or curl is really good.
08:59 🔗 kin37ik ive been doing some poking around in it, and found their directory structure to be.....not quite as i expected on fortune city
09:00 🔗 Wyatt josephwdy: Yeah, they're utilities useful in proportion to the length of their man pages.
09:00 🔗 Wyatt But their man pages are...short story-length.
09:03 🔗 Wyatt What options are good? looks like wget -mkKe robots=off --warc-file from just the past few bits of history
09:03 🔗 db48x2 -E
09:03 🔗 db48x2 --mirror
09:04 🔗 db48x2 --wait
09:04 🔗 db48x2 --random-wait
09:04 🔗 db48x2 -p --protocol-directories -np --follow-ftp --progress=dot:decimal --warc-file --warc-cdx --warch-header --user-agent
09:05 🔗 db48x2 the --warc options require a special build of wget which you'll find on the wiki
09:05 🔗 kin37ik crap, now im stuck....
09:05 🔗 db48x2 they cause it to record an archive that contains not just the files retrieved, but the http request and response headers that lead to the files themselves
09:06 🔗 SketchCow OK, here we go.
09:06 🔗 SketchCow Michael S. Hart is dead and we will miss him.
09:06 🔗 SketchCow Only got to meet him once.
09:07 🔗 chronomex kin37ik: recording headers as db48x2 recommends is the ideal; for some time we mirrored without doing it but now we do when possible
09:07 🔗 kin37ik chronomex: dont you mean Wyatt, and not me? lol
09:07 🔗 chronomex um, right.
09:07 🔗 chronomex I'm not sober.
09:07 🔗 kin37ik lol
09:08 🔗 josephwdy SketchCow: that's pretty awesome :D do tell more.
09:08 🔗 SketchCow DRUNKIVING
09:08 🔗 chronomex DRUNKIRCING
09:08 🔗 SketchCow DON'T DRINK AND DERIVE
09:08 🔗 chronomex I actually don't know how to drive.
09:08 🔗 Wyatt Drunk Relay Chat
09:08 🔗 kin37ik lol
09:09 🔗 SketchCow There it goes!
09:09 🔗 SketchCow Adding 156 books
09:09 🔗 josephwdy Wyatt: the at wiki has a good starting point http://archiveteam.org/index.php?title=Wget
09:09 🔗 Wyatt Yeah, thanks. I was just looking over that.
09:09 🔗 chronomex SketchCow: you still need to hook me up with your adder thing.
09:09 🔗 SketchCow http://www.archive.org/details/commodore-manuals
09:09 🔗 SketchCow chronomex: Yes
09:10 🔗 Wyatt Sometimes I forget that there _are_ good resources for this stuff.
09:11 🔗 ersi Ideally, wouldn't one want; A) 'just a plain wget' mirroring of the site, no modification B) modified links wget mirroring C) a WARC kind of wget mirroring=
09:11 🔗 ersi s//=//?/
09:11 🔗 db48x2 ersi: ues
09:11 🔗 SketchCow Ideally, you want both
09:11 🔗 SketchCow But sometimes, no choice
09:11 🔗 db48x2 -k and -K get you a modified and unmodified mirror
09:11 🔗 chronomex both three?
09:11 🔗 ersi Ah, true.
09:11 🔗 db48x2 and the --warc gets you the archive
09:11 🔗 SketchCow Shut up, drunky
09:12 🔗 ersi does warc make 'archives'?
09:12 🔗 db48x2 yea, after a fashion
09:12 🔗 chronomex SketchCow: I'M drunk?!?
09:12 🔗 ersi db48x2: Hm?
09:12 🔗 db48x2 it's not a tarball
09:12 🔗 ersi I mean, like heritrex (or whatever it's called)
09:12 🔗 Wyatt You said there's a patch for warc on the wik?
09:12 🔗 db48x2 yea, very similar to what heritrix does
09:12 🔗 ersi similar being compatible?
09:13 🔗 db48x2 when they wrote heritrix they invented the arc format
09:13 🔗 db48x2 it's been updated
09:13 🔗 db48x2 I don't know the exact timeline
09:13 🔗 ersi So it makes 'old version WARC archives'?
09:13 🔗 SketchCow http://www.youtube.com/watch?v=xDjOr68VxKw
09:13 🔗 SketchCow Go watch that
09:14 🔗 db48x2 http://archiveteam.org/index.php?title=Wget_with_WARC_output
09:14 🔗 kin37ik hmm, heres a problem, if i poke members.fortunecity.com, ill get all the dir files on that domain but wont get any of the members subsites as they arent linked, how could i get around that to poke the member accounts??
09:14 🔗 chronomex "sites [...] run by habitual whiners, will complain when a site scraping uses 200 megabytes of transfer when it could have used 100." -- sites run by whiners bitch at EVERYTHING
09:14 🔗 Wyatt Truth^
09:14 🔗 db48x2 once you create your warc file, you should append a record that contains the script you ran to grab the site, if it's more than a single invocation of wget
09:15 🔗 ersi Let me add; derp
09:16 🔗 ersi oh, alard wrote the --warc wget support
09:16 🔗 db48x2 yea
09:16 🔗 chronomex you will note, alard has an @ by his name
09:16 🔗 ersi Oh, the headers is probably used for Wayback Machine to place it in the timeline
09:17 🔗 ersi Historically, I had a @ by my name as well.
09:17 🔗 chronomex I think it's just for the masturbatory completeness factor
09:17 🔗 ersi </careface> :P
09:17 🔗 chronomex fine.
09:17 🔗 chronomex usually the @s occupy 1 of 6 nickname columns on my screen; we're running low.
09:17 🔗 chronomex ish.
09:19 🔗 ersi Man, I'd like to just ./mirror-archive-the-fuck-out-of-url <url>
09:20 🔗 SketchCow I think it's obvious we're going to have to write a script set that does this.
09:20 🔗 db48x2 I've been working on one
09:20 🔗 ersi also, darn these dynamic pages that generate these weird files
09:20 🔗 chronomex ersi: weird how?
09:20 🔗 ersi trololo?COMMENTS=UPSIDEDOWN?&SORT=INMYPANTS
09:21 🔗 chronomex what's the problem with that?
09:21 🔗 chronomex that's the bit after the last / in the url
09:21 🔗 chronomex is filename.
09:21 🔗 ersi None really, besides that it bothers me and feels naughty
09:21 🔗 chronomex unix is okay with it, right?
09:21 🔗 ersi Right.
09:21 🔗 db48x2 you can use -E
09:21 🔗 chronomex if it's okay with unix, it's okay with chronomex
09:21 🔗 db48x2 it'll slap a .html on the end of all that
09:22 🔗 ersi yeah, but I didn't do that :)
09:22 🔗 ersi I'm unsure if I should CONTINUE RAPING or STOP and modify my parameters
09:23 🔗 db48x2 indeed
09:23 🔗 db48x2 a dilemma for the ages
09:24 🔗 ersi If I let it run, i'll get a feel for if they use other domains for CDN or trickery and possibly total size of site
09:25 🔗 chronomex this is instructables, right?
09:26 🔗 ersi Yes. It's probably effing huge
09:26 🔗 ersi It's up at 6GB currently
09:27 🔗 chronomex ahyeah.
09:28 🔗 chronomex god I hate it when people who are insane but kind of interesting email me
09:29 🔗 Wyatt Seems like a generalised distributed parallel archival-quality...I hesitate to say "bandwidth fucker" because it's awfully uncouth.
09:29 🔗 Wyatt But yes, challenging, but boy would it be useful.
09:33 🔗 * kin37ik is getting frustrated
09:34 🔗 ersi chronomex: Do you get lots of insane interesting people mailing you? :)
09:35 🔗 chronomex no, for the most part, it's confused transsexual folk who think I care.
09:35 🔗 chronomex responding to this one with "This sounds like something one would ask a lover. Before you proceed any further, ask yourself the following question: Is chronomex my lover?"
09:35 🔗 kin37ik lol
09:37 🔗 ersi Wyatt: My continuation of questionabe quality archival effort?
09:37 🔗 chronomex ersi: what would you change?
09:38 🔗 SketchCow OK, who wants a short project
09:38 🔗 Wyatt ersi: In a sense? I'm saying it would be nice to spread the love around
09:38 🔗 SketchCow http://census.ire.org/
09:38 🔗 ersi Well, I'd throw on -kK and perhaps some more
09:39 🔗 SketchCow Turn that into an "item", a collection that makes sense.
09:39 🔗 SketchCow Module threw exception:
09:39 🔗 SketchCow item must be OCR'd via auto_submit
09:40 🔗 SketchCow That's interesting.
09:40 🔗 ersi Wouldn't the "raw data datasets" from the bottom of http://census.ire.org/data/bulkdata.html be good candidates?
09:40 🔗 chronomex SketchCow: how is this better than the data on census.gov?
09:40 🔗 SketchCow I am not clear at all it is.
09:40 🔗 chronomex I'm not seeing any real value add, except a shinier interface
09:40 🔗 SketchCow If that's the case, I trust that opinion.
09:41 🔗 chronomex I've spent a good deal of time working with census data; I practically majored in that shit.
09:41 🔗 ersi :o
09:41 🔗 chronomex geography is a lot to do with demography
09:42 🔗 chronomex https://github.com/ireapps/census yeah, it's a fancy interface to census data
09:43 🔗 chronomex ersi: -k is not archive-safe, unless combined with -K.
09:43 🔗 ersi That's why I'd do -kK
09:43 🔗 chronomex -K means some extra work to get an archive-safe version
09:44 🔗 chronomex what other flags were you thinking of?
09:45 🔗 ersi well, I'd consider building alard's patched wget version and do WARC perhaps
09:45 🔗 chronomex warc is good
09:45 🔗 chronomex can you combine --continue with --warc ?
09:45 🔗 SketchCow http://www.archive.org/details/commodore-manuals
09:45 🔗 SketchCow aww yeah!
09:46 🔗 ersi maybe add some domains to -D
09:46 🔗 chronomex SketchCow: color monitor service manual?!? fuck yeah!
09:46 🔗 ersi Hm, maybe
09:46 🔗 SketchCow See, these are all useful things
09:46 🔗 ersi But I'd rather do a full blown new run with --warc
09:46 🔗 SketchCow That have been around a long time
09:47 🔗 SketchCow But they're going to be consolidated now.
09:47 🔗 chronomex ersi: right, just wondering. remember, alcohol.
09:47 🔗 ersi Also, change the useragent to Firefox or something instead of Googlebot
09:47 🔗 ersi maybe I'm getting 'GBot customised' versions of pages :/
09:47 🔗 db48x2 yea, that helps a lot
09:47 🔗 chronomex ersi: or "ARCHIVETEAM FUCKYOUBOT"
09:48 🔗 SketchCow ArchiveTeam 1.0/Bitch I'm a Bus
09:48 🔗 kin37ik SketchCow: might just grab a copy of all of those and store them away somewhere
09:48 🔗 chronomex "ARCHIVETEAM FUCKYOUBOT 3.6"
09:48 🔗 ersi currently running with; wget -m -c -p -e robots=off http://www.instructables.com/index --user-agent="Googlebot/2.1 (+http://www.google.com/bot.html)"
09:48 🔗 db48x2 to be honest we ought to archive with lots of different user agents, to make sure
09:48 🔗 db48x2 ersi: --mirror
09:48 🔗 ersi -m == --mirror
09:48 🔗 chronomex db48x2: this sounds like "wget replacement project" to me
09:48 🔗 db48x2 oh, right
09:48 🔗 chronomex wget is great but it's not the ultimate spider.
09:48 🔗 SketchCow We've moved in the last few months from panic downloads to proactives.
09:48 🔗 ersi I like 'em short parameters
09:49 🔗 SketchCow Proactives, I am fine with 5 400mb .tar.gz files, representing different approaches.
09:49 🔗 ersi I really do hope AutoCAD will take great care of Instructables.. but.. Trust No One.
09:49 🔗 SketchCow I just don't want to lose stuff that's time critical.
09:49 🔗 ersi SketchCow: This bitch be huge though
09:49 🔗 db48x2 size isn't an issue
09:49 🔗 SketchCow My opinion, which I told Bre, is that AutoCAD will buy Makerbot within 4-5 years
09:49 🔗 ersi It can complicate things :)
09:49 🔗 chronomex SketchCow: that would be very interesting
09:49 🔗 Wyatt Size is an issue when we've only got two weeks to get all of it.
09:50 🔗 ersi SketchCow: Does not sound unlikely. Since they bought Instructables for the exactly same reason they would buy Makerbot
09:50 🔗 chronomex SketchCow: my personal opinion? makerbot is in violation of its lease, which says "robots made must obey asimov's 3 laws". I've had my fingers burned by a makerbot.
09:50 🔗 ersi lol
09:50 🔗 Wyatt Was that the makerbot's fault?
09:50 🔗 db48x2 yes
09:50 🔗 SketchCow Yeah, seriously
09:50 🔗 db48x2 it let him get injured
09:50 🔗 chronomex yes, it went down when it ought have gone up because my fingers were there!
09:51 🔗 SketchCow If I'm canoeing with you, and you're a fuck and fall over and drown
09:51 🔗 SketchCow Which is within Jason Scott's Three Laws of Robotics
09:51 🔗 SketchCow 2. You die, I get your wallet
09:51 🔗 SketchCow 1. I didn't know him, officer
09:51 🔗 chronomex wait wait wait
09:51 🔗 SketchCow 3. If our size is the same, hey, you died naked for whatever reason
09:51 🔗 chronomex you're a robot?
09:51 🔗 Wyatt I KNEW there was a reason I don't carry cash! And all this time, I thought it was roaming bands of thugs.
09:52 🔗 SketchCow So we're using the DEFCON speech to apply to TED
09:52 🔗 SketchCow The question is, can they get an adequate idea I could do a TED speech when half the words are profanity
09:52 🔗 SketchCow We'll see!!
09:53 🔗 ersi Oh fuck, that'd be great
09:54 🔗 SketchCow Attend TEDActive 2012 in Palm Springs
09:54 🔗 SketchCow Held in Palm Springs, TEDActive is a parallel event held at the same time as TED in Long Beach, featuring the simulcast of the conference. Get the benefits of the TED Book Club, conference video archives, online social networking, and many special offers (Learn more .).
09:54 🔗 SketchCow Price: $3,750
09:54 🔗 SketchCow I wish I could afford TED
09:55 🔗 SketchCow I already qualify as an insider
09:55 🔗 Wyatt If you get in, you have to pull the "Fuck you, you are all in ArchiveTeam" bit.
09:55 🔗 * db48x2 sighs
09:55 🔗 db48x2 3am now
09:55 🔗 SketchCow But I can't pay retail for that shit
09:55 🔗 ersi They're expensive/costly as fuck
09:55 🔗 SketchCow It was so great
09:55 🔗 SketchCow I paid wholesale price
09:55 🔗 SketchCow Still expensive
09:55 🔗 Wyatt Bajeezus, though, that's worse than SXSW...
09:55 🔗 SketchCow Worth every dime.
09:55 🔗 SketchCow Every. Dime.
09:55 🔗 ersi I did get to watch TED live for free last year
09:55 🔗 SketchCow Retail is $7,500
09:56 🔗 ersi (I also RTMPDumped the shit out of the stream)
09:56 🔗 SketchCow I harassed one of the google founders (Page) for 40 seconds.
09:56 🔗 SketchCow Come on, that was worth it right there
09:56 🔗 db48x2 http://pastebin.com/8EDZBLE0
09:56 🔗 chronomex hahahahaha SketchCow
09:57 🔗 SketchCow I demanded he buy 4chan through a shell company
09:57 🔗 SketchCow This was before canv.as of course.
09:57 🔗 SketchCow Shook Bill Gates' hand, had a long talk with The Amazing Randi
09:57 🔗 SketchCow Come on, so worth it
09:58 🔗 SketchCow Also surprised by the people who knew me on sight
09:58 🔗 SketchCow Like Wozniak
09:58 🔗 SketchCow Anyway, I'm applying
09:58 🔗 SketchCow With some help
09:58 🔗 SketchCow If I get in, you'll see probably a 7 or 12 minute version of that speech
09:59 🔗 db48x2 I've got another script that does a zfs snapshot
09:59 🔗 chronomex db48x2: you want me to run that pastebin?
09:59 🔗 db48x2 runs this script and then takes a zfs snapshot to preserve it
09:59 🔗 db48x2 chronomex: this is just the script that I'm working on
10:00 🔗 db48x2 you have to customize it per site, of course
10:00 🔗 chronomex right.
10:00 🔗 db48x2 for GoogleFriendsNewsletter:
10:00 🔗 db48x2 grab -a log http://groups.google.com/group/google-friends/download?s=pages -O google-friends-pages.zip
10:00 🔗 db48x2 mirror -a log "${SITE2}"
10:00 🔗 db48x2 mirror -o log "${SITE}"
10:00 🔗 db48x2 grab -a log http://groups.google.com/group/google-friends/download?s=files -O google-friends-files.zip
10:00 🔗 db48x2 etc
10:01 🔗 db48x2 so it's not really as simple as it ought to be, I guess
10:01 🔗 db48x2 but I could make those command line args
10:01 🔗 SketchCow http://www.guardian.co.uk/books/2011/sep/07/michael-moore-hated-man-america
10:01 🔗 SketchCow makeup? makeup.
10:02 🔗 db48x2 --mirror http://wherever/ --mirror http://another/ --grab http://some/file
10:02 🔗 ersi Lol! Nice that Wozy recognized ya' :)
10:08 🔗 SketchCow OK, bed
10:09 🔗 Wyatt 'Night
10:09 🔗 db48x2 Time's Arrow is a pretty good episode
10:09 🔗 db48x2 it's got everything
10:10 🔗 ersi Time's Arrow?
10:11 🔗 db48x2 severed heads, time travel, body snatchers, robots, historical figures
10:11 🔗 db48x2 ersi: Star Trek: TNG episode
10:11 🔗 ersi oh, heh
10:11 🔗 db48x2 S05E26 and S06E01
10:12 🔗 db48x2 they find Data's 500-year-old severed head in a mine under San Francisco
10:12 🔗 db48x2 hijinks ensure
10:15 🔗 chronomex yeah that was kind of strange.
10:16 🔗 kin37ik right, now that im off the phone i need to figure out this dir
10:27 🔗 kin37ik how do i get Wget to fetch and grab the user/member subsites if they arent linked somewhere on fortunecity for Wget to follow?
10:27 🔗 db48x2 you have to find out the usernames
10:27 🔗 db48x2 feed them to wget
10:27 🔗 kin37ik thats the problem, i need to fetch all the usernames, as far as ive worked out so far
10:28 🔗 db48x2 yep
10:28 🔗 kin37ik members.fortunecity.com contains all the member pages but none of those member pages are actually linked in the members.fortunecity.com/ directory
10:33 🔗 kin37ik if you were to poke for all potential user accounts, how would you go about it?
10:34 🔗 chronomex when we scraped geocities, we did google site: searches for all the words in the dictionary and pulled out the urls
10:35 🔗 chronomex it's kind of icky but it works
10:35 🔗 kin37ik hmm
10:39 🔗 kin37ik i dont know how well that would work on fortunecity
10:39 🔗 kin37ik that would probably hit well over half but then obtaining the rest
10:39 🔗 chronomex how many do you have now?
10:40 🔗 kin37ik at the moment, ive only hit a few user accounts, and then the directory structure just started getting a bit funky
10:41 🔗 alard The wayback machine can also give you a list: http://wayback.archive.org/web/*/http://members.fortunecity.com/*
10:41 🔗 chronomex so, half would be an improvement.
10:41 🔗 alard (But that of course will only give you things that are already archived.)
10:42 🔗 kin37ik alard: yes, but that still helps
10:42 🔗 kin37ik they have a bit of a weird dir structure, not only do they keep the user accounts at something like, for example members.fortunecity.com/user0001/index.html
10:43 🔗 kin37ik but they are also doing the dir in dir as well so like, members.fortunecity.com/millenium/baloons/1035/index.html sort of thing
11:00 🔗 kin37ik ouch, that doesnt help....
11:13 🔗 kin37ik id better write all this down
12:10 🔗 ersi http://feedproxy.google.com/~r/hackaday/LgoM/~3/rMV2Fqe2uao/
12:10 🔗 ersi oh fuck you google, mangling urls
12:10 🔗 ersi http://hackaday.com/2011/09/08/recovering-data-for-a-homemade-cray/ *
12:10 🔗 ersi Fentons cray recovery thingie majingy :)
12:20 🔗 Soojin cool :)
12:53 🔗 SpaceCore Afternoon
12:53 🔗 * SpaceCore reads backlog
12:54 🔗 SpaceCore ersi: need any help with that?
13:01 🔗 ersi Hm?
13:01 🔗 ersi with instructables?
13:03 🔗 SpaceCore yeah
13:05 🔗 ersi I dunno, I got a process running along nicely - hopefully it's useful data :P
13:05 🔗 SpaceCore Ok
13:05 🔗 * SpaceCore goes back to attempting to rebuild his netbook
13:20 🔗 emijrp can we archive Michael S. Hart plox?
13:40 🔗 emijrp Sep 6, 2011 - On Sep 3rd (just before the long labor day weekend), WebCite went down due to a hardware failure. While we are restoring the database from our backups, no new snapshots can be made, and old snapshots may be temporarily unavailable. We apologize for any inconvenience caused.
13:41 🔗 emijrp http://www.webcitation.org/archive.php
15:30 🔗 SketchCow Oh, web citation
15:33 🔗 DFJustin <ersi> Wyatt: Doesn't seem crawled by ia_archiver at all when I visited http://liveweb.archive.org/www.ehow.com
15:34 🔗 DFJustin you're doing it wrong
15:34 🔗 DFJustin need an http:// before www.ehow.com
15:45 🔗 SketchCow http://www.archive.org/details/commodore-manuals
16:04 🔗 lowtekk cool, im sure youve already got any commodore manual I do, but i'll check
16:07 🔗 SketchCow Well, if I DON'T, then yes, it would be good of the world to put that together.
16:32 🔗 emijrp SketchCow: what is the status of jamendo downloading?
16:34 🔗 SketchCow Stops and starts.
16:34 🔗 SketchCow It times out and dies constantly.
16:35 🔗 emijrp but you have to restart or it auto resumes?
16:35 🔗 SketchCow I have to restart it, and I resume it by knowing when it last died.
16:35 🔗 emijrp ok
16:52 🔗 ersi DFJustin: yeah yeah, i wrote that manually here
17:16 🔗 godane just download floss weely 114
17:17 🔗 godane slowly getting old twit.tv show
17:20 🔗 sep332 I heard a rumor that archiveteam is doing something with the Yahoo Video archive soon? Is that true?
17:21 🔗 SketchCow I'm uploading it
17:24 🔗 sep332 I have a 385GB slice of it, users# 1,300,000 - 1,400,000
17:24 🔗 sep332 do you have those already?
17:24 🔗 SketchCow I want it.
17:25 🔗 SketchCow I have to head out, but I am for it.
17:25 🔗 sep332 OK, it will be about 6 hours before I can get to them, but I'll put them wherever you want.
17:26 🔗 SketchCow Ok, mail jason@textfiles.com, I'll set up an rsync slot
17:26 🔗 sep332 ok cool, thanks
17:29 🔗 godane i hope this comes out in 5 years: http://en.wikipedia.org/wiki/Stacked_Volumetric_Optical_Disk
17:30 🔗 godane one layer equals about 2.4TB
17:30 🔗 godane and it can have 100x or more layers
17:32 🔗 godane more likely a better optical disc then hvd or 5D dvd since the most these will save is 6tb to 10tb max
17:34 🔗 sep332 I hope its not a disk-shape, I'm sick of discs
17:34 🔗 sep332 how about a cube? or a nice hexagonal crystal
17:34 🔗 godane only like for archive reasons
17:34 🔗 closure pyramid power
17:34 🔗 godane no write to the device
17:35 🔗 godane just the speed that of the laser for SVOD will have very fast
17:36 🔗 sep332 I think you can write with a holographic laser
17:36 🔗 godane other wise it could take months just to burn it
17:38 🔗 sep332 Kenwood had a 7-laser parallel CDROM reader back in 2001, http://hothardware.com/Reviews/Kenwoods-72X-True-X-CDROM-Drive/
17:38 🔗 sep332 I think we can do better :)
18:15 🔗 Schbirid emijrp: does dumpgenerator.py do the actual downloading or does it generate a urllist?
18:16 🔗 emijrp it downloads the text and images
18:16 🔗 Schbirid nice
18:16 🔗 Schbirid any idea what might be wrong if a wikia wiki is not in http://wiki-stats.wikia.com/ ?
18:16 🔗 Schbirid i want to perserve quake.wikia.com
18:17 🔗 emijrp wikia dumps are generated on demand
18:18 🔗 emijrp you have to request it, but im not sure where
18:18 🔗 Schbirid ah ok
18:19 🔗 Schbirid the doom wiki was just forked to doomwiki.org
18:19 🔗 Schbirid :)
18:19 🔗 emijrp although you can try with dumpgenerator
18:19 🔗 emijrp using http://quake.wikia.com/api.php
18:19 🔗 emijrp i mean, it is better if wikia gives you the dump, but if you dont want to ask or wait, just use wikiteam tools
18:20 🔗 Schbirid yeah
18:20 🔗 Schbirid i shall try it :)
18:20 🔗 Schbirid thanks!
18:20 🔗 Schbirid we should totally sync our jamendo archives some day btw
18:20 🔗 emijrp im downlloading incrementally
18:21 🔗 Schbirid me too
18:21 🔗 emijrp SketchCow too on IA
18:21 🔗 Schbirid jamendo?
18:21 🔗 emijrp yes
18:21 🔗 Schbirid oh wow
18:21 🔗 emijrp mp3 and ogg
18:21 🔗 Schbirid i am still waiting for them to show why removed albums were removed otherwise i would have started doing that
18:21 🔗 Schbirid i was in contact with IA about it once
18:22 🔗 Schbirid jamendo offers to sync albums to servers as community hosted mirrors but they require you to run some python stuff iirc
18:22 🔗 Schbirid there is one such server but the guys are hard to contact
18:40 🔗 swebb1 I forgot that I had this: http://badcheese.com/~steve/crawl/
18:43 🔗 emijrp I am an archivist, and what is this?
18:45 🔗 db48x swebb1: nifty
18:45 🔗 emijrp wget http://badcheese.com/~steve/crawl/crawling.flv
18:46 🔗 Schbirid http://www.onlineuniversity.net/1996-vs-2011/
18:47 🔗 Schbirid crap sory
18:47 🔗 Schbirid infographic spam
18:47 🔗 Schbirid go http://images.onlineuniversity.net.s3.amazonaws.com/96vs11.jpg instead
18:47 🔗 emijrp 1 petabyte = 74 terabytes?
18:48 🔗 Schbirid yes, if you are a macfag and buy 1 petabyte you only get 74tb :)
18:49 🔗 emijrp DRM included?
18:51 🔗 Schbirid 1050tb worth of it
18:59 🔗 db48x I want a petabyte of storage in my apartment
19:09 🔗 godane just need to get 250 4tb hard drivers thoughs come out
19:09 🔗 godane cause 333 3tb hard drives is a very old number
19:10 🔗 godane also saves space cause there will be fewer drives
19:11 🔗 lowtekk start saving
19:13 🔗 ersi stop shaving
19:13 🔗 godane by the time you buy you there will be 8tb or 16tb drives
19:14 🔗 godane bbl
19:21 🔗 emijrp I have a petabyte in my PC.
19:21 🔗 emijrp I signed up on Internet Archive. I can upload whatever I want.
19:22 🔗 emijrp Cloud storage for free. HELL YEAH.
19:22 🔗 emijrp Buy a good internet connection 100mbit and you have almost the same bus speed that local drives.
19:24 🔗 emijrp You can do it too. But, you have to credit me. It was my idea.
19:24 🔗 emijrp Thanks.
19:25 🔗 Schbirid he he he
19:25 🔗 Schbirid IA = cloud
19:25 🔗 lowtekk i'm sure glad my hard drives are faster than 100mbit....
19:28 🔗 ersi Uh, I get around 300-550mbit to my drives
19:28 🔗 ersi even more in my workstation
19:28 🔗 lowtekk that's the idea :)
19:30 🔗 lowtekk maybe it's ye olde ultra-dma drives he's talking about?
19:31 🔗 Aranje my god, I'm looking at that infographic he posted and godaddy is still just as cluttered and confusing was it was in 1996
19:32 🔗 chronomex are you surprised?
19:32 🔗 Aranje Not really >_>
19:32 🔗 Aranje Alittle, I guess
19:32 🔗 Aranje but I shouldn't be
19:32 🔗 Aranje Then again, I've only had internet since 2005, so I wouldn't know what it looked like then
19:32 🔗 ersi It's more suprising how those tards are still in business
19:48 🔗 Schbirid emijrp: error on image retrieval or normal output http://pastebin.com/k8UxDwgD ?
19:49 🔗 emijrp looks like it fails with images at wikia
19:49 🔗 emijrp file a bug http://code.google.com/p/wikiteam/issues/list
19:50 🔗 Schbirid it requires me to use a google account
19:55 🔗 emijrp yes, blame to spammers
19:56 🔗 emijrp im fixing the bug, wait
19:57 🔗 Schbirid awesome!
20:00 🔗 emijrp done
20:00 🔗 emijrp do svn up
20:01 🔗 emijrp and resume
20:01 🔗 emijrp python dumpgenerator.py --api=... --xml --images --resume --path=pathtodirectory
20:02 🔗 emijrp remove quakewikiacom-20110908-images.txt before to be sure
20:02 🔗 Schbirid no such file
20:02 🔗 emijrp ok
20:02 🔗 Schbirid works
20:02 🔗 Schbirid nice
20:02 🔗 Schbirid thanks!
20:03 🔗 emijrp : )
20:18 🔗 Aranje You know what site I'd actually like to have archived? project gutenberg.
20:18 🔗 Aranje I should figure that out.
20:21 🔗 db48x Aranje: download the dvd image
20:21 🔗 Schbirid iirc that is simple and nice :)
20:21 🔗 Aranje oh is there one?
20:21 🔗 Aranje sweet!
20:22 🔗 Schbirid it was his goal to spread it easily
20:22 🔗 Aranje awesome :D
20:22 🔗 Aranje It's totally something I should have a copy of
20:22 🔗 Schbirid you might also be interested in http://gen.lib.rus.ec/
20:23 🔗 Schbirid quite illegal though
20:25 🔗 Aranje the lines of legal and illegal blur often for me :)
20:26 🔗 Aranje and of course there is a torrent
20:26 🔗 Aranje lmao
20:38 🔗 DFJustin the dvd image isn't a complete set
20:38 🔗 DFJustin but the gutenberg etexts are already mirrored on dozens of mirrors and on archive.org
20:39 🔗 Aranje oh, cool
20:39 🔗 DFJustin http://www.archive.org/details/gutenberg
20:40 🔗 DFJustin http://www.gutenberg.org/catalog/world/mirror-redirect
20:41 🔗 Aranje neat :D

irclogger-viewer