[01:20] neat, I found an undeveloped roll of film [01:20] (in this box, not mine) [01:20] uh [01:21] I hope it was not exposed to light or anything [01:21] well, it's mine. [01:21] now [01:21] na, it's been in the container [01:21] (aside from actually taking the pictures) [01:21] oh. you said roll not reel [01:22] easier to tell with rolls [01:22] yeah [01:22] I wonder if it's worth getting developed. Hope this guy wasn't into weird stuff. [01:27] Yeah, EFNet [01:27] Fuck you too [01:27] alard: Absolutely [01:30] mmm [01:30] DCI... talking around 1.5TB for a single 100-minute movie, with only one 8-channel soundtrack (at 96k) [02:43] lachlan mirror still chugging. at 3.7G [02:46] that's quite the website [03:24] Back [03:42] wb [03:42] :> [05:24] today my work as an archivist involes simulating a tape read circuit to decode bits off a data tape image recorded with audio gear [05:24] just in case you guys thought I was slacking :) [05:26] ooh, wow. what's this for? [05:29] http://xrtc.net/f/phreak/3ess.shtml <-- this machine, a 1973 computer welded to a telephone switch, has bad tape carts. [05:29] solution: replace tape drive with something solid-state [05:29] tape drive is in center above teletype, the thing with the round sticker on [05:30] have to replace tape drive to run diagnostics [05:31] have to run diagnostics to figure out what's wrong with the offline processor [05:31] have to fix the offline processor to run code on the machine safely [05:31] have to run code on the machine to do a backup [05:31] have to do a backup before rebooting [05:31] have to reboot because that will probably clear some stuck trouble that's been plaguing it since 1998 at least [05:32] yeah ... it was last booted in 1992 [05:33] that view is the operator console side; the machine is two of those lineups - the second is the switching network and stuff [05:35] I want to strangle the fucker that decided that 1/4" tape cartridges are better than open-reel tape [05:36] STRANGLE you hear me [05:52] Yeah [05:52] batcave went south, can't get anyone to reset. [05:53] So heartbroken, I know [05:53] D: [06:19] http://www.freshdv.com/wp-content/uploads/2011/10/hurlbut-letus-41.jpg [06:19] What a way to jizz up a perfectly fine DSLR [06:20] wow that's a lot of shit to bolt onto a dslr [06:21] wow [06:21] I count... four different handles? [06:55] http://www.archive.org/search.php?query=collection%3Aarchiveteam-yahoovideo&sort=-publicdate [06:55] Back in business. [06:55] speaking of video: http://ia700209.us.archive.org/6/items/dicksonfilmtwo/DicksonFilm_High_512kb.mp4 [06:55] cool shit [07:02] Yeah, going to let those go [07:02] And get some rest, then back up [07:02] There's so much stuff uploading now, the machine's finally emptying out [07:07] Oh, and I found the artist for the archiveteam t-shirt and poster [07:10] oh? [07:34] Dicks On Film? [07:34] documentary about chatroulette? [07:34] ah. that explains the rsync troubles [07:45] daamn: http://popc64.blogspot.com/ [07:48] lachlan mirror still underway, at 4.2G [11:10] chronomex: http://www.myspace.com/pagefault D: [11:10] hahahaha [11:18] Morning, probably need to sleep a tad [11:18] But the batcave now has 12tb free [11:19] So we have a lot of room again. [11:36] SketchCow: The scripts for me.com/mac.com are more or less working now, so that would be a way to get new things to fill it with. [11:36] Excellent. [11:36] So, we should talk about that. [11:37] The number one thing besides making stuff be in a way the wayback machine can accept, when possible, is to have ways to package this crap up into units I can use to upload again. [11:37] Yes, probably have a look at the results as well. [11:37] I'm starting down the google groups stuff, and oh man, this is going to take it forever. [11:38] Did wayback successfully swallow the earlier warc-files btw? [11:38] They've been doing lots of runs against them. [11:38] I don't know how many are fully in but that work is being done. [11:38] MobileMe works with usernames, so there's not an easy way to group it into numbered chunks. (And the full list of usernames is not yet available.) [11:38] So that's a yes? [11:39] I am pretty sure it's a yes. [11:39] Awesome, to 11 [11:39] Even the wget-warc ones? That's good news. [11:41] So, I asked archive team to back up a site. [11:41] Someone came out and said he was doing it, but he got me nervous because he basically said "their robots.txt is blocking the images!" [11:42] Which is like a private detective saying "and then they walked into a building that said no tresspassers!" [11:42] 11:31 I have the backup of csoon.com [11:42] 11:45 And i'm kinda unsure where to upload it. [11:42] So, I'd like someone else to do it. [11:42] It's not that large. [11:42] But it's fucking hilarious. [11:42] Died in 2000. [11:42] Heh. (Already did it, yesterday. Look in batcave. :) [11:42] Been there ever since. [11:42] Good deal, thanks. [11:43] They're right, that's like finding an untouched dinosaur fossil [11:44] I found another amazing site [11:44] Collections of old department stores [11:45] http://departmentstoremuseum.blogspot.com/ [11:46] http://departmentstoremuseum.blogspot.com/2010/06/may-co-cleveland-ohio.html [11:46] That is a lot of crazy work [11:46] I also had a nice long chat with the head of the CULINARY CURATION GROUP OF THE NEW YORK PUBLIC LIBRARY [11:46] Try THAT for crazy [11:46] http://legacy.www.nypl.org/research/chss/grd/resguides/menus/ [11:57] http://batcave.textfiles.com/ocrcount/ <--- You can see how long batcave was in the shitter [12:00] was that, ocr jobs that were running on batcave? :o [12:08] No. [12:09] This was me tracking a limit imposed on my ingestion. [12:09] Ah, alrighty [12:09] I was using a method that worked fine but was hard on the structure [12:09] And got into a fight over that [12:09] Part of it was "you shouldn't use that method if there's more than 200 jobs in queue" [12:09] Now, over time, that's not going to matter, i.e., a queue will be made that DOESN'T hold the job in queue on the machine, but just generally. [12:10] But this was me seeing "So, does it EVER go below 200 or should I even watch" [12:10] Answer: Yes [12:10] And bam, you started filling it up gradually instead of appending to an ever increasing derive queue? :) [12:10] Fuck no [12:11] I slammed that shit up to max [12:11] Then what was the point of that tracking? [12:11] To no if I was being lied to [12:11] I was not specifically being lied to [12:11] ah [12:12] Any time you see me mention interacting with other human beings, ask yourself "So, what's the most hostile interpretation as to why Jason is doing this" [12:12] It'll save you time [12:12] "Hey, guys, I went out to eat" [12:12] Meaning: I got banned from a new diner [12:13] Already known for.. long :) [12:13] Apparently you forgot, twerp! [12:13] Zing! [12:13] The brutal thing coming up with yahoo video is I will be writing something that pulls down an item, does huge stats on it, then uploads again. [12:14] hm, I should get going on instructables again [12:14] that thing is fuckin' huge though [12:15] It's funny for me that I now go into a directory on batcave, see it's 35gb, go "oh." [12:15] I've put up 400gb items [12:15] This is going to be hilarious [12:17] http://googleblog.blogspot.com/2011/10/fall-sweep.html [12:18] Shutting down: Code Search, Google Buzz, Jaiku, Google Labs (Immediately), University Research Program for Google Search [12:18] Yeah [12:18] Boutiques.com and like.com gone [12:19] Code Search was critical [12:47] What would you like to get from the me.com/mac.com downloaders? At the moment, they produce: [12:48] 1. a warc.gz for web.me.com (plus xml index and log file) [12:48] 2. a warc.gz for homepage.mac.com (plus a log file) [12:48] 3. the xml feed for public.me.com, plus a copy of the file structure + the headers for each file (not warc) [12:49] 4. the xml feed for gallery.me.com, plus a zip file for each gallery [13:37] Hmmm. [13:37] I'd like all of it - what's the size differential. [13:42] You do get all of the content, it's just a question of in what form you'd like to get it. [13:42] Just a WARC or also separate files, that sort of thing. [13:45] Here's an example listing of what it produces now: http://pastebin.com/raw.php?i=438zhmSR [13:46] http://vimeo.com/28173775 [13:46] The files can get quite large (up to a 2 GB for the users I've tried so far), so I don't think it's useful to have the data in more than one form. [13:46] I think it could be. [13:47] WARC is so forward looking, but you can't use it for anything BUT wayback. [13:47] Or you have to run a WARC extractor to create the structure wget would create otherwise. [13:48] Hmmm. [13:48] So you'd like to have the wget copy as well? [13:48] Well, you know, I could see that. [13:48] With or without link conversion? [13:48] Massive post-processing. [13:48] I am fine with massive post-processing. [13:48] So WARC might make the most sense. [13:48] I'd like to run that against your warcs we've added already to archive.org, see how that looks. [13:48] It does save a lot of duplicate uploading. [13:49] Agreed. [13:49] And the thing with these machines I have, they suck down data at 40-80MB a second. [13:49] So it can yank it down, rejigger, upload [13:50] (As a reference: the four users I have now have 3.6GB of data together. But maybe I chose the wrong examples.) [13:50] Wow, what the hell. [13:50] Can you link me to them? [13:50] http://web.me.com/sleemason/ [13:51] WARC is the way. [13:51] http://homepage.mac.com/ueda_daisuke/ [13:52] http://gallery.me.com/amurnieks [13:52] yeah, those. [13:52] (each user has something on homepage, gallery, public, web) [13:53] hmm, how does WARC do it? [13:53] I currently make WARCs for homepage.mac.com and web.me.com. [13:53] For gallery.me.com I download the zip files that the server offers. [13:53] ohh, http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml [13:53] For public.me.com I download the files. [13:53] balrog: Yup. [13:53] And this is all closing June 2012? [13:54] Are they blocking with robots.txt? [13:54] SketchCow: yes, as per current info [13:54] Sorry for not paying more attention, been dealing with data [13:54] SketchCow: last I checked, no, but it's messy to parse because it uses XML and JS [13:54] basically it uses JS to load the web content on many pages [13:54] (from an XML file) [13:55] Only gallery.me.com has a robots.txt. public.me.com doesn't, but it is somewhat inaccessible to crawlers. [13:55] Well, Jobs is dead, nobody is watching [13:55] homepage.mac.com has normal sites, can be crawled. web.me.com has some iWeb sites which are hard to crawl (but it's possible if you use webdav). [13:56] alard: homepage.mac.com could have iWeb sites. [13:56] Any examples? The wayback machine doesn't have any. [13:56] I should dig around, but I thought I saw some. [13:57] But wow, we're talking a fuckton of data, aren't wee. [13:57] Not really sure, the gallery/public sections can get large, the web sites are somewhat smaller. [13:57] I'm sure this is some related concept to having such intense integration of the OS and the site [13:57] I'm pretty sure there are many GB of data on here. [13:57] So people can just blow shit back and forth. [13:58] balrog: TB, probably. [13:58] alard: I'll send you a list of homepage.mac.com pulled from my webhistory (which unfortunately doesn't go all that far back) [13:58] SketchCow: what exactly are you referring to? [13:58] alard: a few hundred TB, if you count all the gallery data [13:58] I mean that the .me stuff Apple did really smoothed the process of handling data and stuff. [13:59] Similar to what we saw with Friendster, when photo albums explode [13:59] yeah, they did. they improved it with iCloud, but took away the web-facing features :[ [14:01] alard: hold on a moment :) [14:02] alard: this is not mac.com but may be useful â¦ http://www.wilmut.webspace.virginmedia.com/notes/webpages.html [14:02] http://www.archive.org/details/ARCHIVETEAM-YV-9200002-9299997 [14:02] I am going to get in trouble for that one. [14:03] There was major debate what the maximum item size should be. [14:03] Most people agreed 100gb [14:03] ooooh. [14:03] That's 408gb [14:03] why not break it up then? [14:03] I meant to but it was in the wrong directory when an uploader script ran [14:03] I misread it as 40gb [14:03] I may have to yank it down and split it [14:03] urgh. can you take it down? [14:03] I am really good at yanking it, ask around [14:04] Nothing's breaking, it just becomes harder for it to be moved around. [14:04] alard: you there? [14:04] Yes. [14:04] http://pastie.org/private/gi3mrystmzx5ogyeocapg [14:04] that came out of my history [14:04] not all may work though [14:04] and it's short [14:04] there's another db I have which I have to go through [14:05] (raw sql) [14:07] http://www.archive.org/details/ARCHIVETEAM-YV-3900000-3999999&reCache=1 [14:07] Really, 200gb is not bad for the videos from 100,000 potential userspaces [14:07] isn't that a little large too? [14:07] I am fine with 200gb [14:08] alard: I'll grep this db for mac.com/me.com :p [14:08] however, do you know of a regex that can be used? [14:08] balrog: I downloaded your list. (Though most of the users were already on my list, it seems.) [14:08] grep (homepage|web)\.(me|mac)\.com ? [14:09] I'll get another bigger list, I just need a regex that will get the proper results [14:09] yeah but this is sql [14:09] it's likely to be in the middle of a line [14:09] like, a forum post [14:09] I see. Dump all the content, feed it to grep? [14:09] well yeah, I'd be working from a sql dump [14:09] but there's stuff in the middle of lines [14:10] In that case, I repeat the previous regexp. [14:10] ok ... [14:10] we'll see if it works. [14:10] SketchCow: So I should keep it as WARCs? [14:11] Yeah [14:11] What about the files public.me.com? [14:11] As we discussed, we can make more contemporary extractions. [14:11] All of them [14:11] archive.org can sustain two copies, one generated from the others. [14:11] So don't download them separately, but download to a WARC. [14:11] WARC ensures long-term sustaining [14:11] This is the tradeoff, which I am fine with [14:12] (archive.org prefers we always do WARCs, in return a fuck they do not give how much we waterfall into their serverspace) [14:12] This from on-high [14:12] alard: you mean each user in his own WARC? [14:12] What about the images on gallery.me.com? I currently ask Apple to produces zip files, which is really handy, but isn't WARC. [14:12] If that's the best we can do, that's fine. [14:12] balrog: Yes, each user results in four WARCs. [14:12] aha. [14:12] SketchCow: You can download the images, it just takes a little longer. [14:13] So if WARC is nicer, we should do WARC. [14:13] Yes [14:13] * balrog copies over the latest .sql [14:13] Also a mess: Our star wars forum thing [14:13] (Although I should look at what happens to the album structure if we do that.) [14:13] That's what's not up [14:13] I trust your judgement, alard. [14:14] Now you know big daddy's preferences. [14:14] Heh. [14:14] I just didn't like us shutting out the potential for contemporary users, and if post-facto conversions to items that are easier to regard is possible then I'm on board. [14:14] Where possible, WARC is what the "legit" sites like [14:14] alard: what's used to dump sites as WARC? [14:15] wget-warc. [14:15] also does that deal with when you have to use phantomjs? [14:15] What's the status on those fucks accepting wget-warc [14:15] or are those special-case? [14:16] SketchCow: The last response was 'wow, that diff is huge', and he was inclined not to include it, but offer it as a separate extension (as in: you'd have to enable it before compiling). [14:16] alard: your regex doesn't work :/ [14:16] alard: hmmmmâ¦ mailing list? [14:16] But I made the mistake to include the whole warctools library, which includes things like the curl-extension etc. [14:16] Well optimize and get that in [14:16] That's a huge win [14:17] It'll change everything out there [14:17] * balrog reads up on regex [14:17] Yeah, well, I replied that the files that the wget extension uses are much smaller. I haven't yet got a reply to that. [14:17] I say just do it. [14:17] It'll make a huge change in the world. [14:17] I'll probably make a smaller diff and send that to them. [14:18] Or two versions: the small one with built-in warc, the other one with warc included. [14:18] I have now discovered I have two .tar files of the same range. [14:18] Kick ass effort alard. Kick ass [14:18] One is 111gb. One is 206gb [14:18] huh, why the difference? [14:18] NO IDEA [14:18] balrog: Did you use grep -E ? [14:18] oops, no :p [14:19] that worked, but it grabbed full lines [14:19] I don't want full lines [14:19] I want to isolate the relevant parts [14:19] Maybe do grep -oE "http://(homepage|web)\.(mac|me)\.com/[^/]+" [14:20] alard: does that assume lines start with http://? they don't [14:21] Yes, it does. It also assumes that every url ends with a / [14:21] grep -oE "(homepage|web)\.(mac|me)\.com/[^ ]+" stops as the first whitespace character. [14:22] URLs are formatted http:// â¦ /username. however they may have text in front, or after them, within the same line [14:22] you could have like "Check out this site: Here!" [14:22] Oh, sorry, it doesn't assume that the *line* starts with http://, just that the *url* starts with http://. [14:22] grep -oE 'http://(homepage|web)\.(mac|me)\.com/[^/"]+' [14:26] much shorter list than I expected. [14:26] Then it's probably good to check the regexp. [14:26] http://pastie.org/private/l5cjotdi58ttf8bq8g4m8g [14:26] I did. [14:27] the incoming HTML filter would put http:// before all urls [14:27] you have all these? [14:40] alard: did you have these already? [14:43] balrog: Just checked, most of them, not all. [14:43] OK [15:01] SketchCow: One more question, if you're still there. It's possible to download the gallery contents to WARC. However, I think it doesn't make sense. It certainly wouldn't be useful with the wayback machine. [15:02] So I'm thinking that downloading the metadata xml/json and zipping the images per album is the best solution. [15:03] I agree, then. [15:04] The problem with the gallery is that it isn't really a web page, but a collection of image files that can be renderd in different formats. So for a wayback-thing, you'd have to get every possible format. [15:14] Well then, I think that the scripts are finished. [15:14] If anyone would like to do a test run, please do! https://github.com/ArchiveTeam/mobileme-grab [15:42] -rw-r--r-- 1 root root 205 2011-10-05 17:14 ballsack [15:42] -rw-r--r-- 1 root root 2425 2011-10-05 16:20 balls [15:42] drwxr-xr-x 2 root root 4096 2011-10-05 17:19 DONE [15:42] That's how you know it was me [15:43] LOL [15:48] i seem to have acquired an "@", considering I may as well be a stranger, someone should probably take it away [15:49] "@"? [15:49] i do enjoy lurking, and as much as i love collecting old documents, i haven't contributed a darn thing to this cause [15:49] op status, unless I'm mistaken [15:49] oh, that [15:50] yeah I don't know :p [15:50] I think I was made op here once, though. idk either [15:50] this is efnet though [15:50] if you were to part and return, it would go away [15:50] i've grown rather fond of it [16:00] lol, i was made ops once in this chan [16:00] happens sometimes [16:01] It's all on my arbitrary observations, bitches [16:31] free-flowing ephemeral op-bit [16:31] probably the best way to avoid power clashes [16:56] hey friends:) [16:56] hello [16:57] its old news but i think it would make sense to note the closure of labs.google.com somewhere in the archiveteam.org wiki? [16:58] do it [16:59] http://archiveteam.org/index.php?title=Deathwatch [17:01] it made me lose thrust in google and google inovation, i miss google sets and google squared [17:02] ok im going to add a line there and to the article about google [17:02] s/thrust/trust [17:02] I made that spelling error a lot earlier :) [17:03] of course... [17:03] I agree. [17:04] Stupid Google [17:04] It's not impressive to turn off Google Labs [17:04] It was inspiring to go there and see crazy projects [17:04] The only (only) justification I can come up with is that people/businesses/entities were monetizing or showing reliance on them [17:05] Closing down Google Code Search is fucking stupid as well [17:05] that was back when you had gameing equipment by thrustmaster?^^ [17:05] their main shit is/was search once in a time [17:07] when did google code search vanish :-O [17:07] ? [17:07] was it considered part of google labs too? [17:07] It's not gone yet [17:08] It's being killed [17:08] January [17:08] *sigh* [17:09] Also, no, it was a seperate project. [17:36] but is there any content in google code search? or was it just an alternative view of stuff that's already on the web? [17:36] No content [17:36] Just a great tool [17:36] Which still makes it a fucking shame that they're disbanding it [17:37] someone tell ms to make bing code search [17:37] I mean, what do you think, when you think Google? Most people think Search. [17:37] Or did, atleast. I think of advertisement these days.. and crappy search [17:42] is there a better search engine? I know blekko and duckduckgo have some cool stuff, but for general web stuff? [17:45] grep [17:52] * Coderjoe grumbles [17:52] I am beginning to think I should have used wget-warc [17:53] 5GB and still going. apparently there are some books in there too [17:55] what are you archiveing? [17:55] Coderjoe: wow, when he popped in talking about the site I expected a few hundred megs tops [17:56] possibly for google code search there is some rationale to close it down - that it can be used as a tool for hacking in various ways [17:56] jjonas: lachlan.bluehaze.com.au [17:57] australian physicist that died last year. doing an AFK pull [17:57] I should go bluehaze.com.au as well, as that site belonged to a guy that died in 2006 [17:57] s/go/do [17:58] argh. can't type [17:58] but then, who wrote on top if it that he died in 2010 and that it stays as a memorial? [17:59] the person keeping bluehaze around as well. [17:59] .... but i really have no idea why they droped/hide google labs completley [17:59] i tried to look it up in the waybackmachine [18:00] to see all the various nice tools/attempts that i dont even remember [18:00] but its not in the waybackmachine [18:03] nvm! googlelabs.com is, just the subdomain isnt [18:15] jjonas: That's a fucking stupid ass rationale [18:15] :D [18:15] haha [18:15] I mean seriously, punch you in the face stupid [18:16]