#archiveteam 2011-10-17,Mon

↑back Search

Time	Nickname	Message
01:20 ^🔗	human39	neat, I found an undeveloped roll of film
01:20 ^🔗	human39	(in this box, not mine)
01:20 ^🔗	Coderjoe	uh
01:21 ^🔗	Coderjoe	I hope it was not exposed to light or anything
01:21 ^🔗	human39	well, it's mine.
01:21 ^🔗	human39	now
01:21 ^🔗	human39	na, it's been in the container
01:21 ^🔗	Coderjoe	(aside from actually taking the pictures)
01:21 ^🔗	Coderjoe	oh. you said roll not reel
01:22 ^🔗	Coderjoe	easier to tell with rolls
01:22 ^🔗	human39	yeah
01:22 ^🔗	human39	I wonder if it's worth getting developed. Hope this guy wasn't into weird stuff.
01:27 ^🔗	underscor	Yeah, EFNet
01:27 ^🔗	underscor	Fuck you too
01:27 ^🔗	underscor	alard: Absolutely
01:30 ^🔗	Coderjoe	mmm
01:30 ^🔗	Coderjoe	DCI... talking around 1.5TB for a single 100-minute movie, with only one 8-channel soundtrack (at 96k)
02:43 ^🔗	Coderjoe	lachlan mirror still chugging. at 3.7G
02:46 ^🔗	chronomex	that's quite the website
03:24 ^🔗	SketchCow	Back
03:42 ^🔗	underscor	wb
03:42 ^🔗	underscor	:>
05:24 ^🔗	chronomex	today my work as an archivist involes simulating a tape read circuit to decode bits off a data tape image recorded with audio gear
05:24 ^🔗	chronomex	just in case you guys thought I was slacking :)
05:26 ^🔗	balrog	ooh, wow. what's this for?
05:29 ^🔗	chronomex	http://xrtc.net/f/phreak/3ess.shtml <-- this machine, a 1973 computer welded to a telephone switch, has bad tape carts.
05:29 ^🔗	chronomex	solution: replace tape drive with something solid-state
05:29 ^🔗	chronomex	tape drive is in center above teletype, the thing with the round sticker on
05:30 ^🔗	chronomex	have to replace tape drive to run diagnostics
05:31 ^🔗	chronomex	have to run diagnostics to figure out what's wrong with the offline processor
05:31 ^🔗	chronomex	have to fix the offline processor to run code on the machine safely
05:31 ^🔗	chronomex	have to run code on the machine to do a backup
05:31 ^🔗	chronomex	have to do a backup before rebooting
05:31 ^🔗	chronomex	have to reboot because that will probably clear some stuck trouble that's been plaguing it since 1998 at least
05:32 ^🔗	chronomex	yeah ... it was last booted in 1992
05:33 ^🔗	chronomex	that view is the operator console side; the machine is two of those lineups - the second is the switching network and stuff
05:35 ^🔗	chronomex	I want to strangle the fucker that decided that 1/4" tape cartridges are better than open-reel tape
05:36 ^🔗	chronomex	STRANGLE you hear me
05:52 ^🔗	SketchCow	Yeah
05:52 ^🔗	SketchCow	batcave went south, can't get anyone to reset.
05:53 ^🔗	SketchCow	So heartbroken, I know
05:53 ^🔗	chronomex	D:
06:19 ^🔗	SketchCow	http://www.freshdv.com/wp-content/uploads/2011/10/hurlbut-letus-41.jpg
06:19 ^🔗	SketchCow	What a way to jizz up a perfectly fine DSLR
06:20 ^🔗	chronomex	wow that's a lot of shit to bolt onto a dslr
06:21 ^🔗	bbot_	wow
06:21 ^🔗	bbot_	I count... four different handles?
06:55 ^🔗	SketchCow	http://www.archive.org/search.php?query=collection%3Aarchiveteam-yahoovideo&sort=-publicdate
06:55 ^🔗	SketchCow	Back in business.
06:55 ^🔗	chronomex	speaking of video: http://ia700209.us.archive.org/6/items/dicksonfilmtwo/DicksonFilm_High_512kb.mp4
06:55 ^🔗	chronomex	cool shit
07:02 ^🔗	SketchCow	Yeah, going to let those go
07:02 ^🔗	SketchCow	And get some rest, then back up
07:02 ^🔗	SketchCow	There's so much stuff uploading now, the machine's finally emptying out
07:07 ^🔗	SketchCow	Oh, and I found the artist for the archiveteam t-shirt and poster
07:10 ^🔗	chronomex	oh?
07:34 ^🔗	Ymgve	Dicks On Film?
07:34 ^🔗	Ymgve	documentary about chatroulette?
07:34 ^🔗	Coderjoe	ah. that explains the rsync troubles
07:45 ^🔗	Ymgve	daamn: http://popc64.blogspot.com/
07:48 ^🔗	Coderjoe	lachlan mirror still underway, at 4.2G
11:10 ^🔗	underscor	chronomex: http://www.myspace.com/pagefault D:
11:10 ^🔗	underscor	hahahaha
11:18 ^🔗	SketchCow	Morning, probably need to sleep a tad
11:18 ^🔗	SketchCow	But the batcave now has 12tb free
11:19 ^🔗	SketchCow	So we have a lot of room again.
11:36 ^🔗	alard	SketchCow: The scripts for me.com/mac.com are more or less working now, so that would be a way to get new things to fill it with.
11:36 ^🔗	SketchCow	Excellent.
11:36 ^🔗	SketchCow	So, we should talk about that.
11:37 ^🔗	SketchCow	The number one thing besides making stuff be in a way the wayback machine can accept, when possible, is to have ways to package this crap up into units I can use to upload again.
11:37 ^🔗	alard	Yes, probably have a look at the results as well.
11:37 ^🔗	SketchCow	I'm starting down the google groups stuff, and oh man, this is going to take it forever.
11:38 ^🔗	ersi	Did wayback successfully swallow the earlier warc-files btw?
11:38 ^🔗	SketchCow	They've been doing lots of runs against them.
11:38 ^🔗	SketchCow	I don't know how many are fully in but that work is being done.
11:38 ^🔗	alard	MobileMe works with usernames, so there's not an easy way to group it into numbered chunks. (And the full list of usernames is not yet available.)
11:38 ^🔗	ersi	So that's a yes?
11:39 ^🔗	SketchCow	I am pretty sure it's a yes.
11:39 ^🔗	ersi	Awesome, to 11
11:39 ^🔗	alard	Even the wget-warc ones? That's good news.
11:41 ^🔗	SketchCow	So, I asked archive team to back up a site.
11:41 ^🔗	SketchCow	Someone came out and said he was doing it, but he got me nervous because he basically said "their robots.txt is blocking the images!"
11:42 ^🔗	SketchCow	Which is like a private detective saying "and then they walked into a building that said no tresspassers!"
11:42 ^🔗	SketchCow	11:31 <bearh> I have the backup of csoon.com
11:42 ^🔗	SketchCow	11:45 <bearh> And i'm kinda unsure where to upload it.
11:42 ^🔗	SketchCow	So, I'd like someone else to do it.
11:42 ^🔗	SketchCow	It's not that large.
11:42 ^🔗	SketchCow	But it's fucking hilarious.
11:42 ^🔗	SketchCow	Died in 2000.
11:42 ^🔗	alard	Heh. (Already did it, yesterday. Look in batcave. :)
11:42 ^🔗	SketchCow	Been there ever since.
11:42 ^🔗	SketchCow	Good deal, thanks.
11:43 ^🔗	SketchCow	They're right, that's like finding an untouched dinosaur fossil
11:44 ^🔗	SketchCow	I found another amazing site
11:44 ^🔗	SketchCow	Collections of old department stores
11:45 ^🔗	SketchCow	http://departmentstoremuseum.blogspot.com/
11:46 ^🔗	SketchCow	http://departmentstoremuseum.blogspot.com/2010/06/may-co-cleveland-ohio.html
11:46 ^🔗	SketchCow	That is a lot of crazy work
11:46 ^🔗	SketchCow	I also had a nice long chat with the head of the CULINARY CURATION GROUP OF THE NEW YORK PUBLIC LIBRARY
11:46 ^🔗	SketchCow	Try THAT for crazy
11:46 ^🔗	SketchCow	http://legacy.www.nypl.org/research/chss/grd/resguides/menus/
11:57 ^🔗	SketchCow	http://batcave.textfiles.com/ocrcount/ <--- You can see how long batcave was in the shitter
12:00 ^🔗	ersi	was that, ocr jobs that were running on batcave? :o
12:08 ^🔗	SketchCow	No.
12:09 ^🔗	SketchCow	This was me tracking a limit imposed on my ingestion.
12:09 ^🔗	ersi	Ah, alrighty
12:09 ^🔗	SketchCow	I was using a method that worked fine but was hard on the structure
12:09 ^🔗	SketchCow	And got into a fight over that
12:09 ^🔗	SketchCow	Part of it was "you shouldn't use that method if there's more than 200 jobs in queue"
12:09 ^🔗	SketchCow	Now, over time, that's not going to matter, i.e., a queue will be made that DOESN'T hold the job in queue on the machine, but just generally.
12:10 ^🔗	SketchCow	But this was me seeing "So, does it EVER go below 200 or should I even watch"
12:10 ^🔗	SketchCow	Answer: Yes
12:10 ^🔗	ersi	And bam, you started filling it up gradually instead of appending to an ever increasing derive queue? :)
12:10 ^🔗	SketchCow	Fuck no
12:11 ^🔗	SketchCow	I slammed that shit up to max
12:11 ^🔗	ersi	Then what was the point of that tracking?
12:11 ^🔗	SketchCow	To no if I was being lied to
12:11 ^🔗	SketchCow	I was not specifically being lied to
12:11 ^🔗	ersi	ah
12:12 ^🔗	SketchCow	Any time you see me mention interacting with other human beings, ask yourself "So, what's the most hostile interpretation as to why Jason is doing this"
12:12 ^🔗	SketchCow	It'll save you time
12:12 ^🔗	SketchCow	"Hey, guys, I went out to eat"
12:12 ^🔗	SketchCow	Meaning: I got banned from a new diner
12:13 ^🔗	ersi	Already known for.. long :)
12:13 ^🔗	SketchCow	Apparently you forgot, twerp!
12:13 ^🔗	ersi	Zing!
12:13 ^🔗	SketchCow	The brutal thing coming up with yahoo video is I will be writing something that pulls down an item, does huge stats on it, then uploads again.
12:14 ^🔗	ersi	hm, I should get going on instructables again
12:14 ^🔗	ersi	that thing is fuckin' huge though
12:15 ^🔗	SketchCow	It's funny for me that I now go into a directory on batcave, see it's 35gb, go "oh."
12:15 ^🔗	SketchCow	I've put up 400gb items
12:15 ^🔗	SketchCow	This is going to be hilarious
12:17 ^🔗	SketchCow	http://googleblog.blogspot.com/2011/10/fall-sweep.html
12:18 ^🔗	SketchCow	Shutting down: Code Search, Google Buzz, Jaiku, Google Labs (Immediately), University Research Program for Google Search
12:18 ^🔗	ersi	Yeah
12:18 ^🔗	SketchCow	Boutiques.com and like.com gone
12:19 ^🔗	SketchCow	Code Search was critical
12:47 ^🔗	alard	What would you like to get from the me.com/mac.com downloaders? At the moment, they produce:
12:48 ^🔗	alard	1. a warc.gz for web.me.com (plus xml index and log file)
12:48 ^🔗	alard	2. a warc.gz for homepage.mac.com (plus a log file)
12:48 ^🔗	alard	3. the xml feed for public.me.com, plus a copy of the file structure + the headers for each file (not warc)
12:49 ^🔗	alard	4. the xml feed for gallery.me.com, plus a zip file for each gallery
13:37 ^🔗	SketchCow	Hmmm.
13:37 ^🔗	SketchCow	I'd like all of it - what's the size differential.
13:42 ^🔗	alard	You do get all of the content, it's just a question of in what form you'd like to get it.
13:42 ^🔗	alard	Just a WARC or also separate files, that sort of thing.
13:45 ^🔗	alard	Here's an example listing of what it produces now: http://pastebin.com/raw.php?i=438zhmSR
13:46 ^🔗	SketchCow	http://vimeo.com/28173775
13:46 ^🔗	alard	The files can get quite large (up to a 2 GB for the users I've tried so far), so I don't think it's useful to have the data in more than one form.
13:46 ^🔗	SketchCow	I think it could be.
13:47 ^🔗	SketchCow	WARC is so forward looking, but you can't use it for anything BUT wayback.
13:47 ^🔗	alard	Or you have to run a WARC extractor to create the structure wget would create otherwise.
13:48 ^🔗	SketchCow	Hmmm.
13:48 ^🔗	alard	So you'd like to have the wget copy as well?
13:48 ^🔗	SketchCow	Well, you know, I could see that.
13:48 ^🔗	alard	With or without link conversion?
13:48 ^🔗	SketchCow	Massive post-processing.
13:48 ^🔗	SketchCow	I am fine with massive post-processing.
13:48 ^🔗	SketchCow	So WARC might make the most sense.
13:48 ^🔗	SketchCow	I'd like to run that against your warcs we've added already to archive.org, see how that looks.
13:48 ^🔗	alard	It does save a lot of duplicate uploading.
13:49 ^🔗	SketchCow	Agreed.
13:49 ^🔗	SketchCow	And the thing with these machines I have, they suck down data at 40-80MB a second.
13:49 ^🔗	SketchCow	So it can yank it down, rejigger, upload
13:50 ^🔗	alard	(As a reference: the four users I have now have 3.6GB of data together. But maybe I chose the wrong examples.)
13:50 ^🔗	SketchCow	Wow, what the hell.
13:50 ^🔗	SketchCow	Can you link me to them?
13:50 ^🔗	alard	http://web.me.com/sleemason/
13:51 ^🔗	SketchCow	WARC is the way.
13:51 ^🔗	alard	http://homepage.mac.com/ueda_daisuke/
13:52 ^🔗	alard	http://gallery.me.com/amurnieks
13:52 ^🔗	balrog	yeah, those.
13:52 ^🔗	alard	(each user has something on homepage, gallery, public, web)
13:53 ^🔗	balrog	hmm, how does WARC do it?
13:53 ^🔗	alard	I currently make WARCs for homepage.mac.com and web.me.com.
13:53 ^🔗	alard	For gallery.me.com I download the zip files that the server offers.
13:53 ^🔗	balrog	ohh, http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
13:53 ^🔗	alard	For public.me.com I download the files.
13:53 ^🔗	alard	balrog: Yup.
13:53 ^🔗	SketchCow	And this is all closing June 2012?
13:54 ^🔗	SketchCow	Are they blocking with robots.txt?
13:54 ^🔗	balrog	SketchCow: yes, as per current info
13:54 ^🔗	SketchCow	Sorry for not paying more attention, been dealing with data
13:54 ^🔗	balrog	SketchCow: last I checked, no, but it's messy to parse because it uses XML and JS
13:54 ^🔗	balrog	basically it uses JS to load the web content on many pages
13:54 ^🔗	balrog	(from an XML file)
13:55 ^🔗	alard	Only gallery.me.com has a robots.txt. public.me.com doesn't, but it is somewhat inaccessible to crawlers.
13:55 ^🔗	SketchCow	Well, Jobs is dead, nobody is watching
13:55 ^🔗	alard	homepage.mac.com has normal sites, can be crawled. web.me.com has some iWeb sites which are hard to crawl (but it's possible if you use webdav).
13:56 ^🔗	balrog	alard: homepage.mac.com could have iWeb sites.
13:56 ^🔗	alard	Any examples? The wayback machine doesn't have any.
13:56 ^🔗	balrog	I should dig around, but I thought I saw some.
13:57 ^🔗	SketchCow	But wow, we're talking a fuckton of data, aren't wee.
13:57 ^🔗	alard	Not really sure, the gallery/public sections can get large, the web sites are somewhat smaller.
13:57 ^🔗	SketchCow	I'm sure this is some related concept to having such intense integration of the OS and the site
13:57 ^🔗	balrog	I'm pretty sure there are many GB of data on here.
13:57 ^🔗	SketchCow	So people can just blow shit back and forth.
13:58 ^🔗	alard	balrog: TB, probably.
13:58 ^🔗	balrog	alard: I'll send you a list of homepage.mac.com pulled from my webhistory (which unfortunately doesn't go all that far back)
13:58 ^🔗	balrog	SketchCow: what exactly are you referring to?
13:58 ^🔗	balrog	alard: a few hundred TB, if you count all the gallery data
13:58 ^🔗	SketchCow	I mean that the .me stuff Apple did really smoothed the process of handling data and stuff.
13:59 ^🔗	SketchCow	Similar to what we saw with Friendster, when photo albums explode
13:59 ^🔗	balrog	yeah, they did. they improved it with iCloud, but took away the web-facing features :[
14:01 ^🔗	balrog	alard: hold on a moment :)
14:02 ^🔗	balrog	alard: this is not mac.com but may be useful â¦ http://www.wilmut.webspace.virginmedia.com/notes/webpages.html
14:02 ^🔗	SketchCow	http://www.archive.org/details/ARCHIVETEAM-YV-9200002-9299997
14:02 ^🔗	SketchCow	I am going to get in trouble for that one.
14:03 ^🔗	SketchCow	There was major debate what the maximum item size should be.
14:03 ^🔗	SketchCow	Most people agreed 100gb
14:03 ^🔗	balrog	ooooh.
14:03 ^🔗	SketchCow	That's 408gb
14:03 ^🔗	balrog	why not break it up then?
14:03 ^🔗	SketchCow	I meant to but it was in the wrong directory when an uploader script ran
14:03 ^🔗	SketchCow	I misread it as 40gb
14:03 ^🔗	SketchCow	I may have to yank it down and split it
14:03 ^🔗	balrog	urgh. can you take it down?
14:03 ^🔗	SketchCow	I am really good at yanking it, ask around
14:04 ^🔗	SketchCow	Nothing's breaking, it just becomes harder for it to be moved around.
14:04 ^🔗	balrog	alard: you there?
14:04 ^🔗	alard	Yes.
14:04 ^🔗	balrog	http://pastie.org/private/gi3mrystmzx5ogyeocapg
14:04 ^🔗	balrog	that came out of my history
14:04 ^🔗	balrog	not all may work though
14:04 ^🔗	balrog	and it's short
14:04 ^🔗	balrog	there's another db I have which I have to go through
14:05 ^🔗	balrog	(raw sql)
14:07 ^🔗	SketchCow	http://www.archive.org/details/ARCHIVETEAM-YV-3900000-3999999&reCache=1
14:07 ^🔗	SketchCow	Really, 200gb is not bad for the videos from 100,000 potential userspaces
14:07 ^🔗	balrog	isn't that a little large too?
14:07 ^🔗	SketchCow	I am fine with 200gb
14:08 ^🔗	balrog	alard: I'll grep this db for mac.com/me.com :p
14:08 ^🔗	balrog	however, do you know of a regex that can be used?
14:08 ^🔗	alard	balrog: I downloaded your list. (Though most of the users were already on my list, it seems.)
14:08 ^🔗	alard	grep (homepage\|web)\.(me\|mac)\.com ?
14:09 ^🔗	balrog	I'll get another bigger list, I just need a regex that will get the proper results
14:09 ^🔗	balrog	yeah but this is sql
14:09 ^🔗	balrog	it's likely to be in the middle of a line
14:09 ^🔗	balrog	like, a forum post
14:09 ^🔗	alard	I see. Dump all the content, feed it to grep?
14:09 ^🔗	balrog	well yeah, I'd be working from a sql dump
14:09 ^🔗	balrog	but there's stuff in the middle of lines
14:10 ^🔗	alard	In that case, I repeat the previous regexp.
14:10 ^🔗	balrog	ok ...
14:10 ^🔗	balrog	we'll see if it works.
14:10 ^🔗	alard	SketchCow: So I should keep it as WARCs?
14:11 ^🔗	SketchCow	Yeah
14:11 ^🔗	alard	What about the files public.me.com?
14:11 ^🔗	SketchCow	As we discussed, we can make more contemporary extractions.
14:11 ^🔗	SketchCow	All of them
14:11 ^🔗	SketchCow	archive.org can sustain two copies, one generated from the others.
14:11 ^🔗	alard	So don't download them separately, but download to a WARC.
14:11 ^🔗	SketchCow	WARC ensures long-term sustaining
14:11 ^🔗	SketchCow	This is the tradeoff, which I am fine with
14:12 ^🔗	SketchCow	(archive.org prefers we always do WARCs, in return a fuck they do not give how much we waterfall into their serverspace)
14:12 ^🔗	SketchCow	This from on-high
14:12 ^🔗	balrog	alard: you mean each user in his own WARC?
14:12 ^🔗	alard	What about the images on gallery.me.com? I currently ask Apple to produces zip files, which is really handy, but isn't WARC.
14:12 ^🔗	SketchCow	If that's the best we can do, that's fine.
14:12 ^🔗	alard	balrog: Yes, each user results in four WARCs.
14:12 ^🔗	balrog	aha.
14:12 ^🔗	alard	SketchCow: You can download the images, it just takes a little longer.
14:13 ^🔗	alard	So if WARC is nicer, we should do WARC.
14:13 ^🔗	SketchCow	Yes
14:13 ^🔗	*	balrog copies over the latest .sql
14:13 ^🔗	SketchCow	Also a mess: Our star wars forum thing
14:13 ^🔗	alard	(Although I should look at what happens to the album structure if we do that.)
14:13 ^🔗	SketchCow	That's what's not up
14:13 ^🔗	SketchCow	I trust your judgement, alard.
14:14 ^🔗	SketchCow	Now you know big daddy's preferences.
14:14 ^🔗	alard	Heh.
14:14 ^🔗	SketchCow	I just didn't like us shutting out the potential for contemporary users, and if post-facto conversions to items that are easier to regard is possible then I'm on board.
14:14 ^🔗	SketchCow	Where possible, WARC is what the "legit" sites like
14:14 ^🔗	balrog	alard: what's used to dump sites as WARC?
14:15 ^🔗	alard	wget-warc.
14:15 ^🔗	balrog	also does that deal with when you have to use phantomjs?
14:15 ^🔗	SketchCow	What's the status on those fucks accepting wget-warc
14:15 ^🔗	balrog	or are those special-case?
14:16 ^🔗	alard	SketchCow: The last response was 'wow, that diff is huge', and he was inclined not to include it, but offer it as a separate extension (as in: you'd have to enable it before compiling).
14:16 ^🔗	balrog	alard: your regex doesn't work :/
14:16 ^🔗	balrog	alard: hmmmmâ¦ mailing list?
14:16 ^🔗	alard	But I made the mistake to include the whole warctools library, which includes things like the curl-extension etc.
14:16 ^🔗	SketchCow	Well optimize and get that in
14:16 ^🔗	SketchCow	That's a huge win
14:17 ^🔗	SketchCow	It'll change everything out there
14:17 ^🔗	*	balrog reads up on regex
14:17 ^🔗	alard	Yeah, well, I replied that the files that the wget extension uses are much smaller. I haven't yet got a reply to that.
14:17 ^🔗	SketchCow	I say just do it.
14:17 ^🔗	SketchCow	It'll make a huge change in the world.
14:17 ^🔗	alard	I'll probably make a smaller diff and send that to them.
14:18 ^🔗	alard	Or two versions: the small one with built-in warc, the other one with warc included.
14:18 ^🔗	SketchCow	I have now discovered I have two .tar files of the same range.
14:18 ^🔗	ersi	Kick ass effort alard. Kick ass
14:18 ^🔗	SketchCow	One is 111gb. One is 206gb
14:18 ^🔗	balrog	huh, why the difference?
14:18 ^🔗	SketchCow	NO IDEA
14:18 ^🔗	alard	balrog: Did you use grep -E ?
14:18 ^🔗	balrog	oops, no :p
14:19 ^🔗	balrog	that worked, but it grabbed full lines
14:19 ^🔗	balrog	I don't want full lines
14:19 ^🔗	balrog	I want to isolate the relevant parts
14:19 ^🔗	alard	Maybe do grep -oE "http://(homepage\|web)\.(mac\|me)\.com/[^/]+"
14:20 ^🔗	balrog	alard: does that assume lines start with http://? they don't
14:21 ^🔗	alard	Yes, it does. It also assumes that every url ends with a /
14:21 ^🔗	alard	grep -oE "(homepage\|web)\.(mac\|me)\.com/[^ ]+" stops as the first whitespace character.
14:22 ^🔗	balrog	URLs are formatted http:// â¦ /username. however they may have text in front, or after them, within the same line
14:22 ^🔗	balrog	you could have like "Check out this site: <a href="http://homepage.mac.com/someone">Here!</a>"
14:22 ^🔗	alard	Oh, sorry, it doesn't assume that the line starts with http://, just that the url starts with http://.
14:22 ^🔗	alard	grep -oE 'http://(homepage\|web)\.(mac\|me)\.com/[^/"]+'
14:26 ^🔗	balrog	much shorter list than I expected.
14:26 ^🔗	alard	Then it's probably good to check the regexp.
14:26 ^🔗	balrog	http://pastie.org/private/l5cjotdi58ttf8bq8g4m8g
14:26 ^🔗	balrog	I did.
14:27 ^🔗	balrog	the incoming HTML filter would put http:// before all urls
14:27 ^🔗	balrog	you have all these?
14:40 ^🔗	balrog	alard: did you have these already?
14:43 ^🔗	alard	balrog: Just checked, most of them, not all.
14:43 ^🔗	balrog	OK
15:01 ^🔗	alard	SketchCow: One more question, if you're still there. It's possible to download the gallery contents to WARC. However, I think it doesn't make sense. It certainly wouldn't be useful with the wayback machine.
15:02 ^🔗	alard	So I'm thinking that downloading the metadata xml/json and zipping the images per album is the best solution.
15:03 ^🔗	SketchCow	I agree, then.
15:04 ^🔗	alard	The problem with the gallery is that it isn't really a web page, but a collection of image files that can be renderd in different formats. So for a wayback-thing, you'd have to get every possible format.
15:14 ^🔗	alard	Well then, I think that the scripts are finished.
15:14 ^🔗	alard	If anyone would like to do a test run, please do! https://github.com/ArchiveTeam/mobileme-grab
15:42 ^🔗	SketchCow	-rw-r--r-- 1 root root 205 2011-10-05 17:14 ballsack
15:42 ^🔗	SketchCow	-rw-r--r-- 1 root root 2425 2011-10-05 16:20 balls
15:42 ^🔗	SketchCow	drwxr-xr-x 2 root root 4096 2011-10-05 17:19 DONE
15:42 ^🔗	SketchCow	That's how you know it was me
15:43 ^🔗	balrog	LOL
15:48 ^🔗	lowtekk	i seem to have acquired an "@", considering I may as well be a stranger, someone should probably take it away
15:49 ^🔗	balrog	"@"?
15:49 ^🔗	lowtekk	i do enjoy lurking, and as much as i love collecting old documents, i haven't contributed a darn thing to this cause
15:49 ^🔗	lowtekk	op status, unless I'm mistaken
15:49 ^🔗	balrog	oh, that
15:50 ^🔗	balrog	yeah I don't know :p
15:50 ^🔗	balrog	I think I was made op here once, though. idk either
15:50 ^🔗	balrog	this is efnet though
15:50 ^🔗	balrog	if you were to part and return, it would go away
15:50 ^🔗	lowtekk	i've grown rather fond of it
16:00 ^🔗	sp0rus	lol, i was made ops once in this chan
16:00 ^🔗	sp0rus	happens sometimes
16:01 ^🔗	SketchCow	It's all on my arbitrary observations, bitches
16:31 ^🔗	yipdw	free-flowing ephemeral op-bit
16:31 ^🔗	yipdw	probably the best way to avoid power clashes
16:56 ^🔗	jjonas	hey friends:)
16:56 ^🔗	sp0rus	hello
16:57 ^🔗	jjonas	its old news but i think it would make sense to note the closure of labs.google.com somewhere in the archiveteam.org wiki?
16:58 ^🔗	sp0rus	do it
16:59 ^🔗	sp0rus	http://archiveteam.org/index.php?title=Deathwatch
17:01 ^🔗	jjonas	it made me lose thrust in google and google inovation, i miss google sets and google squared
17:02 ^🔗	jjonas	ok im going to add a line there and to the article about google
17:02 ^🔗	ersi	s/thrust/trust
17:02 ^🔗	ersi	I made that spelling error a lot earlier :)
17:03 ^🔗	jjonas	of course...
17:03 ^🔗	SketchCow	I agree.
17:04 ^🔗	SketchCow	Stupid Google
17:04 ^🔗	SketchCow	It's not impressive to turn off Google Labs
17:04 ^🔗	SketchCow	It was inspiring to go there and see crazy projects
17:04 ^🔗	SketchCow	The only (only) justification I can come up with is that people/businesses/entities were monetizing or showing reliance on them
17:05 ^🔗	ersi	Closing down Google Code Search is fucking stupid as well
17:05 ^🔗	jjonas	that was back when you had gameing equipment by thrustmaster?^^
17:05 ^🔗	ersi	their main shit is/was search once in a time
17:07 ^🔗	jjonas	when did google code search vanish :-O
17:07 ^🔗	jjonas	?
17:07 ^🔗	jjonas	was it considered part of google labs too?
17:07 ^🔗	SketchCow	It's not gone yet
17:08 ^🔗	SketchCow	It's being killed
17:08 ^🔗	SketchCow	January
17:08 ^🔗	jjonas	sigh
17:09 ^🔗	ersi	Also, no, it was a seperate project.
17:36 ^🔗	Ymgve	but is there any content in google code search? or was it just an alternative view of stuff that's already on the web?
17:36 ^🔗	SketchCow	No content
17:36 ^🔗	SketchCow	Just a great tool
17:36 ^🔗	ersi	Which still makes it a fucking shame that they're disbanding it
17:37 ^🔗	Ymgve	someone tell ms to make bing code search
17:37 ^🔗	ersi	I mean, what do you think, when you think Google? Most people think Search.
17:37 ^🔗	ersi	Or did, atleast. I think of advertisement these days.. and crappy search
17:42 ^🔗	sep332	is there a better search engine? I know blekko and duckduckgo have some cool stuff, but for general web stuff?
17:45 ^🔗	SketchCow	grep
17:52 ^🔗	*	Coderjoe grumbles
17:52 ^🔗	Coderjoe	I am beginning to think I should have used wget-warc
17:53 ^🔗	Coderjoe	5GB and still going. apparently there are some books in there too
17:55 ^🔗	jjonas	what are you archiveing?
17:55 ^🔗	sp0rus	Coderjoe: wow, when he popped in talking about the site I expected a few hundred megs tops
17:56 ^🔗	jjonas	possibly for google code search there is some rationale to close it down - that it can be used as a tool for hacking in various ways
17:56 ^🔗	Coderjoe	jjonas: lachlan.bluehaze.com.au
17:57 ^🔗	Coderjoe	australian physicist that died last year. doing an AFK pull
17:57 ^🔗	Coderjoe	I should go bluehaze.com.au as well, as that site belonged to a guy that died in 2006
17:57 ^🔗	Coderjoe	s/go/do
17:58 ^🔗	Coderjoe	argh. can't type
17:58 ^🔗	jjonas	but then, who wrote on top if it that he died in 2010 and that it stays as a memorial?
17:59 ^🔗	Coderjoe	the person keeping bluehaze around as well.
17:59 ^🔗	jjonas	.... but i really have no idea why they droped/hide google labs completley
17:59 ^🔗	jjonas	i tried to look it up in the waybackmachine
18:00 ^🔗	jjonas	to see all the various nice tools/attempts that i dont even remember
18:00 ^🔗	jjonas	but its not in the waybackmachine
18:03 ^🔗	jjonas	nvm! googlelabs.com is, just the subdomain isnt
18:15 ^🔗	ersi	jjonas: That's a fucking stupid ass rationale
18:15 ^🔗	jjonas	:D
18:15 ^🔗	jjonas	haha
18:15 ^🔗	ersi	I mean seriously, punch you in the face stupid
18:16 ^🔗	Coderjoe	I can stab someone in the eye with a pencil. should we remove all pencils?
18:16 ^🔗	jjonas	i wasnt trying to defend such a rationale
18:17 ^🔗	ersi	I didn't perhaps mean you as in you
18:17 ^🔗	ersi	If you're a sad frightened panda right now, that is
18:17 ^🔗	jjonas	i would just be as surprised about that kind of reasoing
18:17 ^🔗	jjonas	that google might did before decideing to close it down
18:18 ^🔗	Coderjoe	heh.. it's like someone went "The terrorists crashed planes into buildings. We must outlaw all planes."
18:18 ^🔗	jjonas	than iam about them closeing google labs
18:18 ^🔗	jjonas	*NOT be as surprised
18:19 ^🔗	sep332	Remember Jonny Long's "Google Hacking" books?
18:20 ^🔗	sp0rus	sep332: aye
18:21 ^🔗	jjonas	but if google and other big companies would think like you consequently
18:21 ^🔗	jjonas	they had realized many usefull features already
18:22 ^🔗	jjonas	that arnt there yet
18:23 ^🔗	jjonas	if you add this as a firefox bookmark and set keyword "mp3"
18:23 ^🔗	jjonas	http://www.google.de/search?hl=de&safe=off&q=intitle%3A%22index.of%22+(mp\|avi\|wma\|mov)+%s%2Bparent%2Bdirectory+-inurl%3A(htm\|html\|cf\|jsp\|asp\|php\|js)+-site%3Amp3s.pl+-download+-torrent+-inurl%3A(franceradio\|null3d\|infoweb\|realm\|boxxet\|openftp\|indexofmp3\|spider\|listen77\|karelia\|randombase\|mp3)&btnG=Suche&meta=
18:23 ^🔗	ersi	shrug
18:23 ^🔗	sep332	I think we should remove all CoderJoe's, the world will be safer without their(?) violent imaginations
18:23 ^🔗	jjonas	then you can type in the adress bar "mp3 any title/artist"
18:23 ^🔗	jjonas	and find working mp3 links
18:24 ^🔗	jjonas	i changed it the last time like 5 years ago so the excluded spam sites might not be uptodate
18:24 ^🔗	Coderjoe	yeah... there is apparently another coderjoe out there, whose name is actually Joe
18:24 ^🔗	Coderjoe	(mine is not)
18:24 ^🔗	jjonas	...but it works
18:24 ^🔗	jjonas	and you maybe use something similar already
18:24 ^🔗	jjonas	so why does google not have a tab "mp3" next to images,maps,...
18:25 ^🔗	ersi	Let's get back to talking about archiveteam stuff instead of fluff
18:25 ^🔗	Coderjoe	expected record company outrage?
18:26 ^🔗	sep332	baidu has an mp3 search, mp3.baidu.com
18:26 ^🔗	jjonas	thats a differnt enviornment, google china also has a million songs freely downloadable
18:26 ^🔗	jjonas	(with a chinese IP only of course
18:27 ^🔗	jjonas	......
18:28 ^🔗	jjonas	yeah, lets talk about archiving, since i made my point why they maybe would (sadly) close down google code for such a reason :D
18:29 ^🔗	jjonas	if you dont mind check my grammar about google labs in http://archiveteam.org/index.php?title=Deathwatch#2011
18:34 ^🔗	jjonas	btw, just to finish the mp3 subtopic condignly: the russian facebook pendant vkontakte.ru has a great community directory shareing all mp3s paird with lyrics files among 100+ million users just like there is no copyright :D
18:34 ^🔗	chronomex	is no copyright in soviet russia
18:35 ^🔗	chronomex	nor in capitalist russia
18:35 ^🔗	jjonas	so warez sites are legal there too
18:35 ^🔗	jjonas	?
18:35 ^🔗	jjonas	even if they have international users
18:35 ^🔗	jjonas	:-O
18:35 ^🔗	*	chronomex shrugs
18:35 ^🔗	chronomex	eez joke
18:36 ^🔗	ersi	Calm the fuck down
18:36 ^🔗	*	ersi brings out the sedatives
18:36 ^🔗	SketchCow	http://yfrog.com/z/obj01nxj
18:36 ^🔗	ersi	SketchCow: Hah, awesome
18:37 ^🔗	jjonas	:) im not nervous, just kidding
20:19 ^🔗	Coderjoe	oh joy
20:19 ^🔗	Coderjoe	I don't know where the link was that caused me to go astray
20:19 ^🔗	Coderjoe	but apparently, the server has no trouble treating html files as directories
20:19 ^🔗	Coderjoe	http://lachlan.bluehaze.com.au/deep.html/books/usa2001/usa2001/usa2001/gnomes.html
20:20 ^🔗	Coderjoe	that gives you the "deep.html" page
20:20 ^🔗	Frigolit	that's called "path info"
20:20 ^🔗	Coderjoe	yes, i know. and I've used it on php, just not html
20:21 ^🔗	Coderjoe	but there is a bad link that lead me to an infinite recursion problem
20:21 ^🔗	Frigolit	ah
20:23 ^🔗	Coderjoe	my apache config at home does not appear to allow pathinfo on html, but then I am not parsing html (while the lachlan server is)
20:25 ^🔗	Coderjoe	somewhere on that site is at least one bad link that adds a directory level to the entire site
20:30 ^🔗	Coderjoe	i'm going to terminate that until I have a chance to inspect things a bit more
21:40 ^🔗	Paradoks	http://www.economist.com/node/21529030
21:41 ^🔗	Paradoks	Scanning and destroying books, for a fee. I wonder if this horrifies Sketchcow. Obviously, it's not archiving, though some people might use it that way.
21:44 ^🔗	Coderjoe	scanning good. destruction BAD
21:46 ^🔗	sep332	related blog post on it http://ascii.textfiles.com/archives/2672
21:52 ^🔗	goekesmi	It's always a hard call when it comes to books.
22:20 ^🔗	dashcloud	if the book needs to be destroyed, I'm expecting perfection for results- anything less isn't worth it (for the sake of archiving, it's not worth it, but I'm sure many people would be happy to make that choice)
22:21 ^🔗	yipdw	oh, I dunno -- people seem perfectly happy to accept 1080p masters for films these days
22:24 ^🔗	sp0rus	yeah, but people are stupid
22:24 ^🔗	yipdw	at least 1DollarScan/Bookscan seem to be clear that they only do this for mass-market copies
22:24 ^🔗	yipdw	that seems to be a bit more sane
22:24 ^🔗	yipdw	well, I think, I dunno -- it's not spelled out in that article
22:26 ^🔗	sp0rus	if it's mass-market and not hard to find, that's a little different
22:27 ^🔗	yipdw	right
22:27 ^🔗	yipdw	I think that's the intent here
22:30 ^🔗	dashcloud	yipdw: what quality masters should people be asking for?
22:32 ^🔗	SketchCow	Hiiii
22:33 ^🔗	yipdw	dashcloud: the highest available, which for some films is 1080p -- Ultraviolet and new scenes in Star Wars come to mind
22:33 ^🔗	yipdw	dashcloud: but it's more that 1080p is markedly inferior in terms of resolution to earlier production techniques, and what with the availability of digital cameras like the RED ONE system it doesn't have to be that way
22:33 ^🔗	yipdw	so, yeah, more of an offhand snark
22:34 ^🔗	dashcloud	at least some of the Blender Foundation's open movies are available as higher than 1080p films
22:36 ^🔗	yipdw	yeah, and with those it's theoretically better because the film assets are available
22:36 ^🔗	dashcloud	here's an awesome article about gifs : http://motherboard.tv/2010/11/19/the-gif-that-keeps-on-gifing-why-animated-images-are-still-a-defining-part-of-our-internets
22:37 ^🔗	yipdw	I say theoretically because I sure as hell haven't been able to e.g. re-render Big Buck Bunny from the assets directory :P
22:37 ^🔗	dashcloud	I know the 2k frames were/are available from xiph's sample site
22:41 ^🔗	SketchCow	My attitude on 1dollarbookscan is it makes more sense than throwing them out
22:54 ^🔗	SketchCow	Barely
23:20 ^🔗	chronomex	^
23:45 ^🔗	underscor	BURP
23:45 ^🔗	underscor	Another 300GB into the archive

irclogger-viewer