#archiveteam 2014-10-05,Sun

↑back Search

Time	Nickname	Message
02:18 ^🔗	LD100	WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
02:19 ^🔗	garyrh	LD100, yahoosucks
02:22 ^🔗	LD100	thx
09:00 ^🔗	Kazzy	we got a channel for panoramio yet?
09:38 ^🔗	Arkiver2	#paranormio
09:39 ^🔗	Arkiver2	-- Help us save the full website.
09:39 ^🔗	Arkiver2	-- Panoramio is closing!
09:39 ^🔗	Arkiver2	-- Starting in a few days: #paranormio
09:39 ^🔗	Arkiver2	---------------------------------------------------
10:59 ^🔗	Arkiver2	SketchCow: For now I tested one item of panoramio
10:59 ^🔗	Arkiver2	I'll test more, but if that items has an average size we are looking at around 500 TB of data in total
11:00 ^🔗	Arkiver2	The biggest part of thet 500 TB is because of the /original/ and /1920x1280/ versions of pictures, which can be up to 8 MB
11:01 ^🔗	Arkiver2	When I find a picture I'm curently generating: /mini_square/, /square/, /thumbnail/, /sa
11:02 ^🔗	Arkiver2	/small/, /medium/, /large/, /original/, and /1920x1280/
11:02 ^🔗	Arkiver2	for the domains static.panoramio.com.storage.googleapis.com and static.panoramio.com
11:03 ^🔗	Arkiver2	and everything except /large/, /original/ and /1920x1280/ for mw2.google.com
11:05 ^🔗	Arkiver2	so, my question is, what do you want me to do? grab all different sizes images? grab only images linked to from html pages and not generate the urls (/original/,etc.)? grab all generated images except /original/ and /1920x1280/?
11:09 ^🔗	Arkiver2	SketchCow: example list of urls currently saved for a found image: http://paste.archivingyoursh.it/raw/jagovusoda
11:26 ^🔗	LD100	personally I think we should grab (only) original as every other picture could get rebuild from original
11:28 ^🔗	Arkiver2	LD100: no, we will not rebuild pictures!
11:33 ^🔗	Arkiver2	also, /original/ only would still be a few hundred TB
11:37 ^🔗	LD100	yea I just mean you could rebuild it in the future if you want to fully restore the website.
11:38 ^🔗	LD100	I cant see any real benefit to multiple pictures with lower resolution.
11:39 ^🔗	schbirid	i had that argument with Arkiver2 with twitpic and it went nowhere
11:59 ^🔗	nblr	Arkiver2: mind the meta-data. most of the value of panoramio comes from the locations connected to the pictures. are they (always) in the exif tag? or do they need to be saved alongside with the picture?
12:11 ^🔗	Arkiver2	nblr: I saved everything I could find
12:11 ^🔗	Arkiver2	but there might be some urls that I haven't found yet
12:12 ^🔗	Arkiver2	Please give me examples of any important urls and I'll make sure they will be saved
13:20 ^🔗	LD100	I don't mind if you save all picture resolutions (It's not my disc space) but I would be missing original if you don't save this.
13:21 ^🔗	arkiver	we're going to get at least 300 TB (could also be 500 TB) of originals if we save them
13:22 ^🔗	arkiver	Ans that is a lot... so I want to wait for a decision on what to do from SketchCow
13:38 ^🔗	LD100	hm ok that might be a problem. are we just saving pics or comments/etc too?
13:40 ^🔗	arkiver	LD100: everything, even the kml files that are used the pictures in google earth :)
13:40 ^🔗	arkiver	the pictures = to locate the pictures
13:43 ^🔗	LD100	ah ok. the pictures seems not to have exif data with coordinates.
13:46 ^🔗	LD100	as the pictures should get migrated to google view the resolution might not be this important as long as it is possible to connect the view image to the panoramio page.
13:48 ^🔗	arkiver	please send any urls you can find which are used for metadata to me
14:13 ^🔗	antomatic	For me, saving the same thing in several different formats from several different servers is an afforadble luxury _only if_ you have enough time to grab everything, and enough space to store it all.
14:14 ^🔗	antomatic	if time or space are limited - there's a case to be made to 'just get ONE version of the best quality asset', in my view.
14:14 ^🔗	arkiver	we have enough time to grab everything most likely
14:14 ^🔗	antomatic	sounds good
14:14 ^🔗	arkiver	as for space, I'm waiting for the answer of how much we can grab
14:15 ^🔗	arkiver	MobileMe was 200 TB and 2 years ago
14:15 ^🔗	arkiver	I think 600 TB now would be the same as 200 TB back then
14:16 ^🔗	schbirid	it would be awesome if we had some distributed redundant file system to start those kind of things without waiting for space
14:17 ^🔗	antomatic	definitely, schbirid - that's being discussed in #huntinggrounds
14:17 ^🔗	schbirid	nice
14:18 ^🔗	antomatic	Sketchcow is the ultimate arbiter of what AT can push to IA, so it's for him to decide whether he wants just original images or original + multiple resized versions + duplicates from other CDNs, and so on
14:18 ^🔗	arkiver	or only images linked to from html pages
14:19 ^🔗	antomatic	indeed
14:19 ^🔗	arkiver	Those are the most important when it comes to viewing pages in the wayback machine
14:19 ^🔗	antomatic	although don't the HTML pages link to the multiple size options, a la flickr?
14:19 ^🔗	arkiver	The pages will look unfinished and ugly without them
14:19 ^🔗	arkiver	antomatic: no
14:19 ^🔗	arkiver	only the large version
14:20 ^🔗	antomatic	ah, cool. agreed, wayback presentation is a high priority (I would imagine)
14:20 ^🔗	*	arkiver is away for a bit
18:25 ^🔗	SketchCow	I'm talking to the head of Panic, Inc. about possibly funding a donation to archive.org for storage of Twitpics or paying for it to be "sleeping on the couch" for a year.
18:25 ^🔗	SketchCow	Tjat
18:25 ^🔗	SketchCow	s wjat
18:26 ^🔗	SketchCow	I would prefer we just saved the full-resolution photo from Panormaialamfjf
18:27 ^🔗	balrog	wait... that Panic?
18:28 ^🔗	balrog	(the one that makes Mac softwarE)
18:28 ^🔗	balrog	software*
18:46 ^🔗	SketchCow	Yes
18:57 ^🔗	db48x	shiny
18:57 ^🔗	db48x	has anyone here played with libvirt's lxc sandboxes?
19:15 ^🔗	SketchCow	The Great Shoving of multiple archiveteam projects into Internet Archive is happening now.
19:18 ^🔗	SmileyG	:)
19:19 ^🔗	SketchCow	https://archive.org/details/archiveteam_quizilla
19:19 ^🔗	SketchCow	https://archive.org/details/archiveteam_ancestry
19:21 ^🔗	SketchCow	https://archive.org/details/archiveteam_swipnet
19:24 ^🔗	joepie91	SketchCow: will it be a comfy couch?
19:28 ^🔗	SketchCow	I hope so
19:28 ^🔗	arkiver	SketchCow: Only original (around 300 TB probably) from both domains ot only from static.panoramio.com or also the html and images linked to from the site?
19:29 ^🔗	SketchCow	This is a major problem.
19:31 ^🔗	arkiver	I think best thing is to do html and images linked to and embedded from the html pages. I can later make a seperate proect to do originals too (like we did with the cloudfront grab for twitpic, which gave us 55 TB)
19:32 ^🔗	arkiver	That meand I will not generate any other of the /mini_square/, /square/, /thumbnail/, etc.
19:32 ^🔗	arkiver	means*
19:32 ^🔗	SketchCow	These are just too big for archive.org.
19:33 ^🔗	arkiver	MobileMe was 200 TB. Besides, I'm estimating the size of only html and embedded pictures to be around 100 TB (50 more or less is possible)
19:33 ^🔗	arkiver	that means no originals
19:34 ^🔗	SketchCow	MobileMe nearly cost me my job.
19:34 ^🔗	SketchCow	It is a poor comparison.
19:34 ^🔗	chfoo	google is deleting the photos as well?
19:34 ^🔗	SketchCow	Google is moving the photos, it appears.
19:34 ^🔗	arkiver	yes
19:36 ^🔗	Nemo_bis	Would be nice to get a dump of freely licensed originals though
19:36 ^🔗	Nemo_bis	They're probably a fraction of the total? I don't remember
19:39 ^🔗	SketchCow	I'm trying to think what to do here.
19:39 ^🔗	SketchCow	These are not tiny sites.
19:39 ^🔗	chfoo	i thought all accounts were migrated already. they did it to mine without asking
19:39 ^🔗	SketchCow	Twitpic is just goddamned huge.
19:39 ^🔗	arkiver	Will make a smal list with options and what different domains mean
19:39 ^🔗	SketchCow	And this is goddamned huge.
19:39 ^🔗	SketchCow	Do we have any others?
19:39 ^🔗	arkiver	Not right now
19:40 ^🔗	arkiver	Maybe orkut
19:40 ^🔗	SketchCow	But no, arkiver, you can't just go "well, you once took MobileMe, surely you can take another couple huge ones."
19:40 ^🔗	SketchCow	Archive.org has 20 petabytes TOTAL
19:40 ^🔗	arkiver	But yipdw wants to save that through the /save/ feature
19:40 ^🔗	arkiver	Sorry about that then, bad thinking from me
19:42 ^🔗	arkiver	So let's say we are saving picture 83983236
19:42 ^🔗	arkiver	I'm grabbing/generating the following:
19:42 ^🔗	arkiver	http://paste.archivingyoursh.it/raw/fawikehico
19:43 ^🔗	arkiver	http://static.panoramio.com.storage.googleapis.com/ is used mostly on ssl.panoramio.com I found
19:43 ^🔗	arkiver	As for the rest
19:44 ^🔗	arkiver	panoramio prefers http://mw2.google.com/ over http://static.panoramio.com/ when the picture is a /mini_square/, /square/, /thumbnail/, /small/ or /medium/. The others go to static.panoramio.com
19:45 ^🔗	arkiver	https://ssl.panoramio.com/photos/large/83983236.jpg > https://static.panoramio.com.storage.googleapis.com/photos/large/83983236.jpg
19:46 ^🔗	arkiver	http://www.panoramio.com/photos/large/83983236.jpg > http://static.panoramio.com/photos/large/83983236.jpg
19:46 ^🔗	SketchCow	So, look.
19:46 ^🔗	SketchCow	(why can't they be easier)
19:47 ^🔗	SketchCow	Since it's clear they're going to try and delete comments, groups, etc, we definitely should go after those.
19:47 ^🔗	SketchCow	Hosting 5+ versions of every photo is batshit insane, we just can't do that.
19:47 ^🔗	SketchCow	Even with twitpic, we saved the biggest one.
19:47 ^🔗	SketchCow	you can always dumbnail the html when the smallers are gone.
19:47 ^🔗	schbirid	http://paste.archivingyoursh.it/raw/fawikehico <- still has identical duplicates (just different domani/URL) i think
19:48 ^🔗	SketchCow	But we're also running into "can this even be brought back under the wayback machine"
19:50 ^🔗	arkiver	so what we are going to do:
19:50 ^🔗	arkiver	- Html and metadata pages
19:50 ^🔗	arkiver	- (embedded images?)
19:50 ^🔗	arkiver	- (originals?)
19:50 ^🔗	arkiver	If we have a decision on those two I can start finishing scripts and start the grab
19:50 ^🔗	*	arkiver is brb
19:51 ^🔗	SketchCow	At least grab originals and embeds and HTML/Metadata
19:51 ^🔗	SketchCow	intermediates can doe.
19:51 ^🔗	SketchCow	die.
19:51 ^🔗	SketchCow	But also, I am not sure IA can even take this.
19:54 ^🔗	*	LD100 supports SketchCow
19:55 ^🔗	wp494	and even if we were to talk to Jimmy from WMF, they'd probably be pissed about having to host stuff
19:55 ^🔗	wp494	so that doesn't look like an option either
19:55 ^🔗	SketchCow	...Jimbo Wales?
19:55 ^🔗	wp494	yep
19:55 ^🔗	SketchCow	That guy and I really hate each other, you know that, right.
19:55 ^🔗	SketchCow	I mean really unpleasantly.
19:56 ^🔗	wp494	oh, well TIL
19:56 ^🔗	wp494	so definitely not an option
19:56 ^🔗	SketchCow	WMF doesn't have that space anyway.
19:56 ^🔗	SketchCow	We need, by my back of napkin... 500tb for this year's crap.
19:57 ^🔗	Nemo_bis	WMF wouldn't host even 5 TB
19:57 ^🔗	aaaaaaaaa	Just wanted to remind everyone that there is a #paranormio channel
19:57 ^🔗	wp494	okay, let me go list that
20:02 ^🔗	Nemo_bis	Your.Org has plenty of space (nobody else offered to mirror Wikimedia projects' media) but probably not that much
20:09 ^🔗	arkiver	SketchCow: I'm going to run a grab with html, embed and original and will tell you how many percent we have lost of the original estimated size
20:11 ^🔗	arkiver	also, originfal from static.panoramio.com or from static.panoramio.com.storage.googleapis.com ?
20:11 ^🔗	arkiver	(I'd do static.panoramio.com
20:11 ^🔗	arkiver	)
20:16 ^🔗	arkiver	https://github.com/ArchiveTeam/panoramio-grab/commit/400cc5f63cb30cec077097b2d8ad112e4153ec63 < see description
20:39 ^🔗	arkiver	SketchCow: still running grab with new decisions (see above github commit description), but don't expect a whole lot of saved space!
20:40 ^🔗	arkiver	It will still be very large :/
20:49 ^🔗	godane	SketchCow: do you happen to know why msnbc is not in 9/11 video collection for 9/11 to 9/13?
20:49 ^🔗	godane	i ask cause its there for 9/14 to 9/17
20:50 ^🔗	SketchCow	No idea
20:51 ^🔗	godane	ok
20:51 ^🔗	SketchCow	Asking the TV goddess
20:57 ^🔗	godane	same goes for fox news
20:57 ^🔗	godane	but fox news is nowhere in the 9/11 collection
21:06 ^🔗	arkiver	SketchCow: the size of a new grab is 53% of the size with al images included
21:06 ^🔗	arkiver	I'm now going to do a grab without originals
21:07 ^🔗	Nemo_bis	How about avatars and similar smaller resources?
21:08 ^🔗	arkiver	those are all being done. they are embedded images, which are being grabbed
21:09 ^🔗	Nemo_bis	ok
21:10 ^🔗	SketchCow	How big is that 53%
21:10 ^🔗	SketchCow	godane:
21:10 ^🔗	SketchCow	[5:08:08 PM] Tracey Jaquith (Internet Archive): prolly recorder went down
21:10 ^🔗	SketchCow	[5:08:38 PM] Tracey Jaquith (Internet Archive): oh, no prolly we weren't recording it until 9/14
21:10 ^🔗	SketchCow	[5:08:54 PM] Tracey Jaquith (Internet Archive): betcha rod scrambled to get it, when someone said "we need X channel!!"
21:10 ^🔗	SketchCow	[5:10:34 PM] Jason Scott: Got it
21:11 ^🔗	godane	thats what i figured
21:12 ^🔗	godane	reason why upload things the way i do
21:12 ^🔗	godane	i have no idea if you guys have it or the system was not recording it at the time
21:13 ^🔗	godane	hopefully we get those days from that 40k VHS collection IA got
21:22 ^🔗	APerti	40k VHS tapes?
21:22 ^🔗	APerti	Holy mackerel.
21:23 ^🔗	DFJustin	http://www.fastcompany.com/3028069/the-internet-archive-is-digitizing-40000-vhs-tapes
21:23 ^🔗	godane	APerti: http://www.dailydot.com/entertainment/marion-stokes-vhs-internet-archive-input/
21:23 ^🔗	APerti	Rad.
21:24 ^🔗	godane	i really hope we get 'The Site' Series that aired on MSNBC from that vhs collection
21:24 ^🔗	APerti	What hardware, software, and codecs are going to be used?
21:26 ^🔗	APerti	Wow. Metadata Fest 40,000!
21:26 ^🔗	antomatic	a complete treasure trove of closed-captioning data too
21:26 ^🔗	antomatic	probably some of the earliest examples of the medium
21:27 ^🔗	DFJustin	archive.org's m.o. with vhs tapes seems to be to rip to dvd-quality mpeg2 which is then provided in web-quality mp4 and ogg streams
21:27 ^🔗	APerti	Excellent. DVD standard is basically twice the resolution of most VHS tapes.
21:27 ^🔗	DFJustin	you can see with the 'input' collection which they've already done https://archive.org/details/marionstokesinput
21:28 ^🔗	APerti	https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem
21:28 ^🔗	antomatic	APerti: Depends. Full quality DVD is twice-VHS or more, but you can easily record to DVD at half-resolution if you want more recording time
21:30 ^🔗	APerti	Yes, I'm aware. MPEG-2 streams allow for many resolutions. I was speaking to the "standard" of 720x480.
21:31 ^🔗	*	antomatic nods.
21:31 ^🔗	DFJustin	also some of it is betamax
21:31 ^🔗	antomatic	yay! :)
21:31 ^🔗	*	antomatic has 2x Betamax
22:07 ^🔗	godane	so i save a segment of the "NBC Today Show' with Arlen Specter in it
22:08 ^🔗	godane	i brute force that one on TV Guide description
22:08 ^🔗	godane	this one: http://www.tvguide.com/detail/tv-show.aspx?tvobjectid=100548&more=ucepisodelist&episodeid=2952892
22:18 ^🔗	arkiver	SketchCow:
22:18 ^🔗	arkiver	For example pack image99pack:969175
22:18 ^🔗	arkiver	- everything: 719 MB
22:19 ^🔗	arkiver	- html (www.panoramio.com + ssl.panoramio.com) + embed + original: 382 MB
22:20 ^🔗	arkiver	- html (www.panoramio.com + ssl.panoramio.com) + embed: 120 MB
22:20 ^🔗	arkiver	let's take 600 TB (probably a bit too high)
22:20 ^🔗	arkiver	- everything: 600 TB
22:20 ^🔗	arkiver	- html (www.panoramio.com + ssl.panoramio.com) + embed + original: 319 TB
22:21 ^🔗	antomatic	arkiver: how much duplication is there in getting www. and ssl. ? Is there anything to be saved by just getting one?
22:21 ^🔗	arkiver	- html (www.panoramio.com + ssl.panoramio.com) + embed: 100 TB
22:21 ^🔗	arkiver	SketchCow: last option would be doable (assuming 100 TB can be stored at IA)
22:21 ^🔗	LD100	couldn't we start with the smallest and run this when it finish we have a better estimation to add orignal and then decide if we should run again only grabbing orignal
22:22 ^🔗	Diesel_	Wasn't 100TB a problem when we did twitpic
22:22 ^🔗	Diesel_	Thats why we tried to come up with our own storage until IA could handle something like that
22:22 ^🔗	Diesel_	100TB was too much for them
22:22 ^🔗	antomatic	at $2,000 per TB, every TB is a problem, potentially
22:23 ^🔗	arkiver	SketchCow: I can first run a grab for html (www.panoramio.com + ssl.panoramio.com) + embed. When that is finished I can create a new grab for originals only
22:23 ^🔗	arkiver	I think that would be the best thing to do
22:23 ^🔗	antomatic	We do need to respect that and consider the rule of 'MVA' - Minimum Viable Archive.
22:23 ^🔗	LD100	arkiver: yep thats what I mean
22:24 ^🔗	arkiver	I'm off now
22:24 ^🔗	antomatic	night arkiver!
22:25 ^🔗	aaaaaaaaa	Well, if the pictures are being transferred, do they need to get grabbed at all?
22:25 ^🔗	Kenshin	friendly reminder, i'm sitting on 60tb of twitpic data. i do not have much space left to screw around
22:25 ^🔗	antomatic	^
22:25 ^🔗	arkiver	Please PM me the final decision and please keep the last option I provided in mind
22:26 ^🔗	*	arkiver if off to bed
23:44 ^🔗	joepie91	antomatic: I guess the basic idea here is "let's make the $2000 problem at least worth the headache, with good data"
23:44 ^🔗	joepie91	:)
23:44 ^🔗	joepie91	useful TBs etc

irclogger-viewer