[02:18] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [02:19] LD100, yahoosucks [02:22] thx [09:00] we got a channel for panoramio yet? [09:38] #paranormio [09:39] -- Help us save the full website. [09:39] -- Panoramio is closing! [09:39] -- Starting in a few days: #paranormio [09:39] --------------------------------------------------- [10:59] SketchCow: For now I tested one item of panoramio [10:59] I'll test more, but if that items has an average size we are looking at around 500 TB of data in total [11:00] The biggest part of thet 500 TB is because of the /original/ and /1920x1280/ versions of pictures, which can be up to 8 MB [11:01] When I find a picture I'm curently generating: /mini_square/, /square/, /thumbnail/, /sa [11:02] /small/, /medium/, /large/, /original/, and /1920x1280/ [11:02] for the domains static.panoramio.com.storage.googleapis.com and static.panoramio.com [11:03] and everything except /large/, /original/ and /1920x1280/ for mw2.google.com [11:05] so, my question is, what do you want me to do? grab all different sizes images? grab only images linked to from html pages and not generate the urls (/original/,etc.)? grab all generated images except /original/ and /1920x1280/? [11:09] SketchCow: example list of urls currently saved for a found image: http://paste.archivingyoursh.it/raw/jagovusoda [11:26] personally I think we should grab (only) original as every other picture could get rebuild from original [11:28] LD100: no, we will not rebuild pictures! [11:33] also, /original/ only would still be a few hundred TB [11:37] yea I just mean you could rebuild it in the future if you want to fully restore the website. [11:38] I cant see any real benefit to multiple pictures with lower resolution. [11:39] i had that argument with Arkiver2 with twitpic and it went nowhere [11:59] Arkiver2: mind the meta-data. most of the value of panoramio comes from the locations connected to the pictures. are they (always) in the exif tag? or do they need to be saved alongside with the picture? [12:11] nblr: I saved everything I could find [12:11] but there might be some urls that I haven't found yet [12:12] Please give me examples of any important urls and I'll make sure they will be saved [13:20] I don't mind if you save all picture resolutions (It's not my disc space) but I would be missing original if you don't save this. [13:21] we're going to get at least 300 TB (could also be 500 TB) of originals if we save them [13:22] Ans that is a lot... so I want to wait for a decision on what to do from SketchCow [13:38] hm ok that might be a problem. are we just saving pics or comments/etc too? [13:40] LD100: everything, even the kml files that are used the pictures in google earth :) [13:40] the pictures = to locate the pictures [13:43] ah ok. the pictures seems not to have exif data with coordinates. [13:46] as the pictures should get migrated to google view the resolution might not be this important as long as it is possible to connect the view image to the panoramio page. [13:48] please send any urls you can find which are used for metadata to me [14:13] For me, saving the same thing in several different formats from several different servers is an afforadble luxury _only if_ you have enough time to grab everything, and enough space to store it all. [14:14] if time or space are limited - there's a case to be made to 'just get ONE version of the best quality asset', in my view. [14:14] we have enough time to grab everything most likely [14:14] sounds good [14:14] as for space, I'm waiting for the answer of how much we can grab [14:15] MobileMe was 200 TB and 2 years ago [14:15] I think 600 TB now would be the same as 200 TB back then [14:16] it would be awesome if we had some distributed redundant file system to start those kind of things without waiting for space [14:17] definitely, schbirid - that's being discussed in #huntinggrounds [14:17] nice [14:18] Sketchcow is the ultimate arbiter of what AT can push to IA, so it's for him to decide whether he wants just original images or original + multiple resized versions + duplicates from other CDNs, and so on [14:18] or only images linked to from html pages [14:19] indeed [14:19] Those are the most important when it comes to viewing pages in the wayback machine [14:19] although don't the HTML pages link to the multiple size options, a la flickr? [14:19] The pages will look unfinished and ugly without them [14:19] antomatic: no [14:19] only the large version [14:20] ah, cool. agreed, wayback presentation is a high priority (I would imagine) [14:20] * arkiver is away for a bit [18:25] I'm talking to the head of Panic, Inc. about possibly funding a donation to archive.org for storage of Twitpics or paying for it to be "sleeping on the couch" for a year. [18:25] Tjat [18:25] s wjat [18:26] I would prefer we just saved the full-resolution photo from Panormaialamfjf [18:27] wait... that Panic? [18:28] (the one that makes Mac softwarE) [18:28] software* [18:46] Yes [18:57] shiny [18:57] has anyone here played with libvirt's lxc sandboxes? [19:15] The Great Shoving of multiple archiveteam projects into Internet Archive is happening now. [19:18] :) [19:19] https://archive.org/details/archiveteam_quizilla [19:19] https://archive.org/details/archiveteam_ancestry [19:21] https://archive.org/details/archiveteam_swipnet [19:24] SketchCow: will it be a comfy couch? [19:28] I hope so [19:28] SketchCow: Only original (around 300 TB probably) from both domains ot only from static.panoramio.com or also the html and images linked to from the site? [19:29] This is a major problem. [19:31] I think best thing is to do html and images linked to and embedded from the html pages. I can later make a seperate proect to do originals too (like we did with the cloudfront grab for twitpic, which gave us 55 TB) [19:32] That meand I will not generate any other of the /mini_square/, /square/, /thumbnail/, etc. [19:32] means* [19:32] These are just too big for archive.org. [19:33] MobileMe was 200 TB. Besides, I'm estimating the size of only html and embedded pictures to be around 100 TB (50 more or less is possible) [19:33] that means no originals [19:34] MobileMe nearly cost me my job. [19:34] It is a poor comparison. [19:34] google is deleting the photos as well? [19:34] Google is moving the photos, it appears. [19:34] yes [19:36] Would be nice to get a dump of freely licensed originals though [19:36] They're probably a fraction of the total? I don't remember [19:39] I'm trying to think what to do here. [19:39] These are not tiny sites. [19:39] i thought all accounts were migrated already. they did it to mine without asking [19:39] Twitpic is just goddamned huge. [19:39] Will make a smal list with options and what different domains mean [19:39] And this is goddamned huge. [19:39] Do we have any others? [19:39] Not right now [19:40] Maybe orkut [19:40] But no, arkiver, you can't just go "well, you once took MobileMe, surely you can take another couple huge ones." [19:40] Archive.org has 20 petabytes TOTAL [19:40] But yipdw wants to save that through the /save/ feature [19:40] Sorry about that then, bad thinking from me [19:42] So let's say we are saving picture 83983236 [19:42] I'm grabbing/generating the following: [19:42] http://paste.archivingyoursh.it/raw/fawikehico [19:43] http://static.panoramio.com.storage.googleapis.com/ is used mostly on ssl.panoramio.com I found [19:43] As for the rest [19:44] panoramio prefers http://mw2.google.com/ over http://static.panoramio.com/ when the picture is a /mini_square/, /square/, /thumbnail/, /small/ or /medium/. The others go to static.panoramio.com [19:45] https://ssl.panoramio.com/photos/large/83983236.jpg > https://static.panoramio.com.storage.googleapis.com/photos/large/83983236.jpg [19:46] http://www.panoramio.com/photos/large/83983236.jpg > http://static.panoramio.com/photos/large/83983236.jpg [19:46] So, look. [19:46] (why can't they be easier) [19:47] Since it's clear they're going to try and delete comments, groups, etc, we definitely should go after those. [19:47] Hosting 5+ versions of every photo is batshit insane, we just can't do that. [19:47] Even with twitpic, we saved the biggest one. [19:47] you can always dumbnail the html when the smallers are gone. [19:47] http://paste.archivingyoursh.it/raw/fawikehico <- still has identical duplicates (just different domani/URL) i think [19:48] But we're also running into "can this even be brought back under the wayback machine" [19:50] so what we are going to do: [19:50] - Html and metadata pages [19:50] - (embedded images?) [19:50] - (originals?) [19:50] If we have a decision on those two I can start finishing scripts and start the grab [19:50] * arkiver is brb [19:51] At least grab originals and embeds and HTML/Metadata [19:51] intermediates can doe. [19:51] die. [19:51] But also, I am not sure IA can even take this. [19:54] * LD100 supports SketchCow [19:55] and even if we were to talk to Jimmy from WMF, they'd probably be pissed about having to host stuff [19:55] so that doesn't look like an option either [19:55] ...Jimbo Wales? [19:55] yep [19:55] That guy and I really hate each other, you know that, right. [19:55] I mean really unpleasantly. [19:56] oh, well TIL [19:56] so definitely not an option [19:56] WMF doesn't have that space anyway. [19:56] We need, by my back of napkin... 500tb for this year's crap. [19:57] WMF wouldn't host even 5 TB [19:57] Just wanted to remind everyone that there is a #paranormio channel [19:57] okay, let me go list that [20:02] Your.Org has plenty of space (nobody else offered to mirror Wikimedia projects' media) but probably not that much [20:09] SketchCow: I'm going to run a grab with html, embed and original and will tell you how many percent we have lost of the original estimated size [20:11] also, originfal from static.panoramio.com or from static.panoramio.com.storage.googleapis.com ? [20:11] (I'd do static.panoramio.com [20:11] ) [20:16] https://github.com/ArchiveTeam/panoramio-grab/commit/400cc5f63cb30cec077097b2d8ad112e4153ec63 < see description [20:39] SketchCow: still running grab with new decisions (see above github commit description), but don't expect a whole lot of saved space! [20:40] It will still be very large :/ [20:49] SketchCow: do you happen to know why msnbc is not in 9/11 video collection for 9/11 to 9/13? [20:49] i ask cause its there for 9/14 to 9/17 [20:50] No idea [20:51] ok [20:51] Asking the TV goddess [20:57] same goes for fox news [20:57] but fox news is nowhere in the 9/11 collection [21:06] SketchCow: the size of a new grab is 53% of the size with al images included [21:06] I'm now going to do a grab without originals [21:07] How about avatars and similar smaller resources? [21:08] those are all being done. they are embedded images, which are being grabbed [21:09] ok [21:10] How big is that 53% [21:10] godane: [21:10] [5:08:08 PM] Tracey Jaquith (Internet Archive): prolly recorder went down [21:10] [5:08:38 PM] Tracey Jaquith (Internet Archive): oh, no prolly we weren't recording it until 9/14 [21:10] [5:08:54 PM] Tracey Jaquith (Internet Archive): betcha rod scrambled to get it, when someone said "we need X channel!!" [21:10] [5:10:34 PM] Jason Scott: Got it [21:11] thats what i figured [21:12] reason why upload things the way i do [21:12] i have no idea if you guys have it or the system was not recording it at the time [21:13] hopefully we get those days from that 40k VHS collection IA got [21:22] 40k VHS tapes? [21:22] Holy mackerel. [21:23] http://www.fastcompany.com/3028069/the-internet-archive-is-digitizing-40000-vhs-tapes [21:23] APerti: http://www.dailydot.com/entertainment/marion-stokes-vhs-internet-archive-input/ [21:23] Rad. [21:24] i really hope we get 'The Site' Series that aired on MSNBC from that vhs collection [21:24] What hardware, software, and codecs are going to be used? [21:26] Wow. Metadata Fest 40,000! [21:26] a complete treasure trove of closed-captioning data too [21:26] probably some of the earliest examples of the medium [21:27] archive.org's m.o. with vhs tapes seems to be to rip to dvd-quality mpeg2 which is then provided in web-quality mp4 and ogg streams [21:27] Excellent. DVD standard is basically twice the resolution of most VHS tapes. [21:27] you can see with the 'input' collection which they've already done https://archive.org/details/marionstokesinput [21:28] https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem [21:28] APerti: Depends. Full quality DVD is twice-VHS or more, but you can easily record to DVD at half-resolution if you want more recording time [21:30] Yes, I'm aware. MPEG-2 streams allow for many resolutions. I was speaking to the "standard" of 720x480. [21:31] * antomatic nods. [21:31] also some of it is betamax [21:31] yay! :) [21:31] * antomatic has 2x Betamax [22:07] so i save a segment of the "NBC Today Show' with Arlen Specter in it [22:08] i brute force that one on TV Guide description [22:08] this one: http://www.tvguide.com/detail/tv-show.aspx?tvobjectid=100548&more=ucepisodelist&episodeid=2952892 [22:18] SketchCow: [22:18] For example pack image99pack:969175 [22:18] - everything: 719 MB [22:19] - html (www.panoramio.com + ssl.panoramio.com) + embed + original: 382 MB [22:20] - html (www.panoramio.com + ssl.panoramio.com) + embed: 120 MB [22:20] let's take 600 TB (probably a bit too high) [22:20] - everything: 600 TB [22:20] - html (www.panoramio.com + ssl.panoramio.com) + embed + original: 319 TB [22:21] arkiver: how much duplication is there in getting www. and ssl. ? Is there anything to be saved by just getting one? [22:21] - html (www.panoramio.com + ssl.panoramio.com) + embed: 100 TB [22:21] SketchCow: last option would be doable (assuming 100 TB can be stored at IA) [22:21] couldn't we start with the smallest and run this when it finish we have a better estimation to add orignal and then decide if we should run again only grabbing orignal [22:22] Wasn't 100TB a problem when we did twitpic [22:22] Thats why we tried to come up with our own storage until IA could handle something like that [22:22] 100TB was too much for them [22:22] at $2,000 per TB, every TB is a problem, potentially [22:23] SketchCow: I can first run a grab for html (www.panoramio.com + ssl.panoramio.com) + embed. When that is finished I can create a new grab for originals only [22:23] I think that would be the best thing to do [22:23] We do need to respect that and consider the rule of 'MVA' - Minimum Viable Archive. [22:23] arkiver: yep thats what I mean [22:24] I'm off now [22:24] night arkiver! [22:25] Well, if the pictures are being transferred, do they need to get grabbed at all? [22:25] friendly reminder, i'm sitting on 60tb of twitpic data. i do not have much space left to screw around [22:25] ^ [22:25] Please PM me the final decision and please keep the last option I provided in mind [22:26] * arkiver if off to bed [23:44] antomatic: I guess the basic idea here is "let's make the $2000 problem at least worth the headache, with good data" [23:44] :) [23:44] useful TBs etc