#archiveteam 2014-10-05,Sun

↑back Search

Time Nickname Message
02:18 🔗 LD100 WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD
02:19 🔗 garyrh LD100, yahoosucks
02:22 🔗 LD100 thx
09:00 🔗 Kazzy we got a channel for panoramio yet?
09:38 🔗 Arkiver2 #paranormio
09:39 🔗 Arkiver2 -- Help us save the full website.
09:39 🔗 Arkiver2 -- Panoramio is closing!
09:39 🔗 Arkiver2 -- Starting in a few days: #paranormio
09:39 🔗 Arkiver2 ---------------------------------------------------
10:59 🔗 Arkiver2 SketchCow: For now I tested one item of panoramio
10:59 🔗 Arkiver2 I'll test more, but if that items has an average size we are looking at around 500 TB of data in total
11:00 🔗 Arkiver2 The biggest part of thet 500 TB is because of the /original/ and /1920x1280/ versions of pictures, which can be up to 8 MB
11:01 🔗 Arkiver2 When I find a picture I'm curently generating: /mini_square/, /square/, /thumbnail/, /sa
11:02 🔗 Arkiver2 /small/, /medium/, /large/, /original/, and /1920x1280/
11:02 🔗 Arkiver2 for the domains static.panoramio.com.storage.googleapis.com and static.panoramio.com
11:03 🔗 Arkiver2 and everything except /large/, /original/ and /1920x1280/ for mw2.google.com
11:05 🔗 Arkiver2 so, my question is, what do you want me to do? grab all different sizes images? grab only images linked to from html pages and not generate the urls (/original/,etc.)? grab all generated images except /original/ and /1920x1280/?
11:09 🔗 Arkiver2 SketchCow: example list of urls currently saved for a found image: http://paste.archivingyoursh.it/raw/jagovusoda
11:26 🔗 LD100 personally I think we should grab (only) original as every other picture could get rebuild from original
11:28 🔗 Arkiver2 LD100: no, we will not rebuild pictures!
11:33 🔗 Arkiver2 also, /original/ only would still be a few hundred TB
11:37 🔗 LD100 yea I just mean you could rebuild it in the future if you want to fully restore the website.
11:38 🔗 LD100 I cant see any real benefit to multiple pictures with lower resolution.
11:39 🔗 schbirid i had that argument with Arkiver2 with twitpic and it went nowhere
11:59 🔗 nblr Arkiver2: mind the meta-data. most of the value of panoramio comes from the locations connected to the pictures. are they (always) in the exif tag? or do they need to be saved alongside with the picture?
12:11 🔗 Arkiver2 nblr: I saved everything I could find
12:11 🔗 Arkiver2 but there might be some urls that I haven't found yet
12:12 🔗 Arkiver2 Please give me examples of any important urls and I'll make sure they will be saved
13:20 🔗 LD100 I don't mind if you save all picture resolutions (It's not my disc space) but I would be missing original if you don't save this.
13:21 🔗 arkiver we're going to get at least 300 TB (could also be 500 TB) of originals if we save them
13:22 🔗 arkiver Ans that is a lot... so I want to wait for a decision on what to do from SketchCow
13:38 🔗 LD100 hm ok that might be a problem. are we just saving pics or comments/etc too?
13:40 🔗 arkiver LD100: everything, even the kml files that are used the pictures in google earth :)
13:40 🔗 arkiver the pictures = to locate the pictures
13:43 🔗 LD100 ah ok. the pictures seems not to have exif data with coordinates.
13:46 🔗 LD100 as the pictures should get migrated to google view the resolution might not be this important as long as it is possible to connect the view image to the panoramio page.
13:48 🔗 arkiver please send any urls you can find which are used for metadata to me
14:13 🔗 antomatic For me, saving the same thing in several different formats from several different servers is an afforadble luxury _only if_ you have enough time to grab everything, and enough space to store it all.
14:14 🔗 antomatic if time or space are limited - there's a case to be made to 'just get ONE version of the best quality asset', in my view.
14:14 🔗 arkiver we have enough time to grab everything most likely
14:14 🔗 antomatic sounds good
14:14 🔗 arkiver as for space, I'm waiting for the answer of how much we can grab
14:15 🔗 arkiver MobileMe was 200 TB and 2 years ago
14:15 🔗 arkiver I think 600 TB now would be the same as 200 TB back then
14:16 🔗 schbirid it would be awesome if we had some distributed redundant file system to start those kind of things without waiting for space
14:17 🔗 antomatic definitely, schbirid - that's being discussed in #huntinggrounds
14:17 🔗 schbirid nice
14:18 🔗 antomatic Sketchcow is the ultimate arbiter of what AT can push to IA, so it's for him to decide whether he wants just original images or original + multiple resized versions + duplicates from other CDNs, and so on
14:18 🔗 arkiver or only images linked to from html pages
14:19 🔗 antomatic indeed
14:19 🔗 arkiver Those are the most important when it comes to viewing pages in the wayback machine
14:19 🔗 antomatic although don't the HTML pages link to the multiple size options, a la flickr?
14:19 🔗 arkiver The pages will look unfinished and ugly without them
14:19 🔗 arkiver antomatic: no
14:19 🔗 arkiver only the large version
14:20 🔗 antomatic ah, cool. agreed, wayback presentation is a high priority (I would imagine)
14:20 🔗 * arkiver is away for a bit
18:25 🔗 SketchCow I'm talking to the head of Panic, Inc. about possibly funding a donation to archive.org for storage of Twitpics or paying for it to be "sleeping on the couch" for a year.
18:25 🔗 SketchCow Tjat
18:25 🔗 SketchCow s wjat
18:26 🔗 SketchCow I would prefer we just saved the full-resolution photo from Panormaialamfjf
18:27 🔗 balrog wait... that Panic?
18:28 🔗 balrog (the one that makes Mac softwarE)
18:28 🔗 balrog software*
18:46 🔗 SketchCow Yes
18:57 🔗 db48x shiny
18:57 🔗 db48x has anyone here played with libvirt's lxc sandboxes?
19:15 🔗 SketchCow The Great Shoving of multiple archiveteam projects into Internet Archive is happening now.
19:18 🔗 SmileyG :)
19:19 🔗 SketchCow https://archive.org/details/archiveteam_quizilla
19:19 🔗 SketchCow https://archive.org/details/archiveteam_ancestry
19:21 🔗 SketchCow https://archive.org/details/archiveteam_swipnet
19:24 🔗 joepie91 SketchCow: will it be a comfy couch?
19:28 🔗 SketchCow I hope so
19:28 🔗 arkiver SketchCow: Only original (around 300 TB probably) from both domains ot only from static.panoramio.com or also the html and images linked to from the site?
19:29 🔗 SketchCow This is a major problem.
19:31 🔗 arkiver I think best thing is to do html and images linked to and embedded from the html pages. I can later make a seperate proect to do originals too (like we did with the cloudfront grab for twitpic, which gave us 55 TB)
19:32 🔗 arkiver That meand I will not generate any other of the /mini_square/, /square/, /thumbnail/, etc.
19:32 🔗 arkiver means*
19:32 🔗 SketchCow These are just too big for archive.org.
19:33 🔗 arkiver MobileMe was 200 TB. Besides, I'm estimating the size of only html and embedded pictures to be around 100 TB (50 more or less is possible)
19:33 🔗 arkiver that means no originals
19:34 🔗 SketchCow MobileMe nearly cost me my job.
19:34 🔗 SketchCow It is a poor comparison.
19:34 🔗 chfoo google is deleting the photos as well?
19:34 🔗 SketchCow Google is moving the photos, it appears.
19:34 🔗 arkiver yes
19:36 🔗 Nemo_bis Would be nice to get a dump of freely licensed originals though
19:36 🔗 Nemo_bis They're probably a fraction of the total? I don't remember
19:39 🔗 SketchCow I'm trying to think what to do here.
19:39 🔗 SketchCow These are not tiny sites.
19:39 🔗 chfoo i thought all accounts were migrated already. they did it to mine without asking
19:39 🔗 SketchCow Twitpic is just goddamned huge.
19:39 🔗 arkiver Will make a smal list with options and what different domains mean
19:39 🔗 SketchCow And this is goddamned huge.
19:39 🔗 SketchCow Do we have any others?
19:39 🔗 arkiver Not right now
19:40 🔗 arkiver Maybe orkut
19:40 🔗 SketchCow But no, arkiver, you can't just go "well, you once took MobileMe, surely you can take another couple huge ones."
19:40 🔗 SketchCow Archive.org has 20 petabytes TOTAL
19:40 🔗 arkiver But yipdw wants to save that through the /save/ feature
19:40 🔗 arkiver Sorry about that then, bad thinking from me
19:42 🔗 arkiver So let's say we are saving picture 83983236
19:42 🔗 arkiver I'm grabbing/generating the following:
19:42 🔗 arkiver http://paste.archivingyoursh.it/raw/fawikehico
19:43 🔗 arkiver http://static.panoramio.com.storage.googleapis.com/ is used mostly on ssl.panoramio.com I found
19:43 🔗 arkiver As for the rest
19:44 🔗 arkiver panoramio prefers http://mw2.google.com/ over http://static.panoramio.com/ when the picture is a /mini_square/, /square/, /thumbnail/, /small/ or /medium/. The others go to static.panoramio.com
19:45 🔗 arkiver https://ssl.panoramio.com/photos/large/83983236.jpg > https://static.panoramio.com.storage.googleapis.com/photos/large/83983236.jpg
19:46 🔗 arkiver http://www.panoramio.com/photos/large/83983236.jpg > http://static.panoramio.com/photos/large/83983236.jpg
19:46 🔗 SketchCow So, look.
19:46 🔗 SketchCow (why can't they be easier)
19:47 🔗 SketchCow Since it's clear they're going to try and delete comments, groups, etc, we definitely should go after those.
19:47 🔗 SketchCow Hosting 5+ versions of every photo is batshit insane, we just can't do that.
19:47 🔗 SketchCow Even with twitpic, we saved the biggest one.
19:47 🔗 SketchCow you can always dumbnail the html when the smallers are gone.
19:47 🔗 schbirid http://paste.archivingyoursh.it/raw/fawikehico <- still has identical duplicates (just different domani/URL) i think
19:48 🔗 SketchCow But we're also running into "can this even be brought back under the wayback machine"
19:50 🔗 arkiver so what we are going to do:
19:50 🔗 arkiver - Html and metadata pages
19:50 🔗 arkiver - (embedded images?)
19:50 🔗 arkiver - (originals?)
19:50 🔗 arkiver If we have a decision on those two I can start finishing scripts and start the grab
19:50 🔗 * arkiver is brb
19:51 🔗 SketchCow At least grab originals and embeds and HTML/Metadata
19:51 🔗 SketchCow intermediates can doe.
19:51 🔗 SketchCow die.
19:51 🔗 SketchCow But also, I am not sure IA can even take this.
19:54 🔗 * LD100 supports SketchCow
19:55 🔗 wp494 and even if we were to talk to Jimmy from WMF, they'd probably be pissed about having to host stuff
19:55 🔗 wp494 so that doesn't look like an option either
19:55 🔗 SketchCow ...Jimbo Wales?
19:55 🔗 wp494 yep
19:55 🔗 SketchCow That guy and I really hate each other, you know that, right.
19:55 🔗 SketchCow I mean really unpleasantly.
19:56 🔗 wp494 oh, well TIL
19:56 🔗 wp494 so definitely not an option
19:56 🔗 SketchCow WMF doesn't have that space anyway.
19:56 🔗 SketchCow We need, by my back of napkin... 500tb for this year's crap.
19:57 🔗 Nemo_bis WMF wouldn't host even 5 TB
19:57 🔗 aaaaaaaaa Just wanted to remind everyone that there is a #paranormio channel
19:57 🔗 wp494 okay, let me go list that
20:02 🔗 Nemo_bis Your.Org has plenty of space (nobody else offered to mirror Wikimedia projects' media) but probably not that much
20:09 🔗 arkiver SketchCow: I'm going to run a grab with html, embed and original and will tell you how many percent we have lost of the original estimated size
20:11 🔗 arkiver also, originfal from static.panoramio.com or from static.panoramio.com.storage.googleapis.com ?
20:11 🔗 arkiver (I'd do static.panoramio.com
20:11 🔗 arkiver )
20:16 🔗 arkiver https://github.com/ArchiveTeam/panoramio-grab/commit/400cc5f63cb30cec077097b2d8ad112e4153ec63 < see description
20:39 🔗 arkiver SketchCow: still running grab with new decisions (see above github commit description), but don't expect a whole lot of saved space!
20:40 🔗 arkiver It will still be very large :/
20:49 🔗 godane SketchCow: do you happen to know why msnbc is not in 9/11 video collection for 9/11 to 9/13?
20:49 🔗 godane i ask cause its there for 9/14 to 9/17
20:50 🔗 SketchCow No idea
20:51 🔗 godane ok
20:51 🔗 SketchCow Asking the TV goddess
20:57 🔗 godane same goes for fox news
20:57 🔗 godane but fox news is nowhere in the 9/11 collection
21:06 🔗 arkiver SketchCow: the size of a new grab is 53% of the size with al images included
21:06 🔗 arkiver I'm now going to do a grab without originals
21:07 🔗 Nemo_bis How about avatars and similar smaller resources?
21:08 🔗 arkiver those are all being done. they are embedded images, which are being grabbed
21:09 🔗 Nemo_bis ok
21:10 🔗 SketchCow How big is that 53%
21:10 🔗 SketchCow godane:
21:10 🔗 SketchCow [5:08:08 PM] Tracey Jaquith (Internet Archive): prolly recorder went down
21:10 🔗 SketchCow [5:08:38 PM] Tracey Jaquith (Internet Archive): oh, no prolly we weren't recording it until 9/14
21:10 🔗 SketchCow [5:08:54 PM] Tracey Jaquith (Internet Archive): betcha rod scrambled to get it, when someone said "we need X channel!!"
21:10 🔗 SketchCow [5:10:34 PM] Jason Scott: Got it
21:11 🔗 godane thats what i figured
21:12 🔗 godane reason why upload things the way i do
21:12 🔗 godane i have no idea if you guys have it or the system was not recording it at the time
21:13 🔗 godane hopefully we get those days from that 40k VHS collection IA got
21:22 🔗 APerti 40k VHS tapes?
21:22 🔗 APerti Holy mackerel.
21:23 🔗 DFJustin http://www.fastcompany.com/3028069/the-internet-archive-is-digitizing-40000-vhs-tapes
21:23 🔗 godane APerti: http://www.dailydot.com/entertainment/marion-stokes-vhs-internet-archive-input/
21:23 🔗 APerti Rad.
21:24 🔗 godane i really hope we get 'The Site' Series that aired on MSNBC from that vhs collection
21:24 🔗 APerti What hardware, software, and codecs are going to be used?
21:26 🔗 APerti Wow. Metadata Fest 40,000!
21:26 🔗 antomatic a complete treasure trove of closed-captioning data too
21:26 🔗 antomatic probably some of the earliest examples of the medium
21:27 🔗 DFJustin archive.org's m.o. with vhs tapes seems to be to rip to dvd-quality mpeg2 which is then provided in web-quality mp4 and ogg streams
21:27 🔗 APerti Excellent. DVD standard is basically twice the resolution of most VHS tapes.
21:27 🔗 DFJustin you can see with the 'input' collection which they've already done https://archive.org/details/marionstokesinput
21:28 🔗 APerti https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem
21:28 🔗 antomatic APerti: Depends. Full quality DVD is twice-VHS or more, but you can easily record to DVD at half-resolution if you want more recording time
21:30 🔗 APerti Yes, I'm aware. MPEG-2 streams allow for many resolutions. I was speaking to the "standard" of 720x480.
21:31 🔗 * antomatic nods.
21:31 🔗 DFJustin also some of it is betamax
21:31 🔗 antomatic yay! :)
21:31 🔗 * antomatic has 2x Betamax
22:07 🔗 godane so i save a segment of the "NBC Today Show' with Arlen Specter in it
22:08 🔗 godane i brute force that one on TV Guide description
22:08 🔗 godane this one: http://www.tvguide.com/detail/tv-show.aspx?tvobjectid=100548&more=ucepisodelist&episodeid=2952892
22:18 🔗 arkiver SketchCow:
22:18 🔗 arkiver For example pack image99pack:969175
22:18 🔗 arkiver - everything: 719 MB
22:19 🔗 arkiver - html (www.panoramio.com + ssl.panoramio.com) + embed + original: 382 MB
22:20 🔗 arkiver - html (www.panoramio.com + ssl.panoramio.com) + embed: 120 MB
22:20 🔗 arkiver let's take 600 TB (probably a bit too high)
22:20 🔗 arkiver - everything: 600 TB
22:20 🔗 arkiver - html (www.panoramio.com + ssl.panoramio.com) + embed + original: 319 TB
22:21 🔗 antomatic arkiver: how much duplication is there in getting www. and ssl. ? Is there anything to be saved by just getting one?
22:21 🔗 arkiver - html (www.panoramio.com + ssl.panoramio.com) + embed: 100 TB
22:21 🔗 arkiver SketchCow: last option would be doable (assuming 100 TB can be stored at IA)
22:21 🔗 LD100 couldn't we start with the smallest and run this when it finish we have a better estimation to add orignal and then decide if we should run again only grabbing orignal
22:22 🔗 Diesel_ Wasn't 100TB a problem when we did twitpic
22:22 🔗 Diesel_ Thats why we tried to come up with our own storage until IA could handle something like that
22:22 🔗 Diesel_ 100TB was too much for them
22:22 🔗 antomatic at $2,000 per TB, every TB is a problem, potentially
22:23 🔗 arkiver SketchCow: I can first run a grab for html (www.panoramio.com + ssl.panoramio.com) + embed. When that is finished I can create a new grab for originals only
22:23 🔗 arkiver I think that would be the best thing to do
22:23 🔗 antomatic We do need to respect that and consider the rule of 'MVA' - Minimum Viable Archive.
22:23 🔗 LD100 arkiver: yep thats what I mean
22:24 🔗 arkiver I'm off now
22:24 🔗 antomatic night arkiver!
22:25 🔗 aaaaaaaaa Well, if the pictures are being transferred, do they need to get grabbed at all?
22:25 🔗 Kenshin friendly reminder, i'm sitting on 60tb of twitpic data. i do not have much space left to screw around
22:25 🔗 antomatic ^
22:25 🔗 arkiver Please PM me the final decision and please keep the last option I provided in mind
22:26 🔗 * arkiver if off to bed
23:44 🔗 joepie91 antomatic: I guess the basic idea here is "let's make the $2000 problem at least worth the headache, with good data"
23:44 🔗 joepie91 :)
23:44 🔗 joepie91 useful TBs etc

irclogger-viewer