#archiveteam 2011-09-22,Thu

↑back Search

Time Nickname Message
00:57 🔗 dashcloud SketchCow: I'm sure someone else has mentioned this to you, but the big area assembly is still huge in is video encoding
01:04 🔗 SketchCo1 21:05 < dashcloud> SketchCow: I'm sure someone else has mentioned this to you, but the big area assembly is still huge in is video encoding
01:04 🔗 SketchCo1 What?
01:04 🔗 dashcloud ffmpeg/libav and x264 utilize assembly heavily
01:06 🔗 dashcloud if there's anyone on the cutting edge of assembly & processors, that would be those folks
01:07 🔗 SketchCow oh.
01:40 🔗 underscor For anyone who missed Jason's QA, here's an ad-free version ripped from ustream
01:40 🔗 underscor http://tracker.archive.org/jscott_kickstarter_qa.flv
01:40 🔗 primus104 awesome, thanks
01:40 🔗 perfinion archiving the archiver. good job :D
01:42 🔗 SketchCow Yeah, that add bullshit is, in fact, bullshit
01:42 🔗 SketchCow ad
01:58 🔗 Wyatts Is there an adaptor or something for the smaller Betacam tapes?
01:59 🔗 SketchCow The machine just takes them.
02:00 🔗 Wyatts Oh, well that's spiffy
02:02 🔗 SketchCow The big issue is I have a few freakjob Digital Betacams, and nothing to play them on.
02:02 🔗 SketchCow Not even a big issue yet, I have tons of tapes to go.
02:05 🔗 Wyatts Ahh, that's right! What proportion were Beta formats again?
03:07 🔗 underscor alard: chronomex Coderjoe Spread some op goodness?
03:07 🔗 dnova yeah :D
03:08 🔗 underscor vmbrasseu: Hey, I know you! :D
03:08 🔗 vmbrasseu O RLY?
03:08 🔗 underscor :D
03:08 🔗 underscor (Alex, from IA)
03:08 🔗 chronomex underscor: you're alex?!?
03:08 🔗 vmbrasseu Oh, hey!
03:08 🔗 underscor chronomex: Yes?
03:08 🔗 vmbrasseu *hugs Alex*
03:08 🔗 underscor (Was that sarcasm, chronomex?)
03:08 🔗 chronomex underscor: put some pressure on those guys to document scandata.xml, I'm tired of not being able to number pages properly
03:09 🔗 * chronomex shrug
03:09 🔗 chronomex dunno
03:09 🔗 underscor chronomex: I'll go yell at people now
03:09 🔗 chronomex thanks
03:09 🔗 vmbrasseu chronomex: I am one of those "guys"
03:09 🔗 underscor Yeah, vm's from the archive too
03:10 🔗 vmbrasseu I'll add it to the queue but please don't hold your breath. There's rather a backlog of documentation (read: NONE).
03:10 🔗 underscor I didn't want to say anything, in case she wanted to go 'incognito'
03:10 🔗 chronomex mhm
03:10 🔗 vmbrasseu Meh. I am who I am. One quick web search will out me as someone at IA. ;-)
03:10 🔗 chronomex underscor: well, is there internal documentation that exists, or code to read it? I'd take -anything-
03:11 🔗 underscor Possibly
03:11 🔗 chronomex I
03:11 🔗 vmbrasseu Not as such.
03:11 🔗 chronomex I've got dozens of things with page numbers like G4AD
03:11 🔗 chronomex (technical drawings)
03:12 🔗 underscor chronomex: fyi
03:12 🔗 underscor [11:11:39 PM] rajamaphone: we will automatically create scandata for you if you upload a pdf
03:12 🔗 underscor But I suppose that doesn't help :P
03:14 🔗 chronomex right, I don't have a way to number pdfs either
03:15 🔗 chronomex also I'm scanning to uncompressed tiffs and uploading those; the software I have doesn't do lossless pdf
03:16 🔗 underscor oic
03:16 🔗 chronomex but regardless, I don't have pdf numbering capabilities
03:17 🔗 underscor That blog post I linked may be of use, idk
03:17 🔗 chronomex doesn't look like much in that direction
03:18 🔗 * chronomex shrug
03:35 🔗 underscor alard: How do you plan to get around SOP?
03:38 🔗 underscor Oh, I see how you inject it
03:42 🔗 underscor Man, this is *really* well done
03:48 🔗 chronomex SOP?
03:58 🔗 underscor Same origin policy
03:59 🔗 chronomex oh
04:17 🔗 chronomex oh dear.
04:17 🔗 chronomex vmbrasseu: are you still here? I've discovered an unpleasant bug in the S3 infrastructure.
04:17 🔗 * vmbrasseu gasps.
04:17 🔗 vmbrasseu Lay it on me.
04:17 🔗 vmbrasseu But no promises.
04:18 🔗 chronomex I uploaded a file using a PUT to http://s3.us.archive.org/CD-1A210-01/CD-1A210-01/bellsystem_CD-1A210-01_images.zip
04:18 🔗 chronomex note the extra slash, it's an error in my script
04:18 🔗 chronomex that last / got turned into %2F
04:18 🔗 chronomex which prevents derive from running to completion; I also cannot delete it with S3 interface (500 error) nor with the web interface
04:19 🔗 chronomex actually not quite
04:19 🔗 chronomex I actually uploaded it to http://s3.us.archive.org/bellsystem_CD-1A210-01/CD-1A210-01%2Fbellsystem_CD-1A210-01_images.zip
04:20 🔗 vmbrasseu Are you trying to delete the file or the item?
04:20 🔗 chronomex er, s,%2F,/,
04:20 🔗 chronomex just the file
04:20 🔗 chronomex I was able to delete it with that url
04:20 🔗 chronomex I got the item id wrong when I was trying to fix it right now
04:20 🔗 chronomex but the undeletable-from-web-interface thing sounds like a bug
04:20 🔗 vmbrasseu Well, deleting in general is a bit of a delicate issue at IA.
04:21 🔗 chronomex understood
04:21 🔗 chronomex the % prevents derive from working properly too
04:21 🔗 vmbrasseu But the encoding seems bug-like.
04:21 🔗 chronomex if I'm not mistaken
04:21 🔗 vmbrasseu Deriving is special voodoo. I'm still working on getting the full lowdown on that one so I can't answer whether the % will bork it here.
04:22 🔗 chronomex aye
04:22 🔗 chronomex tossing that in, I hope it'll get handled properly :)
04:22 🔗 chronomex it seems to parse the url into /{item}/{filename}, then encodes filename to be unix-safe
04:22 🔗 vmbrasseu As soon as I can get someone to define "handled properly" I assure you it'll enter the correct channels. ;-)
04:23 🔗 chronomex hehe okay
04:23 🔗 * chronomex goes to undo the havoc he's wreaked so far today
04:23 🔗 vmbrasseu Yes, that seems like a correct assumption (encoding filename). I'd have to do some code spelunking to confirm.
04:24 🔗 DFJustin do one of you archive.org guys know how to tell the system that you've uploaded a two-page-per-image pdf so the online reader doesn't look retarded http://www.archive.org/stream/DieKoptischenZaubertexteDerSammlungPapyrusErzherzogRainerInWien/stegemann_koptischen_zaubertexte#page/n1/mode/2up
04:24 🔗 vmbrasseu As far as I can tell SO FAR there is no way to declare such a thing.
04:25 🔗 vmbrasseu However that would likely be rolled up in the aforementioned deriving voodoo.
04:25 🔗 vmbrasseu Wait...
04:25 🔗 vmbrasseu You're uploading papyri?
04:25 🔗 DFJustin I guess
04:25 🔗 vmbrasseu Ah, texts about papryi.
04:26 🔗 vmbrasseu Still
04:26 🔗 vmbrasseu This is relevant to my interests!
04:26 🔗 chronomex !
04:26 🔗 chronomex what are you interested in ?
04:27 🔗 vmbrasseu I have a degree in Classical Philology (Latin but mostly Greek) and was headed to grad school for papyrology when The Big Job Offer came through from California.
04:27 🔗 DFJustin heh I guess archiving attracts papyrology geeks
04:27 🔗 chronomex neato
04:28 🔗 vmbrasseu DFJustin: you just got my attention. I'll poke the appropriate personage(s) to see whether there's an answer to your question.
04:28 🔗 DFJustin I'm a computer programmer but I have an amateur interest in philology
04:28 🔗 DFJustin the pdf is from the oriental institute site, they have various stuff that I was going to try to feed in
04:28 🔗 vmbrasseu Computer programming is so much easier than Ancient Greek.
04:28 🔗 SketchCow http://www.archive.org/search.php?query=collection%3Aenter-magazine&sort=-publicdate
04:28 🔗 SketchCow awwww yeah
04:29 🔗 BlueMax lol
04:30 🔗 DFJustin I can crop the pdf manually using briss but it would be nice not to alter it
04:31 🔗 vmbrasseu DFJustin: I've sent your question on to likely suspects.
04:32 🔗 DFJustin thx
04:32 🔗 vmbrasseu Glad to oblige. Stay tuned (probably in a few days).
04:44 🔗 DFJustin I need to get back to greek, it was going so well until the aorists :(
04:44 🔗 vmbrasseu There's method to that madness.
04:45 🔗 vmbrasseu Headed offline here, so we can discuss it off channel sometime.
04:45 🔗 SketchCow I'm up to 4tb of Friendster uploaded.
04:48 🔗 Coderjoe you madman
04:48 🔗 chronomex SketchCow: this is an odd name. http://www.archive.org/details/FRIENDSTER-FRIENDSTER-014200000
04:50 🔗 SketchCow Yes.
04:51 🔗 SketchCow That was me dealing with a big
04:51 🔗 SketchCow bug
04:52 🔗 SketchCow In the code
04:53 🔗 SketchCow And the thing is, until it finishes the deriving and the rest, I can't rename the item.
04:53 🔗 * chronomex nods
04:53 🔗 SketchCow And when you'e deriving/dealing with that many gigs, it takes a while.
04:53 🔗 chronomex but you can rename items, that's good. I've got a misnamed item too
04:53 🔗 chronomex uploader bugs--
04:54 🔗 SketchCow I can.
04:54 🔗 SketchCow I am using a script that does the uploading, called FRIENDSMASH
04:54 🔗 SketchCow And I didn't have error checking
04:55 🔗 SketchCow Then stepped away and phrased the argument wrong
04:55 🔗 chronomex FRIENDSMASH
04:55 🔗 chronomex I like it
04:55 🔗 chronomex mine are rather more buttoned down
04:55 🔗 chronomex but then ... this is The Phone Company
04:56 🔗 SketchCow Next is the Yahoo Video stuff.
04:57 🔗 SketchCow In both these cases, I'd like to write scripts that will suck down the final items, analyze them, and upload info files.
04:57 🔗 SketchCow You saw what I do with CD-ROM images, right.
04:58 🔗 Coderjoe whee... only 15 hours left on this file
04:58 🔗 SketchCow What file are you uploading
04:58 🔗 Coderjoe friendster.002800001-002900000.tar.xz
04:58 🔗 SketchCow Uh oh
04:59 🔗 SketchCow I'm sorry, stop and reupload.
04:59 🔗 Coderjoe uh...
04:59 🔗 Coderjoe okay?
04:59 🔗 SketchCow I was sure you were done.
05:00 🔗 SketchCow Sorry.
05:00 🔗 Coderjoe we'll see how it goes... I did use --partial, so it might have kept the dotfile it uploads to
05:01 🔗 Coderjoe (and renamed it)
05:01 🔗 Coderjoe still waiting for it to tell me anything
05:01 🔗 SketchCow Sorry for this. Let's compare the files you have and lengths before you delete them, when you're done
05:02 🔗 SketchCow I'm getting a lot of pressure to get this data into the system and make room for more stuff.
05:02 🔗 SketchCow The Rsync.net guys want their machine back, etc.
05:02 🔗 Coderjoe i suspect it is doing a checksum check on the 95% of the file up there
05:05 🔗 Coderjoe stupid massively-asymmetric internet connections
05:13 🔗 db48x oh, good
05:13 🔗 db48x IO errors on my /dev/sda
05:18 🔗 Coderjoe looks like --partial saved it
05:19 🔗 Coderjoe it's currently listing a speed of 58MB/s, which is in no way going over my internet connection
05:20 🔗 chronomex yeah --partial is awesome
05:21 🔗 Coderjoe chronomex: well, in this case, a combination of --partial and the fact that rsync writes to a dotfile
05:22 🔗 chronomex rsync only writes to a dotfile if you don't say --partial
05:22 🔗 Coderjoe no, it still writes to a dotfile, but then moves the partially-completed dotfile to the final name
05:23 🔗 Coderjoe (it uses the non-dotfile as the source for blocks that match the remote file)
05:24 🔗 db48x hmm
05:24 🔗 db48x rebooting seems to have "fixed" it
06:08 🔗 SketchCow Rebooting fixes everything
06:28 🔗 vmbrasseu DFJustin: headed to bed but an answer came in to your question and wanted to get it to you ASAP:
06:28 🔗 vmbrasseu "Yes, in fact.  We added a meta.xml element specifically to deal with that. If they give their item a "bookreader-defaults" value of "mode/1up", BookReader will start up in 1-page mode instead of the usual 2-page mode. See, for instance, item CLARION_CALL_1961-1962_v33 and its Read Online link."
06:28 🔗 vmbrasseu Give that a go.
06:30 🔗 vmbrasseu Bonne chance et bonne nuit.
07:40 🔗 Wyatt Jason, after tonight, I appreciate your push for metadata curation more than ever.
07:40 🔗 perfinion what happened tonight?
07:41 🔗 Wyatt Oh, I was explaining some of the issues with crowdsourcing tags for music. And to drive my point home, I went to last.fm.
07:41 🔗 Wyatt And even I wasn't fully prepared for that mess. :/
07:41 🔗 perfinion yeeah
07:41 🔗 perfinion crowd sourcing is a nice idea
07:42 🔗 perfinion but it needs stricter implementations
07:42 🔗 Wyatt But it requires a guiding hand
07:42 🔗 perfinion yeah
07:42 🔗 perfinion i suppose just giving some ppl mod rights would be enough
07:43 🔗 Wyatt Well part of the issue is last.fm is really just inadequate for this task in its current form.
07:43 🔗 perfinion i never really got the point of lastfm
07:43 🔗 Wyatt Tags on last.fm are...third-class citizens?
07:43 🔗 perfinion why would i want to advertise exactly what songs im listening to?
07:44 🔗 Wyatt At its heart, it's something like a social network for music listeners.
07:44 🔗 perfinion i guess i dont really use facebook much either, so im the wrong person to figure it out :P
07:44 🔗 Wyatt And it makes recommendations and allows you to listen with people and such. I use it primarily to see data about what I listened to and when and how often and such.
07:45 🔗 perfinion my music player on my laptop queries it for recommendations
07:46 🔗 perfinion but i dont see why i'd want to scrobble my songs
07:46 🔗 perfinion although i suppose enough ppl hae to do it otherwise it wont have data for recommendations
07:46 🔗 Wyatt Pretty much. It's hueristic based on community similarity rather than actual music traits (Music Genome Project)
07:48 🔗 Wyatt It's interesting to me as a case study, and there are valuable lessons to learn from it, but it could use a makeover.
07:48 🔗 perfinion indeed
07:48 🔗 Wyatt (Though hopefully not like Friendster"
07:48 🔗 perfinion hahaha
07:49 🔗 Wyatt Funny until it comes true. That'd be one to keep an eye on, come to think of it. :/
07:51 🔗 ersi What's there to grab at last.fm by the way? Every users individual scrobbles?
07:51 🔗 ersi usernames? artists / song names?
07:52 🔗 Wyatt It also has user groups with forum functionality, wiki pages per-artist and _per-song_...and I think there's some other stuff.
07:53 🔗 Wyatt It started as a radio station/forum hybrid bolted to a CS project as I recall. And I think it never really knew what to grow up into so it became a Web 2.0 chimera.
07:53 🔗 ersi oh yeah
07:54 🔗 ersi Yeah, definitely
07:55 🔗 Wyatt Actually, now that I look at the history of last year, it might be one to watch. Owned by viacom and making moves that upset users? Sounds like an unfavourable recipe.
07:55 🔗 ersi indeed
07:55 🔗 ersi there's a few scripts made by libre.fm to migrate/gobble user scrobbles atleast
07:56 🔗 ersi I think one needs to log in with it's user to gobble them though
07:56 🔗 Wyatt libre.fm? Haha, okay, I guess I should have seen that coming.
07:57 🔗 Wyatt Ah, no, "CBS Interactive"?
07:58 🔗 Wyatt Oh, right, them.
07:58 🔗 ersi CBS Interactive?
07:58 🔗 Wyatt Not Viacom; CBS owns last.fm
07:58 🔗 ersi ah
09:08 🔗 chronomex huh, I had no idea
14:13 🔗 DFJustin yeah last.fm drives me nuts because they have an automated metadata correction system and even pull known-correct data from musicbrainz and still utterly fail to meaningfully fix anything
14:14 🔗 DFJustin and basically don't seem to give a shit despite blog posts trumpeting all this
14:15 🔗 Wyatt Oh my, I wasn't aware of THAT aspect.
14:16 🔗 DFJustin this is pretty slick though http://encukou.github.com/lastscrape-gui/
14:17 🔗 Wyatt Ooh, nice
14:21 🔗 DFJustin like, people have been robovoting on these since 2009 and half of them still don't pass the autocorrect threshold http://www.last.fm/group/The+Auto-Correct+Correction+Brigade/forum/119632/_/522788
14:25 🔗 Wyatt That doesn't terribly surprise me.
14:26 🔗 Wyatt Which goes back to my thesis that the push for curation is much appreciated.
17:02 🔗 chronomex metadata curation is the exact opposite of sexy
17:03 🔗 Zebranky That's one for /topic
17:20 🔗 DFJustin the thing is as a web company you don't even have to do anything, just slap on an edit button and let asperger's do the work for you
17:22 🔗 Coderjoe which could turn out bad, as some aspergers don't realize or care that they are actually incorrect.
17:34 🔗 DFJustin it's still a huge improvement over routing everything through your staff who don't care
17:34 🔗 DFJustin it's amazing to me how many sites don't understand this
17:34 🔗 DFJustin like, even archive.org won't let visitors fix metadata, and surprise, their metadata sucks
17:43 🔗 Coderjoe they made the assumption that the people adding items would care enough about them, i guess
18:48 🔗 SketchCow the OpenLibrary interface allows metadata repair
18:49 🔗 SketchCow But the issue is different. The issue isn't the uploaders won't do metadata, it's that there's a severe documentation problem that some people are working on, and which I'm trying to help with.
19:21 🔗 DFJustin I mean stuff like this where people can only leave an ineffectual comment https://encrypted.google.com/search?q=%22wrong+book%22+site%3Aarchive.org%2Fdetails&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:unofficial&client=firefox-a
19:22 🔗 DFJustin yes, if you e-mail collections-service they'll deal with it but that's a high barrier
19:28 🔗 alard Quick statistics update: there are 449.287 free articles on JSTOR (that I know of).
20:07 🔗 * Electroni Great Electronics Sale! Prices are reduced up to 50%! Laptops, PDAs, Tablet PCs and more only at X Laptops Co, Ltd. Check us out at 4http://XLaptops.net
20:07 🔗 * Electroni Great Electronics Sale! Prices are reduced up to 50%! Laptops, PDAs, Tablet PCs and more only at X Laptops Co, Ltd. Check us out at 4http://XLaptops.net
20:51 🔗 Coderjoe woohoo
20:51 🔗 Coderjoe 2 minutes left on this file
20:54 🔗 Coderjoe and done
20:54 🔗 Coderjoe SketchCow: done with friendster.002800001-002900000.tar.xz
21:04 🔗 SketchCow Thanks.
21:04 🔗 SketchCow Can you give me the bytesize?
23:49 🔗 Coderjoe SketchCow: 102797504180
23:50 🔗 Coderjoe SketchCow: I forgot to move other files out of the directory I was uploading from, so I accidentally started uploading friendster.000104001-000105000.tar.xz again
23:52 🔗 Coderjoe there's a .csv file with filenames, sizes, and crc32s of all of the files I have
23:59 🔗 Wyatt Now that's curious...what might cause warc-wget to segfault after only 5800 files?

irclogger-viewer