[00:57] SketchCow: I'm sure someone else has mentioned this to you, but the big area assembly is still huge in is video encoding [01:04] 21:05 < dashcloud> SketchCow: I'm sure someone else has mentioned this to you, but the big area assembly is still huge in is video encoding [01:04] What? [01:04] ffmpeg/libav and x264 utilize assembly heavily [01:06] if there's anyone on the cutting edge of assembly & processors, that would be those folks [01:07] oh. [01:40] For anyone who missed Jason's QA, here's an ad-free version ripped from ustream [01:40] http://tracker.archive.org/jscott_kickstarter_qa.flv [01:40] awesome, thanks [01:40] archiving the archiver. good job :D [01:42] Yeah, that add bullshit is, in fact, bullshit [01:42] ad [01:58] Is there an adaptor or something for the smaller Betacam tapes? [01:59] The machine just takes them. [02:00] Oh, well that's spiffy [02:02] The big issue is I have a few freakjob Digital Betacams, and nothing to play them on. [02:02] Not even a big issue yet, I have tons of tapes to go. [02:05] Ahh, that's right! What proportion were Beta formats again? [03:07] alard: chronomex Coderjoe Spread some op goodness? [03:07] yeah :D [03:08] vmbrasseu: Hey, I know you! :D [03:08] O RLY? [03:08] :D [03:08] (Alex, from IA) [03:08] underscor: you're alex?!? [03:08] Oh, hey! [03:08] chronomex: Yes? [03:08] *hugs Alex* [03:08] (Was that sarcasm, chronomex?) [03:08] underscor: put some pressure on those guys to document scandata.xml, I'm tired of not being able to number pages properly [03:09] * chronomex shrug [03:09] dunno [03:09] chronomex: I'll go yell at people now [03:09] thanks [03:09] chronomex: I am one of those "guys" [03:09] Yeah, vm's from the archive too [03:10] I'll add it to the queue but please don't hold your breath. There's rather a backlog of documentation (read: NONE). [03:10] I didn't want to say anything, in case she wanted to go 'incognito' [03:10] mhm [03:10] Meh. I am who I am. One quick web search will out me as someone at IA. ;-) [03:10] underscor: well, is there internal documentation that exists, or code to read it? I'd take -anything- [03:11] Possibly [03:11] I [03:11] Not as such. [03:11] I've got dozens of things with page numbers like G4AD [03:11] (technical drawings) [03:12] chronomex: fyi [03:12] [11:11:39 PM] rajamaphone: we will automatically create scandata for you if you upload a pdf [03:12] But I suppose that doesn't help :P [03:14] right, I don't have a way to number pdfs either [03:15] also I'm scanning to uncompressed tiffs and uploading those; the software I have doesn't do lossless pdf [03:16] oic [03:16] but regardless, I don't have pdf numbering capabilities [03:17] That blog post I linked may be of use, idk [03:17] doesn't look like much in that direction [03:18] * chronomex shrug [03:35] alard: How do you plan to get around SOP? [03:38] Oh, I see how you inject it [03:42] Man, this is *really* well done [03:48] SOP? [03:58] Same origin policy [03:59] oh [04:17] oh dear. [04:17] vmbrasseu: are you still here? I've discovered an unpleasant bug in the S3 infrastructure. [04:17] * vmbrasseu gasps. [04:17] Lay it on me. [04:17] But no promises. [04:18] I uploaded a file using a PUT to http://s3.us.archive.org/CD-1A210-01/CD-1A210-01/bellsystem_CD-1A210-01_images.zip [04:18] note the extra slash, it's an error in my script [04:18] that last / got turned into %2F [04:18] which prevents derive from running to completion; I also cannot delete it with S3 interface (500 error) nor with the web interface [04:19] actually not quite [04:19] I actually uploaded it to http://s3.us.archive.org/bellsystem_CD-1A210-01/CD-1A210-01%2Fbellsystem_CD-1A210-01_images.zip [04:20] Are you trying to delete the file or the item? [04:20] er, s,%2F,/, [04:20] just the file [04:20] I was able to delete it with that url [04:20] I got the item id wrong when I was trying to fix it right now [04:20] but the undeletable-from-web-interface thing sounds like a bug [04:20] Well, deleting in general is a bit of a delicate issue at IA. [04:21] understood [04:21] the % prevents derive from working properly too [04:21] But the encoding seems bug-like. [04:21] if I'm not mistaken [04:21] Deriving is special voodoo. I'm still working on getting the full lowdown on that one so I can't answer whether the % will bork it here. [04:22] aye [04:22] tossing that in, I hope it'll get handled properly :) [04:22] it seems to parse the url into /{item}/{filename}, then encodes filename to be unix-safe [04:22] As soon as I can get someone to define "handled properly" I assure you it'll enter the correct channels. ;-) [04:23] hehe okay [04:23] * chronomex goes to undo the havoc he's wreaked so far today [04:23] Yes, that seems like a correct assumption (encoding filename). I'd have to do some code spelunking to confirm. [04:24] do one of you archive.org guys know how to tell the system that you've uploaded a two-page-per-image pdf so the online reader doesn't look retarded http://www.archive.org/stream/DieKoptischenZaubertexteDerSammlungPapyrusErzherzogRainerInWien/stegemann_koptischen_zaubertexte#page/n1/mode/2up [04:24] As far as I can tell SO FAR there is no way to declare such a thing. [04:25] However that would likely be rolled up in the aforementioned deriving voodoo. [04:25] Wait... [04:25] You're uploading papyri? [04:25] I guess [04:25] Ah, texts about papryi. [04:26] Still [04:26] This is relevant to my interests! [04:26] ! [04:26] what are you interested in ? [04:27] I have a degree in Classical Philology (Latin but mostly Greek) and was headed to grad school for papyrology when The Big Job Offer came through from California. [04:27] heh I guess archiving attracts papyrology geeks [04:27] neato [04:28] DFJustin: you just got my attention. I'll poke the appropriate personage(s) to see whether there's an answer to your question. [04:28] I'm a computer programmer but I have an amateur interest in philology [04:28] the pdf is from the oriental institute site, they have various stuff that I was going to try to feed in [04:28] Computer programming is so much easier than Ancient Greek. [04:28] http://www.archive.org/search.php?query=collection%3Aenter-magazine&sort=-publicdate [04:28] awwww yeah [04:29] lol [04:30] I can crop the pdf manually using briss but it would be nice not to alter it [04:31] DFJustin: I've sent your question on to likely suspects. [04:32] thx [04:32] Glad to oblige. Stay tuned (probably in a few days). [04:44] I need to get back to greek, it was going so well until the aorists :( [04:44] There's method to that madness. [04:45] Headed offline here, so we can discuss it off channel sometime. [04:45] I'm up to 4tb of Friendster uploaded. [04:48] you madman [04:48] SketchCow: this is an odd name. http://www.archive.org/details/FRIENDSTER-FRIENDSTER-014200000 [04:50] Yes. [04:51] That was me dealing with a big [04:51] bug [04:52] In the code [04:53] And the thing is, until it finishes the deriving and the rest, I can't rename the item. [04:53] * chronomex nods [04:53] And when you'e deriving/dealing with that many gigs, it takes a while. [04:53] but you can rename items, that's good. I've got a misnamed item too [04:53] uploader bugs-- [04:54] I can. [04:54] I am using a script that does the uploading, called FRIENDSMASH [04:54] And I didn't have error checking [04:55] Then stepped away and phrased the argument wrong [04:55] FRIENDSMASH [04:55] I like it [04:55] mine are rather more buttoned down [04:55] but then ... this is The Phone Company [04:56] Next is the Yahoo Video stuff. [04:57] In both these cases, I'd like to write scripts that will suck down the final items, analyze them, and upload info files. [04:57] You saw what I do with CD-ROM images, right. [04:58] whee... only 15 hours left on this file [04:58] What file are you uploading [04:58] friendster.002800001-002900000.tar.xz [04:58] Uh oh [04:59] I'm sorry, stop and reupload. [04:59] uh... [04:59] okay? [04:59] I was sure you were done. [05:00] Sorry. [05:00] we'll see how it goes... I did use --partial, so it might have kept the dotfile it uploads to [05:01] (and renamed it) [05:01] still waiting for it to tell me anything [05:01] Sorry for this. Let's compare the files you have and lengths before you delete them, when you're done [05:02] I'm getting a lot of pressure to get this data into the system and make room for more stuff. [05:02] The Rsync.net guys want their machine back, etc. [05:02] i suspect it is doing a checksum check on the 95% of the file up there [05:05] stupid massively-asymmetric internet connections [05:13] oh, good [05:13] IO errors on my /dev/sda [05:18] looks like --partial saved it [05:19] it's currently listing a speed of 58MB/s, which is in no way going over my internet connection [05:20] yeah --partial is awesome [05:21] chronomex: well, in this case, a combination of --partial and the fact that rsync writes to a dotfile [05:22] rsync only writes to a dotfile if you don't say --partial [05:22] no, it still writes to a dotfile, but then moves the partially-completed dotfile to the final name [05:23] (it uses the non-dotfile as the source for blocks that match the remote file) [05:24] hmm [05:24] rebooting seems to have "fixed" it [06:08] Rebooting fixes everything [06:28] DFJustin: headed to bed but an answer came in to your question and wanted to get it to you ASAP: [06:28] "Yes, in fact.  We added a meta.xml element specifically to deal with that. If they give their item a "bookreader-defaults" value of "mode/1up", BookReader will start up in 1-page mode instead of the usual 2-page mode. See, for instance, item CLARION_CALL_1961-1962_v33 and its Read Online link." [06:28] Give that a go. [06:30] Bonne chance et bonne nuit. [07:40] Jason, after tonight, I appreciate your push for metadata curation more than ever. [07:40] what happened tonight? [07:41] Oh, I was explaining some of the issues with crowdsourcing tags for music. And to drive my point home, I went to last.fm. [07:41] And even I wasn't fully prepared for that mess. :/ [07:41] yeeah [07:41] crowd sourcing is a nice idea [07:42] but it needs stricter implementations [07:42] But it requires a guiding hand [07:42] yeah [07:42] i suppose just giving some ppl mod rights would be enough [07:43] Well part of the issue is last.fm is really just inadequate for this task in its current form. [07:43] i never really got the point of lastfm [07:43] Tags on last.fm are...third-class citizens? [07:43] why would i want to advertise exactly what songs im listening to? [07:44] At its heart, it's something like a social network for music listeners. [07:44] i guess i dont really use facebook much either, so im the wrong person to figure it out :P [07:44] And it makes recommendations and allows you to listen with people and such. I use it primarily to see data about what I listened to and when and how often and such. [07:45] my music player on my laptop queries it for recommendations [07:46] but i dont see why i'd want to scrobble my songs [07:46] although i suppose enough ppl hae to do it otherwise it wont have data for recommendations [07:46] Pretty much. It's hueristic based on community similarity rather than actual music traits (Music Genome Project) [07:48] It's interesting to me as a case study, and there are valuable lessons to learn from it, but it could use a makeover. [07:48] indeed [07:48] (Though hopefully not like Friendster" [07:48] hahaha [07:49] Funny until it comes true. That'd be one to keep an eye on, come to think of it. :/ [07:51] What's there to grab at last.fm by the way? Every users individual scrobbles? [07:51] usernames? artists / song names? [07:52] It also has user groups with forum functionality, wiki pages per-artist and _per-song_...and I think there's some other stuff. [07:53] It started as a radio station/forum hybrid bolted to a CS project as I recall. And I think it never really knew what to grow up into so it became a Web 2.0 chimera. [07:53] oh yeah [07:54] Yeah, definitely [07:55] Actually, now that I look at the history of last year, it might be one to watch. Owned by viacom and making moves that upset users? Sounds like an unfavourable recipe. [07:55] indeed [07:55] there's a few scripts made by libre.fm to migrate/gobble user scrobbles atleast [07:56] I think one needs to log in with it's user to gobble them though [07:56] libre.fm? Haha, okay, I guess I should have seen that coming. [07:57] Ah, no, "CBS Interactive"? [07:58] Oh, right, them. [07:58] CBS Interactive? [07:58] Not Viacom; CBS owns last.fm [07:58] ah [09:08] huh, I had no idea [14:13] yeah last.fm drives me nuts because they have an automated metadata correction system and even pull known-correct data from musicbrainz and still utterly fail to meaningfully fix anything [14:14] and basically don't seem to give a shit despite blog posts trumpeting all this [14:15] Oh my, I wasn't aware of THAT aspect. [14:16] this is pretty slick though http://encukou.github.com/lastscrape-gui/ [14:17] Ooh, nice [14:21] like, people have been robovoting on these since 2009 and half of them still don't pass the autocorrect threshold http://www.last.fm/group/The+Auto-Correct+Correction+Brigade/forum/119632/_/522788 [14:25] That doesn't terribly surprise me. [14:26] Which goes back to my thesis that the push for curation is much appreciated. [17:02] metadata curation is the exact opposite of sexy [17:03] That's one for /topic [17:20] the thing is as a web company you don't even have to do anything, just slap on an edit button and let asperger's do the work for you [17:22] which could turn out bad, as some aspergers don't realize or care that they are actually incorrect. [17:34] it's still a huge improvement over routing everything through your staff who don't care [17:34] it's amazing to me how many sites don't understand this [17:34] like, even archive.org won't let visitors fix metadata, and surprise, their metadata sucks [17:43] they made the assumption that the people adding items would care enough about them, i guess [18:48] the OpenLibrary interface allows metadata repair [18:49] But the issue is different. The issue isn't the uploaders won't do metadata, it's that there's a severe documentation problem that some people are working on, and which I'm trying to help with. [19:21] I mean stuff like this where people can only leave an ineffectual comment https://encrypted.google.com/search?q=%22wrong+book%22+site%3Aarchive.org%2Fdetails&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:unofficial&client=firefox-a [19:22] yes, if you e-mail collections-service they'll deal with it but that's a high barrier [19:28] Quick statistics update: there are 449.287 free articles on JSTOR (that I know of). [20:07] * Electroni Great Electronics Sale! Prices are reduced up to 50%! Laptops, PDAs, Tablet PCs and more only at X Laptops Co, Ltd. Check us out at 4http://XLaptops.net [20:07] * Electroni Great Electronics Sale! Prices are reduced up to 50%! Laptops, PDAs, Tablet PCs and more only at X Laptops Co, Ltd. Check us out at 4http://XLaptops.net [20:51] woohoo [20:51] 2 minutes left on this file [20:54] and done [20:54] SketchCow: done with friendster.002800001-002900000.tar.xz [21:04] Thanks. [21:04] Can you give me the bytesize? [23:49] SketchCow: 102797504180 [23:50] SketchCow: I forgot to move other files out of the directory I was uploading from, so I accidentally started uploading friendster.000104001-000105000.tar.xz again [23:52] there's a .csv file with filenames, sizes, and crc32s of all of the files I have [23:59] Now that's curious...what might cause warc-wget to segfault after only 5800 files?