[02:12] benuski: I saw a system the other day that, from the cursory look I gave it, would split newspapers by page region [02:13] Seemed like it might be useful for your project [02:13] hm, page reagion how? did it figure out where columns were, or was it kind of like google maps slippy-view for newspaper scans? [02:16] chronomex: It seemed able to break a page apart into blocks, and sort images/articles/ads [02:16] that's real trick [02:16] I'll have to find it again; I was on vacation [02:16] Again, I didn't look too hard, so I could be wrong [02:17] still [02:18] I've seen things happen like that, didn't think about them before [02:18] I saw it and thought it might be useful for breaking up issues [02:21] Ah, found it. Let's see if my broken modem will let it load... [02:23] http://historying.org/2012/07/12/coding-a-middle-ground/ [02:31] ah, a semimanual approach [02:31] anyone still on the fanfiction archiving? keeping up with that? [02:31] I really like human-assisted machine systems [02:32] wikiteam could use some love, i think, that could be much aeasier to use with a tracker [02:32] chronomex: Train on humans, then let the machines take over when they learn? [02:32] oh, no, he gridded out the page and figured the purpose of grid squares based on words [02:32] that's neat [02:33] chronomex: That's why I figured it'd be useful here. Find the staff list, find what page it always appears on, and use that as your anchor for file breaks [02:33] shaqfu: something like that, really anything that takes away as many of the easy cases as possible [02:33] I seem to have come late to this conversation, what's the problem space? [02:34] I've been away for a few days, and was thinking of benuski's problem of finding automatic ways to break pages from that ugly website dump [02:35] ugly website dump of newspaper scans? [02:35] http://www.fultonhistory.com/Fulton.html [02:35] But they're not split into issues - just pages [02:36] ahhh [02:36] ooo flash :( [02:36] I told you it was ugly :( [02:37] so the problem is taking a page and determining whether it's an issue start? [02:37] Yeah [02:37] that sounds conceptually straightforward [02:38] fuzzy warp-tolerant image-matching is the hard part of the problem [02:38] Yep, although it doesn't seem to need to be very accurate [02:38] ok, so it's 2 million pages [02:38] I see why you don't want to do this manually :P [02:39] If a page has a high incidence of large fonts that aren't ads, odds are profoundly good that it's the front [02:39] chronomex: 20M [02:39] sheeze [02:43] I'll link benuski to it next time he's on [03:21] Ops, please. [03:22] Also, please help me find all the archiveteam aux channels so I can join them. [03:22] I'm in #archiveteam #wikiteam #urlteam #archiveteam-bs #nowwhat [03:23] There's #fireplanet [03:32] So, what did I miss today? [03:45] Apparently nothing :P [03:45] I bought you a tutu though! [04:00] I think I somehow made cdbbsarchive go away from the top bar at http://archive.org/details/software [04:28] http://kotaku.com/5926527/the-secret-atari-emails-you-were-never-supposed-to-see-until-some-guy-released-them [04:40] http://www.neogaf.com/forum/showthread.php?t=147082 [04:52] DFJustin: Impossible. [04:54] Do you know how long those atari e-mails have been on textfiles.com? [04:56] April 4, 2004. [04:56] When were they put online? 2001. [05:15] DFJustin: It wasn't your fault, but it was set to be hidden [05:25] might throw classicpcgames up there while you're at it [05:33] whoops [07:27] http://www.kickstarter.com/projects/jmathai/openphoto-a-photo-service-for-your-s3-or-dropbox-a came to me to mention. [07:29] So easy and useful http://stackoverflow.com/questions/4560400/how-can-i-get-google-cache-age-of-any-url-or-web-page [07:35] ... XD [09:27] Pumping in lots of stuff. [09:27] Now adding all the JWZ mixtapes I was given, 104 in all [09:27] http://archive.org/details/jwz-mixtape-001 [10:06] maybe those will help with solving the riddle on his page [13:52] Ok, I think: http://arcticready.com/social might be something good and funny to save [13:54] ersi: it's a spoof [13:56] Schbirid: Still think it's worth while. I'm gonna see if I can whip something together [13:58] hmmmm [14:00] :) [14:02] Also, I'm not sure if mentioned earlier... this is half off-topic and half on-topic: ""Marissa Mayer, Google's employee #20 and Vice President of Local, has been appointed CEO of Yahoo." [14:05] Who in their right mind would leave Google for Yahoo [14:06] Someone who's in need of a *real* challenge, perhaps? [14:06] Hard to get that floating turd of a ship (ie Yahoo) going upwards instead of down into the bottomless pit of the ocean [14:09] SOmeone who knows they'll always be welcome at google [14:09] or already have enough dosh, why not have some fun? [14:10] if i was a billionaire (she is?) i would love to go to yahoo [14:15] Exactly, and I'm gonna steer this ship back on coursen ow [14:15] Some figures estimate she's worth 300 million USD, so in some currencies - she's a billionare [14:26] Or not...... However; thinking of it another way - If she fails, no one will really care. If she succeeds? wow. [16:16] -bs [16:35] http://archive.org/details/dnamixtape is now uploaded. [18:11] anyone happen to know if there's a python library for puttings stuff up at internet archive using their s3-ish api? [18:16] oh hmm might be possible w/ boto http://www.elastician.com/2011/02/accessing-internet-archive-with-boto.html [18:26] there's no specific one, no. but urllib/urllib2/requests etc are what you'd probably need [18:54] edsu: boto is what we recommend [18:54] >>> import boto [18:54] Hopefully you can just global search and replace amazonaws.com with us.archive.org. [18:54] The S3 API works well with the boto python library (multipart too!), [18:54] We strive to make the S3 API compatible enough with current client code. [18:54] use is_secure=False, host='s3.us.archive.org' when creating your boto connection. [18:54] >>> conn = boto.connect_s3(key, secret, host='s3.us.archive.org', is_secure=False) [19:08] underscor: is working pretty nicely [19:08] underscor: if i accidentally uploaded some files without setting the key name properly is it possible to remove the errorneous keys? [19:09] s/errorneous/erroneous/ [19:09] http://ia600804.us.archive.org/4/items/kasabi/ [19:10] underscor: i guess through this i can http://ia600804.us.archive.org/edit.php?identifier=kasabi [19:11] underscor: but not through the api eh? [19:11] no, no deleting through the API [19:12] and we'll probably be removing that ability from edit.php at some point too [19:12] we don't like deleting [19:12] * edsu neither [19:23] underscor: is it easy to reassign the collection that an item is part of? [19:23] underscor: there's a archiveteam collection isn't there? [19:23] Not as a normal user, you can't [19:23] but if you give me the identifier, I can move it [19:23] or talk to the people who can [19:23] kasabi [19:24] going to be archiving the data they made available before they close their doors [19:24] http://blog.kasabi.com/2012/07/09/shutting-down-kasabi/ [19:24] What will the data look like? [19:24] bunch of rdf quads [19:24] (zips, tars, textifles [19:24] ah [19:24] separate gzipped files for each dataset [19:24] http://blog.kasabi.com/2012/07/16/archive-of-datasets/ has some details [19:28] edsu: moved [19:28] http://archive.org/details/kasabi?reCache=1 [19:29] thanks! [19:37] underscor: I've got some items that I uploaded twice, once under the wrong identifier. what should I do about this? [19:37] about 30 of them [19:47] uploading to ia via boto is a thing of beauty, kudos to whoever made that happen https://github.com/edsu/kasabi-archive/blob/master/archive.py [19:47] that's really spiffy [19:53] underscor: thanks for your help [19:54] np! [22:51] hi folks, there was some discussion earlier about archiving floppies on -bs, so I'd like to contribute what I've used for my 3.5'' windows/dos floppies: http://pastebin.com/mUVVpPZD [22:52] I plan to put it up on the wiki on the archiving floppies part, but I thought I'd get some comments from folks first [23:57] http://sriramk.com/unsolicitedyahoo.html [23:57] SketchCow ^