#archiveteam 2012-07-17,Tue

↑back Search

Time Nickname Message
02:12 🔗 shaqfu benuski: I saw a system the other day that, from the cursory look I gave it, would split newspapers by page region
02:13 🔗 shaqfu Seemed like it might be useful for your project
02:13 🔗 chronomex hm, page reagion how? did it figure out where columns were, or was it kind of like google maps slippy-view for newspaper scans?
02:16 🔗 shaqfu chronomex: It seemed able to break a page apart into blocks, and sort images/articles/ads
02:16 🔗 chronomex that's real trick
02:16 🔗 shaqfu I'll have to find it again; I was on vacation
02:16 🔗 shaqfu Again, I didn't look too hard, so I could be wrong
02:17 🔗 chronomex still
02:18 🔗 chronomex I've seen things happen like that, didn't think about them before
02:18 🔗 shaqfu I saw it and thought it might be useful for breaking up issues
02:21 🔗 shaqfu Ah, found it. Let's see if my broken modem will let it load...
02:23 🔗 shaqfu http://historying.org/2012/07/12/coding-a-middle-ground/
02:31 🔗 chronomex ah, a semimanual approach
02:31 🔗 bsmith094 anyone still on the fanfiction archiving? keeping up with that?
02:31 🔗 chronomex I really like human-assisted machine systems
02:32 🔗 bsmith094 wikiteam could use some love, i think, that could be much aeasier to use with a tracker
02:32 🔗 shaqfu chronomex: Train on humans, then let the machines take over when they learn?
02:32 🔗 chronomex oh, no, he gridded out the page and figured the purpose of grid squares based on words
02:32 🔗 chronomex that's neat
02:33 🔗 shaqfu chronomex: That's why I figured it'd be useful here. Find the staff list, find what page it always appears on, and use that as your anchor for file breaks
02:33 🔗 chronomex shaqfu: something like that, really anything that takes away as many of the easy cases as possible
02:33 🔗 chronomex I seem to have come late to this conversation, what's the problem space?
02:34 🔗 shaqfu I've been away for a few days, and was thinking of benuski's problem of finding automatic ways to break pages from that ugly website dump
02:35 🔗 chronomex ugly website dump of newspaper scans?
02:35 🔗 shaqfu http://www.fultonhistory.com/Fulton.html
02:35 🔗 shaqfu But they're not split into issues - just pages
02:36 🔗 chronomex ahhh
02:36 🔗 chronomex ooo flash :(
02:36 🔗 shaqfu I told you it was ugly :(
02:37 🔗 chronomex so the problem is taking a page and determining whether it's an issue start?
02:37 🔗 shaqfu Yeah
02:37 🔗 chronomex that sounds conceptually straightforward
02:38 🔗 chronomex fuzzy warp-tolerant image-matching is the hard part of the problem
02:38 🔗 shaqfu Yep, although it doesn't seem to need to be very accurate
02:38 🔗 chronomex ok, so it's 2 million pages
02:38 🔗 chronomex I see why you don't want to do this manually :P
02:39 🔗 shaqfu If a page has a high incidence of large fonts that aren't ads, odds are profoundly good that it's the front
02:39 🔗 shaqfu chronomex: 20M
02:39 🔗 chronomex sheeze
02:43 🔗 shaqfu I'll link benuski to it next time he's on
03:21 🔗 SketchCow Ops, please.
03:22 🔗 SketchCow Also, please help me find all the archiveteam aux channels so I can join them.
03:22 🔗 chronomex I'm in #archiveteam #wikiteam #urlteam #archiveteam-bs #nowwhat
03:23 🔗 shaqfu There's #fireplanet
03:32 🔗 SketchCow So, what did I miss today?
03:45 🔗 BlueMax Apparently nothing :P
03:45 🔗 BlueMax I bought you a tutu though!
04:00 🔗 DFJustin I think I somehow made cdbbsarchive go away from the top bar at http://archive.org/details/software
04:28 🔗 lemonkey http://kotaku.com/5926527/the-secret-atari-emails-you-were-never-supposed-to-see-until-some-guy-released-them
04:40 🔗 lemonkey http://www.neogaf.com/forum/showthread.php?t=147082
04:52 🔗 SketchCow DFJustin: Impossible.
04:54 🔗 SketchCow Do you know how long those atari e-mails have been on textfiles.com?
04:56 🔗 SketchCow April 4, 2004.
04:56 🔗 SketchCow When were they put online? 2001.
05:15 🔗 underscor DFJustin: It wasn't your fault, but it was set to be hidden
05:25 🔗 DFJustin might throw classicpcgames up there while you're at it
05:33 🔗 underscor whoops
07:27 🔗 SketchCow http://www.kickstarter.com/projects/jmathai/openphoto-a-photo-service-for-your-s3-or-dropbox-a came to me to mention.
07:29 🔗 nitro2k01 So easy and useful http://stackoverflow.com/questions/4560400/how-can-i-get-google-cache-age-of-any-url-or-web-page
07:35 🔗 SmileyG ... XD
09:27 🔗 SketchCow Pumping in lots of stuff.
09:27 🔗 SketchCow Now adding all the JWZ mixtapes I was given, 104 in all
09:27 🔗 SketchCow http://archive.org/details/jwz-mixtape-001
10:06 🔗 Ymgve maybe those will help with solving the riddle on his page
13:52 🔗 ersi Ok, I think: http://arcticready.com/social might be something good and funny to save
13:54 🔗 Schbirid ersi: it's a spoof
13:56 🔗 ersi Schbirid: Still think it's worth while. I'm gonna see if I can whip something together
13:58 🔗 SmileyG hmmmm
14:00 🔗 Schbirid :)
14:02 🔗 ersi Also, I'm not sure if mentioned earlier... this is half off-topic and half on-topic: ""Marissa Mayer, Google's employee #20 and Vice President of Local, has been appointed CEO of Yahoo."
14:05 🔗 BlueMax Who in their right mind would leave Google for Yahoo
14:06 🔗 ersi Someone who's in need of a *real* challenge, perhaps?
14:06 🔗 ersi Hard to get that floating turd of a ship (ie Yahoo) going upwards instead of down into the bottomless pit of the ocean
14:09 🔗 SmileyG SOmeone who knows they'll always be welcome at google
14:09 🔗 SmileyG or already have enough dosh, why not have some fun?
14:10 🔗 Schbirid if i was a billionaire (she is?) i would love to go to yahoo
14:15 🔗 SmileyG Exactly, and I'm gonna steer this ship back on coursen ow
14:15 🔗 ersi Some figures estimate she's worth 300 million USD, so in some currencies - she's a billionare
14:26 🔗 SmileyG Or not...... However; thinking of it another way - If she fails, no one will really care. If she succeeds? wow.
16:16 🔗 SketchCow -bs
16:35 🔗 SketchCow http://archive.org/details/dnamixtape is now uploaded.
18:11 🔗 edsu anyone happen to know if there's a python library for puttings stuff up at internet archive using their s3-ish api?
18:16 🔗 edsu oh hmm might be possible w/ boto http://www.elastician.com/2011/02/accessing-internet-archive-with-boto.html
18:26 🔗 ersi there's no specific one, no. but urllib/urllib2/requests etc are what you'd probably need
18:54 🔗 underscor edsu: boto is what we recommend
18:54 🔗 underscor >>> import boto
18:54 🔗 underscor Hopefully you can just global search and replace amazonaws.com with us.archive.org.
18:54 🔗 underscor The S3 API works well with the boto python library (multipart too!),
18:54 🔗 underscor We strive to make the S3 API compatible enough with current client code.
18:54 🔗 underscor use is_secure=False, host='s3.us.archive.org' when creating your boto connection.
18:54 🔗 underscor >>> conn = boto.connect_s3(key, secret, host='s3.us.archive.org', is_secure=False)
19:08 🔗 edsu underscor: is working pretty nicely
19:08 🔗 edsu underscor: if i accidentally uploaded some files without setting the key name properly is it possible to remove the errorneous keys?
19:09 🔗 edsu s/errorneous/erroneous/
19:09 🔗 edsu http://ia600804.us.archive.org/4/items/kasabi/
19:10 🔗 edsu underscor: i guess through this i can http://ia600804.us.archive.org/edit.php?identifier=kasabi
19:11 🔗 edsu underscor: but not through the api eh?
19:11 🔗 underscor no, no deleting through the API
19:12 🔗 underscor and we'll probably be removing that ability from edit.php at some point too
19:12 🔗 underscor we don't like deleting
19:12 🔗 * edsu neither
19:23 🔗 edsu underscor: is it easy to reassign the collection that an item is part of?
19:23 🔗 edsu underscor: there's a archiveteam collection isn't there?
19:23 🔗 underscor Not as a normal user, you can't
19:23 🔗 underscor but if you give me the identifier, I can move it
19:23 🔗 underscor or talk to the people who can
19:23 🔗 edsu kasabi
19:24 🔗 edsu going to be archiving the data they made available before they close their doors
19:24 🔗 edsu http://blog.kasabi.com/2012/07/09/shutting-down-kasabi/
19:24 🔗 underscor What will the data look like?
19:24 🔗 edsu bunch of rdf quads
19:24 🔗 underscor (zips, tars, textifles
19:24 🔗 underscor ah
19:24 🔗 edsu separate gzipped files for each dataset
19:24 🔗 edsu http://blog.kasabi.com/2012/07/16/archive-of-datasets/ has some details
19:28 🔗 underscor edsu: moved
19:28 🔗 underscor http://archive.org/details/kasabi?reCache=1
19:29 🔗 edsu thanks!
19:37 🔗 chronomex underscor: I've got some items that I uploaded twice, once under the wrong identifier. what should I do about this?
19:37 🔗 chronomex about 30 of them
19:47 🔗 edsu uploading to ia via boto is a thing of beauty, kudos to whoever made that happen https://github.com/edsu/kasabi-archive/blob/master/archive.py
19:47 🔗 chronomex that's really spiffy
19:53 🔗 edsu underscor: thanks for your help
19:54 🔗 underscor np!
22:51 🔗 dashcloud hi folks, there was some discussion earlier about archiving floppies on -bs, so I'd like to contribute what I've used for my 3.5'' windows/dos floppies: http://pastebin.com/mUVVpPZD
22:52 🔗 dashcloud I plan to put it up on the wiki on the archiving floppies part, but I thought I'd get some comments from folks first
23:57 🔗 kennethre http://sriramk.com/unsolicitedyahoo.html
23:57 🔗 kennethre SketchCow ^
