Time |
Nickname |
Message |
02:12
🔗
|
shaqfu |
benuski: I saw a system the other day that, from the cursory look I gave it, would split newspapers by page region |
02:13
🔗
|
shaqfu |
Seemed like it might be useful for your project |
02:13
🔗
|
chronomex |
hm, page reagion how? did it figure out where columns were, or was it kind of like google maps slippy-view for newspaper scans? |
02:16
🔗
|
shaqfu |
chronomex: It seemed able to break a page apart into blocks, and sort images/articles/ads |
02:16
🔗
|
chronomex |
that's real trick |
02:16
🔗
|
shaqfu |
I'll have to find it again; I was on vacation |
02:16
🔗
|
shaqfu |
Again, I didn't look too hard, so I could be wrong |
02:17
🔗
|
chronomex |
still |
02:18
🔗
|
chronomex |
I've seen things happen like that, didn't think about them before |
02:18
🔗
|
shaqfu |
I saw it and thought it might be useful for breaking up issues |
02:21
🔗
|
shaqfu |
Ah, found it. Let's see if my broken modem will let it load... |
02:23
🔗
|
shaqfu |
http://historying.org/2012/07/12/coding-a-middle-ground/ |
02:31
🔗
|
chronomex |
ah, a semimanual approach |
02:31
🔗
|
bsmith094 |
anyone still on the fanfiction archiving? keeping up with that? |
02:31
🔗
|
chronomex |
I really like human-assisted machine systems |
02:32
🔗
|
bsmith094 |
wikiteam could use some love, i think, that could be much aeasier to use with a tracker |
02:32
🔗
|
shaqfu |
chronomex: Train on humans, then let the machines take over when they learn? |
02:32
🔗
|
chronomex |
oh, no, he gridded out the page and figured the purpose of grid squares based on words |
02:32
🔗
|
chronomex |
that's neat |
02:33
🔗
|
shaqfu |
chronomex: That's why I figured it'd be useful here. Find the staff list, find what page it always appears on, and use that as your anchor for file breaks |
02:33
🔗
|
chronomex |
shaqfu: something like that, really anything that takes away as many of the easy cases as possible |
02:33
🔗
|
chronomex |
I seem to have come late to this conversation, what's the problem space? |
02:34
🔗
|
shaqfu |
I've been away for a few days, and was thinking of benuski's problem of finding automatic ways to break pages from that ugly website dump |
02:35
🔗
|
chronomex |
ugly website dump of newspaper scans? |
02:35
🔗
|
shaqfu |
http://www.fultonhistory.com/Fulton.html |
02:35
🔗
|
shaqfu |
But they're not split into issues - just pages |
02:36
🔗
|
chronomex |
ahhh |
02:36
🔗
|
chronomex |
ooo flash :( |
02:36
🔗
|
shaqfu |
I told you it was ugly :( |
02:37
🔗
|
chronomex |
so the problem is taking a page and determining whether it's an issue start? |
02:37
🔗
|
shaqfu |
Yeah |
02:37
🔗
|
chronomex |
that sounds conceptually straightforward |
02:38
🔗
|
chronomex |
fuzzy warp-tolerant image-matching is the hard part of the problem |
02:38
🔗
|
shaqfu |
Yep, although it doesn't seem to need to be very accurate |
02:38
🔗
|
chronomex |
ok, so it's 2 million pages |
02:38
🔗
|
chronomex |
I see why you don't want to do this manually :P |
02:39
🔗
|
shaqfu |
If a page has a high incidence of large fonts that aren't ads, odds are profoundly good that it's the front |
02:39
🔗
|
shaqfu |
chronomex: 20M |
02:39
🔗
|
chronomex |
sheeze |
02:43
🔗
|
shaqfu |
I'll link benuski to it next time he's on |
03:21
🔗
|
SketchCow |
Ops, please. |
03:22
🔗
|
SketchCow |
Also, please help me find all the archiveteam aux channels so I can join them. |
03:22
🔗
|
chronomex |
I'm in #archiveteam #wikiteam #urlteam #archiveteam-bs #nowwhat |
03:23
🔗
|
shaqfu |
There's #fireplanet |
03:32
🔗
|
SketchCow |
So, what did I miss today? |
03:45
🔗
|
BlueMax |
Apparently nothing :P |
03:45
🔗
|
BlueMax |
I bought you a tutu though! |
04:00
🔗
|
DFJustin |
I think I somehow made cdbbsarchive go away from the top bar at http://archive.org/details/software |
04:28
🔗
|
lemonkey |
http://kotaku.com/5926527/the-secret-atari-emails-you-were-never-supposed-to-see-until-some-guy-released-them |
04:40
🔗
|
lemonkey |
http://www.neogaf.com/forum/showthread.php?t=147082 |
04:52
🔗
|
SketchCow |
DFJustin: Impossible. |
04:54
🔗
|
SketchCow |
Do you know how long those atari e-mails have been on textfiles.com? |
04:56
🔗
|
SketchCow |
April 4, 2004. |
04:56
🔗
|
SketchCow |
When were they put online? 2001. |
05:15
🔗
|
underscor |
DFJustin: It wasn't your fault, but it was set to be hidden |
05:25
🔗
|
DFJustin |
might throw classicpcgames up there while you're at it |
05:33
🔗
|
underscor |
whoops |
07:27
🔗
|
SketchCow |
http://www.kickstarter.com/projects/jmathai/openphoto-a-photo-service-for-your-s3-or-dropbox-a came to me to mention. |
07:29
🔗
|
nitro2k01 |
So easy and useful http://stackoverflow.com/questions/4560400/how-can-i-get-google-cache-age-of-any-url-or-web-page |
07:35
🔗
|
SmileyG |
... XD |
09:27
🔗
|
SketchCow |
Pumping in lots of stuff. |
09:27
🔗
|
SketchCow |
Now adding all the JWZ mixtapes I was given, 104 in all |
09:27
🔗
|
SketchCow |
http://archive.org/details/jwz-mixtape-001 |
10:06
🔗
|
Ymgve |
maybe those will help with solving the riddle on his page |
13:52
🔗
|
ersi |
Ok, I think: http://arcticready.com/social might be something good and funny to save |
13:54
🔗
|
Schbirid |
ersi: it's a spoof |
13:56
🔗
|
ersi |
Schbirid: Still think it's worth while. I'm gonna see if I can whip something together |
13:58
🔗
|
SmileyG |
hmmmm |
14:00
🔗
|
Schbirid |
:) |
14:02
🔗
|
ersi |
Also, I'm not sure if mentioned earlier... this is half off-topic and half on-topic: ""Marissa Mayer, Google's employee #20 and Vice President of Local, has been appointed CEO of Yahoo." |
14:05
🔗
|
BlueMax |
Who in their right mind would leave Google for Yahoo |
14:06
🔗
|
ersi |
Someone who's in need of a *real* challenge, perhaps? |
14:06
🔗
|
ersi |
Hard to get that floating turd of a ship (ie Yahoo) going upwards instead of down into the bottomless pit of the ocean |
14:09
🔗
|
SmileyG |
SOmeone who knows they'll always be welcome at google |
14:09
🔗
|
SmileyG |
or already have enough dosh, why not have some fun? |
14:10
🔗
|
Schbirid |
if i was a billionaire (she is?) i would love to go to yahoo |
14:15
🔗
|
SmileyG |
Exactly, and I'm gonna steer this ship back on coursen ow |
14:15
🔗
|
ersi |
Some figures estimate she's worth 300 million USD, so in some currencies - she's a billionare |
14:26
🔗
|
SmileyG |
Or not...... However; thinking of it another way - If she fails, no one will really care. If she succeeds? wow. |
16:16
🔗
|
SketchCow |
-bs |
16:35
🔗
|
SketchCow |
http://archive.org/details/dnamixtape is now uploaded. |
18:11
🔗
|
edsu |
anyone happen to know if there's a python library for puttings stuff up at internet archive using their s3-ish api? |
18:16
🔗
|
edsu |
oh hmm might be possible w/ boto http://www.elastician.com/2011/02/accessing-internet-archive-with-boto.html |
18:26
🔗
|
ersi |
there's no specific one, no. but urllib/urllib2/requests etc are what you'd probably need |
18:54
🔗
|
underscor |
edsu: boto is what we recommend |
18:54
🔗
|
underscor |
>>> import boto |
18:54
🔗
|
underscor |
Hopefully you can just global search and replace amazonaws.com with us.archive.org. |
18:54
🔗
|
underscor |
The S3 API works well with the boto python library (multipart too!), |
18:54
🔗
|
underscor |
We strive to make the S3 API compatible enough with current client code. |
18:54
🔗
|
underscor |
use is_secure=False, host='s3.us.archive.org' when creating your boto connection. |
18:54
🔗
|
underscor |
>>> conn = boto.connect_s3(key, secret, host='s3.us.archive.org', is_secure=False) |
19:08
🔗
|
edsu |
underscor: is working pretty nicely |
19:08
🔗
|
edsu |
underscor: if i accidentally uploaded some files without setting the key name properly is it possible to remove the errorneous keys? |
19:09
🔗
|
edsu |
s/errorneous/erroneous/ |
19:09
🔗
|
edsu |
http://ia600804.us.archive.org/4/items/kasabi/ |
19:10
🔗
|
edsu |
underscor: i guess through this i can http://ia600804.us.archive.org/edit.php?identifier=kasabi |
19:11
🔗
|
edsu |
underscor: but not through the api eh? |
19:11
🔗
|
underscor |
no, no deleting through the API |
19:12
🔗
|
underscor |
and we'll probably be removing that ability from edit.php at some point too |
19:12
🔗
|
underscor |
we don't like deleting |
19:12
🔗
|
* |
edsu neither |
19:23
🔗
|
edsu |
underscor: is it easy to reassign the collection that an item is part of? |
19:23
🔗
|
edsu |
underscor: there's a archiveteam collection isn't there? |
19:23
🔗
|
underscor |
Not as a normal user, you can't |
19:23
🔗
|
underscor |
but if you give me the identifier, I can move it |
19:23
🔗
|
underscor |
or talk to the people who can |
19:23
🔗
|
edsu |
kasabi |
19:24
🔗
|
edsu |
going to be archiving the data they made available before they close their doors |
19:24
🔗
|
edsu |
http://blog.kasabi.com/2012/07/09/shutting-down-kasabi/ |
19:24
🔗
|
underscor |
What will the data look like? |
19:24
🔗
|
edsu |
bunch of rdf quads |
19:24
🔗
|
underscor |
(zips, tars, textifles |
19:24
🔗
|
underscor |
ah |
19:24
🔗
|
edsu |
separate gzipped files for each dataset |
19:24
🔗
|
edsu |
http://blog.kasabi.com/2012/07/16/archive-of-datasets/ has some details |
19:28
🔗
|
underscor |
edsu: moved |
19:28
🔗
|
underscor |
http://archive.org/details/kasabi?reCache=1 |
19:29
🔗
|
edsu |
thanks! |
19:37
🔗
|
chronomex |
underscor: I've got some items that I uploaded twice, once under the wrong identifier. what should I do about this? |
19:37
🔗
|
chronomex |
about 30 of them |
19:47
🔗
|
edsu |
uploading to ia via boto is a thing of beauty, kudos to whoever made that happen https://github.com/edsu/kasabi-archive/blob/master/archive.py |
19:47
🔗
|
chronomex |
that's really spiffy |
19:53
🔗
|
edsu |
underscor: thanks for your help |
19:54
🔗
|
underscor |
np! |
22:51
🔗
|
dashcloud |
hi folks, there was some discussion earlier about archiving floppies on -bs, so I'd like to contribute what I've used for my 3.5'' windows/dos floppies: http://pastebin.com/mUVVpPZD |
22:52
🔗
|
dashcloud |
I plan to put it up on the wiki on the archiving floppies part, but I thought I'd get some comments from folks first |
23:57
🔗
|
kennethre |
http://sriramk.com/unsolicitedyahoo.html |
23:57
🔗
|
kennethre |
SketchCow ^ |