Time |
Nickname |
Message |
00:01
🔗
|
bsmith093 |
how do i get a bunch of text files to get all their paragraphs on one line,, meaning just p-breaks no line breaks? 96 files all at once is preferred |
00:02
🔗
|
arrith |
bsmith093: what os / distro? |
00:02
🔗
|
bsmith093 |
ubuntu lucid |
00:02
🔗
|
arrith |
are the files plaintext? or html/xml/etc? |
00:02
🔗
|
SketchCow |
http://twitter.com/#!/textfiles/status/129707255141642240 |
00:03
🔗
|
bsmith093 |
arrith: plain text |
00:04
🔗
|
bsmith093 |
SketchCow: whose your friend? |
00:04
🔗
|
SketchCow |
Aaron Swartz |
00:05
🔗
|
bsmith093 |
of Demand Progress, the PAC? |
00:05
🔗
|
bsmith093 |
ps google rocks :) |
00:06
🔗
|
arrith |
bsmith093: do you know if they're dos (crlf) or unix (lf)? and if they have multiple linebreaks between paragraphs consistently? |
00:07
🔗
|
arrith |
sed or awk or perl btw would all probably work. exact command depends on how the files are structured. |
00:07
🔗
|
bsmith093 |
ummm, not sure, some were originally pdb files, converted to txt, with several doc files converted with unoconv |
00:07
🔗
|
bsmith093 |
how do i find out |
00:08
🔗
|
arrith |
bsmith093: in a terminal do file textfile.txt |
00:08
🔗
|
arrith |
should say something like robots.txt: ASCII text, with CRLF line terminators |
00:11
🔗
|
bsmith093 |
all of them say and i quote "UTF-8 Unicode English text, with very long lines" |
00:12
🔗
|
bsmith093 |
and they mostly are but some parts of the text are annoyingly skinny columns, and id like to batch fix that |
00:24
🔗
|
arrith |
hmm |
00:24
🔗
|
bsmith093 |
any ideas? |
00:25
🔗
|
bsmith093 |
the closest thing ive found is something with vim, but thats like greek to me, and its only olne file at a time |
00:26
🔗
|
Coderjoe |
yes, fuck you SFO |
00:26
🔗
|
arrith |
bsmith093: what's the thing in vim? |
00:26
🔗
|
Coderjoe |
also, wheat was that that needed 4 phases? |
00:26
🔗
|
arrith |
bsmith093: i was just looking for a thing to find if any files had crlf |
00:26
🔗
|
arrith |
http://stackoverflow.com/questions/73833/how-do-you-search-for-files-containing-dos-line-endings-crlf-with-grep-under-l |
00:27
🔗
|
arrith |
this seems to do it: grep -IUrl --color '^M' . |
00:27
🔗
|
arrith |
M is "ctrl-M" in the terminal |
00:29
🔗
|
arrith |
i'd convert any CRLFs to LF then separate them into batches depending on how the paragraphs and sentences are separated |
00:49
🔗
|
arrith |
learning awk will make all of this way easier |
01:35
🔗
|
bsmith093 |
arrith: ok i found something that looks vaguely probable for removing line breaks "awk 'BEGIN{}{printf "%d, ", NR}END{printf "\n"}' filename" now how do i use this, and can it do many files at once |
01:37
🔗
|
arrith |
bsmith093: if you just want to remove lf linebreaks you can do tr -d '\n' < in-file.txt > out-file.txt |
01:38
🔗
|
arrith |
i don't really know awk so that might be doing something like replacing two newlines with one |
01:40
🔗
|
bsmith093 |
i just get a prompt arrow like its waiting for something |
01:42
🔗
|
arrith |
bsmith093: are you able to pastebin an example file? |
01:42
🔗
|
bsmith093 |
example of what im trying to fix? yes |
01:42
🔗
|
arrith |
that would help |
01:45
🔗
|
bsmith093 |
here http://pastebin.com/YuaErAjh |
01:45
🔗
|
bsmith093 |
notice the first 50 lines compared to the rest |
01:45
🔗
|
bsmith093 |
i have hundreds like this, and its really annoying |
01:46
🔗
|
arrith |
hmm so unwanted linebreaks in some places |
01:46
🔗
|
bsmith093 |
its not just that little chink either, otherwise id quit whining and just fix it manually, but it randomly happens throughout the file |
01:46
🔗
|
bsmith093 |
and others |
01:47
🔗
|
arrith |
well to do it in a non-ai automated manner you have to find a pattern for where the issue is. like remove newlines until a period is found, then skip two newlines and repeat |
01:48
🔗
|
arrith |
but that won't ensure paragraphs are grouped properly |
01:48
🔗
|
arrith |
just sentences |
01:49
🔗
|
arrith |
hm |
01:49
🔗
|
arrith |
bsmith093: did you edit stuff in that pastebin manually or is that just what the text is like? |
01:50
🔗
|
bsmith093 |
woulnt it be easier to just remove all line break chars, but leave paragraphs alone? |
01:50
🔗
|
bsmith093 |
nop thats the exact file as i have it, unedited' |
01:50
🔗
|
arrith |
well with text files there isn't really a difference between a line at the end of a paragraph and a blank line |
01:51
🔗
|
arrith |
you can open up a file like that in a hex editor to see |
01:51
🔗
|
arrith |
0A is LF and 0D is CR |
01:51
🔗
|
arrith |
a newline at the end of a paragraph* |
01:56
🔗
|
arrith |
could do a thing in awk where a line has to begin with either a capital letter or quote, otherwise it joins lines |
02:02
🔗
|
bsmith093 |
well i originally converted them from a mix of pdb and doc files, so it might just be some wierdness there |
02:04
🔗
|
arrith |
yeah. there might be a better conversion program out there |
02:40
🔗
|
underscor |
THAT AWESOME FEELING YOU GET WHEN YOU FIND 4TB OF FRIENDSTER FOR SketchCow |
02:40
🔗
|
bbot_ |
tab complete doesn't respect capslock, apparently |
02:47
🔗
|
underscor |
Is someone rtmpdumping the conference? |
04:33
🔗
|
Coderjoe |
underscor: conference? |
04:34
🔗
|
underscor |
Books in Browsers |
04:34
🔗
|
underscor |
@ IA |
04:34
🔗
|
Coderjoe |
if I had known someone wanted something dumped, I might have done it |
04:34
🔗
|
underscor |
I'm sure they're gonna release it anyway |
04:34
🔗
|
underscor |
I was just curious |
04:35
🔗
|
Coderjoe |
unfortunately, I was trying to grab something else, and ustream's ppv live setup still baffles me, so I had to TRY to screenrecord it. I am pretty sure the recording failed. |
04:35
🔗
|
Coderjoe |
(i need to hack up camstudio to use opendml or libav or simething. it currently has a 4gb filesize limit) |
04:36
🔗
|
Coderjoe |
in other news, that wget process for woxy is now up to 9.5G of memory and still chugging |
04:36
🔗
|
Coderjoe |
(the instance still has 7G free) |
04:38
🔗
|
underscor |
Big instance? |
04:38
🔗
|
Coderjoe |
high-memory xlarge instance |
04:38
🔗
|
Coderjoe |
as a spot instance. currently running about $6/day |
04:40
🔗
|
Coderjoe |
I pushed the broken woxy fetch out to s3 and deleted the 100GB ebs volume those files were on. |
04:41
🔗
|
Coderjoe |
(paying s3 prices on 17GB of data is better than paying ebs prices on a 100GB volume) |
04:44
🔗
|
Coderjoe |
so that other wget would have failed from ram issues if it hadn't failed trying to write the warc file at some point |
04:45
🔗
|
Coderjoe |
(it was a 32bit instance, so the absolute max wget could have in the process is 4g) |
06:30
🔗
|
arrith |
closure: have you heard back from the BerliOS people? |
09:50
🔗
|
Coderjoe |
well that's awesome... |
09:50
🔗
|
Coderjoe |
livejournal's friends page can only go back 20 posts. once you get to the second page (skip=20), the page is blank |
09:53
🔗
|
ersi |
"It's not a bug, it's a feature!" |
11:30
🔗
|
ersi |
http://www.jwz.org/blog/2011/10/the-internet-archive/ :) |
11:33
🔗
|
phik |
intresting stuff |
15:28
🔗
|
Coderjoe |
and now that wget is at 12.3G |
15:44
🔗
|
alard |
Coderjoe: Still doing woxy.com? I've got a 16GB heritrix dump if you want. :) |
15:47
🔗
|
SketchCow |
http://www.jwz.org/blog/2011/10/the-internet-archive |
16:08
🔗
|
closure |
I have not heard back from BerliOS admins (someone asked) |
16:09
🔗
|
closure |
we should get all the data we ripped into one place (before people lose it) |
16:15
🔗
|
SketchCow |
I have tons of space on batcave right now. |
16:18
🔗
|
closure |
ok, I think we had about 3 tb of data |
16:18
🔗
|
closure |
thing is, we will want to run one more rsync pass later, probably, to get the final updates to projects etc. |
16:19
🔗
|
closure |
since we're rsyncing everything from berlios it will be pretty fast to run -- could it be run on batcave? |
16:50
🔗
|
SketchCow |
Yes |
16:50
🔗
|
SketchCow |
Just throw that up there. |
16:52
🔗
|
closure |
ok, sweet.. |
16:53
🔗
|
closure |
oh, it's only 300 gb anyway |
16:54
🔗
|
closure |
balrog alard dashcloud yipdw underscor wyatt Coderjoe ersi: time to upload your Berlios stuff to batcave |
16:54
🔗
|
DFJustin |
lol this channel, "oh, it's only 300 gb" |
16:54
🔗
|
yipdw |
closure: sure thing |
16:55
🔗
|
yipdw |
closure: are connection details further up in the channel? |
16:55
🔗
|
closure |
I think SketchCow has to set you up with an account |
16:55
🔗
|
yipdw |
ok |
16:56
🔗
|
yipdw |
SketchCow: whenever you've got time, send me upload info for batcave |
16:56
🔗
|
* |
closure too |
17:40
🔗
|
SketchCow |
On it |
17:40
🔗
|
* |
SketchCow blew in a pile of Roland items. |
17:50
🔗
|
SketchCow |
Please rsync into a directory called berlios |
17:51
🔗
|
closure |
SketchCow: one thing before it starts pouring in.. we are not tarring the stuff up, because we want to run rsync again. so expect lots of loose files |
17:51
🔗
|
SketchCow |
Currently, we're at 17tb of free disk space |
17:51
🔗
|
SketchCow |
We can handle |
18:12
🔗
|
SketchCow |
----- |
18:12
🔗
|
SketchCow |
From a local: |
18:12
🔗
|
SketchCow |
Hey, you might be just the man for this: Do you know of a tool that takes a corrupted gzip and extracts usefull stuff from it? |
18:13
🔗
|
SketchCow |
....do we know of any? |
18:13
🔗
|
SketchCow |
----- |
19:13
🔗
|
closure |
not unless it's a rare one made with gzip --rsyncable |
19:14
🔗
|
SketchCow |
http://www.gzip.org/recover.txt |
19:14
🔗
|
SketchCow |
I gave him that |
19:19
🔗
|
closure |
SketchCow: upload in progress to batcave.. some if it's behind a slow link, I estimate 15 days to complete |
19:32
🔗
|
underscor |
SketchCow: iirc Coderjoe wrote something that at least tells you what's wrong with it |
19:32
🔗
|
underscor |
closure: All mine already is |
19:32
🔗
|
closure |
underscor: update wiki? |
20:33
🔗
|
SketchCow |
No issues, closure |
20:33
🔗
|
alard |
closure: My berlios chunks are already on batcave. (I started uploading a bit earlier.) |
20:35
🔗
|
alard |
As is my copy of woxy.com, by the way. |
22:02
🔗
|
SketchCow |
--------------------------- |
22:02
🔗
|
SketchCow |
Whoever wants it - http://www.bbc.co.uk/rd/publications/bbc_monograph_39.shtml |
22:02
🔗
|
SketchCow |
Just looking for the monographs to be downloaded, plus a .txt file of description where they have one. |
22:02
🔗
|
SketchCow |
--------------------------- |
22:02
🔗
|
SketchCow |
I made the official call for a Javascript port of MESS/MAME |
22:13
🔗
|
Cowering |
SketchCow, MESS/MAME already have DRCs for certain things.. talk someone into making a DRC to javascript 'CPU' |
22:13
🔗
|
SketchCow |
Aware. |
22:14
🔗
|
Cowering |
but, since quite a few systems won't even emulate on a native i7 3.5 GHZ at full speed, javascript might still be pushing it a little :) |
22:15
🔗
|
SketchCow |
Aware. |
22:15
🔗
|
alard |
SketchCow: The monographs are currently uploading to batcave. |
22:15
🔗
|
SketchCow |
alard: Thanks, man |
22:16
🔗
|
alard |
There's a text file for each pdf with line 1: title, line 2: authors, line 3+: description. Is that what you need? |
22:16
🔗
|
SketchCow |
Works great for me. |
22:17
🔗
|
alard |
Actually, only the first one has a description. |
22:18
🔗
|
alard |
(And it's not even about the document itself, but about the series.) |
22:21
🔗
|
alard |
SketchCow: Which documents do you actually want? All of them, or just the monographs 1955 to 1969 that you linked to? |
22:21
🔗
|
SketchCow |
Just the monographs at the moment. |
22:21
🔗
|
alard |
Not 'Engineering 1970 to 1970'? |
22:22
🔗
|
SketchCow |
Looks like just one. |
22:22
🔗
|
SketchCow |
If you want to grab all of them, I'll snap them all up. |
22:23
🔗
|
alard |
I'll get them all. The numbering continues: 80 is the last monograph, from 81 to 115 it's BBC Engineering. |
22:23
🔗
|
alard |
But not linked anywhere, it seems. |
22:23
🔗
|
SketchCow |
Do it if you can. |
22:24
🔗
|
alard |
Is html in the description OK? (There are some with numbered lists.) |
22:25
🔗
|
SketchCow |
Yes, it's up to me to deal. |
22:43
🔗
|
alard |
SketchCow: Upload finished (see bbc-monographs on batcave). Would you like the research reports (from 1950 - now) too? If so, I'll add those tomorrow. |
23:03
🔗
|
dashcloud |
the idea of MAME/MESS in javascript may not be such a crazy idea- there was very recently a javascript h264 decoder demoed |
23:11
🔗
|
dashcloud |
so I finally got the opportunity to find out how much a gigabyte connection would cost thanks to the sales guy who called me today |
23:14
🔗
|
dashcloud |
8k or so a month |
23:16
🔗
|
SketchCow |
alard: Sure! |
23:16
🔗
|
SketchCow |
I am not proposing a crazy idea. |