#archiveteam 2011-10-28,Fri

↑back Search

Time Nickname Message
00:01 🔗 bsmith093 how do i get a bunch of text files to get all their paragraphs on one line,, meaning just p-breaks no line breaks? 96 files all at once is preferred
00:02 🔗 arrith bsmith093: what os / distro?
00:02 🔗 bsmith093 ubuntu lucid
00:02 🔗 arrith are the files plaintext? or html/xml/etc?
00:02 🔗 SketchCow http://twitter.com/#!/textfiles/status/129707255141642240
00:03 🔗 bsmith093 arrith: plain text
00:04 🔗 bsmith093 SketchCow: whose your friend?
00:04 🔗 SketchCow Aaron Swartz
00:05 🔗 bsmith093 of Demand Progress, the PAC?
00:05 🔗 bsmith093 ps google rocks :)
00:06 🔗 arrith bsmith093: do you know if they're dos (crlf) or unix (lf)? and if they have multiple linebreaks between paragraphs consistently?
00:07 🔗 arrith sed or awk or perl btw would all probably work. exact command depends on how the files are structured.
00:07 🔗 bsmith093 ummm, not sure, some were originally pdb files, converted to txt, with several doc files converted with unoconv
00:07 🔗 bsmith093 how do i find out
00:08 🔗 arrith bsmith093: in a terminal do file textfile.txt
00:08 🔗 arrith should say something like robots.txt: ASCII text, with CRLF line terminators
00:11 🔗 bsmith093 all of them say and i quote "UTF-8 Unicode English text, with very long lines"
00:12 🔗 bsmith093 and they mostly are but some parts of the text are annoyingly skinny columns, and id like to batch fix that
00:24 🔗 arrith hmm
00:24 🔗 bsmith093 any ideas?
00:25 🔗 bsmith093 the closest thing ive found is something with vim, but thats like greek to me, and its only olne file at a time
00:26 🔗 Coderjoe yes, fuck you SFO
00:26 🔗 arrith bsmith093: what's the thing in vim?
00:26 🔗 Coderjoe also, wheat was that that needed 4 phases?
00:26 🔗 arrith bsmith093: i was just looking for a thing to find if any files had crlf
00:26 🔗 arrith http://stackoverflow.com/questions/73833/how-do-you-search-for-files-containing-dos-line-endings-crlf-with-grep-under-l
00:27 🔗 arrith this seems to do it: grep -IUrl --color '^M' .
00:27 🔗 arrith M is "ctrl-M" in the terminal
00:29 🔗 arrith i'd convert any CRLFs to LF then separate them into batches depending on how the paragraphs and sentences are separated
00:49 🔗 arrith learning awk will make all of this way easier
01:35 🔗 bsmith093 arrith: ok i found something that looks vaguely probable for removing line breaks "awk 'BEGIN{}{printf "%d, ", NR}END{printf "\n"}' filename" now how do i use this, and can it do many files at once
01:37 🔗 arrith bsmith093: if you just want to remove lf linebreaks you can do tr -d '\n' < in-file.txt > out-file.txt
01:38 🔗 arrith i don't really know awk so that might be doing something like replacing two newlines with one
01:40 🔗 bsmith093 i just get a prompt arrow like its waiting for something
01:42 🔗 arrith bsmith093: are you able to pastebin an example file?
01:42 🔗 bsmith093 example of what im trying to fix? yes
01:42 🔗 arrith that would help
01:45 🔗 bsmith093 here http://pastebin.com/YuaErAjh
01:45 🔗 bsmith093 notice the first 50 lines compared to the rest
01:45 🔗 bsmith093 i have hundreds like this, and its really annoying
01:46 🔗 arrith hmm so unwanted linebreaks in some places
01:46 🔗 bsmith093 its not just that little chink either, otherwise id quit whining and just fix it manually, but it randomly happens throughout the file
01:46 🔗 bsmith093 and others
01:47 🔗 arrith well to do it in a non-ai automated manner you have to find a pattern for where the issue is. like remove newlines until a period is found, then skip two newlines and repeat
01:48 🔗 arrith but that won't ensure paragraphs are grouped properly
01:48 🔗 arrith just sentences
01:49 🔗 arrith hm
01:49 🔗 arrith bsmith093: did you edit stuff in that pastebin manually or is that just what the text is like?
01:50 🔗 bsmith093 woulnt it be easier to just remove all line break chars, but leave paragraphs alone?
01:50 🔗 bsmith093 nop thats the exact file as i have it, unedited'
01:50 🔗 arrith well with text files there isn't really a difference between a line at the end of a paragraph and a blank line
01:51 🔗 arrith you can open up a file like that in a hex editor to see
01:51 🔗 arrith 0A is LF and 0D is CR
01:51 🔗 arrith a newline at the end of a paragraph*
01:56 🔗 arrith could do a thing in awk where a line has to begin with either a capital letter or quote, otherwise it joins lines
02:02 🔗 bsmith093 well i originally converted them from a mix of pdb and doc files, so it might just be some wierdness there
02:04 🔗 arrith yeah. there might be a better conversion program out there
02:40 🔗 underscor THAT AWESOME FEELING YOU GET WHEN YOU FIND 4TB OF FRIENDSTER FOR SketchCow
02:40 🔗 bbot_ tab complete doesn't respect capslock, apparently
02:47 🔗 underscor Is someone rtmpdumping the conference?
04:33 🔗 Coderjoe underscor: conference?
04:34 🔗 underscor Books in Browsers
04:34 🔗 underscor @ IA
04:34 🔗 Coderjoe if I had known someone wanted something dumped, I might have done it
04:34 🔗 underscor I'm sure they're gonna release it anyway
04:34 🔗 underscor I was just curious
04:35 🔗 Coderjoe unfortunately, I was trying to grab something else, and ustream's ppv live setup still baffles me, so I had to TRY to screenrecord it. I am pretty sure the recording failed.
04:35 🔗 Coderjoe (i need to hack up camstudio to use opendml or libav or simething. it currently has a 4gb filesize limit)
04:36 🔗 Coderjoe in other news, that wget process for woxy is now up to 9.5G of memory and still chugging
04:36 🔗 Coderjoe (the instance still has 7G free)
04:38 🔗 underscor Big instance?
04:38 🔗 Coderjoe high-memory xlarge instance
04:38 🔗 Coderjoe as a spot instance. currently running about $6/day
04:40 🔗 Coderjoe I pushed the broken woxy fetch out to s3 and deleted the 100GB ebs volume those files were on.
04:41 🔗 Coderjoe (paying s3 prices on 17GB of data is better than paying ebs prices on a 100GB volume)
04:44 🔗 Coderjoe so that other wget would have failed from ram issues if it hadn't failed trying to write the warc file at some point
04:45 🔗 Coderjoe (it was a 32bit instance, so the absolute max wget could have in the process is 4g)
06:30 🔗 arrith closure: have you heard back from the BerliOS people?
09:50 🔗 Coderjoe well that's awesome...
09:50 🔗 Coderjoe livejournal's friends page can only go back 20 posts. once you get to the second page (skip=20), the page is blank
09:53 🔗 ersi "It's not a bug, it's a feature!"
11:30 🔗 ersi http://www.jwz.org/blog/2011/10/the-internet-archive/ :)
11:33 🔗 phik intresting stuff
15:28 🔗 Coderjoe and now that wget is at 12.3G
15:44 🔗 alard Coderjoe: Still doing woxy.com? I've got a 16GB heritrix dump if you want. :)
15:47 🔗 SketchCow http://www.jwz.org/blog/2011/10/the-internet-archive
16:08 🔗 closure I have not heard back from BerliOS admins (someone asked)
16:09 🔗 closure we should get all the data we ripped into one place (before people lose it)
16:15 🔗 SketchCow I have tons of space on batcave right now.
16:18 🔗 closure ok, I think we had about 3 tb of data
16:18 🔗 closure thing is, we will want to run one more rsync pass later, probably, to get the final updates to projects etc.
16:19 🔗 closure since we're rsyncing everything from berlios it will be pretty fast to run -- could it be run on batcave?
16:50 🔗 SketchCow Yes
16:50 🔗 SketchCow Just throw that up there.
16:52 🔗 closure ok, sweet..
16:53 🔗 closure oh, it's only 300 gb anyway
16:54 🔗 closure balrog alard dashcloud yipdw underscor wyatt Coderjoe ersi: time to upload your Berlios stuff to batcave
16:54 🔗 DFJustin lol this channel, "oh, it's only 300 gb"
16:54 🔗 yipdw closure: sure thing
16:55 🔗 yipdw closure: are connection details further up in the channel?
16:55 🔗 closure I think SketchCow has to set you up with an account
16:55 🔗 yipdw ok
16:56 🔗 yipdw SketchCow: whenever you've got time, send me upload info for batcave
16:56 🔗 * closure too
17:40 🔗 SketchCow On it
17:40 🔗 * SketchCow blew in a pile of Roland items.
17:50 🔗 SketchCow Please rsync into a directory called berlios
17:51 🔗 closure SketchCow: one thing before it starts pouring in.. we are not tarring the stuff up, because we want to run rsync again. so expect lots of loose files
17:51 🔗 SketchCow Currently, we're at 17tb of free disk space
17:51 🔗 SketchCow We can handle
18:12 🔗 SketchCow -----
18:12 🔗 SketchCow From a local:
18:12 🔗 SketchCow Hey, you might be just the man for this: Do you know of a tool that takes a corrupted gzip and extracts usefull stuff from it?
18:13 🔗 SketchCow ....do we know of any?
18:13 🔗 SketchCow -----
19:13 🔗 closure not unless it's a rare one made with gzip --rsyncable
19:14 🔗 SketchCow http://www.gzip.org/recover.txt
19:14 🔗 SketchCow I gave him that
19:19 🔗 closure SketchCow: upload in progress to batcave.. some if it's behind a slow link, I estimate 15 days to complete
19:32 🔗 underscor SketchCow: iirc Coderjoe wrote something that at least tells you what's wrong with it
19:32 🔗 underscor closure: All mine already is
19:32 🔗 closure underscor: update wiki?
20:33 🔗 SketchCow No issues, closure
20:33 🔗 alard closure: My berlios chunks are already on batcave. (I started uploading a bit earlier.)
20:35 🔗 alard As is my copy of woxy.com, by the way.
22:02 🔗 SketchCow ---------------------------
22:02 🔗 SketchCow Whoever wants it - http://www.bbc.co.uk/rd/publications/bbc_monograph_39.shtml
22:02 🔗 SketchCow Just looking for the monographs to be downloaded, plus a .txt file of description where they have one.
22:02 🔗 SketchCow ---------------------------
22:02 🔗 SketchCow I made the official call for a Javascript port of MESS/MAME
22:13 🔗 Cowering SketchCow, MESS/MAME already have DRCs for certain things.. talk someone into making a DRC to javascript 'CPU'
22:13 🔗 SketchCow Aware.
22:14 🔗 Cowering but, since quite a few systems won't even emulate on a native i7 3.5 GHZ at full speed, javascript might still be pushing it a little :)
22:15 🔗 SketchCow Aware.
22:15 🔗 alard SketchCow: The monographs are currently uploading to batcave.
22:15 🔗 SketchCow alard: Thanks, man
22:16 🔗 alard There's a text file for each pdf with line 1: title, line 2: authors, line 3+: description. Is that what you need?
22:16 🔗 SketchCow Works great for me.
22:17 🔗 alard Actually, only the first one has a description.
22:18 🔗 alard (And it's not even about the document itself, but about the series.)
22:21 🔗 alard SketchCow: Which documents do you actually want? All of them, or just the monographs 1955 to 1969 that you linked to?
22:21 🔗 SketchCow Just the monographs at the moment.
22:21 🔗 alard Not 'Engineering 1970 to 1970'?
22:22 🔗 SketchCow Looks like just one.
22:22 🔗 SketchCow If you want to grab all of them, I'll snap them all up.
22:23 🔗 alard I'll get them all. The numbering continues: 80 is the last monograph, from 81 to 115 it's BBC Engineering.
22:23 🔗 alard But not linked anywhere, it seems.
22:23 🔗 SketchCow Do it if you can.
22:24 🔗 alard Is html in the description OK? (There are some with numbered lists.)
22:25 🔗 SketchCow Yes, it's up to me to deal.
22:43 🔗 alard SketchCow: Upload finished (see bbc-monographs on batcave). Would you like the research reports (from 1950 - now) too? If so, I'll add those tomorrow.
23:03 🔗 dashcloud the idea of MAME/MESS in javascript may not be such a crazy idea- there was very recently a javascript h264 decoder demoed
23:11 🔗 dashcloud so I finally got the opportunity to find out how much a gigabyte connection would cost thanks to the sales guy who called me today
23:14 🔗 dashcloud 8k or so a month
23:16 🔗 SketchCow alard: Sure!
23:16 🔗 SketchCow I am not proposing a crazy idea.

irclogger-viewer