[00:01] how do i get a bunch of text files to get all their paragraphs on one line,, meaning just p-breaks no line breaks? 96 files all at once is preferred [00:02] bsmith093: what os / distro? [00:02] ubuntu lucid [00:02] are the files plaintext? or html/xml/etc? [00:02] http://twitter.com/#!/textfiles/status/129707255141642240 [00:03] arrith: plain text [00:04] SketchCow: whose your friend? [00:04] Aaron Swartz [00:05] of Demand Progress, the PAC? [00:05] ps google rocks :) [00:06] bsmith093: do you know if they're dos (crlf) or unix (lf)? and if they have multiple linebreaks between paragraphs consistently? [00:07] sed or awk or perl btw would all probably work. exact command depends on how the files are structured. [00:07] ummm, not sure, some were originally pdb files, converted to txt, with several doc files converted with unoconv [00:07] how do i find out [00:08] bsmith093: in a terminal do file textfile.txt [00:08] should say something like robots.txt: ASCII text, with CRLF line terminators [00:11] all of them say and i quote "UTF-8 Unicode English text, with very long lines" [00:12] and they mostly are but some parts of the text are annoyingly skinny columns, and id like to batch fix that [00:24] hmm [00:24] any ideas? [00:25] the closest thing ive found is something with vim, but thats like greek to me, and its only olne file at a time [00:26] yes, fuck you SFO [00:26] bsmith093: what's the thing in vim? [00:26] also, wheat was that that needed 4 phases? [00:26] bsmith093: i was just looking for a thing to find if any files had crlf [00:26] http://stackoverflow.com/questions/73833/how-do-you-search-for-files-containing-dos-line-endings-crlf-with-grep-under-l [00:27] this seems to do it: grep -IUrl --color '^M' . [00:27] M is "ctrl-M" in the terminal [00:29] i'd convert any CRLFs to LF then separate them into batches depending on how the paragraphs and sentences are separated [00:49] learning awk will make all of this way easier [01:35] arrith: ok i found something that looks vaguely probable for removing line breaks "awk 'BEGIN{}{printf "%d, ", NR}END{printf "\n"}' filename" now how do i use this, and can it do many files at once [01:37] bsmith093: if you just want to remove lf linebreaks you can do tr -d '\n' < in-file.txt > out-file.txt [01:38] i don't really know awk so that might be doing something like replacing two newlines with one [01:40] i just get a prompt arrow like its waiting for something [01:42] bsmith093: are you able to pastebin an example file? [01:42] example of what im trying to fix? yes [01:42] that would help [01:45] here http://pastebin.com/YuaErAjh [01:45] notice the first 50 lines compared to the rest [01:45] i have hundreds like this, and its really annoying [01:46] hmm so unwanted linebreaks in some places [01:46] its not just that little chink either, otherwise id quit whining and just fix it manually, but it randomly happens throughout the file [01:46] and others [01:47] well to do it in a non-ai automated manner you have to find a pattern for where the issue is. like remove newlines until a period is found, then skip two newlines and repeat [01:48] but that won't ensure paragraphs are grouped properly [01:48] just sentences [01:49] hm [01:49] bsmith093: did you edit stuff in that pastebin manually or is that just what the text is like? [01:50] woulnt it be easier to just remove all line break chars, but leave paragraphs alone? [01:50] nop thats the exact file as i have it, unedited' [01:50] well with text files there isn't really a difference between a line at the end of a paragraph and a blank line [01:51] you can open up a file like that in a hex editor to see [01:51] 0A is LF and 0D is CR [01:51] a newline at the end of a paragraph* [01:56] could do a thing in awk where a line has to begin with either a capital letter or quote, otherwise it joins lines [02:02] well i originally converted them from a mix of pdb and doc files, so it might just be some wierdness there [02:04] yeah. there might be a better conversion program out there [02:40] THAT AWESOME FEELING YOU GET WHEN YOU FIND 4TB OF FRIENDSTER FOR SketchCow [02:40] tab complete doesn't respect capslock, apparently [02:47] Is someone rtmpdumping the conference? [04:33] underscor: conference? [04:34] Books in Browsers [04:34] @ IA [04:34] if I had known someone wanted something dumped, I might have done it [04:34] I'm sure they're gonna release it anyway [04:34] I was just curious [04:35] unfortunately, I was trying to grab something else, and ustream's ppv live setup still baffles me, so I had to TRY to screenrecord it. I am pretty sure the recording failed. [04:35] (i need to hack up camstudio to use opendml or libav or simething. it currently has a 4gb filesize limit) [04:36] in other news, that wget process for woxy is now up to 9.5G of memory and still chugging [04:36] (the instance still has 7G free) [04:38] Big instance? [04:38] high-memory xlarge instance [04:38] as a spot instance. currently running about $6/day [04:40] I pushed the broken woxy fetch out to s3 and deleted the 100GB ebs volume those files were on. [04:41] (paying s3 prices on 17GB of data is better than paying ebs prices on a 100GB volume) [04:44] so that other wget would have failed from ram issues if it hadn't failed trying to write the warc file at some point [04:45] (it was a 32bit instance, so the absolute max wget could have in the process is 4g) [06:30] closure: have you heard back from the BerliOS people? [09:50] well that's awesome... [09:50] livejournal's friends page can only go back 20 posts. once you get to the second page (skip=20), the page is blank [09:53] "It's not a bug, it's a feature!" [11:30] http://www.jwz.org/blog/2011/10/the-internet-archive/ :) [11:33] intresting stuff [15:28] and now that wget is at 12.3G [15:44] Coderjoe: Still doing woxy.com? I've got a 16GB heritrix dump if you want. :) [15:47] http://www.jwz.org/blog/2011/10/the-internet-archive [16:08] I have not heard back from BerliOS admins (someone asked) [16:09] we should get all the data we ripped into one place (before people lose it) [16:15] I have tons of space on batcave right now. [16:18] ok, I think we had about 3 tb of data [16:18] thing is, we will want to run one more rsync pass later, probably, to get the final updates to projects etc. [16:19] since we're rsyncing everything from berlios it will be pretty fast to run -- could it be run on batcave? [16:50] Yes [16:50] Just throw that up there. [16:52] ok, sweet.. [16:53] oh, it's only 300 gb anyway [16:54] balrog alard dashcloud yipdw underscor wyatt Coderjoe ersi: time to upload your Berlios stuff to batcave [16:54] lol this channel, "oh, it's only 300 gb" [16:54] closure: sure thing [16:55] closure: are connection details further up in the channel? [16:55] I think SketchCow has to set you up with an account [16:55] ok [16:56] SketchCow: whenever you've got time, send me upload info for batcave [16:56] * closure too [17:40] On it [17:40] * SketchCow blew in a pile of Roland items. [17:50] Please rsync into a directory called berlios [17:51] SketchCow: one thing before it starts pouring in.. we are not tarring the stuff up, because we want to run rsync again. so expect lots of loose files [17:51] Currently, we're at 17tb of free disk space [17:51] We can handle [18:12] ----- [18:12] From a local: [18:12] Hey, you might be just the man for this: Do you know of a tool that takes a corrupted gzip and extracts usefull stuff from it? [18:13] ....do we know of any? [18:13] ----- [19:13] not unless it's a rare one made with gzip --rsyncable [19:14] http://www.gzip.org/recover.txt [19:14] I gave him that [19:19] SketchCow: upload in progress to batcave.. some if it's behind a slow link, I estimate 15 days to complete [19:32] SketchCow: iirc Coderjoe wrote something that at least tells you what's wrong with it [19:32] closure: All mine already is [19:32] underscor: update wiki? [20:33] No issues, closure [20:33] closure: My berlios chunks are already on batcave. (I started uploading a bit earlier.) [20:35] As is my copy of woxy.com, by the way. [22:02] --------------------------- [22:02] Whoever wants it - http://www.bbc.co.uk/rd/publications/bbc_monograph_39.shtml [22:02] Just looking for the monographs to be downloaded, plus a .txt file of description where they have one. [22:02] --------------------------- [22:02] I made the official call for a Javascript port of MESS/MAME [22:13] SketchCow, MESS/MAME already have DRCs for certain things.. talk someone into making a DRC to javascript 'CPU' [22:13] Aware. [22:14] but, since quite a few systems won't even emulate on a native i7 3.5 GHZ at full speed, javascript might still be pushing it a little :) [22:15] Aware. [22:15] SketchCow: The monographs are currently uploading to batcave. [22:15] alard: Thanks, man [22:16] There's a text file for each pdf with line 1: title, line 2: authors, line 3+: description. Is that what you need? [22:16] Works great for me. [22:17] Actually, only the first one has a description. [22:18] (And it's not even about the document itself, but about the series.) [22:21] SketchCow: Which documents do you actually want? All of them, or just the monographs 1955 to 1969 that you linked to? [22:21] Just the monographs at the moment. [22:21] Not 'Engineering 1970 to 1970'? [22:22] Looks like just one. [22:22] If you want to grab all of them, I'll snap them all up. [22:23] I'll get them all. The numbering continues: 80 is the last monograph, from 81 to 115 it's BBC Engineering. [22:23] But not linked anywhere, it seems. [22:23] Do it if you can. [22:24] Is html in the description OK? (There are some with numbered lists.) [22:25] Yes, it's up to me to deal. [22:43] SketchCow: Upload finished (see bbc-monographs on batcave). Would you like the research reports (from 1950 - now) too? If so, I'll add those tomorrow. [23:03] the idea of MAME/MESS in javascript may not be such a crazy idea- there was very recently a javascript h264 decoder demoed [23:11] so I finally got the opportunity to find out how much a gigabyte connection would cost thanks to the sales guy who called me today [23:14] 8k or so a month [23:16] alard: Sure! [23:16] I am not proposing a crazy idea.