#archiveteam 2011-10-28,Fri

↑back Search

Time	Nickname	Message
00:01 ^🔗	bsmith093	how do i get a bunch of text files to get all their paragraphs on one line,, meaning just p-breaks no line breaks? 96 files all at once is preferred
00:02 ^🔗	arrith	bsmith093: what os / distro?
00:02 ^🔗	bsmith093	ubuntu lucid
00:02 ^🔗	arrith	are the files plaintext? or html/xml/etc?
00:02 ^🔗	SketchCow	http://twitter.com/#!/textfiles/status/129707255141642240
00:03 ^🔗	bsmith093	arrith: plain text
00:04 ^🔗	bsmith093	SketchCow: whose your friend?
00:04 ^🔗	SketchCow	Aaron Swartz
00:05 ^🔗	bsmith093	of Demand Progress, the PAC?
00:05 ^🔗	bsmith093	ps google rocks :)
00:06 ^🔗	arrith	bsmith093: do you know if they're dos (crlf) or unix (lf)? and if they have multiple linebreaks between paragraphs consistently?
00:07 ^🔗	arrith	sed or awk or perl btw would all probably work. exact command depends on how the files are structured.
00:07 ^🔗	bsmith093	ummm, not sure, some were originally pdb files, converted to txt, with several doc files converted with unoconv
00:07 ^🔗	bsmith093	how do i find out
00:08 ^🔗	arrith	bsmith093: in a terminal do file textfile.txt
00:08 ^🔗	arrith	should say something like robots.txt: ASCII text, with CRLF line terminators
00:11 ^🔗	bsmith093	all of them say and i quote "UTF-8 Unicode English text, with very long lines"
00:12 ^🔗	bsmith093	and they mostly are but some parts of the text are annoyingly skinny columns, and id like to batch fix that
00:24 ^🔗	arrith	hmm
00:24 ^🔗	bsmith093	any ideas?
00:25 ^🔗	bsmith093	the closest thing ive found is something with vim, but thats like greek to me, and its only olne file at a time
00:26 ^🔗	Coderjoe	yes, fuck you SFO
00:26 ^🔗	arrith	bsmith093: what's the thing in vim?
00:26 ^🔗	Coderjoe	also, wheat was that that needed 4 phases?
00:26 ^🔗	arrith	bsmith093: i was just looking for a thing to find if any files had crlf
00:26 ^🔗	arrith	http://stackoverflow.com/questions/73833/how-do-you-search-for-files-containing-dos-line-endings-crlf-with-grep-under-l
00:27 ^🔗	arrith	this seems to do it: grep -IUrl --color '^M' .
00:27 ^🔗	arrith	M is "ctrl-M" in the terminal
00:29 ^🔗	arrith	i'd convert any CRLFs to LF then separate them into batches depending on how the paragraphs and sentences are separated
00:49 ^🔗	arrith	learning awk will make all of this way easier
01:35 ^🔗	bsmith093	arrith: ok i found something that looks vaguely probable for removing line breaks "awk 'BEGIN{}{printf "%d, ", NR}END{printf "\n"}' filename" now how do i use this, and can it do many files at once
01:37 ^🔗	arrith	bsmith093: if you just want to remove lf linebreaks you can do tr -d '\n' < in-file.txt > out-file.txt
01:38 ^🔗	arrith	i don't really know awk so that might be doing something like replacing two newlines with one
01:40 ^🔗	bsmith093	i just get a prompt arrow like its waiting for something
01:42 ^🔗	arrith	bsmith093: are you able to pastebin an example file?
01:42 ^🔗	bsmith093	example of what im trying to fix? yes
01:42 ^🔗	arrith	that would help
01:45 ^🔗	bsmith093	here http://pastebin.com/YuaErAjh
01:45 ^🔗	bsmith093	notice the first 50 lines compared to the rest
01:45 ^🔗	bsmith093	i have hundreds like this, and its really annoying
01:46 ^🔗	arrith	hmm so unwanted linebreaks in some places
01:46 ^🔗	bsmith093	its not just that little chink either, otherwise id quit whining and just fix it manually, but it randomly happens throughout the file
01:46 ^🔗	bsmith093	and others
01:47 ^🔗	arrith	well to do it in a non-ai automated manner you have to find a pattern for where the issue is. like remove newlines until a period is found, then skip two newlines and repeat
01:48 ^🔗	arrith	but that won't ensure paragraphs are grouped properly
01:48 ^🔗	arrith	just sentences
01:49 ^🔗	arrith	hm
01:49 ^🔗	arrith	bsmith093: did you edit stuff in that pastebin manually or is that just what the text is like?
01:50 ^🔗	bsmith093	woulnt it be easier to just remove all line break chars, but leave paragraphs alone?
01:50 ^🔗	bsmith093	nop thats the exact file as i have it, unedited'
01:50 ^🔗	arrith	well with text files there isn't really a difference between a line at the end of a paragraph and a blank line
01:51 ^🔗	arrith	you can open up a file like that in a hex editor to see
01:51 ^🔗	arrith	0A is LF and 0D is CR
01:51 ^🔗	arrith	a newline at the end of a paragraph*
01:56 ^🔗	arrith	could do a thing in awk where a line has to begin with either a capital letter or quote, otherwise it joins lines
02:02 ^🔗	bsmith093	well i originally converted them from a mix of pdb and doc files, so it might just be some wierdness there
02:04 ^🔗	arrith	yeah. there might be a better conversion program out there
02:40 ^🔗	underscor	THAT AWESOME FEELING YOU GET WHEN YOU FIND 4TB OF FRIENDSTER FOR SketchCow
02:40 ^🔗	bbot_	tab complete doesn't respect capslock, apparently
02:47 ^🔗	underscor	Is someone rtmpdumping the conference?
04:33 ^🔗	Coderjoe	underscor: conference?
04:34 ^🔗	underscor	Books in Browsers
04:34 ^🔗	underscor	@ IA
04:34 ^🔗	Coderjoe	if I had known someone wanted something dumped, I might have done it
04:34 ^🔗	underscor	I'm sure they're gonna release it anyway
04:34 ^🔗	underscor	I was just curious
04:35 ^🔗	Coderjoe	unfortunately, I was trying to grab something else, and ustream's ppv live setup still baffles me, so I had to TRY to screenrecord it. I am pretty sure the recording failed.
04:35 ^🔗	Coderjoe	(i need to hack up camstudio to use opendml or libav or simething. it currently has a 4gb filesize limit)
04:36 ^🔗	Coderjoe	in other news, that wget process for woxy is now up to 9.5G of memory and still chugging
04:36 ^🔗	Coderjoe	(the instance still has 7G free)
04:38 ^🔗	underscor	Big instance?
04:38 ^🔗	Coderjoe	high-memory xlarge instance
04:38 ^🔗	Coderjoe	as a spot instance. currently running about $6/day
04:40 ^🔗	Coderjoe	I pushed the broken woxy fetch out to s3 and deleted the 100GB ebs volume those files were on.
04:41 ^🔗	Coderjoe	(paying s3 prices on 17GB of data is better than paying ebs prices on a 100GB volume)
04:44 ^🔗	Coderjoe	so that other wget would have failed from ram issues if it hadn't failed trying to write the warc file at some point
04:45 ^🔗	Coderjoe	(it was a 32bit instance, so the absolute max wget could have in the process is 4g)
06:30 ^🔗	arrith	closure: have you heard back from the BerliOS people?
09:50 ^🔗	Coderjoe	well that's awesome...
09:50 ^🔗	Coderjoe	livejournal's friends page can only go back 20 posts. once you get to the second page (skip=20), the page is blank
09:53 ^🔗	ersi	"It's not a bug, it's a feature!"
11:30 ^🔗	ersi	http://www.jwz.org/blog/2011/10/the-internet-archive/ :)
11:33 ^🔗	phik	intresting stuff
15:28 ^🔗	Coderjoe	and now that wget is at 12.3G
15:44 ^🔗	alard	Coderjoe: Still doing woxy.com? I've got a 16GB heritrix dump if you want. :)
15:47 ^🔗	SketchCow	http://www.jwz.org/blog/2011/10/the-internet-archive
16:08 ^🔗	closure	I have not heard back from BerliOS admins (someone asked)
16:09 ^🔗	closure	we should get all the data we ripped into one place (before people lose it)
16:15 ^🔗	SketchCow	I have tons of space on batcave right now.
16:18 ^🔗	closure	ok, I think we had about 3 tb of data
16:18 ^🔗	closure	thing is, we will want to run one more rsync pass later, probably, to get the final updates to projects etc.
16:19 ^🔗	closure	since we're rsyncing everything from berlios it will be pretty fast to run -- could it be run on batcave?
16:50 ^🔗	SketchCow	Yes
16:50 ^🔗	SketchCow	Just throw that up there.
16:52 ^🔗	closure	ok, sweet..
16:53 ^🔗	closure	oh, it's only 300 gb anyway
16:54 ^🔗	closure	balrog alard dashcloud yipdw underscor wyatt Coderjoe ersi: time to upload your Berlios stuff to batcave
16:54 ^🔗	DFJustin	lol this channel, "oh, it's only 300 gb"
16:54 ^🔗	yipdw	closure: sure thing
16:55 ^🔗	yipdw	closure: are connection details further up in the channel?
16:55 ^🔗	closure	I think SketchCow has to set you up with an account
16:55 ^🔗	yipdw	ok
16:56 ^🔗	yipdw	SketchCow: whenever you've got time, send me upload info for batcave
16:56 ^🔗	*	closure too
17:40 ^🔗	SketchCow	On it
17:40 ^🔗	*	SketchCow blew in a pile of Roland items.
17:50 ^🔗	SketchCow	Please rsync into a directory called berlios
17:51 ^🔗	closure	SketchCow: one thing before it starts pouring in.. we are not tarring the stuff up, because we want to run rsync again. so expect lots of loose files
17:51 ^🔗	SketchCow	Currently, we're at 17tb of free disk space
17:51 ^🔗	SketchCow	We can handle
18:12 ^🔗	SketchCow	-----
18:12 ^🔗	SketchCow	From a local:
18:12 ^🔗	SketchCow	Hey, you might be just the man for this: Do you know of a tool that takes a corrupted gzip and extracts usefull stuff from it?
18:13 ^🔗	SketchCow	....do we know of any?
18:13 ^🔗	SketchCow	-----
19:13 ^🔗	closure	not unless it's a rare one made with gzip --rsyncable
19:14 ^🔗	SketchCow	http://www.gzip.org/recover.txt
19:14 ^🔗	SketchCow	I gave him that
19:19 ^🔗	closure	SketchCow: upload in progress to batcave.. some if it's behind a slow link, I estimate 15 days to complete
19:32 ^🔗	underscor	SketchCow: iirc Coderjoe wrote something that at least tells you what's wrong with it
19:32 ^🔗	underscor	closure: All mine already is
19:32 ^🔗	closure	underscor: update wiki?
20:33 ^🔗	SketchCow	No issues, closure
20:33 ^🔗	alard	closure: My berlios chunks are already on batcave. (I started uploading a bit earlier.)
20:35 ^🔗	alard	As is my copy of woxy.com, by the way.
22:02 ^🔗	SketchCow	---------------------------
22:02 ^🔗	SketchCow	Whoever wants it - http://www.bbc.co.uk/rd/publications/bbc_monograph_39.shtml
22:02 ^🔗	SketchCow	Just looking for the monographs to be downloaded, plus a .txt file of description where they have one.
22:02 ^🔗	SketchCow	---------------------------
22:02 ^🔗	SketchCow	I made the official call for a Javascript port of MESS/MAME
22:13 ^🔗	Cowering	SketchCow, MESS/MAME already have DRCs for certain things.. talk someone into making a DRC to javascript 'CPU'
22:13 ^🔗	SketchCow	Aware.
22:14 ^🔗	Cowering	but, since quite a few systems won't even emulate on a native i7 3.5 GHZ at full speed, javascript might still be pushing it a little :)
22:15 ^🔗	SketchCow	Aware.
22:15 ^🔗	alard	SketchCow: The monographs are currently uploading to batcave.
22:15 ^🔗	SketchCow	alard: Thanks, man
22:16 ^🔗	alard	There's a text file for each pdf with line 1: title, line 2: authors, line 3+: description. Is that what you need?
22:16 ^🔗	SketchCow	Works great for me.
22:17 ^🔗	alard	Actually, only the first one has a description.
22:18 ^🔗	alard	(And it's not even about the document itself, but about the series.)
22:21 ^🔗	alard	SketchCow: Which documents do you actually want? All of them, or just the monographs 1955 to 1969 that you linked to?
22:21 ^🔗	SketchCow	Just the monographs at the moment.
22:21 ^🔗	alard	Not 'Engineering 1970 to 1970'?
22:22 ^🔗	SketchCow	Looks like just one.
22:22 ^🔗	SketchCow	If you want to grab all of them, I'll snap them all up.
22:23 ^🔗	alard	I'll get them all. The numbering continues: 80 is the last monograph, from 81 to 115 it's BBC Engineering.
22:23 ^🔗	alard	But not linked anywhere, it seems.
22:23 ^🔗	SketchCow	Do it if you can.
22:24 ^🔗	alard	Is html in the description OK? (There are some with numbered lists.)
22:25 ^🔗	SketchCow	Yes, it's up to me to deal.
22:43 ^🔗	alard	SketchCow: Upload finished (see bbc-monographs on batcave). Would you like the research reports (from 1950 - now) too? If so, I'll add those tomorrow.
23:03 ^🔗	dashcloud	the idea of MAME/MESS in javascript may not be such a crazy idea- there was very recently a javascript h264 decoder demoed
23:11 ^🔗	dashcloud	so I finally got the opportunity to find out how much a gigabyte connection would cost thanks to the sales guy who called me today
23:14 ^🔗	dashcloud	8k or so a month
23:16 ^🔗	SketchCow	alard: Sure!
23:16 ^🔗	SketchCow	I am not proposing a crazy idea.

irclogger-viewer