The ‘raw’ version of the Brown corpus has been constructed by starting with the
tagged version of the corpus available from
http://archive.org/details/BrownCorpus and applying various (automatic and
manual) transformations. Paragraph breaks from the tagged corpus were
maintained. The Bergen Format I version of the data was used to inform the
transformation decisions made.

First, the automatic changes:
 * normalises spacing
 * drop the tags
 * treats sentences tagged as headlines as one-line paragraphs
 * remove extra spaces around punctuation, where this can be done automatically
 * maps LaTex quotes back to double straight quotes, where this can be done
 	automatically
 * removes the double punctuation that wasn't in the raw

for x in c*[0-9]; do 
	base=`echo $x|perl -pe 's/(..)\d+/$1/;'`; 
	if [ ! -d ../cooked/$base ]; then 
		mkdir ../cooked/$base; 
	fi;
	cat $x|../scripts/normalisetagged.pl > ../cooked/$base/$x; 
done

Text matching the following patterns was manually corrected according to
Bergen Format I version:
/^['"] /
/ ['"]$/
/ ['"] /
/"' /
/ '[",?!;]/
/''/
/``/

Other errors were opportunistically corrected, if they came up while searching,
but no other systematic corrections were made.

To create the unsegmented.txt file, used in the segmentation experiments:

cat cooked/*/* |perl -pe 's/ +/ /g;'|perl -pe 's/\n/ /;'|\
	perl -pe 's/  +/\n\n/g' > unsegmented.txt

And the segmented.txt, used for evaluation:

cat cooked/*/* |grep -v "^$" > segmented.txt