The WeScience data used for segmentation experiments reported in Read et. al. (2012) was created using data from the WeScience project (http://moin.delph-in.net/WeScience). The original data retains some (normalised) markup that is considered relevant for parsing that we stripped for these experiments: cat txt/ws0* txt/ws1{0,1,2,3} |scripts/makeWSgold.pl > segmented.txt In order to produce the A and B versions of the unsegmented text, we re-ran the original WeScience preprocessing scripts (available in SVN), altered to retain paragraph breaks (resulting in pre-AB.txt). We then stripped markup as above, but adding blank lines at paragraph breaks (version A) and at paragraph breaks, after blockquotes, after headings and after list items (version B). cat pre-AB.txt |scripts/makeA.pl > A/unsegmented.txt cat pre-AB.txt |scripts/makeB.pl > B/unsegmented.txt