WeSearch Data Collection (WDC) ============================== The WeSearch Data Collection is a freely redistributable, partly annotated, comprehensive sample of user-generated content. The WDC contains data extracted from a range of genres of varying formality (user forums, product review sites, blogs, and Wikipedia) and covers two different domains (NLP and Linux). For full details about its construction, please see: @inproceedings{Rea:Fli:Dri:12, author={Jonathon Read and Dan Flickinger and Rebecca Dridan and Stephan Oepen and Lilja {\O}vrelid}, title={The WeSearch Corpus, Treebank, and Treecache -- A Comprehensive Sample of User-Generated Content}, booktitle={Proceedings of the Eighth International Conference on Language Resources and Evaluation}, year = 2012, month = {May}, address = {Istanbul, Turkey}, pages = {1829--1835}, url = {http://www.lrec-conf.org/proceedings/lrec2012/pdf/774_Paper.pdf}} Naming Conventions: Each ‘collection’ (i.e. a combination of domain and genre) is identified with a three letter code, following the pattern ‘w[ln][bfrw]’: w (WeSearch Data Collection) [ln] (Linux or NLP) [bfrw] (blogs, forums, reviews, or wikipedia) Two collections draw on independently developed resources, viz. WLN and WLW, which correspond to the WeScience Corpus and parts of the WikiWoods Corpus, respectively. Thus, the assignment of item identifiers in these collections does not follow the same pattern as for the other parts of the WDC (in fact, the WeScience data is maintaned and distributed separately, for now). Directory structure: For each collection, there are three sub-directories making available the text at various levels of normalization (sometimes dubbed L0, L1, and L2): raw: Raw HTML files (L0). There are separate sub-directories for each source website. The files inside each source are HTML, which have been named to correspond to the path on the source website (replacing ‘/’ with ‘:’) txt: [incr tsdb()] import files, containing HTML annotations (L1). gml: [incr tsdb()] import files, containing GML annotations (L2). Additional meta-information about the construction of the WDC is recorded in auxiliary files (in each sub-directory, as appropriate) as follows: Xref : Mapping from original (‘raw’) document names to 8-digit identifier prefix (see below for the exaxt identifier format). Account : Accounts of the deletions made to create L1 and L2 collections. In these, there is a line for all items in that collection. The first two numbers are the item identifier and the character offset of the origin of the post. Then, there are 0-many pairs of numbers. For each of these, the first is a character position relative to the origin, and the second is the number of characters that have been deleted. [incr tsdb()] Import Files: These files contain text sentences ready for import into [incr tsdb()]. The files numbered 00–03 are reserved: 00 : for the benefit of future generations 01 : a test set drawn from several sources 02 : a single source test set 03 : a development set Each line in an import file contains an item identifier and string, delimited with ‘ |’. Identifiers take the form: ‘DGSPPPPPIIII’. D = domain (1=linux, 2=nlp) G = genre (2=blogs, 3=forums, 4=reviews, 5=wiki) S = source (a unique number with respect to domain and genre) 121 = embraceubuntu.com 122 = ubuntu.philipcasey.com 123 = www.linuxscrew.com 124 = www.markshuttleworth.com 125 = www.ubuntugeek.com 126 = www.ubuntu-unleashed.com 221 = blog.cyberling.org 222 = gameswithwords.fieldofscience.com 223 = lingpipe-blog.com 224 = nlpers.blogspot.com 225 = thelousylinguist.blogspot.com P = post (a unique number with respect to domain, genre and source) I = item (a unique number with respect to domain, genre, source and post)