================================================== *SEM SHARED TASK 2012 http://www.clips.ua.ac.be/sem2012-st-neg/ TRAINING AND DEVELOPMENT FILES FOR TASK 1 ON SCOPE DETECTION ================================================== --------------------- VERSION 09 March 2012 b --------------------- A snetence was missing in training: baskervilles13/214 --------------------- VERSION 09 March 2012 --------------------- - Tokenization of "cannot" has been fixed. It was "cannot", it is now "can not". - Some annotation errors of affixal negation in adverbs have been fixed. Negation should scope over clause. (i.e. "impatiently" in "he muttered impatiently as he watched its sluggish drift"). - As well as annotation errors in the following sentences: wisteria01 46 11 come come VB * _ _ _ "come" should belong ot the scope wisteria01 44 10 no no DT * *** wisteria01 44 11 sympathy sympathy NN *) *** "no" was not annotated as a cue baskervilles01 30 3 said say VBN (VP* _ said _ baskervilles01 30 4 as as RB (ADVP* _ as as baskervilles01 30 5 much much RB *) _ much much "said" is the negated event instead of "as much" ------------------ VERSION 28 Feb 2012 ------------------ - PoS tags for parentheses have been reverted to -LRB- and -RRB-. - ASCII italics ('_not_') have been removed. - An error in the numbering of tokens in the original corpus annotated with negation has been fixed. It provoked errors when converting the xml annotations into conll format. Some scopes were incorrectly aligned due to this. - Errors of negation annotations not being a (sub-)string of the surface form. ------------------ VERSION 22 Feb 2012 ------------------ This is a new version of the training and development files for Task 1 on Scope Detection. The corpus has been preprocessed again with different tools in order to fix the following problems: - New tokenization of the corpus. In the first version there were many errors that affected contractions such as don't, can't, etc. As a result, the annotation of negation for these words was inconsistent. - The distinction is made between opening (left) and closing (right) quote marks, e.g. |``| or |“| vs. |''| or |”|. The original text has been preprocessed again and the annotation of negation has been mapped to the new preprocessed file. Dataset CD-SCO for Task 1 ----------------------- SEM-2012-SharedTask-CD-SCO-training-22022012.txt: training file that contains chapters 1-14 from the The Hound of the Baskervilles by Conan Doyle. SEM-2012-SharedTask-CD-SCO-dev-22022012.txt: development file that contains the The Adventures of Wisteria Lodge by Conan Doyle. The original text from the Conan Doyle stories have been obtained from the Project Gutenberg. Pre-processing ------------- In the original txt files, sentences have been manually segmented; the file contains one sentence per line, with paragraph boundaries indicated by a sequence of double linebreaks. Whereas the original versions from Project Gutenberg are pure 7-bit ASCII, our '.txt' versions make use of a handful of UniCode characters to provide important distinctions; the file encoding of all files in this collection is UTF-8. In terms of UniCode characters, these mostly pertain to quotation marks: where the ASCII texts employ straight (so-called typewriter) quotes, these have been disambiguated into UniCode opening (aka left) and closing (aka right) quote marks, both for double (U+201c and U+201d) and single (U+2018 and U+2019) quotation marks. The apostrophe (e.g. in |don’t| or |o’clock|) here is using the same UniCode code point as the closing single quote (U+2019). Furthermore, ASCII double hyphens have been converted to UniCode m-dashes (U+x2014), as in for example (The Adventure of Wisteria Lodge, Chapter 1): “How do you define the word ‘grotesque’?” “Strange—remarkable,” I suggested. The sentence-segmented files have then been processed as follows: tokenization is otbained by the PTB-compliant tokenizer that is part of the LinGO English Resource Grammar; for details, please see: http://moin.delph-in.net/ErgTokenization http://moin.delph-in.net/ReppTop Pre-tokenized strings were then lemmatized using the GENIA tagger (in version 3.0.1, with the '-nt' command line option) and parsed with the re-ranking parser of Charniak & Johnson (2005), in the November 2009 release available from Brown University. In preparing inputs for lemmatization and parsing, the following mapping from UniCode characters to PTB conventions was used (to better align with the training data used in constructing these tools): “ --> `` ” --> '' … --> ... — --> -- – --> -- Note that GENIA PoS tags are complemented with TnT PoS tags, again for increased compatibility with the original PTB: GENIA does not make a common vs. proper noun distinction (NN(S) vs. NNP(S), in the PTB tag set). Tokens tagged as /NNS?/ by GENIA and tagged /NNPS?/ by TnT, thus take the TnT tag assignment; all other tokens have GENIA PoS tags. For compatibility with PTB conventions, the top-level nodes in C&J parse trees, which are always labelled 'S1', have been removed. The C&J parser internally distinguishes auxiliary from other verbs, i.e. adding to the original PTB inventory tags like AUX (e.g. for the form is 'is') or AUXG (e.g. for 'being'). Where C&J trees have preterminal nodes matching /AUX.*/, the original PoS tags from GENIA (plus TnT) were used. The conversion of PTB-style syntactic analysis trees into CoNLL-style, line-oriented format was accomplished by the software available from the organizers of the 2005 CoNLL Shared Task; see: http://www.lsi.upc.edu/~srlconll/ Annotation --------- All occurrences of a negation are annotated, accounting for negation expressed by nouns, pronouns, verbs, adverbs, determiners, conjunctions and prepositions. For each negation cue, the negation cue and scope are marked, as well as the negated event or property, if any. Cues and scopes can be discontinuous. More information about the annotation can be found in the annotation guidelines, which are published in Morante et al. (2011) Annotation of Negation Cues and their Scope. Guidelines v1.0, CLiPS Technical Report Series, to be downloaded from: http://www.clips.ua.ac.be/annotation-of-negation-cues-and-their-scope-guidelines-v10 Format ------ The data are provided in CoNLL format. Each line corresponds to a token and each annotation is provided in a column; empty lines indicate end of sentence. The content of the columns is specified below. Column 1: chapter name Column 2: sentence number within chapter Column 3: token number within sentence Column 4: word Column 5: lemma Column 6: part-of-speech Column 7: syntax Columns 8 to last: (to be produced by the systems, will not be provided in test data) - If the sentence has no negations, column 8 has a "***" value and there are no more columns. - If the sentence has negations, the annotation for each negation is provided in three columns. The first column contains the word or part of the word (e.g., morpheme "un"), that belongs to the negation cue. The second contains the word or part of the word that belongs to the scope of the negation cue. The third column contains the word or part of the word that is the negated event or property. It can be the case that no negated event or property are marked as negated. For example, in Example 3 none of the negations has a negated event annotated because of the conditional construction. In Example 1 there are two negations. Information for the first negation is provided in columns 8-10, and for the second in columns 11-13. Example 2 shows how prefixal negation is represented. "un" is the negation cue, "conventional appearance" is the scope, and "conventional" is the negated property. Example 1 wisteria01 288 0 He He PRP (S(NP*) _ He _ _ He _ wisteria01 288 1 is be VBZ (VP* _ is _ _ is _ wisteria01 288 2 not not RB (ADJP* not _ _ _ _ _ wisteria01 288 3 particularly particularly RB * _ particularly _ _ _ _ wisteria01 288 4 intelligent intelligent JJ *) _ intelligent intelligent _ _ _ wisteria01 288 5 -- -- : * _ _ _ _ _ _ wisteria01 288 6 not not RB (NP(NP* _ _ _ not _ _ wisteria01 288 7 a a DT * _ _ _ _ a _ wisteria01 288 8 man man NN *) _ _ _ _ man _ wisteria01 288 9 likely likely JJ (ADJP* _ _ _ _ likely likely wisteria01 288 10 to to TO (S(VP* _ _ _ _ to _ wisteria01 288 11 be be VB (VP* _ _ _ _ be _ wisteria01 288 12 congenial congenial JJ (ADJP* _ _ _ _ congenial _ wisteria01 288 13 to to TO (PP* _ _ _ _ to _ wisteria01 288 14 a a DT (NP* _ _ _ _ a _ wisteria01 288 15 quick-witted quick-witted JJ * _ _ _ _ quick-witted _ wisteria01 288 16 Latin Latin NNP *))))))))) _ _ _ _ Latin _ wisteria01 288 17 . . . *) _ _ _ _ _ _ Example 2 wisteria01 60 0 Our Our PRP$ (S(NP* _ _ _ wisteria01 60 1 client client NN *) _ _ _ wisteria01 60 2 looked look VBD (VP* _ _ _ wisteria01 60 3 down down RB (ADVP*) _ _ _ wisteria01 60 4 with with IN (PP* _ _ _ wisteria01 60 5 a a DT (NP(NP* _ _ _ wisteria01 60 6 rueful rueful JJ * _ _ _ wisteria01 60 7 face face NN *) _ _ _ wisteria01 60 8 at at IN (PP* _ _ _ wisteria01 60 9 his his PRP$ (NP* _ his _ wisteria01 60 10 own own JJ * _ own _ wisteria01 60 11 unconventional unconventional JJ * un conventional conventional wisteria01 60 12 appearance appearance NN *))))) _ appearance _ wisteria01 60 13 . . . *) _ _ _ Example 3 wisteria01 320 0 She She PRP (S(NP*) _ She _ _ _ _ wisteria01 320 1 would would MD (VP* _ would _ _ _ _ wisteria01 320 2 not not RB * not _ _ _ _ _ wisteria01 320 3 have have VB (VP* _ have _ _ _ _ wisteria01 320 4 said say VBD (VP* _ said _ _ _ _ wisteria01 320 5 ` ` `` (SBAR(S(NP* _ ' _ _ _ _ wisteria01 320 6 Godspeed Godspeed NNP * _ Godspeed _ _ _ _ wisteria01 320 7 ' ' '' *) _ ' _ _ _ _ wisteria01 320 8 had have VBD (VP* _ had _ _ had _ wisteria01 320 9 it it PRP (ADVP* _ it _ _ it _ wisteria01 320 10 not not RB *) _ not _ not _ _ wisteria01 320 11 been be VBN (VP* _ been _ _ been _ wisteria01 320 12 so so RB (ADVP*)))))))) _ so _ _ so _ wisteria01 320 13 . . . *) _ _ _ _ _ _ Credits -------- The corpus has been preprocessed by Stephan Oepen at the University of Oslo. In case you notice room for improvement in these files, or need more information on any of the intermediate steps, feel free to contact Stephan Oepen at the University of Oslo (oe@ifi.uio.no). This corpus has been annotated with negation at CLiPS, Univesity of Antwerp (www.clips.ua.ac.be), with funding from the GOA project BIOGRAPH of the University of Antwerp. Questions about the annotation of negation should be addressed to Roser Morante (roser.morante@ua.ac.be). Comments and errors ----------------- If you find errors and inconsistencies in the dataset, please report them to Roser Morante (roser.morante@ua.ac.be). We welcome also comments on the annotation guidelines. References --------- E. Charniak and M. Johnson (2005) Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking, Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pages 173-180, Ann Arbor, Michigan, June. Association for Computational Linguistics. R. Morante, S. Schrauwen, and W. Daelemans. Annotation of negation cues and their scope guidelines v1.0. Technical Report CTR-003, CLiPS, University of Antwerp, Antwerp, 2011.