CoNLL 2020 Shared Task: Meaning Representation Parsing --- Companion Data Version 1.2; July 29, 2020 Overview ======== This directory (together with its ‘sibling’ directory for the cross-lingual track) contains what is called morpho-syntactic ‘companion’ data for the MRP 2020 shared task, viz. tokenization, PoS tagings, lemmatization, and dependency trees for the parser ‘input’ strings of the MRP training and validation data. For general information on the task and the meaning representation frameworks involved, please see: http://mrp.nlpl.eu The JSON-based uniform interchange format for all frameworks is documented at: http://mrp.nlpl.eu/2020/index.php?page=14#format Cross-Framework Track (English Only) ==================================== These parses were obtained using a combination of the rule-based tokenizer in the CoreNLP package (https://stanfordnlp.github.io/CoreNLP/) and a fortcoming (post-futuristic) development version of the UDPipe engine by Straka (2018; CoNLL). We believe that the quality of these morpho-syntactic parses reflects the state of the art in parsing into basic Universal Dependencies (UD) for English. Unlike in the predecessor (MRP 2019) shared task, tokenization in the 2020 companion data follows the conventions from the OntoNotes project (rather than the traditional PTB rules), which have also been adopted in the Universal Dependencies initiative: Most hyphens and slashes introduce token boundaries. The training data for the parser was compiled from a broad range of syntactic treebanks for English, viz. (a) the Wall Street Journal portion from the recent re-release of the venerable Penn Treebank (LDC2015T13), (b) the English parts of the OntoNotes annotations (LDC2013T19, excluding the WSJ segments), and (c) the English Web Treebank (EWT) from the UD 2.6 release. The WSJ and OntoNotes phrase structure trees were converted to UD 2.x dependencies using version 4.0 of the converter by Sebastian Schuster. Please see the ‘Makefile’ for complete details on the conversion pipeline and invocation of the CoreNLP tokenizer. Because part of the MRP training data overlaps with the resources used to train the morpho-syntactic parser, five-fold jack-knifing was applied to the WSJ and EWT sub-sets, yielding a total of six parsing models. In each case, all of the OntoNotes data was included in training, as well as four fifths of the training data for WSJ and EWT; again, please see the ‘Makefile’ for details. Cross-Lingual Track (Chinese, Czech, German) ============================================ Chinese Abstract Meaning Representation (CAMR) graphs annotate sentences that overlap with the Chinese Tree Bank (CTB), which is a common source of data for training syntactic dependency parsers (like CoreNLP). Therefore, also in the creation of the MRP 2020 Chinese companion dependency parses jack-knifing has been applied: The CAMR subset of CTB 9.0 was split into two parts, and each part was combined with the non-CAMR subset to train a model that was used to parse the other part. Parsing accuracy on the validation set is: UAS = 84.11 LAS = 81.85 The Czech MRP training data comes from the Prague Dependency Treebank, hence it overlaps with the resources used to train the morpho-syntactic parser and jack- knifing had to be applied. The part of PDT for which PTG annotations are available was split into ten parts (train-1 to train-8, dtest, etest). For each part a UDPipe model was trained on those sentences of UD_Czech-PDT (release 2.6) that do not come from that particular part, and on the entire UD_Czech-CAC and UD_Czech-FicTree; the model was then used to tag and parse the part. The average accuracy of the prediction is LAS = 93.45, MLAS = 88.85, BLEX = 91.16. For training the German morpho-syntactic parser, all German UD 2.6 treebanks were used: GSD, HDT, LIT and, PUD. Since there is no overlap with the German MRP training data, there was no need for jack-knifing in this case. A UDPipe model was trained on 90% of the concatenation of these treebanks, and 10% was used as a validation set (irrespective of the original train/dev/test splits). The accuracy of the prediction on the validation set is LAS = 95.84, MLAS = 84.70, BLEX = 90.27. For parsing the MRP training data, the UDPipe 1.2.0 tokenizer was used, with the UD 2.5 German GSD model, followed by the trained UDPipe model for tagging and parsing. Contents ======== The main contents for the cross-framework track is in the follwing files: $ wc -l udpipe.mrp 122661 udpipe.mrp $ wc -l jamr.mrp isi.mrp 61445 jamr.mrp 61445 isi.mrp $ wc -l boxer.mrp 7488 boxer.mrp For each of the MRP training graphs, ‘udpipe.mrp’ contains one dependency tree, where correspondence to the MRP training data is by graph ‘id’entifiers. There are a few identifiers that occur more than once in the MRP training data, for example the first four items in ‘training/amr/amr-guidelines.mrp’, as well as the 89 graphs over WSJ sentences that are annotated in all frameworks. In the companion data, each identifier (and corresponding ‘input’ string) occur only once. In other words, some of the sentences in the companion data correspond to multiple semantic graphs in the MRP training data. Note that the input strings for the EDS and PTG graphs correspond one-to-one with each other (for overlapping WSJ sentences), but there are some sentences not annotated in PTG, and others not annotated in EDS. For the AMR graphs, the files ‘jamr.mrp’ and ‘isi.mrp’ provide reference (if by no means gold-standard) anchorings, obtained from the JAMR system of Flanigan et al. (2016; SemEval) and the ISI aligner by Pourdamghani et al. (2014; EMNLP) and converted to the MRP file format. Here, nodes, edges, and property values can bear anchoring information, using the newly minted version 1.1 of the MRP serialization format: here, JSON objects for edges have been augmented with an (integer-valued) ‘id’ field, which (much like for nodes) serves to encode the correspondence between elements of the AMR graphs proper and their anchoring overlay in ‘jamr.mrp’ or ‘isi.mrp’. To record anchoring information on edges, the overlay files use the same ‘anchors’ property as on nodes; for properties, node objects have been augmented with an ‘anchorings’ array, which follows the same order coding as the corresponding ‘properties’ array. For the AMR graphs, anchoring information (as computed by the above aligners) is encoded in terms of token identifiers, using the tokenization from the MRP companion parses, in a format as follows (for anchoring to tokens #8 and #9 on node #7): { ... "nodes": [ ... {"id": 7, "anchors": [{"#": 8}, {"#": 9}]} ... } In a similar spirt, the file ‘boxer.mrp’ provides companion anchorings for the English DRG annotations, also using version 1.1. of the MRP serialization, as described for the AMR graphs above (however, there are no node properties in DRG, hence the ‘anchorings’ array is not used). Unlike AMR, however, anchors for DRG apply the familiar character-based ‘from’ and ‘to’ format, i.e. are independent of the UDPipe companion tokenization. For the cross-lingual track, the main contents is in the files: $ wc -l *.mrp 43955 ces.mrp 5283 deu.mrp 18365 zho.mrp 1575 boxer.mrp Again, for each of the MRP training graphs there is one dependency tree, where correspondence to the gold-standard training data is by graph ‘id’entifiers. The cross-lingual syntactic companion parses are separated by language, but in each case the file format is the same as for the corresponding cross-framework ‘udpipe.mrp’. Because the Chinese AMR graphs include anchoring as part of the gold-standard annotations, the only reference anchorings for the cross-lingual are for the German DRG structures, in the file ‘boxer.mrp’. Acknowledgments =============== Sebastian Schuster advised on how best to convert from PTB-style constituent trees to (basic) UD 2.x dependency graphs. Milan Straka provided invaluable assistance in training and running the latest development version of his UDPipe system, to generate the morpho-syntactic companion trees for the MRP sentences. Jayeol Chun most helpfully provided the AMR alignments, including forcing the aligners to respect the tokenization from the MRP morpho-syntactic companion parses. He also coordinated the creation of the Chinese dependency parses and has been instrumental in ensuring that anchoring on the parser outputs exactly match the ‘input’ strings underlying the Chinese MRP graphs. Shamy Ji reported several deficiencies in the initial release, viz. some 1882 missing parses (for EDS graphs for WSJ and Brown sentences not annotated in the PTG data) and a smaller number of WSJ sentences that erroneously been parsed in their pre-tokenized form (as used in the AMR graphs), i.e. including spurious white space (see below). Hiroaki Ozaki discovered that two of the Chinese AMR graphs were missing their syntactic companion parse in the initial release. Known Limitations ================= In general, the MRP task design assumes that parser inputs are ‘raw’ strings, i.e. follow common conventions regarding punctuation marks and whitespace. In the case of some of the AMR ‘input’ values, the strings appear semi-tokenized, in the sense of separating punctuation marks like commas, periods, quote marks, and contracted auxiliaries and possessives from adjacent tokens with spurious whitespace. Furthermore, some of these strings use (non-standard) conventions for directional quote marks, viz. the LaTeX-style two-character sequences that have been popularized in NLP corpora by the Penn Treebank. For example: wb.eng_0003.13 Where 's Homer Simpson when you need him ? wb.eng_0003.14 This is a major `` D'oh! '' moment . For participants starting from the companion morpho-syntactic trees, the first of these artifacts can have led to wrong quote disambiguation in the tokenizer: ‘straight’ single and double quote marks preceded by whitespace are treated as left (or opening) quotes, which can at times result in directionally unmatched quote marks, as well as in contractions whose first character is a left quote mark rather than an apostrophe. LaTeX-style quote marks, on the other hand, should have been normalized properly during tokenization for the companion morpho-syntactic trees, i.e. to “ and ” for the above example. Release History =============== [Version 1.2; July 29, 2020] + Add 417 missing German companion parses; correct ‘input’s of five Czech ones. [Version 1.1; July 21, 2020] + Add two missing Chinese companion parses (from non-terminated CoNLL-U file). [Version 1.0; June 22, 2020] + Re-release for missing and corrected strings, plus companion anchorings. [Version 0.9; June 1, 2020] + First release of the MRP 2020 morpho-syntactic companion trees. Contact ======= For questions or comments, please do not hesitate to email the task organizers at: ‘mrp-organizers@nlpl.eu’. Omri Abend, Lasha Abzianidze, Johan Bos, Jan Hajič, Daniel Hershcovich, Bin Li, Stephan Oepen (chair), Tim O'Gorman, Nianwen Xue, and Dan Zeman