CoNLL 2019 Shared Task: Meaning Representation Parsing --- Evaluation Companion Version 1.1; July 15, 2019 Overview ======== This directory contains a re-release of the morpho-syntactic compantion trees for the MRP 2019 shared task, i.e. tokenized and syntactically parsed parser inputs. A re-release of this part of the evaluation package was required to augment the companion trees with anchoring information, i.e. character start and end indices into the original input string. For general information on the task and the meaning representation frameworks involved, please see: http://mrp.nlpl.eu The JSON-based uniform interchange format for all frameworks is documented at: http://mrp.nlpl.eu/index.php?page=4#format Contents ======== The main contents in this release is the file providing companion trees: $ wc -l udpipe.mrp 6288 udpipe.mrp Here, the number of lines corresponds to the number of parser inputs, i.e. the evaluation data for MRP 2019 is comprised of 6288 strings to be parsed. The ‘companion’ morpho-syntactic trees were created using the same software version and parsing model as the training data, but the anchoring in this re-release is token directly from the output of the REPP tokenizer.. For additional technical information on the preparation of the companion trees, please see the original companion package for the training data: http://svn.nlpl.eu/mrp/2019/public/companion.tgz Known Limitations ================= In general, the MRP task design assumes that parser inputs are ‘raw’ strings, i.e. follow common conventions regarding punctuation marks and whitespace. In the case of some of the AMR ‘input’ values, the strings appear semi-tokenized, in the sense of separating punctuation marks like commas, periods, quote marks, and contracted auxiliaries and possessives from adjacent tokens with spurious whitespace. Furthermore, some of these strings use (non-standard) conventions for directional quote marks, viz. the LaTeX-style two-character sequences that have been popularized in NLP corpora by the Penn Treebank. For example: wb.eng_0003.13 Where 's Homer Simpson when you need him ? wb.eng_0003.14 This is a major `` D'oh! '' moment . For participants starting from the companion morpho-syntactic trees, the first of these artifacts can have led to wrong quote disambiguation in the tokenizer: ‘straight’ single and double quote marks preceded by whitespace are treated as left (or opening) quotes, which will at times result in directionally unmatched quote marks, as well as to contractions whose first character is a left quote mark rather than an apostrophe. In retrospect, it turns out that the training data for MRP 2019 exhibited the same limitations in six of the AMR sub-corpora. The LaTeX-style quote marks, on the other hand, have been normalized properly during tokenization for the companion morpho-syntactic trees, i.e. to “ and ” for the above example. Release History =============== [Version 1.1; July 15, 2019] + Re-release of companion trees only, now including anchoring. [Version 1.0; July 1, 2019] + First release of MRP 2019 training data in all frameworks. Contact ======= For questions or comments, please do not hesitate to email the task organizers at: ‘mrp-organizers@nlpl.eu’. Omri Abend Jan Hajič Daniel Hershcovich Marco Kuhlmann Stephan Oepen Tim O'Gorman Nianwen Xue