CoNLL 2019 Shared Task: Meaning Representation Parsing --- Companion Data Version 1.0; June 24, 2019 Overview ======== This directory contains what is called morpho-syntactic ‘companion’ data for the MRP 2019 shared task, PoS tagings, lemmatization, and dependency trees for the parser ‘input’ strings of the MRP training data. For general information on the task and the meaning representation frameworks involved, please see: http://mrp.nlpl.eu The JSON-based uniform interchange format for all frameworks is documented at: http://mrp.nlpl.eu/index.php?page=4#format These parses were obtained using a combination of the rule-based tokenizer by Dridan & Oepen (2012; ACL) and a recent (post-futuristic) development version of the UDPipe engine by Straka (2018; CoNLL). We believe that the quality of these morpho-syntactic analyses reflects the state of the art in parsing into basic Universal Dependencies (UD) for English. The training data for the parser was compiled from a broad range of syntactic treebanks for English, viz. (a) the Wall Street Journal and Brown partitions of the Penn Treebank (LDC99T42), (b) the GENIA Treebank (in its conversion to PTB- style serialization by David McClosky), and (c) the English Web Treebank (EWT) from the UD 2.3 release. The WSJ, Brown, and GENIA phrase structure trees were converted to UD 2.x dependencies using a pre-release snapshot of the converter by Sebastian Schuster (which is an extension of the conversion in the Stanford CoreNLP system). For compatibility with the majority of the training data, the tokenizer was configured for PTB-style tokenization, i.e. (unlike in UD) there will typically not be token boundaries at hyphens or slashes. Because part of the MRP training data overlaps with the resources used to train the morpho-syntactic parser, five-fold jack-knifing was applied to the WSJ and EWT sub-sets, yielding a total of eleven parsing models. In each case, all of the Brown and GENIA data was included in training, as well as the totality of the available training data for either WSJ or EWT not subjected to jack-knifing in a particular model. Contents ======== The main contents in this release is in the follwing files: $ wc -l udpipe.mrp 98290 $ wc -l jamr.mrp isi.mrp 56240 jamr.mrp 56240 isi.mrp For each of the MRP training graphs, ‘udpipe.mrp’ contains one dependency tree, where correspondence to the MRP training data is by graph ‘id’entifiers. There are a few identifiers that occur more than once in the MRP training data, for example the first four items in ‘training/amr/amr-guidelines.mrp’, as well as the 89 graphs over WSJ sentences that are annotated in all frameworks. In the companion data, each identifier (and corresponding ‘input’ string) occur only once. In other words, some of the sentences in the companion data correspond to multiple semantic graphs in the MRP training data. Mostly for archival purposes, probably, the package also provides the companion parses in the native, tab-separated CoNLL-U output format of the parser: $ for i in amr/*.conllu dm/*.conllu ucca/*.conllu; do \ echo -n "$i "; egrep "^#" $i | wc -l; \ done amr/amr-guidelines.conllu 969 amr/bolt.conllu 1061 amr/cctv.conllu 213 amr/dfa.conllu 7378 amr/dfb.conllu 32914 amr/fables.conllu 48 amr/lorelei.conllu 4440 amr/mt09sdl.conllu 203 amr/proxy.conllu 6603 amr/rte.conllu 527 amr/wb.conllu 865 amr/wiki.conllu 191 amr/xinhua.conllu 741 dm/wsj00.conllu 7132 dm/wsj01.conllu 7131 dm/wsj02.conllu 7131 dm/wsj03.conllu 7131 dm/wsj04.conllu 7131 ucca/ewt00.conllu 763 ucca/ewt01.conllu 763 ucca/ewt02.conllu 762 ucca/ewt03.conllu 762 ucca/ewt04.conllu 762 ucca/wiki.conllu 2673 Note that the input strings for the EDS and PSD graphs correspond one-to-one with the DM sentences. For the AMR graphs, the files ‘jamr.mrp’ and ‘isi.mrp’ provide reference (if by no means gold-standard) alignments, obtained from the JAMR system of Flanigan et al. (2016; SemEval) and the ISI aligner by Pourdamghani et al. (2014; EMNLP) and converted to the MRP file format. Here, nodes and property values can bear alignments, encoded in the ‘label’ and (order-coded) ‘values’ JSON properties, respectively. Alignments take the form of lists of token indices, e.g. [0, 1], which are zero-based pointers into the token sequences in ‘udpipe.mrp’. Acknowledgments =============== Sebastian Schuster kindly made available a pre-release of his converter from PTB-style constituent trees to (basic) UD 2.x dependency graphs. Milan Straka provided invaluable assistance in training and running the latest development version of his UDPipe system, to generate the morpho-syntactic companion trees for the MRP sentences. Jayeol Chun most helpfully provided the AMR alignments, including forcing the aligners to respect the tokenization from the MRP morpho- syntactic companion parses. Known Limitations ================= The Wikipedia texts in the AMR graph bank contain some residual HTML mark-up, which the tokenizer probably should be made to strip. Release History =============== [Version 1.0; June 24, 2019] + Re-release, including AMR alignments from JAMR and the ISI aligner. [Version 0.9; May 20, 2019] + First release of the MRP 2019 morpho-syntactic companion trees. Contact ======= For questions or comments, please do not hesitate to email the task organizers at: ‘mrp-organizers@nlpl.eu’. Omri Abend Jan Hajič Daniel Hershcovich Marco Kuhlmann Stephan Oepen Tim O'Gorman Nianwen Xue