CoNLL 2019 Shared Task: Meaning Representation Parsing --- Companion Data

Version 1.0; June 24, 2019


Overview
========

This directory contains what is called morpho-syntactic ‘companion’ data for
the MRP 2019 shared task, PoS tagings, lemmatization, and dependency trees for
the parser ‘input’ strings of the MRP training data.

For general information on the task and the meaning representation frameworks
involved, please see:

  http://mrp.nlpl.eu

The JSON-based uniform interchange format for all frameworks is documented at:

  http://mrp.nlpl.eu/index.php?page=4#format

These parses were obtained using a combination of the rule-based tokenizer by
Dridan & Oepen (2012; ACL) and a recent (post-futuristic) development version
of the UDPipe engine by Straka (2018; CoNLL).  We believe that the quality of
these morpho-syntactic analyses reflects the state of the art in parsing into
basic Universal Dependencies (UD) for English.

The training data for the parser was compiled from a broad range of syntactic
treebanks for English, viz. (a) the Wall Street Journal and Brown partitions of
the Penn Treebank (LDC99T42), (b) the GENIA Treebank (in its conversion to PTB-
style serialization by David McClosky), and (c) the English Web Treebank (EWT)
from the UD 2.3 release.  The WSJ, Brown, and GENIA phrase structure trees were
converted to UD 2.x dependencies using a pre-release snapshot of the converter
by Sebastian Schuster (which is an extension of the conversion in the Stanford
CoreNLP system).  For compatibility with the majority of the training data, the
tokenizer was configured for PTB-style tokenization, i.e. (unlike in UD) there
will typically not be token boundaries at hyphens or slashes.

Because part of the MRP training data overlaps with the resources used to train
the morpho-syntactic parser, five-fold jack-knifing was applied to the WSJ and
EWT sub-sets, yielding a total of eleven parsing models.  In each case, all of
the Brown and GENIA data was included in training, as well as the totality of
the available training data for either WSJ or EWT not subjected to jack-knifing
in a particular model.


Contents
========

The main contents in this release is in the follwing files:

  $ wc -l udpipe.mrp 
  98290
  $ wc -l jamr.mrp isi.mrp
  56240 jamr.mrp
  56240 isi.mrp

For each of the MRP training graphs, ‘udpipe.mrp’ contains one dependency tree,
where correspondence to the MRP training data is by graph ‘id’entifiers.  There
are a few identifiers that occur more than once in the MRP training data, for
example the first four items in ‘training/amr/amr-guidelines.mrp’, as well as
the 89 graphs over WSJ sentences that are annotated in all frameworks.  In the
companion data, each identifier (and corresponding ‘input’ string) occur only
once.  In other words, some of the sentences in the companion data correspond
to multiple semantic graphs in the MRP training data.

Mostly for archival purposes, probably, the package also provides the companion
parses in the native, tab-separated CoNLL-U output format of the parser:

$ for i in amr/*.conllu dm/*.conllu ucca/*.conllu; do \
  echo -n "$i "; egrep "^#" $i | wc -l; \
done
amr/amr-guidelines.conllu 969
amr/bolt.conllu 1061
amr/cctv.conllu 213
amr/dfa.conllu 7378
amr/dfb.conllu 32914
amr/fables.conllu 48
amr/lorelei.conllu 4440
amr/mt09sdl.conllu 203
amr/proxy.conllu 6603
amr/rte.conllu 527
amr/wb.conllu 865
amr/wiki.conllu 191
amr/xinhua.conllu 741
dm/wsj00.conllu 7132
dm/wsj01.conllu 7131
dm/wsj02.conllu 7131
dm/wsj03.conllu 7131
dm/wsj04.conllu 7131
ucca/ewt00.conllu 763
ucca/ewt01.conllu 763
ucca/ewt02.conllu 762
ucca/ewt03.conllu 762
ucca/ewt04.conllu 762
ucca/wiki.conllu 2673

Note that the input strings for the EDS and PSD graphs correspond one-to-one
with the DM sentences.

For the AMR graphs, the files ‘jamr.mrp’ and ‘isi.mrp’ provide reference (if by
no means gold-standard) alignments, obtained from the JAMR system of Flanigan
et al. (2016; SemEval) and the ISI aligner by Pourdamghani et al. (2014; EMNLP)
and converted to the MRP file format.  Here, nodes and property values can bear
alignments, encoded in the ‘label’ and (order-coded) ‘values’ JSON properties,
respectively.  Alignments take the form of lists of token indices, e.g. [0, 1],
which are zero-based pointers into the token sequences in ‘udpipe.mrp’.


Acknowledgments
===============

Sebastian Schuster kindly made available a pre-release of his converter from
PTB-style constituent trees to (basic) UD 2.x dependency graphs.  Milan Straka
provided invaluable assistance in training and running the latest development
version of his UDPipe system, to generate the morpho-syntactic companion trees
for the MRP sentences.  Jayeol Chun most helpfully provided the AMR alignments,
including forcing the aligners to respect the tokenization from the MRP morpho-
syntactic companion parses. 


Known Limitations
=================

The Wikipedia texts in the AMR graph bank contain some residual HTML mark-up,
which the tokenizer probably should be made to strip.


Release History
===============

[Version 1.0; June 24, 2019]

+ Re-release, including AMR alignments from JAMR and the ISI aligner.

[Version 0.9; May 20, 2019]

+ First release of the MRP 2019 morpho-syntactic companion trees.


Contact
=======

For questions or comments, please do not hesitate to email the task organizers
at: ‘mrp-organizers@nlpl.eu’.

Omri Abend
Jan Hajič
Daniel Hershcovich
Marco Kuhlmann
Stephan Oepen
Tim O'Gorman
Nianwen Xue