CoNLL 2020 Shared Task: Meaning Representation Parsing --- Companion Data

Version 1.2; July 29, 2020


Overview
========

This directory (together with its ‘sibling’ directory for the cross-lingual
track) contains what is called morpho-syntactic ‘companion’ data for the MRP
2020 shared task, viz. tokenization, PoS tagings, lemmatization, and dependency
trees for the parser ‘input’ strings of the MRP training and validation data. 

For general information on the task and the meaning representation frameworks
involved, please see:

  http://mrp.nlpl.eu

The JSON-based uniform interchange format for all frameworks is documented at:

  http://mrp.nlpl.eu/2020/index.php?page=14#format


Cross-Framework Track (English Only)
====================================

These parses were obtained using a combination of the rule-based tokenizer in
the CoreNLP package (https://stanfordnlp.github.io/CoreNLP/) and a fortcoming
(post-futuristic) development version of the UDPipe engine by Straka (2018;
CoNLL).  We believe that the quality of these morpho-syntactic parses reflects
the state of the art in parsing into basic Universal Dependencies (UD) for
English.  Unlike in the predecessor (MRP 2019) shared task, tokenization in the
2020 companion data follows the conventions from the OntoNotes project (rather
than the traditional PTB rules), which have also been adopted in the Universal
Dependencies initiative: Most hyphens and slashes introduce token boundaries.

The training data for the parser was compiled from a broad range of syntactic
treebanks for English, viz. (a) the Wall Street Journal portion from the recent
re-release of the venerable Penn Treebank (LDC2015T13), (b) the English parts
of the OntoNotes annotations (LDC2013T19, excluding the WSJ segments), and (c)
the English Web Treebank (EWT) from the UD 2.6 release.  The WSJ and OntoNotes
phrase structure trees were converted to UD 2.x dependencies using version 4.0
of the converter by Sebastian Schuster.  Please see the ‘Makefile’ for complete
details on the conversion pipeline and invocation of the CoreNLP tokenizer.

Because part of the MRP training data overlaps with the resources used to train
the morpho-syntactic parser, five-fold jack-knifing was applied to the WSJ and
EWT sub-sets, yielding a total of six parsing models.  In each case, all of the
OntoNotes data was included in training, as well as four fifths of the training
data for WSJ and EWT; again, please see the ‘Makefile’ for details. 


Cross-Lingual Track (Chinese, Czech, German)
============================================

Chinese Abstract Meaning Representation (CAMR) graphs annotate sentences that
overlap with the Chinese Tree Bank (CTB), which is a common source of data for
training syntactic dependency parsers (like CoreNLP).  Therefore, also in the
creation of the MRP 2020 Chinese companion dependency parses jack-knifing has
been applied: The CAMR subset of CTB 9.0 was split into two parts, and each
part was combined with the non-CAMR subset to train a model that was used to
parse the other part.  Parsing accuracy on the validation set is:
  
  UAS = 84.11  LAS = 81.85

The Czech MRP training data comes from the Prague Dependency Treebank, hence it
overlaps with the resources used to train the morpho-syntactic parser and jack-
knifing had to be applied.  The part of PDT for which PTG annotations are
available was split into ten parts (train-1 to train-8, dtest, etest).  For each 
part a UDPipe model was trained on those sentences of UD_Czech-PDT (release
2.6) that do not come from that particular part, and on the entire UD_Czech-CAC
and UD_Czech-FicTree; the model was then used to tag and parse the part.  The
average accuracy of the prediction is

  LAS = 93.45, MLAS = 88.85, BLEX = 91.16.

For training the German morpho-syntactic parser, all German UD 2.6 treebanks
were used: GSD, HDT, LIT and, PUD.  Since there is no overlap with the German
MRP training data, there was no need for jack-knifing in this case.  A UDPipe
model was trained on 90% of the concatenation of these treebanks, and 10% was
used as a validation set (irrespective of the original train/dev/test splits).
The accuracy of the prediction on the validation set is

  LAS = 95.84, MLAS = 84.70, BLEX = 90.27.

For parsing the MRP training data, the UDPipe 1.2.0 tokenizer was used, with
the UD 2.5 German GSD model, followed by the trained UDPipe model for tagging
and parsing.


Contents
========

The main contents for the cross-framework track is in the follwing files:

  $ wc -l udpipe.mrp
  122661 udpipe.mrp
  $ wc -l jamr.mrp isi.mrp 
   61445 jamr.mrp
   61445 isi.mrp
  $ wc -l boxer.mrp 
    7488 boxer.mrp

For each of the MRP training graphs, ‘udpipe.mrp’ contains one dependency tree,
where correspondence to the MRP training data is by graph ‘id’entifiers.  There
are a few identifiers that occur more than once in the MRP training data, for
example the first four items in ‘training/amr/amr-guidelines.mrp’, as well as
the 89 graphs over WSJ sentences that are annotated in all frameworks.  In the
companion data, each identifier (and corresponding ‘input’ string) occur only
once.  In other words, some of the sentences in the companion data correspond
to multiple semantic graphs in the MRP training data.

Note that the input strings for the EDS and PTG graphs correspond one-to-one
with each other (for overlapping WSJ sentences), but there are some sentences
not annotated in PTG, and others not annotated in EDS.

For the AMR graphs, the files ‘jamr.mrp’ and ‘isi.mrp’ provide reference (if by
no means gold-standard) anchorings, obtained from the JAMR system of Flanigan
et al. (2016; SemEval) and the ISI aligner by Pourdamghani et al. (2014; EMNLP)
and converted to the MRP file format.  Here, nodes, edges, and property values
can bear anchoring information, using the newly minted version 1.1 of the MRP
serialization format: here, JSON objects for edges have been augmented with an
(integer-valued) ‘id’ field, which (much like for nodes) serves to encode the
correspondence between elements of the AMR graphs proper and their anchoring
overlay in ‘jamr.mrp’ or ‘isi.mrp’.  To record anchoring information on edges,
the overlay files use the same ‘anchors’ property as on nodes; for properties,
node objects have been augmented with an ‘anchorings’ array, which follows the
same order coding as the corresponding ‘properties’ array.  For the AMR graphs,
anchoring information (as computed by the above aligners) is encoded in terms
of token identifiers, using the tokenization from the MRP companion parses, in
a format as follows (for anchoring to tokens #8 and #9 on node #7):

  { ... "nodes": [ ... {"id": 7, "anchors": [{"#": 8}, {"#": 9}]} ... }

In a similar spirt, the file ‘boxer.mrp’ provides companion anchorings for the
English DRG annotations, also using version 1.1. of the MRP serialization, as
described for the AMR graphs above (however, there are no node properties in
DRG, hence the ‘anchorings’ array is not used).  Unlike AMR, however, anchors
for DRG apply the familiar character-based ‘from’ and  ‘to’ format, i.e. are
independent of the UDPipe companion tokenization.

For the cross-lingual track, the main contents is in the files:

  $ wc -l *.mrp
  43955 ces.mrp
   5283 deu.mrp
  18365 zho.mrp
   1575 boxer.mrp

Again, for each of the MRP training graphs there is one dependency tree, where
correspondence to the gold-standard training data is by graph ‘id’entifiers.
The cross-lingual syntactic companion parses are separated by language, but in
each case the file format is the same as for the corresponding cross-framework
‘udpipe.mrp’.  Because the Chinese AMR graphs include anchoring as part of the
gold-standard annotations, the only reference anchorings for the cross-lingual
are for the German DRG structures, in the file ‘boxer.mrp’.


Acknowledgments
===============

Sebastian Schuster advised on how best to convert from PTB-style constituent
trees to (basic) UD 2.x dependency graphs.

Milan Straka provided invaluable assistance in training and running the latest
development version of his UDPipe system, to generate the morpho-syntactic
companion trees for the MRP sentences.

Jayeol Chun most helpfully provided the AMR alignments, including forcing the
aligners to respect the tokenization from the MRP morpho-syntactic companion
parses.  He also coordinated the creation of the Chinese dependency parses and
has been instrumental in ensuring that anchoring on the parser outputs exactly
match the ‘input’ strings underlying the Chinese MRP graphs.

Shamy Ji reported several deficiencies in the initial release, viz. some 1882
missing parses (for EDS graphs for WSJ and Brown sentences not annotated in the
PTG data) and a smaller number of WSJ sentences that erroneously been parsed in
their pre-tokenized form (as used in the AMR graphs), i.e. including spurious
white space (see below).

Hiroaki Ozaki discovered that two of the Chinese AMR graphs were missing their
syntactic companion parse in the initial release.


Known Limitations
=================

In general, the MRP task design assumes that parser inputs are ‘raw’ strings,
i.e. follow common conventions regarding punctuation marks and whitespace.  In
the case of some of the AMR ‘input’ values, the strings appear semi-tokenized,
in the sense of separating punctuation marks like commas, periods, quote marks,
and contracted auxiliaries and possessives from adjacent tokens with spurious
whitespace.  Furthermore, some of these strings use (non-standard) conventions
for directional quote marks, viz. the LaTeX-style two-character sequences that
have been popularized in NLP corpora by the Penn Treebank.  For example:

  wb.eng_0003.13  Where 's Homer Simpson when you need him ?
  wb.eng_0003.14  This is a major `` D'oh! '' moment .

For participants starting from the companion morpho-syntactic trees, the first
of these artifacts can have led to wrong quote disambiguation in the tokenizer:
‘straight’ single and double quote marks preceded by whitespace are treated as
left (or opening) quotes, which can at times result in directionally unmatched
quote marks, as well as in contractions whose first character is a left quote
mark rather than an apostrophe.  LaTeX-style quote marks, on the other hand,
should have been normalized properly during tokenization for the companion
morpho-syntactic trees, i.e. to “ and ” for the above example.


Release History
===============

[Version 1.2; July 29, 2020]

+ Add 417 missing German companion parses; correct ‘input’s of five Czech ones.

[Version 1.1; July 21, 2020]

+ Add two missing Chinese companion parses (from non-terminated CoNLL-U file).

[Version 1.0; June 22, 2020]

+ Re-release for missing and corrected strings, plus companion anchorings.

[Version 0.9; June 1, 2020]

+ First release of the MRP 2020 morpho-syntactic companion trees.


Contact
=======

For questions or comments, please do not hesitate to email the task organizers
at: ‘mrp-organizers@nlpl.eu’.

Omri Abend,
Lasha Abzianidze,
Johan Bos,
Jan Hajič,
Daniel Hershcovich,
Bin Li,
Stephan Oepen (chair),
Tim O'Gorman,
Nianwen Xue,
and Dan Zeman