CoNLL 2019 Shared Task: Meaning Representation Parsing --- Evaluation Companion

Version 1.1; July 15, 2019


Overview
========

This directory contains a re-release of the morpho-syntactic compantion trees
for the MRP 2019 shared task, i.e. tokenized and syntactically parsed parser
inputs.  A re-release of this part of the evaluation package was required to
augment the companion trees with anchoring information, i.e. character start
and end indices into the original input string.

For general information on the task and the meaning representation frameworks
involved, please see:

  http://mrp.nlpl.eu

The JSON-based uniform interchange format for all frameworks is documented at:

  http://mrp.nlpl.eu/index.php?page=4#format


Contents
========

The main contents in this release is the file providing companion trees:

  $ wc -l udpipe.mrp
  6288 udpipe.mrp

Here, the number of lines corresponds to the number of parser inputs, i.e. the
evaluation data for MRP 2019 is comprised of 6288 strings to be parsed.   The
‘companion’ morpho-syntactic trees were created using the same software version
and parsing model as the training data, but the anchoring in this re-release is
token directly from the output of the REPP tokenizer..

For additional technical information on the preparation of the companion trees,
please see the original companion package for the training data: 

  http://svn.nlpl.eu/mrp/2019/public/companion.tgz


Known Limitations
=================

In general, the MRP task design assumes that parser inputs are ‘raw’ strings,
i.e. follow common conventions regarding punctuation marks and whitespace.  In
the case of some of the AMR ‘input’ values, the strings appear semi-tokenized,
in the sense of separating punctuation marks like commas, periods, quote marks,
and contracted auxiliaries and possessives from adjacent tokens with spurious
whitespace.  Furthermore, some of these strings use (non-standard) conventions
for directional quote marks, viz. the LaTeX-style two-character sequences that
have been popularized in NLP corpora by the Penn Treebank.  For example:

  wb.eng_0003.13  Where 's Homer Simpson when you need him ?
  wb.eng_0003.14  This is a major `` D'oh! '' moment .

For participants starting from the companion morpho-syntactic trees, the first
of these artifacts can have led to wrong quote disambiguation in the tokenizer:
‘straight’ single and double quote marks preceded by whitespace are treated as
left (or opening) quotes, which will at times result in directionally unmatched
quote marks, as well as to contractions whose first character is a left quote
mark rather than an apostrophe.  In retrospect, it turns out that the training
data for MRP 2019 exhibited the same limitations in six of the AMR sub-corpora.
The LaTeX-style quote marks, on the other hand, have been normalized properly
during tokenization for the companion morpho-syntactic trees, i.e. to “ and ”
for the above example.


Release History
===============

[Version 1.1; July 15, 2019]

+ Re-release of companion trees only, now including anchoring.

[Version 1.0; July 1, 2019]

+ First release of MRP 2019 training data in all frameworks.


Contact
=======

For questions or comments, please do not hesitate to email the task organizers
at: ‘mrp-organizers@nlpl.eu’.

Omri Abend
Jan Hajič
Daniel Hershcovich
Marco Kuhlmann
Stephan Oepen
Tim O'Gorman
Nianwen Xue