CoNLL 2020 Shared Task: Meaning Representation Parsing --- Training Data
[Cross-Framework Track: English Only]

Version 1.2; July 21, 2020


Overview
========

This directory  (together with its ‘sibling’ directory for the cross-lingual
track) contains training data for the MRP 2020 shared task, semantic graphs in
five distinct frameworks: AMR, DRG, EDS, PTG, and UCCA.  All graphs are encoded
in a uniform abstract representation with a common serialization based on JSON.

For general information on the task and the meaning representation frameworks
involved, please see:

  http://mrp.nlpl.eu

The JSON-based uniform interchange format for all frameworks is documented at:

  http://mrp.nlpl.eu/2020/index.php?page=14#format

Note that this release introduces a new, backwards-compatible version of the
MRP serialization format, adding the ‘id’ and ‘anchors’ fields also on edges
and introducing an additional, order-coded ‘anchorings’ array on nodes, with
anchors for corresponding positions in the ‘properties’ array.


Cross-Framework Contents
========================

The main contents in this release are the following files for the English-only
cross-framework track:

  $ wc -l cf/training/*.mrp
  57885 amr.mrp
   6605 drg.mrp
  37192 eds.mrp
  42024 ptg.mrp
   6872 ucca.mrp
 150665 total

Here, line counts correspond to the number of graphs available in each of the
frameworks.  For at least several of the frameworks, the original data release
may be internally sub-divided; AMR, for example, draws on a diverse range of
text types and domains.  Optional ‘source’ and ‘provenance’ top-level fields on
the graphs preserve such sub-divisions, even though they have no significance
in MRP 2020.

The task setup (unlike for MRP 2019) provides an additional set of English
graphs for ‘validation’ (i.e. system development).  These graphs (by and
large) correspond to the held-out evaluation data from the 2019 shared task, so
as to facilitate comparison to earlier published results.  The validation
graphs are organized into a separate directory, but also documented here, for
uniformity:

  $ wc -l cf/validation/*.mrp
   3560 amr.mrp
    885 drg.mrp
   3302 eds.mrp
   1664 ptg.mrp
   1585 ucca.mrp
  10996 total

Participants are free to put the full training data (from this directory) to
use as they best see fit, whereas the validation graphs must not be used for
training proper (i.e. parameter estimation).  They can be used, however, for
hyper-parameter tuning during system development.

In general, the goal of this distribution is to re-package the five collections
of semantic graphs in a uniform representation, to facilitate cross-framework
learning and unified evaluation.  Thus, the MRP graphs only contain information
that parsers are expected to predict, i.e. structural and labeling components
that will be considered in evaluation.  In several cases, this design decision
has led to omission of additional information from the original annotations, as
for example :wiki links in AMR and implicit units and inter-sentence relations
in UCCA.

The MRP training data includes what is called the shared sample of WSJ graphs:
89 sentences for which gold-standard annotations are available in four of the
five frameworks.  This sample is also available as a separate, public package,
including visual renderings of these graphs in DOT and PDF format, which may
facilitate human inspection: 

  http://svn.nlpl.eu/mrp/2020/public/sample.tgz


Cross-Lingual Contents
======================

Starting with release version 1.0 (as May 2020), the package also provides the
training data for the cross-lingual track, with additional graphs in additional
languages, as follows:

  AMR: Chinese
  DRG: German
  PTG: Czech
  UCCA: German

Sentence and token counts per language vary substantially, viz.

  $ make count
  amr.zho.mrp: 18365 428055
  drg.deu.mrp: 1575 7479
  ptg.ces.mrp: 43955 637084
  ucca.deu.mrp: 4125 81915

For EDS, creation of additional Spanish graphs has regrettably been delayed due
to the coronavirus pandemic; we may or may not add be able to release an update
to the cross-lingual training package in time for the MRP 2020 shared task.


AMR: Abstract Meaning Representation
====================================

In the AMR graphs, all nodes have the ‘label’ property (holding what AMR calls
concept identifiers), and many nodes additional use ‘properties’ and ‘values’,
for example to encode negative :polarity or the various components of complex
proper names, e.g. :op1, :op2, etc.  The AMR graphs are unordered and there is
no explicit linking to the surface string, i.e. there are no instances of the
‘anchors’ property on nodes.

As discussed by Kuhlmann & Oepen (2016; CL), AMR graphs can be viewed in two
variants, viz. either in the tree-like structure that is created by annotators
or in a normalized variant, where inverse edges (something like ‘ARG0-of’) are
un-inverted, i.e. treated as an ‘ARG0’ edge in the opposite direction.  There
is an established tradition in AMR evaluation to score the normalized graphs,
i.e. assume that there can be multiple equivalent serializations of the same
graph.  On the other hand, at least same AMR parsers have found it beneficial
to predict graphs in the tree-like, un-normalized topology, and therefore the
MRP release represents both views on the AMR graphs in the same structure: The
‘source’, ‘target’, and ‘label’ properties on edge objects correspond to the
tree-like form, i.e. AMR graphs as annotated; an optional ‘normal’ property on
edges indicates inversion.  On an ‘ARG0-of’ edge, for example, the ‘normal’
property will be ‘ARG0’; conversely, a ‘consist-of’ edge (which superficially
might look like an inverted edge, but is of course not) does not carry the
‘normal’ property.


DRG: Discourse Representation Graphs
====================================

Discourse Representation Graphs (DRG) provide a new and information-preserving
graph encoding of Discourse Representation Structure (DRS), as annotated in the
Parallel Meaning Bank (PMB; Abzianidze et al., 2017).  These graphs differ from
the other frameworks in the MRP collection in that they provide an encoding of
scopal contexts (‘boxes’ in the underlying DRS annotations), which requires the
reification of roles and introduction of separate box membership edges.  Albeit
not encoded formally, nodes in these graphs can be thought of as conceptually
representing three different types of structural elements: (0) unlabeled boxes,
(1) entities (either discourse referents or constants), usually labeled with a
concept, and (2) reified roles; for the latter, the role itself is encoded as  
one labeled node, and two unlabeled edges (one incoming, one outgoing) which
connect the role to two entities.  Conversely, labeled edges in DRG either
encode discourse relations between two nodes that represent DRS boxes, or they
associate a node corresponding to a discourse referent or a role as a member of
a scopal context, using the designated edge label ‘in’.  When using mtool to
visualize DRGs via the DOT language, the ‘--pretty’ command line option will
show the conceptual distinctions of nodes using three distinct shapes, as
exemplified in the MRP framework overview:

  http://mrp.nlpl.eu/2020/index.php?page=12#drg


EDS: Elementary Dependency Structures
=====================================

The EDS graphs, in a sense, present a middle ground between purely bi-lexical
semantic graphs (like the DM and PSD frameworks included in the MRP 2019 shared
task) and the unanchored AMR or DRG ones  In EDS,  all nodes are anchored onto
sub-strings of the input, but anchors can correspond to arbitrary (contiguous)
character ranges (e.g. corresponding to affix or phrasal sub-strings).  Also,
multiple nodes can have overlapping anchors.  Node labels in EDS are semantic
predicates that are sense-disambiguated inasmuch as is determined by syntactic
structure, for example ‘_increase_v_cause’ vs. ‘_increase_v_1’ for causative
vs. inchoative usages of ‘increase’, or ‘_look_v_1’ vs. ‘_look_v_up’ to mark
the distinction between plain vs. verb–particle ‘look’ vs. ‘look up’.

In the context of MRP 2020, EDS node properties (encoding for example tense,
aspect,  or number) have been simplified, seeking to only explicitly encode
marked (or non-default) information, e.g. actual progressives or perfectives.
For example, ‘MOOD’ only marks ‘subjunctive’ (not indicative), ‘PERS’(on) is
only present on nodes corresponding to personal pronouns, and there are no
‘NUM’(ber) values on quantities that cannot be countably individuated.  In
addition to such morpho-semantic reflections of tense, aspect, and more, some
nodes exhibit a property called ‘CARG’ (for constant argument), to encode a
string-valued parameter that is used with predicates like ‘named’ or ‘dofw’,
for proper names and the days of the week, respectively.  EDS has no edge
attributes, and while there can in principle be multiple edges between two
nodes, edge labels are functional.


PTG: Prague Tectogrammatical Graphs
===================================


The PTG structures provide a faithful rendering of the core of the annotations
in what are called tectogrammatical trees (or t-trees) in the tradition of the
Prague Dependency Treebank (PDT) and Prague Czech–English Dependency Treebank
(PCEDT).  While some of the information annoated in the PCEDT and PDT has been
omitted during conversion to the uniform MRP graph encoding, the graph topology
fully reflects the structure of the underlying t-tree, but separate annotations
of corefence among nodes are now straightforwardly encoded as additional edges.
Most PTG nodes are anchored (to arbitrary, non-contiguous input sub-strings,
where overlap of anchoring is possible), but there are two types of unanchored
nodes: an unlabeled virtual root of the graph (the unique ‘top’ node in the MRP
context, whose outgoing edge labels encode different construction types), and
what in the Prague tradition are called ‘generated’ (or empty) nodes, to encode
unexpressed arguments and, where applicable, their coreference relations.

PTG uses a small number of node properties, of which ‘frame’ encodes a sense
identifier in the associated EngValLex valency dictionary (verbal nodes only),
‘sempos’ provides a more coarse semantic sense disambiguation (on all nodes),
and ‘sentmod’ a reflection of what is at times called ‘sentence force’.  All
edges in PTG are labeled, but are not universally functional: in coordinate and
appositive structures, there will frequently be multiple outgoing edges from a
node with the same label.  Some PTG edges use binary properties ‘member’ and
‘effective’, indicating edges that are members of a paratactic construction or
dependencies that during conversion to MRP have been recursively propagated and
distributed through paratactic constructions, respectively.  


UCCA: Universal Conceptual Cognitive Annotation
===============================================

In the UCCA graphs, nodes are generally unlabeled and free of properties, as
they essentially work as group-formning structural elements.  Leaf nodes in the
graphs are anchored to non-overlapping sub-strings of the underlying input, but
there can be multiple, non-consecutive anchors on a node (e.g. for discontinous
multi-word expressions like for example ‘neither ... nor’).  UCCA is the only
framework with edge properties that parsers are expected to predict (and which
will be considered for evaluation, unlike the AMR ‘normal’ property on edges,
which merely provides structural hints).  On re-entrant nodes (with in-degree
greater than one), all but one of the incoming edges will be considered remote
participants (from other UCCA units).  This distinction is encoded through a
boolean-valued ‘remote’ property, which (currently at least) is only present on
edges that actually are remote (i.e. have a ‘true’ value for this property).


Known Limitations
=================

In general, the MRP task design assumes that parser inputs are ‘raw’ strings,
i.e. follow common conventions regarding punctuation marks and whitespace.  In
the case of some of the AMR ‘input’ values, the strings appear semi-tokenized,
in the sense of separating punctuation marks like commas, periods, quote marks,
and contracted auxiliaries and possessives from adjacent tokens with spurious
whitespace.  Furthermore, some of these strings use (non-standard) conventions
for directional quote marks, viz. the LaTeX-style two-character sequences that
have been popularized in NLP corpora by the Penn Treebank.  For example:

  wb.eng_0009.104  Could n't agree more .
  wb.eng_0002.139  That would be a tough `` choice '' .

Anchor values in EDS graphs sometimes include character positions corresponding
to adjacent punctuation marks (reflecting their morpho-syntactic analysis as a
kind of ‘pseudo-affix’ in the underlying annotations).  Evaluation of anchoring
in the official MRP scorer is somewhat robust to such variation, i.e. there is
a notion of anchor normalization (ignoring a specific set of punctuation marks
in prefix or affix positions), such that it will be legitimate for a parser to
directly predict normalized anchors.  For background, please see:

  http://mrp.nlpl.eu/2020/index.php?page=15#software


Release History
===============

[Version 1.2; July 21, 2020]

+ Remove spurious ‘正 在’ space in surface string of CAMR graph #export_amr.2032.

[Version 1.1; June 22, 2020]

+ Quality improvements in Chinese AMR and German DRG graphs.

[Version 1.0; May 23, 2020]

+ Add cross-lingual training data (Chinese, Czech, and German).

[Version 0.9; April 28, 2020]

+ First release of MRP 2020 training data in all frameworks.


Contact
=======

For questions or comments, please do not hesitate to email the task organizers
at: ‘mrp-organizers@nlpl.eu’.

Omri Abend,
Lasha Abzianidze,
Johan Bos,
Jan Hajič,
Daniel Hershcovich,
Bin Li,
Stephan Oepen (chair),
Tim O'Gorman,
Nianwen Xue,
and Dan Zeman