CoNLL 2020 Shared Task: Meaning Representation Parsing --- Training Data [Cross-Framework Track: English Only] Version 1.2; July 21, 2020 Overview ======== This directory (together with its ‘sibling’ directory for the cross-lingual track) contains training data for the MRP 2020 shared task, semantic graphs in five distinct frameworks: AMR, DRG, EDS, PTG, and UCCA. All graphs are encoded in a uniform abstract representation with a common serialization based on JSON. For general information on the task and the meaning representation frameworks involved, please see: http://mrp.nlpl.eu The JSON-based uniform interchange format for all frameworks is documented at: http://mrp.nlpl.eu/2020/index.php?page=14#format Note that this release introduces a new, backwards-compatible version of the MRP serialization format, adding the ‘id’ and ‘anchors’ fields also on edges and introducing an additional, order-coded ‘anchorings’ array on nodes, with anchors for corresponding positions in the ‘properties’ array. Cross-Framework Contents ======================== The main contents in this release are the following files for the English-only cross-framework track: $ wc -l cf/training/*.mrp 57885 amr.mrp 6605 drg.mrp 37192 eds.mrp 42024 ptg.mrp 6872 ucca.mrp 150665 total Here, line counts correspond to the number of graphs available in each of the frameworks. For at least several of the frameworks, the original data release may be internally sub-divided; AMR, for example, draws on a diverse range of text types and domains. Optional ‘source’ and ‘provenance’ top-level fields on the graphs preserve such sub-divisions, even though they have no significance in MRP 2020. The task setup (unlike for MRP 2019) provides an additional set of English graphs for ‘validation’ (i.e. system development). These graphs (by and large) correspond to the held-out evaluation data from the 2019 shared task, so as to facilitate comparison to earlier published results. The validation graphs are organized into a separate directory, but also documented here, for uniformity: $ wc -l cf/validation/*.mrp 3560 amr.mrp 885 drg.mrp 3302 eds.mrp 1664 ptg.mrp 1585 ucca.mrp 10996 total Participants are free to put the full training data (from this directory) to use as they best see fit, whereas the validation graphs must not be used for training proper (i.e. parameter estimation). They can be used, however, for hyper-parameter tuning during system development. In general, the goal of this distribution is to re-package the five collections of semantic graphs in a uniform representation, to facilitate cross-framework learning and unified evaluation. Thus, the MRP graphs only contain information that parsers are expected to predict, i.e. structural and labeling components that will be considered in evaluation. In several cases, this design decision has led to omission of additional information from the original annotations, as for example :wiki links in AMR and implicit units and inter-sentence relations in UCCA. The MRP training data includes what is called the shared sample of WSJ graphs: 89 sentences for which gold-standard annotations are available in four of the five frameworks. This sample is also available as a separate, public package, including visual renderings of these graphs in DOT and PDF format, which may facilitate human inspection: http://svn.nlpl.eu/mrp/2020/public/sample.tgz Cross-Lingual Contents ====================== Starting with release version 1.0 (as May 2020), the package also provides the training data for the cross-lingual track, with additional graphs in additional languages, as follows: AMR: Chinese DRG: German PTG: Czech UCCA: German Sentence and token counts per language vary substantially, viz. $ make count amr.zho.mrp: 18365 428055 drg.deu.mrp: 1575 7479 ptg.ces.mrp: 43955 637084 ucca.deu.mrp: 4125 81915 For EDS, creation of additional Spanish graphs has regrettably been delayed due to the coronavirus pandemic; we may or may not add be able to release an update to the cross-lingual training package in time for the MRP 2020 shared task. AMR: Abstract Meaning Representation ==================================== In the AMR graphs, all nodes have the ‘label’ property (holding what AMR calls concept identifiers), and many nodes additional use ‘properties’ and ‘values’, for example to encode negative :polarity or the various components of complex proper names, e.g. :op1, :op2, etc. The AMR graphs are unordered and there is no explicit linking to the surface string, i.e. there are no instances of the ‘anchors’ property on nodes. As discussed by Kuhlmann & Oepen (2016; CL), AMR graphs can be viewed in two variants, viz. either in the tree-like structure that is created by annotators or in a normalized variant, where inverse edges (something like ‘ARG0-of’) are un-inverted, i.e. treated as an ‘ARG0’ edge in the opposite direction. There is an established tradition in AMR evaluation to score the normalized graphs, i.e. assume that there can be multiple equivalent serializations of the same graph. On the other hand, at least same AMR parsers have found it beneficial to predict graphs in the tree-like, un-normalized topology, and therefore the MRP release represents both views on the AMR graphs in the same structure: The ‘source’, ‘target’, and ‘label’ properties on edge objects correspond to the tree-like form, i.e. AMR graphs as annotated; an optional ‘normal’ property on edges indicates inversion. On an ‘ARG0-of’ edge, for example, the ‘normal’ property will be ‘ARG0’; conversely, a ‘consist-of’ edge (which superficially might look like an inverted edge, but is of course not) does not carry the ‘normal’ property. DRG: Discourse Representation Graphs ==================================== Discourse Representation Graphs (DRG) provide a new and information-preserving graph encoding of Discourse Representation Structure (DRS), as annotated in the Parallel Meaning Bank (PMB; Abzianidze et al., 2017). These graphs differ from the other frameworks in the MRP collection in that they provide an encoding of scopal contexts (‘boxes’ in the underlying DRS annotations), which requires the reification of roles and introduction of separate box membership edges. Albeit not encoded formally, nodes in these graphs can be thought of as conceptually representing three different types of structural elements: (0) unlabeled boxes, (1) entities (either discourse referents or constants), usually labeled with a concept, and (2) reified roles; for the latter, the role itself is encoded as one labeled node, and two unlabeled edges (one incoming, one outgoing) which connect the role to two entities. Conversely, labeled edges in DRG either encode discourse relations between two nodes that represent DRS boxes, or they associate a node corresponding to a discourse referent or a role as a member of a scopal context, using the designated edge label ‘in’. When using mtool to visualize DRGs via the DOT language, the ‘--pretty’ command line option will show the conceptual distinctions of nodes using three distinct shapes, as exemplified in the MRP framework overview: http://mrp.nlpl.eu/2020/index.php?page=12#drg EDS: Elementary Dependency Structures ===================================== The EDS graphs, in a sense, present a middle ground between purely bi-lexical semantic graphs (like the DM and PSD frameworks included in the MRP 2019 shared task) and the unanchored AMR or DRG ones In EDS, all nodes are anchored onto sub-strings of the input, but anchors can correspond to arbitrary (contiguous) character ranges (e.g. corresponding to affix or phrasal sub-strings). Also, multiple nodes can have overlapping anchors. Node labels in EDS are semantic predicates that are sense-disambiguated inasmuch as is determined by syntactic structure, for example ‘_increase_v_cause’ vs. ‘_increase_v_1’ for causative vs. inchoative usages of ‘increase’, or ‘_look_v_1’ vs. ‘_look_v_up’ to mark the distinction between plain vs. verb–particle ‘look’ vs. ‘look up’. In the context of MRP 2020, EDS node properties (encoding for example tense, aspect, or number) have been simplified, seeking to only explicitly encode marked (or non-default) information, e.g. actual progressives or perfectives. For example, ‘MOOD’ only marks ‘subjunctive’ (not indicative), ‘PERS’(on) is only present on nodes corresponding to personal pronouns, and there are no ‘NUM’(ber) values on quantities that cannot be countably individuated. In addition to such morpho-semantic reflections of tense, aspect, and more, some nodes exhibit a property called ‘CARG’ (for constant argument), to encode a string-valued parameter that is used with predicates like ‘named’ or ‘dofw’, for proper names and the days of the week, respectively. EDS has no edge attributes, and while there can in principle be multiple edges between two nodes, edge labels are functional. PTG: Prague Tectogrammatical Graphs =================================== The PTG structures provide a faithful rendering of the core of the annotations in what are called tectogrammatical trees (or t-trees) in the tradition of the Prague Dependency Treebank (PDT) and Prague Czech–English Dependency Treebank (PCEDT). While some of the information annoated in the PCEDT and PDT has been omitted during conversion to the uniform MRP graph encoding, the graph topology fully reflects the structure of the underlying t-tree, but separate annotations of corefence among nodes are now straightforwardly encoded as additional edges. Most PTG nodes are anchored (to arbitrary, non-contiguous input sub-strings, where overlap of anchoring is possible), but there are two types of unanchored nodes: an unlabeled virtual root of the graph (the unique ‘top’ node in the MRP context, whose outgoing edge labels encode different construction types), and what in the Prague tradition are called ‘generated’ (or empty) nodes, to encode unexpressed arguments and, where applicable, their coreference relations. PTG uses a small number of node properties, of which ‘frame’ encodes a sense identifier in the associated EngValLex valency dictionary (verbal nodes only), ‘sempos’ provides a more coarse semantic sense disambiguation (on all nodes), and ‘sentmod’ a reflection of what is at times called ‘sentence force’. All edges in PTG are labeled, but are not universally functional: in coordinate and appositive structures, there will frequently be multiple outgoing edges from a node with the same label. Some PTG edges use binary properties ‘member’ and ‘effective’, indicating edges that are members of a paratactic construction or dependencies that during conversion to MRP have been recursively propagated and distributed through paratactic constructions, respectively. UCCA: Universal Conceptual Cognitive Annotation =============================================== In the UCCA graphs, nodes are generally unlabeled and free of properties, as they essentially work as group-formning structural elements. Leaf nodes in the graphs are anchored to non-overlapping sub-strings of the underlying input, but there can be multiple, non-consecutive anchors on a node (e.g. for discontinous multi-word expressions like for example ‘neither ... nor’). UCCA is the only framework with edge properties that parsers are expected to predict (and which will be considered for evaluation, unlike the AMR ‘normal’ property on edges, which merely provides structural hints). On re-entrant nodes (with in-degree greater than one), all but one of the incoming edges will be considered remote participants (from other UCCA units). This distinction is encoded through a boolean-valued ‘remote’ property, which (currently at least) is only present on edges that actually are remote (i.e. have a ‘true’ value for this property). Known Limitations ================= In general, the MRP task design assumes that parser inputs are ‘raw’ strings, i.e. follow common conventions regarding punctuation marks and whitespace. In the case of some of the AMR ‘input’ values, the strings appear semi-tokenized, in the sense of separating punctuation marks like commas, periods, quote marks, and contracted auxiliaries and possessives from adjacent tokens with spurious whitespace. Furthermore, some of these strings use (non-standard) conventions for directional quote marks, viz. the LaTeX-style two-character sequences that have been popularized in NLP corpora by the Penn Treebank. For example: wb.eng_0009.104 Could n't agree more . wb.eng_0002.139 That would be a tough `` choice '' . Anchor values in EDS graphs sometimes include character positions corresponding to adjacent punctuation marks (reflecting their morpho-syntactic analysis as a kind of ‘pseudo-affix’ in the underlying annotations). Evaluation of anchoring in the official MRP scorer is somewhat robust to such variation, i.e. there is a notion of anchor normalization (ignoring a specific set of punctuation marks in prefix or affix positions), such that it will be legitimate for a parser to directly predict normalized anchors. For background, please see: http://mrp.nlpl.eu/2020/index.php?page=15#software Release History =============== [Version 1.2; July 21, 2020] + Remove spurious ‘正 在’ space in surface string of CAMR graph #export_amr.2032. [Version 1.1; June 22, 2020] + Quality improvements in Chinese AMR and German DRG graphs. [Version 1.0; May 23, 2020] + Add cross-lingual training data (Chinese, Czech, and German). [Version 0.9; April 28, 2020] + First release of MRP 2020 training data in all frameworks. Contact ======= For questions or comments, please do not hesitate to email the task organizers at: ‘mrp-organizers@nlpl.eu’. Omri Abend, Lasha Abzianidze, Johan Bos, Jan Hajič, Daniel Hershcovich, Bin Li, Stephan Oepen (chair), Tim O'Gorman, Nianwen Xue, and Dan Zeman