Extrinsic Parser Evaluation

The First Shared Task on Extrinsic Parser Evaluation (EPE 2017) sought to provide better estimates of the relative utility of different types of dependency representations for a variety of downstream applications that depend heavily on the analysis of grammatical structure. The task was sponsored jointly by the Fourth International Conference on Dependency Linguistics (DepLing 2017) and the 15^th International Conference on Parsing Technologies (IWPT 2017), which were co-located in Pisa (Italy) from September 18 to 22, 2017.

On Wednesday, September 20, DepLing and IWPT held one overlapping day of joint programme, seeking to exploit the (broad) points of contact between dependency linguistics and parsing technologies. The EPE 2017 shared task formed part of the joint programme for the two conferences on September 20 and gave rise to a more informed comparison of dependency representations both empirically and linguistically. Task results have been published as a peer-reviewed separate volume of the IWPT proceedings.

Some historical information and examples of variation in syntactico-semantic dependency representations are available on a separate background page. The task will provide (a) a pre-defined selection of state-of-the-art downstream systems, (b) generic and parameterizable interfaces to bi-lexical syntactico-semantic dependency graphs, and (c) an infrastructure for large-scale end-to-end parameter tuning and evaluation.

There is a comparatively low barrier to entry for candiate participants: The minimum requirement for a team is to parse the running text (training, development, and eventually evaluation data; from various genres and domains) provided by the EPE 2017 organizers; parsing results in any kind of dependency representation that satisfies the formal definitions below can be submitted for evaluation, using a common interchange format defined for the task. Teams are free to determine if and to what degree they adapt or tune their parsers to the specific types of text in the task and whether to submit a single dependency representation or multiple, different types.

Supported downstream systems will comprise at least the following:

Biological event extraction (Björne et al. 2009), in the spirit of the BioNLP 2009 Shared Task (Kim et al. 2009);
Negation scope resolution (Lapponi et al. 2012), in the spirit of the 2012 *SEM Shared Task (Morante & Blanco, 2012);
fine-grained opinion analysis (Johansson & Moschitti, 2013) against the MPQA Opinion Corpus (Wiebe et al. 2005).

State-of-the-art results for all three downstream applications have been shown to depend strongly on ‘relational’ structure reflecting syntactico-semantic analysis and, thus, these systems are expected to provide an informative test-bed for contrastive extrinsic evaluation of different dependency representations. Each downstream application defines its own inventory of evaluation measures, and an unweighted combination of these will be applied for the purpose of ranking EPE 2017 submissions across applications. However, despite such empirical comparison, the task co-organizers hope that the task will help derive a better understanding of relevant (i.e. contentful) linguistic differences between different schools of dependency analysis, as well as develop a re-usable infrastructure for future extrinsic parser evaluation.

Task Definitions

The term (bi-lexical) dependency representation in the context of EPE 2017 will be interpreted as a graph whose nodes correspond to surface lexical units, and whose edges represent labeled directed relations between two nodes. Each node corresponds to a sub-string of the underlying linguistic signal (or ‘input string’), identified by character stand-off pointers. Node labels can comprise a non-recursive attribute–value matrix (or ‘feature structure’), for example to encode lemma and part of speech information. Each graph can optionally designate one or more ‘top’ nodes, broadly interpreted as the root-level head or highest-scoping predicate (Kuhlmann & Oepen, 2016). This generalized notion of dependency graphs is intended to capture both ‘classic’ syntactic dependency trees as well as structures that relax one or more of the ‘treeness’ assumptions made in much statistical dependency parsing work.

Defining nodes in terms of (in principle arbitrary) sub-strings of the surface signal makes this view on dependency representations independent of notions of ‘token’ or ‘word’ (which can receive divergent interpretations in different types of dependency representations). Furthermore, the above definition does not exclude overlapping or empty (i.e. zero-span) node sub-strings, as might characterize more weakly lexicalized dependency graphs like Elementary Dependency Structures (EDS; Oepen & Lønning 2006) or Abstract Meaning Representation (AMR; Banarescu et al. 2013). However, current downstream systems may only have limited (if any) support for ‘overlapping’ or ‘empty’ dependency nodes and, hence, may not immediately be able to take full advantage of the above more ‘abstract’ types of semantic (dependency) graphs.

EPE 2017 is (regrettably) limited to parsing English text. For each downstream application, separate training, development, and evaluation data will be provided as ‘running’ clean text (i.e. without information about sentence and token boundaries). There are no limitations on which parsing approaches and resources can be put to use, as long as the output of the parsing system is a dependency representation in the above sense, and the parser is wholly independent of the evaluation data.

For each participating parsing system, a ‘personalized’ instance of each downstream system will have to be developed (mostly automatically, mostly by the task organizers), and to support this re-training of downstream systems, each team will be asked to provide parsing results for both the training and development data. A first ‘trial’ run of participating parsing systems and the EPE downstream systems will be conduced in early April 2017; this will give participating teams initial feedback on expected end-to-end results.

For better comparability of purely learning-based parsing approaches, the organizers are considering the option of an additional ‘closed’ track, where training data with gold-standard annotations in some common dependency representations (e.g. LTH, Stanford, SDP, and UD) is made available, and parsing systems must be exclusively derived from this data, without the use of additional resources. Should this option turn out feasible, participants will of course be free to submit results to either track, or to both.

Tentative Schedule

March 13, 2017: First Call for Participation; Availability of Parser Input Texts
March 27, 2017: Specification of Common Interchange Format
April 17, 2017: ‘Trial’ Submission of Parser Outputs
April 24, 2017: ‘Trial’ End-to-End Scores on Development Set; Downstream Systems Available to Participants; Second Call for Participation
June 6–June 17, 2017: Evaluation Period (Held-Out Data)
June 26, 2017: Official End-to-End Evaluation Results
August 13, 2017: Submission of System Descriptions
September 4, 2017: Camera-Ready Manuscripts
September 20, 2017: Presentation and Discussion of Results