Release notes for MMT grammars, August 15, 2007

1. Background

The MMT system (Matrix Machine Translation or Massively
Multilingual Translation) is an experiment in adapting the
LOGON MT architecture and Grammar Matrix-derived grammars to
create an NxN machine translation system (where the current
value of N is 10).  The goals for the system are to have all
languages equally available as source or target languages,
and all language pairs equally functional.  In addition, it
should be possible to add the N+1st language (as source and
target) without directly considering every new language pair
that this adds.

While the grammars have relatively interesting coverage over
a range of phenomena (coordination, negation, polar
questions, marking of definiteness and demonstratives,
clause-embedding verbs), the system targets a tiny toy
domain (sentences about dogs and cats chasing cars and
sleeping), and side-steps many important issues in transfer
by positing a pseudo-interlingua.  Even with this
oversimplification, we still need to posit transfer rules to
handle residual mismatches between the MRSs for the
different languages.  The two main sources of mismatch are
the treatment of pro-drop (as these grammars do not posit
_pronoun_n_rels for dropped pronouns) and complex predicates
(with "hurt" in Italian and Farsi being expressed as "make
harm", and "chase" in Farsi as "make pursuit").

Rather than writing N^2 transfer grammars, we create one
transfer grammar per target language, which instantiates
transfer rules which accommodate the expectations of the
target language's (monolingual) grammar.  For further
generalization, the transfer rule types are taken from the
Transfer MatriX (mtr.tdl, mrs.tdl).  Specific transfer rules
(e.g., pronoun insertion) are defined as types as well in a
single shared file (acm.tdl).  Particular transfer grammars
then instantiate only the transfer rules in acm.tdl that are
required.

At this point, it remains an open question whether this
strategy can be scaled, or whether N^2 language pairs
require N^2 transfer grammars.

2. MMT set up

The file setup.lisp defines the variables *mmt-languages*
and *mmt-transfer-grammars*.  The former lists the languages
handled, and the latter associates each target language with
its accommodation transfer grammar.  Each language is
identified by its three letter ISO code.

A single language pair can be invoked with the following
command (issued in the $LOGONROOT directory):

./batch --binary --from src --to tgt --ascii ./uw/mmt/test_sentences/src2tgt.txt 

where `src' and `tgt' are replaced with the three-letter
codes for the source and target languages respectively
(twice in the string each).

The script all_lg_test will do a test run of all of the
language pairs.  It invokes format_results.py to output
a pdf file with a table summarizing coverage over the
17 test sentences.


3. Provenance of grammars

With the exception of the English grammar, the MMT
monolingual grammars all began as course projects for
Linguistics 471/567 at the University of Washington.  In
this class (renumbered to 567 in 2005), each student
develops a grammar for a different language over the 10-week
quarter, according to lab instructions highlighting
different phenomena each week.  The students begin with the
Grammar Matrix (and since 2006, with a starter grammar
configured from the Grammar Matrix customization system) and
build out from there.  In many cases, students work with
languages that they are previously unfamiliar with, using
reference grammars and in some cases native speaker
consultants to assess the facts of the language as they
attempt to model it.

All of the grammars use ascii transliteration, which may or
may not correspond to any standard transliteration.  In
addition, many of the grammars assume (but do not include) a
morphophonological analyzer, and so parse and generate
strings of regularized forms.

For the MMT system, 9 of these grammars were selected and
then updated for consistency with the current version of the
Matrix and current conventions for the MRS representations.
In some cases, the grammar coverage needed to be extended in
order to handle our toy domain.  In general, the grammars
from earlier years required more modifications than the more
recent grammars.  These updates were done by Scott
Drellishak, Margalit Zabludowski, and Emily M. Bender.

The English grammar was created specifically for the MMT
system, beginning with a starter grammar from the Grammar
Matrix customization system.

Language 	Code	Orig. Author	Orig. Dev Date	Modifications
--------	----	------------	--------------	-------------
Armenian	hye	S. Drellishak	2004		Drellishak, Bender
English		eng	S. Drellishak	2007		Drellishak, Bender
Esperanto	epo	J. Pool		2005		Drellishak, Bender
Farsi		fas	W. McNeill	2004		Drellishak, Bender
Finnish		fin	R. Mattson	2005		Drellishak, Bender
Hausa		hau	K. Hutchins	2007		Bender, Drellishak
Hebrew		heb	M. Zabludowski	2006		Zabludowski, Bender, 
							  Drellishak
Icelandic	isl	K. Sickles	2007		Bender, Drellishak
Italian		ita	J. Johanson	2006		Zabludowski, Bender,
							  Drellishak
Zulu		zul	K. O'Hara	2007		Bender, Drellishak


4. Files

In the mmt directory, there are subdirectories for each
monolingual grammar identified by the three letter language
codes given above.  Inside each grammar directory, there is
a subdirectory called "doc" which contains student write ups
from the course in which the grammars were developed, as
well as the instructor's responses to those write ups.

Also in the mmt directory are the transfer grammars (eng-acm
et al), the shared files for the transfer grammars (mrs.tdl,
mtr.tdl, and acm.tdl), a directory called test_sentences, and
a directory called tsdb. test_sentences stores the input
sentences for each language (again identified by the three
letter code), as well as bitexts for each language pair.
The bitexts themselves can be created by the script
create_bitexts.py.

The tsdb/home directory contains (untreebanked) profiles for
each monolingual grammar over the the test suites created by
the students as they developed their grammars and separate
profiles for the MMT sentences.  tsdb/skeletons provides
skeletons for the general test suites and MMT sentences
for each language.  (For heb and eng, only MMT sentences
are available.)

5. Acknowledgments

The initial work of adapting these grammars for the MMT
system was supported by a gift from the Utilika Foundation
to the Turing Center at the University of Washington.
Development of the Grammar Matrix is currently supported by
NSF grant BCS-0644097.