.. Copyright (C) 2001-2012 NLTK Project
.. For license information, see LICENSE.TXT

.. -*- coding: utf-8 -*-

=========
Alignment
=========

Corpus Reader
-------------

    >>> from nltk.corpus import comtrans
    >>> comtrans.words('alignment-en-fr.txt')
    ['Resumption', 'of', 'the', 'session', 'I', 'declare', ...]
    >>> als = comtrans.aligned_sents('alignment-en-fr.txt')[0]
    >>> als  # doctest: +NORMALIZE_WHITESPACE
    AlignedSent(['Resumption', 'of', 'the', 'session'],
    ['Reprise', 'de', 'la', 'session'],
    Alignment([(0, 0), (1, 1), (2, 2), (3, 3)]))


Alignment Objects
-----------------

Aligned sentences are simply a mapping between words in a sentence:

    >>> als.words
    ['Resumption', 'of', 'the', 'session']
    >>> als.mots
    ['Reprise', 'de', 'la', 'session']
    >>> als.alignment
    Alignment([(0, 0), (1, 1), (2, 2), (3, 3)])


Usually we look at them from the perpective of a source to a target languge,
but they are easilly inverted:

    >>> als.invert() # doctest: +NORMALIZE_WHITESPACE
    AlignedSent(['Reprise', 'de', 'la', 'session'],
    ['Resumption', 'of', 'the', 'session'],
    Alignment([(0, 0), (1, 1), (2, 2), (3, 3)]))


We can create new alignments, but these need to be in the correct range of
the corresponding sentences:

    >>> from nltk.align import Alignment, AlignedSent
    >>> als = AlignedSent(['Reprise', 'de', 'la', 'session'],
    ...                   ['Resumption', 'of', 'the', 'session'],
    ...                   Alignment([(0, 0), (1, 4), (2, 1), (3, 3)]))
    Traceback (most recent call last):
        ...
    IndexError: Alignment is outside boundary of mots


You can set alignments with any sequence of tuples, so long as the first two
indexes of the tuple are the alignment indices:

als.alignment = Alignment([(0, 0), (1, 1), (2, 2, "boat"), (3, 3, False, (1,2))])

    >>> Alignment([(0, 0), (1, 1), (2, 2, "boat"), (3, 3, False, (1,2))])
    Alignment([(0, 0), (1, 1), (2, 2, 'boat'), (3, 3, False, (1, 2))])


Alignment Algorithms
--------------------

EM for IBM Model 1
~~~~~~~~~~~~~~~~~~

Here is an example from Kohn, 2010:

    >>> from nltk.align import IBMModel1
    >>> corpus = [AlignedSent(['the', 'house'], ['das', 'Haus']),
    ...           AlignedSent(['the', 'book'], ['das', 'Buch']),
    ...           AlignedSent(['a', 'book'], ['ein', 'Buch'])]
    >>> em_ibm1 = IBMModel1(corpus, 1e-3)
    >>> print round(em_ibm1.probabilities['the', 'das'], 1)
    1.0
    >>> print round(em_ibm1.probabilities['book', 'das'], 1)
    0.0
    >>> print round(em_ibm1.probabilities['house', 'das'], 1)
    0.0
    >>> print round(em_ibm1.probabilities['the', 'Buch'], 1)
    0.0
    >>> print round(em_ibm1.probabilities['book', 'Buch'], 1)
    1.0
    >>> print round(em_ibm1.probabilities['a', 'Buch'], 1)
    0.0
    >>> print round(em_ibm1.probabilities['book', 'ein'], 1)
    0.0
    >>> print round(em_ibm1.probabilities['a', 'ein'], 1)
    1.0
    >>> print round(em_ibm1.probabilities['the', 'Haus'], 1)
    0.0
    >>> print round(em_ibm1.probabilities['house', 'Haus'], 1)
    1.0
    >>> print round(em_ibm1.probabilities['book', None], 1)
    0.5

And using an NLTK corpus. We train on only 10 sentences, since it is so incredibly slow:

    >>> from nltk.corpus import comtrans
    >>> com_ibm1 = IBMModel1(comtrans.aligned_sents() [:10])
    >>> print round(com_ibm1.probabilities['bitte', 'Please'], 1)
    0.2
    >>> print round(com_ibm1.probabilities['Sitzungsperiode', 'session'], 1)
    1.0

Get the alignments:

    >>> em_ibm1.aligned() # doctest: +NORMALIZE_WHITESPACE
    [AlignedSent(['the', 'house'], ['das', 'Haus'],
            Alignment([(0, 0), (1, 1)])),
     AlignedSent(['the', 'book'], ['das', 'Buch'],
            Alignment([(0, 0), (1, 1)])),
     AlignedSent(['a', 'book'], ['ein', 'Buch'],
            Alignment([(0, 0), (1, 1)]))]


Evaluation
----------
The evaluation metrics for alignments are usually not interested in the
contents of alignments but more often the comparison to a "gold standard"
alignment that has been been constructed by human experts. For this reason we
often want to work just with raw set operations against the alignment points.
This then gives us a very clean form for defining our evaluation metrics.

.. Note::
    The AlignedSent class has no distinction of "possible" or "sure"
    alignments. Thus all alignments are treated as "sure".

Consider the following aligned sentence for evaluation:

    >>> my_als = AlignedSent(['Resumption', 'of', 'the', 'session'],
    ...     ['Reprise', 'de', 'la', 'session'],
    ...     [(0, 0), (3, 3), (1, 2), (1, 1), (1, 3)])

Precision
~~~~~~~~~
``precision = |A∩P| / |A|``

**Precision** is probably the most well known evaluation metric and there is
already a set based implementation in NLTK as
`nltk.metrics.scores.precision`_.  Since precision is simply interested in the
proportion of correct alignments, we calculate the ratio of the number of our
test alignments (*A*) that match a possible alignment (*P*) over the number of
test alignments provided. We compare to the possible alignment set don't
penalise for coming up with an alignment that a humans would have possibly
considered to be correct [OCH2000]_.

Here are some examples:

    >>> print als.precision(set())
    0.0
    >>> print als.precision([(0,0), (1,1), (2,2), (3,3)])
    1.0
    >>> print als.precision([(0,0), (3,3)])
    0.5
    >>> print als.precision([(0,0), (1,1), (2,2), (3,3), (1,2), (2,1)])
    1.0
    >>> print my_als.precision(als)
    0.6


.. _nltk.metrics.scores.precision:
    http://nltk.googlecode.com/svn/trunk/doc/api/nltk.metrics.scores-module.html#precision


Recall
~~~~~~
``recall = |A∩S| / |S|``

**Recall** is another well known evaluation metric that has a set based
implementation in NLTK as `nltk.metrics.scores.recall`_. Since recall is
simply interested in the proportion of found alignments, we calculate the
ratio of the number of our test alignments (*A*) that match a sure alignment
(*S*) over the number of sure alignments. Since we are not sure about some of
our possible alignments we don't penalise for not finding these [OCH2000]_.

Here are some examples:

    >>> print als.recall(set())
    None
    >>> print als.recall([(0,0), (1,1), (2,2), (3,3)])
    1.0
    >>> print als.recall([(0,0), (3,3)])
    1.0
    >>> print als.recall([(0,0), (1,1), (2,2), (3,3), (1,2), (2,1)])
    0.666666666667
    >>> print my_als.recall(als)
    0.75


.. _nltk.metrics.scores.recall:
    http://nltk.googlecode.com/svn/trunk/doc/api/nltk.metrics.scores-module.html#recall


Alignment Error Rate (AER)
~~~~~~~~~~~~~~~~~~~~~~~~~~
``AER = 1 - (|A∩S| + |A∩P|) / (|A| + |S|)``

**Alignment Error Rate** is commonly used metric for assessing sentence
alignments. It combines precision and recall metrics together such that a
perfect alignment must have all of the sure alignments and may have some
possible alignments [MIHALCEA2003]_ [KOEHN2010]_.

.. Note::
    [KOEHN2010]_ defines the AER as ``AER = (|A∩S| + |A∩P|) / (|A| + |S|)``
    meaning that the best alignment would when the ``AER = 1.0``. Thus we
    follow [MIHALCEA2003]_ more intuitive definition where we are minimising
    the error rate.

Here are some examples:

    >>> print als.alignment_error_rate(set())
    1.0
    >>> print als.alignment_error_rate([(0,0), (1,1), (2,2), (3,3)])
    0.0
    >>> print my_als.alignment_error_rate(als)
    0.333333333333
    >>> print my_als.alignment_error_rate(als,
    ...     als.alignment | set([(1,2), (2,1)]))
    0.222222222222


.. [OCH2000] Och, F. and Ney, H. (2000)
    *Statistical Machine Translation*, EAMT Workshop

.. [MIHALCEA2003] Mihalcea, R. and Pedersen, T. (2003)
    *An evaluation exercise for word alignment*, HLT-NAACL 2003

.. [KOEHN2010] Koehn, P. (2010)
    *Statistical Machine Translation*, Cambridge University Press