.. Copyright (C) 2001-2012 NLTK Project
.. For license information, see LICENSE.TXT

=============
 Classifiers
=============

Classifiers label tokens with category labels (or *class labels*).
Typically, labels are represented with strings (such as ``"health"``
or ``"sports"``.  In NLTK, classifiers are defined using classes that
implement the `ClassifyI` interface:

    >>> import nltk
    >>> nltk.usage(nltk.classify.ClassifierI)
    ClassifierI supports the following operations:
      - self.batch_classify(featuresets)
      - self.batch_prob_classify(featuresets)
      - self.classify(featureset)
      - self.labels()
      - self.prob_classify(featureset)

NLTK defines several classifier classes:

- `ConditionalExponentialClassifier`
- `DecisionTreeClassifier`
- `MaxentClassifier`
- `NaiveBayesClassifier`
- `WekaClassifier`

Classifiers are typically created by training them on a training
corpus.


Regression Tests
~~~~~~~~~~~~~~~~

We define a very simple training corpus with 3 binary features: ['a',
'b', 'c'], and are two labels: ['x', 'y'].  We use a simple feature set so
that the correct answers can be calculated analytically (although we
haven't done this yet for all tests).

    >>> train = [
    ...     (dict(a=1,b=1,c=1), 'y'),
    ...     (dict(a=1,b=1,c=1), 'x'),
    ...     (dict(a=1,b=1,c=0), 'y'),
    ...     (dict(a=0,b=1,c=1), 'x'),
    ...     (dict(a=0,b=1,c=1), 'y'),
    ...     (dict(a=0,b=0,c=1), 'y'),
    ...     (dict(a=0,b=1,c=0), 'x'),
    ...     (dict(a=0,b=0,c=0), 'x'),
    ...     (dict(a=0,b=1,c=1), 'y'),
    ...     ]
    >>> test = [
    ...     (dict(a=1,b=0,c=1)), # unseen
    ...     (dict(a=1,b=0,c=0)), # unseen
    ...     (dict(a=0,b=1,c=1)), # seen 3 times, labels=y,y,x
    ...     (dict(a=0,b=1,c=0)), # seen 1 time, label=x
    ...     ]

Test the Naive Bayes classifier:

    >>> classifier = nltk.classify.NaiveBayesClassifier.train(train)
    >>> sorted(classifier.labels())
    ['x', 'y']
    >>> classifier.batch_classify(test)
    ['y', 'x', 'y', 'x']
    >>> for pdist in classifier.batch_prob_classify(test):
    ...     print '%.4f %.4f' % (pdist.prob('x'), pdist.prob('y'))
    0.3104 0.6896
    0.5746 0.4254
    0.3685 0.6315
    0.6365 0.3635
    >>> classifier.show_most_informative_features()
    Most Informative Features
                           c = 0                   x : y      =      2.0 : 1.0
                           c = 1                   y : x      =      1.5 : 1.0
                           a = 1                   y : x      =      1.4 : 1.0
                           a = 0                   x : y      =      1.2 : 1.0
                           b = 0                   x : y      =      1.2 : 1.0
                           b = 1                   y : x      =      1.1 : 1.0

Test the Decision Tree classifier:

    >>> classifier = nltk.classify.DecisionTreeClassifier.train(
    ...     train, entropy_cutoff=0,
    ...                                                support_cutoff=0)
    >>> sorted(classifier.labels())
    ['x', 'y']
    >>> print classifier
    c=0? .................................................. x
      a=0? ................................................ x
      a=1? ................................................ y
    c=1? .................................................. y
    <BLANKLINE>
    >>> classifier.batch_classify(test)
    ['y', 'y', 'y', 'x']
    >>> for pdist in classifier.batch_prob_classify(test):
    ...     print '%.4f %.4f' % (pdist.prob('x'), pdist.prob('y'))
    Traceback (most recent call last):
      . . .
    NotImplementedError

Test SklearnClassifier, which requires the scikit-learn package.

    >>> from nltk.classify import SklearnClassifier
    >>> from sklearn.naive_bayes import BernoulliNB
    >>> from sklearn.svm import SVC
    >>> train_data = [({"a": 4, "b": 1, "c": 0}, "ham"),
    ...               ({"a": 5, "b": 2, "c": 1}, "ham"),
    ...               ({"a": 0, "b": 3, "c": 4}, "spam"),
    ...               ({"a": 5, "b": 1, "c": 1}, "ham"),
    ...               ({"a": 1, "b": 4, "c": 3}, "spam")]
    >>> classif = SklearnClassifier(BernoulliNB()).train(train_data)
    >>> test_data = [{"a": 3, "b": 2, "c": 1},
    ...              {"a": 0, "b": 3, "c": 7}]
    >>> classif.batch_classify(test_data)
    ['ham', 'spam']
    >>> classif = SklearnClassifier(SVC(), sparse=False).train(train_data)
    >>> classif.batch_classify(test_data)
    ['ham', 'spam']

Test the SVM classifier, which requires the PySVMlight implementation of
SVMlight.

    >>> import nltk.classify.svm
    >>> nltk.classify.svm.demo()
    --- nltk.classify.svm demo ---
    Number of training examples: 7444
    Total SVM dimensions: 63
    Label mapping: {'male': -1, 'female': 1}
    --- Processing an example instance ---
    Reference instance: ('Agnola', 'female')
    NLTK-format features:
        ({'penultimate_letter': 'l', 'last_letter': 'a'}, 'female')
    SVMlight-format features:
        (1, [(12, 1.0), (48, 1.0)])
    Instance classification and confidence: female 1.0
    --- Measuring classifier performance ---
    Overall accuracy: 0.762


Test the Maximum Entropy classifier training algorithms; they should all
generate the same results.

    >>> def test_maxent(algorithms):
    ...     classifiers = {}
    ...     for algorithm in nltk.classify.MaxentClassifier.ALGORITHMS:
    ...         if algorithm.lower() == 'megam':
    ...             try: nltk.classify.megam.config_megam()
    ...             except: pass
    ...         try:
    ...             classifiers[algorithm] = nltk.classify.MaxentClassifier.train(
    ...                 train, algorithm, trace=0, max_iter=1000)
    ...         except Exception, e:
    ...             classifiers[algorithm] = e
    ...     print ' '*11+''.join(['      test[%s]  ' % i
    ...                           for i in range(len(test))])
    ...     print ' '*11+'     p(x)  p(y)'*len(test)
    ...     print '-'*(11+15*len(test))
    ...     for algorithm, classifier in classifiers.items():
    ...         print '%11s' % algorithm,
    ...         if isinstance(classifier, Exception):
    ...             print 'Error: %r' % classifier; continue
    ...         for featureset in test:
    ...             pdist = classifier.prob_classify(featureset)
    ...             print '%8.2f%6.2f' % (pdist.prob('x'), pdist.prob('y')),
    ...         print
    >>> test_maxent(nltk.classify.MaxentClassifier.ALGORITHMS)
    ... # doctest: +ELLIPSIS
    [Found megam: ...]
                     test[0]        test[1]        test[2]        test[3]
                    p(x)  p(y)     p(x)  p(y)     p(x)  p(y)     p(x)  p(y)
    -----------------------------------------------------------------------
         LBFGSB     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
            GIS     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
            IIS     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
    Nelder-Mead     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
             CG     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
           BFGS     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
          MEGAM     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24
         Powell     0.16  0.84     0.46  0.54     0.41  0.59     0.76  0.24


Regression tests for TypedMaxentFeatureEncoding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    >>> from nltk.classify import maxent
    >>> train = [
    ...     ({'a': 1, 'b': 1, 'c': 1}, 'y'),
    ...     ({'a': 5, 'b': 5, 'c': 5}, 'x'),
    ...     ({'a': 0.9, 'b': 0.9, 'c': 0.9}, 'y'),
    ...     ({'a': 5.5, 'b': 5.4, 'c': 5.3}, 'x'),
    ...     ({'a': 0.8, 'b': 1.2, 'c': 1}, 'y'),
    ...     ({'a': 5.1, 'b': 4.9, 'c': 5.2}, 'x')
    ... ]

    >>> test = [
    ...     {'a': 1, 'b': 0.8, 'c': 1.2},
    ...     {'a': 5.2, 'b': 5.1, 'c': 5}
    ... ]

    >>> encoding = maxent.TypedMaxentFeatureEncoding.train(
    ...     train, count_cutoff=3, alwayson_features=True)

    >>> classifier = maxent.MaxentClassifier.train(
    ...     train, bernoulli=False, encoding=encoding, trace=0)

    >>> classifier.batch_classify(test)
    ['y', 'x']