.. Copyright (C) 2001-2012 NLTK Project .. For license information, see LICENSE.TXT ========================= Text Segmentation Metrics ========================= The `nltk.metrics.segmentation` module provides a variety of *evaluation measures* which can be used for evaluating text segmentation methods A segmentation is any sequence over a vocabulary of two items (e.g. "0", "1"), where the specified boundary value is used to mark the edge of a segmentation. >>> from nltk.metrics import windowdiff, ghd, pk ---------- Windowdiff ---------- Compute the windowdiff score for a pair of segmentations. >>> s1 = "00000010000000001000000" >>> s2 = "00000001000000010000000" >>> s3 = "00010000000000000001000" >>> windowdiff(s1, s1, 3) 0 >>> windowdiff(s1, s2, 3) 4 >>> windowdiff(s2, s3, 3) 16 ---------------------------- Generalized Hamming Distance ---------------------------- Generalized Hamming Distance may be used as an evaluation metric for text segmentation. It compares two segmentations, and returns the cost of transforming one segmentation into the other. The transformation is done though boundary insertions, deletions and shifts. Each operation may have a different cost. >>> ghd('1100100000', '1100010000', 1.0, 1.0, 0.5) 0.5 >>> ghd('1100100000', '1100000001', 1.0, 1.0, 0.5) 2.0 >>> ghd('011', '110', 1.0, 1.0, 0.5) 1.0 >>> ghd('1', '0', 1.0, 1.0, 0.5) 1.0 >>> ghd('111', '000', 1.0, 1.0, 0.5) 3.0 >>> ghd('000', '111', 1.0, 2.0, 0.5) 6.0 -------------- Befferman's Pk -------------- Beeferman's Pk was proposed as an evaluation metric for text segmentation. It takes a reference segmentation as first argument, an hypothesis segmentation as second argument. It returns the propability that randomly chosen pair of words a distance of k words is inconsistently classified. >>> print pk('1000100', '1000100', 3) 0.0 >>> print pk('100', '010', 2) 0.5 >>> print pk('100100', '111111', 2) 0.64 >>> print pk('100100', '000000', 2) 0.04 >>> print pk('100100', '111111', 3) 0.25 >>> print pk('100100', '000000', 3) 0.25