================
Sourced Strings
================

"Sourced strings" are strings that are annotated with information
about the location in a document where they were originally found.
Sourced strings are subclassed from Python strings.  As a result, they
can usually be used anywhere a normal Python string can be used.

Creating Sourced Strings
========================
A sourced string for a document can be constructed by calling the
`SourcedString` constructor with two arguments: a Python string
(containing the contents of the document), and a document identifier
(such as a file name):

    >>> from nltk.sourcedstring import *
    >>> newt_contents = """\
    ... She turned me into a newt!
    ... I got better."""
    >>> newt_doc = SourcedString(newt_contents, 'newt.txt')
    >>> print repr(newt_doc)
    'She turned me into a newt!\nI got better.'@[0:40]
    >>> newt = newt_doc.split()[5] # Find the sixth word.
    >>> print repr(newt)
    'newt!'@[21:26]

The suffix ``@[0:40]`` at the end of ``newt_doc``'s string representation
indicates that it is a sourced string beginning at offset 0, and
ending at offset 40.  Similarly, the suffix ``@[21:26]`` at the end of
``newt``'s string representation indicates that it spans from offset
21 to offset 26.

.. note:: The `SourcedString` constructor automatically delegates to
   either `SimpleSourcedByteString` or `SimpleSourcedUnicodeString`,
   depending on whether its first argument has type ``str`` or ``unicode``.
   The subclasses of `SourcedString` are discussed in more detail
   in `Unicode and Sourced Strings`_.

Sourced strings can also be created using the `SourcedStringStream`
class, which wraps an existing stream object, and causes its read
methods to return sourced strings.

    >>> from StringIO import StringIO
    >>> stream = SourcedStringStream(StringIO(newt_contents))
    >>> for line in stream:
    ...     print repr(line)
    'She turned me into a newt!\n'@[0:27]
    'I got better.'@[27:40]

Finally, some of NLTK's corpus readers can be instructed to return
sourced strings instead of Python strings:

    >>> from nltk.corpus import gutenberg
    >>> emma_words = gutenberg.words('austen-emma.txt', sourced=True)
    >>> remembrance = emma_words[114]
    >>> print repr(remembrance)
    'remembrance'@[552:563]
    >>> emma_sents = gutenberg.sents('austen-emma.txt', sourced=True)
    >>> emma_sents[28] # doctest: +NORMALIZE_WHITESPACE
    ['The'@[4980:4983], 'Woodhouses'@[4984:4994], 'were'@[4995:4999],
     'first'@[5000:5005], 'in'@[5006:5008], 'consequence'@[5009:5020],
     'there'@[5021:5026], '.'@[5026]]

String Sources
==============
The location where a sourced string was found is recorded using the
``source`` attribute:

    >>> newt.source
    StringSource('newt.txt', begin=21, end=26)
    >>> remembrance.source
    StringSource('austen-emma.txt', begin=552, end=563)

Sources are encoded using `StringSource` objects, which consist of a
document identifier along with information about the offsets of the
characters that make up the string.  These offsets are typically
either byte offsets or character offsets.  (As we'll see below, byte
offsets and character offsets are not equivalent when used to describe
unicode strings.)

String Sources define four attributes that describe the location where
a string was found: ``docid``, ``begin``, ``end``, and ``offsets``.
The ``docid`` attribute contains an identifier (such as a filename)
that names the document where the string was found:

    >>> newt.source.docid
    'newt.txt'

The ``begin`` and ``end`` attributes should be interpreted in the same
way as Python slice indices.  In particular, the ``begin`` index
specifies the offset of the first character in the string; and the
``end`` index specifies the offset just past the last character in the
string:

    >>> newt.source.begin
    21
    >>> newt.source.end
    26
    >>> newt_contents[newt.source.begin:newt.source.end]
    'newt!'

The ``offsets`` attribute returns a tuple of offsets specifying the
location of each character in the document:

    >>> newt.source.offsets
    (21, 22, 23, 24, 25, 26)

In particular, for a `SourcedString` ``s``, character ``s[i]`` begins
at offset ``s.source.offsets[i]`` and ends at offset
``s.source.offsets[i+1]``.  Note that the ``offsets`` list contains
one more offset than there are characters in the string:

    >>> len(newt), len(newt.source.offsets)
    (5, 6)

That's because the `StringSource` specifies both the begin offset and
the end offset for each character.  The ``begin`` and ``end``
attributes are always equal to the first and last elements of the
``offsets`` attribute, respectively:

    >>> assert newt.source.begin == newt.source.offsets[0]
    >>> assert newt.source.end == newt.source.offsets[-1]

The `pprint()` method (which stands for "pretty-print") is helpful for
showing the relationship between offsets and characters.  In the
following example, compare the pretty-printed document with the list
of offsets in newt's source:

    >>> print newt_doc.pprint(wrap='\n')
    [=======================newt.txt=======================]
                        1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2  2
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6  7
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+--+
    |S|h|e| |t|u|r|n|e|d| |m|e| |i|n|t|o| |a| |n|e|w|t|!|\n|
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+--+
    [=========newt.txt========]
    2 2 2 3 3 3 3 3 3 3 3 3 3 4
    7 8 9 0 1 2 3 4 5 6 7 8 9 0
    +-+-+-+-+-+-+-+-+-+-+-+-+-+
    |I| |g|o|t| |b|e|t|t|e|r|.|
    +-+-+-+-+-+-+-+-+-+-+-+-+-+
    >>> newt.source.offsets
    (21, 22, 23, 24, 25, 26)

At first, it may seem redundant to keep track of the offsets for every
character in a string -- for many strings, the offset of ``s[i]`` is
simply ``s.begin+i``.  However, when byte offsets are used to describe
unicode characters, we can no longer assume that the characters in a
string have consecutive offsets.  In the following example, we
construct a `SourcedString` from a utf-8 encoded byte string (thus
ensuring that we are using byte offsets); and then decode that string
to unicode.  When we print the `SourcedString`, we can see that
several of its characters span two bytes:

    >>> students_and_time = SourcedString("""\
    ... Le temps est un grand ma\xc3\xaetre, dit-on, le malheur est \
    ... qu'il tue ses \xc3\xa9l\xc3\xa8ves""", 'Berlioz').decode('utf-8')
    >>> print students_and_time.pprint()
    [==============================Berlioz===============================]
                        1 1 1 1 1 1 1 1 1 1 2 2 2 2 2      2 2 2 2 3 3 3 3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4      6 7 8 9 0 1 2 3
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+------+-+-+-+-+-+-+-+
    |L|e| |t|e|m|p|s| |e|s|t| |u|n| |g|r|a|n|d| |m|a|\u00ee|t|r|e|,| |d|i|
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+------+-+-+-+-+-+-+-+
    [===============================Berlioz===============================]
    3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6
    3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |t|-|o|n|,| |l|e| |m|a|l|h|e|u|r| |e|s|t| |q|u|'|i|l| |t|u|e| |s|e|s| |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    [=======Berlioz=======]
    6      7 7      7 7 7 7
    8      0 1      3 4 5 6
    +------+-+------+-+-+-+
    |\u00e9|l|\u00e8|v|e|s|
    +------+-+------+-+-+-+

StringSource Subclasses
-----------------------
In order to efficiently encode the sources of strings with consecutive
characters while also accommodating strings without consecutive
characters, the `StringSource` class defines two subclasses:

- `ConsecutiveCharStringSource` is used to describe the source of
   strings whose characters have consecutive offsets.  In particular,
   it is used for byte strings with byte offsets; and unicode strings
   with character offsets.  It is encoded using a document identifier,
   a begin offset, and an end offset:

    >>> ConsecutiveCharStringSource('newt.txt', begin=12, end=18)
    StringSource('newt.txt', begin=12, end=18)

- `ContiguousCharStringSource` is used to describe the source of
   strings whose characters are contiguous, but do not necessarily
   have consecutive offsets.  In particular, it is used for unicode
   strings with byte offsets.  It is encoded using a document
   identifier and a tuple of character offsets.

    >>> ContiguousCharStringSource('newt.txt', offsets=[12, 15, 16, 18])
    StringSource('newt.txt', offsets=(12, 15, 16, 18))

The `StringSource` class itself is an abstract base class; but its
constructor automatically delegates to the appropriate subclass,
depending on how it was called:

    >>> StringSource('newt.txt', begin=12, end=18)
    StringSource('newt.txt', begin=12, end=18)
    >>> type(StringSource('newt.txt', begin=12, end=18))
    <class 'nltk.sourcedstring.ConsecutiveCharStringSource'>

    >>> StringSource('newt.txt', offsets=[12, 15, 16, 18])
    StringSource('newt.txt', offsets=(12, 15, 16, 18))
    >>> type(StringSource('newt.txt', offsets=[12, 15, 16, 18]))
    <class 'nltk.sourcedstring.ContiguousCharStringSource'>

SourcedString Source Attributes
-------------------------------
For convenience, the ``SourcedString`` class defines the attributes
``begin``, ``end``, and ``docid``.  Their value is identical to the
corresponding attribute of the string's source:

    >>> assert newt.begin == newt.source.begin
    >>> assert newt.end == newt.source.end
    >>> assert newt.docid == newt.source.docid

As we'll see `below <Compound Sourced Strings>`_, these three
attributes are only defined for "simple sourced strings" -- i.e.,
strings that correspond to a single substring of a document.  They are
not defined for "compound sourced strings," which are constructed by
concatenating strings from multiple sources.

Substrings of Sourced Strings
=============================
Operations that return substrings of a `SourcedString` (such as
slicing, indexing, and `split()`) return them as `SourcedString`s:

    >>> newt_doc[4:10]
    'turned'@[4:10]
    >>> newt_doc[5]
    'u'@[5]
    >>> newt_doc.split() # doctest: +NORMALIZE_WHITESPACE
    ['She'@[0:3], 'turned'@[4:10], 'me'@[11:13], 'into'@[14:18],
     'a'@[19], 'newt!'@[21:26], 'I'@[27], 'got'@[29:32], 'better.'@[33:40]]
    >>> newt_doc[:4].strip()
    'She'@[0:3]

Most regular expression operations will also return a `SourcedString`
when given a `SourcedString` as input:

    >>> import re
    >>> re.findall(r'\w*e\w*', newt_doc) # doctest: +NORMALIZE_WHITESPACE
    ['She'@[0:3], 'turned'@[4:10], 'me'@[11:13],
     'newt'@[21:25], 'better'@[33:39]]
    >>> re.search(r'\w+ed', newt_doc).group()
    'turned'@[4:10]

The exception to this rule is the regular expression substitution
operations, ``re.sub`` and ``re.subn``.  See `Limitations`_ for more
information.

Compound Sourced Strings
========================
When sourced strings are concatenated with other strings, the result
is a compound sourced string:

    >>> better_newt = 'My orange ' + newt_doc[21:25] + ' is ' + newt_doc[33:39]
    >>> print better_newt.pprint()
               [newt.tx]    [==newt.txt=]
               2 2 2 2 2    3 3 3 3 3 3 3
               1 2 3 4 5    3 4 5 6 7 8 9
    +----------+-+-+-+-+----+-+-+-+-+-+-+
    |My orange |n|e|w|t| is |b|e|t|t|e|r|
    +----------+-+-+-+-+----+-+-+-+-+-+-+
    >>> print repr(better_newt)
    'My orange newt is better'@[...,21:25,...,33:39]

Compound sourced strings keep track of the sources of all the
substrings they were composed from.  The pieces that make up a
compound sourced string can be retrieved using the ``substrings``
attribute:

    >>> better_newt.substrings
    ('My orange ', 'newt'@[21:25], ' is ', 'better'@[33:39])

The substrings of a compound sourced string are always either simple
sourced strings or Python strings, never compound sourced strings.

Slicing Compound Sourced Strings
--------------------------------
The type of object that is returned by slicing a compound sourced
string will depend on what portion of the compound sourced string is
covered by the slice.  If the slice falls within a single Python
substring, then it will be returned as a Python string:

    >>> better_newt[3:9] # Returns a Python string
    'orange'

If the slice falls within a single simple sourced string, then
it will be returned as a simple sourced string:

    >>> better_newt[10:13] # Returns a simple sourced string
    'new'@[21:24]

Otherwise, it will be returned as a compound sourced string:

    >>> better_newt[3:14] # Returns a compound sourced string
    'orange newt'@[...,21:25]

Note that a single-character sourced string may never be compound;
therefore, indexing a Sourced String will always return either a
Python character or a simple sourced string:

    >>> better_newt[8]
    'e'
    >>> better_newt[10]
    'n'@[21]

If you are not sure what type of string will result from an operation,
you can use ``isinstance()`` to check whether it's a Python string, a
simple sourced string, or a compound sourced string:

    >>> def check_type(s):
    ...     if isinstance(s, SimpleSourcedString):
    ...         print 'simple sourced string'
    ...     elif isinstance(s, CompoundSourcedString):
    ...         print 'compound sourced string'
    ...     else:
    ...         print 'python string'

    >>> check_type(better_newt[1:2])
    python string
    >>> check_type(better_newt[10:13])
    simple sourced string
    >>> check_type(better_newt[3:14])
    compound sourced string

Alternatively, you can use ``hasattr()`` to check whether a substring
has a source:

    >>> hasattr(better_newt[1:2], 'source')   # Python string
    False
    >>> hasattr(better_newt[10:13], 'source') # Simple sourced string
    True
    >>> hasattr(better_newt[3:14], 'source')  # Compound sourced string
    False

Concatenating Compound Sourced Strings
--------------------------------------
When two compound sourced strings ``c1`` and ``c2`` are concatenated
together, the resulting compound sourced string ``c3`` does *not*
contain ``c1`` and ``c2`` themselves as substrings.  Instead, ``c3``
contains ``c1``'s substrings and ``c2``'s substrings.  This
"flattening" ensures that the substrings of a compound sourced string
will always be either Python strings or simple sourced strings, and
never compound strings.

    >>> c1 = better_newt
    >>> c2 = ' than your ' + better_newt[3:14]
    >>> c3 = c1+c2
    >>> for substring in c3.substrings:
    ...     print '%25r %s' % (substring, type(substring).__name__)
                 'My orange ' str
               'newt'@[21:25] SimpleSourcedByteString
                       ' is ' str
             'better'@[33:39] SimpleSourcedByteString
         ' than your orange ' str
               'newt'@[21:25] SimpleSourcedByteString

Multi-Document Sourced Strings
------------------------------
It is possible to concatenate sourced strings that come from different
documents:

    >>> doc2 = SourcedString("Hello World", 'hello.txt')
    >>> cello = ("Does "+newt[:-1].capitalize()+"on like my " +
    ...      doc2[:5].replace('H','C')+"?")
    >>> print cello.pprint()
          [newt.tx]            [hello.t]
          2 2 2 2 2
          1 2 3 4 5            1 2 3 4 5
    +-----+-+-+-+-+------------+-+-+-+-+-+
    |Does |N|e|w|t|on like my C|e|l|l|o|?|
    +-----+-+-+-+-+------------+-+-+-+-+-+

Transforming Sourced Strings
============================
The `SourcedString` methods that return a modified string will preseve
source information whenever possible.

Case Modification
-----------------
Case modification methods return a sourced string with the same source
as the original string:

    >>> sent = newt_doc.split('\n')[0]
    >>> sent.lower()
    'she turned me into a newt!'@[0:26]
    >>> sent.title()
    'She Turned Me Into A Newt!'@[0:26]
    >>> better_newt.title()
    'My Orange Newt Is Better'@[...,21:25,...,33:39]

In particular, the characters that are modified keep their original
source information.  This is in contrast with the `replace()` method
(discussed below), where the replacement string has its own source
information.

Justification methods
---------------------
The string justification methods preserve source information of the
original string.  The padding substring will usually be sourceless
(unless you supply a sourced string as the fill character):

    >>> print newt.rjust(15).pprint()
               [=newt.txt]
               2 2 2 2 2 2
               1 2 3 4 5 6
    +----------+-+-+-+-+-+
    |          |n|e|w|t|!|
    +----------+-+-+-+-+-+
    >>> print newt.center(15, '.').pprint()
          [=newt.txt]
          2 2 2 2 2 2
          1 2 3 4 5 6
    +-----+-+-+-+-+-+-----+
    |.....|n|e|w|t|!|.....|
    +-----+-+-+-+-+-+-----+

Replacement Method
------------------
The ``replace`` method preserves string information for both the
original string and the replacement string:

    >>> print sent.replace('newt', doc2[6:]).pprint()
    [=================newt.txt================][hello.txt][n]
                        1 1 1 1 1 1 1 1 1 1 2 2        1 12 2
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 16 7 8 9 0 15 6
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-++-+
    |S|h|e| |t|u|r|n|e|d| |m|e| |i|n|t|o| |a| ||W|o|r|l|d||!|
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-++-+

If the replacement string is a Python string, then the
corresponding substring will be sourceless:

    >>> print newt.replace('!', 'on').pprint()
    [newt.tx]
    2 2 2 2 2
    1 2 3 4 5
    +-+-+-+-+--+
    |n|e|w|t|on|
    +-+-+-+-+--+

Other Modifications
-------------------
Unfortunately, several other modification methods (such as
``str.join()`` and ``re.sub()``) do not always preserve source
information.  See `Limitations`_ for more details.

Unicode and Sourced Strings
===========================

The `SourcedString` class is an abstract base class.  It defines two
abtract subclasses, each of which defines two concrete subclasses::

  SourcedString (abstract)
    |
    +-- SimpleSourcedString (abstract)
    |     |
    |     +-- SimpleSourcedByteString
    |     |
    |     +-- SimpleSourcedUnicodeString
    |
    +-- CompoundSourcedString (abstract)
          |
          +-- CompoundSourcedByteString
          |
          +-- CompoundSourcedUnicodeString

The two ``-ByteString`` classes are subclassed from ``str``; and the
two ``-UnicodeString`` classes are subclassed from ``unicode``.  When
the `SourcedString` constructor is called directly, it will delegate
to the appropriate subclass, based on the type of the content string:

    >>> byte_str = 'He was a tall lumberjack.'
    >>> SourcedString(byte_str, 'lumberjack.txt')
    'He was a tall lumberjack.'@[0:25]
    >>> type(SourcedString(byte_str, 'lumberjack.txt'))
    <class 'nltk.sourcedstring.SimpleSourcedByteString'>

    >>> unicode_str = u'He was a tall lumberjack.'
    >>> SourcedString(unicode_str, 'lumberjack.txt')
    u'He was a tall lumberjack.'@[0:25]
    >>> type(SourcedString(unicode_str, 'lumberjack.txt'))
    <class 'nltk.sourcedstring.SimpleSourcedUnicodeString'>

The ``CompoundSourced-*-String`` classes are not usually instantiated
directly; instead, they are created by concatenating sourced strings
with other strings.  See `Compound Sourced Strings`_ for details.

Equality
========
Two sourced strings are considered equal if their contents are equal,
even if their sources differ:

    >>> newt_doc[3], newt_doc[10]
    (' '@[3], ' '@[10])
    >>> newt_doc[3] == newt_doc[10]
    True

Sourced strings may also be compared for equality with non-sourced
strings:

    >>> newt == 'newt!'
    True
    >>> cello == "Does Newton like my Cello?"
    True

The fact that string equality ignores sources is important in ensuring
that sourced strings act like normal strings.  In particular, it
allows sourced strings to be used with code that was originally
intended to process plain Python strings.  E.g., this fact allows
sourced strings to be parsed by standard parsing algorithms (which
have no knowledge of sourced strings).

If you wish to determine whether two simple sourced strings correspond
to the same location in a document, simply compare their ``source``
attribute:

    >>> x = newt_doc[4:10]
    >>> y = newt_doc.split()[1]
    >>> z = x.upper()
    >>> (x, y, z)
    ('turned'@[4:10], 'turned'@[4:10], 'TURNED'@[4:10])
    >>> x==y, x.source==y.source
    (True, True)
    >>> x==z, x.source==z.source
    (False, True)

If you may be dealing with compound sourced strings, then you should
use the ``sources`` attribute instead.  This attribute is defined for
both simple and compound sourced strings, and contains a sorted tuple
of ``(index,source)`` pairs.  Each such pair specifies that the source
of the substring starting at ``index``, and extending ``len(source)``
characters, is ``source``:

    >>> newt.sources
    ((0, StringSource('newt.txt', begin=21, end=26)),)
    >>> cello.sources # doctest: +NORMALIZE_WHITESPACE
    ((5, StringSource('newt.txt', begin=21, end=25)),
     (20, StringSource('hello.txt', begin=0, end=0)),
     (21, StringSource('hello.txt', begin=1, end=5)))

If you wish to compare two strings, and they might be simple sourced
strings, compound sourced strings, or plain Python strings, then you
can use ``getattr(s, 'sources', ())``, which will return ``s.sources``
for sourced strings, and ``()`` for plain Python strings:

    >>> print getattr(cello[:4], 'sources', ())
    ()
    >>> print getattr(cello[5:9], 'sources', ())
    ((0, StringSource('newt.txt', begin=21, end=25)),)
    >>> print getattr(cello[17:], 'sources', ()) # doctest: +NORMALIZE_WHITESPACE
    ((3, StringSource('hello.txt', begin=0, end=0)),
     (4, StringSource('hello.txt', begin=1, end=5)))

Sourced Strings as Dictionary Keys and Set Values
==================================================
When sourced strings are used as dictionary keys, or placed in sets,
we would sometimes like to ensure that strings with different sources
are treated as different values.  However, the fact that sourced
string equality ignores sources makes this impossible.  To get around
this problem, you can use the sourced string's ``source`` (or
``sources`` for compound strings), or a tuple containing the sourced
string and its ``source``, as a dictionary key or set value:

    >>> animals_contents = 'the dog and the cat'
    >>> animals = SourcedString(animals_contents, source='animals.txt')

    >>> # Create a list of words, including some case-normalized duplicates
    >>> words = animals.split() + re.findall('DOG|CAT', animals.upper())
    >>> sorted(words) # doctest: +NORMALIZE_WHITESPACE
    ['CAT'@[16:19], 'DOG'@[4:7], 'and'@[8:11], 'cat'@[16:19],
     'dog'@[4:7], 'the'@[0:3], 'the'@[12:15]]

    >>> # Show the set of unique words (using string equality).  Note
    >>> # that the second occurence of 'the' was discarded.
    >>> sorted(set(words)) # doctest: +NORMALIZE_WHITESPACE
    ['CAT'@[16:19], 'DOG'@[4:7], 'and'@[8:11], 'cat'@[16:19],
     'dog'@[4:7], 'the'@[0:3]]

    >>> # Show the set of locations where words occur.  Note that
    >>> # the locations of 'cat' and 'dog' each appear only once.
    >>> sorted(set(word.source for word in words)) # doctest: +NORMALIZE_WHITESPACE
    [StringSource('animals.txt', begin=0, end=3),
     StringSource('animals.txt', begin=4, end=7),
     StringSource('animals.txt', begin=8, end=11),
     StringSource('animals.txt', begin=12, end=15),
     StringSource('animals.txt', begin=16, end=19)]

    >>> # Show the set of unique (string, location) pairs.  Note
    >>> # that both occurences of 'the' appear; and that both copies
    >>> # of 'dog' and 'cat' appear.
    >>> sorted(set((word.source, word) for word in words)) # doctest: +NORMALIZE_WHITESPACE
    [(StringSource('animals.txt', begin=0, end=3), 'the'@[0:3]),
     (StringSource('animals.txt', begin=4, end=7), 'DOG'@[4:7]),
     (StringSource('animals.txt', begin=4, end=7), 'dog'@[4:7]),
     (StringSource('animals.txt', begin=8, end=11), 'and'@[8:11]),
     (StringSource('animals.txt', begin=12, end=15), 'the'@[12:15]),
     (StringSource('animals.txt', begin=16, end=19), 'CAT'@[16:19]),
     (StringSource('animals.txt', begin=16, end=19), 'cat'@[16:19])]

Limitations
===========
Some types of string manipulation can cause source information to be
lost.  In particular, functions and methods that accesses a sourced
string using the low-level "buffer" interface will often bypass the
sourced string's ability to preserve source information.

Operations that are known to result in a loss of source information
are listed below:

- ``str.join()``, where the joining string is not sourced:

    >>> '+'.join(sent.split())
    'She+turned+me+into+a+newt!'

- ``str.replace()``, where the original string is not sourced:

    >>> turned = newt_doc.split()[1]
    >>> 'I twisted around'.replace('twisted', turned)
    'I turned around'

- String formatting, where the format string is not sourced:

    >>> 'My %s is %s' % (newt_doc[21:25], newt_doc[33:39])
    'My newt is better'

- Regular expression substitution, where the regular expression
  pattern string is not sourced:

    >>> re.sub('orange', 'green', better_newt)
    'My green newt is better'
    >>> re.subn('orange', 'green', better_newt, 1)
    ('My green newt is better', 1)

- String justification methods, where the string being justified
  is unsourced but the fill character is sourced:

    >>> 'coconut'.center(25, newt[-1])
    '!!!!!!!!!coconut!!!!!!!!!'
    >>> 'coconut'.ljust(25, newt[-1])
    'coconut!!!!!!!!!!!!!!!!!!'
    >>> 'coconut'.rjust(25, newt[-1])
    '!!!!!!!!!!!!!!!!!!coconut'


.. ======================= Regression Tests ===========================

Regression Tests
================

String Sources
--------------

ConsecutiveCharStringSource
~~~~~~~~~~~~~~~~~~~~~~~~~~~
String representations:

    >>> source = ConsecutiveCharStringSource('coconut.txt', 5, 18)
    >>> repr(source)
    "StringSource('coconut.txt', begin=5, end=18)"
    >>> str(source)
    '@coconut.txt[5:18]'

Attributes:

    >>> source = ConsecutiveCharStringSource('coconut.txt', 5, 18)
    >>> source.begin, source.end, source.docid
    (5, 18, 'coconut.txt')
    >>> source.docid
    'coconut.txt'
    >>> source.offsets
    (5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)

Begin and end must be integers (or longs):

    >>> ConsecutiveCharStringSource('coconut.txt', 5, 10)
    StringSource('coconut.txt', begin=5, end=10)
    >>> ConsecutiveCharStringSource('coconut.txt', 5L, 10L)
    StringSource('coconut.txt', begin=5L, end=10L)
    >>> ConsecutiveCharStringSource('coconut.txt', 5.3, 10)
    Traceback (most recent call last):
      . . .
    TypeError: begin attribute expected an integer
    >>> ConsecutiveCharStringSource('coconut.txt', 5, 10.3)
    Traceback (most recent call last):
      . . .
    TypeError: end attribute expected an integer
    >>> ConsecutiveCharStringSource('coconut.txt', 'five', 10)
    Traceback (most recent call last):
      . . .
    TypeError: begin attribute expected an integer
    >>> ConsecutiveCharStringSource('coconut.txt', 5, 'ten')
    Traceback (most recent call last):
      . . .
    TypeError: end attribute expected an integer

The end index must be greater than or equal to the begin offset:

    >>> ConsecutiveCharStringSource('coconut.txt', 5, 6)
    StringSource('coconut.txt', begin=5, end=6)
    >>> ConsecutiveCharStringSource('coconut.txt', 5, 5)
    StringSource('coconut.txt', begin=5, end=5)
    >>> ConsecutiveCharStringSource('coconut.txt', 5, 4)
    Traceback (most recent call last):
      . . .
    ValueError: begin must be less than or equal to end

The begin and end offsets may be negative:

    >>> ConsecutiveCharStringSource('coconut.txt', -5, 5)
    StringSource('coconut.txt', begin=-5, end=5)
    >>> ConsecutiveCharStringSource('coconut.txt', -5, -2)
    StringSource('coconut.txt', begin=-5, end=-2)

Length-1 source:

    >>> source = ConsecutiveCharStringSource('coconut.txt', 5, 6)
    >>> repr(source)
    "StringSource('coconut.txt', begin=5, end=6)"
    >>> str(source)
    '@coconut.txt[5]'
    >>> len(source)
    1
    >>> source.begin, source.end, source.docid
    (5, 6, 'coconut.txt')
    >>> source.offsets
    (5, 6)

zero-length source:

    >>> source = ConsecutiveCharStringSource('coconut.txt', 5, 5)
    >>> repr(source)
    "StringSource('coconut.txt', begin=5, end=5)"
    >>> str(source)
    '@coconut.txt[5:5]'
    >>> len(source)
    0
    >>> source.begin, source.end, source.docid
    (5, 5, 'coconut.txt')
    >>> source.offsets
    (5,)

Indexing:

    >>> source = ConsecutiveCharStringSource('coconut.txt', 15, 21)
    >>> for i in range(-len(source), len(source)):
    ...     print '  source[%2d] = %r' % (i, source[i])
      source[-6] = StringSource('coconut.txt', begin=15, end=16)
      source[-5] = StringSource('coconut.txt', begin=16, end=17)
      source[-4] = StringSource('coconut.txt', begin=17, end=18)
      source[-3] = StringSource('coconut.txt', begin=18, end=19)
      source[-2] = StringSource('coconut.txt', begin=19, end=20)
      source[-1] = StringSource('coconut.txt', begin=20, end=21)
      source[ 0] = StringSource('coconut.txt', begin=15, end=16)
      source[ 1] = StringSource('coconut.txt', begin=16, end=17)
      source[ 2] = StringSource('coconut.txt', begin=17, end=18)
      source[ 3] = StringSource('coconut.txt', begin=18, end=19)
      source[ 4] = StringSource('coconut.txt', begin=19, end=20)
      source[ 5] = StringSource('coconut.txt', begin=20, end=21)
    >>> source[len(source)]
    Traceback (most recent call last):
      . . .
    IndexError: StringSource index out of range
    >>> source[-len(source)-1]
    Traceback (most recent call last):
      . . .
    IndexError: StringSource index out of range

Slicing:

    >>> def slice_test(source, *indices):
    ...     """Print a table showing the result of slicing the given
    ...     source, using each of the given indices as a start or end
    ...     index for the slice."""
    ...     print '     |'+' '.join(str(j).center(5) for j in indices)
    ...     print '-----+'+'------'*len(indices)
    ...     for i in indices:
    ...         print '%4s |' % i,
    ...         for j in indices:
    ...             if i is None and j is None: sliced_source = source[:]
    ...             elif i is None: sliced_source = source[:j]
    ...             elif j is None: sliced_source = source[i:]
    ...             else:  sliced_source = source[i:j]
    ...             print '%2s:%-2s' % (sliced_source.begin, sliced_source.end),
    ...             assert sliced_source.docid == 'coconut.txt'
    ...         print
    ...
    >>> source = ConsecutiveCharStringSource('coconut.txt', 15, 28)
    >>> slice_test(source, None, 0, 1, len(source)-1, len(source), 100,
    ...            -1, -len(source)+1, -len(source), -100)
         | None   0     1     12    13   100    -1   -12   -13   -100
    -----+------------------------------------------------------------
    None | 15:28 15:15 15:16 15:27 15:28 15:28 15:27 15:16 15:15 15:15
       0 | 15:28 15:15 15:16 15:27 15:28 15:28 15:27 15:16 15:15 15:15
       1 | 16:28 16:16 16:16 16:27 16:28 16:28 16:27 16:16 16:16 16:16
      12 | 27:28 27:27 27:27 27:27 27:28 27:28 27:27 27:27 27:27 27:27
      13 | 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28
     100 | 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28
      -1 | 27:28 27:27 27:27 27:27 27:28 27:28 27:27 27:27 27:27 27:27
     -12 | 16:28 16:16 16:16 16:27 16:28 16:28 16:27 16:16 16:16 16:16
     -13 | 15:28 15:15 15:16 15:27 15:28 15:28 15:27 15:16 15:15 15:15
    -100 | 15:28 15:15 15:16 15:27 15:28 15:28 15:27 15:16 15:15 15:15

    >>> source = ConsecutiveCharStringSource('coconut.txt', 50, 53)
    >>> slice_test(source, -4, -3, 0, -2, 1, -1, 2, 3, 4, 5)
         |  -4    -3    0     -2    1     -1    2     3     4     5
    -----+------------------------------------------------------------
      -4 | 50:50 50:50 50:50 50:51 50:51 50:52 50:52 50:53 50:53 50:53
      -3 | 50:50 50:50 50:50 50:51 50:51 50:52 50:52 50:53 50:53 50:53
       0 | 50:50 50:50 50:50 50:51 50:51 50:52 50:52 50:53 50:53 50:53
      -2 | 51:51 51:51 51:51 51:51 51:51 51:52 51:52 51:53 51:53 51:53
       1 | 51:51 51:51 51:51 51:51 51:51 51:52 51:52 51:53 51:53 51:53
      -1 | 52:52 52:52 52:52 52:52 52:52 52:52 52:52 52:53 52:53 52:53
       2 | 52:52 52:52 52:52 52:52 52:52 52:52 52:52 52:53 52:53 52:53
       3 | 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53
       4 | 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53
       5 | 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53

ContiguousCharStringSource
~~~~~~~~~~~~~~~~~~~~~~~~~~
String representations:

    >>> offsets = [5, 7, 8, 9, 13, 18]
    >>> source = ContiguousCharStringSource('coconut.txt', offsets)
    >>> repr(source)
    "StringSource('coconut.txt', offsets=(5, 7, 8, 9, 13, 18))"
    >>> str(source)
    '@coconut.txt[5:18]'

Attributes:

    >>> source = ContiguousCharStringSource('coconut.txt', offsets)
    >>> source.begin, source.end, source.docid
    (5, 18, 'coconut.txt')
    >>> source.docid
    'coconut.txt'
    >>> source.offsets
    (5, 7, 8, 9, 13, 18)

Offsets must be integers (or longs):

    >>> ContiguousCharStringSource.CONSTRUCTOR_CHECKS_OFFSETS = True
    >>> ContiguousCharStringSource('coconut.txt', [5, 6L, 7])
    StringSource('coconut.txt', offsets=(5, 6L, 7))
    >>> ContiguousCharStringSource('coconut.txt', [6.2])
    Traceback (most recent call last):
      . . .
    TypeError: offsets must be integers
    >>> ContiguousCharStringSource('coconut.txt', ['five'])
    Traceback (most recent call last):
      . . .
    TypeError: offsets must be integers

Offsets must be monotonic increasing:

    >>> ContiguousCharStringSource('coconut.txt', [5, 6, 7])
    StringSource('coconut.txt', offsets=(5, 6, 7))
    >>> ContiguousCharStringSource('coconut.txt', [5, 5])
    StringSource('coconut.txt', offsets=(5, 5))
    >>> ContiguousCharStringSource('coconut.txt', [7, 6, 5])
    Traceback (most recent call last):
      . . .
    TypeError: offsets must be monotonic increasing

Offsets may be negative:

    >>> ContiguousCharStringSource('coconut.txt', [-5, 5])
    StringSource('coconut.txt', offsets=(-5, 5))
    >>> ContiguousCharStringSource('coconut.txt', [-5, -2])
    StringSource('coconut.txt', offsets=(-5, -2))

Length-1 source:

    >>> source = ContiguousCharStringSource('coconut.txt', [5,6])
    >>> repr(source)
    "StringSource('coconut.txt', offsets=(5, 6))"
    >>> str(source)
    '@coconut.txt[5]'
    >>> len(source)
    1
    >>> source.begin, source.end, source.docid
    (5, 6, 'coconut.txt')
    >>> source.offsets
    (5, 6)

zero-length source:

    >>> source = ContiguousCharStringSource('coconut.txt', [5])
    >>> repr(source)
    "StringSource('coconut.txt', offsets=(5,))"
    >>> str(source)
    '@coconut.txt[5:5]'
    >>> len(source)
    0
    >>> source.begin, source.end, source.docid
    (5, 5, 'coconut.txt')
    >>> source.offsets
    (5,)

Indexing:

    >>> source = ContiguousCharStringSource('coconut.txt', range(15, 22))
    >>> for i in range(-len(source), len(source)):
    ...     print '  source[%2d] = %r' % (i, source[i])
      source[-6] = StringSource('coconut.txt', offsets=(15, 16))
      source[-5] = StringSource('coconut.txt', offsets=(16, 17))
      source[-4] = StringSource('coconut.txt', offsets=(17, 18))
      source[-3] = StringSource('coconut.txt', offsets=(18, 19))
      source[-2] = StringSource('coconut.txt', offsets=(19, 20))
      source[-1] = StringSource('coconut.txt', offsets=(20, 21))
      source[ 0] = StringSource('coconut.txt', offsets=(15, 16))
      source[ 1] = StringSource('coconut.txt', offsets=(16, 17))
      source[ 2] = StringSource('coconut.txt', offsets=(17, 18))
      source[ 3] = StringSource('coconut.txt', offsets=(18, 19))
      source[ 4] = StringSource('coconut.txt', offsets=(19, 20))
      source[ 5] = StringSource('coconut.txt', offsets=(20, 21))
    >>> source[len(source)]
    Traceback (most recent call last):
      . . .
    IndexError: StringSource index out of range
    >>> source[-len(source)-1]
    Traceback (most recent call last):
      . . .
    IndexError: StringSource index out of range

Slicing:

    >>> def slice_test(source, *indices):
    ...     """Print a table showing the result of slicing the given
    ...     source, using each of the given indices as a start or end
    ...     index for the slice."""
    ...     print '     |'+' '.join(str(j).center(5) for j in indices)
    ...     print '-----+'+'------'*len(indices)
    ...     for i in indices:
    ...         print '%4s |' % i,
    ...         for j in indices:
    ...             if i is None and j is None: sliced_source = source[:]
    ...             elif i is None: sliced_source = source[:j]
    ...             elif j is None: sliced_source = source[i:]
    ...             else:  sliced_source = source[i:j]
    ...             print '%2s:%-2s' % (sliced_source.begin, sliced_source.end),
    ...             assert sliced_source.docid == 'coconut.txt'
    ...         print
    ...
    >>> source = ContiguousCharStringSource('coconut.txt', range(15, 29))
    >>> slice_test(source, None, 0, 1, len(source)-1, len(source), 100,
    ...            -1, -len(source)+1, -len(source), -100)
         | None   0     1     12    13   100    -1   -12   -13   -100
    -----+------------------------------------------------------------
    None | 15:28 15:15 15:16 15:27 15:28 15:28 15:27 15:16 15:15 15:15
       0 | 15:28 15:15 15:16 15:27 15:28 15:28 15:27 15:16 15:15 15:15
       1 | 16:28 16:16 16:16 16:27 16:28 16:28 16:27 16:16 16:16 16:16
      12 | 27:28 27:27 27:27 27:27 27:28 27:28 27:27 27:27 27:27 27:27
      13 | 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28
     100 | 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28 28:28
      -1 | 27:28 27:27 27:27 27:27 27:28 27:28 27:27 27:27 27:27 27:27
     -12 | 16:28 16:16 16:16 16:27 16:28 16:28 16:27 16:16 16:16 16:16
     -13 | 15:28 15:15 15:16 15:27 15:28 15:28 15:27 15:16 15:15 15:15
    -100 | 15:28 15:15 15:16 15:27 15:28 15:28 15:27 15:16 15:15 15:15

    >>> source = ContiguousCharStringSource('coconut.txt', range(50, 54))
    >>> slice_test(source, -4, -3, 0, -2, 1, -1, 2, 3, 4, 5)
         |  -4    -3    0     -2    1     -1    2     3     4     5
    -----+------------------------------------------------------------
      -4 | 50:50 50:50 50:50 50:51 50:51 50:52 50:52 50:53 50:53 50:53
      -3 | 50:50 50:50 50:50 50:51 50:51 50:52 50:52 50:53 50:53 50:53
       0 | 50:50 50:50 50:50 50:51 50:51 50:52 50:52 50:53 50:53 50:53
      -2 | 51:51 51:51 51:51 51:51 51:51 51:52 51:52 51:53 51:53 51:53
       1 | 51:51 51:51 51:51 51:51 51:51 51:52 51:52 51:53 51:53 51:53
      -1 | 52:52 52:52 52:52 52:52 52:52 52:52 52:52 52:53 52:53 52:53
       2 | 52:52 52:52 52:52 52:52 52:52 52:52 52:52 52:53 52:53 52:53
       3 | 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53
       4 | 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53
       5 | 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53 53:53


Sourced Strings
---------------

The following helper function that checks a sourced string's
characters to make sure that the string they come from is what it
should be.  It looks in ``check.documents[docid]`` for the text of the
document named ``docid``.

    >>> def check(sourced_string):
    ...     for char in sourced_string:
    ...         if isinstance(char, SourcedString):
    ...             document = check.documents[char.docid]
    ...             source_char = document[char.begin:char.end]
    ...             assert (char == source_char or
    ...                     (isinstance(source_char, str) and
    ...                      isinstance(char, unicode) and
    ...                      char.decode('utf-8') == source_char))
    >>> check.documents = {}


Constructing string tokens:

    >>> from nltk.data import *
    >>> s = ("Good muffins cost $3.88\nin New York.  Please buy me\n"
    ...      "two of them.\n\nThanks.")
    >>> check.documents['muffins.txt'] = s
    >>> doc = SourcedString(s, source='muffins.txt')

SourcedString indexing:

    >>> for i in 0, 1, 50, 72:
    ...     check(doc[i]); check(doc[-i])
    ...     assert doc[i] == s[i]
    ...     assert doc[-i] == s[-i]

    >>> doc[8], s[8]
    ('f'@[8], 'f')
    >>> doc[-1], s[-1], s[72]
    ('.'@[72], '.', '.')
    >>> doc[-5], s[-5], s[68]
    ('a'@[68], 'a', 'a')
    >>> doc[74]
    Traceback (most recent call last):
      . . .
    IndexError: string index out of range
    >>> doc[-74]
    Traceback (most recent call last):
      . . .
    IndexError: string index out of range

    >>> in_new_york = doc[-49:-38]; in_new_york
    'in New York'@[24:35]
    >>> doc[27], in_new_york[3], s[27]
    ('N'@[27], 'N'@[27], 'N')

SourcedString slicing:

    >>> len(in_new_york)
    11
    >>> def test_slice(sstring, string, print_indices, start, stop):
    ...     check(sstring)
    ...     assert (sstring == string)
    ...     if (start in print_indices and stop in print_indices):
    ...         s_repr = re.sub(r'^(.{30}).*(.{15})$', r'\1...\2',
    ...                         repr(sstring))
    ...         print 's[%4s:%4s] = %r' % (start, stop, s_repr)
    >>> def test_slices(sstring, string, test_indices, print_indices):
    ...     test_slice(sstring[:], string[:], print_indices, '', '')
    ...     for i in test_indices:
    ...         test_slice(sstring[i:], string[i:], print_indices, i, '')
    ...         test_slice(sstring[:i], string[:i], print_indices, '', i)
    ...     for start in test_indices:
    ...         for stop in test_indices:
    ...             test_slice(sstring[start:stop], string[start:stop],
    ...                        print_indices, start, stop)
    >>> test_slices(in_new_york, 'in New York',
    ...             range(-12, 13)+[None,100,-100, -20, 20],
    ...             ('', 0, 1, -1, 5))
    s[    :    ] = "'in New York'@[24:35]"
    s[  -1:    ] = "'k'@[34]"
    s[    :  -1] = "'in New Yor'@[24:34]"
    s[   0:    ] = "'in New York'@[24:35]"
    s[    :   0] = "''@[24:24]"
    s[   1:    ] = "'n New York'@[25:35]"
    s[    :   1] = "'i'@[24]"
    s[   5:    ] = "'w York'@[29:35]"
    s[    :   5] = "'in Ne'@[24:29]"
    s[  -1:  -1] = "''@[34:34]"
    s[  -1:   0] = "''@[34:34]"
    s[  -1:   1] = "''@[34:34]"
    s[  -1:   5] = "''@[34:34]"
    s[   0:  -1] = "'in New Yor'@[24:34]"
    s[   0:   0] = "''@[24:24]"
    s[   0:   1] = "'i'@[24]"
    s[   0:   5] = "'in Ne'@[24:29]"
    s[   1:  -1] = "'n New Yor'@[25:34]"
    s[   1:   0] = "''@[25:25]"
    s[   1:   1] = "''@[25:25]"
    s[   1:   5] = "'n Ne'@[25:29]"
    s[   5:  -1] = "'w Yor'@[29:34]"
    s[   5:   0] = "''@[29:29]"
    s[   5:   1] = "''@[29:29]"
    s[   5:   5] = "''@[29:29]"
    >>> check(in_new_york[:])
    >>> assert (in_new_york[:] ==
    ...         'in New York'[:])
    >>> for i in range(-12, 13)+[None,100,-100, -20, 20]:
    ...     check(in_new_york[:i])
    ...     check(in_new_york[i:])
    ...     assert (in_new_york[i:] == 'in New York'[i:])
    ...     assert (in_new_york[:i] == 'in New York'[:i])


Misc other tests:

    >>> doc[5:12]
    'muffins'@[5:12]
    >>> doc[:4]
    'Good'@[0:4]
    >>> doc[-7:]
    'Thanks.'@[66:73]
    >>> doc[-7:-1]
    'Thanks'@[66:72]
    >>> doc[-46:-38]
    'New York'@[27:35]

    >>> tok = doc[-49:-38]
    >>> tok[:]
    'in New York'@[24:35]
    >>> tok[:2]
    'in'@[24:26]
    >>> tok[3:]
    'New York'@[27:35]
    >>> tok[3:4]
    'N'@[27]

When a token slice is taken, and the step is not 1, a plain unicode
string is returned:

    >>> tok[::-1]
    'kroY weN ni'
    >>> tok[1:-1:2]
    'nNwYr'

Regular expressions can be used to search SourcedStrings:

    >>> import re
    >>> intoks = re.findall('in', doc)
    >>> print intoks
    ['in'@[9:11], 'in'@[24:26]]

Two tokens with the same string contents compare equal even if their
source/begin/end differ:

    >>> intoks[0] == intoks[1]
    True

Sourced strings can also be compared for equality with simple strings:

    >>> intoks[0] == 'in'
    True

Case manipulation:

    >>> tok.capitalize()
    'In new york'@[24:35]
    >>> tok.lower()
    'in new york'@[24:35]
    >>> tok.upper()
    'IN NEW YORK'@[24:35]
    >>> tok.swapcase()
    'IN nEW yORK'@[24:35]
    >>> tok.title()
    'In New York'@[24:35]

Stripping:

    >>> wstok = SourcedString(u'   Test   ', 'source')
    >>> wstok.lstrip()
    u'Test   '@[3:10]
    >>> wstok.rstrip()
    u'   Test'@[0:7]
    >>> wstok.strip()
    u'Test'@[3:7]

Splitting:

    >>> doc.split() # doctest: +NORMALIZE_WHITESPACE
    ['Good'@[0:4], 'muffins'@[5:12], 'cost'@[13:17], '$3.88'@[18:23],
     'in'@[24:26], 'New'@[27:30], 'York.'@[31:36], 'Please'@[38:44],
     'buy'@[45:48], 'me'@[49:51], 'two'@[52:55], 'of'@[56:58],
     'them.'@[59:64], 'Thanks.'@[66:73]]
    >>> doc.split(None, 5) # doctest: +NORMALIZE_WHITESPACE
    ['Good'@[0:4], 'muffins'@[5:12], 'cost'@[13:17], '$3.88'@[18:23],
     'in'@[24:26], 'New York.  Please buy me\ntwo of them.\n\nThanks.'@[27:73]]
    >>> doc.split('\n') # doctest: +NORMALIZE_WHITESPACE
    ['Good muffins cost $3.88'@[0:23],
     'in New York.  Please buy me'@[24:51],
     'two of them.'@[52:64],
     ''@[65:65],
     'Thanks.'@[66:73]]
    >>> doc.split('\n', 1) # doctest: +NORMALIZE_WHITESPACE
    ['Good muffins cost $3.88'@[0:23],
     'in New York.  Please buy me\ntwo of them.\n\nThanks.'@[24:73]]

    >>> doc.rsplit() # doctest: +NORMALIZE_WHITESPACE
    ['Good'@[0:4], 'muffins'@[5:12], 'cost'@[13:17], '$3.88'@[18:23],
     'in'@[24:26], 'New'@[27:30], 'York.'@[31:36], 'Please'@[38:44],
     'buy'@[45:48], 'me'@[49:51], 'two'@[52:55], 'of'@[56:58],
     'them.'@[59:64], 'Thanks.'@[66:73]]
    >>> doc.rsplit(None, 5) # doctest: +NORMALIZE_WHITESPACE
    ['Good muffins cost $3.88\nin New York.  Please buy'@[0:48],
     'me'@[49:51], 'two'@[52:55], 'of'@[56:58], 'them.'@[59:64],
     'Thanks.'@[66:73]]
    >>> doc.rsplit('\n') # doctest: +NORMALIZE_WHITESPACE
    ['Good muffins cost $3.88'@[0:23], 'in New York.  Please buy me'@[24:51],
     'two of them.'@[52:64], ''@[65:65], 'Thanks.'@[66:73]]
    >>> doc.rsplit('\n', 3) # doctest: +NORMALIZE_WHITESPACE
    ['Good muffins cost $3.88\nin New York.  Please buy me'@[0:51],
     'two of them.'@[52:64], ''@[65:65], 'Thanks.'@[66:73]]

Adding adjacent string tokens gives new string tokens:

    >>> doc[:4] + doc[4:12]
    'Good muffins'@[0:12]

Adding empty strings to string tokens gives string tokens:

    >>> tok + ''
    'in New York'@[24:35]
    >>> '' + tok
    'in New York'@[24:35]

All other add operations give basic strings:

    >>> 'not '+tok
    'not in New York'@[...,24:35]
    >>> doc[:4] + doc[12:17]
    'Good cost'@[0:4,12:17]

Regexps:

    >>> sent = newt_doc.split('\n')[1]
    >>> re.sub('better', 'worse', sent)
    'I got worse.'
    >>> SourcedStringRegexp('better').sub('worse', sent)
    'I got worse.'@[27:33,...,39:40]
    >>> SourcedStringRegexp.patch_re_module()
    >>> re.sub('better', 'worse', sent)
    'I got worse.'@[27:33,...,39:40]
    >>> SourcedStringRegexp.unpatch_re_module()
    >>> re.sub('better', 'worse', sent)
    'I got worse.'


Str/Unicode Interactions
------------------------
    >>> x = SourcedString('byte string \xcc', 'str')
    >>> y = SourcedString(u'unicode string \ubbbb', 'unicode')

Any operation that combines a byte string with a unicode string will
first decode the byte string using the default encoding.  As a result,
all of the following operations raise an exception (since the string
``x`` can't be decoded using the ASCII encoding):

    >>> x+y
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y+x
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> x.__radd__(y)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y.__radd__(x)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> x.find(y)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y.find(x)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> x.lstrip(y)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y.lstrip(x)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> x.rstrip(y)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y.rstrip(x)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> x.strip(y)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y.strip(x)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> x.split(y)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y.split(x)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> x.rsplit(y)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y.rsplit(x)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> x.partition(y)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y.partition(x)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> x.rpartition(y)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y.rpartition(x)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> x.join(y)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y.join(x)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 0: ordinal not in range(128)
    >>> x.center(100, y[-1])
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y.center(100, x[-1])
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 0: ordinal not in range(128)
    >>> x.ljust(100, y[-1])
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y.ljust(100, x[-1])
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 0: ordinal not in range(128)
    >>> x.rjust(100, y[-1])
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y.rjust(100, x[-1])
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 0: ordinal not in range(128)
    >>> x.find(y)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y.find(x)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> x.replace('x', y)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> x.replace(y, 'x')
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y.replace('x', x)
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)
    >>> y.replace(x, 'x')
    Traceback (most recent call last):
      . . .
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 12: ordinal not in range(128)


    >>> x = SourcedString('ascii byte string', 'str')
    >>> y = SourcedString(u'unicode string \ubbbb', 'unicode')

But these will all work, because x is ASCII:

    >>> x+y
    u'ascii byte stringunicode string \ubbbb'@[0:17,0:16]
    >>> y+x
    u'unicode string \ubbbbascii byte string'@[0:16,0:17]
    >>> x.__radd__(y)
    u'unicode string \ubbbbascii byte string'@[0:16,0:17]
    >>> y.__radd__(x)
    u'ascii byte stringunicode string \ubbbb'@[0:17,0:16]
    >>> x.find(y)
    -1
    >>> y.find(x)
    -1
    >>> x.lstrip(y)
    'ascii byte string'@[0:17]
    >>> y.lstrip(x)
    u'unicode string \ubbbb'@[0:16]
    >>> x.rstrip(y)
    'ascii by'@[0:8]
    >>> y.rstrip(x)
    u'unicode string \ubbbb'@[0:16]
    >>> x.strip(y)
    'ascii by'@[0:8]
    >>> y.strip(x)
    u'unicode string \ubbbb'@[0:16]
    >>> x.split(y)
    [u'ascii byte string'@[0:17]]
    >>> y.split(x)
    [u'unicode string \ubbbb'@[0:16]]
    >>> x.rsplit(y)
    [u'ascii byte string'@[0:17]]
    >>> y.rsplit(x)
    [u'unicode string \ubbbb'@[0:16]]
    >>> x.partition(y)
    ('ascii byte string'@[0:17], ''@[17:17], ''@[17:17])
    >>> y.partition(x)
    (u'unicode string \ubbbb'@[0:16], u''@[16:16], u''@[16:16])
    >>> x.rpartition(y)
    (''@[0:0], ''@[0:0], 'ascii byte string'@[0:17])
    >>> y.rpartition(x)
    (u''@[0:0], u''@[0:0], u'unicode string \ubbbb'@[0:16])
    >>> x.join(y) # doctest: +ELLIPSIS
    u'uascii byte stringnascii byte stringiascii byte stri...5:16]
    >>> y.join(x) # doctest: +ELLIPSIS
    u'aunicode string \ubbbbsunicode string \ubbbbcunicode...6:17]
    >>> x.center(100, y[-1]) # doctest: +ELLIPSIS
    u'\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubb...5:16]
    >>> y.center(100, x[-1]) # doctest: +ELLIPSIS
    u'ggggggggggggggggggggggggggggggggggggggggggunicode st...6:17]
    >>> x.ljust(100, y[-1]) # doctest: +ELLIPSIS
    u'ascii byte string\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubbb...5:16]
    >>> y.ljust(100, x[-1]) # doctest: +ELLIPSIS
    u'unicode string \ubbbbggggggggggggggggggggggggggggggg...6:17]
    >>> x.rjust(100, y[-1]) # doctest: +ELLIPSIS
    u'\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubbbb\ubb...0:17]
    >>> y.rjust(100, x[-1]) # doctest: +ELLIPSIS
    u'gggggggggggggggggggggggggggggggggggggggggggggggggggg...0:16]
    >>> x.find(y)
    -1
    >>> y.find(x)
    -1
    >>> x.replace('x', y)
    u'ascii byte string'@[0:17]
    >>> x.replace(y, 'x')
    u'ascii byte string'@[0:17]
    >>> y.replace('x', x)
    u'unicode string \ubbbb'@[0:16]
    >>> y.replace(x, 'x')
    u'unicode string \ubbbb'@[0:16]


Translate

    >>> table = [chr(i) for i in range(256)]
    >>> table[ord('e')] = '3'
    >>> table[ord('!')] = '|'
    >>> newt.translate(''.join(table))
    'n3wt|'@[21:26]
    >>> newt.translate(''.join(table), 'n3t')
    '3w|'@[22:24,25:26]

    >>> newt.decode().translate({'e':'3', '!':'*'})
    u'n3wt*'@[21:26]
    >>> newt.decode().translate({'e':'3', '!':'*', 'w': None})
    u'n3t*'@[21:23,24:26]