;;; -*- mode: fundamental; coding: utf-8; indent-tabs-mode: t; -*-

;;;
;;; another shot at a finite-state language for preprocessing, normalization,
;;; and tokenization in LKB grammars.  requires LKB version of 1-feb-09 or
;;; newer.  note that the syntax is rigid: everything starting in column 2
;;; (i.e. right after the rule type marker) is used as the match pattern until
;;; the first `\t' (tabulator sign); one or more tabulator are considered the
;;; separator between the matching pattern and the replacement, but other
;;; whitespace will be considered part of the patterns.  empty lines or lines
;;; with a semicolon in column 1 (i.e. in place of the rule type marker, this
;;; is not Lisp) will be ignored.
;;;
;;; this is a fresh attempt (as of September 2008) at input tokenization.  for
;;; increased compatibility with existing tools (specifically taggers trained
;;; on the PTB), we now assume a PTB-like tokenization in pre-processing.  the
;;; grammar includes token mapping rules (using the new chart mapping machinery
;;; in PET) to eventually adjust (i.e. correct, in some cases) tokenization to
;;; its needs.  specifically, many punctuation marks will be re-combined with 
;;; preceding or following tokens, reflecting standard orthographic convention,
;;; and are then analyzed as pseudo-affixes.  
;;;
;;; this file is inspired by the PTB `tokenizer.sed' script, and by and large
;;; should yield very similar results.  with the addition of token mapping as
;;; a separate step inside the parser, we want to restrict RE-based processing
;;; to pure string-level phenomena.  however, to actually tokenize (following
;;; some set of principles), we need to do more than just break at whitespace.
;;; some punctuation marks give rise to token boundaries, but not all.  also,
;;; inputs (in the 21st century) may contain some amount of mark-up, where XML
;;; character references have become relatively common.  full UniCode support
;;; now makes it possible to represent a much larger range of characters, e.g.
;;; various types of quotes and dashes.  we aim to map mark-up to corresponding
;;; UniCode characters, and preserve those in parsing, as much as possible.
;;;
;;; the original `tokenizer.sed' script actually cannot always yield the exact
;;; tokenization found in the PTB.  the script unconditionally separates a set
;;; of punctuation or other non-alphanumeric characters (e.g. |&| and |!|) that
;;; may be part of a single token (say in |AT&T| or URLs).  we aim to do better
;;; than the original script, here, conditioning on adjacent whitespace.
;;;


;;
;; preprocessor rules versioning; auto-maintained upon CVS (or SVN) check-in.
;; 
@$Date: 2009-02-06 08:33:49 +0100 (fre, 06 feb 2009) $

;;
;; tokenization pattern: after normalization, the string will be broken up at
;; each occurrence of this pattern; the pattern match itself is deleted.
;;
:[ \t]+

;;;
;;; string rewrite rules: all matches, over the entire string, are replaced by
;;; the right-hand side; grouping (using `(' and `)') in the pattern) and group
;;; references (`\1' for the first group, et al.) carry over part of the match.
;;;

;;
;; pad the full string with trailing and leading whitespace; makes matches for
;; word boundaries a little easier down the road; also, squash multiple spaces
;; and replace tabulators with a space.
;;
!^(.+)$			\1
!  +			
!\t			

;;
;; a set of `mark-up modules', often replacing mark-up character entitities
;; with actual UniCode characters (e.g. |&mdash;| or |---|), or just ditching
;; mark-up that has no bearing on parsing for now (e.g. most wiki mark-up).
;; these modules can be activated selectively by name in the REPP environment
;; or the top-level call into REPP.
;;
>xml
>latex
>ascii
>wiki

;;
;; two special cases involving periods: map ASCII ellipsis (|...|) to a single
;; UniCode character (|…|), and convert |..| between numbers into an n-dash,
;; i.e. a numeric range (typically tokenized off, i.e. |42| |–| |43|).  maybe
;; the latter can also occur between non-numbers?  we could also just preserve
;; it, but always make it a token in its own right?
;;
;; _fix_me_
;; what about a sentence-final period following the ellipsis (as in cb/7060)?;;                                                              (24-sep-08; oe)
;;
!([^.])\.\.\.+([^.])			\1 … \2
!\[\.\.\.\]			…
!([0-9]) *\.\. *([0-9])			\1 – \2

!(^| )-( |$)			–
;;
;; some UniCode characters force token boundaries: m-dash, n-dash, ellipsis.
;;
!([—–…])			\1

;; FIXME: For now tokenize punctuation marks off (for TnT and Negra compatibility)

!(.*)([\.,?!])			\1 \2
!(.*)([\.,?!])$			\1 \2


;;
;; deviating from the PTB conventions, we use one-character double quote marks
;; (i.e. |“| and |"| instead of |``| and |''|); much like the PTB, however, we
;; aim to disambiguate neutral quotes (|"| and |''|) at the string level, i.e.
;; opening quotes are preceded by a token boundary (white space), with a small
;; number of additional, token-initial characters than can intervene.  anything
;; else, we assume, is a closing quote.  rather than the proper UniCode closing
;; quote (|”|), however, use a straight double quote (|"|), which can double as
;; a unit of measure (feet).  do the same for single quotes, using apostrophes
;; (|'|) rather than proper closing quotes (|’|), to allow ambiguity with the
;; possessive maker, specifically when following |s|.  to not create spurious
;; ambiguity, preserve UniCode closing quotes, if used in the input.
;;
;; convert quotes to single characters prior to tokenizing off other characters
;; (group #1 below) to make adjacent whitespace detection easier, as e.g. in
;; |``$20!''|.
;;
;; _fix_me_
;; in principle, i just discovered, there are separate prime and double prime
;; UniCode characters, intended for the units of measure.  i doubt we see them
;; in any of the existing data sets, but in carefully edited documents, they
;; may show up eventually.  assuming these are never used as quotes, we should
;; probably preserve them here.  but as for the distinction between straight
;; and closing quotes, i now suspect we might see the closing quotes as a unit
;; of measure too.  hence, consider ditching straight quotes altogether.
;;                                                              (23-jan-09; oe)
;;
!``			“
!(^| [[({]*)("|\'\')			\1 “ 

!\'\'			”
!`			‘
!(^| [[({]*)\'			\1 ‘

;;
;; normalize stylistic variance in (directional) quote marks.  once these rules
;; are complete, we are down to only six quote marks: |“|, |”|, |"|, |‘|, |’|,
;; and |'|.  of these, the straight ones (the traditional ASCII characters) are
;; ambiguous between being a closing quote and something else.
;;
![„«]			“
![»]			”
![‚‹]			‘
![›]			’


;;; FIXME: Tokenise  quotes off for now: 

!([^ ]+) *["“”‘’] *([.,!?:;])			\1 \2

!( ["“”‘’])([^ ]+)			\1 \2
!([^ ]+)(["“”‘’])			\1 \2

;;;!([^ ]+)_\+\+\+			\1

;;
;; remove space after initial |O'| and |L'|, i.e. irish and romance names, to
;; avoid stripping off their apostrophes.
;;
! ([OlL]) ['’]			\1'


;;; FIXME: Remove quotes for now: 

!(["“””‘’])			
!((``|\'\'))			

;;
;; a new REPP facility: named groups and iterative group calls.  there are a
;; number of characters that PTB tokenizes off (unconditionally, it seems, in
;; the original `tokenizer.sed'), though not when they are parts of names or
;; NE patterns, e.g. |AT&T| or |http://www.emmtee.net/?foo.php&bar=42|.  thus,
;; we only want these as separate tokens when they are preceded or followed by
;; whitespace; this leaves a problem with, say, |http://www.emmtee.net/|, where
;; one would have to apply NE recognition (what used to be `ersatzing') _prior_
;; to tokenization.
;;
;; either way, because characters we want to tokenize off might be `clustered'
;; with each other, e.g. |(42%), |, the notion of adjacent whitespace needs to
;; apply transitively through such clusters.  it seems an iterative group is
;; the most straightforward way of getting that effect.  the rules from the
;; group will be applied repeatedly (in order) at the time the group is called
;; (by means of the `>' operator), until there are no further matches.  we need
;; to be careful to avoid indefinite recursion within the group, i.e. not add
;; duplicate spaces.  thus, ditch multiple spaces initially.
;;
;; at this point, we exclude a few punctuation characters from this policy, in
;; part because that is the PTB approach (|-| and |/|), in part because they
;; can be prefixes or suffixes of one-token named entities, i.e. |<| and |>| in
;; URLs and email addresses.  to work around these, we may need a string-level
;; `ersatzing' facility, associating a sub-string (that can be unambiguously 
;; identified by surface properties, e.g. a URL) with an identifier of a token
;; class.
;;
;; like in the original PTB script, periods are only tokenized off in sentence-
;; final position, maybe followed only by closing quote marks or parentheses.
;;
!  +			
#1
!([^ ])([][(){}?!,;:@#$€¢£¥%&“”"‘’']) ([^ ]|$)			\1 \2 \3
!([^ ])\. ([])}”"’' ]*)$			\1 . \2
!(^|[^ ]) ([][(){}?!,;:@#$€¢£¥%&“”"‘’'])([^ ])			\1 \2 \3
#

>1

;;
;; to allow parsing (of inputs involving basic punctuation) in the LKB, there 
;; is a REPP module to undo PTB-style separation of tokens.  this module will
;; only be activated for use within the LKB, not by preprocess-for-pet().
;;
>gg