;;; -*- mode: fundamental; coding: utf-8; indent-tabs-mode: t; -*- ;;; ;;; another shot at a finite-state language for preprocessing, normalization, ;;; and tokenization in LKB grammars. requires LKB version of 1-feb-09 or ;;; newer. note that the syntax is rigid: everything starting in column 2 ;;; (i.e. right after the rule type marker) is used as the match pattern until ;;; the first `\t' (tabulator sign); one or more tabulator are considered the ;;; separator between the matching pattern and the replacement, but other ;;; whitespace will be considered part of the patterns. empty lines or lines ;;; with a semicolon in column 1 (i.e. in place of the rule type marker, this ;;; is not Lisp) will be ignored. ;;; ;;; this is a fresh attempt (as of September 2008) at input tokenization. for ;;; increased compatibility with existing tools (specifically taggers trained ;;; on the PTB), we now assume a PTB-like tokenization in pre-processing. the ;;; grammar includes token mapping rules (using the new chart mapping machinery ;;; in PET) to eventually adjust (i.e. correct, in some cases) tokenization to ;;; its needs. specifically, many punctuation marks will be re-combined with ;;; preceding or following tokens, reflecting standard orthographic convention, ;;; and are then analyzed as pseudo-affixes. ;;; ;;; this file is inspired by the PTB `tokenizer.sed' script, and by and large ;;; should yield very similar results. with the addition of token mapping as ;;; a separate step inside the parser, we want to restrict RE-based processing ;;; to pure string-level phenomena. however, to actually tokenize (following ;;; some set of principles), we need to do more than just break at whitespace. ;;; some punctuation marks give rise to token boundaries, but not all. also, ;;; inputs (in the 21st century) may contain some amount of mark-up, where XML ;;; character references have become relatively common. full UniCode support ;;; now makes it possible to represent a much larger range of characters, e.g. ;;; various types of quotes and dashes. we aim to map mark-up to corresponding ;;; UniCode characters, and preserve those in parsing, as much as possible. ;;; ;;; the original `tokenizer.sed' script actually cannot always yield the exact ;;; tokenization found in the PTB. the script unconditionally separates a set ;;; of punctuation or other non-alphanumeric characters (e.g. |&| and |!|) that ;;; may be part of a single token (say in |AT&T| or URLs). we aim to do better ;;; than the original script, here, conditioning on adjacent whitespace. ;;; ;; ;; preprocessor rules versioning; auto-maintained upon CVS (or SVN) check-in. ;; @$Date: 2009-02-06 08:33:49 +0100 (fre, 06 feb 2009) $ ;; ;; tokenization pattern: after normalization, the string will be broken up at ;; each occurrence of this pattern; the pattern match itself is deleted. ;; :[ \t]+ ;;; ;;; string rewrite rules: all matches, over the entire string, are replaced by ;;; the right-hand side; grouping (using `(' and `)') in the pattern) and group ;;; references (`\1' for the first group, et al.) carry over part of the match. ;;; ;; ;; pad the full string with trailing and leading whitespace; makes matches for ;; word boundaries a little easier down the road; also, squash multiple spaces ;; and replace tabulators with a space. ;; !^(.+)$ \1 ! + !\t ;; ;; a set of `mark-up modules', often replacing mark-up character entitities ;; with actual UniCode characters (e.g. |—| or |---|), or just ditching ;; mark-up that has no bearing on parsing for now (e.g. most wiki mark-up). ;; these modules can be activated selectively by name in the REPP environment ;; or the top-level call into REPP. ;; >xml >latex >ascii >wiki ;; ;; two special cases involving periods: map ASCII ellipsis (|...|) to a single ;; UniCode character (|…|), and convert |..| between numbers into an n-dash, ;; i.e. a numeric range (typically tokenized off, i.e. |42| |–| |43|). maybe ;; the latter can also occur between non-numbers? we could also just preserve ;; it, but always make it a token in its own right? ;; ;; _fix_me_ ;; what about a sentence-final period following the ellipsis (as in cb/7060)?;; (24-sep-08; oe) ;; !([^.])\.\.\.+([^.]) \1 … \2 !\[\.\.\.\] … !([0-9]) *\.\. *([0-9]) \1 – \2 !(^| )-( |$) – ;; ;; some UniCode characters force token boundaries: m-dash, n-dash, ellipsis. ;; !([—–…]) \1 ;; FIXME: For now tokenize punctuation marks off (for TnT and Negra compatibility) !(.*)([\.,?!]) \1 \2 !(.*)([\.,?!])$ \1 \2 ;; ;; deviating from the PTB conventions, we use one-character double quote marks ;; (i.e. |“| and |"| instead of |``| and |''|); much like the PTB, however, we ;; aim to disambiguate neutral quotes (|"| and |''|) at the string level, i.e. ;; opening quotes are preceded by a token boundary (white space), with a small ;; number of additional, token-initial characters than can intervene. anything ;; else, we assume, is a closing quote. rather than the proper UniCode closing ;; quote (|”|), however, use a straight double quote (|"|), which can double as ;; a unit of measure (feet). do the same for single quotes, using apostrophes ;; (|'|) rather than proper closing quotes (|’|), to allow ambiguity with the ;; possessive maker, specifically when following |s|. to not create spurious ;; ambiguity, preserve UniCode closing quotes, if used in the input. ;; ;; convert quotes to single characters prior to tokenizing off other characters ;; (group #1 below) to make adjacent whitespace detection easier, as e.g. in ;; |``$20!''|. ;; ;; _fix_me_ ;; in principle, i just discovered, there are separate prime and double prime ;; UniCode characters, intended for the units of measure. i doubt we see them ;; in any of the existing data sets, but in carefully edited documents, they ;; may show up eventually. assuming these are never used as quotes, we should ;; probably preserve them here. but as for the distinction between straight ;; and closing quotes, i now suspect we might see the closing quotes as a unit ;; of measure too. hence, consider ditching straight quotes altogether. ;; (23-jan-09; oe) ;; !`` “ !(^| [[({]*)("|\'\') \1 “ !\'\' ” !` ‘ !(^| [[({]*)\' \1 ‘ ;; ;; normalize stylistic variance in (directional) quote marks. once these rules ;; are complete, we are down to only six quote marks: |“|, |”|, |"|, |‘|, |’|, ;; and |'|. of these, the straight ones (the traditional ASCII characters) are ;; ambiguous between being a closing quote and something else. ;; ![„«] “ ![»] ” ![‚‹] ‘ ![›] ’ ;;; FIXME: Tokenise quotes off for now: !([^ ]+) *["“”‘’] *([.,!?:;]) \1 \2 !( ["“”‘’])([^ ]+) \1 \2 !([^ ]+)(["“”‘’]) \1 \2 ;;;!([^ ]+)_\+\+\+ \1 ;; ;; remove space after initial |O'| and |L'|, i.e. irish and romance names, to ;; avoid stripping off their apostrophes. ;; ! ([OlL]) ['’] \1' ;;; FIXME: Remove quotes for now: !(["“””‘’]) !((``|\'\')) ;; ;; a new REPP facility: named groups and iterative group calls. there are a ;; number of characters that PTB tokenizes off (unconditionally, it seems, in ;; the original `tokenizer.sed'), though not when they are parts of names or ;; NE patterns, e.g. |AT&T| or |http://www.emmtee.net/?foo.php&bar=42|. thus, ;; we only want these as separate tokens when they are preceded or followed by ;; whitespace; this leaves a problem with, say, |http://www.emmtee.net/|, where ;; one would have to apply NE recognition (what used to be `ersatzing') _prior_ ;; to tokenization. ;; ;; either way, because characters we want to tokenize off might be `clustered' ;; with each other, e.g. |(42%), |, the notion of adjacent whitespace needs to ;; apply transitively through such clusters. it seems an iterative group is ;; the most straightforward way of getting that effect. the rules from the ;; group will be applied repeatedly (in order) at the time the group is called ;; (by means of the `>' operator), until there are no further matches. we need ;; to be careful to avoid indefinite recursion within the group, i.e. not add ;; duplicate spaces. thus, ditch multiple spaces initially. ;; ;; at this point, we exclude a few punctuation characters from this policy, in ;; part because that is the PTB approach (|-| and |/|), in part because they ;; can be prefixes or suffixes of one-token named entities, i.e. |<| and |>| in ;; URLs and email addresses. to work around these, we may need a string-level ;; `ersatzing' facility, associating a sub-string (that can be unambiguously ;; identified by surface properties, e.g. a URL) with an identifier of a token ;; class. ;; ;; like in the original PTB script, periods are only tokenized off in sentence- ;; final position, maybe followed only by closing quote marks or parentheses. ;; ! + #1 !([^ ])([][(){}?!,;:@#$€¢£¥%&“”"‘’']) ([^ ]|$) \1 \2 \3 !([^ ])\. ([])}”"’' ]*)$ \1 . \2 !(^|[^ ]) ([][(){}?!,;:@#$€¢£¥%&“”"‘’'])([^ ]) \1 \2 \3 # >1 ;; ;; to allow parsing (of inputs involving basic punctuation) in the LKB, there ;; is a REPP module to undo PTB-style separation of tokens. this module will ;; only be activated for use within the LKB, not by preprocess-for-pet(). ;; >gg