;;; -*- mode: fundamental; coding: utf-8; indent-tabs-mode: t; -*- ;;; ;;; Copyright (c) 2009 -- 2010 Stephan Oepen (oe@ifi.uio.no); ;;; see `LICENSE' for conditions. ;;; ;;; ;;; another shot at a finite-state language for preprocessing, normalization, ;;; and tokenization in LKB grammars. requires LKB version of 1-feb-09 or ;;; newer. note that the syntax is rigid: everything starting in column 2 ;;; (i.e. right after the rule type marker) is used as the match pattern until ;;; the first `\t' (tabulator sign); one or more tabulators are considered the ;;; separator between the matching pattern and the replacement, but other ;;; whitespace will be considered part of the patterns. empty lines or lines ;;; with a semicolon in column 1 (i.e. in place of the rule type marker, this ;;; is not Lisp) will be ignored. ;;; ;;; this is a fresh attempt (as of September 2008) at input tokenization. for ;;; increased compatibility with existing tools (specifically taggers trained ;;; on the PTB), we now assume a PTB-like tokenization in pre-processing. the ;;; grammar includes token mapping rules (using the new chart mapping machinery ;;; in PET) to eventually adjust (i.e. correct, in some cases) tokenization to ;;; its needs. specifically, many punctuation marks will be re-combined with ;;; preceding or following tokens, reflecting standard orthographic convention, ;;; and are then analyzed as pseudo-affixes. ;;; ;;; this file is inspired by the PTB `tokenizer.sed' script, and by and large ;;; should yield very similar results. with the addition of token mapping as ;;; a separate step inside the parser, we want to restrict RE-based processing ;;; to pure string-level phenomena. however, to actually tokenize (following ;;; some set of principles), we need to do more than just break at whitespace. ;;; some punctuation marks give rise to token boundaries, but not all. also, ;;; inputs (in the 21st century) may contain some amount of mark-up, where XML ;;; character references have become relatively common. full UniCode support ;;; now makes it possible to represent a much larger range of characters, e.g. ;;; various types of quotes and dashes. we aim to map mark-up to corresponding ;;; UniCode characters, and preserve those in parsing, as much as possible. ;;; ;;; the original `tokenizer.sed' script actually cannot always yield the exact ;;; tokenization found in the PTB. the script unconditionally separates a set ;;; of punctuation or other non-alphanumeric characters (e.g. |&| and |!|) that ;;; may be part of a single token (say in |AT&T| or URLs). we aim to do better ;;; than the original script, here, conditioning on adjacent whitespace. ;;; ;; ;; preprocessor rules versioning; auto-maintained upon CVS (or SVN) check-in. ;; @$Date: 2010-08-20 14:45:36 +0200 (fre, 20 aug 2010) $ ;; ;; tokenization pattern: after normalization, the string will be broken up at ;; each occurrence of this pattern; the pattern match itself is deleted. ;; :[ \t]+ ;;; ;;; string rewrite rules: all matches, over the entire string, are replaced by ;;; the right-hand side; grouping (using `(' and `)') in the pattern) and group ;;; references (`\1' for the first group, et al.) carry over part of the match. ;;; ;; ;; pad the full string with trailing and leading whitespace; makes matches for ;; word boundaries a little easier down the road; also, squash multiple spaces ;; and replace tabulators with a space. ;; !^(.+)$ \1 !\t ! + ;; a set of `mark-up modules', often replacing mark-up character entities with ;; actual UniCode characters (e.g. |—| or |---| as |—|), or just ditching ;; mark-up that has no bearing on parsing for now (e.g. most wiki mark-up). ;; these modules can be activated selectively by name in the REPP environment ;; or the top-level call into REPP. >xml >latex >ascii >wiki ;; ;; we hope to keep it to a minimum, and only invoke it in select configuration, ;; but at times we need to heuristically `fix up' deficiencies in the input; ;; for example spurious duplicate punctuation, and such. ;; >robustness ;; ;; our treatment of quote marks continues to evolve. as of february 2010, it ;; seems we need to support two distinct scenarios: (a) inputs using standard ;; typography (quotes either are directional already or can be diambiguated by ;; adjacency to preceding or following whitespace) vs. (b) inputs that already ;; have been pre-tokenized or otherwise damaged (as in the current CoNLL 2010 ;; training data), where we stand no chance of disambiguating quote marks. in ;; downstream processing, we thus need to handle six types of quotes: genuine ;; directional quotes (|“|, |”|, |‘|, and |’|) and ambiguous, straight ones ;; (|"| and |'|). the former group can be restricted to opening and closing ;; functions, and in the case of closing quotes also to apostrophes and units ;; of measure (feet and inches, or seconds and minutes; for which in `proper' ;; typography the double prime and prime symbols could be used); the straight ;; quotes, on the other hand, need to be considered ambiguous between opening ;; and closing, apostrophe, or prime usages. (9-feb-10; oe) ;; >quotes ;; ;; two special cases involving periods: map ASCII ellipsis (|...|) to a single ;; UniCode character (|…|), and convert |..| between numbers into an n-dash, ;; i.e. a numeric range (typically tokenized off, i.e. |42| |–| |43|). maybe ;; the latter can also occur between non-numbers? we could also just preserve ;; it, but always make it a token in its own right? ;; ;; _fix_me_ ;; what about a sentence-final period following the ellipsis (as in cb/7060)? ;; (24-sep-08; oe) ;; !([^.])\.\.\.+([^.]) \1 … \2 !\[\.\.\.\] … !([0-9]) *\.\. *([0-9]) \1 – \2 ;; ;; some UniCode characters force token boundaries: m-dash, ellipsis. we used ;; to include n-dashes in this list, but some authors (e.g. ESR) use n-dashes ;; much like hyphens, attempting to be clever about bracketing in expressions ;; like |non–source-aware| [users]. ;; !([—…]) \1 ;; ;; remove space after initial |O'| and |L'|, i.e. irish and romance names, to ;; avoid stripping off their apostrophes. ;; ! ([OL])['’] \1' ;; ;; a new REPP facility: named groups and iterative group calls. there are a ;; number of characters that PTB tokenizes off (unconditionally, it seems, in ;; the original `tokenizer.sed'), though not when they are parts of names or ;; NE patterns, e.g. |AT&T| or |http://www.emmtee.net/?foo.php&bar=42|. thus, ;; we only want these as separate tokens when they are preceded or followed by ;; whitespace; this leaves a problem with, say, |http://www.emmtee.net/|, where ;; one would have to apply NE recognition (what used to be `ersatzing') _prior_ ;; to tokenization. ;; ;; either way, because characters we want to tokenize off might be `clustered' ;; with each other, e.g. |(42%), |, the notion of adjacent whitespace needs to ;; apply transitively through such clusters. it seems an iterative group is ;; the most straightforward way of getting that effect. the rules from the ;; group will be applied repeatedly (in order) at the time the group is called ;; (by means of the `>' operator), until there are no further matches. we need ;; to be careful to avoid indefinite recursion within the group, i.e. not add ;; duplicate spaces. thus, once again, make sure the full string is bracketed ;; by single spaces, and ditch multiple spaces initially; then we can require a ;; non-whitespace preceding or following context, besides the actual spaces. ;; ;; at this point, we exclude a few punctuation characters from this policy, in ;; part because that is the PTB approach (|-| and |/|), in part because they ;; can be prefixes or suffixes of one-token named entities, i.e. |<| and |>| in ;; URLs and email addresses. to work around these, we may need a string-level ;; `ersatzing' facility, associating a sub-string (that can be unambiguously ;; identified by surface properties, e.g. a URL) with an identifier of a token ;; class. ;; ;; like in the original PTB script, periods are only tokenized off in sentence- ;; final position, maybe followed only by closing quote marks or parentheses. ;; ;; the story for parentheses and square brackets is somehwat involved, to avoid ;; breaking up inputs like |factor(s)| or |Ca[2+]|. ;; ;; _fix_me_ ;; our current solution unconditionally strips of leading parentheses, which ;; would break an input like |(mis-)read|. talk to dan about how best to deal ;; with these (maybe see whether we find actual examples). (26-feb-10; oe) ;; ;; _fix_me_ ;; there is an issue with some of the characters that are asserted (at least in ;; whitespace adjacency) to constitute separate tokens, specifically the dollar ;; sign. inputs like |HK$ 7.8| will end up with a bogus token boundary (which ;; is the case too in the original PTB sed(1) script). (15-jul-09; oe) ;; !^(.+)$ \1 ! + #1 ! ([^ (]+)(\)) ([^ ]|$) \1 \2 \3 ! ([^ []+)(]) ([^ ]|$) \1 \2 \3 !([^ ])([]}?!,;:@#$€¢£¥%&”"’']) ([^ ]|$) \1 \2 \3 !([^ ])\. ([])}”"’' ]*)$ \1 . \2 !(^|[^ ]) ([[({:@#$€¢£¥%&“‘])([^ ]) \1 \2 \3 # >1 ;; ;; any word-final apostrophe, by now, should be separated (e.g. |abrams'| --> ;; |abrams '|). which only leaves contracted forms, including the undesirable ;; PTB ones, e.g. |don't| --> |do n't|. but not |cannot| --> |can not| and the ;; more obscure ones: |gimme|, |lemme|, |'tis|, |wanna|, et al. ;; ;; _fix_me_ ;; the |cannot| case, especially without characterization information, is a bit ;; challenging: it presumably is frequent enough so that for PTB compliance we ;; should pull it apart, but that would seem to introduce unwanted ambiguity. ;; i doubt that |she cannot participate on monday| has the reading of her being ;; able to `not participate' (stay out of the way) on monday. or does it? ;; (19-sep-08; oe) ;; ;; ;; as it stands, we violate what we said above about quotes: in the contracted ;; auxiliaries and possessives, we use a straight (ASCII) quote to represent ;; the apostrophe, not the closing single quote (as would recommended by the ;; UniCode standard). the cost of conversion would be relatively high in this ;; case: many lexical entries would need to be duplicated for `correct' quote ;; mode, and still be kept with ASCII apostrophes for robust mode. ;; !([^ ])[’']([dDmMsS]) \1 '\2 !([^ ])[’'](ll|LL|re|RE|ve|VE) \1 '\2 !([^ ])(n[’']t) \1 n't !([^ ])(N[’']T) \1 N'T ;; ;; to allow parsing (of inputs involving basic punctuation) in the LKB, there ;; is a REPP module to undo PTB-style separation of tokens. this module will ;; only be activated for use within the LKB, not by preprocess-for-pet(). ;; >lkb