;;; -*- mode: fundamental; coding: utf-8; indent-tabs-mode: t; -*- ;;; ;;; Copyright (c) 2009 -- 2010 Stephan Oepen (oe@ifi.uio.no); ;;; see `LICENSE' for conditions. ;;; ;; ;; deviating from the PTB conventions, we use one-character double quote marks ;; (i.e. |“| and |”| instead of |``| and |''|); much like the PTB, however, we ;; aim to disambiguate neutral quotes (|"| and |''|) at the string level, i.e. ;; opening quotes are preceded by a token boundary (white space), with a small ;; number of additional, token-initial characters than can intervene; anything ;; else, we assume, is a closing quote. ;; ;; convert quotes to single characters prior to tokenizing off other characters ;; (group #1 below) to make adjacent whitespace detection easier, as e.g. in ;; |``$20!''|. closing quotes can double as apostrophes and units of measure, ;; i.e. feet and inches, or seconds and minutes. ;; ;; ;; it appears we cannot trust writers to use `funny' quotes properly, hence we ;; neuter them to straight double or single quotes, which will then go through ;; disambiguation, based on adjacency to token boundaries. ;; ![«»] " ![‹›] ' ;; ;; _fix_me_ ;; in bio-medical texts we see names with double or triple apostrophes, e.g. ;; |Figure B''| or |(A–C'')| (presumably in a figure caption). clearly, the ;; LaTeX-style conventions are incompatible with such usage of the apostrophe, ;; and probably we should limit support for LaTeX-style quotes to the LaTeX ;; REPP module. at present, however, i doubt the ERG would do the right thing ;; for double-apostrophe inputs anyway, and a full analysis of |A'| and |A''| ;; could be expensive in terms of extra ambiguity. discuss this with dan, one ;; fine day. (13-mar-10; oe) ;; ;; ;; the ‘raw’ WSJ texts every now and again contain sentence-final quotes that ;; are preceded by whitespace (many of these look like quotes that actually ;; should be part of the following sentence, i.e. were moved across a sentence ;; boundary). for robustness, force directionality of quotes in sentence- ;; initial and final positions. ;; !^([[({“‘' ]*)("|``) \1“ !^([[({“‘ ]*)'(?![0-9]{2}s?) \1‘ !("|'')([])}”’' ]*)$ ”\2 !'([])}”’ ]*)$ ’\1 #1 !(„|``) “ !(^| [[({“‘]*)("|'') \1“ !("|'') ” !(‚|`) ‘ !(^| [[({“‘]*)'(?![0-9]{2}s?) \1‘ !' ’ # >1