this directory contains various summary views on the LOGON development and test
corpus.  the files in here, typically, are produced using a combination of one
or more of the scripts in `$LOGONROOT/uio/data/bin/' (which are provided kindly
by lars nygaard) and some of the standard Un*x text utilities.  please read on 
to see what the individual files are, and how they were constructed.  where a
recipe for creation of individual files is provided, the resulting files have
been compiled by me; all others (i.e. everything requiring access to the test
data) were provided by lars.                                    (6-nov-06; oe)


+ jh.no.forms, ps.no.forms, tg.no.forms, jhpstg.no.forms

  these four files are lists of tokens (word forms) from the three development 
  corpora, ordered by frequency (plus one combined list, `jhpstg.no.forms').
  capitalization is preserved from the original texts, and sentence-initial
  forms are flagged with an asterisk (`*').  some punctuation marks have been
  removed.

  cat $LOGONROOT/uio/data/jh{0,1,2,3,4,5}.txt > $LOGONROOT/uio/data/jh.txt

  $LOGONROOT/uio/data/bin/extr_vocab_fan.pl < $LOGONROOT/uio/data/jh.txt \
  | sed -e 's/^\*\*/\*/g' -e 's/^\*+/\*/g' | sort | uniq -c | sort -nr \
  > $LOGONROOT/uio/data/lists/jh.no.forms

  $LOGONROOT/uio/data/bin/extr_vocab_fan.pl < $LOGONROOT/uio/data/ps.txt \
  | sed -e 's/^\*\*/\*/g' -e 's/^\*+/\*/g' | sort | uniq -c | sort -nr \
  > $LOGONROOT/uio/data/lists/ps.no.forms

  $LOGONROOT/uio/data/bin/extr_vocab_fan.pl < $LOGONROOT/uio/data/tg.txt \
  | sed -e 's/^\*\*/\*/g' -e 's/^\*+/\*/g' | sort | uniq -c | sort -nr \
  > $LOGONROOT/uio/data/lists/tg.no.forms

  cat $LOGONROOT/uio/data/{jh,ps,tg}.txt \
  | $LOGONROOT/uio/data/bin/extr_vocab_fan.pl \
  | sed -e 's/^\*\*/\*/g' -e 's/^\*+/\*/g' | sort | uniq -c | sort -nr \
  > $LOGONROOT/uio/data/lists/jhpstg.no.forms


+ jhk.no.forms, jhk.en.forms, psk.no.forms, tgk.no.forms

  word lists (most likely compiled using the same `extr_vocab_fan.pl' script as
  for the development corpus) for the known-vocabulary test segments of JH, PS,
  and TG.


+ jhk.no.new, psk.no.new, tgk.no.new

  set differences of, for example, `jhk.no.forms' minus `jhpstg.no.forms'.  in
  other words, word forms found exclusively in one of the known-vocabulary test
  segments, but not anywhere in the development corpus.

  perl $LOGONROOT/uio/data/bin/diff_vocab_lists.pl \
    $LOGONROOT/uio/data/lists/{jhpstg,jhk}.no.forms \
  | sort > $LOGONROOT/uio/data/lists/jhk.no.new

  perl $LOGONROOT/uio/data/bin/diff_vocab_lists.pl \
    $LOGONROOT/uio/data/lists/{jhpstg,psk}.no.forms \
  | sort > $LOGONROOT/uio/data/lists/psk.no.new

  perl $LOGONROOT/uio/data/bin/diff_vocab_lists.pl \
    $LOGONROOT/uio/data/lists/{jhpstg,tgk}.no.forms \
  | sort > $LOGONROOT/uio/data/lists/tgk.no.new


+ old/jhpstg.no.forms

  a legacy copy of the former (incomplete) `$LOGONROOT/uio/data/lists/a.form'.


+ old/psk.no.forms, old/psk.en.forms, old/tgk.no.forms, old/tgk.en.forms

  legacy copies of the former word lists for the known-vocabulary held-out 
  parts of PS and TG (which used to be in `$LOGONROOT/uio/data/test-vocab/').
  these are now superseded by files of the same name in the parent directory, 
  because the PS and TG parts of the test corpus had to be reduced in size, in
  order to make the proportions of items from the three distinct sources 
  parallel to the distribution in the development data.  with a total of 200 JH
  items held out, the test segments of PS and TG had to be limited to 30 and 95
  items, respectively (see `maintainers' email around 2-nov-06).


+ old/jhpstg.no.new

  the set difference of the current (complete) word list for the development
  parts of JHPSTG, minus the earlier (incomplete) list of JHPSTG word forms. 

  perl $LOGONROOT/uio/data/bin/diff_vocab_lists.pl \
    $LOGONROOT/uio/data/lists/{old/jhpstg,jhpstg}.no.forms \
  | sort -u > $LOGONROOT/uio/data/lists/old/jhpstg.no.new

  note that this list (with 1348 entries) over-estimates the size of missing 
  vocabulary in the original list.  the original list was compiled downcasing
  all forms and not putting the asterisk flag on sentence-initial forms, hence
  quite some of the gaps reported in `old/jhpstg.no.new' are non-issues.


+ old/jhpstg.no.surprise

  another take at the same set difference, attempting to wash out the effect of
  capitalization and initial asterisks.  

  gawk '{ sub(/^\*/, "", $2); 
          printf("%s %s\n", $1, tolower($2)); }' \
    $LOGONROOT/uio/data/lists/jhpstg.no.forms > /tmp/jhpstg.forms.new
  gawk '{ printf("%s %s\n", $1, tolower($2)); }' \
    $LOGONROOT/uio/data/lists/old/jhpstg.no.forms > /tmp/jhpstg.forms.old
  perl $LOGONROOT/uio/data/bin/diff_vocab_lists.pl \
    /tmp/jhpstg.forms.old /tmp/jhpstg.forms.new \
  | sort -u > $LOGONROOT/uio/data/lists/old/jhpstg.no.surprise

  this time we end up with 461 forms, and the list looks plausible to me (oe).
  however, this set could in principle under-estimate the number of forms that
  were missing in the earlier JHPSTG word list: once everything is downcased,
  it could happen that the proper name `Ås' gets conflated with the common noun
  `ås'.  assuming we wanted both, if the former were in the incomplete list but
  not the latter, the common noun would not be in `old/jhpstg.no.surprise'.

+ handon.no.forms, handon.en.forms

  cat $LOGONROOT/uio/data/*.no.txt $LOGONROOT/uio/data/tg+.txt\
  | egrep '^\[[0-9]+\]' | grep -v '<p>' | egrep -v '^[\t]*$' \
  | $LOGONROOT/uio/data/bin/extr_vocab_fan.pl \
  | sort | uniq -c | sort -nr > $LOGONROOT/uio/data/lists/handon.no.forms

  cat $LOGONROOT/uio/data/*.en.txt \
  | egrep '^\[[0-9]+\]' | grep -v '<p>' | egrep -v '^[\t]*$' \
  | $LOGONROOT/uio/data/bin/extr_vocab_fan.pl \
  | sort | uniq -c | sort -nr > $LOGONROOT/uio/data/lists/handon.en.forms

+ handon.no.new, handon.en.new

  perl $LOGONROOT/uio/data/bin/diff_vocab_lists.pl \
    $LOGONROOT/uio/data/lists/{jhpstg,handon}.no.forms \
  | sort -u > $LOGONROOT/uio/data/lists/handon.no.new

  perl $LOGONROOT/uio/data/bin/diff_vocab_lists.pl \
    $LOGONROOT/uio/data/lists/{jhpstg,handon}.en.forms \
  | sort -u > $LOGONROOT/uio/data/lists/handon.en.new