Morphology, POS Tags, and Unknown Words


Introduction

Czech is a highly-inflected lanaguage, with a much richer morphology than English. The Czech training and test data for this project are tagged with 15-letter morphology strings, encoding a much wider range of possibilities than the part-of-speech tags than the Collins parser was designed to work with. This part of the project is exploring how to collapse those morphology strings into a set of POS tags that will be maximally informative for the parser. The results are of particular importance for unknown words, where the POS information is all that the parser has to go on, so alternative strategies for dealing with unknown words are also being explored.

Initial Round of POS Tag Experiments

An initial round of experiments measured the performance of various simple projections of the original 15-letter morphology strings.
  1. Primary POS: this baseline strategy used just the first letter of the morphology string.
  2. P-punct: first letter, but with special handling of punctuation tag.
  3. P-S: the first two letters, primary-POS and sub-POS tags, from the morphology string. (Note: because there is no overlap between sub-POS values for different primary-POS values, this could equivalently be termed plain S.)
  4. P-C: primary-POS plus the case
  5. P-SC: primary-POS plus a second letter that encodes usually case, but sub-POS in the case of the primary POS values D, V, and X.
 
Performance of Various POS Tagsets on Preliminary Training/Test Data 
P 71.57%
P-punct 71.65%
P-S 72.50%
P-C 72.67%
P-SC 73.00%
These figures (which show percent of correct dependencies) indicate that including the SC data improves parser performance by about 1.5% compared to using on the primary-POS.

Using Hand-Assigned Tags vs. Machine-Assigned Tags for Training

Besides choices about what reduced tag set to map the full morphology tag strings into, there are also choices as to which tag to use. Each word in both training and test data comes annotated with all of the tags that it could possibly have and with a single tag chosen by an automatic tagger (which was trained on a different, non-overlapping dataset). In addition, the training data words are annotated by hand with the correct tag. Training on those correct tags should presumably lead to a more accurate model, but training on the machine-assigned tags has the advantage that the test data will most closely resemble the training data.
A comparison test on the portion of the official training set that was available at the time produced the following results:
Training on Machine vs. Hand Tags
machine tags 72.31%
hand tags 70.50%
Thus it appears that the consistency advantage of machine tags outweighs the correctness advantage of the hand-assigned ones.

 Unknown Words

An initial measure shows that 38% of the words in test data are treated by the parser as unknown. (The parser considers all words not seen at least 5 times in training data to be unknown.) This figure is much higher than the comparable figure for English, due to the richer inflectional morphology of Czech. Various experiments are being tried to alter the parser's handling of unknown words to try to deal with this more effectively.