Morphology, POS Tags, and Unknown Words
Introduction
Czech is a highly-inflected lanaguage, with a much richer morphology than
English. The Czech training and test data for this project are tagged with
15-letter morphology strings, encoding a much wider range of possibilities
than the part-of-speech tags than the Collins parser was designed to work
with. This part of the project is exploring how to collapse those morphology
strings into a set of POS tags that will be maximally informative for the
parser. The results are of particular importance for unknown words, where
the POS information is all that the parser has to go on, so alternative
strategies for dealing with unknown words are also being explored.
Initial Round of POS Tag Experiments
An initial round of experiments measured the performance of various simple
projections of the original 15-letter morphology strings.
-
Primary POS: this baseline strategy used just the first letter of the morphology
string.
-
P-punct: first letter, but with special handling of punctuation tag.
-
P-S: the first two letters, primary-POS and sub-POS tags, from the morphology
string. (Note: because there is no overlap between sub-POS values for different
primary-POS values, this could equivalently be termed plain S.)
-
P-C: primary-POS plus the case
-
P-SC: primary-POS plus a second letter that encodes usually case, but sub-POS
in the case of the primary POS values D, V, and X.
Performance of Various POS Tagsets on Preliminary Training/Test
Data
| P |
71.57% |
| P-punct |
71.65% |
| P-S |
72.50% |
| P-C |
72.67% |
| P-SC |
73.00% |
These figures (which show percent of correct dependencies) indicate that
including the SC data improves parser performance by about 1.5% compared
to using on the primary-POS.
Using Hand-Assigned Tags vs. Machine-Assigned Tags for Training
Besides choices about what reduced tag set to map the full morphology tag
strings into, there are also choices as to which tag to use. Each word
in both training and test data comes annotated with all of the tags that
it could possibly have and with a single tag chosen by an automatic tagger
(which was trained on a different, non-overlapping dataset). In addition,
the training data words are annotated by hand with the correct tag. Training
on those correct tags should presumably lead to a more accurate model,
but training on the machine-assigned tags has the advantage that the test
data will most closely resemble the training data.
A comparison test on the portion of the official training set that
was available at the time produced the following results:
Training on Machine vs. Hand Tags
| machine tags |
72.31% |
| hand tags |
70.50% |
Thus it appears that the consistency advantage of machine tags outweighs
the correctness advantage of the hand-assigned ones.
Unknown Words
An initial measure shows that 38% of the words in test data are treated
by the parser as unknown. (The parser considers all words not seen at least
5 times in training data to be unknown.) This figure is much higher than
the comparable figure for English, due to the richer inflectional morphology
of Czech. Various experiments are being tried to alter the parser's handling
of unknown words to try to deal with this more effectively.