Feature Selection for the Dependency Model
WS96 Presentation, Aug 21
Dekai WU,
dekai@cs.ust.hk
----------------------------------------------------------------------
CAVEAT: All numbers in this document are provisional.
----------------------------------------------------------------------
The original plan was to select link bigrams by a mutual information
criterion against an ordinary ngram model. Experiments revealed problems
with sparse counts, even against a bigram model.
The plan was subsequently revised to:
1. Analyze link-bigrams primarily by frequency, whose effect appears from
inspection to swamp the mutual information scores.
2. Apply both manual and mutual-information criteria to select worthwhile
link types.
Basic characteristics of the corpus
The training corpus (SW00-11) contains 1437972 links. Note this is
exactly the number of words in the corpus, under a pure dependency
model (as opposed to a link grammar model where the number of links
generally exceeds the number of words).
Basic characteristics of link-bigrams
The majority of links occur between adjacent words, consistent with
the generally decent performance of simple bigram models. The
following table shows how the frequency of occurrence and the number
of distinct link-bigrams falls as the minimum span distance is
increased. Note the Average Occurrences is deceptive as the median
number of occurrences is 1; 213627 out of 303548 link-bigrams occurred
just once.
Link-bigrams Arrow-bigrams Word-bigrams Collins-bigrams
Min Freq Pct Types Avg Types Avg Types Avg Types Avg
Dist Occur Occur Occur Occur
1 1437972 100 % 303548 4.74
2 648011 45.1% 176661 3.67 160194 4.05 156476 4.14 163725 3.96
3 322009 103247 3.12 94007 3.43 91937 3.50
4 182796 57912 3.16
For us the informative links will be those that span longer distances,
since our language model already incorporates ordinary bigram
constraints. Because our dependency models are currently restricted
to bigram histories (in the reduced histories), we henceforth
exclusively use models where the minimum link distance is 2, i.e.,
non-adjacent link-bigrams.
The number of distinct link-bigrams types to apply constraints on can
be pruned by thresholding on frequency. (All entries in the table
below are for non-adjacent links.)
Link Arrow Collins
Min Bigram Bigram Bigram
Freq Types Types Types
2 46666 45771
3 26327 26177
4 18628
5 14370
6 11821
7 10005
10 6943
12 5733
13 5248
14 4859 4847
15 4538
Basic characteristics of links
Min Link
Dist Types
1 2477
2 1798
3 1335
Over half of the 2477 link types occur 6 times or less.
1 548
2 273
3 169
4 124
5 93
6 61
7 62
8 51
9 52
10 30
11 33
12 24
13 26
14 32
15 17
16 16
...
32 8 (representative picks)
64 4
128 4
256 1 (mostly zero starting here)
516 1
1023 1
2073 1
4250 1
8204 1
16477 1
37846 1
57392 1
65210 1
112507 1
204722 1
Link selection
Our simplest link model classes all link types into just two types,
preserving only the left/right arrow distinction. During his visit to
the Workshop, Mike Collins suggested five important link classes whose
specific type identity we might not want to discard.
Information on Collins' suggested most important links
ALL LINKS DISTANCE >= 2 DISTANCE >= 3 LINK
-------------- -------------- -------------- ----------------
RANK FREQ RANK FREQ RANK FREQ a ll lc lr
(8) 27567 (8) 14745 (8) 6753 < VB VP NP
(39) 7820 (32) 4446 (28) 2443 < VB VP S
(38) 7883 (31) 4453 (26) 2673 < VB VP SBAR
(11) 22547 (7) 22206 (6) 7608 < IN SBAR S
(1) 204722 (4) 27307 (4) 11444 > NP S VP
Note that these all rank in the top 40 link types, which suggests
selecting other links out of (some of) the most frequent links.
For non-adjacent links, which are most likely to be useful,
these are:
110513 < TOPL TOPC S
43771 < IN PP NP
37360 > CC S VP
27307 > NP S VP
24091 > DT NP NN
23523 > INTJ S VP
22206 < IN SBAR S
14745 < VB VP NP
13042 < VBP VP NP
12525 < VBZ VP NP
11807 < VBP VP VP
9587 < VBP VP SBAR
9343 < VBD VP NP
8550 < NP NP NP
8182 > S S VP
7892 > PRN S VP
7882 > RB S VP
7847 < WHADVP SBAR S
7780 < VB VP PP
7565 < WHNP SBAR S
7524 > ADVP S VP
6333 < VP VP CC
6058 > DT NP NNS
5944 < VP VP VP
5706 > SBAR S VP
5037 < VBP VP PP
4913 < VBD VP PP
4558 < VBP VP ADVP
4517 < MD VP VP
4472 > PP S VP
4453 < VB VP SBAR
4446 < VB VP S
4204 < VBZ VP ADJP
4107 > EDITED S VP
4064 < VBG VP NP
4049 < VBN VP NP
3835 < VB VP ADVP
3510 < VBZ VP S
3492 < VBZ VP ADVP
3363 > JJ NP NN
Hand-pruning this list to 24 types gives:
43771 < IN PP NP
37360 > CC S VP
27307 > NP S VP
24091 > DT NP NN
22206 < IN SBAR S
14745 < VB VP NP
13042 < VBP VP NP
12525 < VBZ VP NP
11807 < VBP VP VP
9587 < VBP VP SBAR
9343 < VBD VP NP
8182 > S S VP
7892 > PRN S VP
7882 > RB S VP
7847 < WHADVP SBAR S
7565 < WHNP SBAR S
6058 > DT NP NNS
5706 > SBAR S VP
4517 < MD VP VP
4453 < VB VP SBAR
4446 < VB VP S
4064 < VBG VP NP
4049 < VBN VP NP
3510 < VBZ VP S
A more principled approach to link selection should be to use
information gain. Define the {\it average link gain\/} as
\begin{eqnarray*}
I(z ; L | y[L])
& = & E_{P(L)} G(L)
\end{eqnarray*}
where the individual {\it link gain}
\begin{eqnarray*}
G(L) & = & E_{P(yz|L)} \lg \frac {P(Lz|y[L])} {P(z|y[L]) P(L|y[L])}
\end{eqnarray*}
indicates how much information on average is gained by knowing the
specific identity $L$ of the link between two linked words $y$ and
$z$, compared with knowing only the equivalence class $[L]$. Note
this measure is designed to be essentially insensitive to the
frequency of $L$. The gain is measured relative to a unigram LM.
In our models, $[L]$ is just the arrow direction $a$ of $L$.
Also define the {\it weighted link gain\/}
\begin{eqnarray*}
\tilde{G}(L) & = & P(L) G(L)
\end{eqnarray*}
which is sensitive to the frequency of $L$. However, from inspection
of results it appears that this measure is swamped by the frequency,
so we might as well use just the frequency (particularly since the
rank of link types, not their gain score, is what really matters).
LINK LINK GAIN WEIGHTED LINK GAIN
-------------------- ---------------------- ----------------------
a ll lc lr RANK BITS RANK BITS
< VB VP NP (2289) 1.83543 ( 3) 0.0351866
< VB VP S ( 937) 6.87514 (27) 0.0139335
< VB VP SBAR (1010) 6.59419 (23) 0.014747
< IN SBAR S ( 959) 6.78454 (46) 0.00909541
> NP S VP (2451) 0.182855 ( 7) 0.0260329
Another potentially interesting gain measure is comes from the
{\it average anchor gain}
\begin{eqnarray*}
I(z ; y | L)
& = & E_{P(L)} G_A(L)
\end{eqnarray*}
where the individual {\it link gain}
\begin{eqnarray*}
G_A(L) & = & E_{P(yz|L)} \lg \frac {P(yz|L)} {P(y|L) P(z|L)}
\end{eqnarray*}
indicates the informativeness of the identity of the anchor $y$
if $L$ is already known. It is not worth using $L$ as a link type
for link-bigrams if the bulk of the information comes from $L$
rather than $y$.
LINK LINK GAIN LINK ANCHOR GAIN
-------------------- ---------------------- ----------------------
a ll lc lr RANK BITS RANK BITS
< VB VP NP (2289) 1.83543 ( 39) 6.55541
< VB VP S ( 937) 6.87514 ( 90) 5.52263
< VB VP SBAR (1010) 6.59419 ( 83) 5.64445
< IN SBAR S ( 959) 6.78454 (389) 2.99731
> NP S VP (2451) 0.182855 (199) 4.13835
----------------------------------------------------------------------
CAVEAT: All numbers in this document are provisional.
----------------------------------------------------------------------