Feature Selection for the Dependency Model

WS96 Presentation, Aug 21

Dekai WU, dekai@cs.ust.hk




----------------------------------------------------------------------
CAVEAT:  All numbers in this document are provisional.
----------------------------------------------------------------------


The original plan was to select link bigrams by a mutual information
criterion against an ordinary ngram model.  Experiments revealed problems
with sparse counts, even against a bigram model.

The plan was subsequently revised to:

1.  Analyze link-bigrams primarily by frequency, whose effect appears from
inspection to swamp the mutual information scores.  

2.  Apply both manual and mutual-information criteria to select worthwhile
link types.


Basic characteristics of the corpus

The training corpus (SW00-11) contains 1437972 links.  Note this is
exactly the number of words in the corpus, under a pure dependency
model (as opposed to a link grammar model where the number of links
generally exceeds the number of words).


Basic characteristics of link-bigrams

The majority of links occur between adjacent words, consistent with
the generally decent performance of simple bigram models.  The
following table shows how the frequency of occurrence and the number
of distinct link-bigrams falls as the minimum span distance is
increased.  Note the Average Occurrences is deceptive as the median
number of occurrences is 1; 213627 out of 303548 link-bigrams occurred
just once.


			Link-bigrams	Arrow-bigrams	Word-bigrams	Collins-bigrams

Min	Freq	Pct	Types	Avg	Types	Avg	Types	Avg	Types	Avg
Dist				Occur		Occur		Occur		Occur

1	1437972	100  %	303548	4.74
2	 648011	 45.1%	176661	3.67	160194	4.05	156476	4.14	163725	3.96
3	 322009		103247	3.12	 94007	3.43	 91937	3.50
4	 182796				 57912	3.16


For us the informative links will be those that span longer distances,
since our language model already incorporates ordinary bigram
constraints.  Because our dependency models are currently restricted
to bigram histories (in the reduced histories), we henceforth
exclusively use models where the minimum link distance is 2, i.e.,
non-adjacent link-bigrams.

The number of distinct link-bigrams types to apply constraints on can
be pruned by thresholding on frequency.  (All entries in the table
below are for non-adjacent links.)

	Link	Arrow	Collins
Min	Bigram	Bigram	Bigram
Freq	Types	Types	Types

 2	46666	45771
 3	26327	26177
 4		18628
 5		14370
 6		11821
 7		10005
10		 6943
12		 5733
13		 5248
14		 4859	4847
15		 4538


Basic characteristics of links

Min	Link
Dist	Types

1	2477
2	1798
3	1335

Over half of the 2477 link types occur 6 times or less.

1       548
2       273
3       169
4       124
5       93
6       61
7       62
8       51
9       52
10      30
11      33
12      24
13      26
14      32
15      17
16      16
...
32      8	(representative picks)
64      4
128     4
256     1	(mostly zero starting here)
516     1
1023    1
2073    1
4250    1
8204    1
16477   1
37846   1
57392   1
65210   1
112507  1
204722  1


Link selection

Our simplest link model classes all link types into just two types,
preserving only the left/right arrow distinction.  During his visit to
the Workshop, Mike Collins suggested five important link classes whose
specific type identity we might not want to discard.

Information on Collins' suggested most important links


ALL LINKS	DISTANCE >= 2	DISTANCE >= 3	LINK
--------------	--------------	--------------	----------------
RANK	FREQ	RANK	FREQ	RANK	FREQ	a ll   lc   lr

(8)	27567	(8)	14745	(8)	6753	< VB   VP   NP
(39)	7820	(32)	4446	(28)	2443	< VB   VP   S
(38)	7883	(31)	4453	(26)	2673	< VB   VP   SBAR
(11)	22547	(7)	22206	(6)	7608	< IN   SBAR S
(1)	204722	(4)	27307	(4)	11444	> NP   S    VP


Note that these all rank in the top 40 link types, which suggests
selecting other links out of (some of) the most frequent links.
For non-adjacent links, which are most likely to be useful,
these are:

110513 <        TOPL    TOPC    S
43771 < IN      PP      NP
37360 > CC      S       VP
27307 > NP      S       VP
24091 > DT      NP      NN
23523 > INTJ    S       VP
22206 < IN      SBAR    S
14745 < VB      VP      NP
13042 < VBP     VP      NP
12525 < VBZ     VP      NP
11807 < VBP     VP      VP
9587 <  VBP     VP      SBAR
9343 <  VBD     VP      NP
8550 <  NP      NP      NP
8182 >  S       S       VP
7892 >  PRN     S       VP
7882 >  RB      S       VP
7847 <  WHADVP  SBAR    S
7780 <  VB      VP      PP
7565 <  WHNP    SBAR    S
7524 >  ADVP    S       VP
6333 <  VP      VP      CC
6058 >  DT      NP      NNS
5944 <  VP      VP      VP
5706 >  SBAR    S       VP
5037 <  VBP     VP      PP
4913 <  VBD     VP      PP
4558 <  VBP     VP      ADVP
4517 <  MD      VP      VP
4472 >  PP      S       VP
4453 <  VB      VP      SBAR
4446 <  VB      VP      S
4204 <  VBZ     VP      ADJP
4107 >  EDITED  S       VP
4064 <  VBG     VP      NP
4049 <  VBN     VP      NP
3835 <  VB      VP      ADVP
3510 <  VBZ     VP      S
3492 <  VBZ     VP      ADVP
3363 >  JJ      NP      NN


Hand-pruning this list to 24 types gives:


43771   <       IN      PP      NP
37360   >       CC      S       VP
27307   >       NP      S       VP
24091   >       DT      NP      NN
22206   <       IN      SBAR    S
14745   <       VB      VP      NP
13042   <       VBP     VP      NP
12525   <       VBZ     VP      NP
11807   <       VBP     VP      VP
9587    <       VBP     VP      SBAR
9343    <       VBD     VP      NP
8182    >       S       S       VP
7892    >       PRN     S       VP
7882    >       RB      S       VP
7847    <       WHADVP  SBAR    S
7565    <       WHNP    SBAR    S
6058    >       DT      NP      NNS
5706    >       SBAR    S       VP
4517    <       MD      VP      VP
4453    <       VB      VP      SBAR
4446    <       VB      VP      S
4064    <       VBG     VP      NP
4049    <       VBN     VP      NP
3510    <       VBZ     VP      S


A more principled approach to link selection should be to use
information gain.  Define the {\it average link gain\/} as
\begin{eqnarray*}
I(z ; L | y[L])
 & = & E_{P(L)} G(L)
\end{eqnarray*}
where the individual {\it link gain}
\begin{eqnarray*}
G(L) & = & E_{P(yz|L)} \lg \frac {P(Lz|y[L])} {P(z|y[L]) P(L|y[L])}
\end{eqnarray*}
indicates how much information on average is gained by knowing the
specific identity $L$ of the link between two linked words $y$ and
$z$, compared with knowing only the equivalence class $[L]$.  Note
this measure is designed to be essentially insensitive to the
frequency of $L$.  The gain is measured relative to a unigram LM.
In our models, $[L]$ is just the arrow direction $a$ of $L$.

Also define the {\it weighted link gain\/}
\begin{eqnarray*}
\tilde{G}(L) & = & P(L) G(L)
\end{eqnarray*}
which is sensitive to the frequency of $L$.  However, from inspection
of results it appears that this measure is swamped by the frequency,
so we might as well use just the frequency (particularly since the
rank of link types, not their gain score, is what really matters).


LINK			LINK GAIN		WEIGHTED LINK GAIN
--------------------	----------------------	----------------------
a ll   lc   lr		RANK	BITS		RANK	BITS

< VB   VP   NP		(2289)	1.83543		( 3)	0.0351866
< VB   VP   S		( 937)	6.87514		(27)	0.0139335
< VB   VP   SBAR	(1010)	6.59419		(23)	0.014747
< IN   SBAR S		( 959)	6.78454		(46)	0.00909541
> NP   S    VP		(2451)	0.182855	( 7)	0.0260329


Another potentially interesting gain measure is comes from the
{\it average anchor gain}
\begin{eqnarray*}
I(z ; y | L)
 & = & E_{P(L)} G_A(L)
\end{eqnarray*}
where the individual {\it link gain}
\begin{eqnarray*}
G_A(L) & = & E_{P(yz|L)} \lg \frac {P(yz|L)} {P(y|L) P(z|L)}
\end{eqnarray*}
indicates the informativeness of the identity of the anchor $y$
if $L$ is already known.  It is not worth using $L$ as a link type
for link-bigrams if the bulk of the information comes from $L$
rather than $y$.


LINK			LINK GAIN		LINK ANCHOR GAIN
--------------------	----------------------	----------------------
a ll   lc   lr	RANK	BITS		RANK	BITS

< VB   VP   NP		(2289)	1.83543		( 39)	6.55541
< VB   VP   S		( 937)	6.87514		( 90)	5.52263
< VB   VP   SBAR	(1010)	6.59419		( 83)	5.64445
< IN   SBAR S		( 959)	6.78454		(389)	2.99731
> NP   S    VP		(2451)	0.182855	(199)	4.13835



----------------------------------------------------------------------
CAVEAT:  All numbers in this document are provisional.
----------------------------------------------------------------------