Note: the instructions contained on this page were written for use specifically in the CLSP WS02 Lab. They may refer to applications, files or other materials that are not accessible from other locations.
JHU NLP Workshop Parsing Lab - 11 July, 2002
Experiment: Beam Search Thresholding in a Probabilistic CKY Parser
The first experiment will introduce you to the basics of
statistical parsing. The task will be to modify an existing parser
written in Perl in a sequence of steps listed below. Knowledge of
basic data types, control structures, and regular expressions in Perl
is all you should need to do this lab.
Background. A recurring theme in probabilistic parsing is
the need to limit the search space of the parser. Many productions
that a hand-constructed CFG would rule out as impossible, a learned
PCFG might permit with low but non-zero probability. Techniques to
limit the search space might still guarantee that the optimal parse is
found (such as A* search). Or they might do more aggressive pruning
for even more speed, but sometimes fail to find the most probable
parse. The Beam Search Thresholding you will implement is an example
of this second, more aggressive pruning.
You will be taking an existing probabilistic context-free CKY
parser and adding beam search thresholding to improve its running
time. Beam search thresholding is a particular form of pruning which
ignores items stored in the parser chart that have a very low
probability, relative to other items in the same cell. Pruning these
items is like making a bet that these items are so unlikely, they will
not be in the most probable parse, even if they combine with other
high-probability items. The ratio in probabilities from the most
probable item in a cell to the lowest probability permitted for a
"competing" item is the Beam Width, and this parameter can be
tuned to trade off speed (a narrower beam can prune more items)
vs. accuracy (pruning more items risks pruning the most probable
parse).
Notes on data. The grammar was trained from the Penn
Treebank of WSJ articles, sections 01-09, with no smoothing of the
count statistics. The test sentences were drawn from section 0. To
reduce the size of the grammar and parses, the words (lexical symbols)
were removed, so that the parse tree "leaves" are part-of-speech (POS)
tags. The full set of Penn Treebank POS and syntactic tags is below. While the full POS tagset is large to
capture fine distinctions in meaning, the initial letter of the
symbols generally maps to the standard simple grammatical categories.
The hand-annotated "gold" reference parses, and the output of the
provided parser, use the standard nested parentheses to represent the
parse structure. The evaluation of the quality of parses uses the "EVALB"
tool from NYU, which reports the precision and recall of the brackets
(the parse tree nodes and their spans) relative to the reference parse.
Notes on CKY algorithm. The CKY algorithm without pruning is
exhaustive, guaranteed to find every possible parse of a sentence.
There are several ways to order the triple-nested loops that test
whether a cell in the chart can be built from two smaller cells.
All of these orders must preserve the property of never trying to
build an item before building all of the smaller items that it could
be built from. The variant of CKY in the provided parser uses as its
outer-most loop j, the right edge of the new item being built,
incrementing it left-to-right across the words in the sentence. The
next loop is on i, the left edge of the new item, decrementing
it from j-1 back to 0. (The other direction would lose the above property.)
And the next loop is on k, the split between the two smaller cells being
examined. The remaining loops look at every combination of items from
these two cells, and check the grammar for every rule with these
symbols as a Right-Hand-Side (RHS).
The implementation has some additional code to handle unary rules, since
the grammar used to annotate the treebank include many nodes with a
single non-terminal as a RHS. (It's not in Chomsky Normal Form.)
You can work on this lab solo or in a two-person team. If Perl is
unfamiliar, you might try to partner with someone who knows it better.
You can find an online Perl reference here . Follow these steps:
- Copy all the files from ~ruhlen/parselab/files into a new
directory. Test that the provided program can parse the sentences in
the development set by running:
perl cky-parser.pl ~ruhlen/parselab/out.gram < input.small > parser.out.nopruning
Evaluate the accuracy of the best parses by running:
./test_output parser.out.nopruning gold.small
This reports accuracy of the parser output and the labeled development
set. Parsing these sentences with no pruning may take around 15
minutes, depending on your system speed and load. Note both the
accuracy (precision and recall) and the total CPU time of the
exhaustive CKY parser. You can get a qualitative look at how the
parser's output differs from the reference parse by piping your output
and the gold parses into "prettyprint".
cat parser.out.nopruning gold.small | prettyprint
- While waiting for the development set parsing to finish, analyze
the nested loops of the CKY algorithm in the file cky-parser.pl . Find the point
in the program at which, for each cell i,j in the chart, all
possible items have been created and stored.
- While writing and debugging your code, you should use the file "input.tiny" instead of "input.small".
At the point you identified above, add a call to the provided function "prune_items" that measures the number of items stored
at each chart cell i,j. Extend that function so that after the parser outputs its best
parse for a sentence, you can output the mean, max, and min number of items
per chart cell for this sentence, using statistics collected in
"prune_items". Display these values using the format:
"# Items/cell mean: X.X max: N min: N Total items: N cells: N"
-
Extend the function to also compute the maximum and minimum
log-probabilities of the items in each cell, and the log-ratio of
max:min probabilities in that cell. (Hint: what's the direct way to
calculate the log of a ratio of two probabilities already stored as
log-probabilities?) Output another line after the Items/cell output
in the format:
"# Log Max:min probability log-ratios mean: X.X max: X.X min: X.X"
using statistics collected in your function. (Note: each non-empty
cell has a single max:min ratio, and this output shows the statistics
for this value across all non-empty cells.)
- Modify the function prune_items further to use the flag variables
$pruning_enabled and $beam, removing all items in each cell whose
probability is less than that of the most probable item in the same
cell divided by beta. (Again, remember we're in log space.)
Add another output line for each sentence showing how many items were
pruned, in the format:
"# Items pruned total: N mean per cell: X.X"
For a given sentence, should the total number of items in the chart
using beam search be the same as the total number without any pruning
minus the total number pruned? Why or why not? Look at your output
for several sentences from this step and the previous step to confirm.
- Measure running times and error rates on the entire development
data set with beam thresholding using several different beam widths.
Include the beam values 0.0001 and 0.01.
perl cky-parser.pl -p -b=0.0001 ~ruhlen/parselab/out.gram < input.small > parser.out.pruning.0.0001
perl cky-parser.pl -p -b=0.01 ~ruhlen/parselab/out.gram < input.small > parser.out.pruning.0.01
perl cky-parser.pl -p -b= ...
-
Plot the change in parsing time,
comparing the parser without pruning and the parser with pruning at
the various beam widths. Use the following script to
generate the output graph:
perl plot_times.pl -g=/apps/bin/gnuplot parser.out.nopruning parser.out.pruning.0.0001 parser.out.pruning.0.01 ...
Check the name of the output file: timing-xx.eps and view it using gv:
/apps/bin/gv timing-xx.eps
- Check how the precision/recall figures change using the
"test_output" script (usage explained above). Is there a "sweet spot"
beam width that significantly reduces time with little change in
precision/recall?
Show either Paul or Anoop the timing graph and the precision/recall figures, and you're ready to move on the optional steps.
- Optional: As time allows, consider, implement, and test
other ways to speed up parsing. Possible things to try:
- A grammar with unary rules can cause problems for this type of
pruning, since a single parse tree can have more than one node (item) in
a single cell. These items could trigger pruning against each other,
which would eliminate that entire parse tree, even if it had no
competing complete parses. Other than converting the grammar to
CNF, can you think of a way to avoid this self-pruning suicide?
Add diagnostics to detect whether this kind of self-pruning is
taking place for the development data, and eliminate it if it is.
- Instead of just applying the beam width to the inside
probability of each item, multiply it by a rough estimate of the
outside probability before finding the most probable item and
pruning items outside the beam. (Hint: the provided function
prior(NT) gives the prior probability of a non-terminal in a
parse tree being NT.)
- Keep only the N-best items in a cell instead of using beam
search, or combine this with beam search in some way.
- Dynamically adjust the beam width based on the contents of a
cell. (For example, is there an item with a much larger decrease in
probability from the next most probable item?)
- Examine the new parse errors introduced with narrow beam widths.
Is there a pattern to the kinds of non-terminals or tree structures?
Does it suggest any way to focus on fixing those errors?
- Be creative--try something new and different!
Table 2:
The Penn Treebank POS tagset
1. CC Coordinating conjunction 25.TO to
2. CD Cardinal number 26.UH Interjection
3. DT Determiner 27.VB Verb, base form
4. EX Existential there 28.VBD Verb, past tense
5. FW Foreign word 29.VBG Verb, gerund/present participle
6. IN Preposition/subord. 30.VBN Verb, past participle
218z conjunction
7. JJ Adjective 31.VBP Verb, non-3rd ps. sing. present
8. JJR Adjective, comparative 32.VBZ Verb, 3rd ps. sing. present
9. JJS Adjective, superlative 33.WDT wh-determiner
10.LS List item marker 34.WP wh-pronoun
11.MD Modal 35.WP Possessive wh-pronoun
12.NN Noun, singular or mass 36.WRB wh-adverb
13.NNS Noun, plural 37. # Pound sign
14.NNP Proper noun, singular 38. $ Dollar sign
15.NNPS Proper noun, plural 39. . Sentence-final punctuation
16.PDT Predeterminer 40. , Comma
17.POS Possessive ending 41. : Colon, semi-colon
18.PRP Personal pronoun 42. ( Left bracket character
19.PP Possessive pronoun 43. ) Right bracket character
20.RB Adverb 44. " Straight double quote
21.RBR Adverb, comparative 45. ` Left open single quote
22.RBS Adverb, superlative 46. " Left open double quote
23.RP Particle 47. ' Right close single quote
24.SYM Symbol 48. " Right close double quote
(mathematical or scientific)
__________________
Some symbols cannot be displayed in HTML format
Table 3:
The Penn Treebank syntactic tagset
Tags
1. ADJP Adjective phrase
2. ADVP Adverb phrase
3. NP Noun phrase
4. PP Prepositional phrase
5. S Simple declarative clause
6. SBAR Clause introduced by subordinating conjunction or
0 (see below)
7. SBARQ Direct question introduced by wh-word or wh-phrase
8. SINV Declarative sentence with subject-aux inversion
9. SQ Subconstituent of SBARQ excluding wh-word or wh-phrase
10. VP Verb phrase
11. WHADVP Wh-adverb phrase
12. WHNP Wh-noun phrase
13. WHPP Wh-prepositional phrase
14. X Constituent of unknown or uncertain category
Null elements
1. * ``Understood'' subject of infinitive or imperative
2. 0 Zero variant of that in subordinate clauses
3. T Trace---marks position where moved wh-constituent
is interpreted
4. NIL Marks position where preposition is interpreted in
pied-piping contexts