SUMMARY - PRONUNCIATION MODELING TEAM MEETING -- 2/4/97 CHANTILLY, VA
(Mike Riley)
Present - Bill Byrne (BB), Sanjeev Khudanpur (SK), Michael Riley (MR),
Roni Rosenfeld (RR), Patrick Schone (PS), Steve Young (SY), George
Zavaliagkos (GZ)
An impromptu meeting of the pronunciation modeling group was held
at the DARPA Speech Recognition Workshop in Chantilly, VA on Tuesday
Feb 4, 1997. Given that most of the team members were present, we felt
we should take advantage of the Tuesday 'free time' and meet.
Our apologies to those members who were not at the conference.
Below is my summary of the meeting. Amendations are welcome,
as well as comments/questions from those who were not present.
Fred Jelinek told me that he will provide travel (for non-students)
for two pre-workshop meetings of each team at the 'center of gravity'
of its members. Since many of our members are involved in the LVCSR
evaluation, there was a general desire to hold the first meeting after
the Maritime Institute of Technology LVCSR meeting in May. Thus, it
was agreed that we would hold our first meeting just after this
meeting at a nearby location (e.g., JHU) and that it would last for a
day or two. A second meeting would be planned for June.
There was a widespread view that RR's earlier proposal that our
group set as its main goal the reduction of perplexity on the ICSI
transcriptions was not enough and that we must also try to improve
word error rate on the Switchboard task. As such, we developed a
plan to accomplish this. (We can view RR's proposal as our failsafe
in the event of a meltdown of computational resources). Here are the
steps in that plan:
STEP 1: OBTAIN AN INITIAL PRONUNCIATION MODEL
We realized this is where we primarily differed in what was the
best strategy. There were several proposals:
(a) use unconstrained phone recognition (WS96 pron group)
(b) use a decision tree model trained on the ICSI data (including
cross-word modeling) (MR)
(c) use a model obtained from phone recognition of frequent
words (GZ)
(d) use rule-based pronunciation models trained on ICSI data
(Michael Finke, in absentia)
(e) use a model that would allow a 'constrained alignment' of the
switchboard corpus, which would be more efficient and accurate
than (a), but still allow alternatives not seen in the ICSI
transcriptions (SY)
MR noted that by pruning the tree in (b), there was a continuum
from model (a) (pruned to the root) to allowing only alternatives
seen in ICSI (no pruning). SY observed that then (b) with pruning
was an implementation of his (e).
TOOLS: MR will provide decision-tree tools for (b) for the workshop.
Other members with alternative proposals will arrange that any
needed tools are discussed/brought/obtained.
STEP 2: AUTOMATICALLY TRANSCRIBE THE SWITCHBOARD TRAINING DATA
2.1 USE PRON MODEL from STEP 1 to transform word transcriptions into
phonetic lattices. MR will provide finite-state (FSM) tools that
will accomplish this for proposal 1b (including cross-word
pronunciation models!).
2.2 USE PHONETIC lattices TO TRANSCRIBE DATA - Workshop HTK
tools/models will be used.
STEP 3: BUILD AN ACOUSTIC MODEL BASED ON THE TRANSCRIPTIONS OF STEP 2
Workshop HTK tools/models will be used.
STEP 4: TEST MODEL OF STEP 3
4.1 USE PRON MODEL from STEP 1 TO TRANSFORM WORKSHOP-PROVIDED WORD
LATTICES TO PHONETIC LATTICES
MR will provide FSM tools for this.
4.2 PHONETIC LATTICE RESCORING - Workshop HTK tools will be used
STEP 5: BUILD A NEW PRONUNCIATION MODEL BASED ON TRANSCRIPTIONS IN
STEP 2 THEN GO TO STEP 2 (ITERATE)
For proposal 1b, this is just a retraining of the trees. I am
unclear on what is required here for the other proposals.
PRE-WORKSHOP ACTIVITIES:
MR will report on preliminary work on proposal 1b by our May meeting.
Similarily, GZ will report on preliminary work on 1c at that time.
SK will report on any new results on 1a.
SK will soon arrange an account for MR at JHU and direct him to sample
HTK word lattices. MR will explore Step 4.1 using these lattices. He
will also visit JHU and meet with SK and others before the May meeting
so we can discuss/check compatibility of various tools we will be
using.
-m
From Roni_Rosenfeld@HEE.SPEECH.CS.CMU.EDU Fri Jul 4 19:42:21 1997
Date: Thu, 13 Feb 97 09:42:58 EST
From: Roni_Rosenfeld@HEE.SPEECH.CS.CMU.EDU
Reply-To: ws97_pron@cspjhu.ece.jhu.edu
To: ws97_pron@cspjhu.ece.jhu.edu
Subject: Re: Chantilly summary
> STEP 4: TEST MODEL OF STEP 3
>
> 4.1 USE PRON MODEL from STEP 1 TO TRANSFORM WORKSHOP-PROVIDED WORD
> LATTICES TO PHONETIC LATTICES
>
> MR will provide FSM tools for this.
>
> 4.2 PHONETIC LATTICE RESCORING - Workshop HTK tools will be used
(I probably missed this part of the discussion, so please bear with
me.)
Is it possible to also do a complete re-decode, namely to somehow
interface the FSM tools (or the other pron models) to the initial
pass of the decoder? If it is possible, is it feasible? Say, for
the FSM tools?
A relevant datapoint here is the Lattice Word Error Rate (LWER) for
the workshop-provided word lattices. Does anyone know/remember
what it is?
-Roni
From sanjeev@cspjhu.ece.jhu.edu Fri Jul 4 19:42:27 1997
Date: Thu, 13 Feb 1997 09:51:41 -0500 (EST)
From: Sanjeev Khudanpur
Reply-To: ws97_pron@cspjhu.ece.jhu.edu
To: ws97_pron@cspjhu.ece.jhu.edu
Subject: Re: Chantilly summary
On Thu, 13 Feb 1997 Roni_Rosenfeld@HEE.SPEECH.CS.CMU.EDU wrote:
> A relevant datapoint here is the Lattice Word Error Rate (LWER) for
> the workshop-provided word lattices. Does anyone know/remember
> what it is?
The WS96 lattices have a LWER of about 13%.
From sanjeev@cspjhu.ece.jhu.edu Fri Jul 4 19:42:35 1997
Date: Thu, 13 Feb 1997 09:51:41 -0500 (EST)
From: Sanjeev Khudanpur
Reply-To: ws97_pron@cspjhu.ece.jhu.edu
To: ws97_pron@cspjhu.ece.jhu.edu
Subject: Re: Chantilly summary
On Thu, 13 Feb 1997 Roni_Rosenfeld@HEE.SPEECH.CS.CMU.EDU wrote:
> A relevant datapoint here is the Lattice Word Error Rate (LWER) for
> the workshop-provided word lattices. Does anyone know/remember
> what it is?
The WS96 lattices have a LWER of about 13%.
From Roni_Rosenfeld@HEE.SPEECH.CS.CMU.EDU Fri Jul 4 19:42:43 1997
Date: Thu, 13 Feb 97 10:06:13 EST
From: Roni_Rosenfeld@HEE.SPEECH.CS.CMU.EDU
Reply-To: ws97_pron@cspjhu.ece.jhu.edu
To: ws97_pron@cspjhu.ece.jhu.edu
Subject: Re: Chantilly summary
> The WS96 lattices have a LWER of about 13%.
Thanks, Sanjeev. (Btw, the LM95 LWER was 10%. Is the difference
solely due to a harder set?).
In any case, 13% is quite high, especially considering that the
lattice errors have an effect on the neighboring words due to word
boundary requirements and LM word transitions. So re-decoding, if
feasible, is desirable.
Otherwise, a cheating experiment might be to add the missing words
to the lattice. Here we need to distiniguish between words that are
missing due to search errors (a real cheat) and those missing due to
modeling errors (less of a cheat, since they will be rejected again if
the improved pronunciations don't help them).
-Roni
From riley%tiberius%research.att.com@cspjhu.ece.jhu.edu
Date: Mon, 17 Feb 1997 21:58:08 -0500
From: "Michael D. Riley"
Reply-To: ws97_pron@cspjhu.ece.jhu.edu
To: ws97_pron@cspjhu.ece.jhu.edu
Subject: Re: Chantilly summary
Roni Rosenfeld writes:
> Is it possible to also do a complete re-decode, namely to somehow
> interface the FSM tools (or the other pron models) to the initial pass
> of the decoder? If it is possible, is it feasible? Say, for the FSM
> tools?
Andrej Ljolje and I have been trying that. Will let you know if we
succeed. Rescoring has the advantage that the search space, althouhg
expanded by the multiple prons (especially myriads of deletions), is
kept manageable.
-m
Harriet Nock
Last modified: Sat Jul 5 14:58:59 EDT 1997