CLSP Workshop '99 Acoustic Properties of Speech Sounds 1
Anatomical Structures for Speech Production
Soft Palate
(Velum)
Hyoid Bone
Epiglottis
Cricoid
Cartilage
Esophagus
Nasal Cavity
Hard Palate
Tongue
Thyroid Cartilage
Vocal Cords
Trachea
Lung
Sternum
Nasal Cavity
Hard Palate
Tongue
Thyroid Cartilage
Vocal Folds
Trachea
Lung
Soft Palate
(Velum)
Jaw
CLSP Workshop '99 Acoustic Properties of Speech Sounds 2
1
1
Page 2
3
Sub-Word Linguistic Units
The phoneme is one of the most basic linguistic units used to
represent pronunciations of words
ASR systems typically represent words as phoneme sequences
English contains approximately 40 phonemes which can be
grouped by manner and place of articulation
Manner Class Number
Vowel s 16
Fricatives 8
Stops 6
Semivowels 4
Nasals 3
Affricates 2
Aspirant 1
CLSP Workshop '99 Acoustic Properties of Speech Sounds 3
Phonemes in American English
IPA AB Word IPA AB Word IPA AB Word
/i/ iy beat /s/ s see /w/ wwet
/I/ ihbit /S/ shshe /r/ r red
/e/ eybait /f/ f fee /l/ l let
/E/ ehbet /T/ th thief /y/ y yet
/@/ aebat /z/ z z /m/ m meet
/a/ aa bob /Z/ zhGigi /n/ n neat
/O/ ao bought /v/ v v /4/ ngsing
/^/ ahbut /D/ dh thee /C/ ch church
/o/ ow boat /p/ p pea /J/ jhjudge
/U/ uh book /t/ t tea /h/ hh heat
/u/ uw boot /k/ k key
/5/ erbird /b/ b bay
/aÛ Ê /aybite /d/ d day
/OÛÊ / oy Boyd /g/ g geese
/aÚ Ê / aw bout
/{/ ax about
CLSP Workshop '99 Acoustic Properties of Speech Sounds 4
2
2
Page 3
4
Places of Articulation for Speech Production
Palatal Velar
Uvular
Alveopalatal
Alveolar
Labial
Dental
CLSP Workshop '99 Acoustic Properties of Speech Sounds 5
A Speech Waveform
Two plus seven is less than ten
CLSP Workshop '99 Acoustic Properties of Speech Sounds 6
3
3
Page 4
5
Spectral Representations
Speech waveforms are usually sampled at rates varying from 8K
(telephone) to 20K (wide-band) samples/ sec
ASR systems typically transform the waveform into a spectrum: a
sequence of frequency-based analyses usually performed at
regular intervals (e. g., 10 ms)
A short-time Fourier transform (STFT) performs a spectral analysis
on waveform segments small enough to be able to assume that
the speech signal is quasi-stationary
The waveform segment is created by a moving window, whose
type (e. g., Hamming) and duration (e. g., 5-25ms) have a
significant impact on the resulting spectrum
A spectrogram is an image computed from the resulting
spectrum, which is often used to examine the waveform
CLSP Workshop '99 Acoustic Properties of Speech Sounds 7
A Wide-Band Speech Spectrogram
Two plus seven is less than ten CLSP Workshop '99 Acoustic Properties of Speech Sounds 8
4
4
Page 5
6
A Narrow-Band Speech Spectrogram
Two plus seven is less than ten CLSP Workshop '99 Acoustic Properties of Speech Sounds 9
Vowel Production
No significant constriction in the vocal tract
Usually produced with periodic excitation
Acoustic characteristics depend on the position of the jaw,
tongue, and lips
[i][ @][ a][ u]
CLSP Workshop '99 Acoustic Properties of Speech Sounds 10
5
5
Page 6
7
Vowels of American English
There are approximately 18 vowels in American English made up
of monothongs, diphthongs, and reduced vowels (schwa's)
They are often described by the articulatory features: High/ Low,
Front/ Back, Retro exed, Rounded, andTense/ Lax
/i/ iy beat /O/ ao bought /aÛÊ/aybite
/I/ ihbit /^/ ahbut /OÛÊ/ oy Boyd
/e/ eybait/ o/ ow boat /aÚÊ/ aw bout
/E/ ehbet /U/ uh book [{] ax about
/@/ ae bat /u/ uw boot [|] ix roses
/a/ aa Bob /5/ er Bert [}] axr butter
CLSP Workshop '99 Acoustic Properties of Speech Sounds 11
Vowel Formant Averages
Vowels are often characterized by F1, F2, and F3
High/ Low is correlated with F1
Front/ Back is correlated with F2
Retro exion is marked by a low F3
Female Speakers Male Speakers
iÛ I eÛ E @ a O ^ oÚ U u 5 { |
0
500
1000
1500
2000
2500
3000
3500
Average
Frequency
(Hz)
Vowel
F 1 F 2 F 3
iÛ I eÛ E @ a O ^ oÚ U u 5 { |
0
500
1000
1500
2000
2500
3000
3500
Average
Frequency
(Hz)
Vowel
F 1 F 2 F 3
CLSP Workshop '99 Acoustic Properties of Speech Sounds 12
6
6
Page 7
8
Vowel Formant Trajectories
Diphthongs can have significant formant motion
Most vowels in American English are somewhat diphthongized
Female Speakers Male Speakers
700
900
1100
1300
1500
1700
1900
2100
2300
2500
2700
300 400 500 600 700 800 900
F
2
F 1
iÛ
E
aÛ
OÛ
{
|
aÚ ^
a
eÛ
I
O
@
u oÚ
5 U
700
900
1100
1300
1500
1700
1900
2100
2300
2500
2700
300 400 500 600 700 800 900
F
2
F 1
iÛ
@
|
E
I
eÛ
^
O
oÚ
5 U
u
OÛ
aÛ
{
aÚ
a
CLSP Workshop '99 Acoustic Properties of Speech Sounds 13
Vowel Durations
Each vowel has a different intrinsic duration
Schwa's have distinctly shorter durations (50ms)
/I, E, ^, U/ are the shortest monothongs
Context can greatly in uence vowel duration
Female Speakers Male Speakers
iÛ I eÛ E @ a O ^ oÚ U u 5 { | aÚ oÛ aÛ Ûu
0
50
100
150
200
250
Average
Duration
(ms)
Vowel
iÛ I eÛ E @ a O ^ oÚ U u 5 { | aÚ oÛ aÛ Ûu
0
50
100
150
200
250
Average
Duration
(ms)
Vowel
CLSP Workshop '99 Acoustic Properties of Speech Sounds 14
7
7
Page 8
9
Fricative Production
Turbulence produced at narrow constriction
Constriction position determines acoustic characteristics
Can be produced with periodic excitation
[f][ T][ s][ S]
CLSP Workshop '99 Acoustic Properties of Speech Sounds 15
Fricatives of American English
There are 8 fricatives in American English
They are often described by the features Strident/ Non-Strident
(Strong/ Weak), Voiced/ Unvoiced
Four places of articulation: Labial, Dental, Alveolar, andPalatal
Type Unvoiced Voi ced
Labial /f/ f fee /v/v v
Dental /T/ th thief /D/ dh thee
Alveolar /s/ s see /z/ zz
Palatal /S/ shshe /Z/ zhGigi
CLSP Workshop '99 Acoustic Properties of Speech Sounds 16
8
8
Page 9
10
Fricative Energy
Average Total Energy
Probability
Density
unadjusted
for
frequency
-100 -90 -80 -70 -60 -50 -40
0.0
0.02
0.04
0.06
NON-STRIDENT
STRIDENT
Strident fricatives tend to be stronger than non-strident
CLSP Workshop '99 Acoustic Properties of Speech Sounds 17
Fricative Durations
Duration
Probability
Density
unadjusted
for
frequency
0.0 0.05 0.10 0.15 0.20 0.25 0.30
02
4
6
8101214
UNVOICED
VOICED
Voiced fricatives tend to be shorter than unvoiced
CLSP Workshop '99 Acoustic Properties of Speech Sounds 18
9
9
Page 10
11
Nasal Production
Velum lowering results in air ow through nasal cavity
Consonants produced with closure in oral cavity
Nasalized vowels have output through oral and nasal cavities
Nasal murmurs have similar spectral characteristics
[m][ n][ 4]
CLSP Workshop '99 Acoustic Properties of Speech Sounds 19
Nasal Consonants of American English
Three places of articulation: Labial, Alveolar, andVelar
Always attached to a vowel, though can form an entire syllable in
unstressed environments ([ nê ], [mê ], [4 ê ])
/4/ is always post-vocalic
Place identified by neighboring formant transitions
Type Nasal
Labial /m/m me
Dental /n/ n knee
Velar /4/ ngsing
CLSP Workshop '99 Acoustic Properties of Speech Sounds 20
10
10
Page 11
12
Nasal Durations
Singleton Unvoiced Cluster Voiced Cluster 0
25
50
75
100
125
150
Duration
(ms)
Nasal consonants tend to be shorter in clusters with unvoiced
consonants, and longer with voiced consonants
CLSP Workshop '99 Acoustic Properties of Speech Sounds 21
Semivowel Production
Constriction in vocal tract, no turbulence
Slower articulatory motion than other consonants
Laterals form complete closure with tongue tip,
air ow via sides of constriction
[w][ y][ r][ l]
CLSP Workshop '99 Acoustic Properties of Speech Sounds 22
11
11
Page 12
13
Semivowels of American English
There are 4 semivowels in American English
Always attached to a vowel, though /l/ can form an entire syllable
in unstressed environments ([ lê ])
Extreme articulation of a corresponding vowel
{ Similar formant positions
{ Generally weaker due to constriction
Type Semivowel Nearest Vowel
Glides /w/w wet /u/
/y/ y yet /i/
Liquids /r/ rred /5/
/l/ llet /o/
CLSP Workshop '99 Acoustic Properties of Speech Sounds 23
Acoustic Properties of Semivowels
/w/ is characterized by a very low F1, F2
{ Typically a rapid spectral falloff above F2
/y/ is characterized by very low F1, very high F2
/r/ is characterized by a very low F3
{ Prevocalic F3 < medial F3 < postvocalic F3
/l/ is characterized by a low F1 and F2
{ Often presence of high frequency energy
{ Postvocalic /l/ characterized by minimal spectral discontinuity,
gradual motion of formants
CLSP Workshop '99 Acoustic Properties of Speech Sounds 24
12
12
Page 13
14
Aspirant Production
/h/ in AmericanEnglish
Turbulence excitation at glottis
No constriction in the vocal tract, normal formant excitation
Coupling with subglottal system results in little energy in F1
region
Periodic excitation can be present in medial position
CLSP Workshop '99 Acoustic Properties of Speech Sounds 25
Stop Production
Complete closure in the vocal tract, pressure build up
Sudden release of the constriction, turbulence noise
Can have periodic excitation during closure
[b][ d][ g]
CLSP Workshop '99 Acoustic Properties of Speech Sounds 26
13
13
Page 14
15
Stops of American English
There are 6 stop consonants in American English
Same places of articulation as nasal consonants
Unvoiced stops are typically aspirated
Voiced stops usually exhibit a "voice-bar'' during closure
Information about formant transitions and release useful for
classification
Type Voiced Unvoiced
Labial /b/ b bee /p/ p pea
Dental /d/ d Dee /t/ ttea
Velar /g/ g geese /k/ k key
CLSP Workshop '99 Acoustic Properties of Speech Sounds 27
Singleton Stop Durations
b dgptk
0
10
20
30
40
50
60
70
80
VOT
Duration
(ms)
The voice onset time (VOT) of unvoiced stops
is longer than that of voiced stops
CLSP Workshop '99 Acoustic Properties of Speech Sounds 28
14
14
Page 15
16
/s/-Stop Durations
p t k
0
10
20
30
40
50
60
70
80
VOT
Duration
(ms)
Unvoiced stops are unaspirated in /s/ stop sequences
CLSP Workshop '99 Acoustic Properties of Speech Sounds 29
Stop-Semivowel Durations
b dgptk
0
10
20
30
40
50
60
70
80
90
100
VOT
Duration
(ms)
Singletons
[Stop][ Semivowel]
Clusters
Semivowels are partially devoiced in stop semivowel sequences
CLSP Workshop '99 Acoustic Properties of Speech Sounds 30
15
15
Page 16
17
Voicing Cues for Stops
There are many voicing cues for a stop
CLSP Workshop '99 Acoustic Properties of Speech Sounds 31
Affricate Production
Alveolar-stop palatal-fricative pairs
Sudden release of the constriction, turbulence noise
Can have periodic excitation during closure
Affricates of American English
There are two affricates in American English
Voi ced Unvoiced
/J/ jh judge/ C/ ch church
CLSP Workshop '99 Acoustic Properties of Speech Sounds 32
16
16
Page 17
18
Speech from a Close-Talking Microphone
kHz kHz
Wide Band Spectrogram
kHz kHz
0
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
8
Time (seconds) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
kHz kHz
0 0
8 8
16 16 Zero Crossing Rate
dB dB Total Energy
dB dB Energy --125 Hz to 750 Hz
Waveform
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
File: /server/ users/ jwc/ latex/ sum97/ sennheiser. wav Printed by jwc on Wed Jul 16 11: 58: 32 1997
Page: 1 The Thinker is a famous sculpture
CLSP Workshop '99 Acoustic Properties of Speech Sounds 33
Speech from a Omni-Directional Microphone
kHz kHz
Wide Band Spectrogram
kHz kHz
0
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
8
Time (seconds) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
kHz kHz
0 0
8 8
16 16 Zero Crossing Rate
dB dB Total Energy
dB dB Energy --125 Hz to 750 Hz
Waveform
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
File: /server/ users/ jwc/ latex/ sum97/ bk. wav Printed by jwc on Wed Jul 16 11: 57: 43 1997
Page: 1 The Thinker is a famous sculpture
CLSP Workshop '99 Acoustic Properties of Speech Sounds 34
17
17
Page 18
Speech over a Telephone Channel
kHz kHz
Wide Band Spectrogram
kHz kHz
0
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
8
Time (seconds) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
kHz kHz
0 0
8 8
16 16 Zero Crossing Rate
dB dB Total Energy
dB dB Energy --125 Hz to 750 Hz
Waveform
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
File: /server/ users/ jwc/ latex/ sum97/ telephone. wav Printed by jwc on Wed Jul 16 11: 59: 12 1997
Page: 1 The Thinker is a famous sculpture
CLSP Workshop '99 Acoustic Properties of Speech Sounds 35 18