Document Body Page Navigation Panel

Pages 1--18 from sounds.dvi


Page 1 2
Acoustic Properties of Speech Sounds
 Speech production
 Signal processing
 Properties of speech sounds of American English
 Microphone variations
 Cheat sheets!
 Spectrographic Examples

CLSP Workshop '99 Acoustic Properties of Speech Sounds 1

Anatomical Structures for Speech Production

Soft Palate
(Velum)
Hyoid Bone

Epiglottis
Cricoid
Cartilage

Esophagus

Nasal Cavity
Hard Palate

Tongue

Thyroid Cartilage
Vocal Cords

Trachea
Lung
Sternum

Nasal Cavity
Hard Palate

Tongue

Thyroid Cartilage
Vocal Folds
Trachea

Lung

Soft Palate
(Velum)

Jaw

CLSP Workshop '99 Acoustic Properties of Speech Sounds 2 1
1 Page 2 3
Sub-Word Linguistic Units
 The phoneme is one of the most basic linguistic units used to
represent pronunciations of words
 ASR systems typically represent words as phoneme sequences

 English contains approximately 40 phonemes which can be
grouped by manner and place of articulation

Manner Class Number
Vowel s 16
Fricatives 8
Stops 6
Semivowels 4
Nasals 3
Affricates 2
Aspirant 1

CLSP Workshop '99 Acoustic Properties of Speech Sounds 3

Phonemes in American English
IPA AB Word IPA AB Word IPA AB Word
/i/ iy beat /s/ s see /w/ wwet
/I/ ihbit /S/ shshe /r/ r red
/e/ eybait /f/ f fee /l/ l let
/E/ ehbet /T/ th thief /y/ y yet
/@/ aebat /z/ z z /m/ m meet
/a/ aa bob /Z/ zhGigi /n/ n neat
/O/ ao bought /v/ v v /4/ ngsing
/^/ ahbut /D/ dh thee /C/ ch church
/o/ ow boat /p/ p pea /J/ jhjudge
/U/ uh book /t/ t tea /h/ hh heat
/u/ uw boot /k/ k key
/5/ erbird /b/ b bay
/aÛ Ê /aybite /d/ d day
/OÛÊ / oy Boyd /g/ g geese
/aÚ Ê / aw bout
/{/ ax about
CLSP Workshop '99 Acoustic Properties of Speech Sounds 4 2
2 Page 3 4
Places of Articulation for Speech Production
Palatal Velar
Uvular

Alveopalatal
Alveolar

Labial
Dental

CLSP Workshop '99 Acoustic Properties of Speech Sounds 5
A Speech Waveform

Two plus seven is less than ten
CLSP Workshop '99 Acoustic Properties of Speech Sounds 6 3
3 Page 4 5
Spectral Representations
 Speech waveforms are usually sampled at rates varying from 8K
(telephone) to 20K (wide-band) samples/ sec

 ASR systems typically transform the waveform into a spectrum: a
sequence of frequency-based analyses usually performed at
regular intervals (e. g., 10 ms)

 A short-time Fourier transform (STFT) performs a spectral analysis
on waveform segments small enough to be able to assume that
the speech signal is quasi-stationary

 The waveform segment is created by a moving window, whose
type (e. g., Hamming) and duration (e. g., 5-25ms) have a
significant impact on the resulting spectrum

 A spectrogram is an image computed from the resulting
spectrum, which is often used to examine the waveform

CLSP Workshop '99 Acoustic Properties of Speech Sounds 7

A Wide-Band Speech Spectrogram

Two plus seven is less than ten CLSP Workshop '99 Acoustic Properties of Speech Sounds 8 4
4 Page 5 6
A Narrow-Band Speech Spectrogram
Two plus seven is less than ten CLSP Workshop '99 Acoustic Properties of Speech Sounds 9
Vowel Production
 No significant constriction in the vocal tract
 Usually produced with periodic excitation
 Acoustic characteristics depend on the position of the jaw,
tongue, and lips

[i][ @][ a][ u]

CLSP Workshop '99 Acoustic Properties of Speech Sounds 10 5
5 Page 6 7
Vowels of American English
 There are approximately 18 vowels in American English made up
of monothongs, diphthongs, and reduced vowels (schwa's)

 They are often described by the articulatory features: High/ Low,
Front/ Back, Retro exed, Rounded, andTense/ Lax

/i/ iy beat /O/ ao bought /aÛÊ/aybite
/I/ ihbit /^/ ahbut /OÛÊ/ oy Boyd
/e/ eybait/ o/ ow boat /aÚÊ/ aw bout
/E/ ehbet /U/ uh book [{] ax about
/@/ ae bat /u/ uw boot [|] ix roses
/a/ aa Bob /5/ er Bert [}] axr butter

CLSP Workshop '99 Acoustic Properties of Speech Sounds 11

Vowel Formant Averages
 Vowels are often characterized by F1, F2, and F3
 High/ Low is correlated with F1
 Front/ Back is correlated with F2
 Retro exion is marked by a low F3

Female Speakers Male Speakers

iÛ I eÛ E @ a O ^ oÚ U u 5 { |
0

500
1000
1500
2000
2500
3000
3500

Average
Frequency

(Hz)

Vowel

F 1 F 2 F 3
iÛ I eÛ E @ a O ^ oÚ U u 5 { |
0

500
1000
1500
2000
2500
3000
3500

Average
Frequency

(Hz)

Vowel

F 1 F 2 F 3

CLSP Workshop '99 Acoustic Properties of Speech Sounds 12 6
6 Page 7 8
Vowel Formant Trajectories
 Diphthongs can have significant formant motion
 Most vowels in American English are somewhat diphthongized

Female Speakers Male Speakers

700
900
1100
1300
1500
1700
1900
2100
2300
2500
2700

300 400 500 600 700 800 900
F
2

F 1


E


{

|
aÚ ^

a


I

O
@
u oÚ
5 U

700
900
1100
1300
1500
1700
1900
2100
2300
2500
2700

300 400 500 600 700 800 900
F
2

F 1


@
|

E
I


^
O

5 U
u



{

a

CLSP Workshop '99 Acoustic Properties of Speech Sounds 13
Vowel Durations
 Each vowel has a different intrinsic duration
 Schwa's have distinctly shorter durations (50ms)
 /I, E, ^, U/ are the shortest monothongs
 Context can greatly in uence vowel duration

Female Speakers Male Speakers

iÛ I eÛ E @ a O ^ oÚ U u 5 { | aÚ oÛ aÛ Ûu
0

50
100
150
200
250

Average
Duration
(ms)

Vowel
iÛ I eÛ E @ a O ^ oÚ U u 5 { | aÚ oÛ aÛ Ûu
0

50
100
150
200
250

Average
Duration
(ms)

Vowel
CLSP Workshop '99 Acoustic Properties of Speech Sounds 14 7
7 Page 8 9
Fricative Production
 Turbulence produced at narrow constriction
 Constriction position determines acoustic characteristics
 Can be produced with periodic excitation

[f][ T][ s][ S]

CLSP Workshop '99 Acoustic Properties of Speech Sounds 15
Fricatives of American English
 There are 8 fricatives in American English
 They are often described by the features Strident/ Non-Strident
(Strong/ Weak), Voiced/ Unvoiced

 Four places of articulation: Labial, Dental, Alveolar, andPalatal

Type Unvoiced Voi ced
Labial /f/ f fee /v/v v
Dental /T/ th thief /D/ dh thee
Alveolar /s/ s see /z/ zz
Palatal /S/ shshe /Z/ zhGigi

CLSP Workshop '99 Acoustic Properties of Speech Sounds 16 8
8 Page 9 10
Fricative Energy
Average Total Energy
Probability
Density
unadjusted

for
frequency

-100 -90 -80 -70 -60 -50 -40
0.0

0.02

0.04
0.06
NON-STRIDENT
STRIDENT

Strident fricatives tend to be stronger than non-strident
CLSP Workshop '99 Acoustic Properties of Speech Sounds 17

Fricative Durations

Duration
Probability
Density
unadjusted

for
frequency

0.0 0.05 0.10 0.15 0.20 0.25 0.30
02

4
6
8101214

UNVOICED
VOICED

Voiced fricatives tend to be shorter than unvoiced
CLSP Workshop '99 Acoustic Properties of Speech Sounds 18 9
9 Page 10 11
Nasal Production
 Velum lowering results in air ow through nasal cavity
 Consonants produced with closure in oral cavity
 Nasalized vowels have output through oral and nasal cavities
 Nasal murmurs have similar spectral characteristics

[m][ n][ 4]

CLSP Workshop '99 Acoustic Properties of Speech Sounds 19
Nasal Consonants of American English
 Three places of articulation: Labial, Alveolar, andVelar
 Always attached to a vowel, though can form an entire syllable in
unstressed environments ([ nê ], [mê ], [4 ê ])

 /4/ is always post-vocalic

 Place identified by neighboring formant transitions

Type Nasal
Labial /m/m me
Dental /n/ n knee
Velar /4/ ngsing

CLSP Workshop '99 Acoustic Properties of Speech Sounds 20 10
10 Page 11 12
Nasal Durations
Singleton Unvoiced Cluster Voiced Cluster 0
25
50
75
100
125
150

Duration
(ms)

Nasal consonants tend to be shorter in clusters with unvoiced
consonants, and longer with voiced consonants

CLSP Workshop '99 Acoustic Properties of Speech Sounds 21

Semivowel Production
 Constriction in vocal tract, no turbulence
 Slower articulatory motion than other consonants
 Laterals form complete closure with tongue tip,
air ow via sides of constriction

[w][ y][ r][ l]

CLSP Workshop '99 Acoustic Properties of Speech Sounds 22 11
11 Page 12 13
Semivowels of American English
 There are 4 semivowels in American English
 Always attached to a vowel, though /l/ can form an entire syllable
in unstressed environments ([ lê ])

 Extreme articulation of a corresponding vowel

{ Similar formant positions
{ Generally weaker due to constriction

Type Semivowel Nearest Vowel
Glides /w/w wet /u/
/y/ y yet /i/
Liquids /r/ rred /5/
/l/ llet /o/

CLSP Workshop '99 Acoustic Properties of Speech Sounds 23

Acoustic Properties of Semivowels
 /w/ is characterized by a very low F1, F2
{ Typically a rapid spectral falloff above F2
 /y/ is characterized by very low F1, very high F2
 /r/ is characterized by a very low F3

{ Prevocalic F3 < medial F3 < postvocalic F3
 /l/ is characterized by a low F1 and F2

{ Often presence of high frequency energy
{ Postvocalic /l/ characterized by minimal spectral discontinuity,
gradual motion of formants

CLSP Workshop '99 Acoustic Properties of Speech Sounds 24 12
12 Page 13 14
Aspirant Production
 /h/ in AmericanEnglish
 Turbulence excitation at glottis
 No constriction in the vocal tract, normal formant excitation
 Coupling with subglottal system results in little energy in F1
region

 Periodic excitation can be present in medial position

CLSP Workshop '99 Acoustic Properties of Speech Sounds 25

Stop Production
 Complete closure in the vocal tract, pressure build up
 Sudden release of the constriction, turbulence noise
 Can have periodic excitation during closure

[b][ d][ g]

CLSP Workshop '99 Acoustic Properties of Speech Sounds 26 13
13 Page 14 15
Stops of American English
 There are 6 stop consonants in American English
 Same places of articulation as nasal consonants
 Unvoiced stops are typically aspirated
 Voiced stops usually exhibit a "voice-bar'' during closure
 Information about formant transitions and release useful for
classification

Type Voiced Unvoiced
Labial /b/ b bee /p/ p pea
Dental /d/ d Dee /t/ ttea
Velar /g/ g geese /k/ k key

CLSP Workshop '99 Acoustic Properties of Speech Sounds 27

Singleton Stop Durations

b dgptk
0

10
20
30
40
50
60
70
80

VOT
Duration

(ms)

The voice onset time (VOT) of unvoiced stops
is longer than that of voiced stops

CLSP Workshop '99 Acoustic Properties of Speech Sounds 28 14
14 Page 15 16
/s/-Stop Durations
p t k
0

10
20
30
40
50
60
70
80

VOT
Duration

(ms)

Unvoiced stops are unaspirated in /s/ stop sequences
CLSP Workshop '99 Acoustic Properties of Speech Sounds 29

Stop-Semivowel Durations

b dgptk
0

10
20
30
40
50
60
70
80
90
100

VOT
Duration

(ms)
Singletons
[Stop][ Semivowel]
Clusters

Semivowels are partially devoiced in stop semivowel sequences
CLSP Workshop '99 Acoustic Properties of Speech Sounds 30 15
15 Page 16 17
Voicing Cues for Stops
There are many voicing cues for a stop
CLSP Workshop '99 Acoustic Properties of Speech Sounds 31

Affricate Production
 Alveolar-stop palatal-fricative pairs
 Sudden release of the constriction, turbulence noise
 Can have periodic excitation during closure

Affricates of American English
There are two affricates in American English
Voi ced Unvoiced
/J/ jh judge/ C/ ch church

CLSP Workshop '99 Acoustic Properties of Speech Sounds 32 16
16 Page 17 18
Speech from a Close-Talking Microphone
kHz kHz
Wide Band Spectrogram
kHz kHz

0
1
2
3
4
5
6
7
8

0
1
2
3
4
5
6
7
8

Time (seconds) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
kHz kHz
0 0
8 8
16 16 Zero Crossing Rate

dB dB Total Energy
dB dB Energy --125 Hz to 750 Hz

Waveform
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

File: /server/ users/ jwc/ latex/ sum97/ sennheiser. wav Printed by jwc on Wed Jul 16 11: 58: 32 1997
Page: 1 The Thinker is a famous sculpture

CLSP Workshop '99 Acoustic Properties of Speech Sounds 33

Speech from a Omni-Directional Microphone

kHz kHz
Wide Band Spectrogram
kHz kHz

0
1
2
3
4
5
6
7
8

0
1
2
3
4
5
6
7
8

Time (seconds) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
kHz kHz
0 0
8 8
16 16 Zero Crossing Rate

dB dB Total Energy
dB dB Energy --125 Hz to 750 Hz

Waveform
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

File: /server/ users/ jwc/ latex/ sum97/ bk. wav Printed by jwc on Wed Jul 16 11: 57: 43 1997
Page: 1 The Thinker is a famous sculpture

CLSP Workshop '99 Acoustic Properties of Speech Sounds 34 17
17 Page 18
Speech over a Telephone Channel
kHz kHz
Wide Band Spectrogram
kHz kHz

0
1
2
3
4
5
6
7
8

0
1
2
3
4
5
6
7
8

Time (seconds) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
kHz kHz
0 0
8 8
16 16 Zero Crossing Rate

dB dB Total Energy
dB dB Energy --125 Hz to 750 Hz

Waveform
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

File: /server/ users/ jwc/ latex/ sum97/ telephone. wav Printed by jwc on Wed Jul 16 11: 59: 12 1997
Page: 1 The Thinker is a famous sculpture

CLSP Workshop '99 Acoustic Properties of Speech Sounds 35 18

Page Navigation Panel

1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18