CLSP Homepage : Workshop Homepage
Workshop 2000

Digital Sound

Paul Bamberg
Dragon Systems, Inc.


The "real" sound that we hear is "analog:" air pressure or velocity is a continuously varying function of time. A microphone converts it to a voltage that is also a continuous function of time.

To be used by a digital computer: sound must first be digitized. An analog-to-digital converter (ADC) samples the sound at regular intervals and converts the voltage for each sample to an integer.

Common sampling rates:

8000 and 16000 Hertz (the rest of the world) 11025, 22050, 44100 Hertz (Windows)

(20000 Hertz is the upper limit of human hearing)

Common sample values:

-128 to 127 (8-bit sound) -2048 to 2047 (12-bit sound) -32768 to 32767 (16-bit sound)

11025 Hertz, 8 bits takes about 0.66 megabytes per minute 44100 Hertz, 16 bits takes about 5 megabytes per minute

Compression can save a factor of 10 without perceived loss of quality.

What can go wrong during conversion

Clipping: some of the samples are too large or too negative to represent. The effect on a sinusoidal sound is to add "harmonics" of the frequency.

Quantization : since only integer values are allowed, the recorded sample values are only an approximation of the true values.

Aliasing: with a sample rate of 2N Hertz it is impossible to represent sound whose frequency exceeds N Hertz (the "Nyquist frequency.")

After sampling, a frequency of M Hertz (N < M < 2N) is indistinguishable from (2N - M) Hertz.

Example: sampling frequency is 11025 Hertz.
Samples taken from a sine wave of frequency 8000 Hertz look as though they came from a wave of frequency 3025 Hertz

When clipping or quantization occurs, the extra harmonics that are generated may be above the Nyquist frequency. They get "aliased down" below the Nyquist frequency and may well be lower in pitch than the original.

Needless to say, this makes a mess of speech recognition!

Examples : Start with 2756 Hz and "hard-clip it"
Harmonics are generated at 3 * 2756 = 8268 -> 2757

And at 5 * 2756 = 13280 -> 2755
The result sounds almost unchanged.

Do the same with 2700 Hz
Harmonics are at 3* 2700 = 8100 -> 2925

and at 5 * 2700 = 13500 -> 2475
The result sounds like a buzzer.
Digital speech is a mixture of "voiced" and "unvoiced" segments.

Voiced speech (vowels, r, l, m, n, ng) is almost periodic, with a well-defined fundamental frequency ("pitch"). It is vulnerable to clipping.

If you repeatedly play a short segment of voiced speech, it is easy to hear the pitch (almost like singing).

Unvoiced speech (s, f, th, sh) is random noise, not periodic at all. It typically has low amplitude and is vulnerable to quantization.

All speech is vulnerable to aliasing unless the frequencies above the Nyquist frequency are filtered out before the speech is digitized. This is done over the telephone (8000 Hz sampling) so that an "s" sound will be lost instead of being aliased.

With practice, you can learn to recognize some phonemes (at least for one talker) from their waveforms.

Recording Speech - the "dump truck" algorithm
Problem - your automated gravel mine spews out a steady stream of sand and gravel - first sand, then gravel, then sand again. You want to use just the gravel to pave a new road.

Unsatisfactory solution: spray the output of the mine directly onto the road and try to separate sand from gravel.

Better solution: Get two dump trucks and send them to the mine. After the first one is full, while the second is loading, look in the first truck. If you have gravel, keep it and put it on the road, otherwise throw it out. Make sure the first truck is empty and back at the mine before the second one is full. As soon as you have covered the road with gravel or the mine has stopped producing gravel, send the dump trucks home.

The mine is the output of your ADC. Sand is silence, gravel speech.

The road is the region of memory (or a file) in which you place the speech.

The dump trucks are two "memory buffers" that receive speech from the ADC. They are filled automatically by the Windows sound system, and you receive a message when a buffer is full. It is then the responsibility of your program to inspect and perhaps copy the contents of the full buffer.

Playback works the same way - you fill one or more buffers with speech, and send them off for playback. Transfer from the buffers to your audio output takes place without intervention from your program, and you receive a message when playback of a buffer is complete. By using two buffers you can play a large file without loading it all into memory.

Design of the demonstration program (written in Visual C++ using MFC)


			"Document" object                                    Input file
 

		     Embedded "Wave" object                             Output file




			Callback functions (outside all objects)





Form view object			          Scrolling view object	
(buttons, etc)	
When a buffer is filled, Windows calls a callback function. That sends a message to the wave object, which determines whether the buffer contains speech and, if so, saves it. When speech followed by silence has been seen, the wave object calls a function to stop recording.

When recording or playback is finally done, Windows calls a callback function. That sends a message to the wave object, which in turn sends a message to the form view object to re-enable the buttons.

When the scroll bar is moved in the scrolling view, its calls the function UpdateAllViews with the scroll position as a "hint"

Return to Preliminary Schedule