Who’s Calling?

June 20, 2015

Impact_WhosCalling_FOR-WEB

The call comes into the White House switchboard and the voice on the other end of the line hisses: “The president better not come to New York tomorrow if he knows what’s good for him.”

“Wouldn’t the Secret Service like to know who is calling?” asks Sanjeev Khudanpur, associate professor of electrical and computer engineering and a member of the Center for Language and Speech Processing.

A one-year grant from the Forensic Services Division of the Secret Service—the agency that protects the president and other national leaders— is helping Khudanpur and Adjunct Professor Jack Godfrey, former chief of human technology language research at the National Security Agency, close in on one of the toughest challenges in forensic science: identifying suspects by their voices alone, with enough certainty to sway a jury.

“Think of the Trayvon Martin case, where we have four seconds of cellphone audio of someone shouting ‘Help!’” Khudanpur says. “Or, somebody robs a convenience store. He’s wearing a mask, but as he leaves, he shouts something and the security camera picks it up.”

The potential promise of using voice-recognition technology to assist the justice system, Khudanpur says, lies in the fact that each person’s spectrum of vocal energy and pitch is tied to his or her own unique throat structure and nasal physiology. But teaching a machine—or a CSI professional—how to tell the difference between two very similar voices is extremely difficult, especially when the audio comes from a speakerphone, a video camera, or an online chat.

“Voice recognition is inherently hard,” Khudanpur says. “Let’s stop thinking that it is purely an engineering problem, and let’s stop thinking that it is purely a human detective problem. We need to marry the computer and the human elements.”

To test and refine their software, Khudanpur and Godfrey capture the speech of hundreds of volunteers, then play back the recordings in pairs. Sometimes the pairs are of the same person, and sometimes they are not. Khudanpur says that “it is now possible [under favorable conditions] to identify matched pairs [and discard mismatched pairs] about 99 times out of 100.’’

This article originally appeared in the Summer 2015 issue of Johns Hopkins Engineering magazine.

Center for Language and Speech Processing