How Things Work: Speech Recognition

“Any sufficiently advanced technology is indistinguishable from magic,” said Arthur C. Clarke, renowned science fiction author of the 20th century — and even today, we see the truth of this statement everywhere before our eyes.

“Open Sesame!” chanted Aladdin to open the treasure cavern of Ali Baba and the 40 Thieves. Today, we deal with much the same in our daily lives — from phones that dial numbers at the sound of a name, to automated voice-menus in various phone services, voice recognition is a little bit of magic that technology has introduced into our lives.

Voice recognition, however, is a far more tricky business than it sounds. If even real-life, human students have trouble understanding professors’ thick accents, what hope do machines have in this regard?

They have quite a bit of hope, it would seem, by the increasingly widespread prevalence of voice-activated devices today. Not unlike humans, voice recognition programs operate on a statistical approach — namely, they analyze what they think you said, and pick the most likely option.

First, the microphone records a digitized waveform of the speech input, which is immediately filtered of most frequencies that are outside of normal human speech parameters to remove static, background noise, and general interference. The actual sound digitization is perhaps the easiest part — it’s the subsequent analysis that takes the most time.

After filtering, the computer scans the voice file and attempts to break it up into phonemes — short pulses of sound that characterize the shortest elements of language we vocalize. For example, the word “cat” is constructed out of three phonemes: “k,” “a,” and “t.” The broken-up voice file is then analyzed and each phoneme is identified — or, more commonly, several possible candidates are identified.

Next, the computer in the voice recognition device must search through its entire database of words (which may range from a mere handful in simpler applications to literally millions in advanced dictation programs), attempting to recognize the appropriate word spoken.

Obviously, this is far from a trivial task: Given the uncertainty on each phoneme, differences in dialect, speed of talking, and run-on words that may be hard to separate, hundreds of words may seem to fit the digitized input.

For a smaller voice recognition computer, which only needs to receive simple input such as numbers or specific directions, the process ends here. If the input is a positive match for something within its library, or at least a close approximation, the appropriate command is chosen and executed. Otherwise, the computer simply rejects the input and asks the user to reissue the command.

For larger dictation programs, though, the work has only just begun. The sheer ambiguity of language, while perhaps second nature to human speakers, is a hard task to quantify within hard-coded computer language. These programs must maintain gigantic libraries of information on how words and phonemes relate to one another, including the statistical probabilities that any given word may be preceded or followed by any other given word in the language.

The computer must then individually evaluate every single one of these possibilities before arriving at a sentence that approximates what was spoken. The possibilities are staggering — given a short sentence of 10 words and perhaps 30 phonemes, literally hundreds of different sentences may present themselves; given that today’s computers are as yet incapable of actually “understanding” what is being spoken, it’s quite a miracle that speech recognition programs work at all.

Homophones, or words that sound the same, are yet another giant obstacle for voice recognition. For example, “there” and “their” or “hair,” “air,” and “heir,” are all words that sound very similar but have quite different meanings. Once again, the statistical correlation of words is the only way a machine can hope to decode what is actually being said.

One practical application of speech recognition that is rapidly gaining notice, interestingly, is for automatic prescription dictation in medical institutes.

All jibes at doctors’ handwriting aside, automatic dictation will no doubt relieve many pharmacists of their fear of giving out mistaken prescriptions. Whether or not the lives of human beings can be entrusted to as-yet unreliable speech recognition programs is yet a matter of debate.

One of the most well-known companies in voice recognition technology, Voicesignal, touts that it makes recognition programs for over 21 different languages; one can only hope that it is but a matter of time before the day when C-3PO’s fluency in 8 million different forms of galactic communication becomes the standard.