.

[an error occurred while processing this directive]

 


Features - February 1999 - Talking point
Is voice recognition technology all talk or does it have a future? Philip Hunter investigates
.

Microsoft has clearly decided that the time is right to roll speech technology into its products, including future versions of NT beyond 5.0. That much is clear from Microsoft’s greatly expanded activities in this field, in particular, its $45 million investment in the speech technology all-rounder Lernout & Hauspie (L&H) in September 1997. The idea is to allow the Windows environment to be driven entirely by spoken commands, while relevant applications such as Word will also have speech recognition capabilities. Does all this mean that speech technology has suddenly come of age and that there has been a breakthrough that makes it viable for mainstream business applications rather than just hobbyists and a few niche markets? Well, not really. What is true though is that there is greater realism about what can be achieved in a field that has been characterised by steady and often disappointingly slow progress over the last 25 years.

Machines cannot achieve a 100% recognition rate all of the time, neither can we


Increasing processing power has helped improve recognition accuracy, but as in some other fields such as numerical weather forecasting, the law of diminishing returns applies here. Even with infinite processing power it would not be possible to achieve a 100% recognition rate 100% of the time using current methods. But this is where the realism comes in. For no human can recognise all spoken words all of the time either, as anyone who has tried to hold a conversation at a rave party or listen to someone with a particularly thick regional accent knows. It all depends on the situation.  Machines have been able to achieve 100% hit rates in a favourable environment for many years, i.e. when they are trained to just one speaker who then has to leave tangible gaps between words. The emphasis now is on reaching acceptable levels of accuracy for given applications, and there are already many satisfied users of speech recognition systems in a variety of fields.

At the third stroke…

Applications of speech technology can be split into four types. Firstly, there is speech synthesis, involving the conversion of text into spoken words, which can either be generated by a computer from component parts of speech, or selected from a pre-recorded database of words or sentences spoken by a human. The latter, which is well known to users of telephone directory enquiries, is not strictly speaking synthesis but is usually placed under the same heading. Synthesis is technically the simplest task, at least as far as accuracy is concerned which is no problem, although achieving a pleasing sound with computer generated speech is harder. The other three application types, involving recognition rather than output of speech, generate much more interest. One is straightforward recognition of individual commands, or words from a menu: this is not too difficult to achieve as there is usually just a handful of words, and the system knows exactly what the options are. The second application involves recognition not of the words but of the person speaking them, with obvious use as a biometric security technique replacing or reinforcing passwords, PINs and plastic. This again is not too hard to achieve, as the system only has to identify characteristics of the speaker’s voice and not the actual words.

Total recognition


Then we have the full works - total recognition of the spoken words, which may or may not include speaker identification. For many years it was only possible to achieve reasonable recognition rates when there were gaps between words, but more recently there has been steady and substantial progress in the recognition of continuous speech. There has also been progress in making systems independent of the speaker, which is necessary for any applications involving public access and is desirable for any business product. There is also a distinction between online and deferred recognition. With online recognition, the words appear on the screen as the user dictates them, with a delay of perhaps a second or so depending on the power of the system. With deferred recognition, the speech is recorded and digitised at the PC, or perhaps by a portable device held by the user, and then transmitted to a central server where the recognition process is performed.

Deferred recognition has the advantage that a single powerful server can be dedicated to the task, which requires considerable processing power, more than many PCs possess. Furthermore the process no longer has to take place in near real time, so that longer can be spent on it to improve accuracy. The other advantage is that by networking the application, the task of cleaning up the text afterwards, given that accuracy is never 100%, can be delegated.

Networked deferred speech recognition systems


For this reason, networked deferred speech recognition systems are being adopted for applications where letters or documents tended to be dictated and then transcribed by secretaries or clerks anyway. "In some markets, such as the legal profession, the preferred way of working is to send a document to a secretary, or dictate into a machine, and so a move to speech technology means no change in working patterns," says Chris Ford, head of software for northern Europe at Philips Dictation Systems. In such cases the speech recognition tends to be integrated with existing workflow management software, giving users the choice between voice or keyboard for entering text. And when the task of producing a report from the digitised words is deferred, the secretary then has the option of either using the text produced by the speech recognition system, or just listening to the recording as usual.

Ford claims that accuracy levels are such that the task of producing letters or reports will be quicker using the speech recognition engine. One Philips’ customer, the Christie Hospital NHS Trust, where speech technology is used by radiologists to dictate reports, has made significant productivity gains with, on average, around 93% of words being correctly recognised. Such figures disguise the fact that for a given system there are significant variations in word recognition rates between different users. On a system where the average is 93%, it is quite normal for people with good diction to achieve close to 100% recognition rates, while those with incoherent speech, defects, strong accents, or who sound ‘r’ as ‘w’ might struggle to exceed 50%. Perhaps speech recognition systems will encourage us all to talk proper!

Is there a real market for speech recognition?


But, apart from the question of accuracy, there must also be some doubt as to whether speech recognition will catch on for applications where the keyboard is currently king. Ford admits that a wholesale switch to speech technology in such cases would lead to a fall in productivity. In fact for short documents such as email, using the keyboard is certainly faster than speech input. There is also the psychological factor that with keyboards, users have total control over their input and can immediately correct typos, while speech users could be frustrated by the frequent appearance of errors when words are not recognised correctly.

This suggests that the important new markets for speech recognition will be where no keyboard is used, or where the keyboard is too small for regular input of large amounts of text (i.e. some palmtop devices). Surveyors, for example, could dictate reports into portable machines and then clean them up and print them later. Dictation could either be via palmtops, or with devices known as digital voice tracers which are dedicated purely to recording and store the voice in digitised form. The basic techniques are now well understood and although some refinement is possible and more processing power would help, there is unlikely to be a major breakthrough. Neural network techniques have, however, accelerated progress somewhat. They have been applied to the process of training systems to extract the phonemes, which are the vowels and consonants comprising the basic parts of speech, from the input audio signal.

The problem that neural network training helps to solve is the fact that although we all use the same phonemes, we all pronounce them differently. Neural network technology allows the system to be trained as it is continually presented with samples of speech, ideally spoken by a variety of different people, and then told what the phonemes are. Then, by a process of feedback, the system learns to identify the defining features of a phoneme that are common to each person.

It is well known that humans exploit a considerable amount of contextual information when identifying words and meanings, and this can help fill in gaps when some words are not heard properly for example. This is an active field of research from which future speech recognition systems could benefit and become more widely used.

Main players

Speech technology is not particularly suitable for start-ups as it is a long-term game which requires heavy investment in R&D. Most players have been in the field many years now, and Microsoft gave itself a leg up by buying into one of them, L&H, which also specialises in machine translation. Microsoft is interested in this aspect too, primarily for translating Internet content from English into other languages.L&H acquired its speech recognition technology from a company called Kurzweil, which was a rare start up in the field during the 1980s. The company unveiled a number of new speech recognition products at the recent Comdex ’98 show in Las Vegas, including the ‘Now You’re Talking Deluxe’ which incorporates the first fruit of the relationship with Microsoft. It has voice-activated features for Microsoft Office 95 and 97 applications as well as a voice driven Web finder which exploits L&H’s natural language technology, allowing users to surf the Web and initiate searches by asking questions without having to memorise specific commands.

L&H has also developed a speech synthesis product called RealSpeak which goes some way to answering the criticism that such systems sound too awful for words. The product understands and reads texts presented to it, and does sound less robotic than previous efforts, although still rather unnatural and not as human-like as L&H claims in its blurb. Indeed most speech system vendors have been guilty of hyperbole at some point, desperate for some headline seeking developments in a slow moving field of technology. Claims of being the first with a system capable of speaker-independent recognition of continuous talk without gaps between words are about as common and erroneous as exclusive tags to tabloid news stories.

Aside from L&H, the three other major and long established players in the field are IBM, Dragon Systems, and Philips, which all have commercially proven speech recognition systems. Dragon Systems has DragonDictate, which facilitates both dictation and hands-free control of the computer. This product’s vital statistics usefully illustrate the current state of the field: It is ‘speaker-adaptive’ in that its dictionary of prepared words stored in a base voice model is tuned to the different idiosyncrasies of individual users via an initial enrolment process that takes about 45 minutes, during which time you talk to the system. Thereafter the system continues to adapt to your voice, and the company reckons it will take a further 6 to 12 hours to reach the full level of recognition accuracy, which is just over 90% of the words spoken on average. The product comes with a base dictionary of 180,000 words, but this is too great for online recognition with standard hardware. Therefore the actual working dictionary is a subset of this, comprising 10,000 to 60,000 words depending on the version purchased.

The product does not support continuous speech as it typically requires a pause of at least 0.1 second between words, although this depends on the power of the CPU. Speaking in this way, users can dictate between 35 and 55 words a minute while maintaining full accuracy, which is slower than a professional typist but faster than handwriting - and significantly ahead of pen-based computer technology. Given the average level of typos that most users of word processors make, the performance is acceptable for those who fancy that mode of working. But there is some way to go to eradicate or at least mitigate undesirable features such as the fact that accuracy can diminish significantly when the user has a cold. Telecoms companies such as BT and AT&T are also well into the field, but their focus, as far as implementing systems is concerned, has been more on command-level recognition of single words for telephony applications. The loss of quality and therefore signal information over the telephone makes it harder to recognise continuous speech accurately, and even for single words identification rates are often not much greater than 90%.