| Machines cannot achieve a
100% recognition rate all of the time, neither can we
Total recognition
Networked deferred speech recognition systems
Is there a real market for speech recognition?
Main players
Microsoft has clearly decided that the time is right to
roll speech technology into its products, including future versions of NT beyond 5.0. That
much is clear from Microsofts greatly expanded activities in this field, in
particular, its $45 million investment in the speech technology all-rounder Lernout &
Hauspie (L&H) in September 1997. The idea is to allow the Windows environment to be
driven entirely by spoken commands, while relevant applications such as Word will also
have speech recognition capabilities. Does all this mean that speech technology has
suddenly come of age and that there has been a breakthrough that makes it viable for
mainstream business applications rather than just hobbyists and a few niche markets? Well,
not really. What is true though is that there is greater realism about what can be
achieved in a field that has been characterised by steady and often disappointingly slow
progress over the last 25 years.
Machines cannot achieve a 100% recognition rate all of the time, neither can
we
Increasing processing power has helped improve recognition accuracy, but as in some other
fields such as numerical weather forecasting, the law of diminishing returns applies here.
Even with infinite processing power it would not be possible to achieve a 100% recognition
rate 100% of the time using current methods. But this is where the realism comes in. For
no human can recognise all spoken words all of the time either, as anyone who has tried to
hold a conversation at a rave party or listen to someone with a particularly thick
regional accent knows. It all depends on the situation. Machines have been able to
achieve 100% hit rates in a favourable environment for many years, i.e. when they are
trained to just one speaker who then has to leave tangible gaps between words. The
emphasis now is on reaching acceptable levels of accuracy for given applications, and
there are already many satisfied users of speech recognition systems in a variety of
fields.
At the third stroke
Applications of speech technology can be split into four types. Firstly, there is speech
synthesis, involving the conversion of text into spoken words, which can either be
generated by a computer from component parts of speech, or selected from a pre-recorded
database of words or sentences spoken by a human. The latter, which is well known to users
of telephone directory enquiries, is not strictly speaking synthesis but is usually placed
under the same heading. Synthesis is technically the simplest task, at least as far as
accuracy is concerned which is no problem, although achieving a pleasing sound with
computer generated speech is harder. The other three application types, involving
recognition rather than output of speech, generate much more interest. One is
straightforward recognition of individual commands, or words from a menu: this is not too
difficult to achieve as there is usually just a handful of words, and the system knows
exactly what the options are. The second application involves recognition not of the words
but of the person speaking them, with obvious use as a biometric security technique
replacing or reinforcing passwords, PINs and plastic. This again is not too hard to
achieve, as the system only has to identify characteristics of the speakers voice
and not the actual words.
Total recognition
Then we have the full works - total recognition of the spoken words, which may or may not
include speaker identification. For many years it was only possible to achieve reasonable
recognition rates when there were gaps between words, but more recently there has been
steady and substantial progress in the recognition of continuous speech. There has also
been progress in making systems independent of the speaker, which is necessary for any
applications involving public access and is desirable for any business product. There is
also a distinction between online and deferred recognition. With online recognition, the
words appear on the screen as the user dictates them, with a delay of perhaps a second or
so depending on the power of the system. With deferred recognition, the speech is recorded
and digitised at the PC, or perhaps by a portable device held by the user, and then
transmitted to a central server where the recognition process is performed.
Deferred recognition has the advantage that a single powerful server can be dedicated to
the task, which requires considerable processing power, more than many PCs possess.
Furthermore the process no longer has to take place in near real time, so that longer can
be spent on it to improve accuracy. The other advantage is that by networking the
application, the task of cleaning up the text afterwards, given that accuracy is never
100%, can be delegated.
Networked deferred speech recognition systems
For this reason, networked deferred speech recognition systems are being adopted for
applications where letters or documents tended to be dictated and then transcribed by
secretaries or clerks anyway. "In some markets, such as the legal profession, the
preferred way of working is to send a document to a secretary, or dictate into a machine,
and so a move to speech technology means no change in working patterns," says Chris
Ford, head of software for northern Europe at Philips Dictation Systems. In such cases the
speech recognition tends to be integrated with existing workflow management software,
giving users the choice between voice or keyboard for entering text. And when the task of
producing a report from the digitised words is deferred, the secretary then has the option
of either using the text produced by the speech recognition system, or just listening to
the recording as usual.
Ford claims that accuracy levels are such that the task of producing letters or reports
will be quicker using the speech recognition engine. One Philips customer, the
Christie Hospital NHS Trust, where speech technology is used by radiologists to dictate
reports, has made significant productivity gains with, on average, around 93% of words
being correctly recognised. Such figures disguise the fact that for a given system there
are significant variations in word recognition rates between different users. On a system
where the average is 93%, it is quite normal for people with good diction to achieve close
to 100% recognition rates, while those with incoherent speech, defects, strong accents, or
who sound r as w might struggle to exceed 50%. Perhaps speech
recognition systems will encourage us all to talk proper!
Is there a real market for speech recognition?
But, apart from the question of accuracy, there must also be some doubt as to whether
speech recognition will catch on for applications where the keyboard is currently king.
Ford admits that a wholesale switch to speech technology in such cases would lead to a
fall in productivity. In fact for short documents such as email, using the keyboard is
certainly faster than speech input. There is also the psychological factor that with
keyboards, users have total control over their input and can immediately correct typos,
while speech users could be frustrated by the frequent appearance of errors when words are
not recognised correctly.
This suggests that the important new markets for speech recognition will be where no
keyboard is used, or where the keyboard is too small for regular input of large amounts of
text (i.e. some palmtop devices). Surveyors, for example, could dictate reports into
portable machines and then clean them up and print them later. Dictation could either be
via palmtops, or with devices known as digital voice tracers which are dedicated purely to
recording and store the voice in digitised form. The basic techniques are now well
understood and although some refinement is possible and more processing power would help,
there is unlikely to be a major breakthrough. Neural network techniques have, however,
accelerated progress somewhat. They have been applied to the process of training systems
to extract the phonemes, which are the vowels and consonants comprising the basic parts of
speech, from the input audio signal.
The problem that neural network training helps to solve is the fact that although we all
use the same phonemes, we all pronounce them differently. Neural network technology allows
the system to be trained as it is continually presented with samples of speech, ideally
spoken by a variety of different people, and then told what the phonemes are. Then, by a
process of feedback, the system learns to identify the defining features of a phoneme that
are common to each person.
It is well known that humans exploit a considerable amount of contextual information when
identifying words and meanings, and this can help fill in gaps when some words are not
heard properly for example. This is an active field of research from which future speech
recognition systems could benefit and become more widely used.
Main players
Speech technology is not particularly suitable for start-ups as it is a long-term game
which requires heavy investment in R&D. Most players have been in the field many years
now, and Microsoft gave itself a leg up by buying into one of them, L&H, which also
specialises in machine translation. Microsoft is interested in this aspect too, primarily
for translating Internet content from English into other languages.L&H acquired its
speech recognition technology from a company called Kurzweil, which was a rare start up in
the field during the 1980s. The company unveiled a number of new speech recognition
products at the recent Comdex 98 show in Las Vegas, including the Now
Youre Talking Deluxe which incorporates the first fruit of the relationship
with Microsoft. It has voice-activated features for Microsoft Office 95 and 97
applications as well as a voice driven Web finder which exploits L&Hs natural
language technology, allowing users to surf the Web and initiate searches by asking
questions without having to memorise specific commands.
L&H has also developed a speech synthesis product called RealSpeak which goes some way
to answering the criticism that such systems sound too awful for words. The product
understands and reads texts presented to it, and does sound less robotic than previous
efforts, although still rather unnatural and not as human-like as L&H claims in its
blurb. Indeed most speech system vendors have been guilty of hyperbole at some point,
desperate for some headline seeking developments in a slow moving field of technology.
Claims of being the first with a system capable of speaker-independent recognition of
continuous talk without gaps between words are about as common and erroneous as exclusive
tags to tabloid news stories.
Aside from L&H, the three other major and long established players in the field are
IBM, Dragon Systems, and Philips, which all have commercially proven speech recognition
systems. Dragon Systems has DragonDictate, which facilitates both dictation and hands-free
control of the computer. This products vital statistics usefully illustrate the
current state of the field: It is speaker-adaptive in that its dictionary of
prepared words stored in a base voice model is tuned to the different idiosyncrasies of
individual users via an initial enrolment process that takes about 45 minutes, during
which time you talk to the system. Thereafter the system continues to adapt to your voice,
and the company reckons it will take a further 6 to 12 hours to reach the full level of
recognition accuracy, which is just over 90% of the words spoken on average. The product
comes with a base dictionary of 180,000 words, but this is too great for online
recognition with standard hardware. Therefore the actual working dictionary is a subset of
this, comprising 10,000 to 60,000 words depending on the version purchased.
The product does not support continuous speech as it typically requires a pause of at
least 0.1 second between words, although this depends on the power of the CPU. Speaking in
this way, users can dictate between 35 and 55 words a minute while maintaining full
accuracy, which is slower than a professional typist but faster than handwriting - and
significantly ahead of pen-based computer technology. Given the average level of typos
that most users of word processors make, the performance is acceptable for those who fancy
that mode of working. But there is some way to go to eradicate or at least mitigate
undesirable features such as the fact that accuracy can diminish significantly when the
user has a cold. Telecoms companies such as BT and AT&T are also well into the field,
but their focus, as far as implementing systems is concerned, has been more on
command-level recognition of single words for telephony applications. The loss of quality
and therefore signal information over the telephone makes it harder to recognise
continuous speech accurately, and even for single words identification rates are often not
much greater than 90%. |

|