Open Sesame

Date:   Thursday , September 04, 2008

Most of us cannot forget the wide-eyed feeling we had as children when we heard the story of ‘Ali Baba and his treasure cave’ that opened with a spoken command, “Open Sesame”. Speech recognition technologies open up vistas of exciting enterprise mobility applications. This article examines, briefly, how far this technology has reached, its strengths, and limitations and explores the possibilities within the current constraints.

Ali Baba is confined to the realm of fairy tales but a new breed of technologists have been working to make the spoken command work in seemingly magical ways. Today, Voice recognition, Speech Recognition, and Semantics are driving forces in several industries. They have opened up a treasure trove of exciting prospects. Shorter response time and a disregard for day and night are redefining our ways of working and productivity. Enterprise Mobility has moved from being a social statement to a useful productivity technology. The day is not far when the ROI driven applications built on Voice, Speech, and Semantics will define the functional contours of enterprises.

The challenges of adaptations are daunting, but surmountable. We have made a successful transition from passive DTMF menu structure driven IVR applications to more interactive, personality injected, and near natural experience of dialogue-driven applications.

Rather than pressing buttons or interacting with a computer screen, users speak to the computer. As automatic speech recognition returns probabilities, not certainties, the challenges of levels of uncertainty associated with the users’ speech input can be daunting,

The most palpable Achilles’ heel is the potential for misrecognition. No matter how much effort and care is put into developing a piece of speech recognition software, there will always be times when the application misrecognizes user input. Because of this, it becomes important to provide for greater error handling than in other applications. If the confidence score on a specific recognition is low, it becomes important to confirm what the user said. The system may have to ask users to repeat themselves. Sometimes a given user will just not be understood, perhaps because he or she is in a noisy environment. If a speech engine returns low confidence values for the same user several times, it may be imperative to transfer that user to a human agent so the user can carry out his transaction.
Speech recognition is also affected by the quality of the input. If a user is calling a system, a bad cell phone connection or overly compressed Internet audio may throw off recognition. Providing for these situations becomes critical when designing speech recognition applications.

Despite these weaknesses, speech recognition continues to be the best way to handle a lot of applications. Traditional DTMF (Touch-Tone) phone applications require users to navigate long and complicated menus and submenus. At any time a user is limited to only a handful of possible choices, and they must remember the proper number to press.

Speech Recognition is not simply a new name for IVR. We have built systems that understand not only the King’s and Queen’s English as spoken by Westerners, but also several variations of Indian accents in English. That includes ‘Malayalee English’, ‘Telugu English’, and ‘Marathi English’, to name just a few of the exotic accents that the systems have now been trained to understand.

A speech-enabled system gives users much greater flexibility. Speech systems are based around asking user questions and allowing them to answer in a way that is natural and intuitive. Speech applications can also provide users with larger menu options at any given time, as the number of keys on a phone keypad does not limit them. And it does not depend on the users’ memory of obscure numerical choices. Users can simply say what they want and get through their interactions much faster.

It also opens up new types of applications. Call routers become easier for users, since they don’t need to know how to spell a name in order to say it. It becomes easier for users who are driving or otherwise incapable of looking at keypads to interact with a system.

Users can provide open-ended input that would not be possible in standard DTMF systems: specifying the city and state for a phone number directory, picking a specific color or make of a car, choosing toppings on a pizza, dialing a number by saying a person‘s name, and looking up addresses are all examples of responses that would not be easy in traditional IVR applications.

Speech applications are better able to convey a company’s unique brand, as users identify more with a computer system they talk to. By using quality voice talent that conveys specific emotions and personality, designers can build systems that connect with users in ways that other software never could.

Now, several speech recognition driven applications are available in native Indian languages like Hindi, Punjabi, Bengali, Oriya, Tamil, Telugu, Kannada, and Malayalam. Applications handling over a million calls from the most illiterate segments of pre-paid users have been running in pan India roll outs for over a year now.

The challenges posed by natural language grammar have been nearly met with advancements in directed dialogue based applications that anticipate the word the user might speak to the system. For example, the application would ask the caller, “Would you like sales, support, or accounting?” Whereas the Natural Language applications simply ask, “How may I help you?”

Other challenges such as the needs of redundancy, performance, scalability, and efficient upgrading have been met with advancements in distributed models for speech-based applications that now have feasible architecture, and offer truly effective load distribution for high call volume applications.

The advantages of reaching out to a geometric progression of numbers of stakeholders in any application lie in scaling. Being able to run applications while on the move with mere spoken commands is no more a figment of imagination.

Voice enabled applications handling customer queries, preliminary FAQs, remote server functional activations, and authorizations have been successfully field-tested, scaled, and run on large-scale implementations across the world. India, with multiple languages posing a barrier, has seen Indian language speech overlays on well researched applications resulting in innovative products in employee safety, logistics management, compliance management, customer complaints handling, student admission processes, and even certain aspects of recruitment processes that are process driven.

With the exponential growth in mobile population in India and increased ease of use of the voice-based applications, enterprise mobility services based on voice, speech, and semantics may set the trend for IT services in the years ahead.

The author is Advisor, Lattice Bridge Infotech