- LATIN AMERICA
- MIDDLE EAST
- United Kingdom
- United States
- New Zealand
- South Africa
Siri might be the best-known example of a voice-enabled UI, but it is certainly not the only one. Whether speech will eventually replace touch as the primary way that smartphone and tablet users interact with apps is debatable. But for now, one thing is clear: Developers should start considering whether and how to add speech control to their apps to stay competitive.
When looking to integrate voice controls to mobile apps, developers are faced with a growing and often bewildering array of implementation options. To help sort through the technology and design options, we recently spoke with Ben Lilienthal, co-founder and CEO of OneTok, a new company that aims to make it easier and cheaper for developers to speech-enable their apps.
How do you define "voice control" for mobile apps? It seems to have become a catchall term for everything from basic voice recognition (e.g., speech to text) to natural language understanding to voice biometrics.
We define speech as the ability to interact with an app [by voice] in a normal and natural way and have the app return the appropriate information to the user. This solves the bulk of consumers' frustrations with mobile devices, such as small form factors, limited data input, difficulty of typing with thumbs, etcetera.
In order to provide this level of engagement and accuracy, apps need to use a combination of voice recognition and natural language processing. There are sub-categories of speech -- like voice biometrics -- which can be very useful for highly sensitive data, such as financial information. Further, we believe that having information read back to the user (also called speech to text) is not as big a pain point as data input.
What should developers consider when deciding which parts of their app to speech-enable?
Speech recognition is very good when you can constrain the vocabulary that you are transcribing. Free-form speech recognition such as voicemail transcription only works about 85 percent of the time.
Our experience has been that trying to speech-enable the entire app at once is too big a problem to solve. Rather, it makes sense to voice-enable a couple of the killer use cases first. Find out what are the primary methods or functions that people use in your app and voice-enable these first. It's an iterative process, and the rest of the app can be speech-enabled or speech controlled in a phase II or beyond development cycle.
What are a developer's technological options for speech-enabling their app? For example, a lot of vendors and even mobile operators offer solutions. What should developers look for when comparing those solutions?
There are a number of technologies in the marketplace to speech-enable apps. The key metric to look at -- as in any development project -- is efficacy of the solution versus total cost of ownership. Many of the solutions in the market are not full solutions at all but only provide the developer with speech-to-text transcription. So the onus is on the developer to interpret the transcription.
As a simple example: "I'm hungry" most likely means "I'm looking for a place to eat," not "I am a country in the European Union." But somewhere along the way, this needs to be defined and interpreted. So a developer is faced with either implementing and maintaining their own natural language processing layer to do this -- an expensive, difficult and time-consuming effort -- or contracting out an expensive and time-consuming NRE consulting project from a vendor to implement, maintain and tune.
The OneTok solution is the first end-to-end, software-as-a-service (SaaS) platform in the market that encapsulates audio signal processing on the client, speech recognition and NLP to provide app developers with a simple, easy-to-use, subscription-based solution. The OneTok platform can be customized by the developer to correspond to the phrases and words that his user says.
A lot of speech-enabled apps rely on a server to perform tasks rather than doing them entirely on the phone. What should developers keep in mind to minimize the amount of traffic that speech features generate so the app doesn't use up too much of the customer's monthly data bucket?
There are many ways to compress the audio using different speech codecs while still maintaining high-quality samples. One of the things that we do at OneTok is make sure that the packets are in the right order when we get them on the server so that the transcriptions come out correctly.
For a developer who has no experience with speech UIs and tools, what's the learning curve like?
It's pretty steep. There are a couple of tools built in to Android which make it easy to do simple voice recognition and transcription. Once a developer wants to move beyond that into more of a natural language environment, it can become complicated and costly.
In your experience, what aspects of speech (e.g., cost to implement, usability) do app developers often overlook or underestimate?
All of it -- and I would include OneTok in that bucket. The computational linguistics discipline, which leads to the creation of the natural language process layer, is incredibly complex. Think about it: We're trying to model and interpret the English language.
Luckily, computing power is cheap enough now that we can get pretty close. Costs vary from free to millions of dollars, but with services like ours, the cost to benefit for long and mid-tail developers is finally at equilibrium. I think defining the use cases where speech makes sense is probably the hardest part of the process. Once those have been defined, the implementation