Archive for the ‘News’ Category

A More Optimistic Outlook on the Future of Speech

Wednesday, June 30th, 2010

The speech application industry got some critical press in recent months (here are some spirited responses, respectively.)

All the more refreshing to come across this New York Times article presenting current work in speech and artificial intelligence. The article highlights broadly what kind of AI applications have moved into the mainstream (or have potential to do so). Speech and natural language understanding, the article claims, have gone furthest.

One thing that is generalizable from both criticisms above is that development of speech-enabled applications has stagnated, in various ways1. The underlying technology – speech recognition (ASR) – has gone as far as it can. Application designers and developers haven’t adopted. Dictation has learned to understand doctors and lawyers better, but still struggles with conversational speech.

This point may have to be conceded. In terms of commercial applications however, especially speech-enabled voice (IVR) systems, the root cause for stagnation is not necessarily a failure of AI, rather than a maturing of standards and best-practices. Fulfilling expectations that voice applications, much like websites, behave according to certain rules is much to the advantage of the millions who interact with such systems every day.

What I walk away with from the generalized critical, as well as the Times’ optimistic perspective is that, short of a revolution in underlying technologies (which hardly anyone expects), filling practical, everyday niches is where things can still move forward for speech and language processing.  These niches have certainly not been fully uncovered.


1 Roughly summarized, Robert Fostner: “development in speech technology has flat-lined since 2001″; David Suendermann: “(statistical) engineering methods are more efficient than traditional symbolic linguistic approaches to language processing.”

Roger Ebert TTS

Wednesday, March 10th, 2010

Roger Ebert, who lost his lower jaw to cancer, has been his old voice back. Or at least a version of it. Edinburgh-based CereProc has build a custom voice for its own speech synthesis engine based on old recordings such as TV appearances and DVD commentary tracks.

This is of course not the first case of text-to-speech (TTS) being used for essential day-to-day communication. Most prominently, Professor Stephen Hawkins has been doing so since 1985, initially using DECTalk, since 2009 NeoSpeech. The poor quality of his voice prior to the switch was of course a bit of a trademark. The anecdote goes that Professor Hawkins stuck with his old voice out of attachment. While many speech and language technologies suffer a wow-but-who-really-needs-it existence, these cases are wonderful examples exhibiting real utility.

Mr. Ebert’s voice is novel in one regard: he got his own voice back. I have half-seriously mused in the past whether this wasn’t becoming a real option. Typically, new voice development for general purpose speech synthesis is a costly affair, mostly due to time and labor intensive data preprocessing (studio recording, annotation, hand alignment, etc.) However as the “grunt work” is getting more streamlined and automatized the buy-in costs for a new voice lowers. Mr. Ebert was “lucky” in the sense that large amounts of his voice had already been recorded in good enough quality to enable building his custom voice. Another player on the TTS market, Cepstral, has recently launched its VoiceForge offering, which aims to lower the entry threshold for home-grown TTS developers.

Another option that seems to be more and more realistic is employing “voice-morphing” and “voice transformation”. The idea here is to simply apply changes to an already existing, high-quality TTS voice. The following is a demonstration of how the latter can be done by changing purely acoustic properties (timbre, pitch, rate) of a voice signal:

Voice morphing changes one voice to another. A Cambridge University research project demonstrated how recordings of one speaker could be made to sound like that of another using relatively little training data. The following are some examples:

Original Speaker 1:

Target Speaker 2:

Converted Speaker 1 to Speaker 2:

Similar technology was also show cast extensively during the 2009 Interspeech Conference. Perhaps this will one day enable those that have lost their voice without hours (or days) of recordings of it at their disposal to have their own custom voices to talk to their loved ones.

SpinVox, Voice-to-Text and Some Terminology

Monday, January 18th, 2010

The recent acquisition of SpinVox by Nuance not only represents another major step towards market consolidation by the latter company, but also prompted me have a look at the voice-to-text market. Being a “late adopter power user” – out of some combination of complacency with existing work flows – and refusing to pay for certain conveniences, I have refrained from using such services until now. Shameful for one who’s bread and butter is working with speech technology, I admin.

Luckily I came across some useful reviews of the most prominent providers to get me up to snuff. I won’t go into them, as I’m sure others have more to say about the actual user experience. However as “mobile” is the way speech and langauge technology seems to want to go, and as I finally plan to use more personal mobile computing resources (especially various gadgets starting with “i”) for speech technology, I may give some of these a whirl in the near future…

SpinVox caused somewhat of a stir when launching their voice-to-text service in 2004 and another when the BBC “uncovered” that the company used a combination of human and machine intelligence. To anyone working in speech and language technology this would have been obvious from the get-go, as well as to anyone reading the company’s patent or patent applications, in which the use of human operators is mentioned explicitly. However regular users would probably have been duped into thinking a machine was doing all the typing.  Failure to understand/communicate this caused a wholly avoidable privacy debacle.

One thing that’s clear from last years privacy debacle is that there’s a bit of mess of terminology when it comes to voice and speech technologies.  So here’s an attempt at shedding some light on what’s what:

Speech Recognition – also ASR (automatic speech recognition) for short. This is the general term used to refer to the technology that automatically turns spoken words into machine-readable text. However there are different dimensions to describe this technology, such as models employed (HMM-based vs connectionist), who it’s for  (one single speaker or all speakers of a dialect or language).  Also, there is a host of applications that employ it (dictation, IVR/telephone systems, voice-to-text services), each with different requirements. Hence ASR is really an umbrella term.

Voice Recognition – often confused with speech recognition.  Usually voice recognition refers to software that works for only a single speaker.  However this is anecdotal and in marketing the two are used synonymously.

Voice-to-Text – a service that converts spoken words into text. Some ASR may be used to help to do so, as well as human transcribers, however the label itself makes no claim as to whether the process is fully automated.

Speaker Recognition – this is a security technology typically used to perform one of two tasks: (1) identifying a speaker from a group of known speakers or (2) determining whether a speaker is really who s/he claims. These are very similar tasks that people often confuse.  Think of the first one as picking a person out of a crowd and the second as a kind of “voice fingerprint matching”.

Text-to-Speech – or short TTS, another term for speech synthesis.  This technology is used to turn written text into an audio signal (such as an MP3).  This should be an obvious label, but surprisingly people seem to confuse it with Voice-to-Text services frequently (purely my own anecdote).

I’m also told SpinVox’s sales price of $102m is a bit of a disappointment, representing just over 50% of the initial $200m that SpinVox raised in 2003. But that’s something I’ll let others address. Let’s see where Nuance goes with this, in terms of trying to fully automate the whole transcription process…

SVOX purchases Siemens AG speech-related IP

Monday, January 26th, 2009
Following Nuance’s acquisition of IBM speech technology intellectual property two weeks ago, Zurich-based SVOX today announced the purchase of the Siemens AG speech recognition technology group. The deal gears at creating “obvious synergies of developing TTS, ASR and speech dialog solutions” and enhances SVOX’s portfolio of technologies, which to date included only highly specialized speech synthesis solutions, to now entail speech recognition.
Like the Nuance-IBM deal (and unlike the Microsoft acquisition of TellMe), this merger breaks with the obvious big-fish small-fish paradigm. Here, a larger company’s (IBM, Siemens) R&D division was sold to a smaller, more specialized company (SVOX, Nuance).
Both transactions come with an intend to pursue development of novel interactive voice applications. However while Nuance announced the potential development of applications across platforms and environment with IBM expertise and IP, SVOX appears to stay on course with its successful line of automotive solutions to build
“a commanding market share in speech solutions for premium cars“.

This deal adds SVOX to a list of companies offering network and embedded speech recognition technologies, also including Nuance, Telisma, Loquendo and Microsoft. Financial terms of the deal were not announced.

Nuance acquires IBM speech patents

Friday, January 16th, 2009

Nuance yesterday announced the acquisition of speech-related patents from IBM. The deal encompasses a “licensing and technical services agreement”, with IBM continuing to support existing customers. Integrated solutions of the two companies’ technologies are expected in two years time, according to the press release.

This deal represents a further step in market consolidation, which Nuance has pursued via a number of mergers and acquisitions over the past years. Friends in the industry tell me IBM has been trying to market their suite of IVR voice application server software more aggressively, however speech research activity, once part of the company’s “pervasive computing” vision, has declined lately.

Perhaps the IBM vision will bear fruit at Nuance, as the announcement comes with a commitment ” to proliferate advanced speech capabilities across a broad range of devices and environments”. One thing is sure: much like Nuance’s recent acquisition of Philips voice products, years after taking over Philips IVR products and solutions, this deal represents another closure, as Nuance has been marketing and supporting IBM’s ViaVoice product line for years. The de facto number of competitors on the speech and voice technology market is shrinking, as applications become more mainstream.


Nuance buys Philips Speech Recognition Systems

Thursday, October 2nd, 2008

Nuance announced this week its acquisition of Philips Speech Recognition Systems. This represents another step in a series of acquisition by the speech technology giant towards market and portfolio expansion. In 2002, Scansoft Inc., which through further mergers and acquisitions became today’s Nuance, already acquired Philips’ network speech processing group, though not its dictation unit. With this weeks acquisition, the dictation unit will be incorporated into Nuance’s already strong dictation portfolio, expanding especially on European healthcare markets, the company announced. Highlights of the purchase include increasing customer base, language & solutions portfolios, distribution channels as well as a great leap forward in international expansion.

OnMobile buys Telisma

Monday, May 19th, 2008
OnMobile Global Ltd today acquired France-based Telisma, a producer of speech recognition software for network/telephony environments.
The acquisition comes at a time after OnMobile recently partnered with Nuance, a Telisma competitor for speech recognition markets, to deploy voice search applications for its home market, India. India’s multilingual market has made it a tough one to crack for speech technology companies, though a lucrative one as India has recently surpassed the U.S. as the second largest mobile market in the world, according to Om Malik at GigaOm.
I suspect issues specific to speech technology and India’s multilingualism have something to do with this deal. As I recently pointed out, internationalization of speech and language technologies comes at a steep entry cost, due to the high demands on expertise and data required for building language-specific models. In addition, speech recognition companies like Nuance have long kept their language models under wraps. In other words, if your language isn’t catered to, reaching that language’s customer base becomes a very pricey affair.
While open-source aspirations to build freely availably language models for speech recognition exist, Telisma has opted on middle-ground in this matter by allowing partners/customers to build their own models, but selling the tools to do so at a price. In a market like India, the ability to cater to a multi-lingual customer base without purchase of expensive proprietary software (or paying someone else to develop proprietary software for you to purchase) may have made a big difference in this deal.

On a different note, this acquisition is the latest in a series of acquisitions consolidating the speech technology market. While five years ago telephony speech technology was a highly redundant market of small companies building similar products, today they have largely been acquired by or merged with bigger players. In the meantime, companies like Microsoft, IBM, Siemens and Google are making their own moves to enter the market.

Telismas acoustic modelling toolkit is indeed not for sale, but for free, as one reader has pointed out. Thanks!

News Redux & Building VoiceGlue

Tuesday, December 4th, 2007

I stumbled across some “traditional” news bits this week for speech and language technologies, representing most of the major and a few interesting minor market players . Yahoo is offering some kind of NLP-driven structured search for e-commerce solutions starting next year. A new bundled automatic translation software with automatic learning capabilities was announced by across Systems GmbH and Language Weaver. Loquendo is sponsoring a speech-for-in-car-navigation industry event. Persay, maker of voice authentication software, is shipping solutions securing Planet Payment’s voice-enabled payment processing. Lastly Nuance, continuing its acquisition spree, buys Viecore, a contact-center integration consulting company, indicating a clear focus on strengthening its traditional speech and telephony market position.

Recently I stumbled across and blogged about VoiceGlue, an integration of various GPL-licensed pieces of software, providing full IVR capabilities (including rudimentary speech synthesis but not recognition.) Well, last night, together with Christoph, I finally had a stab at it myself.
Our test setup involved running Fedora 9 virtualized in Mac OS X. Our Fedora installation was missing a few pieces of software beyond the indicated prerequisites, but after about an hour everything was under way.
The trickiest bit proved to be building various modules required for the XML parser (I presume needed later for VoiceGlue-customized DTMF grammar parser.) For some reason CPAN’s console kept conking out on us (claiming inexplicably missing/unbuildable prereqs), so after wrestling with that for some time, we decided to manually build all the modules ourself (hoorah, makefiles).
This worked like a charm, though we hit a snag with the Module::Build perl module, which required C_Support, which in turn required another perl module (ExtUtils-CBuilders), not mentioned in any documentation (scant across the board, though that’s half the fun, isn’t it).
After that, the VoiceGlue installation completed swiftly and all services started running after a minimal bit of configuration.
Next week we’ll be back with some test calls and our first impressions. In the meanwhile we’ll keep our eyes peeled for ASR integration (LumenVox/Sphinx), which will make this a truly valuable stab at open sourcing some of the most expensive carrier-grade technology out there.

Google on the Move, News Redux

Wednesday, July 25th, 2007

Very quiet recently. No big acquisitions, no no speech-tech revolution.

Most interesting: Google announced Mike Cohen (of formerly Nuance) will appear as keynote speaker at SpeechTek in August to reveal Google’s speech technology strategy. Google has already moved into the speech application market with GOOG411, an automatic directory assistance application leveraging business search and Google Maps.
UBC researchers announce speech learning system that doesn’t use traditional data-driven model to learn the sounds of a language. Instead it is said to represent more experience driven learning, much like infants. So far, the system has acquired English and Japanese vowels.
Some product reviews/announcements: a quick history of desktop dictation, uses of TextAloud for the iPhone, and Nuance’s new South African voice “Tessa”.
Also on the web: NIST evaluates DARPA automatic translation software in military contexts, and What Semantic Search is Not.

I may post less frequently in coming weeks. Stay tuned.

This week: Bunnies, Trojans and the Jetsons

Wednesday, July 11th, 2007

There was no shortage of novel uses for speech technology this week. Avaya and the Jersey City’s Liberty Science Center announced speech-enabled exhibits, allowing customers to access information and services in the museum using their voice (and, of course, mobile devices).
Gizmo freaks should love (and everyone else should hate) this bunny, displaying speech recognition and synthesis, while also providing some unified communication capacities.
Also novel, though on a sadder note: speech is finally on the malware radar for good, as TTS trojans popped up using Microsoft’s builtin text-to-speech engine to annoy users by commenting their own malicious behavior. Call it the salt-in-wound virus. This news comes after about half a year after a MS Vista speech recognition security flaw was revealed, whereby the recognizer enables remote execution of content on a computer running speech recognition.

Traditional speech applications made some headlines this week as well: Nuance signs deal with Damovo to roll out speech apps in Ireland, forecasting €1.5m in profits over the next year. TuVox annouces hosted on-demand speech apps for VOIP access.

Lastly, here is an interesting article about the Jetsons and why speech technology hasn’t caught on as much as we have all hoped.