Posts Tagged ‘Nuance’

SpinVox, Voice-to-Text and Some Terminology

Monday, January 18th, 2010

The recent acquisition of SpinVox by Nuance not only represents another major step towards market consolidation by the latter company, but also prompted me have a look at the voice-to-text market. Being a “late adopter power user” – out of some combination of complacency with existing work flows – and refusing to pay for certain conveniences, I have refrained from using such services until now. Shameful for one who’s bread and butter is working with speech technology, I admin.

Luckily I came across some useful reviews of the most prominent providers to get me up to snuff. I won’t go into them, as I’m sure others have more to say about the actual user experience. However as “mobile” is the way speech and langauge technology seems to want to go, and as I finally plan to use more personal mobile computing resources (especially various gadgets starting with “i”) for speech technology, I may give some of these a whirl in the near future…

SpinVox caused somewhat of a stir when launching their voice-to-text service in 2004 and another when the BBC “uncovered” that the company used a combination of human and machine intelligence. To anyone working in speech and language technology this would have been obvious from the get-go, as well as to anyone reading the company’s patent or patent applications, in which the use of human operators is mentioned explicitly. However regular users would probably have been duped into thinking a machine was doing all the typing.  Failure to understand/communicate this caused a wholly avoidable privacy debacle.

One thing that’s clear from last years privacy debacle is that there’s a bit of mess of terminology when it comes to voice and speech technologies.  So here’s an attempt at shedding some light on what’s what:

Speech Recognition – also ASR (automatic speech recognition) for short. This is the general term used to refer to the technology that automatically turns spoken words into machine-readable text. However there are different dimensions to describe this technology, such as models employed (HMM-based vs connectionist), who it’s for  (one single speaker or all speakers of a dialect or language).  Also, there is a host of applications that employ it (dictation, IVR/telephone systems, voice-to-text services), each with different requirements. Hence ASR is really an umbrella term.

Voice Recognition – often confused with speech recognition.  Usually voice recognition refers to software that works for only a single speaker.  However this is anecdotal and in marketing the two are used synonymously.

Voice-to-Text – a service that converts spoken words into text. Some ASR may be used to help to do so, as well as human transcribers, however the label itself makes no claim as to whether the process is fully automated.

Speaker Recognition – this is a security technology typically used to perform one of two tasks: (1) identifying a speaker from a group of known speakers or (2) determining whether a speaker is really who s/he claims. These are very similar tasks that people often confuse.  Think of the first one as picking a person out of a crowd and the second as a kind of “voice fingerprint matching”.

Text-to-Speech – or short TTS, another term for speech synthesis.  This technology is used to turn written text into an audio signal (such as an MP3).  This should be an obvious label, but surprisingly people seem to confuse it with Voice-to-Text services frequently (purely my own anecdote).

I’m also told SpinVox’s sales price of $102m is a bit of a disappointment, representing just over 50% of the initial $200m that SpinVox raised in 2003. But that’s something I’ll let others address. Let’s see where Nuance goes with this, in terms of trying to fully automate the whole transcription process…

SVOX purchases Siemens AG speech-related IP

Monday, January 26th, 2009
Following Nuance’s acquisition of IBM speech technology intellectual property two weeks ago, Zurich-based SVOX today announced the purchase of the Siemens AG speech recognition technology group. The deal gears at creating “obvious synergies of developing TTS, ASR and speech dialog solutions” and enhances SVOX’s portfolio of technologies, which to date included only highly specialized speech synthesis solutions, to now entail speech recognition.
Like the Nuance-IBM deal (and unlike the Microsoft acquisition of TellMe), this merger breaks with the obvious big-fish small-fish paradigm. Here, a larger company’s (IBM, Siemens) R&D division was sold to a smaller, more specialized company (SVOX, Nuance).
Both transactions come with an intend to pursue development of novel interactive voice applications. However while Nuance announced the potential development of applications across platforms and environment with IBM expertise and IP, SVOX appears to stay on course with its successful line of automotive solutions to build
“a commanding market share in speech solutions for premium cars“.

This deal adds SVOX to a list of companies offering network and embedded speech recognition technologies, also including Nuance, Telisma, Loquendo and Microsoft. Financial terms of the deal were not announced.

Nuance acquires IBM speech patents

Friday, January 16th, 2009

Nuance yesterday announced the acquisition of speech-related patents from IBM. The deal encompasses a “licensing and technical services agreement”, with IBM continuing to support existing customers. Integrated solutions of the two companies’ technologies are expected in two years time, according to the press release.

This deal represents a further step in market consolidation, which Nuance has pursued via a number of mergers and acquisitions over the past years. Friends in the industry tell me IBM has been trying to market their suite of IVR voice application server software more aggressively, however speech research activity, once part of the company’s “pervasive computing” vision, has declined lately.

Perhaps the IBM vision will bear fruit at Nuance, as the announcement comes with a commitment ” to proliferate advanced speech capabilities across a broad range of devices and environments”. One thing is sure: much like Nuance’s recent acquisition of Philips voice products, years after taking over Philips IVR products and solutions, this deal represents another closure, as Nuance has been marketing and supporting IBM’s ViaVoice product line for years. The de facto number of competitors on the speech and voice technology market is shrinking, as applications become more mainstream.

.

Nuance buys Philips Speech Recognition Systems

Thursday, October 2nd, 2008

Nuance announced this week its acquisition of Philips Speech Recognition Systems. This represents another step in a series of acquisition by the speech technology giant towards market and portfolio expansion. In 2002, Scansoft Inc., which through further mergers and acquisitions became today’s Nuance, already acquired Philips’ network speech processing group, though not its dictation unit. With this weeks acquisition, the dictation unit will be incorporated into Nuance’s already strong dictation portfolio, expanding especially on European healthcare markets, the company announced. Highlights of the purchase include increasing customer base, language & solutions portfolios, distribution channels as well as a great leap forward in international expansion.

Internationalization and Speech Technologies

Monday, May 5th, 2008

The not-so-subtle truth is, of course, that we all speak English. Yet localization and internationalization are at once prerequisite and stumbling stone for many web-based endeavors.

In my own backyard, two examples illustrate the effect and need for of internationalization, respectively. German professional social network XING has internationally outperformed competitors like LinkedIn through early and aggressive internationalization. StudiVZ – the “German Facebook” has gained much of the student social network market before Facebook decided to release a German version of its web app, making this a tough-to-crack market.

Ironically, as these two examples underline, the need for localization remains in cases where the demands on usability are low (join group/contact person/send message) and the target audience can largely be expected to speak sufficient English (read this for an interesting take on the same issues and solutions in online gaming.) Moreover, localization is an effort far greater than providing an interface in the local language.

As one expects, localization and internationalization and speech technology are inextricably linked – in a sense developing speech technologies is internationalization. And using such technology in professional service projects is akin to building a internationalized web application. Here are some of the oddities I’ve observed while working with speech technologies in an international environment:

Translation is not enough. When you write software that speaks or wants to be spoken to, there is more at stake than providing interface text. Can you expect all your users to spell input when your system doesn’t understand the raw speech input? Can you be sure that all your translated content will generate well-formed speech-synthesis output? Language and culture are sensitive issues, so a well-localized speech application must do more than provide translated user interface. Employing local staff is usually a minimum to building a speech application for a new market.

The cost shifts. Re-usability of resources from previous speech projects is usually low. So unlike localizing a web application, porting a speech application requires grunt work that you thought you had done the first time around. Moreover, speech applications in new languages almost always come with additional licensing burdens and questions about the appropriate technology partner. Expect to pay for things you didn’t expect.

There is no long tail. The buy-in costs for developing a new language in almost any speech or language technology (recognition, synthesis, translation) remain constant. This makes every newly developed language a strategic decision and translates into a two-tier localization effort: one developing basic technologies, one employing such technology in professional service projects.
As an example, the world’s most successful dictation software packages: Dragon Naturally Speaking ships in five flavors of English and six European languages. Philip’s Speech Magic ships in 23 dialects of 11 languages. Both a far cry from world-coverage.
The enormous cost of development has a decided effect on developing speech technology for lesser-spoken languages. And it has posed a significant hurdle as well for open-source initiatives of speech technologies to provide such resources for free.

GOOG: We need more data

Thursday, January 3rd, 2008

The old maxim “I need more data” should be familiar to anyone who has ever tried to wrestle with language technology issues, attempted speech application tuning or delved into any statistical approach to an AI-related problem. Google moved into the speech world last year with GOOG-411, a speech recognition driven directory assistance application (you say what you are looking for and where, it returns suitable businesses and connects you to the one you want or sends you details in an SMS).
Like all (well, most) other Google services, GOOG-411 is free for the end-user. As such, the basic business model (collect data, turn data into cash) applies. This was recently confirmed in interview by Marissa Mayer, Google’s VP of Search Products and User Experience:


Whether or not free-411 is a profitable business unto itself is yet to be seen. I myself am somewhat skeptical. The reason we really did it is because we need to build a great speech-to-text model … that we can use for all kinds of different things, including video search.

Google thus couples statistical AI and its general data-driven approach to everything in a novel way. In doing so, Google may find itself in a catch-up race with the ilk of Nuance, Loquendo IBM, or Telisma, whose stronghold on speech recognition technology comes, in part, from having aggregated speech and language databases through data collection during professional services projects.
What’s new in Google’s approach, however, is the convergence of the dual role that data plays in AI and in the overall service-driven business model. Google will presumably not be content to bootstrap a pattern matching engine to sell licenses like the technology companies above. More interestingly to follow will be the range of services Google can spin using this technology (context sensitive video advertising, audio indexing, IVR hosting) which are more befitting of their overall company strategy.
Unsurprisingly, Mayer goes on to claim that Google isn’t working on ways out of the world of brute-force data-driven algorithms:

People should be able to ask questions, and we should understand their meaning, or they should be able to talk about things at a conceptual level. … A lot of people will turn to things like the semantic Web as a possible answer to that. But what we’re seeing actually is that with a lot of data, you ultimately see things that seem intelligent even though they’re done through brute force.

User privacy advocates may also have a thought or two on this new dimension of data collection, as Google is beginning to loose the “conventionally trustworthy” image it held amongst many over the past years. Fortunately the ways in which speech data is commonly used to train pattern matching models involves very little in the ways of privacy infringement.
Happy data collecting!

News Redux & Building VoiceGlue

Tuesday, December 4th, 2007

I stumbled across some “traditional” news bits this week for speech and language technologies, representing most of the major and a few interesting minor market players . Yahoo is offering some kind of NLP-driven structured search for e-commerce solutions starting next year. A new bundled automatic translation software with automatic learning capabilities was announced by across Systems GmbH and Language Weaver. Loquendo is sponsoring a speech-for-in-car-navigation industry event. Persay, maker of voice authentication software, is shipping solutions securing Planet Payment’s voice-enabled payment processing. Lastly Nuance, continuing its acquisition spree, buys Viecore, a contact-center integration consulting company, indicating a clear focus on strengthening its traditional speech and telephony market position.

Recently I stumbled across and blogged about VoiceGlue, an integration of various GPL-licensed pieces of software, providing full IVR capabilities (including rudimentary speech synthesis but not recognition.) Well, last night, together with Christoph, I finally had a stab at it myself.
Our test setup involved running Fedora 9 virtualized in Mac OS X. Our Fedora installation was missing a few pieces of software beyond the indicated prerequisites, but after about an hour everything was under way.
The trickiest bit proved to be building various modules required for the XML parser (I presume needed later for VoiceGlue-customized DTMF grammar parser.) For some reason CPAN’s console kept conking out on us (claiming inexplicably missing/unbuildable prereqs), so after wrestling with that for some time, we decided to manually build all the modules ourself (hoorah, makefiles).
This worked like a charm, though we hit a snag with the Module::Build perl module, which required C_Support, which in turn required another perl module (ExtUtils-CBuilders), not mentioned in any documentation (scant across the board, though that’s half the fun, isn’t it).
After that, the VoiceGlue installation completed swiftly and all services started running after a minimal bit of configuration.
Next week we’ll be back with some test calls and our first impressions. In the meanwhile we’ll keep our eyes peeled for ASR integration (LumenVox/Sphinx), which will make this a truly valuable stab at open sourcing some of the most expensive carrier-grade technology out there.

Assistive and Accessibility Technology

Wednesday, November 21st, 2007

Diligent readers may have noticed that dominant news bits concerning speech and language technologies seem to focus on the cost- or time-saving aspects it. This is understandable, as the big players (Google, Microsoft, Nuance, IBM) have made it their mandate to capture lucrative markets (call center automation, directory assistance). Application of natural language technologies elsewhere, e.g. where it’s fun (in games) or necessary (providing accessibility for visually impaired users), seems to lag.
Not so this week. This week seems to shine under the assistive/accessibility technology star. Note Sourceforge project “Speak as Daisy” – a Microsoft Word plugin that enables creation of XML files with markup for speech synthesis or electronic braille generation. The plugin is said to be available in 2008.
Mac users with need for improved document read back in British English will rejoice over the improved Infovox iVox voices.
Philips and Elsevier develop a speech-enabled diagnostic system for Radiologists.
Behold Nattiq’s USB Hal Pen, which allows blind users to use the company’s accessibility features on any computer with a USB port without installation.
Of course there’s some overlap with time-, cost- and money-saving technologies as well. The FBI has announced widespread use of Nuance Dragon Naturally Speaking dictation for report and interview transcription.
Lastly, here’s an a propos rant against call center automation and frustrated end-users, a target group for speech and language technologies all too often neglected. Perhaps there’s a lesson to be learned about usability by the “money savers” employing speech technology, taken from those that rely on speech recognition and synthesis for their daily needs. I don’t know, but F-word spotting as a means for prioritizing frustrated callers seems like an acknowledgement of defeat.

Back in the saddle with MSFT, GOOG and VoiceGlue

Tuesday, November 13th, 2007

Back after an extensive break. Been working hard on some of my own multi-modal ideas. Keep your eyes peeled.
Looks like it’s been a quiet fall, speech and language technology-wise. After GOOG-411, Microsoft has also added speech to their search engine endeavors (if in a different domain) by speech-enabling Live Search for mobile users. Nuance continues to consolidate the speech tech market.
Exciting news on the IVR front. Finally a serious attempt to integrate various open-source technologies to provide free carrier-grade speech/telephone services is under way. VoiceGlue has managed to combine OpenVXI (VXML browser), Flite (Speech Synthesis) on Asterisk and is planning to integrate Sphinx2 for speech recognition. All components would then be available under some form of the GPL. Could this herald a change in availability of speech telephone platforms for developers unwilling to dish out horrendous per-port costs? Something to follow, anyway.
Lastly, here‘s an article describing the growing role of speech in warehouse management.

Google on the Move, News Redux

Wednesday, July 25th, 2007

Very quiet recently. No big acquisitions, no no speech-tech revolution.

Most interesting: Google announced Mike Cohen (of formerly Nuance) will appear as keynote speaker at SpeechTek in August to reveal Google’s speech technology strategy. Google has already moved into the speech application market with GOOG411, an automatic directory assistance application leveraging business search and Google Maps.
UBC researchers announce speech learning system that doesn’t use traditional data-driven model to learn the sounds of a language. Instead it is said to represent more experience driven learning, much like infants. So far, the system has acquired English and Japanese vowels.
Some product reviews/announcements: a quick history of desktop dictation, uses of TextAloud for the iPhone, and Nuance’s new South African voice “Tessa”.
Also on the web: NIST evaluates DARPA automatic translation software in military contexts, and What Semantic Search is Not.

I may post less frequently in coming weeks. Stay tuned.