Archive for the ‘Brands’ Category

Roger Ebert TTS

Wednesday, March 10th, 2010

Roger Ebert, who lost his lower jaw to cancer, has been his old voice back. Or at least a version of it. Edinburgh-based CereProc has build a custom voice for its own speech synthesis engine based on old recordings such as TV appearances and DVD commentary tracks.

This is of course not the first case of text-to-speech (TTS) being used for essential day-to-day communication. Most prominently, Professor Stephen Hawkins has been doing so since 1985, initially using DECTalk, since 2009 NeoSpeech. The poor quality of his voice prior to the switch was of course a bit of a trademark. The anecdote goes that Professor Hawkins stuck with his old voice out of attachment. While many speech and language technologies suffer a wow-but-who-really-needs-it existence, these cases are wonderful examples exhibiting real utility.

Mr. Ebert’s voice is novel in one regard: he got his own voice back. I have half-seriously mused in the past whether this wasn’t becoming a real option. Typically, new voice development for general purpose speech synthesis is a costly affair, mostly due to time and labor intensive data preprocessing (studio recording, annotation, hand alignment, etc.) However as the “grunt work” is getting more streamlined and automatized the buy-in costs for a new voice lowers. Mr. Ebert was “lucky” in the sense that large amounts of his voice had already been recorded in good enough quality to enable building his custom voice. Another player on the TTS market, Cepstral, has recently launched its VoiceForge offering, which aims to lower the entry threshold for home-grown TTS developers.

Another option that seems to be more and more realistic is employing “voice-morphing” and “voice transformation”. The idea here is to simply apply changes to an already existing, high-quality TTS voice. The following is a demonstration of how the latter can be done by changing purely acoustic properties (timbre, pitch, rate) of a voice signal:

Voice morphing changes one voice to another. A Cambridge University research project demonstrated how recordings of one speaker could be made to sound like that of another using relatively little training data. The following are some examples:

Original Speaker 1:

Target Speaker 2:

Converted Speaker 1 to Speaker 2:

Similar technology was also show cast extensively during the 2009 Interspeech Conference. Perhaps this will one day enable those that have lost their voice without hours (or days) of recordings of it at their disposal to have their own custom voices to talk to their loved ones.

SpinVox, Voice-to-Text and Some Terminology

Monday, January 18th, 2010

The recent acquisition of SpinVox by Nuance not only represents another major step towards market consolidation by the latter company, but also prompted me have a look at the voice-to-text market. Being a “late adopter power user” – out of some combination of complacency with existing work flows – and refusing to pay for certain conveniences, I have refrained from using such services until now. Shameful for one who’s bread and butter is working with speech technology, I admin.

Luckily I came across some useful reviews of the most prominent providers to get me up to snuff. I won’t go into them, as I’m sure others have more to say about the actual user experience. However as “mobile” is the way speech and langauge technology seems to want to go, and as I finally plan to use more personal mobile computing resources (especially various gadgets starting with “i”) for speech technology, I may give some of these a whirl in the near future…

SpinVox caused somewhat of a stir when launching their voice-to-text service in 2004 and another when the BBC “uncovered” that the company used a combination of human and machine intelligence. To anyone working in speech and language technology this would have been obvious from the get-go, as well as to anyone reading the company’s patent or patent applications, in which the use of human operators is mentioned explicitly. However regular users would probably have been duped into thinking a machine was doing all the typing.  Failure to understand/communicate this caused a wholly avoidable privacy debacle.

One thing that’s clear from last years privacy debacle is that there’s a bit of mess of terminology when it comes to voice and speech technologies.  So here’s an attempt at shedding some light on what’s what:

Speech Recognition – also ASR (automatic speech recognition) for short. This is the general term used to refer to the technology that automatically turns spoken words into machine-readable text. However there are different dimensions to describe this technology, such as models employed (HMM-based vs connectionist), who it’s for  (one single speaker or all speakers of a dialect or language).  Also, there is a host of applications that employ it (dictation, IVR/telephone systems, voice-to-text services), each with different requirements. Hence ASR is really an umbrella term.

Voice Recognition – often confused with speech recognition.  Usually voice recognition refers to software that works for only a single speaker.  However this is anecdotal and in marketing the two are used synonymously.

Voice-to-Text – a service that converts spoken words into text. Some ASR may be used to help to do so, as well as human transcribers, however the label itself makes no claim as to whether the process is fully automated.

Speaker Recognition – this is a security technology typically used to perform one of two tasks: (1) identifying a speaker from a group of known speakers or (2) determining whether a speaker is really who s/he claims. These are very similar tasks that people often confuse.  Think of the first one as picking a person out of a crowd and the second as a kind of “voice fingerprint matching”.

Text-to-Speech – or short TTS, another term for speech synthesis.  This technology is used to turn written text into an audio signal (such as an MP3).  This should be an obvious label, but surprisingly people seem to confuse it with Voice-to-Text services frequently (purely my own anecdote).

I’m also told SpinVox’s sales price of $102m is a bit of a disappointment, representing just over 50% of the initial $200m that SpinVox raised in 2003. But that’s something I’ll let others address. Let’s see where Nuance goes with this, in terms of trying to fully automate the whole transcription process…

Quick Voice Prompts with Google Translate TTS Service

Tuesday, January 12th, 2010

Google last month released several new features to their translation service among them a text-to-speech rendition of the English translation.  As reported elsewhere, it turns out you can directly access this service using a simple URL in your browser.  Following this link will return an MP3 of the text sent along with it:

http://translate.google.com/translate_tts?q=Hello+reader

Just replace “Hello+reader” with any text that you want spoken in your address bar.  Remember to replace spaces with pluses (+).

Some browsers however seem to have problems with the returned audio.  Chrome worked for me, though Internet Explorer is reportedly working as well.

As this is not an official RESTful Google API don’t be surprised if it stops working. Beware that commercial reuse of the output audio is likely also governed by license restrictions.

Update:
Friend Schamai pointed out how this could be employed in a web form. Here’s an example:


Or the corresponding HTML:

<form action="http://translate.google.com/translate_tts">
<input name="q" size="55" value="just saying" />
</form>

Speech and Dialog Conferences / Speech for iPhone and Android

Saturday, July 11th, 2009

Conference time: I will be spending a couple of days in London and Brighton from September 5th attending Interspeech, SIGDIAL as well as a researcher round-table. Anyone interested in meeting up, feel free to get in touch.

Also, here are some more or less recent, interesting news for Android (at about 6:20, thanks Schamai) and iPhone speech developers.

Tim O’Reilly: Google Voice Search Key Technology

Thursday, April 2nd, 2009

ReadWriteWeb reports Tim O’Reilly addressed attendees at the San Francisco Web 2.0 Expo this week, talking about key technologies for the Web >2.0. Voice search (Google iPhone App), he claimed was a tipping point in terms “sensor based interfaces”.

While not the only vendor to provide voice search (i.e. Yahoo oneSearch powered by Vlingo) Google certainly seems ahead in the game in what appears to be a gradual unfolding of a broad voice strategy, such as Voice Search and recently rebranding a feature-enhanced GrandCentral as Google Voice. Future work on the voice front we can expect includes promotion of its own speech recognition capacities through Android, Google Gears bringing speech capacities to all browers, tighter integration of Gaudi (audio indexing) with other services and perhaps one day opening up voice services over APIs.

As I’ve previously pointed out, to Google voice is just another form of data, but what’s slowly beginning to emerge is a central role for speech and voice technologies to play in coming developments for the web and how we search and interface with it.

Kindle Speech Synthesis

Thursday, February 26th, 2009

News about speech and language technology tend to be an in-industry affair, interesting largely to those who need and use it on a daily basis or those who produce (develop or market) it. Every so often however, mainstream news surface that raise issues of broad interest. Google’s efforts with speech recognition are an example of this. Last month, Amazon’s Kindle 2 e-book reader created a buzz with its text-to-speech “audio book” functionality.

The underlying issue is that Amazon is selling e-books, which can be listened to using speech synthesis, without owning the rights to produce audio book versions. The Authors’s Guild argues that this undermines the lucrative audio book market. While it is arguable that a synthesized voice is comparable to the experience of listening to a well-produced audio book, Amazon decided not to fight this one out.

What do you think? Can synthesized audio books provide an experience comparable to real voice productions?

Microsoft Recite Preview – Note Dictation and Voice Search

Monday, February 16th, 2009

Arstechnica reports today on the release of Microsoft Recite “Technology Preview” for Windows Mobile. The applications lets users record short notes as audio snippets, which can later be searched for content by speaking key words. Apparently it does not entail speech recognition rather than simpler pattern matching, meaning it cannot be searched in text form but may work more robustly, eliminating the effort of training for speaker-independency.

While not a full product yet, this sounds like a nifty little application for cognitive off-loading.

Have you tried Microsoft Recite?



More speech on the iPhone

Sunday, February 8th, 2009

The iPhone has proved a game-changer in many regards and speech is no exception. Both Google and Yahoo (with vlingo) have deployed mobile speech applications for the iPhone.
Today I came across another sighting of iPhone speech recognition, Vocalia by Creaceed, employing open-source ASR engine Julius for back-end technology. There is no “push to talk” button but a “shake to retry”, which may prove useful when recognition goes awry. The app supports French, English and German for now and costs €2.99. Dictation is not available at this point, though Julius is certainly capable of it from an architecture point of view.

Other speech and language related iPhone apps:,

Has anyone used these extensively? What is your experience with speech on the iPhone?

Zumba Lumba – iPhone killer or simply a hoax?

Monday, February 2nd, 2009

A no-frills phone with the unlikely name of Zumba Lumba has recently received some attention by the BBC. The phone is said to be top-secret, developed by a defense-aviation company. It does without frills like a camera or an applications platform, but touts some interesting security and computational features, (not only) related to speech technology:

  • Cloud computing – the phone uses no local storage for contacts, data.
  • Network speech recognition – user input is recognized over the internet. This should avoid hardware intensive local computing for voice input, but requires internet access.
  • Voice identification – enhanced security, because the phone will only respond to a single user’s voice.

Some seem to think this is a potential iPhone killer at least in terms of making use of innovative input modalities (though Google already released a speech recognition app for the iPhone.) Others simply thinks it’s a hoax.

Either way, the idea of joining mobile with cloud computing is interesting. Using voice identification for security has its appeal as well, even if it’s unclear whether keeping data in the cloud and sending voice data over the internet is any more secure than simply keeping data on your phone, locally.

SVOX purchases Siemens AG speech-related IP

Monday, January 26th, 2009
Following Nuance’s acquisition of IBM speech technology intellectual property two weeks ago, Zurich-based SVOX today announced the purchase of the Siemens AG speech recognition technology group. The deal gears at creating “obvious synergies of developing TTS, ASR and speech dialog solutions” and enhances SVOX’s portfolio of technologies, which to date included only highly specialized speech synthesis solutions, to now entail speech recognition.
Like the Nuance-IBM deal (and unlike the Microsoft acquisition of TellMe), this merger breaks with the obvious big-fish small-fish paradigm. Here, a larger company’s (IBM, Siemens) R&D division was sold to a smaller, more specialized company (SVOX, Nuance).
Both transactions come with an intend to pursue development of novel interactive voice applications. However while Nuance announced the potential development of applications across platforms and environment with IBM expertise and IP, SVOX appears to stay on course with its successful line of automotive solutions to build
“a commanding market share in speech solutions for premium cars“.

This deal adds SVOX to a list of companies offering network and embedded speech recognition technologies, also including Nuance, Telisma, Loquendo and Microsoft. Financial terms of the deal were not announced.