Posts Tagged ‘TTS’

Roger Ebert TTS

Wednesday, March 10th, 2010

Roger Ebert, who lost his lower jaw to cancer, has been his old voice back. Or at least a version of it. Edinburgh-based CereProc has build a custom voice for its own speech synthesis engine based on old recordings such as TV appearances and DVD commentary tracks.

This is of course not the first case of text-to-speech (TTS) being used for essential day-to-day communication. Most prominently, Professor Stephen Hawkins has been doing so since 1985, initially using DECTalk, since 2009 NeoSpeech. The poor quality of his voice prior to the switch was of course a bit of a trademark. The anecdote goes that Professor Hawkins stuck with his old voice out of attachment. While many speech and language technologies suffer a wow-but-who-really-needs-it existence, these cases are wonderful examples exhibiting real utility.

Mr. Ebert’s voice is novel in one regard: he got his own voice back. I have half-seriously mused in the past whether this wasn’t becoming a real option. Typically, new voice development for general purpose speech synthesis is a costly affair, mostly due to time and labor intensive data preprocessing (studio recording, annotation, hand alignment, etc.) However as the “grunt work” is getting more streamlined and automatized the buy-in costs for a new voice lowers. Mr. Ebert was “lucky” in the sense that large amounts of his voice had already been recorded in good enough quality to enable building his custom voice. Another player on the TTS market, Cepstral, has recently launched its VoiceForge offering, which aims to lower the entry threshold for home-grown TTS developers.

Another option that seems to be more and more realistic is employing “voice-morphing” and “voice transformation”. The idea here is to simply apply changes to an already existing, high-quality TTS voice. The following is a demonstration of how the latter can be done by changing purely acoustic properties (timbre, pitch, rate) of a voice signal:

Voice morphing changes one voice to another. A Cambridge University research project demonstrated how recordings of one speaker could be made to sound like that of another using relatively little training data. The following are some examples:

Original Speaker 1:

Target Speaker 2:

Converted Speaker 1 to Speaker 2:

Similar technology was also show cast extensively during the 2009 Interspeech Conference. Perhaps this will one day enable those that have lost their voice without hours (or days) of recordings of it at their disposal to have their own custom voices to talk to their loved ones.

Quick Voice Prompts with Google Translate TTS Service

Tuesday, January 12th, 2010

Google last month released several new features to their translation service among them a text-to-speech rendition of the English translation.  As reported elsewhere, it turns out you can directly access this service using a simple URL in your browser.  Following this link will return an MP3 of the text sent along with it:

Just replace “Hello+reader” with any text that you want spoken in your address bar.  Remember to replace spaces with pluses (+).

Some browsers however seem to have problems with the returned audio.  Chrome worked for me, though Internet Explorer is reportedly working as well.

As this is not an official RESTful Google API don’t be surprised if it stops working. Beware that commercial reuse of the output audio is likely also governed by license restrictions.

Friend Schamai pointed out how this could be employed in a web form. Here’s an example:

Or the corresponding HTML:

<form action="">
<input name="q" size="55" value="just saying" />

Speaking Piano

Thursday, December 31st, 2009

I greatly enjoyed this video about a piano-cum-speech-synthesis installation. I also think that this would make a great GarageBand plugin.

Speech and Dialog Conferences / Speech for iPhone and Android

Saturday, July 11th, 2009

Conference time: I will be spending a couple of days in London and Brighton from September 5th attending Interspeech, SIGDIAL as well as a researcher round-table. Anyone interested in meeting up, feel free to get in touch.

Also, here are some more or less recent, interesting news for Android (at about 6:20, thanks Schamai) and iPhone speech developers.

Kindle Speech Synthesis

Thursday, February 26th, 2009

News about speech and language technology tend to be an in-industry affair, interesting largely to those who need and use it on a daily basis or those who produce (develop or market) it. Every so often however, mainstream news surface that raise issues of broad interest. Google’s efforts with speech recognition are an example of this. Last month, Amazon’s Kindle 2 e-book reader created a buzz with its text-to-speech “audio book” functionality.

The underlying issue is that Amazon is selling e-books, which can be listened to using speech synthesis, without owning the rights to produce audio book versions. The Authors’s Guild argues that this undermines the lucrative audio book market. While it is arguable that a synthesized voice is comparable to the experience of listening to a well-produced audio book, Amazon decided not to fight this one out.

What do you think? Can synthesized audio books provide an experience comparable to real voice productions?

More speech on the iPhone

Sunday, February 8th, 2009

The iPhone has proved a game-changer in many regards and speech is no exception. Both Google and Yahoo (with vlingo) have deployed mobile speech applications for the iPhone.
Today I came across another sighting of iPhone speech recognition, Vocalia by Creaceed, employing open-source ASR engine Julius for back-end technology. There is no “push to talk” button but a “shake to retry”, which may prove useful when recognition goes awry. The app supports French, English and German for now and costs €2.99. Dictation is not available at this point, though Julius is certainly capable of it from an architecture point of view.

Other speech and language related iPhone apps:,

Has anyone used these extensively? What is your experience with speech on the iPhone?

SVOX purchases Siemens AG speech-related IP

Monday, January 26th, 2009
Following Nuance’s acquisition of IBM speech technology intellectual property two weeks ago, Zurich-based SVOX today announced the purchase of the Siemens AG speech recognition technology group. The deal gears at creating “obvious synergies of developing TTS, ASR and speech dialog solutions” and enhances SVOX’s portfolio of technologies, which to date included only highly specialized speech synthesis solutions, to now entail speech recognition.
Like the Nuance-IBM deal (and unlike the Microsoft acquisition of TellMe), this merger breaks with the obvious big-fish small-fish paradigm. Here, a larger company’s (IBM, Siemens) R&D division was sold to a smaller, more specialized company (SVOX, Nuance).
Both transactions come with an intend to pursue development of novel interactive voice applications. However while Nuance announced the potential development of applications across platforms and environment with IBM expertise and IP, SVOX appears to stay on course with its successful line of automotive solutions to build
“a commanding market share in speech solutions for premium cars“.

This deal adds SVOX to a list of companies offering network and embedded speech recognition technologies, also including Nuance, Telisma, Loquendo and Microsoft. Financial terms of the deal were not announced.

IBM Predicts Talking Web

Friday, November 28th, 2008

IBM’s annual crystal ball list of Innovations That Will Change Our Lives in the Next Five Years includes a forecast of a voice-enabled talking web. “You will be able to sort through the Web verbally to find what you are looking for and have the information read back to you,” the article predicts.
IBM itself has launched several voice-enabled products and initiatives over the years, most notably the WebSphere Voice family of web servers, which adds various voice functionality to its flagship WebSphere platform, leveraging it in areas such as unified messaging and call-center automation.
Some problems exist with a vision as the one advocated by the article. Speech recognition accuracy and noise filtering have obviously come a long way and may only pose a minor impediment.
The user’s desire to speak rather than type or click is another problem. Issuing voice commands in the presence of others may not always be desirable and can be disruptive, for instance at work on public transport. Lastly, there are usability concerns, beyond the quality of speech technology, when converting a visual 2- or even 3-dimensional representation of information into a 1-dimensional audio stream. The cognitive load increases significantly with tasks more complex than, for instance, obtaining time-table information or finding the nearest Italian restaurant.
The effort that stands behind the vision, to put voice technology to uses beyond call-center automation, is laudable. Mobile internet access and computing on-the-road may indeed do their parts to make this vision come true. And clearly, there are use cases, such as improved accessibility for users with impairments, that on their own accord merit making the web voice-accessible. Wide-spread usage of a voice-enabled web, however, may be more than five years off.

Internationalization and Speech Technologies

Monday, May 5th, 2008

The not-so-subtle truth is, of course, that we all speak English. Yet localization and internationalization are at once prerequisite and stumbling stone for many web-based endeavors.

In my own backyard, two examples illustrate the effect and need for of internationalization, respectively. German professional social network XING has internationally outperformed competitors like LinkedIn through early and aggressive internationalization. StudiVZ – the “German Facebook” has gained much of the student social network market before Facebook decided to release a German version of its web app, making this a tough-to-crack market.

Ironically, as these two examples underline, the need for localization remains in cases where the demands on usability are low (join group/contact person/send message) and the target audience can largely be expected to speak sufficient English (read this for an interesting take on the same issues and solutions in online gaming.) Moreover, localization is an effort far greater than providing an interface in the local language.

As one expects, localization and internationalization and speech technology are inextricably linked – in a sense developing speech technologies is internationalization. And using such technology in professional service projects is akin to building a internationalized web application. Here are some of the oddities I’ve observed while working with speech technologies in an international environment:

Translation is not enough. When you write software that speaks or wants to be spoken to, there is more at stake than providing interface text. Can you expect all your users to spell input when your system doesn’t understand the raw speech input? Can you be sure that all your translated content will generate well-formed speech-synthesis output? Language and culture are sensitive issues, so a well-localized speech application must do more than provide translated user interface. Employing local staff is usually a minimum to building a speech application for a new market.

The cost shifts. Re-usability of resources from previous speech projects is usually low. So unlike localizing a web application, porting a speech application requires grunt work that you thought you had done the first time around. Moreover, speech applications in new languages almost always come with additional licensing burdens and questions about the appropriate technology partner. Expect to pay for things you didn’t expect.

There is no long tail. The buy-in costs for developing a new language in almost any speech or language technology (recognition, synthesis, translation) remain constant. This makes every newly developed language a strategic decision and translates into a two-tier localization effort: one developing basic technologies, one employing such technology in professional service projects.
As an example, the world’s most successful dictation software packages: Dragon Naturally Speaking ships in five flavors of English and six European languages. Philip’s Speech Magic ships in 23 dialects of 11 languages. Both a far cry from world-coverage.
The enormous cost of development has a decided effect on developing speech technology for lesser-spoken languages. And it has posed a significant hurdle as well for open-source initiatives of speech technologies to provide such resources for free.

The Times Reports & Is SciFi Really Wrong?

Sunday, January 27th, 2008

The New York Times today published an interesting, if brief, article about speech recognition in the mobile/telco space – cited as a “$1.6 billion market in 2007″. The article provides a brief overview of a range of applications and mashups, such as and SimulScribe as well as some directory assistance services (but omitting some others such as SpinVox, GOOG411), that use voice.
The article opens:

“Innovation usually needs time to steep. Time to turn the idea into something tangible, time to get it to market, time for people to decide they accept it. Speech recognition technology has steeped for a long time”

And concludes:

“Even a digital expert [...] cautions that some people may never be satisfied with the quality of speech recognition technology — thanks to a steady diet of fictional books, movies and television shows featuring machines that understand everything a person says, no matter how sharp the diction or how loud the ambient noise.”

But isn’t this a bit hackneyed? Perhaps by today’s standards a twenty-year steeping period seems long, but this is hardly the case anywhere else in history. And after re-watching 1982′s Blade Runner recently, I actually felt rather optimistic that we are today close to what the movie’s expectations for speech recognition and speaker verification were for 2019. Elsewhere , a similar picture emerges.
The Star Trek ship computer’s speech recognition engine (the year is 2151), while accurate, stills require the push of a button to kick in, rather than listening for the hot word “computer”, a capacity available , if not quite ripe for deployment, today.
Of course, there are the HALs (2001), Marvins (no date), C3P0s (Long long time ago…), whose capacities far exceed that, which we dare dream our mobile phones can one day understand. But here it seems the problem is less about the quality of speech technology – the quality of HAL’s speech synthesis is available today, and Marvin’s characteristic monotone baritone should be easy to do – rather than about the old hard-soft divide in Artificial Intelligence. As long as we use a hard-AI problem, which speech arguably is, to solve soft-AI problems (“find closest pizza service”) we cannot fail to be disappointed.