Archive for the ‘Research’ Category

A More Optimistic Outlook on the Future of Speech

Wednesday, June 30th, 2010

The speech application industry got some critical press in recent months (here are some spirited responses, respectively.)

All the more refreshing to come across this New York Times article presenting current work in speech and artificial intelligence. The article highlights broadly what kind of AI applications have moved into the mainstream (or have potential to do so). Speech and natural language understanding, the article claims, have gone furthest.

One thing that is generalizable from both criticisms above is that development of speech-enabled applications has stagnated, in various ways1. The underlying technology – speech recognition (ASR) – has gone as far as it can. Application designers and developers haven’t adopted. Dictation has learned to understand doctors and lawyers better, but still struggles with conversational speech.

This point may have to be conceded. In terms of commercial applications however, especially speech-enabled voice (IVR) systems, the root cause for stagnation is not necessarily a failure of AI, rather than a maturing of standards and best-practices. Fulfilling expectations that voice applications, much like websites, behave according to certain rules is much to the advantage of the millions who interact with such systems every day.

What I walk away with from the generalized critical, as well as the Times’ optimistic perspective is that, short of a revolution in underlying technologies (which hardly anyone expects), filling practical, everyday niches is where things can still move forward for speech and language processing.  These niches have certainly not been fully uncovered.

Thoughts?


1 Roughly summarized, Robert Fostner: “development in speech technology has flat-lined since 2001″; David Suendermann: “(statistical) engineering methods are more efficient than traditional symbolic linguistic approaches to language processing.”

Roger Ebert TTS

Wednesday, March 10th, 2010

Roger Ebert, who lost his lower jaw to cancer, has been his old voice back. Or at least a version of it. Edinburgh-based CereProc has build a custom voice for its own speech synthesis engine based on old recordings such as TV appearances and DVD commentary tracks.

This is of course not the first case of text-to-speech (TTS) being used for essential day-to-day communication. Most prominently, Professor Stephen Hawkins has been doing so since 1985, initially using DECTalk, since 2009 NeoSpeech. The poor quality of his voice prior to the switch was of course a bit of a trademark. The anecdote goes that Professor Hawkins stuck with his old voice out of attachment. While many speech and language technologies suffer a wow-but-who-really-needs-it existence, these cases are wonderful examples exhibiting real utility.

Mr. Ebert’s voice is novel in one regard: he got his own voice back. I have half-seriously mused in the past whether this wasn’t becoming a real option. Typically, new voice development for general purpose speech synthesis is a costly affair, mostly due to time and labor intensive data preprocessing (studio recording, annotation, hand alignment, etc.) However as the “grunt work” is getting more streamlined and automatized the buy-in costs for a new voice lowers. Mr. Ebert was “lucky” in the sense that large amounts of his voice had already been recorded in good enough quality to enable building his custom voice. Another player on the TTS market, Cepstral, has recently launched its VoiceForge offering, which aims to lower the entry threshold for home-grown TTS developers.

Another option that seems to be more and more realistic is employing “voice-morphing” and “voice transformation”. The idea here is to simply apply changes to an already existing, high-quality TTS voice. The following is a demonstration of how the latter can be done by changing purely acoustic properties (timbre, pitch, rate) of a voice signal:

Voice morphing changes one voice to another. A Cambridge University research project demonstrated how recordings of one speaker could be made to sound like that of another using relatively little training data. The following are some examples:

Original Speaker 1:

Target Speaker 2:

Converted Speaker 1 to Speaker 2:

Similar technology was also show cast extensively during the 2009 Interspeech Conference. Perhaps this will one day enable those that have lost their voice without hours (or days) of recordings of it at their disposal to have their own custom voices to talk to their loved ones.

Incremental Dialogue Management

Wednesday, December 30th, 2009

Dilbert.com
The past year I’ve been involved in research on incremental processing in spoken dialogue systems at Potsdam University. Our project looks at how information in dialogues can be reduced to basic units, which get passed between modules (such as a speech recognizer and a semantic engine), based on a general abstract model of how this can be done. Thus far, we’ve been mainly concerned with issues originating close to the input speech signal (ASR, semantics, reference resolution, n-best lists, prosody etc.). As these issues are mostly laid out, 2010 will be dedicated to research on larger dialogue issues (interaction & dialogue management, incremental output generation.)

As in the Dilbert dialogue snippet, some issues that will naturally arise are (1) how different types of questions can be handled by an incremental dialogue system (breaking with the established Question-Answer-Question-A-Q… paradigm in favour of something more dynamic) and (2) what turn-taking means in an incremental framework (we now have a system that can interrupt the user at appropriate moments).  Incrementality delivers mostly benefits of speed, robustness and naturalness on the interaction front and these are linked to output generation, so this is a third issue to watch out for.  Larger dialogue strategies may not be as affected, but if they are, we need to establish in what ways.

We’ll certainly steer clear of calling our prototype Morgan. If you are involved in speech and language processing and interested in creating interesting, more natural human-machine dialogues, I’d love to hear from you.

Speech and Dialog Conferences / Speech for iPhone and Android

Saturday, July 11th, 2009

Conference time: I will be spending a couple of days in London and Brighton from September 5th attending Interspeech, SIGDIAL as well as a researcher round-table. Anyone interested in meeting up, feel free to get in touch.

Also, here are some more or less recent, interesting news for Android (at about 6:20, thanks Schamai) and iPhone speech developers.

Incrementality in Verbal Interaction

Thursday, June 18th, 2009

Since I’ve joined a research program at Potsdam University end of last year (as a researcher and PhD student), I’ve decided to use this blog for some additional, more personal updates. This is the first :-).

Our research is concerned with human-machine spoken dialog systems from an incremental, i.e. real-time processing, perspective. As such, members of our team, including me, were recently invited to a workshop on “Incrementality in Verbal Interaction.” The workshop brought together an interesting mix of perspectives on incrementality from Psycholinguistics as well as Theoretical and Computational Linguistics. Slides from our project presentation are available here.

Internationalization and Speech Technologies

Monday, May 5th, 2008

The not-so-subtle truth is, of course, that we all speak English. Yet localization and internationalization are at once prerequisite and stumbling stone for many web-based endeavors.

In my own backyard, two examples illustrate the effect and need for of internationalization, respectively. German professional social network XING has internationally outperformed competitors like LinkedIn through early and aggressive internationalization. StudiVZ – the “German Facebook” has gained much of the student social network market before Facebook decided to release a German version of its web app, making this a tough-to-crack market.

Ironically, as these two examples underline, the need for localization remains in cases where the demands on usability are low (join group/contact person/send message) and the target audience can largely be expected to speak sufficient English (read this for an interesting take on the same issues and solutions in online gaming.) Moreover, localization is an effort far greater than providing an interface in the local language.

As one expects, localization and internationalization and speech technology are inextricably linked – in a sense developing speech technologies is internationalization. And using such technology in professional service projects is akin to building a internationalized web application. Here are some of the oddities I’ve observed while working with speech technologies in an international environment:

Translation is not enough. When you write software that speaks or wants to be spoken to, there is more at stake than providing interface text. Can you expect all your users to spell input when your system doesn’t understand the raw speech input? Can you be sure that all your translated content will generate well-formed speech-synthesis output? Language and culture are sensitive issues, so a well-localized speech application must do more than provide translated user interface. Employing local staff is usually a minimum to building a speech application for a new market.

The cost shifts. Re-usability of resources from previous speech projects is usually low. So unlike localizing a web application, porting a speech application requires grunt work that you thought you had done the first time around. Moreover, speech applications in new languages almost always come with additional licensing burdens and questions about the appropriate technology partner. Expect to pay for things you didn’t expect.

There is no long tail. The buy-in costs for developing a new language in almost any speech or language technology (recognition, synthesis, translation) remain constant. This makes every newly developed language a strategic decision and translates into a two-tier localization effort: one developing basic technologies, one employing such technology in professional service projects.
As an example, the world’s most successful dictation software packages: Dragon Naturally Speaking ships in five flavors of English and six European languages. Philip’s Speech Magic ships in 23 dialects of 11 languages. Both a far cry from world-coverage.
The enormous cost of development has a decided effect on developing speech technology for lesser-spoken languages. And it has posed a significant hurdle as well for open-source initiatives of speech technologies to provide such resources for free.

Speech Enabled Knowledge Bases

Tuesday, April 24th, 2007

Two articles and a product showcase recently demonstrated speech-enabled knowledge base solutions. In essence products/solutions such as this are expert systems with various degrees of complexity, ranging from speaking manuals to complex diagnosis systems. Users can describe a problem and ultimately receive an answer, whether through complex one-shot natural language processing/understanding or a plain-old, multi-step directed dialogue.
Alongside traditional call-center automation applications – e.g. customer service, process automation, pre-qualification, directory assistance – these systems represent a minor market segment. However they are relatively novel, so much can still happen. Especially in medical/health care domains, the market appears untapped and the list of potential applications broad.

Web 3.0 and Natural Language Processing

Monday, April 9th, 2007

Web 3.0 is getting some buzz in the blogosphere. Like Web 2.0, it begs the question that PCMag.com recently ran by its readers: what is it? However this time around things seems a bit easier.

Web 2.0 seems to be happy with being vaguely defined (delimited may be a better term) and equally a social and a technological movement. Web 3.0 clearly hovers over the idea of the “Semantic Web”, a term coined by Tim Berners-Lee, in which richly mark-upped hypertext and data allow for novel more meaningful human-machine and machine-machine communication. Radar Networks (currently in stealth mode) claim to be driving some interesting developments in this direction and are followed closely by those interested.

This has already raised some questions: will content be expensive hand labor or machine boot-strappable, what new privacy policies do we have to live with, how does one separate style and content, what are alternatives to RDF.

Sadly, there’s very little inspiring out there about potential applications.

My question (though not uniquely mine) to add to this: What role will natural language processing play in this (i.e. how “semantic” is this talk of Semantics)? Semantic content in RDF appears to be little more than a means for one machine to tell another who authored a particular book or what are the postal codes in the greater Boston area. Semantics to me is as much about intentions (“Why is web-service A dispensing such information?”) and interpreting such information for the purposes of action (“What can web-service B – or my browser or I – do with it?”).

Perhaps this misses the mark and semantic really isn’t about natural language. But there is a weaker, more real form of this “language and technology” concern: Insofar as semantics is just information, can it be bootstrapped by a machine (perhaps even linguistically informed rather than statistically)?

Thoughts?

Three Observations about Recent Language Technology News

Wednesday, March 28th, 2007

To start us off, recent experience has shown three things:

  1. Speech (i.e. voice) related news is TTS-dominated, less so by ASR.
  2. The company featured most frequently in the news is Nuance.
  3. The talk of semantic search engines seems to dominate the NLP news.

The success of TTS is largely due to requirements set by mobile and in-car technologies, especially GPS and communications. The future of ASR in the other hand seems to depend on the dictation market (especially in the healthcare sector) and a growing relevance of network ASR (driven by advancing VoIP, impact of multi-modal applications).

Nuance’s continued position will depend on the role of “super players” IBM and Microsoft and to a lesser degree the role of open-source initiatives, especially on the network/telephony side.

Semantic search engines recently got some media hype with “Google-Killer” Powerset, a PARC offspring. While in its infancy, some believe this development towards semantic web will usher in a Web3.0 revolution. Of course, soem others believe this has already begun, while yet more just wanna see what happens with all this.

Let’s see how these trends develop. Especially multi-modality and semantic searches will be issues to follow closely.