Archive for the ‘Research’ Category

ICT Workshop on Overlap in Human-Computer Dialogue and Semdial 2011

Thursday, September 29th, 2011

Sorry, this title is a mouthful. I spent the week from October 19th-23rd 2011 at the offices of the good folks at ICT in Los Angeles who organized the aforementioned workshop co-located with Semdial 2011 “Los Angelogue”, held there that same week. Many thanks to all the wonderful, busy people who made it happen.

The workshop’s main theme, “overlap”, describes a set of less-studied sets of behavior in human-human (and human-machine) dialogue. While there’s general agreement that humans do not speak at the same time very often, interesting things do happen when they do. These, the workshop tried to describe, classify and tie into more abstract dialogue-related themes. Notably, back-channels, interruptions (competitive and cooperative), side-talk and turn-taking events were a recurring theme.

The workshop consisted of a number of breakout sessions, in which focused on specific aspects of these issues. Our 2nd day group, for instance, tried to account for turn-taking behavior, i.e. how dialogue participants make decisions about when to speak next, in terms of avoiding or initiating overlap. As an example, we used the following “pre-linguistic” behavior:

I’m not sure we solved the problem even for a set of dialogue acts consisting of DADA, STOMP and LAUGH, but there was generally consensus that somehow the desire to speak next (i.e. having something to say) and the utility to start talking should somehow be maximized at any given moment. Whether the two should come from a joint utility function was left open.

A recurring question was of what use of low-level “overlap” phenomena – especially back-channels, like “yeah”, “ok”, “mhm”, “uh” indicating (non-)understanding – are in modeling higher-level aspects of discourse, such as grounding. “Of little use” may be supported by such observations that speakers do not generally consciously remember producing such short utterances, indicating their more or less reflexive nature. “Of much use”, in turn, is the observation that such phenomena often occur at specific and (semantically/informationally) relevant points in conversation, indicating that they are useful for grounding information.

Following the workshop, Semdial offered a broader view of Semantics in Dialogue, from psycho- and developmental linguistics, formal semantics and human-machine dialogue research. Keynote speakers Patrick Healey, Jerry Hobbs, David Schlangen and Lenhart Schubert exemplified this range. Jerry Hobbs’ talk in particular contained a number of observations, arrived at by analysis of a three-party meeting-schedule dialogue, that left me with ideas to try out in my own dialogue system implementation.

Firstly, his observation is that such dialogues largely follow an ordered, task-specific breakdown of steps towards a high-level goal (finding a suitable scheduling arrangement). “Violations” of this order occur when relevant partial tasks were not recently ratified or were important and thus merited revisitation. (“Violation” is his term, though it seems to indicate that the order is somehow prescribed, though it is not clear by whom/what. “Revisitation” is more apt I believe.) This finding I take as a good indication that my own dialogue management approach in terms of hierarchical discourse units (that encode “order” in a similar fashion) is on the right track (see my own Semdial contribution for details.)

Secondly, a more low-level analysis of the types of questions posed during the scheduling session indicates that, even though a large number of questions are either posed as explicit and implicit YesNo questions, or otherwise answerable by “yes” and “no”, a very low proportion of them is ever treated as such. Here I see a strong divergence between human-human and human-machine dialogues, as the latter explicitly makes use of YesNo questions to overcome technological shortcomings, e.g. pose them to avoid errors from speech recognition in order to produce more robust dialogues. Such “computer-speak” (“I think you said Boston. Is that right?”), of course, is perceived as unnatural by human users/subjects. Perhaps, instead, posing the more common type of YesNo question, i.e. one that is aimed at collecting actual information (“Do you know how you want to travel?”), in a spoken dialogue system can rehabilitate (or “even out”) the more explicit kind used in systems today. To be determined.

A More Optimistic Outlook on the Future of Speech

Wednesday, June 30th, 2010

The speech application industry got some critical press in recent months (here are some spirited responses, respectively.)

All the more refreshing to come across this New York Times article presenting current work in speech and artificial intelligence. The article highlights broadly what kind of AI applications have moved into the mainstream (or have potential to do so). Speech and natural language understanding, the article claims, have gone furthest.

One thing that is generalizable from both criticisms above is that development of speech-enabled applications has stagnated, in various ways1. The underlying technology – speech recognition (ASR) – has gone as far as it can. Application designers and developers haven’t adopted. Dictation has learned to understand doctors and lawyers better, but still struggles with conversational speech.

This point may have to be conceded. In terms of commercial applications however, especially speech-enabled voice (IVR) systems, the root cause for stagnation is not necessarily a failure of AI, rather than a maturing of standards and best-practices. Fulfilling expectations that voice applications, much like websites, behave according to certain rules is much to the advantage of the millions who interact with such systems every day.

What I walk away with from the generalized critical, as well as the Times’ optimistic perspective is that, short of a revolution in underlying technologies (which hardly anyone expects), filling practical, everyday niches is where things can still move forward for speech and language processing.  These niches have certainly not been fully uncovered.

Thoughts?


1 Roughly summarized, Robert Fostner: “development in speech technology has flat-lined since 2001″; David Suendermann: “(statistical) engineering methods are more efficient than traditional symbolic linguistic approaches to language processing.”

Roger Ebert TTS

Wednesday, March 10th, 2010

Roger Ebert, who lost his lower jaw to cancer, has been his old voice back. Or at least a version of it. Edinburgh-based CereProc has build a custom voice for its own speech synthesis engine based on old recordings such as TV appearances and DVD commentary tracks.

This is of course not the first case of text-to-speech (TTS) being used for essential day-to-day communication. Most prominently, Professor Stephen Hawkins has been doing so since 1985, initially using DECTalk, since 2009 NeoSpeech. The poor quality of his voice prior to the switch was of course a bit of a trademark. The anecdote goes that Professor Hawkins stuck with his old voice out of attachment. While many speech and language technologies suffer a wow-but-who-really-needs-it existence, these cases are wonderful examples exhibiting real utility.

Mr. Ebert’s voice is novel in one regard: he got his own voice back. I have half-seriously mused in the past whether this wasn’t becoming a real option. Typically, new voice development for general purpose speech synthesis is a costly affair, mostly due to time and labor intensive data preprocessing (studio recording, annotation, hand alignment, etc.) However as the “grunt work” is getting more streamlined and automatized the buy-in costs for a new voice lowers. Mr. Ebert was “lucky” in the sense that large amounts of his voice had already been recorded in good enough quality to enable building his custom voice. Another player on the TTS market, Cepstral, has recently launched its VoiceForge offering, which aims to lower the entry threshold for home-grown TTS developers.

Another option that seems to be more and more realistic is employing “voice-morphing” and “voice transformation”. The idea here is to simply apply changes to an already existing, high-quality TTS voice. The following is a demonstration of how the latter can be done by changing purely acoustic properties (timbre, pitch, rate) of a voice signal:

Voice morphing changes one voice to another. A Cambridge University research project demonstrated how recordings of one speaker could be made to sound like that of another using relatively little training data. The following are some examples:

Original Speaker 1:

Target Speaker 2:

Converted Speaker 1 to Speaker 2:

Similar technology was also show cast extensively during the 2009 Interspeech Conference. Perhaps this will one day enable those that have lost their voice without hours (or days) of recordings of it at their disposal to have their own custom voices to talk to their loved ones.

Incremental Dialogue Management

Wednesday, December 30th, 2009

Dilbert.com
The past year I’ve been involved in research on incremental processing in spoken dialogue systems at Potsdam University. Our project looks at how information in dialogues can be reduced to basic units, which get passed between modules (such as a speech recognizer and a semantic engine), based on a general abstract model of how this can be done. Thus far, we’ve been mainly concerned with issues originating close to the input speech signal (ASR, semantics, reference resolution, n-best lists, prosody etc.). As these issues are mostly laid out, 2010 will be dedicated to research on larger dialogue issues (interaction & dialogue management, incremental output generation.)

As in the Dilbert dialogue snippet, some issues that will naturally arise are (1) how different types of questions can be handled by an incremental dialogue system (breaking with the established Question-Answer-Question-A-Q… paradigm in favour of something more dynamic) and (2) what turn-taking means in an incremental framework (we now have a system that can interrupt the user at appropriate moments).  Incrementality delivers mostly benefits of speed, robustness and naturalness on the interaction front and these are linked to output generation, so this is a third issue to watch out for.  Larger dialogue strategies may not be as affected, but if they are, we need to establish in what ways.

We’ll certainly steer clear of calling our prototype Morgan. If you are involved in speech and language processing and interested in creating interesting, more natural human-machine dialogues, I’d love to hear from you.

Speech and Dialog Conferences / Speech for iPhone and Android

Saturday, July 11th, 2009

Conference time: I will be spending a couple of days in London and Brighton from September 5th attending Interspeech, SIGDIAL as well as a researcher round-table. Anyone interested in meeting up, feel free to get in touch.

Also, here are some more or less recent, interesting news for Android (at about 6:20, thanks Schamai) and iPhone speech developers.

Incrementality in Verbal Interaction

Thursday, June 18th, 2009

Since I’ve joined a research program at Potsdam University end of last year (as a researcher and PhD student), I’ve decided to use this blog for some additional, more personal updates. This is the first :-).

Our research is concerned with human-machine spoken dialog systems from an incremental, i.e. real-time processing, perspective. As such, members of our team, including me, were recently invited to a workshop on “Incrementality in Verbal Interaction.” The workshop brought together an interesting mix of perspectives on incrementality from Psycholinguistics as well as Theoretical and Computational Linguistics. Slides from our project presentation are available here.

Internationalization and Speech Technologies

Monday, May 5th, 2008

The not-so-subtle truth is, of course, that we all speak English. Yet localization and internationalization are at once prerequisite and stumbling stone for many web-based endeavors.

In my own backyard, two examples illustrate the effect and need for of internationalization, respectively. German professional social network XING has internationally outperformed competitors like LinkedIn through early and aggressive internationalization. StudiVZ – the “German Facebook” has gained much of the student social network market before Facebook decided to release a German version of its web app, making this a tough-to-crack market.

Ironically, as these two examples underline, the need for localization remains in cases where the demands on usability are low (join group/contact person/send message) and the target audience can largely be expected to speak sufficient English (read this for an interesting take on the same issues and solutions in online gaming.) Moreover, localization is an effort far greater than providing an interface in the local language.

As one expects, localization and internationalization and speech technology are inextricably linked – in a sense developing speech technologies is internationalization. And using such technology in professional service projects is akin to building a internationalized web application. Here are some of the oddities I’ve observed while working with speech technologies in an international environment:

Translation is not enough. When you write software that speaks or wants to be spoken to, there is more at stake than providing interface text. Can you expect all your users to spell input when your system doesn’t understand the raw speech input? Can you be sure that all your translated content will generate well-formed speech-synthesis output? Language and culture are sensitive issues, so a well-localized speech application must do more than provide translated user interface. Employing local staff is usually a minimum to building a speech application for a new market.

The cost shifts. Re-usability of resources from previous speech projects is usually low. So unlike localizing a web application, porting a speech application requires grunt work that you thought you had done the first time around. Moreover, speech applications in new languages almost always come with additional licensing burdens and questions about the appropriate technology partner. Expect to pay for things you didn’t expect.

There is no long tail. The buy-in costs for developing a new language in almost any speech or language technology (recognition, synthesis, translation) remain constant. This makes every newly developed language a strategic decision and translates into a two-tier localization effort: one developing basic technologies, one employing such technology in professional service projects.
As an example, the world’s most successful dictation software packages: Dragon Naturally Speaking ships in five flavors of English and six European languages. Philip’s Speech Magic ships in 23 dialects of 11 languages. Both a far cry from world-coverage.
The enormous cost of development has a decided effect on developing speech technology for lesser-spoken languages. And it has posed a significant hurdle as well for open-source initiatives of speech technologies to provide such resources for free.

Speech Enabled Knowledge Bases

Tuesday, April 24th, 2007

Two articles and a product showcase recently demonstrated speech-enabled knowledge base solutions. In essence products/solutions such as this are expert systems with various degrees of complexity, ranging from speaking manuals to complex diagnosis systems. Users can describe a problem and ultimately receive an answer, whether through complex one-shot natural language processing/understanding or a plain-old, multi-step directed dialogue.
Alongside traditional call-center automation applications – e.g. customer service, process automation, pre-qualification, directory assistance – these systems represent a minor market segment. However they are relatively novel, so much can still happen. Especially in medical/health care domains, the market appears untapped and the list of potential applications broad.

Web 3.0 and Natural Language Processing

Monday, April 9th, 2007

Web 3.0 is getting some buzz in the blogosphere. Like Web 2.0, it begs the question that PCMag.com recently ran by its readers: what is it? However this time around things seems a bit easier.

Web 2.0 seems to be happy with being vaguely defined (delimited may be a better term) and equally a social and a technological movement. Web 3.0 clearly hovers over the idea of the “Semantic Web”, a term coined by Tim Berners-Lee, in which richly mark-upped hypertext and data allow for novel more meaningful human-machine and machine-machine communication. Radar Networks (currently in stealth mode) claim to be driving some interesting developments in this direction and are followed closely by those interested.

This has already raised some questions: will content be expensive hand labor or machine boot-strappable, what new privacy policies do we have to live with, how does one separate style and content, what are alternatives to RDF.

Sadly, there’s very little inspiring out there about potential applications.

My question (though not uniquely mine) to add to this: What role will natural language processing play in this (i.e. how “semantic” is this talk of Semantics)? Semantic content in RDF appears to be little more than a means for one machine to tell another who authored a particular book or what are the postal codes in the greater Boston area. Semantics to me is as much about intentions (“Why is web-service A dispensing such information?”) and interpreting such information for the purposes of action (“What can web-service B – or my browser or I – do with it?”).

Perhaps this misses the mark and semantic really isn’t about natural language. But there is a weaker, more real form of this “language and technology” concern: Insofar as semantics is just information, can it be bootstrapped by a machine (perhaps even linguistically informed rather than statistically)?

Thoughts?

Three Observations about Recent Language Technology News

Wednesday, March 28th, 2007

To start us off, recent experience has shown three things:

  1. Speech (i.e. voice) related news is TTS-dominated, less so by ASR.
  2. The company featured most frequently in the news is Nuance.
  3. The talk of semantic search engines seems to dominate the NLP news.

The success of TTS is largely due to requirements set by mobile and in-car technologies, especially GPS and communications. The future of ASR in the other hand seems to depend on the dictation market (especially in the healthcare sector) and a growing relevance of network ASR (driven by advancing VoIP, impact of multi-modal applications).

Nuance’s continued position will depend on the role of “super players” IBM and Microsoft and to a lesser degree the role of open-source initiatives, especially on the network/telephony side.

Semantic search engines recently got some media hype with “Google-Killer” Powerset, a PARC offspring. While in its infancy, some believe this development towards semantic web will usher in a Web3.0 revolution. Of course, soem others believe this has already begun, while yet more just wanna see what happens with all this.

Let’s see how these trends develop. Especially multi-modality and semantic searches will be issues to follow closely.