ICT Workshop on Overlap in Human-Computer Dialogue and Semdial 2011

September 29th, 2011

Sorry, this title is a mouthful. I spent the week from October 19th-23rd 2011 at the offices of the good folks at ICT in Los Angeles who organized the aforementioned workshop co-located with Semdial 2011 “Los Angelogue”, held there that same week. Many thanks to all the wonderful, busy people who made it happen.

The workshop’s main theme, “overlap”, describes a set of less-studied sets of behavior in human-human (and human-machine) dialogue. While there’s general agreement that humans do not speak at the same time very often, interesting things do happen when they do. These, the workshop tried to describe, classify and tie into more abstract dialogue-related themes. Notably, back-channels, interruptions (competitive and cooperative), side-talk and turn-taking events were a recurring theme.

The workshop consisted of a number of breakout sessions, in which focused on specific aspects of these issues. Our 2nd day group, for instance, tried to account for turn-taking behavior, i.e. how dialogue participants make decisions about when to speak next, in terms of avoiding or initiating overlap. As an example, we used the following “pre-linguistic” behavior:

I’m not sure we solved the problem even for a set of dialogue acts consisting of DADA, STOMP and LAUGH, but there was generally consensus that somehow the desire to speak next (i.e. having something to say) and the utility to start talking should somehow be maximized at any given moment. Whether the two should come from a joint utility function was left open.

A recurring question was of what use of low-level “overlap” phenomena – especially back-channels, like “yeah”, “ok”, “mhm”, “uh” indicating (non-)understanding – are in modeling higher-level aspects of discourse, such as grounding. “Of little use” may be supported by such observations that speakers do not generally consciously remember producing such short utterances, indicating their more or less reflexive nature. “Of much use”, in turn, is the observation that such phenomena often occur at specific and (semantically/informationally) relevant points in conversation, indicating that they are useful for grounding information.

Following the workshop, Semdial offered a broader view of Semantics in Dialogue, from psycho- and developmental linguistics, formal semantics and human-machine dialogue research. Keynote speakers Patrick Healey, Jerry Hobbs, David Schlangen and Lenhart Schubert exemplified this range. Jerry Hobbs’ talk in particular contained a number of observations, arrived at by analysis of a three-party meeting-schedule dialogue, that left me with ideas to try out in my own dialogue system implementation.

Firstly, his observation is that such dialogues largely follow an ordered, task-specific breakdown of steps towards a high-level goal (finding a suitable scheduling arrangement). “Violations” of this order occur when relevant partial tasks were not recently ratified or were important and thus merited revisitation. (“Violation” is his term, though it seems to indicate that the order is somehow prescribed, though it is not clear by whom/what. “Revisitation” is more apt I believe.) This finding I take as a good indication that my own dialogue management approach in terms of hierarchical discourse units (that encode “order” in a similar fashion) is on the right track (see my own Semdial contribution for details.)

Secondly, a more low-level analysis of the types of questions posed during the scheduling session indicates that, even though a large number of questions are either posed as explicit and implicit YesNo questions, or otherwise answerable by “yes” and “no”, a very low proportion of them is ever treated as such. Here I see a strong divergence between human-human and human-machine dialogues, as the latter explicitly makes use of YesNo questions to overcome technological shortcomings, e.g. pose them to avoid errors from speech recognition in order to produce more robust dialogues. Such “computer-speak” (“I think you said Boston. Is that right?”), of course, is perceived as unnatural by human users/subjects. Perhaps, instead, posing the more common type of YesNo question, i.e. one that is aimed at collecting actual information (“Do you know how you want to travel?”), in a spoken dialogue system can rehabilitate (or “even out”) the more explicit kind used in systems today. To be determined.

Machines talking to machines that talk

September 6th, 2011

As a life sign, something amusing I came by today:

InproTk Demonstration

November 4th, 2010

I’ve uploaded a simple incremental demonstration system built with our toolkit at my Potsdam University site. It’s still micro-domain-based, meaning you can’t do much yet (select and delete puzzle pieces), but it exhibits some interesting interaction phenomena, such as prosody-driven end-of-turn classification, mid-utterance action execution and display of partial ASR hypotheses (German only.)

This Goes to Eleven

July 20th, 2010

No content, just for fun.

A More Optimistic Outlook on the Future of Speech

June 30th, 2010

The speech application industry got some critical press in recent months (here are some spirited responses, respectively.)

All the more refreshing to come across this New York Times article presenting current work in speech and artificial intelligence. The article highlights broadly what kind of AI applications have moved into the mainstream (or have potential to do so). Speech and natural language understanding, the article claims, have gone furthest.

One thing that is generalizable from both criticisms above is that development of speech-enabled applications has stagnated, in various ways1. The underlying technology – speech recognition (ASR) – has gone as far as it can. Application designers and developers haven’t adopted. Dictation has learned to understand doctors and lawyers better, but still struggles with conversational speech.

This point may have to be conceded. In terms of commercial applications however, especially speech-enabled voice (IVR) systems, the root cause for stagnation is not necessarily a failure of AI, rather than a maturing of standards and best-practices. Fulfilling expectations that voice applications, much like websites, behave according to certain rules is much to the advantage of the millions who interact with such systems every day.

What I walk away with from the generalized critical, as well as the Times’ optimistic perspective is that, short of a revolution in underlying technologies (which hardly anyone expects), filling practical, everyday niches is where things can still move forward for speech and language processing.  These niches have certainly not been fully uncovered.


1 Roughly summarized, Robert Fostner: “development in speech technology has flat-lined since 2001″; David Suendermann: “(statistical) engineering methods are more efficient than traditional symbolic linguistic approaches to language processing.”

Roger Ebert TTS

March 10th, 2010

Roger Ebert, who lost his lower jaw to cancer, has been his old voice back. Or at least a version of it. Edinburgh-based CereProc has build a custom voice for its own speech synthesis engine based on old recordings such as TV appearances and DVD commentary tracks.

This is of course not the first case of text-to-speech (TTS) being used for essential day-to-day communication. Most prominently, Professor Stephen Hawkins has been doing so since 1985, initially using DECTalk, since 2009 NeoSpeech. The poor quality of his voice prior to the switch was of course a bit of a trademark. The anecdote goes that Professor Hawkins stuck with his old voice out of attachment. While many speech and language technologies suffer a wow-but-who-really-needs-it existence, these cases are wonderful examples exhibiting real utility.

Mr. Ebert’s voice is novel in one regard: he got his own voice back. I have half-seriously mused in the past whether this wasn’t becoming a real option. Typically, new voice development for general purpose speech synthesis is a costly affair, mostly due to time and labor intensive data preprocessing (studio recording, annotation, hand alignment, etc.) However as the “grunt work” is getting more streamlined and automatized the buy-in costs for a new voice lowers. Mr. Ebert was “lucky” in the sense that large amounts of his voice had already been recorded in good enough quality to enable building his custom voice. Another player on the TTS market, Cepstral, has recently launched its VoiceForge offering, which aims to lower the entry threshold for home-grown TTS developers.

Another option that seems to be more and more realistic is employing “voice-morphing” and “voice transformation”. The idea here is to simply apply changes to an already existing, high-quality TTS voice. The following is a demonstration of how the latter can be done by changing purely acoustic properties (timbre, pitch, rate) of a voice signal:

Voice morphing changes one voice to another. A Cambridge University research project demonstrated how recordings of one speaker could be made to sound like that of another using relatively little training data. The following are some examples:

Original Speaker 1:

Target Speaker 2:

Converted Speaker 1 to Speaker 2:

Similar technology was also show cast extensively during the 2009 Interspeech Conference. Perhaps this will one day enable those that have lost their voice without hours (or days) of recordings of it at their disposal to have their own custom voices to talk to their loved ones.

SpinVox, Voice-to-Text and Some Terminology

January 18th, 2010

The recent acquisition of SpinVox by Nuance not only represents another major step towards market consolidation by the latter company, but also prompted me have a look at the voice-to-text market. Being a “late adopter power user” – out of some combination of complacency with existing work flows – and refusing to pay for certain conveniences, I have refrained from using such services until now. Shameful for one who’s bread and butter is working with speech technology, I admin.

Luckily I came across some useful reviews of the most prominent providers to get me up to snuff. I won’t go into them, as I’m sure others have more to say about the actual user experience. However as “mobile” is the way speech and langauge technology seems to want to go, and as I finally plan to use more personal mobile computing resources (especially various gadgets starting with “i”) for speech technology, I may give some of these a whirl in the near future…

SpinVox caused somewhat of a stir when launching their voice-to-text service in 2004 and another when the BBC “uncovered” that the company used a combination of human and machine intelligence. To anyone working in speech and language technology this would have been obvious from the get-go, as well as to anyone reading the company’s patent or patent applications, in which the use of human operators is mentioned explicitly. However regular users would probably have been duped into thinking a machine was doing all the typing.  Failure to understand/communicate this caused a wholly avoidable privacy debacle.

One thing that’s clear from last years privacy debacle is that there’s a bit of mess of terminology when it comes to voice and speech technologies.  So here’s an attempt at shedding some light on what’s what:

Speech Recognition – also ASR (automatic speech recognition) for short. This is the general term used to refer to the technology that automatically turns spoken words into machine-readable text. However there are different dimensions to describe this technology, such as models employed (HMM-based vs connectionist), who it’s for  (one single speaker or all speakers of a dialect or language).  Also, there is a host of applications that employ it (dictation, IVR/telephone systems, voice-to-text services), each with different requirements. Hence ASR is really an umbrella term.

Voice Recognition – often confused with speech recognition.  Usually voice recognition refers to software that works for only a single speaker.  However this is anecdotal and in marketing the two are used synonymously.

Voice-to-Text – a service that converts spoken words into text. Some ASR may be used to help to do so, as well as human transcribers, however the label itself makes no claim as to whether the process is fully automated.

Speaker Recognition – this is a security technology typically used to perform one of two tasks: (1) identifying a speaker from a group of known speakers or (2) determining whether a speaker is really who s/he claims. These are very similar tasks that people often confuse.  Think of the first one as picking a person out of a crowd and the second as a kind of “voice fingerprint matching”.

Text-to-Speech – or short TTS, another term for speech synthesis.  This technology is used to turn written text into an audio signal (such as an MP3).  This should be an obvious label, but surprisingly people seem to confuse it with Voice-to-Text services frequently (purely my own anecdote).

I’m also told SpinVox’s sales price of $102m is a bit of a disappointment, representing just over 50% of the initial $200m that SpinVox raised in 2003. But that’s something I’ll let others address. Let’s see where Nuance goes with this, in terms of trying to fully automate the whole transcription process…

Twitter List RSS with Yahoo Pipes

January 18th, 2010

This post isn’t really about speech technology, but I wanted to share that after a long time of wondering what the point was, I finally found a use for twitter: Twitter Lists. With these you can follow a group of users with a common theme, either by packing them into a list yourself or by subscribing to other users’ public lists.
However I still can’t be bothered to check twitter.com for updates, nor do I care to install another 3rd-party app for enriching my user experience. And unfortunately there is no direct way to follow a list as an RSS feed, which is how I prefer to consume information1.

Thankfully, yet another neat little Yahoo Pipes mashup comes to the rescue. Simply enter the lists’ creator’s user name and the list name, and off you go.

To add a bit of speech tech to this post, here are a few sample lists that you might find interesting:
(And thanks to people compiling these!)

1 Interestingly, several friends have recently pointed out that they have ditched RSS for twitter as most of their regular feeds also post there.  However I receive too much content via RSS that twitter won’t deliver, such as Google Alerts, and I find sorting through the twitfeed quickly becomes a chore, something you’ll still have to do when reading lists, I suppose. Also, leaving an open protocol for a commercial (if free) service seems like a step in the wrong direction…

Quick Voice Prompts with Google Translate TTS Service

January 12th, 2010

Google last month released several new features to their translation service among them a text-to-speech rendition of the English translation.  As reported elsewhere, it turns out you can directly access this service using a simple URL in your browser.  Following this link will return an MP3 of the text sent along with it:


Just replace “Hello+reader” with any text that you want spoken in your address bar.  Remember to replace spaces with pluses (+).

Some browsers however seem to have problems with the returned audio.  Chrome worked for me, though Internet Explorer is reportedly working as well.

As this is not an official RESTful Google API don’t be surprised if it stops working. Beware that commercial reuse of the output audio is likely also governed by license restrictions.

Friend Schamai pointed out how this could be employed in a web form. Here’s an example:

Or the corresponding HTML:

<form action="http://translate.google.com/translate_tts">
<input name="q" size="55" value="just saying" />

Speaking Piano

December 31st, 2009

I greatly enjoyed this video about a piano-cum-speech-synthesis installation. I also think that this would make a great GarageBand plugin.