SpinVox, Voice-to-Text and Some Terminology

January 18th, 2010

The recent acquisition of SpinVox by Nuance not only represents another major step towards market consolidation by the latter company, but also prompted me have a look at the voice-to-text market. Being a “late adopter power user” – out of some combination of complacency with existing work flows – and refusing to pay for certain conveniences, I have refrained from using such services until now. Shameful for one who’s bread and butter is working with speech technology, I admin.

Luckily I came across some useful reviews of the most prominent providers to get me up to snuff. I won’t go into them, as I’m sure others have more to say about the actual user experience. However as “mobile” is the way speech and langauge technology seems to want to go, and as I finally plan to use more personal mobile computing resources (especially various gadgets starting with “i”) for speech technology, I may give some of these a whirl in the near future…

SpinVox caused somewhat of a stir when launching their voice-to-text service in 2004 and another when the BBC “uncovered” that the company used a combination of human and machine intelligence. To anyone working in speech and language technology this would have been obvious from the get-go, as well as to anyone reading the company’s patent or patent applications, in which the use of human operators is mentioned explicitly. However regular users would probably have been duped into thinking a machine was doing all the typing.  Failure to understand/communicate this caused a wholly avoidable privacy debacle.

One thing that’s clear from last years privacy debacle is that there’s a bit of mess of terminology when it comes to voice and speech technologies.  So here’s an attempt at shedding some light on what’s what:

Speech Recognition – also ASR (automatic speech recognition) for short. This is the general term used to refer to the technology that automatically turns spoken words into machine-readable text. However there are different dimensions to describe this technology, such as models employed (HMM-based vs connectionist), who it’s for  (one single speaker or all speakers of a dialect or language).  Also, there is a host of applications that employ it (dictation, IVR/telephone systems, voice-to-text services), each with different requirements. Hence ASR is really an umbrella term.

Voice Recognition – often confused with speech recognition.  Usually voice recognition refers to software that works for only a single speaker.  However this is anecdotal and in marketing the two are used synonymously.

Voice-to-Text – a service that converts spoken words into text. Some ASR may be used to help to do so, as well as human transcribers, however the label itself makes no claim as to whether the process is fully automated.

Speaker Recognition – this is a security technology typically used to perform one of two tasks: (1) identifying a speaker from a group of known speakers or (2) determining whether a speaker is really who s/he claims. These are very similar tasks that people often confuse.  Think of the first one as picking a person out of a crowd and the second as a kind of “voice fingerprint matching”.

Text-to-Speech – or short TTS, another term for speech synthesis.  This technology is used to turn written text into an audio signal (such as an MP3).  This should be an obvious label, but surprisingly people seem to confuse it with Voice-to-Text services frequently (purely my own anecdote).

I’m also told SpinVox’s sales price of $102m is a bit of a disappointment, representing just over 50% of the initial $200m that SpinVox raised in 2003. But that’s something I’ll let others address. Let’s see where Nuance goes with this, in terms of trying to fully automate the whole transcription process…

Twitter List RSS with Yahoo Pipes

January 18th, 2010

This post isn’t really about speech technology, but I wanted to share that after a long time of wondering what the point was, I finally found a use for twitter: Twitter Lists. With these you can follow a group of users with a common theme, either by packing them into a list yourself or by subscribing to other users’ public lists.
However I still can’t be bothered to check twitter.com for updates, nor do I care to install another 3rd-party app for enriching my user experience. And unfortunately there is no direct way to follow a list as an RSS feed, which is how I prefer to consume information1.

Thankfully, yet another neat little Yahoo Pipes mashup comes to the rescue. Simply enter the lists’ creator’s user name and the list name, and off you go.

To add a bit of speech tech to this post, here are a few sample lists that you might find interesting:
@die_lautmaler/voicebusiness
@alisohani/machine-learning
@suellewellyn/cunning-linguists
@rachelcotterill/computational-linguistics
(And thanks to people compiling these!)


1 Interestingly, several friends have recently pointed out that they have ditched RSS for twitter as most of their regular feeds also post there.  However I receive too much content via RSS that twitter won’t deliver, such as Google Alerts, and I find sorting through the twitfeed quickly becomes a chore, something you’ll still have to do when reading lists, I suppose. Also, leaving an open protocol for a commercial (if free) service seems like a step in the wrong direction…

Quick Voice Prompts with Google Translate TTS Service

January 12th, 2010

Google last month released several new features to their translation service among them a text-to-speech rendition of the English translation.  As reported elsewhere, it turns out you can directly access this service using a simple URL in your browser.  Following this link will return an MP3 of the text sent along with it:

http://translate.google.com/translate_tts?q=Hello+reader

Just replace “Hello+reader” with any text that you want spoken in your address bar.  Remember to replace spaces with pluses (+).

Some browsers however seem to have problems with the returned audio.  Chrome worked for me, though Internet Explorer is reportedly working as well.

As this is not an official RESTful Google API don’t be surprised if it stops working. Beware that commercial reuse of the output audio is likely also governed by license restrictions.

Update:
Friend Schamai pointed out how this could be employed in a web form. Here’s an example:


Or the corresponding HTML:

<form action="http://translate.google.com/translate_tts">
<input name="q" size="55" value="just saying" />
</form>

Speaking Piano

December 31st, 2009

I greatly enjoyed this video about a piano-cum-speech-synthesis installation. I also think that this would make a great GarageBand plugin.

Incremental Dialogue Management

December 30th, 2009

Dilbert.com
The past year I’ve been involved in research on incremental processing in spoken dialogue systems at Potsdam University. Our project looks at how information in dialogues can be reduced to basic units, which get passed between modules (such as a speech recognizer and a semantic engine), based on a general abstract model of how this can be done. Thus far, we’ve been mainly concerned with issues originating close to the input speech signal (ASR, semantics, reference resolution, n-best lists, prosody etc.). As these issues are mostly laid out, 2010 will be dedicated to research on larger dialogue issues (interaction & dialogue management, incremental output generation.)

As in the Dilbert dialogue snippet, some issues that will naturally arise are (1) how different types of questions can be handled by an incremental dialogue system (breaking with the established Question-Answer-Question-A-Q… paradigm in favour of something more dynamic) and (2) what turn-taking means in an incremental framework (we now have a system that can interrupt the user at appropriate moments).  Incrementality delivers mostly benefits of speed, robustness and naturalness on the interaction front and these are linked to output generation, so this is a third issue to watch out for.  Larger dialogue strategies may not be as affected, but if they are, we need to establish in what ways.

We’ll certainly steer clear of calling our prototype Morgan. If you are involved in speech and language processing and interested in creating interesting, more natural human-machine dialogues, I’d love to hear from you.

Welcome at the new URL

December 29th, 2009

Hello reader,

You may be new, you may have found me at my old blog (the content of which has already been migrated here.)  This is a fairly content-free post, bidding you a warm welcome here.

Only the best for 2010,

Okko

Speech and Dialog Conferences / Speech for iPhone and Android

July 11th, 2009

Conference time: I will be spending a couple of days in London and Brighton from September 5th attending Interspeech, SIGDIAL as well as a researcher round-table. Anyone interested in meeting up, feel free to get in touch.

Also, here are some more or less recent, interesting news for Android (at about 6:20, thanks Schamai) and iPhone speech developers.

Incrementality in Verbal Interaction

June 18th, 2009

Since I’ve joined a research program at Potsdam University end of last year (as a researcher and PhD student), I’ve decided to use this blog for some additional, more personal updates. This is the first :-).

Our research is concerned with human-machine spoken dialog systems from an incremental, i.e. real-time processing, perspective. As such, members of our team, including me, were recently invited to a workshop on “Incrementality in Verbal Interaction.” The workshop brought together an interesting mix of perspectives on incrementality from Psycholinguistics as well as Theoretical and Computational Linguistics. Slides from our project presentation are available here.

Tim O’Reilly: Google Voice Search Key Technology

April 2nd, 2009

ReadWriteWeb reports Tim O’Reilly addressed attendees at the San Francisco Web 2.0 Expo this week, talking about key technologies for the Web >2.0. Voice search (Google iPhone App), he claimed was a tipping point in terms “sensor based interfaces”.

While not the only vendor to provide voice search (i.e. Yahoo oneSearch powered by Vlingo) Google certainly seems ahead in the game in what appears to be a gradual unfolding of a broad voice strategy, such as Voice Search and recently rebranding a feature-enhanced GrandCentral as Google Voice. Future work on the voice front we can expect includes promotion of its own speech recognition capacities through Android, Google Gears bringing speech capacities to all browers, tighter integration of Gaudi (audio indexing) with other services and perhaps one day opening up voice services over APIs.

As I’ve previously pointed out, to Google voice is just another form of data, but what’s slowly beginning to emerge is a central role for speech and voice technologies to play in coming developments for the web and how we search and interface with it.

Language Technology April Fools

April 1st, 2009

Just posting some gems from today concerning speech and language technology, such as natural language generation, speech recognition and natural language processing.

Have you found any others?