SpinVox, Voice-to-Text and Some Terminology

The recent acquisition of SpinVox by Nuance not only represents another major step towards market consolidation by the latter company, but also prompted me have a look at the voice-to-text market. Being a “late adopter power user” – out of some combination of complacency with existing work flows – and refusing to pay for certain conveniences, I have refrained from using such services until now. Shameful for one who’s bread and butter is working with speech technology, I admin.

Luckily I came across some useful reviews of the most prominent providers to get me up to snuff. I won’t go into them, as I’m sure others have more to say about the actual user experience. However as “mobile” is the way speech and langauge technology seems to want to go, and as I finally plan to use more personal mobile computing resources (especially various gadgets starting with “i”) for speech technology, I may give some of these a whirl in the near future…

SpinVox caused somewhat of a stir when launching their voice-to-text service in 2004 and another when the BBC “uncovered” that the company used a combination of human and machine intelligence. To anyone working in speech and language technology this would have been obvious from the get-go, as well as to anyone reading the company’s patent or patent applications, in which the use of human operators is mentioned explicitly. However regular users would probably have been duped into thinking a machine was doing all the typing.  Failure to understand/communicate this caused a wholly avoidable privacy debacle.

One thing that’s clear from last years privacy debacle is that there’s a bit of mess of terminology when it comes to voice and speech technologies.  So here’s an attempt at shedding some light on what’s what:

Speech Recognition – also ASR (automatic speech recognition) for short. This is the general term used to refer to the technology that automatically turns spoken words into machine-readable text. However there are different dimensions to describe this technology, such as models employed (HMM-based vs connectionist), who it’s for  (one single speaker or all speakers of a dialect or language).  Also, there is a host of applications that employ it (dictation, IVR/telephone systems, voice-to-text services), each with different requirements. Hence ASR is really an umbrella term.

Voice Recognition – often confused with speech recognition.  Usually voice recognition refers to software that works for only a single speaker.  However this is anecdotal and in marketing the two are used synonymously.

Voice-to-Text – a service that converts spoken words into text. Some ASR may be used to help to do so, as well as human transcribers, however the label itself makes no claim as to whether the process is fully automated.

Speaker Recognition – this is a security technology typically used to perform one of two tasks: (1) identifying a speaker from a group of known speakers or (2) determining whether a speaker is really who s/he claims. These are very similar tasks that people often confuse.  Think of the first one as picking a person out of a crowd and the second as a kind of “voice fingerprint matching”.

Text-to-Speech – or short TTS, another term for speech synthesis.  This technology is used to turn written text into an audio signal (such as an MP3).  This should be an obvious label, but surprisingly people seem to confuse it with Voice-to-Text services frequently (purely my own anecdote).

I’m also told SpinVox’s sales price of $102m is a bit of a disappointment, representing just over 50% of the initial $200m that SpinVox raised in 2003. But that’s something I’ll let others address. Let’s see where Nuance goes with this, in terms of trying to fully automate the whole transcription process…

Tags: ,

4 Responses to “SpinVox, Voice-to-Text and Some Terminology”

  1. James Siminoff Says:

    You should check out, http://www.techcrunch.com/2010/01/28/phonetag-voice-to-text-86-percent-accurate-google-voice/. You will probably also be interested in the study, http://www.scribd.com/doc/26017529/Accuracy-of-Voicemail-To-text-Services.

    I think that this will be a very interesting market to follow in the next few years.

  2. Okko Says:

    Thanks for those links. That study slipped by me.

    I’m very curious to see where this is going. Some obvious questions are how accessible and pervasive voice transcription will become. Will there be a healthy developer base (voice technologies always suffer from having a small, somewhat esoteric one). What about “real” web APIs, leveraging this stuff for mash ups like smart ad placements in video, 3rd party calendar plug-ins, etc..

    How to enable real international market penetration is also big question. English speaking ones appear reaching a good level of maturity. I’ve previously written about the fact that there is no long tail in speech and language technology development. The buy-in costs per market/language remain the same, regardless of market size. Capital per market however varies greatly. This could be a real show stopper.

    Thoughts?

  3. philippine tv online Says:

    I used osx’s ‘text to mp3′ to do the voice within my Youtube video. So trust me guys, voice things work great, you should use them too! One problem I had was when it tried to pronounce some ‘fake’ words/ sounds like ‘woosh’, but with practice it’s easy :-)

  4. Barney Bochat Says:

    Thanks for this amazing post. This is very useful for someone looking for a transcription article. I will be checking your site again soon.

Leave a Reply