Roger Ebert, who lost his lower jaw to cancer, has been his old voice back. Or at least a version of it. Edinburgh-based CereProc has build a custom voice for its own speech synthesis engine based on old recordings such as TV appearances and DVD commentary tracks.
This is of course not the first case of text-to-speech (TTS) being used for essential day-to-day communication. Most prominently, Professor Stephen Hawkins has been doing so since 1985, initially using DECTalk, since 2009 NeoSpeech. The poor quality of his voice prior to the switch was of course a bit of a trademark. The anecdote goes that Professor Hawkins stuck with his old voice out of attachment. While many speech and language technologies suffer a wow-but-who-really-needs-it existence, these cases are wonderful examples exhibiting real utility.
Mr. Ebert’s voice is novel in one regard: he got his own voice back. I have half-seriously mused in the past whether this wasn’t becoming a real option. Typically, new voice development for general purpose speech synthesis is a costly affair, mostly due to time and labor intensive data preprocessing (studio recording, annotation, hand alignment, etc.) However as the “grunt work” is getting more streamlined and automatized the buy-in costs for a new voice lowers. Mr. Ebert was “lucky” in the sense that large amounts of his voice had already been recorded in good enough quality to enable building his custom voice. Another player on the TTS market, Cepstral, has recently launched its VoiceForge offering, which aims to lower the entry threshold for home-grown TTS developers.
Another option that seems to be more and more realistic is employing “voice-morphing” and “voice transformation”. The idea here is to simply apply changes to an already existing, high-quality TTS voice. The following is a demonstration of how the latter can be done by changing purely acoustic properties (timbre, pitch, rate) of a voice signal:
Voice morphing changes one voice to another. A Cambridge University research project demonstrated how recordings of one speaker could be made to sound like that of another using relatively little training data. The following are some examples:
Original Speaker 1:
Target Speaker 2:
Converted Speaker 1 to Speaker 2:
Similar technology was also show cast extensively during the 2009 Interspeech Conference. Perhaps this will one day enable those that have lost their voice without hours (or days) of recordings of it at their disposal to have their own custom voices to talk to their loved ones.