To prepare for my interview with Susan Gerhart I tried using text-to-speech software to read menu choices and text selections aloud. As always, I experienced the reaction that Susan, in her latest post, calls synthetic voice shock.
For those of us who don’t need to rely on synthetic voices, that reaction isn’t a problem, it’s merely a deterrent to optional use of the technology. For example, though it might be convenient to shift some material from the domain of written text to the domain of audio, the unpleasantness of synthetic voices stops me from doing that.
But the real problem, Susan explains today, is that synthetic voice shock deters people who have lost their vision, and who would benefit greatly if they could adapt to those voices. Here’s how she characterizes typical reactions:
- I cannot understand that voice!!!
- The voice is so inhuman, inexpressive, robotic, unpleasant!
- How could I possibly benefit from using anything that hard to listen to?
- If that’s how the blind read, I am definitely not ready to take that step.
She adds:
Conversely, those long experienced with screen readers and reading appliances may be surprised at these adverse reactions to the text-to-speech technology they listen to many hours a day.
How can we help people cross that chasm? Susan offers advice to four groups: “vision losers”, developers of assistive reading technologies, sighted people who are helping vision losers, and rehab trainers.
I’m in the third group. My mom’s macular degeneration is progressing, and although she’s not yet forced to rely on text-to-speech, that day may come. To those of us in this group, Susan recommends that, when evaluating applications and appliances, we need to bear in mind that voice quality is a separable concern, not directly tied to the capabilities of the software and hardware. And she suggests that, in order to help friends or family members, we might want to develop some familiarity with the range of available voices.
To that end, Susan has provided audio renderings of her blog posts, including four different versions of today’s post as read by Neospeech Kate, Neospeech Paul, Microsoft Mike, and Robotic UK Jane. None of the readings is pleasant to listen to. But Susan says:
I testify it takes a little patience and self-training and then you hear past these voices and your brain naturally absorbs the underlying content.
Those of us not compelled to learn how to “hear past” those voices might still want to try the experiment, in order to help friends and family members make the transition.
The voices in the demos are certainly more understandable than the Kurzweil reading machine demo I heard 30 years ago.
There are better voice solutions. ATT has some technology called Natural Voices that is integrated into text to speech software that is really much better than these demos.
My company (Vocollect Healthcare Systems) develops a wearable computer and software for voice-assisted healthcare and we have much the same reaction from people on first blush. We can talk about how workers get used to the TTS because they hear them every day, all the time, but the decision-makers have a mental barrier that prevents them from believing this. Nearly every user who is not hostile to the system is fine with the TTS after getting used to it.
One of the interesting issues is an ‘uncanny valley’ for voice. Making a voice sound *too* human can be disorienting the 5-10% of the time when it mis-pronounces a word, or has an unnatural pause. It shocks you into remembering you’re listening to a computer.
Additionally, for applications like ours where the user talks back to the computer it’s useful for the user to remember there is actually a computer on the other end of the conversation, otherwise they may relax and fall into more natural speech patterns instead of the constrained dialog that’s necessary.
Whoops, I plead guilty of the same developer bias. I use Neospeech Kate for both screen and text reading, so that voice sounds best to me.
I added ATT Mike and ATT Crystal to the list of audio recordings for more contrast in voice. Actually, I like ATT Mike best for this recording.
Just to mention a few more parameters. Speed is slower in these recordings than usually used by a more experienced listener, with reported reading rates up to 800, versus more normal 250, words per minute.
Another variable is the dictionary for pronouncing and abbreviating words. These are highly tuned for limited vocabulary applications, as in telephony.
Also, all text readers have different styles of pausing, pronouncing, etc. The text read in the same voice will sound different in the screen reader and the tool that converts the text to mp3.
And, even if we know the voice is nothing but a data file, we still ascribe personality, gender, and other human attributes, as reported in Nass’ fascinating “Wired for Speech” experiments.
Listening to synthetic speech is a skill. Or, from the other side, inability to master synthetic speech is, well, a disability when it comes to using audio interfaced tools.
Thanks for the feedback,
Susan
The standard response these days is to try Mac OS X VoiceOver with the Alex voice, which is uncanny but keeps you out of the valley.
For a while, back around 2000, I got into using TTS to make mp3s of Gutenberg stuff. I can tell you that after a bit it does fade away — In the end, I was actually listening to Chesterton and James — and you don’t sense it as a voice at a certain point.
It was strange — because I was also using Audible.com at the time, and I came to the conclusion that TTS was inferior when compared to a good reader, but *superior* to an annoying reader (which happens a bunch on audiobooks).
I haven’t tried this yet, but IBM Research has an “expressive” text-to-speech system on their site:
http://www.research.ibm.com/tts/