Hugh McGuire notes that Google Labs has expanded the audio indexing and search of political videos on YouTube. I checked out the examples, and guessed that this system works by doing speech-to-text conversion, then conventional indexing and search of the text. That’s feasible because even an imperfect conversion yields plenty of recognizable words for search to find.
Here’s an example of imperfect conversion that doesn’t interfere with a search for the word “health”:
spoken: and that’s true with health care. Of the estimated 47 million
transcribed the ranks not and that how much health care the native forty seven
Now that’s the worst of the small set I sampled. Here’s a less imperfect example:
spoken: businesses liberated from high taxes and health care costs will unleash
transcribed: businesses liberated from high taxes and health care costs well I’m
And sure enough, the FAQ confirms my hunch:
Google Audio Indexing uses speech technology to transform spoken words into text and leverages the Google indexing technology to return the best results to the user.
Way back in 2002, I reviewed Fast-Talk, a product (actually, a technology demo of a licenseable SDK) that took a completely different approach to audio indexing. It worked phonetically. One of my tests was of a phone interview with Tim Bray, which I recorded and which was indexed in realtime as we spoke. Here’s what happened next:
When my interview with Tim Bray was done, the first segment I looked for was the one where Bray said, “Jean Paoli spent four hours showing me XDocs.” The name “Jean Paoli” was, not surprisingly, ineffective as a search term. But “four hours” found the segment instantly, as did “fore ours” — which of course resolves to the same string of phonemes. “Zhawn Powli” also worked, illustrating what will soon become a new strategy for users of voice-aware search engines: When in doubt, spell it out phonetically. In practice, I find myself resorting to this strategy less often than I’d have expected. And it was fairly obvious when to do so. I guessed correctly that “MySQL” would not work, for example, but that “my sequel” would.
This approach doesn’t yield a transcript, but it’s so fast, efficient, and effective that I felt sure it would be in widespread use by now, and that audio indexing would be far more prevalent than it has become.
Why didn’t my prediction come true?
9 thoughts on “Why didn’t phonetic audio indexing prevail?”
My guess would be your own comment: “…this approach doesn’t yield a transcript”.
We seem (or at least /I/ seem *grin*) to find searches that appear with context more tractable, and audio/video context is simply much more difficult. It involves dragging back and forth through a media file, and even if that’s automagic, the simple act of reviewing 3-7 seconds of audio/video muliple times to find your best/correct search result seems much more involved than skimming 10 lines of text. Text, if nothing else, is efficient.
Its so intuitive to search phonetically, specially for names, even more so for names from a different culture/language/country. I would think a hybrid of the two will really work well. The question is how does one signal which parts to index phonetically and which parts as text. As usual an amazing observation!
I also thought that FastTalk or something like it would be a winner one day. I still have a FastTalk t-shirt someplace.
Maybe what’s keeping it back is the workarounds we have – “start listening at 13:42.”
> We seem (or at least /I/ seem *grin*) to
> find searches that appear with context
> more tractable
That’s certainly true. Still, it’s not as though people have had the opportunity to try phonetic systems and reject them on that basis. I can’t point to examples where the approach has even been tried.
Even without a text snippet, the raw capability to jump in an audio stream to a word you’ve searched for and found is powerful and I would have thought compelling.
What’s more, the phonetic approach is radically more efficient computationally. Had it prevailed, we’d have vast quantities of searchable audio now. Plus, the ability to search current material — like a convention speech that just ended — in near realtime.
It’s really hard for me to understand why these potential benefits have not been exploited.
> Maybe what’s keeping it back is the
> workarounds we have – “start listening
> at 13:42.”
Yes, although both the original FastTalk prototype and the current Google implementation do a very good job of jumping you to the quote in the audio stream.
Here’s an idea: Use the phonetic transcription as the seed for a wikipage-transcript, and let users improve it. A mostly-right transcript is trivial for a human to correct incrementally, where a full transcription (and time-coding) is a big job.
I’ve created ‘stub’ pages in wikipedia on topics that I hoped were there, only to find them a couple of months later fairly well developed. Move the ball into the “easy to do in little collaborative steps” and it’ll happen.
Using multiple phonetic mapping for the search text could probably improve the search recall for rare or new names and expressions. Maybe it could just work for audio search like orthographic corrections work for text search : google suggests a correction when your search recalls few results as compared to a very similar word or expression.
But I just wanted to add that I see no technical reason why a purely phonetic trancsription process would be so much faster or more readily available than a word-based transcription process.
Indeed, in both case, the speech-to-text (or ASR) engine uses some word-based statistical language model to rank the transcription hypothesis. in non-technical words : no engine (and in fact no human) is capable of transcribing speech phoneticaly without some knowledge about the language, and about how words usually combine together.
So, in all cases, FastTalk or GoogleSearch, a text transcription must be part of the engine primitive output.
If GoogleSearch and FastTalk do not have the same availabity or speed or distribution scheme, blame it on business choices not on technical reasons.
Beside, I’ve seen no evidence that GoogleSearch was not real-time, and indeed if you want to transcribe and index everyday all the audio of the day, you’d better be faster than real-time! (but I must admit I’ve not checked how much audio was indexed everyday by GoogleSearch).
“It’s really hard for me to understand why these potential benefits have not been exploited.”
Great question Jon.
Possible sources where insight might be gleaned from, participants with: http://www.podscope.com/ and the former http://podzinger.com/ ?
Both focused on passionate early podcast audiences. Podscope uses (used to use?) phonetic tech. Podzinger did not.
Podzinger was birthed at BBN using tech developed from phone system (possible over generalization), then spun-off, attempting monetize the service, one way to exploit it.
It may only be a coincidence that the service that started with a phonetic approach is still standing…
About 2 years ago, I used both of these services in attempts to find podcast subject matter that was of interest. Neither were particularly effective. I recall using podzinger more often. Since not much came of it, I stopped using them…
The examples you provided with Google audio indexing appear to perform much better than I recall either of these services at that time.
FastTalk became Nexidia, which I believe is the largest audio indexing software company based on phonetic indexing.
A hybrid of word lattice and phonetic lattice indexing has been shown to perform pretty well. See Jonathan Mamou’s paper in SIGIR ’06 as well as work by Peng Yu and Frank Seide at MSR Asia.
With respect to text, my recent master’s thesis (not published yet…) focused on the utility of transcript snippets in speech retrieval compared to relevance visualizations. The not so surprising result was that users preferred text to be present in the search interface even though they performed just as well without text. They wanted to see the content, even if the snippets did not improve search performance.