Hugh McGuire notes that Google Labs has expanded the audio indexing and search of political videos on YouTube. I checked out the examples, and guessed that this system works by doing speech-to-text conversion, then conventional indexing and search of the text. That’s feasible because even an imperfect conversion yields plenty of recognizable words for search to find.
Here’s an example of imperfect conversion that doesn’t interfere with a search for the word “health”:
spoken: and that’s true with health care. Of the estimated 47 million
transcribed the ranks not and that how much health care the native forty seven
Now that’s the worst of the small set I sampled. Here’s a less imperfect example:
spoken: businesses liberated from high taxes and health care costs will unleash
transcribed: businesses liberated from high taxes and health care costs well I’m
And sure enough, the FAQ confirms my hunch:
Google Audio Indexing uses speech technology to transform spoken words into text and leverages the Google indexing technology to return the best results to the user.
Way back in 2002, I reviewed Fast-Talk, a product (actually, a technology demo of a licenseable SDK) that took a completely different approach to audio indexing. It worked phonetically. One of my tests was of a phone interview with Tim Bray, which I recorded and which was indexed in realtime as we spoke. Here’s what happened next:
When my interview with Tim Bray was done, the first segment I looked for was the one where Bray said, “Jean Paoli spent four hours showing me XDocs.” The name “Jean Paoli” was, not surprisingly, ineffective as a search term. But “four hours” found the segment instantly, as did “fore ours” — which of course resolves to the same string of phonemes. “Zhawn Powli” also worked, illustrating what will soon become a new strategy for users of voice-aware search engines: When in doubt, spell it out phonetically. In practice, I find myself resorting to this strategy less often than I’d have expected. And it was fairly obvious when to do so. I guessed correctly that “MySQL” would not work, for example, but that “my sequel” would.
This approach doesn’t yield a transcript, but it’s so fast, efficient, and effective that I felt sure it would be in widespread use by now, and that audio indexing would be far more prevalent than it has become.
Why didn’t my prediction come true?