For this week’s Perspectives show I spoke with Scott Prevost, general manager and product director for Powerset, the semantic search engine that was recently acquired by Microsoft, and that can currently be seen in action working with the combined contents of Wikipedia and Freebase.
In our interview, Scott discusses the natural language engine — 30 years in the making — that Powerset acquired from PARC (formerly Xerox PARC). But he also makes clear that the use of that engine is part of a blended strategy that also takes advantage of statistical and machine learning techniques.
If you try Powerset, you find that your mileage will vary depending on a lot of factors. It’s clearly a work in progress, as all of natural language technology has been since, really, the dawn of computing. But the approach that Scott describes here sounds like a flexible and pragmatic way to leverage the technology as it continues to evolve.
Here’s one evocative use of Powerset:
The first result, Dreams from My Father, comes from Freebase, where that book is one of two items in the Works Written slot of Obama’s Person record. In this case, there’s no need to discover structure, Freebase has already encoded it. But the natural language technology is being used in a complementary way, to map between a natural form of the question and the corresponding Freebase query.
To see a glimpse of what Powerset’s linguistic analysis of Wikipedia can do, try this query:
Here, Powerset uses its semantic representation of my Wikipedia page to extract two “Factz” based on one of the linguistic patterns it uses. In this case, the pattern is subject / verb / object, and two Factz are adduced. One is bogus:
udell authored advisor
And the other is valid:
udell authored Practical Internet Groupware
There isn’t much in Wikipedia about me, but if you pick a more notable person — say, Tim Bray — the list of Factz includes:
chaired Atompub Working group
Missing from this list, by the way, is:
live-engineered Electric Eel Shock
OK, I’m just kidding about that, Electric Eel Shock’s live engineer was another Tim Bray, which points out the need — as Scott and I discussed briefly — for name/entity recognition and disambiguation.
I’ve always been fascinated by the ongoing effort to understand and produce natural language using computers and software. Fifty years ago, early computer scientists thought they’d lick the problem in five years. Now many people believe it may never happen. I think it will, but gradually over a long time. And as Scott Prevost points out, it’s just one tool in the kit, and should be used appropriately, in concert with other tools.
5 thoughts on “Scott Prevost explains Powerset’s approach to semantic search”
It’s really not clear that this approach offers much advantage. Try searching jim smith football in Powerset/Live and Google. The Google result does a reasonable job of finding and presenting the Jim Smith from UK football and the Jim Smith from US football. The Powerset/Live version is a bit of a mix-up. I’m concerned about the risks of clever search producing mashed-up but misleading results as in http://nelh.blogspot.com/2008/09/does-anyone-actually-want-semantic.html
That’s an interesting example. In general, to compare directly, you’ll want to restrict Google using site:wikipedia.org, so:
site:wikipedia.org jim smith football
Although in this case, it doesn’t matter because the unrestricted results are the same: wikipedia pages for the American and English players are 1 and 2.
In Powerset, they’re 1 and 3.
Neither, of course, explicitly disambiguates the two players.
There’s a subtler disambiguation to be made as well. Both players’ pages include ‘Birmingham’ but that’s also ambiguous, one’s Alabama and the other England.
Here’s how I see it. Making linguistic sense of texts is a long-term challenge. Years ago I interviewed on of the early MT researchers, who said: “It’s more perspiration than inspiration”.
The trick will be to make as much appropriate and effective use of this gradually evolving technology as is possible, while avoiding inappropriate/ineffective use to the extent possible. And that’s going to be a real balancing act, which is part of what makes the whole thing so interesting.