Proximity search in Google and Live?

I recently added a specialized search to help curators working with the elmcity project find recurring events in their communities. It’s helpful, but would be much more helpful if it produced results only when the two searched-for phrases occur in close proximity.

The phrase pairs look like this:

"every thursday" "keene nh"
"first friday" "keene nh"

I’d like to limit results to pages where these pairs occur within, say, 100 words of one another. My search robot uses both Google and Live because, well, why wouldn’t you want the best of both worlds? But as far as I know, neither supports a proximity syntax like:

"every thursday" within 100 "keene nh"

I only need to run my search robot occasionally, and there are only thousands of pages per calendar hub, and there are only a dozen hubs yet. So for now it’s feasible to use brute force. I can, and likely will, fetch all the pages found by the two engines, analyze them, and reject those that fail my proximity test.

But since I am virtuously lazy, I just thought I’d ask. Are their undocumented features for either or both of these engines that I’m missing?

12 Comments

  1. Google has a * operator for any word… can you do something like “every thursday * * * keene nh”? Or does that make to many search strings?

  2. Not exactly what you want, but perhaps useful: Google allows the asterix wildcard to match words in an exact string search.

    So, while “apples oranges” will only find pages containing those words consecutively, “apples * oranges” will match “apples and oranges”, and “apples * * oranges” will match “apples, pears, and oranges”

    It would be a lot of trouble to get from this feature to a general proximity rule that goes up to 100 words and ignores order, but it could be handy to combine a few terms of the above if you can live with a tighter proximity and/or know what order the tokens will appear in.

  3. “It would be a lot of trouble to get from this feature to a general proximity rule that goes up to 100 words and ignores order”

    Indeed. If it becomes important enough — and I have a hunch that it might — I’ll just fetch the pages and do it the hard way.

    But laziness already paid off once today. Maybe if I wait a bit longer, it’ll pay off twice :-)

  4. While I work neither at Google nor at MS/Live, I can hazard a fairly confident guess that proximity operators are already built into the “secret sauce” of these search engine algorithms. You know, the hundreds of features that search engines say they use to rank documents. By default, and without you having to specify it, I believe that proximity is one of those features.

    Ceterus parabus, when you type the query [“every thursday” “keene nh”] into one of these engines, the documents that you get back will by default be ranked in nearest-proximity order.

    I know that’s not exactly what you want; you want a boolean filter, rather than a fuzzy ranking. But if you cut off the results at the top n, anyway, you’ll probably get something very close to what you’re seeking.

    I also found this, FWIW: http://www.blueroom.com/google/search-proximity.htm

    This topic, however, is an ongoing point of contention among some in the information retrieval research and industry communities. It’s the notion of explicit vs. implicit capability. Google (I’m 98% sure) really does give you proximity-sorted results. But it is an implicit operation with no transparency. The user cannot tell for sure whether it is happening, and to what degree. Even though it really is there. So is this the best way to design a search engine? Or is it better to give explicit interfaces to the underlying algorithms, so that the user can explicitly require proximity?

    See also http://irgupf.com/2009/03/09/exploration-and-explanation/

  5. Whups, I had this window open too long in my browser and didn’t see all the other comments that came through. Good to know that Live and Exalead do explicitly support the prox operator. I think Google still doesn’t.

    The academic, research search engine (Inquery) that I worked with in grad school, many years ago, had both ordered and unordered proximity operators. Even Live and Exalead only provide the unordered version. Inquery was developed 15 years ago; sometimes I wonder why we still don’t have that functionality on the web.

  6. Well, looky there, the link jeremy posted for blueroom has a proximity search, with downloadable Perl code that implements it. It uses the asterisk operator, as as Matt and I suggested. All you would have to do is download http://www.staggernation.com/ga/source/gaps_cgi.txt, put it on your own server, and modify $max_distance to 100.

    However, you may run into issues with the 1000 query per day limit. It looks like it would run 200 searches to do so, it doesn’t appear to attempt to use the OR operator at all to consolidate the searches.

    –Kevin

  7. > Google (I’m 98% sure) really does give
    > you proximity-sorted results.

    This is (fairly) easy to test, right?

    > If you cut off the results at the top,
    > you’ll probably get something very close
    > to what you’re seeking.

    That’s a great suggestion. Thanks!

  8. “Inquery was developed 15 years ago; sometimes I wonder why we still don’t have that functionality on the web.”

    Funny, isn’t it? We in the tech biz cultivate the impression of relentless fast-paced innovation. But a lot of things actually cook very slowly.

  9. i think prox search is already well-done. Perhaps it’s because it conflicts with their built-in prox search. Or, perhaps it conflicts with hidden marketing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s