Talking with Stefano Mazzocchi about reconciling web naming systems

When Stefano Mazzocchi saw my posts on webscale identiers[1, 2] he pointed me to some recent work he and others have been doing at Metaweb. At ids.freebaseapps.com you can find sets of different web identifiers that refer to the same things. So, for example:

Apple Inc.
versus
Apple Records

Each of these views collects identifiers from different sources. For Apple Inc. they include:

The NYTimes: topics.nytimes.com/top/news/business/companies/apple_computer_inc/

Wikipedia: wikipedia.org/wiki/Apple_Computer

Open Library: openlibrary.org/a/OL2669993A/Inc._Apple_Computer

On this week’s Innovators show Stefano joins me to discuss efforts underway at Metaweb to reconcile many different web naming systems and activate connections among them.

Meanwhile my recent guest Kingsley Idehen is demonstrating a similar kind of name reconciliation at bbc.openlinksw.com. At this URL, for example, you can see canonical identifers for Michael Jackson from the BBC’s own namespace and others including DBpedia and OpenCyc.

I’m not quite sure what to make of all this. But my spidey sense is telling me to pay attention, so I am.


Related:

  1. Semantic web mashups for the rest of us

  2. A conversation with Stefano Mazzocchi about Cocoon and SIMILE

  3. Motivating people to write the semantic web: A conversation with David Huynh about Parallax

  4. Talking with Kingsley Idehen about mastering your own search index

7 Comments

  1. Related to this conversation of canonical identifiers, what do you think of Google’s new Places ids, like the following:

    http://maps.google.com/places/us/cambridge/brattle-st/52/-burdick-chocolate-cafe

    Clearly an attempt to encode geolocation and common names in a single identifier, yet also clearly somewhat fragile in both respects. Google will be attempting to aggregate a *whole lot* of other identifier mappings in these pages, as well, as they pull in local content aggregators, review sites, etc.

    1. It’s funny you mention this. Just today my daughter told me that she’d recently been to Burdick in Cambridge, and thought it was funny that most visitors to that shop probably don’t know that the original shop is in Walpole, NH: http://maps.google.com/maps/place?cid=17313875298298739692&q=burdick+chocolate. (Although that one doesn’t seem to rate its own places URL.)

      Should http://maps.google.com/places/us/burdick-chocolate-cafe produce the same two results as http://maps.google.com/maps?q=burdick+chocolate+cafe ?

      If it did, would the two forms mean the same thing or different things?

      1. If http://maps.google.com/places/us/burdick-chocolate-cafe has the same content as http://maps.google.com/maps?q=burdick+chocolate+cafe then I’d say they mean (are different representations of) the same thing.

        If there are two results, both http://maps.google.com/places/us/burdick-chocolate-cafe and http://maps.google.com/maps?q=burdick+chocolate+cafe could return a “303 Multiple Choices” response, with a listing containing unique URLs for each of the shops listed – would that be appropriate?

  2. The following comment from Stefano Bertolo, via email, is quoted in full here with his permission.


    I wanted to bring to Jon’s attention (I have discussed this with Stefano separately a few weeks ago), that in addition to Metaweb there’s a bunch of smart people doing work in this space in the EU.

    In particular, I would like to flag the OKKAM project

    http://www.okkam.org/

    for which I am responsible as the project officer in charge from the European Commission, its funding agency.

    Jon, if you want to learn more about what OKKAM does and how far they’ve got, feel free to get in touch with the project coordinator Paolo Bouquet (in copy) from the University of Trento.

    In his presentation Stefano makes an interesting point (akin to the ‘curse of dimensionality’ problem that has been long recognized in machine learning): the more distinctions you introduce in the type system of web data links, the higher the chance that data ships will pass each other in the night.
    In this scenario, curation can be seen as the process of setting up a form of traffic control that will identify identities and resolve them as they arise and they are detected (up to the degree of precision, coverage, reliability needed by various application; this is in itself a gradient worth exploring).

    In this space as well there is a lot of good work going on in the EU. The teams whose results I am most familiar with are the Silk team from the Free University in Berlin

    http://www4.wiwiss.fu-berlin.de/bizer/silk/

    Soeren Auer’s team in Leipzig

    http://www.informatik.uni-leipzig.de/~auer/

    and Giovanni Tummarello’s team at DERI

    http://www.deri.ie/about/team/member/giovanni_tummarello/

    Chris, Soeren and Giovanni are also in copy.

    Finally, my two, untutored cents.

    If I were tasked to work on data reconciliation, I think I would set up the problem as an evolutionary problem.

    I would set up data linkers as a population of data-crawler + identity evaluators (“is A = B?” is a binary classification problem to which various ensemble methods can be applied). I would then let this population of linkers evolve according to a fitness function whose definition could be crowdsourced by outfits such as

    https://www.mturk.com/mturk/welcome

    or

    http://doloreslabs.com/

    Or

    http://www.gwap.com/

    (which was one of the tactics mentioned by Stefano in his interview).

    Bottom line: considering that this is a global problem with global benefits when solved, it is useful for people to be informed as to what pieces are being worked on where.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s