Freebase Gridworks: A power tool for data scrubbers

I’ve had many conversations with Stefano Mazzocchi and David Huynh [1, 2, 3] about the data magic they performed at MIT’s Project Simile and now perform at Metaweb. If you’re somebody who values clean data and has wrestled with the dirty stuff, these screencasts about a forthcoming product called Freebase Gridworks will make you weep with joy.

There’s one by David, and another by Stefano. Using common public datasets about food, international disasters, and US government contracts, they fly through a series of transformations that:

  • Merge similar names using a host of methods:
    • Automatic title-casing
    • A rich expression language
    • Analysis of “edit distance” between similar phrases, using several clustering algorithms
  • Split multi-valued facets
  • Create new facets (e.g., a year column from a data column)
  • Morph linear scales to log scales where appropriate

It’s all live, undoable, and fully instrumented, by which I mean that every transformation updates the counts of the values in each facet, and displays histograms of the new distribution of values — along with sliders for selecting and focusing on subsets.

As the open data juggernaut picks up steam, a lot of folks are going to discover what some of us have known all along. Much of the data that’s lying around is a mess. That’s partly because nobody has ever really looked at it. As a new wave of visualization tools arrives, there will be more eyeballs on more data, and that’s a great thing. But we’ll also need to be able to lay hands on the data and clean up the messes we can begin to see. As we do, we’ll want to be using tools that do the kinds of things shown in the Gridworks screencasts.