I’ve had many conversations with Stefano Mazzocchi and David Huynh [1, 2, 3] about the data magic they performed at MIT’s Project Simile and now perform at Metaweb. If you’re somebody who values clean data and has wrestled with the dirty stuff, these screencasts about a forthcoming product called Freebase Gridworks will make you weep with joy.
There’s one by David, and another by Stefano. Using common public datasets about food, international disasters, and US government contracts, they fly through a series of transformations that:
- Merge similar names using a host of methods:
- Automatic title-casing
- A rich expression language
- Analysis of “edit distance” between similar phrases, using several clustering algorithms
- Split multi-valued facets
- Create new facets (e.g., a year column from a data column)
- Morph linear scales to log scales where appropriate
It’s all live, undoable, and fully instrumented, by which I mean that every transformation updates the counts of the values in each facet, and displays histograms of the new distribution of values — along with sliders for selecting and focusing on subsets.
As the open data juggernaut picks up steam, a lot of folks are going to discover what some of us have known all along. Much of the data that’s lying around is a mess. That’s partly because nobody has ever really looked at it. As a new wave of visualization tools arrives, there will be more eyeballs on more data, and that’s a great thing. But we’ll also need to be able to lay hands on the data and clean up the messes we can begin to see. As we do, we’ll want to be using tools that do the kinds of things shown in the Gridworks screencasts.
March 26, 2010 at 9:42 am
“As the open data juggernaut picks up steam, a lot of folks are going to discover what some of us have known all along. Much of the data that’s lying around is a mess. That’s partly because nobody has ever really looked at it.”
Please print that on a t-shirt and mail to:
1330 New Hampshire Ave NW
Washington, DC 20036
MyType has been getting its hands dirty with messy data for a few months now. When is Freebase Gridworks forthcoming? (I haven’t had a chance to watch the vids yet.)
March 26, 2010 at 11:11 am
[...] Freebase Gridworks would have been amazingly useful at Datamonitor. Really user-friendly ways of cleaning up large quantities of grid-shaped data. [...]
March 26, 2010 at 12:56 pm
[...] Freebase Gridworks: A power tool for data scrubbers « Jon Udell – I’ve been a fan of Freebase, it’s Parallax, and SIMILE, but I had no idea that they were all created by the same couple of people. [...]
March 26, 2010 at 9:40 pm
Jon, have any of the many tools like this you’ve seen convinced you that they will see meaningful use?
I’m just not sure the ‘average joe’ is able to do what the expert users who present the screencast can. There’s the data cleaning challenge, the question-posing challenge, the tool-learning challenge … in many cases I think we need someone who can interpret the results we ask for. I’m thinking of NYTimes infographics here.
March 26, 2010 at 11:07 pm
[...] Shared Freebase Gridworks: A power tool for data scrubbers. [...]
March 27, 2010 at 12:56 pm
I just watched both screencasts: wow! I’ve needed this tool so many times over the past few years (both while working for the Guardian newspaper and for my own personal projects). I can’t wait.
Neil: Gridworks doesn’t need to work for the average joe – it needs to work for expert and semi-expert users (like journalists) who work with this kind of data. And if those users are sharing their results, a dataset only needs to be cleaned up once.
You can bet newspaper infographics departments will be all over this once it’s released.
March 28, 2010 at 9:02 am
[...] found out about Freebase Gridworks through a post by Jon Udell. In the post, Jon refers to two screencasts on this yet unreleased product. In the Freebase blog [...]
March 28, 2010 at 12:29 pm
[...] Freebase Gridworks: A power tool for data scrubbers « Jon Udell [...]
March 28, 2010 at 5:39 pm
[...] Freebase Gridworks: A power tool for data scrubbers « Jon Udell [...]
March 28, 2010 at 7:08 pm
[...] Freebase Gridworks: A power tool for data scrubbers « Jon Udell (tags: freebase gridworks metaweb) [...]
March 28, 2010 at 10:31 pm
[...] By eduprobe Leave a Comment Categories: data Gridworks sounds very exciting. Check out the screencasts linked from Jon Udell’s [...]
March 29, 2010 at 9:03 am
[...] Freebase Gridworks: A power tool for data scrubbers « Jon Udell RT @mattmcalister: cleanup and analyze #spreadsheet #data with gridworks: http://bit.ly/bSqSBX #tools (tags: tools data spreadsheet via:packrati.us) [...]
March 30, 2010 at 8:05 am
[...] Freebase Gridworks: A power tool for data scrubbers « Jon Udell Kevin: Jon Udell writes about a project from the team that work on MIT's Project Simile and now working at Metaweb. Jon writes: "As the open data juggernaut picks up steam, a lot of folks are going to discover what some of us have known all along. Much of the data that’s lying around is a mess." They are building a system that will clean up the data, especially the metadata on datasets. This will be a godsend for anyone using messy datasets and speed merge functions and also help with the creation of new 'facets' ( eg a year column from a data column). (tags: data gridworks freebase information metaweb datamining visualization) [...]
April 7, 2010 at 4:34 pm
David Huynh will be giving a hands-on presentation of Gridworks at the SF Freebase meetup next week… so if anyone here really wants to see it in person, come by!
http://www.meetup.com/sf-freebase/calendar/12845548/
April 19, 2010 at 1:28 pm
[...] other piece of this puzzle is Freebase Gridworks, which I’m testing in pre-release. The exercise I’ll describe here is really a [...]
June 21, 2010 at 1:07 am
[...] Freebase Gridworks: A power tool for data scrubbers (jonudell.net) [...]