Freebase Gridworks: A power tool for data scrubbers

26 Mar 201026 Mar 2010 ~ Jon Udell

I’ve had many conversations with Stefano Mazzocchi and David Huynh [1, 2, 3] about the data magic they performed at MIT’s Project Simile and now perform at Metaweb. If you’re somebody who values clean data and has wrestled with the dirty stuff, these screencasts about a forthcoming product called Freebase Gridworks will make you weep with joy.

There’s one by David, and another by Stefano. Using common public datasets about food, international disasters, and US government contracts, they fly through a series of transformations that:

Merge similar names using a host of methods:
- Automatic title-casing
- A rich expression language
- Analysis of “edit distance” between similar phrases, using several clustering algorithms
Split multi-valued facets
Create new facets (e.g., a year column from a data column)
Morph linear scales to log scales where appropriate

It’s all live, undoable, and fully instrumented, by which I mean that every transformation updates the counts of the values in each facet, and displays histograms of the new distribution of values — along with sliders for selecting and focusing on subsets.

As the open data juggernaut picks up steam, a lot of folks are going to discover what some of us have known all along. Much of the data that’s lying around is a mess. That’s partly because nobody has ever really looked at it. As a new wave of visualization tools arrives, there will be more eyeballs on more data, and that’s a great thing. But we’ll also need to be able to lay hands on the data and clean up the messes we can begin to see. As we do, we’ll want to be using tools that do the kinds of things shown in the Gridworks screencasts.

Published by Jon Udell

View all posts by Jon Udell

17 thoughts on “Freebase Gridworks: A power tool for data scrubbers”

Tim Koelkebeck says:

26 Mar 2010 at 9:42 am

“As the open data juggernaut picks up steam, a lot of folks are going to discover what some of us have known all along. Much of the data that’s lying around is a mess. That’s partly because nobody has ever really looked at it.”

Please print that on a t-shirt and mail to:

1330 New Hampshire Ave NW
Washington, DC 20036

MyType has been getting its hands dirty with messy data for a few months now. When is Freebase Gridworks forthcoming? (I haven’t had a chance to watch the vids yet.)

Loading...

Reply
Pingback: Gridworks « Rage on Omnipotent
Pingback: Puzzlepieces – Freebase Gridworks: A power tool for data scrubbers « Jon Udell (March 26, 2010)
Neil says:

26 Mar 2010 at 9:40 pm

Jon, have any of the many tools like this you’ve seen convinced you that they will see meaningful use?

I’m just not sure the ‘average joe’ is able to do what the expert users who present the screencast can. There’s the data cleaning challenge, the question-posing challenge, the tool-learning challenge … in many cases I think we need someone who can interpret the results we ask for. I’m thinking of NYTimes infographics here.

Loading...

Reply
Pingback: Daily Digest for March 26th at dandube.com
Simon Willison says:

27 Mar 2010 at 12:56 pm

I just watched both screencasts: wow! I’ve needed this tool so many times over the past few years (both while working for the Guardian newspaper and for my own personal projects). I can’t wait.

Neil: Gridworks doesn’t need to work for the average joe – it needs to work for expert and semi-expert users (like journalists) who work with this kind of data. And if those users are sharing their results, a dataset only needs to be cleaned up once.

You can bet newspaper infographics departments will be all over this once it’s released.

Loading...

Reply
Pingback: Freebase Gridworks: The data curation tool
Pingback: Home Construction Shop Safety | Home Construction Shop and Guide
Pingback: Home Construction Shop Takes a look at Power Drills | Home Construction Shop and Guide
Pingback: links for 2010-03-28 « links and tweets
Pingback: Freebase Gridworks « Open Analysis
Pingback: links for 2010-03-29 « Onlinejournalismtest's Blog
Pingback: links for 2010-03-30
Alec Flett says:

7 Apr 2010 at 4:34 pm

David Huynh will be giving a hands-on presentation of Gridworks at the SF Freebase meetup next week… so if anyone here really wants to see it in person, come by!

http://www.meetup.com/sf-freebase/calendar/12845548/

Loading...

Reply
Pingback: PowerPivot + Gridworks = Wow! « Jon Udell
Pingback: More curation tools
Pingback: Location-tagged events in elmcity hubs « Jon Udell

Freebase Gridworks: A power tool for data scrubbers

Like this:

Published by Jon Udell

17 thoughts on “Freebase Gridworks: A power tool for data scrubbers”

Leave a ReplyCancel reply

Share this:

Like this:

Published by Jon Udell

17 thoughts on “Freebase Gridworks: A power tool for data scrubbers”

Leave a ReplyCancel reply

Discover more from Jon Udell