PowerPivot + Gridworks = Wow!

While reading Jonathan Safran Foer’s Eating Animals I got to wondering about global and national trends in the production of meat and fish. He mentions, for example, that US chicken production is way up over the past few decades. How do we compare to other countries? Here’s how I answered that question:

The screenshot is of Excel 2010 augmented by PowerPivot, the business intelligence add-in that’s the subject of last week’s Innovators show with John Hancock.

Using the same spreadsheet, I asked and answered some questions about fish production:

What’s the worldwide trend for capture versus aquaculture?

How much fish are certain countries capturing?

On a per capita basis?

How much fish are certain countries growing?

On a per capita basis?

The book raises very different kinds of questions. How should we treat the animals we eat? How much land should we use to raise crops that feed the animals we eat, instead of raising crops that feed people directly? We won’t find the answers to these kinds of questions in spreadsheets. But what I hope we will find — in spreadsheets, in linked databases, in data visualizations — is a framework that will ground our discussion of these and many other issues.

In order to get there, a number of puzzle pieces will have to fall into place. The exercise that yielded this particular spreadsheet led me to explore two that I want to discuss. One is PowerPivot. It’s a tool that comes from the world of business intelligence, but that I think will appeal much more widely as various sources of public data come online, and as various kinds of people realize that they want to analyze that data.

The other piece of this puzzle is Freebase Gridworks, which I’m testing in pre-release. The exercise I’ll describe here is really a collaboration involving Excel, PowerPivot, and Gridworks, in a style that I think will become very common.

My starting point, in this case, was data on fish and meat production from the Food and Agriculture Organization, via a set of OData feeds in Dallas. These series report total production by country, but since I also wanted to look at per-capita production I added population data from data.un.org to the mix.

To see how Gridworks and Excel/PowerPivot complement one another, let’s look at two PowerPivot tables. First, population:

Second, fish production:

PowerPivot is relational. Because these tables are joined by the concatenation of Country and Year, the PerCapita value in the production table is able to use that relationship. Here is the formula for the column:

=[value] / RELATED(‘population'[Pop in Millions])

In other words, divide what’s in the Value column of the production table by what’s in the Pop in Millions table for the corresponding Country and Year. This declarative style, available in PowerPivot, is vastly more convenient that Excel’s procedural style which requires table-flattening and lookup gymnastics.

But here’s the thing. In the world of business intelligence, the tables you feed to PowerPivot are likely to come from relational databases with strong referential integrity. In the realm of open-ended data mashups from a variety of sources, that’s not likely. For example, the Food and Agriculture series has rows for United States, but the population series has rows for United States of America. You have to reconcile those names before you can join the tables.

Enter Gridworks. To feed it the list of names to reconcile, I took this population table from Excel:

And stacked it on top of this food production table from Excel:

The only column that lines up is the Country column, but that’s all that mattered. I read the combined table into Gridworks, and told it to reconcile that column against the Freebase type country:

On the first pass, 8146 rows matched and 868 didn’t.

I focused on the rows that didn’t match.

And then I worked through the list. To approve all the rows with British Indian Ocean Territory, I clicked on the double checkmark shown here:

Sometimes the reconciled name differs:

Sometimes Gridworks doesn’t know what match to propose. One problem name that came up was Côte d’Ivoire, not Côte d’Ivoire, which is something that happens commonly when the proper character encoding is lost during data transfer. In that case, you can search Freebase for a match.

Proceeding in this manner I quickly reduced the set of unmatched names until I got to one that should not match.

Should it be Belgium? Luxembourg? Actually neither. At this point I realized that the population table was a mixture of country names and region names. I wanted to exclude the latter. So I matched up everything else, and was left with 202 rows that had names like Belgium/Luxembourg, Australia/New Zealand, Northern America, Western Africa, and World. When I selected just the matching rows for export, these unmatched rows were left on the cutting room floor.

I split the tables apart again, took them back into the Excel/PowerPivot environment, and found that things still didn’t quite work. In cases where the original and reconciled names differed, Gridworks was exporting the original name (e.g. Iran, Islamic Republic of) rather than the reconciled name (e.g., Iran). To export the reconciled names, I added a new column in Gridworks, based on the country column, and used a Gridworks expression to display the reconciled name.

There will be much more to say about PowerPivot and Gridworks. Each, on its own, is an amazing tool. But the combination makes my spidey sense tingle in a way I haven’t felt for a long time.

Freebase Gridworks: A power tool for data scrubbers

I’ve had many conversations with Stefano Mazzocchi and David Huynh [1, 2, 3] about the data magic they performed at MIT’s Project Simile and now perform at Metaweb. If you’re somebody who values clean data and has wrestled with the dirty stuff, these screencasts about a forthcoming product called Freebase Gridworks will make you weep with joy.

There’s one by David, and another by Stefano. Using common public datasets about food, international disasters, and US government contracts, they fly through a series of transformations that:

  • Merge similar names using a host of methods:
    • Automatic title-casing
    • A rich expression language
    • Analysis of “edit distance” between similar phrases, using several clustering algorithms
  • Split multi-valued facets
  • Create new facets (e.g., a year column from a data column)
  • Morph linear scales to log scales where appropriate

It’s all live, undoable, and fully instrumented, by which I mean that every transformation updates the counts of the values in each facet, and displays histograms of the new distribution of values — along with sliders for selecting and focusing on subsets.

As the open data juggernaut picks up steam, a lot of folks are going to discover what some of us have known all along. Much of the data that’s lying around is a mess. That’s partly because nobody has ever really looked at it. As a new wave of visualization tools arrives, there will be more eyeballs on more data, and that’s a great thing. But we’ll also need to be able to lay hands on the data and clean up the messes we can begin to see. As we do, we’ll want to be using tools that do the kinds of things shown in the Gridworks screencasts.

Visualizing Nobel Peace Prize winners in Freebase

When I watched Barack Obama accept the Nobel Peace Prize, I thought about how the world has changed since the inception of the prize, and how it will continue to change. Since the winners of the Prize are themselves a reflection of what’s changing, I thought I’d try using Freebase to visualize them over the century the Prize has existed.

What you can find out, with Freebase, depends on its coverage of the topics you’re asking about. So realize that what I’ll show here is possible because Nobel Peace Prize winners are a well-covered topic. Still, it’s wildly impressive.

The Nobel site tells us that 89 Nobel Peace Prizes have been awarded since 1901. I haven’t been able to reproduce that number in Freebase because there are multiple winners in a few years, and I haven’t found a way to group results by year. But for my purposes this related query is good enough:

That number, 100, isn’t as closely related to 89 as you might think. It’s less by the number of years no award was given, but more by the number of recipients in multiple-award years. Perhaps a Freebase guru can show us how to measure those uncertainties, but I’ve eyeballed them and I don’t think they invalidate my results.

How did I wind up querying the topic /award/award_winner? It wasn’t immediately obvious. I spent a while searching and then exploring the facets that emerged, including:

The crazy thing about Freebase is that, in a way, it doesn’t matter where you start. Everything’s connected to everything, so you can pick up any node of the graph and re-dangle the rest.

Except when you can’t. I haven’t yet gotten a good feel for which paths to prefer and why.

But in the end I came up with the kind of results I’d envisioned:

1901-2009 nobel peace prize winners by gender
male female

1901-2009 nobel peace prize winners by nationality
male female

Taken together they show a couple of trends. First, of course, we see most female winners after about 1960. Second, we see a more even geographic distribution of female winners because, prior to 1960, most winners were not only male but also American or European.

These results didn’t surprise me. What did is the relative ease with which I was able to discover and document them. I thought it would be necessary to write MQL queries in order to do this kind of analysis. I’d previously done a bit of work with MQL, and dug further into it this time around.

But in the end I found that it was just as effective to use interactive filtering. Now to be clear, getting the software to actually do the things I’ve shown here wasn’t a cakewalk. I had to develop a feel for the web of topics in the domain I chose. And it’s painfully slow to add and drop filters.

But still, it’s doable. And you can do it yourself by pointing and clicking. That is an astonishing tour de force, and a glimpse of what things will be like when we can all fluently visualize information about our world.