PowerPivot + Gridworks = Wow!

While reading Jonathan Safran Foer’s Eating Animals I got to wondering about global and national trends in the production of meat and fish. He mentions, for example, that US chicken production is way up over the past few decades. How do we compare to other countries? Here’s how I answered that question:

The screenshot is of Excel 2010 augmented by PowerPivot, the business intelligence add-in that’s the subject of last week’s Innovators show with John Hancock.

Using the same spreadsheet, I asked and answered some questions about fish production:

What’s the worldwide trend for capture versus aquaculture?

How much fish are certain countries capturing?

On a per capita basis?

How much fish are certain countries growing?

On a per capita basis?

The book raises very different kinds of questions. How should we treat the animals we eat? How much land should we use to raise crops that feed the animals we eat, instead of raising crops that feed people directly? We won’t find the answers to these kinds of questions in spreadsheets. But what I hope we will find — in spreadsheets, in linked databases, in data visualizations — is a framework that will ground our discussion of these and many other issues.

In order to get there, a number of puzzle pieces will have to fall into place. The exercise that yielded this particular spreadsheet led me to explore two that I want to discuss. One is PowerPivot. It’s a tool that comes from the world of business intelligence, but that I think will appeal much more widely as various sources of public data come online, and as various kinds of people realize that they want to analyze that data.

The other piece of this puzzle is Freebase Gridworks, which I’m testing in pre-release. The exercise I’ll describe here is really a collaboration involving Excel, PowerPivot, and Gridworks, in a style that I think will become very common.

My starting point, in this case, was data on fish and meat production from the Food and Agriculture Organization, via a set of OData feeds in Dallas. These series report total production by country, but since I also wanted to look at per-capita production I added population data from data.un.org to the mix.

To see how Gridworks and Excel/PowerPivot complement one another, let’s look at two PowerPivot tables. First, population:

Second, fish production:

PowerPivot is relational. Because these tables are joined by the concatenation of Country and Year, the PerCapita value in the production table is able to use that relationship. Here is the formula for the column:

=[value] / RELATED(‘population'[Pop in Millions])

In other words, divide what’s in the Value column of the production table by what’s in the Pop in Millions table for the corresponding Country and Year. This declarative style, available in PowerPivot, is vastly more convenient that Excel’s procedural style which requires table-flattening and lookup gymnastics.

But here’s the thing. In the world of business intelligence, the tables you feed to PowerPivot are likely to come from relational databases with strong referential integrity. In the realm of open-ended data mashups from a variety of sources, that’s not likely. For example, the Food and Agriculture series has rows for United States, but the population series has rows for United States of America. You have to reconcile those names before you can join the tables.

Enter Gridworks. To feed it the list of names to reconcile, I took this population table from Excel:

And stacked it on top of this food production table from Excel:

The only column that lines up is the Country column, but that’s all that mattered. I read the combined table into Gridworks, and told it to reconcile that column against the Freebase type country:

On the first pass, 8146 rows matched and 868 didn’t.

I focused on the rows that didn’t match.

And then I worked through the list. To approve all the rows with British Indian Ocean Territory, I clicked on the double checkmark shown here:

Sometimes the reconciled name differs:

Sometimes Gridworks doesn’t know what match to propose. One problem name that came up was Côte d’Ivoire, not Côte d’Ivoire, which is something that happens commonly when the proper character encoding is lost during data transfer. In that case, you can search Freebase for a match.

Proceeding in this manner I quickly reduced the set of unmatched names until I got to one that should not match.

Should it be Belgium? Luxembourg? Actually neither. At this point I realized that the population table was a mixture of country names and region names. I wanted to exclude the latter. So I matched up everything else, and was left with 202 rows that had names like Belgium/Luxembourg, Australia/New Zealand, Northern America, Western Africa, and World. When I selected just the matching rows for export, these unmatched rows were left on the cutting room floor.

I split the tables apart again, took them back into the Excel/PowerPivot environment, and found that things still didn’t quite work. In cases where the original and reconciled names differed, Gridworks was exporting the original name (e.g. Iran, Islamic Republic of) rather than the reconciled name (e.g., Iran). To export the reconciled names, I added a new column in Gridworks, based on the country column, and used a Gridworks expression to display the reconciled name.

There will be much more to say about PowerPivot and Gridworks. Each, on its own, is an amazing tool. But the combination makes my spidey sense tingle in a way I haven’t felt for a long time.

The “it just works” kind of efficiency

I’m editing an interview with John Hancock, who leads the PowerPivot charge and championed its support of OData. During our conversation, I told him this story about how pleased I was to discover that OData “just works” with PubSubHubbub. His response made me smile, and I had to stop and transcribe it:

Any two teams can invent a really efficient way to exchange data. But every time you do that, every time you create a custom protocol, you block yourself off from the effect you just described. If you can get every team — and this is something we went for a long time telling people around the company — look, REST and Atom aren’t the most efficient things you can possibly imagine. We could take some of your existing APIs and our engine and wire them together. But we’d be going around and doing that forever, with every single pair of things we wanted to wire up. So if we take a step back and look at what is the right way to do this, what’s the right way to exchange data between applications, and bet on a standard thing that’s out there already, namely Atom, other things will come along that we haven’t imagined. Dallas is a good example of that. It developed independently of PowerPivot. It was quite late in the game before we finally connected up and started working with it, but we had a prototype in an afternoon. It was so simple, just because we had taken the right bets.

There are, of course, many kinds of efficiency. Standards like Atom aren’t most efficient in all ways. But they are definitely the most efficient in the “it just works” way.

OData and PubSubHubbub: An answer and a question

I had been meaning to explore PubSubHubbub, a protocol that enables near-realtime consumption of data feeds. Then somebody asked me: “Can OData feeds update through PubSubHubbub?” OData, which recently made a splash at the MIX conference, is based on Atom feeds. And PubSubHubbub works with Atom feeds. So I figured it would be trivial for an OData producer to hook into a PubSubHubbub cloud.

I’ve now done the experiment, and the answer is: Yes, it is trivial. In an earlier post I described how I’m exporting health and performance data from my elmcity service as an OData feed. In theory, enabling that feed for PubSubHubbub should only require me to add a single XML element to that feed. If the hub that connects publishers and subscribers is Google’s own reference implementation of the protocol, at http://pubsubhubbub.appspot.com, then that element is:

<link rel="hub" href="http://pubsubbubbub.appspot.com"/>

So I added that to my OData feed. To verify that it worked, I tried using the publish and subscribe tools at pubsubhubbub.appspot.com, at first with no success. That was OK, because it forced me to implement my own publisher and my own subscriber, which helped me understand the protocol. Once I worked out the kinks, I was able to use my own subscriber to tell Google’s hub that I wanted my subscriber to receive near-realtime updates when the feed was updated. And I was able to use my own publisher to tell Google’s hub that the feed had been updated, thus triggering a push to the subscriber.

In this case, the feed is produced by the Azure Table service. It could also have been produced by the SQL Azure service, or by any other data service — based on SQL or not — that knows how to emit Atom feeds. And in this case, the feed URL (or, as the spec calls it, the topic URL) expresses query syntax that passes through to the underlying Azure Table service. Here’s one variant of that URL:

http://{ODATA HOST}/services/odata?table=monitor

That query asks for the whole table. But even though the service that populates that table only adds a new record every 10 minutes, the total number of records becomes unwieldy after a few days. So the query URL can also restrict the results to just recent records, like so:

http://{ODATA HOST}/services/odata?table=monitor&since_hours_ago=4

The result of that query, however, is different from the result of this one:

http://{ODATA HOST}/services/odata?table=monitor&since_hours_ago=24

Now here’s my question. The service knows, when it updates the table, that any URL referring to that table is now stale. How does it tell the hub that? The spec says that the topic URL “MUST NOT contain an anchor fragment” but “can otherwise be free-form.” If the feed producer is a data service that supports a query language, and the corresponding OData service supports RESTful query, there is a whole family of topic URLs that can be subscribed to. How do publishers and subscribers specify the parent?

YesAndNoSQL

I’ve written elsewhere about some of the reasons OData makes me happy. Following the announcements at MIX this week, best summarized here by Doug Purdy, I’d like to add another. It can be a nice bridge between the NoSQL and SQL worlds.

For example, my elmcity service monitors itself by sampling a bunch of performance counters and pushing the data to an Azure table. Because the Azure Table service is an OData producer, I’m able to analyze and chart the data using PowerPivot, an Excel add-in that’s an OData consumer.

I’ve been thinking about moving this data store over to SQL Azure, because some kinds of queries I might want to run will be much easier using SQL rather than Azure Table’s primitive query language. Now that SQL Azure is also an OData producer I’ll be able to make that switch seamlessly. From a SQL perspective I’ll have a more powerful query capability. But from a NoSQL perspective it’ll look just the same.

I’ve always loved Sam Ruby’s tagline: It’s just data. I should be able to spin up a service using a decentralized key/value store, like Azure Table or SimpleDB, and then gracefully migrate to a SQL store if and when that becomes necessary. With OData living on both sides of the SQL/NoSQL divide, that glide path will be much smoother.

Hey Honda, I paid for that data!

Yesterday at the Honda dealer’s service desk I found myself in an all-too-familiar situation, craning my head for a glimpse of a screenful of data that I paid for but do not own. Well, that’s not quite true. I do have a degraded form of the data: printouts of work orders. But I don’t have it in a useful form that would enable me to compute the ownership cost of my car, or share its maintenance history with owners of similar cars so we can know which repairs have been normal or abnormal.

Although we tend to focus on the portability of our health care data, the same principles apply to all kinds of service providers. And in many of those cases, we would be less concerned about the privacy of the data.

Why, then, don’t service providers and their customers co-own this data? Is it because providers want to keep high-quality electronic data, while only dispensing low-quality paper data, in order to make their services stickier? It would make a certain kind of sense for Honda to think that way, but I don’t think that’s the answer. Instead:

1. Nobody asks for the data.

2. There’s no convenient way to provide it.

We’ll get over the first hurdle as our cultural expectations evolve. Today it would be weird to find an OData URL printed on your paid work order. In a few years, I hope, that will be normal.

We’ll get over the second hurdle as service providers begin to colonize the cloud. One of the key points I tried to make in a recent interview about cloud computing is that cloud-based services can flip a crucial default setting. If you want to export access to data stored in today’s point-of-sale and back-end systems, you have to swim upstream. But when those systems are cloud-based, you can go with the flow. The data in those systems can still be held closely. But when you’re asked to share it, the request is much easier to satisfy.

Producing and consuming OData feeds: An end-to-end example

Having waxed theoretical about the Open Data Protocol (OData), it’s time to make things more concrete. I’ve been adding instrumentation to monitor the health and performance of my elmcity service. Now I’m using OData to feed the telemetry into Excel. It makes a nice end-to-end example, so let’s unpack it.

Data capture

The web and worker roles in my Azure service take periodic snapshots of a set of Windows performance counters, and store those to an Azure table. Although I could be using the recently-released Azure diagnostics API, I’d already come up with my own approach. I keep a list of the counters I want to measure in another Azure table, shown here in Cerebrata‘s viewer/editor:

When you query an Azure table like this one, the records come back packaged as content elements within Atom entries:

<entry m:etag="W/datetime'2010-02-09T00:00:53.7164253Z'">
<id>http://elmcity.table.core.windows.net/monitor(PartitionKey='ProcessMonitor',
  RowKey='634012704503641218')</id>
<content type="application/xml">
<m:properties>
<d:PartitionKey>ProcessMonitor</d:PartitionKey>
<d:RowKey>634012704503641218</d:RowKey>
<d:HostName>RD00155D317B3F</d:HostName>
<d:ProcName>WaWorkerHost</d:ProcName>
<d:mem_available_mbytes m:type="Edm.Double">1320</d:mem_available_mbytes>
...snip...
<d:tcp_connections_established m:type="Edm.Double">24</d:tcp_connections_established>
</m:properties>
</content>
</entry>

This isn’t immediately obvious if you use the storage client libary that comes with the Azure SDK, which wraps an ADO.NET Data Services abstraction around the Azure table service. But if you peek under the covers using a tool like Eric Lawrence’s astonishingly capable Fiddler, you’ll see nothing but Atom entries. In order to get direct access to them, I don’t actually use the storage client library in the SDK, but instead use an alternate interface that exposes the underlying HTTP/REST machinery.

Exposing data services

If the Azure table service did not require special authentication, it would itself be an OData service that you could point any OData-aware client at. To fetch recent entries from my table of snapshots, for example, you could use this URL in any browser:

GET http://elmcity.table.core.windows.net/monitor?$filter=Timestamp+gt+datetime’2010-02-08&#8242;

(A table named ‘monitor’ is where the telemetry data are stored.)

The table service does require authentication, though, so in order to export data feeds I’m creating wrappers around selected queries. Until recently, I’ve always packaged the query response as a .NET List of Dictionaries. A record in an Azure table maps nicely to a Dictionary. Both are flexible bags of name/value pairs, and a Dictionary is easily consumed from both C# and IronPython.

To enable OData services I just added an alternate method that returns the raw response from an Azure table query. Then I extended the public namespace of my service, adding a /odata mapping that accepts URL parameters for the name of a table, and for the text of a query. I’m doing this in ASP.NET MVC, but there’s nothing special about the technique. If you were working in, say, Rails or Django, it would be just the same. You’d map out a piece of public namespace, and wire it to a parameterized service that returns Atom feeds.

Discovering data services

An OData-aware client can use an Atom service document to find out what feeds are available from a provider. The one I’m using looks kind of like this:

<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<service xmlns:atom='http://www.w3.org/2005/Atom' 
  xmlns:app='http://www.w3.org/2007/app' xmlns='http://www.w3.org/2007/app'>
  <workspace>
    <atom:title>elmcity odata feeds</atom:title>
    <collection href='http://elmcity.cloudapp.net/odata?table=monitor&hours_ago=48'>
      <atom:title>recent monitor data (web and worker roles)</atom:title>
    </collection>
    <collection href="http://elmcity.cloudapp.net/odata?table=monitor&hours_ago=48&
      query=ProcName eq 'WaWebHost'">
      <atom:title>recent monitor data (web roles)</atom:title>
    </collection>
    <collection href="http://elmcity.cloudapp.net/odata?table=monitor&hours_ago=48&
         query=ProcName eq 'WaWorkerHost'">
      <atom:title>recent monitor data (worker roles)</atom:title>
    </collection>
    <collection href="http://elmcity.cloudapp.net/odata?table=counters">
      <atom:title>peformance counters</atom:title>
    </collection>
  </workspace>
</service>

PowerPivot is an Excel add-in that knows about this stuff. Here’s a picture of PowerPivot discovering those feeds:

It’s straightforward for any application or service, written in any language, running in any environment, to enable this kind of discovery.

Using data services

In my case, PowerPivot — which is an add-in that brings some nice business intelligence capability to Excel — makes a good consumer of my data services. Here are some charts that slice my service’s request execution times in a couple of different ways:

Again, it’s straightforward for any application or service, written in any language, running in any environment, to do this kind of thing. It’s all just Atom feeds with data-describing payloads. There’s nothing special about it, which is the whole point. If things pan out as I hope, we’ll have a cornucopia of OData feeds — from our banks, from our Internet service providers, from our governments, and from every other source that currently publishes data on paper, or in less useful electronic formats like PDF and HTML. And we’ll have a variety of OData clients, on mobile devices and on our desktops and in the cloud, that enable us to work with those data feeds.

OData for collaborative sense-making

OData, the Open Data Protocol, is described at odata.org:

The Open Data Protocol (OData) is a web protocol for querying and updating data. OData applies web technologies such as HTTP, Atom Publishing Protocol (AtomPub) and JSON to provide access to information from a variety of applications, services, and stores.

The other day, Pablo Castro wrote an excellent post explaining how developers can implement aspects of the modular OData spec, and outlining some benefits that accrue from each. One of the aspects is query, and Pablo gives this example:

http://ogdi.cloudapp.net/v1/dc/BankLocations?$filter=zipcode eq 20007

One benefit for exposing query to developers, Pablo says, is:

Developers using the Data Services client for .NET would be able to use LINQ against your service, at least for the operators that map to the query options you implemented.

I’d like to suggest that there’s a huge benefit for users as well. Consider Pablo’s example, based on some Washington, DC datasets published using the Open Government Data Initiative toolkit. Let’s look at one of those datasets, BankLocations, through the lens of Excel 2010’s PowerPivot.

PowerPivot adds heavy-duty business analytics to Excel in ways I’m not really qualified to discuss, but for my purposes here that’s beside the point. I’m just using it to show what it can be like, from a user’s perspective, to point an OData-aware client, which could be any desktop or web application, at an OData source, which could be provided by any backend service.

In this case, I pointed PowerPivot at the following URL:

http://ogdi.cloudapp.net/v1/dc/BankLocations

I previewed the Atom feed, selected a subset of the columns, and imported them into a pivot table. I used slicers to help visualize the zipcodes associated with each bank. And I wound up with a view which reports that there are three branches of WashingtonFirst Bank in DC, at three addresses, in two zipcodes.

If I were to name this worksheet, I’d call it WashingonFirst Bank branches in DC. But it has another kind of name, one that’s independent of the user who makes such a view, and of the application used to make it. Here is that other name:

http://ogdi.cloudapp.net/v1/dc/BankLocations?$filter=name eq ‘WashingtonFirst Bank’

If you and I want to have a conversation about banks in Washington, DC, and if we agree that this dataset is an authoritative list of them, then we — and anyone else who cares about this stuff — can converse using a language in which phrases like ‘WashingtonFirst Bank branches in DC’ or ‘banks in zipcode 20007’ are well defined.

If we incorporate this kind of fully articulated web namespace into public online discourse, then others can engage with it too. Suppose, to take just one small example, I find what I think is an error in the dataset. Maybe I think one of the branch addresses is wrong. Or maybe I want to associate some extra information with the address. Today, the way things usually work, I’d visit the source website and look for some kind of feedback mechanism. If there is one, and if I’m willing to provide my feedback in a form it will accept, and if my feedback is accepted, then my effort to engage with that dataset will be successful. But that’s a lot of ifs.

When public datasets provide fully articulated web namespaces, though, things can happen in a more loosely coupled way. I can post my feedback anywhere — for example, right here on this blog. If I have something to say about the WashingtonFirst branch at 1500 K Street, NW, I can refer to it using an URL: 1500 K Street, NW.

That URL is, in effect, a trackback that points to one record in the dataset.1 The service that hosts the dataset could scan the web for these inbound links and, if desired, reflect them back to its users. Or any other service could do the same. Discourse about the dataset can grow online in a decentralized way. The publisher need not explicitly support, maintain, or be liable for that discourse. But it can be discovered and aggregated by any interested party.

The open data movement, in government and elsewhere, aims to help people engage with and participate in processes represented by the data. When you publish data in a fully articulated way, you build a framework for engagement, a trellis for participation. This is a huge opportunity, and it’s what most excites me about OData.


1 PowerPivot doesn’t currently expose that URL, but it could, and so could any other OData-aware application.