A Pivot visualization of my WordPress blog

A Pivot experiment

Pivot, from Microsoft Live Labs, is a data browser that visualizes data facets represented as Deep Zoom images and collections. I’ve been meaning to try my hand at creating a Pivot collection. My first experiment is a visualization of my blog which, in its current incarnation at WordPress.com, has about 600 entries. That’s a reasonable number of items for the simplest (and most common) kind of collection in which data and visuals are pre-computed and cached in the viewer. Here’s the default Pivot view of those entries.

The default view

To create this collection, I needed a visual representation of each blog entry. I didn’t think screenshots would be very useful, but the method worked out better than I expected. At the default zoom level there’s not much to see, but you can pick out entries that include pictures.

A selected entry

When you select an entry, the view zooms about halfway in to focus on it.

A text-only entry

Here’s a purely textual entry at the same resolution. If you click to enlarge that picture, you’ll see that at this level the titles of the current entry and its surrounding entries are legible.

The Show Info control

Clicking the Show Info control opens up an information window that displays title, description, and metadata. I’ve included the first paragraph of each entry as the description.

Zooming closer

If I zoom in further, the text becomes fully legible.

Histogram of entries

Of course the screenshot doesn’t capture the entire entry, it’s just a picture of the first screenful. To read the full entry, you click the Open control to view the entire HTML page inside Pivot.

Pivot itself isn’t a reader, it’s a data browser. This becomes clear when you switch from item view to graph view. 2006 and 2010 are incomplete years, but the period 2007-2009 shows a clear decline. I suspect a lot of blogs would show a similar trend, reflecting Twitter’s eclipse of the blogosophere.

2007 distribution

Here’s the distribution for just the year 2007.

Histogram of comments

And here’s the comments facet, which counts the number of comments on each entry.

Histogram of entries with more than 20 comments

Adjusting the slider limits the view to entries with more than 20 comments.

Filtering by tags

Of course I can also view entries by tags or tag combinations.

Filtering by keywords

When I start typing a keyword, the wordwheel displays matches from two namespaces: tags and titles.

Other possible views

Facets can be anything you can enumerate and count. I could, for example, count the number of images, tables, and other kinds of HTML constructs in each entry. That isn’t just a gratuitous exercise. Some years back, I outfitted my blog with an XQuery service that could search for items that contained more than a few images or tables, and it was useful for finding items that I remembered that way.

It would also be nice to include facets based on the WordPress stats API. And since a lot of the flow to the blog nowadays comes through bit.ly-shorted URLs on Twitter, a facet based on those referrals would be handy.

How I did it

Life’s too short to make 600 screenshots by hand, so the process had to be automated. Also, I want to be able to update this collection as I add entries to the blog. So I’m using IECapt to programmatically render pages as images, and the indispensable ImageMagick to crop the images in a standard way.

To automate the creation of Deep Zoom images (and XML files), I’m using deepzoom.py. (Note that I had to make two small changes to that version. At line 224, I changed tile_size=254 to tile_size=256. And at line 291 I changed PIL.Image.open(source_path) to PIL.Image.open(source_path.name).)

To build the main CXML (collection XML) file, I export my WordPress blog and run a Python script against it. I hadn’t looked at that export file in a long time, and was surprised to find that currently it isn’t quite valid XML. The declaration of the Atom namespace is missing. My script does a search-and-replace to fix that before it parses the XML.

I haven’t uploaded the collection to a server yet, because there are a bazillion little files and I’m still tweaking. Once I’m happy with the results, though, I should be able to establish a baseline collection on a server and then easily extend it an entry at a time.

If there’s interest I’ll publish the script. It’ll be more compelling, I suspect, once Pivot becomes available as a Silverlight control. Currently you have to download and install the Windows version of Pivot to use this visualization. But imagine if WordPress.com could deliver something like this for all of its blogs as a zero-install, cross-platform, cross-browser feature. That would be handy.

Two interpretations of US health care cost vs. life expectancy

On FiveThirtyEight.com the other day, Andrew Gelman posted this chart illustrating the high cost of US health care:

He did so to correct a “somewhat misleading (in my opinion) presentation of these numbers [that] has been floating around on the web recently.” The misleading graph, which appeared on a National Geographic blog, was — I agree — a confusing way to show information better represented in a scatterplot.

But I’ve seen this data before, and there’s more to the story. Neither the National Geographic nor FiveThirtyEight has anything to say about which numbers they’re charting.

Back in 2005, in a review of John Abramson’s excellent book Overdo$ed America, I noted that he had used a different source to reach a slightly different conclusion.

His chart, based on OECD health-expenditure data (link now 404) and WHO healthy life expectancy data (link still alive), looked like this:

He used it to make the oft-cited point that US healthcare isn’t just wildly expensive, but that it also correlates with worse life expectancy than in many countries that spend less.

I wondered what the chart would look like if based on the same OECD expenditure data but on the OECD’s rather than the WHO’s definition of life expectancy. The result looked like this:

The U.S. is the clear cost outlier on both charts. The first chart, however, places us near the low end of the life expectancy range, justifying Abramson’s assertion that we combine “poor health and high costs.” The second chart places us near the high end of the life expectancy range, suggesting that while value still isn’t proportional to cost, we’re at least buying more value than the first chart indicates.

Although based on older data, this second chart closely resembles the ones recently shown and discussed by the National Geographic and FiveThirtyEight.

My review of Abramson’s book concluded:

Has Abramson spun the data to make his point, just as he accuses the pharmaceutical industry of doing? Of course. Everybody spins the data. What matters is that:

  • Everybody can access the source data, as we can in the case of Abramson’s book but cannot (he argues) in the case of much medical research
  • The interpretation used to drive policy expresses the values shared by the citizenry

Would we generally agree that we should measure the value of our health care in terms of healthy life expectancy, not raw life expectancy? That the WHO’s way of assessing healthy life expectancy is valid? These are kinds of questions that citizens have not been able to address easily or effectively. Pushing the data and surrounding discussion into the blogosphere is the best way — arguably the only way — to change that.

That was five years ago. The data was, and is, out there. So it’s disheartening to see the same chart pop up again without any further discussion of the sources of its data, or of the definitions underlying those sources.

Visualizing Nobel Peace Prize winners in Freebase

When I watched Barack Obama accept the Nobel Peace Prize, I thought about how the world has changed since the inception of the prize, and how it will continue to change. Since the winners of the Prize are themselves a reflection of what’s changing, I thought I’d try using Freebase to visualize them over the century the Prize has existed.

What you can find out, with Freebase, depends on its coverage of the topics you’re asking about. So realize that what I’ll show here is possible because Nobel Peace Prize winners are a well-covered topic. Still, it’s wildly impressive.

The Nobel site tells us that 89 Nobel Peace Prizes have been awarded since 1901. I haven’t been able to reproduce that number in Freebase because there are multiple winners in a few years, and I haven’t found a way to group results by year. But for my purposes this related query is good enough:

That number, 100, isn’t as closely related to 89 as you might think. It’s less by the number of years no award was given, but more by the number of recipients in multiple-award years. Perhaps a Freebase guru can show us how to measure those uncertainties, but I’ve eyeballed them and I don’t think they invalidate my results.

How did I wind up querying the topic /award/award_winner? It wasn’t immediately obvious. I spent a while searching and then exploring the facets that emerged, including:

The crazy thing about Freebase is that, in a way, it doesn’t matter where you start. Everything’s connected to everything, so you can pick up any node of the graph and re-dangle the rest.

Except when you can’t. I haven’t yet gotten a good feel for which paths to prefer and why.

But in the end I came up with the kind of results I’d envisioned:

1901-2009 nobel peace prize winners by gender
male female

1901-2009 nobel peace prize winners by nationality
male female

Taken together they show a couple of trends. First, of course, we see most female winners after about 1960. Second, we see a more even geographic distribution of female winners because, prior to 1960, most winners were not only male but also American or European.

These results didn’t surprise me. What did is the relative ease with which I was able to discover and document them. I thought it would be necessary to write MQL queries in order to do this kind of analysis. I’d previously done a bit of work with MQL, and dug further into it this time around.

But in the end I found that it was just as effective to use interactive filtering. Now to be clear, getting the software to actually do the things I’ve shown here wasn’t a cakewalk. I had to develop a feel for the web of topics in the domain I chose. And it’s painfully slow to add and drop filters.

But still, it’s doable. And you can do it yourself by pointing and clicking. That is an astonishing tour de force, and a glimpse of what things will be like when we can all fluently visualize information about our world.