Transparency trends (continued): A data-wrangling tale

As promised yesterday, here’s a detailed account of the gymnastics required to extract usable data from Transparency International’s Corruption Perception Index (CPI) reports.

The reports are published as yearly editions for each of the 11 years since 1998. They’re not consolidated, at least not anywhere I can find, so if you want to analyze trends in the TI data you’ve got to consolidate those reports yourself.

The yearly reports are available as both HTML tables and corresponding Excel spreadsheets. I didn’t know about the latter. The website is organized such that for the recent years I examined first, only the HTML table is obviously available. So the procedure I’ll show here wasn’t strictly necessary. I could have gone straight to the Excel files.

But in the end it’s the same data, and all the subsequent processing is necessary in either case. So I’ll take this opportunity to show how to use Excel to extract data from an HTML table. That’s a really common operation if you’re into this sort of thing, and Excel does it pretty well.

Here’s part of the 2005 CPI table:

TI 2005 Corruption Perceptions Index

Country rank Country 2005 CPI score Confidence range Surveys used***

1

Iceland 9.7 9.5 – 9.7 8
2 Finland 9.6 9.5 – 9.7 9
New Zealand 9.6 9.5 – 9.7 9
4 Denmark 9.5 9.3 – 9.6 10

To import it into Excel 2007, first visit the page and capture its URL.

Then, in Excel, do Data -> From Web -> From Web (Classic Mode), navigate to the table you want, click the arrow at its top left corner, and click Import.

That was the easy part. Before long, I had a spreadsheet with 11 CPI reports. To simplify things, I stripped each one down to just two columns: country name and CPI rank. I wanted to see trends in the ranking over time. To do that, I needed to merge the 11 sheets into a single sheet with a column of normalized names, and 11 columns of normalized ranking data.

The names had to be normalized for a couple of reasons. First, there were six different encodings of Côte d´Ivoire:

C\xC3\xB4te d\xC2\xB4Ivoire
Cote d'Ivoire
C\xF4te-d'Ivoire
Cote d\xB4Ivoire
Cote d?Ivoire
C\xF4te d\xB4Ivoire

There were also typos (Moldovaa for Moldova) and variant spellings (USA vs United States)

The rankings had to be normalized because sometimes countries are tied for a rank. In those cases (as above) some of the files were sparse, with empty cells for repeated ranking. In other cases, all cells were populated.

To do this normalization I exported the data from Excel to 11 CSV files, and used the following Python script:

import csv

r98 = csv.reader(open('cpi1998.csv'))
r99 = csv.reader(open('cpi1999.csv'))
r00 = csv.reader(open('cpi2000.csv'))
r01 = csv.reader(open('cpi2001.csv'))
r02 = csv.reader(open('cpi2002.csv'))
r03 = csv.reader(open('cpi2003.csv'))
r04 = csv.reader(open('cpi2004.csv'))
r05 = csv.reader(open('cpi2005.csv'))
r06 = csv.reader(open('cpi2006.csv'))
r07 = csv.reader(open('cpi2007.csv'))
r08 = csv.reader(open('cpi2008.csv'))

def fix(c):
  c = c.replace('(Former Yugoslav Republic of)','')
  c = c.replace('Congo, Republic of','Congo, Republic')
  c = c.replace('Congo, Republic the','Congo, Republic')
  c = c.replace('Dominican Rep.','Dominican Republic')
  c = c.replace('Dominican Rep\n','Dominican Republic\n')
  c = c.replace('FYR ','')
  c = c.replace('Saint Vincent and the','Saint Vincent')
  c = c.replace('Saint Vincent and','Saint Vincent')
  c = c.replace('Macedonia ','Macedonia')
  c = c.replace('Moldovaa','Moldova')
  c = c.replace('Serbia and Montenegro','Serbia')
  c = c.replace('Palestinian Authority','Palestine')
  c = c.replace('the Grenadines','Grenadines')
  c = c.replace('&','and')
  c = c.replace('USA','United States')
  c = c.replace('Viet Nam','Vietnam')
  c = c.replace('Slovak Republic','Slovakia')
  c = c.replace('Kuweit','Kuwait')
  c = c.replace('Taijikistan','Tajikistan')
  c = c.replace('Republik','Republic')
  c = c.replace('Herzgegovina','Herzegovina')
  c = c.replace("Ivory Coast",'C\xC3\xB4te d\xC2\xB4Ivoire')
  c = c.replace("Cote d'Ivoire",'C\xC3\xB4te d\xC2\xB4Ivoire')
  c = c.replace("C\xF4te-d'Ivoire", 'C\xC3\xB4te d\xC2\xB4Ivoire')
  c = c.replace('Cote d\xB4Ivoire', 'C\xC3\xB4te d\xC2\xB4Ivoire')
  c = c.replace('Cote d?Ivoire', 'C\xC3\xB4te d\xC2\xB4Ivoire')
  c = c.replace('C\xF4te d\xB4Ivoire', 'C\xC3\xB4te d\xC2\xB4Ivoire')
  return c

d = {}
rnum = -1
lastrank = None

for reader in [r98,r99,r00,r01,r02,r03,r04,r05,r06,r07,r08]:
  rnum += 1
  for row in reader:
    rank = row[0]
    if rank == '':         # normalize rank
      rank = lastrank
    lastrank = rank
    country = fix(row[1])  # normalize name
    if not d.has_key(country):
      d[country] = [0,0,0,0,0,0,0,0,0,0,0]
    d[country][rnum] = rank
   
keys = d.keys()
keys.sort()

for key in keys:
  print "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" % 
    ( key, d[key][0],d[key][1],d[key][2],d[key][3],
           d[key][4],d[key][5],d[key][6],d[key][7],
           d[key][8],d[key][9],d[key][10] )

As you can see, the bulk of this script is really just data, in the form of search/replace pairs. Its output is another CSV file. It took me a few tries to reduce the list of names to a normalized core. I ran the script, took the output into Excel, eyeballed the list, and added new search/replace pairs.

Eventually I wound up with this data, which I brought back into Excel to explore. Because I wanted to look at what I’m calling volatility — that is, the variability in CPI rankings — I added a column that computes the difference between a country’s highest and lowest rankings over the 11-year period, and then sorts countries by that difference, from most to least volatile.

We can debate whether a stack of sparklines is a useful way to visualize trends in this data, but that’s the approach I decided to try. It gave me a chance to experiment with some of the sparkline kits available for Excel, and the one I settled on is BonaVista’s MicroCharts.

Here’s a picture of two chart styles I tried:

These microcharts do succeed in telling stories about each country individually, while also making it possible to notice that Georgia, atypically among the more volatile countries, is moving toward a lower (better) ranking.

In another variation on this theme, I flipped the rankings to their negative counterparts so that the charts would flip too, and would correspond to my natural sense that up means better and lower means worse. I also removed the zeroes so that they wouldn’t show up as data points.

That was good enough for my purposes, but when I converted the spreadsheet back to HTML I wasn’t happy with the results. That’s partly because the microcharts, which are rendered using TrueType fonts, had to be converted to lower-resolution images. And it’s partly because the HTML that Excel generated was too complicated for my WordPress blog to handle gracefully.

So I exported the enhanced data back out to a CSV file, and switched to Python again. There are a million ways to generate sparklines from data, but the one I remembered from a previous encounter was Joe Gregorio’s handy sparkline service.

(By the way, it should be possible to use that web-based service from Excel. And interactively, you can. If you capture a sparkline URL like this one, you can paste it into the File Open dialog presented by Excel’s Insert -> Picture feature. The dialog asks for a filename, but you can give it an URL and it’ll work.

When I realized that, I spent a few minutes trying to automate the procedure so that Excel 2007 could programmatically grab data, run it through an image-generating web service, and embed the resulting pictures. I failed, as have others before me, but it’s nifty idea. If you know the solution, please share.)

Anyway, here’s the little Python script that reads the data, produces sparkline images, and embeds them in the HTML table I displayed on my blog:

# -*- coding: utf-8 -*-
import urllib2,os

data = open('cpi.csv').read()

url_template = "http://bitworking.org/projects/sparklines/spark.cgi?\
  type=discrete&d=%s&height=20&limits=0,200&upper=1&\
  above-color=black&below-color=white&width=4"

rows = ''
row_template = """<tr>
<td class="sparkline">
<img src="http://jonudell.net/img/cpi/%s">
</td>
<td>%s</td></tr>\n"""

lines = data.split('\n')
for line in lines:
  country = line.split(',')[0]
  ranks = line.split(',')[1:]
  quoted_fname = '%s.png' % urllib2.quote(country)
  fname = '%s.png' % country
  imgurl = url_template % ranks
  cmd = 'curl "%s" > "./cpi/%s" ' % (imgurl,fname)
  os.system(cmd)
  cmd = 'mogrify -flip "./cpi/%s"' % fname
  os.system(cmd)
  rows += row_template % (quoted_fname,country.replace(' ','&nbsp;'))

html = '<table cellspacing="4">%s</table>' % rows
f = open('cpi.html','w')
f.write(html)

By specifying upper=1 and below-color=white in the sparkline-generating URL, the zeroes (representing unreported data) vanish from the charts.

The charts don’t include reference lines as shown in the Excel screenshot, but I added them back using this bit of CSS:

td.sparkline {
border-top:1px #cccccc solid;
}

I’m using Python here partly as a shell language. It invokes a pair of command-line utilities: cURL to download images, and mogrify (part of the ImageMagick suite) to flip them.

Although one of these commands is a cloud-based sparkline service, and the other is a locally-installed image processing program, they’re treated in exactly the same way. When the quantities of data involved are small — these .PNG images are just a few hundred bytes — there’s no discernible difference between the two modes. I like that symmetry.

What I don’t like is all the moving parts. It’s awkward for me to move from Excel to Python to Excel to Python, with excursions to the command line along the way, and no normal person would even consider doing that.

In a simple case like this, such gymnastics should never have been required. If you’re going to publish data to the web, assume that people will want to use it and do the minimal basic hygiene and consolidation.

At some point, though, people will want to do fancier tricks. Today you have to be a “data geek” to perform them, but that shouldn’t be so. We’ve got to find a way to integrate Excel, dynamic scripting, command-line voodoo, and web publishing into a suite of capabilities that’s much more accessible.

Posted in Uncategorized

15 thoughts on “Transparency trends (continued): A data-wrangling tale

  1. Nice flexibility on those maneuvers! I have my own Ruby toolkit for scraping/munging/joining data and exporting as a nicer format for using in other tools.

    Of course, it’s also possible for various services to wrangle this data and offer it in more broadly used formats.

    For example, we have the CPI for several years on GeoCommons. Download as Shapefile, KML, and CSV – or visualize against other open data.

  2. Interesting post… did a quick look to see if I could import into OpenOffice, since I don’t use Microsoft tools… and yes I could, though not with any graphical preview… found it under Insert -> Link to External Data, inserted the URL and hit enter. Then I got a list of things it could import, but no preview.

    Anyway, the main thing I wanted to bring up was proposing to use an emerging standard: JSON Rest. I’m in active development on a project using the Dojo Toolkit, and the folks over there are pioneering ways of representing data in ways that become easy to consume and manipulate. The Dojo DataGrid can get plugged into a variety of data formats, but the fastest/easiest is a simple javascript array of rows, with each row an object with each field name and field data. Publishing data in this way, while not necessarily improving your experience of getting it into Excel, would greatly lower the barriers to remixing from other sites…

    A thought, anyway…

  3. In the realm of pure Excel hackery, one technique is to use Visual Basic to flip data underneath a chart through a technique like this:

    http://www.juiceanalytics.com/writing/why-make-100-charts-when-one-will-do/

    This changes the data underneath a chart allowing you to use one chart to display hundreds of data series.

    You can then use VB Chart.Export (not sure exactly what the function call is) to write each of those charts out to hundreds of GIF images. Then include those charts in your HTML.

    We’ve used this successfully a number of times, the most extreme example being this video:

    http://www.juiceanalytics.com/writing/extreme-excel-charts/

    The sheer insanity of that video still makes me laugh.

  4. > For example, we have the CPI for several
    > years on GeoCommons.

    I have looked at GeoCommons before, very nice!

    How would you use it to compare, say, CPI 2008 vs world income inequality?

    I just tried that, and could not come up with a visualization I liked. Would be interested to see your solution.

  5. > Top 100 ways to write: Guns N’ Roses –
    > Knockin’ on Heaven’s Door

    Of course Jeff Jonas would say this is a good thing. This cloud of associations is a better definition of the entity!

  6. > Publishing data in this way, while not
    > necessarily improving your experience of
    > getting it into Excel, would greatly lower
    > the barriers to remixing from other sites…

    JSON is a beautiful thing. But when we of the geek tribe discuss the pros and cons of JSON vs XML vs CSV, we’re off in the weeds.

    The problems I ran into here have nothing to do with delivery format. It’s all in the data management. As we saw here, typos, variant encodings, and format inconsistencies were expressed identically in HTML and XLS, and would be expressed the same way in JSON or XML or any other deliver format.

    We need to give people more upstream assistance so they can avoid the typos, variant encodings, and format inconsistencies in the first place.

  7. > The sheer insanity of that video still
    > makes me laugh.

    There’s nothing insane about it at all. Disney made animations a cel at a time. Pixar still does, thanks to automation.

    I often want to see a “small multiples” display as a movie. One not-entirely-successful experiment along those lines:

    http://weblog.infoworld.com/udell/2004/11/03.html

    In general, I’m fascinated by the evolution toward animated infographics that only really work interactively:

    http://move.rmi.org/features/oilmap.html

    This is cool. And yet, I would like to see more granularity. You ought to be able to bookmark any frame for reference. You ought to be able to grab the sequence of frames and render it in alternate ways, even as a printed “small multiples” poster.

  8. > Did you consider trying to do all the
    > work in Resolver

    Yes. But In Resolver I couldn’t, among other things, use an Excel sparkline add-in, and that was part of what I wanted to explore.

  9. > But now I won’t.

    Why not? There are a zillion good solutions worth mentioning.

    I focused on Excel here because I wanted to try the inline sparkline thing.

    Your comment prompted me to check if GSheets can insert a sparkline image from an URL like http://bitworking.org/projects/sparklines/spark.cgi?type=smooth&d=88,84,82,92,82,86,66,82,44,64,66,88,96,80,24,26,14,0,0,26,8,6,6,24,52,66,36,6,10,14,30&height=20&limits=0,100&min-m=false&max-m=false&last-m=false&min-color=red&max-color=blue&last-color=green&step=2.

    It can, but onto the sheet not into a cell.

    Putting multiple sparklines into a column, and then sorting on other columns, was a nice aspect of this experiment, and one that I’ve yet to replicate online.

Leave a Reply