|
Since 1998, Transparency International has published an annual report called the Corruption Perception Index (CPI), which “ranks 180 countries by their perceived levels of corruption, as determined by expert assessments and opinion surveys.” Looking at the 2008 edition, I wondered about trends. Which countries have shown the most CPI volatility since 1998? Is there a trend toward light or darkness? If so, which countries run counter to the trend, and why? The table of sparklines shown here presents a rendering of the data in a way that allows us to ask, and begin to answer, such questions. It defines CPI volatility as the difference between a country’s highest and lowest CPI ranking over the 11-year period, and sorts countries from most to least volatile. Sparklines chart this data under a reference line, and distance from that line signifies descent into darkness. To answer one of my questions, Bangladesh, Nigeria, Georgia, and Guatemala stand out — among the most volatile countries — as atypically hopeful amidst a general downhill slide. That, anyway, is what Transparency International’s data seems to indicate. I’ll leave it to political experts to weigh in on the plausibility of that interpretation. Here I’ll just ask a more basic question. We see tables, maps, and charts — like the ones published by Transparency International — all over the web. But in my experience, when you try to actually use the data, it’s almost always way too hard. In a later entry I’ll describe, in gory detail, the gymnastics required to massage the TI data and produce this visualization. But just to give you a hint, here are the six different ways of encoding Côte d´Ivoire that I found in the eleven files I had to merge: C\xC3\xB4te d\xC2\xB4Ivoire Cote d'Ivoire C\xF4te-d'Ivoire Cote d\xB4Ivoire Cote d?Ivoire C\xF4te d\xB4Ivoire There were also typos (Moldovaa for Moldova), variant spellings (USA vs United States), and format inconsistencies (empty vs. non-empty cells when a rank is repeated). Why go to all the trouble to gather and publish this kind of data, and then not consolidate it into a form we can use directly? |
Hi John – Nice to see such an insightful and trenchant post that is also honest enough about the hard work that went into the data analysis. As a data geek, I couldn’t agree more with you with you that that data scrubbing is a painful, laborious process. I’m optimistic that this may change, given some recent trends — (my 2 cents are at http://www.dataspora.com/blog).
The harsh truth is this: data is messy because the world is messy. Borders shift. Metrics change. Data goes uncollected or missing.
But I’m hopeful posts like this will give the data geeks out there courage to push forward and produce informative graphics like yours — in spite of the hard work.
> The harsh truth is this: data is messy
> because the world is messy. Borders shift.
> Metrics change. Data goes uncollected or
> missing.
Agreed. And yet…in so many cases, it just ain’t rocket science. This is a simple spreadsheet:
http://jonudell.net/data/cpi.csv
If this info were just maintained in a master spreadsheet, and it had a row for Ivory Coast, and the new data came in tagged Côte d´Ivoire, it would be an obvious and trivial thing to reconcile that.
We shouldn’t need a tribe of “data geeks” to reverse-engineer simple stuff like this. Their skills should be applied to a different class of problem.
People need to begin to understand and apply some very basic principles of data management. It’s yet another example of how computational thinking needs to become one of the pillars of primary education along with reading, writing, and arithmetic.
first off, I agree, a disproportionate amount of effort is being directed at pretty trivial data management work, that should be addressed waaaay earlier in the “pipeline”.
that being said, i’m not really sure about visualizing this information as a sparkline? or, i guess more specifically, i’m not really sure that i get much value out of seeing 50+ sparklines stacked vertically on top of each other. I actually had a hard time using the presented visualizations to answer the questions you posed (trending, etc). IMO, sparklines are great micro-visualizations that can be embedded into a body of text to illustrate one specific point… however, to compare large volumes of data, wouldn’t a simple line graph of served effectively?
> compare large volumes of data, wouldn’t
> a simple line graph of served effectively?
It’s good question. Try it and see!
Seriously, I’m not claiming this is the be-all, end-all for this data set.
Nothing ever is, really.
That said, I like the Tuftean “small multiples” idea. This is really 2 columns of a spreadsheet I made. A better version of this idea would be active, not static, and would enable sorting by country name as well as by volatility. That makes it easier to look up the trend for a particular country you’re interested in.
For comparison, I think 50 lines would be too many.
When I did the volatility sort in the spreadsheet, what really struck me was scrolling down the list and watching the sparklines a) flatten, and b) approach the reference line.
You get that same effect by scrolling down in this HTML page.
Now admittedly, the effect relies on a kind of poor-man’s-animation, a flip-card effect, if you will.
For what it’s worth, I actually think that all static infographics are challenged w/respect to visualizing change, and that we ultimately in many cases need moving pictures to best convey moving data.
i tried it- and yes, the simple line graph was completely meaningless (at least, the limited resolution graph that excel produced)…. i waaay underestimated the volume of data. a 100% stacked area chart was somewhat more useful, but again, i think the main limiter was resolution and “navigation” capabilities.
I think a growing/shrinking bubble visualization (e.g., hans rosling style..http://www.ted.com/index.php/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html) would be great…