January 2010


OData, the Open Data Protocol, is described at odata.org:

The Open Data Protocol (OData) is a web protocol for querying and updating data. OData applies web technologies such as HTTP, Atom Publishing Protocol (AtomPub) and JSON to provide access to information from a variety of applications, services, and stores.

The other day, Pablo Castro wrote an excellent post explaining how developers can implement aspects of the modular OData spec, and outlining some benefits that accrue from each. One of the aspects is query, and Pablo gives this example:

http://ogdi.cloudapp.net/v1/dc/BankLocations?$filter=zipcode eq 20007

One benefit for exposing query to developers, Pablo says, is:

Developers using the Data Services client for .NET would be able to use LINQ against your service, at least for the operators that map to the query options you implemented.

I’d like to suggest that there’s a huge benefit for users as well. Consider Pablo’s example, based on some Washington, DC datasets published using the Open Government Data Initiative toolkit. Let’s look at one of those datasets, BankLocations, through the lens of Excel 2010’s PowerPivot.

PowerPivot adds heavy-duty business analytics to Excel in ways I’m not really qualified to discuss, but for my purposes here that’s beside the point. I’m just using it to show what it can be like, from a user’s perspective, to point an OData-aware client, which could be any desktop or web application, at an OData source, which could be provided by any backend service.

In this case, I pointed PowerPivot at the following URL:

http://ogdi.cloudapp.net/v1/dc/BankLocations

I previewed the Atom feed, selected a subset of the columns, and imported them into a pivot table. I used slicers to help visualize the zipcodes associated with each bank. And I wound up with a view which reports that there are three branches of WashingtonFirst Bank in DC, at three addresses, in two zipcodes.

If I were to name this worksheet, I’d call it WashingonFirst Bank branches in DC. But it has another kind of name, one that’s independent of the user who makes such a view, and of the application used to make it. Here is that other name:

http://ogdi.cloudapp.net/v1/dc/BankLocations?$filter=name eq ‘WashingtonFirst Bank’

If you and I want to have a conversation about banks in Washington, DC, and if we agree that this dataset is an authoritative list of them, then we — and anyone else who cares about this stuff — can converse using a language in which phrases like ‘WashingtonFirst Bank branches in DC’ or ‘banks in zipcode 20007′ are well defined.

If we incorporate this kind of fully articulated web namespace into public online discourse, then others can engage with it too. Suppose, to take just one small example, I find what I think is an error in the dataset. Maybe I think one of the branch addresses is wrong. Or maybe I want to associate some extra information with the address. Today, the way things usually work, I’d visit the source website and look for some kind of feedback mechanism. If there is one, and if I’m willing to provide my feedback in a form it will accept, and if my feedback is accepted, then my effort to engage with that dataset will be successful. But that’s a lot of ifs.

When public datasets provide fully articulated web namespaces, though, things can happen in a more loosely coupled way. I can post my feedback anywhere — for example, right here on this blog. If I have something to say about the WashingtonFirst branch at 1500 K Street, NW, I can refer to it using an URL: 1500 K Street, NW.

That URL is, in effect, a trackback that points to one record in the dataset.1 The service that hosts the dataset could scan the web for these inbound links and, if desired, reflect them back to its users. Or any other service could do the same. Discourse about the dataset can grow online in a decentralized way. The publisher need not explicitly support, maintain, or be liable for that discourse. But it can be discovered and aggregated by any interested party.

The open data movement, in government and elsewhere, aims to help people engage with and participate in processes represented by the data. When you publish data in a fully articulated way, you build a framework for engagement, a trellis for participation. This is a huge opportunity, and it’s what most excites me about OData.


1 PowerPivot doesn’t currently expose that URL, but it could, and so could any other OData-aware application.

I’m listening to the audio version of a very cool talk given by astronaut-turned-artist Alan Bean. (Skip the hokey intro, though, and jump in at minute 7 when he starts.)

He tells great stories about the space program, but also offers wider perspectives on life, art, and human potential.

Along the way, he tells an amusing anecdote about the famous picture of Neil Armstrong planting an American flag onto the moon’s surface. Armstrong told Bean it had been a scary moment, and Bean asked why. Armstrong said (as paraphrased by Bean):

Well, I couldn’t get that flag into the ground, like in training. Up there, those particles in the dirt aren’t rounded like regular sand. On Earth I would just do like that, and it would go in. But up there I did like that and it didn’t go in.

I imagined that when I let go, it would fall into the dirt, and people all over the world would see the American flag fall into the dirt. So I tipped it back until the center of gravity was over the hole. Then I put a little dirt around it. I knew that if I could get it balanced, and get away from it, that without any wind it would stay balanced. So that’s what we did. We got away from it, and we never got close to it again.

Bean adds: “It probably blew over when they launched, but it didn’t make any difference. That’s an engineer’s solution!”

What a great hack!

Today on a conference call I was reminded of another. A few years ago, in an airport, I saw a guy with a cellphone in one hand and a payphone in the other. His ear, brain, and mouth were trying to bridge two phone networks together, it wasn’t working well, and he was visibly frustrated. Finally he removed his head from between the two phones, stuck them together, and reversed them earphone-to-microphone, so the two parties were talking directly to each other.

My conference call today presented a different version of that scenario. It was scheduled as a VOIP call, then was switched to a POTS call, but not everybody got the memo. So I made the POTS call. And since I have a podcast rig that lets me do POTS calls through my computer, using the same headset I use for VOIP, I made the call that way.

Then people started to show up on both the POTS side and the VOIP side. I realized that, unexpectedly, I was hearing both sides and they were hearing me. Both were being conveyed through my computer’s audio subsystem. I was just like the guy with the cellphone on one ear and the payphone on the other.

It would have been cool to do the same kind of earphone-to-microphone hack. But before I got the chance to try, the VOIP folks hung up and dialed back in on the POTS side.

Oh well, maybe next time.

The sound track for yesterday’s run was a talk by primatologist Richard Wrangham1, author of Catching fire: how cooking made us human. Cooking, he says, has long been thought to be an optional cultural practice, like wearing jewelry. But really, he argues, cooking was the essential technological innovation that enabled us to produce the metabolic energy we needed to become human.

How? Cooked food is more digestible than raw food. And not just by a little, but by a lot. Learn how to control fire, use it to cook your food, and you free up extra energy — plus time that would otherwise be spent masticating. Spend that time hunting, and your metabolic equation gets even better.

Wrangham has fascinating things to say about how this surplus time and energy explains such cultural universals (or former universals) as marriage, sexual division of labor, and the family dinner. Whether you agree or disagree with this analysis, though, it’s supported by an attention-grabbing claim. Everything we thought we knew about absorption of energy from food is wrong.

To this day, Wrangham says, the USDA website2 publishes tables that make no distinction between the nutritional value of cooked and raw food. On this page, for example, the energy content of one large raw egg is given as 75 kcal. The value for one large hard-boiled egg is almost the same: 78 kcal.

This is wrong, Wrangham says. A cooked egg delivers way more energy than a raw egg. How could this be? And how could we not know it?3

Here’s the explanation. We have traditionally measured the energy content of food by comparing input (the food we eat) and output (the feces we excrete). Burn both in a calorimeter, subtract, and the difference is the energy that was extracted from the food.

Yes, but extracted by whom? Or rather, by what? The energy that we humans take from our food has almost all been extracted by the time it reaches the end of the small intestine. But it has a long way to go yet. It must also pass through the large intestine, where dwell a myriad of gut flora. And they, Wrangham says, are hungry. If you eat a raw banana you only get some of its energy, and they get most of the remainder. If you eat a cooked banana, though, you get a lot more of its energy and leave less for them. The end result looks the same, but the internal distribution is quite different.

So you need to compare the energy in food entering the mouth to the energy remaining in the digestive products leaving the small intestine.4 Only then does the dramatic difference between the energies we get from raw versus cooked food become evident.

This is a great parable about instrumentation, measurement, knowledge, and epistemology. What other profound errors of basic understanding arise from misplaced instrumentation? And what might we learn by making simple — and in retrospect obvious — adjustments?


1 Yet another podcast from KUOW’s Speakers’ Forum, which has become one of my most reliable sources of audio brain food.

2 A sad reminder that government website and chamber of horrors are still, too often, synonymous.

3 The error, if it is indeed an error, propagates to WolframAlpha, which sources the USDA data. Compare 100 g of raw egg to 100 g of cooked egg.

4 How do you tap in at that point? Recruit people who have had ileostomies.

If you’re interested in the use of computers and networks to support collaboration, you’ll have heard of PLATO. It was an early courseware system, and by early I mean circa 1960, running on vacuum tubes. But it was also a petri dish in which much of what we now know as online culture first evolved.

I’ve long known that PLATO inspired many other systems, including VAX Notes and Lotus Notes. But I never heard the backstory. So when I found out that Brian Dear is completing a history of PLATO, and planning a conference to commemorate its 50th anniversary, I invited him onto my weekly show to find out more about it. PLATO matters, Brian says, because

it challenges our assumptions of how the online world evolved. It rewrites the history. It’s as if we discovered Wilbur and Orville Wright were not the first to fly a powered plane — that it’d been done faster and longer with a jet aircraft 30 years earlier.

Of couse the same can be said of other early technologies, notably Smalltalk, which introduced ideas and methods that are only now hitting the mainstream. It’s fun to wax nostalgic, but I’d rather explore how these systems arose, why they flourished, and what accounts for the propagation of their memes but not their genes.

From that perspective Brian reminds us, first, that PLATO was expensive. Few universities were willing or able to invest millions in a Control Data mainframe and a fleet of gas-plasma flat-panel bitmapped touch-screen display terminals. Those terminals enabled some extraordinary things, like the interactive music software that captivated Brian as a University of Delaware undergrad. They also enabled a now-extinct species of emoticons, which relied on the bitmapped graphics. But since much of what became PLATO’s essential DNA required only character-mapped graphics, those expensive bitmapped screens became an evolutionary bottleneck.

Another feature that didn’t pass through that bottleneck was PLATO’s ability to make sense of natural language input. Many thousands of programmer hours were invested in enabling PLATO to recognize a variety of human utterances. That in turn enabled courseware authors to create lessons that responded intelligently — and, Brian says, in ways that are sadly still not typical of modern courseware.

Today we can attack that problem by creating open source libraries, by reusing them, and by extending them. That’s a great way to create DNA that can propagate. But it’s useful to consider why it might not. We still, for the most part, create dependencies on specific programming languages, and on the environments in which they run.

As we move into an era of services, though, we can start to imagine a more fluid environment in which capabilities persist across language and system boundaries. Consider this exhibit from an antique PLATO library:

This is a screenshot from the live PLATO system running (in emulation) at cyber1.org. It’s a page from the catalog of functions in PLATO’s CYBIS library. Shown here are some of the methods available to process responses to questions.

Some of those methods might still be useful. And if they’d been packaged in a language- and system-independent way, some might conceivably still be in use.

PLATO programmers didn’t have the option to package their work in a such a way. Now we’re on the cusp of an era in which these kinds of library services can also be language- and system-independent web services. Will we exploit this new possibility? Will some of today’s core services still be delivering value decades from now, freeing developers to add value farther up the stack? It’s worth pondering.

A while back I reviewed the reading machine that my mom, who suffers from macular degeneration, now depends on. I gave it a thumbs up, but also noted that she was having some problems.

On my last visit I came up with a method that will help, if she can get the hang of it. The method is non-obvious, and isn’t documented anywhere I’ve been able to find, so I made a short movie to illustrate it.

The key insights are:

Use the left margin screw to set a left margin somewhere

It almost doesn’t matter where, you just need a guide for carriage returns.

Position the book and the tray

Getting this right makes a huge difference. My mom was constantly fiddling with the position of the book on the tray. This frustrated her, and seriously impaired her ability to read fluidly.

But if you position the tray correctly, and the book relative to the tray, then you can easily read the whole page without touching or moving the book at all. Here’s how:

Align the bottom left corner of the book with the bottom left corner of the screen.

This is counter-intuitive. The natural expectation is to start at the top of the page. And you do want to start reading there. But I found that establishing a bottom margin is a crucial first maneuver, and it involves three steps:

1 Push tray all the way forward and rightward

2. Place book on tray

3. Move book to align bottom left corner of page with bottom left corner of screen

With the tray still as far forward and as far right as it will go, you have defined both a left margin and a bottom margin for the page. Now read the whole page without touching the book again. Here’s how:

Find the top of the page.

To do that you pull the tray out (forward, towards yourself) until the top margin of the page lines up with the top of the screen.

Read as many lines vertically as the screen can display.

Use only a two-stroke left/right motion of the tray. The sequence is:

1. Slide tray left to reveal ends of lines

2. Slide tray right for carriage return

My mom had been advancing the tray (by pushing it in) once per line. This wastes effort and disrupts context. If the left margin screw is set, a carriage return always goes to the same place. So it was easy — at least for her — to make a visual connection from the end of the previous line to the beginning of the next one.

I realize this part may not work for everyone, and maybe not even for her as her vision worsens. Right now, at her magnification, her screen can display 8 or 10 lines. At higher magnification, when only a few are visible, there will be less context to help make that connection. Then it may become necessary to scroll vertically once per line. But the longer that can be avoided, the better.

Why was this necessary?

Shouldn’t multi-thousand-dollar gizmos like this come with training materials that help people figure this stuff out? Yes, but I’ve given up being shocked that they don’t.

If you’ve got a friend or relative in the same boat, let me know if this writeup — and/or the accompanying video — makes sense.

A note on making the movie

The video combines slides with a side-by-side animation of the tray and the screen. I wound up using PowerPoint, which conveniently handles the three ingredients: text, bitmap graphics, and vector graphics.

Rather than use PowerPoint’s animation features, though, I made a sequence of frames, nudging objects by small increments from frame to frame. This turned out to be a surprisingly easy and approachable technique.

Then I turned on a screen recorder — I used Camtasia, but it could have been any other — and stepped through the frames.

On this week’s podcast, Greg Wilson tells the story of a university course he created, and has taught for many years, called Software Carpentry. I have known Greg for a long time. We are kindred spirits in several ways. Most notably, we like to mine veins of knowledge, experience, and technique that some practitioners take for granted, but that many others haven’t yet discovered — or don’t yet use as well as they could.

I, for example, wonder why we don’t teach everyone basic principles of structured information, namespace design, and syndication. Greg, similarly, wonders why student programmers — and student scientists whose careers increasingly depend on computational methods — are not taught basic principles of version control, debugging, and refactoring. And why we don’t read great software in the same way we read great literature or study landmark scientific experiments. And why the controlled reproducibility of commercial software development isn’t typical of computational science.

If you care about these issues, there are two ways you can help. First, take a look at the reboot of the Software Carpentry course that Greg’s experience has led him to propose. Second, help him find the funding to keep doing this work.

On FiveThirtyEight.com the other day, Andrew Gelman posted this chart illustrating the high cost of US health care:

He did so to correct a “somewhat misleading (in my opinion) presentation of these numbers [that] has been floating around on the web recently.” The misleading graph, which appeared on a National Geographic blog, was — I agree — a confusing way to show information better represented in a scatterplot.

But I’ve seen this data before, and there’s more to the story. Neither the National Geographic nor FiveThirtyEight has anything to say about which numbers they’re charting.

Back in 2005, in a review of John Abramson’s excellent book Overdo$ed America, I noted that he had used a different source to reach a slightly different conclusion.

His chart, based on OECD health-expenditure data (link now 404) and WHO healthy life expectancy data (link still alive), looked like this:

He used it to make the oft-cited point that US healthcare isn’t just wildly expensive, but that it also correlates with worse life expectancy than in many countries that spend less.

I wondered what the chart would look like if based on the same OECD expenditure data but on the OECD’s rather than the WHO’s definition of life expectancy. The result looked like this:

The U.S. is the clear cost outlier on both charts. The first chart, however, places us near the low end of the life expectancy range, justifying Abramson’s assertion that we combine “poor health and high costs.” The second chart places us near the high end of the life expectancy range, suggesting that while value still isn’t proportional to cost, we’re at least buying more value than the first chart indicates.

Although based on older data, this second chart closely resembles the ones recently shown and discussed by the National Geographic and FiveThirtyEight.

My review of Abramson’s book concluded:

Has Abramson spun the data to make his point, just as he accuses the pharmaceutical industry of doing? Of course. Everybody spins the data. What matters is that:

  • Everybody can access the source data, as we can in the case of Abramson’s book but cannot (he argues) in the case of much medical research
  • The interpretation used to drive policy expresses the values shared by the citizenry

Would we generally agree that we should measure the value of our health care in terms of healthy life expectancy, not raw life expectancy? That the WHO’s way of assessing healthy life expectancy is valid? These are kinds of questions that citizens have not been able to address easily or effectively. Pushing the data and surrounding discussion into the blogosphere is the best way — arguably the only way — to change that.

That was five years ago. The data was, and is, out there. So it’s disheartening to see the same chart pop up again without any further discussion of the sources of its data, or of the definitions underlying those sources.

On this week’s Innovators show, Doug Day joins me to discuss the new iCalendar validator he has recently deployed on Azure.

The project draws inspiration from the pathbreaking RSS/Atom feed validator originally created by Mark Pilgrim and Sam Ruby. The RSS/Atom validator’s test-driven and advice-oriented approach is exemplary, and the iCalendar validator follows in its footsteps.

The tests, in this case, are iCalendar snippets that are, or are not, valid according to the spec. These snippets, packaged into XML files, form a library of examples that does not depend on the programming language used to run the tests. So although Doug’s validator, based on his open source parser, is written in C#, another validator written in Java or Python or Ruby could use the same test suite.

The advice offered is minimal so far, but I hope will expand as the test suite grows. Sam Ruby observes:

Identifying real issues that prevent real feeds from being consumed by real consumers and describing the issue in terms that makes sense to the producer is what most would call value.

In that spirit, I am gathering examples of calendars in the wild and looking for ways to help Doug add value.

In the podcast we discuss a nice example that came up recently in the curators’ room of the elmcity project. A custom-built calendar contained events (VEVENT components, in iCalendar-speak) with no start or end times (DTSTART and DTEND properties). This, it turns out, is not prohibited by the spec. But reporting no error is unhelpful. The author of the calendar — or of the software that produced the calendar — ought to be warned that such a calendar won’t yield a useful or expected result.

Why would anyone produce such a calendar in the first place? This harkens back to the early days of RSS. Many of us found that we could craft simple ad-hoc feeds in order to leverage RSS as a lightweight data exchange. It was liberating to be able to do that. But hand-crafted feeds, or feeds written by hand-crafted software, were valuable only to the extent they would reliably interoperate. Often they would not. The feed validator, by showing what was wrong with these feeds, and explaining why and how to fix them, was a powerful ally for those of us trying to bootstrap a feed ecosystem.

The iCalendar validator has a long way to go yet. But the road ahead is well lit, and I’m grateful to Doug Day for resolving to travel it.

The other day I listened to a Spark (CBC Radio) interview with Larry Lessig about his New Republic essay Against Transparency, which begins:

We are not thinking critically enough about where and when transparency works, and where and when it may lead to confusion, or to worse. And I fear that the inevitable success of this movement–if pursued alone, without any sensitivity to the full complexity of the idea of perfect openness–will inspire not reform, but disgust. The “naked transparency movement,” as I will call it here, is not going to inspire change. It will simply push any faith in our political system over the cliff.

The essay was published in October 2009. In this interview from November, Prof. Lessig reflected on the reactions that it provoked. Although the delicious and bitly feedback now suggests that most people understood the essay to be a thoughtfully nuanced critique, there were evidently some early responders who read it as a retreat from openness and an assault on the Internet.

I’m glad I missed the essay when it first appeared. Reading it along with a cloud of feedback from readers and from the author amplifies one of the key points: We don’t really want naked transparency, we want transparency clothed in context.

The Net can be an engine for context assembly, a wonderful phrase I picked up years ago from Jack Ozzie and echoed in several essays. But it can also be a context destroyer.

In the interview, Lessig notes one example of context destruction. The article, which most people will read online, spans eleven pages, each of which wraps its nugget of “content” in layers of distraction. Some early negative comments, Lessig says, came from people who had clearly not read to the end.

Our increasingly compressed and fragmented attention can also be a context destroyer:

What about when the claims are neither true nor false? Or worse, when the claims actually require more than the 140 characters in a tweet?

This is the problem of attention-span. To understand something–an essay, an argument, a proof of innocence– requires a certain amount of attention. But on many issues, the average, or even rational, amount of attention given to understand many of these correlations, and their defamatory implications, is almost always less than the amount of time required. The result is a systemic misunderstanding–at least if the story is reported in a context, or in a manner, that does not neutralize such misunderstanding. The listing and correlating of data hardly qualifies as such a context. Understanding how and why some stories will be understood, or not understood, provides the key to grasping what is wrong with the tyranny of transparency.

Transparency is a necessary but not a sufficient condition. Recently my town’s crime data and council meetings have appeared online. But this remarkable transparency does not alone enable the sort of collaborative sense-making that we all rightly envision.

In the case of crime data, we require a context that includes historical trends, regional and national comparisons, guidance from government about how its local taxonomy relates to regional and national taxonomies, and reporting by newspapers and citizens.

In the case of city council meetings, we require a context that includes relevant state law and local code, and reporting by stakeholders, by newspapers, and by affected citizens.

To enable context assembly, we’ll need to organize the numeric and narrative data produced by the “naked transparency” movement in ways friendly to linking, aggregation, and discovery.

But these principles will need to be adopted more broadly than by governments alone. Everyone needs to understand the principles of linking, aggregation, and discovery, so that everyone can help create the context we crave.