Polymath = user innovation

In February 2007, Mike Adams, who had recently joined Automattic, the company that makes WordPress, decided on a lark to endow all blogs running on WordPress.com with the ability to use LaTeX, the venerable mathematical typesetting language. So I can write this:

$latex \pi r^2$

And produce this:

\pi r^2

When he introduced the feature, Mike wrote:

Odd as it may sound, I miss all the equations from my days in grad school, so I decided that what WordPress.com needed most was a hot, niche feature that maybe 17 people would use regularly.

A whole lot more than 17 people cared. And some of them, it turns out, are Fields medalists. Back in January, one member of that elite group — Tim Gowers — asked: Is massively collaborative mathematics possible? Since then, as reported by observer/participant Michael Nielsen (1, 2), Tim Gowers, Terence Tao, and a bunch of their peers have been pioneering a massively collaborative approach to solving hard mathematical problems.

Reflecting on the outcome of the first polymath experiment, Michael Nielsen wrote:

The scope of participation in the project is remarkable. More than 1000 mathematical comments have been written on Gowers’ blog, and the blog of Terry Tao, another mathematician who has taken a leading role in the project. The Polymath wiki has approximately 59 content pages, with 11 registered contributors, and more anonymous contributors. It’s already a remarkable resource on the density Hales-Jewett theorem and related topics. The project timeline shows notable mathematical contributions being made by 23 contributors to date. This was accomplished in seven weeks.

Just this week, a polymath blog has emerged to serve as an online home for the further evolution of this approach.

I am completely unqualified to evaluate the nature of mathematical discourse that’s going in on these polymath collaborations, or the claims being made regarding outcomes. But it sure makes my spidey-sense tingle.

I am, however, qualified to evaluate the nature of the collaborative methods being employed. And on that front, I’m amused (and chagrined) to recall something I wrote back in 2000, in a report called Internet groupware for scientific collaboration. The report was commissioned by Greg Wilson, who organized this week’s Science 2.0 event in Toronto. At that event, my report served as a historical frame for the polymath experimentation that’s going on right now, and that Michael Nielsen discussed at the Toronto event in an updated version of this talk.

In my 2000 report I said:

TeX and LaTeX define scientific publishing for a generation of scientists. But these formats don’t integrate directly into the shared spaces of the Web. The rise of XML as a universal markup language, along with vocabularies such as MathML (for mathematical notation) and SVG (for scalable vector graphics), suggests that the Web may yet reach its original collaborative goal.

Why didn’t I see, then, that the crux of the issue wasn’t XML and MathML and SVG, but rather the ability to “integrate directly into the shared spaces of the Web”? And that what ought to be integrated directly was the typesetting language already familiar to mathematicians, namely LaTeX?

The answer is that I needed (and still need) to be reminded that good-enough solutions here now, and familiar to people, often trump great solutions that aren’t here and wouldn’t be familiar if they were.

From that perspective, I’m wondering what will and won’t turn out be good enough for the polymathematicians. The current setup is admittedly imperfect, and they’re now begining to explore WordPress plugins that enable, for example, more powerful ways to organize, reply to, and refer to one anothers’ comments.

I don’t think anybody yet knows what the right tooling will be for polymathematical collaboration. The ones who are best qualified to figure it out are the polymathematical collaborators themselves, but they are not WordPress plugin developers.

What’s needed is what Eric von Hippel calls a user innovation toolkit. The idea is this: Leading users, as they employ a tool, also modify it, and in so doing they express intentions that tool developers can then capture and formalize.

If you look at the systems of notation that the polymathematicians are creating in order to organize and refer to their contributions in these long and complex threads of mathematical discourse, you can see intentions being expressed. So arguably, WordPress is a user innovation toolkit, and we’ll see these innovations codified in future plugins. I’ll be watching with great interest.

Update: As per Jonathan Fine’s comment below, it appears that MathTran.org has offered the same kind of service for quite a while now:

Talking with Mike Dunn about practical uses of semantic technology

My guest for this week’s Innovators show is Mike Dunn, a veteran media technologist who recently attended, and spoke at, the 2009 Semantic Technology. Mike and I were both impressed by Tom Tague’s keynote talk, which avoided theory and focused on practical ways that here-and-now semantic technologies are helping media businesses work smarter and more profitably. In this conversation, Mike describes some of the ways that his company, Hearst Media Interactive, is proving that point.

Search engine optimization is currently one of the best ways to profit from data-enabled content. Meanwhile, one of the expected benefits of semantic technology — better search recall and precision — hasn’t materialized. But although most users may not care about querying archives more comprehensively and more precisely, writers and editors should. And not only because it helps automate the assembly of context around a current story. If you can review an archive in a precise and comprehensive way, you can do a better job of planning future stories that acknowledge — and advance — the ones you’ve already done.

Topical event hubs

The elmcity project began with a focus on aggregating events for communities defined by places: cities, towns. But I realized a while ago that it could also be used to aggregate events for communities defined by topics. So now I’m building out that capability. One early adopter tracks and promotes online events in the e-learning domain. Another tracks and promotes conferences and events related to environmentally-sustainable business practices.

The curation method is very similar to what’s defined in the elmcity project FAQ. To define a topic hub you use a Delicious account, you create a metadata URL as shown in the FAQ, and you use what= instead of where= to define a topic instead of a location. Since there’s no location, there’s no aggregation of Eventful and Upcoming events. The topical hub is driven purely by your registry of iCalendar feeds.

If you (or somebody you know) needs to curate events by topic, and would like try this method, please get in touch. I’d love to have you help me define how this can work, and discover where it can go.

Why we need an XML representation for iCalendar

Translations:

Croatian

On this week’s Innovators show I got together with two of the authors of a new proposal for representing iCalendar in XML. Mike Douglass is lead developer of the Bedework Calendar System, and Steven Lees is Microsoft’s program manager for FeedSync and chair of the XML technical committee in CalConnect, the Calendaring and Scheduling Consortium.

What’s proposed is no more, but no less, than a well-defined two-way mapping between the current non-XML-based iCalendar format and an equivalent XML format. So, for example, here’s an event — the first low tide of 2009 in Myrtle Beach, SC — in iCalendar format:

BEGIN:VEVENT
SUMMARY:Low Tide 0.39 ft
DTSTART:20090101T090000Z
UID:2009.0
DTSTAMP:20080527T000001Z
END:VEVENT

And here’s the equivalent XML:

<vevent>
  <properties>
    <dtstamp>
      <date-time utc='yes'>
        <year>2008</year><month>5</month><day>27</day>
        <hour>0</hour><minute>0</minute><second>1</second>
      </date-time>
    </dtstamp>
    <dtstart>
      <date-time utc='yes'>
        <year>2009</year><month>1</month><day>1</day>
        <hour>9</hour><minute>0</minute><second>0</second>
      </date>
    </dtstart>
    <summary>
      <text>Low Tide 0.39 ft</text>
    </summary>
    <uid>
      <text>2009.0</text>
    </uid>
  </properties>
</vevent>

The mapping is quite straightforward, as you can see. At first glance, the XML version just seems verbose. So why bother? Because the iCalendar format can be tricky to read and write, either directly (using eyes and hands) or indirectly (using software). That’s especially true when, as is typical, events include longer chunks of text than you see here.

I make an analogy to the RSS ecosystem. When I published my first RSS feed a decade ago, I wrote it by hand. More specifically, I copied an existing feed as a template, and altered it using cut-and-paste. Soon afterward, I wrote the first of countless scripts that flowed data through similar templates to produce various kinds of RSS feeds.

Lots of other people did the same, and that’s part of the reason why we now have a robust network of RSS and Atom feeds that carries not only blogs, but all kinds of data packets.

Another part of the reason is the Feed Validator which, thanks to heroic efforts by Mark Pilgrim and Sam Ruby, became and remains the essential sanity check for anybody who’s whipping up an ad-hoc RSS or Atom feed.

No such ecosystem exists for iCalendar. I’ve been working hard to show why we need one, but the most compelling rationale comes from a Scott Adams essay that I quoted from in this blog entry. Dilber’s creator wrote:

I think the biggest software revolution of the future is that the calendar will be the organizing filter for most of the information flowing into your life. You think you are bombarded with too much information every day, but in reality it is just the timing of the information that is wrong. Once the calendar becomes the organizing paradigm and filter, it won’t seem as if there is so much.

If you buy that argument, then we’re going to need more than a handful of applications that can reliably create and exchange calendar data. We’ll want anyone to whip up a calendar feed as easily as anyone can now whip up an RSS/Atom feed.

We’ll also need more than a handful of parsers that can reliably read calendar feeds, so that thousands of ad-hoc applications, services, and scripts will be able consume all the new streams of time-and-date-oriented information.

I think that a standard XML representation of iCalendar will enable lots of ad-hoc producers and consumers to get into the game, and collectively bootstrap this new ecosystem. And that will enable what Scott Adams envisions.

Here’s a small but evocative example. Yesterday I started up a new instance of the elmcity aggregator for Myrtle Beach, SC. The curator, Dave Slusher, found a tide table for his location, and it offers an iCalendar feed. So the Myrtle Beach calendar for today begins like this:

Thu Jul 23 2009

WeeHours

Thu 03:07 AM Low Tide -0.58 ft (Tide Table for Myrtle Beach, SC)

Morning

Thu 06:21 AM Sunrise 6:21 AM EDT (Tide Table for Myrtle Beach, SC)
Thu 09:09 AM High Tide 5.99 ft (Tide Table for Myrtle Beach, SC)
Thu 10:00 AM Free Coffee Fridays (eventful: )
Thu 10:00 AM Summer Arts Project at The Market Common (eventful: )
Thu 10:00 AM E.B. Lewis: Story Painter (eventful: )

Imagine this kind of thing happening on the scale of the RSS/Atom feed ecosystem. The lack of an agreed-upon XML representation for iCalendar isn’t the only reason why we don’t have an equally vibrant ecosystem of calendar feeds. But it’s an impediment that can be swept away, and I hope this proposal will finally do that.

Late July in Toronto: DemoCamp and Science 2.0

On Tuesday July 28 I’ll be at the Toronto DemoCamp. I’m looking forward to meeting the designers, developers, and developers who’ll be there, seeing what you’re working on, and showing you what I’m working on.

The following day I’ll be speaking at a Science 2.0 event organized by my friend Greg Wilson. Here are the forward-thinking scientists I’ll be joining:

  • Titus Brown: Choosing Infrastructure and Testing Tools for Scientific Software Projects
  • Cameron Neylon: A Web Native Research Record: Applying the Best of the Web to the Lab Notebook
  • Michael Nielsen: Doing Science in the Open: How Online Tools are Changing Scientific Discovery
  • David Rich: Using “Desktop” Languages for Big Problems
  • Victoria Stodden: How Computational Science is Changing the Scientific Method

I am not a scientest, nor do I play one on TV, so why me? Because back in 2000, Greg commissioned me to write a report entitled Internet Groupware for Scientific Collaboration. Greg was then working with the Los Alamos National Laboratory on ways to help scientists make better use of the tools of computation as well as the methods of online collaboration. I had recently finished my book Practical Internet Groupware, I was exploring what we would now call the Web 2.0 landscape, and I was thinking and writing a lot about how these open and loosely-coupled modes of communication could enable the sort of collaboration at the core of science (and other kinds of academic endeavors) in powerful new ways.

Nearly a decade later, that vision is becoming a reality. I’m really excited to meet these folks, whose adventures I’ve been following through their blogs, and hear about their experiences at the forefront of what I believe will be a new golden age of science.

In my own talk, I’ll review how own current project tackles the challenge of social information management, and aims to democratize the computational way of thinking that enables us to wire the web.

Tinker to Evers to Chance, Tripit to Dopplr to Facebook

A few months back I observed:

Tripit, meet Dopplr. Dopplr, Tripit. You two should really get to know one another.

Richard Akerman replied:

You can feed TripIt’s ical output into Dopplr, I hear (I haven’t tried it)

That remark should have rung a loud bell for me, but somehow it didn’t. Then, yesterday, in conversation with James Senior, the bell rang. We were talking about how many services publish and/or subscribe to iCalendar feeds, how few people know that, and how much latent capability is being left on the table. Paraphrasing James:

I’ll give you a perfect example. I use Tripit, it’s a wonderful service. You email it your travel itinerary, and it organizes all your information for you. But I’ve been frustrated not to be able to share that information with my friends on Facebook. I also use Dopplr, and Dopplr talks to Facebook, but Tripit doesn’t. Then I realized that Tripit publishes an iCalendar feed, and that Dopplr can subscribe to iCalendar feeds. So I made that connection, and now my Tripit events are showing up in Facebook.

Brilliant. Look:

How did I miss that? Me, of all people, Mr. Splice-Everything-To-Everything, Mr. Find-Unintended-Uses-Of-Software, Mr. Cosmic-Significance-Of-Pub-Sub, Mr. Champion-Of-The-Underutilized-iCalendar-Standard, Mr. Computational-Thinking?

Because wiring the web is still too abstract, too convoluted, and too non-obvious — even, sometimes, for me.

The phrase wiring the web comes from Ray Ozzie, by the way. At ETech in 2006, demoed a concept called Live Clipboard. From my InfoWorld writeup:

Subscribing to an RSS feed, for example, has never conformed to any familiar user-interface pattern. Soon copying and pasting RSS feeds will feel natural to everyone, and Ozzie hopes the copy/paste metaphor will also make advanced capabilities more accessible. Consider my LibraryLookup bookmarklet. Dragging it onto the browser’s toolbar isn’t something easily understood or explained. Using the clipboard as the wiring junction will make a lot more sense to most people.

The same metaphor can accommodate what I’ve called lightweight service composition and what Ozzie calls “wiring the Web.” He showed how RSS feeds acting as service end points can be pasted into apps to create dynamically updating views. Virtually anyone can master this Tinkertoy approach to self-serve mashups.

This was, and remains, a crucial insight. From now on, we are all going to be wiring the web in one way or another. And we’re going to need a conceptual frame in which to do that — ideally, a user-interface metaphor that’s already familiar. Maybe it’s as simple as copy/paste. Maybe it’s more like Yahoo! Pipes or Popfly blocks. Whatever it turns out to be, we need to invent and deploy a universal junction box for wiring the web.

Talking with Peter O’Toole about gathering clinical data and sharing medical knowledge

My guest for this week’s Innovators show is Peter O’Toole from mTuitive, a company whose authoring toolkit for clinical data collection I featured in a 2006 screencast. mTuitive is working at the intersection of a number of disciplines that all need to come together to deliver cheaper and better health care.

First, usability. Designing clinical data gathering systems that capture what’s right for the patient, along with what’s mandated by the insurance company, requires a careful balancing of constraints and freedom in software user interfaces.

Second, knowledge engineering. Clinical systems don’t merely record data, they embody medical protocols that reflect an ever-changing consensus about methods and best practices. mTuitive’s authoring system aims to enable leading practioners to encode that knowledge in ways that can then guide others. But knowledge grows at the edge as well as at the center. So mTuitive also enables practitioners to extend and modify the software, injecting local knowledge and custom. Who owns this knowledge? Who’s liable for the consequences of its use? These are some of the implications we discussed.

Third, semantics. Electronic medical records are still mainly narrative in form, says Peter O’Toole. But we’re moving toward more computable ways of describing observations about, say, the nature and size of tumors.

Fourth, social software. My hunch, and Peter O’Toole’s too, is that progress toward the nirvana of medical records that are both semantically rich and interoperable will be powered by a two-stroke engine. One stroke of the piston will be driven by centrally-defined standards and centrally-imposed legislation. But the other will be driven by networked collaboration, at the edge, among doctors who pool and codify their experiential knowledge using ad-hoc, Web 2.0-like methods.

Hat tip to Joshua Allen’s Better Living Through Software

Here’s another piece of Say Everything that I want to spotlight:

Microsoft wasn’t known as a haven of openness and cooperation. But it was a big place with a lot of smart people. At the turn of the millenium, during the company’s bitter antitrust fight with the U.S. Department of Justice, many of those people found it impossible to recognize themselves in the press’s portrait of the company. The first programmer at Microsoft to start blogging, Joshua Allen, set himself up with an account on Dave Winer’s EditThisPage service in 2000 and started posting under the header “Better Living Through Software: Tales of Life at Microsoft.” It was totally informal and unauthorized — a lone call for a parley raised from behind the company’s siege walls. Allen explained his intent: “I wanted to say that I am a Microsoft person and you can talk with me.”

I used to read Joshua’s blog back then, I still read it now, it was nice to see its seminal role acknowledged in the book.

Here’s a picture of the blog’s home page, annotated by the ClearForest Gnosis entity extractor:

Quite a cast of characters!

More fun than herding servers

Until recently, the elmcity calendar aggregator was running as a single instance of an Azure worker role. The idea all along, of course, was to exploit the system’s ability to farm out the work of aggregation to many workers. Although the sixteen cities currently being aggregated don’t yet require the service to scale beyond a single instance, I’d been meaning to lay the foundation for that. This week I finally did.

Will there ever be hundreds or thousands of participating cities and towns? Maybe that’ll happen, maybe it won’t, but the gating factor will not be my ability to babysit servers. That’s a remarkable change from just a few years ago. Over the weekend I read Scott Rosenberg’s new history of blogging, Say Everything. Here’s a poignant moment from 2001:

Blogger still lived a touch-and-go existence. Its expenses had dropped from a $50,000-a-month burn rate to a few thousand in rent and technical costs for bandwidth and such; still, even that modest budget wasn’t easy to meet. Eventually [Evan] Williams had to shut down the office entirely and move the servers into his apartment. He remembers this period as an emotional rollercoaster. “I don’t know how I’m going to pay the rent, and I can’t figure that out because the server’s not running, and I have to stay up all night, trying to figure out Linux, and being hacked, and then fix that.”

I’ve been one of those guys who babysits the server under the desk, and I’m glad I won’t ever have to go back there again. What I will have to do, instead, is learn how to take advantage of the cloud resources now becoming available. But I’m finding that to be an enjoyable challenge.

In the case of the calendar aggregator, which needs to map many worker roles to many cities, I’m using a blackboard approach. Here’s a snapshot of it, from an aggregator run using only a single worker instance:

     id: westlafcals
  start: 7/14/2009 12:12:05 PM
   stop: 7/14/2009 12:14:46 PM
running: False

     id: networksierra
  start: 7/14/2009 12:14:48 PM
   stop: 7/14/2009 12:15:05 PM
running: False

     id: localist
  start: 7/14/2009 12:15:06 PM
   stop: 7/14/2009  5:37:03 AM
running: True

     id: aroundfred
  start: 7/14/2009  5:37:05 AM
   stop: 7/14/2009  5:39:20 AM
running: False

The moving finger wrote westlafcals (West Lafayette) and networksierra (Sonora), it’s now writing localist (Baltimore), and will next write aroundfred (Fredericksburg).

Here’s a snapshot from another run using two worker instances:

     id: westlafcals
  start: 7/14/2009 10:12:05 PM
   stop: 7/14/2009  4:37:03 AM
running: True

     id: networksierra
  start: 7/14/2009 10:12:10 PM
   stop: 7/14/2009 10:13:05 PM
running: False

     id: localist
  start: 7/14/2009 10:13:06 PM
   stop: 7/14/2009  4:41:12 AM
running: True

     id: aroundfred
  start: 7/14/2009  4:41:05 AM
   stop: 7/14/2009  4:42:20 AM
running: False

Now there are two moving fingers. One’s writing westlafcals, one has written networksierra, one’s writing localist, and one or the other will soon write aroundfred. The total elapsed time will be very close to half what it was in the single-instance case. I’d love to crank up the instance count and see an aggregation run rip through all the cities in no time flat. But the Azure beta caps the instance count at two.

The blackboard is an Azure table with one record for each city. Records are flexible bags of name/value pairs. If you make a REST call to the table service to query for one of those records, the Atom payload that comes back looks like this:

<m:properties>
   <d:PartitionKey>blackboard</d:PartitionKey>
   <d:RowKey>aroundfred</d:RowKey>
   <d:start>7/14/2009 4:41:05 AM</d:start>
   <d:stop>7/14/2009 4:42:20 AM</d:stop>
   <d:running>False</d:stop>
</m:properties>

At the start of a cycle, each worker wakes up, iterates through all the cities, aggregates those not claimed by other workers, and then sleeps until the next cycle. To claim a city, a worker tries to create a record in a parallel Azure table, using the PartitionKey locks instead of blackboard. If the worker succeeds in doing that, it considers the city locked for its own use, it aggregates the city’s calendars, and then it deletes the lock record. If the worker fails to create that record, it considers the city locked by another worker and moves on.

This cycle is currently one hour. But in order to respect the various services it pulls from, the service defines the interval between aggregation runs to be 8 hours. So when a worker claims a city, it first checks to see if the last aggregation started more than 8 hours ago. If not, the worker skips that city.

Locks can be abandoned. That could happen if a worker hangs or crashes, or when I redeploy a new version of the service. So the worker also checks to see if a lock has been hanging around longer than the aggregation interval. If so, it overrides the lock and aggregates that city.

I’m sure this scheme isn’t bulletproof, but I reckon it doesn’t need to be. If two workers should happen to wind up aggregating the same city at about the same time, it’s no big deal. The last writer wins, a little extra work gets done.

Anyway, I’ll be watching the blackboard over the next few days. There’s undoubtedly more tinkering to do. And it’s a lot more fun than herding servers.

The civic dashboard

On Friday my local paper ran a story entitled Keene crime rates steady over years. Because that link will go dark soon, I’m going to assert fair use for the part of the story that cites statistics:

Strings of vehicle break-ins and vandalism and the occasional vicious beating or stabbing may lead some to believe that Keene’s streets are getting meaner, but crime statistics show little change over the last six years.

Even in light of rough economic times, which typically parallel a spike in shoplifting — people begin stealing groceries or other necessities they can no longer afford — the Elm City’s property crime rate remains stable.

The city’s social programs, such as The Community Kitchen, which provides food to area residents in need, play a significant role in curbing crime, Keene police Lt. Jay U. Duguay said.

“We’re behind the nation when it comes to economic issues. People are still losing their homes and jobs, but overall we haven’t felt the effects of it yet,” he said. “Right now it’s wait-and-see.”

During the last six years, Keene police have received an average of 490 reports dealing with larceny or theft. Last year they took 667 reports of larceny or theft, the highest number of those types of crimes since 2002, which saw 604 reports.

From the beginning of this year to the end of April, there were 202 reports of larceny and theft, slightly higher than the 147 during the same period last year, and 33 burglaries, which is on par with previous years.

“There’s going to be periods with a little influx, but for the most part it’s steady,” Duguay said. “I was actually kind of surprised at how consistent the numbers were.”

In 2004 and 2005, property crime rates dipped dramatically. While 2003 saw 557 larcenies and thefts, that number hit 272 the following year and then slightly increased to 286 the next year before rising to 455 in 2006.

“We didn’t change our patrol procedures during those times (2004 and 2005) and we weren’t up to full staff. So I don’t know why those years are lower,” Duguay said. “I think the more consistent number is the high number, but thank goodness for the lows.”

Violent crime reports in Keene have also remained steady over the last several years, with an average of 366 assaults annually.

Between 20 and 30 sex assaults are reported in the city each year, though only a small fraction of those cases result in arrests because the others lack sufficient evidence, Duguay said.

Statistics only tell part of the story, though. For the crime victims, the numbers hold little meaning.

The story concludes with anecdotes from townsfolk who either do, or don’t, believe that tough economic times are making Keene’s streets meaner.

I quote at length from it here because I think it captures a moment in time. The story seems to be data-driven, but not in the way that many of us now realize such stories can be. The reporter got some numbers from the police department, and the story quotes a lieutenant’s interpretation of those numbers, but there’s nothing available for an interested citizen to verify or falsify. And there’s no reference to an alternative source — from the US Bureau of Justice — that could confirm, challenge, or otherwise contextualize the numbers.

I hope that my response, below, also marks a moment in time — one in which people didn’t demand, governments didn’t provide, journalists didn’t exploit, and all these groups didn’t collaboratively engage with more and better evidence than informs most civic dialogue today.

From time to time, communities ask: Are we having a crime wave? A couple of summers ago it seemed that way. The Sentinel invited TalkBalk comments, and one person wrote: “Keene has gone downhill. Once a peaceful, quaint city that was safe, it is no more.”

We shouldn’t have to just speculate about these trends though. We should be able to look at the data and draw reasonable conclusions. Increasingly, we can.

In 2007 I looked, and the first source I found was the data reported by the Keene police department (and every other police department around the country) to the US Bureau of Justice. I noticed a couple of things. First, the numbers showed no uptick in violent crime. But since they stopped in 2005, they didn’t address concern about events in the then-current 2006-2007 period.

Second, because the numbers went back to 1985, they revealed a remarkable anomaly. There was a huge spike in violent crime — assaults and rapes — from 1990 to 1994. You can see the trend plainly in the charts and data I’ve posted at http://jonudell.net/crime/keene-crime.html. What happened then? How should this historical context influence our perception of current trends? I’d love to see the Sentinel ask, and try to answer, these questions.

Since the Bureau of Justice data wasn’t current enough to address the 2006-2007 concerns about crime, I asked the police department to provide me with more recent data. In the end, after multiple requests and some nudging by an attorney, they complied. The snapshot I received, with numbers through July 2007, showed no evidence of a recent uptick in either violent crime or property crime. That was
reassuring.

It was also enlightening to compare the raw data in the police spreadsheet to the numbers reported to the Bureau of Justice. They don’t exactly line up. This isn’t nefarious, it’s just what happens when local systems try to mesh with national systems. There is a lot of local variation in the classification of different types of crimes, and room for interpretation when you bundle them into larger
categories.

Fast forward to summer 2009. The economy has tanked, and people are again wondering whether we’re having a crime wave. The Sentinel gathered some data, talked to the police, and concluded — I suspect correctly — that as before, the perception of a crime wave is not the reality.

For the reasons I’ve explained, the police department numbers reported in the Sentinel don’t quite line up with those reported to the Bureau of Justice. Consider larceny-theft, for example:

          2003   2004   2005   2006
Sentinel   557    272    286    455
Justice    534    245    235    622

But I do wonder about this:

“Violent crime reports in Keene have also remained steady over the last several years, with an average of 366 assaults annually.”

I hope that’s an error. According to the Bureau of Justice there were at most about 100 violent crimes per year, back in the dark ages of 1990-1994, and we’ve averaged between 40 and 60 per year from then until 2007.

In any case, here’s the larger point. Cities around the country have begun to realize that the operational data of city government can be made available to everyone — citizens as well as journalists — so that we can all monitor the health of our cities in a collaborative way. Crime statistics are one popular category of data, others include restaurant inspections, infrastructure repairs, and licensing.

Nowadays it costs about $100/month to augment a police department’s information system with software that reports current crime statistics online, and also displays the locations of crimes on a map. In New Hampshire, one such system (crimereports.com) has been installed in Exeter, Hampton, Laconia, and Rochester.

I’d love to see the Keene police department join that club. A civic dashboard is part of what I proposed during the Community Visioning Process. But there’s no need to wait until 2028. Cities around the country are creating their dashboards now, and we can too.

Understanding Wikipedia notability

Some fellow residents of my town have recently noticed, and pointed out to me, that I’m listed in Wikipedia as a notable inhabitant of Keene, NH. They’re more impressed than they should be. All forms of notability are subject to bias, but Internet notability is subject to a different kind of bias than most people realize.

For example, friends and family used to be impressed by the fact that I was the top result in Google for my first name — and then second to Jon Stewart for a long while, until I had to reboot my InfoWorld archive. Why? Just because I’ve projected a large surface area of searchable documents whose titles include the trigram jon.

An example of a far more notable person than me is Glenn Fine, who was in my grade in junior high school and is now Inspector General for the Department of Justice. You won’t find him anywhere near the top of a search for his first name because Inspectors General don’t (yet) project a large surface area of documents onto the web.

To place my newfound Wikipedia notability into a similar context, I wanted to show people how these lists of notable inhabitants are made. I figured the person who made the change is somebody who knows of my work, because I’ve written about it so much online, and who is inclined to edit Wikipedia, which correlates with an interest in my work.

I wanted to illustrate exactly who, when, and how, so I went to Wikipedia with the confident expectation that it would be easy to answer those questions.

Surprisingly, it wasn’t. I guess I haven’t really tried searching revision histories in Wikipedia before, but in this case and a few others I’ve tried lately, it seems quite difficult to pinpoint the author of a change.

For example, on Twitter I asked:

Wikipedia: “The term ‘Web 2.0’ was coined by Darcy DiNucci in 1999.” Added when, by whom? WikiBlame seems an ineffective way to find out.

@bazzargh replied: Robert Gehl. http://bit.ly/46r1a

Thanks. By the way, how’d you do that?

switch to 500 view in history, then rough bisection from oldest. Couple of minutes; used this a lot to find long-lived vandalism.

if older, I progressively back off 2..4..8… pages through this. In this case though, there was a clueful log message!

That’s pretty much what I’ve found myself doing when trying to track down changes, so I was glad to know it wasn’t just me. But this highlights an important point about transparency: It’s all relative.

One of the reasons we think of government as opaque is that while records may be notionally public, it takes time, effort, and skill to visit city hall, dig through them, and find what you’re looking for.

I have always regarded Wikipedia as an extreme counter-example. And that’s true. It is radically transparent. You can ultimately find out exactly how any statement in any article came to be. You may not be able to correlate the author’s pseudonym to a real-world identity, but you can evaluate that author’s corpus and reputation within the context of Wikipedia.

And yet, the ability to do this spelunking requires more time, effort, and skill than most people possess. Although I’m reluctant to deflate my status as a notable inhabitant of Keene, I wish it were easier for people who read that to also find out what it does — and doesn’t — mean.

Strategic choices for calendar publishers

Although I haven’t been able to confirm this officially yet, it looks like FuseCal, the HTML screen-scraping service that I’ve been using (and recommending) as a way to convert calendar-like web pages into iCalendar feeds, has shut down.

The web pages that FuseCal has been successfully processing, for several curators participating in the elmcity project, are listed below. They’re a kind of existence proof, validating the notion that unstructured calendar info — what people intuitively create — can be mechanically transformed into structured info that syndicates reliably.

I hope this service, or some future variant of it, will continue. It’s a really useful way to help people grasp the concept of publishing calendar feeds.

But in the long run, it’s a set of training wheels. Ultimately we need to teach people why and how to produce such feeds more directly. All of the event information shown below could be managed in a more structured way using calendar software that produces data feeds for syndication and web pages for browsing.

More broadly, incidents like this prompt us to consider the nature of the services ecosystem we’re all embedded in — as users and, increasingly, as co-creators. In the software business, developers have long since learned to evaluate the benefits and risks of “taking a dependency” on a component, library, or service. Users didn’t have to think too much about that. A software product that was discontinued would keep working perfectly well, maybe for years. But services can — and sometimes do — stop abruptly.

Since the elmcity project is embedded in a services ecosystem, as both a provider and a consumer, how should a curator evaluate service dependencies and their associated risks and benefits? Here are some guidelines.

Many eggs, many baskets

An instance of the calendar aggregator gathers events from three main sources: Eventful (service #1), Upcoming (service #2), and a curated set of iCalendar feeds. A subset of those feeds may (until recently) have been mediated by FuseCal (service #3). So there were three main service dependencies here, and that’s one form of diversification.

But the iCalendar feeds represent another, and more powerful, form of diversification. One may be served up by a Drupal system, one may be an ICS file posted from Outlook 2007, one may be an instance of Google Calendar. Each depends on its own supporting services, but the ecosystem is very diverse.

Data and service portability

The elmcity project isn’t a database of events, but rather an aggregator of feeds of events. What matters in this case is portability of metadata describing the feeds, as well as data describing events. The system depends on Delicious for the management of the metadata. But all this metadata is replicated to Azure for safekeeping.

Since the elmcity project does run on Azure, there’s clearly a strong dependence on that platform’s compute and storage services. But I could run the code on another host — even another cloud-based host, thanks to Amazon’s EC2 for Windows. Likewise I could store blobs and tables in Amazon’s S3 and SimpleDB.

Strategic choices

In this context, the use of FuseCal was a strategic choice. There isn’t a readily available replacement, and that’s a recipe for the sort of disruption we’ve just experienced. But since the system is diversified, that disruption is contained. Was the benefit provided by this unique service worth the cost of disruption? Some curators may disagree, but I think the answer is yes. It was really helpful to be able to show people that informational web pages are implicitly data feeds, and to show what can happen when those data feeds are made explicit.

Still, it was a crutch. Ultimately we want people to stand on their own two feet, and take direct control of the information they publish to the web. FuseCal had to guess which times went with which events, and sometimes guessed wrong. If you’re publishing the event, you want to state these facts unambiguously. And using a variety of methods, as I’ve shown, you can. Those methods are the real strategic choices. If you can publish your own data feed, simply and inexpensively, you should seize the opportunity to do so


Calendar pages successfully parsed by FuseCal

prescottaz

fallschurchcals

ottawacals

snoqualmie

mashablecity

elmcity

a2cal

whyhuntington

Influencing the production of public data

In the latest installment of my Innovators podcast, which ran while I was away on vacation, I spoke with Steven Willmott of 3scale, one of several companies in the emerging business of third-party API management. As more organizations get into the game of providing APIs to their online data, there’s a growing need for help in the design and management of those APIs.

By way of demonstration, 3scale is providing an unofficial API to some of the datasets offered by the United Nations. The UN data at http://data.un.org, while browseable and downloadable, is not programmatically accessible. If you visit 3scale’s demo at www.undata-api.org/ you can sign up for an access key, ask for available datasets — mostly, so far, from the World Health Organization (see below) — and then query them.

The query capability is rather limited. For a given measure, like Births by caesarean section (percent), you can select subsets by country or by year, but you can’t query or order by values. And you can’t make correlations across tables in one query.

It’s just a demo, of course. If 3scale wanted to invest more effort, a more robust query system could be built. The fact that such a system can be built by an unofficial intermediary, rather than by the provider of the data, is quite interesting.

As I watch this data publication meme spread, here’s something that interests me even more. These efforts don’t really reflect the Web 2.0 values of engagement and participation to the extent they could. We’re now very focused on opening up flexible means of access to data. But the conversation is still framed in terms of a producer/consumer relationship that isn’t itself much discussed.

At the end of this entry you’ll find a list of WHO datasets. Here’s one: Community and traditional health workers density (per 10,000 population). What kinds of questions do we think we might try to answer by counting this category of worker? What kinds of questions can’t we try to answer using the datasets WHO is collecting? How might we therefore want to try to influence the WHO’s data-gathering efforts, and those of other public health organizations?

“Give us the data” is an easy slogan to chant. And there’s no doubt that much good will come from poking through what we are given. But we also need to have ideas about what we want the data for, and communicate those ideas to the providers who are gathering it on our behalf.


Adolescent fertility rate
Adult literacy rate (percent)
Gross national income per capita (PPP international $)
Net primary school enrolment ratio female (percent)
Net primary school enrolment ratio male (percent)
Population (in thousands) total
Population annual growth rate (percent)
Population in urban areas (percent)
Population living below the poverty line (percent living on less than US$1 per day)
Population median age (years)
Population proportion over 60 (percent)
Population proportion under 15 (percent)
Registration coverage of births (percent)
Registration coverage of deaths (percent)
Total fertility rate (per woman)
Antenatal care coverage – at least four visits (percent)
Antiretroviral therapy coverage among HIV-infected pregnant women for PMTCT (percent)
Antiretroviral therapy coverage among people with advanced HIV infections (percent)
Births attended by skilled health personnel (percent)
Births by caesarean section (percent)
Children aged 6-59 months who received vitamin A supplementation (percent)
Children aged less than 5 years sleeping under insecticide-treated nets (percent)
Children aged less than 5 years who received any antimalarial treatment for fever (percent)
Children aged less than 5 years with ARI symptoms taken to facility (percent)
Children aged less than 5 years with diarrhoea receiving ORT (percent)
Contraceptive prevalence (percent)
Neonates protected at birth against neonatal tetanus (PAB) (percent)
One-year-olds immunized with MCV
One-year-olds immunized with three doses of Hepatitis B (HepB3) (percent)
One-year-olds immunized with three doses of Hib (Hib3) vaccine (percent)
One-year-olds immunized with three doses of diphtheria tetanus toxoid and pertussis (DTP3) (percent)
Tuberculosis detection rate under DOTS (percent)
Tuberculosis treatment success under DOTS (percent)
Women who have had PAP smear (percent)
Women who have had mammography (percent)
Community and traditional health workers density (per 10 000 population)
Dentistry personnel density (per 10 000 population)
Environment and public health workers density (per 10 000 population)
External resources for health as percentage of total expenditure on health
General government expenditure on health as percentage of total expenditure on health
General government expenditure on health as percentage of total government expenditure
Hospital beds (per 10 000 population)
Laboratory health workers density (per 10 000 population)
Number of community and traditional health workers
Number of dentistry personnel
Number of environment and public health workers
Number of laboratory health workers
Number of nursing and midwifery personnel
Number of other health service providers
Number of pharmaceutical personnel
Nursing and midwifery personnel density (per 10 000 population)
Other health service providers density (per 10 000 population)
Out-of-pocket expenditure as percentage of private expenditure on health
Per capita total expenditure on health (PPP int. $)
Per capita total expenditure on health at average exchange rate (US$
Pharmaceutical personnel density (per 10 000 population)
Physicians density (per 10 000 population)
Private expenditure on health as percentage of total expenditure on health
Private prepaid plans as percentage of private expenditure on health
Ratio of health management and support workers to health service providers
Ratio of nurses and midwives to physicians
Social security expenditure on health as percentage of general government expenditure on health
Total expenditure on health as percentage of gross domestic product