July 2007
Monthly Archive
July 31, 2007
Posted by Jon Udell under
Uncategorized [11] Comments
I’ve had a brief fling with Taglocity, an Outlook add-in for tagging email, contacts, and tasks. You can of course already tag messages in Outlook using categories, and I do, but rarely, just as I’ve rarely used labels in Gmail. For me at least, tagging is most interesting and useful when it is social.
Consider, for example, the recent interaction around the publicdata tag in del.icio.us. I’m continuing to see new items show up in the global bucket, /tag/publicdata. Occasionally I add one of these to the list I’m curating in my private bucket, /judell/publicdata. There, I can see at a glance which of the items I’ve collected are of interest to others, and to how many others. Focusing on the Bureau of Justice crime data I’ve been using lately, I can see who else tagged it, I can observe the historical interest in that URL back to 2004, and I can notice that it was most recently tagged by somebody at Many Eyes. I can then compare the list that’s being curated by Many Eyes to the list that I’m curating.
So here’s the question: Can these effects occur in email? In theory they can, and Taglocity lays the foundation with a feature called traveling tags. This is actually an idea that I discussed long ago in my book Practical Internet Groupware, where I suggested that keywords could be passed in SMTP headers, or in XML packets carried in message bodies. From the Taglocity FAQ:
Taglocity has a number of ways of transferring the tags you have assigned to an outgoing message to the recipient. These include using the SMTP Keyword header and something called a ‘Tagline’. A Tagline is a footer that includes the text ‘Tags:’. The Tagline is very simple but will work on any device and mail relay, as it is placed within the message content.
In my experiments I didn’t find any evidence of the SMTP method, but maybe that got stripped out by an intermediate relay. The tagline wouldn’t be the nicest thing to have to parse mechanically:
<p class=3DMsoNormal><span style=3D'font-size:8.0pt;font-family:"Arial","sa=
ns-serif";
color:#8C8C8C'><a href=3D"http://www.Taglocity.com">Taglocity</a> Tags: tag=
ging,
socialsoftware</span><span style=3D'font-size:12.0pt;font-family:"Times New=
Roman","serif"'><o:p></o:p></span></p>
But it’s doable, and there may be an option for including a more well-structured packet.
These are just technical details though. The real question is: What kind of tag-related social dynamics are even possible in email? I guess that’s something you could only find out by trading tags with other people for a while.
In an internal email thread on this topic, one person noted:
I wonder if there’s both a public tagspace and a private tagspace? What if I want to tag a thread as “readlater” or “notmyproblem”… does everyone have to suffer my personal categorization?
I investigated, and found that there is a notion of public and private tagging. It works on a per-tag basis, though, unlike del.icio.us’ per-item privacy, so apparently you’d need to develop a private vocabulary that included no tags you’d want to use in public.
In an another message on that internal thread, someone noted the obvious problem of critical mass. The social effects in del.icio.us are a function of scale. Is departmental or even corporate scale sufficient to sustain those effects? Even when it is, as I’ve heard is true for IBM’s internal use of Dogear, can the social effects of that kind of web application be usefully translated to the email environment?
I don’t know the answer, but I’d love to hear from folks who are actively using Taglocity (or equivalents, if they exist) about what works or doesn’t, and why or why not.
July 30, 2007
Posted by Jon Udell under
Uncategorized [10] Comments
When I read this story about cancer care in the Sunday New York Times yesterday, I was struck by one particular information graphic which I thought was very nicely done:
It turns out that Chris Gemignani was impressed too, and he decided to recreate the image using Excel. Here’s what he came up with:
Going one huge step further, and in the spirit of today’s theme of narrating the work, he created a screencast in which he demonstrates the process of making that graphic. It’s a wonderful example of the dynamic I’ve been describing. One of the commenters on Chris’ blog thanks him for teaching him some helpful techniques. Another suggests a technique that Chris hadn’t used but thinks is interesting. Very cool!
With Excel, as with all software — on the desktop and on the web — there’s so much untapped potential. The obstacles are knowing what’s even possible, and then knowing how to achieve it. Screencasts like this one leap over both obstacles in a single bound.
July 30, 2007
Posted by Jon Udell under
Uncategorized [3] Comments
I’ve listened to many of Moira Gunn’s Tech Nation podcasts, so it was a treat to turn the tables and interview her for this week’s episode of my show. Recently she’s been devoting a lot of attention to the world of biotechnology. There’s a new show focused exclusively on that subject, BioTech Nation, and in March she published a book about the show: Welcome to BioTech Nation: My Unexpected Odyssey into the Land of Small Molecules, Lean Genes, and Big Ideas.
In this conversation we discuss what it’s like for a computer scientist and engineer to venture into the world of biotechnology, why the decade of biotech may finally have arrived, what makes biotech entrepeneurs special, and how we can use Internet media to enlarge the public understanding of science and technology.
On that last point, Moira echoed comments that I’ve also heard recently from Hugh McGuire and Timo Hannay, both of whom told me they listen voraciously to science-oriented podcasts. They all agree that hearing scientists narrate their own work, in their own words, is a wonderful new opportunity, and a great way to promote more and better public understanding.
It’s also interesting to hear Moira’s take on podcasting. Comparing her long experience with large terrestrial and satellite radio audiences to her recent experience with a smaller Internet radio audience, she says: “The quality of the listener you get on the Internet is far better.”
July 30, 2007
Posted by Jon Udell under
Uncategorized [3] Comments
Last year Greg Wilson wrote to tell me about the collection of essays that he and Andy Oram were compiling into what has now become the book Beautiful Code: Leading Programmers Explain How They Think:
The idea is to get a bunch of well-known and not-yet-well-known programmers to select medium-sized pieces of code (100-200 lines) that they think are particularly elegant, and spend 2500 words or so explaining why.
The 600-page tome arrived recently, and as I’ve been reading it I’m struck once again by the theme of narrating the work. Of the chapters I’ve read so far, three are especially vivid examples of that: Karl Fogel’s exegesis of the stream-oriented interface used in Subversion to convey changes across the network, Alberto Savoia’s meditation on the process of software testing, and Lincoln Stein’s sketches (“code stories”) that he writes for himself as he develops a new bioinformatics module.
Although this is a book by programmers and for programmers, the method of narrating the work process is, in principle, much more widely applicable. In practice, it’s something that’s especially easy and natural for programmers to do.
It’s easy because a programmer’s work product — in intermediate and final form — happens to be lines of text that can be printed in a book or published online.
It’s natural because programmers have been embedded for longer than most other professionals in a work process that’s fundamentally enabled by electronic publishing. We’ve been sharing code, and conversations about code, online for decades.
Most work processes don’t lend themselves to the sort of direct capture and literal representation that you see in Beautiful Code. Not yet, anyway. I think that can and will change, though, and I think two emerging forms of media will be powerful agents of change.
One of those forms is Internet video, which enables the capture and sharing of many kinds of physical-world expertise. The other is screencasting, which does the same for virtual-world expertise. Narration of work in these forms won’t be able to be printed in a book. But it will be just as valuable as the narration in Beautiful Code, and for the same reasons. Access to expert minds is just inherently valuable. We’re entering an era in which we’ll be able to access many more — and many different kinds of — expert minds. I’m looking forward to it. Meanwhile, I’m enjoying the access I have now to the 38 minds that Greg and Andy have collected for this book.
July 25, 2007
Posted by Jon Udell under
Uncategorized [7] Comments
While I was editing today’s screencast I kept a log of my edits, and I’ve included that log below. As is typical when I edit screencasts, this one squeezed down quite a lot: from almost 54 minutes to 34 minutes. The result not only saves the viewer a precious 20 minutes, but also unfolds in a far more entertaining and engaging way.
I’ve written a lot about why to do this kind of editing, but never shown in detail what the process is like. For folks who are familiar with the editing process — in any medium — this is all just basic knowledge and common sense. But there are lots of folks who are not familiar with the editing process in any medium. So to convey what it’s like, I decided to narrate (part of) the editing of this particular screencast.
As I’ve mentioned before, there’s one huge difference between editing audio and editing video. With audio, as with text, you can seamlessly cut and rearrange to your heart’s content. With video, the need to preserve visual continuity imposes severe limits, especially on the so-called internal edits that elide words and phrases. It’s interesting to note that, in this respect, the demo/interview genre of screencasting has more in common with audio than with video. There’s usually a lot less happening in a screencast than in motion video. So you can usually get away with the sort of heavy editing that’s normally only possible in the audio domain. And it’s very useful to be able to do that.
(Initial length: 53:45.)
I cut the first 2.5 mins of Henrik talking in general terms about CCR, DSS, the programming model. Why? Nothing to show, and this info is available elsewhere.
The real meat of this demo is to show how the Robotics Studio exposes a RESTful interface, and to demo interactions with (real and simulated) robots using that interface.
In the next segment, Henrik starts by saying “I have a nice big robot next to me, I might be able to show you, if I can just…”
I then cut 15 seconds of him fumbling around in the services directory and muttering to himself, while hunting for the webcam interface. So it went from:
“I might be able to show you, if I can just” …. 15 seconds of fumbling and muttering … “there you go! [image appears]”
to:
“I might be able to show you, if I can just … there you go! [image appears]”
This is partly about respect for the viewer’s time, because people have better things to do than watch and listen to 15 seconds of fumbling and muttering. And it’s partly about keeping the storyline moving forward in an engaging way.
A subtle point here is that I left in just enough of the fumbling and muttering. If I had reduced it to:
“I might just be able to show you…there you go!”
then it would have felt overproduced and inauthentic. I want Henrik to fumble and mutter a little bit, that’s part of the whole charm of the thing. But I want to limit the fumbling and muttering to a reasonable length. I think that leaving in “if I can just…” retains just enough of that quality — but not too much:
“I might be able to show you…if I can just…there you go! [image appears]”
During the next stretch I made no major cuts, but lots of little ones in the range of 2 to 5 seconds. These are places where the audio pauses because Henrik is thinking, or waiting for the computer to respond. And they are also places where he’s just verbally warming up to what he really wants to say — or where I am doing the same.
Example: “…and so, um…” –> “” == 2 seconds saved
Example: “And what we have here is that, um, and so, everybody has seen a web server” –> “Everybody has seen a web server” == 5 seconds saved
These internal cuts are completely inaudible and, so long as they don’t interrupt the onscreen action, also invisible. Since a typical screencast is often visually quiescent there are many opportunities to make these cuts. They not only reduce the end-to-end time, but also — just as important — they make the video far more watchable.
The next major cut was a 20-second setup leading to the statement: “In our model, everybody is a client and everybody is a server.” In the setup, Henrik talked about how typical web apps (like home banking) exhibit a more classical client/server architecture. It was a judgement call but, in this case, I decided that the kinds of folks who will care about RESTful interfaces to a robotic services fabric didn’t need the setup, and that it was more valuable to shave those 20 seconds than to keep them.
It’s worth noting how the context supported making this cut. Originally:
“We have services that talk to each other, that wire each other up, and use each other to construct and compose applications.”
… 20-second setup ….
“In our model, everybody is a client and everybody is a server.”
Finally:
“We have services that talk to each other, that wire each other up, and use each other to construct and compose applications. In our model, everybody is a client and everybody is a server.”
It flows perfectly.
Next I cut a restatement of “everybody is a client and a server” which chewed up 5 or 6 seconds without adding anything new. In doing so I ran into a logistical problem. When trying to make precise audio cuts in Camtasia you can run into trouble in tight spaces. (I keep meaning — and keep forgetting — to mitigate this problem by capturing at a higher frame rate than the one at which I finally produce.) A workaround is to silence a region that’s too small to accurately cut.
So, for example, after cutting that restatement I wound up with:
...at the same time same time. That has some great benefits.
---------
I wasn’t able to cut the redundant “same time” without affecting the “That has” — but I was able to replace the redundant “same time” with silence:
...at the same time. ________ That has some great benefits.
That left a perfectly natural-sounding 1-second pause.
(Length now: 49:38)
Through the next section I made assorted internal cuts, and one major cut. After Henrik contrasted OO-style inheritance with the additive composition of RESTful services which is the extension pattern for the Robotics Studio, we got into a several-minute discussion about the tradeoffs between these approaches. It wasn’t really conclusive, though, and I realized that it would be better to factor that out. In fact, while recording, I decided at this point to do a separate podcast in which we’ll drill down on these more abstract points. In a screencast, you want to keep the visuals moving along.
(Length now: 46:38)
For the remainder, more of the same: internal cuts, plus 20- or 30-second chunks that were disposable.
Final length: 34:30
July 25, 2007
Posted by Jon Udell under
Uncategorized [9] Comments

Henrik Frystyk Nielsen used to work for the World Wide Web Consortium on some key pieces of infrastructure including the HTTP specification and libwww. He left the W3C in 1999 and now works for Microsoft where his current project is Robotics Studio, whose tagline is: “A Windows-based environment for academic, hobbyist and commercial developers to easily create robotics applications across a wide variety of hardware.” What that description doesn’t tell you, but today’s screencast shows, is that the Robotics Studio is based on a RESTful architecture, and that applications are built by composing lightweight services in ways that will be instantly familiar to every web developer.
To drive home that point, much of the action in this screencast occurs in a web browser, where you’ll see Henrik explore a distributed directory of services and view XML snapshots of the current state of bumpers, cameras, and laser range finders.
From a read-only perspective it’s all HTTP GET, and you can do things like subscribe to robotic sensors using RSS feeds. When you control a robot, SOAP is used to optimize fine-grained updates. But either way it’s a loosely coupled and late bound system that leverages the fundamental flexibility of web architecture in a very different domain. In one compelling demonstration of that flexibility, you’ll see a generic controller — which had been controlling the robot in Henrik’s office with no prior knowledge of the device, purely by interface discovery — switch over to a simulated robot and drive it by means of the same kind of discovery.
July 23, 2007
Posted by Jon Udell under
Uncategorized [33] Comments
 |
Mean temperatures for December
Concord, NH, 1871-2007 |
The weekend was beautiful here in New Hampshire but today we’re back to what has over the past five years started to seem like normal for July: cool and cloudy. I’ve been taking an informal survey this month, asking friends whether they’ve gone swimming yet this summer. Almost nobody has, which we all agree is just weird, and which we all tend to attribute to climate change.
For me, our recent pattern of cool and cloudy summers is becoming a deal-breaker. Reliably nice summers used to be the payoff for living here through the winter, but if I can’t count on summer to be reliably nice, I’m tempted to consider other options. That’d be a big decision though, so I’d like to support it with hard data.
Given all that’s been said and written about climate change, it turns out to be surprisingly hard to get hold of historical climate data. I had to look around quite a while before I found this FTP site where NOAA has parked files full of raw temperature and precipitation data.
The files cover the whole world and they go back a long way. Here’s one line from the mean temperature file:
4257260500201993 -37 -73 -2 85 143 190 219 208 164 88 45 -13
I’ve marked the 3-digit country code in red, the 5-digit World Meteorological Organization station number in blue, and the year in green. What follows are twelve values which are monthly mean temperatures in tenths of degrees Celsius.
Since Concord, NH is the closest WMO station to me — 55 miles away — I uploaded the Concord data for mean temperatures (1871-2007) and precipitation (1859-2007) to Many Eyes, and looked for a recent pattern.
I didn’t find one. Here’s a view of summer mean temperatures, and here’s a view of precipitation. Look for yourself.
On the temperature front, we’re all very aware that this past December was freakily warm. And as this view shows, there hasn’t been an above-zero-Celsius December since 1982. But there were more (and warmer) above-zero-Celsius Decembers before the midpoint of the time series — 1939 — than there have been since.
I am not saying that the planet isn’t warming or that the climate isn’t changing. But we ought to be able to explore the evidence for these phenomena, and review the interpretations of them, in much more interactive and collaborative ways. Not only for reasons of global policy, but also so we can contextualize what we see happening around us.
Arguing about the weather has undoubtedly been a favorite pastime of our species since we learned how to talk. Now we can have those arguments in the context of actual data. And as questions about climate change grow more critical, it’s imperative that we do. Hats off to Martin Wattenberg, Fernanda Viégas, and their colleagues for creating Many Eyes and showing how we can.
July 23, 2007
Posted by Jon Udell under
Uncategorized [6] Comments
In a pair of articles from today’s New York Times, the world’s unequal distribution of Internet access refracts through two very different lenses. Paul Krugman’s subscriber-only column, The French Connections, highlights the sorry state of the United States relative to France, Japan, and other nations where broadband access is more widely distributed and much faster. But as Ron Nixon points out in Africa, Offline: Waiting for the Web, “less than 4 percent of Africa’s population is connected to the Web.”
That’s not likely to change anytime soon, according to Ken Banks whom I interviewed for my weekly ITConversations podcast. The network that matters in Africa is the pervasive cellphone network. (The US, of course, fails to lead in that realm too.) Leveraging the ubiquity of text messaging, Ken has created an entry-level SMS hub called FrontlineSMS — free to charities and non-profits — which automates various patterns of text-message-based communication. He’s recently been awarded a MacArthur grant to continue this work. Good going, Ken!
July 20, 2007
Posted by Jon Udell under
Uncategorized [10] Comments
In today’s installment of my Microsoft Conversations series I talked with John Shewchuk about BizTalk Services, a project to create what he likes to call an Internet Service Bus. The project’s blog, with pointers to key resources, is here. There’s also a Channel 9 video on this same topic, in which John Shewchuk and Dennis Pilarinos illustrate the concepts using a whiteboard and demos.
I began our conversation with a reference to a blog item posted by Clemens Vasters back in April when BizTalk Services was announced. He described a Slingbox-like application he’d done for his family.
It’s a custom-built (software/hardware) combo of two machines (one in Germany, one here in the US) that provide me and my family with full Windows Media Center embedded access to live and recorded TV along with electronic program guide data for 45+ German TV channels, Sports Pay-TV included.
Clemens did this the hard way, and it was really hard:
The work of getting the connectivity right (dynamic DNS, port mappings, firewall holes), dealing with the bandwidth constraints and shielding this against unwanted access were ridiculously complicated.
And he observed:
Using BizTalk Services would throw out a whole lot of complexity that I had to deal with myself, especially on the access control/identity and connectivity and discoverability fronts.
I began by asking John to describe how BizTalk services attacks these challenges in order to mitigate that complexity. We talked through a couple of scenarios in detail. The one you’ve heard the most about, if you’ve heard of this at all, is the cross-organization scenario in which a pair of organizations can very easily interconnect services — of any flavor, could be REST, could be WS-* — with reliable connectivity through NATs and firewalls, dynamic addressing, and declarative access control.
There’s another scenario that hasn’t been much discussed, but is equally fascinating to me: peer-to-peer. We haven’t heard that term a whole lot lately, but as the Windows Communication Foundation begins to spread to the installed base of PCs, and with the advent of a WCF-based service fabric such as BizTalk Services, I expect we’ll see the pendulum start to swing back.
At one point I asked John whether BizTalk Services supports the classic optimization — used by Skype and other P2P apps — in which endpoints, having used the fabric’s services to rendezvous with one another, are able to establish direct communication. He said that it does, and followed with this observation about the economics of hosting BizTalk Services.
When we host it, we’ll incur certain operational costs, so we’ll want to recover those costs. But our goal is not to differentiate our offering from others because we host the software, it should be the case that Microsoft competes on an equal basis with other hosters.
In many regards, our motivations differ from other providers. Take Amazon’s queueing service as an example. Because we’ve got software running both up in the cloud as well as on the edge nodes, we can create a network shortcut so that the two endpoints can talk directly. In that scenario, we don’t see any traffic up on our network. All we did was provide a simple name capability, so the two applications end up talking to each other, using their own network bandwidth. We can use the smarts in the clients and in our servers to reduce the overall operating cost.
Now in that scenario, the endpoints are presumably servers running within an organization, or maybe across organizations. But WCF-equipped clients can play in this sandbox too. The idea is, in effect, to generalize the capabilities of an application like Skype, and enable developers to build all sorts of applications that leverage that kind of fabric.
That’s a vision that many of us in the industry share. We’d just like to reduce the barriers to being able to connect our machines and our solutions. The industry’s seen a bit transition to a hosted world, because that’s been the easiest way to get universal connectivity. If a big organization with a whole bunch of high priests of IT were out there running the servers for you, then you didn’t have to get your machine to be able to do that.
But sometimes I might just want to put those things on my machine, and if it were easy enough, wouldn’t that be a great model? Why do I want to be beholden to some organization that’s capturing my data? Maybe I want to have more privacy.
Of course there are benefits to moving it out to the cloud, but we think that should be a decision you make after the fact. Build your application to a consistent abstraction, then decide where you want the dial as the demands on the application change. If I’m just trying to do a quick video share with my friends, why do I have to create a new space? Why not simply say, here’s the URL? And have that URL be stable?
Why not indeed. At its core the Internet has always been fundamentally peer-to-peer, but after a while we couldn’t sanely continue in that mode. Things got too scary, so we built walls and created ghettoes. Technically our PCs are still Internet hosts but, except when they’re running a few important P2P apps, they haven’t really been hosts for a long time. It’d be great to get back to that.
July 19, 2007
Posted by Jon Udell under
Uncategorized [9] Comments
I’m making some progress in my quest to improve access to (and interpretation of) local public data. Yesterday’s meeting with the police department yielded a couple of spreadsheets — one with five-year historical data, and one with recent incident reports. The latter includes addresses which enabled me to plot the incidents in Virtual Earth.
It has been a while since I’ve done this, and the technology has matured. GeoRSS, in particular, seems to be a fairly new thing in the world. It’s a simple idea: use RSS (or Atom) to package sets of locations, encapsulating latitude and longitude coordinates in the GeoRSS namespace. Here’s the GeoRSS file I built from the police spreadsheet.
In poking around online in order to learn how to use GeoRSS, I ran across a familiar name: Jef Poskanzer. For many years I have been enlightened by Jef’s various experiments at acme.com. Way back in 1997, for example, I was using his implementation of Java servlets to explore that way of building online services. So I was delighted to see Jef’s name pop up again when I looked into GeoRSS.
On his GeoRSS page you can plug in the URL of a GeoRSS file and his service will map it for you in either Google Maps or Yahoo! Simple Maps. Nice! Jef’s page doesn’t include Virtual Earth as an option, but it also now supports GeoRSS so this was a good opportunity to try out that combination. It was, as you’d expect, quite easy to do. Given a well-formed GeoRSS file, all of the modern mapping APIs require very little of a developer who wants to spray the locations in the file onto a map.
But as I was reminded when going through this exercise, it requires a whole lot of work to transform a typical real-world document like the one I received yesterday — an Excel spreadsheet with manually-typed addresses — into a well-formed GeoRSS file. Data preparation is always the bottleneck.
Reflecting on how I got the job done, it’s amazing to consider the number and diversity of tools that I used. A partial list includes Excel for massaging and sorting data, Python for various bits of transfomational glue, and curl to pump addresses through an online geocoder.
I also leveraged a ton of tacit knowledge about the web, about XML, about regular expressions, and about the organization and display of data.
It’s always striking to me how we technical folk tend to focus on the endgame. “Look, ma, no hands! Just plug your [insert newest format] into your [insert newest tool] and it’s automatic!”
Here’s one small example of the difficulties we sweep under the rug. Consider a series of incidents at these addresses:
27 Damon Ct.
27 Damon Ct.
35 Castle St.
35 Castle St.
45 Damon Ct.
45 Damon Ct.
165 Castle St.
165 Castle St.
That’s how the address column will sort in a typical spreadsheet. But that’s not how you’d like to scan the legend in a mashup. To help visualize neighborhood patterns, you’d rather see something like:
Damon Ct. (27)
Damon Ct. (27)
Damon Ct. (45)
Damon Ct. (45)
Castle St. (35)
Castle St. (35)
Castle St. (165)
Castle St. (165)
In reality it’s more complicated because I’ve omitted apartment numbers, dates, and annotations. The organization of these elements has a profound influence on which kinds of visualizations tend to come for free, and which will require a lot of extra work.
Now, because I’m handy with data, with text processing, and with regular expressions, I know how to reorganize the raw data. And because I have a view of the endgame, I know why to do it. But none of this is evident to a normal person sitting in an office compiling incident reports into an Excel log.
It seems, though, that we should now be able to normalize this kind of data entry in a way that would maximize the reuse value of the data. If I can feed random addresses into a geocoder and pretty reliably get back coordinates, I should also be able to feed unstructured addresses into some other online service and get back well-structured addresses. And I should be able to equip Excel to use that service to ensure that the structured addresses are logged with incidents. Is there a recipe for doing that?
July 17, 2007
Posted by Jon Udell under
Uncategorized [17] Comments
Recently I began keeping track of interesting public data sources using the del.icio.us tag judell/publicdata, and invited others to do the same using their own del.icio.us accounts. That method sets up an interesting pattern of collaboration whereby all contributions flow up to the global bucket, tag/publicdata, but individual contributors can curate subsets of that collection according to their own interests.
A nice example of that pattern emerged when the Many Eyes folks showed up at manyeyes/publicdata. Their contributions flowed up to the global bucket, and thence to the RSS feed I’m watching, which is how I got to find out about this excellent survey of a variety of public sources. It was done for a class at the University of Maryland, and it very helpfully characterizes data sources along a number of axes including searchability, browsability, interaction, and formats.
All this is quite straightforward and unsurprising to anyone who’s familiar with social bookmarking — which is to say, still quite unfamiliar to most people today.
So there’s not much chance that the next maneuver I’m going to describe will resonate in the general population, but I want to describe it anyway because those of us who think about these things ought to be thinking about how to make it more discoverable.
Several years ago, in a screencast entitled Language evolution in del.icio.us, I posited that tag vocabularies could evolve in the same way that natural languages do. In the realm of natural language, we coin new words all the time. When we hear a new word that we like, we adopt it — or, perhaps, adapt it. The punchline of the screencast was that this is how the grassroots semantic web will form. There are just two requirements: We need to be able to speak, and we need to be able to hear others speak.
Speaking, in the realm of tag vocabularies, means writing tags, and sometimes creating new ones. Hearing means reading tags, and observing how they’re applied to resources and by whom.
If you land on a page that you haven’t yet bookmarked, you can use the del.icio.us posting bookmarklet to show you (as recommended tags) which other tags have been assigned to that URL.
I tend to rely on a more sensitive organ of hearing: a bookmarklet that I call dc, for del.icio.us conversation. I use it all the time. Suppose, for example, I’d found that University of Maryland page through some other means of referral than del.icio.us. I’d have reflexively clicked the dc bookmarklet to produce this report which shows who else has bookmarked that page, and how it has been described.
In this case there’s not much to see. The URL was bookmarked once in Feb 07, by elzzup, to the tags data and class, and again in Jul 07, by manyeyes, to the tag publicdata.
This view is interesting for a couple of reasons that I don’t think are widely appreciated. First, it shows a progression from general ways of describing the resource to a more particular way. Note, by the way, that the proposed refinement of data to publicdata is not visible when you launch the bookmarking form, which recommends only class and publicdata. Note also that the introduction of publicdata is really a hack. It would arguably be better to rely on the individual tags public and data. But that would make it necessary to query for the conjunction, and that connection is too fragile. So publicdata also suggests something about how to form tags — that is, by making these conjunctions explicit.
Second, it shows who has proposed publicdata — namely, manyeyes, an identity that may be recognized, and that if recognized will add weight to the proposed usage of the tag.
These are subtle effects. For most people, they’re too subtle to matter at all. But I’m reminded that there’s important work yet to be done to render these effects in ways that make it easier for everyone to hear (and visualize) linguistic evolution in the tag domain, so that people can participate more actively and more naturally in that evolution.
July 16, 2007
Posted by Jon Udell under
Uncategorized [9] Comments
The LibraryLookup project is almost five years old, and people are still gradually discovering it, as I’m periodically reminded when I get a flurry of emails such as was provoked by this Lifehacker article. I think it’s time for this idea to graduate from the realm of hacks for adventurous people, and enter the realm of normal capabilities that everyone takes for granted.
For starters, if your system supports searching by ISBN, I suggest that you offer — in addition to whatever syntax you already use — one simple and standard pattern:
/search?isbn=1565925378
Next, use the OCLC’s xISBN service to expand the search to include all manifestations of the work indicated by the given ISBN.
Finally, have each instance of your system publish the bookmarklet made from these ingredients on its home page, along with instructions for using it.
For extra credit, enable patrons to indicate wish lists of books they’re interested in, and notify them when books on those lists become available.
People like this stuff when they discover it, but as yet not many have, and until it’s baked into your systems, most won’t.
July 12, 2007
Posted by Jon Udell under
Uncategorized [12] Comments
The same audio glitch that ruined my interview with Joel Selanikio also affected another interview on the same day. That interview, for my Microsoft Conversations series, was with Ted Okada, who is the director of a small group called Microsoft Humanitarian Systems. So again I’ll have to settle for reporting highlights from the interview along with some quotes I was able to salvage.
Ted came to Microsoft by way of Groove, where he’d been hired to spearhead Groove’s use in the humanitarian sector which had increasingly come to value the product for several interesting properties — technical resilience in the face of intermittent connectivity, and political resilience in the sense that it creates neutral infrastructure owned by no single agency. When I caught up with Ted, as he was packing for a trip to Afghanistan, he gave this example of the latter:
We’ve been working with an NGO that was using Groove to negotiate between the Tamil Tigers and the Sinhalese government in Sri Lanka. The two parties wouldn’t sit in the same room, but they did agree to use Groove to arbitrate the conflict.
In this appearance on Channel 10, Ted talks about how Groove is uniquely well equipped to support collaboration in disaster relief situations, and he demonstrates a Groove-based solution that enabled five different relief organizations responding to the 2005 Kashmir earthquake to synchronize on the same operational picture.
Ted has also been one of Microsoft’s representatives at Strong Angel, an exercise to simulate disaster response that’s been held three times — in 2000, 2004, and most recently 2006. Strong Angel was the brainchild of U.S. Navy Medical Corps commander Dr. Eric Rasmussen. I asked Ted what it’s like to participate, and he replied:
It’s an odd mixture of the early Interop conferences — where people were trying to get routers from different manufacturers to work together — plus a little bit of Burning Man, a little bit of Foo Camp, and a little bit of the military channel. Officially it’s a demonstration, but it involves all those elements and addresses all kinds of questions. How do you cross the civil/military boundary, particularly when trust is low and the need for collective action is high? How do you make sure all the gear works together? Of course it’s also a venue for some interesting gear, like solar reflective yurts that you might find at Burning Man — and in fact actually were taken to Burning Man.
As John Markoff reported, there were some notable interoperability failures at Strong Angel 3 but also some notable successes. One of the latter involved the use of Simple Sharing Extensions (SSE), an extension of RSS, to synchronize location data between Google Earth and Microsoft Virtual Earth.
I wondered what broader role SSE might play, given that it extends a Groove-like data synchronization capability to a diverse set of applications. It turns out that Ted will be testing a prototype SSE adapter for Microsoft Access on a trip to Kabul next week:
From my perspective as a relief and development person for 20 years, you can’t overestimate the value of simple tools like good old Access. What if Access could relay messages and synchronize via SSE, so that you’ve got persistent statefulness and failover on highly intermittent and jittery networks? Suddenly Access becomes a much more lively player in the edge-based mesh. So now in Afghanistan we’ll actually be using this wonderful everyman’s tool, Access, enlivened with SSE adapters, to help out an NGO partner who’s told us that would really help them share data with the other stakeholders in the reconstruction project they’re working on.
Ted has an interesting take on what Microsoft might learn by collaborating with these kinds of partners:
If you make the developer part of an environment that is itself stressed, and build for the extreme case, maybe you can titrate lessons faster and close the loop quicker on accelerated learning. It’s hard to work in a place like Afghanistan. It’s an austere environment and you’re at the mercy of that environment. Very few people know who Microsoft is — or care who we are — and there aren’t many places in the world where that’s true. In some ways, perhaps, immersion in that environment could turn out to be the ultimate sort of extreme programming.
Those were the highlights. It’s painful to have bungled those audio recordings. When I told Phil Windley he said, “I live in fear of that.” Well, the silver lining — for folks who don’t listen to podcasts, at least — is that it forced me to write more about the interviews than I normally do. Tomorrow I’ll record what will be the third in a series of conversations about humanitarian uses of technology, and you can bet I’ll double-check to make sure I’m recording what I think I’m recording!
July 12, 2007
Posted by Jon Udell under
Uncategorized [6] Comments
For this week’s ITConversations show I interviewed Dr. Joel Selanikio, co-founder of DataDyne, a non-profit consultancy dedicated to improving the quantity and quality of public health data. DataDyne’s principal tool is EpiSurveyor, a free and open source software product that simplifies the creation of forms for doing field data collection with handheld devices. There’s a Windows-based forms designer, and a runtime for Palm OS-based PDAs which is being ported to Windows Mobile- and Java-based devices.
It was a great interview but, when I opened up the audio file I’d recorded, I was horrified to find that an audio glitch had rendered it unusable. So instead I’ll report here on what Joel told me, and weave in some quotes I was able to salvage.
Our conversation was well-timed because I’d just watched the dustup between Michael Moore and Wolf Blitzer, checked out Moore’s rebuttal of Sanjay Gupta’s report on SiCKo, and tracked down some of the cited sources — including the United Nations Human Development Report. How reliable are these sources, particularly for developing countries? As you might imagine, and as Joel’s experiences confirm, there’s a lot of guessing going on:
It’s amazing how unaware people are of the tenuous nature of our knowledge of, really, anything. One of the things I ask people is: “What’s the population of the United States?” And they’ll say 290 million, or 300 million. But the real answer is: We don’t know. And we check, every 10 years, and we do a pretty good job of checking.
So if you want to know what’s the leading cause of disease in children in rural Africa, what’s the chance that you’ll have any idea what the answer to that question is?
I was a first responder to the tsunami in Southeast Asia. Imagine showing up in a place where the slate has been wiped clean, God just slammed his hand down and flattened everything. The roads are gone, and three or four thousand aid workers are all clustered in the few places they can get to. But of course, you have to come up with an estimate of how many people are dead. So somebody picks a number, and then you hear it on CNN that night. Fifty thousand, a hundred thousand, a hundred and twenty-five thousand, none of those estimates were based on any attempt to really find out.
A friend who works for American Red Cross asked me what I thought they could do. I said the most valuable thing to do with the hundreds of millions of dollars of donations they’d received was to invest in data collection. Normally in a situation like that you do a sample. You go to ten percent of households, and try to extrapolate, and hope that your sample isn’t biased. But I said, with five hundred million dollars, there’s no reason we can’t get the local people to do a census. Go to every refugee area and every household and actually find out — not estimates — but actual numbers, which would be of huge importance for reconstruction.
So of course it didn’t happen.
As a one-time database programmer who went to medical school and then became an epidemiologist, Joel’s acutely aware of the relationship between information technology and epidemiology, a discipline that is, as he reflects on here, profoundly data-driven:
In about 1995, over the course of six weeks, kids start showing up at the university hospital in Port-au-Prince, Haiti. They had different symptoms, but they all died. At about the hundred mark the local docs contacted the World Health Organization who contacted CDC, and a colleague of mine and I went down to Haiti. It’s a high-pressure situation, kids are dying every day. When we got there we began creating a database of responses to questions: what their symptoms had been, what medications they had taken. Within a few days it became apparent that all of the kids who’d died had taken one of several locally-produced Tylenol-like medications. Once we discovered that, they made an announcement and the outbreak ended.
People would ask me, “What magic did you work?” Well, in clinical medicine, the way that we understand things is — if it’s a rash, I look at the rash, I think about it, I look stuff up, but I don’t systematically create a database. For one patient you can juggle the variables in your head. But when you have a population of affected people, you need to collect data and analyze it. That’s the basis of epidemiology.
Unfortunately our standards and methods for data collection are far lower in the realm of public health than in the realm of business:
Imagine if you were the CEO of Toyota, and your CFO said, well, sales are pretty good, we think, but we’re not sure. He’d flip. And yet that’s how things are with public health. We’ve been making a concerted effort for fifty years to get rid of malaria, but the quality of our statistics is terrible, and we’ve just gotten used to that.
A key reason for that poor quality is that the collection of public health data in developing countries is still mostly a paper-based activity. And while handheld devices are obviously a great alternative, what Joel found is that the software available for creating surveys was way too hard for ordinary folks to use:
If you’re the ministry of health in Kenya and you want to survey a hundred thousand households, and use handhelds to do it, you’ll need some knowledge of programming. If I told you to write down your questions in Microsoft Word you could easily do it, it’s frictionless, but with the commercially available software for creating surveys that run on PDAs, you can’t do it. That software can do all kinds of fancy things, of course, but most of the time the information you need to collect is very simple stuff.
So that’s what EpiSurveyor does, Joels says. It makes the simple stuff simple, so that ordinary folks in developing countries can create surveys without having to hire programmers and consultants.
But how can you have any assurance that the data gathered in these kinds of surveys will be usefully comparable? Are there standard forms and standard schemas? Not really, Joel says. The existing forms are hard to find and reuse, and there’s been little progress toward standardization:
If you went to a UN organization and said, we want to standardize how we collect data about child nutrition, the response would be, let’s have a conference. We’ll have experts get together in Rome, and then in Paris, and decide what are the key questions for any standard child nutrition survey. But it’s hard to achieve unanimity, and there’s a built-in incentive not to because every time you get together it’s a trip to Rome.
Coming at the problem from a grassroots web 2.0 persective, Joel’ working to translate the various forms used by international agencies into EpiSurveyor’s XML format, and to make them available in a shareable repository. The notion is that reuse will occur naturally when it lies along the path of least resistance. And he sees that starting to happen. For example, having trained field workers in both Kenya and Zambia, he discovered — after the fact — that the Zambian workers had found, and reused, a Kenyan survey which they’d found on DataDyne’s private project management site.
My Zambian contact said, Joel, I hope this is OK, but I downloaded their form, and opened it up, and made a few changes — basically just the names of provinces — and then I used their form.
Of course it was more than OK, he was delighted. Asking the same questions, in the same ways, is exactly what you want to happen, and yet it rarely does.
The forms repository that Joel envisions doesn’t yet exist, but he’s hoping that as DataDyne builds up a reputation around successful deployments of EpiSurveyor, the company will be able to attract the resources and the attention needed to make that happen.
July 11, 2007
Posted by Jon Udell under
Uncategorized [5] Comments
Here’s an update for those who’ve been following the story of my quest for local crime statistics[1,2]. This morning I met with the police chief and some other officials. Given that I began asking for this data in late April or early May, and went through four rounds of telephone and then email contact, it shouldn’t have taken so long to convene the meeting. And it would have taken longer had I not engaged my friend Ted Parent, who is a lawyer and a great champion of democracy, to write a letter to the city attorney. The magic incantation in New Hampshire, by the way, is not Freedom of Information Act (FOIA), but rather Right to Know. It wasn’t enough for me to utter those magic words, though. Ted had to do that, in a letter that went on to describe in great detail my reputation, qualifications, and seriousness of purpose. That description is true, but shouldn’t have been necessary, nor should Ted’s services have been.
In any event, we had a productive discussion and will meet again soon to discuss logistics: what’s unavailable and why, what’s available and how to get it. What will likely be available is an update to this data set, which might or might not reveal trends since 2005. That would be of interest locally because, while there’s a strong sense that crime is worse lately, nobody seems to be clear about the details.
But here’s why that might not help. The feds only gather and report on certain categories of crime. Among those not included, the chief told me, are drunk driving incidents, which he’s been seeing a lot more of lately. Another systematic omission: rapes only count as rapes when inflicted on females by males.
Then there’s the fact that state participation in the National Incident-Based Reporting System (NIBRS) — which is apparently the new name for what used to be called UCR (Uniform Crime Reporting) — is voluntary and spotty.
So it’s unclear what questions can even be answered — in local, state, or national context — by the UCR/NIBRS data that the city’s software can and does report to the feds.
But other questions are entirely outside the scope of that dataset. It includes no location information, for example. I was surprised to learn that while the city does of course collect street addresses when entering crime reports into its database, they’re unaware of any straightforward way to get the location data back out in order to visualize geographic patterns. My hunch is that I can help them with that, if I can get hold of a raw export, so that’s something we’re going to explore at our next meeting.
This has been an interesting process to observe. Today the assistant city attorney said something that crystallized, for me, an insight about the stewardship of public data. Although the city has so far received very few Right to Know requests, one of them, she said, could have proved very costly in terms of the software and consulting services that would have been needed in order to comply. That insight won’t rewrite the legacy system, but it certainly imposes an important new requirement on its successor.
The folks I met with today aren’t familiar with ChicagoCrime.org or CAPStat, but I didn’t get the impression they’re opposed to the idea of citizen participation in the interpretation of government data. On the contrary, I think they may conclude that deploying systems to enable that participation would be as useful to them as it would be to the public.
We have a long way to go at all levels: local, state, national, international. But expectations are being reset, up and down the line, and I’m hopeful that we’ll get where we need to go.
July 9, 2007
Posted by Jon Udell under
Uncategorized [12] Comments

Hans Rosling has been justly acclaimed for a couple of TED talks on global health in which he makes mesmerizing use of his (and now Google’s) GapMinder software, which he uses to tell compelling stories with data. The software is very cool, but what really makes the stories come to life is Rosling’s narrative. Data analysis, for him, is a performance art.
I’ve been thinking about this because I’ve been trying to investigate a perceived crime wave in my home town. You’d think it would be straightforward to get hold of the data but, after four months, I’m still trying. Meanwhile, however, I found some historical data at the Bureau of Justice, and I decided to see what I could make of that.
The visualizations shown in today’s screencast were done with Many Eyes, which is another very cool piece of software. But what I realized while making them is that narrated animation is really the secret sauce. Analytical software, whether it’s Excel or GapMinder or Many Eyes or something else, is necessary but not sufficient. The stories that people will understand, and remember, are the ones that have been performed well.
Now I’m no Hans Rosling, and you certainly won’t see me swallow a sword at the end of this screencast — as he amazingly does at the end of this video. But I will be trying to emulate his example when I tell stories with data. And I’m struck, once again, by the way in which screencasting can bring software interaction to life.
The charts used in my screencast could have been made in Excel or in any other charting package. By making them in Many Eyes, I added the important new dimension of social analysis. So you can visit the data sets there, comment on the visualizations, and add your own visualizations. But data analysis as performance art goes beyond the snapshots produced by analytical tools. It lives in the interstitial spaces between the snapshots, traces a narrative arc, shows as it tells.
July 6, 2007
Posted by Jon Udell under
Uncategorized [8] Comments
As director of web publishing for Nature Publishing Group, Timo Hannay’s projects include: Connotea, a social bookmarking service for scientists; Nature Network, a social network for scientists; and Nature Precedings, a site where researchers can share and discuss work prior to publication.
The social and collaborative aspects of these systems are, of course, inspired by their more general counterparts on the web: del.icio.us, Facebook and LinkedIn, the blogosophere. That’s part of what we discussed in this week’s ITConversations podcast. We also talked about my longstanding concern that scientists, like other academics and indeed most professional people, aren’t directly rewarded for being wired into the web. Timo has some great ideas about how to change that. He notes:
This will sound a bit strange coming from someone who works for a journal publisher, but to date, the way that scientists’ output has been measured has been unduly focused on publications in peer-reviewed journals. That is, and will continue to be, a really important part of it, but it’s not the only thing they do.
Here’s one specific proposal for change — measure, and reward, contributions of data:
Biology in recent years has seen a move from what I would characterize as cottage industry science, where everything from data capture through to analysis to writing the paper happens within one lab among a small group of people, to a much more industrial scale where you have different groups, widely dispersed, perhaps who don’t even know each other, doing the data capture versus the analysis versus writing the paper.
But you can’t just publish a data set. So what tends to happen is that, for a really big important data set — like a new major genome — they’ll publish a paper off the back of it, and do a very quick preliminary analysis. But the real news is not the analysis, it’s the data set. They have to make this fig leaf of analysis in order to justify publishing the paper.
We need to make it possible for people to publish data sets — to put them out there, track what use is made of them by other people, and then eventually gain credit for that.
Excellent suggestion!
More broadly, Timo wants to measure activity in the specialized versions of the blogging, bookmarking, and social networking services that Nature Publishing Grouop is creating for scientists. He says NPG is working with funding organizations to figure out what kinds of measurement can support a broader system of credit and recognition.I know it’s hard to nail down this touchy-feely stuff, but it really does matter. Yesterday I found a great quote from E.O. Wilson — in Consilience, which I’ve finally gotten around to reading — that helps explain why:
The creative process is an opaque mix. Perhaps only openly confessional memoirs, still rare to nonexistent, might disclose how scientists actually find their way to a publishable conclusion. In one sense scientific articles are deliberately misleading. Just as a novel is better than the novelist, a scientific report is better than the scientist, having been stripped of all the confusions and ignoble thought that led to its composition. Yet such voluminous and incomprehensible chaff, soon to be forgotten, contains most of the secrets of scientific success.
Narrating the work in openly confessional memoirs can and should be measurable, valuable, credit-worthy.
July 5, 2007
Posted by Jon Udell under
Uncategorized [20] Comments
The emerging discipline of social data analysis and visualization faces two challenges. First, obviously, you need data. Then, more interestingly, you need to figure out ways for people to create, share, and collaboratively refine interpretations of the data. There are a handful of well-known and powerful sources of data. The OECD’s data, for example, drives several of the visualizations at IBM’s Many Eyes site. Where else can you find data for these kinds of tools and services to chew on?
Sources I’ve used and discussed include Washington DC’s CAPStat and the Dartmouth Atlas of Health Care. A number of others are listed in this summary from the session at Foo Camp 07 on liberating government data.
For my own purposes, I’ve decided to keep track of these kinds of public data sources at del.icio.us/judell/publicdata. One of the delightful consequences of doing things that way is that I can pop up a level, to del.icio.us/tag/publicdata, in order to find out what other folks have been storing in the publicdata bucket.
There’s not a whole lot there, yet, but here’s one gem I discovered by way of a link to Gapminder: the United Nations Common Database. From the Gapminder blog on June 7:
UN statistics finally liberated and free of charge!
In a bold move that hopefully will set the standard for all major producers of statistics, UN Statistical Division have made their data accessible and FREE OF CHARGE from May 1 this year. United Nations Common Database (UNCDB) is now available for everyone, with no demand of subscription or user fees on their web-site.
We now look forward to the domino-effect and the liberation of other hidden or locked global statistics from other producers and collectors of data.
Amen. To that end, I invite readers of this blog to contribute these kinds of findings — as you encounter them in your travels — to the publicdata bucket in del.icio.us, to which I’m now subscribed. I’ll in turn curate that list at judell/publicdata, with an eye toward sources that I deem to be noteworthy, conveniently accessible, and likely to yield useful analysis.
July 3, 2007
Posted by Jon Udell under
Uncategorized [8] Comments
In the latest episode of my Microsoft Conversations series I talked with Pablo Castro about Astoria, a layer of middleware that makes data readable and writeable by means of a RESTful interface. Even if you don’t know or care about the buzzwords, it’s easy to show what Astoria does and to explain why it’s interesting. One of the sample databases configured to work with the experimental version of Astoria is a subset of the Encarta encyclopedia. You don’t have be a programmer or grok XML in order to appreciate the following dialogue with the Astoria-enhanced version of Encarta.
| What are Encarta’s topic areas? |
encarta/encarta.rse/Areas |
<Area uri="Areas[5]">
<ID>4</ID>
<Name>Life Sciences</Name>
<Articles href="Areas[5]/Articles" />
</Area>
...etc...
|
| The answer comes back in exactly the form shown here. It’s XML, but a very webby kind of XML that’s full of links that I’ve rendered as clickable. |
| So, what’s the fifth Area? |
encarta/encarta.rse/Areas[5] |
<Area uri="Areas[5]">
<ID>5</ID>
<Name>Sports, Hobbies, and Pets</Name>
<Articles href="Areas[5]/Articles"/>
</Area>
|
| Every link asks a question, and gets an answer that embeds links to ask more questions. |
| OK, what are the articles in that area? |
encarta/encarta.rse/Areas[5]/Articles |
<Article uri="Articles[761553558]">
<ID>761553558</ID>
<Title>Aaron, Hank</Title>
<Preview>
Aaron, Hank, born in 1934, American baseball player,
nicknamed Hammerin’ Hank, whose 755 home runs broke
the all-time record previously held by ...
</Preview>
<Url>
http://encarta.msn.com/encyclopedia_761553558/Hank_Aaron.html
</Url>
<Area href="Articles[761553558]/Area"/>
<ArticleBody href="Articles[761553558]/ArticleBody"/>
<Notes href="Articles[761553558]/Notes"/>
<RelatedArticles href="Articles[761553558]/RelatedArticles"/>
</Article>
...etc...
|
A database with Astoria layered on top of it isn’t a web application, but it’s within shouting distance of being one, and you don’t even have to shout very loudly.
Pablo’s presentation at MIX is chock full of demos and explanations. Our podcast refers to and complements that presentation.
I’m not even close to being an expert in the underlying data access technologies, including ADO.NET, the Entity Data Model, and LINQ, so parts of the discussion quite frankly went over my head. Nor am I yet familiar with the tooling that’s required to wrap this kind of services layer around a plain data source. But I’m 100% clear that it’s a good idea, and a great example of RESTful web services — a book that Pablo Castro says is “required reading” for members of the Astoria team.
July 2, 2007
Posted by Jon Udell under
Uncategorized [26] Comments
If you plug the quoted phrase “the data finds the data” into any of the search engines, the first hit will be one of several essays on Jeff Jonas’ blog. Other evocative phrases that lead to Jeff’s blog include “perpetual analytics”, “sequence neutrality,” and “persistent context,” but while those will soon resonate once you scratch the surface of Jeff’s work, none is as broadly compelling as “the data finds the data.” As sound bites go, that one’s a keeper.
Jeff Jonas is chief scientist for IBM’s Entity Analytic Solutions. His long career in data surveillance, and recent interest in privacy-respecting data surveillance, has drawn a lot of media attention lately. In the mainstream he’s appeared in Newsweek and on NPR. In the techsphere, Tim O’Reilly blogged about Jeff’s visit to PC Forum, Dan Farber interviewed him at the Web 2.0 conference and Phil Windley wrote a detailed review of his keynote at ETech 2007.
Given our shared interests — including surveillance, analytics, security, privacy, and manufactured serendipity — it’s surprising that I only recently became aware of Jeff’s work. Of course, we’ve been working different ends of the same street. He’s focused on finding bad guys: casino fraudsters, terrorists, and others who collaborate secretly. I’ve focused on helping people who collaborate openly do so more effectively. And yet…these really are two sides of the same coin.
Here’s an example of “the data finds the data” in Jeff’s world, from his article in IEEE Security and Privacy entitled Threat and Fraud Intelligence, Las Vegas Style. You have two records that refer to the same person, but you don’t know that they do. Then a third record appears which relates to each of the first two, and which establishes that all three refer to the same person. The first two pieces of data find one another, through the agency of a third piece of data.
Here’s an example of “the data finds the data” in my world. On June 17 I bookmarked this item from Mike Caulfield, who is a local friend, the webmaster at Keene State College, and a forward thinker about Net-enabled education. On June 19 I noticed that Jim Groom — who is a distant acquantance at the University of Mary Washington and another forward thinker on the same topic — had responded to Mike’s post. Ten days later I noticed that Mike had become Jim’s new favorite blogger.
I don’t know whether Jim subscribes to my bookmark feed or not, but if he does, that would be the likely vector for this nice bit of manufactured serendipity. I’d been wanting to introduce Mike at KSC to Jim (and his innovative team) at UMW. It would be delightful to have accomplished that introduction by simply publishing a bookmark.
But even if that weren’t the vector, the point is that given the overlap between Jim’s published work and Mike’s published work, it’s likely that they would sooner or later have discovered one another. In the realm of personal publishing, thanks to syndication and search, data tends to finds data. And when it does, people find each other.
This process of discovery works best, of course, when there’s common data available to the syndication and search engines. When the same things have different URLs or different names, the connections are non-obvious.
For non-obvious connections that don’t want to be found, you need a technology like the one Jeff Jonas sold to IBM. It goes by the name NORA: non-obvious relationship awareness.
For non-obvious connections that do want to be found, though, we can help the process along in a variety of ways. Publishing hyperlinks is one way to expose non-obvious relationships. Publishing key words and phrases is another. So, for example, in reading up on Jeff Jonas’ work, I realized that the privacy-assuring version of NORA, called ANNA, which uses one-way hashes to obscure private information while still enabling matching and discovery, is related to Peter Wayner’s notion of translucent databases (1, 2).
I’m not the first one to make that connection — Noah Campbell noted it last fall — but this item will strengthen it, in a way that may help some data find some other data, and some people find some other people.