Small steps forward for calendar syndication

In turbulent times it can help to focus on small steps and tangible signs of progress. In that spirit, here’s a fragment of the collaborative events calendar I’ve been trying to summon into existence:

07:00 PM DREW HICKUM & THE COLONELS (armadillos)
08:00 PM Roger McGuinn & Tom Rush (eventful: Colonial Theatre)
08:00 PM Patty Larkin | Francestown Meetinghouse (monadnock folk)
08:30 PM Chris Fitz (eventful: E.F. Lane Hotel)

A pretty good selection for a Saturday night in the Monadnock region! That’s good news for all of us living around here.

What’s especially encouraging, for me, is the process behind the scenes. That sequence of four closely-spaced events comes from four contributors who are publishing three different flavors of calendar feed, using Eventful, Google Calendar, and WordPress.

Best of all, only one of those contributors was me.

A conversation with Howard Bloom about collective learning, group selectionism, and the global brain

My guest for this week’s Innovators podcast is Howard Bloom. He’s written several books, one of which — Global Brain: The Evolution of Mass Mind from the Big Bang to the 21st Century — is the main topic of our conversation.

There’s no easy way to summarize this show, but here are some notes that I took while reading the book, and used to guide the discussion:

global data sharing among bacteria

complex adaptive system

imitative learning

individual vs group selection

passion for gathering in cities

raven roosts are data collection centers

elements of a collective learning machine:

  1. conformity enforcers (genome, social norms)
  2. diversity generators (curiosity, deviance)
  3. inner judges
  4. resource shifters
  5. intergroup tournaments

apoptosis / cell suicide

behavioral vs verbal memes

the group influences individual perception

each node in the collective brain represents a different approach available to the mesh of mind

individuals and subgroups are disposable rovers, sensors for an interlaced intelligence

pumphouse gang shows how individuals and groups can become test pilots for speculative strategies

team hunters, crop thieves, garbage raiders: each a separate “hypothesis”

collective intelligence uses the ground rules of a neural net: shuttling resources and influence to those who master problems, stripping influence, connection, and luxury from those who cannot seem to understand

If these themes resonate, you’ll love hearing Howard elaborate them.

Meme tracking with Twitter and Timeline

Social networks are Petri dishes in which we can watch memes emerge and spread by imitation. Three years ago, I traced the effect of a powerful one created by the ACLU: a fictional screencast about a dystopian future in which identity and privacy have gone horribly wrong. What I found when I looked at the data was that, although forward thinkers and actors in the realm of digital identity had only recently become aware of the ACLU’s powerful meme, it had been active for 18 months, most forcefully at the beginning of that span.

In that case the meme was an idea which, because it was neatly represented by an URL, could be tracked by using services like del.icio.us and bloglines as proxies for the attention that flows to an URL.

In other cases, a meme is best represented by a word — often, a neologism. There’s no canonical URL to track, but there are other ways to monitor the spread of the meme. Search engines, for example. In the case of screencast, for example, there were 200 Google hits for screencast in April 2005, 60,000 in June 2005, 325,000 in November 2005, and there are 3,000,000 today.

I’m always on the lookout for new ways to make these kinds of observations. Yesterday I encountered Pecha Kucha for the first time. It has a Wikipedia page, so the revision log there is one source of insight.

Since I encountered the phrase on Twitter, I tried a different strategy. While relaying a definition of the term, I used the tag #pechakucha. I realized that these Twitter “hashtags” are another proxy for linguistic memeflow, so I plotted occurrences of the tag on a Timeline. There were only 16 occurrences as of yesterday, so it’s a little sparse, but the same approach can be used to provide insight into the birth and evolution of any Twitter hashtag.

Here’s a Timeline for #quotes. It started on April 6, 2008, when Leonardo Souza quoth: “#quotes ‘This story, like any story worth telling, is about a girl'”, which evidently is from Spider-Man.

One of the nice features of Timeline, one of David Huynh’s many ingenious creations, is this condensed summary of activity:

Here we can see sporadic use of #quotes from April to the first of September, and then much heavier use. What happened on September 1? Tim O’Reilly, a powerful meme transmitter and amplifier, quoth: “‘The skill of writing is to create a context in which other people can think.’ Edwin Schlossberg. #quotes”

In Timeline we can watch other Twitter users immediately begin to use and transmit the #quotes meme:

This method will be most useful for watching Twitter hashtags that haven’t yet been widely adopted. If you apply it to, say, #ike you’ll run into two problems. First, Twitter’s API caps the number of search results you can retrieve, so in the case of #ike we can only see back as far as September 18. Second, Timeline struggles to display thousands of events.

These are general problems. No matter which Petri dish we observe — del.icio.us tagspace, the blogosphere, Twitter — our ability to watch memes evolve is limited by the amount of data we can gather, and also by our ability to effectively visualize what data we can gather. I expect both constraints to gradually erode. As they do, this game of meme tracking will become even more interesting.

The Congressional content management system

Recent legislative drama highlights the absurdity of expecting people to make sense of complex texts that are evolving rapidly in high-stakes, high-pressure situations. What we have here is a classic culture clash, in this case between people who think in terms of paper documents and those who think in terms of electronic documents.

Washington is a paper-based culture. There are hopeful signs of change, and Bob Glushko spotted one of them here:

Based on the file name embedded in the pdf of the bill — O:\AYO\AYO08C04.xml — at least the people doing the publishing work for the bill are doing their best to save our tax dollars by creating the file using XML for efficient production and revision.

But there’s no public access to AYO08C04.xml. The government’s reflex is still to publish paper, or its electronic equivalent, PDF. So when the Sunlight Foundation’s John Wonderlich tried to visualize the evolution of the Senate’s version of the bailout bill, he was reduced to printing out PDFs, arranging them on the floor, and marking them up with a yellow highlighter.

Recognizing the futility of this approach, he complained on a mailing list and Joshua Tauberer responded with a special GovTrack.us feature that extracts the text from the PDFs and provides electronic comparisons. John Wonderlich observes:

Josh’s page does what I failed to effectively do with paper: get a comprehensive view of what has changed between each copy of the bill.

As I noted with respect to my recent legislative excursion, every Wikipedia author/editor takes for granted the ability to review the entire history of an article, compare differences between any two versions easily and effectively, and collaborate with other interested parties.

Even more powerful change visualization is possible, as we saw when Andy Baio, in response to my LazyWeb request for animation of Wikipedia change history, sponsored a contest that Dan Phiffer won.

Is MediaWiki, the software that powers Wikipedia, a more capable content management system than the one used by Congress to produce and collaboratively edit AYO08C04.xml? I would hope that the internal taxpayer-funded system is actually delivering the benefits that Bob Glushko supposes it must be. But how can we be sure? Maybe somebody in the know can comment.

Old-fashioned and newfangled plumbing

On this week’s Interviews with Innovators I followed up on the most unusual thing I saw at DEMO: a silicon-based flow-control valve for air conditioners. Mark Luckevich, VP of engineering for Microstaq, explains how they’re using MEMS (micro-electro-mechanical systems) to enable a simple retrofit that could save large amounts of the energy currently used for commercial air conditioning.

This conjunction of old-fashioned and newfangled styles of plumbing represents the sort of cultural mashup that always gets my attention. As Amory Lovins has been saying for decades about energy conservation, there’s low-hanging fruit we can harvest by instrumenting, monitoring, and controlling our HVAC systems using modern sensors, controls, and information systems. The Microstaq valve is a great example of that.

More generally, it points toward an interdisciplinary cross-fertilization that enables a set of well-established IT practices — logging, testing, debugging, hot-spot analysis, refactoring — to be applied in a very different domain.

Socializing the analysis of the socialization of banking

When Allen Noren pointed to this visualization of U.S. government bailouts, I wanted to tweak it by showing the magnitudes on a timeline. I found this data set on Many Eyes, updated it with the number $700B, and made this bubble chart:

Bailouts by year (Many Eyes)

It was quick and easy to do that, and the result was automatically shared in a social environment that invites discussion and revisualization.

But the visualization wasn’t quite what I wanted. So I fired up Excel and made this version:

Bailouts by year (Excel)

That does what I meant1. It also took longer, was harder, and isn’t automatically shared as an active object in a social environment on the web.

The reason it took longer and was harder is that I couldn’t find a way to automate the creation, placement, and rotation of custom labels. That might be more than we can expect a software wizard to do. But there are plenty of human wizards who know how to do it, and some of them probably have automation tricks up their sleeves.

That makes two reasons I wish that Excel could share active objects into social environments on the web. First, so that a la Many Eyes, we could more directly riff on one anothers’ visualizations. Second, so that we could more easily teach one another how to use Excel to become more active participants in a world of data-driven abstractions.


1 Well, almost. I’d rather cluster the 2008 bubbles within an aggregate bubble.

Scott Prevost explains Powerset’s approach to semantic search

For this week’s Perspectives show I spoke with Scott Prevost, general manager and product director for Powerset, the semantic search engine that was recently acquired by Microsoft, and that can currently be seen in action working with the combined contents of Wikipedia and Freebase.

In our interview, Scott discusses the natural language engine — 30 years in the making — that Powerset acquired from PARC (formerly Xerox PARC). But he also makes clear that the use of that engine is part of a blended strategy that also takes advantage of statistical and machine learning techniques.

If you try Powerset, you find that your mileage will vary depending on a lot of factors. It’s clearly a work in progress, as all of natural language technology has been since, really, the dawn of computing. But the approach that Scott describes here sounds like a flexible and pragmatic way to leverage the technology as it continues to evolve.

Here’s one evocative use of Powerset:

Barack Obama’s book

The first result, Dreams from My Father, comes from Freebase, where that book is one of two items in the Works Written slot of Obama’s Person record. In this case, there’s no need to discover structure, Freebase has already encoded it. But the natural language technology is being used in a complementary way, to map between a natural form of the question and the corresponding Freebase query.

To see a glimpse of what Powerset’s linguistic analysis of Wikipedia can do, try this query:

Jon Udell

Here, Powerset uses its semantic representation of my Wikipedia page to extract two “Factz” based on one of the linguistic patterns it uses. In this case, the pattern is subject / verb / object, and two Factz are adduced. One is bogus:

udell authored advisor

And the other is valid:

udell authored Practical Internet Groupware

There isn’t much in Wikipedia about me, but if you pick a more notable person — say, Tim Bray — the list of Factz includes:

chaired Atompub Working group
resigned editorship
founded OpenText
wrote Lark
invented XML
organized conferences
provoked protests
edited specifications

Missing from this list, by the way, is:

live-engineered Electric Eel Shock

Who knew?

OK, I’m just kidding about that, Electric Eel Shock’s live engineer was another Tim Bray, which points out the need — as Scott and I discussed briefly — for name/entity recognition and disambiguation.

I’ve always been fascinated by the ongoing effort to understand and produce natural language using computers and software. Fifty years ago, early computer scientists thought they’d lick the problem in five years. Now many people believe it may never happen. I think it will, but gradually over a long time. And as Scott Prevost points out, it’s just one tool in the kit, and should be used appropriately, in concert with other tools.

Ground truthing

My wife is preparing for one of her annual open studio events, and asked me to update her website with an announcement and a map. I’ve done this before, so it’s odd I never noticed that none of the popular mapping services accurately pinpoints our address. In Google Maps, the pushpin is stuck near Douglass St., a little more than a block away from the Roxbury/Grant intersection where we really are:

Live Maps gets the location right, but it shows a Roxbury/Hardy Ct. intersection that does not exist.

Long ago, we believe that Hardy Ct. did join Roxbury St, but at some point that changed and now it’s a dead end as shown correctly in Google Maps.

Yahoo Maps, like Live Maps, has the correct location but wrong street layout:

What to do? I tried adjusting the lat/lon parameters in the Google Maps URL in order to move the pushpin to the correct location. That turned out to be more trouble than it was worth, so I just took a picture of a map, stuck a marker in the right place, and published that.

Along the way, I noticed a few things about these online maps that I’ve never considered before. First, in all these applications, it’s harder to work directly with lat/lon coordinates than I would have thought. Everything’s geared toward street addresses, which makes sense most of the time for most people. But as GPS coordinates become commonly available, shouldn’t they be first-class citizens in these user interfaces?

More broadly, there’s the question of ground truthing — the subject of a wonderful New Yorker article. The reporter, Nick Paumgarten, writes:

Despite the digitization of maps and the satellites circling the earth, the cartographic revolution still relies heavily on fresh observations made by people.

He goes on to describe a tour of the area near LaGuardia Airport with a pair of Navteq “field researchers” who spend their days comparing maps with reality. There are six hundred people doing this work.

But wait a sec. We’re living in the Age of Web 2.0 Participation. It’s a Two-Way Web. I know where my house is, and everyone on the east side of town knows that Hardy Ct. is a dead end. Shouldn’t there be an obvious way for millions of people to convey the ground truths they know to Navteq and Tele Atlas?

Catch-22

From a financial services company:

Dear Sir or Madam,

We are writing to let you know that computer tapes containing some of your personal information were lost while being transported to an off-site storage facility by our archive services vendor.

The missing tapes held certain personal information, such as your name, address, Social Security number and/or shareowner account information.

We have no reason to believe your information has been or will be improperly accessed or misused as a result of this incident. We understand, however, that you may have questions and concerns. That is why we are offering you and other impacted individuals a free credit monitoring product, TripleAlertSM, for 24 months to help you detect possible misuse of your data.

https://experian.consumerinfo.com/triplealert/…

Tap, tap, click, click…

You’re only steps away from enrolling in Triple AlertSM Credit Monitoring.

Verification Information:

  Social Security number     Confirm Social security number
    _____ ____ ______             _____ ____ ______

Aw, forget it.

Future shock, cowpaths, and Government 2.0

In a stunning September 11 essay on accelerating change and future shock1, Adam Greenfield asserts that the future is deeply terrifying to Americans whose “eponymous century…ended seven years ago today.” He adds:

In the relatively narrow field of my interests – ambient informatics, the networked city – can be seen something profound writ small: among fully-developed nations, the US stands out as having generally rejected “futuristic” interventions in everyday urban life, to the point that what I’m bound to present as innovative to US audiences is almost laughably banal elsewhere.

I wonder if that’s because the wave of acceleration hit America first, and we’ve been living through it longer. In any case I do sense a slackening appetite for technological novelty, and a nostalgia for simple solutions that meet basic needs quietly and capably.

From that perspective, I’m enjoying the action over at PublicMarkup.org, where the Dodds and Treasury proposals are accumulating per-section comments from citizens. You’ve got to love the proposed names for the Treasury’s Act.

This isn’t a futuristic intervention, it’s just a good old-fashioned bulletin board, with a brutally simple mental model shared by everyone. Exactly the sort of cowpath that we might, or might not, need to pave as we advance toward Government 2.0.


1Via Matt McCalister’s excellent The opportunity cost of noiselessness.

What is an Internet operating system?

I trace the phrase Internet operating system back to a 2002 essay in which Tim O’Reilly imagined that the Internet OS would arise from, and become the governing framework for, a soup of ingredients:

All of these things [including web services, p2p filesharing, blogs] come together into what I’m calling “the emergent Internet operating system.”


In the third stage, the hodgepodge of individual services will be integrated into a true operating system layer, in which a single vendor (or a few competing vendors) will provide a comprehensive set of APIs that turns the Internet into a huge collection of program-callable components…

Of course the web had always been a collection of components, as I had pointed out in 1996, but the implicitly-available services woven into the web’s fabric were hard to use back then, and in many ways still are. One key enabler for the Internet OS, therefore, would be a framework for defining and deploying services. Another would be universal data-exchange mechanisms that would supply the grease to overcome data friction. Still another would be standard ways for services to communicate through intermediaries that support authentication, authorization, and group membership.

The Internet itself, meanwhile, had always natively supported peer-to-peer networking, a capability that was eclipsed when its success spawned a layer of NAT (network-address-translation) firewalls protecting hordes of semi-connected private networks. As a result, enabling the new Internet OS would also require some means of restoring that original P2P connectivity.

What blogging brought to the table, in addition to the liberating power of personal publishing, was a new take on the venerable publish/subscribe pattern, expressed now in terms of the familiar metaphor of news syndication. In any version of the new Internet OS, syndication-oriented architecture would have to play a crucial role.

Fast-forwarding to 2008, I’ve been reading definitions of the Internet OS like this one from Doc Searls:

[Google’s] Chrome also runs apps. In that respect, it’s more than the UI-inside-a-window that all browsers have become. It’s essentially an operating system.

An Internet OS is, to be sure, a platform for running applications, though that’s a slippery term given application styles ranging from native-to-the-underlying-OS to dynamic-HTML-plus-JavaScript to rich-a-la-Flash-and-Silverlight. But when you expand the notion of an application beyond UI-inside-a-window, a number of supporting themes come into view: universal data exchange, peer-to-peer connectivity, group formation, publish/subscribe messaging, syndication-oriented architecture. One of the places where these themes come together is Live Mesh, as Mike Zintel explained eloquently in Live Mesh as a Platform.

As I mentioned in my interview with Ray Ozzie about Live Mesh, the Internet OS meme morphed into the Web 2.0 meme which embodied an application style based on dynamic-HTML-plus-JavaScript and a cultural preference for open participation. But it was inevitable that the original notion would come round again. As it does, it behooves us to ask: “What is an Internet operating system?” I think we’ll find that it includes what we mean by Web 2.0 but also expands that meaning in ways we’ll want to discuss and define.

DayJet at the end of its runway

James Fallows, who wrote the book on the air taxi movement, delivers a first-draft post mortem on DayJet, whose founder and CEO Ed Iacobucci I interviewed last year. Evidently, and sadly, it’s curtains for DayJet.

I’ll leave the industry analysis to James Fallows, who’s in a far better position to assess the financial and aeronautical factors. As for me, I’ll just thank Ed Iacobucci for pioneering two important new categories of software: the operating system for a peer-to-peer air travel network, and the simulator used to model regional transportation options and evaluate how best to grow that network.

I am certain that these innovations will flourish in one way or another, and I hope that Ed and his teams will be there when it happens.

Biomedical initiatives at Microsoft Research

In this week’s installment of my Perspectives series I spoke with Kristin Tolle about a couple of important biomedical initiatives ongoing at MSR. One is a program with the daunting title Computational Challenges of Genome Wide Association Studies (GWAS). These studies entail scanning individual human genomes to look for genes implicated in diseases, or to check for reactions to drugs. The computational challenges that MSR wants to clarify, and help address, range from pattern recognition to data visualization to automatic analysis of clinical records. What’s the payoff? Kristin boils it down:

In simplest terms, genome-wide association studies will deliver on being able to provide personalized medicine for all of us.

Another program carries a much more accessible title: Cell Phone as a Platform for Healthcare. It aims to reach rural and underserved communities with solutions that leverage components that are cheap and ubiquitous — cellphones and TVs — using the Fone+, an enabling technology that was developed by Microsoft Research Asia:

It’s a phone that sits in a cradle, with RGB out to a television set, and USB input ports for mouse, keyboard, etc. So basically it enables your phone to work like a PC.

In one application, this combo powers a low-cost ultrasound scanner. In another, it might be a platform for a field microscope that detects malaria parasites.

Like other initiatives I’ve explored in the Perspectives series, this one belongs to MSR’s External Research division. There’s a lot of good work happening there, and I’m enjoying the process of finding and telling the stories.

Why didn’t phonetic audio indexing prevail?

Hugh McGuire notes that Google Labs has expanded the audio indexing and search of political videos on YouTube. I checked out the examples, and guessed that this system works by doing speech-to-text conversion, then conventional indexing and search of the text. That’s feasible because even an imperfect conversion yields plenty of recognizable words for search to find.

Here’s an example of imperfect conversion that doesn’t interfere with a search for the word “health”:

spoken: and that’s true with health care. Of the estimated 47 million

transcribed the ranks not and that how much health care the native forty seven

Now that’s the worst of the small set I sampled. Here’s a less imperfect example:

spoken: businesses liberated from high taxes and health care costs will unleash

transcribed: businesses liberated from high taxes and health care costs well I’m

And sure enough, the FAQ confirms my hunch:

Google Audio Indexing uses speech technology to transform spoken words into text and leverages the Google indexing technology to return the best results to the user.

Way back in 2002, I reviewed Fast-Talk, a product (actually, a technology demo of a licenseable SDK) that took a completely different approach to audio indexing. It worked phonetically. One of my tests was of a phone interview with Tim Bray, which I recorded and which was indexed in realtime as we spoke. Here’s what happened next:

When my interview with Tim Bray was done, the first segment I looked for was the one where Bray said, “Jean Paoli spent four hours showing me XDocs.” The name “Jean Paoli” was, not surprisingly, ineffective as a search term. But “four hours” found the segment instantly, as did “fore ours” — which of course resolves to the same string of phonemes. “Zhawn Powli” also worked, illustrating what will soon become a new strategy for users of voice-aware search engines: When in doubt, spell it out phonetically. In practice, I find myself resorting to this strategy less often than I’d have expected. And it was fairly obvious when to do so. I guessed correctly that “MySQL” would not work, for example, but that “my sequel” would.

This approach doesn’t yield a transcript, but it’s so fast, efficient, and effective that I felt sure it would be in widespread use by now, and that audio indexing would be far more prevalent than it has become.

Why didn’t my prediction come true?

Jock Gill on energy, information, technology, networks, markets, and society (part 1)

Here is the first part of a two-part interview with Jock Gill, whom I can only partly describe as a technologist, philosopher, humanist, media hacker, and alternative energy entrepeneur. We met in a wonderfully serendipitous way. I was on a bicycle tour through the White Mountain National Forest last month, staying overnight with a friend of a friend, when the subject of pellet heat arose — as it frequently does nowadays in New England. My friend’s friend showed me this article about an entrepeneur who’s exploring the conversion of grass into fuel. But not liquid fuel for transportation. Rather, solid fuel for thermal applications, mainly heating (in the near term) but also potentially local power production. His name sounded familiar. Jock Gill? Where had I heard that before?

When I invited Jock to do this interview, I learned there were a couple of possible connections. During the early Internet years, he was director of special projects in the Office of Media Affairs at The White House. I had probably heard about that.

Going back a bit further, though, we discovered that we were both at Lotus Development Corp. at end of 1980s, by way of two separate acquisitions. He arrived with BlueFish, a company that did indexing and search. I arrived with Datext, a company that aggregated business information. We worked in the same division and it seems we must have met at some point, but perhaps not. Funny how that goes.

Anyway, twenty years on we connected for a fascinating conversation. It begins with grass as a potential source of solid-fuel biomass. Jock then expands to consider micro combined heat and power (MicroCHP). He ties that to decentralization, relocalization, and peer-to-peer resource sharing. He reminds us that while Al Gore did not invent the Internet he did presciently advocate the electranet. And in general, he connects the dots with respect to information, energy, technology, networks, markets, and society.

Swim-lane visualization of security protocols

Reacting to this report about a flaw in the single signon protocol for Google Apps (via ZDNet and heise Security), Kim Cameron writes:

As an industry we shouldn’t be making the kinds of mistakes we made 15 or 20 years ago. There must be better processes in place. I hope we’ll get to the point where we are all using vetted software frameworks so this kind of do-it-yourself brain surgery doesn’t happen.

The “brain surgery” Kim refers to here was the omission of a unique ID that’s supposed to be cryptographicallly bound into a SAML assertion, so that the party relying on the assertion knows it was “freshly minted in response to its needs”.

It would certainly be useful to standardize on a relatively small set of frameworks that have been vetted, as Kim suggests, and are believed to implement these tricky protocols accurately and reliably.

I can imagine taking things a step further, exposing the test suites for these frameworks so that any implementation can be explored interactively and probed automatically. Given the complex dance of machine-to-machine, machine-to-human, and sometimes human-to-human interaction that occurs when a security protocol is enacted, I’m reminded of Ward Cunningham’s swim-lane visualizations. The idea is that anyone can run business-logic tests on demand, visualize the resulting flow of interaction, and verify the outcomes. Ward’s vision didn’t garner nearly the interest I expected when I first wrote it up (and then followed with a podcast). But like so many of his brainstorms, I think his approach to implementing Brian Marick’s notion of Visible Workings is revolutionary.

Evaluating an implementation of a security protocol is a job that requires expert brainpower assisted by all the automated tooling it marshall. But security protocols are also forms of business logic that can, and should, be transparent and understandable to everyone — at least at some useful level of description. In Ward’s world, when you’re ready to submit your credentials to a login authority, you could hit an Explore button and land in a swim-lane visualization driven by the actual tests used to validate the implementation of the protocol you’re enacting. I’d like to live in that world.

That first step can be a doozy

Andy Baio has done a tour de force analysis of Girl Talk’s Feed the Animals, a musical mashup made from hundreds of samples. From Wikipedia, he extracted data about the samples: the artist, title, and start time of each sample. Then, remarkably, he used Mechanical Turk to crowdsource the lookup of a missing fact: the release dates for each sample. The resulting visualizations are wonderfully evocative. To arrive at that point, though, you’d have to intuit something that was “easy” for Andy but wouldn’t be for many people. He writes:

Getting the sample list was easy. I took a snapshot of the album’s Wikipedia entry and extracted all the samples using Excel’s Text to Columns feature.

Actually, I’m not sure what kind of snapshot would be amenable to the Text-to-Columns treatment, which divides text into columns based on delimiters (e.g., commas) or according to fixed-length rules (e.g., columns 1-30, then 31-40).

Of course there are other approaches, and that got me to wondering. Of the million ways to extract tabular data from an HTML table on a web page, which method would be most obvious to a nontechnical person? Which would be most effective? And, would those two methods coincide?

I think it’s reasonable to suppose that an average person would reach for Excel in this situation. What then? How would you think about importing the Wikipedia page into Excel?

One approach might be to plug the URL into Excel’s File Open dialog, though I think that’s unlikely because it’s labeled File name:. That doesn’t suggest that it accepts an URL. In fact it does, but the web page arrives as unformatted text not formatted HTML.

Following Andy’s notion of taking a snapshot, you might instead do a Save As Web Page in the browser, and then try opening the saved file in Excel. This works rather well. The tables from Wikipedia arrive almost completely intact. However a few cells need to be tweaked, and that’s problematic because the hyperlinks are active, and every time you touch a cell containing one you trigger a security alert.

If you went down this path, you might then try searching the web for ways to remove hyperlinks from an Excel sheet. You’d find advice ranging from creating a VBA macro to performing a brain-exploding manual procedure. And you’d probably bail.

Alternatively — but perhaps less intuitively? — you could activate Excel’s Data->From Web feature, plug in the Wikipedia URL, select the entire Wikipedia page as the table to be imported, and click Import. This works wonderfully well. The HTML tables arrive almost perfectly formatted as before, but sans hyperlinks so you can easily make the final corrections.

It seems silly to dwell on these mundane details but they are where the devil resides. If frictionless capture of data from Wikipedia were more widely evident, Andy’s eye-opening use of Mechanical Turk would be well within the reach of many people less technical than he, and I, and doubtless many readers of this blog.

I’ve dwelled here before, but it’s just striking how some very basic kinds of data friction keep getting in the way of ever-more-amazing possibilities for analysis and insight.

A conversation with the founders of Princeton’s Center for Information Techology Policy

As information technologies weave their way into every aspect of our personal, professional, and civic lives, there’s a growing need for informed public discussion of their public policy implications. Princeton’s Center for Information Technology Policy (CITP) is one emerging forum for that discussion. My guests on this week’s Innovators show are Ed Felten and David Robinson, who are respectively the director and the associate director of the Center. Ed holds a wonderfully mashed-up job title: He’s professor of computer science and public affairs at Princeton. The Center’s mission, Ed says, is to “do the intellectual import/export work” necessary to build bridges of understanding between information technologists and the rest of society. Widely known as a leading researcher in the field of computer security, he started thinking more broadly about a decade ago:

It was pretty clear that information technology was going to do more than change the way geeks like us do our jobs. It was going to be a big deal for the way society was organized, for the way markets work, for the way people relate to one another.

David adds:

When I was an undergrad at Princeton, one of my frustrations was that there wasn’t enough institutional support for studying issues that relate to digital technology but are not solely technical, and touch many other areas.

In this conversation we discuss the origin and goals of the Center, and then turn our attention to a recent CITP paper entitled Government Data and the Invisible Hand. This widely-cited essay argues that governments should worry less about building full-service web portals and focus more on providing raw data in easily digestible formats that third parties can mash up as needed.

I agree in principle, but argue that governments will need to supply context for the raw data they provide. Consider XBRL (eXtensible Business Reporting Language), the standard that may become mandatory for companies filing with the Securities and Exchange Commission. An XBRL report isn’t just raw data, it’s data contextualized by a set of definitions that capture key principles and practices of accounting.

Nailing down those definitions is brutally hard work, which is why XBRL has been slow to develop. But it’s important and necessary work. The data produced internally by governments will need to be similarly contextualized if we’re going to realize the benefits of making it transparently available.

Of course if we were to wait for governments themselves to finalize definitions before releasing data, we’d wait forever. What’s more, the relevant principles and practices will be, in many cases, far less evident than in the realm of accounting. So I think we all agreed, in the end, that governments should release data early and often, and should emphasize reusable data products over fixed-function web portals. But at the same time, governments should engage in a public dialogue that iteratively refines the data products that are published, and the explanations of what they are intended to mean.

Silicon-based flow control for smarter/cheaper air conditioning and refrigeration

One of the key themes in Amory Lovins’ series of talks on energy efficiency in buildings is the energy lost when pumping fluids. At DEMO today, a company called Microstaq showed a silicon-based valve that promises to dramatically reduce the power required to move fluids through air conditioning and refrigeration systems. Current refrigerant expansion valves, controlled thermostatically or by step motors, are imprecise and waste a surprising amount of electricity. Microstaq’s Silicon Expansion Valve, an electronically actuated device, delivers more precise control for far less power — a savings of 25%, the company claims.

“We married two different industries that weren’t even dating,” said CEO Sandeep Kumar. When you look at the gadget, which embeds a silicon chip into a plumbing fixture, it’s easy to see what he means. It’s a pipe fitting with a brain.

When we think about how electronics can improve energy efficiency, the notion of an energy web, or electranet, often comes up. And there’s no doubt we need a modernized power grid that acts more like a digital network. But Lovins reminds us that we’re not just pushing electrons around, we’re pushing immense volumes of fluids. Semiconductor-based flow control technologies, applied rather straightforwardly to hydraulic systems, can produce major efficiency gains.

Annotating DNS with personal information

I’ve always had a fondness for solutions that scribble in the margins of the Domain Name System. Today I saw a new one at the DEMO conference: Telnic, a service you can use to store basic personal or business information directly in the DNS. The service is associated with the .tel top-level domain. If you visit, say, henri.tel, which belongs to Henri Asseily, Telnic’s chief strategist, you’ll see a web page, but it’s rendered by a proxy that pulls the information from DNS records.

As Henri notes on his blog, which is one of the links advertised in henri.tel, the system at its core is a way to store key-value pairs in the DNS. Users control this data by way of a web-based management console. Developers of .tel-aware applications can use DNS directly, or can use access libraries provided by Telnic. An application might, for example, locate people by way of the LOC (location) record in their .tel domains.

You could, of course, use a web-based convention — like foaf.xml — to accomplish the same thing. People mostly don’t, though. Would a system bound more closely to DNS identity seem more natural and be more appealing? Maybe.

21st century Yankee ingenuity

Serendipity brought me a copy of this article on Jock Gill’s vision of small-scale grass farming operations. He thinks they’ll be able to produce biomass fuel, in a sustainable and decentralized way, for local production of heat and power. We had a long talk about this, and related themes, which will appear in two upcoming episodes of my Innovators show. Meanwhile, this paragraph from the article keeps echoing in my head:

He said a high school student in Morrisville has completed a successful prototype of a green wood chip combustion unit that can produce 50,000 BTUs of heat per hour. Gill said the student is confident his system could also burn dry grass tablets.

Good old Yankee ingenuity, in other words, hasn’t yet run its course. As we reconfigure our energy systems, that latent talent will flourish again.

The World Bank’s web of data could be webbier

As Stefan Tilkov notes tearfully, the REST API for the World Bank data leaves something to be desired. The URI to fetch the list of countries looks like this:

http://open.worldbank.org/rest.php?method=wb.countries.get

In my review of the Leonard Richardson and Sam Ruby book RESTful Web Services I summarized their best practices for making a service “look like the web”:

  • Data are organized as sets of resources
  • Resources are addressable
  • An application presents a broad surface area of addressable resources
  • Representations of resources are densely interconnected

So in this case, the resource that is the list of countries might simply be modeled as:

http://open.worldbank.org/countries

Likewise, given that the indentifier for electric power consumption is EG.USE.ELEC.KH.PC, and for China is CHN, you might have:

http://open.worldbank.org/indicators/EG.USE.ELEC.KH.PC/CHN

Or, for all countries:

http://open.worldbank.org/indicators/EG.USE.ELEC.KH.PC

Or, to fetch all indicators for China:

http://open.worldbank.org/countries/CHN

If the data were organized this way, and if results included URIs for traversing up and down, you could explore this web of data by just navigating around in it.

When you draw attention to this question of style, you risk seeming like a pedantic REST philosopher. But it’s really just a practical, down-to-earth kind of thing. If I’m a developer exploring a collection of data, it’s convenient to be able to navigate a web of linked resources, and have the API revealed to me as I go along.

World Bank data now available through APIs

By way of David Stephenson I’ve learned that the World Bank now offers an API for several of its data sets on development, governance, and business conditions, plus a collection of photos.

Here are the indicators you can explore, for many countries, going back to about 1960 (though the data are sparse in some cases):

Agricultural land (% of land area)
Forest area (% of land area)
Surface area (sq. km)
Foreign direct investment, net inflows (BoP, current US$)
Workers’ remittances and compensation of employees, received (US$)
Control of Corruption
Market capitalization of listed companies (% of GDP)
Cost (% of estate) [Closing a Business]
Recovery rate (cents on the dollar) [Closing a Business]
Rank [Closing a Business]
Time (years) [Closing a Business]
Cost (% of income per capita) [Dealing with Licenses]
Procedures (number) [Dealing with Licenses]
Rank [Dealing with Licenses]
Time (days) [Dealing with Licenses]
Cost (% of debt) [Enforcing Contracts]
Procedures (number) [Enforcing Contracts]
Rank [Enforcing Contracts]
Time (days) [Enforcing Contracts]
Difficulty of Firing Index [Employing Workers]
Difficulty of Hiring Index [Employing Workers]
Firing costs (weeks of wages) [Employing Workers]
Nonwage labor cost (% of salary) [Employing Workers]
Rigidity of Employment Index [Employing Workers]
Rigidity of Hours Index [Employing Workers]
Rank [Employing Workers]
Credit Information Index [Getting Credit]
Legal Rights Index [Getting Credit]
Private bureau coverage (% adults) [Getting Credit]
Public registry coverage (% adults) [Getting Credit]
Rank [Getting Credit]
Disclosure Index [Protecting Investors]
Director Liability Index [Protecting Investors]
Investor Protection Index [Protecting Investors]
Rank [Protecting Investors]
Shareholder Suits Index [Protecting Investors]
Labor tax and contributions (%) [Paying Taxes]
Other taxes (%) [Paying Taxes]
Payments (number) [Paying Taxes]
Profit tax (%) [Paying Taxes]
Rank [Paying Taxes]
Time (hours) [Paying Taxes]
Total tax rate (% profit) [Paying Taxes]
Ease of Doing Business Rank
Cost (% of property value) [Registering Property]
Procedures (number) [Registering Property]
Rank [Registering Property]
Time (days) [Registering Property]
Cost (% of income per capita) [Starting a Business]
Min. capital (% of income per capita) [Starting a Business]
Procedures (number) [Starting a Business]
Rank [Starting a Business]
Time (days) [Starting a Business]
Cost to export (US$ per container) [Trading Across Borders]
Cost to import (US$ per container) [Trading Across Borders]
Documents for export (number) [Trading Across Borders]
Documents for import (number) [Trading Across Borders]
Rank [Trading Across Borders]
Time for export (days) [Trading Across Borders]
Time for import (days) [Trading Across Borders]
External debt total (DOD current US$)
Short-term debt outstanding (DOD current US$)
Official development assistance and official aid (current US$)
Total debt service (% of exports of goods, services and income)
Electric power consumption (kWh per capita)
Energy use (kg of oil equivalent per capita)
CO2 emissions (metric tons per capita)
Annual freshwater withdrawals total (% of internal resources)
Cash surplus/deficit (% of GDP)
Revenue, excluding grants (current LCU)
Government Effectiveness
Time required to start a business (days)
Roads, paved (% of total roads)
Internet users (per 100 people)
Internet users (per 1,000 people)
Fixed line and mobile phone subscribers (per 100 people)
Fixed line and mobile phone subscribers (per 1,000 people)
Military expenditure (% of GDP)
Exports of goods and services (% of GDP)
Gross capital formation (% of GDP)
Imports of goods and services (% of GDP)
Agriculture, value added (% of GDP)
Industry, value added (% of GDP)
Services, etc., value added (% of GDP)
Inflation, GDP deflator (annual %)
GDP (current US$)
GDP growth (annual %)
GNI, Atlas method (current US$)
GNI PPP (current international $)
GNI per capita, Atlas method (current US$)
GNI per capita PPP (current international $)
Political Stability and Absence of Violence
Rule of Law
Regulatory Quality
Ratio of girls to boys in primary and secondary education (%)
Primary completion rate, total (% of relevant age group)
Prevalence of HIV, total (% of population ages 15-49)
Mortality rate, under-5 (per 1,000)
Improved water source (% of population with access)
Immunization, measles (% of children ages 12-23 months)
Improved sanitation facilities, urban (% of urban population with access)
Births attended by skilled health staff (% of total)
Malnutrition prevalence, weight for age (% of children under 5)
Income share held by lowest 20%
Poverty headcount ratio at national poverty line (% of population)
Adolescent fertility rate (births per 1000 women ages 15-19)
Contraceptive prevalence (% of women ages 15-49)
Life expectancy at birth, total (years)
Fertility rate, total (births per woman)
Population growth (annual %)
Population, total
Merchandise trade (% of GDP)
Net barter terms of trade (2000 = 100)
High-technology exports (% of manufactured exports)
Voice and Accountability

So, for example, you might wonder about electric power consumption (kWh per capita) for the US, India, and China. According to these data:

US India China
1975 8522 116 196
1985 10414 194 353
1995 12659 365 770
2005 13647 480 1780

Thanks, World Bank! That sure beats digging the answers out of PDF files.

Freebase, Wikipedia, Powerset

Athough I’ve explored Freebase in several ways, I hadn’t seen the way it is now integrated — along with Wikipedia — into the Powerset demo of natural language search. It’s quite eye-opening to see the answer to a deceptively simple query like Tim O’Reilly’s siblings.

Assuming that a database goes to the trouble of actually knowing Tim as an entity, knowing his siblings as entities, and knowing their relationships, it’s fascinating to think about when you’d want to do Parallax-style exploration — in this case, find Tim, then click to see his siblings — and when you’d want to cut to the chase by asking a direct question. Personally I’d want both modes available, but I suppose some people will mainly prefer to navigate and others to ask questions.

Activating the web: One programming language or many?

Google’s newly-announced browser, which bakes in a JavaScript-specific virtual machine, reminds me of an earlier era in which the Netscape browser baked in support for the Java VM. It makes perfect sense for Google under the circumstances, but also serves as a reminder that language-specific runtimes aren’t the only game in town. From that perspective, it’s worth recalling that Silverlight is based on the .NET Common Language Runtime, a multilingual engine that can accommodate languages ranging from C# to Ruby while leveraging a common set of libraries and a common security architecture.

It’s true of course that JavaScript is the web’s original and predominant mechanism for injecting active behavior into otherwise static web pages. But the web’s evolution — from a collection of hyperlinked documents into what is now also becoming a collection of applications and services — is ongoing. The capabilities of both the HTML+JavaScript layer and of the plug-in layer — where Flash and Silverlight reside — are evolving too. And the boundary between those layers is being redrawn.

Google’s new browser runtime aims to improve the HTML+JavaScript layer in a way that further enshrines JavaScript as the web’s programming lingua franca. Meanwhile, Silverlight 2.0 and the Dynamic Language Runtime aim to improve the plug-in layer so that other languages — including dynamic languages like IronPython and IronRuby — can be used to activate the web. As both efforts go forward, it’ll be fascinating to see just how that boundary between the layers does get redrawn.

New England’s biomass-fueled home heating future, part 2

The essay I posted last winter about New England’s historic transition from oil-fired home heating to biomass-fired alternatives has been read consistently ever since. Here’s a Labor Day 2008 update.

As is typical in New England homes, my 1870-era home isn’t conducive to space heating. Which makes you wonder why open plan wasn’t fashionable back then. A railroad layout connects a series of small rooms, and the three chimneys tell you that the original space heating solution — many fireplaces — was a challenge. When coal-fired and then oil-fired central heating began to deliver hot water to radiators in every room, it must have seemed like a miracle.

Then, suddenly, oil prices more than quadrupled and the miracle became a nightmare. Supplemental heating with a pellet stove helped, but it would be crazy to put a pellet stove everywhere fireplaces used to be. The central heating system has to be reconfigured to burn an alternate fuel.

Additionally, of course, the thermal integrity of the shell has to be improved. In my case there’s adequate attic insulation, so to make a big difference you’d want to replace all the windows and rebuild the walls from the inside. That’s an investment arguably worth making at this point, but if you’re still burning oil, that might only wind the clock back to 2005, when were at two-buck-a-gallon oil, not 2000 when it was eighty-nine cents. And of course the clock keeps ticking.

So biomass-fired central heating has become imperative, and two classes of solution are emerging. Pellet boilers are the central-heating equivalent of pellet stoves. And wood gasification boilers are the central-heating equivalent of wood stoves.

It’s a back to the future scenario. Yes, it’s a return to a solid-fuel-based regime that we thought we had left behind. But both solutions burn biomass far more cleanly, efficiently, and safely than was ever possible before. Neither is as automatic or convenient — wood gasifiers even less so than pellet boilers — but that’s going to be the new reality, at least for a while to come.

So, pellet boiler or wood gasifier? I chose the latter because, while more labor-intensive, I like the idea of being closer to the fuel source, i.e. trees. Cordwood is a minimally-processed derivative. If it became necessary I could own a woodlot and make it myself. I have lots of friends who do just that.

Pellets are a downstream, more highly-processed product. They’ve been an attractive option so far because cost has been reasonable and availability hasn’t been a problem. But as I understand it, that’s largely because the pellet industry is currently harvesting waste wood products — sawdust, woood scrap. At some point it will have to go back to the source. When the pellet industry has to start harvesting trees to make pellets, I’m betting that the real cost of their convenience will become apparent.

In either case, of course, there are important unanswered questions about sustainable forestry. Can we manage our forests for sustainable production of wood-based solid fuel on the scale that will be necessary? Nobody knows, but we are about to start finding out.

It’s not necessarily all about trees, by the way. I recently had a fascinating conversation with Jock Gill, for an upcoming interview, about a different approach based on grass pellets. That’s a story for another day, but if you’re curious, read this article and think about the challenges of transporting trees to multimillion-dollar processing plants and then distributing the derivative solid fuels. Jock envisions, instead, a decentralized network of local producers whose processing operations require far less capital investment, and whose products need not travel far.

But I digress. Here’s my situation at the moment. I imported an EKO-40 wood gasifer, it’s sitting in my garage, and it’s ready to be installed. Except it can’t be. Because I’ve discovered, to my horror, that my city’s building code won’t allow it. Why not? It doesn’t have UL and/or ASME stickers. Instead, it has a TUV and CE stickers, which certify that the machine complies with the following European standards and directives:

standards

EN 60335: Specification for safety of household and similar electrical appliances. General requirements

EN 50165: Electrical equipment of non-electric appliances for household and similar purposes. Safety requirements

EN 55014: Electromagnetic compatibility. Requirements for household appliances, electric tools and similar apparatus. Emission

EN 61000-6-3: Electromagnetic compatibility (EMC). Generic standards. Emission standard for residential, commercial and light-industrial environments

EN 45011: General requirements for bodies operating product certification systems

EN 303-5: Heating boilers. Heating boilers with forced draught burners. Heating boilers for solid fuels, hand and automatically fired, nominal heat output of up to 300 kW. Terminology, requirements, testing and marking

EN 60529: Specification for degrees of protection provided by enclosures (IP code)

directives

97/23/EG: Directive 97/23/EC of the European Parliament and of the Council of 29 May 1997 on the approximation of the laws of the Member States concerning pressure equipment

73/23/EEC: Council Directive 73/23/EEC of 19 February 1973 on the harmonization of the laws of Member States relating to electrical equipment designed for use within certain voltage limits

89/336/EWG: EMC-directive, Electromagnetic compatibility

I have been trying to identify the relevant UL and/or ASTM standards so that I can get a qualified engineer to write a letter to the city explaining that my machine is as safe, clean, sophisticated, and effective as any of the U.S.-certified machines they would approve.

Along the way, I’ve discovered that it’s not clear there are any devices that they would approve. If a UL sticker is required, which UL standard should it certify? UL 391? That’s an umbrella standard governing older electromechanical systems but, I’m told, may not be relevant to the latest technology with its more sophisticated electronic controls? UL 2523, entitled Solid fuel-fired water heaters and boilers, which isn’t yet supported by any system I’ve found?

Likewise if an ASTM sticker, which ASTM standard, and why? And oh by the way, although the city’s codes don’t yet say anything about emissions, my EKO is tested to the strict EN 303-5 standard because Europe, unlike the U.S., takes emissions seriously. My understanding is that the EKO isn’t just way cleaner than the wood stoves and outdoor wood boilers that people are frantically installing these days, it’s cleaner in most respects than an oil burner! Shouldn’t I be rewarded, not punished, for investing in a solution that respects the city’s air quality and the planet’s carbon burden?

On Tuesday I’ll meet with the city’s chief code officer to try to answer these questions, and see if there’s a way we can move forward. Based on what they’ve told me so far, though, it seems possible that none of the best pellet boilers and wood gasifiers, from both domestic and foreign manufacturers, would meet my city’s code requirements as currently written. And that’s because this class of machine, long used in Europe, has only recently started to become interesting to American homeowners. There hasn’t been time to adjust to a technology landscape that’s now undergoing major and rapid upheaval.

If I weren’t stuck in the middle of it, this would just be a case study of the perpetual tug of war between standards and innovation, at an unusual moment in history when time is of the essence. But I am stuck in the middle, and I have no idea how the story’s going to turn out. I’m writing it anyway because I went into this with two goals. First, I wanted to solve a pressing problem for me and my family. But second, I wanted to be able to document and openly discuss a solution that will work for many others who will want to follow. I don’t think that my EKO boiler, if it were to be permitted, would be the first to be installed in Keene, NH. But I do think it would be the first to be legally permitted. So, wish me luck!

Trident: A workflow system for doing data-intensive science with reproducible results

Another of the many interesting stories coming out of Microsoft External Research these days is the one Roger Barga tells in this week’s installment of Perspectives. When Roger told me that Trident, the system he’s developing to automate scientific workflow, was inspired by Jim Gray, it was a déjà vu moment. Everywhere I turn, I find new evidence of Jim’s profound and far-reaching influence at the intersection of science and computing.

I never met Jim in person, but we collaborated briefly on this 1995 BYTE feature that condenses his career-long work in the field of scalable databases and transaction monitors into a lucid taxonomy. Well, it’s a stretch to say that we collaborated. Jim delivered the article in pristine condition, and there were only minor editorial details needing attention. But when he did attend to them, he exhibited the qualities I’ve since heard about from many others. He was gracious, fully attentive, deeply wise, broadly connected. It’s remarkable to watch the connections he formed continue to ripple through MSR and out to MSR’s external partners.

In Roger’s case, here was the seed:

Jim Gray was the first person who had the vision of an oceanographer’s workbench. His insight was that scientists really want to interact with visualizations of the ocean, but there was a huge gap between the raw data and those visualizations.

In our interview, Roger describes a project called Trident, a system for authoring, running, and tracking the provenance of scientific workflows — that is, sequences of computational steps that bridge the gap between the data produced by the Neptune sensor array and the COVE visualization system.

Oceanography is only the first scientific discipline that will benefit from Trident. Astronomy is next in line, and other fields are expected to follow. As all scientific disciplines become increasingly data intensive, two related requirements emerge. There needs to be a general framework for creating pipelines of reusable data transformations, and it needs to be coupled with the ability to document, version, and reliably reproduce the results that come out of those pipelines.

Today, as Roger points out, reproducing a scientific result is often a dicey thing:

If you happen to know the person who did the experiment, or if you happen to capture enough stuff in your lab notebook or on your whiteboard, then you have a chance of being able to do it again.

In the domain of software engineering, both commercial and open source, that would simply be unsustainable. So strong traditions of version control and provenance have developed. But as Greg Wilson has been observing for many years, those traditions have not sufficiently taken hold in many computationally-intensive areas of science. In this interview Greg takes the HPC (high-performance computing) community to task for caring too little about verifying the correctness of models and ensuring that code and the data are managed in ways that make experiments reliably reproducible.

Some scientists can and do assimilate the best practices from software engineering. But most will need a system that embodies those best practices, and that is what Trident aims to be.

One final comment of Roger’s particularly struck me:

The hope is that here in External Research, because we’re building these tools not just in the context of one science project, but many, you can have community tools that bridge communities. We’re talking to people in the earth sciences doing atmospheric studies, and their workflows and analyses are so similar to what the oceanographers are doing. But right now, since those two communities aren’t talking or sharing tools, it’s very difficult for one community to interact with the other.

Now more than ever there is a pressing need to make interdisciplinary science as frictionless as it can possibly be. I hope that what Roger and his team are doing will supply some of the necessary lubricant.

Specifying exceptions to recurring calendar events

Thanks to my calendar syndication project, I’ve gotten intimately familiar with how various calendar programs — including Outlook, Google Calendar, and Apple iCal — handle the entry of recurring events. They all make the task reasonably straightforward, but there’s one vexing problem. There isn’t a way to specify exceptions. My local YMCA, for example, is closed for maintenance during the last week of August. You could enter a “YMCA closed” event for that week, and hope that it gets rendered so that people will understand it to override all the recurring events shown for that week. But that’s not a great workaround.

Really, you’d like to be able to specify exceptions as part of the recurrence rule. To do that in a standard way, that capability would have to be part of the iCalendar standard. And sure enough, it is:

Property Name: EXRULE

Purpose: This property defines a rule or repeating pattern for an
exception to a recurrence set.

Property Name: EXDATE

Purpose: This property defines the list of date/time exceptions for a
recurring calendar component.

But none of the calendar programs I’m familiar with seem to support these features as part of event data entry. Are there others that do? Even if there are, I couldn’t depend on the feature being ubiquitously available to folks who contribute to the calendar network I’m trying to assemble.

The service operates as an iCalendar intermediary, though, so it might be able to inject some exceptions — at least for global EXRULEs like “YMCA closed last week of August”. It’d be harder for event-specific EXRULEs like “Pool closed for maintenance July 22” which would affect a subset of events, or “Kickboxing class won’t be held Sept 14 or 18” which would affect a single event.

One of the questions this project has led me to ponder is: Why, after all these years, are calendar programs not used as extensively as it seems they should be? Maybe this is part of the answer. Exceptions to rules are part of the fabric of real life. If the software doesn’t enable people to specify those exceptions, that’s a problem.

Update: Thanks to commenters for pointing out that of course calendar programs enable people to specify exceptions. They just don’t do it the way I expected, i.e. as a continuation of the dialog used to specify the recurrence rule. Instead they do it by enabling you to edit or delete a single event in the series, after the series has been created.

Now I’m curious as to whether my expectation is a geeky aberration that hasn’t affected most people, who have been happily creating exceptions all along. Or whether it’s broadly undiscovered by civilians too.

The continuum of access styles in the emerging Microsoft cloud

SQL Server Data Services is a cloud-based data service that’s currently comparable to a combination of Amazon’s SimpleDB (for key/value storage) and S3 (for blobs). But as Soumitra Sengupta explains here, SSDS is indeed based on SQL Server and it aims to progressively open a wider channel to the capabilities of SQL Server and the broader Microsoft data platform. When SQL Server Data Services (SSDS) was introduced back in March, Information Week said this:

Though Microsoft has often been criticized for making its products work only with other Microsoft products, SQL Server Data Services doesn’t require SQL Server or .Net applications. “I can walk up to it with standard types of tools,” [project leader Dave] Campbell said. The service supports Rest and SOAP interfaces and will support the AtomPub protocol.

That’s a true statement, as I recently verified while exploring the service. From a web developer’s perspective, one of the easiest ways to tirekick a RESTful system is to use cURL, the “Swiss-army knife for HTTP,” to retrieve and also post data. Within minutes of cracking open the SSDS documentation I was doing just that: Using curl, on the command line, to create and querying SSDS containers and entities.

Of course there are other ways. The SSDS SDK beta released last week, for example, includes several handy tools. One is the browser-like SSDS Explorer, which enables you to navigate around in SSDS data space, create and delete containers and entities, run queries, and view the underlying HTTP requests and responses. The other is a command-line tool that you can use to automate those interactions.

But the cURL experience is worth mentioning because it underscores how the emerging Microsoft cloud is crossing a cultural chasm. In a blog entry about Astoria (ADO.NET Data Services) — another RESTful offering — RedMonk’s Michael Coté wrote:

You’re either a 100% Microsoft coder or a 0% Microsoft coder. Sure, that’s an exaggeration, but the more nuanced consequences are that something intriguing like Astoria will play best with Microsoft coders, unlike Amazon’s web services which will play well with any coder.

For that reason, Michael was more intrigued by Astoria as a hosted web-facing service than by Astoria as an on-premises service:

A hosted option has the potential to remove this mental barrier to usage. If you’re just coding to a URL, that’s not quit so bad as coding to a .Net library and all the Microsoft baggage and tool-chain needed to support that.

I’d put it a bit differently. With all RESTful services, there is properly a continuum of access styles. The most primitive is something like curl, which is useful for certain kinds of basic exploration. But most real work is done with the help of libraries that abstract the RESTful interface. Those libraries can be available for Java, for C#, for dynamic languages like Perl, Python, and Ruby, and perhaps even for those same dynamic languages implemented for Java or .NET. These flavors in turn determine a range of tools that can be brought to bear on development and debugging.

But I agree with Michael’s point: Microsoft platforms have not historically encompassed the continuum of access styles. Happily, the emerging cloud does. And while the novelty of “just coding to a URL” on a Microsoft platform will undoubtedly attract some tirekickers who otherwise wouldn’t show up, the real draw will be the ability to exercise choice along the whole continuum.

That choice, by the way, is not only relevant to developers accessing hosted, web-facing services. It will matter as much — and with the SOAP option, even more — to developers accessing on-premises services behind the corporate firewall and across corporate boundaries. Facing outward or inward, you’re most productive when you can choose the right access style for the job at hand. Between the realm of the 100% Microsoft coder and the 0% Microsoft coder, a wide and fruitful middle ground is opening up.

Update: Some nice examples of cURL SSDS idioms here from Jeff Currier. (Thanks, Soumitra!)