A conversation with Greg Wilson about doing HPC right

My guest on Innovators this week is Greg Wilson. We share common interests in collaboration and Python, but neither of those topics was the focus of this conversation. Instead, we discussed Greg’s unique and somewhat curmudgeonly take on high-performance computing. In his view, the HPC industry has focused on achieving bigger and faster computation at the expense of human productivity, verifiable correctness, and reproducibility.

I claim no expertise in that field, but Greg is an expert, so I wondered what he’d think about the approach discussed in one of my recent Perspectives shows, Cluster computing for the classroom. On that show, Kryil Faenov — Microsoft’s general manager for Windows HPC — describes a system that enables professors to define computational models that students can check out, tweak, and then run against large data on a compute cluster.

From a human productivity standpoint Greg likes that approach. But he’d prefer to see more attention paid to verifying the correctness of the models, and to ensuring that code and the data are managed in ways that make experiments reliably reproducible.


Disclosure: While working at Los Alamos National Laboratory back in 2000, Greg commissioned me to write a report on Internet Groupware for Scientific Collaboration.

Free online calendar publishing, part 2: Google Calendar

This post is part two of a series in which I’ll summarize what I know about publishing calendars openly on the web, for free, using popular calendar applications including Outlook, Google Calendar, and Apple iCal.

Google Calendar

You’ll need a Google account. If you use Gmail you already have one. Start the calendar program by clicking the Calendar link at the top of the Gmail page.

To publish your calendar in ICS (aka ICAL) format, open the drop-down menu for your calendar’s name under the My Calendars heading, and select Calendar Settings.

The first tab on the ensuing page is called Calendar Details. If you scroll to the bottom you’ll see two sections containing sets of hyperlinked icons. The sections are labeled Calendar Address and Private Address.

You don’t actually have to make your calendar public in order to share both its ICS (ICAL) and HTML formats. You could use the second set of private links to publish (and otherwise communicate) those formats without exposing the contents of your calendar to the Google search engine. But if the goal is to advertise your calendar as widely as possible, you’ll want to do that. So, visit the second tab on this page, labeled Share this Calendar, and check Make This Calendar Public:

Now your ICS feed is active at an URL that looks like this:

http://www.google.com/calendar/ical/yourname%40gmail.com/
public/basic.ics

To capture your version of this link, right-click the ICAL icon in the Calendar Address section, and use your browser’s link-capture method: Copy Shortcut (IE), Copy Link Location (Firefox), Copy Link (Safari). You can paste the link into a web page that you publish, or into a web form or an email that transmits it to another site to which you want to syndicate your calendar.

Similarly, the web view of your calendar is active at an URL that looks like this:

http://www.google.com/calendar/embed?
src=yourname%40gmail.com&ctz=America/New_York

To capture your version of this link, right-click the HTML icon in the Calendar Address section and do as above. This link leads to a Google-hosted page for viewing the calendar.

If your web hosting circumstances allow you to use an HTML feature called IFRAME, you can instead embed the calendar in one of your own pages. The HTML code to do that is provided in the Embed This Calendar section.

Free online calendar publishing, part 1: Outlook

This post is part one of a series in which I’ll summarize what I know about publishing calendars openly on the web, for free, using popular calendar applications including Outlook, Google Calendar, and Apple iCal.

Outlook 2007

With Outlook 2007, you can publish for free to calendars.office.microsoft.com. You’ll need a Live ID account. If you don’t already have one, a Live ID is useful for many other services too. To get one, start at login.live.com and click the “Sign up for an account” link.

To start publishing, right-click the name of your Outlook calendar as it appears under My Calendars in Outlook’s navigation pane, select Publish to Internet, and select Publish to Office Online as shown here:

You’ll land on this screen, where — for an open public calendar — you can just click OK and take the defaults.

Now you’ll be prompted for your Live ID credentials.

Enter the email address and password of your Live ID account. And check “Remember my password” so that Outlook can send calendar updates to the server automatically.

Here’s the confirmation:

Even though you likely won’t want to send individual invitations, click Yes anyway. That’s the easiest way to discover what the web address of your published calendar will be. Here’s the email message:

You don’t need to send it anyone, you just need to capture the calendar’s web address. Which, in this case, is:

webcals://calendars.office.microsoft.com/pubcalstorage/
j447ytlz27542/test_Calendar.ics
.

If you publish that link on a web page (more realistically, with a label like Subscribe to calendar), visitors who click the link will be invited to launch one or another calendar program (such as Outlook, or Apple iCal) to view the calendar and subscribe to updates. That same address can be used by online services like http://elmcity.info/events which combine calendars from multiple sources.

The .ics in test_Calendar.ics stands for Internet Calendar Standard. The ICS file is useful for exchanging calendar information among calendar programs that run on personal computers, and among calendar services that live online. But it’s not something people can view directly on the web. For that, you’ll want to use a variant of the address that produces a web page people can see and interact with. Here’s the variant:

http://calendars.office.microsoft.com/en-us/pubcal/viewer.aspx?path=
/pubcalstorage/j447ytlz27542/test_Calendar.ics

To form your version of this link, copy the initial part of the above link — the part that isn’t bold — and then replace the part that is bold with the corresponding part from the invitation email shown above.

If you then publish that link on your website, it will lead visitors to a page like this:

Visitors to that page can view the calendar in several ways. And they can subscribe to the calendar by clicking the Subscribe link.

Earlier versions of Outlook

I’m still researching the options. Comments welcome.

Caroline Arms on digital formats for long-term preservation

My guest for this week’s Perspectives show is Caroline Arms, a digital preservation pioneer at the Library of Congress. She’s a leading student and promoter of digital formats for long-term preservation.

It was fascinating to hear her take on the interplay between the reality of market forces and the interests of cultural preservation. From the Library’s perspective, an important format is one that is both disclosed (i.e., openly specified) and widely adopted. The Library has few illusions about its ability to influence adoption, but it does participate in standardization efforts such as PDF/A and Office Open XML.

Caroline joined the Library of Congress in 1995 to work on the American Memory project, and she well understands that our memories are not only represented by commercially-published content, but also by personally-created content such as photographs and diaries. When that content is paper-based, it tends to survive benign neglect. But digital content doesn’t survive benign neglect, and the Library is thinking hard about the challenge that presents for the photographs and diaries we’re creating from now on.

Yesterday’s proposal for an association of URL-shortening services was motivated by that same challenge. It’s overwhelming to think about tackling the URL persistence problem in a general way, although there’s good progress being made in particular domains, notably scholarly publishing. But it strikes me that URL-shortening is an area where we could bootstrap a scheme that would provide at least some assurance of continuity, in a way that would be evident to a lot of mainstream users. It wouldn’t solve a major problem, but that’s actually the point. We need to pluck some low-hanging fruit, and start to raise expectations about the persistence of the digital resources we’re all creating.

Could there be an association of URL-shortening services?

The creator of a new URL-shortening service, urlborg, recently wrote to me to announce some new features. There are, at this point, quite a few of these URL-shortening services. I’m sure each has differentiating features, but before I explore the differences I’d like to see a new and important kind of commonality.

Each of these services invites you to invest in creating a set of short URLs that point to your own longer URLs. None of them provides any guarantees about the future availability of those short URLs. I’d love to see these services form an association that does make such guarantees.

There can never be a simple solution to the problem of linkrot. We don’t own domain names, we only rent them. As content management systems evolve, so often do the URLs they project onto the web. Even if an association of URL-shortening services guaranteed the continuity of short URLs, the long URLs behind them would remain as fragile as they are today.

Still, it would be an inspiring and forward-looking experiment to try. What if TinyURL, snurl, urlborg, and the others were members of an association that would inherit the URL mappings of any member that ceased to honor them? Given such a guarantee, I’d be much more willing to invest in the creation of URL mappings with any of the members, and to explore the features that differentiate them.

Semi-structured database records for social tagging

In my writeup on MIT’s Project Simile, and again in my talk at the CUSEC conference, I lauded an approach to collective information management that respects our actual linguistic nature. People don’t normally create vocabularies by committee. Rather, we absorb, imitate, innovate, and negotiate the vocabularies we use. Simile embraces that reality. It encourages people to name resources in ways that make sense to them, within the context of their tribes. Then it provides ways to map out equivalences among the terms used by different tribes.

This same idea of pluralistic naming and equivalence mapping came up in last week’s Perspectives interview with Quentin Clark: Where is WinFS now? The connection was implicit but it’s worth making explicit. Here’s what Quentin said:

QC: Going through the litany of technologies that have come from WinFS, one of them is the notion of what I refer to as semi-structured records. The schema is not necessarily all that well defined at the outset of the application. How does the database handle that? We had built WinFS around a feature called UDTs [user-defined types], which is a column type — a CLR type system type.

We finished that up, and we built a whole spatial datatype on it in SQL Server 2008, it’s all good stuff.

But when we stepped back and looked at the semi-structured data problem in a larger context, beyond the WinFS requirements, we saw the need to extend the top-level SQL type system in that way. Not just UDTs, but to have arbitrary extensibility.

So we did this feature in SQL Server 2008 that we internally refer to as sparse columns. It’s a combination of various things. First, a large number of columns. Right now there’s a 1024 limit on the number of columns in a single SQL table. We’re way widening that out.

That comes of course with the ability to store data that’s very sparsely populated across a large number of columns. In SQL Server 2005 we actually allocate space for every column in every row, whether it’s filled or not.

JU: This is what the semantic web folks are interested in, right? Having attributes scattered through a sparse matrix?

QC: That’s right. And that leads to another thing which we call column groups, which allow you to clump a few of them together and say, that’s a thing, I’m going to put a moniker on that and treat it as an equivalence class in some dimension.

Given my enduring fascination with del.icio.us as a prime example of social tagging services that enable real people to evolve metadata vocabularies in a natural way, that really got my spidey sense tingling.

A conversation with Gabriel Dance and Shan Carter about interactive graphics at the New York Times

Last November the New York Times ran an interactive visualization of one of the Republican debates that absolutely wowed me. On this week’s Interviews with Innovators show I spoke with two of its creators, Gabriel Dance and Shan Carter, about that project, and about some of their other work including the stunning Faces of the Dead in Iraq. It’s a great overview of how and why the NYTimes has been raising the level of its game — and therefore of everyone’s game — in the realm of interactive data display.

There’s an odd little Web 2.0 backstory about how we arranged this interview. When I cited the credits for the debate visualizer in my blog post, I had a hunch that my use of those names would appear on the creators’ radar screens. And sure enough, I heard back from Gabriel Dance. When I didn’t find any contact info for him on his website, I went hunting around and eventually found him on Facebook.

We then began an on-again, off-again dialogue that lasted for a couple of months, until we eventually settled on a time for the interview. At one point I tried to steer the discussion away from Facebook and into regular email, but for some reason that didn’t happen, so we wound up doing all the communication in Facebook.

When we finally got together for the interview, Gabriel mentioned that he’d never been involved in such a long Facebook email thread. Me neither. Somehow we got stuck in a loop where each of us thought the other preferred to communicate only in Facebook. I was glad to know that this wasn’t some kind of Gen-Y thing, and that we both thought it was a weird glitch.

The other delightful thing about this interview is the audio quality. Gabriel and Shan called me from the Times’ tape synch facility, so their half of the call was professionally recorded, then I merged their track with my locally recorded track. Nice!

Where is WinFS now? Quentin Clark explains.

In 2004 I interviewed Quentin Clark, who led the WinFS effort, for an InfoWorld cover story on Longhorn. We had dinner recently, and Quentin made a surprising remark. He said that although WinFS never shipped, many of the underlying technologies already have. I wanted to hear more.

So, on this week’s Perspectives show, Quentin expounds at length on the question: Where is WinFS now? Topics include schemas, the entity data model, filestream and hierarchical namespace support in SQL Server, and synchronization.

In general I’m trying to aim Perspectives at a wider audience. But although you have to be fairly technical to enjoy reading or listening to this interview, I coudn’t resist. It’s a fascinating story, and not one the technology press is ever likely to tell. From that perspective, when the WinFS project was shut down, the whole thing evaporated. But as we know, technologies often wind up being used in ways not originally intended. WinFS is a prime example.

Computational thinkers make good body hackers

Sean McGrath’s report on coping with RSI reminded me of a couple of things. First, I need to find out whether the chair-mounted split keyboard shown here is still available. It’s been hugely helpful to me over the years, but I’m not sure it can be replaced at this point, and that would suck.

(Update: Uh oh. Discontinued 3 years ago.)

Second, I’ve been meaning to note a connection between computational thinking and health. Sean writes:

RSI is about the most complex problem I have ever tried to debug.

His reference to debugging might seem like a geeky affectation, but I don’t think that it is. When you’re searching for the causes of health problems, including mechanical ones like RSI, it can be fiendishly hard to, as Sean says, “establish repeatable causal connections between events.” Our bodies are complex, layered systems. Problems arise at different levels; the levels interact; any assumption may need to be questioned. But ultimately our bodies are systems, and computational thinkers can be pretty good at hacking and debugging them.

You see it when geeks deal with RSI. And you also see it when they deal with obesity. I known seven or eight technical types who have slimmed dramatically in recent years. We’re talking major weight losses of 75 pounds, or 100, or even more. In each case they describe the process in the language of computational thinking. “I hacked my body.” “I debugged my metabolism.”

Sean is right to offer this disclaimer:

I am a computer geek. Not a medical practitioner. If you have symptoms, go see a doctor, ok?

And yet, in my experience with RSI and with other kinds of mechanically-induced soft tissue injuries, doctors can’t help much if at all. What’s required is realtime analysis and debugging of a complex system, on a continuous and perpetual basis. The person best equipped to do that debugging is you, the owner, operator, and inhabitant of the system.

A conversation with Lucas Gonze about discovering, sharing, and experiencing music

It was a great pleasure to speak with Lucas Gonze for this week’s Innovators interview. Back in 2004, in Blogs + playlists = collaborative listening, I first wrote about webjay.org, the playlist-sharing service that Lucas founded and later sold to Yahoo. Later that year, I made an audio documentary about the people, the services, and ideas that I saw coming together to create a new kind of cultural curation. The factors in play included abundant talent, Creative Commons licensing, and linkable hypermedia.

That vision hasn’t materialized yet. In our conversation, Lucas and I discuss why it hasn’t — and how it might still.

In the realm of music, I think that Lucas’ project to reanimate 19th-century songs provides one of the missing pieces of the puzzle. Copyright restrictions are what sent him to the archives to learn, perform, record, and distribute these old tunes. But as he’s explored them, he’s realized that parlour music of that era was social and participatory in ways that are far less common today.

Lucas once wrote about how he was happy with a recording he’d made of a piece that he played with “only had a few mistakes.” The other day he wrote:

Imagine that we lived in a world where all photography was the kind you see in magazines. In this world all photos are taken by professionals and all the people who got their pictures taken are models at the peak of their career. If you had your picture taken normally, you’d think you were hideously ugly. That is the musical world we grew up in, and it’s bogus. Things don’t have to be that way.

In an era of cognitive surplus, as the pendulum swings back from consumption to production of culture, that’s a good thing to remember.

That word, syndication, I do not think it means what you think it means

Something about the title of this week’s Perspectives interview, OpenSearch federation with Search Server 2008, has been nagging me ever since I wrote it. In the interview, Richard Riley and Keller Smith describe how the new Microsoft search server can extend its reach by sending queries to other search services that can return results as OpenSearch-compliant RSS or Atom feeds.

We call this activity federation, but the enabling technology is syndication. So is the group of participating servers a federation, or is it a syndicate?

Some definitions of federation, from 1 dictionary.com and 2 Merriam-Webster:

1 a federated body formed by a number of nations, states, societies, unions, etc., each retaining control of its own internal affairs.

2 an encompassing political or societal entity formed by uniting smaller or more localized entities: as a: a federal government b: a union of organizations

That seems too formal, too heavyweight, for an OpenSearch-mediated search scenario. When you modify a search service to return results in the OpenSearch format, you’re not necessarily joining any kind of union. You’re just making it easier for other entities to latch onto your search results.

OpenSearch was announced on March 16, 2005, at the Web 2.0 conference. That same day I adapted my version of the InfoWorld search service to use it. There was nothing special about what I did, which is why it only took a few minutes. I just added a variant of the query URL that returned results as RSS, with a few minor extensions to comply with OpenSearch.

Then I registered my service with Amazon’s A9, searched A9 for “Jean Paoli”, and saw the combined results shown here.

This arguably was a federation, because you had to join the club in order to have results from your service show up in A9. But nothing about OpenSearch required things to work that way. Other services could consume my search feeds without requiring me to register with them, or permit them.

What’s more, any RSS reader could consume those feeds. Although I’d done the OpenSearch hack to showcase integration with A9, it turned out that I’d solved another problem without even intending to. It was now also possible for individuals to subscribe to InfoWorld queries.

OpenSearch can involve federation, but more fundamentally it’s about syndication. So, do the participating entities form a syndicate?

1 a: a group of persons or concerns who combine to carry out a particular transaction or project b: cartel c: a loose association of racketeers in control of organized crime

2 a group of individuals or organizations combined or making a joint effort to undertake some specific duty or carry out specific transactions or negotiations

That doesn’t seem right either. We can get closer by focusing on the definitions that emphasize simultaneous publication:

1 a business concern that sells materials for publication in a number of newspapers or periodicals simultaneously

2 to publish simultaneously, or supply for simultaneous publication, in a number of newspapers or other periodicals in different places: Her column is syndicated in 120 papers

But these definitions still involve more business coordination than OpenSearch, or feed syndication in general, require. If I use OpenSearch to publish a search service within the enterprise, I don’t need to make a formal agreement with the Search Server administrator in order to enable that server to include my search results. I just need to publish my results as an RSS feed, and tell that person I’ve done so. That same RSS feed is available to users who may wish to subscribe to searches performed directly on my service.

It’s the same on the open web. When you adopt a syndication-oriented architecture, small pieces can be loosely joined, or they can be more tightly coupled. But the underlying publish/subscribe mechanism doesn’t determine that choice.

Chewing on these definitions is more than a pedantic exercise for me. In my local community, I’m trying to show how a particular use of publish/subscribe technology — namely, calendar syndication — can solve an important problem for people, organizations, and the community as a whole.

Federation would clearly be the wrong word for the network of calendars that I’m trying to bring into existence. I’ve been using the word syndication instead. But now I suspect that’s the wrong word too. I want to convey that we can create small pieces, that they can be loosely joined, and that important network effects will emerge. I don’t yet know what word or phrase will make that cluster of concepts light up in people’s heads.

Calendar software is natural for reading, but not for writing

In response to a popular recent item — “We posted weekly.pdf to the website. Isn’t that good enough?” — Sarah Allen echoes my favorite Sergey Brin quote. Sergey said: “I’d rather make progress by having computers understand what humans write, than by forcing humans to write in ways computers can understand.”

Sarah, citing weblog software as an example of software that enables people to write naturally, goes on to say:

Likewise, it is natural to record calendar information overlaid on a timeline with day, week, and month views that mimic traditional paper visualizations of time. This enables the software to generate structured data without people needing to think about it.

I mostly agree with her about blog software. And I would have been inclined to agree with her about calendar software too, until I started looking seriously into how people do — and often don’t — use calendar software.

Let’s look at a fragment of a softball schedule which, significantly, has been written as an Excel file:

Fri. Apr. 25 6:15 Whitney Brothers Greenwald Realty
7:45 Servpro Athen’s Pizza
Sat. Apr. 26 9:00 WR Painting Peerless Insurance

Notice what’s missing? There’s no AM/PM, because everybody is expected to know that 6:15AM would be too early for a Friday game while 9:00PM would be too late for a Saturday game.

Yes, it’s natural to view calendar information in ways that mimic traditional presentations. But it’s unnatural to write it using calendar software that constantly nags you to specify nitpicky details like AM and PM. People understand what’s a reasonable time for a Friday or Saturday game. Why can’t software figure that out?

I guess that’s why another recent item on parsing human-written date and time information struck a chord with readers. Until we create (and widely deploy) naturalistic interfaces, people are going to avoid the Procrustean bed that is conventional calendar data entry.

A conversation with Janis Dickinson about citizen science

On this week’s Interviews with Innovators I spoke with Janis Dickinson, director of citizen science at the Cornell Ornithology Lab. We talked about several of the lab’s projects that involve collection and analysis of volunteer observations about birds and bird habitats.

Courtesy of the eBird project, for example, here is a view of first sightings of common bird species in New Hampshire. At first glance it might be tempting to see the preponderance of dates in the current decade as an effect of global warming. But to support that interpretation, you’d have to answer a bunch of questions about the evolution of record-keeping over the period, and the distribution, reliability, and bias of volunteer observers.

Extracting signal from noise is, of course, one of the classic bread-and-butter activities of information science. What’s fascinating here is the Web 2.0 angle. Birdwatchers are famously passionate data collectors who develop reputations among their peers. When they contribute their data to eBird — and thence to the Avian Knowledge Network — those reputations can begin to be measured, and used to tune the analysis of a large body of contributed data.

For example, the all-time latest reported sighting of the Nelson’s Sharp-tailed Sparrow in New Hampshire was on Nov 24 2007, by Michael Harvey. Is that unusually late? And if so, is it credible? To answer these questions, Cornell’s data crunchers can compare what was and wasn’t reported in the region around that time, by observers whose reputations are one kind of signal that emerges from noisy data.

Stonewall Farm, Darby Brook Farm, and the collaborative curation of data

Lately I’m obsessed with figuring out how to harness the cognitive surplus and put it to work doing better social information management.

The other night I attended a kick-off meeting for a group interested in advancing the cause of local food production in our region. Inevitably the discussion turned to questions that require data to answer. Who are the local producers? Where are they? What do they produce?

In the ensuing discussion, various sources of data emerged. There’s a USDA website, a state government website, a special-interest website, this or that blog. Two things were immediately clear to everyone. First, there would be no effective way to collate these existing sources. Second, most of the needed data wouldn’t be there anyway.

I’d like to be able to recommend the sort of loosely-coupled collaborative list-making method that works so effectively for me. But here’s why I can’t. The method presumes that all the things you’d want to collaboratively curate are already represented by URLs.

In the real world, some are and some aren’t. Consider two examples from this list:

Name: Darby Brook Farm
Day/Time:  8:00 AM – 5:00 PM
Season:  June 1 – October 1
Address:  347 Hill Road
What you’ll find: Vegetables, raspberries, apples.
More Info: 603.835.6624

Name: Stonewall Farm
Day/Time:  Hours vary
Season:  June – October
Address:  242 Chesterfield Road
What you’ll find:  Garden fresh produce through the Community Supported Agriculture (CSA) program, call for options
More Info:  603.357.7278,   bsaunders@stonewallfarm.org,  www.stonewallfarm.org

Because Stonewall Farm has a web presence, we can do all kinds of useful things with its URL. We can tag various bits of metadata onto it (location, products), we can derives views that include that information, we can syndicate those views.

Because Darby Brook Farm doesn’t have an URL, we can’t do those things.

Of course Darby Brook Farm does have an implicit URL-addressable identity at Lighten Up NH. That identity is the record in Lighten Up NH’s database that’s currently being published into a web page by its ColdFusion server.

If that record were directly URL-addressable, the implicit identity would be explicit. Using the record’s URL as a temporary placeholder, we could bootstrap Darby Brook Farm into a collaborative list-making regime based on URLs, tags, and syndication.

Later, when Darby Brook Farm does establish a real web presence, we can unhook its cloud of annotations from the placeholder URL and attach it to the official one.

This scenario highlights a subtle but powerful benefit of data-publishing technologies like Astoria. When you aggressively expose record-level URLs, you can enable the same methods that will work for Stonewall Farm to also work for Darby Brook Farm.

Negotiating shared responsibility for community information

This week’s Interviews with Innovators show is a conversation with Raymond Yee, author of the recently-published Pro Web 2.0 Mashups.

The book is chock full of good examples. Even if you’re an experienced developer of mashups that involve Flickr, del.icio.us, Eventful, and the various mapping services, you’ll learn helpful strategies for using these services individually and in combination.

What we wound up mostly talking about, though, is the vast space of information that’s not currently available to be mashed up. That might be because the information isn’t online at all, or because it isn’t online in a form that’s tractable.

As a kind of social experiment I’ve been tackling this problem in my local community, with particular emphasis on calendar information. In this week’s interview, Raymond talks about tackling the same kind of problem with emphasis on geographic information. Both cases can exemplify a pattern that I’m calling shared responsibility.

Consider, for example, the public library. It hosts a variety of events, some of which are its own (children’s story hour) and some of which aren’t (an AA meeting). Who’s responsible for putting these events onto the library’s public calendar?

Clearly the library should publish its own events. But it needn’t necessarily feel obliged to publish other organizations’ events. In the case of AA meetings, for example, the library is only one of about a dozen venues around town. Shouldn’t AA publish its events to those venues?

We have the tools and services now to enable this kind of small-pieces-loosely-joined approach. In this case, acting as a proxy for AA, I published its regular meetings to Eventful. One of those meetings happens at the public library. So now when you visit the combined calendar, events at the library show up from multiple sources. One is clearly identified with the library itself, others are identified with the various groups using the library.

Of course nothing prevents the library from choosing to authoritatively publish all of the events that it hosts. But it’s useful to show how that can be a choice, not an obligation. If we take a decentralized, small-pieces-loosely-joined approach, information management chores that look insurmountable can turn out not to be.

A conversation with Ray Ozzie about Live Mesh

Ray Ozzie joined me for this week’s Perspectives show. It’s available there as audio plus a text transcript, and you can also watch the video on Channel 9.

Ray opens the conversation by reflecting on his transition to Microsoft three years ago, and on the roles he and Craig Mundie will play as they jointly inherit Bill Gates’ responsibilities.

Next the conversation turns to a meme that Tim O’Reilly once evangelized: the Internet operating system. That phrase never resonated as powerfully as Web 2.0 did, but the ideas behind it are becoming realities. Ray applauds the work that Amazon and Google have done in this area. And he talks about how Microsoft’s legacy as a platform company, dedicated to helping developers succeed, will influence its approach.

In that context, Ray explores one piece of Microsoft’s emerging Internet operating system: the newly-announced Live Mesh. Sharing common DNA with earlier projects, notably Groove and before that Notes, Live Mesh is a data synchronizer born to the Web. The objects that it synchronizes are represented as RSS and Atom feeds, and are manipulated with a RESTful API that works symmetrically on local and cloud-based nodes.

Although the most visible Live Mesh application is a file-and-folder synchronizer, Ray notes that this is just one example of an application pattern that can apply equally to the synchronization of custom objects, like calendar events, across all the devices in a mesh. It also applies across the spectrum of application types, ranging from the browser to conventional rich clients to Web-based rich clients like Flash and Silverlight.

There’s another pattern for Live Mesh applications, one that’s less familiar. In this pattern, a website uses Live Mesh as a pipeline to communicate with Live Mesh users. If you’re running a travel site, or a bank, you can use that pipeline to transmit structured data to your users — for example, itineraries or transaction reports. It’s easy to create those XML feeds, you can leverage the Live Mesh infrastructure to deliver them securely and reliably at scale, they synchronize across all devices in each user’s Live Mesh, and they’re accessible to local applications using same RESTful feed APIs that were used to create them.

“We posted weekly.pdf to the website. Isn’t that good enough?”

It’s almost 10 years since I began producing and consuming data feeds, initially in RSS format. Although I regard the syndication of data feeds, in general, as a transformative technology, the concept still makes no sense to civilians and has little or no effect on their lives.

In order to understand why not, and as a way of figuring out how to motivate a practical understanding of syndication, I’m tackling a problem whose solution doesn’t involve RSS, or Atom, or microformats, or XML. The problem is calendar syndication, and part of the solution is iCalendar, a non-XML format that all widely-used calendar programs support well enough for my purposes.

It’s only part of the solution because the real problem is that most people, most of the time, for most of their calendar-related activities, don’t use calendar programs. They use spreadsheets and wordprocessors, and they produce unstructured web pages and PDF files.

There was a time when, behind their backs, I would mock them for doing so. No longer. As I meet with intelligent and well-educated professionals in my community, and talk with them about how to synchronize calendar information from a variety of sources, I realize that they simply have no intuition about the difference between a PDF file and an ICS file that contain the same calendar information. Both are computer files, right? Both can be posted to the web, right? Both can be searched, right? Problem solved.

There are really two aspects to this missing intuition. First, the concept that some kinds of computer files are more structured than other kinds. Second, the concept that the structured kind can flow easily around the Net without loss of fidelity, and can deliver use value in a variety of contexts, whereas the unstructured kind is inert.

These are ways of computational thinking unknown to most people. As a school administrator, librarian, city planner, social worker, or retail store owner, nobody expects you to understand and apply these principles.

And yet almost everybody needs to harmonize personal and organizational calendars. And many individuals and organizations need to flow their calendar data into other contexts to promote and coordinate their activities.

So here’s my approach. I’m scooping up all the calendar information I can find for my community, in whatever form I can find it, and flowing it into a coommon view. Then I’m syndicating that view elsewhere to show that there’s nothing special about my aggregation.

The idea is to establish a critical mass by brute force, and allow people to see how, over time, sources that are structured and can syndicate will remain in the game, and sources that aren’t will have to sit out on the sidelines.

It’s turning into a nice case study of how organizations and individuals can negotiate shared responsibility for calendar information that’s of common interest. But that’s a story for another day. First things first. I need to give people a reason to care about using a calendar program — any calendar program, could be Outlook or Apple iCal or Google Calendar, so long as it exports iCalendar — in preference to a spreadsheet or word processor. Although the geek tribe can scarcely imagine why, that first step is a doozy.

A conversation with Deepak Singh about science in the web 2.0 era

For this week’s Interviews with Innovators show I spoke with Deepak Singh. This interview extends what has become an ongoing series of discussions with folks who are applying the principles of web 2.0 to the practice of science. This was, of course, the original purpose of web 1.0.

Other Innovators shows on this topic include conversations with Joel Selanikio about epidemiological data collection, Barbara Aronson about giving poor countries free subscriptions to biomedical journals, and Timo Hannay about the impressive stream of online innovations that’s flowing from the Nature Publishing Group.

My new Perspectives series has also explored this theme of Net-enabled science. There, I’ve talked with Catharine van Ingen and Dennis Baldocchi about collaborative analysis of atmospheric C02 data, and with Pablo Fernicola about using Word to produce scientific articles in the National Library of Medicine’s XML format.

Panoramic Westmoreland

For some reason I’ve never gotten around to doing stitched-together panoramic photos until recently. Today, with spring fever raging, I hopped on my bicycle, did one of my favorite circuits, and made this 360 view of Park Hill in Westmoreland:

It turned out to be an interesting study in perception. If you check the enlarged view, you’ll see a tiny, insignificant-looking church in the center of the spread, dwarfed by mailboxes in the foreground. In my memory of the scene, that church was the dominant feature. But what my eyes actually saw is what the camera saw: a tiny, insignificant-looking church.

Next time I’ll need to stand closer to it. And I’ll need to bear in mind that what we think we see is a heavily interpreted version of what hits the retinas.

Still, it was fun. I love that you can see the handlebars of my bicycle on the left, and the seat on the right.

I’m sure there lots of ways to do this, I’ve never really looked into it, but Windows Live Photo Gallery makes the whole thing a snap. From camera import, to photo stitching, to Flickr upload, was under 10 minutes. And most of that was CPU time.

Radio commentary on citizen use of public data

A while ago I recorded a commentary for New Hampshire public radio on the topic of public data. The themes will be familiar to readers of this blog: transparency, citizen use of government data. I wondered when it would air, and then last night, while doing the dishes, I heard myself on the kitchen radio.

The piece is available on the NHPR site here. Will it make sense to folks listening at their kitchen sinks, or driving in their cars? I hope so, because as powerful an idea as this is, it’ll go nowhere until it does make sense to those folks.

Syndication of rules versus syndication of data

To follow up on last week’s item about parsing the kinds of dates and times that people actually write, Google Calendar’s Quick Add feature looks like the clear winner. Here’s a test page with expressions like:

Third Saturday of Every Month, 10 – 11:30 am

Let’s try the Chronic module from Ruby:

irb(main):007:0> Chronic.parse('Third Saturday of Every Month, 10 - 11:30 am')
=> nil

No joy.

As David French pointed out, Google Calendar’s Quick Add gets this right. Or anyway, close enough. There seems to be a small bug that pokes an instance of the event into today’s slot, whether or not today is a 3rd Saturday. But otherwise it works great.

There are tougher challenges on that test page, like:

9:00 am – 1:30 pm, North Conference Room 1
April: April 5 and 12
May: May 3 and 10
June: June 7 and 14

I doubt think anything we’ve mentioned so far can touch that, though I’d be happy to be proven wrong.

Meanwhile, the ability to capture recurring events like ‘Third Saturday of Every Month, 10 – 11:30 am’ for my aggregated community calendar has raised a new question. When I use Google Calendar for this purpose, its iCal export doesn’t enumerate the series, it defines a rule:

LOCATION:Cheshire Medical Center
RRULE:FREQ=MONTHLY;INTERVAL=1;BYDAY=3SA;WKST=MO

When I pull that event into elmcity.info/events, the RRULE (recurrence rule) only fires once each time the feed is fetched. And that’s fine. I don’t necessarily want to see these recurring events on the the calendar into the far future.

But while I can syndicate these events directly from Google Calendar into elmcity.info, I would rather route them through Eventful.com. The reason is social not technical. Although I’m herding almost all these events into my aggregator for the time being, I want their rightful owners to claim them at some point and take care for them thereafter. Eventful is better suited for the kind of commons-based peer production I’m hoping to encourage.

But, I don’t see how to inject dynamic rules, rather than static events, into Eventful. You could run the rule yourself, then poke the generated events into Eventful, but that’d create maintenance woes when events are rescheduled, modified, or cancelled. I’d rather syndicate the rule than the data.

A conversation with Phil Libin about EverNote’s new memex

In his 1945 Atlantic Monthly essay As We May Think, Vannevar Bush famously imagined the memex, a mechanism that would augment human memory. This idea of mental augmentation inspired Doug Engelbart, and we’ve been chasing the dream ever since. On this week’s Interviews with Innovators, Phil Libin discusses EverNote, a new software-plus-services offering that aims to become your memex.

Listeners may recall that Phil appeared on the show once before. In fact he was the first guest in this series. Then he was CEO of Corestreet, a company tackling the problem of large-scale credentials validation in really interesting ways. Now, as EverNote’s CEO, he’s tackling a very different problem. But although EverNote is an application for ordinary folks rather than for governments and major institutions, it raises its own set of scale issues. And not just in terms of scaling out numbers of users and quantities of storage. EverNote wants to scale in the dimension of time as well.

Like me, Phil’s a huge fan of the Long Now Foundation. When he says that EverNote wants to guarantee the integrity of the digital objects that you commit to it forever, he’s not kidding.

While it’s refreshing to see a Web 2.0 company taking this long view, Phil admits that addressing the forever challenge in a meaningful way is beyond the means of EverNote. I’d add that it’s beyond any individual organization, and will require a federation of players to hammer out not only technical standards, but also shared business arrangements.

That’s not going to happen anytime soon, but then EverNote isn’t currently making guarantees that sentimental memorabilia will be preserved for your great-grandchildren. Instead it wants to guarantee that you’ll have effective near-term use of operational memorabilia — key documents, and in particular photos from which it finds, extracts, and indexes text.

The idea with this photo feature is that you can take pictures of receipts, wine labels, magazine pages, or event posters, dump the pictures into EverNote, and then find the photos by searching for the text in them. EverNote’s secret sauce here is its ability to find text not only in high-res scans, but also in “crappy cellphone photos taken at an angle.”

As Phil points out, from EverNote’s perspective the world comes at its users in two modes. First, when they’re away from their computers and out in the world, usually with some kind of camera. Second, when they’re at their computers, in which case they can take clippings from the web, or forward email.

I’m in that second mode a lot, so we’ll see whether EverNote becomes another of the memory augmentation methods I already use. These include blogging, email, and social bookmarking. Each method serves a communication function but also provides a repository where I often stash things purely so I can find them later.

Here’s an interesting and counter-intuitive aspect of EverNote. Human memory degrades over time. Digital memories, however, not only retain full fidelity, they can actually improve over time. Faces that you can’t find in your EverNote archive today may become recognizable next month or next year.

That’s true not only for EverNote, of course, but also for any system to which we commit digital objects. Human augmentation is powerful magic. We’re only starting to realize what it can do for us. And, I should add, to us.

Making sense of C02 data: A scientific collaboration

This week on Perspectives, I explore the partnership between Dennis Baldocchi, a Berkeley climate scientist, and Catharine van Ingen, an MSR researcher. They’ve been working together on Fluxnet, a scientific data server and collaboration service for hundreds of scientists around the world who are measuring C02 flux in the atmosphere and trying to understand the dynamics of that flux.

Science in the twenty-first century is increasingly a game of data curation and analysis, involving hundreds or thousands of players distributed all around the world. To make progress, teams will need to coordinate online. The coordination systems will emerge from partnerships like the one Dennis Baldocchi and Catharine van Ingen discuss in this interview.

It’s also fascinating to hear, from the horse’s mouth, what we actually know, and don’t know, about atmospheric CO2. And about how and why we know or don’t know. On key issues like global warming, there’s a huge gap between scientific knowledge and public understanding. Projects like this one can help close that gap.

Parsing human-written date and time information

I’m working on a project that aggregates a bunch of community calendars, plus a lot of calendar info that’s just written out free-form. Some examples of the latter, in ascending order of resistance to mechanical parsing:

Tue, 4/1/08

2 Apr – Wed 10:00AM-10:45AM

Weekdays 8:30am-4:30pm

Thu, 11/15/07 – Fri, 4/11/08

Every Tuesday of the month from 10:00-11:00 a.m

Sat., Apr. 05, 9:00 AM Registration/Preview, 10:00 AM Live Auction

2nd Saturday of every other month, 10:00 am-12:00 pm

Programming languages tend to offer lots of functions and modules for converting among machine formats, and for converting machine formats into human formats, but when it comes to recognizing human formats, not so much.

In looking around for a recognizer, I came across the script that Jamie Zawinski uses to manage the calendar for his DNA Lounge. It looks like it can handle many of these formats, but it’s a 6500-line Perl behemoth that does a bunch of different things.

What else is available, for any language, preferably more focused and packaged, that can turn an item in human format, like “2nd Saturday of every other month, 10:00 am-12:00 pm,” into a sequence of items in machine format?

Office XML: The long view

For many years I have tried, and mostly failed, to get people to appreciate the value of structured information. Sure, I’ve connected with the chattering classes who Twitter, blog, and read TechMeme, but I’ve only been preaching to the choir. Inside our echo chambers we grok XML, tagging, syndication, and information architecture. Out in the real world, though, most people aren’t hopping on that cluetrain, and that’s almost as true today as it was a decade ago.

Of course I’m not alone in my quest. Tim Berners-Lee has also tried, and mostly failed, to evangelize the power of structured information. The gating factor always was, and still is, data entry. You can go a long, long way with unstructured information, as Google has brilliantly shown. In late 2002 Sergey Brin told me:

Look, putting angle brackets around things is not a technology, by itself. I’d rather make progress by having computers understand what humans write, than by forcing humans to write in ways computers can understand.

That’s a great way to make progress, but we’re not in an either/or situation here. There’s also huge progress still to be made by enabling (not forcing) people to write in ways that computers can understand more deeply and effectively.

Jean Paoli saw an opportunity to do something about that on a large scale. It was also late 2002 when I first started talking to him about the injection of XML capabilities into Office. I evangelized that stuff long before I became Microsoft evangelist, because I believed then, and still believe today, that it’s a crucial enabler for a world facing challenges that are infinitely compounded by almost universally crummy information management.

In the flurry of commentary surrounding yesterday’s approval of Office Open XML as an ISO standard, I haven’t seen anyone thank Jean and his team for having the vision to transform Office in this important way, and the constancy of purpose to make it real. Well, I’ll say it. Thanks!

My close encounter with the Hannaford data breach

My debit card was one of the potentially 4.2 million exposed in the recent Hannaford data breach. Here’s part of the letter from my bank, the Savings Bank of Walpole.

I’ve thanked them privately, and want to thank them publicly as well, for being proactive and doing the right thing here. They’re dealing with fallout from a problem they didn’t create.

Details are still emerging but we don’t yet have the full story. As the InfoWorld story notes, Hannaford’s servers might have been compromised by a remote exploit through the network, or a local exploit made possible by unauthorized physical access.

In the aftermath, most of the usual defense-in-depth strategies are being rehashed, and that’s good. But one-time account numbers still aren’t on the radar screen, and I keep on wondering: Why not?

A conversation with Tim Spalding about LibraryThing

I had a great time talking about LibraryThing with Tim Spalding for this week’s ITConversations show. He says LibraryThing is a baroque application. I think of it as deep in the same ways that Flickr is: Many features, many modes of use, many constituencies. Although Tim is flagellating himself about the way we swam around in those depths, I enjoyed the conversation immensely. If you’re fascinated by the dynamics of social information management — whether or not you are a book-lover — I think you will too.

We wound up talking for almost two hours. I omitted the second hour not only for reasons of length, but also because it raised a question that neither of us felt we were able to address very well. As mentioned in comments here, though, it does warrant further consideration. A lot of folks, me included, feel that the inability to move identity and relationships across social networks is increasingly an impediment to joining them and participating in them.

But Tim rightly points out that friction has value. Rites of initiation are costly for a reason. When you invest effort you create meaning. So here’s the question. How do we separate those aspects of social information management that should be portable and frictionless from those that should be unique and special?

Cluster computing, with large data, for the classroom

This week’s Perspectives is a two-parter: an interview and companion screencast on the topic of cluster computing in the classroom. The interview is with Kyril Faenov, the General Manager of the Windows HPC (high performance computing) unit, and the screencast is with Rich Ciapala, a program manager for Microsoft HPC++ Labs.

The project demonstrated in the screencast, and discussed in the interview, is called CompFin Lab. It’s a system that enables professors to in turn enable their students to run computationally expensive financial models on large quantities of data. From the student’s perspective, you go to a SharePoint server, select a computational model, pick a basket of stocks, and run the model. Behind the scenes the task is partitioned and sprayed across a cluster of computers, then the results are gathered and presented in an Excel spreadsheet.

From the professor’s point of view, some .NET programming is required. But a framework abstracts the mechanics of dealing with the cluster, so the professor can focus on the logic of the model itself.

There are couple of key points about the evolution of high-performance computing that I want to highlight here. First, there’s what Kyril calls “the gravitational pull of data.” Increasingly, people and organizations are building vast repositories of data that other people and organizations will want to analyze in computationally expensive ways. It’s great to have access to a compute cluster in the cloud that can do the heavy lifting, but when datasets get really big you get bottlenecked trying to send the data to where the code runs. At a certain point you’d rather send the code to where the data lives.

A second and related point is that in our current model for large-scale cloud-based computing, there are only a handful of what I call intergalactic clusters — namely, those operated by Google, Yahoo, Amazon, and Microsoft. These are one-of-a-kind behemoths. You can’t replicate one of them locally and apply it to your terabytes of data. So as Kyril and his team build out their cloud-based HPC services, they’re working to ensure the services can be replicated locally.

Maybe the most optimal thing is for you to stand up a 1000-node cluster with each node having a terabyte of disk. We want to enable that. We want to be able to tell our customers: Here’s how we run this large-scale data-driven HPC applications, and here’s how, within a day or two, you can stand up one of these yourself.

The idea is that if you build one of those for your own terabyte trove of astronomical or climatalogical data, you can run your own computations against that data, and you can also share that capability with other people and organizations who want to run their code against your data.

Revisiting the InfoWorld metadata explorer

A while ago I wrote an alternative search and navigation interface to InfoWorld.com. The search is broken now because the underlying engine switched from Ultraseek to Google, and nobody has updated the search wrapper. But the navigation piece still works, and while it does, I want to invite some commentary because I’m thinking of doing something similar for another project.

In this model the navigation is metadata-driven, and supports views like:

InfoWorld stories tagged ‘Silverlight’

InfoWorld news stories tagged ‘Silverlight’

InfoWorld news stories by Elizabeth Montalbano tagged ‘Silverlight’

Every piece of metadata in the tabular display is active, and toggles a filter for that item. This works especially well for the tags, and enables you to cruise through the tagspace in a fluid way. For example, try this progression:

1. InfoWorld news stories tagged ‘Silverlight’

2. Click ‘flash’ to toggle it on

3. InfoWorld news stories tagged ‘Silverlight’ and ‘Flash’

4. Click ‘silverlight’ to toggle it off

InfoWorld news stories tagged ‘Flash’

The same principle holds for other bits of metadata, like storytype. So for example:

1. InfoWorld news stories tagged ‘Silverlight’

2. Click ‘News’ to toggle it off

3. InfoWorld stories tagged ‘Silverlight’

4. Click ‘Review’ to toggle it on

5. InfoWorld Reviews tagged ‘Silverlight’

6. Click ‘Martin Heller’ to toggle it on

7. InfoWorld Reviews by Martin Heller tagged ‘Silverlight’

8. Click ‘silverlight’ to toggle it off

9. InfoWorld Reviews by Martin Heller

It’s powerful to explore things this way, but if I did something like this again, I’d look for ways to make these filter progressions more intuitive and discoverable.

I just don’t think people expect every item to work as a control as well as an information display. And because they don’t, it may be a bad idea to do things that way. Or maybe it’s a good idea that’s still in search of its perfect expression. I’d be curious to know what you think.

Rediscovering LibraryThing

To prepare for an interview with Tim Spalding, the founder and lead developer of LibraryThing, I re-registered with LibraryThing, spent some quality time with the service, and was wildly impressed.

At one point in the interview, Tim asked me how I, Mr. LibraryLookup, as likely a person as there is to use and appreciate LibraryThing, could have gone so long without hooking up with it.

I think part of the answer is hidden in the first paragraph: I had to re-register for the service, which I had tirekicked a year or two ago. The friction of joining and re-joining online services has become a major barrier.

There’s also conceptual friction. LibraryThing is a deep application that does lots of things, but on the surface, it appears to be a mechanism for cataloging books that you own. In fact it isn’t only that, you can just load it with books that you’ve read, or might read, as a way to seed discovery and recommendation.

Finally, there’s data friction. There are bibliophiles who will obsessively catalog their own collections, but I’m not one of them. I do, however, maintain a list of books on my Amazon wishlist. I syndicate that list to the version of LibraryLookup that alerts me when books on the wishlist become available in my local library.

What I needed was a frictionless way to reuse that list. And on this go-round with LibraryThing I found it. Sort of. You can import your Amazon wishlist into LibraryThing, which is a great way to jumpstart the discovery and recommendation process. It doesn’t yet syndicate from Amazon, so the initial import won’t be refreshed, but Tim says that’s coming.

It turns out not to matter at all that list of books I’m interested in happens to be an Amazon wishlist. All that matters is that I can keep it in some service, somewhere, that can syndicate data to other services elsewhere.