Uncategorized


Something about the title of this week’s Perspectives interview, OpenSearch federation with Search Server 2008, has been nagging me ever since I wrote it. In the interview, Richard Riley and Keller Smith describe how the new Microsoft search server can extend its reach by sending queries to other search services that can return results as OpenSearch-compliant RSS or Atom feeds.

We call this activity federation, but the enabling technology is syndication. So is the group of participating servers a federation, or is it a syndicate?

Some definitions of federation, from 1 dictionary.com and 2 Merriam-Webster:

1 a federated body formed by a number of nations, states, societies, unions, etc., each retaining control of its own internal affairs.

2 an encompassing political or societal entity formed by uniting smaller or more localized entities: as a: a federal government b: a union of organizations

That seems too formal, too heavyweight, for an OpenSearch-mediated search scenario. When you modify a search service to return results in the OpenSearch format, you’re not necessarily joining any kind of union. You’re just making it easier for other entities to latch onto your search results.

OpenSearch was announced on March 16, 2005, at the Web 2.0 conference. That same day I adapted my version of the InfoWorld search service to use it. There was nothing special about what I did, which is why it only took a few minutes. I just added a variant of the query URL that returned results as RSS, with a few minor extensions to comply with OpenSearch.

Then I registered my service with Amazon’s A9, searched A9 for “Jean Paoli”, and saw the combined results shown here.

This arguably was a federation, because you had to join the club in order to have results from your service show up in A9. But nothing about OpenSearch required things to work that way. Other services could consume my search feeds without requiring me to register with them, or permit them.

What’s more, any RSS reader could consume those feeds. Although I’d done the OpenSearch hack to showcase integration with A9, it turned out that I’d solved another problem without even intending to. It was now also possible for individuals to subscribe to InfoWorld queries.

OpenSearch can involve federation, but more fundamentally it’s about syndication. So, do the participating entities form a syndicate?

1 a: a group of persons or concerns who combine to carry out a particular transaction or project b: cartel c: a loose association of racketeers in control of organized crime

2 a group of individuals or organizations combined or making a joint effort to undertake some specific duty or carry out specific transactions or negotiations

That doesn’t seem right either. We can get closer by focusing on the definitions that emphasize simultaneous publication:

1 a business concern that sells materials for publication in a number of newspapers or periodicals simultaneously

2 to publish simultaneously, or supply for simultaneous publication, in a number of newspapers or other periodicals in different places: Her column is syndicated in 120 papers

But these definitions still involve more business coordination than OpenSearch, or feed syndication in general, require. If I use OpenSearch to publish a search service within the enterprise, I don’t need to make a formal agreement with the Search Server administrator in order to enable that server to include my search results. I just need to publish my results as an RSS feed, and tell that person I’ve done so. That same RSS feed is available to users who may wish to subscribe to searches performed directly on my service.

It’s the same on the open web. When you adopt a syndication-oriented architecture, small pieces can be loosely joined, or they can be more tightly coupled. But the underlying publish/subscribe mechanism doesn’t determine that choice.

Chewing on these definitions is more than a pedantic exercise for me. In my local community, I’m trying to show how a particular use of publish/subscribe technology — namely, calendar syndication — can solve an important problem for people, organizations, and the community as a whole.

Federation would clearly be the wrong word for the network of calendars that I’m trying to bring into existence. I’ve been using the word syndication instead. But now I suspect that’s the wrong word too. I want to convey that we can create small pieces, that they can be loosely joined, and that important network effects will emerge. I don’t yet know what word or phrase will make that cluster of concepts light up in people’s heads.

In response to a popular recent item — “We posted weekly.pdf to the website. Isn’t that good enough?” — Sarah Allen echoes my favorite Sergey Brin quote. Sergey said: “I’d rather make progress by having computers understand what humans write, than by forcing humans to write in ways computers can understand.”

Sarah, citing weblog software as an example of software that enables people to write naturally, goes on to say:

Likewise, it is natural to record calendar information overlaid on a timeline with day, week, and month views that mimic traditional paper visualizations of time. This enables the software to generate structured data without people needing to think about it.

I mostly agree with her about blog software. And I would have been inclined to agree with her about calendar software too, until I started looking seriously into how people do — and often don’t — use calendar software.

Let’s look at a fragment of a softball schedule which, significantly, has been written as an Excel file:

Fri. Apr. 25 6:15 Whitney Brothers Greenwald Realty
7:45 Servpro Athen’s Pizza
Sat. Apr. 26 9:00 WR Painting Peerless Insurance

Notice what’s missing? There’s no AM/PM, because everybody is expected to know that 6:15AM would be too early for a Friday game while 9:00PM would be too late for a Saturday game.

Yes, it’s natural to view calendar information in ways that mimic traditional presentations. But it’s unnatural to write it using calendar software that constantly nags you to specify nitpicky details like AM and PM. People understand what’s a reasonable time for a Friday or Saturday game. Why can’t software figure that out?

I guess that’s why another recent item on parsing human-written date and time information struck a chord with readers. Until we create (and widely deploy) naturalistic interfaces, people are going to avoid the Procrustean bed that is conventional calendar data entry.

On this week’s Interviews with Innovators I spoke with Janis Dickinson, director of citizen science at the Cornell Ornithology Lab. We talked about several of the lab’s projects that involve collection and analysis of volunteer observations about birds and bird habitats.

Courtesy of the eBird project, for example, here is a view of first sightings of common bird species in New Hampshire. At first glance it might be tempting to see the preponderance of dates in the current decade as an effect of global warming. But to support that interpretation, you’d have to answer a bunch of questions about the evolution of record-keeping over the period, and the distribution, reliability, and bias of volunteer observers.

Extracting signal from noise is, of course, one of the classic bread-and-butter activities of information science. What’s fascinating here is the Web 2.0 angle. Birdwatchers are famously passionate data collectors who develop reputations among their peers. When they contribute their data to eBird — and thence to the Avian Knowledge Network — those reputations can begin to be measured, and used to tune the analysis of a large body of contributed data.

For example, the all-time latest reported sighting of the Nelson’s Sharp-tailed Sparrow in New Hampshire was on Nov 24 2007, by Michael Harvey. Is that unusually late? And if so, is it credible? To answer these questions, Cornell’s data crunchers can compare what was and wasn’t reported in the region around that time, by observers whose reputations are one kind of signal that emerges from noisy data.

Lately I’m obsessed with figuring out how to harness the cognitive surplus and put it to work doing better social information management.

The other night I attended a kick-off meeting for a group interested in advancing the cause of local food production in our region. Inevitably the discussion turned to questions that require data to answer. Who are the local producers? Where are they? What do they produce?

In the ensuing discussion, various sources of data emerged. There’s a USDA website, a state government website, a special-interest website, this or that blog. Two things were immediately clear to everyone. First, there would be no effective way to collate these existing sources. Second, most of the needed data wouldn’t be there anyway.

I’d like to be able to recommend the sort of loosely-coupled collaborative list-making method that works so effectively for me. But here’s why I can’t. The method presumes that all the things you’d want to collaboratively curate are already represented by URLs.

In the real world, some are and some aren’t. Consider two examples from this list:

Name: Darby Brook Farm
Day/Time:  8:00 AM - 5:00 PM
Season:  June 1 - October 1
Address:  347 Hill Road
What you’ll find: Vegetables, raspberries, apples.
More Info: 603.835.6624

Name: Stonewall Farm
Day/Time:  Hours vary
Season:  June - October
Address:  242 Chesterfield Road
What you’ll find:  Garden fresh produce through the Community Supported Agriculture (CSA) program, call for options
More Info:  603.357.7278,   bsaunders@stonewallfarm.org,  www.stonewallfarm.org

Because Stonewall Farm has a web presence, we can do all kinds of useful things with its URL. We can tag various bits of metadata onto it (location, products), we can derives views that include that information, we can syndicate those views.

Because Darby Brook Farm doesn’t have an URL, we can’t do those things.

Of course Darby Brook Farm does have an implicit URL-addressable identity at Lighten Up NH. That identity is the record in Lighten Up NH’s database that’s currently being published into a web page by its ColdFusion server.

If that record were directly URL-addressable, the implicit identity would be explicit. Using the record’s URL as a temporary placeholder, we could bootstrap Darby Brook Farm into a collaborative list-making regime based on URLs, tags, and syndication.

Later, when Darby Brook Farm does establish a real web presence, we can unhook its cloud of annotations from the placeholder URL and attach it to the official one.

This scenario highlights a subtle but powerful benefit of data-publishing technologies like Astoria. When you aggressively expose record-level URLs, you can enable the same methods that will work for Stonewall Farm to also work for Darby Brook Farm.

This week’s Interviews with Innovators show is a conversation with Raymond Yee, author of the recently-published Pro Web 2.0 Mashups.

The book is chock full of good examples. Even if you’re an experienced developer of mashups that involve Flickr, del.icio.us, Eventful, and the various mapping services, you’ll learn helpful strategies for using these services individually and in combination.

What we wound up mostly talking about, though, is the vast space of information that’s not currently available to be mashed up. That might be because the information isn’t online at all, or because it isn’t online in a form that’s tractable.

As a kind of social experiment I’ve been tackling this problem in my local community, with particular emphasis on calendar information. In this week’s interview, Raymond talks about tackling the same kind of problem with emphasis on geographic information. Both cases can exemplify a pattern that I’m calling shared responsibility.

Consider, for example, the public library. It hosts a variety of events, some of which are its own (children’s story hour) and some of which aren’t (an AA meeting). Who’s responsible for putting these events onto the library’s public calendar?

Clearly the library should publish its own events. But it needn’t necessarily feel obliged to publish other organizations’ events. In the case of AA meetings, for example, the library is only one of about a dozen venues around town. Shouldn’t AA publish its events to those venues?

We have the tools and services now to enable this kind of small-pieces-loosely-joined approach. In this case, acting as a proxy for AA, I published its regular meetings to Eventful. One of those meetings happens at the public library. So now when you visit the combined calendar, events at the library show up from multiple sources. One is clearly identified with the library itself, others are identified with the various groups using the library.

Of course nothing prevents the library from choosing to authoritatively publish all of the events that it hosts. But it’s useful to show how that can be a choice, not an obligation. If we take a decentralized, small-pieces-loosely-joined approach, information management chores that look insurmountable can turn out not to be.

Ray Ozzie joined me for this week’s Perspectives show. It’s available there as audio plus a text transcript, and you can also watch the video on Channel 9.

Ray opens the conversation by reflecting on his transition to Microsoft three years ago, and on the roles he and Craig Mundie will play as they jointly inherit Bill Gates’ responsibilities.

Next the conversation turns to a meme that Tim O’Reilly once evangelized: the Internet operating system. That phrase never resonated as powerfully as Web 2.0 did, but the ideas behind it are becoming realities. Ray applauds the work that Amazon and Google have done in this area. And he talks about how Microsoft’s legacy as a platform company, dedicated to helping developers succeed, will influence its approach.

In that context, Ray explores one piece of Microsoft’s emerging Internet operating system: the newly-announced Live Mesh. Sharing common DNA with earlier projects, notably Groove and before that Notes, Live Mesh is a data synchronizer born to the Web. The objects that it synchronizes are represented as RSS and Atom feeds, and are manipulated with a RESTful API that works symmetrically on local and cloud-based nodes.

Although the most visible Live Mesh application is a file-and-folder synchronizer, Ray notes that this is just one example of an application pattern that can apply equally to the synchronization of custom objects, like calendar events, across all the devices in a mesh. It also applies across the spectrum of application types, ranging from the browser to conventional rich clients to Web-based rich clients like Flash and Silverlight.

There’s another pattern for Live Mesh applications, one that’s less familiar. In this pattern, a website uses Live Mesh as a pipeline to communicate with Live Mesh users. If you’re running a travel site, or a bank, you can use that pipeline to transmit structured data to your users — for example, itineraries or transaction reports. It’s easy to create those XML feeds, you can leverage the Live Mesh infrastructure to deliver them securely and reliably at scale, they synchronize across all devices in each user’s Live Mesh, and they’re accessible to local applications using same RESTful feed APIs that were used to create them.

It’s almost 10 years since I began producing and consuming data feeds, initially in RSS format. Although I regard the syndication of data feeds, in general, as a transformative technology, the concept still makes no sense to civilians and has little or no effect on their lives.

In order to understand why not, and as a way of figuring out how to motivate a practical understanding of syndication, I’m tackling a problem whose solution doesn’t involve RSS, or Atom, or microformats, or XML. The problem is calendar syndication, and part of the solution is iCalendar, a non-XML format that all widely-used calendar programs support well enough for my purposes.

It’s only part of the solution because the real problem is that most people, most of the time, for most of their calendar-related activities, don’t use calendar programs. They use spreadsheets and wordprocessors, and they produce unstructured web pages and PDF files.

There was a time when, behind their backs, I would mock them for doing so. No longer. As I meet with intelligent and well-educated professionals in my community, and talk with them about how to synchronize calendar information from a variety of sources, I realize that they simply have no intuition about the difference between a PDF file and an ICS file that contain the same calendar information. Both are computer files, right? Both can be posted to the web, right? Both can be searched, right? Problem solved.

There are really two aspects to this missing intuition. First, the concept that some kinds of computer files are more structured than other kinds. Second, the concept that the structured kind can flow easily around the Net without loss of fidelity, and can deliver use value in a variety of contexts, whereas the unstructured kind is inert.

These are ways of computational thinking unknown to most people. As a school administrator, librarian, city planner, social worker, or retail store owner, nobody expects you to understand and apply these principles.

And yet almost everybody needs to harmonize personal and organizational calendars. And many individuals and organizations need to flow their calendar data into other contexts to promote and coordinate their activities.

So here’s my approach. I’m scooping up all the calendar information I can find for my community, in whatever form I can find it, and flowing it into a coommon view. Then I’m syndicating that view elsewhere to show that there’s nothing special about my aggregation.

The idea is to establish a critical mass by brute force, and allow people to see how, over time, sources that are structured and can syndicate will remain in the game, and sources that aren’t will have to sit out on the sidelines.

It’s turning into a nice case study of how organizations and individuals can negotiate shared responsibility for calendar information that’s of common interest. But that’s a story for another day. First things first. I need to give people a reason to care about using a calendar program — any calendar program, could be Outlook or Apple iCal or Google Calendar, so long as it exports iCalendar — in preference to a spreadsheet or word processor. Although the geek tribe can scarcely imagine why, that first step is a doozy.

For this week’s Interviews with Innovators show I spoke with Deepak Singh. This interview extends what has become an ongoing series of discussions with folks who are applying the principles of web 2.0 to the practice of science. This was, of course, the original purpose of web 1.0.

Other Innovators shows on this topic include conversations with Joel Selanikio about epidemiological data collection, Barbara Aronson about giving poor countries free subscriptions to biomedical journals, and Timo Hannay about the impressive stream of online innovations that’s flowing from the Nature Publishing Group.

My new Perspectives series has also explored this theme of Net-enabled science. There, I’ve talked with Catharine van Ingen and Dennis Baldocchi about collaborative analysis of atmospheric C02 data, and with Pablo Fernicola about using Word to produce scientific articles in the National Library of Medicine’s XML format.

For some reason I’ve never gotten around to doing stitched-together panoramic photos until recently. Today, with spring fever raging, I hopped on my bicycle, did one of my favorite circuits, and made this 360 view of Park Hill in Westmoreland:

It turned out to be an interesting study in perception. If you check the enlarged view, you’ll see a tiny, insignificant-looking church in the center of the spread, dwarfed by mailboxes in the foreground. In my memory of the scene, that church was the dominant feature. But what my eyes actually saw is what the camera saw: a tiny, insignificant-looking church.

Next time I’ll need to stand closer to it. And I’ll need to bear in mind that what we think we see is a heavily interpreted version of what hits the retinas.

Still, it was fun. I love that you can see the handlebars of my bicycle on the left, and the seat on the right.

I’m sure there lots of ways to do this, I’ve never really looked into it, but Windows Live Photo Gallery makes the whole thing a snap. From camera import, to photo stitching, to Flickr upload, was under 10 minutes. And most of that was CPU time.

A while ago I recorded a commentary for New Hampshire public radio on the topic of public data. The themes will be familiar to readers of this blog: transparency, citizen use of government data. I wondered when it would air, and then last night, while doing the dishes, I heard myself on the kitchen radio.

The piece is available on the NHPR site here. Will it make sense to folks listening at their kitchen sinks, or driving in their cars? I hope so, because as powerful an idea as this is, it’ll go nowhere until it does make sense to those folks.

To follow up on last week’s item about parsing the kinds of dates and times that people actually write, Google Calendar’s Quick Add feature looks like the clear winner. Here’s a test page with expressions like:

Third Saturday of Every Month, 10 - 11:30 am

Let’s try the Chronic module from Ruby:

irb(main):007:0> Chronic.parse('Third Saturday of Every Month, 10 - 11:30 am')
=> nil

No joy.

As David French pointed out, Google Calendar’s Quick Add gets this right. Or anyway, close enough. There seems to be a small bug that pokes an instance of the event into today’s slot, whether or not today is a 3rd Saturday. But otherwise it works great.

There are tougher challenges on that test page, like:

9:00 am - 1:30 pm, North Conference Room 1
April: April 5 and 12
May: May 3 and 10
June: June 7 and 14

I doubt think anything we’ve mentioned so far can touch that, though I’d be happy to be proven wrong.

Meanwhile, the ability to capture recurring events like ‘Third Saturday of Every Month, 10 - 11:30 am’ for my aggregated community calendar has raised a new question. When I use Google Calendar for this purpose, its iCal export doesn’t enumerate the series, it defines a rule:

LOCATION:Cheshire Medical Center
RRULE:FREQ=MONTHLY;INTERVAL=1;BYDAY=3SA;WKST=MO

When I pull that event into elmcity.info/events, the RRULE (recurrence rule) only fires once each time the feed is fetched. And that’s fine. I don’t necessarily want to see these recurring events on the the calendar into the far future.

But while I can syndicate these events directly from Google Calendar into elmcity.info, I would rather route them through Eventful.com. The reason is social not technical. Although I’m herding almost all these events into my aggregator for the time being, I want their rightful owners to claim them at some point and take care for them thereafter. Eventful is better suited for the kind of commons-based peer production I’m hoping to encourage.

But, I don’t see how to inject dynamic rules, rather than static events, into Eventful. You could run the rule yourself, then poke the generated events into Eventful, but that’d create maintenance woes when events are rescheduled, modified, or cancelled. I’d rather syndicate the rule than the data.

In his 1945 Atlantic Monthly essay As We May Think, Vannevar Bush famously imagined the memex, a mechanism that would augment human memory. This idea of mental augmentation inspired Doug Engelbart, and we’ve been chasing the dream ever since. On this week’s Interviews with Innovators, Phil Libin discusses EverNote, a new software-plus-services offering that aims to become your memex.

Listeners may recall that Phil appeared on the show once before. In fact he was the first guest in this series. Then he was CEO of Corestreet, a company tackling the problem of large-scale credentials validation in really interesting ways. Now, as EverNote’s CEO, he’s tackling a very different problem. But although EverNote is an application for ordinary folks rather than for governments and major institutions, it raises its own set of scale issues. And not just in terms of scaling out numbers of users and quantities of storage. EverNote wants to scale in the dimension of time as well.

Like me, Phil’s a huge fan of the Long Now Foundation. When he says that EverNote wants to guarantee the integrity of the digital objects that you commit to it forever, he’s not kidding.

While it’s refreshing to see a Web 2.0 company taking this long view, Phil admits that addressing the forever challenge in a meaningful way is beyond the means of EverNote. I’d add that it’s beyond any individual organization, and will require a federation of players to hammer out not only technical standards, but also shared business arrangements.

That’s not going to happen anytime soon, but then EverNote isn’t currently making guarantees that sentimental memorabilia will be preserved for your great-grandchildren. Instead it wants to guarantee that you’ll have effective near-term use of operational memorabilia — key documents, and in particular photos from which it finds, extracts, and indexes text.

The idea with this photo feature is that you can take pictures of receipts, wine labels, magazine pages, or event posters, dump the pictures into EverNote, and then find the photos by searching for the text in them. EverNote’s secret sauce here is its ability to find text not only in high-res scans, but also in “crappy cellphone photos taken at an angle.”

As Phil points out, from EverNote’s perspective the world comes at its users in two modes. First, when they’re away from their computers and out in the world, usually with some kind of camera. Second, when they’re at their computers, in which case they can take clippings from the web, or forward email.

I’m in that second mode a lot, so we’ll see whether EverNote becomes another of the memory augmentation methods I already use. These include blogging, email, and social bookmarking. Each method serves a communication function but also provides a repository where I often stash things purely so I can find them later.

Here’s an interesting and counter-intuitive aspect of EverNote. Human memory degrades over time. Digital memories, however, not only retain full fidelity, they can actually improve over time. Faces that you can’t find in your EverNote archive today may become recognizable next month or next year.

That’s true not only for EverNote, of course, but also for any system to which we commit digital objects. Human augmentation is powerful magic. We’re only starting to realize what it can do for us. And, I should add, to us.

This week on Perspectives, I explore the partnership between Dennis Baldocchi, a Berkeley climate scientist, and Catharine van Ingen, an MSR researcher. They’ve been working together on Fluxnet, a scientific data server and collaboration service for hundreds of scientists around the world who are measuring C02 flux in the atmosphere and trying to understand the dynamics of that flux.

Science in the twenty-first century is increasingly a game of data curation and analysis, involving hundreds or thousands of players distributed all around the world. To make progress, teams will need to coordinate online. The coordination systems will emerge from partnerships like the one Dennis Baldocchi and Catharine van Ingen discuss in this interview.

It’s also fascinating to hear, from the horse’s mouth, what we actually know, and don’t know, about atmospheric CO2. And about how and why we know or don’t know. On key issues like global warming, there’s a huge gap between scientific knowledge and public understanding. Projects like this one can help close that gap.

I’m working on a project that aggregates a bunch of community calendars, plus a lot of calendar info that’s just written out free-form. Some examples of the latter, in ascending order of resistance to mechanical parsing:

Tue, 4/1/08

2 Apr - Wed 10:00AM-10:45AM

Weekdays 8:30am-4:30pm

Thu, 11/15/07 - Fri, 4/11/08

Every Tuesday of the month from 10:00-11:00 a.m

Sat., Apr. 05, 9:00 AM Registration/Preview, 10:00 AM Live Auction

2nd Saturday of every other month, 10:00 am-12:00 pm

Programming languages tend to offer lots of functions and modules for converting among machine formats, and for converting machine formats into human formats, but when it comes to recognizing human formats, not so much.

In looking around for a recognizer, I came across the script that Jamie Zawinski uses to manage the calendar for his DNA Lounge. It looks like it can handle many of these formats, but it’s a 6500-line Perl behemoth that does a bunch of different things.

What else is available, for any language, preferably more focused and packaged, that can turn an item in human format, like “2nd Saturday of every other month, 10:00 am-12:00 pm,” into a sequence of items in machine format?

For many years I have tried, and mostly failed, to get people to appreciate the value of structured information. Sure, I’ve connected with the chattering classes who Twitter, blog, and read TechMeme, but I’ve only been preaching to the choir. Inside our echo chambers we grok XML, tagging, syndication, and information architecture. Out in the real world, though, most people aren’t hopping on that cluetrain, and that’s almost as true today as it was a decade ago.

Of course I’m not alone in my quest. Tim Berners-Lee has also tried, and mostly failed, to evangelize the power of structured information. The gating factor always was, and still is, data entry. You can go a long, long way with unstructured information, as Google has brilliantly shown. In late 2002 Sergey Brin told me:

Look, putting angle brackets around things is not a technology, by itself. I’d rather make progress by having computers understand what humans write, than by forcing humans to write in ways computers can understand.

That’s a great way to make progress, but we’re not in an either/or situation here. There’s also huge progress still to be made by enabling (not forcing) people to write in ways that computers can understand more deeply and effectively.

Jean Paoli saw an opportunity to do something about that on a large scale. It was also late 2002 when I first started talking to him about the injection of XML capabilities into Office. I evangelized that stuff long before I became Microsoft evangelist, because I believed then, and still believe today, that it’s a crucial enabler for a world facing challenges that are infinitely compounded by almost universally crummy information management.

In the flurry of commentary surrounding yesterday’s approval of Office Open XML as an ISO standard, I haven’t seen anyone thank Jean and his team for having the vision to transform Office in this important way, and the constancy of purpose to make it real. Well, I’ll say it. Thanks!

My debit card was one of the potentially 4.2 million exposed in the recent Hannaford data breach. Here’s part of the letter from my bank, the Savings Bank of Walpole.

I’ve thanked them privately, and want to thank them publicly as well, for being proactive and doing the right thing here. They’re dealing with fallout from a problem they didn’t create.

Details are still emerging but we don’t yet have the full story. As the InfoWorld story notes, Hannaford’s servers might have been compromised by a remote exploit through the network, or a local exploit made possible by unauthorized physical access.

In the aftermath, most of the usual defense-in-depth strategies are being rehashed, and that’s good. But one-time account numbers still aren’t on the radar screen, and I keep on wondering: Why not?

I had a great time talking about LibraryThing with Tim Spalding for this week’s ITConversations show. He says LibraryThing is a baroque application. I think of it as deep in the same ways that Flickr is: Many features, many modes of use, many constituencies. Although Tim is flagellating himself about the way we swam around in those depths, I enjoyed the conversation immensely. If you’re fascinated by the dynamics of social information management — whether or not you are a book-lover — I think you will too.

We wound up talking for almost two hours. I omitted the second hour not only for reasons of length, but also because it raised a question that neither of us felt we were able to address very well. As mentioned in comments here, though, it does warrant further consideration. A lot of folks, me included, feel that the inability to move identity and relationships across social networks is increasingly an impediment to joining them and participating in them.

But Tim rightly points out that friction has value. Rites of initiation are costly for a reason. When you invest effort you create meaning. So here’s the question. How do we separate those aspects of social information management that should be portable and frictionless from those that should be unique and special?

This week’s Perspectives is a two-parter: an interview and companion screencast on the topic of cluster computing in the classroom. The interview is with Kyril Faenov, the General Manager of the Windows HPC (high performance computing) unit, and the screencast is with Rich Ciapala, a program manager for Microsoft HPC++ Labs.

The project demonstrated in the screencast, and discussed in the interview, is called CompFin Lab. It’s a system that enables professors to in turn enable their students to run computationally expensive financial models on large quantities of data. From the student’s perspective, you go to a SharePoint server, select a computational model, pick a basket of stocks, and run the model. Behind the scenes the task is partitioned and sprayed across a cluster of computers, then the results are gathered and presented in an Excel spreadsheet.

From the professor’s point of view, some .NET programming is required. But a framework abstracts the mechanics of dealing with the cluster, so the professor can focus on the logic of the model itself.

There are couple of key points about the evolution of high-performance computing that I want to highlight here. First, there’s what Kyril calls “the gravitational pull of data.” Increasingly, people and organizations are building vast repositories of data that other people and organizations will want to analyze in computationally expensive ways. It’s great to have access to a compute cluster in the cloud that can do the heavy lifting, but when datasets get really big you get bottlenecked trying to send the data to where the code runs. At a certain point you’d rather send the code to where the data lives.

A second and related point is that in our current model for large-scale cloud-based computing, there are only a handful of what I call intergalactic clusters — namely, those operated by Google, Yahoo, Amazon, and Microsoft. These are one-of-a-kind behemoths. You can’t replicate one of them locally and apply it to your terabytes of data. So as Kyril and his team build out their cloud-based HPC services, they’re working to ensure the services can be replicated locally.

Maybe the most optimal thing is for you to stand up a 1000-node cluster with each node having a terabyte of disk. We want to enable that. We want to be able to tell our customers: Here’s how we run this large-scale data-driven HPC applications, and here’s how, within a day or two, you can stand up one of these yourself.

The idea is that if you build one of those for your own terabyte trove of astronomical or climatalogical data, you can run your own computations against that data, and you can also share that capability with other people and organizations who want to run their code against your data.

A while ago I wrote an alternative search and navigation interface to InfoWorld.com. The search is broken now because the underlying engine switched from Ultraseek to Google, and nobody has updated the search wrapper. But the navigation piece still works, and while it does, I want to invite some commentary because I’m thinking of doing something similar for another project.

In this model the navigation is metadata-driven, and supports views like:

InfoWorld stories tagged ‘Silverlight’

InfoWorld news stories tagged ‘Silverlight’

InfoWorld news stories by Elizabeth Montalbano tagged ‘Silverlight’

Every piece of metadata in the tabular display is active, and toggles a filter for that item. This works especially well for the tags, and enables you to cruise through the tagspace in a fluid way. For example, try this progression:

1. InfoWorld news stories tagged ‘Silverlight’

2. Click ‘flash’ to toggle it on

3. InfoWorld news stories tagged ‘Silverlight’ and ‘Flash’

4. Click ’silverlight’ to toggle it off

InfoWorld news stories tagged ‘Flash’

The same principle holds for other bits of metadata, like storytype. So for example:

1. InfoWorld news stories tagged ‘Silverlight’

2. Click ‘News’ to toggle it off

3. InfoWorld stories tagged ‘Silverlight’

4. Click ‘Review’ to toggle it on

5. InfoWorld Reviews tagged ‘Silverlight’

6. Click ‘Martin Heller’ to toggle it on

7. InfoWorld Reviews by Martin Heller tagged ‘Silverlight’

8. Click ’silverlight’ to toggle it off

9. InfoWorld Reviews by Martin Heller

It’s powerful to explore things this way, but if I did something like this again, I’d look for ways to make these filter progressions more intuitive and discoverable.

I just don’t think people expect every item to work as a control as well as an information display. And because they don’t, it may be a bad idea to do things that way. Or maybe it’s a good idea that’s still in search of its perfect expression. I’d be curious to know what you think.

To prepare for an interview with Tim Spalding, the founder and lead developer of LibraryThing, I re-registered with LibraryThing, spent some quality time with the service, and was wildly impressed.

At one point in the interview, Tim asked me how I, Mr. LibraryLookup, as likely a person as there is to use and appreciate LibraryThing, could have gone so long without hooking up with it.

I think part of the answer is hidden in the first paragraph: I had to re-register for the service, which I had tirekicked a year or two ago. The friction of joining and re-joining online services has become a major barrier.

There’s also conceptual friction. LibraryThing is a deep application that does lots of things, but on the surface, it appears to be a mechanism for cataloging books that you own. In fact it isn’t only that, you can just load it with books that you’ve read, or might read, as a way to seed discovery and recommendation.

Finally, there’s data friction. There are bibliophiles who will obsessively catalog their own collections, but I’m not one of them. I do, however, maintain a list of books on my Amazon wishlist. I syndicate that list to the version of LibraryLookup that alerts me when books on the wishlist become available in my local library.

What I needed was a frictionless way to reuse that list. And on this go-round with LibraryThing I found it. Sort of. You can import your Amazon wishlist into LibraryThing, which is a great way to jumpstart the discovery and recommendation process. It doesn’t yet syndicate from Amazon, so the initial import won’t be refreshed, but Tim says that’s coming.

It turns out not to matter at all that list of books I’m interested in happens to be an Amazon wishlist. All that matters is that I can keep it in some service, somewhere, that can syndicate data to other services elsewhere.

This week’s ITConversations show is a chat with Carl Malamud, whose exploits I’ve followed ever since he launched podcasting a decade ahead of schedule with a project called Internet Talk Radio. Since then, Carl’s mainly known for his tireless crusade to release troves of public information to the Net: SEC filings, patents, Congressional video, historical photographs, and most recently, U.S. case law.

One of the questions I wanted to explore with Carl is also raised here by John Montgomery:

Popfly, a mashup tool, depends on three things: data that is simple to access programmatically, interesting, and available under terms that enable users to work with it. As with most software endeavors, you can pick two.

The government has a huge amount of interesting data that’s available under really great terms. Weather? Check out http://www.noaa.gov. Financial information? Start with http://www.sec.gov. Crime statistics? Dig around in http://www.usdoj.gov/. But how much of this is programmatically accessible? Very little, as it turns out.

John mentions the Sunlight Foundation’s efforts to provide an intermediary layer of services that make raw data easier to access and manipulate, and I raised that point with Carl. From his perspective, of course, it all starts with the data which he is rightly focused on providing. Even though the U.S. is far ahead of many other countries in this regard, there are oceans of important information not yet available even in raw form.

Carl has enormous faith in the Net’s ability to interconnect and enhance these raw sources, and I do too. Here’s a small but significant example. If you view source on 28 Fed.R.Serv.3d 415, you’ll see one of my favorite strategies at work: semantic metadata encoded using CSS style tags. That enables an important kind of programmatic access. Now it’s true that today, Internet search engines don’t support queries that ask for documents where Shelby Reed appears as a plaintiff in an appeal to the U.S. Court of Appeals, Fifth Circuit. Someday, though, that kind query will be supported, and the latent semantics of this rendering of U.S. case law will emerge.

These enhanced services don’t necessarily just arise from the grassroots, however. Resource-rich organizations are often in the best position to provide them. One example, we agreed, is the New York Times’ stunningly effective visualization of presidential election debates. Ideally we’d be able to visualize all of the proceedings of Congress in the same way. That’s probably too much to expect of public-interest groups running shoestring operations. But what such groups can do is apply Carl’s favorite technique: Create a few high-profile examples, and then pressure the government into internalizing the process.

The second installment of Perspectives is up, with Vittorio Bertocci, author of Understanding Windows CardSpace. This interview was recorded a few months ago, and has been waiting for the Perspectives site to launch. In January I excerpted the part about omnidirectional identity, a difficult phrase that I continue to struggle with. Maybe a better one is Internet persona: the social mask that you project when you self-publish online, and to which reputation attaches. Whatever we call this phenomenon, its Laws of Identity — not only for people, but also for digital object — are not yet well defined.

Most of the interview, though, concerns the existing “unidirectional” mechanisms supported by CardSpace. I asked Vittorio to relate those mechanisms to precursors like SSL client certificates and Kerberos, and also to the complementary OpenID system. As discussed in my ITConversations podcast with Dick Hardt, the principles that govern this identity machinery are abstract and, until we experience them firsthand, will be hard for most of us to grasp. But Vittorio does a good job of explaining those principles in terms of concrete examples.

While reviewing a white paper by a colleague on the subject of personal digital archives, I realized that I hadn’t followed through on a plan to consolidate a few different caches of digital photos from various digicam and computer eras. So of course, when I went looking, things weren’t exactly the way I remembered. One particular batch was missing, and there were some anxious moments while I booted up dormant computers and mounted shelved disks. In the end I found the missing set, but although I could have sworn they were in three safe places, there was really only one.

In these moments of panic, the need for a lifebits service becomes crystal clear. But the moments pass, and we move on. Most people, most of the time, don’t yet feel the need for that kind of service.

Inevitably that will change. I wonder how, and when?

I’m running a couple of services that make automatic use of Amazon wishlists, and today I noticed that the current version of the API is going away:

503 - Service Unavailable

ECS3 is currently unavailable due to a planned outage in preparation for the complete shutdown of ECS3 on March 31, 2008.

After March 31, 2008, we will no longer accept Amazon ECS 3.0 requests. Please upgrade to the Amazon Associates Web Service (previously called Amazon E-Commerce Web Service 4.0) by then to ensure that you or your customers are not affected by the upcoming deprecation.

Amazon ECS 3.0 deprecation was announced a year ago in February 2007. You can read the original post at http://developer.amazonwebservices.com/connect/ann.jspa?annID=164.

In preparation of the March 31st deprecation, the Amazon ECS 3.0 web service will experience several outages. The complete outage schedule can be viewed at http://developer.amazonwebservices.com/connect/ann.jspa?annID=276.

Please refer to the migration guide for assistance in mapping Amazon ECS 3.0 calls to their Amazon Associates Web Service 4.0 equivalents. You can find the migration guide at http://developer.amazonwebservices.com/connect/entry.jspa?categoryID=12&externalID=627. Please use the Amazon Associates Web Service forum to ask technical questions and share answers with your fellow developers.

We thank you for being part of Amazon’s Developer community and look forward to your continued support.

Like Rich Burridge, I’ll be needing a replacement for PyAmazon, the Python module Mark Pilgrim wrote long ago to simplify use of the original Amazon API.

In our modern world of aggregation, search, and syndication, it’s easy to wait and see what will happen. I went to bloglines and searched for blog items that — like Rich’s and now mine — point to Amazon’s page about migrating to the new API. And then I subscribed to that search.

In a way, this is too easy. I can imagine a bunch of people camped on that query, watching the clock and waiting for someone else to step up to the plate before March 31. The first time around, when Amazon web services were new and shiny, it was cool to be that person. Now, not so much.

Update: A couple of folks have pointed to PyAWS. As mentioned in Rich Burridge’s blog entry, it doesn’t seem to offer, e.g., a single call to retrieve all items from a wishlist. However, when I reviewed my use of the earlier PyAmazon, in terms of raw interaction with the RESTful API and its XML output, I remembered how simple that interaction was. It’s just as simple in the new Amazon API, just slightly different. Encapsulating what I needed to do required only a few lines of code.

Generalizing that encapsulation is much harder. And when you have to repeat that hard work for many different languages, and for many different APIs, the inevitable result is that these per-language API wrappers tend to lag.

That’s one reason I’m looking forward to services built on Astoria ADO.NET Data Services, or an equivalent normalization layer. I think it can substantially narrow the gap between RESTful APIs and the convenience wrappers we enjoy in various programming languages.

This week on ITConversations I have a two-part interview with Ward Cunningham. In part one, we explore his implementation of Brian Marick’s visible workings idea, which combines software testing with business process transparency. This is one of those transformative ideas that will not, at first, seem interesting and important to most people. And maybe it never will. But then again, Ward has a track record. The wiki idea didn’t at first seem interesting and important to most people either, and look what’s happened there. So, you never know. Maybe in 2020 we’ll notice that business software is a lot more reliable and understandable than it used to be, and we’ll look back and say: Ward did it again.

In part two, we discuss Ward’s new wiki-based venture, aboutus.org. It’s a directory that aims to become a sort of extended WHOIS database, where domain name owners — along with anyone who reads the websites attached to those domains — can collaboratively describe the people, companies, and organizations represented by those websites. I like the concept, but I wish it weren’t necessary to sign up in order to update http://aboutus.org/jonudell.net. Instead I’d prefer to describe myself on my own hosted lifebits service, wherever that might be, and then syndicate the information to aboutus.org and elsewhere.

I wasn’t going to post this humorous anecdote but Mike Caulfield reminded me that it’s too funny not to share. After musing about a subscription service for running shoes, I walked in my local store, bought a new pair, and invited them to notify me in three months. Hilarity ensued.

He: We’re not really set up to do that.

Me: You could email me.

He: Yeah, but then we’d have to keep some kind of customer database on the computer.

Oh, right. Having a database of customers who’ve invited you to contact them on a regular basis … that’d suck, wouldn’t it?

Today I’m launching a new Microsoft-oriented interview series called Perspectives. The show will touch on a variety of topics including robotics, digital identity, e-science, and social software. I’ll be speaking mostly with passionate Microsoft innovators, and sometimes also with key partners from academia and industry.

The format is an audio podcast and a blog, where the blog provides a partial (but substantial) text transcription in order to make these conversations accessible to folks who don’t listen to podcasts, and also to expose them to the Net’s ecosystem of search, linking, and aggregation. Where appropriate, I’ll also use screencasts to show software in action.

Perspectives runs on the same publishing platform that supports Channel 10 (for enthusiasts), Channel 8 (for students), TechNet Edge (for IT pros), and VisitMIX (for Web designers and developers). (Channel 9, the original site, will migrate to this platform too.) Perspectives intersects with the interests of all these sites, but it doesn’t really belong in any of them, so we’ve created an independent home for it. Thanks to the EvNet team, especially Duncan Mackenzie, David Shadle, and Jeff Sandquist, for making that happen.

The first episode, with Henrik Nielsen and Tandy Trower, explores the Microsoft Robotics initiative. We discuss why robotics is — as futurist Paul Saffo believes — a Next Big Thing. And Henrik and Tandy explain how the concurrency and decentralized-services infrastructure that supports the robotics platform is broadly relevant in an era of loosely-coupled services.

On the Ann Arbor public library’s website you can find a wonderful example of how two local institutions — the library and the police department — can work together to curate an online exhibit. In 2002, history buff and police sergeant Michael Logghe self-published the lavishly illustrated True Crimes and the History of the Ann Arbor Police Department. The library worked with Logghe to produce an online version of the book. And when he visited the library to speak about the book and the online exhibit, his talk was recorded and made available for download (as video or audio-only) from the library’s podcast feed. Nicely done!

In my Remixing the library talk, I said that the two-way web paves the way for this kind of productive teamwork. It’s not a natural reflex, as Cassandra Targett points out:

It’s a shift from being passive recipients of the world’s knowledge to active participants in its creation, a shift that in many ways goes against some of the deepest core principles of what has become library science.

For a profession steeped in the idea that our role is to describe packaged knowledge and then help people find it (and play no role in how they use it once we point the way to it), the idea that we can not only modify some types of packages or even create substantially new ones is quite foreign still.

As I noted in my interview with Adrian Holovaty about EveryBlock, the curatorial collaboration among local governments, newspapers and libraries can encompass more than text, images, audio, and video. Those same institutions can work together to curate data about the operation of government (crime, taxes, maintenance), about social and civic life (event calendars), about the environment (weather, air quality), and more.

Although it’s starting to happen more in the scientific realm, I haven’t yet found a good example of that kind of data-oriented collaboration in the civic realm. But the teamwork shown by Ann Arbor’s police department and public library embodies the spirit that will make it happen.

John Lam asked how to excerpt fragments of Steve Ballmer’s keynote, and the principle of keystroke conservation requires me to answer here. The VisitMIX page for the keynote lists three streams. The links point to .asx files, which are wrappers around references to media files or streams. In this case, the references point to streams, which means that you can excerpt fragments by specifying the starttime and duration parameters.

Here’s the medium-bandwidth .asx file into which I’ve inserted starttime and duration parameters to create a fragment that points to a question and answer about HealthVault.

<asx version="3.0">
  <title>mix08: steve ballmer</title>
  <entry>
    <title>mix08: steve ballmer on healthvault</title>
    <starttime value = “52:50.0″/>
    <duration value=”1:45″/>
    <copyright>copyright 2008. all rights reserved.</copyright>
    <ref href=”mms://istreampl.wmod.llnwd.net/a269/o2/microsoft/300_microsoft_mix_080306.wmv” />
  </entry>
</asx>

I’ve posted the file at http://channel9.msdn.com/media/ballmer-keynote-healthvault.asx. It should play in Windows Media Player, and also in VLC on the Mac or Linux though I can’t check those at the moment.

In general, launching appropriate media players from a web page is a complex process. I’m hoping and expecting that Silverlight, over time, will simplify it, and help make rich media more granularly linkable.

In Montreal this Friday, McGill professor Darin Barney will be giving a version of his talk on citizenship and technology. Here’s an excerpt:

Each of the telegraph, telephone, radio and television was accompanied by its own heroic rhetoric of democratic transformation and reinvigorated civic engagement. None have delivered fully on this promise, but each has been crucial for the maintenance of a system of political and economic power in which most people are systematically distanced from the practice of citizenship most of the time. For the most part, these technologies have been means of anything but citizenship: spectacular entertainment; docile recreation; habituation to the rhythms of capitalist production and consumption; cultural normalization. The internet, as a radically decentralized medium whose capacity for publication and circulation far surpasses that of its broadcast predecessors, has certainly provided the means by which politically-engaged citizens can access and produce politically-charged information that would never have seen the light of day under the regime of the television and newspaper. This information can be an important resource for political judgment. But the Internet also surpasses its predecessors as an integrated medium of enrolment in the depoliticized economy and culture of consumer capitalism. This is why we should be wary of equating more and better access to information and communication technology with enhanced citizenship.

One Montreal resident deeply influenced by Barney’s critique of the Internet as an enabler of citizenship is Michael Lenczner, whom I interviewed for this week’s ITConversations show. Mike is a co-founder of Île Sans Fil, Montreal’s community wireless network. With over 150 access points and nearly 60,000 users, the project is a huge success, all the more so given that municipal wi-fi projects in other cities have failed to materialize. And yet, Mike questions the value of what’s been accomplished. The project’s goal was not merely to light up hotspots in downtown Montreal, but to enhance the “sociality” of the city and elicit more and better civic engagement. He doubts these goals have been achieved, and asks himself hard questions about how technology can be deployed to these ends.When I met Mike recently in Montreal, I said: “It amazes that you’re asking yourself these questions. He replied: “It amazes me that others don’t.”

Next Page »