Talking with Martin Hepp about solving the paradox of choice

In his luminous essay Information obesity, Ned Gulley illustrates the paradox of choice:

I’m reading about the Mohawk Trail, where the Cold River crashes noisily down the granitic glacier-fractured hillside. Where whispering understory birches are sheltered by towering firs. Now my mouth is watering. I have to go. I am referred to ReserveAmerica, a well-built web site that manages thousands of parks nationwide, and — DAMN! Mohawk Trail State Forest is booked solid. I start researching other nearby campgrounds, and now I’m sucked into the game. Unfortunately, ReserveAmerica lets you pick your campsite from an interactive map, and my book tells you which sites are the very best at each campground. Just when you start to salivate about the perfect spot, your dream is dashed by some early bird camper who’s beaten you to the reservation. You can cycle through this process for hours.

I borrow the phrase paradox of choice from Barry Schwartz, who argues in a compelling TED talk that as we broaden our options in all areas, we ratchet up our expectations about how good those options will be. The result is disappointment.

Less is more — except when it isn’t. My counterexample is a recent quest of mine for a particular kind of double-stick tape I needed for an interior storm window project. Key criteria included width (roughly 5/8″) and type of adhesion (plastic to wood). Web search yielded a bewildering array of choices, from various sources, but no way to filter by my criteria. This isn’t some idle consumer whim. I’m trying to save energy in the most effective way I can. I want to see as many qualifying choices as possible. But I can’t.

In Restructuring expert attention to revive the lost art of personal customer service I described one great solution to this problem: Kevin, the resident expert at FindTape.com, with whom I discussed SCF-01, DC-4420LB, and eventually settled on 3M-4905.

When there’s a Kevin available, he’ll be my first choice. But there won’t always be a Kevin. The answer in that case is not to artificially constrain my choices. That already happens because web search doesn’t enable me to state my criteria. Instead I want to search more effectively. To do that — as noted by several comments on Barry Schwartz’s TED video — we need to overcome filter failure.

This week’s Innovators show, with Martin Hepp, explores how we can create better filters. It’s a follow-on to an earlier show with Kingsley Idehen on the topics of RDFa, the GoodRelations ontology, and the idea that we can become the masters of our own search indexes.

The conversation mainly revolves around how to express an offer for goods or services by means of RDFa snippets that use the GoodRelations e-commerce vocabulary, that are generated by a form-based tool, and that rely on the web’s venerable traditions of view source and copy/paste.

But the same vocabulary used to describe offers can also express needs. And here Martin makes a really good observation about the current architecture of web search:

You can only search synchronously. You can’t ask a question and say, ‘Work on this for two weeks, improve your results in the background, and then come back with the best answer.’ But think about the potential if we can increase the amount of computational time for returning results. Currently there is only 400 milliseconds, because this is the average patience of web users. But if you can express what you’re looking for, and save it with a name, then the search engine will have two weeks to produce a good list of results.

I was also intrigued by Martin’s comments on intermediaries and affiliates. In his view, a commerce site like Amazon is not the only possible source of filter-enhancing metadata. Affiliates can play too. A travel service, for example, might supply search engines with enhanced views of Amazon relative to certain places and certain areas of expertise.

The paradox of choice is real, and in many cases we may indeed be happier with less. But when we really need or want more options, we shouldn’t have to prematurely foreclose them. Search could be far more effective, and an approach like the one Martin envisions is the way to make it so.

SQL Azure “Vidalia”: Practical translucency

Ever since Peter Wayner introduced me to the idea of a translucent database I’ve been thinking about the implications of this powerful idea. In a nutshell, the data in a translucent database service is opaque to the operator of the service, and visible only to sets of users who establish trust relationships. My 2002 review of Peter’s book summarizes his babysitter example:

Imagine a web service that enables parents to find available babysitters. A compromise would disastrously reveal vulnerable households where parents are absent and teenage girls are present. Translucency, in this case, means encrypting sensitive data (identities of parents, identities and schedules of babysitters) so that it is hidden even from the database itself, while yet enabling the two parties (parents, babysitters) to rendezvous.

Fast forwarding to 2009, here’s a current headline from InfoWorld: Microsoft adds access controls for SQL Azure online database. The article doesn’t say so, but this is database translucency in action.

The 2009 version of the babysitter example appears at 37:45 in this PDC session, where Dave Campbell and Rahul Auradkur discuss, and also show, a translucent pharmaceutical reagent marketplace. Dave Campbell spells out the scenario:

Pharma companies see reagents as being pre-competitive. They don’t compete at that level, and they’re willing to sell these reagents to one another, as long nobody can see what’s being bought and sold. That’s the controlled trust we need to set up.

The trick is accomplished by means of encryption and careful separation of concerns. Access policies are isolated from data storage, capable of federation, and auditable by trusted intermediaries.

This is exciting new territory. Historically, we’ve always assumed that the operator of an online information system has complete access to the data in that service. Translucency turns that assumption on its head, and leads to entirely new service design patterns. To implement those patterns requires more than just a database in the cloud. You also need a coordinated suite of supporting services for identity, access control, auditing, and more. Azure, as it becomes one provider of such services, will help make translucency a practical reality.

OData is grease to cut data friction

Back in 2007 I talked with Pablo Castro about Astoria, which I described as a way of making data readable and writeable by means of a RESTful interface. The technology has continued to move forward, and I’m now a heavy user of one of its implementations: the Azure table store. Yesterday at PDC we announced the proposed standardization of this approach as OData, which InfoQ nicely summarizes here.

I’ll leave detailed analysis of the proposal, and the inevitable comparisons to Google’s GData, to others who are better qualified. Nowadays I’m mainly a developer building a web service, and from that perspective it’s very clear that wide adoption of something like “ODBC for the cloud” is needed. We have no shortage of APIs, all of which yield XML and/or JSON data, but you have to overcome friction to compose with these APIs.

For example, the elmcity service merges event information from sets of iCalendar feeds and also from three different sources — Eventful, Upcoming, and (recently added) Eventbrite. In each of those three cases, I’ve had to create slightly different versions of the same algorithm:

  • Query for future events
  • Retrieve the count of matching events
  • Page through the matching events
  • Map events into a common data model

Each service uses a slightly different syntax to query for future events. And each reports the count of matching events differently: page_count vs. total_results vs. resultcount. OData would normalize the queries. And because the spec says:

The count value included in the result MUST be enclosed in an <m:count>

it would also normalize the counting of results.

Open data on the web has enormous potential value, but if we have to overcome too much data friction in order to combine it and make sense of it, we will often fail to realize that value. ODBC in its era was a terrific lubricant. I’m hoping that OData, widely implemented in software, services, and mashup environments like the just-announced Dallas, will be another.

Talking with Gavin Bell about Building Social Web Applications

My guest for this week’s Innovators show is Gavin Bell, author of Building Social Web Applications. A lot has changed in the decade since I wrote my own book on this topic. One constant, as we discuss in the podcast, is that we still reach for special terminology like computer-supported collaborative work or groupware or social software. That won’t be true forever. Sooner or later we’ll take for granted that all networked information systems augment us collectively as well as individually. Until then, though, it remains appropriate to speak of social web applications as opposed to simply web applications.

Whatever we call this kind of software, it’s a challenge in this era of tech churn to write about it at book length. This effort succeeds by exploring patterns and principles that will endure no matter which technologies prevail. Yes, it’s an O’Reilly technical book, with the traditional animal picture on the cover — in this case, of spiders. But it’s not code-heavy. Gavin Bell aptly compares it to the polar bear book by Peter Morville and Louis Rosenfeld. Both books draw on a wealth of experience gleaned from building and evolving web applications.

For designers, developers, project managers, and online community managers, Building Social Web Applications addresses questions like:

What are the social objects at the core of our application?

How can relationships form around such objects?

Which search, navigation, access, and notification patterns can best support those relationships?

How do we evolve our application as our users gain experience with these object-mediated relationships?

We’ll be thinking about these kinds of questions from now on. Gavin Bell’s excellent book provides a framework in which to do that thinking.

Where is the money going?

Over the weekend I was poking around in the recipient-reported data at recovery.gov. I filtered the New Hampshire spreadsheet down to items for my town, Keene, and was a bit surprised to find no descriptions in many cases. Here’s the breakdown:

# of awards 25
# of awards with descriptions 05 20%
# of awards without descriptions 20 80%
$ of awards 10,940,770
$ of awards with descriptions 1,260,719 12%
$ of awards without descriptions 9,680,053 88%

In this case, the half-dozen largest awards aren’t described:

award amount funding agency recipient description
EE00161 2,601,788 Sothwestern Community Services Inc
S394A090030 1,471,540 Keene School District
AIP #3-33-SBGP-06-2009 1,298,500 City of Keene
2W-33000209-0 1,129,608 City of Keene
2F-96102301-0 666,379 City of Keene
2F-96102301-0 655,395 City of Keene
0901NHCOS2 600,930 Sothwestern Community Services Inc
2009RKWX0608 459,850 Department of Justice KEENE, CITY OF The COPS Hiring Recovery Program (CHRP) provides funding directly to law enforcement agencies to hire and/or rehire career law enforcement officers in an effort to create and preserve jobs, and to increase their community policing capacity and crime prevention efforts.
NH36S01050109 413,394 Department of Housing and Urban Development KEENE HOUSING AUTHORITY ARRA Capital Fund Grant. Replacement of roofing, siding, and repair of exterior storage sheds on 29 public housing units at a family complex

That got me wondering: Where does the money go? So I built a little app that explores ARRA awards for any city or town: http://elmcity.cloudapp.net/arra. For most places, it seems, the ratio of awards with descriptions to awards without isn’t quite so bad. In the case of Philadelphia, for example, “only” 27% of the dollars awarded ($280 million!) are not described.

But even when the description field is filled in, how much does that tell us about what’s actually being done with the money? We can’t expect to find that information in a spreadsheet at recovery.gov. The knowledge is held collectively by the many people who are involved in the projects funded by these awards.

If we want to materialize a view of that collective knowledge, the ARRA data provides a useful starting point. Every award is identified by an award number. These are, effectively, webscale identifiers — that is, more-or-less unique tags we could use to collate newspaper articles, blog entries, tweets, or any other online chatter about awards.

To promote this idea, the app reports award numbers as search strings. In Keene, for example, the school district got an award for $1.47 million. The award number is S394A090030. If you search for that you’ll find nothing but a link back to a recovery.gov page entitled Where is the Money Going?

Recovery.gov can’t bootstrap itself out of this circular trap. But if we use the tags that it has helpfully provided, we might be able to find out a lot more about where the money is going.

Talking with Marco Barulli about zero-knowledge online password management

A couple of years ago I was enamored with a clever password manager that pointed the way toward an ideal solution. It was really just a bookmarklet — a small chunk of JavaScript code — that used a simple method to produce a unique and strong password for the website you were visiting. The method was to combine a passphrase that you could remember with the domain name of the site, using a one-way cryptographic hash, in order to produce a strong password that would be unique to the site — and that you’d otherwise never be able to remember.

It wasn’t perfect. Sometimes the passwords it generated wouldn’t meet a site’s requirements. And sometimes the login domain name would vary, which broke the scheme. But it introduced me to two powerful — and related — ideas. JavaScript could turn your browser into a programmable cryptographic engine. And that engine could be used to implement protocols that relied on cryptography but transmitted no secrets over the wire.

To my way of thinking, that’s a killer combination. For years I’ve been using Bruce Schneier’s Password Safe, a Windows program that keeps my passwords in an encrypted store. There are many such programs, another example being 1Password for the Mac. This kind of app lives on your computer and talks to a local data store. That means it’s cumbersome to move the app and your data from one of your machines to another. And you can’t use it online, say from a public machine at the library or a friend’s computer.

Imagine a web application that would encrypt your credentials and store them in the cloud. It would deliver that encrypted store to any browser you happen to be using, along with a JavaScript engine that could decrypt it, display your credentials, and even use them to automatically log you onto any of your password-protected services. You’d trust it because its cryptographic code would be available for security pros to validate.

I’ve wanted this solution for a long time. Now I have it: Clipperz. My guest for this week’s Innovators show is Marco Barulli, founder and CEO of Clipperz, which he describes as a zero-knowledge web application. What Clipperz has zero knowledge of is you and your data. It just connects you with your data, on terms that you control, in a way that reminds me of Peter Wayner’s concept of translucent databases.

Clipperz is immediately useful to all of us who struggle to manage our growing collections of online credentials, But it’s also a great example of an important design principle. We reflexively build services that identity users and retain all kinds of information about them. Often we need such knowledge, but it’s a liability for the operators of services that store it, and a risk for users of those services. If it’s feasible not to know, we can embrace that constraint and achieve powerful effects.

A literary appreciation of the Olson/Zoneinfo/tz database

You will probably never need to know about the Olson database, also known as the Zoneinfo or tz database. And were it not for my elmcity project I never would have looked into it. I knew roughly that this bedrock database is a compendium of definitions of the world’s timezones, plus rules for daylight savings transitions (DST), used by many operating systems and programming languages.

I presumed that it was written Unix-style, in some kind of plain-text format, and that’s true. Here, for example, are top-level DST rules for the United States since 1918:

# Rule NAME FROM  TO    IN   ON         AT      SAVE    LETTER/S
Rule   US   1918  1919  Mar  lastSun    2:00    1:00    D
Rule   US   1918  1919  Oct  lastSun    2:00    0       S
Rule   US   1942  only  Feb  9          2:00    1:00    W # War
Rule   US   1945  only  Aug  14         23:00u  1:00    P # Peace
Rule   US   1945  only  Sep  30         2:00    0       S
Rule   US   1967  2006  Oct  lastSun    2:00    0       S
Rule   US   1967  1973  Apr  lastSun    2:00    1:00    D
Rule   US   1974  only  Jan  6          2:00    1:00    D
Rule   US   1975  only  Feb  23         2:00    1:00    D
Rule   US   1976  1986  Apr  lastSun    2:00    1:00    D
Rule   US   1987  2006  Apr  Sun>=1     2:00    1:00    D
Rule   US   2007  max   Mar  Sun>=8     2:00    1:00    D
Rule   US   2007  max   Nov  Sun>=1     2:00    0       S

What I didn’t appreciate, until I finally unzipped and untarred a copy of ftp://elsie.nci.nih.gov/pub/tzdata2009o.tar.gz, is the historical scholarship scribbled in the margins of this remarkable database, or document, or hybrid of the two.

You can see a glimpse of that scholarship in the above example. The most recent two rules define the latest (2007) change to US daylight savings. The spring forward rule says: “On the second Sunday in March, at 2AM, save one hour, and use D to change EST to EDT.” Likewise, on the fast-approaching first Sunday in November, spend one hour and go back to EST.

But look at the rules for Feb 9 1942 and Aug 14 1945. The letters are W and P instead of D and S. And the comments tell us that during that period there were timezones like Eastern War Time (EWT) and Eastern Peace Time (EPT). Arthur David Olson elaborates:

From Arthur David Olson (2000-09-25):

Last night I heard part of a rebroadcast of a 1945 Arch Oboler radio drama. In the introduction, Oboler spoke of “Eastern Peace Time.” An AltaVista search turned up :”When the time is announced over the radio now, it is ‘Eastern Peace Time’ instead of the old familiar ‘Eastern War Time.’ Peace is wonderful.”

 

Most of this Talmudic scholarship comes from founding contributor Arthur David Olson and editor Paul Eggert, both of whose Wikipedia pages, although referenced from the Zoneinfo page, strangely do not exist.

But the Olson/Eggert commentary is also interspersed with many contributions, like this one about the Mount Washington Observatory.

From Dave Cantor (2004-11-02)

Early this summer I had the occasion to visit the Mount Washington Observatory weather station atop (of course!) Mount Washington [, NH]…. One of the staff members said that the station was on Eastern Standard Time and didn’t change their clocks for Daylight Saving … so that their reports will always have times which are 5 hours behind UTC.

 

Since Mount Washington has a climate all its own, I guess it makes sense for it to have its own time as well.

Here’s a glimpse of Alaska’s timezone history:

From Paul Eggert (2001-05-30):

Howse writes that Alaska switched from the Julian to the Gregorian calendar, and from east-of-GMT to west-of-GMT days, when the US bought it from Russia. This was on 1867-10-18, a Friday; the previous day was 1867-10-06 Julian, also a Friday. Include only the time zone part of this transition, ignoring the switch from Julian to Gregorian, since we can’t represent the Julian calendar.

As far as we know, none of the exact locations mentioned below were permanently inhabited in 1867 by anyone using either calendar. (Yakutat was colonized by the Russians in 1799, but the settlement was destroyed in 1805 by a Yakutat-kon war party.) However, there were nearby inhabitants in some cases and for our purposes perhaps it’s best to simply use the official transition.

 

You have to have a sense of humor about this stuff, and Paul Eggert does:

From Paul Eggert (1999-03-31):

Shanks writes that Michigan started using standard time on 1885-09-18, but Howse writes (pp 124-125, referring to Popular Astronomy, 1901-01) that Detroit kept

local time until 1900 when the City Council decreed that clocks should be put back twenty-eight minutes to Central Standard Time. Half the city obeyed, half refused. After considerable debate, the decision was rescinded and the city reverted to Sun time. A derisive offer to erect a sundial in front of the city hall was referred to the Committee on Sewers. Then, in 1905, Central time was adopted by city vote.

 

This story is too entertaining to be false, so go with Howse over Shanks.

 

The document is chock full of these sorts of you-can’t-make-this-stuff-up tales:

From Paul Eggert (2001-03-06), following a tip by Markus Kuhn:

Pam Belluck reported in the New York Times (2001-01-31) that the Indiana Legislature is considering a bill to adopt DST statewide. Her article mentioned Vevay, whose post office observes a different
time zone from Danner’s Hardware across the street.

 

I love this one about the cranky Portuguese prime minister:

Martin Bruckmann (1996-02-29) reports via Peter Ilieve

that Portugal is reverting to 0:00 by not moving its clocks this spring.
The new Prime Minister was fed up with getting up in the dark in the winter.

 

Of course Gaza could hardly fail to exhibit weirdness:

From Ephraim Silverberg (1997-03-04, 1998-03-16, 1998-12-28, 2000-01-17 and 2000-07-25):

According to the Office of the Secretary General of the Ministry of Interior, there is NO set rule for Daylight-Savings/Standard time changes. One thing is entrenched in law, however: that there must be at least 150 days of daylight savings time annually.

 

The rule names for this zone are poignant too:

# Zone  NAME            GMTOFF  RULES   FORMAT  [UNTIL]
Zone    Asia/Gaza       2:17:52 -       LMT     1900 Oct
                        2:00    Zion    EET     1948 May 15
                        2:00 EgyptAsia  EE%sT   1967 Jun  5
                        2:00    Zion    I%sT    1996
                        2:00    Jordan  EE%sT   1999
                        2:00 Palestine  EE%sT

There’s also some wonderful commentary in the various software libraries that embody the Olson database. Here’s Stuart Bishop on why pytz, the Python implementation, supports almost all of the Olson timezones:

As Saudi Arabia gave up trying to cope with their timezone definition, I see no reason to complicate my code further to cope with them. (I understand the intention was to set sunset to 0:00 local time, the start of the Islamic day. In the best case caused the DST offset to change daily and worst case caused the DST offset to change each instant depending on how you interpreted the ruling.)

 

It’s all deliciously absurd. And according to Paul Eggert, Ben Franklin is having the last laugh:

From Paul Eggert (2001-03-06):

Daylight Saving Time was first suggested as a joke by Benjamin Franklin in his whimsical essay “An Economical Project for Diminishing the Cost of Light” published in the Journal de Paris (1784-04-26). Not everyone is happy with the results.

 

So is Olson/Zoneinfo/tz a database or a document? Clearly both. And its synthesis of the two modes is, I would argue, a nice example of literate programming.

More Python and C# idioms: Finding the difference between two lists

Recently I’ve posted two examples[1, 2] of Python idioms alongside corresponding C# idioms. It always intrigues me to look at the same concept through different lenses, and it seems to intrigue others as well, so here’s a third installment.

Today’s example comes from a real scenario. I’ve recently added a feature to the elmcity service that enables curators to control their hubs by sending Twitter direct messages to the service. One method, GetDirectMessagesFromTwitter, calls the Twitter API and returns a list of direct messages sent to the elmcity service. Another method, GetDirectMessagesFromAzure, calls the Azure table storage API and returns a list of direct messages stored there. The difference between the two lists — if any — represents new messages to be processed.

Here’s one take on Python and C# idioms for finding the difference between two lists:

Python C#
fetched_messages = 
  GetDirectMessagesFromTwitter();
stored_messages = 
  GetDirectMessagesFromAzure();
diff = set(fetched_messages) - 
  set(stored_messages)
return list(diff)
var fetched_messages = 
  GetDirectMessagesFromTwitter();
var stored_messages = 
  GetDirectMessagesFromAzure();
var diff = fetched_messages.Except(
  stored_messages);
return diff.ToList();

I can’t decide which one I prefer. Python’s set arithmetic is mathematically pure. But C#’s noun-verb syntax is appealing too. Which do you prefer? And why?


PS: The Python example above is slightly concocted. It won’t work as shown here because I’m modeling Twitter direct messages as .NET objects. IronPython can use those objects, but the set subtraction fails because the objects returned from the two API calls aren’t directly comparable.

A real working example would add something like this:

fetched_message_sigs = [x.text+x.datetime for x in fetched_messages]
stored_message_sigs = [x.text+x.datetime for x in stored_messages]
diff = list(set(fetched_message_sigs) - set(stored_message_sigs))

But that’s a detail that would only obscure the side-by-side comparison I’m making here.

To: elmcity, From: @curator, Message: start

Because I am lazy, curious, and evangelical, the elmcity service works in an unusual way. Anything that I can delegate to other services I do. So when curators add feeds to hubs, or modify the behavior of hubs, they do it by bookmarking and tagging URLs at delicious.com. It would be foolish to only keep that registry and configuration data in delicious, so I don’t, I persist it to Azure tables. But for now, I’m delegating the data entry interface to delicious.

It’s a lazy approach, in the good sense of lazy. I don’t want to build my own data entry system unless I can add important value, and in this case I can’t.

I’m also curious to see how far this approach can take us. As the project has evolved, so has the tag vocabulary spoken between curators and the service. It’s an easy and natural process, and I don’t see any roadblocks ahead.

Finally, I’m evangelizing this way of doing things because I continue to think that more people should appreciate it.

In this scenario I’ve delegated something else to delicious: authentication. My service doesn’t have its own user accounts. Instead, as the administrator of the service, I tell it to trust a specific set of delicious accounts. When one of those accounts bookmarks an iCalendar URL, and tags it in a particular way, the service regards that as an authenticated request to add the feed to that hub’s registry.

Other requests that curators can make include:

Make the radius for my hub 5 miles.

Make my timezone Arizona.

Get my CSS file from this URL.

But here’s one that curators have wanted to make and couldn’t:

I just added a feed or changed a configuration option. Please reprocess my hub ASAP.

We could represent this message with a tag. Or we could use the rudimentary messaging system in delicious. But these approaches seemed awkward, and I rejected them.

Well, why not Twitter? True, it means that curators who want to send messages to the service will now need accounts in two places. But if they don’t already have accounts on both delicious and Twitter, they can create them. And those accounts will serve them in a variety of ways, unlike a single-purpose account on elmcity.

So, it’s done. As the curator for Keene, I’ve added the tag twitter=judell to the delicious account that controls the Keene hub. As the elmcity service periodically scans its designated set of delicious accounts, it follows any Twitter handle it isn’t already following. Those Twitter accounts can then send direct messages to the Twitter account of the elmcity service.

For now there’s only one thing a curator can say to the service in a direct message — “start” — which means “please reprocess my hub ASAP.” But I’m sure the control vocabulary will evolve. And of course the service can use the channel to send notifications back to curators.

Twitter is famously unreliable, but that should be OK for my purposes. We’re not controlling the space shuttle. If a message doesn’t get through to the service on the first or second try, it’ll get through eventually, and that’ll be good enough.

Someday I may have to build a data entry system and an accounts system. Then again, maybe not. Meanwhile I’m going to keep exploring this lightweight approach. It’s effective and, not coincidentally, it’s fun.

Restructuring expert attention to revive the lost art of personal customer service

Instead of mourning the lost art of personal customer service, I would rather celebrate examples that show it’s still possible. Yesterday I found two gems.

First, Southwest Airlines. I had booked a round-trip flight and then needed to change to one-way. You can’t do that online. So I clenched my jaw, called customer service, and prepared for the long wait.

Instead, this:

IVR: “Would you like us to call you back in about 20 minutes?”

Me: “Why…yes! Beep, beep, beep, beep, beep, beep, beep, #.”

My jaw relaxed.

Twenty or so minutes later, an agent called back and we made the change. Now the unclenched jaw morphed into a smile.

Second, FindTape.com. I’m making interior storm windows and I need double-stick tape for the project. Which, sure, you can buy online. But the smorgasbord of choices is paralyzing. I wasted a half-hour trying to figure out which product would best suit my unusual application and made no progress whatsoever.

Then, at FindTape.com, I read this:

If you have a specific question related to which tape would work best in your application please fill out and submit the following fields so that we can have an appropriate representative get back in contact with you.

A fellow named Kevin wrote back, we’ve have been discussing my options, and now I’m ready to buy.

Both examples remind me of Michael Nielsen’s luminous phrase: the restructuring of expert attention. He coined it to define a new era of scientific collaboration, but it applies more broadly.

We’ve been told that companies can’t afford to focus expert attention on customers. The truth, of course, is that they can’t afford not to.

For a generation and more we’ve driven a wedge between people who have expertise with products and services and people who need that expertise. How’s that working for you? Me neither.

It’s true that expert attention is a scarce resource. But we’re living through a Cambrian explosion of awareness networks and communication modes. Used adroitly, they can optimize the allocation of that scarce resource. Which is a fancy way of saying: Maybe personal customer service isn’t a lost art after all.

Allman Brothers, Oct 14: Huntington or Nashville? A parable about syndication and provenance.

Yesterday Bill Rawlinson, the elmcity curator for Huntington, WV, noticed something odd about an event that showed up on Eventful.com:

Here’s the example: http://eventful.com/huntington/events/allman-brothers-/E0-001-020736056-0. It appears the Allman Brothers were in concert today, but I’m pretty sure they weren’t.

I’m pretty sure they weren’t either. At AllmanBrothersBand.com it says they were in Nashville on October 14. But if that’s true, Eventful isn’t the only site that got it wrong date. So, apparently, did a number of event-gathering and ticket-selling sites. Here are couple of examples I found.

In cases like these it’s hard to nail down the provenance of a “fact” such as Allman Brothers, Huntington WV, October 14 2009. There is clearly syndication going on, but who’s upstream and who’s downstream? How is the network of feeds interconnected? Which is the authoritative source?

I know what the answer to all these questions should be. The Allman Brothers themselves should be the authoritative source, and everyone else should syndicate from them.

If AllmanBrothersBand.com published its schedule as calendar data rather than as calendarish web pages, the organization could control the data. Was there originally a concert planned for Huntington on the 14th? I don’t know, but say for the sake of argument there was. The Allman Brothers calendarish web page cannot effectively propagate a change of plan.

An iCalendar feed, on the other hand, could. But calendarish web page are almost never alternately available as machine-readable iCalendar data that can reliably syndicate.

Looking under the covers, I see that AllmanBrothersBand.com is a PostNuke site. Are there calendar modules for PostNuke that export iCalendar? None of the ones that I found seem to.

Why don’t more content management systems make event information available as useful data? Why do they instead advertise things like XHTML compliance and not-very-useful RSS feeds? Because, chicken-and-egg, nobody ever seems to expect an iCalendar feed.

If we can change that expectation, a nice chunk of the real-world semantic web will fall into place. And it won’t require RDFa or SPARQL or ontologies. Just good old RFC2445, right under our noses the whole time, if only we would open our eyes and look.

Talking with Daniel Debow about using Rypple to open the Johari Window

On this week’s Innovators show, with Daniel Debow of Rypple, I learned about a cognitive psychological tool called the Johari Window. Rypple focuses on the quadrant of the Johari Window at the intersection of “known to others” and “not known to self” — the so-called blind area. The company is dedicated to the proposition that if we can become more aware of what others know about us that we don’t, we can improve ourselves along various axes: personal, social, and — critically for Rypple’s business model — professionally.

How do you gain that awareness? By asking questions like:

Am I giving sufficiently clear guidance?

or

Do I interrupt people too often?

You direct these questions to a set of people whose feedback you value. Rypple anonymizes their responses and, to the extent you buy into the service, provides a progressively capable framework within which to continue the dialogue. This is a great idea, and one of the very few appropriate uses for online anonymity that I can imagine.

Rypple, as a company, lives at the intersection of a couple of key trends. Social media, obviously, but also the services ecosystem. As we discuss in the podcast, corporate HR has historically been a monolith that expects 100% compliance with its systems. But people, as we know, differ emotionally and cognitively. We should be able to use a variety of methods to manage and evaluate people, and help them manage and evaluate themselves. Software delivered as a service is an enabler of that possibility.

Here’s a twist: A company won’t have access to the feedback that employees solicit using Rypple. Daniel Debow says that HR folks, well aware of mainstream social software, are ready to embrace this model. I hope he’s right.

His favorite recent story about Rypple goes like this:

At an HR conference I talked to the CEO of a company that uses Rypple. He’s excited about what we’re doing, but he said: “You have a real problem. Use of your system might make your system obselete. We’ve been using it for a while now, and I’ve noticed that people are much more willing to give me feedback face-to-face, they’re willing to talk to me.”

Well that’s the furthest thing from a problem I can imagine. It’s like saying to Facebook, you’ve got a problem, people keep meeting on Facebook and then meeting up in person and creating real relationships offline.

Actually that would be problem for Facebook. But Rypple isn’t about pageviews, it’s about helping people improve. Which seems like a great idea to me.

You can, by the way, use Rypple not only to solicit anonymized feedback from a chosen set of responders, but also from an open-ended set. So here’s my question:

How can I make my ideas more accessible and more actionable?

I’m asking a chosen set too, but if you can perceive my blind spot I’d love to know what you see there.

More visualization of Nobel Peace Prize winners in Freebase

To sharpen the point I made the other day about the eroding bias toward giving the Nobel Peace Prize to Americans and Europeans, here’s a comparison of the nationalities of winners before and after 1960.

1901-2009 nobel peace prize winners by nationality
before 1960 after 1960

Here’s another point I forgot to mention. There are gaps in timeline for the Nobel Peace Prize, because it wasn’t awarded in 1914-1918, 1923, 1924, 1928, 1932, 1939-1943, 1948, 1955-1956, 1966-1967 and 1972. The timeline shows those gaps concisely:

As in the earlier examples, you can do this with point-and-click filtering in Freebase, no query-writing required. Which is awesome.

Finally, Stefano Mazzocchi offers a clarification of a point that came up in our recent interview:

I made it sound like Freebase loaded directly IMDB data while what I should have specified is that we loaded the IMDB ‘identifiers’ along with our movie data.

Thanks Stefano. And, kudos to the Metaweb team!

Recovering forgotten methods of construction

After feasting on audio podcasts for years, I realized that I don’t always want somebody else’s voice in my head while running, biking, and hiking. So I went on an audio fast for a couple of months. But now I’m ready for more input, and I’m once again reminded how wonderful it is to be able to bring engaging minds with me on my outdoor excursions.

One of my companions on yesterday’s hike was John Ochsendorf, a historian and structural engineer who explores the relevance of ancient and sometimes forgotten construction methods, like Incan suspension bridges woven from grass. One of his passions is Guastavino tile vaulting, a system that was patented in 1885. Although widely used in many notable structures — including Grand Central Station — Ochsendorf says that some of these structures have been torn down and rebuilt conventionally because modern engineers no longer understand how the Guastavino system works, and cannot evaluate its integrity.

This theme of forgotten knowledge echoes something I heard in Amory Lovins’ epic MAP/Ming lecture series. He describes a large government building in Washington, DC, that was made of stone and cooled by a carefully-designed pattern of air flow. The cooling system wasn’t completely passive, though. You had to open and close windows in a particular sequence throughout the day. Now that building is cooled by hundreds of window-mounted air conditioners. I’m sure our modernn expectation of extreme cooling is part of the reason why. But Lovins also says that air conditioning became necessary because people forgot how to operate the building.

I love the idea of recovering — and scientifically validating — forgotten knowledge. That’s what John Ochsendorf’s research group does. One of his students, Joe Dahmen, did a project called Rammed Earth — a long-term experiment to see if that ancient construction method could actually work in present-day New England. John Ochsendorf says:

Historical methods of construction that are very green, very local, may create beautiful low-energy architecture, we’ve forgotten how to do them. So we have to rediscover them, and do testing to prove to clients and building owners that you can use these methods. And it’s a good example of MIT’s motto of mind and hand. We don’t like to just read about rammed earth walls, we like to get dirty and build them.

Very cool. I think the MacArthur Foundation invested wisely in this guy.

Visualizing Nobel Peace Prize winners in Freebase

When I watched Barack Obama accept the Nobel Peace Prize, I thought about how the world has changed since the inception of the prize, and how it will continue to change. Since the winners of the Prize are themselves a reflection of what’s changing, I thought I’d try using Freebase to visualize them over the century the Prize has existed.

What you can find out, with Freebase, depends on its coverage of the topics you’re asking about. So realize that what I’ll show here is possible because Nobel Peace Prize winners are a well-covered topic. Still, it’s wildly impressive.

The Nobel site tells us that 89 Nobel Peace Prizes have been awarded since 1901. I haven’t been able to reproduce that number in Freebase because there are multiple winners in a few years, and I haven’t found a way to group results by year. But for my purposes this related query is good enough:

That number, 100, isn’t as closely related to 89 as you might think. It’s less by the number of years no award was given, but more by the number of recipients in multiple-award years. Perhaps a Freebase guru can show us how to measure those uncertainties, but I’ve eyeballed them and I don’t think they invalidate my results.

How did I wind up querying the topic /award/award_winner? It wasn’t immediately obvious. I spent a while searching and then exploring the facets that emerged, including:

The crazy thing about Freebase is that, in a way, it doesn’t matter where you start. Everything’s connected to everything, so you can pick up any node of the graph and re-dangle the rest.

Except when you can’t. I haven’t yet gotten a good feel for which paths to prefer and why.

But in the end I came up with the kind of results I’d envisioned:

1901-2009 nobel peace prize winners by gender
male female

1901-2009 nobel peace prize winners by nationality
male female

Taken together they show a couple of trends. First, of course, we see most female winners after about 1960. Second, we see a more even geographic distribution of female winners because, prior to 1960, most winners were not only male but also American or European.

These results didn’t surprise me. What did is the relative ease with which I was able to discover and document them. I thought it would be necessary to write MQL queries in order to do this kind of analysis. I’d previously done a bit of work with MQL, and dug further into it this time around.

But in the end I found that it was just as effective to use interactive filtering. Now to be clear, getting the software to actually do the things I’ve shown here wasn’t a cakewalk. I had to develop a feel for the web of topics in the domain I chose. And it’s painfully slow to add and drop filters.

But still, it’s doable. And you can do it yourself by pointing and clicking. That is an astonishing tour de force, and a glimpse of what things will be like when we can all fluently visualize information about our world.

Magic glasses and magic projectors: Private versus public augmentation of experience

At its core, your browser is powered by an engine called the Document Object Model, hereafter DOM. You can think of the DOM as an outline, and the browser as an outline processor that shows and hides things, displays things in different ways, and even adds, removes, or rearranges things. Nowadays what you see, when you view a web page, is the result of a complex interaction between data and code. The data is the HTML content of the page, and the code is its JavaScript behavior. But these are slippery terms. A lot of content never originates as HTML, but is instead produced dynamically — by a web server, but also quite possibly in the browser as it manipulates the DOM. And a lot of behavior happens opportunistically in response to content on the page.

This arrangement has radical implications. For example, back in 2002 I invented LibraryLookup, a bookmarklet that noticed when you were visiting an Amazon or Barnes and Noble book page and offered a one-click search for that book in your local library. A few years later, a Firefox extension called Greasemonkey arrived on the scene. It offered two capabilities that, working together, enabled a zero-click LibraryLookup. First, it could call out to a web service. Second, it could modify the DOM based on the response. Putting these two things together, I wrote a script that would notice that you were visiting an Amazon book page, check to see if the book was available at your local library, and if so, insert a paragraph into the DOM that said: “Hey, it’s available at the [YOUR LIBRARY NAME] library!”

Is this kosher? I think so, but it’s a tricky question. At the time I made a short screencast that reflected on questions of ownership and fair use in an environment that’s designed and built to support intermediation and remixing. These questions were still largely hypothetical, though, because Firefox users who had also installed Greasemonkey were a very small number indeed.

But now, thanks to modern browser-independent JavaScript libraries like jQuery, those hypothetical questions are becoming very real. Here’s Phil Windley demonstrating his 2009 version of LibraryLookup:

The example comes from Phil’s recent essay The Forgotten Edge: Building a Purpose-Centric Web, which makes the case for contextualized browsing as enabled by libraries like jQuery and by infrastructure like that provided by Phil’s company, Kynetx.

In Phil’s next blog item, Claiming My Right to a Purpose-Centric Web: SideWiki, he asserts:

I claim the right to mash-up, remix, annotate, augment, and otherwise modify Web content for my purposes in my browser using any tool I choose and I extend to everyone else that same privilege.

That item grew a long tail of comments. It includes some interesting back-and-forth between Phil Windley and Dave Winer, but I want to focus on this observation from Greg Yardley:

Sites also generally come with a contract attached – some implicit (the view-through), some explicit (the click-through) – and these contracts, done correctly, are generally enforceable.

This whole post mystifies me, because you don’t have the the right to mash-up, remix, annotate, augment, and otherwise modify Web content – it’s not your content.

Earlier in the thread, Jeremy Pickens cited an example of such a contract: Google’s terms of service:

8.2 You should be aware that Content presented to you as part of the Services, including but not limited to advertisements in the Services and sponsored Content within the Services may be protected by intellectual property rights which are owned by the sponsors or advertisers who provide that Content to Google (or by other persons or companies on their behalf). You may not modify, rent, lease, loan, sell, distribute or create derivative works based on this Content (either in whole or in part) unless you have been specifically told that you may do so by Google or by the owners of that Content, in a separate agreement.

In response to Greg Yardley, Phil Rees cites fair use:

Actually we do have those rights.

http://www.law.cornell.edu/uscode/17/107.html

I believe so too. Sooner or later, that belief will be tested.

After my March interview with Phil about Kynetx, I wrote:

There’s a continuum of ways in which I can modify a web page in a browser, ranging from font enlargement to translation to contexual overlays. I wouldn’t draw a line anywhere along that continuum. It seems to me that I’m entitled to view the world through any lens I choose.

This doesn’t only apply to my view of the virtual world, by the way. It will apply to my view of the physical world too. We don’t yet have magic glasses that overlay web prices on shelf items, or web reputations on store signage, but someday we will.

I can’t see how I could be prevented from creating a heads-up display — for realspace or cyberspace — that’s advantageous to me. But I’ve got a hunch that those magic glasses are going to be controversial.

I wonder if it’s going to boil down to magic glasses versus magic projectors. Or, in other words, private versus public augmentation of our experiences of the virtual and real worlds. I can wear my magic glasses, but I can’t necessarily project the view that I’m seeing.

Talking with Victoria Stodden about Science Commons

On this week’s Innovators show I spoke with Victoria Stodden about Science Commons, an effort to bring the values and methods of Creative Commons to the realm of science. Because modern science is so data- and computation-intensive, Science Commons provides legal tools that govern the sharing of data and code. There are lots of good reasons to share the artifacts of scientific computation. Victoria particularly focuses on the benefit of reproducibility. It’s one thing to say that your analysis of a data set leads to a conclusion. It’s quite another to give me your data, and the code you used to process it, and invite me to repeat the experiment.

In this kind of discussion, the word “repository” always comes up. If you put your stuff into a repository, I can take it out and work with it. But I’ve always had a bit of an allergic reaction to that word, and during this podcast I realized why: it connotes a burial ground. What goes into a repository just sits there. It might be looked at, it might be copied, but it’s essentially inert, a dead artifact divorced from its live context.

Sooner or later, cloud computing will change that. The live context in which primary research happens will be a shareable online space. Publishing won’t entail pushing your code and data to a repository, but rather granting access to that space.

It’s a hard conceptual shift to make, though. We think of publishing as a way of pushing stuff out from where we work on it to someplace else where people can get at it. But when we do our work in the cloud, publishing is really just an invitation to visit us there.

Querying mobile data objects with LINQ

I’m using US census data to look up the estimated populations of the cities and towns running elmcity hubs. The dataset is just plain old CSV (comma-separated variable), a format that’s more popular than ever thanks in part to a new wave of web-based data services like DabbleDB, ManyEyes, and others.

For my purposes, simple pattern matching was enough to look up the population of a city and state. But I’d been meaning to try out LINQtoCSV, the .NET equivalent of my old friend, Python’s csv module. As happens lately, I was struck by the convergence of the languages. Here’s a side-by-side comparison of Python and C# using their respective CSV modules to query for the population of Keene, NH:

Python C#
 
 
i_name = 5
i_statename = 6
 
i_pop2008 = 17
 
 
handle = urllib.urlopen(url)
 
 
 
 
 
 
 
 
reader = csv.reader(
  handle, delimiter=',')
 
 
rows = itertools.ifilter(lambda x : 
  x[i_name].startswith('Keene') and 
  x[i_statename] == 'New Hampshire', 
    reader)
 
found_rows = list(rows)
 
 
 
count = len(found_rows)
 
if ( count > 0 ):
  pop = int(found_rows[0][i_pop2008])    
public class USCensusPopulationData
  {   
  public string NAME;
  public string STATENAME;
  ... etc. ...
  public string POP_2008;
  }
  
var csv = new WebClient().
  DownloadString(url);
  
var stream = new MemoryStream(
  Encoding.UTF8.GetBytes(csv));
var sr = new StreamReader(stream);
var cc = new CsvContext();
var fd = new CsvFileDescription { };
  
var reader = 
  cc.Read<USCensusPopulationData>(sr, fd);
  
 
var rows = reader.ToList();
 
  
  
 
var found_rows = rows.FindAll(row => 
  row.name.StartsWith('Keene') && 
  row.statename == 'New Hampshire');
  
var count = rows.Count;
  
if ( count > 0 )
  pop = Convert.ToInt32(
    found_rows[0].pop_2008)

Things don’t line up quite as neatly as in my earlier example, or as in the A/B comparison (from way back in 2005) between my first LINQ example and Sam Ruby’s Ruby equivalent. But the two examples share a common approach based on iterators and filters.

This idea of running queries over simple text files is something I first ran into long ago in the form of the ODBC Text driver, which provides SQL queries over comma-separated data. I’ve always loved this style of data access, and it remains incredibly handy. Yes, some data sets are huge. But the 80,000 rows of that census file add up to only 8MB. The file isn’t growing quickly, and it can tell a lot of stories. Here’s one:

2000 - 2008 population loss in NH

-8.09% Berlin city
-3.67% Coos County
-1.85% Portsmouth city
-1.85% Plaistow town
-1.78% Balance of Coos County
-1.43% Claremont city
-1.02% Lancaster town
-0.99% Rye town
-0.81% Keene city
-0.23% Nashua city

In both Python and C# you can work directly with the iterators returned by the CSV modules to accomplish this kind of query. Here’s a Python version:

import urllib, itertools, csv

i_name = 5
i_statename = 6
i_pop2000 = 9
i_pop2008 = 17

def make_reader():
  handle = open('pop.csv')
  return csv.reader(handle, delimiter=',')

def unique(rows):
  dict = {}
  for row in rows:
    key = "%s %s %s %s" % (i_name, i_statename, 
      row[i_pop2000], row[i_pop2008])    
    dict[key] = row
  list = []
  for key in dict:
    list.append( dict[key] )
  return list

def percent(row,a,b):
  pct = - (  float(row[a]) / float(row[b]) * 100 - 100 )
  return pct

def change(x,state,minpop=1):
  statename = x[i_statename]
  p2000 = int(x[i_pop2000])
  p2008 = int(x[i_pop2008])
  return (  statename==state and 
            p2008 > minpop   and 
            p2008 < p2000 )

state = 'New Hampshire'

reader = make_reader()
reader.next() # skip fieldnames

rows = itertools.ifilter(lambda x : 
  change(x,state,minpop=3000), reader)

l = list(rows)
l = unique(l)
l.sort(lambda x,y: cmp(percent(x,i_pop2000,i_pop2008),
  percent(y,i_pop2000,i_pop2008)))

for row in l:
  print "%2.2f%% %s" % ( 
       percent(row,i_pop2000,i_pop2008),
       row[i_name] )

A literal C# translation could do all the same things in the same ways: Convert the iterator into a list, use a dictionary to remove duplication, filter the list with a lambda function, sort the list with another lambda function.

As queries grow more complex, though, you tend to want a more declarative style. To do that in Python, you’d likely import the CSV file into a SQL database — perhaps SQLite in order to stay true to the lightweight nature of this example. Then you’d ship queries to the database in the form of SQL statements. But you’re crossing a chasm when you do that. The database’s type system isn’t the same as Python’s. And database’s internal language for writing functions won’t be Python either. In the case of SQLite, there won’t even be an internal language.

With LINQ there’s no chasm to cross. Here’s the LINQ code that produces the same result:

var census_rows = make_reader();

var distinct_rows = census_rows.Distinct(new CensusRowComparer());

var threshold = 3000;

var rows = 
  from row in distinct_rows
  where row.STATENAME == statename
      && Convert.ToInt32(row.POP_2008) > threshold
      && Convert.ToInt32(row.POP_2008) < Convert.ToInt32(row.POP_2000) 
  orderby percent(row.POP_2000,row.POP_2008) 
  select new
    {
    name = row.NAME,
    pop2000 = row.POP_2000,
    pop2008 = row.POP_2008    
    };

 foreach (var row in rows)
   Console.WriteLine("{0:0.00}% {1}",
     percent(row.pop2000,row.pop2008), row.name );

You can see the supporting pieces below. There are a number of aspects to this approach that I’m enjoying. It’s useful, for example, that every row of data becomes an object whose properties are available to the editor and the debugger. But what really delights me is the way that the query context and the results context share the same environment, just as in the Python example above. In this (slightly contrived) example I’m using the percent function in both contexts.

With LINQ to CSV I’m now using four flavors of LINQ in my project. Two are built into the .NET Framework: LINQ to XML, and LINQ to native .NET objects. And two are extensions: LINQ to CSV, and LINQ to JSON. In all four cases, I’m querying some kind of mobile data object: an RSS feed, a binary .NET object retrieved from the Azure blob store, a JSON response, and now a CSV file.

Six years ago I was part of a delegation from InfoWorld that visited Microsoft for a preview of technologies in the pipeline. At a dinner I sat with Anders Hejslberg and listened to him lay out his vision for what would become LINQ. There were two key goals. First, a single environment for query and results. Second, a common approach to many flavors of data.

I think he nailed both pretty well. And it’s timely because the cloud isn’t just an ecosystem of services, it’s also an ecosystem of mobile data objects that come in a variety of flavors.


private static float percent(string a, string b)
  {
  var y0 = float.Parse(a);
  var y1 = float.Parse(b);
  return - ( y0 / y1 * 100 - 100);
  }

private static IEnumerable<USCensusPopulationData> make_reader()
  {
  var h = new FileStream("pop.csv", FileMode.Open);
  var bytes = new byte[h.Length];
  h.Read(bytes, 0, (Int32)h.Length);
  bytes = Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(bytes));
  var stream = new MemoryStream(bytes);
  var sr = new StreamReader(stream);
  var cc = new CsvContext();
  var fd = new CsvFileDescription { };

  var census_rows = cc.Read<USCensusPopulationData>(sr, fd);
  return census_rows;
  }

public class USCensusPopulationData
  {
  public string SUMLEV;
  public string state;
  public string county;
  public string PLACE;
  public string cousub;
  public string NAME;
  public string STATENAME;
  public string POPCENSUS_2000;
  public string POPBASE_2000;
  public string POP_2000;
  public string POP_2001;
  public string POP_2002;
  public string POP_2003;
  public string POP_2004;
  public string POP_2005;
  public string POP_2006;
  public string POP_2007;
  public string POP_2008;

  public override string ToString()
    {
    return
      NAME + ", " + STATENAME + " " + 
      "pop2000=" + POP_2000 + " | " +
      "pop2008=" + POP_2008;
    } 
  }

public class  CensusRowComparer : IEqualityComparer<USCensusPopulationData>
  {
  public bool Equals(USCensusPopulationData x, USCensusPopulationData y)
    {
    return x.NAME == y.NAME && x.STATENAME == y.STATENAME ;
    }

  public int GetHashCode(USCensusPopulationData obj)
    {
    var hash = obj.ToString();
    return hash.GetHashCode();
    }
  }

Talking with Stefano Mazzocchi about reconciling web naming systems

When Stefano Mazzocchi saw my posts on webscale identiers[1, 2] he pointed me to some recent work he and others have been doing at Metaweb. At ids.freebaseapps.com you can find sets of different web identifiers that refer to the same things. So, for example:

Apple Inc.
versus
Apple Records

Each of these views collects identifiers from different sources. For Apple Inc. they include:

The NYTimes: topics.nytimes.com/top/news/business/companies/apple_computer_inc/

Wikipedia: wikipedia.org/wiki/Apple_Computer

Open Library: openlibrary.org/a/OL2669993A/Inc._Apple_Computer

On this week’s Innovators show Stefano joins me to discuss efforts underway at Metaweb to reconcile many different web naming systems and activate connections among them.

Meanwhile my recent guest Kingsley Idehen is demonstrating a similar kind of name reconciliation at bbc.openlinksw.com. At this URL, for example, you can see canonical identifers for Michael Jackson from the BBC’s own namespace and others including DBpedia and OpenCyc.

I’m not quite sure what to make of all this. But my spidey sense is telling me to pay attention, so I am.


Related:

  1. Semantic web mashups for the rest of us

  2. A conversation with Stefano Mazzocchi about Cocoon and SIMILE

  3. Motivating people to write the semantic web: A conversation with David Huynh about Parallax

  4. Talking with Kingsley Idehen about mastering your own search index

Speaking and writing webscale identifiers

I’ve really enjoyed the conversation about webscale identifiers. Naming web resources is such a crucial discipline, and yet one we’re all still making up as we go along. I ended the earlier post by suggesting that when we invent namespaces we should, where feasible, prefer names that make sense to people. In comments, a number of folks who have wrestled with the problem of ambiguity pointed out all sorts of reasons why that often just isn’t feasible.

Gavin Bell likes Amazon’s hybrid approach:

The model that Amazon have since moved to with a unique URL identifier and an ignored pretty human readable section is a good compromise.

Michael Smethurst agreed with me that the BBC’s opaque IDs — for example, b006qpgr for The Archers — could be promoted as a tag vocabulary that people would be encouraged to use:

Shownar is a prototype by Schulze and Webb that aims to track “buzz” around bbc programmes. For now it’s based on inbound links from blogs/twitter/etc but it could be expanded to use machine tags!?!

On Shownar, I find that this episode of Miss Marple was discussed in this blog entry:

BBC Radio have just started an Agatha Christie season and a whole host of programmes about the Queen of Crime are available to UK listeners on the iPlayer.

They include dramatizations of works starring super sleuths from Miss Marple to the Mysterious Mr Quin, as well as revealing documentaries.

The entry uses URLs that embed these BBC ids: b00mk71d, b007jvht. How did the author find them? Clearly, in this case, by way of the search URL which is also cited in the entry:

http://www.bbc.co.uk/iplayer/search/?q=agatha christie

The search term agatha christie is wildly ambiguous, of course. Shownar would never have included this item had it not cited specific BBC shows by way of their opaque IDs. Nor would the author have cited them if that had required typing b00mk71d or b007jvht. It only works thanks to copy/paste, but it works quite nicely, and it shows why site-specific search still matters in an era of uber search engines.

This example got me thinking about the character strings that we can and do type, easily and naturally, versus those we can’t and won’t. For example:

queries (what we can and do type) results (what we can’t and don’t type)
http://www.librarything.com/catalog/jonudell&deepsearch=
practical internet groupware

http://www.librarything.com/work/16804

http://www.librarything.com/work/16804/book/28447984

http://www.google.com/search?q=
practical internet groupware

http://oreilly.com/catalog/9781565925373

http://oreilly.com/catalog/pracintgr

http://www.bing.com/results.aspx?q=
practical internet groupware

http://www.amazon.com/Practical-Internet-Groupware-Jon-Udell/dp/156592537

http://my.safaribooksonline.com/1565925378

http://www.worldcat.org/search?q=
practical internet groupware

http://www.worldcat.org/oclc/43188074

http://www.amazon.com/s?index=blended&field-keywords=
practical internet groupware

http://www.amazon.com/Practical-Internet-Groupware-Jon-Udell/dp/1565925378

 

Looking at the consistency on the left column, and the variation on the right, I’ve got to conclude that:

  1. Practical Internet Groupware is the de facto webscale identifier for my book.

  2. 16804, 28447984, 9781565925373, pracintgr, 156592537, 1565925378, and 43188074 will never converge.

I’ve long imagined a class of equivalence services that would help us bridge the gap between vocabularies we can speak and write and those we’ll never speak and need help to write.

Both are sets of webscale identifiers that we’ll need to use in complementary ways. That’ll require a mix of social conventions and technical services.

Familiar idioms in Perl, Python, JavaScript, and C#

When I started working on the elmcity project, I planned to use my language of choice in recent years: Python. But early on, IronPython wasn’t fully supported on Azure, so I switched to C#. Later, when IronPython became fully supported, there was really no point in switching my core roles (worker and web) to it, so I’ve proceeded in a hybrid mode. The core roles are written in C#, and a variety of auxiliary pieces are written in IronPython.

Meanwhile, I’ve been creating other auxiliary pieces in JavaScript, as will happen with any web project. The other day, at the request of a calendar curator, I used JavaScript to prototype a tag summarizer. This was so useful that I decided to make it a new feature of the service. The C# version was so strikingly similar to the JavaScript version that I just had to set them side by side for comparison:

JavaScript C#
var tagdict = new Object();

for ( i = 0; i < obj.length; i++ )
  {
  var evt = obj[i];
  if ( evt["categories"] != undefined)
    {
    var tags = evt["categories"].split(',');
    for (j = 0; j < tags.length; j++ )
      {
      var tag = tags[j];
      if ( tagdict[tag] != undefined )
        tagdict[tag]++;
      else
        tagdict[tag] = 1;
      }
    }
  }
var tagdict = new Dictionary();

foreach (var evt in es.events)
  {

  if (evt.categories != null)
    {
    var tags = evt.categories.Split(',');
    foreach (var tag in tags)
      {

      if (tagdict.ContainsKey(tag))
        tagdict[tag]++;
      else
        tagdict[tag] = 1;
      }
    }
  }
var sorted_keys = [];

for ( var tag in tagdict )
  sorted_keys.push(tag);

sorted_keys.sort(function(a,b) 
 { return tagdict[b] - tagdict[a] });
var sorted_keys = new List();

foreach (var tag in tagdict.Keys)
  sorted_keys.Add(tag);

sorted_keys.Sort( (a, b) 
  => tagdict[b].CompareTo(tagdict[a]));

The idioms involved here include:

  • Splitting a string on a delimiter to produce a list

  • Using a dictionary to build a concordance of strings and occurrence counts

  • Sorting an array of keys by their associated occurrence counts

I first used these idioms in Perl. Later they became Python staples. Now here they are again, in both JavaScript and C#.

Talking with Hugh McGuire about BookOven

On this week’s Innovators show I reconnect with Hugh McGuire. He’s the 104th guest in the current incarnation of the show, and was also the fourth. With Hugh it’s always about books and collaboration. Our first conversation explored one of my favorite projects, LibriVox, which brings people together to make free downloadable audiobooks. This time around we talked about his new project, BookOven, which aims to help authors, editors, and readers work together to create new books.

Writing a book was the hardest thing I’ve ever done. The loneliness was what got to me. I finished around the time the blogosophere was starting to emerge, and the collegial joy I found here made me think I’d never want to repeat that solitary experience.

Nowadays I wouldn’t have to. Authors commonly write books out in the open on blogs. BookOven aims to push that strategy further by providing a suite of online tools purpose-built for discussing, editing, and proofing long texts.

Given the rise of the 140-character blurb, this emphasis on the long form is counter-cyclical. But for me, at least, the pendulum is swinging back. Lately I’m snacking less on Twitter and enjoying full meals served up by the blogosphere, online magazines, library books. It’s been nourishing. But I’m also noticing that much of this work — in the commercial as well as the amateur realm — could benefit from better organization, editing, and proofing.

The collaborative restructuring of all kinds of professional work has only just begun. Hugh McGuire and I share the belief that our new ability to harness what Yochai Benkler calls the loose affiliation of ad-hoc teams will yield better results in many areas. Book-length writing is the domain that Hugh has staked out. How can the new modes of collaboration enhance this ancient practice? We’ll see.

Ask and ye may receive, don’t ask and ye surely will not

This fall a small team of University of Toronto and Michigan State undergrads will be working on parts of the elmcity project by way of Undergraduate Capstone Open Source Projects (UCOSP), organized by Greg Wilson. In our first online meeting, the students decided they’d like to tackle the problem that FuseCal was solving: extraction of well-structured calendar information from weakly-structured web pages.

From a computer science perspective, there’s a fairly obvious path. Start with specific examples that can be scraped, then work toward a more general solution. So the first two examples are going to be MySpace and LibraryThing. The recipes[1, 2] I’d concocted for FuseCal-written iCalendar feeds were especially valuable because they could be used by almost any curator for almost any location.

But as I mentioned to the students, there’s another way to approach these two cases. And I was reminded of it again when Michael Foord pointed to this fascinating post prompted by the open source release of FriendFeed’s homegrown web server, Tornado. The author of the post, Glyph Lefkowitz, is the founder of Twisted, a Python-based network programming framework that includes the sort of asynchronous event-driven capabilities that FriendFeed recreated for Tornado. Glyph writes:

If you’re about to undergo a re-write of a major project because it didn’t meet some requirements that you had, please tell the project that you are rewriting what you are doing. In the best case scenario, someone involved with that project will say, “Oh, you’ve misunderstood the documentation, actually it does do that”. In the worst case, you go ahead with your rewrite anyway, but there is some hope that you might be able to cooperate in the future, as the project gradually evolves to meet your requirements. Somewhere in the middle, you might be able to contribute a few small fixes rather than re-implementing the whole thing and maintaining it yourself.

Whether FriendFeed could have improved the parts of Twisted that it found lacking, while leveraging its synergistic aspects, is a question only specialists close to both projects can answer. But Glyph is making a more general point. If you don’t communicate your intentions, such questions can never even be asked.

Tying this back to the elmcity project, I mentioned to the students that the best scraper for MySpace and LibraryThing calendars is no scraper at all. If these services produced iCalendar feeds directly, there would be no need. That would be the ideal solution — a win for existing users of the services, and for the iCalendar ecosystem I’m trying to bootstrap.

I’ve previously asked contacts at MySpace and LibraryThing about this. But now, since we’re intending to scrape those services for calendar info, it can’t hurt to announce that intention and hope one or both services will provide feeds directly and obviate the need. That way the students can focus on different problems — and there are plenty to choose from.

So I’ll be sending the URL of this post to my contacts at those companies, and if any readers of this blog can help move things along, please do. We may end up with scrapers anyway. But maybe not. Maybe iCalendar feeds have already been provided, but aren’t documented. Maybe they were in the priority stack and this reminder will bump them up. It’s worth a shot. If the problem can be solved by communicating intentions rather than writing redundant code, that’s the ultimate hack. And its one that I hope more computer science students will learn to aspire to.

Talking with Kingsley Idehen about mastering your own search index

Kingsley Idehen’s vision of a web of linked data long predates the recognition I accorded him in 2003. He’s seen the big picture for a very long time, and has been driving toward it consistently. Over the years we’ve had conversations in which I’ve always wound up saying: “Yes, OK, but how will we get people to create this web of linked data that we want to navigate and query?”

On this week’s Innovators show he responds with what I find to be a plausible scenario. Every business, and increasingly every person, presents some kind of home page to the world. On those pages you will find, implied but not clearly stated, one or both of the following kinds of assertions:

1. Things I offer.

2. Things I seek.

A plumber, for example, may offer hydronic heating services, and may seek an assistant with certain qualifications. By encoding these kinds of assertions as subject-verb-object triples we could, in theory, build a semantic web that matches seekers and finders more efficiently than the current searchable web can. But that first step was always doozy. Writing the assertions required an XML syntax which has never become a web mainstay.

There are other ways to write them, however. Using an approach called RDFa, you can embed them directly into human-readable web pages. This isn’t a new idea. A decade ago, in my book Practical Internet Groupware, I showed how CSS class attributes could do double duty within a web page, governing style while also conveying meaning. In 2003 I was still experimenting with the idea, which I then called microcontent. Nowadays the term is microformats.

Although we’ve heard plenty about this idea over the years, it has yet to bear fruit. I don’t know that it will, but the scenario Kingsley Idehen outlines strikes me as plausible because, as Dries Buytaert evocatively says, structured data is the new search engine optimization. Formerly of concern only to publishers, the rationale for search engine optimization is now becoming evident to everyone who writes an About page for their businesses or — what often comes to the same thing — for themselves.

The formula for an About page is well known: name, address, services offered, hours of operation, etc. Everyone writes this stuff once for the About page, and then again in countless variations for inclusion in various directories. Kingsley and I both hope that the time is now ripe for a web-friendly way to write this data into About pages once, for common use by human visitors, search crawlers, and syndicated directories.

His proposal relies on RDFa to encode factual assertions, and on an e-commerce ontology called GoodRelations which, as its creator Martin Hepp says, provides the vocabulary to say things like:

  • a particular Web site describes an offer to sell cellphones of a certain make and model at a certain price,
  • a pianohouse offers maintenance for pianos that weigh less than 150 kg,
  • a car rental company leases out cars of a certain make and model from a particular set of branches across the country.

The GoodRelations wiki shows cookbook examples for Yahoo and Google. You’d have to be fairly technical to adapt these using cut-and-paste, but there’s also a form that, although currently still wired to emit the older RDF/XML kinds of assertions, will soon also emit RDFa that can be woven into existing About pages.

To navigate and query a web of linked data you need, obviously, mechanisms by which to do the navigation and the querying. That’s never been the problem. Technologists love to figure such things out. But we’ve spectacularly failed to help people create that web of linked data in the first place. I don’t know if the approach Kingsley Idehen sketches in this week’s podcast will succeed. But it feels right, and I love his tagline: “Be the master of your own index.”

The joy of webscale identifiers

My guest for this week’s Innovators show, Ian Forrester, heads up the BBC’s Backstage project. Launched in 2005, Backstage lives at a cultural crossroads where legacy systems and methods intersect with their next-generation counterparts. The tagline for the feeds and APIs provided under the Backstage umbrella is “use our stuff to build your stuff.”

Admittedly that sounded a lot more exciting prior to 2006, when the BBC ended its trial of the Creative Archive service that was expected to “open the floodgates” to a “treasure trove” of cultural riches. Ian Forrester says those expectations were ratcheted back for two reasons. First, much of that treasure trove remains undigitized. Second, rights clearance proved to be an intractable problem.

So the “our stuff” that’s available to build “your stuff” turns out to be mostly metadata: news headlines, program titles and schedules. What’s more, that metadata comes from a plethora of BBC content management systems. What can you make out of these ingredients?

Here’s an evocative example: http://www.bbc.co.uk/nature/species/African_Bush_Elephant. The BBC’s Tom Scott explains:

Over the last few months we’ve been plundering the NHU’s [Natural History Unit’s] archive to find the best bits — segmenting the TV programmes, tagging them (with DBpedia terms) and then aggregating them around URIs for the key concepts within the natural history domain; so that you can discover those programme segments via both the originating programme and via concepts within the natural history domain — species, habitats, adaptations and the like.

This is just the sort of remixing that Backstage ought to enable anyone, inside or outside the BBC, to achieve. Since I’m a US resident, and don’t pay the UK’s television license fee, I can’t watch the videos on that page. There’s nothing that the Backstage team can do about that. But they can take a radically open and inclusive approach to the management of the metadata that supports this remixing, and that’s just what they’re doing.

In our conversation, Ian Forrester describes how the taxonomy that governs the Backstage feeds and APIs is shared with that of Wikipedia and its structured derivative, DBpedia. Tom Scott elaborates:

You might have noticed that the slugs for our URIs (the last bit of the URL) are the same as those used by Wikipedia and DBpedia that’s because I believe in the simple joy of webscale identifiers, you will also see that much like the BBC’s music site we are transcluding the introductory text from Wikipedia to provide background information for most things. This also means that we are creating and editing Wikipedia articles where they need improving (of course you are also more than welcome to improve upon the articles).

As someone who both practices and preaches collaborative curation, I’m delighted to see the BBC taking this approach. And I love the phrase webscale identifier. Here’s how Michael Smethurst defines it, in the post pointed to by Tom Scott:

I agree with the four Linked Data rules but I’d like to try to add a fifth: if possible don’t reinvent other people’s web identifiers. By web identifiers I mean those fragments of URLs that uniquely identify a resource within a domain. So in the case of the MusicBrainz entry for The Fall (http://musicbrainz.org/artist/d5da1841-9bc8-4813-9f89-11098090148e.html) that’ll be d5da1841-9bc8-4813-9f89-11098090148e.

The last time we updated the /music site we made this mistake (kind of unavoidable at the time). Even though we linked our data to MusicBrainz we minted new identifiers for artists. So The Fall became http://www.bbc.co.uk/music/artist/jb9x/ where jb9x was the identifier. But jb9x doesn’t exist anywhere outside of /music. We’ll (hopefully) never make that mistake again.

Beautifully said. Enormous synergies have gone unrealized because web publishers have chosen to mint new namespaces rather than add value to existing ones.

What I realized when talking with Ian, though, is that there is one namespace for which the BBC is the appropriate mint, namely its own. Here, for example, are some of the family of URLs for a radio drama called The Archers:

homepage: http://www.bbc.co.uk/programmes/b006qpgr/

upcoming shows: http://www.bbc.co.uk/programmes/b006qpgr/episodes/upcoming.xml

In this example b006qpgr is, at least potentially, a webscale identifier. It’s a unique tag for the show that, if used on blogs, on Twitter, and elsewhere, would make it easy to assemble all kinds of online activity related to the show. But in fact only web developers using Backstage feeds and APIs will ever discover, or use, b006qpgr. In colloquial discourse people use The Archers.

If the BBC wants people to collaborate with its namespace in the same way that it collaborates with Wikipedia’s, this would be more inviting:

http://www.bbc.co.uk/programmes/The_Archers/

http://www.bbc.co.uk/programmes/The_Archers/episodes/upcoming.xml

It should go without saying, but right after the first rule for linked data, “Use URIs as names for things,” I would add “Where possible, choose names that make sense to people.”

FriendFeed for project collaboration

For me, FriendFeed has been a new answer to an old question — namely, how to collaborate in a loosely-coupled way with people who are using, and helping to develop, an online service. The elmcity project’s FriendFeed room has been an incredibly simple and effective way to interleave curated calendar feeds, blog postings describing the evolving service that aggregates those feeds, and discussion among a growing number of curators.

In his analysis of Where FriendFeed Went Wrong Dare Obasanjo describes the value of a handful of services (Facebook, Twitter, etc.) in terms that would make sense to non-geeks like his wife. Here’s the elevator pitch for FriendFeed:

Republish all of the content from the different social networking media websites you use onto this site. Also one place to stay connected to what people are saying on multiple social media sites instead of friending them on multiple sites.

As usual, I’m an outlying data point. I’m using FriendFeed as a lightweight, flexible aggregator of feeds from my blog and from Delicious, and as a discussion forum. These feeds report key events in the life of the project: I added a new feature to the aggregator, the curator for Sasktatoon found and added a new calendar. The discussion revolves around strategies for finding or creating calendar feeds, features that curators would like me to add to the service, and problems they’re having with the service.

I doubt there’s a mainstream business model here. It’s valuable to me because I’ve created a project environment in which key events in the life of the project are already flowing through feeds that are available to be aggregated and discussed. Anyone could arrange things that way, but few people will.

It’s hugely helpful to me, though. And while I don’t know for sure that FriendFeed’s acquisition by FaceBook will end my ability to use FriendFeed in this way, I do need to start thinking about how I’d replace the service.

I don’t need a lot of what FriendFeed offers. Many of the services it can aggregate — Flickr, YouTube, SlideShare — aren’t relevant. And we don’t need realtime notification. So it really boils down to a lightweight feed aggregator married to a discussion forum.

One feature that FriendFeed’s API doesn’t offer, by the way, but that I would find useful, is programmatic control of the aggregator’s registry. When a new curator shows up, I have to manually add the associated Delicious feed to the FriendFeed room. It’d be nice to automate that.

Ideally FriendFeed will coast along in a way that lets me keep using it as I currently am. If not, it wouldn’t be too hard to recreate something that provides just the subset of FriendFeed’s services that I need. But ideally, of course, I’d repurpose an existing service rather than build a new one. If you’re using something that could work, let me know.

Purple Numbers for PDF documents?

My contribution to Silona Bonewald week was an interview about her new project citability.org. Silona proposes two new features for government websites. First, change tracking. Second, permalinks for documents, sections, and paragraphs.

Nobody will dispute the need for, or utility of, these features. The question is how to implement them across a sprawling landscape of content management systems and publishing procedures that still, in many cases, regard print as canonical and the web as an afterthought.

In a follow-on discussion with Silona, on the citability wiki, I recalled a little-known and rarely-used feature of PDF documents. You can form URLs that point to specific pages. And with the right preparation, you can even form URLs that point to named destinations within pages.

Those of us fluent in web-friendly document formats like HTML and XML will tend to recommend that these become canonical. But having recently observed what happened when the old-fashioned non-XML method of math typesetting was supported by WordPress.com, I have to ask: How much more mileage might we be getting out of the existing print-oriented systems?

I am not an expert user of PDF authoring tools, nor an expert user of software libraries that enable programmatic manipulation of PDF files. But some of you are. What would it take, I wonder, to post-process the kinds of PDF files that governments typically produce, in order to add Purple Numbers?

elmcity and WordPress MU: Questions and answers

In the spirit of keystroke conservation, I’m relaying some elmcity-related questions and answers from email to here. Hopefully it will attract more questions and more answers.

Dear Mr. Udell,

I am looking for a flexible calendar aggregator that I can use to report upcoming events for our college’s “Learning Commons” WordPress MU website, a site that will hopefully help keep our students abreast of events and opportunities taking place on campus.

1) Our site will be maintained using WordPress MU, so ideally the
display of the calendars, and/or event-lists will be handled by a
WordPress plugin. The one I am favouring is
http://wordpress.org/extend/plugins/wordpress-ics-importer/ . I have
tried this plugin and it almost does what we want.

Specifically, the plugin includes:

– a single widget that can display the “event-list” for one calendar;

– flexible options for displaying and aggregating calendars.

This plugin almost does what I want, but not quite.

a) The plugin is now limited to a single “events-list” widget. But with WordPress 2.8, it is possible to have many instances of a widget, so theoretically, I could display the “Diagnostic Tests” calendar in one instance , and the “Peer-tutoring” calendar in another widget instance.

b) It would be nice to have an option to display only the current week for specific calendars. While in other cases, it makes sense to display the entire month. And although I haven’t thought about it, likely displaying just the current day would be useful.

c) I would like flexibility over which calendars to aggregate, creating as many “topic” hubs as the current maintainer of the website might think useful for the students.

2) It would be nice to remove the calendar aggregation from the WordPress plugin, and handle that separately. Hopefully the calendars will change much less frequently than the website will be viewed. If I understand http://blog.jonudell.net/elmcity-project-faq/ properly, this might be possible using the elmcity-project.

For example, I think we could use “topical hub aggregation” to create a “diagnostic test calendar” that aggregates the holiday calendar and the different departments “diagnostic test” calendars. What I don’t understand is what is the output of “elmcity”. Does it output a single merged calendar (ics) that could be displayed by the above plugin? Is that a possibility?

Similarly, I believe I could create a different meta bookmark to aggregate our holiday calendar and our different peer-tutoring calendars (created by each department). Is this correct?

We have lots of groups, faculty, departments and staff on campus, and each wants to publicize their upcoming events. Letting them input and maintain their own calendars really seems to make sense. (Thanks for the idea. It seems clear this is the way to go, but I don’t seem to have the pieces to construct the output properly, as yet.)

I agree with your analysis that it would be better to have a separation of concerns between aggregation and display. So let’s do that, and start with aggregation.

I would like flexibility over which calendars to aggregate, creating as many “topic” hubs as the current maintainer of the website might think useful for the students.

I think the elmcity system can be helpful here. I’ve recently discovered that there are really two levels — what I’ve started to call curation and meta-curation.

I believe I could create bookmarks to aggregate our holiday calendar and our different peer-tutoring calendars (created by each department). Is this correct?

Right. It sounds like you’d want to curate a topic hub. It could be YourCollege, but if there may need to be other topic hubs you could choose a more specific name, like YourCollegeLearningCommons. That’d be your Delicious account name, and you’d be the “meta-curator” in this scenario.

As meta-curator you’d bookmark, in that Delicious account:

– Your holiday calendar

– Multiple departments’ calendars

Each of those would be managed by the responsible/authoritative person, using any software (Outlook, Google, Apple, Drupal, Notes, Live, etc.) that can publish an ICS feed.

There’s another level of flexibility using tags. In the above scenario, as meta-curator you could tag your holiday feed as holiday, and your LearningCommons feeds as LearningCommons, and then filter them accordingly.

What I don’t understand is what is the output of elmcity. Does it output a single merged calendar (ics) that could be displayed by the above plugin?

Yes. The outputs currently are:

Now, for the display options. So far, we’ve got:

  • Use the WordPress plugin to display merged ICS

  • Display the entire calendar as included (maybe customized) HTML

  • Display today’s events as included or script-sourced HTML

  • I have also just recently added a new method that enables things like this: http://jonudell.net/test/upcoming-widget.html

  • You can view the source to see how it’s done. The “API call” here is:

    http://elmcity.cloudapp.net/services/elmcity/json?jsonp=eventlist&recent=7&view=music

    Yours might be:

    http://elmcity.cloudapp.net/services/YourCollegeLearningCommons/json?jsonp=eventlist&recent=10

    or

    &recent=20&view=holiday

    etc.

    This is brand new, as of yesterday. Actually I just realized I should use “upcoming” instead of “recent” so I’ll go and change that now :-) But you get the idea.

    The flexibility here is ultimately governed by:

    1. The curator’s expressive and disciplined use of tags to create useful views

    2. The kinds of queries I make available through the API. So far I’ve only been asked to do ‘next N events’ so that’s what I did yesterday. But my intention is to support every kind of query that’s feasible, and that people ask for. Things like a week’s worth, or a week’s worth in a category, are obvious next steps.

Two projects for civic-minded student programmers

One of the key findings of the elmcity project, so far, is that there’s a lot of calendar information online, but very little in machine-readable form. Transforming this implicit data about public events into explicit data is an important challenge. I’ve been invited to define the problem, for students who may want to tackle it as a school project. Here are the two major aspects I’ve identified.

A general scraper for calendar-like web pages

There are zillions of calendar-like web pages, like this one for Harlow’s Pub in Peterborough, NH. These ideally ought to be maintained using calendar programs that publish machine-readable iCalendar feeds which are also transformed and styled to create human-readable web pages. But that doesn’t (yet) commonly happen.

These web pages are, however, often amenable to scraping. And for a while, elmcity curators were making very effective use of FuseCal (1, 2, 3) to transform these kinds of pages into iCalendar feeds.

When that service shut down, I retained a list of the pages that elmcity curators were successfully transforming into iCalendar feeds using FuseCal. These are test cases for an HTML-to-iCalendar service. Anyone who’s handy with scraping libraries like Beautiful Soup can solve these individually. The challenge here is to create, by abstraction and generalization, an engine that can handle a significant swath of these cases.

A hybrid system for finding implicit recurring events and making them explicit

Lots of implicit calendar data online doesn’t even pretend to be calendar-like, and cannot be harvested using a scraper. Finding one-off events in this category is out of scope for my project. But finding recurring events seems promising. The singular effort required to publish one of these will pay ongoing dividends.

It’s helpful that the language people use to describe these events — “every Tuesday”, “third Saturday of every month” — is distinctive. To being exploring this domain, I wrote a specialized search robot that looks for these patterns, in conjunction with names of places. Its output is available for all the cities and towns participating in the elmcity project. For example, this page is the output for Keene, NH. It includes more than 2000 links to web pages — or, quite often, PDF files — some fraction of which represent recurring events.

In Finding and connecting social capital I showed a couple of cases where the pages found this way did, in fact, represent recurring events that could be added to an iCalendar feed.

To a computer scientist this looks like a problem that you might solve using a natural language parser. And I think it is partly that, but only partly. Let’s look at another example:

At first glance, this looks hopeful:

First Monday of each month: Dads Group, 105 Castle Street, Keene NH

But the real world is almost always messier than that. For starters, that image comes from the Monadnock Men’s Resource Center’s Fall 2004 newsletter. So before I add this to a calendar, I’ll want to confirm the information. The newsletter is hosted at the MMRC site. Investigation yields these observations:

  • The most recent issue of the newsletter was Winter ’06

  • The last-modified date of the MMRC home page is September 2008

  • As of that date, the Dads Group still seems to have been active, under a slightly different name: Parent Outreach Project, DadTime Program, 355-3082

  • There’s no email address, only a phone number.

So I called the number, left a message, and will soon know the current status.

What kind of software-based system can help us scale this gnarly process? There is an algorithmic solution, surely, but it will need to operate in a hybrid environment. The initial search-driven discovery of candidate events can be done by an automated parser tuned for this domain. But the verification of candidates will need to be done by human volunteers, assisted by software that helps them:

  • Divide long lists of candidates into smaller batches

  • Work in parallel on those batches

  • Evaluate the age and provenance of candidates

  • Verify or disqualify candidates based on discoverable evidence, if possible

  • Otherwise, find appropriate email addresses (preferably) or phone numbers, and manage the back-and-forth communication required to verify or disqualify a candidate

  • Refer event sponsors to a calendar publishing how-to, and invite them to create data feeds that can reliably syndicate

Students endowed with the geek gene are likely to gravitate toward the first problem because it’s cleaner. But I hope I can also attract interest in the second problem. We really need people who can hack that kind of real-world messiness.

That word “events”: It does not mean what you think it means

In of one of my favorite scenes from one my favorite movies, The Princess Bride, Vizzini (Wallace Shawn) has been repeatedly exclaiming: “Inconceivable!” Finally Inigo Montoya (Mandy Patinkin) responds:

You keep using that word. I do not think it means what you think it means.

I’ve already riffed on that classic bit in the titles of two other items. Now I’m compelled to do it again because when I talk about events, vis-a-vis the elmcity project, I think the word means something different from you probably think it means.

Here’s one common meaning: major public events. These include things like artistic performances, festivals, fairs, and sporting events. They dominate the “Things to See and Do” section of every newspaper and online community guide, and are usually well publicized.

Here’s another common meaning: minor events that are often (but not aways) private. These include birthday parties, house concerts, and outdoor excursions. They are, nowadays, often publicized very well in Facebook.

Although I’m happy to see major public events showing up in an elmcity hub, that isn’t my main goal. And private events, of course, don’t belong in an elmcity hub, they belong in Facebook, or in other private networks.

There’s a third kind of event that interests me most of all. It occupies a space between the other two. It’s public, but minor: a book discussion, a roadside cleanup, a support group, a squaredance. These events typically don’t show up in “Things To See And Do” guides because they’re considered too niche, and because it’s too much work — for both the publisher and the contributor — to get them included. They might show up in Facebook, but if so they will be visible there only within a closed social network.

There are tons of events in this minor-but-public category. Here’s one of my favorite examples. We were having dinner with our friends Lin and Tom recently, and Lin mentioned that Tom had just won the New Hampshire state archery tournament.

Me: “Really? Congratulations! Where was that held?”

Lin: “At the Keene Recreation Center, last Saturday.”

The Rec Center is a ten-minute walk from my house. I’d have loved to have seen those precision archers ply their trade. And it was open to the public. Anybody could have gone. But nobody knew.

Everyone I talk to has similar stories. Everyone says they find out about such things — if they find out at all — only after the fact. Everyone acknowledges that there should be a better way to inform one another about the goings-on that implicitly form much of the social capital of the community. If we can make more of it explicit, we will lead richer lives. And here I mean richer in two senses of that word. There’s the Robert Putnam sense of social well-being. And there’s the Richard Florida sense of economic well-being. If we can make more of our implicit social capital explicit, we’ll profit in both ways.