Search Results for 'rse'


You will probably never need to know about the Olson database, also known as the Zoneinfo or tz database. And were it not for my elmcity project I never would have looked into it. I knew roughly that this bedrock database is a compendium of definitions of the world’s timezones, plus rules for daylight savings transitions (DST), used by many operating systems and programming languages.

I presumed that it was written Unix-style, in some kind of plain-text format, and that’s true. Here, for example, are top-level DST rules for the United States since 1918:

# Rule NAME FROM  TO    IN   ON         AT      SAVE    LETTER/S
Rule   US   1918  1919  Mar  lastSun    2:00    1:00    D
Rule   US   1918  1919  Oct  lastSun    2:00    0       S
Rule   US   1942  only  Feb  9          2:00    1:00    W # War
Rule   US   1945  only  Aug  14         23:00u  1:00    P # Peace
Rule   US   1945  only  Sep  30         2:00    0       S
Rule   US   1967  2006  Oct  lastSun    2:00    0       S
Rule   US   1967  1973  Apr  lastSun    2:00    1:00    D
Rule   US   1974  only  Jan  6          2:00    1:00    D
Rule   US   1975  only  Feb  23         2:00    1:00    D
Rule   US   1976  1986  Apr  lastSun    2:00    1:00    D
Rule   US   1987  2006  Apr  Sun>=1     2:00    1:00    D
Rule   US   2007  max   Mar  Sun>=8     2:00    1:00    D
Rule   US   2007  max   Nov  Sun>=1     2:00    0       S

What I didn’t appreciate, until I finally unzipped and untarred a copy of ftp://elsie.nci.nih.gov/pub/tzdata2009o.tar.gz, is the historical scholarship scribbled in the margins of this remarkable database, or document, or hybrid of the two.

You can see a glimpse of that scholarship in the above example. The most recent two rules define the latest (2007) change to US daylight savings. The spring forward rule says: “On the second Sunday in March, at 2AM, save one hour, and use D to change EST to EDT.” Likewise, on the fast-approaching first Sunday in November, spend one hour and go back to EST.

But look at the rules for Feb 9 1942 and Aug 14 1945. The letters are W and P instead of D and S. And the comments tell us that during that period there were timezones like Eastern War Time (EWT) and Eastern Peace Time (EPT). Arthur David Olson elaborates:

From Arthur David Olson (2000-09-25):

Last night I heard part of a rebroadcast of a 1945 Arch Oboler radio drama. In the introduction, Oboler spoke of “Eastern Peace Time.” An AltaVista search turned up :”When the time is announced over the radio now, it is ‘Eastern Peace Time’ instead of the old familiar ‘Eastern War Time.’ Peace is wonderful.”

 

Most of this Talmudic scholarship comes from founding contributor Arthur David Olson and editor Paul Eggert, both of whose Wikipedia pages, although referenced from the Zoneinfo page, strangely do not exist.

But the Olson/Eggert commentary is also interspersed with many contributions, like this one about the Mount Washington Observatory.

From Dave Cantor (2004-11-02)

Early this summer I had the occasion to visit the Mount Washington Observatory weather station atop (of course!) Mount Washington [, NH]…. One of the staff members said that the station was on Eastern Standard Time and didn’t change their clocks for Daylight Saving … so that their reports will always have times which are 5 hours behind UTC.

 

Since Mount Washington has a climate all its own, I guess it makes sense for it to have its own time as well.

Here’s a glimpse of Alaska’s timezone history:

From Paul Eggert (2001-05-30):

Howse writes that Alaska switched from the Julian to the Gregorian calendar, and from east-of-GMT to west-of-GMT days, when the US bought it from Russia. This was on 1867-10-18, a Friday; the previous day was 1867-10-06 Julian, also a Friday. Include only the time zone part of this transition, ignoring the switch from Julian to Gregorian, since we can’t represent the Julian calendar.

As far as we know, none of the exact locations mentioned below were permanently inhabited in 1867 by anyone using either calendar. (Yakutat was colonized by the Russians in 1799, but the settlement was destroyed in 1805 by a Yakutat-kon war party.) However, there were nearby inhabitants in some cases and for our purposes perhaps it’s best to simply use the official transition.

 

You have to have a sense of humor about this stuff, and Paul Eggert does:

From Paul Eggert (1999-03-31):

Shanks writes that Michigan started using standard time on 1885-09-18, but Howse writes (pp 124-125, referring to Popular Astronomy, 1901-01) that Detroit kept

local time until 1900 when the City Council decreed that clocks should be put back twenty-eight minutes to Central Standard Time. Half the city obeyed, half refused. After considerable debate, the decision was rescinded and the city reverted to Sun time. A derisive offer to erect a sundial in front of the city hall was referred to the Committee on Sewers. Then, in 1905, Central time was adopted by city vote.

 

This story is too entertaining to be false, so go with Howse over Shanks.

 

The document is chock full of these sorts of you-can’t-make-this-stuff-up tales:

From Paul Eggert (2001-03-06), following a tip by Markus Kuhn:

Pam Belluck reported in the New York Times (2001-01-31) that the Indiana Legislature is considering a bill to adopt DST statewide. Her article mentioned Vevay, whose post office observes a different
time zone from Danner’s Hardware across the street.

 

I love this one about the cranky Portuguese prime minister:

Martin Bruckmann (1996-02-29) reports via Peter Ilieve

that Portugal is reverting to 0:00 by not moving its clocks this spring.
The new Prime Minister was fed up with getting up in the dark in the winter.

 

Of course Gaza could hardly fail to exhibit weirdness:

From Ephraim Silverberg (1997-03-04, 1998-03-16, 1998-12-28, 2000-01-17 and 2000-07-25):

According to the Office of the Secretary General of the Ministry of Interior, there is NO set rule for Daylight-Savings/Standard time changes. One thing is entrenched in law, however: that there must be at least 150 days of daylight savings time annually.

 

The rule names for this zone are poignant too:

# Zone  NAME            GMTOFF  RULES   FORMAT  [UNTIL]
Zone    Asia/Gaza       2:17:52 -       LMT     1900 Oct
                        2:00    Zion    EET     1948 May 15
                        2:00 EgyptAsia  EE%sT   1967 Jun  5
                        2:00    Zion    I%sT    1996
                        2:00    Jordan  EE%sT   1999
                        2:00 Palestine  EE%sT

There’s also some wonderful commentary in the various software libraries that embody the Olson database. Here’s Stuart Bishop on why pytz, the Python implementation, supports almost all of the Olson timezones:

As Saudi Arabia gave up trying to cope with their timezone definition, I see no reason to complicate my code further to cope with them. (I understand the intention was to set sunset to 0:00 local time, the start of the Islamic day. In the best case caused the DST offset to change daily and worst case caused the DST offset to change each instant depending on how you interpreted the ruling.)

 

It’s all deliciously absurd. And according to Paul Eggert, Ben Franklin is having the last laugh:

From Paul Eggert (2001-03-06):

Daylight Saving Time was first suggested as a joke by Benjamin Franklin in his whimsical essay “An Economical Project for Diminishing the Cost of Light” published in the Journal de Paris (1784-04-26). Not everyone is happy with the results.

 

So is Olson/Zoneinfo/tz a database or a document? Clearly both. And its synthesis of the two modes is, I would argue, a nice example of literate programming.

Because I am lazy, curious, and evangelical, the elmcity service works in an unusual way. Anything that I can delegate to other services I do. So when curators add feeds to hubs, or modify the behavior of hubs, they do it by bookmarking and tagging URLs at delicious.com. It would be foolish to only keep that registry and configuration data in delicious, so I don’t, I persist it to Azure tables. But for now, I’m delegating the data entry interface to delicious.

It’s a lazy approach, in the good sense of lazy. I don’t want to build my own data entry system unless I can add important value, and in this case I can’t.

I’m also curious to see how far this approach can take us. As the project has evolved, so has the tag vocabulary spoken between curators and the service. It’s an easy and natural process, and I don’t see any roadblocks ahead.

Finally, I’m evangelizing this way of doing things because I continue to think that more people should appreciate it.

In this scenario I’ve delegated something else to delicious: authentication. My service doesn’t have its own user accounts. Instead, as the administrator of the service, I tell it to trust a specific set of delicious accounts. When one of those accounts bookmarks an iCalendar URL, and tags it in a particular way, the service regards that as an authenticated request to add the feed to that hub’s registry.

Other requests that curators can make include:

Make the radius for my hub 5 miles.

Make my timezone Arizona.

Make my CSS file to this URL.

But here’s one that curators have wanted to make and couldn’t:

I just added a feed or changed a configuration option. Please reprocess my hub ASAP.

We could represent this message with a tag. Or we could use the rudimentary messaging system in delicious. But these approaches seemed awkward, and I rejected them.

Well, why not Twitter? True, it means that curators who want to send messages to the service will now need accounts in two places. But if they don’t already have accounts on both delicious and Twitter, they can create them. And those accounts will serve them in a variety of ways, unlike a single-purpose account on elmcity.

So, it’s done. As the curator for Keene, I’ve added the tag twitter=judell to the delicious account that controls the Keene hub. As the elmcity service periodically scans its designated set of delicious accounts, it follows any Twitter handle it isn’t already following. Those Twitter accounts can then send direct messages to the Twitter account of the elmcity service.

For now there’s only one thing a curator can say to the service in a direct message — “start” — which means “please reprocess my hub ASAP.” But I’m sure the control vocabulary will evolve. And of course the service can use the channel to send notifications back to curators.

Twitter is famously unreliable, but that should be OK for my purposes. We’re not controlling the space shuttle. If a message doesn’t get through to the service on the first or second try, it’ll get through eventually, and that’ll be good enough.

Someday I may have to build a data entry system and an accounts system. Then again, maybe not. Meanwhile I’m going to keep exploring this lightweight approach. It’s effective and, not coincidentally, it’s fun.

Instead of mourning the lost art of personal customer service, I would rather celebrate examples that show it’s still possible. Yesterday I found two gems.

First, Southwest Airlines. I had booked a round-trip flight and then needed to change to one-way. You can’t do that online. So I clenched my jaw, called customer service, and prepared for the long wait.

Instead, this:

IVR: “Would you like us to call you back in about 20 minutes?”

Me: “Why…yes! Beep, beep, beep, beep, beep, beep, beep, #.”

My jaw relaxed.

Twenty or so minutes later, an agent called back and we made the change. Now the unclenched jaw morphed into a smile.

Second, FindTape.com. I’m making interior storm windows and I need double-stick tape for the project. Which, sure, you can buy online. But the smorgasbord of choices is paralyzing. I wasted a half-hour trying to figure out which product would best suit my unusual application and made no progress whatsoever.

Then, at FindTape.com, I read this:

If you have a specific question related to which tape would work best in your application please fill out and submit the following fields so that we can have an appropriate representative get back in contact with you.

A fellow named Kevin wrote back, we’ve have been discussing my options, and now I’m ready to buy.

Both examples remind me of Michael Nielsen’s luminous phrase: the restructuring of expert attention. He coined it to define a new era of scientific collaboration, but it applies more broadly.

We’ve been told that companies can’t afford to focus expert attention on customers. The truth, of course, is that they can’t afford not to.

For a generation and more we’ve driven a wedge between people who have expertise with products and services and people who need that expertise. How’s that working for you? Me neither.

It’s true that expert attention is a scarce resource. But we’re living through a Cambrian explosion of awareness networks and communication modes. Used adroitly, they can optimize the allocation of that scarce resource. Which is a fancy way of saying: Maybe personal customer service isn’t a lost art after all.

On this week’s Innovators show, with Daniel Debow of Rypple, I learned about a cognitive psychological tool called the Johari Window. Rypple focuses on the quadrant of the Johari Window at the intersection of “known to others” and “not known to self” — the so-called blind area. The company is dedicated to the proposition that if we can become more aware of what others know about us that we don’t, we can improve ourselves along various axes: personal, social, and — critically for Rypple’s business model — professionally.

How do you gain that awareness? By asking questions like:

Am I giving sufficiently clear guidance?

or

Do I interrupt people too often?

You direct these questions to a set of people whose feedback you value. Rypple anonymizes their responses and, to the extent you buy into the service, provides a progressively capable framework within which to continue the dialogue. This is a great idea, and one of the very few appropriate uses for online anonymity that I can imagine.

Rypple, as a company, lives at the intersection of a couple of key trends. Social media, obviously, but also the services ecosystem. As we discuss in the podcast, corporate HR has historically been a monolith that expects 100% compliance with its systems. But people, as we know, differ emotionally and cognitively. We should be able to use a variety of methods to manage and evaluate people, and help them manage and evaluate themselves. Software delivered as a service is an enabler of that possibility.

Here’s a twist: A company won’t have access to the feedback that employees solicit using Rypple. Daniel Debow says that HR folks, well aware of mainstream social software, are ready to embrace this model. I hope he’s right.

His favorite recent story about Rypple goes like this:

At an HR conference I talked to the CEO of a company that uses Rypple. He’s excited about what we’re doing, but he said: “You have a real problem. Use of your system might make your system obselete. We’ve been using it for a while now, and I’ve noticed that people are much more willing to give me feedback face-to-face, they’re willing to talk to me.”

Well that’s the furthest thing from a problem I can imagine. It’s like saying to Facebook, you’ve got a problem, people keep meeting on Facebook and then meeting up in person and creating real relationships offline.

Actually that would be problem for Facebook. But Rypple isn’t about pageviews, it’s about helping people improve. Which seems like a great idea to me.

You can, by the way, use Rypple not only to solicit anonymized feedback from a chosen set of responders, but also from an open-ended set. So here’s my question:

How can I make my ideas more accessible and more actionable?

I’m asking a chosen set too, but if you can perceive my blind spot I’d love to know what you see there.

When I watched Barack Obama accept the Nobel Peace Prize, I thought about how the world has changed since the inception of the prize, and how it will continue to change. Since the winners of the Prize are themselves a reflection of what’s changing, I thought I’d try using Freebase to visualize them over the century the Prize has existed.

What you can find out, with Freebase, depends on its coverage of the topics you’re asking about. So realize that what I’ll show here is possible because Nobel Peace Prize winners are a well-covered topic. Still, it’s wildly impressive.

The Nobel site tells us that 89 Nobel Peace Prizes have been awarded since 1901. I haven’t been able to reproduce that number in Freebase because there are multiple winners in a few years, and I haven’t found a way to group results by year. But for my purposes this related query is good enough:

That number, 100, isn’t as closely related to 89 as you might think. It’s less by the number of years no award was given, but more by the number of recipients in multiple-award years. Perhaps a Freebase guru can show us how to measure those uncertainties, but I’ve eyeballed them and I don’t think they invalidate my results.

How did I wind up querying the topic /award/award_winner? It wasn’t immediately obvious. I spent a while searching and then exploring the facets that emerged, including:

The crazy thing about Freebase is that, in a way, it doesn’t matter where you start. Everything’s connected to everything, so you can pick up any node of the graph and re-dangle the rest.

Except when you can’t. I haven’t yet gotten a good feel for which paths to prefer and why.

But in the end I came up with the kind of results I’d envisioned:

1901-2009 nobel peace prize winners by gender
male female

1901-2009 nobel peace prize winners by nationality
male female

Taken together they show a couple of trends. First, of course, we see most female winners after about 1960. Second, we see a more even geographic distribution of female winners because, prior to 1960, most winners were not only male but also American or European.

These results didn’t surprise me. What did is the relative ease with which I was able to discover and document them. I thought it would be necessary to write MQL queries in order to do this kind of analysis. I’d previously done a bit of work with MQL, and dug further into it this time around.

But in the end I found that it was just as effective to use interactive filtering. Now to be clear, getting the software to actually do the things I’ve shown here wasn’t a cakewalk. I had to develop a feel for the web of topics in the domain I chose. And it’s painfully slow to add and drop filters.

But still, it’s doable. And you can do it yourself by pointing and clicking. That is an astonishing tour de force, and a glimpse of what things will be like when we can all fluently visualize information about our world.

I’m using US census data to look up the estimated populations of the cities and towns running elmcity hubs. The dataset is just plain old CSV (comma-separated variable), a format that’s more popular than ever thanks in part to a new wave of web-based data services like DabbleDB, ManyEyes, and others.

For my purposes, simple pattern matching was enough to look up the population of a city and state. But I’d been meaning to try out LINQtoCSV, the .NET equivalent of my old friend, Python’s csv module. As happens lately, I was struck by the convergence of the languages. Here’s a side-by-side comparison of Python and C# using their respective CSV modules to query for the population of Keene, NH:

Python C#
 
 
i_name = 5
i_statename = 6
 
i_pop2008 = 17
 
 
handle = urllib.urlopen(url)
 
 
 
 
 
 
 
 
reader = csv.reader(
  handle, delimiter=',')
 
 
rows = itertools.ifilter(lambda x :
  x[i_name].startswith('Keene') and
  x[i_statename] == 'New Hampshire',
    reader)
 
found_rows = list(rows)
 
 
 
count = len(found_rows)
 
if ( count > 0 ):
  pop = int(found_rows[0][i_pop2008])
public class USCensusPopulationData
  {
  public string NAME;
  public string STATENAME;
  ... etc. ...
  public string POP_2008;
  }
 
var csv = new WebClient().
  DownloadString(url);
 
var stream = new MemoryStream(
  Encoding.UTF8.GetBytes(csv));
var sr = new StreamReader(stream);
var cc = new CsvContext();
var fd = new CsvFileDescription { };
 
var reader =
  cc.Read<USCensusPopulationData>(sr, fd);
 
 
var rows = reader.ToList();
 
 
 
 
var found_rows = rows.FindAll(row =>
  row.name.StartsWith('Keene') &&
  row.statename == 'New Hampshire');
 
var count = rows.Count;
 
if ( count > 0 )
  pop = Convert.ToInt32(
    found_rows[0].pop_2008)

Things don’t line up quite as neatly as in my earlier example, or as in the A/B comparison (from way back in 2005) between my first LINQ example and Sam Ruby’s Ruby equivalent. But the two examples share a common approach based on iterators and filters.

This idea of running queries over simple text files is something I first ran into long ago in the form of the ODBC Text driver, which provides SQL queries over comma-separated data. I’ve always loved this style of data access, and it remains incredibly handy. Yes, some data sets are huge. But the 80,000 rows of that census file add up to only 8MB. The file isn’t growing quickly, and it can tell a lot of stories. Here’s one:

2000 - 2008 population loss in NH

-8.09% Berlin city
-3.67% Coos County
-1.85% Portsmouth city
-1.85% Plaistow town
-1.78% Balance of Coos County
-1.43% Claremont city
-1.02% Lancaster town
-0.99% Rye town
-0.81% Keene city
-0.23% Nashua city

In both Python and C# you can work directly with the iterators returned by the CSV modules to accomplish this kind of query. Here’s a Python version:

import urllib, itertools, csv

i_name = 5
i_statename = 6
i_pop2000 = 9
i_pop2008 = 17

def make_reader():
  handle = open('pop.csv')
  return csv.reader(handle, delimiter=',')

def unique(rows):
  dict = {}
  for row in rows:
    key = "%s %s %s %s" % (i_name, i_statename,
      row[i_pop2000], row[i_pop2008])
    dict[key] = row
  list = []
  for key in dict:
    list.append( dict[key] )
  return list

def percent(row,a,b):
  pct = - (  float(row[a]) / float(row[b]) * 100 - 100 )
  return pct

def change(x,state,minpop=1):
  statename = x[i_statename]
  p2000 = int(x[i_pop2000])
  p2008 = int(x[i_pop2008])
  return (  statename==state and
            p2008 > minpop   and
            p2008 < p2000 )

state = 'New Hampshire'

reader = make_reader()
reader.next() # skip fieldnames

rows = itertools.ifilter(lambda x :
  change(x,state,minpop=3000), reader)

l = list(rows)
l = unique(l)
l.sort(lambda x,y: cmp(percent(x,i_pop2000,i_pop2008),
  percent(y,i_pop2000,i_pop2008)))

for row in l:
  print "%2.2f%% %s" % (
       percent(row,i_pop2000,i_pop2008),
       row[i_name] )

A literal C# translation could do all the same things in the same ways: Convert the iterator into a list, use a dictionary to remove duplication, filter the list with a lambda function, sort the list with another lambda function.

As queries grow more complex, though, you tend to want a more declarative style. To do that in Python, you’d likely import the CSV file into a SQL database — perhaps SQLite in order to stay true to the lightweight nature of this example. Then you’d ship queries to the database in the form of SQL statements. But you’re crossing a chasm when you do that. The database’s type system isn’t the same as Python’s. And database’s internal language for writing functions won’t be Python either. In the case of SQLite, there won’t even be an internal language.

With LINQ there’s no chasm to cross. Here’s the LINQ code that produces the same result:

var census_rows = make_reader();

var distinct_rows = census_rows.Distinct(new CensusRowComparer());

var threshold = 3000;

var rows =
  from row in distinct_rows
  where row.STATENAME == statename
      && Convert.ToInt32(row.POP_2008) > threshold
      && Convert.ToInt32(row.POP_2008) < Convert.ToInt32(row.POP_2000)
  orderby percent(row.POP_2000,row.POP_2008)
  select new
    {
    name = row.NAME,
    pop2000 = row.POP_2000,
    pop2008 = row.POP_2008
    };

 foreach (var row in rows)
   Console.WriteLine("{0:0.00}% {1}",
     percent(row.pop2000,row.pop2008), row.name );

You can see the supporting pieces below. There are a number of aspects to this approach that I’m enjoying. It’s useful, for example, that every row of data becomes an object whose properties are available to the editor and the debugger. But what really delights me is the way that the query context and the results context share the same environment, just as in the Python example above. In this (slightly contrived) example I’m using the percent function in both contexts.

With LINQ to CSV I’m now using four flavors of LINQ in my project. Two are built into the .NET Framework: LINQ to XML, and LINQ to native .NET objects. And two are extensions: LINQ to CSV, and LINQ to JSON. In all four cases, I’m querying some kind of mobile data object: an RSS feed, a binary .NET object retrieved from the Azure blob store, a JSON response, and now a CSV file.

Six years ago I was part of a delegation from InfoWorld that visited Microsoft for a preview of technologies in the pipeline. At a dinner I sat with Anders Hejslberg and listened to him lay out his vision for what would become LINQ. There were two key goals. First, a single environment for query and results. Second, a common approach to many flavors of data.

I think he nailed both pretty well. And it’s timely because the cloud isn’t just an ecosystem of services, it’s also an ecosystem of mobile data objects that come in a variety of flavors.


private static float percent(string a, string b)
  {
  var y0 = float.Parse(a);
  var y1 = float.Parse(b);
  return - ( y0 / y1 * 100 - 100);
  }

private static IEnumerable<USCensusPopulationData> make_reader()
  {
  var h = new FileStream("pop.csv", FileMode.Open);
  var bytes = new byte[h.Length];
  h.Read(bytes, 0, (Int32)h.Length);
  bytes = Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(bytes));
  var stream = new MemoryStream(bytes);
  var sr = new StreamReader(stream);
  var cc = new CsvContext();
  var fd = new CsvFileDescription { };

  var census_rows = cc.Read<USCensusPopulationData>(sr, fd);
  return census_rows;
  }

public class USCensusPopulationData
  {
  public string SUMLEV;
  public string state;
  public string county;
  public string PLACE;
  public string cousub;
  public string NAME;
  public string STATENAME;
  public string POPCENSUS_2000;
  public string POPBASE_2000;
  public string POP_2000;
  public string POP_2001;
  public string POP_2002;
  public string POP_2003;
  public string POP_2004;
  public string POP_2005;
  public string POP_2006;
  public string POP_2007;
  public string POP_2008;

  public override string ToString()
    {
    return
      NAME + ", " + STATENAME + " " +
      "pop2000=" + POP_2000 + " | " +
      "pop2008=" + POP_2008;
    }
  }

public class  CensusRowComparer : IEqualityComparer<USCensusPopulationData>
  {
  public bool Equals(USCensusPopulationData x, USCensusPopulationData y)
    {
    return x.NAME == y.NAME && x.STATENAME == y.STATENAME ;
    }

  public int GetHashCode(USCensusPopulationData obj)
    {
    var hash = obj.ToString();
    return hash.GetHashCode();
    }
  }

I’ve really enjoyed the conversation about webscale identifiers. Naming web resources is such a crucial discipline, and yet one we’re all still making up as we go along. I ended the earlier post by suggesting that when we invent namespaces we should, where feasible, prefer names that make sense to people. In comments, a number of folks who have wrestled with the problem of ambiguity pointed out all sorts of reasons why that often just isn’t feasible.

Gavin Bell likes Amazon’s hybrid approach:

The model that Amazon have since moved to with a unique URL identifier and an ignored pretty human readable section is a good compromise.

Michael Smethurst agreed with me that the BBC’s opaque IDs — for example, b006qpgr for The Archers — could be promoted as a tag vocabulary that people would be encouraged to use:

Shownar is a prototype by Schulze and Webb that aims to track “buzz” around bbc programmes. For now it’s based on inbound links from blogs/twitter/etc but it could be expanded to use machine tags!?!

On Shownar, I find that this episode of Miss Marple was discussed in this blog entry:

BBC Radio have just started an Agatha Christie season and a whole host of programmes about the Queen of Crime are available to UK listeners on the iPlayer.

They include dramatizations of works starring super sleuths from Miss Marple to the Mysterious Mr Quin, as well as revealing documentaries.

The entry uses URLs that embed these BBC ids: b00mk71d, b007jvht. How did the author find them? Clearly, in this case, by way of the search URL which is also cited in the entry:

http://www.bbc.co.uk/iplayer/search/?q=agatha christie

The search term agatha christie is wildly ambiguous, of course. Shownar would never have included this item had it not cited specific BBC shows by way of their opaque IDs. Nor would the author have cited them if that had required typing b00mk71d or b007jvht. It only works thanks to copy/paste, but it works quite nicely, and it shows why site-specific search still matters in an era of uber search engines.

This example got me thinking about the character strings that we can and do type, easily and naturally, versus those we can’t and won’t. For example:

queries (what we can and do type) results (what we can’t and don’t type)
http://www.librarything.com/catalog/jonudell&deepsearch=
practical internet groupware

http://www.librarything.com/work/16804

http://www.librarything.com/work/16804/book/28447984

http://www.google.com/search?q=
practical internet groupware

http://oreilly.com/catalog/9781565925373

http://oreilly.com/catalog/pracintgr

http://www.bing.com/results.aspx?q=
practical internet groupware

http://www.amazon.com/Practical-Internet-Groupware-Jon-Udell/dp/156592537

http://my.safaribooksonline.com/1565925378

http://www.worldcat.org/search?q=
practical internet groupware

http://www.worldcat.org/oclc/43188074

http://www.amazon.com/s?index=blended&field-keywords=
practical internet groupware

http://www.amazon.com/Practical-Internet-Groupware-Jon-Udell/dp/1565925378

 

Looking at the consistency on the left column, and the variation on the right, I’ve got to conclude that:

  1. Practical Internet Groupware is the de facto webscale identifier for my book.

  2. 16804, 28447984, 9781565925373, pracintgr, 156592537, 1565925378, and 43188074 will never converge.

I’ve long imagined a class of equivalence services that would help us bridge the gap between vocabularies we can speak and write and those we’ll never speak and need help to write.

Both are sets of webscale identifiers that we’ll need to use in complementary ways. That’ll require a mix of social conventions and technical services.

This fall a small team of University of Toronto and Michigan State undergrads will be working on parts of the elmcity project by way of Undergraduate Capstone Open Source Projects (UCOSP), organized by Greg Wilson. In our first online meeting, the students decided they’d like to tackle the problem that FuseCal was solving: extraction of well-structured calendar information from weakly-structured web pages.

From a computer science perspective, there’s a fairly obvious path. Start with specific examples that can be scraped, then work toward a more general solution. So the first two examples are going to be MySpace and LibraryThing. The recipes[1, 2] I’d concocted for FuseCal-written iCalendar feeds were especially valuable because they could be used by almost any curator for almost any location.

But as I mentioned to the students, there’s another way to approach these two cases. And I was reminded of it again when Michael Foord pointed to this fascinating post prompted by the open source release of FriendFeed’s homegrown web server, Tornado. The author of the post, Glyph Lefkowitz, is the founder of Twisted, a Python-based network programming framework that includes the sort of asynchronous event-driven capabilities that FriendFeed recreated for Tornado. Glyph writes:

If you’re about to undergo a re-write of a major project because it didn’t meet some requirements that you had, please tell the project that you are rewriting what you are doing. In the best case scenario, someone involved with that project will say, “Oh, you’ve misunderstood the documentation, actually it does do that”. In the worst case, you go ahead with your rewrite anyway, but there is some hope that you might be able to cooperate in the future, as the project gradually evolves to meet your requirements. Somewhere in the middle, you might be able to contribute a few small fixes rather than re-implementing the whole thing and maintaining it yourself.

Whether FriendFeed could have improved the parts of Twisted that it found lacking, while leveraging its synergistic aspects, is a question only specialists close to both projects can answer. But Glyph is making a more general point. If you don’t communicate your intentions, such questions can never even be asked.

Tying this back to the elmcity project, I mentioned to the students that the best scraper for MySpace and LibraryThing calendars is no scraper at all. If these services produced iCalendar feeds directly, there would be no need. That would be the ideal solution — a win for existing users of the services, and for the iCalendar ecosystem I’m trying to bootstrap.

I’ve previously asked contacts at MySpace and LibraryThing about this. But now, since we’re intending to scrape those services for calendar info, it can’t hurt to announce that intention and hope one or both services will provide feeds directly and obviate the need. That way the students can focus on different problems — and there are plenty to choose from.

So I’ll be sending the URL of this post to my contacts at those companies, and if any readers of this blog can help move things along, please do. We may end up with scrapers anyway. But maybe not. Maybe iCalendar feeds have already been provided, but aren’t documented. Maybe they were in the priority stack and this reminder will bump them up. It’s worth a shot. If the problem can be solved by communicating intentions rather than writing redundant code, that’s the ultimate hack. And its one that I hope more computer science students will learn to aspire to.

My guest for this week’s Innovators show, Ian Forrester, heads up the BBC’s Backstage project. Launched in 2005, Backstage lives at a cultural crossroads where legacy systems and methods intersect with their next-generation counterparts. The tagline for the feeds and APIs provided under the Backstage umbrella is “use our stuff to build your stuff.”

Admittedly that sounded a lot more exciting prior to 2006, when the BBC ended its trial of the Creative Archive service that was expected to “open the floodgates” to a “treasure trove” of cultural riches. Ian Forrester says those expectations were ratcheted back for two reasons. First, much of that treasure trove remains undigitized. Second, rights clearance proved to be an intractable problem.

So the “our stuff” that’s available to build “your stuff” turns out to be mostly metadata: news headlines, program titles and schedules. What’s more, that metadata comes from a plethora of BBC content management systems. What can you make out of these ingredients?

Here’s an evocative example: http://www.bbc.co.uk/nature/species/African_Bush_Elephant. The BBC’s Tom Scott explains:

Over the last few months we’ve been plundering the NHU’s [Natural History Unit's] archive to find the best bits — segmenting the TV programmes, tagging them (with DBpedia terms) and then aggregating them around URIs for the key concepts within the natural history domain; so that you can discover those programme segments via both the originating programme and via concepts within the natural history domain — species, habitats, adaptations and the like.

This is just the sort of remixing that Backstage ought to enable anyone, inside or outside the BBC, to achieve. Since I’m a US resident, and don’t pay the UK’s television license fee, I can’t watch the videos on that page. There’s nothing that the Backstage team can do about that. But they can take a radically open and inclusive approach to the management of the metadata that supports this remixing, and that’s just what they’re doing.

In our conversation, Ian Forrester describes how the taxonomy that governs the Backstage feeds and APIs is shared with that of Wikipedia and its structured derivative, DBpedia. Tom Scott elaborates:

You might have noticed that the slugs for our URIs (the last bit of the URL) are the same as those used by Wikipedia and DBpedia that’s because I believe in the simple joy of webscale identifiers, you will also see that much like the BBC’s music site we are transcluding the introductory text from Wikipedia to provide background information for most things. This also means that we are creating and editing Wikipedia articles where they need improving (of course you are also more than welcome to improve upon the articles).

As someone who both practices and preaches collaborative curation, I’m delighted to see the BBC taking this approach. And I love the phrase webscale identifier. Here’s how Michael Smethurst defines it, in the post pointed to by Tom Scott:

I agree with the four Linked Data rules but I’d like to try to add a fifth: if possible don’t reinvent other people’s web identifiers. By web identifiers I mean those fragments of URLs that uniquely identify a resource within a domain. So in the case of the MusicBrainz entry for The Fall (http://musicbrainz.org/artist/d5da1841-9bc8-4813-9f89-11098090148e.html) that’ll be d5da1841-9bc8-4813-9f89-11098090148e.

The last time we updated the /music site we made this mistake (kind of unavoidable at the time). Even though we linked our data to MusicBrainz we minted new identifiers for artists. So The Fall became http://www.bbc.co.uk/music/artist/jb9x/ where jb9x was the identifier. But jb9x doesn’t exist anywhere outside of /music. We’ll (hopefully) never make that mistake again.

Beautifully said. Enormous synergies have gone unrealized because web publishers have chosen to mint new namespaces rather than add value to existing ones.

What I realized when talking with Ian, though, is that there is one namespace for which the BBC is the appropriate mint, namely its own. Here, for example, are some of the family of URLs for a radio drama called The Archers:

homepage: http://www.bbc.co.uk/programmes/b006qpgr/

upcoming shows: http://www.bbc.co.uk/programmes/b006qpgr/episodes/upcoming.xml

In this example b006qpgr is, at least potentially, a webscale identifier. It’s a unique tag for the show that, if used on blogs, on Twitter, and elsewhere, would make it easy to assemble all kinds of online activity related to the show. But in fact only web developers using Backstage feeds and APIs will ever discover, or use, b006qpgr. In colloquial discourse people use The Archers.

If the BBC wants people to collaborate with its namespace in the same way that it collaborates with Wikipedia’s, this would be more inviting:

http://www.bbc.co.uk/programmes/The_Archers/

http://www.bbc.co.uk/programmes/The_Archers/episodes/upcoming.xml

It should go without saying, but right after the first rule for linked data, “Use URIs as names for things,” I would add “Where possible, choose names that make sense to people.”

For me, FriendFeed has been a new answer to an old question — namely, how to collaborate in a loosely-coupled way with people who are using, and helping to develop, an online service. The elmcity project’s FriendFeed room has been an incredibly simple and effective way to interleave curated calendar feeds, blog postings describing the evolving service that aggregates those feeds, and discussion among a growing number of curators.

In his analysis of Where FriendFeed Went Wrong Dare Obasanjo describes the value of a handful of services (Facebook, Twitter, etc.) in terms that would make sense to non-geeks like his wife. Here’s the elevator pitch for FriendFeed:

Republish all of the content from the different social networking media websites you use onto this site. Also one place to stay connected to what people are saying on multiple social media sites instead of friending them on multiple sites.

As usual, I’m an outlying data point. I’m using FriendFeed as a lightweight, flexible aggregator of feeds from my blog and from Delicious, and as a discussion forum. These feeds report key events in the life of the project: I added a new feature to the aggregator, the curator for Sasktatoon found and added a new calendar. The discussion revolves around strategies for finding or creating calendar feeds, features that curators would like me to add to the service, and problems they’re having with the service.

I doubt there’s a mainstream business model here. It’s valuable to me because I’ve created a project environment in which key events in the life of the project are already flowing through feeds that are available to be aggregated and discussed. Anyone could arrange things that way, but few people will.

It’s hugely helpful to me, though. And while I don’t know for sure that FriendFeed’s acquisition by FaceBook will end my ability to use FriendFeed in this way, I do need to start thinking about how I’d replace the service.

I don’t need a lot of what FriendFeed offers. Many of the services it can aggregate — Flickr, YouTube, SlideShare — aren’t relevant. And we don’t need realtime notification. So it really boils down to a lightweight feed aggregator married to a discussion forum.

One feature that FriendFeed’s API doesn’t offer, by the way, but that I would find useful, is programmatic control of the aggregator’s registry. When a new curator shows up, I have to manually add the associated Delicious feed to the FriendFeed room. It’d be nice to automate that.

Ideally FriendFeed will coast along in a way that lets me keep using it as I currently am. If not, it wouldn’t be too hard to recreate something that provides just the subset of FriendFeed’s services that I need. But ideally, of course, I’d repurpose an existing service rather than build a new one. If you’re using something that could work, let me know.

One of the key findings of the elmcity project, so far, is that there’s a lot of calendar information online, but very little in machine-readable form. Transforming this implicit data about public events into explicit data is an important challenge. I’ve been invited to define the problem, for students who may want to tackle it as a school project. Here are the two major aspects I’ve identified.

A general scraper for calendar-like web pages

There are zillions of calendar-like web pages, like this one for Harlow’s Pub in Peterborough, NH. These ideally ought to be maintained using calendar programs that publish machine-readable iCalendar feeds which are also transformed and styled to create human-readable web pages. But that doesn’t (yet) commonly happen.

These web pages are, however, often amenable to scraping. And for a while, elmcity curators were making very effective use of FuseCal (1, 2, 3) to transform these kinds of pages into iCalendar feeds.

When that service shut down, I retained a list of the pages that elmcity curators were successfully transforming into iCalendar feeds using FuseCal. These are test cases for an HTML-to-iCalendar service. Anyone who’s handy with scraping libraries like Beautiful Soup can solve these individually. The challenge here is to create, by abstraction and generalization, an engine that can handle a significant swath of these cases.

A hybrid system for finding implicit recurring events and making them explicit

Lots of implicit calendar data online doesn’t even pretend to be calendar-like, and cannot be harvested using a scraper. Finding one-off events in this category is out of scope for my project. But finding recurring events seems promising. The singular effort required to publish one of these will pay ongoing dividends.

It’s helpful that the language people use to describe these events — “every Tuesday”, “third Saturday of every month” — is distinctive. To being exploring this domain, I wrote a specialized search robot that looks for these patterns, in conjunction with names of places. Its output is available for all the cities and towns participating in the elmcity project. For example, this page is the output for Keene, NH. It includes more than 2000 links to web pages — or, quite often, PDF files — some fraction of which represent recurring events.

In Finding and connecting social capital I showed a couple of cases where the pages found this way did, in fact, represent recurring events that could be added to an iCalendar feed.

To a computer scientist this looks like a problem that you might solve using a natural language parser. And I think it is partly that, but only partly. Let’s look at another example:

At first glance, this looks hopeful:

First Monday of each month: Dads Group, 105 Castle Street, Keene NH

But the real world is almost always messier than that. For starters, that image comes from the Monadnock Men’s Resource Center’s Fall 2004 newsletter. So before I add this to a calendar, I’ll want to confirm the information. The newsletter is hosted at the MMRC site. Investigation yields these observations:

  • The most recent issue of the newsletter was Winter ‘06

  • The last-modified date of the MMRC home page is September 2008

  • As of that date, the Dads Group still seems to have been active, under a slightly different name: Parent Outreach Project, DadTime Program, 355-3082

  • There’s no email address, only a phone number.

So I called the number, left a message, and will soon know the current status.

What kind of software-based system can help us scale this gnarly process? There is an algorithmic solution, surely, but it will need to operate in a hybrid environment. The initial search-driven discovery of candidate events can be done by an automated parser tuned for this domain. But the verification of candidates will need to be done by human volunteers, assisted by software that helps them:

  • Divide long lists of candidates into smaller batches

  • Work in parallel on those batches

  • Evaluate the age and provenance of candidates

  • Verify or disqualify candidates based on discoverable evidence, if possible

  • Otherwise, find appropriate email addresses (preferably) or phone numbers, and manage the back-and-forth communication required to verify or disqualify a candidate

  • Refer event sponsors to a calendar publishing how-to, and invite them to create data feeds that can reliably syndicate

Students endowed with the geek gene are likely to gravitate toward the first problem because it’s cleaner. But I hope I can also attract interest in the second problem. We really need people who can hack that kind of real-world messiness.

In of one of my favorite scenes from one my favorite movies, The Princess Bride, Vizzini (Wallace Shawn) has been repeatedly exclaiming: “Inconceivable!” Finally Inigo Montoya (Mandy Patinkin) responds:

You keep using that word. I do not think it means what you think it means.

I’ve already riffed on that classic bit in the titles of two other items. Now I’m compelled to do it again because when I talk about events, vis-a-vis the elmcity project, I think the word means something different from you probably think it means.

Here’s one common meaning: major public events. These include things like artistic performances, festivals, fairs, and sporting events. They dominate the “Things to See and Do” section of every newspaper and online community guide, and are usually well publicized.

Here’s another common meaning: minor events that are often (but not aways) private. These include birthday parties, house concerts, and outdoor excursions. They are, nowadays, often publicized very well in Facebook.

Although I’m happy to see major public events showing up in an elmcity hub, that isn’t my main goal. And private events, of course, don’t belong in an elmcity hub, they belong in Facebook, or in other private networks.

There’s a third kind of event that interests me most of all. It occupies a space between the other two. It’s public, but minor: a book discussion, a roadside cleanup, a support group, a squaredance. These events typically don’t show up in “Things To See And Do” guides because they’re considered too niche, and because it’s too much work — for both the publisher and the contributor — to get them included. They might show up in Facebook, but if so they will be visible there only within a closed social network.

There are tons of events in this minor-but-public category. Here’s one of my favorite examples. We were having dinner with our friends Lin and Tom recently, and Lin mentioned that Tom had just won the New Hampshire state archery tournament.

Me: “Really? Congratulations! Where was that held?”

Lin: “At the Keene Recreation Center, last Saturday.”

The Rec Center is a ten-minute walk from my house. I’d have loved to have seen those precision archers ply their trade. And it was open to the public. Anybody could have gone. But nobody knew.

Everyone I talk to has similar stories. Everyone says they find out about such things — if they find out at all — only after the fact. Everyone acknowledges that there should be a better way to inform one another about the goings-on that implicitly form much of the social capital of the community. If we can make more of it explicit, we will lead richer lives. And here I mean richer in two senses of that word. There’s the Robert Putnam sense of social well-being. And there’s the Richard Florida sense of economic well-being. If we can make more of our implicit social capital explicit, we’ll profit in both ways.

I’ve long been dissatisfied with how we discover and tune into Net radio. This iTunes screenshot illustrates the problem:

Start with a genre, pick a station in that genre, then listen to that station. This just doesn’t work for me. I like to listen to a lot of different things. And I especially value serendipitous recommendations from curators whose knowledge and preferences diverge radically from my own.

Yes there’s Pandora, but what I’ve been wanting all along is a way to enable and then subscribe to curators who guide me to what’s playing now on the live streams coming from radio stations around the world. It’s Wednesday morning, 11AM Eastern Daylight Time, and I know there are all kinds of shows playing right now. But how do I materialize a view for this moment in time — or for tonight at 9PM, or for Sunday morning at 10AM — across that breadth and wealth of live streams?

I started thinking about schedules of radio programs, and about calendars, and about BBC Backstage — because I’ll be interviewing Ian Forrester for an upcoming episode of my podcast — and I landed on this blog post which shows how to form an URL that retrieves upcoming episodes of a BBC show as an iCalendar feed.

Meanwhile, I’ve just created a new mode for the elmcity calendar aggregator. Now instead of creating a geographical hub, which combines events from Eventful and Upcoming and events from a list of iCalendar feeds — all for one location — you can create a topical hub whose events are governed only by time, not by location.

Can these ingredients combine to solve my Net radio problem? Could a curator for an elmcity topical aggregator cherrypick favorite shows from around the Net, and create a calendar that shows me what’s playing right now?

It seems plausible, so I spun up a new topical hub in the elmcity aggregator and started experimenting.

I began with the BBC’s iCalendar feeds. But evidently they don’t include VTIMEZONE components, which means calendar clients (or aggregators) can’t translate UK times to other times.

I ran into a few other issues, which perhaps can be sorted out when I chat with Ian Forrester. But meanwhile, since the universe of Net radio is much vaster than the BBC, and since most of it won’t be accessible in the form of data feeds, I stepped back for a broader view.

Really, anyone can publish an event that gives the time for a live show, plus a link to its player. And when a show happens on a regular recurring schedule, the little bit of effort it takes to publish that event pays recurring dividends.

Consider, for example, Nic Harcourt’s Sounds Eclectic. It’s on at these (Pacific) times: SUN 6:00A-8:00A, SAT 2:00P-4:00P, SAT 10:00P-12:00A. You can plug these into any calendar program as recurring events. And if you publish a feed, it’s not only available to you from any calendar client, it’s also available to any other calendar client — or to any aggregator.

Here’s a calendar with three recurring events for Sounds Eclectic, plus one recurring event for WICN’s Sunday jazz show, plus a single non-recurring event — the BBC’s Folkscene — which will be on the BBC iPlayer on Thursday at 4:05PM my time and 9:05PM UK time. If you load the calendar feed into a client — Outlook, Apple iCal, Google Calendar, Lotus Notes — you’ll see these events translated into your local timezone.

Note that Live Calendar is especially handy for publishing events from many different timezones. That’s because like Outlook, but unlike Google Calendar, it enables you to specify timezones on a per-event basis. So instead of having to enter the Sunday morning recurrence of Sounds Eclectic as 9AM Eastern Daylight, I can enter it as 6AM Pacific Daylight Time. Likewise Folkscene: I can enter 9:05 British Summer Time. Since these are the times that appear on the shows’ websites, it’s natural to use them.

This sort of calendar is great for personal use. But I’m looking for the Webjay of Net radio. And I think maybe elmcity topical hubs can help enable that.

There’s a way of using these topical hubs I hadn’t thought of until Tony Karrer created one. Tony runs TechEmpower, a software, web, and eLearning development firm. He wants to track and publish online eLearning events, so he’s managing them in Google Calendar and syndicating them through an elmcity topical hub to his website.

A topical hub, like a geographic hub, is controlled by a Delicious account whose owner maintains a list of feeds. I’d been thinking of the account owner as the curator, and of the feeds as homogeneous sources of events: school board meetings, soccer games, and so on.

But then Tony partnered with another organization that tracks webinars, invited that group to publish its own feed, added it to the eLearning hub, and wrote a blog post entitled Second Calendar Curator Joins to Help with List of Free Webinars:

The initial list of calendar entries, we added ourselves. But I’m pleased to announce that we’ve just signed up our second calendar curator – Coaching Ourselves. Their events are now appearing in the listings. … It is exactly because we can distribute the load of keeping this list current that makes me think this will work really well in the long run.

This probably shouldn’t have surprised me, but it did. I’d been thinking in terms of curators, feeds, and events. What Tony showed me is that you can also (optionally) think in terms of meta-curators, curators, feeds, and events. In this example, Tony is himself a curator, but he is also a meta-curator — that is, a collector of curators.

I’d love to see this model evolve in the realm of Net radio. If you want to join the experiment, just use any calendar program to keep track of some of your favorite recurring shows. (Again, it’s very helpful to use one that supports per-event timezones.) Then publish the shows as an iCalendar feed, and send me the URL. As the meta-curator of delicious.com/InternetRadio, as well as the curator of jonu.calendar.live.com/calendar/InternetRadio/index.html, I’ll have two options. If I like most or all of the shows you like, I can add your feed to the hub. If I only like some of the shows you like, I can cherrypick them for my feed. Either way, the aggregated results will be available as XML, as JSON, and as an iCalendar feed that can flow into calendar clients or aggregators.

Naturally there can also be other meta-curators. To become one, designate a Delicious account for the purpose, spin up your own topical hub, and tell me about it.

My guest for this week’s Innovators show is Cathy Marshall, a Senior Researcher in Microsoft’s Silicon Valley Lab. She’s long been intrigued by personal information management — and nowadays, also by its social dimension.

We kicked off the conversation with a discussion of her recent paper Do Tags Work?. (See also her slides from a talk about the project.) This was a clever study in which she collected a bunch of Flickr photos of people spinning on the bull’s balls in Milan. Notice how that fulltext query effectively retrieves a pile of images, taken by different people, of the same curious custom:

If you are passing through the Galleria Vittorio Emanuele II, you should spin around on the testicles of the bull mosaic found in the centre. Legend has it that this will bring you good luck!

Now try this query, which uses the same terms but looks at tags instead of the free text (title, description) associated with the photos. It finds nothing.

Cathy concludes that while many people think tags are effective hooks for information retrieval, they really aren’t.

Of course, those of us who attend conferences where the first order of business is to announce a tag know that tags can be a very effective way to aggregate all the blog postings, tweets, and photos associated with an event. Folksonomies that aren’t intended to converge don’t. Those that are meant to converge do, quite dramatically, which is why I’ve long been obsessed with intentional tagging as an enabler of loosely-coupled collaboration.

In the second half of the conversation we discussed personal digital archiving, curation, benign neglect, and lifestreams. Cathy tells a lot of stories about the ways in which people do, and also don’t, take care of their digital stuff. She observes, for example, that when people lose the contents of a computer, they react initially with horror, but then often feel a sense of relief. It turns out a lot of what was there wasn’t really needed. The burden of culling through it is lifted, and the guilt associated with not doing that culling that goes away.

(I laughed harder than I have in a long time when Cathy described rental storage units as “garbage cans you pay for, and then when you realize you no longer care about the stuff in them, you stop paying for.”)

We ended by agreeing that the hardest thing about introducing a hosted lifebits service ecosystem will be the conceptual model. For psychological reasons, people will want to think in terms of monolithic containers that keep stuff in one place, and monolithic services that do everything related to that stuff. For architectural reasons, though, we’ll want to federate storage, and also decouple classes of service — so that storage, for example, is orthogonal to access control and authorization, which is orthogonal to social interaction.

In February 2007, Mike Adams, who had recently joined Automattic, the company that makes WordPress, decided on a lark to endow all blogs running on WordPress.com with the ability to use LaTeX, the venerable mathematical typesetting language. So I can write this:

$latex \pi r^2$

And produce this:

\pi r^2

When he introduced the feature, Mike wrote:

Odd as it may sound, I miss all the equations from my days in grad school, so I decided that what WordPress.com needed most was a hot, niche feature that maybe 17 people would use regularly.

A whole lot more than 17 people cared. And some of them, it turns out, are Fields medalists. Back in January, one member of that elite group — Tim Gowers — asked: Is massively collaborative mathematics possible? Since then, as reported by observer/participant Michael Nielsen (1, 2), Tim Gowers, Terence Tao, and a bunch of their peers have been pioneering a massively collaborative approach to solving hard mathematical problems.

Reflecting on the outcome of the first polymath experiment, Michael Nielsen wrote:

The scope of participation in the project is remarkable. More than 1000 mathematical comments have been written on Gowers’ blog, and the blog of Terry Tao, another mathematician who has taken a leading role in the project. The Polymath wiki has approximately 59 content pages, with 11 registered contributors, and more anonymous contributors. It’s already a remarkable resource on the density Hales-Jewett theorem and related topics. The project timeline shows notable mathematical contributions being made by 23 contributors to date. This was accomplished in seven weeks.

Just this week, a polymath blog has emerged to serve as an online home for the further evolution of this approach.

I am completely unqualified to evaluate the nature of mathematical discourse that’s going in on these polymath collaborations, or the claims being made regarding outcomes. But it sure makes my spidey-sense tingle.

I am, however, qualified to evaluate the nature of the collaborative methods being employed. And on that front, I’m amused (and chagrined) to recall something I wrote back in 2000, in a report called Internet groupware for scientific collaboration. The report was commissioned by Greg Wilson, who organized this week’s Science 2.0 event in Toronto. At that event, my report served as a historical frame for the polymath experimentation that’s going on right now, and that Michael Nielsen discussed at the Toronto event in an updated version of this talk.

In my 2000 report I said:

TeX and LaTeX define scientific publishing for a generation of scientists. But these formats don’t integrate directly into the shared spaces of the Web. The rise of XML as a universal markup language, along with vocabularies such as MathML (for mathematical notation) and SVG (for scalable vector graphics), suggests that the Web may yet reach its original collaborative goal.

Why didn’t I see, then, that the crux of the issue wasn’t XML and MathML and SVG, but rather the ability to “integrate directly into the shared spaces of the Web”? And that what ought to be integrated directly was the typesetting language already familiar to mathematicians, namely LaTeX?

The answer is that I needed (and still need) to be reminded that good-enough solutions here now, and familiar to people, often trump great solutions that aren’t here and wouldn’t be familiar if they were.

From that perspective, I’m wondering what will and won’t turn out be good enough for the polymathematicians. The current setup is admittedly imperfect, and they’re now begining to explore WordPress plugins that enable, for example, more powerful ways to organize, reply to, and refer to one anothers’ comments.

I don’t think anybody yet knows what the right tooling will be for polymathematical collaboration. The ones who are best qualified to figure it out are the polymathematical collaborators themselves, but they are not WordPress plugin developers.

What’s needed is what Eric von Hippel calls a user innovation toolkit. The idea is this: Leading users, as they employ a tool, also modify it, and in so doing they express intentions that tool developers can then capture and formalize.

If you look at the systems of notation that the polymathematicians are creating in order to organize and refer to their contributions in these long and complex threads of mathematical discourse, you can see intentions being expressed. So arguably, WordPress is a user innovation toolkit, and we’ll see these innovations codified in future plugins. I’ll be watching with great interest.

Update: As per Jonathan Fine’s comment below, it appears that MathTran.org has offered the same kind of service for quite a while now:

On this week’s Innovators show I got together with two of the authors of a new proposal for representing iCalendar in XML. Mike Douglass is lead developer of the Bedework Calendar System, and Steven Lees is Microsoft’s program manager for FeedSync and chair of the XML technical committee in CalConnect, the Calendaring and Scheduling Consortium.

What’s proposed is no more, but no less, than a well-defined two-way mapping between the current non-XML-based iCalendar format and an equivalent XML format. So, for example, here’s an event — the first low tide of 2009 in Myrtle Beach, SC — in iCalendar format:

BEGIN:VEVENT
SUMMARY:Low Tide 0.39 ft
DTSTART:20090101T090000Z
UID:2009.0
DTSTAMP:20080527T000001Z
END:VEVENT

And here’s the equivalent XML:

<vevent>
  <properties>
    <dtstamp>
      <date-time utc='yes'>
        <year>2008</year><month>5</month><day>27</day>
        <hour>0</hour><minute>0</minute><second>1</second>
      </date-time>
    </dtstamp>
    <dtstart>
      <date-time utc='yes'>
        <year>2009</year><month>1</month><day>1</day>
        <hour>9</hour><minute>0</minute><second>0</second>
      </date>
    </dtstart>
    <summary>
      <text>Low Tide 0.39 ft</text>
    </summary>
    <uid>
      <text>2009.0</text>
    </uid>
  </properties>
</vevent>

The mapping is quite straightforward, as you can see. At first glance, the XML version just seems verbose. So why bother? Because the iCalendar format can be tricky to read and write, either directly (using eyes and hands) or indirectly (using software). That’s especially true when, as is typical, events include longer chunks of text than you see here.

I make an analogy to the RSS ecosystem. When I published my first RSS feed a decade ago, I wrote it by hand. More specifically, I copied an existing feed as a template, and altered it using cut-and-paste. Soon afterward, I wrote the first of countless scripts that flowed data through similar templates to produce various kinds of RSS feeds.

Lots of other people did the same, and that’s part of the reason why we now have a robust network of RSS and Atom feeds that carries not only blogs, but all kinds of data packets.

Another part of the reason is the Feed Validator which, thanks to heroic efforts by Mark Pilgrim and Sam Ruby, became and remains the essential sanity check for anybody who’s whipping up an ad-hoc RSS or Atom feed.

No such ecosystem exists for iCalendar. I’ve been working hard to show why we need one, but the most compelling rationale comes from a Scott Adams essay that I quoted from in this blog entry. Dilber’s creator wrote:

I think the biggest software revolution of the future is that the calendar will be the organizing filter for most of the information flowing into your life. You think you are bombarded with too much information every day, but in reality it is just the timing of the information that is wrong. Once the calendar becomes the organizing paradigm and filter, it won’t seem as if there is so much.

If you buy that argument, then we’re going to need more than a handful of applications that can reliably create and exchange calendar data. We’ll want anyone to whip up a calendar feed as easily as anyone can now whip up an RSS/Atom feed.

We’ll also need more than a handful of parsers that can reliably read calendar feeds, so that thousands of ad-hoc applications, services, and scripts will be able consume all the new streams of time-and-date-oriented information.

I think that a standard XML representation of iCalendar will enable lots of ad-hoc producers and consumers to get into the game, and collectively bootstrap this new ecosystem. And that will enable what Scott Adams envisions.

Here’s a small but evocative example. Yesterday I started up a new instance of the elmcity aggregator for Myrtle Beach, SC. The curator, Dave Slusher, found a tide table for his location, and it offers an iCalendar feed. So the Myrtle Beach calendar for today begins like this:

Thu Jul 23 2009

WeeHours

Thu 03:07 AM Low Tide -0.58 ft (Tide Table for Myrtle Beach, SC)

Morning

Thu 06:21 AM Sunrise 6:21 AM EDT (Tide Table for Myrtle Beach, SC)
Thu 09:09 AM High Tide 5.99 ft (Tide Table for Myrtle Beach, SC)
Thu 10:00 AM Free Coffee Fridays (eventful: )
Thu 10:00 AM Summer Arts Project at The Market Common (eventful: )
Thu 10:00 AM E.B. Lewis: Story Painter (eventful: )

Imagine this kind of thing happening on the scale of the RSS/Atom feed ecosystem. The lack of an agreed-upon XML representation for iCalendar isn’t the only reason why we don’t have an equally vibrant ecosystem of calendar feeds. But it’s an impediment that can be swept away, and I hope this proposal will finally do that.

My guest for this week’s Innovators show is Peter O’Toole from mTuitive, a company whose authoring toolkit for clinical data collection I featured in a 2006 screencast. mTuitive is working at the intersection of a number of disciplines that all need to come together to deliver cheaper and better health care.

First, usability. Designing clinical data gathering systems that capture what’s right for the patient, along with what’s mandated by the insurance company, requires a careful balancing of constraints and freedom in software user interfaces.

Second, knowledge engineering. Clinical systems don’t merely record data, they embody medical protocols that reflect an ever-changing consensus about methods and best practices. mTuitive’s authoring system aims to enable leading practioners to encode that knowledge in ways that can then guide others. But knowledge grows at the edge as well as at the center. So mTuitive also enables practitioners to extend and modify the software, injecting local knowledge and custom. Who owns this knowledge? Who’s liable for the consequences of its use? These are some of the implications we discussed.

Third, semantics. Electronic medical records are still mainly narrative in form, says Peter O’Toole. But we’re moving toward more computable ways of describing observations about, say, the nature and size of tumors.

Fourth, social software. My hunch, and Peter O’Toole’s too, is that progress toward the nirvana of medical records that are both semantically rich and interoperable will be powered by a two-stroke engine. One stroke of the piston will be driven by centrally-defined standards and centrally-imposed legislation. But the other will be driven by networked collaboration, at the edge, among doctors who pool and codify their experiential knowledge using ad-hoc, Web 2.0-like methods.

Until recently, the elmcity calendar aggregator was running as a single instance of an Azure worker role. The idea all along, of course, was to exploit the system’s ability to farm out the work of aggregation to many workers. Although the sixteen cities currently being aggregated don’t yet require the service to scale beyond a single instance, I’d been meaning to lay the foundation for that. This week I finally did.

Will there ever be hundreds or thousands of participating cities and towns? Maybe that’ll happen, maybe it won’t, but the gating factor will not be my ability to babysit servers. That’s a remarkable change from just a few years ago. Over the weekend I read Scott Rosenberg’s new history of blogging, Say Everything. Here’s a poignant moment from 2001:

Blogger still lived a touch-and-go existence. Its expenses had dropped from a $50,000-a-month burn rate to a few thousand in rent and technical costs for bandwidth and such; still, even that modest budget wasn’t easy to meet. Eventually [Evan] Williams had to shut down the office entirely and move the servers into his apartment. He remembers this period as an emotional rollercoaster. “I don’t know how I’m going to pay the rent, and I can’t figure that out because the server’s not running, and I have to stay up all night, trying to figure out Linux, and being hacked, and then fix that.”

I’ve been one of those guys who babysits the server under the desk, and I’m glad I won’t ever have to go back there again. What I will have to do, instead, is learn how to take advantage of the cloud resources now becoming available. But I’m finding that to be an enjoyable challenge.

In the case of the calendar aggregator, which needs to map many worker roles to many cities, I’m using a blackboard approach. Here’s a snapshot of it, from an aggregator run using only a single worker instance:

     id: westlafcals
  start: 7/14/2009 12:12:05 PM
   stop: 7/14/2009 12:14:46 PM
running: False

     id: networksierra
  start: 7/14/2009 12:14:48 PM
   stop: 7/14/2009 12:15:05 PM
running: False

     id: localist
  start: 7/14/2009 12:15:06 PM
   stop: 7/14/2009  5:37:03 AM
running: True

     id: aroundfred
  start: 7/14/2009  5:37:05 AM
   stop: 7/14/2009  5:39:20 AM
running: False

The moving finger wrote westlafcals (West Lafayette) and networksierra (Sonora), it’s now writing localist (Baltimore), and will next write aroundfred (Fredericksburg).

Here’s a snapshot from another run using two worker instances:

     id: westlafcals
  start: 7/14/2009 10:12:05 PM
   stop: 7/14/2009  4:37:03 AM
running: True

     id: networksierra
  start: 7/14/2009 10:12:10 PM
   stop: 7/14/2009 10:13:05 PM
running: False

     id: localist
  start: 7/14/2009 10:13:06 PM
   stop: 7/14/2009  4:41:12 AM
running: True

     id: aroundfred
  start: 7/14/2009  4:41:05 AM
   stop: 7/14/2009  4:42:20 AM
running: False

Now there are two moving fingers. One’s writing westlafcals, one has written networksierra, one’s writing localist, and one or the other will soon write aroundfred. The total elapsed time will be very close to half what it was in the single-instance case. I’d love to crank up the instance count and see an aggregation run rip through all the cities in no time flat. But the Azure beta caps the instance count at two.

The blackboard is an Azure table with one record for each city. Records are flexible bags of name/value pairs. If you make a REST call to the table service to query for one of those records, the Atom payload that comes back looks like this:

<m:properties>
   <d:PartitionKey>blackboard</d:PartitionKey>
   <d:RowKey>aroundfred</d:RowKey>
   <d:start>7/14/2009 4:41:05 AM</d:start>
   <d:stop>7/14/2009 4:42:20 AM</d:stop>
   <d:running>False</d:stop>
</m:properties>

At the start of a cycle, each worker wakes up, iterates through all the cities, aggregates those not claimed by other workers, and then sleeps until the next cycle. To claim a city, a worker tries to create a record in a parallel Azure table, using the PartitionKey locks instead of blackboard. If the worker succeeds in doing that, it considers the city locked for its own use, it aggregates the city’s calendars, and then it deletes the lock record. If the worker fails to create that record, it considers the city locked by another worker and moves on.

This cycle is currently one hour. But in order to respect the various services it pulls from, the service defines the interval between aggregation runs to be 8 hours. So when a worker claims a city, it first checks to see if the last aggregation started more than 8 hours ago. If not, the worker skips that city.

Locks can be abandoned. That could happen if a worker hangs or crashes, or when I redeploy a new version of the service. So the worker also checks to see if a lock has been hanging around longer than the aggregation interval. If so, it overrides the lock and aggregates that city.

I’m sure this scheme isn’t bulletproof, but I reckon it doesn’t need to be. If two workers should happen to wind up aggregating the same city at about the same time, it’s no big deal. The last writer wins, a little extra work gets done.

Anyway, I’ll be watching the blackboard over the next few days. There’s undoubtedly more tinkering to do. And it’s a lot more fun than herding servers.

Although I haven’t been able to confirm this officially yet, it looks like FuseCal, the HTML screen-scraping service that I’ve been using (and recommending) as a way to convert calendar-like web pages into iCalendar feeds, has shut down.

The web pages that FuseCal has been successfully processing, for several curators participating in the elmcity project, are listed below. They’re a kind of existence proof, validating the notion that unstructured calendar info — what people intuitively create — can be mechanically transformed into structured info that syndicates reliably.

I hope this service, or some future variant of it, will continue. It’s a really useful way to help people grasp the concept of publishing calendar feeds.

But in the long run, it’s a set of training wheels. Ultimately we need to teach people why and how to produce such feeds more directly. All of the event information shown below could be managed in a more structured way using calendar software that produces data feeds for syndication and web pages for browsing.

More broadly, incidents like this prompt us to consider the nature of the services ecosystem we’re all embedded in — as users and, increasingly, as co-creators. In the software business, developers have long since learned to evaluate the benefits and risks of “taking a dependency” on a component, library, or service. Users didn’t have to think too much about that. A software product that was discontinued would keep working perfectly well, maybe for years. But services can — and sometimes do — stop abruptly.

Since the elmcity project is embedded in a services ecosystem, as both a provider and a consumer, how should a curator evaluate service dependencies and their associated risks and benefits? Here are some guidelines.

Many eggs, many baskets

An instance of the calendar aggregator gathers events from three main sources: Eventful (service #1), Upcoming (service #2), and a curated set of iCalendar feeds. A subset of those feeds may (until recently) have been mediated by FuseCal (service #3). So there were three main service dependencies here, and that’s one form of diversification.

But the iCalendar feeds represent another, and more powerful, form of diversification. One may be served up by a Drupal system, one may be an ICS file posted from Outlook 2007, one may be an instance of Google Calendar. Each depends on its own supporting services, but the ecosystem is very diverse.

Data and service portability

The elmcity project isn’t a database of events, but rather an aggregator of feeds of events. What matters in this case is portability of metadata describing the feeds, as well as data describing events. The system depends on Delicious for the management of the metadata. But all this metadata is replicated to Azure for safekeeping.

Since the elmcity project does run on Azure, there’s clearly a strong dependence on that platform’s compute and storage services. But I could run the code on another host — even another cloud-based host, thanks to Amazon’s EC2 for Windows. Likewise I could store blobs and tables in Amazon’s S3 and SimpleDB.

Strategic choices

In this context, the use of FuseCal was a strategic choice. There isn’t a readily available replacement, and that’s a recipe for the sort of disruption we’ve just experienced. But since the system is diversified, that disruption is contained. Was the benefit provided by this unique service worth the cost of disruption? Some curators may disagree, but I think the answer is yes. It was really helpful to be able to show people that informational web pages are implicitly data feeds, and to show what can happen when those data feeds are made explicit.

Still, it was a crutch. Ultimately we want people to stand on their own two feet, and take direct control of the information they publish to the web. FuseCal had to guess which times went with which events, and sometimes guessed wrong. If you’re publishing the event, you want to state these facts unambiguously. And using a variety of methods, as I’ve shown, you can. Those methods are the real strategic choices. If you can publish your own data feed, simply and inexpensively, you should seize the opportunity to do so


Calendar pages successfully parsed by FuseCal

prescottaz

fallschurchcals

ottawacals

snoqualmie

mashablecity

elmcity

a2cal

whyhuntington

In the latest installment of my Innovators podcast, which ran while I was away on vacation, I spoke with Steven Willmott of 3scale, one of several companies in the emerging business of third-party API management. As more organizations get into the game of providing APIs to their online data, there’s a growing need for help in the design and management of those APIs.

By way of demonstration, 3scale is providing an unofficial API to some of the datasets offered by the United Nations. The UN data at http://data.un.org, while browseable and downloadable, is not programmatically accessible. If you visit 3scale’s demo at www.undata-api.org/ you can sign up for an access key, ask for available datasets — mostly, so far, from the World Health Organization (see below) — and then query them.

The query capability is rather limited. For a given measure, like Births by caesarean section (percent), you can select subsets by country or by year, but you can’t query or order by values. And you can’t make correlations across tables in one query.

It’s just a demo, of course. If 3scale wanted to invest more effort, a more robust query system could be built. The fact that such a system can be built by an unofficial intermediary, rather than by the provider of the data, is quite interesting.

As I watch this data publication meme spread, here’s something that interests me even more. These efforts don’t really reflect the Web 2.0 values of engagement and participation to the extent they could. We’re now very focused on opening up flexible means of access to data. But the conversation is still framed in terms of a producer/consumer relationship that isn’t itself much discussed.

At the end of this entry you’ll find a list of WHO datasets. Here’s one: Community and traditional health workers density (per 10,000 population). What kinds of questions do we think we might try to answer by counting this category of worker? What kinds of questions can’t we try to answer using the datasets WHO is collecting? How might we therefore want to try to influence the WHO’s data-gathering efforts, and those of other public health organizations?

“Give us the data” is an easy slogan to chant. And there’s no doubt that much good will come from poking through what we are given. But we also need to have ideas about what we want the data for, and communicate those ideas to the providers who are gathering it on our behalf.


Adolescent fertility rate
Adult literacy rate (percent)
Gross national income per capita (PPP international $)
Net primary school enrolment ratio female (percent)
Net primary school enrolment ratio male (percent)
Population (in thousands) total
Population annual growth rate (percent)
Population in urban areas (percent)
Population living below the poverty line (percent living on less than US$1 per day)
Population median age (years)
Population proportion over 60 (percent)
Population proportion under 15 (percent)
Registration coverage of births (percent)
Registration coverage of deaths (percent)
Total fertility rate (per woman)
Antenatal care coverage – at least four visits (percent)
Antiretroviral therapy coverage among HIV-infected pregnant women for PMTCT (percent)
Antiretroviral therapy coverage among people with advanced HIV infections (percent)
Births attended by skilled health personnel (percent)
Births by caesarean section (percent)
Children aged 6-59 months who received vitamin A supplementation (percent)
Children aged less than 5 years sleeping under insecticide-treated nets (percent)
Children aged less than 5 years who received any antimalarial treatment for fever (percent)
Children aged less than 5 years with ARI symptoms taken to facility (percent)
Children aged less than 5 years with diarrhoea receiving ORT (percent)
Contraceptive prevalence (percent)
Neonates protected at birth against neonatal tetanus (PAB) (percent)
One-year-olds immunized with MCV
One-year-olds immunized with three doses of Hepatitis B (HepB3) (percent)
One-year-olds immunized with three doses of Hib (Hib3) vaccine (percent)
One-year-olds immunized with three doses of diphtheria tetanus toxoid and pertussis (DTP3) (percent)
Tuberculosis detection rate under DOTS (percent)
Tuberculosis treatment success under DOTS (percent)
Women who have had PAP smear (percent)
Women who have had mammography (percent)
Community and traditional health workers density (per 10 000 population)
Dentistry personnel density (per 10 000 population)
Environment and public health workers density (per 10 000 population)
External resources for health as percentage of total expenditure on health
General government expenditure on health as percentage of total expenditure on health
General government expenditure on health as percentage of total government expenditure
Hospital beds (per 10 000 population)
Laboratory health workers density (per 10 000 population)
Number of community and traditional health workers
Number of dentistry personnel
Number of environment and public health workers
Number of laboratory health workers
Number of nursing and midwifery personnel
Number of other health service providers
Number of pharmaceutical personnel
Nursing and midwifery personnel density (per 10 000 population)
Other health service providers density (per 10 000 population)
Out-of-pocket expenditure as percentage of private expenditure on health
Per capita total expenditure on health (PPP int. $)
Per capita total expenditure on health at average exchange rate (US$
Pharmaceutical personnel density (per 10 000 population)
Physicians density (per 10 000 population)
Private expenditure on health as percentage of total expenditure on health
Private prepaid plans as percentage of private expenditure on health
Ratio of health management and support workers to health service providers
Ratio of nurses and midwives to physicians
Social security expenditure on health as percentage of general government expenditure on health
Total expenditure on health as percentage of gross domestic product

My recent adventure in naming the times of day was so much fun that I lost track of the original purpose of the exercise, which was to improve accessibility for sight-impaired users.

When I interpersed time-of-day labels into each day’s event listing, I used HTML DIV tags. Wrong, wrong, wrong! Those labels are structural elements, and as my accessibility consultant Susan Gerhart gently reminded me, screen readers depend on HTML headings to find and announce them. The labels should have been second-level headings — i.e., HTML H2 tags.

It gets worse. When Susan prompted me to take another look at what I’d done, I found that the date labels were inexplicably tagged as paragraphs (P) instead of the top-level headers (H1) that they logically are.

Oh. Right. Of course. Duh. Fixed. Sorry.

What was I thinking? How could somebody like me, who has preached about the attention-focusing power of heads, decks, and leads, screw up something so basic as this?

Easily, as it turns out, in the absence of feedback. If you yourself don’t depend on a design feature, there is a natural tendency to forget why it matters to others.

Coincidentally (or not) Susan recently wrote an essay, and published a companion audio recording, that will help me — and I hope others — not to forget again. Entitled Hear Me Stumble Around White House, Recovery, and Data GOV web sites, it’s a blow-by-blow account of her efforts to navigate those sites with a screen reader.

In this recording you can hear Susan and her screen reader trying to make sense of whitehouse.gov. If you’ve never heard a screen reader in action, it’s worth listening for that alone. You’ll get a very clear sense of how these tools depend on the hierarchy of the page.

Simultaneously you’ll hear Susan narrate her intention — to read an article about cybersecurity — and her frustration. For example:

I was thrown off by the slide show at the top of the page. Once I hit the cybersecurity story, the next time I traverse this section the story was about the Supreme Court nominee.

Despite this randomness, the page does at least identify the top stories with H1 tags. And Signed Legislation is an H2. But none of the headlines under Signed Legislation are H3s, they’re Ps.

Over at recovery.gov and data.gov Susan finds none at all, and reacts to their omissions less gently she did to mine:

It’s the headings, stupid!!!

Thanks. I will try not to forget that again.


PS: In a follow-up to her blog essay, Susan links to detailed reports by accessibility pioneer Jim Thatcher on the issues he found with data.gov and recovery.gov.

When I invited folks to become calendar curators for the elmcity project, the person who stepped forward in Prescott AZ was Susan Gerhart, whom I interviewed here. One of her great insights about web design is that the right thing for a vision-impaired user is almost always also the right thing for everyone. She calls this the curb cuts principle:

Curb cuts for wheelchairs also guide blind persons into street crossings and prevent accidents for baby strollers, bicyclists, skateboarders, and inattentive walkers.

So I shouldn’t have been surprised when Susan noticed that the HTML rendering of the calendar need some curb cuts. Within each day, the events show up as a long undifferentiated list. She suggested that subdividing the list by time of day — morning, afternoon, evening — will be helpful to folks using screen readers. But in fact, it’s just plain helpful. So I’m testing a version of that idea now.

Ionically I was just thinking about this same principle in another context. The new version of Oakland Crimespotting, which I raved about, segments incidents using this vocabulary:

light, dark, commute, nightlife, day, night, swing shift

In that spirit, I’m trying this:

morning, lunch, afternoon, evening, night

This of course leads to the question: When do these times begin and end?

I was fascinated to see that both Google and Bing return the same Yahoo answers page for the query morning afternoon evening.

For now, though, I’m going with this ruleset:

  Morning:  5:00 AM to 11:30 AM
    Lunch: 11:30 AM to  1:00 PM
Afternoon:  1:30 PM to  5:30 PM
  Evening:  5:30 PM to  9:00 PM
   Night:   9:00 PM to  5:00 AM

But I’ll make these rules — and maybe even the time-of-day names — configurable on a per-location basis.

In his writeup on Google Wave, Dare Obasanjo says:

I’m sure there are thousands of Web developers out there right now asking themselves “would my app be better if users could see each others’ edits in real time?”,”should we add a playback feature to our service as well” [ed note - wikipedia could really use this] and “why don’t we support seamless drag and drop in our application?”. All inspired by their exposure to Google Wave.

Indeed, every application that preserves a change history needs playback. Wikipedia, as Dare notes, is a prime candidate. Back in 2006, I made this LazyWeb request:

Animation is the best way to visualize the flow of change, as I discovered when I made my Wikipedia screencast. For Wikipedia, and indeed for all kinds of living documents supported by revision history and diff tools, I can imagine being able to isolate a paragraph or section and autogenerate the screencast of its evolution. I can even imagine the content of such visualizations being considered not just cutting-room floor debris but, rather, part of the “real” document, like footnotes.

Andy Baio responded by sponsoring a contest for a tool that would do just that. And I made a screencast demonstrating Dan Phiffer’s winning entry.

That script is unavailable at the moment because, ironically, Dan’s server reports:

Oh noes! I got HACK*D. I’m sifting through my files and should restore things back to normal soon.

In any case, it probably wasn’t practical for routine use. Fetching every revision on the fly really hammers Wikipedia. What’s really needed — again, not just for Wikipedia but everywhere — is a general way to query change history, and return a stream of versions and differences.

One way of doing the latter would be to use FeedSync, an open extension to RSS/Atom that supports synchronization in Live Mesh. Another would be to use Google’s Wave protocol. Because FeedSync deals with lists of items, which can be arbitrary chunks of content, whereas Wave deals with lists of document-mutation operations, like delete-element and start-annotation, it seems to me that FeedSync is more general, albeit less immediately useful for collaborative editing.

To explain why generality matters, consider change animation in a very different domain: software configuration. My wife, for example, sometimes changes her settings — in Word or Firefox — in ways that cause problems. If these apps persisted their settings to Live Mesh, as they could and arguably should, I’d be able to debug a mishap locally or remotely. But ideally, the change visualization would be sufficiently user-friendly so that she’d have a shot at figuring it out for herself.


PS: Speaking of history and restoration, I’ve been feeling like an amnesiac ever since my InfoWorld archive went dark. So in spare moments I’ve been reconstructing and republishing it. I’ll have the text of all the old blog entries up soon. And I’ve been restoring the screencasts as well. I’m keeping track of my progress at delicious.com/judell/screencast+restored.

My plumber’s last name is Thieme. I was just looking up his phone number, and got distracted when I realized that the people search in Live Bing does a fair job of visualizing the geographic distribution of surnames. If you do a people search for Thieme, New Hampshire, and start panning around at county and state resolutions, you can see where Thiemes have clustered and where they haven’t.

As I was doing this, I suddenly realized: Why don’t maps offer named zoom levels? If you want to pan across the country at state or county resolution, it requires an enormous amount of continuous zooming in and out. Of course the sizes of states and counties vary as you move across the country. But that’s the whole point. Computers can do the math and automate those adjustments.

What prompted this thought was the newly-redesigned Oakland Crimespotting, which features a nifty new widget for selecting times of day. Stamen Designs’ Eric Rodenbeck, whom I recently interviewed, calls it the time pie. It’s fun to spin your way through the hours, making contiguous or discontiguous selections. But what’s really useful are the named slices: light, dark, commute, nightlife, day, night, swing shift. As Stamen’s blog notes:

The last time slices (day, night and swing) are the ways that the police view this information, and one thing we hope will come from the project is a better understanding of how the police view their data as it’s collected.

Nice!

What you may not notice, as you navigate the new interface, is that every adjustment is reflected in an exquisitely detailed URL. It’s not obvious because the URLs are really long, and the changes happen outside the visible part of the browser’s location window. But watch:

Default: http://oakland.crimespotting.org/map/#dtend=2009-06-04T20:35:28-07:00&lat=37.806&types=AA,Mu,Ro,SA,DP,Na,Al,Pr,Th,VT,Va,Bu,Ar&lon=-122.270&hours=16-23&zoom=14&dtstart=2009-05-28T20:35:28-07:00

Hide all crime types: http://oakland.crimespotting.org/map/#dtend=2009-06-04T23:59:59-07:00&lat=37.806&types=&lon=-122.270&hours=0-23&zoom=14&dtstart=2009-05-28T23:59:59-07:00

Show all and extend dates to max range: http://oakland.crimespotting.org/map/#dtend=2009-06-04T23:59:59-07:00&lat=37.806&types=AA,Mu,Ro,SA,DP,Na,Al,Pr,Th,VT,Va,Bu,Ar&lon=-122.270&hours=0-23&zoom=14&dtstart=2009-05-08T00:00:00-07:00

Narcotics only: http://oakland.crimespotting.org/map/#dtend=2009-06-04T23:59:59-07:00&lat=37.806&types=Na&lon=-122.270&hours=0-23&zoom=14&dtstart=2009-05-08T00:00:00-07:00

Nighttime narcotics: http://oakland.crimespotting.org/map/#dtend=2009-06-04T23:59:59-07:00&lat=37.806&types=Na&lon=-122.270&hours=16-23&zoom=14&dtstart=2009-05-08T00:00:00-07:00

Wee hours narcotics: http://oakland.crimespotting.org/map/#dtend=2009-06-04T23:59:59-07:00&lat=37.806&types=Na&lon=-122.270&hours=1-4&zoom=14&dtstart=2009-05-08T00:00:00-07:00

As noted on the Stamen blog, this means that:

It’s now possible to navigate and link to recent newsworthy events like the assassination of journalist Chauncey Bailey, the Oscar Grant riots from January 2009, and the Lovelle Mixon incident from this past March.

The Stamen crew is renowned for brilliance, and rightly so. But the principles at work here — thoughtful naming, granular linking — are ones that we all can and should practice, in the many small ways that we can as we explore and co-create the infosphere.

Curation is always a two-step tango. First you collect, then you categorize. Until now, the elmcity project has been all about collecting. But as the nodes of this network of community hubs start to light up, and as curators gather growing numbers of calendar feeds, it’s time to start enabling them to categorize as well.

This is a classic hard problem. How do you get people to tag hundreds or thousands of items? What makes the problem even harder, in the domain of events, is that once those items fade into the past, any effort invested in tagging them is lost.

My answer is, at least for now: Don’t worry too much about tagging individual events. Instead, gain leverage by finding ways to tag sources of events. Here are two good strategies:

1. Categorizing iCalendar feeds

The obvious place to start is with the iCalendar feeds that curators are collecting. There’s already a mechanism in place to capture metadata about those feeds. Here, for example, is the iCalendar feed for the 2009 Board of Supervisors meetings in Prescott, AZ:

http://fusecal.com/calendar/ical/3200531?h=b75b09c8-50c2-11de-9169-00163e12298c

That’s an iCalendar feed that was made from this web page:

http://www.co.yavapai.az.us/Events.aspx/id=32794

If you check the Delicious metadata for Prescott’s iCalendar feeds, you’ll see this structure:

title: Board of Supervisors
  url: http://fusecal.com/calendar/ical/3200531?h=b75b09c8-50c2-11de-9169-00163e12298c
  tag: trusted
  tag: ics
  tag: feed
  tag: url=http://www.co.yavapai.az.us/Meetings.aspx/folderid=1488&year=2009
  tag: category=government

The url= tag was already there. It provides the all-important link back to a human-readable authoritative source for events coming from this feed. It’s best if individual events provide their own links, but often in iCalendar feeds they don’t, so this is the default link.

What’s new is the category= tag. Now all events coming from this feed will carry that category. For example:

Mon Jun 15 2009


Regular Meeting – Cottonwood N/A
(Board of Supervisors)
(government)

The same info travels downstream, to the aggregated Prescott iCalendar feed:

BEGIN:VEVENT
CATEGORIES:government
DESCRIPTION:Regular Meeting - Cottonwood N/A \n\n****************
nfrom  FuseCal.com\n ******************************\n\n
DTSTART;VALUE=DATE:20090615
LOCATION: (see http://www.co.yavapai.az.us/Events.aspx?id=32794)
SEQUENCE:0
SUMMARY:Regular Meeting - Cottonwood N/A
UID:633797255542010000-1196352865@elmcity.cloudapp.net
URL:http://www.co.yavapai.az.us/Events.aspx?id=32794
END:VEVENT

And to the aggregated XML feed:

<event>
<title>Regular Meeting - Cottonwood N/A</title>
<url>http://www.co.yavapai.az.us/Events.aspx?id=32794</url>
<source>Board of Supervisors</source>
<dtstart>2009-06-15T00:00:00</dtstart>
<categories>government</categories>
</event>

This strategy only works, for course, for feeds that can be categorized. And that won’t always be true. Events coming from the ReadItNews feed don’t fit into any single category (or short list of categories). So they’ll remain untagged for now. That’s OK. Better to make some progress than to make none. This partial approach yields a nice return on investment. And thanks to the bulk editing feature of Delicious, it’s really quick and easy to select a set of feeds and then tag them with a category= tag.

2. Categorizing Eventful and Upcoming venues

We can use a variation of this strategy to categorize sources of events coming from Eventful and Upcoming. In this case, the lever is the venue. Not all venues host events that can be categorized. But some do, and in those cases, why not exploit that?

The strategy here is to bookmark and tag the event’s venue URL from Upcoming or Eventful. Here are two examples:

Upcoming

title: Venue: Prescott YMCA - Upcoming
  url: http://upcoming.yahoo.com/venue/435420
  tag: venue=upcoming
  tag: category=recreation

Eventful

title: Venue: Raven Cafe
  url: http://eventful.com/prescott/venues/raven-cafe-/V0-001-000366078-7
  tag: venue=eventful
  tag: category=music

If you check the default HTML view of Prescott’s aggregated events, you’ll see that these categories indeed show up. They’re also in the downstream XML, ICS, and JSON feeds.

But can’t the source iCalendar feeds provide per-event categories?

Yes, some do. In the case of Prescott, the public library’s iCalendar feed uses the CATEGORIES property, so those categories show up too. For example:

Thu 02:00 PM
Sign up for Computer Mentor
(Prescott Library)
(Adult Computer Class,library)

Here we see a list of two categories. The first item, Adult Computer Class, was in the original iCalendar feed. The second item, library, was inherited from the feed metadata specified by the curator.

There’s a long way to go with this stuff. But this is a nice start!

Jamie Heywood joined me for this week’s Innovators show. His quest to cure ALS (Amyotrophic Lateral Sclerosis, aka Lou Gehrig’s Disease) is featured in a book and a movie. In this conversation, we explore Jamie’s current project: PatientsLikeMe. It’s a website where people pool data about their medical conditions, their drug regimes and related therapies, and their outcomes.

Of course people have been sharing medical information online since it became possible to do so. But PatientsLikeMe differs from other online health communities in several ways. The profile of a user is someone who is grappling with a serious, life-changing illness where:

  • You are very debilitated, perhaps even unable to go to work.

  • You can tell if your treatment is helping. (If you have Parkinson’s disease or depression, for example, you can judge what works or doesn’t. If you have breast cancer, you can’t.)

  • You are in a situation where both diagnosis and treament are ambiguous.

The data that you report brings you into direct contact with other patients who share similar conditions and treatments. In this sense, PatientsLikeMe is a uniquely data-driven social network:

It is the richest open quantified human-to-human network that exists. There are a couple of hundred measured channels on which you can evaluate yourself against everyone else that you might be interested in connecting to. And you can go across any of those channels to anyone else in the world.

The data you report also brings you into direct contact with drug companies:

It connects you with the people who are developing the drugs to treat your disease. This cuts out an immense amount of inefficiency and middlemen, and can potentially make the system much better. It’s a way of rationalizing and accelerating discovery.

For that reason, Jamie sees no need to apologize for PatientsLikeMe’s business model, which is to sell the data it collects to drug companies. This arrangement may even, arguably, be a form of citizen science:

Do I think that we’ll be using crowdsourcing to interpret the RNA signature in blood? No. But in the real world, when you ask what it means to have ALS, each patient in the system is a representative of their own specific phenotype of this illness. Which is a way of putting it into the process of discovery. Because if you’re not in there — if you’re different, and everyone is unique in some way — the specific components of your own health and its impacts on your life will not be addressed in the process of treatment.

What about privacy? Jamie admits, honestly, that there can be no guarantees, and does not think people who expect guarantees should use PatientsLikeMe. It isn’t for everyone. But there are a number of folks who, after evaluating the risk of participating (pseudonymously) in the service, conclude that the benefit outweighs that risk. They are part of a collective experiment that I will be watching with the greatest interest.

When I shared my strategy for harvesting Keene’s softball schedules, the Little League baseball schedules hadn’t yet been published online. Now I see why. It took the folks at the Keene Cal Ripken Baseball Association (KCRBA) a while to get them written down in Excel, and then produced and uploaded as a set of web pages like this one. We’re two weeks into the season, and those pages are finally up, but not — sadly yet typically — in a useful calendar format that can mesh with other calendars.

Over the weekend, @llama_grande tweeted:

Dilbert creator on calendars @judell may enjoy http://bit.ly/2lKTlb

I set it aside thinking it was a cartoon I’d enjoy later. In fact, it’s a cogent essay by Scott Adams that nicely captures part of my motivation for doing the elmcity project. From the essay:

I think the family calendar is the organizing principle into which all external information should flow. I want the kids’ school schedules for sports and plays and even lunch choices to automatically flow into the home calendar. And when I want to decide what to do on the weekend, I want to click on the date for next Saturday and have all the relevant choices of plays, movies, and events pop up.

I think the biggest software revolution of the future is that the calendar will be the organizing filter for most of the information flowing into your life. You think you are bombarded with too much information every day, but in reality it is just the timing of the information that is wrong. Once the calendar becomes the organizing paradigm and filter, it won’t seem as if there is so much.

Meanwhile, here’s the reality for Kevin Curry:

checking a PDF 4 school lunch is daily routine 4 me

That’s how it is for most of us, most of the time. But it needn’t be.

Consider the Little League example. If the keystrokes that were poured into Excel to create those web pages had been directed into almost any calendar program, the schedules could have been published both as HTML for online viewing and as iCalendar for syndication to other calendars.

Happily, FuseCal can set things straight. It handily created calendars for each of the 27 teams. I collected the feed URLs and wrote a throwaway script to spray them into Delicious. In a few hours, when the elmcity service scans that account again, all the games will be included in the combined calendar. And anybody who wants what Scott Adams wants — to have the kids’ sports events flow into a home calendar — can have it.

This is wrong and backwards, of course. And while the creator of Dilbert would probably enjoy the absurdity of my solution, I’m glad to know he’s also thinking about the right way to move forward.

Like all University of Michigan alumni who were in the school of Literature, Science, and the Arts, I receive the quarterly LSA Magazine. This spring I’m actually in the magazine. For an issue on the theme of surviving in tough economic times, I contributed the back-page editorial which the editors entitled Can the Noosphere Save Us? The themes will be familiar to readers who know me: personal publishing, knowledge sharing, online collaboration. It was a treat to be asked to write about these topics for a diverse audience of UM alumni.

I would have subtitled the piece: “Ask not what the web can do for you. Instead ask what you can do with the web.” It features three people I have interviewed for my Innovators show, all of whom exemplify that dictum. They are Jean-Claude Bradley, Susan Gerhart, and John Leeke.

In order to make things easier for Susan Gerhart, who’ll be using a screen reader, I’m supplementing the PDF version posted at the magazine’s site with a plain HTML version.

When Phil Windley pointed me to Jeannette Wing’s manifesto on computational thinking, she had me at hello. The intellectual tools of computer science, she argues — including the ability to work at multiple levels of abstraction, to automate repetitive processes, and to make and use state machines — are really “a universally applicable attitude and skill set that everyone, not just computer scientists, would be eager to learn and use.”

In 2007 I interviewed Jeannette Wing for my Innovators show. Since then she has moved from Carnegie Mellon to the National Science Foundation, where she is — among other activities — working to define, promote, and bootstrap the teaching of computational thinking.

On this week’s show I spoke with Joan Peckman, a University of Rhode Island professor of computer science who’s on leave to work with the NSF on that mission.

Toward the end of the podcast, she relates this delightful anecdote:

At the first CSTB workshop on computational thinking for everyone, someone from the University of Indiana showed a video on how he was teaching science. We all looked at it and thought: “But it’s also computational thinking!”

In the course of teaching elementary school students about honey bees, he took them out on the playground and asked them to act out what the honey bees did: leaving the hive, finding the pollen, giving directions to the other bees. Then he brought them back into the classroom, went to a whiteboard, and engaged them in activites that I would identify as modeling, debugging, and drawing finite state diagrams. He didn’t call them that, but that’s what they were.

Yes he was teaching them science, but the way he was analyzing the subject, and engaging them in analysis, clearly involved a set of computational constructs.

In my own recent writing and speaking, I’ve suggested that feed syndication and lightweight service composition are aspects of computational thinking that we ought to formulate as basic principles and teach in middle school or even grade school.

We tried, but failed, to come up with a phrase that embellishes computational thinking with connotations of flow, orchestration, and connectedness. Syndication-oriented architecture. gets partway there, but will never fly in the mainstream. Maybe connected thinking? But you don’t want to leave out what computational connotes. Perhaps computational and connected thinking? Nah, too wordy. I’d love to hear suggestions for a tagline that concisely captures both aspects.

For more background on computational thinking, here are Joan Peckham’s show notes:

The CSTB (Computer Science and Telecommunications Board) of the National Academy of Sciences is holding Computational Thinking for Everyone: A Workshop Series in 2009. Monitor their website for developments and reports: http://sites.nationalacademies.org/cstb/CurrentProjects/CSTB_043590

Previously awarded CPATH projects (only some of which address computational thinking directly … although the current solicitation requires it):

2007 award portfolio – http://www.nsf.gov/cise/funding/CPATH2007awardsfinal.pdf

2008 award portfolio – http://www.nsf.gov/cise/funding/CPATH2008awardsfinal.pdf

Computer Science Unplugged (http://csunplugged.org/) site has a wealth of classroom ready activities.

Rebooting Computing Summit in January 2009 (http://www.rebootingcomputing.org/). Several working groups emerged from this meeting. Some of the groups were concerned with computing education, and in defining and better communicating computing to others.

The Computer Science Teachers Association (CSTA) has a web repository with K-12 computer science teaching and learning materials: http://csta.acm.org/WebRepository/WebRepository.html

The Carnegie Mellon University Center for Computational Thinking site has materials and resources: http://www.cs.cmu.edu/~CompThink/. [ed: Sponsored, I'm pleased to say, by Microsoft Research.]

Keene is crazy about baseball and softball. In the men’s softball league alone there are 56 teams, they have played 73 games so far, and will play another 431 through August. I know this because the schedule was made in Excel, and published as a web page that Excel’s Data->From Web feature can easily read back.

That Excel spreadsheet isn’t at all useful, however, if you want to combine the schedule with other public calendars, or with your own personal calendar. For that you need an ICS feed. And almost nobody — from the major league websites to local leagues like mine — bothers to provide those.

So I made an ICS feed for Keene men’s softball, and I did it in an unusual way. My first thought was to point FuseCal at the schedule page, which is just an HTML table that looks like this:

DATE TIME FIELD AWAY HOME Lg
Fri. Apr 17 6:00 PM D Computer Solutions of
Keene
J.A. Jubb C1
Fri. Apr 17 6:00 PM O Peerless Insurance C&S 1 D2

But FuseCal wouldn’t read that page. It’s a service that specializes in digging structure out of unstructured text, and I guess it got freaked out when it saw too much structure in this page!

Normally in cases like this I’d write a script to read the HTML table, parse out the dates and times, and write an ICS feed. But that isn’t a skill most people have, and I’m looking for ways to help calendar curators do this kind of thing for themselves.

Then it occurred to me: What would FuseCal read? How about this:

Fri. Apr 17 06:00 PM,
Computer Solutions of Keene vs. J.A. Jubb, Field D
Fri. Apr 17 06:00 PM,
Peerless Insurance vs. C&S 1, Field O

In other words, the same stuff lightly reformatted, and coalesced into a single cell per row. And yes, FuseCal will read that.

So I added a column to the Excel sheet with this formula:

=CONCATENATE(A4, " ", TEXT(B4,"hh:mm AM/PM"), ", " D4, " vs. ", F4,
 ", ", "Field", C4)

Then I exported that column back out as this HTML page, used FuseCal to create this ICS feed, and bookmarked it for inclusion in the aggregator.

This has to be the weirdest maneuver I’ve ever thought of. Taking away structure in order to be able to add structure? Crazy! And yet it makes perfect sense. FuseCal is a component that specializes in turning weakly-structured calendar-like data into better-structured calendar data. It also knows how to do other useful things, like monitor the source of that data for changes, and convert the data into ICS format. If it’s easy enough to provide the sort of weak structure that FuseCal expects, why not just do that and leverage its strengths?

So I did, and here are the key outcomes:

  1. The softball events now show up on the aggregated calendar.
  2. They’re also available directly from the ICS feed, so that players and their families can add these events to personal calendars.

Nice!

It would be even nicer if, as a member of, say, the Blazers, I could scoop up just my own team’s events. And in fact FuseCal does support filtering. As the creator of the feed, I can go into the application, type Blazers, and restrict the feed to just those events. But I’d have to create 56 separate filtered calendars to provide feeds for all the teams. Feature request for FuseCal: Support filtering on the feed URL, so I can form URLs like:

http://fusecal.com/calendar/view/ 741833?h=5f7c2ac6-13cc-11de-a48e-00163e284ee0&filter=Blazers

http://fusecal.com/calendar/view/ 741833?h=5f7c2ac6-13cc-11de-a48e-00163e284ee0&filter=Greenwald+Realty

While we’re wishing, here’s a feature request for Yahoo Pipes: Add a module for ICS feeds! Pipes is a fabulous tool for transforming, filtering, and merging RSS feeds. It would be great to be able to do the same kinds of magic with ICS feeds.