How George Bailey can save Delicious

Every Christmas we watch It’s a Wonderful Life. This year I’ll be imagining Jimmy Stewart saying, to a panicked crowd of delicious.com users rushing for the exits, “Now, hold on, it’ll work out, we’ve just got to stick together.”

If you’ve never used the social bookmarking service that began life with the whimsical domain name del.icio.us, here’s the Wikipedia summary. The service began in 2003, and by 2004 had transformed my work practices more profoundly than almost anything else before or since. I’ve written scores of essays explaining how and why. Here are some of my favorites:

2004: Collaborative knowledge gardening

2005: Language evolution with del.icio.us (screencast)

2005: Collaborative filtering with del.icio.us

2006: Del.icio.us is a database

2007: Discovering and teaching principles of information management

2007: Social information management

2008: Twine, del.icio.us, and event-driven service integration

2008: Databasing trusted feeds with del.icio.us

2008: Why and how to blurb your social bookmarks

2009: Collaborative curation as a service

Since the now-infamous leak of an internal Yahoo! slide naming delicious as one of a set of doomed services, there’s been some great gallows humor. Ed Kohler:

The easiest way to shut down Wikileaks would be to have Yahoo! acquire it.

And Anil Dash:

It seems like @pinboardIN is the most successful product Yahoo!’s had a hand in launching in five years. Congrats, @baconmeteor.

Anil is referring to pinboard.in, one of several delicious-like services to which delicious users began fleeing. Pinboard is notable for a clever model in which the price of a lifetime subscription rises with the number of users. When I first checked yesterday morning, that price was $6.90. I signed up at $7.24. Neil Saunders started tracking it at #pinboardwatch; it got to $7.74 last night; it’s $8.17 now. Maybe I should’ve bought 100 accounts at $6.90!

But seriously, this is a moment to reflect on how we can preserve the value we collectively create online. As some of you know, I have made heavy use of delicious in my own service, elmcity. When the news broke, Jeremy Dunck asked: “Bad news for elmcity, huh?”

Actually that’s the least of my worries. The folks who curate elmcity calendar hubs use delicious to configure their hubs, and to list the feeds aggregated by their hubs. It’ll be a bit inconvenient to transition to another bookmarking service, but it’s no big deal. And of course all the existing data is cached in an Azure database; the elmcity service doesn’t depend on live access to delicious.

The real concern is far broader. Millions of us have used delicious to create named sets of online resources. We can recreate our individual collections in other services, but not our collaborative efforts. In Delicious’s Data Policy is Like Setting a Museum on Fire, Marshall Kirkpatrick writes:

One community of non-profit technologists has been bookmarking links with the tag “NPTech” for years – they have 24,028 links categorized as relevant for organizations seeking to change the world and peoples’ lives using technology. Wouldn’t it be good to have that body of data, metadata and curated resources available elsewhere once Delicious is gone?

The problem with “elsewhere,” of course, is that there’s no elsewhere immune to the same business challenges faced by Yahoo!. Maybe now is the time for a new model to emerge. Except it wouldn’t be new at all. The Building and Loan service that George Bailey ran in It’s a Wonderful Life wasn’t a bank, it was a coop, and its customers were shareholders. Could delicious become the first user-owned Internet service? Could we users collectively make Yahoo! an offer, buy in as shareholders, and run the service ourselves?

It’s bound to happen sooner or later. My top Christmas wish: delicious goes first.

Automatic shifting and manual steering on the information superhighway

I’d like to thank the folks at the Berkman Center for listening to my talk yesterday, and for feedback that was skeptical about the very points I know that I need to sharpen. The talk is available here in multiple audio and video formats. The slides are separately available on SlideShare. There are many ways to use these materials. If I wanted to listen and watch, here are the methods I’d choose. For a tethered experience I’d download the original PowerPoint deck from SlideShare and watch it along with the MP3 audio. For an untethered experience I’d look at the slides first, and then dump the MP3 onto a portable player and head out for a run. Finally, if I lacked the time or inclination for either of those modes, but was still curious about the talk, I’d read Ethan Zuckerman’s excellent write-up.

After the talk we had a stimulating discussion that raised questions some of us have been kicking around forever in the blogosphere:

  1. Do “real people” — that is, people who do not self-identify as geeks — actually use feed syndication?

  2. If not directly and intentionally, do they use it indirectly and unconsciously by way of systems that syndicate feeds without drawing attention to the concept?

  3. Does the concept matter?

The third question is the big one for me. From the moment that the blogosphere booted up, I thought that pub/sub syndication — formerly a topic of interest only to engineers of networked information systems — was now becoming a tool that everyone would want to master in order to actively engage with networked information systems. Mastering the principles of pub/sub syndication wasn’t like mastering the principles of automotive technology in order to drive a car. It was, instead, like knowing how to steer the car — a form of knowledge that we don’t fully intuit. I have been driving for over 35 years. But there are things I never learned until we sent our kids to Skid School and participated in the training.

I’ll admit I have waffled on this. After convincing Gardner Campbell that we should expect people to know how to steer their cars on the information superhighway, I began to doubt that was possible. Maybe people don’t just need automatic transmission. Maybe they need automatic steering too. Maybe I was expecting too much.

But Gardner was unfazed by my doubt. He continued to believe that people need to learn how to steer, and he created a Skid School in order to teach them. It’s called the New Media Studies Faculty Seminar, it’s taking place at Baylor University where Gardner teaches, at partner schools, and from wherever else like minds are drawn by the tags that stitch together this distributed and syndicated conversation. Here’s Gardner reflecting on the experience:

Friday, I was scanning the blog feeds to read the HCC blogs about the discussion. Then I clicked over to some of the other sites’ blogs to see what was happening there. Oops! I was brought up short. I thought I’d clicked on a St. Lawrence University blog post. It sure looked like their site. But as I read the post, it was clear to me something had gone wrong. I was reading a description of the discussion at HCC, which had included very thoughtful inquiries into the relationship of information, knowledge, and wisdom. Then I realized that in fact I was reading a description of the HCC discussion — because that’s what they’d talked about at St. Lawrence University as well.

And now my links bear witness to that connection, tell my story of those connections, and enact them anew.

This property of the link — that it is both map and territory — is one I’ve blogged about before (a lucky blog for me, as it elicited three of my Favorite Comments Ever). But now I see something much larger coming into view. Each person enacts the network. At the same time, the network begins to represent and enact the infinities within the persons who make it up. The inside is bigger than the outside. Each part contains the whole, and also contributes to the whole.

The New Media Studies Faculty Seminar has given some educators a lesson in how to steer their own online destinies, and a Skid School course on which to practice their new skills. That pretty much sums up my ambition for the elmcity project too. Automatic transmissions are great. But we really do need to teach folks how to steer.

Jazz in Madison, Wisconsin: A case study in curation

The elmcity project’s newest hub is called Madison Jazz. The curator, Bob Kerwin, will be aggregating jazz-related events in Madison, Wisconsin. Bob thought about creating a Where hub, which merges events from Eventful, Upcoming, and Eventbrite with a curated list of iCalendar feeds. That model works well for hyperlocal websites looking to do general event coverage, like the Falls Church Times and Berkeleyside. But Bob didn’t want to cast that kind of wide net. He just wanted to enumerate jazz-related iCalendar feeds.

So he created a What hub — that is, a topical rather than a geographic hub. It has a geographic aspect, of course, because it serves the jazz scene in Madison. But in this case the topical aspect is dominant. So to create the hub, Bob spun up the delicious account MadisonJazz. And in its metadata bookmark he wrote what=JazzInMadisonWI instead of where=Madison,WI.

If you want to try something like this, for any kind of local or regional or global topic, the first thing you’ll probably want to do — as Bob did — is set up your own iCalendar feed where you record events not otherwise published in a machine-readable way. You can use Google Calendar, or Live Calendar, or Outlook, or Apple iCal, or any other application that publishes an iCalendar feed.

If you are very dedicated, you can enter invidual future events on that calendar. But it’s hard, for me anyway, to do that kind of data entry for single events that will just scroll off the event horizon in a few weeks or months. So for my own hub I use this special kind of curatorial calendar mainly for recurring events. As I use it, the effort invested in data entry pays recurring dividends and builds critical mass for the calendar.

Next, you’ll want to look for existing iCalendar feeds to bookmark. Most often, these are served up by Google Calendar. Other sources include Drupal-based websites, and an assortment of other content management systems. Sadly there’s no easy way to search for these. You have to visit websites relevant to the domain you’re curating, look for the event sections on websites, and then look for iCalendar feeds as alternatives to the standard web views. These are few and far between. Teaching event sponsors how and why to produce such feeds is a central goal of the elmcity project.

When a site does offer a Google Calendar feed, it will often be presented as seen here on the Surrounded By Reality blog. The link to its calendar of events points to this Google Calendar. Its URL looks like this:

1. google.com/calendar/embed?src=surroundedbyreality@gmail.com

That’s not the address of the iCalendar feed, though. It is, instead, a variant that looks like this:

2. google.com/calendar/ical/surroundedbyreality@gmail.com/public/basic.ics

To turn URL #1 into URL #2, just transfer the email address into an URL like #2. Alternatively, click the Google icon on the HTML version to add the calendar to the Google Calendar app, then open its settings, right-click the green ICAL button, and capture the URL of the iCalendar feed that way.

Note that even though a What hub will not automatically aggregate events from Eventful or Upcoming, these services can sometimes provide iCalendar feeds that you’ll want to include. For example, Upcoming lists the Cafe Montmartre as a wine bar and jazz cafe. If there were future events listed there, Bob could add the iCalendar feed for that venue to his list of MadisonJazz bookmarks.

Likewise for Eventful. One of the Google Calendars that Bob Kerwin has collected is for Restaurant Magnus. It is also a Eventful venue that provides an iCalendar feed for its upcoming schedule. If Restaurant Magnus weren’t already publishing its own feed, the Eventful feed would be an alternate source Bob could collect.

For curators of musical events, MySpace is another possible source of iCalendar feeds. For example, the band dot to dot management plays all around the midwest, but has a couple of upcoming shows in Madison. I haven’t been able to persuade anybody at MySpace to export iCalendar feeds for the zillions of musical calendars on its site. But although the elmcity service doesn’t want to be in the business of scraping web pages, it does make exceptions to that rule, and MySpace is one of them. So Bob could bookmark that band’s MySpace web page, filter the results to include only shows in Madison, and bookmark the resulting iCalendar feed.

This should all be much more obvious than it is. Anyone publishing event info online should expect that any publishing tool used for the purpose will export an iCalendar feed. Anyone looking for event info should expect to find it in an iCalendar feed. Anyone wishing to curate events should expect to find lots of feeds that can be combined in many ways for many purposes.

Maybe, as more apps and services support OData, and as more people become generally familiar with the idea of publishing, subscribing to, and mashing up feeds of data … maybe then the model I’m promoting here will resonate more widely. A syndicated network of calendar feeds is just a special case of something much more general: a syndicated network of data feeds. That’s a model lots of people need to know and apply.

Producing and consuming OData feeds: An end-to-end example

Having waxed theoretical about the Open Data Protocol (OData), it’s time to make things more concrete. I’ve been adding instrumentation to monitor the health and performance of my elmcity service. Now I’m using OData to feed the telemetry into Excel. It makes a nice end-to-end example, so let’s unpack it.

Data capture

The web and worker roles in my Azure service take periodic snapshots of a set of Windows performance counters, and store those to an Azure table. Although I could be using the recently-released Azure diagnostics API, I’d already come up with my own approach. I keep a list of the counters I want to measure in another Azure table, shown here in Cerebrata‘s viewer/editor:

When you query an Azure table like this one, the records come back packaged as content elements within Atom entries:

[sourcecode language=”xml”]
<entry m:etag="W/datetime’2010-02-09T00:00:53.7164253Z’">
<id>http://elmcity.table.core.windows.net/monitor(PartitionKey=’ProcessMonitor&#8217;,
RowKey=’634012704503641218′)</id>
<content type="application/xml">
<m:properties>
<d:PartitionKey>ProcessMonitor</d:PartitionKey>
<d:RowKey>634012704503641218</d:RowKey>
<d:HostName>RD00155D317B3F</d:HostName>
<d:ProcName>WaWorkerHost</d:ProcName>
<d:mem_available_mbytes m:type="Edm.Double">1320</d:mem_available_mbytes>
…snip…
<d:tcp_connections_established m:type="Edm.Double">24</d:tcp_connections_established>
</m:properties>
</content>
</entry>
[/sourcecode]

This isn’t immediately obvious if you use the storage client libary that comes with the Azure SDK, which wraps an ADO.NET Data Services abstraction around the Azure table service. But if you peek under the covers using a tool like Eric Lawrence’s astonishingly capable Fiddler, you’ll see nothing but Atom entries. In order to get direct access to them, I don’t actually use the storage client library in the SDK, but instead use an alternate interface that exposes the underlying HTTP/REST machinery.

Exposing data services

If the Azure table service did not require special authentication, it would itself be an OData service that you could point any OData-aware client at. To fetch recent entries from my table of snapshots, for example, you could use this URL in any browser:

GET http://elmcity.table.core.windows.net/monitor?$filter=Timestamp+gt+datetime’2010-02-08&#8242;

(A table named ‘monitor’ is where the telemetry data are stored.)

The table service does require authentication, though, so in order to export data feeds I’m creating wrappers around selected queries. Until recently, I’ve always packaged the query response as a .NET List of Dictionaries. A record in an Azure table maps nicely to a Dictionary. Both are flexible bags of name/value pairs, and a Dictionary is easily consumed from both C# and IronPython.

To enable OData services I just added an alternate method that returns the raw response from an Azure table query. Then I extended the public namespace of my service, adding a /odata mapping that accepts URL parameters for the name of a table, and for the text of a query. I’m doing this in ASP.NET MVC, but there’s nothing special about the technique. If you were working in, say, Rails or Django, it would be just the same. You’d map out a piece of public namespace, and wire it to a parameterized service that returns Atom feeds.

Discovering data services

An OData-aware client can use an Atom service document to find out what feeds are available from a provider. The one I’m using looks kind of like this:

[sourcecode language=”xml”]
<?xml version=’1.0′ encoding=’utf-8′ standalone=’yes’?>
<service xmlns:atom=’http://www.w3.org/2005/Atom&#8217;
xmlns:app=’http://www.w3.org/2007/app&#8217; xmlns=’http://www.w3.org/2007/app’&gt;
<workspace>
<atom:title>elmcity odata feeds</atom:title>
<collection href=’http://elmcity.cloudapp.net/odata?table=monitor&hours_ago=48′&gt;
<atom:title>recent monitor data (web and worker roles)</atom:title>
</collection>
<collection href="http://elmcity.cloudapp.net/odata?table=monitor&hours_ago=48&amp;
query=ProcName eq ‘WaWebHost’">
<atom:title>recent monitor data (web roles)</atom:title>
</collection>
<collection href="http://elmcity.cloudapp.net/odata?table=monitor&hours_ago=48&amp;
query=ProcName eq ‘WaWorkerHost’">
<atom:title>recent monitor data (worker roles)</atom:title>
</collection>
<collection href="http://elmcity.cloudapp.net/odata?table=counters"&gt;
<atom:title>peformance counters</atom:title>
</collection>
</workspace>
</service>
[/sourcecode]

PowerPivot is an Excel add-in that knows about this stuff. Here’s a picture of PowerPivot discovering those feeds:

It’s straightforward for any application or service, written in any language, running in any environment, to enable this kind of discovery.

Using data services

In my case, PowerPivot — which is an add-in that brings some nice business intelligence capability to Excel — makes a good consumer of my data services. Here are some charts that slice my service’s request execution times in a couple of different ways:

Again, it’s straightforward for any application or service, written in any language, running in any environment, to do this kind of thing. It’s all just Atom feeds with data-describing payloads. There’s nothing special about it, which is the whole point. If things pan out as I hope, we’ll have a cornucopia of OData feeds — from our banks, from our Internet service providers, from our governments, and from every other source that currently publishes data on paper, or in less useful electronic formats like PDF and HTML. And we’ll have a variety of OData clients, on mobile devices and on our desktops and in the cloud, that enable us to work with those data feeds.

Where is the money going?

Over the weekend I was poking around in the recipient-reported data at recovery.gov. I filtered the New Hampshire spreadsheet down to items for my town, Keene, and was a bit surprised to find no descriptions in many cases. Here’s the breakdown:

# of awards 25
# of awards with descriptions 05 20%
# of awards without descriptions 20 80%
$ of awards 10,940,770
$ of awards with descriptions 1,260,719 12%
$ of awards without descriptions 9,680,053 88%

In this case, the half-dozen largest awards aren’t described:

award amount funding agency recipient description
EE00161 2,601,788 Sothwestern Community Services Inc
S394A090030 1,471,540 Keene School District
AIP #3-33-SBGP-06-2009 1,298,500 City of Keene
2W-33000209-0 1,129,608 City of Keene
2F-96102301-0 666,379 City of Keene
2F-96102301-0 655,395 City of Keene
0901NHCOS2 600,930 Sothwestern Community Services Inc
2009RKWX0608 459,850 Department of Justice KEENE, CITY OF The COPS Hiring Recovery Program (CHRP) provides funding directly to law enforcement agencies to hire and/or rehire career law enforcement officers in an effort to create and preserve jobs, and to increase their community policing capacity and crime prevention efforts.
NH36S01050109 413,394 Department of Housing and Urban Development KEENE HOUSING AUTHORITY ARRA Capital Fund Grant. Replacement of roofing, siding, and repair of exterior storage sheds on 29 public housing units at a family complex

That got me wondering: Where does the money go? So I built a little app that explores ARRA awards for any city or town: http://elmcity.cloudapp.net/arra. For most places, it seems, the ratio of awards with descriptions to awards without isn’t quite so bad. In the case of Philadelphia, for example, “only” 27% of the dollars awarded ($280 million!) are not described.

But even when the description field is filled in, how much does that tell us about what’s actually being done with the money? We can’t expect to find that information in a spreadsheet at recovery.gov. The knowledge is held collectively by the many people who are involved in the projects funded by these awards.

If we want to materialize a view of that collective knowledge, the ARRA data provides a useful starting point. Every award is identified by an award number. These are, effectively, webscale identifiers — that is, more-or-less unique tags we could use to collate newspaper articles, blog entries, tweets, or any other online chatter about awards.

To promote this idea, the app reports award numbers as search strings. In Keene, for example, the school district got an award for $1.47 million. The award number is S394A090030. If you search for that you’ll find nothing but a link back to a recovery.gov page entitled Where is the Money Going?

Recovery.gov can’t bootstrap itself out of this circular trap. But if we use the tags that it has helpfully provided, we might be able to find out a lot more about where the money is going.

A literary appreciation of the Olson/Zoneinfo/tz database

You will probably never need to know about the Olson database, also known as the Zoneinfo or tz database. And were it not for my elmcity project I never would have looked into it. I knew roughly that this bedrock database is a compendium of definitions of the world’s timezones, plus rules for daylight savings transitions (DST), used by many operating systems and programming languages.

I presumed that it was written Unix-style, in some kind of plain-text format, and that’s true. Here, for example, are top-level DST rules for the United States since 1918:

# Rule NAME FROM  TO    IN   ON         AT      SAVE    LETTER/S
Rule   US   1918  1919  Mar  lastSun    2:00    1:00    D
Rule   US   1918  1919  Oct  lastSun    2:00    0       S
Rule   US   1942  only  Feb  9          2:00    1:00    W # War
Rule   US   1945  only  Aug  14         23:00u  1:00    P # Peace
Rule   US   1945  only  Sep  30         2:00    0       S
Rule   US   1967  2006  Oct  lastSun    2:00    0       S
Rule   US   1967  1973  Apr  lastSun    2:00    1:00    D
Rule   US   1974  only  Jan  6          2:00    1:00    D
Rule   US   1975  only  Feb  23         2:00    1:00    D
Rule   US   1976  1986  Apr  lastSun    2:00    1:00    D
Rule   US   1987  2006  Apr  Sun>=1     2:00    1:00    D
Rule   US   2007  max   Mar  Sun>=8     2:00    1:00    D
Rule   US   2007  max   Nov  Sun>=1     2:00    0       S

What I didn’t appreciate, until I finally unzipped and untarred a copy of ftp://elsie.nci.nih.gov/pub/tzdata2009o.tar.gz, is the historical scholarship scribbled in the margins of this remarkable database, or document, or hybrid of the two.

You can see a glimpse of that scholarship in the above example. The most recent two rules define the latest (2007) change to US daylight savings. The spring forward rule says: “On the second Sunday in March, at 2AM, save one hour, and use D to change EST to EDT.” Likewise, on the fast-approaching first Sunday in November, spend one hour and go back to EST.

But look at the rules for Feb 9 1942 and Aug 14 1945. The letters are W and P instead of D and S. And the comments tell us that during that period there were timezones like Eastern War Time (EWT) and Eastern Peace Time (EPT). Arthur David Olson elaborates:

From Arthur David Olson (2000-09-25):

Last night I heard part of a rebroadcast of a 1945 Arch Oboler radio drama. In the introduction, Oboler spoke of “Eastern Peace Time.” An AltaVista search turned up :”When the time is announced over the radio now, it is ‘Eastern Peace Time’ instead of the old familiar ‘Eastern War Time.’ Peace is wonderful.”

 

Most of this Talmudic scholarship comes from founding contributor Arthur David Olson and editor Paul Eggert, both of whose Wikipedia pages, although referenced from the Zoneinfo page, strangely do not exist.

But the Olson/Eggert commentary is also interspersed with many contributions, like this one about the Mount Washington Observatory.

From Dave Cantor (2004-11-02)

Early this summer I had the occasion to visit the Mount Washington Observatory weather station atop (of course!) Mount Washington [, NH]…. One of the staff members said that the station was on Eastern Standard Time and didn’t change their clocks for Daylight Saving … so that their reports will always have times which are 5 hours behind UTC.

 

Since Mount Washington has a climate all its own, I guess it makes sense for it to have its own time as well.

Here’s a glimpse of Alaska’s timezone history:

From Paul Eggert (2001-05-30):

Howse writes that Alaska switched from the Julian to the Gregorian calendar, and from east-of-GMT to west-of-GMT days, when the US bought it from Russia. This was on 1867-10-18, a Friday; the previous day was 1867-10-06 Julian, also a Friday. Include only the time zone part of this transition, ignoring the switch from Julian to Gregorian, since we can’t represent the Julian calendar.

As far as we know, none of the exact locations mentioned below were permanently inhabited in 1867 by anyone using either calendar. (Yakutat was colonized by the Russians in 1799, but the settlement was destroyed in 1805 by a Yakutat-kon war party.) However, there were nearby inhabitants in some cases and for our purposes perhaps it’s best to simply use the official transition.

 

You have to have a sense of humor about this stuff, and Paul Eggert does:

From Paul Eggert (1999-03-31):

Shanks writes that Michigan started using standard time on 1885-09-18, but Howse writes (pp 124-125, referring to Popular Astronomy, 1901-01) that Detroit kept

local time until 1900 when the City Council decreed that clocks should be put back twenty-eight minutes to Central Standard Time. Half the city obeyed, half refused. After considerable debate, the decision was rescinded and the city reverted to Sun time. A derisive offer to erect a sundial in front of the city hall was referred to the Committee on Sewers. Then, in 1905, Central time was adopted by city vote.

 

This story is too entertaining to be false, so go with Howse over Shanks.

 

The document is chock full of these sorts of you-can’t-make-this-stuff-up tales:

From Paul Eggert (2001-03-06), following a tip by Markus Kuhn:

Pam Belluck reported in the New York Times (2001-01-31) that the Indiana Legislature is considering a bill to adopt DST statewide. Her article mentioned Vevay, whose post office observes a different
time zone from Danner’s Hardware across the street.

 

I love this one about the cranky Portuguese prime minister:

Martin Bruckmann (1996-02-29) reports via Peter Ilieve

that Portugal is reverting to 0:00 by not moving its clocks this spring.
The new Prime Minister was fed up with getting up in the dark in the winter.

 

Of course Gaza could hardly fail to exhibit weirdness:

From Ephraim Silverberg (1997-03-04, 1998-03-16, 1998-12-28, 2000-01-17 and 2000-07-25):

According to the Office of the Secretary General of the Ministry of Interior, there is NO set rule for Daylight-Savings/Standard time changes. One thing is entrenched in law, however: that there must be at least 150 days of daylight savings time annually.

 

The rule names for this zone are poignant too:

# Zone  NAME            GMTOFF  RULES   FORMAT  [UNTIL]
Zone    Asia/Gaza       2:17:52 -       LMT     1900 Oct
                        2:00    Zion    EET     1948 May 15
                        2:00 EgyptAsia  EE%sT   1967 Jun  5
                        2:00    Zion    I%sT    1996
                        2:00    Jordan  EE%sT   1999
                        2:00 Palestine  EE%sT

There’s also some wonderful commentary in the various software libraries that embody the Olson database. Here’s Stuart Bishop on why pytz, the Python implementation, supports almost all of the Olson timezones:

As Saudi Arabia gave up trying to cope with their timezone definition, I see no reason to complicate my code further to cope with them. (I understand the intention was to set sunset to 0:00 local time, the start of the Islamic day. In the best case caused the DST offset to change daily and worst case caused the DST offset to change each instant depending on how you interpreted the ruling.)

 

It’s all deliciously absurd. And according to Paul Eggert, Ben Franklin is having the last laugh:

From Paul Eggert (2001-03-06):

Daylight Saving Time was first suggested as a joke by Benjamin Franklin in his whimsical essay “An Economical Project for Diminishing the Cost of Light” published in the Journal de Paris (1784-04-26). Not everyone is happy with the results.

 

So is Olson/Zoneinfo/tz a database or a document? Clearly both. And its synthesis of the two modes is, I would argue, a nice example of literate programming.

Querying mobile data objects with LINQ

I’m using US census data to look up the estimated populations of the cities and towns running elmcity hubs. The dataset is just plain old CSV (comma-separated variable), a format that’s more popular than ever thanks in part to a new wave of web-based data services like DabbleDB, ManyEyes, and others.

For my purposes, simple pattern matching was enough to look up the population of a city and state. But I’d been meaning to try out LINQtoCSV, the .NET equivalent of my old friend, Python’s csv module. As happens lately, I was struck by the convergence of the languages. Here’s a side-by-side comparison of Python and C# using their respective CSV modules to query for the population of Keene, NH:

Python C#
 
 
i_name = 5
i_statename = 6
 
i_pop2008 = 17
 
 
handle = urllib.urlopen(url)
 
 
 
 
 
 
 
 
reader = csv.reader(
  handle, delimiter=',')
 
 
rows = itertools.ifilter(lambda x : 
  x[i_name].startswith('Keene') and 
  x[i_statename] == 'New Hampshire', 
    reader)
 
found_rows = list(rows)
 
 
 
count = len(found_rows)
 
if ( count > 0 ):
  pop = int(found_rows[0][i_pop2008])    
public class USCensusPopulationData
  {   
  public string NAME;
  public string STATENAME;
  ... etc. ...
  public string POP_2008;
  }
  
var csv = new WebClient().
  DownloadString(url);
  
var stream = new MemoryStream(
  Encoding.UTF8.GetBytes(csv));
var sr = new StreamReader(stream);
var cc = new CsvContext();
var fd = new CsvFileDescription { };
  
var reader = 
  cc.Read<USCensusPopulationData>(sr, fd);
  
 
var rows = reader.ToList();
 
  
  
 
var found_rows = rows.FindAll(row => 
  row.name.StartsWith('Keene') && 
  row.statename == 'New Hampshire');
  
var count = rows.Count;
  
if ( count > 0 )
  pop = Convert.ToInt32(
    found_rows[0].pop_2008)

Things don’t line up quite as neatly as in my earlier example, or as in the A/B comparison (from way back in 2005) between my first LINQ example and Sam Ruby’s Ruby equivalent. But the two examples share a common approach based on iterators and filters.

This idea of running queries over simple text files is something I first ran into long ago in the form of the ODBC Text driver, which provides SQL queries over comma-separated data. I’ve always loved this style of data access, and it remains incredibly handy. Yes, some data sets are huge. But the 80,000 rows of that census file add up to only 8MB. The file isn’t growing quickly, and it can tell a lot of stories. Here’s one:

2000 - 2008 population loss in NH

-8.09% Berlin city
-3.67% Coos County
-1.85% Portsmouth city
-1.85% Plaistow town
-1.78% Balance of Coos County
-1.43% Claremont city
-1.02% Lancaster town
-0.99% Rye town
-0.81% Keene city
-0.23% Nashua city

In both Python and C# you can work directly with the iterators returned by the CSV modules to accomplish this kind of query. Here’s a Python version:

import urllib, itertools, csv

i_name = 5
i_statename = 6
i_pop2000 = 9
i_pop2008 = 17

def make_reader():
  handle = open('pop.csv')
  return csv.reader(handle, delimiter=',')

def unique(rows):
  dict = {}
  for row in rows:
    key = "%s %s %s %s" % (i_name, i_statename, 
      row[i_pop2000], row[i_pop2008])    
    dict[key] = row
  list = []
  for key in dict:
    list.append( dict[key] )
  return list

def percent(row,a,b):
  pct = - (  float(row[a]) / float(row[b]) * 100 - 100 )
  return pct

def change(x,state,minpop=1):
  statename = x[i_statename]
  p2000 = int(x[i_pop2000])
  p2008 = int(x[i_pop2008])
  return (  statename==state and 
            p2008 > minpop   and 
            p2008 < p2000 )

state = 'New Hampshire'

reader = make_reader()
reader.next() # skip fieldnames

rows = itertools.ifilter(lambda x : 
  change(x,state,minpop=3000), reader)

l = list(rows)
l = unique(l)
l.sort(lambda x,y: cmp(percent(x,i_pop2000,i_pop2008),
  percent(y,i_pop2000,i_pop2008)))

for row in l:
  print "%2.2f%% %s" % ( 
       percent(row,i_pop2000,i_pop2008),
       row[i_name] )

A literal C# translation could do all the same things in the same ways: Convert the iterator into a list, use a dictionary to remove duplication, filter the list with a lambda function, sort the list with another lambda function.

As queries grow more complex, though, you tend to want a more declarative style. To do that in Python, you’d likely import the CSV file into a SQL database — perhaps SQLite in order to stay true to the lightweight nature of this example. Then you’d ship queries to the database in the form of SQL statements. But you’re crossing a chasm when you do that. The database’s type system isn’t the same as Python’s. And database’s internal language for writing functions won’t be Python either. In the case of SQLite, there won’t even be an internal language.

With LINQ there’s no chasm to cross. Here’s the LINQ code that produces the same result:

var census_rows = make_reader();

var distinct_rows = census_rows.Distinct(new CensusRowComparer());

var threshold = 3000;

var rows = 
  from row in distinct_rows
  where row.STATENAME == statename
      && Convert.ToInt32(row.POP_2008) > threshold
      && Convert.ToInt32(row.POP_2008) < Convert.ToInt32(row.POP_2000) 
  orderby percent(row.POP_2000,row.POP_2008) 
  select new
    {
    name = row.NAME,
    pop2000 = row.POP_2000,
    pop2008 = row.POP_2008    
    };

 foreach (var row in rows)
   Console.WriteLine("{0:0.00}% {1}",
     percent(row.pop2000,row.pop2008), row.name );

You can see the supporting pieces below. There are a number of aspects to this approach that I’m enjoying. It’s useful, for example, that every row of data becomes an object whose properties are available to the editor and the debugger. But what really delights me is the way that the query context and the results context share the same environment, just as in the Python example above. In this (slightly contrived) example I’m using the percent function in both contexts.

With LINQ to CSV I’m now using four flavors of LINQ in my project. Two are built into the .NET Framework: LINQ to XML, and LINQ to native .NET objects. And two are extensions: LINQ to CSV, and LINQ to JSON. In all four cases, I’m querying some kind of mobile data object: an RSS feed, a binary .NET object retrieved from the Azure blob store, a JSON response, and now a CSV file.

Six years ago I was part of a delegation from InfoWorld that visited Microsoft for a preview of technologies in the pipeline. At a dinner I sat with Anders Hejslberg and listened to him lay out his vision for what would become LINQ. There were two key goals. First, a single environment for query and results. Second, a common approach to many flavors of data.

I think he nailed both pretty well. And it’s timely because the cloud isn’t just an ecosystem of services, it’s also an ecosystem of mobile data objects that come in a variety of flavors.


private static float percent(string a, string b)
  {
  var y0 = float.Parse(a);
  var y1 = float.Parse(b);
  return - ( y0 / y1 * 100 - 100);
  }

private static IEnumerable<USCensusPopulationData> make_reader()
  {
  var h = new FileStream("pop.csv", FileMode.Open);
  var bytes = new byte[h.Length];
  h.Read(bytes, 0, (Int32)h.Length);
  bytes = Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(bytes));
  var stream = new MemoryStream(bytes);
  var sr = new StreamReader(stream);
  var cc = new CsvContext();
  var fd = new CsvFileDescription { };

  var census_rows = cc.Read<USCensusPopulationData>(sr, fd);
  return census_rows;
  }

public class USCensusPopulationData
  {
  public string SUMLEV;
  public string state;
  public string county;
  public string PLACE;
  public string cousub;
  public string NAME;
  public string STATENAME;
  public string POPCENSUS_2000;
  public string POPBASE_2000;
  public string POP_2000;
  public string POP_2001;
  public string POP_2002;
  public string POP_2003;
  public string POP_2004;
  public string POP_2005;
  public string POP_2006;
  public string POP_2007;
  public string POP_2008;

  public override string ToString()
    {
    return
      NAME + ", " + STATENAME + " " + 
      "pop2000=" + POP_2000 + " | " +
      "pop2008=" + POP_2008;
    } 
  }

public class  CensusRowComparer : IEqualityComparer<USCensusPopulationData>
  {
  public bool Equals(USCensusPopulationData x, USCensusPopulationData y)
    {
    return x.NAME == y.NAME && x.STATENAME == y.STATENAME ;
    }

  public int GetHashCode(USCensusPopulationData obj)
    {
    var hash = obj.ToString();
    return hash.GetHashCode();
    }
  }

Familiar idioms in Perl, Python, JavaScript, and C#

When I started working on the elmcity project, I planned to use my language of choice in recent years: Python. But early on, IronPython wasn’t fully supported on Azure, so I switched to C#. Later, when IronPython became fully supported, there was really no point in switching my core roles (worker and web) to it, so I’ve proceeded in a hybrid mode. The core roles are written in C#, and a variety of auxiliary pieces are written in IronPython.

Meanwhile, I’ve been creating other auxiliary pieces in JavaScript, as will happen with any web project. The other day, at the request of a calendar curator, I used JavaScript to prototype a tag summarizer. This was so useful that I decided to make it a new feature of the service. The C# version was so strikingly similar to the JavaScript version that I just had to set them side by side for comparison:

JavaScript C#
var tagdict = new Object();

for ( i = 0; i < obj.length; i++ )
  {
  var evt = obj[i];
  if ( evt["categories"] != undefined)
    {
    var tags = evt["categories"].split(',');
    for (j = 0; j < tags.length; j++ )
      {
      var tag = tags[j];
      if ( tagdict[tag] != undefined )
        tagdict[tag]++;
      else
        tagdict[tag] = 1;
      }
    }
  }
var tagdict = new Dictionary();

foreach (var evt in es.events)
  {

  if (evt.categories != null)
    {
    var tags = evt.categories.Split(',');
    foreach (var tag in tags)
      {

      if (tagdict.ContainsKey(tag))
        tagdict[tag]++;
      else
        tagdict[tag] = 1;
      }
    }
  }
var sorted_keys = [];

for ( var tag in tagdict )
  sorted_keys.push(tag);

sorted_keys.sort(function(a,b) 
 { return tagdict[b] - tagdict[a] });
var sorted_keys = new List();

foreach (var tag in tagdict.Keys)
  sorted_keys.Add(tag);

sorted_keys.Sort( (a, b) 
  => tagdict[b].CompareTo(tagdict[a]));

The idioms involved here include:

  • Splitting a string on a delimiter to produce a list

  • Using a dictionary to build a concordance of strings and occurrence counts

  • Sorting an array of keys by their associated occurrence counts

I first used these idioms in Perl. Later they became Python staples. Now here they are again, in both JavaScript and C#.

Ask and ye may receive, don’t ask and ye surely will not

This fall a small team of University of Toronto and Michigan State undergrads will be working on parts of the elmcity project by way of Undergraduate Capstone Open Source Projects (UCOSP), organized by Greg Wilson. In our first online meeting, the students decided they’d like to tackle the problem that FuseCal was solving: extraction of well-structured calendar information from weakly-structured web pages.

From a computer science perspective, there’s a fairly obvious path. Start with specific examples that can be scraped, then work toward a more general solution. So the first two examples are going to be MySpace and LibraryThing. The recipes[1, 2] I’d concocted for FuseCal-written iCalendar feeds were especially valuable because they could be used by almost any curator for almost any location.

But as I mentioned to the students, there’s another way to approach these two cases. And I was reminded of it again when Michael Foord pointed to this fascinating post prompted by the open source release of FriendFeed’s homegrown web server, Tornado. The author of the post, Glyph Lefkowitz, is the founder of Twisted, a Python-based network programming framework that includes the sort of asynchronous event-driven capabilities that FriendFeed recreated for Tornado. Glyph writes:

If you’re about to undergo a re-write of a major project because it didn’t meet some requirements that you had, please tell the project that you are rewriting what you are doing. In the best case scenario, someone involved with that project will say, “Oh, you’ve misunderstood the documentation, actually it does do that”. In the worst case, you go ahead with your rewrite anyway, but there is some hope that you might be able to cooperate in the future, as the project gradually evolves to meet your requirements. Somewhere in the middle, you might be able to contribute a few small fixes rather than re-implementing the whole thing and maintaining it yourself.

Whether FriendFeed could have improved the parts of Twisted that it found lacking, while leveraging its synergistic aspects, is a question only specialists close to both projects can answer. But Glyph is making a more general point. If you don’t communicate your intentions, such questions can never even be asked.

Tying this back to the elmcity project, I mentioned to the students that the best scraper for MySpace and LibraryThing calendars is no scraper at all. If these services produced iCalendar feeds directly, there would be no need. That would be the ideal solution — a win for existing users of the services, and for the iCalendar ecosystem I’m trying to bootstrap.

I’ve previously asked contacts at MySpace and LibraryThing about this. But now, since we’re intending to scrape those services for calendar info, it can’t hurt to announce that intention and hope one or both services will provide feeds directly and obviate the need. That way the students can focus on different problems — and there are plenty to choose from.

So I’ll be sending the URL of this post to my contacts at those companies, and if any readers of this blog can help move things along, please do. We may end up with scrapers anyway. But maybe not. Maybe iCalendar feeds have already been provided, but aren’t documented. Maybe they were in the priority stack and this reminder will bump them up. It’s worth a shot. If the problem can be solved by communicating intentions rather than writing redundant code, that’s the ultimate hack. And its one that I hope more computer science students will learn to aspire to.

FriendFeed for project collaboration

For me, FriendFeed has been a new answer to an old question — namely, how to collaborate in a loosely-coupled way with people who are using, and helping to develop, an online service. The elmcity project’s FriendFeed room has been an incredibly simple and effective way to interleave curated calendar feeds, blog postings describing the evolving service that aggregates those feeds, and discussion among a growing number of curators.

In his analysis of Where FriendFeed Went Wrong Dare Obasanjo describes the value of a handful of services (Facebook, Twitter, etc.) in terms that would make sense to non-geeks like his wife. Here’s the elevator pitch for FriendFeed:

Republish all of the content from the different social networking media websites you use onto this site. Also one place to stay connected to what people are saying on multiple social media sites instead of friending them on multiple sites.

As usual, I’m an outlying data point. I’m using FriendFeed as a lightweight, flexible aggregator of feeds from my blog and from Delicious, and as a discussion forum. These feeds report key events in the life of the project: I added a new feature to the aggregator, the curator for Sasktatoon found and added a new calendar. The discussion revolves around strategies for finding or creating calendar feeds, features that curators would like me to add to the service, and problems they’re having with the service.

I doubt there’s a mainstream business model here. It’s valuable to me because I’ve created a project environment in which key events in the life of the project are already flowing through feeds that are available to be aggregated and discussed. Anyone could arrange things that way, but few people will.

It’s hugely helpful to me, though. And while I don’t know for sure that FriendFeed’s acquisition by FaceBook will end my ability to use FriendFeed in this way, I do need to start thinking about how I’d replace the service.

I don’t need a lot of what FriendFeed offers. Many of the services it can aggregate — Flickr, YouTube, SlideShare — aren’t relevant. And we don’t need realtime notification. So it really boils down to a lightweight feed aggregator married to a discussion forum.

One feature that FriendFeed’s API doesn’t offer, by the way, but that I would find useful, is programmatic control of the aggregator’s registry. When a new curator shows up, I have to manually add the associated Delicious feed to the FriendFeed room. It’d be nice to automate that.

Ideally FriendFeed will coast along in a way that lets me keep using it as I currently am. If not, it wouldn’t be too hard to recreate something that provides just the subset of FriendFeed’s services that I need. But ideally, of course, I’d repurpose an existing service rather than build a new one. If you’re using something that could work, let me know.

elmcity and WordPress MU: Questions and answers

In the spirit of keystroke conservation, I’m relaying some elmcity-related questions and answers from email to here. Hopefully it will attract more questions and more answers.

Dear Mr. Udell,

I am looking for a flexible calendar aggregator that I can use to report upcoming events for our college’s “Learning Commons” WordPress MU website, a site that will hopefully help keep our students abreast of events and opportunities taking place on campus.

1) Our site will be maintained using WordPress MU, so ideally the
display of the calendars, and/or event-lists will be handled by a
WordPress plugin. The one I am favouring is
http://wordpress.org/extend/plugins/wordpress-ics-importer/ . I have
tried this plugin and it almost does what we want.

Specifically, the plugin includes:

– a single widget that can display the “event-list” for one calendar;

– flexible options for displaying and aggregating calendars.

This plugin almost does what I want, but not quite.

a) The plugin is now limited to a single “events-list” widget. But with WordPress 2.8, it is possible to have many instances of a widget, so theoretically, I could display the “Diagnostic Tests” calendar in one instance , and the “Peer-tutoring” calendar in another widget instance.

b) It would be nice to have an option to display only the current week for specific calendars. While in other cases, it makes sense to display the entire month. And although I haven’t thought about it, likely displaying just the current day would be useful.

c) I would like flexibility over which calendars to aggregate, creating as many “topic” hubs as the current maintainer of the website might think useful for the students.

2) It would be nice to remove the calendar aggregation from the WordPress plugin, and handle that separately. Hopefully the calendars will change much less frequently than the website will be viewed. If I understand http://blog.jonudell.net/elmcity-project-faq/ properly, this might be possible using the elmcity-project.

For example, I think we could use “topical hub aggregation” to create a “diagnostic test calendar” that aggregates the holiday calendar and the different departments “diagnostic test” calendars. What I don’t understand is what is the output of “elmcity”. Does it output a single merged calendar (ics) that could be displayed by the above plugin? Is that a possibility?

Similarly, I believe I could create a different meta bookmark to aggregate our holiday calendar and our different peer-tutoring calendars (created by each department). Is this correct?

We have lots of groups, faculty, departments and staff on campus, and each wants to publicize their upcoming events. Letting them input and maintain their own calendars really seems to make sense. (Thanks for the idea. It seems clear this is the way to go, but I don’t seem to have the pieces to construct the output properly, as yet.)

I agree with your analysis that it would be better to have a separation of concerns between aggregation and display. So let’s do that, and start with aggregation.

I would like flexibility over which calendars to aggregate, creating as many “topic” hubs as the current maintainer of the website might think useful for the students.

I think the elmcity system can be helpful here. I’ve recently discovered that there are really two levels — what I’ve started to call curation and meta-curation.

I believe I could create bookmarks to aggregate our holiday calendar and our different peer-tutoring calendars (created by each department). Is this correct?

Right. It sounds like you’d want to curate a topic hub. It could be YourCollege, but if there may need to be other topic hubs you could choose a more specific name, like YourCollegeLearningCommons. That’d be your Delicious account name, and you’d be the “meta-curator” in this scenario.

As meta-curator you’d bookmark, in that Delicious account:

– Your holiday calendar

– Multiple departments’ calendars

Each of those would be managed by the responsible/authoritative person, using any software (Outlook, Google, Apple, Drupal, Notes, Live, etc.) that can publish an ICS feed.

There’s another level of flexibility using tags. In the above scenario, as meta-curator you could tag your holiday feed as holiday, and your LearningCommons feeds as LearningCommons, and then filter them accordingly.

What I don’t understand is what is the output of elmcity. Does it output a single merged calendar (ics) that could be displayed by the above plugin?

Yes. The outputs currently are:

Now, for the display options. So far, we’ve got:

  • Use the WordPress plugin to display merged ICS

  • Display the entire calendar as included (maybe customized) HTML

  • Display today’s events as included or script-sourced HTML

  • I have also just recently added a new method that enables things like this: http://jonudell.net/test/upcoming-widget.html

  • You can view the source to see how it’s done. The “API call” here is:

    http://elmcity.cloudapp.net/services/elmcity/json?jsonp=eventlist&recent=7&view=music

    Yours might be:

    http://elmcity.cloudapp.net/services/YourCollegeLearningCommons/json?jsonp=eventlist&recent=10

    or

    &recent=20&view=holiday

    etc.

    This is brand new, as of yesterday. Actually I just realized I should use “upcoming” instead of “recent” so I’ll go and change that now :-) But you get the idea.

    The flexibility here is ultimately governed by:

    1. The curator’s expressive and disciplined use of tags to create useful views

    2. The kinds of queries I make available through the API. So far I’ve only been asked to do ‘next N events’ so that’s what I did yesterday. But my intention is to support every kind of query that’s feasible, and that people ask for. Things like a week’s worth, or a week’s worth in a category, are obvious next steps.

Two projects for civic-minded student programmers

One of the key findings of the elmcity project, so far, is that there’s a lot of calendar information online, but very little in machine-readable form. Transforming this implicit data about public events into explicit data is an important challenge. I’ve been invited to define the problem, for students who may want to tackle it as a school project. Here are the two major aspects I’ve identified.

A general scraper for calendar-like web pages

There are zillions of calendar-like web pages, like this one for Harlow’s Pub in Peterborough, NH. These ideally ought to be maintained using calendar programs that publish machine-readable iCalendar feeds which are also transformed and styled to create human-readable web pages. But that doesn’t (yet) commonly happen.

These web pages are, however, often amenable to scraping. And for a while, elmcity curators were making very effective use of FuseCal (1, 2, 3) to transform these kinds of pages into iCalendar feeds.

When that service shut down, I retained a list of the pages that elmcity curators were successfully transforming into iCalendar feeds using FuseCal. These are test cases for an HTML-to-iCalendar service. Anyone who’s handy with scraping libraries like Beautiful Soup can solve these individually. The challenge here is to create, by abstraction and generalization, an engine that can handle a significant swath of these cases.

A hybrid system for finding implicit recurring events and making them explicit

Lots of implicit calendar data online doesn’t even pretend to be calendar-like, and cannot be harvested using a scraper. Finding one-off events in this category is out of scope for my project. But finding recurring events seems promising. The singular effort required to publish one of these will pay ongoing dividends.

It’s helpful that the language people use to describe these events — “every Tuesday”, “third Saturday of every month” — is distinctive. To being exploring this domain, I wrote a specialized search robot that looks for these patterns, in conjunction with names of places. Its output is available for all the cities and towns participating in the elmcity project. For example, this page is the output for Keene, NH. It includes more than 2000 links to web pages — or, quite often, PDF files — some fraction of which represent recurring events.

In Finding and connecting social capital I showed a couple of cases where the pages found this way did, in fact, represent recurring events that could be added to an iCalendar feed.

To a computer scientist this looks like a problem that you might solve using a natural language parser. And I think it is partly that, but only partly. Let’s look at another example:

At first glance, this looks hopeful:

First Monday of each month: Dads Group, 105 Castle Street, Keene NH

But the real world is almost always messier than that. For starters, that image comes from the Monadnock Men’s Resource Center’s Fall 2004 newsletter. So before I add this to a calendar, I’ll want to confirm the information. The newsletter is hosted at the MMRC site. Investigation yields these observations:

  • The most recent issue of the newsletter was Winter ’06

  • The last-modified date of the MMRC home page is September 2008

  • As of that date, the Dads Group still seems to have been active, under a slightly different name: Parent Outreach Project, DadTime Program, 355-3082

  • There’s no email address, only a phone number.

So I called the number, left a message, and will soon know the current status.

What kind of software-based system can help us scale this gnarly process? There is an algorithmic solution, surely, but it will need to operate in a hybrid environment. The initial search-driven discovery of candidate events can be done by an automated parser tuned for this domain. But the verification of candidates will need to be done by human volunteers, assisted by software that helps them:

  • Divide long lists of candidates into smaller batches

  • Work in parallel on those batches

  • Evaluate the age and provenance of candidates

  • Verify or disqualify candidates based on discoverable evidence, if possible

  • Otherwise, find appropriate email addresses (preferably) or phone numbers, and manage the back-and-forth communication required to verify or disqualify a candidate

  • Refer event sponsors to a calendar publishing how-to, and invite them to create data feeds that can reliably syndicate

Students endowed with the geek gene are likely to gravitate toward the first problem because it’s cleaner. But I hope I can also attract interest in the second problem. We really need people who can hack that kind of real-world messiness.

Curation, meta-curation, and live Net radio

I’ve long been dissatisfied with how we discover and tune into Net radio. This iTunes screenshot illustrates the problem:

Start with a genre, pick a station in that genre, then listen to that station. This just doesn’t work for me. I like to listen to a lot of different things. And I especially value serendipitous recommendations from curators whose knowledge and preferences diverge radically from my own.

Yes there’s Pandora, but what I’ve been wanting all along is a way to enable and then subscribe to curators who guide me to what’s playing now on the live streams coming from radio stations around the world. It’s Wednesday morning, 11AM Eastern Daylight Time, and I know there are all kinds of shows playing right now. But how do I materialize a view for this moment in time — or for tonight at 9PM, or for Sunday morning at 10AM — across that breadth and wealth of live streams?

I started thinking about schedules of radio programs, and about calendars, and about BBC Backstage — because I’ll be interviewing Ian Forrester for an upcoming episode of my podcast — and I landed on this blog post which shows how to form an URL that retrieves upcoming episodes of a BBC show as an iCalendar feed.

Meanwhile, I’ve just created a new mode for the elmcity calendar aggregator. Now instead of creating a geographical hub, which combines events from Eventful and Upcoming and events from a list of iCalendar feeds — all for one location — you can create a topical hub whose events are governed only by time, not by location.

Can these ingredients combine to solve my Net radio problem? Could a curator for an elmcity topical aggregator cherrypick favorite shows from around the Net, and create a calendar that shows me what’s playing right now?

It seems plausible, so I spun up a new topical hub in the elmcity aggregator and started experimenting.

I began with the BBC’s iCalendar feeds. But evidently they don’t include VTIMEZONE components, which means calendar clients (or aggregators) can’t translate UK times to other times.

I ran into a few other issues, which perhaps can be sorted out when I chat with Ian Forrester. But meanwhile, since the universe of Net radio is much vaster than the BBC, and since most of it won’t be accessible in the form of data feeds, I stepped back for a broader view.

Really, anyone can publish an event that gives the time for a live show, plus a link to its player. And when a show happens on a regular recurring schedule, the little bit of effort it takes to publish that event pays recurring dividends.

Consider, for example, Nic Harcourt’s Sounds Eclectic. It’s on at these (Pacific) times: SUN 6:00A-8:00A, SAT 2:00P-4:00P, SAT 10:00P-12:00A. You can plug these into any calendar program as recurring events. And if you publish a feed, it’s not only available to you from any calendar client, it’s also available to any other calendar client — or to any aggregator.

Here’s a calendar with three recurring events for Sounds Eclectic, plus one recurring event for WICN’s Sunday jazz show, plus a single non-recurring event — the BBC’s Folkscene — which will be on the BBC iPlayer on Thursday at 4:05PM my time and 9:05PM UK time. If you load the calendar feed into a client — Outlook, Apple iCal, Google Calendar, Lotus Notes — you’ll see these events translated into your local timezone.

Note that Live Calendar is especially handy for publishing events from many different timezones. That’s because like Outlook, but unlike Google Calendar, it enables you to specify timezones on a per-event basis. So instead of having to enter the Sunday morning recurrence of Sounds Eclectic as 9AM Eastern Daylight, I can enter it as 6AM Pacific Daylight Time. Likewise Folkscene: I can enter 9:05 British Summer Time. Since these are the times that appear on the shows’ websites, it’s natural to use them.

This sort of calendar is great for personal use. But I’m looking for the Webjay of Net radio. And I think maybe elmcity topical hubs can help enable that.

There’s a way of using these topical hubs I hadn’t thought of until Tony Karrer created one. Tony runs TechEmpower, a software, web, and eLearning development firm. He wants to track and publish online eLearning events, so he’s managing them in Google Calendar and syndicating them through an elmcity topical hub to his website.

A topical hub, like a geographic hub, is controlled by a Delicious account whose owner maintains a list of feeds. I’d been thinking of the account owner as the curator, and of the feeds as homogeneous sources of events: school board meetings, soccer games, and so on.

But then Tony partnered with another organization that tracks webinars, invited that group to publish its own feed, added it to the eLearning hub, and wrote a blog post entitled Second Calendar Curator Joins to Help with List of Free Webinars:

The initial list of calendar entries, we added ourselves. But I’m pleased to announce that we’ve just signed up our second calendar curator – Coaching Ourselves. Their events are now appearing in the listings. … It is exactly because we can distribute the load of keeping this list current that makes me think this will work really well in the long run.

This probably shouldn’t have surprised me, but it did. I’d been thinking in terms of curators, feeds, and events. What Tony showed me is that you can also (optionally) think in terms of meta-curators, curators, feeds, and events. In this example, Tony is himself a curator, but he is also a meta-curator — that is, a collector of curators.

I’d love to see this model evolve in the realm of Net radio. If you want to join the experiment, just use any calendar program to keep track of some of your favorite recurring shows. (Again, it’s very helpful to use one that supports per-event timezones.) Then publish the shows as an iCalendar feed, and send me the URL. As the meta-curator of delicious.com/InternetRadio, as well as the curator of jonu.calendar.live.com/calendar/InternetRadio/index.html, I’ll have two options. If I like most or all of the shows you like, I can add your feed to the hub. If I only like some of the shows you like, I can cherrypick them for my feed. Either way, the aggregated results will be available as XML, as JSON, and as an iCalendar feed that can flow into calendar clients or aggregators.

Naturally there can also be other meta-curators. To become one, designate a Delicious account for the purpose, spin up your own topical hub, and tell me about it.

Topical event hubs

The elmcity project began with a focus on aggregating events for communities defined by places: cities, towns. But I realized a while ago that it could also be used to aggregate events for communities defined by topics. So now I’m building out that capability. One early adopter tracks and promotes online events in the e-learning domain. Another tracks and promotes conferences and events related to environmentally-sustainable business practices.

The curation method is very similar to what’s defined in the elmcity project FAQ. To define a topic hub you use a Delicious account, you create a metadata URL as shown in the FAQ, and you use what= instead of where= to define a topic instead of a location. Since there’s no location, there’s no aggregation of Eventful and Upcoming events. The topical hub is driven purely by your registry of iCalendar feeds.

If you (or somebody you know) needs to curate events by topic, and would like try this method, please get in touch. I’d love to have you help me define how this can work, and discover where it can go.

Why we need an XML representation for iCalendar

Translations:

Croatian

On this week’s Innovators show I got together with two of the authors of a new proposal for representing iCalendar in XML. Mike Douglass is lead developer of the Bedework Calendar System, and Steven Lees is Microsoft’s program manager for FeedSync and chair of the XML technical committee in CalConnect, the Calendaring and Scheduling Consortium.

What’s proposed is no more, but no less, than a well-defined two-way mapping between the current non-XML-based iCalendar format and an equivalent XML format. So, for example, here’s an event — the first low tide of 2009 in Myrtle Beach, SC — in iCalendar format:

BEGIN:VEVENT
SUMMARY:Low Tide 0.39 ft
DTSTART:20090101T090000Z
UID:2009.0
DTSTAMP:20080527T000001Z
END:VEVENT

And here’s the equivalent XML:

<vevent>
  <properties>
    <dtstamp>
      <date-time utc='yes'>
        <year>2008</year><month>5</month><day>27</day>
        <hour>0</hour><minute>0</minute><second>1</second>
      </date-time>
    </dtstamp>
    <dtstart>
      <date-time utc='yes'>
        <year>2009</year><month>1</month><day>1</day>
        <hour>9</hour><minute>0</minute><second>0</second>
      </date>
    </dtstart>
    <summary>
      <text>Low Tide 0.39 ft</text>
    </summary>
    <uid>
      <text>2009.0</text>
    </uid>
  </properties>
</vevent>

The mapping is quite straightforward, as you can see. At first glance, the XML version just seems verbose. So why bother? Because the iCalendar format can be tricky to read and write, either directly (using eyes and hands) or indirectly (using software). That’s especially true when, as is typical, events include longer chunks of text than you see here.

I make an analogy to the RSS ecosystem. When I published my first RSS feed a decade ago, I wrote it by hand. More specifically, I copied an existing feed as a template, and altered it using cut-and-paste. Soon afterward, I wrote the first of countless scripts that flowed data through similar templates to produce various kinds of RSS feeds.

Lots of other people did the same, and that’s part of the reason why we now have a robust network of RSS and Atom feeds that carries not only blogs, but all kinds of data packets.

Another part of the reason is the Feed Validator which, thanks to heroic efforts by Mark Pilgrim and Sam Ruby, became and remains the essential sanity check for anybody who’s whipping up an ad-hoc RSS or Atom feed.

No such ecosystem exists for iCalendar. I’ve been working hard to show why we need one, but the most compelling rationale comes from a Scott Adams essay that I quoted from in this blog entry. Dilber’s creator wrote:

I think the biggest software revolution of the future is that the calendar will be the organizing filter for most of the information flowing into your life. You think you are bombarded with too much information every day, but in reality it is just the timing of the information that is wrong. Once the calendar becomes the organizing paradigm and filter, it won’t seem as if there is so much.

If you buy that argument, then we’re going to need more than a handful of applications that can reliably create and exchange calendar data. We’ll want anyone to whip up a calendar feed as easily as anyone can now whip up an RSS/Atom feed.

We’ll also need more than a handful of parsers that can reliably read calendar feeds, so that thousands of ad-hoc applications, services, and scripts will be able consume all the new streams of time-and-date-oriented information.

I think that a standard XML representation of iCalendar will enable lots of ad-hoc producers and consumers to get into the game, and collectively bootstrap this new ecosystem. And that will enable what Scott Adams envisions.

Here’s a small but evocative example. Yesterday I started up a new instance of the elmcity aggregator for Myrtle Beach, SC. The curator, Dave Slusher, found a tide table for his location, and it offers an iCalendar feed. So the Myrtle Beach calendar for today begins like this:

Thu Jul 23 2009

WeeHours

Thu 03:07 AM Low Tide -0.58 ft (Tide Table for Myrtle Beach, SC)

Morning

Thu 06:21 AM Sunrise 6:21 AM EDT (Tide Table for Myrtle Beach, SC)
Thu 09:09 AM High Tide 5.99 ft (Tide Table for Myrtle Beach, SC)
Thu 10:00 AM Free Coffee Fridays (eventful: )
Thu 10:00 AM Summer Arts Project at The Market Common (eventful: )
Thu 10:00 AM E.B. Lewis: Story Painter (eventful: )

Imagine this kind of thing happening on the scale of the RSS/Atom feed ecosystem. The lack of an agreed-upon XML representation for iCalendar isn’t the only reason why we don’t have an equally vibrant ecosystem of calendar feeds. But it’s an impediment that can be swept away, and I hope this proposal will finally do that.

More fun than herding servers

Until recently, the elmcity calendar aggregator was running as a single instance of an Azure worker role. The idea all along, of course, was to exploit the system’s ability to farm out the work of aggregation to many workers. Although the sixteen cities currently being aggregated don’t yet require the service to scale beyond a single instance, I’d been meaning to lay the foundation for that. This week I finally did.

Will there ever be hundreds or thousands of participating cities and towns? Maybe that’ll happen, maybe it won’t, but the gating factor will not be my ability to babysit servers. That’s a remarkable change from just a few years ago. Over the weekend I read Scott Rosenberg’s new history of blogging, Say Everything. Here’s a poignant moment from 2001:

Blogger still lived a touch-and-go existence. Its expenses had dropped from a $50,000-a-month burn rate to a few thousand in rent and technical costs for bandwidth and such; still, even that modest budget wasn’t easy to meet. Eventually [Evan] Williams had to shut down the office entirely and move the servers into his apartment. He remembers this period as an emotional rollercoaster. “I don’t know how I’m going to pay the rent, and I can’t figure that out because the server’s not running, and I have to stay up all night, trying to figure out Linux, and being hacked, and then fix that.”

I’ve been one of those guys who babysits the server under the desk, and I’m glad I won’t ever have to go back there again. What I will have to do, instead, is learn how to take advantage of the cloud resources now becoming available. But I’m finding that to be an enjoyable challenge.

In the case of the calendar aggregator, which needs to map many worker roles to many cities, I’m using a blackboard approach. Here’s a snapshot of it, from an aggregator run using only a single worker instance:

     id: westlafcals
  start: 7/14/2009 12:12:05 PM
   stop: 7/14/2009 12:14:46 PM
running: False

     id: networksierra
  start: 7/14/2009 12:14:48 PM
   stop: 7/14/2009 12:15:05 PM
running: False

     id: localist
  start: 7/14/2009 12:15:06 PM
   stop: 7/14/2009  5:37:03 AM
running: True

     id: aroundfred
  start: 7/14/2009  5:37:05 AM
   stop: 7/14/2009  5:39:20 AM
running: False

The moving finger wrote westlafcals (West Lafayette) and networksierra (Sonora), it’s now writing localist (Baltimore), and will next write aroundfred (Fredericksburg).

Here’s a snapshot from another run using two worker instances:

     id: westlafcals
  start: 7/14/2009 10:12:05 PM
   stop: 7/14/2009  4:37:03 AM
running: True

     id: networksierra
  start: 7/14/2009 10:12:10 PM
   stop: 7/14/2009 10:13:05 PM
running: False

     id: localist
  start: 7/14/2009 10:13:06 PM
   stop: 7/14/2009  4:41:12 AM
running: True

     id: aroundfred
  start: 7/14/2009  4:41:05 AM
   stop: 7/14/2009  4:42:20 AM
running: False

Now there are two moving fingers. One’s writing westlafcals, one has written networksierra, one’s writing localist, and one or the other will soon write aroundfred. The total elapsed time will be very close to half what it was in the single-instance case. I’d love to crank up the instance count and see an aggregation run rip through all the cities in no time flat. But the Azure beta caps the instance count at two.

The blackboard is an Azure table with one record for each city. Records are flexible bags of name/value pairs. If you make a REST call to the table service to query for one of those records, the Atom payload that comes back looks like this:

<m:properties>
   <d:PartitionKey>blackboard</d:PartitionKey>
   <d:RowKey>aroundfred</d:RowKey>
   <d:start>7/14/2009 4:41:05 AM</d:start>
   <d:stop>7/14/2009 4:42:20 AM</d:stop>
   <d:running>False</d:stop>
</m:properties>

At the start of a cycle, each worker wakes up, iterates through all the cities, aggregates those not claimed by other workers, and then sleeps until the next cycle. To claim a city, a worker tries to create a record in a parallel Azure table, using the PartitionKey locks instead of blackboard. If the worker succeeds in doing that, it considers the city locked for its own use, it aggregates the city’s calendars, and then it deletes the lock record. If the worker fails to create that record, it considers the city locked by another worker and moves on.

This cycle is currently one hour. But in order to respect the various services it pulls from, the service defines the interval between aggregation runs to be 8 hours. So when a worker claims a city, it first checks to see if the last aggregation started more than 8 hours ago. If not, the worker skips that city.

Locks can be abandoned. That could happen if a worker hangs or crashes, or when I redeploy a new version of the service. So the worker also checks to see if a lock has been hanging around longer than the aggregation interval. If so, it overrides the lock and aggregates that city.

I’m sure this scheme isn’t bulletproof, but I reckon it doesn’t need to be. If two workers should happen to wind up aggregating the same city at about the same time, it’s no big deal. The last writer wins, a little extra work gets done.

Anyway, I’ll be watching the blackboard over the next few days. There’s undoubtedly more tinkering to do. And it’s a lot more fun than herding servers.