Searching for calendar information

WHAT AND WHY

With the community calendar service now live, I’ve got to do a bit more work to make it fully data-driven. Since I’m already managing the per-community feed lists and metadata on Delicious, I figure I might as well go all the way. So I’m keeping a list of the Delicious accounts that control each community’s calendar aggregator on Delicious too. Today there are three. The idea is that when I add the fourth, I won’t touch any code — or even configuration data — that will require an update to the running service. I’ll just bookmark a fourth Delicious account and tag it with calendarcuration.

But that’s merely an administrative convenience. Much more critical, at this point, is to help curators find machine-readable calendars in their communities and — since most of the calendars that might exist don’t — also show people how they can easily create them.

I got a running start when I bootstrapped the Ann Arbor instance, thanks to Google Calendar. I searched for Ann Arbor there and found a nice list of iCalendar feeds. But that search feature is, at least for now, gone.

Several curators have tried searching the web for .ICS files (e.g. filetype:ics), but that’s not very productive for a couple of reasons. Where iCalendar resources do exist, they often aren’t exposed as files with .ICS extensions. But more importantly, relative to the number of iCalendar resources that could exist, very few actually do.

So I thought back on how I bootstrapped the original Keene instance. A number of the events there are recurring events that were advertised on the web, but not in any structured format. I found them one day by doing web queries like:

"first monday" keene
"every thursday" keene

There’s no fully automatic way to convert this stuff into structured calendar data. But it’s pretty straightforward to fire up a calendar program, enter some recurring events, and publish a feed. The advantage of recurring events, of course, is that they keep showing up, which is very helpful if you’re trying to build critical mass.

So I’m now envisioning a pair of tools to help curators do this more easily. First, I’d like to have each community’s aggregator running a scheduled search that helps the curator be aware of calendar-like information that could be upgraded to actual calendar data. Second, I’d like to provide a tool that partly automates the cumbersome data entry.

I’ve done an initial version of the search tool, and an example of its output is here. I’ll attach the code to the end of this item, for those who care, although I expect that if it winds up being useful to curators, most will appropriately not care, and will only want to scan the links now and then.

It may be interesting, over time, to try to evolve this into a robot that makes sense of the calendar information that people actually write, as opposed to the information that calendar programs constrain them to produce. But meanwhile this hybrid approach seems like a way to make progress.

HOW

I did this tool in two parts. The kernel, so to speak, is in C#, because for now that’s the most practical way to write Azure services and applications. But the application is in IronPython, because the search function doesn’t yet need to be hosted on Azure, and IronPython is a really flexible and convenient way to experiment with the kernel.

The C# piece uses James Newton-King’s Json.NET library because JavaScript interfaces are now the preferred way to search programmatically. It’s been a while since I’ve done this kind of thing. Used to be, the REST APIs were easy to find. But now, since those interfaces are mainly intended for use by JavaScript objects embedded in web pages, I had to do a bit of spelunking.

One of the interesting things about Json.NET is that it includes an implementation of LINQ for JSON. That’s why you see the “from … select” syntax, which extracts an enumerable list of URLs from the JavaScript results returned by the search services.

using System;
using System.Collections.Generic;
using System.Linq;
using Newtonsoft.Json;
using Newtonsoft.Json.Linq;

namespace CalendarAggregator
{
  public class Search
    {
    public List<string> search_result_urls;
    public Dictionary<string, int> dict;

    public List<string> livesearch(string query)
      {
      var url_template =  "http://api.search.live.net/json.aspx?AppId=XXX& \
          Sources=web&Query={0}&Web.Count=50";
      var offset_template = "&Web.Offset={1}";
      var search_url = "";
      int[] offsets = { 0, 50, 100, 150 };
      foreach (var offset in offsets)
        {
        if (offset == 0)
          search_url = string.Format(url_template, query);
        else
          search_url = string.Format(url_template + 
            offset_template, query, offset);

        var page = Utils.FetchUrl(search_url).data_as_string;
        JObject o = ( JObject) JsonConvert.DeserializeObject(page);

        var urls =
          from url in o["SearchResponse"]["Web"]["Results"].Children()
            select url.Value<string>("Url").ToString();

        dictify(urls);
        }
      return new List<string>();
    }

    public List<string> googlesearch(string query)
      {
      var url_template = "http://ajax.googleapis.com/ajax/services/search \
          /web?v=1.0&rsz=large&q={0}&start={1}";
      var search_url = "";
      int[] offsets = { 0, 8, 16, 24, 32, 40, 48 };
      foreach (var offset in offsets)
        {
        search_url = string.Format(url_template, query, offset);
        var page = Utils.FetchUrl(search_url).data_as_string;
        JObject o = (JObject)JsonConvert.DeserializeObject(page);

        var urls =
          from url in o["responseData"]["results"].Children()
             select url.Value<string>("url").ToString();

        dictify(urls);
        }
      return new List<string>();
    }

  private void dictify(IEnumerable<string> urls)
    {
    foreach (var url in urls)
      {
      if (dict.ContainsKey(url))
        dict[url] += 1;
      else
        dict[url] = 1;
      }
    }
  }
}

Here’s the IronPython piece which uses the search methods from the C# code:

import clr
clr.AddReference("CalendarAggregator")

locations = [
'ann arbor',
'huntington wv',
'keene'
'virginia beach',
]

qualifiers = [
'first',
'second',
'third',
'fourth',
'every'
]

days = [
'monday',
'tuesday',
'wednesday',
'thursday',
'friday',
'saturday',
'sunday'
]

for location in locations:
  search = Search()
  for qualifier in qualifiers:
    for day in days:
      q = '"%s" "%s %s"' % ( location, qualifier, day )
      search.googlesearch(q)
      search.livesearch(q)

for key in search.dict.Keys:
  print key, search.dict[key]

7 Comments

  1. The people over at FuseCal have already done all this work for you. http://fusecal.com. You just plop in a URL that may contain some kind of structured event data, and it generates an ICS file from it.

    It’s been really slick in my experiences.

    They said they reached out to you a year ago and you said you were angry that they had to exist :)

  2. FuseCal is great, and since I practice as well a preach the virtue of laziness I use it wherever I can.

    Currently, the Keene Public Library is the only calendar I’ve been able to usefully scrape using it:

    http://www.fusecal.com/calendar/ical/320516?h=ea236e2a-d8fb-11dd-a692-00163e284ee0

    But yes, whenever you see a web page that looks like a calendar it’s worth trying FuseCal on it. These are cases where the info has been published in a vaguely but not precisely structured way.

    What I’m talking about in this entry, though, is the vast majority of stuff that’s purely narrative. Examples from Virginia Beach:

    “We meet every Sunday evening at 7:17 at Western Branch Community Church!”

    “The clinic is held on the first Monday of each month at 1: 00 p.m. at: The CHKD Health Center 171 Kempsville Road Building A, Norfolk, Virginia”

    I would love to be proven wrong, but I do not think that we will anytime soon have technology that can do what human brains can do with this information.

    I do think there is value in reducing the impedance that currently makes it way too hard for human brains to a) discover, and b) process this kind of stuff.

    UPDATE: The curator for Huntington, WV reports that FuseCal successfully parsed three of the web pages that turned up in his search. That’s great!

  3. Mykel, thanks for mentioning fusecal; I’d not heard of that service before. So, since I’m working on calendar curation for Huntington, WV I decided to try a bunch of the calendars I had found and so far I’ve successfully gotten it to parse/scrape five different calendars.. I’m going to try for some more.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s