Parsing human-written date and time information

I’m working on a project that aggregates a bunch of community calendars, plus a lot of calendar info that’s just written out free-form. Some examples of the latter, in ascending order of resistance to mechanical parsing:

Tue, 4/1/08

2 Apr – Wed 10:00AM-10:45AM

Weekdays 8:30am-4:30pm

Thu, 11/15/07 – Fri, 4/11/08

Every Tuesday of the month from 10:00-11:00 a.m

Sat., Apr. 05, 9:00 AM Registration/Preview, 10:00 AM Live Auction

2nd Saturday of every other month, 10:00 am-12:00 pm

Programming languages tend to offer lots of functions and modules for converting among machine formats, and for converting machine formats into human formats, but when it comes to recognizing human formats, not so much.

In looking around for a recognizer, I came across the script that Jamie Zawinski uses to manage the calendar for his DNA Lounge. It looks like it can handle many of these formats, but it’s a 6500-line Perl behemoth that does a bunch of different things.

What else is available, for any language, preferably more focused and packaged, that can turn an item in human format, like “2nd Saturday of every other month, 10:00 am-12:00 pm,” into a sequence of items in machine format?

37 thoughts on “Parsing human-written date and time information

  1. stuart

    Not sure if it’s what you’re after, but I’ve used a JS library called Date.js (www.datejs.com) which works really well. And they have a cool website to test it out.

    Reply
  2. David French

    John Udell raises the challenge of translating the human formats of a calendar entry into a machine format. Google Calendars quick add feature does make a fair effort and responds as the human intended in most cases.
    From the examples given by John
    Tue, 4/1/08 ok
    2 Apr – Wed 10:00AM-10:45AM Gets date wrong (time of day ok)
    Weekdays 8:30am-4:30pm ok
    Thu, 11/15/07 – Fri, 4/11/08 ok

    Every Tuesday of the month from 10:00-11:00 a.m
    ok
    Sat., Apr. 05, 9:00 AM Registration/Preview, 10:00 AM Live Auction ok
    2nd Saturday of every other month, 10:00 am-12:00 pm ok
    The API seems to provide a neat packaging of the requirement as a service which could be used in many ways. Problems that are encountered, like the example above, might eventually be dealt with by the team at Google but seem tractable through pre-processing.

    Reply
  3. ohxten

    Very interesting. Certainly not a simple task.

    It does seem fitting that Google would provide a good implementation — they’re in the search business, after all. I can type many things in a bunch of different ways (whether it be in Google Maps, search, etc) and it usually gets it right.

    Good luck.

    Reply
  4. Larry O'Brien

    I know you know if you can’t find a library or if customization is needed, a parser-generator such as ANTLR is the way to go. Heck! Write a Popfly component!

    Reply
  5. Glen

    I’ve been looking for something similar to this. I even checked out date.js. It doesn’t support something as simple as ‘the day after tomorrow’…

    Reply
  6. Jon Udell Post author

    Wow. Thanks for all the great suggestions! At a glance, Chronic seems the most promising, and it’s an excuse to revisit Ruby which I’ve only scratched the surface of.

    Would be interesting to assemble a suite of test cases, drawn from my examples above plus the examples given in the docs for these various libraries, see the results, and then accumulate more results for other modules as they emerge.

    Reply
  7. Jon Udell Post author

    > From Dirt to Shovels: Fully Automatic Tool Generation from Ad Hoc Data

    Fascinating. At a glance, though, it appears this ad hoc data comes from machines (webserver log, crash log, transaction records) and not from people.

    Reply
  8. Jon Udell Post author

    “The API seems to provide a neat packaging of the requirement as a service which could be used in many ways.”

    Yes, Google Calendar’s recognizer does seem like a promising approach.

    Reply
  9. Mary Branscombe

    You might get some higher level ideas from talking to the folks at ReQall and Tripit who are doing nice parsing of information including ‘human friendly’ dates and times

    Reply
  10. Pablo Fernicola

    An angle on this topic is that the US is one of the few (only?) countries where dates are Month/Day/Year, while most other countries are Day/Month/Year, so when trying to parse the date, the locale becomes important.

    Thus, referring to events by date, like 9/11, don’t quite translate directly when talking to folks in overseas.

    Also, swapping month/day is a common mistake when people come to the US. An interesting likely instance of this appeared in the news in 2002, where the assumption was made that a person was trying to deceive authorities, when it may have been just a mistake when writing down his date of birth, April 7 and July 4, which are both 7/4, depending on the country (http://www.guardian.co.uk/world/2002/jul/05/usgunviolence.usa3).

    Reply
  11. Greg

    You should talk to Rael Dornfest about this. I know he’s got something serious in this regard he built for iwantsandy.com. It’ll be in Ruby and it’ll be seriously tested with a large amount of real world data.

    Reply
  12. Pingback: Thursday night notes and links | clock — watching time, the only true currency

  13. Pingback: Links: 4-3-2008

  14. Warren Young

    There’s a module in CPAN called Date::Manip that does this. It has a method called ParseDate() that does its best to figure out what a given input means. Then, once parsed, the rest of the module lets you work with dates in more computer-friendly ways.

    Reply
  15. bear

    Looks like Parand above has already mentioned my library: parsedatetime for Python.

    I would be extremely interested in any feedback for items it cannot handle.

    One item it already handles is adjusting to different Locale’s day/month/year order.

    Reply
  16. Pingback: Syndication of rules versus syndication of data « Jon Udell

  17. Sen Hu

    Use biterscripting for things like this. Just create a generic data/time processor function or script and reuse it over and over. Download free at http://www.biterscripting.com .

    There are some sample scripts (one I use is at http://www.biterscripting.com/Download/SS_SearchWeb.txt – I don’t remember all of them but with the biterscripting download, all their sample scripts are downloaded also.)

    Email me if you need more help.
    (It may be some time before I can respond.)

    Sen

    Reply
  18. Pingback: Searching for calender information « Jon Udell

  19. Pingback: Searching for calendar information « Jon Udell

  20. Pingback: Calendar software is natural for reading, but not for writing « Jon Udell

  21. Pingback: 纽约网站设计

  22. Jon Udell Post author

    There was no final solution::-) I punted on trying to parse free-form text and focused exclusively on iCalendar clients (some of which, notably Google, do incorporate such parsing.)

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s