Parsing human-written date and time information

I’m working on a project that aggregates a bunch of community calendars, plus a lot of calendar info that’s just written out free-form. Some examples of the latter, in ascending order of resistance to mechanical parsing:

Tue, 4/1/08

2 Apr – Wed 10:00AM-10:45AM

Weekdays 8:30am-4:30pm

Thu, 11/15/07 – Fri, 4/11/08

Every Tuesday of the month from 10:00-11:00 a.m

Sat., Apr. 05, 9:00 AM Registration/Preview, 10:00 AM Live Auction

2nd Saturday of every other month, 10:00 am-12:00 pm

Programming languages tend to offer lots of functions and modules for converting among machine formats, and for converting machine formats into human formats, but when it comes to recognizing human formats, not so much.

In looking around for a recognizer, I came across the script that Jamie Zawinski uses to manage the calendar for his DNA Lounge. It looks like it can handle many of these formats, but it’s a 6500-line Perl behemoth that does a bunch of different things.

What else is available, for any language, preferably more focused and packaged, that can turn an item in human format, like “2nd Saturday of every other month, 10:00 am-12:00 pm,” into a sequence of items in machine format?

Posted in .

37 thoughts on “Parsing human-written date and time information

  1. That’s the sort of thing that REXX did really well, as I recall. But’s that going back a ‘fer piece’.

  2. Not sure if it’s what you’re after, but I’ve used a JS library called Date.js (www.datejs.com) which works really well. And they have a cool website to test it out.

  3. John Udell raises the challenge of translating the human formats of a calendar entry into a machine format. Google Calendars quick add feature does make a fair effort and responds as the human intended in most cases.
    From the examples given by John
    Tue, 4/1/08 ok
    2 Apr – Wed 10:00AM-10:45AM Gets date wrong (time of day ok)
    Weekdays 8:30am-4:30pm ok
    Thu, 11/15/07 – Fri, 4/11/08 ok

    Every Tuesday of the month from 10:00-11:00 a.m
    ok
    Sat., Apr. 05, 9:00 AM Registration/Preview, 10:00 AM Live Auction ok
    2nd Saturday of every other month, 10:00 am-12:00 pm ok
    The API seems to provide a neat packaging of the requirement as a service which could be used in many ways. Problems that are encountered, like the example above, might eventually be dealt with by the team at Google but seem tractable through pre-processing.

  4. Very interesting. Certainly not a simple task.

    It does seem fitting that Google would provide a good implementation — they’re in the search business, after all. I can type many things in a bunch of different ways (whether it be in Google Maps, search, etc) and it usually gets it right.

    Good luck.

  5. I know you know if you can’t find a library or if customization is needed, a parser-generator such as ANTLR is the way to go. Heck! Write a Popfly component!

  6. I’ve been looking for something similar to this. I even checked out date.js. It doesn’t support something as simple as ‘the day after tomorrow’…

  7. Wow. Thanks for all the great suggestions! At a glance, Chronic seems the most promising, and it’s an excuse to revisit Ruby which I’ve only scratched the surface of.

    Would be interesting to assemble a suite of test cases, drawn from my examples above plus the examples given in the docs for these various libraries, see the results, and then accumulate more results for other modules as they emerge.

  8. > From Dirt to Shovels: Fully Automatic Tool Generation from Ad Hoc Data

    Fascinating. At a glance, though, it appears this ad hoc data comes from machines (webserver log, crash log, transaction records) and not from people.

  9. “The API seems to provide a neat packaging of the requirement as a service which could be used in many ways.”

    Yes, Google Calendar’s recognizer does seem like a promising approach.

  10. You might get some higher level ideas from talking to the folks at ReQall and Tripit who are doing nice parsing of information including ‘human friendly’ dates and times

  11. Note that google calendar will get all your examples right if you drop the redundant day of week in “2 Apr – Wed 10:00AM-10:45AM”

  12. An angle on this topic is that the US is one of the few (only?) countries where dates are Month/Day/Year, while most other countries are Day/Month/Year, so when trying to parse the date, the locale becomes important.

    Thus, referring to events by date, like 9/11, don’t quite translate directly when talking to folks in overseas.

    Also, swapping month/day is a common mistake when people come to the US. An interesting likely instance of this appeared in the news in 2002, where the assumption was made that a person was trying to deceive authorities, when it may have been just a mistake when writing down his date of birth, April 7 and July 4, which are both 7/4, depending on the country (http://www.guardian.co.uk/world/2002/jul/05/usgunviolence.usa3).

  13. You should talk to Rael Dornfest about this. I know he’s got something serious in this regard he built for iwantsandy.com. It’ll be in Ruby and it’ll be seriously tested with a large amount of real world data.

  14. Pingback: Links: 4-3-2008
  15. There’s a module in CPAN called Date::Manip that does this. It has a method called ParseDate() that does its best to figure out what a given input means. Then, once parsed, the rest of the module lets you work with dates in more computer-friendly ways.

  16. Looks like Parand above has already mentioned my library: parsedatetime for Python.

    I would be extremely interested in any feedback for items it cannot handle.

    One item it already handles is adjusting to different Locale’s day/month/year order.

  17. The GNU coreutils have this functionality as seen in commands like “touch” and “at”.

  18. Use biterscripting for things like this. Just create a generic data/time processor function or script and reuse it over and over. Download free at http://www.biterscripting.com .

    There are some sample scripts (one I use is at http://www.biterscripting.com/Download/SS_SearchWeb.txt – I don’t remember all of them but with the biterscripting download, all their sample scripts are downloaded also.)

    Email me if you need more help.
    (It may be some time before I can respond.)

    Sen

  19. Pingback: 纽约网站设计
  20. There was no final solution::-) I punted on trying to parse free-form text and focused exclusively on iCalendar clients (some of which, notably Google, do incorporate such parsing.)

Leave a Reply