I’m working on a project that aggregates a bunch of community calendars, plus a lot of calendar info that’s just written out free-form. Some examples of the latter, in ascending order of resistance to mechanical parsing:
2 Apr – Wed 10:00AM-10:45AM
Thu, 11/15/07 – Fri, 4/11/08
Every Tuesday of the month from 10:00-11:00 a.m
Sat., Apr. 05, 9:00 AM Registration/Preview, 10:00 AM Live Auction
2nd Saturday of every other month, 10:00 am-12:00 pm
Programming languages tend to offer lots of functions and modules for converting among machine formats, and for converting machine formats into human formats, but when it comes to recognizing human formats, not so much.
In looking around for a recognizer, I came across the script that Jamie Zawinski uses to manage the calendar for his DNA Lounge. It looks like it can handle many of these formats, but it’s a 6500-line Perl behemoth that does a bunch of different things.
What else is available, for any language, preferably more focused and packaged, that can turn an item in human format, like “2nd Saturday of every other month, 10:00 am-12:00 pm,” into a sequence of items in machine format?
37 thoughts on “Parsing human-written date and time information”
Check out http://www.datejs.com and also in Python, dateutil, http://labix.org/python-dateutil .
That’s the sort of thing that REXX did really well, as I recall. But’s that going back a ‘fer piece’.
Have you seen Chronic, the natural language time parser for Ruby?
There’s a blog post and screencast of it here: http://www.rubyinside.com/chronic-ruby-date-time-parser-screencast-263.html
I haven’t used it in a while but when I did it could handle almost anything I threw at it.
Have you tried DateTime::Format::Natural::parse_datetime_duration(), or DateTime::Format::Flexible::parse_datetime?
Python parsedatetime does some of what you’re looking for, but not all:
Not sure if it’s what you’re after, but I’ve used a JS library called Date.js (www.datejs.com) which works really well. And they have a cool website to test it out.
John Udell raises the challenge of translating the human formats of a calendar entry into a machine format. Google Calendars quick add feature does make a fair effort and responds as the human intended in most cases.
From the examples given by John
Tue, 4/1/08 ok
2 Apr – Wed 10:00AM-10:45AM Gets date wrong (time of day ok)
Weekdays 8:30am-4:30pm ok
Thu, 11/15/07 – Fri, 4/11/08 ok
Every Tuesday of the month from 10:00-11:00 a.m
Sat., Apr. 05, 9:00 AM Registration/Preview, 10:00 AM Live Auction ok
2nd Saturday of every other month, 10:00 am-12:00 pm ok
The API seems to provide a neat packaging of the requirement as a service which could be used in many ways. Problems that are encountered, like the example above, might eventually be dealt with by the team at Google but seem tractable through pre-processing.
Very interesting. Certainly not a simple task.
It does seem fitting that Google would provide a good implementation — they’re in the search business, after all. I can type many things in a bunch of different ways (whether it be in Google Maps, search, etc) and it usually gets it right.
I know you know if you can’t find a library or if customization is needed, a parser-generator such as ANTLR is the way to go. Heck! Write a Popfly component!
You might be interested in the paper “From Dirt to Shovels: Fully Automatic Tool Generation from Ad Hoc Data”:
I’ve been looking for something similar to this. I even checked out date.js. It doesn’t support something as simple as ‘the day after tomorrow’…
Wow. Thanks for all the great suggestions! At a glance, Chronic seems the most promising, and it’s an excuse to revisit Ruby which I’ve only scratched the surface of.
Would be interesting to assemble a suite of test cases, drawn from my examples above plus the examples given in the docs for these various libraries, see the results, and then accumulate more results for other modules as they emerge.
> From Dirt to Shovels: Fully Automatic Tool Generation from Ad Hoc Data
Fascinating. At a glance, though, it appears this ad hoc data comes from machines (webserver log, crash log, transaction records) and not from people.
“The API seems to provide a neat packaging of the requirement as a service which could be used in many ways.”
Yes, Google Calendar’s recognizer does seem like a promising approach.
You might get some higher level ideas from talking to the folks at ReQall and Tripit who are doing nice parsing of information including ‘human friendly’ dates and times
Note that google calendar will get all your examples right if you drop the redundant day of week in “2 Apr – Wed 10:00AM-10:45AM”
GATE will do that
An angle on this topic is that the US is one of the few (only?) countries where dates are Month/Day/Year, while most other countries are Day/Month/Year, so when trying to parse the date, the locale becomes important.
Thus, referring to events by date, like 9/11, don’t quite translate directly when talking to folks in overseas.
Also, swapping month/day is a common mistake when people come to the US. An interesting likely instance of this appeared in the news in 2002, where the assumption was made that a person was trying to deceive authorities, when it may have been just a mistake when writing down his date of birth, April 7 and July 4, which are both 7/4, depending on the country (http://www.guardian.co.uk/world/2002/jul/05/usgunviolence.usa3).
You should talk to Rael Dornfest about this. I know he’s got something serious in this regard he built for iwantsandy.com. It’ll be in Ruby and it’ll be seriously tested with a large amount of real world data.
Yes, I would point you to iwantsandy.com.
There’s a module in CPAN called Date::Manip that does this. It has a method called ParseDate() that does its best to figure out what a given input means. Then, once parsed, the rest of the module lets you work with dates in more computer-friendly ways.
Looks like Parand above has already mentioned my library: parsedatetime for Python.
I would be extremely interested in any feedback for items it cannot handle.
One item it already handles is adjusting to different Locale’s day/month/year order.
It’s not doing ranges and repeats just yet, but I’ve started a project in C# that handles dates such as “tomorrow”, “next friday”, or even “the 3rd tuesday in next june”.
All of this sounds very handy for search engine indexing and entity extraction. I’ll check out some of those packages.
The GNU coreutils have this functionality as seen in commands like “touch” and “at”.
Oh,great! country of turkey 657439
Use biterscripting for things like this. Just create a generic data/time processor function or script and reuse it over and over. Download free at http://www.biterscripting.com .
There are some sample scripts (one I use is at http://www.biterscripting.com/Download/SS_SearchWeb.txt – I don’t remember all of them but with the biterscripting download, all their sample scripts are downloaded also.)
Email me if you need more help.
(It may be some time before I can respond.)
We developed exactly what you are looking for on an internal project. We are thinking of making this public if there is sufficient need for it. Take a look at this blog for more details: http://precisionsoftwaredesign.com/blog.php.
Feel free to contact me if you are interested: email@example.com
Jon – curious to here what your final solution was.
There was no final solution::-) I punted on trying to parse free-form text and focused exclusively on iCalendar clients (some of which, notably Google, do incorporate such parsing.)