I’m working on a project that aggregates a bunch of community calendars, plus a lot of calendar info that’s just written out free-form. Some examples of the latter, in ascending order of resistance to mechanical parsing:
Tue, 4/1/08
2 Apr – Wed 10:00AM-10:45AM
Weekdays 8:30am-4:30pm
Thu, 11/15/07 – Fri, 4/11/08
Every Tuesday of the month from 10:00-11:00 a.m
Sat., Apr. 05, 9:00 AM Registration/Preview, 10:00 AM Live Auction
2nd Saturday of every other month, 10:00 am-12:00 pm
Programming languages tend to offer lots of functions and modules for converting among machine formats, and for converting machine formats into human formats, but when it comes to recognizing human formats, not so much.
In looking around for a recognizer, I came across the script that Jamie Zawinski uses to manage the calendar for his DNA Lounge. It looks like it can handle many of these formats, but it’s a 6500-line Perl behemoth that does a bunch of different things.
What else is available, for any language, preferably more focused and packaged, that can turn an item in human format, like “2nd Saturday of every other month, 10:00 am-12:00 pm,” into a sequence of items in machine format?
April 2, 2008 at 3:16 pm
Check out http://www.datejs.com and also in Python, dateutil, http://labix.org/python-dateutil .
The date.js stuff looks like more of what you’re looking for, however it is done in javascript.
April 2, 2008 at 3:21 pm
That’s the sort of thing that REXX did really well, as I recall. But’s that going back a ‘fer piece’.
April 2, 2008 at 3:49 pm
Have you seen Chronic, the natural language time parser for Ruby?
There’s a blog post and screencast of it here: http://www.rubyinside.com/chronic-ruby-date-time-parser-screencast-263.html
I haven’t used it in a while but when I did it could handle almost anything I threw at it.
April 2, 2008 at 3:50 pm
Have you tried DateTime::Format::Natural::parse_datetime_duration(), or DateTime::Format::Flexible::parse_datetime?
April 2, 2008 at 3:51 pm
Python parsedatetime does some of what you’re looking for, but not all:
http://code.google.com/p/parsedatetime/
April 2, 2008 at 3:51 pm
Not sure if it’s what you’re after, but I’ve used a JS library called Date.js (www.datejs.com) which works really well. And they have a cool website to test it out.
April 2, 2008 at 4:04 pm
John Udell raises the challenge of translating the human formats of a calendar entry into a machine format. Google Calendars quick add feature does make a fair effort and responds as the human intended in most cases.
From the examples given by John
Tue, 4/1/08 ok
2 Apr – Wed 10:00AM-10:45AM Gets date wrong (time of day ok)
Weekdays 8:30am-4:30pm ok
Thu, 11/15/07 – Fri, 4/11/08 ok
Every Tuesday of the month from 10:00-11:00 a.m
ok
Sat., Apr. 05, 9:00 AM Registration/Preview, 10:00 AM Live Auction ok
2nd Saturday of every other month, 10:00 am-12:00 pm ok
The API seems to provide a neat packaging of the requirement as a service which could be used in many ways. Problems that are encountered, like the example above, might eventually be dealt with by the team at Google but seem tractable through pre-processing.
April 2, 2008 at 4:11 pm
Very interesting. Certainly not a simple task.
It does seem fitting that Google would provide a good implementation — they’re in the search business, after all. I can type many things in a bunch of different ways (whether it be in Google Maps, search, etc) and it usually gets it right.
Good luck.
April 2, 2008 at 4:41 pm
I know you know if you can’t find a library or if customization is needed, a parser-generator such as ANTLR is the way to go. Heck! Write a Popfly component!
April 2, 2008 at 7:00 pm
You might be interested in the paper “From Dirt to Shovels: Fully Automatic Tool Generation from Ad Hoc Data”:
http://www.cs.princeton.edu/~dpw/papers/padslearning-0707.pdf
April 2, 2008 at 8:27 pm
I’ve been looking for something similar to this. I even checked out date.js. It doesn’t support something as simple as ‘the day after tomorrow’…
April 3, 2008 at 12:44 am
Wow. Thanks for all the great suggestions! At a glance, Chronic seems the most promising, and it’s an excuse to revisit Ruby which I’ve only scratched the surface of.
Would be interesting to assemble a suite of test cases, drawn from my examples above plus the examples given in the docs for these various libraries, see the results, and then accumulate more results for other modules as they emerge.
April 3, 2008 at 12:50 am
> From Dirt to Shovels: Fully Automatic Tool Generation from Ad Hoc Data
Fascinating. At a glance, though, it appears this ad hoc data comes from machines (webserver log, crash log, transaction records) and not from people.
April 3, 2008 at 12:53 am
“The API seems to provide a neat packaging of the requirement as a service which could be used in many ways.”
Yes, Google Calendar’s recognizer does seem like a promising approach.
April 3, 2008 at 7:23 am
You might get some higher level ideas from talking to the folks at ReQall and Tripit who are doing nice parsing of information including ‘human friendly’ dates and times
April 3, 2008 at 12:42 pm
Note that google calendar will get all your examples right if you drop the redundant day of week in “2 Apr – Wed 10:00AM-10:45AM”
April 3, 2008 at 3:28 pm
GATE will do that
http://gate.ac.uk/
April 3, 2008 at 5:18 pm
An angle on this topic is that the US is one of the few (only?) countries where dates are Month/Day/Year, while most other countries are Day/Month/Year, so when trying to parse the date, the locale becomes important.
Thus, referring to events by date, like 9/11, don’t quite translate directly when talking to folks in overseas.
Also, swapping month/day is a common mistake when people come to the US. An interesting likely instance of this appeared in the news in 2002, where the assumption was made that a person was trying to deceive authorities, when it may have been just a mistake when writing down his date of birth, April 7 and July 4, which are both 7/4, depending on the country (http://www.guardian.co.uk/world/2002/jul/05/usgunviolence.usa3).
April 3, 2008 at 7:41 pm
You should talk to Rael Dornfest about this. I know he’s got something serious in this regard he built for iwantsandy.com. It’ll be in Ruby and it’ll be seriously tested with a large amount of real world data.
April 3, 2008 at 10:44 pm
[...] Udell goes LazyWeb with “Parsing human-written date and time information, and the commenters come through, especially with DateJS.com. Not the only solution, though. [...]
April 3, 2008 at 11:13 pm
Yes, I would point you to iwantsandy.com.
April 4, 2008 at 4:50 am
[...] Parsing human-written date and time information « Jon Udell Good stuff in the comments about parsing human generated (not computer generated) dates, need to check out Chronic. (categories: dates datetime parsing antlr chronic sandy ) [...]
April 6, 2008 at 4:08 am
There’s a module in CPAN called Date::Manip that does this. It has a method called ParseDate() that does its best to figure out what a given input means. Then, once parsed, the rest of the module lets you work with dates in more computer-friendly ways.
April 9, 2008 at 1:09 am
Looks like Parand above has already mentioned my library: parsedatetime for Python.
I would be extremely interested in any feedback for items it cannot handle.
One item it already handles is adjusting to different Locale’s day/month/year order.
April 10, 2008 at 5:33 am
[...] of data Posted by Jon Udell under Uncategorized To follow up on last week’s item about parsing the kinds of dates and times that people actually write, Google Calendar’s [...]
April 12, 2008 at 1:13 am
It’s not doing ranges and repeats just yet, but I’ve started a project in C# that handles dates such as “tomorrow”, “next friday”, or even “the 3rd tuesday in next june”.
http://www.codeplex.com/DateTimeEnglishParse
May 13, 2008 at 2:22 pm
All of this sounds very handy for search engine indexing and entity extraction. I’ll check out some of those packages.
August 21, 2008 at 5:21 pm
The GNU coreutils have this functionality as seen in commands like “touch” and “at”.
December 14, 2008 at 11:13 pm
Oh,great! country of turkey 657439
January 4, 2009 at 1:24 pm
Use biterscripting for things like this. Just create a generic data/time processor function or script and reuse it over and over. Download free at http://www.biterscripting.com .
There are some sample scripts (one I use is at http://www.biterscripting.com/Download/SS_SearchWeb.txt – I don’t remember all of them but with the biterscripting download, all their sample scripts are downloaded also.)
Email me if you need more help.
(It may be some time before I can respond.)
Sen
March 10, 2009 at 11:37 am
We developed exactly what you are looking for on an internal project. We are thinking of making this public if there is sufficient need for it. Take a look at this blog for more details: http://precisionsoftwaredesign.com/blog.php.
Feel free to contact me if you are interested: contact@precisionsoftware.us
March 13, 2009 at 12:04 pm
[...] time, to try to evolve this into a robot that makes sense of the calendar information that people actually write, as opposed to the information that calendar programs constrain them to produce. But meanwhile this [...]
March 13, 2009 at 12:11 pm
[...] time, to try to evolve this into a robot that makes sense of the calendar information that people actually write, as opposed to the information that calendar programs constrain them to produce. But meanwhile this [...]
June 10, 2009 at 10:29 am
[...] guess that’s why another recent item on parsing human-written date and time information struck a chord with readers. Until we create (and widely deploy) naturalistic interfaces, people [...]
March 11, 2012 at 4:21 pm
纽约网站设计…
[...]Parsing human-written date and time information « Jon Udell[...]…