This fall a small team of University of Toronto and Michigan State undergrads will be working on parts of the elmcity project by way of Undergraduate Capstone Open Source Projects (UCOSP), organized by Greg Wilson. In our first online meeting, the students decided they’d like to tackle the problem that FuseCal was solving: extraction of well-structured calendar information from weakly-structured web pages.
From a computer science perspective, there’s a fairly obvious path. Start with specific examples that can be scraped, then work toward a more general solution. So the first two examples are going to be MySpace and LibraryThing. The recipes[1, 2] I’d concocted for FuseCal-written iCalendar feeds were especially valuable because they could be used by almost any curator for almost any location.
But as I mentioned to the students, there’s another way to approach these two cases. And I was reminded of it again when Michael Foord pointed to this fascinating post prompted by the open source release of FriendFeed’s homegrown web server, Tornado. The author of the post, Glyph Lefkowitz, is the founder of Twisted, a Python-based network programming framework that includes the sort of asynchronous event-driven capabilities that FriendFeed recreated for Tornado. Glyph writes:
If you’re about to undergo a re-write of a major project because it didn’t meet some requirements that you had, please tell the project that you are rewriting what you are doing. In the best case scenario, someone involved with that project will say, “Oh, you’ve misunderstood the documentation, actually it does do that”. In the worst case, you go ahead with your rewrite anyway, but there is some hope that you might be able to cooperate in the future, as the project gradually evolves to meet your requirements. Somewhere in the middle, you might be able to contribute a few small fixes rather than re-implementing the whole thing and maintaining it yourself.
Whether FriendFeed could have improved the parts of Twisted that it found lacking, while leveraging its synergistic aspects, is a question only specialists close to both projects can answer. But Glyph is making a more general point. If you don’t communicate your intentions, such questions can never even be asked.
Tying this back to the elmcity project, I mentioned to the students that the best scraper for MySpace and LibraryThing calendars is no scraper at all. If these services produced iCalendar feeds directly, there would be no need. That would be the ideal solution — a win for existing users of the services, and for the iCalendar ecosystem I’m trying to bootstrap.
I’ve previously asked contacts at MySpace and LibraryThing about this. But now, since we’re intending to scrape those services for calendar info, it can’t hurt to announce that intention and hope one or both services will provide feeds directly and obviate the need. That way the students can focus on different problems — and there are plenty to choose from.
So I’ll be sending the URL of this post to my contacts at those companies, and if any readers of this blog can help move things along, please do. We may end up with scrapers anyway. But maybe not. Maybe iCalendar feeds have already been provided, but aren’t documented. Maybe they were in the priority stack and this reminder will bump them up. It’s worth a shot. If the problem can be solved by communicating intentions rather than writing redundant code, that’s the ultimate hack. And its one that I hope more computer science students will learn to aspire to.