Thanks to some really great comments on yesterday’s item I’ve taken another pass through the spreadsheet I got from the police department1. It looks like Chris Anderson and David French were exactly right to suggest a “police station effect” — namely, that there’s more crime at or near the police station.

Here’s a version of yesterday’s chart (with cleaner underlying data):

It’s focused on the old location of the police station which, you may recall, moved from Central Square in Jan 2006. If you thought the presence of the station would suppress the number of incidents, you wouldn’t find evidence for that here.

Now here’s the same thing focused on the new location of the police station:

That’s pretty clear!

There were two causes suggested.

1 (Chris): “The station was the place of the crime report and there was often no specific address.”

Yup. Of the 341 incidents within .1 mile of the new station, 315 were at the exact address.

2 (David): “This is where you end up when they let you out of the drunk tank.”

It’s possible to explore that spillover effect, but I’ll stop here and call out another excellent comment from Doug Finner:

If you get a big pile-o-data and don’t know everything about how the data was collected, it can be pretty close to impossible to do anything other than make very general observations. Trying to draw conclusions from data that is likely ‘dirty’ is often a fools errand. Probably the best you can do, is find interesting trends and then try and get good clean data collected - the whole scientific method thing.

Indeed. For this round I took a much more critical look at the address data. I discarded the fair number of junk addresses that resolved erroneously to the city center. And because the addresses in the file didn’t specify “St” or “Rd” there were systematic problems — particularly in the case of Marlboro which was resolving to Rd rather than St.

As Doug Finner suggests, it would be wise at this point to hand back the file augmented not only with latitude/longitude coordinates, but also with indications of how clean or dirty the geocoding was, and recommendations on how to improve it.

Meanwhile, the toolsmith in me is getting fired up with all kinds of ideas. For example, when I processed the raw file to create this categorized stack graph I wound up creating an ad-hoc system of piped filters in Python. Each one takes a list of rows and returns a transformed list of rows. Here are some of them:

  • removeIncidentnums
  • dedupeCasenums
  • adjustDates
  • trimDescs
  • removeSingletonDescs
  • addCategories
  • addMonthlyCounts

All well and good. But this just begs for some kind of social treatment a la Pipes or Popfly, with a particular focus on the transformation of rectangular datasets.

I’m also thinking about ways to meld Python and Excel together more closely. So far, I’ve only relied on code generation — that is, using Python to write VBA macros to, for example, define named ranges. There’s also the possibility of outside-in automation, where Python drives Excel through its automation interface. But then I got to wondering: Will there be a role for IronPython (or IronRuby) here, someday, such that you could use these languages inside Excel? That’d be very cool.


1 Yes, I will publish this data once I’ve had a chance to show my work to the police and get their approval.