How to create a Pivot visualization of a WordPress blog

I’ve posted the Python script I used to make the Pivot visualization of this blog. I need to set it aside for now and do other things, but here’s a snapshot of the process for my future self and for anyone else who’s interested.

Using deepzoom.py to create Deep Zoom images and collections

I’m using this Python component to create Deep Zoom images and collections. I made the following changes to it:

1. tile_size=256 (not 254) at line 59, line 160, and line 224

2. source_path.name instead of source_path at line 291

3. destination + '.xml' instead of destination at line 341

Let’s assume that Python is installed, along with the Python Imaging Library, and that your current directory contains the files 001.jpg, 002.jpg, and 003.jpg:

001.jpg
002.jpg
003.jpg

For each image file, you could run deepzoom.py thrice from the command line, like so:

python deepzoom.py -d 001.xml 001.jpg
python deepzoom.py -d 002.xml 002.jpg
python deepzoom.py -d 003.xml 003.jpg

My script doesn’t actually do it that way, it enumerates JPEGs and instantiates deepzoom.py’s ImageCreator object once for each. But either way, for each JPEG you end up with a DZI (Deep Zoom Image) package that consists of (for 001.jpg):

  • A settings file: 001.xml
  • A subdirectory: 001_files
  • More subdirectories (named 0, 1, etc.) inside 001_files
  • JPG files inside those subdirectories

Now, in this case, the current directory looks like this (using -> to mark additions):

001.jpg
-> 001.xml
-> 001_files
002.jpg
-> 002.xml
-> 002_files
003.jpg
-> 003.xml
-> 003_files

To build a collection, do something like this in Python:

from deepzoom import *
images = ['001.xml','002.xml', '003.xml']
creator = CollectionCreator()
creator.create(images, 'dzc_output')

Now the current directory looks like:

001.jpg
001.xml
001_files
002.jpg
002.xml
002_files
003.jpg
003.xml
003_files
-> dzc_output.xml
-> dzc_output_files

The Pivot collection’s CXML file will refer to dzc_output.xml, like so:

<Items ImgBase="dzc_output.xml">

Using IECapt to grab screenshots

This tool uses Internet Explorer, so only works on Windows. There is also CutyCapt for WebKit, which I haven’t tried but would be curious to hear about.

Here’s an example of the IECapt command line I’m using:

iecapt –url=https://blog.jonudell.net/… –delay=1000 –out=tmp.jpg

The result in most cases is a tall skinny JPEG, because it renders the whole page — which can be very long — before imaging it. When I ran it over a 600-item collection, it hung a couple of times because of JavaScript errors. So I went to Internet Options->Browsing in IE, checked Disable script debugging, and unchecked Display a notification about every script error.

Using ImageMagic to crop screenshots

Here’s a picture of an image produced by IECapt, overlaid with a rectangle marking where I want to crop:

The rectangle’s origin is at x=30 and y=180. Its width is 530 pixels, and height 500. Here’s the ImageMagick command to crop a captured image in tmp.jpg into a cropped image in 001.jpg:

convert -quality 100 -crop 530×500+30+180 -border 1×1 -bordercolor Black tmp.jpg 001.jpg

I’m writing this down here mainly for myself. ImageMagic can do everything under the sun, but it always takes me a while to dig up the recipe for a given operation.

Parsing the WordPress export file

I found to my surprise that WordPress currently exports invalid XML. So the script starts with a search-and-replace that looks for this:

xmlns:wp="http://wordpress.org/export/1.0/"

And replaces it with this:

xmlns:wp="http://wordpress.org/export/1.0/"
xmlns:atom="http://www.w3.org/2005/Atom"

Then it walks through the items in the Atom feed, extracting the various things that will become Pivot facets. For the description, it tries to parse the content:encoded element as XML, and find the first paragraph element within it. If that fails, it just treats the element as text and grabs the beginning of it.

Weaving the collection

There are two control files that need to be synchronized. First, there’s dzc_output.xml, for the Deep Zoom collection. It has elements like this:

<I Id=”596″ N=”596″ Source=”2245.xml”>

Then there’s pivot.cxml which drives the visualization. It has elements like this:

<Item Id="596" Img="#596" 
  Name="Freebase Gridworks: A power tool for data scrubbers" 
  Href="https://blog.jonudell.net/2010/03/26/...
<Description><![CDATA[
I've had many conversations with Stefano Mazzocchi and David Huynh [1, 2, 3] 
about the data magic they performed at MIT's Project Simile and now perform 
at Metaweb. If you're somebody who values clean data and has wrestled with 
the dirty stuff, these screencasts about a forthcoming product called 
Freebase Gridworks will make you weep with joy.
]]></Description>
<Facets>
  <Facet Name="date">
    <DateTime Value="2010-03-26T00:00:00-00:00" />
  </Facet>
<Facet Name="tag">
<String Value="freebase" />
<String Value="gridworks" />
<String Value="metaweb" />
</Facet>
  <Facet Name="comments">
    <Number Value="24" />
  </Facet>
</Facets>
</Item>

In this example, Source="2245.xml" in dzc_output.xml refers to a Deep Zoom image whose name comes from the WordPress post_id for that entry, which is:

<wp:post_id>2245</wp:post_id>

But Id="596", which is the connection between dzc_output.xml and pivot.cxml, comes from a counter in the script that increments for each item processed. I don’t know why the numbering of items in the WordPress export file is sparse, but it is, hence the difference.

Things to do

Here are some ideas for next steps.

1. Check the comment logic. I just noticed the counts seem odd. Maybe because I’m counting all comments instead of approved comments?

2. Use HTML Tidy to ensure that item content will parse as XML, and then count various kinds of elements within it: tables, images, etc.

2. Use APIs of various services — Twitter, bit.ly, etc. — to count reactions to each item.

A Pivot visualization of my WordPress blog

A Pivot experiment

Pivot, from Microsoft Live Labs, is a data browser that visualizes data facets represented as Deep Zoom images and collections. I’ve been meaning to try my hand at creating a Pivot collection. My first experiment is a visualization of my blog which, in its current incarnation at WordPress.com, has about 600 entries. That’s a reasonable number of items for the simplest (and most common) kind of collection in which data and visuals are pre-computed and cached in the viewer. Here’s the default Pivot view of those entries.

The default view

To create this collection, I needed a visual representation of each blog entry. I didn’t think screenshots would be very useful, but the method worked out better than I expected. At the default zoom level there’s not much to see, but you can pick out entries that include pictures.

A selected entry

When you select an entry, the view zooms about halfway in to focus on it.

A text-only entry

Here’s a purely textual entry at the same resolution. If you click to enlarge that picture, you’ll see that at this level the titles of the current entry and its surrounding entries are legible.

The Show Info control

Clicking the Show Info control opens up an information window that displays title, description, and metadata. I’ve included the first paragraph of each entry as the description.

Zooming closer

If I zoom in further, the text becomes fully legible.

Histogram of entries

Of course the screenshot doesn’t capture the entire entry, it’s just a picture of the first screenful. To read the full entry, you click the Open control to view the entire HTML page inside Pivot.

Pivot itself isn’t a reader, it’s a data browser. This becomes clear when you switch from item view to graph view. 2006 and 2010 are incomplete years, but the period 2007-2009 shows a clear decline. I suspect a lot of blogs would show a similar trend, reflecting Twitter’s eclipse of the blogosophere.

2007 distribution

Here’s the distribution for just the year 2007.

Histogram of comments

And here’s the comments facet, which counts the number of comments on each entry.

Histogram of entries with more than 20 comments

Adjusting the slider limits the view to entries with more than 20 comments.

Filtering by tags

Of course I can also view entries by tags or tag combinations.

Filtering by keywords

When I start typing a keyword, the wordwheel displays matches from two namespaces: tags and titles.

Other possible views

Facets can be anything you can enumerate and count. I could, for example, count the number of images, tables, and other kinds of HTML constructs in each entry. That isn’t just a gratuitous exercise. Some years back, I outfitted my blog with an XQuery service that could search for items that contained more than a few images or tables, and it was useful for finding items that I remembered that way.

It would also be nice to include facets based on the WordPress stats API. And since a lot of the flow to the blog nowadays comes through bit.ly-shorted URLs on Twitter, a facet based on those referrals would be handy.

How I did it

Life’s too short to make 600 screenshots by hand, so the process had to be automated. Also, I want to be able to update this collection as I add entries to the blog. So I’m using IECapt to programmatically render pages as images, and the indispensable ImageMagick to crop the images in a standard way.

To automate the creation of Deep Zoom images (and XML files), I’m using deepzoom.py. (Note that I had to make two small changes to that version. At line 224, I changed tile_size=254 to tile_size=256. And at line 291 I changed PIL.Image.open(source_path) to PIL.Image.open(source_path.name).)

To build the main CXML (collection XML) file, I export my WordPress blog and run a Python script against it. I hadn’t looked at that export file in a long time, and was surprised to find that currently it isn’t quite valid XML. The declaration of the Atom namespace is missing. My script does a search-and-replace to fix that before it parses the XML.

I haven’t uploaded the collection to a server yet, because there are a bazillion little files and I’m still tweaking. Once I’m happy with the results, though, I should be able to establish a baseline collection on a server and then easily extend it an entry at a time.

If there’s interest I’ll publish the script. It’ll be more compelling, I suspect, once Pivot becomes available as a Silverlight control. Currently you have to download and install the Windows version of Pivot to use this visualization. But imagine if WordPress.com could deliver something like this for all of its blogs as a zero-install, cross-platform, cross-browser feature. That would be handy.

elmcity and WordPress MU: Questions and answers

In the spirit of keystroke conservation, I’m relaying some elmcity-related questions and answers from email to here. Hopefully it will attract more questions and more answers.

Dear Mr. Udell,

I am looking for a flexible calendar aggregator that I can use to report upcoming events for our college’s “Learning Commons” WordPress MU website, a site that will hopefully help keep our students abreast of events and opportunities taking place on campus.

1) Our site will be maintained using WordPress MU, so ideally the
display of the calendars, and/or event-lists will be handled by a
WordPress plugin. The one I am favouring is
http://wordpress.org/extend/plugins/wordpress-ics-importer/ . I have
tried this plugin and it almost does what we want.

Specifically, the plugin includes:

– a single widget that can display the “event-list” for one calendar;

– flexible options for displaying and aggregating calendars.

This plugin almost does what I want, but not quite.

a) The plugin is now limited to a single “events-list” widget. But with WordPress 2.8, it is possible to have many instances of a widget, so theoretically, I could display the “Diagnostic Tests” calendar in one instance , and the “Peer-tutoring” calendar in another widget instance.

b) It would be nice to have an option to display only the current week for specific calendars. While in other cases, it makes sense to display the entire month. And although I haven’t thought about it, likely displaying just the current day would be useful.

c) I would like flexibility over which calendars to aggregate, creating as many “topic” hubs as the current maintainer of the website might think useful for the students.

2) It would be nice to remove the calendar aggregation from the WordPress plugin, and handle that separately. Hopefully the calendars will change much less frequently than the website will be viewed. If I understand https://blog.jonudell.net/elmcity-project-faq/ properly, this might be possible using the elmcity-project.

For example, I think we could use “topical hub aggregation” to create a “diagnostic test calendar” that aggregates the holiday calendar and the different departments “diagnostic test” calendars. What I don’t understand is what is the output of “elmcity”. Does it output a single merged calendar (ics) that could be displayed by the above plugin? Is that a possibility?

Similarly, I believe I could create a different meta bookmark to aggregate our holiday calendar and our different peer-tutoring calendars (created by each department). Is this correct?

We have lots of groups, faculty, departments and staff on campus, and each wants to publicize their upcoming events. Letting them input and maintain their own calendars really seems to make sense. (Thanks for the idea. It seems clear this is the way to go, but I don’t seem to have the pieces to construct the output properly, as yet.)

I agree with your analysis that it would be better to have a separation of concerns between aggregation and display. So let’s do that, and start with aggregation.

I would like flexibility over which calendars to aggregate, creating as many “topic” hubs as the current maintainer of the website might think useful for the students.

I think the elmcity system can be helpful here. I’ve recently discovered that there are really two levels — what I’ve started to call curation and meta-curation.

I believe I could create bookmarks to aggregate our holiday calendar and our different peer-tutoring calendars (created by each department). Is this correct?

Right. It sounds like you’d want to curate a topic hub. It could be YourCollege, but if there may need to be other topic hubs you could choose a more specific name, like YourCollegeLearningCommons. That’d be your Delicious account name, and you’d be the “meta-curator” in this scenario.

As meta-curator you’d bookmark, in that Delicious account:

– Your holiday calendar

– Multiple departments’ calendars

Each of those would be managed by the responsible/authoritative person, using any software (Outlook, Google, Apple, Drupal, Notes, Live, etc.) that can publish an ICS feed.

There’s another level of flexibility using tags. In the above scenario, as meta-curator you could tag your holiday feed as holiday, and your LearningCommons feeds as LearningCommons, and then filter them accordingly.

What I don’t understand is what is the output of elmcity. Does it output a single merged calendar (ics) that could be displayed by the above plugin?

Yes. The outputs currently are:

Now, for the display options. So far, we’ve got:

  • Use the WordPress plugin to display merged ICS

  • Display the entire calendar as included (maybe customized) HTML

  • Display today’s events as included or script-sourced HTML

  • I have also just recently added a new method that enables things like this: http://jonudell.net/test/upcoming-widget.html

  • You can view the source to see how it’s done. The “API call” here is:

    http://elmcity.cloudapp.net/services/elmcity/json?jsonp=eventlist&recent=7&view=music

    Yours might be:

    http://elmcity.cloudapp.net/services/YourCollegeLearningCommons/json?jsonp=eventlist&recent=10

    or

    &recent=20&view=holiday

    etc.

    This is brand new, as of yesterday. Actually I just realized I should use “upcoming” instead of “recent” so I’ll go and change that now :-) But you get the idea.

    The flexibility here is ultimately governed by:

    1. The curator’s expressive and disciplined use of tags to create useful views

    2. The kinds of queries I make available through the API. So far I’ve only been asked to do ‘next N events’ so that’s what I did yesterday. But my intention is to support every kind of query that’s feasible, and that people ask for. Things like a week’s worth, or a week’s worth in a category, are obvious next steps.

Polymath = user innovation

In February 2007, Mike Adams, who had recently joined Automattic, the company that makes WordPress, decided on a lark to endow all blogs running on WordPress.com with the ability to use LaTeX, the venerable mathematical typesetting language. So I can write this:

$latex \pi r^2$

And produce this:

\pi r^2

When he introduced the feature, Mike wrote:

Odd as it may sound, I miss all the equations from my days in grad school, so I decided that what WordPress.com needed most was a hot, niche feature that maybe 17 people would use regularly.

A whole lot more than 17 people cared. And some of them, it turns out, are Fields medalists. Back in January, one member of that elite group — Tim Gowers — asked: Is massively collaborative mathematics possible? Since then, as reported by observer/participant Michael Nielsen (1, 2), Tim Gowers, Terence Tao, and a bunch of their peers have been pioneering a massively collaborative approach to solving hard mathematical problems.

Reflecting on the outcome of the first polymath experiment, Michael Nielsen wrote:

The scope of participation in the project is remarkable. More than 1000 mathematical comments have been written on Gowers’ blog, and the blog of Terry Tao, another mathematician who has taken a leading role in the project. The Polymath wiki has approximately 59 content pages, with 11 registered contributors, and more anonymous contributors. It’s already a remarkable resource on the density Hales-Jewett theorem and related topics. The project timeline shows notable mathematical contributions being made by 23 contributors to date. This was accomplished in seven weeks.

Just this week, a polymath blog has emerged to serve as an online home for the further evolution of this approach.

I am completely unqualified to evaluate the nature of mathematical discourse that’s going in on these polymath collaborations, or the claims being made regarding outcomes. But it sure makes my spidey-sense tingle.

I am, however, qualified to evaluate the nature of the collaborative methods being employed. And on that front, I’m amused (and chagrined) to recall something I wrote back in 2000, in a report called Internet groupware for scientific collaboration. The report was commissioned by Greg Wilson, who organized this week’s Science 2.0 event in Toronto. At that event, my report served as a historical frame for the polymath experimentation that’s going on right now, and that Michael Nielsen discussed at the Toronto event in an updated version of this talk.

In my 2000 report I said:

TeX and LaTeX define scientific publishing for a generation of scientists. But these formats don’t integrate directly into the shared spaces of the Web. The rise of XML as a universal markup language, along with vocabularies such as MathML (for mathematical notation) and SVG (for scalable vector graphics), suggests that the Web may yet reach its original collaborative goal.

Why didn’t I see, then, that the crux of the issue wasn’t XML and MathML and SVG, but rather the ability to “integrate directly into the shared spaces of the Web”? And that what ought to be integrated directly was the typesetting language already familiar to mathematicians, namely LaTeX?

The answer is that I needed (and still need) to be reminded that good-enough solutions here now, and familiar to people, often trump great solutions that aren’t here and wouldn’t be familiar if they were.

From that perspective, I’m wondering what will and won’t turn out be good enough for the polymathematicians. The current setup is admittedly imperfect, and they’re now begining to explore WordPress plugins that enable, for example, more powerful ways to organize, reply to, and refer to one anothers’ comments.

I don’t think anybody yet knows what the right tooling will be for polymathematical collaboration. The ones who are best qualified to figure it out are the polymathematical collaborators themselves, but they are not WordPress plugin developers.

What’s needed is what Eric von Hippel calls a user innovation toolkit. The idea is this: Leading users, as they employ a tool, also modify it, and in so doing they express intentions that tool developers can then capture and formalize.

If you look at the systems of notation that the polymathematicians are creating in order to organize and refer to their contributions in these long and complex threads of mathematical discourse, you can see intentions being expressed. So arguably, WordPress is a user innovation toolkit, and we’ll see these innovations codified in future plugins. I’ll be watching with great interest.

Update: As per Jonathan Fine’s comment below, it appears that MathTran.org has offered the same kind of service for quite a while now: