I’ve posted the Python script I used to make the Pivot visualization of this blog. I need to set it aside for now and do other things, but here’s a snapshot of the process for my future self and for anyone else who’s interested.

Using deepzoom.py to create Deep Zoom images and collections

I’m using this Python component to create Deep Zoom images and collections. I made the following changes to it:

1. tile_size=256 (not 254) at line 59, line 160, and line 224

2. source_path.name instead of source_path at line 291

3. destination + '.xml' instead of destination at line 341

Let’s assume that Python is installed, along with the Python Imaging Library, and that your current directory contains the files 001.jpg, 002.jpg, and 003.jpg:

001.jpg
002.jpg
003.jpg

For each image file, you could run deepzoom.py thrice from the command line, like so:

python deepzoom.py -d 001.xml 001.jpg
python deepzoom.py -d 002.xml 002.jpg
python deepzoom.py -d 003.xml 003.jpg

My script doesn’t actually do it that way, it enumerates JPEGs and instantiates deepzoom.py’s ImageCreator object once for each. But either way, for each JPEG you end up with a DZI (Deep Zoom Image) package that consists of (for 001.jpg):

  • A settings file: 001.xml
  • A subdirectory: 001_files
  • More subdirectories (named 0, 1, etc.) inside 001_files
  • JPG files inside those subdirectories

Now, in this case, the current directory looks like this (using -> to mark additions):

001.jpg
-> 001.xml
-> 001_files
002.jpg
-> 002.xml
-> 002_files
003.jpg
-> 003.xml
-> 003_files

To build a collection, do something like this in Python:

from deepzoom import *
images = ['001.xml','002.xml', '003.xml']
creator = CollectionCreator()
creator.create(images, 'dzc_output')

Now the current directory looks like:

001.jpg
001.xml
001_files
002.jpg
002.xml
002_files
003.jpg
003.xml
003_files
-> dzc_output.xml
-> dzc_output_files

The Pivot collection’s CXML file will refer to dzc_output.xml, like so:

<Items ImgBase="dzc_output.xml">

Using IECapt to grab screenshots

This tool uses Internet Explorer, so only works on Windows. There is also CutyCapt for WebKit, which I haven’t tried but would be curious to hear about.

Here’s an example of the IECapt command line I’m using:

iecapt –url=http://blog.jonudell.net/… –delay=1000 –out=tmp.jpg

The result in most cases is a tall skinny JPEG, because it renders the whole page — which can be very long — before imaging it. When I ran it over a 600-item collection, it hung a couple of times because of JavaScript errors. So I went to Internet Options->Browsing in IE, checked Disable script debugging, and unchecked Display a notification about every script error.

Using ImageMagic to crop screenshots

Here’s a picture of an image produced by IECapt, overlaid with a rectangle marking where I want to crop:

The rectangle’s origin is at x=30 and y=180. Its width is 530 pixels, and height 500. Here’s the ImageMagick command to crop a captured image in tmp.jpg into a cropped image in 001.jpg:

convert -quality 100 -crop 530×500+30+180 -border 1×1 -bordercolor Black tmp.jpg 001.jpg

I’m writing this down here mainly for myself. ImageMagic can do everything under the sun, but it always takes me a while to dig up the recipe for a given operation.

Parsing the WordPress export file

I found to my surprise that WordPress currently exports invalid XML. So the script starts with a search-and-replace that looks for this:

xmlns:wp="http://wordpress.org/export/1.0/"

And replaces it with this:

xmlns:wp="http://wordpress.org/export/1.0/"
xmlns:atom="http://www.w3.org/2005/Atom"

Then it walks through the items in the Atom feed, extracting the various things that will become Pivot facets. For the description, it tries to parse the content:encoded element as XML, and find the first paragraph element within it. If that fails, it just treats the element as text and grabs the beginning of it.

Weaving the collection

There are two control files that need to be synchronized. First, there’s dzc_output.xml, for the Deep Zoom collection. It has elements like this:

<I Id=”596″ N=”596″ Source=”2245.xml”>

Then there’s pivot.cxml which drives the visualization. It has elements like this:

<Item Id="596" Img="#596" 
  Name="Freebase Gridworks: A power tool for data scrubbers" 
  Href="http://blog.jonudell.net/2010/03/26/...
<Description><![CDATA[
I've had many conversations with Stefano Mazzocchi and David Huynh [1, 2, 3] 
about the data magic they performed at MIT's Project Simile and now perform 
at Metaweb. If you're somebody who values clean data and has wrestled with 
the dirty stuff, these screencasts about a forthcoming product called 
Freebase Gridworks will make you weep with joy.
]]></Description>
<Facets>
  <Facet Name="date">
    <DateTime Value="2010-03-26T00:00:00-00:00" />
  </Facet>
<Facet Name="tag">
<String Value="freebase" />
<String Value="gridworks" />
<String Value="metaweb" />
</Facet>
  <Facet Name="comments">
    <Number Value="24" />
  </Facet>
</Facets>
</Item>

In this example, Source="2245.xml" in dzc_output.xml refers to a Deep Zoom image whose name comes from the WordPress post_id for that entry, which is:

<wp:post_id>2245</wp:post_id>

But Id="596", which is the connection between dzc_output.xml and pivot.cxml, comes from a counter in the script that increments for each item processed. I don’t know why the numbering of items in the WordPress export file is sparse, but it is, hence the difference.

Things to do

Here are some ideas for next steps.

1. Check the comment logic. I just noticed the counts seem odd. Maybe because I’m counting all comments instead of approved comments?

2. Use HTML Tidy to ensure that item content will parse as XML, and then count various kinds of elements within it: tables, images, etc.

2. Use APIs of various services — Twitter, bit.ly, etc. — to count reactions to each item.

Recently I’ve posted two examples[1, 2] of Python idioms alongside corresponding C# idioms. It always intrigues me to look at the same concept through different lenses, and it seems to intrigue others as well, so here’s a third installment.

Today’s example comes from a real scenario. I’ve recently added a feature to the elmcity service that enables curators to control their hubs by sending Twitter direct messages to the service. One method, GetDirectMessagesFromTwitter, calls the Twitter API and returns a list of direct messages sent to the elmcity service. Another method, GetDirectMessagesFromAzure, calls the Azure table storage API and returns a list of direct messages stored there. The difference between the two lists — if any — represents new messages to be processed.

Here’s one take on Python and C# idioms for finding the difference between two lists:

Python C#
fetched_messages = 
  GetDirectMessagesFromTwitter();
stored_messages = 
  GetDirectMessagesFromAzure();
diff = set(fetched_messages) - 
  set(stored_messages)
return list(diff)
var fetched_messages = 
  GetDirectMessagesFromTwitter();
var stored_messages = 
  GetDirectMessagesFromAzure();
var diff = fetched_messages.Except(
  stored_messages);
return diff.ToList();

I can’t decide which one I prefer. Python’s set arithmetic is mathematically pure. But C#’s noun-verb syntax is appealing too. Which do you prefer? And why?


PS: The Python example above is slightly concocted. It won’t work as shown here because I’m modeling Twitter direct messages as .NET objects. IronPython can use those objects, but the set subtraction fails because the objects returned from the two API calls aren’t directly comparable.

A real working example would add something like this:

fetched_message_sigs = [x.text+x.datetime for x in fetched_messages]
stored_message_sigs = [x.text+x.datetime for x in stored_messages]
diff = list(set(fetched_message_sigs) - set(stored_message_sigs))

But that’s a detail that would only obscure the side-by-side comparison I’m making here.

I’m using US census data to look up the estimated populations of the cities and towns running elmcity hubs. The dataset is just plain old CSV (comma-separated variable), a format that’s more popular than ever thanks in part to a new wave of web-based data services like DabbleDB, ManyEyes, and others.

For my purposes, simple pattern matching was enough to look up the population of a city and state. But I’d been meaning to try out LINQtoCSV, the .NET equivalent of my old friend, Python’s csv module. As happens lately, I was struck by the convergence of the languages. Here’s a side-by-side comparison of Python and C# using their respective CSV modules to query for the population of Keene, NH:

Python C#
 
 
i_name = 5
i_statename = 6
 
i_pop2008 = 17
 
 
handle = urllib.urlopen(url)
 
 
 
 
 
 
 
 
reader = csv.reader(
  handle, delimiter=',')
 
 
rows = itertools.ifilter(lambda x : 
  x[i_name].startswith('Keene') and 
  x[i_statename] == 'New Hampshire', 
    reader)
 
found_rows = list(rows)
 
 
 
count = len(found_rows)
 
if ( count > 0 ):
  pop = int(found_rows[0][i_pop2008])    
public class USCensusPopulationData
  {   
  public string NAME;
  public string STATENAME;
  ... etc. ...
  public string POP_2008;
  }
  
var csv = new WebClient().
  DownloadString(url);
  
var stream = new MemoryStream(
  Encoding.UTF8.GetBytes(csv));
var sr = new StreamReader(stream);
var cc = new CsvContext();
var fd = new CsvFileDescription { };
  
var reader = 
  cc.Read<USCensusPopulationData>(sr, fd);
  
 
var rows = reader.ToList();
 
  
  
 
var found_rows = rows.FindAll(row => 
  row.name.StartsWith('Keene') && 
  row.statename == 'New Hampshire');
  
var count = rows.Count;
  
if ( count > 0 )
  pop = Convert.ToInt32(
    found_rows[0].pop_2008)

Things don’t line up quite as neatly as in my earlier example, or as in the A/B comparison (from way back in 2005) between my first LINQ example and Sam Ruby’s Ruby equivalent. But the two examples share a common approach based on iterators and filters.

This idea of running queries over simple text files is something I first ran into long ago in the form of the ODBC Text driver, which provides SQL queries over comma-separated data. I’ve always loved this style of data access, and it remains incredibly handy. Yes, some data sets are huge. But the 80,000 rows of that census file add up to only 8MB. The file isn’t growing quickly, and it can tell a lot of stories. Here’s one:

2000 - 2008 population loss in NH

-8.09% Berlin city
-3.67% Coos County
-1.85% Portsmouth city
-1.85% Plaistow town
-1.78% Balance of Coos County
-1.43% Claremont city
-1.02% Lancaster town
-0.99% Rye town
-0.81% Keene city
-0.23% Nashua city

In both Python and C# you can work directly with the iterators returned by the CSV modules to accomplish this kind of query. Here’s a Python version:

import urllib, itertools, csv

i_name = 5
i_statename = 6
i_pop2000 = 9
i_pop2008 = 17

def make_reader():
  handle = open('pop.csv')
  return csv.reader(handle, delimiter=',')

def unique(rows):
  dict = {}
  for row in rows:
    key = "%s %s %s %s" % (i_name, i_statename, 
      row[i_pop2000], row[i_pop2008])    
    dict[key] = row
  list = []
  for key in dict:
    list.append( dict[key] )
  return list

def percent(row,a,b):
  pct = - (  float(row[a]) / float(row[b]) * 100 - 100 )
  return pct

def change(x,state,minpop=1):
  statename = x[i_statename]
  p2000 = int(x[i_pop2000])
  p2008 = int(x[i_pop2008])
  return (  statename==state and 
            p2008 > minpop   and 
            p2008 < p2000 )

state = 'New Hampshire'

reader = make_reader()
reader.next() # skip fieldnames

rows = itertools.ifilter(lambda x : 
  change(x,state,minpop=3000), reader)

l = list(rows)
l = unique(l)
l.sort(lambda x,y: cmp(percent(x,i_pop2000,i_pop2008),
  percent(y,i_pop2000,i_pop2008)))

for row in l:
  print "%2.2f%% %s" % ( 
       percent(row,i_pop2000,i_pop2008),
       row[i_name] )

A literal C# translation could do all the same things in the same ways: Convert the iterator into a list, use a dictionary to remove duplication, filter the list with a lambda function, sort the list with another lambda function.

As queries grow more complex, though, you tend to want a more declarative style. To do that in Python, you’d likely import the CSV file into a SQL database — perhaps SQLite in order to stay true to the lightweight nature of this example. Then you’d ship queries to the database in the form of SQL statements. But you’re crossing a chasm when you do that. The database’s type system isn’t the same as Python’s. And database’s internal language for writing functions won’t be Python either. In the case of SQLite, there won’t even be an internal language.

With LINQ there’s no chasm to cross. Here’s the LINQ code that produces the same result:

var census_rows = make_reader();

var distinct_rows = census_rows.Distinct(new CensusRowComparer());

var threshold = 3000;

var rows = 
  from row in distinct_rows
  where row.STATENAME == statename
      && Convert.ToInt32(row.POP_2008) > threshold
      && Convert.ToInt32(row.POP_2008) < Convert.ToInt32(row.POP_2000) 
  orderby percent(row.POP_2000,row.POP_2008) 
  select new
    {
    name = row.NAME,
    pop2000 = row.POP_2000,
    pop2008 = row.POP_2008    
    };

 foreach (var row in rows)
   Console.WriteLine("{0:0.00}% {1}",
     percent(row.pop2000,row.pop2008), row.name );

You can see the supporting pieces below. There are a number of aspects to this approach that I’m enjoying. It’s useful, for example, that every row of data becomes an object whose properties are available to the editor and the debugger. But what really delights me is the way that the query context and the results context share the same environment, just as in the Python example above. In this (slightly contrived) example I’m using the percent function in both contexts.

With LINQ to CSV I’m now using four flavors of LINQ in my project. Two are built into the .NET Framework: LINQ to XML, and LINQ to native .NET objects. And two are extensions: LINQ to CSV, and LINQ to JSON. In all four cases, I’m querying some kind of mobile data object: an RSS feed, a binary .NET object retrieved from the Azure blob store, a JSON response, and now a CSV file.

Six years ago I was part of a delegation from InfoWorld that visited Microsoft for a preview of technologies in the pipeline. At a dinner I sat with Anders Hejslberg and listened to him lay out his vision for what would become LINQ. There were two key goals. First, a single environment for query and results. Second, a common approach to many flavors of data.

I think he nailed both pretty well. And it’s timely because the cloud isn’t just an ecosystem of services, it’s also an ecosystem of mobile data objects that come in a variety of flavors.


private static float percent(string a, string b)
  {
  var y0 = float.Parse(a);
  var y1 = float.Parse(b);
  return - ( y0 / y1 * 100 - 100);
  }

private static IEnumerable<USCensusPopulationData> make_reader()
  {
  var h = new FileStream("pop.csv", FileMode.Open);
  var bytes = new byte[h.Length];
  h.Read(bytes, 0, (Int32)h.Length);
  bytes = Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(bytes));
  var stream = new MemoryStream(bytes);
  var sr = new StreamReader(stream);
  var cc = new CsvContext();
  var fd = new CsvFileDescription { };

  var census_rows = cc.Read<USCensusPopulationData>(sr, fd);
  return census_rows;
  }

public class USCensusPopulationData
  {
  public string SUMLEV;
  public string state;
  public string county;
  public string PLACE;
  public string cousub;
  public string NAME;
  public string STATENAME;
  public string POPCENSUS_2000;
  public string POPBASE_2000;
  public string POP_2000;
  public string POP_2001;
  public string POP_2002;
  public string POP_2003;
  public string POP_2004;
  public string POP_2005;
  public string POP_2006;
  public string POP_2007;
  public string POP_2008;

  public override string ToString()
    {
    return
      NAME + ", " + STATENAME + " " + 
      "pop2000=" + POP_2000 + " | " +
      "pop2008=" + POP_2008;
    } 
  }

public class  CensusRowComparer : IEqualityComparer<USCensusPopulationData>
  {
  public bool Equals(USCensusPopulationData x, USCensusPopulationData y)
    {
    return x.NAME == y.NAME && x.STATENAME == y.STATENAME ;
    }

  public int GetHashCode(USCensusPopulationData obj)
    {
    var hash = obj.ToString();
    return hash.GetHashCode();
    }
  }

When I started working on the elmcity project, I planned to use my language of choice in recent years: Python. But early on, IronPython wasn’t fully supported on Azure, so I switched to C#. Later, when IronPython became fully supported, there was really no point in switching my core roles (worker and web) to it, so I’ve proceeded in a hybrid mode. The core roles are written in C#, and a variety of auxiliary pieces are written in IronPython.

Meanwhile, I’ve been creating other auxiliary pieces in JavaScript, as will happen with any web project. The other day, at the request of a calendar curator, I used JavaScript to prototype a tag summarizer. This was so useful that I decided to make it a new feature of the service. The C# version was so strikingly similar to the JavaScript version that I just had to set them side by side for comparison:

JavaScript C#
var tagdict = new Object();

for ( i = 0; i < obj.length; i++ )
  {
  var evt = obj[i];
  if ( evt["categories"] != undefined)
    {
    var tags = evt["categories"].split(',');
    for (j = 0; j < tags.length; j++ )
      {
      var tag = tags[j];
      if ( tagdict[tag] != undefined )
        tagdict[tag]++;
      else
        tagdict[tag] = 1;
      }
    }
  }
var tagdict = new Dictionary();

foreach (var evt in es.events)
  {

  if (evt.categories != null)
    {
    var tags = evt.categories.Split(',');
    foreach (var tag in tags)
      {

      if (tagdict.ContainsKey(tag))
        tagdict[tag]++;
      else
        tagdict[tag] = 1;
      }
    }
  }
var sorted_keys = [];

for ( var tag in tagdict )
  sorted_keys.push(tag);

sorted_keys.sort(function(a,b) 
 { return tagdict[b] - tagdict[a] });
var sorted_keys = new List();

foreach (var tag in tagdict.Keys)
  sorted_keys.Add(tag);

sorted_keys.Sort( (a, b) 
  => tagdict[b].CompareTo(tagdict[a]));

The idioms involved here include:

  • Splitting a string on a delimiter to produce a list

  • Using a dictionary to build a concordance of strings and occurrence counts

  • Sorting an array of keys by their associated occurrence counts

I first used these idioms in Perl. Later they became Python staples. Now here they are again, in both JavaScript and C#.

Follow

Get every new post delivered to your Inbox.

Join 6,067 other followers