Querying mobile data objects with LINQ

I’m using US census data to look up the estimated populations of the cities and towns running elmcity hubs. The dataset is just plain old CSV (comma-separated variable), a format that’s more popular than ever thanks in part to a new wave of web-based data services like DabbleDB, ManyEyes, and others.

For my purposes, simple pattern matching was enough to look up the population of a city and state. But I’d been meaning to try out LINQtoCSV, the .NET equivalent of my old friend, Python’s csv module. As happens lately, I was struck by the convergence of the languages. Here’s a side-by-side comparison of Python and C# using their respective CSV modules to query for the population of Keene, NH:

Python C#
 
 
i_name = 5
i_statename = 6
 
i_pop2008 = 17
 
 
handle = urllib.urlopen(url)
 
 
 
 
 
 
 
 
reader = csv.reader(
  handle, delimiter=',')
 
 
rows = itertools.ifilter(lambda x : 
  x[i_name].startswith('Keene') and 
  x[i_statename] == 'New Hampshire', 
    reader)
 
found_rows = list(rows)
 
 
 
count = len(found_rows)
 
if ( count > 0 ):
  pop = int(found_rows[0][i_pop2008])    
public class USCensusPopulationData
  {   
  public string NAME;
  public string STATENAME;
  ... etc. ...
  public string POP_2008;
  }
  
var csv = new WebClient().
  DownloadString(url);
  
var stream = new MemoryStream(
  Encoding.UTF8.GetBytes(csv));
var sr = new StreamReader(stream);
var cc = new CsvContext();
var fd = new CsvFileDescription { };
  
var reader = 
  cc.Read<USCensusPopulationData>(sr, fd);
  
 
var rows = reader.ToList();
 
  
  
 
var found_rows = rows.FindAll(row => 
  row.name.StartsWith('Keene') && 
  row.statename == 'New Hampshire');
  
var count = rows.Count;
  
if ( count > 0 )
  pop = Convert.ToInt32(
    found_rows[0].pop_2008)

Things don’t line up quite as neatly as in my earlier example, or as in the A/B comparison (from way back in 2005) between my first LINQ example and Sam Ruby’s Ruby equivalent. But the two examples share a common approach based on iterators and filters.

This idea of running queries over simple text files is something I first ran into long ago in the form of the ODBC Text driver, which provides SQL queries over comma-separated data. I’ve always loved this style of data access, and it remains incredibly handy. Yes, some data sets are huge. But the 80,000 rows of that census file add up to only 8MB. The file isn’t growing quickly, and it can tell a lot of stories. Here’s one:

2000 - 2008 population loss in NH

-8.09% Berlin city
-3.67% Coos County
-1.85% Portsmouth city
-1.85% Plaistow town
-1.78% Balance of Coos County
-1.43% Claremont city
-1.02% Lancaster town
-0.99% Rye town
-0.81% Keene city
-0.23% Nashua city

In both Python and C# you can work directly with the iterators returned by the CSV modules to accomplish this kind of query. Here’s a Python version:

import urllib, itertools, csv

i_name = 5
i_statename = 6
i_pop2000 = 9
i_pop2008 = 17

def make_reader():
  handle = open('pop.csv')
  return csv.reader(handle, delimiter=',')

def unique(rows):
  dict = {}
  for row in rows:
    key = "%s %s %s %s" % (i_name, i_statename, 
      row[i_pop2000], row[i_pop2008])    
    dict[key] = row
  list = []
  for key in dict:
    list.append( dict[key] )
  return list

def percent(row,a,b):
  pct = - (  float(row[a]) / float(row[b]) * 100 - 100 )
  return pct

def change(x,state,minpop=1):
  statename = x[i_statename]
  p2000 = int(x[i_pop2000])
  p2008 = int(x[i_pop2008])
  return (  statename==state and 
            p2008 > minpop   and 
            p2008 < p2000 )

state = 'New Hampshire'

reader = make_reader()
reader.next() # skip fieldnames

rows = itertools.ifilter(lambda x : 
  change(x,state,minpop=3000), reader)

l = list(rows)
l = unique(l)
l.sort(lambda x,y: cmp(percent(x,i_pop2000,i_pop2008),
  percent(y,i_pop2000,i_pop2008)))

for row in l:
  print "%2.2f%% %s" % ( 
       percent(row,i_pop2000,i_pop2008),
       row[i_name] )

A literal C# translation could do all the same things in the same ways: Convert the iterator into a list, use a dictionary to remove duplication, filter the list with a lambda function, sort the list with another lambda function.

As queries grow more complex, though, you tend to want a more declarative style. To do that in Python, you’d likely import the CSV file into a SQL database — perhaps SQLite in order to stay true to the lightweight nature of this example. Then you’d ship queries to the database in the form of SQL statements. But you’re crossing a chasm when you do that. The database’s type system isn’t the same as Python’s. And database’s internal language for writing functions won’t be Python either. In the case of SQLite, there won’t even be an internal language.

With LINQ there’s no chasm to cross. Here’s the LINQ code that produces the same result:

var census_rows = make_reader();

var distinct_rows = census_rows.Distinct(new CensusRowComparer());

var threshold = 3000;

var rows = 
  from row in distinct_rows
  where row.STATENAME == statename
      && Convert.ToInt32(row.POP_2008) > threshold
      && Convert.ToInt32(row.POP_2008) < Convert.ToInt32(row.POP_2000) 
  orderby percent(row.POP_2000,row.POP_2008) 
  select new
    {
    name = row.NAME,
    pop2000 = row.POP_2000,
    pop2008 = row.POP_2008    
    };

 foreach (var row in rows)
   Console.WriteLine("{0:0.00}% {1}",
     percent(row.pop2000,row.pop2008), row.name );

You can see the supporting pieces below. There are a number of aspects to this approach that I’m enjoying. It’s useful, for example, that every row of data becomes an object whose properties are available to the editor and the debugger. But what really delights me is the way that the query context and the results context share the same environment, just as in the Python example above. In this (slightly contrived) example I’m using the percent function in both contexts.

With LINQ to CSV I’m now using four flavors of LINQ in my project. Two are built into the .NET Framework: LINQ to XML, and LINQ to native .NET objects. And two are extensions: LINQ to CSV, and LINQ to JSON. In all four cases, I’m querying some kind of mobile data object: an RSS feed, a binary .NET object retrieved from the Azure blob store, a JSON response, and now a CSV file.

Six years ago I was part of a delegation from InfoWorld that visited Microsoft for a preview of technologies in the pipeline. At a dinner I sat with Anders Hejslberg and listened to him lay out his vision for what would become LINQ. There were two key goals. First, a single environment for query and results. Second, a common approach to many flavors of data.

I think he nailed both pretty well. And it’s timely because the cloud isn’t just an ecosystem of services, it’s also an ecosystem of mobile data objects that come in a variety of flavors.


private static float percent(string a, string b)
  {
  var y0 = float.Parse(a);
  var y1 = float.Parse(b);
  return - ( y0 / y1 * 100 - 100);
  }

private static IEnumerable<USCensusPopulationData> make_reader()
  {
  var h = new FileStream("pop.csv", FileMode.Open);
  var bytes = new byte[h.Length];
  h.Read(bytes, 0, (Int32)h.Length);
  bytes = Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(bytes));
  var stream = new MemoryStream(bytes);
  var sr = new StreamReader(stream);
  var cc = new CsvContext();
  var fd = new CsvFileDescription { };

  var census_rows = cc.Read<USCensusPopulationData>(sr, fd);
  return census_rows;
  }

public class USCensusPopulationData
  {
  public string SUMLEV;
  public string state;
  public string county;
  public string PLACE;
  public string cousub;
  public string NAME;
  public string STATENAME;
  public string POPCENSUS_2000;
  public string POPBASE_2000;
  public string POP_2000;
  public string POP_2001;
  public string POP_2002;
  public string POP_2003;
  public string POP_2004;
  public string POP_2005;
  public string POP_2006;
  public string POP_2007;
  public string POP_2008;

  public override string ToString()
    {
    return
      NAME + ", " + STATENAME + " " + 
      "pop2000=" + POP_2000 + " | " +
      "pop2008=" + POP_2008;
    } 
  }

public class  CensusRowComparer : IEqualityComparer<USCensusPopulationData>
  {
  public bool Equals(USCensusPopulationData x, USCensusPopulationData y)
    {
    return x.NAME == y.NAME && x.STATENAME == y.STATENAME ;
    }

  public int GetHashCode(USCensusPopulationData obj)
    {
    var hash = obj.ToString();
    return hash.GetHashCode();
    }
  }

2 thoughts on “Querying mobile data objects with LINQ

  1. Pingback: Arjan’s World » LINKBLOG for Sep 29, 2009

  2. Pingback: More Python and C# idioms: Finding the difference between two list « Jon Udell

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s