September 2009
Monthly Archive
September 29, 2009
I’m using US census data to look up the estimated populations of the cities and towns running elmcity hubs. The dataset is just plain old CSV (comma-separated variable), a format that’s more popular than ever thanks in part to a new wave of web-based data services like DabbleDB, ManyEyes, and others.
For my purposes, simple pattern matching was enough to look up the population of a city and state. But I’d been meaning to try out LINQtoCSV, the .NET equivalent of my old friend, Python’s csv module. As happens lately, I was struck by the convergence of the languages. Here’s a side-by-side comparison of Python and C# using their respective CSV modules to query for the population of Keene, NH:
|
Python
|
C#
|
i_name = 5
i_statename = 6
i_pop2008 = 17
handle = urllib.urlopen(url)
reader = csv.reader(
handle, delimiter=',')
rows = itertools.ifilter(lambda x :
x[i_name].startswith('Keene') and
x[i_statename] == 'New Hampshire',
reader)
found_rows = list(rows)
count = len(found_rows)
if ( count > 0 ):
pop = int(found_rows[0][i_pop2008])
|
public class USCensusPopulationData
{
public string NAME;
public string STATENAME;
... etc. ...
public string POP_2008;
}
var csv = new WebClient().
DownloadString(url);
var stream = new MemoryStream(
Encoding.UTF8.GetBytes(csv));
var sr = new StreamReader(stream);
var cc = new CsvContext();
var fd = new CsvFileDescription { };
var reader =
cc.Read<USCensusPopulationData>(sr, fd);
var rows = reader.ToList();
var found_rows = rows.FindAll(row =>
row.name.StartsWith('Keene') &&
row.statename == 'New Hampshire');
var count = rows.Count;
if ( count > 0 )
pop = Convert.ToInt32(
found_rows[0].pop_2008)
|
Things don’t line up quite as neatly as in my earlier example, or as in the A/B comparison (from way back in 2005) between my first LINQ example and Sam Ruby’s Ruby equivalent. But the two examples share a common approach based on iterators and filters.
This idea of running queries over simple text files is something I first ran into long ago in the form of the ODBC Text driver, which provides SQL queries over comma-separated data. I’ve always loved this style of data access, and it remains incredibly handy. Yes, some data sets are huge. But the 80,000 rows of that census file add up to only 8MB. The file isn’t growing quickly, and it can tell a lot of stories. Here’s one:
2000 - 2008 population loss in NH
-8.09% Berlin city
-3.67% Coos County
-1.85% Portsmouth city
-1.85% Plaistow town
-1.78% Balance of Coos County
-1.43% Claremont city
-1.02% Lancaster town
-0.99% Rye town
-0.81% Keene city
-0.23% Nashua city
In both Python and C# you can work directly with the iterators returned by the CSV modules to accomplish this kind of query. Here’s a Python version:
import urllib, itertools, csv
i_name = 5
i_statename = 6
i_pop2000 = 9
i_pop2008 = 17
def make_reader():
handle = open('pop.csv')
return csv.reader(handle, delimiter=',')
def unique(rows):
dict = {}
for row in rows:
key = "%s %s %s %s" % (i_name, i_statename,
row[i_pop2000], row[i_pop2008])
dict[key] = row
list = []
for key in dict:
list.append( dict[key] )
return list
def percent(row,a,b):
pct = - ( float(row[a]) / float(row[b]) * 100 - 100 )
return pct
def change(x,state,minpop=1):
statename = x[i_statename]
p2000 = int(x[i_pop2000])
p2008 = int(x[i_pop2008])
return ( statename==state and
p2008 > minpop and
p2008 < p2000 )
state = 'New Hampshire'
reader = make_reader()
reader.next() # skip fieldnames
rows = itertools.ifilter(lambda x :
change(x,state,minpop=3000), reader)
l = list(rows)
l = unique(l)
l.sort(lambda x,y: cmp(percent(x,i_pop2000,i_pop2008),
percent(y,i_pop2000,i_pop2008)))
for row in l:
print "%2.2f%% %s" % (
percent(row,i_pop2000,i_pop2008),
row[i_name] )
A literal C# translation could do all the same things in the same ways: Convert the iterator into a list, use a dictionary to remove duplication, filter the list with a lambda function, sort the list with another lambda function.
As queries grow more complex, though, you tend to want a more declarative style. To do that in Python, you’d likely import the CSV file into a SQL database — perhaps SQLite in order to stay true to the lightweight nature of this example. Then you’d ship queries to the database in the form of SQL statements. But you’re crossing a chasm when you do that. The database’s type system isn’t the same as Python’s. And database’s internal language for writing functions won’t be Python either. In the case of SQLite, there won’t even be an internal language.
With LINQ there’s no chasm to cross. Here’s the LINQ code that produces the same result:
var census_rows = make_reader();
var distinct_rows = census_rows.Distinct(new CensusRowComparer());
var threshold = 3000;
var rows =
from row in distinct_rows
where row.STATENAME == statename
&& Convert.ToInt32(row.POP_2008) > threshold
&& Convert.ToInt32(row.POP_2008) < Convert.ToInt32(row.POP_2000)
orderby percent(row.POP_2000,row.POP_2008)
select new
{
name = row.NAME,
pop2000 = row.POP_2000,
pop2008 = row.POP_2008
};
foreach (var row in rows)
Console.WriteLine("{0:0.00}% {1}",
percent(row.pop2000,row.pop2008), row.name );
You can see the supporting pieces below. There are a number of aspects to this approach that I’m enjoying. It’s useful, for example, that every row of data becomes an object whose properties are available to the editor and the debugger. But what really delights me is the way that the query context and the results context share the same environment, just as in the Python example above. In this (slightly contrived) example I’m using the percent function in both contexts.
With LINQ to CSV I’m now using four flavors of LINQ in my project. Two are built into the .NET Framework: LINQ to XML, and LINQ to native .NET objects. And two are extensions: LINQ to CSV, and LINQ to JSON. In all four cases, I’m querying some kind of mobile data object: an RSS feed, a binary .NET object retrieved from the Azure blob store, a JSON response, and now a CSV file.
Six years ago I was part of a delegation from InfoWorld that visited Microsoft for a preview of technologies in the pipeline. At a dinner I sat with Anders Hejslberg and listened to him lay out his vision for what would become LINQ. There were two key goals. First, a single environment for query and results. Second, a common approach to many flavors of data.
I think he nailed both pretty well. And it’s timely because the cloud isn’t just an ecosystem of services, it’s also an ecosystem of mobile data objects that come in a variety of flavors.
private static float percent(string a, string b)
{
var y0 = float.Parse(a);
var y1 = float.Parse(b);
return - ( y0 / y1 * 100 - 100);
}
private static IEnumerable<USCensusPopulationData> make_reader()
{
var h = new FileStream("pop.csv", FileMode.Open);
var bytes = new byte[h.Length];
h.Read(bytes, 0, (Int32)h.Length);
bytes = Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(bytes));
var stream = new MemoryStream(bytes);
var sr = new StreamReader(stream);
var cc = new CsvContext();
var fd = new CsvFileDescription { };
var census_rows = cc.Read<USCensusPopulationData>(sr, fd);
return census_rows;
}
public class USCensusPopulationData
{
public string SUMLEV;
public string state;
public string county;
public string PLACE;
public string cousub;
public string NAME;
public string STATENAME;
public string POPCENSUS_2000;
public string POPBASE_2000;
public string POP_2000;
public string POP_2001;
public string POP_2002;
public string POP_2003;
public string POP_2004;
public string POP_2005;
public string POP_2006;
public string POP_2007;
public string POP_2008;
public override string ToString()
{
return
NAME + ", " + STATENAME + " " +
"pop2000=" + POP_2000 + " | " +
"pop2008=" + POP_2008;
}
}
public class CensusRowComparer : IEqualityComparer<USCensusPopulationData>
{
public bool Equals(USCensusPopulationData x, USCensusPopulationData y)
{
return x.NAME == y.NAME && x.STATENAME == y.STATENAME ;
}
public int GetHashCode(USCensusPopulationData obj)
{
var hash = obj.ToString();
return hash.GetHashCode();
}
}
September 28, 2009
Posted by Jon Udell under
Uncategorized [7] Comments
When Stefano Mazzocchi saw my posts on webscale identiers[1, 2] he pointed me to some recent work he and others have been doing at Metaweb. At ids.freebaseapps.com you can find sets of different web identifiers that refer to the same things. So, for example:
Apple Inc.
versus
Apple Records
Each of these views collects identifiers from different sources. For Apple Inc. they include:
The NYTimes: topics.nytimes.com/top/news/business/companies/apple_computer_inc/
Wikipedia: wikipedia.org/wiki/Apple_Computer
Open Library: openlibrary.org/a/OL2669993A/Inc._Apple_Computer
On this week’s Innovators show Stefano joins me to discuss efforts underway at Metaweb to reconcile many different web naming systems and activate connections among them.
Meanwhile my recent guest Kingsley Idehen is demonstrating a similar kind of name reconciliation at bbc.openlinksw.com. At this URL, for example, you can see canonical identifers for Michael Jackson from the BBC’s own namespace and others including DBpedia and OpenCyc.
I’m not quite sure what to make of all this. But my spidey sense is telling me to pay attention, so I am.
Related:
-
Semantic web mashups for the rest of us
-
A conversation with Stefano Mazzocchi about Cocoon and SIMILE
-
Motivating people to write the semantic web: A conversation with David Huynh about Parallax
-
Talking with Kingsley Idehen about mastering your own search index
September 17, 2009
I’ve really enjoyed the conversation about webscale identifiers. Naming web resources is such a crucial discipline, and yet one we’re all still making up as we go along. I ended the earlier post by suggesting that when we invent namespaces we should, where feasible, prefer names that make sense to people. In comments, a number of folks who have wrestled with the problem of ambiguity pointed out all sorts of reasons why that often just isn’t feasible.
Gavin Bell likes Amazon’s hybrid approach:
The model that Amazon have since moved to with a unique URL identifier and an ignored pretty human readable section is a good compromise.
Michael Smethurst agreed with me that the BBC’s opaque IDs — for example, b006qpgr for The Archers — could be promoted as a tag vocabulary that people would be encouraged to use:
Shownar is a prototype by Schulze and Webb that aims to track “buzz” around bbc programmes. For now it’s based on inbound links from blogs/twitter/etc but it could be expanded to use machine tags!?!
On Shownar, I find that this episode of Miss Marple was discussed in this blog entry:
BBC Radio have just started an Agatha Christie season and a whole host of programmes about the Queen of Crime are available to UK listeners on the iPlayer.
They include dramatizations of works starring super sleuths from Miss Marple to the Mysterious Mr Quin, as well as revealing documentaries.
The entry uses URLs that embed these BBC ids: b00mk71d, b007jvht. How did the author find them? Clearly, in this case, by way of the search URL which is also cited in the entry:
http://www.bbc.co.uk/iplayer/search/?q=agatha christie
The search term agatha christie is wildly ambiguous, of course. Shownar would never have included this item had it not cited specific BBC shows by way of their opaque IDs. Nor would the author have cited them if that had required typing b00mk71d or b007jvht. It only works thanks to copy/paste, but it works quite nicely, and it shows why site-specific search still matters in an era of uber search engines.
This example got me thinking about the character strings that we can and do type, easily and naturally, versus those we can’t and won’t. For example:
Looking at the consistency on the left column, and the variation on the right, I’ve got to conclude that:
-
Practical Internet Groupware is the de facto webscale identifier for my book.
-
16804, 28447984, 9781565925373, pracintgr, 156592537, 1565925378, and 43188074 will never converge.
I’ve long imagined a class of equivalence services that would help us bridge the gap between vocabularies we can speak and write and those we’ll never speak and need help to write.
Both are sets of webscale identifiers that we’ll need to use in complementary ways. That’ll require a mix of social conventions and technical services.
September 16, 2009
When I started working on the elmcity project, I planned to use my language of choice in recent years: Python. But early on, IronPython wasn’t fully supported on Azure, so I switched to C#. Later, when IronPython became fully supported, there was really no point in switching my core roles (worker and web) to it, so I’ve proceeded in a hybrid mode. The core roles are written in C#, and a variety of auxiliary pieces are written in IronPython.
Meanwhile, I’ve been creating other auxiliary pieces in JavaScript, as will happen with any web project. The other day, at the request of a calendar curator, I used JavaScript to prototype a tag summarizer. This was so useful that I decided to make it a new feature of the service. The C# version was so strikingly similar to the JavaScript version that I just had to set them side by side for comparison:
|
JavaScript
|
C#
|
var tagdict = new Object();
for ( i = 0; i < obj.length; i++ )
{
var evt = obj[i];
if ( evt["categories"] != undefined)
{
var tags = evt["categories"].split(',');
for (j = 0; j < tags.length; j++ )
{
var tag = tags[j];
if ( tagdict[tag] != undefined )
tagdict[tag]++;
else
tagdict[tag] = 1;
}
}
}
|
var tagdict = new Dictionary();
foreach (var evt in es.events)
{
if (evt.categories != null)
{
var tags = evt.categories.Split(',');
foreach (var tag in tags)
{
if (tagdict.ContainsKey(tag))
tagdict[tag]++;
else
tagdict[tag] = 1;
}
}
}
|
var sorted_keys = [];
for ( var tag in tagdict )
sorted_keys.push(tag);
sorted_keys.sort(function(a,b)
{ return tagdict[b] - tagdict[a] });
|
var sorted_keys = new List();
foreach (var tag in tagdict.Keys)
sorted_keys.Add(tag);
sorted_keys.Sort( (a, b)
=> tagdict[b].CompareTo(tagdict[a]));
|
The idioms involved here include:
-
Splitting a string on a delimiter to produce a list
-
Using a dictionary to build a concordance of strings and occurrence counts
-
Sorting an array of keys by their associated occurrence counts
I first used these idioms in Perl. Later they became Python staples. Now here they are again, in both JavaScript and C#.
September 15, 2009
Posted by Jon Udell under
Uncategorized 1 Comment
On this week’s Innovators show I reconnect with Hugh McGuire. He’s the 104th guest in the current incarnation of the show, and was also the fourth. With Hugh it’s always about books and collaboration. Our first conversation explored one of my favorite projects, LibriVox, which brings people together to make free downloadable audiobooks. This time around we talked about his new project, BookOven, which aims to help authors, editors, and readers work together to create new books.
Writing a book was the hardest thing I’ve ever done. The loneliness was what got to me. I finished around the time the blogosophere was starting to emerge, and the collegial joy I found here made me think I’d never want to repeat that solitary experience.
Nowadays I wouldn’t have to. Authors commonly write books out in the open on blogs. BookOven aims to push that strategy further by providing a suite of online tools purpose-built for discussing, editing, and proofing long texts.
Given the rise of the 140-character blurb, this emphasis on the long form is counter-cyclical. But for me, at least, the pendulum is swinging back. Lately I’m snacking less on Twitter and enjoying full meals served up by the blogosphere, online magazines, library books. It’s been nourishing. But I’m also noticing that much of this work — in the commercial as well as the amateur realm — could benefit from better organization, editing, and proofing.
The collaborative restructuring of all kinds of professional work has only just begun. Hugh McGuire and I share the belief that our new ability to harness what Yochai Benkler calls the loose affiliation of ad-hoc teams will yield better results in many areas. Book-length writing is the domain that Hugh has staked out. How can the new modes of collaboration enhance this ancient practice? We’ll see.
September 14, 2009
This fall a small team of University of Toronto and Michigan State undergrads will be working on parts of the elmcity project by way of Undergraduate Capstone Open Source Projects (UCOSP), organized by Greg Wilson. In our first online meeting, the students decided they’d like to tackle the problem that FuseCal was solving: extraction of well-structured calendar information from weakly-structured web pages.
From a computer science perspective, there’s a fairly obvious path. Start with specific examples that can be scraped, then work toward a more general solution. So the first two examples are going to be MySpace and LibraryThing. The recipes[1, 2] I’d concocted for FuseCal-written iCalendar feeds were especially valuable because they could be used by almost any curator for almost any location.
But as I mentioned to the students, there’s another way to approach these two cases. And I was reminded of it again when Michael Foord pointed to this fascinating post prompted by the open source release of FriendFeed’s homegrown web server, Tornado. The author of the post, Glyph Lefkowitz, is the founder of Twisted, a Python-based network programming framework that includes the sort of asynchronous event-driven capabilities that FriendFeed recreated for Tornado. Glyph writes:
If you’re about to undergo a re-write of a major project because it didn’t meet some requirements that you had, please tell the project that you are rewriting what you are doing. In the best case scenario, someone involved with that project will say, “Oh, you’ve misunderstood the documentation, actually it does do that”. In the worst case, you go ahead with your rewrite anyway, but there is some hope that you might be able to cooperate in the future, as the project gradually evolves to meet your requirements. Somewhere in the middle, you might be able to contribute a few small fixes rather than re-implementing the whole thing and maintaining it yourself.
Whether FriendFeed could have improved the parts of Twisted that it found lacking, while leveraging its synergistic aspects, is a question only specialists close to both projects can answer. But Glyph is making a more general point. If you don’t communicate your intentions, such questions can never even be asked.
Tying this back to the elmcity project, I mentioned to the students that the best scraper for MySpace and LibraryThing calendars is no scraper at all. If these services produced iCalendar feeds directly, there would be no need. That would be the ideal solution — a win for existing users of the services, and for the iCalendar ecosystem I’m trying to bootstrap.
I’ve previously asked contacts at MySpace and LibraryThing about this. But now, since we’re intending to scrape those services for calendar info, it can’t hurt to announce that intention and hope one or both services will provide feeds directly and obviate the need. That way the students can focus on different problems — and there are plenty to choose from.
So I’ll be sending the URL of this post to my contacts at those companies, and if any readers of this blog can help move things along, please do. We may end up with scrapers anyway. But maybe not. Maybe iCalendar feeds have already been provided, but aren’t documented. Maybe they were in the priority stack and this reminder will bump them up. It’s worth a shot. If the problem can be solved by communicating intentions rather than writing redundant code, that’s the ultimate hack. And its one that I hope more computer science students will learn to aspire to.
September 9, 2009
Kingsley Idehen’s vision of a web of linked data long predates the recognition I accorded him in 2003. He’s seen the big picture for a very long time, and has been driving toward it consistently. Over the years we’ve had conversations in which I’ve always wound up saying: “Yes, OK, but how will we get people to create this web of linked data that we want to navigate and query?”
On this week’s Innovators show he responds with what I find to be a plausible scenario. Every business, and increasingly every person, presents some kind of home page to the world. On those pages you will find, implied but not clearly stated, one or both of the following kinds of assertions:
1. Things I offer.
2. Things I seek.
A plumber, for example, may offer hydronic heating services, and may seek an assistant with certain qualifications. By encoding these kinds of assertions as subject-verb-object triples we could, in theory, build a semantic web that matches seekers and finders more efficiently than the current searchable web can. But that first step was always doozy. Writing the assertions required an XML syntax which has never become a web mainstay.
There are other ways to write them, however. Using an approach called RDFa, you can embed them directly into human-readable web pages. This isn’t a new idea. A decade ago, in my book Practical Internet Groupware, I showed how CSS class attributes could do double duty within a web page, governing style while also conveying meaning. In 2003 I was still experimenting with the idea, which I then called microcontent. Nowadays the term is microformats.
Although we’ve heard plenty about this idea over the years, it has yet to bear fruit. I don’t know that it will, but the scenario Kingsley Idehen outlines strikes me as plausible because, as Dries Buytaert evocatively says, structured data is the new search engine optimization. Formerly of concern only to publishers, the rationale for search engine optimization is now becoming evident to everyone who writes an About page for their businesses or — what often comes to the same thing — for themselves.
The formula for an About page is well known: name, address, services offered, hours of operation, etc. Everyone writes this stuff once for the About page, and then again in countless variations for inclusion in various directories. Kingsley and I both hope that the time is now ripe for a web-friendly way to write this data into About pages once, for common use by human visitors, search crawlers, and syndicated directories.
His proposal relies on RDFa to encode factual assertions, and on an e-commerce ontology called GoodRelations which, as its creator Martin Hepp says, provides the vocabulary to say things like:
- a particular Web site describes an offer to sell cellphones of a certain make and model at a certain price,
- a pianohouse offers maintenance for pianos that weigh less than 150 kg,
- a car rental company leases out cars of a certain make and model from a particular set of branches across the country.
The GoodRelations wiki shows cookbook examples for Yahoo and Google. You’d have to be fairly technical to adapt these using cut-and-paste, but there’s also a form that, although currently still wired to emit the older RDF/XML kinds of assertions, will soon also emit RDFa that can be woven into existing About pages.
To navigate and query a web of linked data you need, obviously, mechanisms by which to do the navigation and the querying. That’s never been the problem. Technologists love to figure such things out. But we’ve spectacularly failed to help people create that web of linked data in the first place. I don’t know if the approach Kingsley Idehen sketches in this week’s podcast will succeed. But it feels right, and I love his tagline: “Be the master of your own index.”