Andy Baio has done a tour de force analysis of Girl Talk’s Feed the Animals, a musical mashup made from hundreds of samples. From Wikipedia, he extracted data about the samples: the artist, title, and start time of each sample. Then, remarkably, he used Mechanical Turk to crowdsource the lookup of a missing fact: the release dates for each sample. The resulting visualizations are wonderfully evocative. To arrive at that point, though, you’d have to intuit something that was “easy” for Andy but wouldn’t be for many people. He writes:

Getting the sample list was easy. I took a snapshot of the album’s Wikipedia entry and extracted all the samples using Excel’s Text to Columns feature.

Actually, I’m not sure what kind of snapshot would be amenable to the Text-to-Columns treatment, which divides text into columns based on delimiters (e.g., commas) or according to fixed-length rules (e.g., columns 1-30, then 31-40).

Of course there are other approaches, and that got me to wondering. Of the million ways to extract tabular data from an HTML table on a web page, which method would be most obvious to a nontechnical person? Which would be most effective? And, would those two methods coincide?

I think it’s reasonable to suppose that an average person would reach for Excel in this situation. What then? How would you think about importing the Wikipedia page into Excel?

One approach might be to plug the URL into Excel’s File Open dialog, though I think that’s unlikely because it’s labeled File name:. That doesn’t suggest that it accepts an URL. In fact it does, but the web page arrives as unformatted text not formatted HTML.

Following Andy’s notion of taking a snapshot, you might instead do a Save As Web Page in the browser, and then try opening the saved file in Excel. This works rather well. The tables from Wikipedia arrive almost completely intact. However a few cells need to be tweaked, and that’s problematic because the hyperlinks are active, and every time you touch a cell containing one you trigger a security alert.

If you went down this path, you might then try searching the web for ways to remove hyperlinks from an Excel sheet. You’d find advice ranging from creating a VBA macro to performing a brain-exploding manual procedure. And you’d probably bail.

Alternatively — but perhaps less intuitively? — you could activate Excel’s Data->From Web feature, plug in the Wikipedia URL, select the entire Wikipedia page as the table to be imported, and click Import. This works wonderfully well. The HTML tables arrive almost perfectly formatted as before, but sans hyperlinks so you can easily make the final corrections.

It seems silly to dwell on these mundane details but they are where the devil resides. If frictionless capture of data from Wikipedia were more widely evident, Andy’s eye-opening use of Mechanical Turk would be well within the reach of many people less technical than he, and I, and doubtless many readers of this blog.

I’ve dwelled here before, but it’s just striking how some very basic kinds of data friction keep getting in the way of ever-more-amazing possibilities for analysis and insight.