That first step can be a doozy

Andy Baio has done a tour de force analysis of Girl Talk’s Feed the Animals, a musical mashup made from hundreds of samples. From Wikipedia, he extracted data about the samples: the artist, title, and start time of each sample. Then, remarkably, he used Mechanical Turk to crowdsource the lookup of a missing fact: the release dates for each sample. The resulting visualizations are wonderfully evocative. To arrive at that point, though, you’d have to intuit something that was “easy” for Andy but wouldn’t be for many people. He writes:

Getting the sample list was easy. I took a snapshot of the album’s Wikipedia entry and extracted all the samples using Excel’s Text to Columns feature.

Actually, I’m not sure what kind of snapshot would be amenable to the Text-to-Columns treatment, which divides text into columns based on delimiters (e.g., commas) or according to fixed-length rules (e.g., columns 1-30, then 31-40).

Of course there are other approaches, and that got me to wondering. Of the million ways to extract tabular data from an HTML table on a web page, which method would be most obvious to a nontechnical person? Which would be most effective? And, would those two methods coincide?

I think it’s reasonable to suppose that an average person would reach for Excel in this situation. What then? How would you think about importing the Wikipedia page into Excel?

One approach might be to plug the URL into Excel’s File Open dialog, though I think that’s unlikely because it’s labeled File name:. That doesn’t suggest that it accepts an URL. In fact it does, but the web page arrives as unformatted text not formatted HTML.

Following Andy’s notion of taking a snapshot, you might instead do a Save As Web Page in the browser, and then try opening the saved file in Excel. This works rather well. The tables from Wikipedia arrive almost completely intact. However a few cells need to be tweaked, and that’s problematic because the hyperlinks are active, and every time you touch a cell containing one you trigger a security alert.

If you went down this path, you might then try searching the web for ways to remove hyperlinks from an Excel sheet. You’d find advice ranging from creating a VBA macro to performing a brain-exploding manual procedure. And you’d probably bail.

Alternatively — but perhaps less intuitively? — you could activate Excel’s Data->From Web feature, plug in the Wikipedia URL, select the entire Wikipedia page as the table to be imported, and click Import. This works wonderfully well. The HTML tables arrive almost perfectly formatted as before, but sans hyperlinks so you can easily make the final corrections.

It seems silly to dwell on these mundane details but they are where the devil resides. If frictionless capture of data from Wikipedia were more widely evident, Andy’s eye-opening use of Mechanical Turk would be well within the reach of many people less technical than he, and I, and doubtless many readers of this blog.

I’ve dwelled here before, but it’s just striking how some very basic kinds of data friction keep getting in the way of ever-more-amazing possibilities for analysis and insight.

Posted in Uncategorized

15 thoughts on “That first step can be a doozy

  1. I wonder whether in this case you and I would fall foul of the expert’s trait of making it overly difficult?

    I was thinking about how my non-techie girlfriend would approach the problem. Having observed her cut-and-pasting things from web pages into Word when she wants to reorganise them, I think she’d try the same with Excel.

    Lo and behold, selecting all the relevant text from that wikipedia page, switching to Excel and hitting Ctrl-V gives me it all in separate cells automatically.

    But it would be one of the last things that I’d try.

  2. Thanks for the writeup, Jon!

    @Adrian: Yeah, that’s basically what I did. Select the text from Wikipedia, paste into Excel. I had to tweak some of the cells after that, but that’s pretty much it.

  3. > I wonder whether in this case you and I
    > would fall foul of the expert’s trait of
    > making it overly difficult?

    Almost certainly. That’s why I floated this item.

    > But it would be one of the last things
    > that I’d try.

    Fascinating, isn’t it.

    When I try this from IE the CTRL-V result is as above: mostly good, some correction needed, correction complicated by active hyperlinks.

    When I try from FF the CTRL-V result is garbage. But Paste Special -> As Text gives the mostly-good result without problematic hyperlinks.

  4. > Yeah, that’s basically what I did. Select
    > the text from Wikipedia, paste into Excel.

    Why/how did you apply Text to Columns then?

  5. Turns out I was remembering incorrectly. For me, when pasting from Firefox, Excel 2004 for Mac automatically applies the default Text-to-Columns transform.

  6. hm. I do this to get race results from triathlons into spreadsheets. Typically follow a variant of the non-techie approach: select the table in Firefox, ‘view selection source’, cut/paste html into either Excel or Word, either as special or Ctrl-V, whichever treats it better.

    Results that aren’t in a HTML table require more fudging in Word: scan/replace the spaces with some other character which can then be replaced with a tab so the whole thing can then be made into a table and cut/pasted to Excel. Usually for example names are separated by a single space, where the virtual columns of data are separated by two or more spaces, so a careful scan/replace strategy can get it all lined up.

    I’m going to try the Data-> from Web next, thanks ;-)

  7. For me, running it from FF3 into Excel 2007 with a simple Ctrl-V works but has the complication of active hyperlinks. As you say, paste special would be the way to go.

    I wonder if we’re just conditioned to look for the programmable, re-usable solution and so don’t spot that the so-simple-it-doesn’t-need-re-using solution is easier.

  8. > I wonder if we’re just conditioned to look
    > for the programmable, re-usable solution
    > and so don’t spot that the
    > so-simple-it-doesn’t-need-re-using
    > solution is easier.

    Probably so. Interesting to be reminded by these examples, though, what a grab-bag clipboard formats can be, and therefore how unpredictable the results.

  9. No mention of IE’s “Export to Microsoft Excel” contextual menu when right-clicking on HTML tables?

  10. > No mention of IE’s “Export to Microsoft
    > Excel” contextual menu when right-clicking
    > on HTML tables?

    Good grief, you’re right! How did I not know, or discover, this?

    Well, it’s a sure bet I’m not alone in my ignorance. This leads to two questions:

    1. What uncommon search strategy would have located this nugget?

    2. How would the nugget need to have been packaged in order to yield to common search strategies?

    Anyway, thanks Alf!

  11. This a good argument for technical folk involved in publishing information to think about what they do, too.

    We need (easier) ways to publish data as XML documents or fragments, or at the least to have a way to tag data on pages to make it easily consumable.

    Some ideas:

    – publish pages like that as XML with an XSLT stylesheet for presentation;

    – modify HTML standards to allow some sort of tag, which can contain only table elements within it.

    I tend to favor the XML method – if you want the page to be presented as HTML it’s fairly trivial to transform XML to HTML and CSS using XSL and we’re not heading down the direction of screwing up HTML and browser (in)compatibility.

  12. > I tend to favor the XML method

    In general I do too, but it hasn’t prevailed so far.

    In the case of Wikipedia, there’s a much more straightforward possibility. The wikitable markup is essentially CSV with pipes instead of commas:

    | 0:00 || [[The Kills]] || “[[Sour Cherry]]”

    Why not put a bug next to every table that extracts its contents as CSV or tab-delimited?

Leave a Reply