A beautiful power tool to scrape, clean, and combine data

Labels like “data scientist” and “data journalist” connote an elite corps of professionals who can analyze data and use it to reason about the world. There are elite practitioners, of course, but since the advent of online data a quarter century ago I’ve hoped that every thinking citizen of the world (and of the web) could engage in similar analysis and reasoning.

That’s long been possible for those of us with the ability to wrangle APIs and transform data using SQL, Python, or another programming language. But even for us it hasn’t been easy. When I read news stories that relate to the generation of electric power in California, for example, questions occur to me that I know I could illuminate by finding, transforming, and charting sources of web data:

– How can I visualize the impact of shutting down California’s last nuclear plant?

– What’s the relationship between drought and hydro power?

All the ingredients are lying around in plain sight, but the effort required to combine them winds up being more trouble than it’s worth. And that’s for me, a skilled longtime scraper and transformer of web data. For you — even if you’re a scientist or a journalist! — that may not even be an option.

Enter Workbench, a web app with the tagline: “Scrape, clean, combine and analyze data without code.” I’ve worked with tools in the past that pointed the way toward that vision. DabbleDB in 2005 (now gone) and Freebase Gridworks in 2010 (still alive as Open Refine) were effective ways to cut through data friction. Workbench carries those ideas forward delightfully. It enables me to fly through the boring and difficult stuff — the scraping, cleaning, and combining — in order to focus on what matters: the analysis.

Here’s the report that I made to address the questions I posed above. It’s based on a workflow that you can visit and explore as I describe it here. (If you create your own account you can clone and modify.)

The workflow contains a set of tabs; each tab contains a sequence of steps; each step transforms a data set and displays output as a table or chart. When you load the page the first tab runs, and the result of its last step is displayed. In this case that’s the first chart shown in the report:

As in a Jupyter notebook you can run each step individually. Try clicking step 1. You’ll see a table of data from energy.ca.gov. Notice that step 1 is labeled Concatenate tabs. If you unfurl it you’ll see that it uses another tab, 2001-2020 scraped, which in turn concatenates two other tabs, 2001-2010 scraped and 2011-2020 scraped. Note that I’ve helpfully explained that in the optional comment field above step 1.

Each of the two source tabs scrapes a table from the page at energy.ca.gov. As I note in the report, it wasn’t necessary to scrape those tables since the data are available as an Excel file that can be downloaded, then uploaded to Workbench (as I’ve done in the tab named energy.ca.gov xslx). I scraped them anyway because that web page presents a common challenge: the data appear in two separate HTML tables. That’s helpful to the reader but frustrating to an analyst who wants to use the data. Rapid and fluid combination of scraped tables is grease for cutting through data friction; Workbench supplies that grease.

Now click step 2 in the first tab. It’s the last step, so you’re back to the opening display of the chart. Unfurl it and you’ll see the subset of columns included in the chart. I’ve removed some minor sources, like oil and waste heat, in order to focus on major ones. Several details are notable here. First: colors. The system provides a default palette but you can adjust it. Black wasn’t on the default palette but I chose that for coal.

Second, grand total. The data set doesn’t include that column, and it’s not something I needed here. But in some situations I’d want it, so the system offers it as a choice. That’s an example of the attention to detail that pervades every aspect of Workbench.

Third, Vega. See the triple-dot button above the legend in the chart? Click it, then select Open in Vega Editor, and when you get there, click Run. Today I learned that Vega is:

a declarative format for creating, saving, and sharing visualization designs. With Vega, visualizations are described in JSON, and generate interactive views using either HTML5 Canvas or SVG.

Sweet! I think I’ll use it in my own work to simplify what I’ve recently (and painfully) learned how to do with D3.js. It’s also a nice example of how Workbench prioritizes openness, reusability, and reproducibility in every imaginable way.

I use the chart as the intro to my report, which is made with an elegant block editor in which you can combine tables and charts from any of your tabs with snippets of text written in markdown. There I begin to ask myself questions, adding tabs to marshal supporting evidence and sourcing evidence from tabs into the report.

My first question is about the contribution that the Diablo Canyon nuclear plant has been making to the overall mix. In the 2020 percentages all major sources tab I start in step 1 by reusing the tab 2001-2020 scraped. Step 2 filters the columns to just the same set of major sources shown in the chart. I could instead apply that step in 2001-2020 scraped and avoid the need to select columns for the chart. Since I’m not sure how that decision might affect downstream analysis I keep all the columns. If I change my mind it’s easy to push the column selection upstream.

Workbench not only makes it possible to refactor a workflow, it practically begs you to do that. When things go awry, as they inevitably will, it’s no problem. You can undo and redo the steps in each tab! You won’t see that in the read-only view but if you create your own account, and duplicate my workflow in it, give it a try. With stepwise undo/redo, exploratory analysis becomes a safe and stress-free activity.

At step 2 of 2020 percentages all major sources we have rows for all the years. In thinking about Diablo Canyon’s contribution I want to focus on a single reference year so in step 3 I apply a filter that selects just the 2020 row. Here’s the UX for that.

In situations like this, where you need to select one or more items from a list, Workbench does all the right things to minimize tedium: search if needed, start from all or none depending on which will be easier, then keep or remove selected items, again depending on which will be easier.

In step 4 I include an alternate way to select just the 2020 row. It’s a Select SQL step that says select * from input where Year = '2020'. That doesn’t change anything here; I could omit either step 3 or step 4 without affecting the outcome; I include step 4 just to show that SQL is available at any point to transform the output of a prior step.

Which is fantastic, but wait, there’s more. In step 5 I use a Python step to do the same thing in terms of a pandas dataframe. Again this doesn’t affect the outcome, I’m just showing that Python is available at any point to transform the output of a prior step. Providing equivalent methods for novices and experts, in a common interface, is extraordinarily powerful.

I’m noticing now, by the way, that step 5 doesn’t work if you’re not logged in. So I’ll show it to you here:

Step 6 transposes the table so we can reason about the fuel types. In steps 3-5 they’re columns, in step 6 they become rows. This is a commonly-needed maneuver. And while I might use the SQL in step 4 to do the row selection handled by the widget in step 3, I won’t easily accomplish the transposition that way. The Transpose step is one of the most powerful tools in the kit.

Notice at step 6 that the first column is named Year. That’s a common outcome of transposition and here necessitates step 7 in which I rename it to Fuel Type. There are two ways to do that. You can click the + Add Step button, choose the Rename columns option, drag the new step into position 7, open it, and do the renaming there.

But look:

You can edit anything in a displayed table. When I change Year to Fuel Type that way, the same step 7 that you can create manually appears automatically.

It’s absolutely brilliant.

In step 8 I use the Calculate step to add a new column showing each row’s percentage of the column sum. In SQL I’d have to think about that a bit. Here, as is true for so many routine operations like this, Workbench offers the solution directly:

Finally in step 9 I sort the table. The report includes it, and there I consider the question of Diablo Canyon’s contribution. According to my analysis nuclear power was 9% of the major sources I’ve selected, contributing 16,280 GWh in 2020. According to another energy.ca.gov page that I cite in the report, Diablo Canyon is the only remaining nuke plant in the state, producing “about 18,000 GWh.” That’s not an exact match but it’s close enough to give me confidence that reasoning about the nuclear row in the table applies to Diablo Canyon specifically.

Next I want to compare nuclear power to just the subset of sources that are renewable. That happens in the 2020 percentages renewable tab, the output of which is also included in the report. Step 1 begins with the output of 2020 percentages of all major sources. In step 2 I clarify that the 2020 column is really 2020 GWh. In step 3 I remove the percent column in order to recalculate it. In step 4 I remove rows in order to focus on just nuclear and renewables. In step 5 I recalculate the percentages. And in step 6 I make the chart that also flows through to the report.

Now, as I look at the chart, I notice that the line for large hydro is highly variable and appears to correlate with drought years. In order to explore that correlation I look for data on reservoir levels and arrive at https://cdec.water.ca.gov/. I’d love to find a table that aggregates levels for all reservoirs statewide since 2001, but that doesn’t seem to be on offer. So I decide to use Lake Mendocino as a proxy. In step 1 I scrape an HTML table with monthly levels for the lake since 2001. In step 2 I delete the first row which only has some months. In step 3 I rename the first column to Year in order to match what’s in the table I want to join with. In step 4 I convert the types of the month columns from text to numeric to enable calculation. In step 5 I calculate the average into a new column, Avg. In step 6 I select just Year and Avg.

When I first try the join in step 8 it fails for a common reason that Workbench helpfully explains:

In the other table Year looks like ‘2001’, but the text scraped from energy.ca.gov looks like ‘2,001’. That’s a common glitch that can bring an exercise like this to a screeching halt. There’s probably a Workbench way to do this, but in step 7 I use SQL to reformat the values in the Year column, removing the commas to enable the join. While there I also rename the Avg column to Lake Mendocino Avg Level. Now in tab 8 I can do the join.

In tab 9 I scale the values for Large Hydro into a new column, Scaled Large Hydro. Why? The chart I want to see will compare power generation in GWh (gigawatt hours) and lake levels in AF (acre-feet). These aren’t remotely compatible but I don’t care, I’m just looking for comparable trends. Doubling the value for Large Hydro gets close enough for the comparison chart in step 10, which also flows through to the report.

All this adds up to an astonishingly broad, deep, and rich set of features. And I haven’t even talked about the Clean text step for tweaking whitespace, capitalization, and punctuation, or the Refine step for finding and merging clusters of similar values that refer to the same things. Workbench is also simply beautiful as you can see from the screen shots here, or by visiting my workflow. When I reviewed software products for BYTE and InfoWorld it was rare to encounter one that impressed me so thoroughly.

But wait, there’s more.

At the core of my workflow there’s a set of tabs; each is a sequence of steps; some of these produce tables and charts. Wrapped around the workflow there’s the report into which I cherrypick tables and charts for the story I’m telling there.

There’s also another kind of wrapper: a lesson wrapped around the workflow. I could write a lesson that guides you through the steps I’ve described and checks that each yields the expected result. See Intro to data journalism for an example you can try. Again, it’s brilliantly well done.

So Workbench succeeds in three major ways. If you’re not a data-oriented professional, but you’re a thinking person who wants to use available web data to reason about the world, Workbench will help you power through the grunt work of scraping, cleaning, and combining that data so you can focus on analysis. If you aspire to become such a professional, and you don’t have a clue about how to do that grunt work, it will help you learn the ropes. And if you are one of the pros you’ll still find it incredibly useful.

Kudos to the Workbench team, and especially to core developers Jonathan Stray, Pierre Conti, and Adam Hooper, for making this spectacularly great software tool.

Posted in .

8 thoughts on “A beautiful power tool to scrape, clean, and combine data

  1. Thanks for this great review, Jon. I wonder if Workbench, or something like it, can be hosted on a corporate Intranet, where it could scrape internal data that’s not accessible to a cloud-based system.

    1. Here’s the repo: https://github.com/CJWorkbench. So yes in theory, in practice likely a heavy lift, although the ongoing march of containerization tends to lighten that load over time.

      There are other ways to scrape, of course, so one could use an alternate scraper and upload data to a private workflow for analysis, assuming you’d be ok with having that private data visible to Workbench.

Leave a Reply