A Bloomberg terminal for Mastodon

As I mentioned last time, the Steampipe dashboard for Mastodon has evolved in unexpected ways. I imagined that the components — a plugin that maps Mastodon APIs to Postgres foreign tables, and a suite of views that query the APIs — would combine to enable a broad overview of activity in the Fediverse. That didn’t pan out for two reasons.

First, I learned that the Mastodon community didn’t appreciate the kind of surveillance required for such analysis. That was the original community, I should stress, and things have changed dramatically, but I want to respect the original ethos. Plenty of people will, nevertheless, crawl and index the Fediverse, but I don’t need to put my shoulder to that wheel. And if I did I’d be pushing Steampipe out of its sweet spot: realtime acquisition, querying, and visualization of API-sourced data.

Second, Mastodon’s API allows 300 requests every 5 minutes. You can use Steampipe in batch mode to defeat that limit, and you can store data permanently in its Postgres database, but that cuts across the grain with respect to both Steampipe and Mastodon. All Mastodon clients are subject to the same API rate limit. If you use the web app, or one of the phone apps, you will likely never have seen a message announcing that you’ve hit the limit and need to wait a few minutes. I never saw that message until I started querying the API with Steampipe while also using the web app.

So if Mastodon culture and tech resist deep data mining, and the system is optimized for clients that live within an API budget of 300 requests every 5 minutes, what kind of Mastodon client could Steampipe enable? It wouldn’t be a conventional client because Steampipe is a read-only system. The path forward would be some kind of reader, or browser, that augments the interactive apps.

The outcome, so far, is a suite of dashboards that display tabular views (along with some charts) of the home, local, and federated timelines, of my toot history and my favorites, of my follows and followers, of my notifications, of searches for terms, people, and hashtags, and of the timelines formed by the lists to which I’ve assigned people I follow. These are all HTML tables rendered by Steampipe’s dashboard server. The columns are all sortable, and the cells of the tables can contain only links or plain text.

Given that the toot content returned from the Mastodon API is HTML, the plain-text-only constraint felt, initially, like a blocker. No images? No links in toot content? What good is that?

Some constraints are worth embracing, though, and that may prove true here. The views created this way put a lot of information onto the screen. Here’s my default view in the stock client.

At a glance I can see three items on the home timeline, and if I want to scroll through 100 items I can only do so awkwardly in small gulps.

Here’s my home timeline in the Steampipe dashboard. I can see a dozen items at a glance, and can easily scan 100 items in gulps of that size.

When I described this effect to Greg Wilson he gave me the title for this post: “That sounds like the Bloomberg terminal for Mastodon.” I’ve never used one, and I’m aware that its design is often derided as a UX disaster, but as I understand it the product is built to enable traders to scan fast-moving data feeds from many different sources. In that sense I do think it’s an interesting and useful comparison.

The underlying principle is one I’ve learned from Edward Tufte: present information at maximum density. Our brains are built to take in a lot of information at a glance, and if it’s organized well we can do that very effectively. It feels like that’s happening for me when I scan these densely-packed views of Mastodon activity.

To enhance the effect, I’ve begun to apply filters. In a Mastodon timeline, for example, a chatty person can dominate what you see at a glance. When we participate in social media we are always making bids for one another’s attention. As publishers of feeds it’s wise to consider how a flurry of items can overwhelm a reader’s experience. But it’s also useful to consider ways that feed readers can filter a chatty source. Steampipe’s SQL foundation affords an easy and natural way to do that. Here’s part of the query that drives the list view.

select distinct on (list, user_name, person, hour) -- only one per list/user/hour
  person,
  url,
  hour,
  toot
from
  data
order by
  hour desc, list, person

It was easy to implement a rule that limits each person to at most one toot per hour. Next steps here will be to apply this rule to other views, show the number of collapsed toots, and enable such rules on a per-person basis.

There are always links into the Mastodon web app, and I follow them when I want to view images, boost someone, or reply to someone. The dashboards help me scan a lot of Mastodon activity quickly, and decide which items I want to interact with. Your 500-character toot is all you’ve got to grab my attention, and I’ll only see it as an unformatted chunk of plain text. That’s a pretty severe constraint, and not everyone will want to embrace it, but it’s working pretty well for me so far.

I expect that our dashboard system will support formatted text and images in cells of HTML tables. When it does I’d like to make it an option you can turn on or off in Mastodon dashboards. What should the default be? I suspect I’ll want plain text and no images, especially if image captions can appear along with the text of toots. Some of the original Mastodon cultural norms aren’t surviving the onslaught of new people, but writing descriptions of images is one that’s held up so far, and it’s a wonderful thing. So write a short thoughtful post, write a caption for your image if you include one, and if you capture my attention I’ll click through to view and interact.


1 https://blog.jonudell.net/2022/11/28/autonomy-packet-size-friction-fanout-and-velocity/
2 https://blog.jonudell.net/2022/12/06/mastodon-steampipe-and-rss/
3 https://blog.jonudell.net/2022/12/10/browsing-the-fediverse/
4 https://blog.jonudell.net/2022/12/17/a-bloomberg-terminal-for-mastodon/
5 https://blog.jonudell.net/2022/12/19/create-your-own-mastodon-ux/
6 https://blog.jonudell.net/2022/12/22/lists-and-people-on-mastodon/
7 https://blog.jonudell.net/2022/12/29/how-many-people-in-my-mastodon-feed-also-tweeted-today/
8 https://blog.jonudell.net/2022/12/31/instance-qualified-mastodon-urls/
9 https://blog.jonudell.net/2023/01/16/mastodon-relationship-graphs/
10 https://blog.jonudell.net/2023/01/21/working-with-mastodon-lists/
11 https://blog.jonudell.net/2023/01/26/images-considered-harmful-sometimes/
12 https://blog.jonudell.net/2023/02/02/mapping-the-wider-fediverse/
13 https://blog.jonudell.net/2023/02/06/protocols-apis-and-conventions/
14 https://blog.jonudell.net/2023/02/14/news-in-the-fediverse/
15 https://blog.jonudell.net/2023/02/26/mapping-people-and-tags-on-mastodon/
16 https://blog.jonudell.net/2023/03/07/visualizing-mastodon-server-moderation/
17 https://blog.jonudell.net/2023/03/14/mastodon-timelines-for-teams/
18 https://blog.jonudell.net/2023/04/03/the-mastodon-plugin-is-now-available-on-the-steampipe-hub/
19 https://blog.jonudell.net/2023/04/11/migrating-mastodon-lists/
20 https://blog.jonudell.net/2023/05/24/when-the-rubber-duck-talks-back/

Browsing the Fediverse

A month ago, when the Great Discontinuity happened, I started working on a Steampipe plugin to enable SQL queries against the Mastodon API, along with a companion Steampipe “mod” (suite of dashboards) to display and chart the results of those queries.

I expect these dashboards will soon be available in Steampipe Cloud, where it will take just a few seconds to pop in your Mastodon access token (from, e.g., https://mastodon.social/settings/applications/new) and begin using the dashboards.

Meanwhile, if you’re so inclined, you can find the plugin here and the dashboards here. If you’re reasonably technical you can pretty quickly and easily install Steampipe, clone these repos, build the plugin, and start using the dashboards.

Why would you want to? My own motivation, originally, was to do Mastodon analytics. I thought Steampipe’s SQLification of the API would be a handy way to discern and monitor activity trends during a period of extraordinary flux. And that’s proven to be true, to a limited extent. Here’s a snapshot of the dashboard that uses the instance activity API.

I’m watching this chart with great interest. Where does it go from here? I’m not going to hazard a guess. Everything’s up in the air right now, and anything could happen.

But as I added tables to the plugin to encapsulate more of the Mastodon API, and added dashboards to visualize those tables, my focus shifted. I began to see the suite of dashboards as a Mastodon reader/browser that complements the web and phone clients, and that’s how I mainly use them now.

I think the key benefit is one of Edward Tufte’s core principles: information density. Each of these dashboards shows more activity than you can see at a glance in the web or phone interfaces. I find this very helpful for searching and browsing. When I see items of interest that I want to interact with, I click through to the web app in order to boost, reply, or favorite.

Will this way of browsing Mastodon appeal to you? To get a feel for what it’s like, here are snapshots of some of the dashboards I’ve built so far.

dashboard.Favorites

dashboard.Following

dashboard.Home

dashboard.List

dashboard.Me

dashboard.Notification

dashboard.PeopleSearch

dashboard.StatusSearch

dashboard.TagSearch

For me, at least, this approach has become an effective way to browse the fediverse, find interesting people, read what they boost, and keep track of my own activity.

If you are dev-minded, by the way, please note that these dashboards are just one way to skin the results of queries against the plugin. Any SQL client can connect to Steampipe’s Postgres endpoint. You could use dashboards like Metabase or Grafana, or you could embed Steampipe as a component in an app.


1 https://blog.jonudell.net/2022/11/28/autonomy-packet-size-friction-fanout-and-velocity/
2 https://blog.jonudell.net/2022/12/06/mastodon-steampipe-and-rss/
3 https://blog.jonudell.net/2022/12/10/browsing-the-fediverse/
4 https://blog.jonudell.net/2022/12/17/a-bloomberg-terminal-for-mastodon/
5 https://blog.jonudell.net/2022/12/19/create-your-own-mastodon-ux/
6 https://blog.jonudell.net/2022/12/22/lists-and-people-on-mastodon/
7 https://blog.jonudell.net/2022/12/29/how-many-people-in-my-mastodon-feed-also-tweeted-today/
8 https://blog.jonudell.net/2022/12/31/instance-qualified-mastodon-urls/
9 https://blog.jonudell.net/2023/01/16/mastodon-relationship-graphs/
10 https://blog.jonudell.net/2023/01/21/working-with-mastodon-lists/
11 https://blog.jonudell.net/2023/01/26/images-considered-harmful-sometimes/
12 https://blog.jonudell.net/2023/02/02/mapping-the-wider-fediverse/
13 https://blog.jonudell.net/2023/02/06/protocols-apis-and-conventions/
14 https://blog.jonudell.net/2023/02/14/news-in-the-fediverse/
15 https://blog.jonudell.net/2023/02/26/mapping-people-and-tags-on-mastodon/
16 https://blog.jonudell.net/2023/03/07/visualizing-mastodon-server-moderation/
17 https://blog.jonudell.net/2023/03/14/mastodon-timelines-for-teams/
18 https://blog.jonudell.net/2023/04/03/the-mastodon-plugin-is-now-available-on-the-steampipe-hub/
19 https://blog.jonudell.net/2023/04/11/migrating-mastodon-lists/
20 https://blog.jonudell.net/2023/05/24/when-the-rubber-duck-talks-back/

When your database is an HTTP client

Here are three things I once hoped — but no longer expect — to outlive:

1. PDF files

2. passwords

3. The occasional need to scrape data from web pages

PDF files and passwords are topics for another day, but web scraping is timely. Today I was asked to corral data from the Steampipe blog, and the path of least resistance was (of course!) to extract it from the site.

I was all warmed up for the exercise because we’d just published a post dramatically entitled “Why build an HTTP client into a database? So you can ingest web data directly!” In that post I show three solutions enabled by the Net plugin’s net_http_request table.

Since the dawn of the web, scraping has worked this way: Use a script to fetch the data, then save it for retrieval and analysis. You might use the script language to query the data, or the query language of a database.

A couple of years ago I found a way to unify those ingredients: Run the script inside the database. You can do a lot with Postgres’ built-in procedural language, and even more if you activate Python inside Postgres. I went deeply into both and explained why in an earlier episode.

PL/Python was great for advanced uses, and I used it for a bit of web scraping too. It all worked fine, and I never thought to ask this question: “What if the database is the HTTP client, and SQL the common way to reason over data coming through that pipe?”

The examples in the post show what that’s like. In its simplest form you write a query like this.

select
    response_status_code,
    jsonb_pretty(response_headers) as headers
  from
    net_http_request
  where
    url = 'https://steampipe.io'

The result is like this.

+----------------------+-------------------------------------------------------+
| response_status_code | headers                                               |
+----------------------+-------------------------------------------------------+
| 200                  | {                                                     |
|                      |     "Age": [                                          |
|                      |         "45557"                                       |
|                      |     ],                                                |
|                      |     "Date": [                                         |
|                      |         "Fri, 09 Dec 2022 06:46:40 GMT"               |
|                      |     ],                                                |
|                      |     "Etag": [                                         |
|                      |         "W/\"614a142998557b388e053bfa4408cf70\""      |

The response_status_code is a regular Postgres column, the headers column is a JSONB column that you can index into (e.g. headers ->> 'etag'). If you also select the response_body column of the net_http_request table you’ll get another regular Postgres column containing the text of the web page. If it’s HTML text, you can use regular expressions to match patterns in it. If it’s JSON text you can use Postgres’ JSON functions to query and transform it.

You can join fields nested in the JSON with other Postgres columns. And those other columns can belong to tables populated in any of the ways Steampipe plugins populate tables: from JSON or GraphQL API endpoints, from CSV or Terraform or Yaml files, from anywhere really. As a developer writing Steampipe queries (and flowing results into dashboards) you see all of these sources as tables, you query them individually in all the ways Postgres can, and you join across diverse sources in a common way.

Of course web pages are structured in ways that regular expressions can’t easily grok. It’s easy to match links, but parsing HTML tag names and attributes is a job for a real parser. I’d made a start on an HTML plugin for Steampipe. There were already two tables: one to extract links from a web page, one to transform HTML tables to CSV format. So today, when tasked with tabulating blog metadata, I added a third table to enable these queries.

-- find the title

select
  page,
  tag_name,
  tag_content
from
  html_tag
where
  page = 'https://steampipe.io/blog/selective-select'
  and tag_name = 'title'

-- list the meta tags

select
  page,
  tag_name,
  tag_attrs
from
  html_tag
where
  page = 'https://steampipe.io/blog/selective-select'
  and tag_name = 'meta'

That’s a powerful way to reason over HTML data! It was easy for me to extend the HTML plugin in this way, and I assure you that I’m no 10x programmer. The Steampipe plugin SDK and the wonderful goquery package are doing all the heavy lifting. I just had to stitch the components together, and if you’re any kind of programmer, with or without Go experience, you could pretty readily do the same.

Mastodon, Steampipe, and RSS

I was determined to write my Mastodon #introduction today. To get started I used the tag search in the dashboard I’m building.

The idea was to look at a bunch of other #introduction posts to get a feel for how mine should go. When you search specifically for hashtags, the Mastodon search API returns this information.

"hashtags": [
    {
      "name": "introduction",
      "url": "https://mastodon.social/tags/introduction",
      "history": [
        {
          "day": "1574553600",
          "uses": "10",
          "accounts": "9"
        },
        // ...
      ]
    },

A first version of the dashboard, having only this data to work with, just listed the names of tags matching the search term along with corresponding URLs. Here was the initial query.

select 
  name,
  url
from 
  mastodon_search_hashtag 
where 
  query = 'introduction'

That produced a list of links, like https://mastodon.social/tags/introduction, to home pages for variants of the tag. These are useful links! Each goes to a page where you can see who is posting to the tag.

To make this view slightly more useful, I tapped the third element of the API response, history, in a revised query.

with data as (
  select 
    name,
    url,
    ( jsonb_array_elements(history) ->> 'uses' )::int as uses
  from 
    mastodon_search_hashtag 
  where 
    query = 'introduction'
)
select
  name,
  url,
  sum(uses)
from
  data
group by
  name, url
order by
  sum desc

These results help me decide which variant to use.

+-------------------+---------------------------------------------------+------+
| name              | url                                               | sum  |
+-------------------+---------------------------------------------------+------+
| introduction      | https://mastodon.social/tags/introduction         | 1816 |
| introductions     | https://mastodon.social/tags/introductions        | 218  |
| introductionpost  | https://mastodon.social/tags/introductionpost     | 19   |
| introductionfr    | https://mastodon.social/tags/introductionfr       | 6    |

But I still need to visit each link’s page to explore how it’s being used. It would be nice to surface more context in the dashboard, and I found a nifty way to do it, but first let’s dwell on the revised query for a minute. Postgres’ JSON features are powerful and it’s often a challenge (at least for me) to visualize how they work.

The Postgres jsonb_array_elements() function is what’s called a set-returning function. Here it unpacks Postgres’ JSON representation of the list of history structures returned from the Mastodon API. In its simplest form, the function call jsonb_array_elements(history) produces a temporary table with per-tag, per-day data.

select
  name,
  jsonb_array_elements(history) as history
from
  mastodon_search_hashtag 
where 
  query = 'introduction'
+--------------------------------+----------------------------------------------------+
| name                           | history                                            |
+--------------------------------+----------------------------------------------------+
| introduction                   | {"accounts":"16","day":"1670371200","uses":"19"}   |
| introduction                   | {"accounts":"250","day":"1670284800","uses":"269"} |
| introduction                   | {"accounts":"259","day":"1670198400","uses":"274"} |
| introduction                   | {"accounts":"253","day":"1670112000","uses":"270"} |
| introduction                   | {"accounts":"245","day":"1670025600","uses":"269"} |
| introduction                   | {"accounts":"345","day":"1669939200","uses":"383"} |
| introduction                   | {"accounts":"307","day":"1669852800","uses":"339"} |
| introductionsfr                | {"accounts":"0","day":"1670371200","uses":"0"}     |
| introductionsfr                | {"accounts":"0","day":"1670284800","uses":"0"}     |
| introductionsfr                | {"accounts":"0","day":"1670198400","uses":"0"}     |
| introductionsfr                | {"accounts":"0","day":"1670112000","uses":"0"}     |
| introductionsfr                | {"accounts":"0","day":"1670025600","uses":"0"}     |

history is a JSONB column that holds an object with three fields. The revised query uses Postgres’ JSON indexing operator ->> to reach into that object and hoist the number of daily uses into its own column, so it can be the target of a SQL SUM function.

OK, ready for the nifty solution? Recall that https://mastodon.social/tags/introduction is the home page for that variant of the tag. There you can see introduction posts from people using the tag. Those posts typically include other tags. In the dashboard shown above you can see that Kathy Nickels is using these: #Music #Art #Equestrian #Nature #Animals. The tags appear in her introduction post.

I didn’t immediately see how to capture them for use in the dashboard. Then I remembered that certain classes of Mastodon page have corresponding RSS feeds, and wondered if the tag pages are members of one such class. Sure enough they are, and https://mastodon.social/tags/introduction.rss is a thing. That link, formed by tacking .rss onto the base URL, provides the extra context I was looking for. Here’s the final version of the query.

with data as (
  select 
    name,
    url,
    ( jsonb_array_elements(history) ->> 'uses' )::int as uses
  from 
    mastodon_search_hashtag 
  where 
    query = 'introduction'
  ),
  uses as (
    select 
      name,
      url || '.rss' as feed_link,
      sum(uses) as recent_uses
    from 
      data 
    group 
      by connection, name, url
  )
  select
    u.name,
    r.guid as link,
    to_char(r.published, 'YYYY-MM-DD') as published,
    r.categories
  from
    uses u
  join
    rss_item r
  on 
    r.feed_link = u.feed_link
  where
    recent_uses > 1
  order by 
    recent_uses desc, published desc
)

The new ingredients, courtesy of the RSS feed, are: guid which links to an individual introduction like Kathy’s, published which is the day the introduction appeared, and categories which has the tags used in the introduction post. Sweet! Now I can scan the dashboard to get a sense of which introductions I want to check out.

The first three queries use the Steampipe plugin for Mastodon, and in particular its mastodon_search_hashtag table which encapsulates the Mastodon API for searching tags. The final version joins that table with the rss_item table provided by the RSS plugin, using the common base URL as the basis of the join.

This delights me in so many ways. When the blogosphere first emerged in the early 2000s, some of us discovered that the RSS protocol was capable of far more than just delivering feeds to RSS readers. The other new hot protocol in that era was XML web services. As an InfoWorld analyst I was supposed to be cheering the latter as an enterprise-grade technology, but I couldn’t help noticing that RSS kept turning out to be a great way to move data between cooperating systems. That’s always been true, and I love how this example reminds us that it’s still true.

I’m equally delighted to show how Steampipe enables this modern exercise in RSS-powered integration. Steampipe was, originally, an engine for mapping results from JSON API endpoints to SQL tables. Over time, though, it has broadened its view of what constitutes an API. You can use Steampipe to query CSV files, or Terraform files, or — as we see here — RSS feeds. Data comes in all kinds of flavors. Steampipe abstracts those differences and brings all the flavors into a common space where you can reason over them using SQL.

And finally, it’s just wonderful to be at the intersection of Mastodon, Steampipe, and RSS in this remarkable moment. I’ll readily admit that nostalgia is a factor. But RSS did bust things wide open twenty years ago, Mastodon’s doing that now, and I love that RSS can help it happen again.

Now I need to write that #introduction!


1 https://blog.jonudell.net/2022/11/28/autonomy-packet-size-friction-fanout-and-velocity/
2 https://blog.jonudell.net/2022/12/06/mastodon-steampipe-and-rss/
3 https://blog.jonudell.net/2022/12/10/browsing-the-fediverse/
4 https://blog.jonudell.net/2022/12/17/a-bloomberg-terminal-for-mastodon/
5 https://blog.jonudell.net/2022/12/19/create-your-own-mastodon-ux/
6 https://blog.jonudell.net/2022/12/22/lists-and-people-on-mastodon/
7 https://blog.jonudell.net/2022/12/29/how-many-people-in-my-mastodon-feed-also-tweeted-today/
8 https://blog.jonudell.net/2022/12/31/instance-qualified-mastodon-urls/
9 https://blog.jonudell.net/2023/01/16/mastodon-relationship-graphs/
10 https://blog.jonudell.net/2023/01/21/working-with-mastodon-lists/
11 https://blog.jonudell.net/2023/01/26/images-considered-harmful-sometimes/
12 https://blog.jonudell.net/2023/02/02/mapping-the-wider-fediverse/
13 https://blog.jonudell.net/2023/02/06/protocols-apis-and-conventions/
14 https://blog.jonudell.net/2023/02/14/news-in-the-fediverse/
15 https://blog.jonudell.net/2023/02/26/mapping-people-and-tags-on-mastodon/
16 https://blog.jonudell.net/2023/03/07/visualizing-mastodon-server-moderation/
17 https://blog.jonudell.net/2023/03/14/mastodon-timelines-for-teams/
18 https://blog.jonudell.net/2023/04/03/the-mastodon-plugin-is-now-available-on-the-steampipe-hub/
19 https://blog.jonudell.net/2023/04/11/migrating-mastodon-lists/
20 https://blog.jonudell.net/2023/05/24/when-the-rubber-duck-talks-back/

Autonomy, packet size, friction, fanout, and velocity

Nostalgia is a dangerous drug and it’s always risky to wallow in it. So those of us who fondly remember the early blogosphere, and now want to draw parallels to the fediverse, should do so carefully. But we do want to learn from history.

Here’s one way to compare five generations of social software along the five dimensions named in the title of this post.

              Autonomy    Packet Size   Friction    Fanout    Velocity
 
Usenet        medium      high          medium      medium    low

Blogosphere   high        high          high        low       low

Facebook      low         high          low         medium    high

Twitter       low         low           low         high      high

Fediverse     high        medium        high        medium    medium

These are squishy categories, but I think they surface key distinctions. Many of us who were active in the blogosphere of the early 2000s enjoyed a high level of autonomy. Our RSS readers were our Internet dashboards. We loaded them with a curated mix of official and individual voices. There were no limits on the size of packets exchanged in this network. You could write one short paragraph or a 10,000-word essay. Networking wasn’t frictionless because blog posts did mostly feel like essays, and because comments didn’t yet exist. To comment on my blog post you’d write your own blog post linking to it.

That friction limited the degree to which a post would fan out through the network, and the velocity of its propagation. The architecture of high friction, low fanout, and low velocity was a winning combination for a while. In that environment I felt connected but not over-connected, informed but not overloaded.

Twitter flipped things around completely. It wasn’t just the loss of autonomy as ads and algos took over. With packets capped at 120 characters, and tweets potentially seen immediately by everyone, friction went nearly to zero. The architecture of low friction created an addictive experience and enabled powerful effects. But it wasn’t conducive to healthy discourse.

The fediverse can, perhaps, strike a balance. Humans didn’t evolve to thrive in frictionless social networks with high fanout and velocity, and arguably we shouldn’t. We did evolve in networks governed by Dunbar’s number, and our online networks should respect that limit. We need less friction within communities of knowledge and practice, more friction between them. We want messages to fan out pervasively and rapidly within communities, but less so between them.

We’re at an extraordinary inflection point right now. Will the fediverse enable us to strike the right balance? I think it has the right architectural ingredients to land where I’ve (speculatively) placed it in that table. High autonomy. As little friction as necessary, but not too little. As much fanout and velocity as necessary, but not too much. Nobody knows how things will turn out, predictions are futile, behavior is emergent, but I am on the edge of my seat watching this all unfold.


1 https://blog.jonudell.net/2022/11/28/autonomy-packet-size-friction-fanout-and-velocity/
2 https://blog.jonudell.net/2022/12/06/mastodon-steampipe-and-rss/
3 https://blog.jonudell.net/2022/12/10/browsing-the-fediverse/
4 https://blog.jonudell.net/2022/12/17/a-bloomberg-terminal-for-mastodon/
5 https://blog.jonudell.net/2022/12/19/create-your-own-mastodon-ux/
6 https://blog.jonudell.net/2022/12/22/lists-and-people-on-mastodon/
7 https://blog.jonudell.net/2022/12/29/how-many-people-in-my-mastodon-feed-also-tweeted-today/
8 https://blog.jonudell.net/2022/12/31/instance-qualified-mastodon-urls/
9 https://blog.jonudell.net/2023/01/16/mastodon-relationship-graphs/
10 https://blog.jonudell.net/2023/01/21/working-with-mastodon-lists/
11 https://blog.jonudell.net/2023/01/26/images-considered-harmful-sometimes/
12 https://blog.jonudell.net/2023/02/02/mapping-the-wider-fediverse/
13 https://blog.jonudell.net/2023/02/06/protocols-apis-and-conventions/
14 https://blog.jonudell.net/2023/02/14/news-in-the-fediverse/
15 https://blog.jonudell.net/2023/02/26/mapping-people-and-tags-on-mastodon/
16 https://blog.jonudell.net/2023/03/07/visualizing-mastodon-server-moderation/
17 https://blog.jonudell.net/2023/03/14/mastodon-timelines-for-teams/
18 https://blog.jonudell.net/2023/04/03/the-mastodon-plugin-is-now-available-on-the-steampipe-hub/
19 https://blog.jonudell.net/2023/04/11/migrating-mastodon-lists/
20 https://blog.jonudell.net/2023/05/24/when-the-rubber-duck-talks-back/

Debuggable explanations

I’ve been reviewing Greg Wilson’s current book project, Software Design in Python. Like the earlier JavaScript-based Software Design by Example it’s a guided tour of tools, techniques, and components common to many software systems: testing frameworks, parsers, virtual machines, debuggers. Each chapter of each of these books shows how to build the simplest possible working version of one of these things.

Though I’ve used this stuff for most of my life, I’ve never studied it formally. How does an interpreter work? The chapter on interpreters explains the basic mechanism using a mixture of prose and code. When I read the chapter I can sort of understand what’s happening, but I’m not great at mental simulation of running code. I need to run the code in a debugger, set breakpoints, step through execution, and watch variables change. Then it sinks in.

The GitHub repo for the book includes all the text and all the code. I’d like to put them side-by-side, so that as I read the narrative I can run and debug the code that’s being described. Here’s how I’m doing that in VSCode.

This is pretty good! But it wasn’t dead simple to get there. In a clone of the repo, the steps included:

  • Find the HTML file for chapter 3.
  • Install a VSCode extension to preview HTML.
  • Find the code for chapter 3.
  • Adjust the code to not require command-line arguments.
  • Arrange the text and code in side-by-side panes.

Though it’s all doable, the activation threshold is high enough to thwart my best intention of repeating the steps for every chapter.

Whether in VSCode or another IDE or some other kind of app, what would be the best way to lower that activation threshold?

Curating the Studs Terkel archive

I can read much faster than I can listen, so I rarely listen to podcasts when I’m at home with screens on which to read. Instead I listen on long hikes when I want to shift gears, slow down, and absorb spoken words. Invariably some of those words strike me as particularly interesting, and I want to capture them. Yesterday, what stuck was these words from a 1975 Studs Terkel interview with Muhammad Ali:

Everybody’s watching me. I was rich. The world saw me, I had lawyers to fight it, I was getting credit for being a strong man. So that didn’t really mean nothing. What about, I admire the man that had to go to jail named Joe Brown or Sam Jones, who don’t nobody know who’s in the cell, you understand? Doing his time, he got no lawyers’ fees to pay. And when he get out, he won’t be praised for taking a stand. So he’s really stronger than me. I had the world watching me. I ain’t so great. I didn’t do nothing that’s so great. What about the little man don’t nobody know? He’s really the one.

I heard these words on an episode of Radio OpenSource about the Studs Terkel Radio Archive, an extraordinary compilation of (so far) about a third of his 5600 hours of interviews with nearly every notable person during the latter half of the twentieth century.

If you weren’t aware of him, the Radio OpenSource episode, entitled Studs Terkel’s Feeling Tone, is the perfect introduction. And it’s delightful to hear one great interviewer, Chris Lydon, discuss with his panel of experts the interviewing style of perhaps the greatest interviewer ever.

Because I’d heard Muhammad Ali’s words on Radio OpenSource, I could have captured them in the usual way. I always remember where I was when I heard a segment of interest. If that was 2/3 of the way through my hike, I’ll find the audio at the 2/3 mark on the timeline. I made a tool to help me capture and share a link to that segment, but it’s a clumsy solution.

What you’d really rather do is search for the words in a transcript, select surrounding context, use that selection to define an audio segment, and share a link to both text and audio. That’s exactly what I did to produce this powerful link, courtesy of WFMT’s brilliant remixer, which captures both the written words I quoted above and the spoken words synced to them.

That’s a dream come true. Thank you! It’s so wonderful that I hesitate to ask, but … WFMT, can you please make the archive downloadable? I would listen to a lot of those interviews if I could put them in my pocket and take them with me on hikes. Then I could use your remixer to help curate the archive.

GitHub for English teachers

I’ve long imagined a tool that would enable a teacher to help students learn how to write and edit. In Thoughts in motion I explored what might be possible in Federated Wiki, a writing tool that keeps version history for each paragraph. I thought it could be extended to enable the kind of didactic editing I have in mind, but never found a way forward.

In How to write a press release I tried bending Google Docs to this purpose. To narrate the process of editing a press release, I dropped a sample release into a GDoc and captured a series of edits as named versions. Then I captured the versions as screenshots and combined them with narration, so the reader of the blog post can see each edit as a color-coded diff with an explanation.

The key enabler is GDoc’s File -> Version history -> Name current version, along with File -> See version history‘s click-driven navigation of the set of diffs. It’s easy to capture a sequence of editing steps that way.

But it’s much harder to present those steps as I do in the post. That required me to make, name, and organize a set of images, then link them to chunks of narration. It’s tedious work. And if you want to build something like this for students, that’s work you shouldn’t be doing. You just want to do the edits, narrate them, and share the result.

This week I tried a different approach when editing a document written by a colleague. Again the goal was not only to produce an edited version, but also to narrate the edits in a didactic way. In this case I tried bending GitHub to my purpose. I put the original doc in a repository, made step-by-step edits in a branch, and created a pull request. We were then able to review the pull request, step through the changes, and review each as a color-coded diff with an explanation. No screenshots had to be made, named, organized, or linked to the narration. I could focus all my attention on doing and narrating the edits. Perfect!

Well, perfect for someone like me who uses GitHub every day. If that’s not you, could this technique possibly work?

In GitHub for the rest of us I argued that GitHub’s superpowers could serve everyone, not just programmers. In retrospect I felt that I’d overstated the case. GitHub was, and remains, a tool that’s deeply optimized for programmers who create and review versioned source code. Other uses are possible, but awkward.

As an experiment, though, let’s explore how awkward it would be to recreate my Google Docs example in GitHub. I will assume that you aren’t a programmer, have never used GitHub, and don’t know (or want to know) anything about branches or commits or pull requests. But you would like to be able to create a presentation that walks a learner though a sequence of edits, with step-by-step narration and color-coded diffs. At the end of this tutorial you’ll know how to do that. The method isn’t as straightforward as I wish it were. But I’ll describe it carefully, so you can try it for yourself and decide whether it’s practical.

Here’s the final result of the technique I’ll describe.

If you want to replicate that, and don’t already have a GitHub account, create one now and log in.

Ready to go? OK, let’s get started.

Step 1: Create a repository

Click the + button in the top right corner, then click New repository.

Here’s the next screen. All you must do here is name the repository, e.g. editing-step-by-step, then click Create repository. I’ve ticked the Add a README file box, and chosen the Apache 2.0 license, but you could leave the defaults — box unchecked, license None — as neither matters for our purpose here.

Step 2: Create a new file

On your GitHub home page, click the Repositories tab. Your new repo shows up first. Click its link to open it, then click the Add file dropdown and choose Create new file. Here’s where you land.

Step 3: Add the original text, create a new branch, commit the change, and create a pull request

What happens on the next screen is bewildering, but I will spare you the details because I’m assuming you don’t want to know about branches or commits or pull requests, you just want to build the kind of presentation I’ve promised you can. So, just follow this recipe.

  • Name the file (e.g. sample-press-release.txt
  • Copy/paste the text of the document into the edit box
  • Select Create a new branch for this commit and start a pull request
  • Name the branch (e.g. edits)
  • Click Propose new file

On the next screen, title the pull request (e.g. edit the press release) and click Create pull request.

Step 4: Visit the new branch and begin editing

On the home page of your repo, use the main dropdown to open the list of branches. There are now two: main and edits. Select edits

Here’s the next screen.

Click the name of the document you created (e.g. sample-press-release.txt to open it.

Click the pencil icon’s dropdown, and select Edit this file.

Make and preview your first edit. Here, that’s my initial rewrite of the headline. I’ve written a title for the commit (Step 1: revise headline), and I’ve added a detailed explanation in the box below the title. You can see the color-coded diff above, and the rationale for the change below.

Click Commit changes, and you’re back in the editor ready to make the next change.

Step 5: Visit the pull request to review the change

On your repo’s home page (e.g. https://github.com/judell/editing-step-by-step), click the Pull requests button. You’ll land here.

Click the name of the pull request (e.g. edit the press release) to open it. In the rightmost column you’ll see links with alphanumeric labels.

Click the first one of those to land here.

This is the first commit, the one that added the original text. Now click Next to review the first change.

This, finally, is the effect we want to create: a granular edit, with an explanation and a color-coded diff, encapsulated in a link that you can give to a learner who can then click Next to step through a series of narrated edits.

Lather, rinse, repeat

To continue building the presentation, repeat Step 4 (above) once per edit. I’m doing that now.

… time passes …

OK, done. Here’s the final edited copy. To step through the edits, start here and use the Next button to advance step-by-step.

If this were a software project you’d merge the edits branch into the main branch and close the pull request. But you don’t need to worry about any of that. The edits branch, with its open pull request, is the final product, and the link to the first commit in the pull request is how you make it available to a learner who wants to review the presentation.

GitHub enables what I’ve shown here by wrapping the byzantine complexity of the underlying tool, Git, in a much friendlier interface. But what’s friendly to a programmer is still pretty overwhelming for an English teacher. I still envision another layer of packaging that would make this technique simpler for teachers and learners focused on the craft of writing and editing. Meanwhile, though, it’s possible to use GitHub to achieve a compelling result. Is it practical? That’s not for me to say, I’m way past being able to see this stuff through the eyes of a beginner. But if that’s you, and you’re motivated to give this a try, I would love to know whether you’re able to follow this recipe, and if so whether you think it could help you to help learners become better writers and editors.

How to rewrite a press release: a step-by-step guide

As a teaching fellow in grad school I helped undergrads improve their expository writing. Some were engineers, and I invited them to think about writing and editing prose in the same ways they thought about writing and editing code. Similar rules apply, with different names. Strunk and White say “omit needless words”; coders say “DRY” (don’t repeat yourself.) Writers edit; coders refactor. I encouraged students to think about writing and editing prose not as a creative act (though it is one, as is coding) but rather as a method governed by rules that are straightforward to learn and mechanical to apply.

This week I applied those rules to an internal document that announces new software features. It’s been a long time since I’ve explained the method, and thanks to a prompt from Greg Wilson I’ll give it a try using another tech announcement I picked at random. Here is the original version.

I captured the transformations in a series of steps, and named each step in the version history of a Google Doc.

Step 1

The rewritten headline applies the following rules.

Lead with key benefits. The release features two: support for diplex-matched antennas and faster workflow. The original headline mentions only the first, I added the second.

Clarify modifiers. A phrase like “diplex matched antennas” is ambiguous. Does “matched” modify “diplex” or “antennas”? The domain is unfamiliar to me, but I suspected it should be “diplex-matched” and a web search confirmed that hunch.

Omit needless words. The idea of faster workflow appears in the original first paragraph as “new efficiencies aimed at streamlining antenna design workflows and shortening design cycles.” That’s a long, complicated, yet vague way of saying “enables designers to work faster.”

Step 2

The original lead paragraph was now just a verbose recap of the headline. So poof, gone.

Step 3

The original second paragraph, now the lead, needed a bit of tightening. Rules in play here:

Strengthen verbs. “NOUN is a NOUN that VERBs” weakens the verb. “NOUN, a NOUN, VERBs” makes it stronger.

Clarify modifiers. “matching network analysis” -> “matching-network analysis”. (As I look at it again now, I’d revise to “analysis of matching networks.”)

Break up long, weakly-linked sentences. The original was really two sentences linked weakly by “making it,” so I split them.

Omit needless words. A word that adds nothing, like “applications” here, weakens a sentence.

Strengthen parallelism. If you say “It’s ideal for X and Y” there’s no problem. But when X becomes “complex antenna designs that involve multi-state and multi-port aperture or impedance tuners,” and Y becomes “corporate feed networks with digital phase shifters,” then it helps to make the parallelism explicit: “It’s ideal for X and for Y.”

Step 4

Omit needless words. “builds on the previous framework with additional” -> “adds”.

Simplify. “capability to connect” -> “ability to connect”.

Show, don’t tell. A phrase like “time-saving options in the schematic editor’s interface” tells us that designers save time but doesn’t show us how. That comes next: “the capability to connect two voltage sources to a single antenna improves workflow efficiency.” The revision cites that as a shortcut.

Activate the sentence. “System and radiation efficiencies … can be effortlessly computed from a single schematic” makes efficiencies the subject and buries the agent (the designer) who computes them. The revision activates that passive construction. Similar rules govern the rewrite of the next paragraph.

Step 5

When I reread the original fourth paragraph I realized that the release wasn’t only touting faster workflow, but also better collaboration. So I adjusted the headline accordingly.

Step 6

Show, don’t tell. The original version tells, the new one shows.

Simplify. “streamline user input” -> “saves keystrokes” (which I might further revise to “clicks and keystrokes”).

Final result

Here’s the result of these changes.

I haven’t fully explained each step, and because the domain is unfamiliar I’ve likely missed some nuance. But I’m certain that the final version is clearer and more effective. I hope this step-by-step narration helps you see how and why the method works.

The Velvet Bandit’s COVID series

The Velvet Bandit is a local street artist whose work I’ve admired for several years. Her 15 minutes of fame happened last year when, as reported by the Press Democrat, Alexandria Ocasio-Cortez wore a gown with a “Tax the Rich” message that closely resembled a VB design.

I have particularly enjoyed a collection that I think of as the Velvet Bandit’s COVID series, which appeared on the boarded-up windows of the former Economy Inn here in Santa Rosa. The building is now under active renovation and the installation won’t last much longer, so I photographed it today and made a slideshow.

I like this image especially, though I have no idea what it means.

If you would like to buy some of her work, it’s available here. I gather sales have been brisk since l’affaire AOC!

Subtracting devices

People who don’t listen to podcasts often ask people who do: “When do you find time to listen?” For me it’s always on long walks or hikes. (I do a lot of cycling too, and have thought about listening then, but wind makes that impractical and cars make it dangerous.) For many years my trusty podcast player was one or another version of the Creative Labs MuVo which, as the ad says, is “ideal for dynamic environments.”

At some point I opted for the convenience of just using my phone. Why carry an extra, single-purpose device when the multi-purpose phone can do everything? That was OK until my Quixotic attachment to Windows Phone became untenable. Not crazy about either of the alternatives, I flipped a coin and wound up with an iPhone. Which, of course, lacks a 3.5mm audio jack. So I got an adapter, but now the setup was hardly “ideal for dynamic environments.” My headset’s connection to the phone was unreliable, and I’d often have to stop walking, reseat it, and restart the podcast.

If you are gadget-minded you are now thinking: “Wireless earbuds!” But no thanks. The last thing I need in my life is more devices to keep track of, charge, and sync with other devices.

I was about to order a new MuVo, and I might still; it’s one of my favorite gadgets ever. But on a recent hike, in a remote area with nobody else around, I suddenly realized I didn’t need the headset at all. I yanked it out, stuck the phone in my pocket, and could hear perfectly well. Bonus: Nothing jammed into my ears.

It’s a bit weird when I do encounter other hikers. Should I pause the audio or not when we cross paths? So far I mostly do, but I don’t think it’s a big deal one way or another.

Adding more devices to solve a device problem amounts to doing the same thing and expecting a different result. I want to remain alert to the possibility that subtracting devices may be the right answer.

There’s a humorous coda to this story. It wasn’t just the headset that was failing to seat securely in the Lightning port. Charging cables were also becoming problematic. A friend suggested a low-tech solution: use a toothpick to pull lint out of the socket. It worked! I suppose I could now go back to using my wired headset on hikes. But I don’t think I will.

What happened to simple, basic web hosting?

For a friend’s memorial I signed up to make a batch of images into a slideshow. All I wanted was the Simplest Possible Thing: a web page that would cycle through a batch of images. It’s been a while since I did something like this, so I looked around and didn’t find anything that seemed simple enough. The recipes I found felt like overkill. Here’s all I wanted to do:

  1. Put the images we’re gathering into a local folder
  2. Run one command to build slideshow.html
  3. Push the images plus slideshow.html to a web folder

Step 1 turned out to be harder than expected because a bunch of the images I got are in Apple’s HEIC format, so I had to find a tool that would convert those to JPG. Sigh.

For step 2 I wrote the script below. A lot of similar recipes you’ll find for this kind of thing will create a trio of HTML, CSS, and JavaScript files. That feels to me like overkill for something as simple as this, I want as few moving parts as possible, so the Python script bundles everything into slideshow.html which is the only thing that needs to be uploaded (along with the images).

Step 3 was simple: I uploaded the JPGs and slideshow.html to a web folder.

Except, whoa, not so fast there, old-timer! True, it’s easy for me, I’ve maintained a personal web server for decades and I don’t think twice about pushing files to it. Once upon a time, when you signed up with an ISP, that was a standard part of the deal: you’d get web hosting, and would use an FTP client — or some kind of ISP-provided web app — to move files to your server.

As I realized a few years ago, that’s now a rare experience. It seems that for most people, it’s far from obvious how to push a chunk of basic web stuff to a basic web server. People know how to upload stuff to Google Drive, or WordPress, but those are not vanilla web hosting environments.

It’s a weird situation. The basic web platform has never been more approachable. Browsers have converged nicely on the core standards. Lots of people could write a simple app like this one. Many more could at least /use/ it. But I suspect it will be easier for many nowadays to install Python and run this script than to push its output to a web server.

I hate to sound like a Grumpy Old Fart. Nobody likes that guy. I don’t want to be that guy. So I’ll just ask: What am I missing here? Are there reasons why it’s no longer important or useful for most people to be able to use the most basic kind of web hosting?




import os

l = [i for i in os.listdir() if i.endswith('.jpg')]

divs = ''

for i in l:
  divs += f"""
<div class="slide">
  <img src="{i}">
</div>
"""


# Note: In a Python f-string, CSS/JS squiggies ({}) need to be doubled

html = f"""
<html>

<head>
  <title>My Title</title>
  <style>
    body {{ background-color: black }} 
    .slide {{ text-align: center; display: none; }}
    img {{ height: 100% }}
  </style>
</head>

<body>

<div id="slideshow">
<div role="list">
{divs}
</div>
</div>

<script>
const slides = document.querySelectorAll('.slide')

const time = 5000

slides[0].style.display = 'block';

let i = 0

setInterval( () => {{
  i++
  if (i === slides.length) {{
    i = 0
  }}
  for (let j = 0; j <= i; j++ ) {{
    if ( j === i ) {{
      slides[j].style.display = 'block'
    }} else {{
      slides[j].style.display = 'none'
    }}
    
  }}
}}, time)
</script>

</body>
</html>
"""

with open('slideshow.html', 'w') as f:
  f.write(html)

Appreciating “Just Have a Think”

Just Have a Think, a YouTube channel created by Dave Borlace, is one of my best sources for news about, and analysis of, the world energy transition. Here are some hopeful developments I’ve enjoyed learning about.

Solar Wind and Wave. Can this ocean hybrid platform nail all three?

New energy storage tech breathing life and jobs back into disused coal power plants

Agrivoltaics. An economic lifeline for American farmers?

Solar PV film roll. Revolutionary new production technology

All of Dave’s presentations are carefully researched and presented. A detail that has long fascinated me: how the show displays source material. Dave often cites IPCC reports and other sources that are, in raw form, PDF files. He spices up these citations with some impressive animated renderings. Here’s one from the most recent episode.

The progressive rendering of the chart in this example is an even fancier effect than I’ve seen before, and it prompted me to track down the original source. In that clip Dave cites IRENA, the International Renewable Energy Agency, so I visited their site, looked for the cited report, and found it on page 8 of World Energy Transitions Outlook 2022. That link might or might not take you there directly, if not you can scroll to page 8 where you’ll find the chart that’s been animated in the video.

The graphical finesse of Just Have a Think is only icing on the cake. The show reports a constant stream of innovations that collectively give me hope we might accomplish the transition and avoid worst-case scenarios. But still, I wonder. That’s just a pie chart in a PDF file. How did it become the progressive rendering that appears in the video?

In any case, and much more importantly: Dave, thanks for the great work you’re doing!

Capture the rain

It’s raining again today, and we’re grateful. This will help put a damper on what was shaping up to be a terrifying early start of fire season. But the tiny amounts won’t make a dent in the drought. The recent showers bring us to 24 inches of rain for the season, about 2/3 of normal. But 10 of those 24 inches came in one big burst on Oct 24.

Here are a bunch of those raindrops sailing down the Santa Rosa creek to the mouth of the Russian River at Jenner.

With Sam Learner’s amazing River Runner we can follow a drop that fell in the Mayacamas range as it makes its way to the ocean.

Until 2014 I’d only ever lived east of the Mississipi River, in Pennsylvania, Michigan, Maryland, Massachusetts, and New Hampshire. During those decades there may never have been a month with zero precipitation.

I still haven’t adjusted to a region where it can be dry for many months. In 2017, the year of the devastating Tubbs Fire, there was no rain from April through October.

California relies heavily on the dwindling Sierra snowpack for storage and timed release of water. Clearly we need a complementary method of storage and release, and this passage in Kim Stanley Robinson’s Ministry for the Future imagines it beautifully.

Typically the Sierra snowpack held about fifteen million acre-feet of water every spring, releasing it to reservoirs in a slow melt through the long dry summers. The dammed reservoirs in the foothills could hold about forty million acre-feet when full. Then the groundwater basin underneath the central valley could hold around a thousand million acre-feet; and that immense capacity might prove their salvation. In droughts they could pump up groundwater and put it to use; then during flood years they needed to replenish that underground reservoir, by capturing water on the land and not allow it all to spew out the Golden Gate.

Now the necessity to replumb the great valley for recharge had forced them to return a hefty percentage of the land to the kind of place it had been before Europeans arrived. The industrial agriculture of yesteryear had turned the valley into a giant factory floor, bereft of anything but products grown for sale; unsustainable ugly, devastated, inhuman, and this in a place that had been called the “Serengeti of North America,” alive with millions of animals, including megafauna like tule elk and grizzly bear and mountain lion and wolves. All those animals had been exterminated along with their habitat, in the first settlers’ frenzied quest to use the valley purely for food production, a kind of secondary gold rush. Now the necessity of dealing with droughts and floods meant that big areas of the valley were restored, and the animals brought back, in a system of wilderness parks or habitat corridors, all running up into the foothills that ringed the central valley on all sides.

The book, which Wikipedia charmingly classifies as cli-fi, grabbed me from page one and never let go. It’s an extraordinary blend of terror and hope. But this passage affected me in the most powerful way. As Marc Reisner’s Cadillac Desert explains, and as I’ve seen for myself, we’ve already engineered the hell out of California’s water systems, with less than stellar results.

Can we redo it and get it right this time? I don’t doubt our technical and industrial capacity. Let’s hope it doesn’t take an event like the one the book opens with — a heat wave in India that kills 20 million people in a week — to summon the will.

Life in the neighborhood

I’ve worked from home since 1998. All along I’ve hoped many more people would enjoy the privilege and share in the benefits. Now that it’s finally happening, and seems likely to continue in some form, let’s take a moment to reflect on an underappreciated benefit: neighborhood revitalization.

I was a child of the 1960s, and spent my grade school years in a newly-built suburb of Philadelphia. Commuter culture was well established by then, so the dads in the neighborhood were gone during the day. So were some of the moms, mine included, but many were at home and were able to keep an eye on us kids as we played in back yards after school. And our yards were special. A group of parents had decided not to fence them, thus creating what was effectively a private park. The games we played varied from season to season but always involved a group of kids roaming along that grassy stretch. Nobody was watching us most of the time. Since the kitchens all looked out on the back yards, though, there was benign surveillance. Somebody’s mom might be looking out at any given moment, and if things got out of hand, somebody’s mom would hear that.

For most kids, a generation later, that freedom was gone. Not for ours, though! They were in grade school when BYTE Magazine ended and I began my remote career. Our house became an after-school gathering place for our kids and their friends. With me in my front office, and Luann in her studio in the back, those kids enjoyed a rare combination of freedom and safety. We were mostly working, but at any given moment we could engage with them in ways that most parents never could.

I realized that commuter culture had, for several generations, sucked the daytime life out of neighborhoods. What we initially called telecommuting wasn’t just a way to save time, reduce stress, and burn less fossil fuel. It held the promise of restoring that daytime life.

All this came back to me powerfully at the height of the pandemic lockdown. Walking around the neighborhood on a weekday afternoon I’d see families hanging out, kids playing, parents working on landscaping projects and tinkering in garages, neighbors talking to one another. This was even better than my experience in the 2000s because more people shared it.

Let’s hold that thought. Even if many return to offices on some days of the week, I believe and hope that we’ve normalized working from home on other days. By inhabiting our neighborhoods more fully on weekdays, we can perhaps begin to repair a social fabric frayed by generations of commuter culture.

Meanwhile here is a question to ponder. Why do we say that we are working from and not working at home?

Remembering Diana

The other day Luann and I were thinking of a long-ago friend and realized we’d forgotten the name of that friend’s daughter. Decades ago she was a spunky blonde blue-eyed little girl; we could still see her in our minds’ eyes, but her name was gone.

“Don’t worry,” I said confidently, “it’ll come back to one us.”

Sure enough, a few days later, on a bike ride, the name popped into my head. I’m sure you’ve had the same experience. This time around it prompted me to think about how that happens.

To me it feels like starting up a background search process that runs for however long it takes, then notifies me when the answer is ready. I know the brain isn’t a computer, and I know this kind of model is suspect, so I wonder what’s really going on.

– Why was I was so sure the name would surface?

– Does a retrieval effort kick off neurochemical change that elaborates over time?

– Before computers, what model did people use to explain this phenomenon?

So far I’ve only got one answer. That spunky little girl was Diana.

The (appropriately) quantified self

A year after we moved to northern California I acquired a pair of shiny new titanium hip joints. There would be no more running for me. But I’m a lucky guy who gets to to bike and hike more than ever amidst spectacular scenery that no-one could fully explore in a lifetime.

Although the osteoarthritis was more advanced on the right side, we opted for bilateral replacement because the left side wasn’t far behind. Things hadn’t felt symmetrical in the years leading up to the surgery, and that didn’t change. There’s always a sense that something’s different about the right side.

We’re pretty sure it’s not the hardware. X-rays show that the implants remain firmly seated, and there’s no measurable asymmetry. Something about the software has changed, but there’s been no way to pin down what’s different about the muscles, tendons, and ligaments on that side, whether there’s a correction to be made, and if so, how.

Last month, poking around on my iPhone, I noticed that I’d never opened the Health app. That’s beause I’ve always been ambivalent about the quantified self movement. In college, when I left competive gymnastics and took up running, I avoided tracking time and distance. Even then, before the advent of fancy tech, I knew I was capable of obsessive data-gathering and analysis, and didn’t want to go there. It was enough to just run, enjoy the scenery, and feel the afterglow.

When I launched the Health app, I was surprised to see that it had been counting my steps since I became an iPhone user 18 months ago. Really? I don’t recall opting into that feature.

Still, it was (of course!) fascinating to see the data and trends. And one metric in particular grabbed my attention: Walking Asymmetry.

Walking asymmetry is the percent of time that your steps with one foot are faster or slower than the other foot.

An even or symmetrical walk is often an important physical therapy goal when recovering from injury.

Here’s my chart for the past year.

I first saw this in mid-December when the trend was at its peak. What caused it? Well, it’s been rainy here (thankfully!), so I’ve been riding less, maybe that was a factor?

Since then I haven’t biked more, though, and I’ve walked the usual mile or two most days, with longer hikes on weekends. Yet the data suggest that I’ve reversed the trend.

What’s going on here?

Maybe this form of biofeedback worked. Once aware of the asymmetry I subconsciously corrected it. But that doesn’t explain the November/December trend.

Maybe the metric is bogus. A phone in your pocket doesn’t seem like a great way to measure walking asymmetry. I’ve also noticed that my step count and distances vary, on days when I’m riding, in ways that are hard to explain.

I’d like to try some real gait analysis using wearable tech. I suspect that data recorded from a couple of bike rides, mountain hikes, and neighborhood walks could help me understand the forces at play, and that realtime feedback could help me balance those forces.

I wouldn’t want to wear it all the time, though. It’d be a diagnostic and therapeutic tool, not a lifestyle.

My own personal AWS S3 bucket

I’ve just rediscovered two digital assets that I’d forgotten about.

1. The Reddit username judell, which I created in 2005 and never used. When you visit the page it says “hmm… u/judell hasn’t posted anything” but also reports, in my Trophy Case, that I belong to the 15-year club.

2. The Amazon AWS S3 bucket named simply jon, which I created in 2006 for an InfoWorld blog post and companion column about the birth of Amazon Web Services. As Wikipedia’s timeline shows, AWS started in March of that year.

Care to guess the odds that I could still access both of these assets after leaving them in limbo for 15 years?

Spoiler alert: it was a coin flip.

I’ve had no luck with Reddit so far. The email account I signed up with no longer exists. The support folks kindly switched me to a current email but it’s somehow linked to Educational_Elk_7869 not to judell. I guess we may still get it sorted but the point is that I was not at all surprised by this loss of continuity. I’ve lost control of all kinds of digital assets over the years, including the above-cited InfoWorld article which only Wayback (thank you as always!) now remembers.

When I turned my attention to AWS S3 I was dreading a similar outcome. I’d gone to Microsoft not long after I made that AWS developer account; my early cloud adventures were all in Azure; could I still access those long-dormant AWS resources? Happily: yes.

Here’s the backstory from that 2006 blog post:

Naming

The name of the bucket is jon. The bucket namespace is global which means that as long as jon is owned by my S3 developer account, nobody else can use that name. Will this lead to a namespace land grab? We’ll see. Meanwhile, I’ve got mine, and although I may never again top Jon Stewart as Google’s #1 Jon, his people are going to have to talk to my people if they want my Amazon bucket.

I’m not holding my breath waiting for an offer. Bucket names never mattered in the way domain names do. Still, I would love to be pleasantly surprised!

My newfound interest in AWS is, of course, because Steampipe wraps SQL around a whole bunch of AWS APIs including the one for S3 buckets. So, for example, when exactly did I create that bucket? Of course I can log into the AWS console and click my way to the answer. But I’m all about SQL lately so instead I can do this.

> select name, arn, creation_date from aws_s3_bucket 
+-------+--------------------+---------------------+         
| name  | arn                | creation_date       |         
+-------+--------------------+---------------------+         
| jon   | arn:aws:s3:::jon   | 2006-03-16 08:16:12 |         
| luann | arn:aws:s3:::luann | 2007-04-26 14:47:45 |         
+-------+--------------------+---------------------+  

Oh, and there’s the other one I made for Luann the following year. These are pretty cool ARNs (Amazon Resource Names)! I should probably do something with them; the names you can get nowadays are more like Educational_Elk_7869.

Anyway I’m about to learn a great deal about the many AWS APIs that Steampipe can query, check for policy compliance, and join with the APIs of other services. Meanwhile it’s fun to recall that I wrote one of the first reviews of the inaugural AWS product and, in the process, laid claim to some very special S3 bucket names.

Query like it’s 2022

Monday will be my first day as community lead for Steampipe, a young open source project that normalizes APIs by way of Postgres foreign data wrappers. The project’s taglines are select * from cloud and query like it’s 1992; the steampipe.io home page nicely illustrates these ideas.

I’ve been thinking about API normalization for a long time. The original proposal for the World Wide Web says:

Databases

A generic tool could perhaps be made to allow any database which uses a commercial DBMS to be displayed as a hypertext view.

We ended up with standard ways for talking to databases — ODBC, JDBC — but not for expressing them on the web.

When I was at Microsoft I was bullish on OData, an outgrowth of Pablo Castro’s wonderful Project Astoria. Part of the promise was that every database-backed website could automatically offer basic API access that wouldn’t require API wrappers for everybody’s favorite programming language. The API was hypertext; a person could navigate it using links and search. Programs wrapped around that API could be useful, but meaningful interaction with data would be possible without them.

(For a great example of what that can feel like, jump into the middle of one of Simon Willison’s datasettes, for example san-francisco.datasettes.com, and start clicking clicking around.)

Back then I wrote a couple of posts on this topic[1, 2]. Many years later OData still hasn’t taken the world by storm. I still think it’s a great idea and would love to see it, or something like it, catch on more broadly. Meanwhile Steampipe takes a different approach. Given a proliferation of APIs and programming aids for them, let’s help by providing a unifying abstraction: SQL.

I’ve done a deep dive into the SQL world over the past few years. The first post in a series I’ve been writing on my adventures with Postgres is what connected me to Steampipe and its sponsor (my new employer) Turbot. When you install Steampipe it brings Postgres along for the ride. Imagine what you could do with data flowing into Postgres from many different APIs and filling up tables you can view, query, join, and expose to tools and systems that talk to Postgres. Well, it’s going to be my job to help imagine, and explain, what’s possible in that scenario.

Meanwhile I need to give some thought to my Twitter tag line: patron saint of trailing edge technologies. It’s funny and it’s true. At BYTE I explored how software based on the Net News Transfer Protocol enabled my team to do things that we use Slack for today. At Microsoft I built a system for community-scale calendaring based on iCalendar. When I picked up NNTP and iCalendar they were already on the trailing edge. Yet they were, and especially in the case of iCalendar still are, capable of doing much more than is commonly understood.

Then of course came web annotation. Although Hypothesis recently shepherded it to W3C standardization it goes all the way back to the Mosaic browser and is exactly the kind of generative tech that fires my imagination. With Hypothesis now well established in education, I hope others will continue to explore the breadth of what’s possible when every document workflow that needs to can readily connect people, activities, and data to selections in documents. If that’s of interest, here are some signposts pointing to scenarios I’ve envisioned and prototyped.

And now it’s SQL. For a long time I set it aside in favor of object, XML, and NoSQL stores. Coming back to it, by way of Postgres, has shown me that:

– Modern SQL is more valuable as a programming language than is commonly understood

– So is Postgres as a programming environment

The tagline query like it’s 1992 seems very on-brand for me. But maybe I should let go of the trailing-edge moniker. Nostalgia isn’t the best way to motivate fresh energy. Maybe query like it’s 2022 sets a better tone? In any case I’m very much looking forward to this next phase.

The Postgres REPL

R0ml Lefkowitz’s The Image of Postgres evokes the Smalltalk experience: reach deeply into a running system, make small changes, see immediate results. There isn’t yet a fullblown IDE for the style of Postgres-based development I describe in this series, though I can envision a VSCode extension that would provide one. But there is certainly a REPL (read-eval-print loop), it’s called psql, and it delivers the kind of immediacy that all REPLs do. In our case there’s also Metabase; it offers a complementary REPL that enhances its power as a lightweight app server.

In the Clojure docs it says:

The Clojure REPL gives the programmer an interactive development experience. When developing new functionality, it enables her to build programs first by performing small tasks manually, as if she were the computer, then gradually make them more and more automated, until the desired functionality is fully programmed. When debugging, the REPL makes the execution of her programs feel tangible: it enables the programmer to rapidly reproduce the problem, observe its symptoms closely, then improvise experiments to rapidly narrow down the cause of the bug and iterate towards a fix.

I feel the same way about the Python REPL, the browser’s REPL, the Metabase REPL, and now also the Postgres REPL. Every function and every materialized view in the analytics system begins as a snippet of code pasted into the psql console (or Metabase). Iteration yields successive results instantly, and those results reflect live data. In How is a Programmer Like a Pathologist Gilad Bracha wrote:

A live program is dynamic; it changes over time; it is animated. A program is alive when it’s running. When you work on a program in a text editor, it is dead.

Tudor Girba amplified the point in a tweet.

In a database-backed system there’s no more direct way to interact with live data than to do so in the database. The Postgres REPL is, of course, a very sharp tool. Here are some ways to handle it carefully.

Find the right balance for tracking incremental change

In Working in a hybrid Metabase / Postgres code base I described how version-controlled files — for Postgres functions and views, and for Metabase questions — repose in GitHub and drive a concordance of docs. I sometimes write code snippets directly in psql or Metabase, but mainly compose in a “repository” (telling word!) where those snippets are “dead” artifacts in a text editor. They come to life when pasted into psql.

A knock on Smalltalk was that it didn’t play nicely with version control. If you focus on the REPL aspect, you could say the same of Python or JavaScript. In any such case there’s a balance to be struck between iterating at the speed of thought and tracking incremental change. Working solo I’ve been inclined toward a fairly granular commit history. In a team context I’d want to leave a chunkier history but still record the ongoing narrative somewhere.

Make it easy to understand the scope and effects of changes

The doc concordance has been the main way I visualize interdependent Postgres functions, Postgres views, and Metabase questions. In Working with interdependent Postgres functions and materialized views I mentioned Laurenz Albe’s Tracking View Dependencies in Postgres. I’ve adapted the view dependency tracker he develops there, and adapted related work from others to track function dependencies.

This tooling is still a work in progress, though. The concordance doesn’t yet include Postgres types, for example, nor the tables that are upstream from materialized views. My hypothetical VSCode extension would know about all the artifacts and react immediately when things change.

Make it easy to find and discard unwanted artifacts

Given a function or view named foo, I’ll often write and test a foo2 before transplanting changes back into foo. Because foo may often depend on bar and call baz I wind up also with bar2 and baz2. These artifacts hang around in Postgres until you delete them, which I try to do as I go along.

If foo2 is a memoized function (see this episode), it can be necessary to delete the set of views that it’s going to recreate. I find these with a query.

select 
  'drop materialized view ' || matviewname || ';' as drop_stmt
from pg_matviews 
where matviewname ~* {{ pattern }}

That pattern might be question_and_answer_summary_for_group to find all views based on that function, or _6djxg2yk to find all views for a group, or even [^_]{8,8}$ to find all views made by memoized functions.

I haven’t yet automated the discovery or removal of stale artifacts and references to them. That’s another nice-to-have for the hypothetical IDE.

The Image of Postgres

I’ll give R0ml the last word on this topic.

This is the BYTE magazine cover from August of 1981. In the 70s and the 80s, programming languages had this sort of unique perspective that’s completely lost to history. The way it worked: a programming environment was a virtual machine image, it was a complete copy of your entire virtual machine memory and that was called the image. And then you loaded that up and it had all your functions and your data in it, and then you ran that for a while until you were sort of done and then you saved it out. And this wasn’t just Smalltalk, Lisp worked that way, APL worked that way, it was kind of like Docker only it wasn’t a separate thing because everything worked that way and so you didn’t worry very much about persistence because it was implied. If you had a programming environment it saved everything that you were doing in the programming environment, you didn’t have to separate that part out. A programming environment was a place where you kept all your data and business logic forever.

So then Postgres is kind of like Smalltalk only different.

What’s the difference? Well we took the UI out of Smalltalk and put it in the browser. The rest of it is the same, so really Postgres is an application delivery platform, just like we had back in the 80s.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

Metabase as a lightweight app server

In A virtuous cycle for analytics I said this about Metabase:

It’s all nicely RESTful. Interactive elements that can parameterize queries, like search boxes and date pickers, map to URLs. Queries can emit URLs in order to compose themselves with other queries. I came to see this system as a kind of lightweight application server in which to incubate an analytics capability that could later be expressed more richly.

Let’s explore that idea in more detail. Consider this query that finds groups created in the last week.

with group_create_days as (
  select
     to_char(created, 'YYYY-MM-DD') as day
  from "group"
  where created > now() - interval '1 week'
)
select 
  day,
  count(*)
from group_create_days
group by day 
order by day desc 

A Metabase user can edit the query and change the interval to, say, 1 month, but there’s a nicer way to enable that. Terms in double squigglies are Metabase variables. When you type {{interval}} in the query editor, the Variables pane appears.

Here I’m defining the variable’s type as text and providing the default value 1 week. The query sent to Postgres will be the same as above. Note that this won’t work if you omit ::interval. Postgres complains: “ERROR: operator does not exist: timestamp with time zone – character varying.” That’s because Metabase doesn’t support variables of type interval as required for date subtraction. But if you cast the variable to type interval it’ll work.

That’s an improvement. A user of this Metabase question can now type 2 months or 1 year to vary the interval. But while Postgres’ interval syntax is fairly intuitive, this approach still requires people to make an intuitive leap. So here’s a version that eliminates the guessing.

The variable type is now Field Filter; the filtered field is the created column of the group table; the widget type is Relative Date; the default is Last Month. Choosing other intervals is now a point-and-click operation. It’s less flexible — 3 weeks is no longer an option — but friendlier.

Metabase commendably provides URLs that capture these choices. The default in this case is METABASE_SERVER/question/1060?interval=lastmonth. For the Last Year option it becomes interval=lastyear.

Because all Metabase questions that use variables work this way, the notion of Metabase as rudimentary app server expands to sets of interlinked questions. In Working in a hybrid Metabase / Postgres code base I showed the following example.

A Metabase question, #600, runs a query that selects columns from the view top_20_annotated_domains_last_week. It interpolates one of those columns, domain, into an URL that invokes Metabase question #985 and passes the domain as a parameter to that question. In the results for question #600, each row contains a link to a question that reports details about groups that annotated pages at that row’s domain.

This is really powerful stuff. Even without all the advanced capabilities I’ve been discussing in this series — pl/python functions, materialized views — you can do a lot more with the Metabase / Postgres combo than you might think.

For example, here’s a interesting idiom I’ve discovered. It’s often useful to interpolate a Metabase variable into a WHERE clause.

select * 
from dashboard_users
where email = {{ email }} 

You can make that into a fuzzy search using the case-insensitive regex-match operator ~*.

select * 
from dashboard_users
where email ~* {{ email }}

That’ll find a single address regardless of case; you can also find all records matching, say, ucsc.edu. But it requires the user to type some value into the input box. Ideally this query won’t require any input. If none is given, it lists all addresses in the table. If there is input it does a fuzzy match on that input. Here’s a recipe for doing that. Tell Metabase that {{ email }} a required variable, and set its default to any. Then, in the query, do this:

select * 
from dashboard_users
where email ~*
  case 
    when {{ email }} = 'any' then ''
    else {{ email}}
  end

In the default case the matching operator binds to the empty string, so it matches everything and the query returns all rows. For any other input the operator binds to a value that drives a fuzzy search.

This is all very nice, you may think, but even the simplest app server can write to the database as well as read from it, and Metabase can’t. It’s ultimately just a tool that you point at a data warehouse to SELECT data for display in tables and charts. You can’t INSERT or UPDATE or ALTER or DELETE or CALL anything.

Well, it turns out that you can. Here’s a Metabase question that adds a user to the table.

select add_dashboard_user( {{email}} )

How can this possibly work? If add_dashboard_user were a Postgres procedure you could CALL it from psql, but in this context you can only SELECT.

We’ve seen the solution in Postgres set-returning functions that self-memoize as materialized views. A Postgres function written in pl/python can import and use a Python function from a plpython_helpers module. That helper function can invoke psql to CALL a procedure. So this is possible.

We’ve used Metabase for years. It provides a basic, general-purpose UX that’s deeply woven into the fabric of the company. Until recently we thought of it as a read-only system for analytics, so a lot of data management happens in spreadsheets that don’t connect to the data warehouse. It hadn’t occurred to me to leverage that same basic UX for data management too, and that’s going to be a game-changer. I always thought of Metabase as a lightweight app server. With some help from Postgres it turns out to be a more capable one than I thought.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

Notes for an annotation SDK

While helping Hypothesis find its way to ed-tech it was my great privilege to explore ways of adapting annotation to other domains including bioscience, journalism, and scholarly publishing. Working across these domains showed me that annotation isn’t just an app you do or don’t adopt. It’s also a service you’d like to be available in every document workflow that connects people to selections in documents.

In my talk Weaving the Annotated Web I showcased four such services: Science in the Classroom, The Digital Polarization Project, SciBot, and ClaimChart. Others include tools to evaluate credibility signals, or review claims, in news stories.

As I worked through these and other scenarios, I accreted a set of tools for enabling any annotation-aware interaction in any document-oriented workflow. I’ve wanted to package these as a coherent software development kit, that hasn’t happened yet, but here are some of the ingredients that belong in such an SDK.

Creating an annotation from a selection in a document

Two core operations lie at the heart of any annotation system: creating a note that will bind to a selection in a document, and binding (anchoring) that note to its place in the document. A tool that creates an annotation reacts to a selection in a document by forming one or more selectors that describe the selection.

The most important selector is TextQuoteSelector. If I visit http://www.example.com and select the phrase “illustrative examples” and then use Hypothesis to annotate that selection, the payload sent from the client to the server includes this construct.

{
  "type": "TextQuoteSelector", 
  "exact": "illustrative examples", 
  "prefix": "n\n    This domain is for use in ", 
  "suffix": " in documents. You may use this\n"
}

The Hypothesis client formerly used an NPM module, dom-anchor-text-quote, to derive that info from a selection. It no longer uses that module, and the equivalent code that it does use isn’t separately available. But annotations created using TextQuoteSelectors formed by dom-anchor-text-quote interoperate with those created using the Hypothesis client, and I don’t expect that will change since Hypothesis needs to remain backwards-compatible with itself.

You’ll find something like TextQuoteSelector in any annotation system. It’s formally defined by the W3C here. In the vast majority of cases this is all you need to describe the selection to which an annotation should anchor.

There are, however, cases where TextQuoteSelector won’t suffice. Consider a document that repeats the same passage three times. Given a short selection in the first of those passages, how can a system know that an annotation should anchor to that one, and not the second or third? Another selector, TextPositionSelector (https://www.npmjs.com/package/dom-anchor-text-position), enables a system to know which passage contains the selection.

{
  "type": "TextPositionSelector", 
  "start": 51
  "end": 72, 
}

It records the start and end of the selection in the visible text of an HTML document. Here’s the HTML source of that web page.

<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>

Here is the visible text to which the TextQuoteSelector refers.

\n\n Example Domain\n This domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.\n More information…\n\n\n\n

The positions recorded by a TextQuoteSelector can change for a couple of reasons. If the document is altered, it’s obvious that an annotation’s start and stop numbers might change. Less obviously that can happen even if the document’s text isn’t altered. A news website, for example, may inject different kinds of advertising-related text content from one page load to the next. In that case the positions for two consecutive Hypothesis annotations made on the same selection can differ. So while TextPositionSelector can resolve ambiguity, and provide hints to an annotation system about where to look for matches, the foundation is ultimately TextQuoteSelector.

If you try the first example in the README at https://github.com/judell/TextQuoteAndPosition, you can form your own TextQuoteSelector and TextPositionSelector from a selection in a web page. That repo exists only as a wrapper around the set of modules — dom-anchor-text-quote, dom-anchor-text-position, and wrap-range-text — needed to create and anchor annotations.

Building on these ingredients, HelloWorldAnnotated illustrates a common pattern.

  • Given a selection in a page, form the selectors needed to post an annotation that targets the selection.
  • Lead a user through an interaction that influences the content of that annotation.
  • Post the annotation.

Here is an example of such an interaction. It’s a content-labeling scenario in which a user rates the emotional affect of a selection. This is the kind of thing that can be done with the stock Hypothesis client, but awkwardly because users must reliably add tags like WeakNegative or StrongPositive to represent their ratings. The app prompts for those tags to ensure consistent use of them.

Although the annotation is created by a standalone app, the Hypothesis client can anchor it, display it, and even edit it.

And the Hypothesis service can search for sets of annotations that match the tags WeakNegative or StrongPositive.

There’s powerful synergy at work here. If your annotation scenario requires controlled tags, or a prescribed workflow, you might want to adapt the Hypothesis client to do those things. But it can be easier to create a standalone app that does exactly what you need, while producing annotations that interoperate with the Hypothesis system.

Anchoring an annotation to its place in a document

Using this same set of modules, a tool or system can retrieve an annotation from a web service and anchor it to a document in the place where it belongs. You can try the second example in the README at https://github.com/judell/TextQuoteAndPosition to see how this works.

For a real-world demonstration of this technique, see Science in the Classroom. It’s a project sponsored by The American Association for the Advancement of Science. Graduate students annotate research papers selected from the Science family of journals so that younger students can learn about the terminology, methods, and outcomes of scientific research.

Pre-Hypothesis, annotations on these papers were displayed using Learning Lens, a viewer that color-codes them by category.

Nothing about Learning Lens changed when Hypothesis came into the picture, it just provided a better way to record the annotations. Originally that was done as it’s often done in the absence of a formal way to describe annotation targets, by passing notes like: “highlight the word ‘proximodistal’ in the first paragraph of the abstract, and attach this note to it.” This kind of thing happens a lot, and wherever it does there’s an opportunity to adopt a more rigorous approach. Nowadays at Science in the Classroom the annotators use Hypothesis to describe where notes should anchor, as well as what they should say. When an annotated page loads it searches Hypothesis for annotations that target the page, and inserts them using the same format that’s always been used to drive the Learning Lens. Tags assigned by annotators align with Learning Lens categories. The search looks only for notes from designated annotators, so nothing unwanted will appear.

An annotation-powered survey

The Credibility Coalition is “a research community that fosters collaborative approaches to understanding the veracity, quality and credibility of online information.” We worked with them on a project to test a set of signals that bear on the credibility of news stories. Examples of such signals include:

  • Title Representativeness (Does the title of an article accurately reflect its content?)
  • Sources (Does the article cite sources?)
  • Acknowledgement of uncertainty (Does the author acknowledge uncertainty, or the possibility things might be otherwise?)

Volunteers were asked these questions for each of a set of news stories. Many of the questions were yes/no or multiple choice and could have been handled by any survey tool. But some were different. What does “acknowledgement of uncertainty” look like? You know it when you see it, and you can point to examples. But how can a survey tool solicit answers that refer to selections in documents, and record their locations and contexts?

The answer was to create a survey tool that enabled respondents to answer such questions by highlighting one or more selections. Like the HelloWorldAnnotated example above, this was a bespoke client that guided the user through a prescribed workflow. In this case, that workflow was more complex. And because it was defined in a declarative way, the same app can be used for any survey that requires people to provide answers that refer to selections in web documents.

A JavaScript wrapper for the Hypothesis API

The HelloWorldAnnotated example uses functions from a library, hlib, to post an annotation to the Hypothesis service. That library includes functions for searching and posting annotations using the Hypothesis API. It also includes support for interaction patterns common to annotation apps, most of which occur in facet, a standalone tool that searches, displays, and exports sets of annotations. Supported interactions include:

– Authenticating with an API token

– Creating a picklist of groups accessible to the authenticated user

– Assembling and displaying conversation threads

– Parsing annotations

– Editing annotations

– Editing tags

In addition to facet, other tools based on this library include CopyAnnotations and TagRename

A Python wrapper for the Hypothesis API

If you’re working in Python, hypothesis-api is an alternative API wrapper that supports searching for, posting, and parsing annotations.

Notifications

If you’re a publisher who embeds Hypothesis on your site, you can use a wildcard search to find annotations. But it would be helpful to be notified when annotations are posted. h_notify is a tool that uses the Hypothesis API to watch for annotations on individual or wildcard URLs, or from particular users, or in a specified group, or with a specified tag.

When an h_notify-based watcher finds notes in any of these ways, it can send alerts to a Slack channel, or to an email address, or add items to an RSS feed.

At Hypothesis we mainly rely on the Slack option. In this example, user nirgendheim highlighted the word “interacting” in a page on the Hypothesis website.

The watcher sent this notice to our #website channel in Slack.

A member of the support team (Hypothesis handle mdiroberts) saw it there and responded to nirgendheim as shown above. How did nirgendheim know that mdiroberts had responded? The core Hypothesis system sends you an email when somebody replies to one of your notes. h_notify is for bulk monitoring and alerting.

A tiny Hypothesis server

People sometimes ask about connecting the Hypothesis client to an alternate server in order to retain complete control over their data. It’s doable, you can follow the instructions here to build and run your own server, and some people and organizations do that. Depending on need, though, that can entail more effort, and more moving parts, than may be warranted.

Suppose for example you’re part of a team of investigative journalists annotating web pages for a controversial story, or a team of analysts sharing confidential notes on web-based financial reports. The documents you’re annotating are public, but the notes you’re taking in a Hypothesis private group are so sensitive that you’d rather not keep them in the Hypothesis service. You’d ideally like to spin up a minimal server for that purpose: small, simple, and easy to manage within your own infrastructure.

Here’s a proof of concept. This tiny server clocks in just 145 lines of Python with very few dependencies. It uses Python’s batteries-include SQLite module for annotation storage. The web framework is Pyramid only because that’s what I’m familiar with, but could as easily be Flask, the ultra-light framework typically used for this sort of thing.

A tiny app wrapped around those ingredients is all you need to receive JSON payloads from a Hypothesis client, and return JSON payloads when the client searches for annotations to anchor to a page.

The service is dockerized and easy to deploy. To test it I used the fly.io speedrun to create an instance at https://summer-feather-9970.fly.dev. Then I made the handful of small tweaks to the Hypothesis client shown in client-patches.txt. My method for doing that, typical for quick proofs of concept that vary the Hypothesis client in some small way, goes like this:

  • Clone the Hypothesis client.
  • Edit gulpfile.js to say const IS_PRODUCTION_BUILD = false. This turns off minification so it’s possible to read and debug the client code.
  • Follow the instructions to run the client from a browser extension. After establishing a link between the client repo and browser-extension repo, as per those instructions, use this build command — make build SETTINGS_FILE=settings/chrome-prod.json — to create a browser extension that authenticates to the Hypothesis production service.
  • In a Chromium browser (e.g. Chrome or Edge or Brave) use chrome://extensions, click Load unpacked, and point to the browser-extension/build directory where you built the extension.

This is the easiest way to create a Hypothesis client in which to try quick experiments. There are tons of source files in the repos, but just a handful of bundles and loose files in the built extension. You can run the extension, search and poke around in those bundles, set breakpoints, make changes, and see immediate results.

In this case I only made the changes shown in client-patches.txt:

  • In options/index.html I added an input box to name an alternate server.
  • In options/options.js I sync that value to the cloud and also to the browser’s localStorage.
  • In the extension bundle I check localStorage for an alternate server and, if present, modify the API request used by the extension to show the number of notes found for a page.
  • In the sidebar bundle I check localStorage for an alternate server and, if present, modify the API requests used to search for, create, update, and delete annotations.

I don’t recommend this cowboy approach for anything real. If I actually wanted to use this tweaked client I’d create branches of the client and the browser-extension, and transfer the changes into the source files where they belong. If I wanted to share it with a close-knit team I’d zip up the extension so colleagues could unzip and sideload it. If I wanted to share more broadly I could upload the extension to the Chrome web store. I’ve done all these things, and have found that it’s feasible — without forking Hypothesis — to maintain branches that maintain small but strategic changes like this one. But when I’m aiming for a quick proof of concept, I’m happy to be a cowboy.

In any event, here’s the proof. With the tiny server deployed to summer-feather-9970.fly.dev, I poked that address into the tweaked client.

And sure enough, I could search for, create, reply to, update, and delete annotations using that 145-line SQLite-backed server.

The client still authenticates to Hypothesis in the usual way, and behaves normally unless you specify an alternate server. In that case, the server knows nothing about Hypothesis private groups. The client sees it as the Hypothesis public layer, but it’s really the moral equivalent of a private group. Others will see it only if they’re running the same tweaked extension and pointing to the same server. You could probably go quite far with SQLite but, of course, it’s easy to see how you’d swap it out for a more robust database like Postgres.

Signposts

I think of these examples as signposts pointing to a coherent SDK for weaving annotation into any document workflow. They show that it’s feasible to decouple and recombine the core operations: creating an annotation based on a selection in a document, and anchoring an annotation to its place in a document. Why decouple? Reasons are as diverse as the document workflows we engage in. The stock Hypothesis system beautifully supports a wide range of scenarios. Sometimes it’s helpful to replace or augment Hypothesis with a custom app that provides a guided experience for annotators and/or an alternative display for readers. The annotation SDK I envision will make it straightforward for developers to build solutions that leverage the full spectrum of possibility.

End notes

https://www.infoworld.com/article/3263344/how-web-annotation-will-transform-content-management.html

Weaving the annotated web

Working with interdependent Postgres functions and materialized views

In Working with Postgres types I showed an example of a materialized view that depends on a typed set-returning function. Because Postgres knows about that dependency, it won’t allow DROP FUNCTION foo. Instead it requires DROP FUNCTION foo CASCADE.

A similar thing happens with materialized views that depend on tables or other materialized views. Let’s build a cascade of views and consider the implications.

create materialized view v1 as (
  select 
    1 as number,
    'note_count' as label 
);
SELECT 1

select * from v1;

number | label
-------+-------
     1 | note_count

Actually, before continuing the cascade, let’s linger here for a moment. This is a table-like object created without using CREATE TABLE and without explicitly specifying types. But Postgres knows the types.

\d v1;

Materialized view "public.v1"

Column  | Type 
--------+-----
number  | integer
label   | text

The read-only view can become a read-write table like so.

create table t1 as (select * from v1);
SELECT 1

select * from t1;

number | label
-------+-------
     1 | note_count

\d t1

Table "public.v1"

Column  | Type 
--------+-----
number  | integer
label   | text

This ability to derive a table from a materialized view will come in handy later. It’s also just interesting to see how the view’s implicit types become explicit in the table.

OK, let’s continue the cascade.

create materialized view v2 as (
  select 
    number + 1,
    label
  from v1
);
SELECT 1

select * from v2;

number | label
-------+-------
     2 | note_count

create materialized view v3 as (
  select 
    number + 1,
    label
  from v2
);
SELECT 1

select * from v3;

number | label
-------+-------
     3 | note_count

Why do this? Arguably you shouldn’t. Laurenz Albe makes that case in Tracking view dependencies in PostgreSQL. Recognizing that it’s sometimes useful, though, he goes on to provide code that can track recursive view dependencies.

I use cascading views advisedly to augment the use of CTEs and functions described in Postgres functional style. Views that refine views can provide a complementary form of the chunking that aids reasoning in an analytics system. But that’s a topic for another episode. In this episode I’ll describe a problem that arose in a case where there’s only a single level of dependency from a table to a set of dependent materialized views, and discuss my solution to that probem.

Here’s the setup. We have an annotation table that’s reloaded nightly. On an internal dashboard we have a chart based on the materialized view annos_at_month_ends_for_one_year which is derived from the annotation table and, as its name suggests, reports annotation counts on a monthly cycle. At the beginning of the nightly load, this happens: DROP TABLE annotation CASCADE. So the derived view gets dropped and needs to be recreated as part of the nightly process. But that’s a lot of unnecessary work for a chart that only changes monthly.

Here are two ways to protect a view from a cascading drop of the table it depends on. Both reside in a SQL script, monthly.sql, that only runs on the first of every month. First, annos_at_month_ends_for_one_year.

drop materialized view annos_at_month_ends_for_one_year;
create materialized view annos_at_month_ends_for_one_year as (
  with last_days as (
    select 
      last_days_of_prior_months( 
        date(last_month_date() - interval '6 year')
      ) as last_day
    ),
    monthly_counts as (
      select
        to_char(last_day, '_YYYY-MM') as end_of_month,
        anno_count_between( 
          date(last_day - interval '1 month'), last_day
        ) as monthly_annos
      from last_days
    )
    select
      end_of_month,
      monthly_annos,
      sum(monthly_annos) over 
        (order by end_of_month asc rows 
           between unbounded preceding and current row
        ) as cumulative_annos
    from monthly_counts
) with data;

Because this view depends indirectly on the annotation table — by way of the function anno_count_between — Postgres doesn’t see the dependency. So the view isn’t affected by the cascading drop of the annotation table. It persists until, once a month, it gets dropped and recreated.

What if you want Postgres to know about such a dependency, so that the view will participate in a cascading drop? You can do this.

create materialized view annos_at_month_ends_for_one_year as (
  with depends as (
    select * from annotation limit 1
  )
  last_days as (
    ),
  monthly_counts as (
    )
  select
    *
  from monthly_counts;

The depends CTE doesn’t do anything relevant to the query, it just tells Postgres that this view depends on the annotation table.

Here’s another way to protect a view from a cascading drop. This expensive-to-build view depends directly on the annotation table but only needs to be updated monthly. So in this case, cumulative_annotations is a table derived from a temporary materialized view.

create materialized view _cumulative_annotations as (
  with data as (
    select 
      to_char(created, 'YYYY-MM') as created
    from annotation
    group by created
  ) 
  select 
    data.created, 
    sum(data.count) 
      over ( 
        order by data.created asc 
        rows between unbounded preceding and current row
      )
    from data
    group by data.created 
    order by data.created 

drop table cumulative_annotations;

create table cumulative_annotations as (
  select * from _cumulative_annotations
);

drop materialized view _cumulative_annotations;

The table cumulative_annotations is only rebuilt once a month. It depends indirectly on the annotation table but Postgres doesn’t see that, so doesn’t include it in the cascading drop.

Here’s the proof.

-- create a table
create table t1 (number int);

insert into t1 (number) values (1);
INSERT 0 1

select * from t1;

number
-------
     1

-- derive a view from t1
create materialized view v1 as (select * from t1);
SELECT 1

select * from v1

number
-------
     1

-- try to drop t1
drop table t1;

ERROR: cannot drop table t1 because other objects depend on it
DETAIL: materialized view v1 depends on table t1
HINT: Use DROP ... CASCADE to drop the dependent objects too.

-- derive an independent table from t1 by way of a matview
drop materialized view v1;

create materialized view v1 as (select * from t1);
SELECT 1

create table t2 as (select * from v1);
SELECT 1

-- drop the matview
drop materialized view v1;

-- drop t1
drop table t1;

-- no complaint, and t2 still exists
select * from t2;

number
-------
     1

These are two ways I’ve found to protect a long-lived result set from the cascading drop of a short-lived table on which it depends. You can hide the dependency behind a function, or you can derive an independent table by way of a transient materialized view. I use them interchangeably, and don’t have a strong preference one way or another. Both lighten the load on the analytics server. Materialized views (or tables) that only need to change weekly or monthly, but were being dropped nightly by cascade from core tables, are now recreated only on their appropriate weekly or monthly cycles.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

Working in a hybrid Metabase / Postgres code base

In this series I’m exploring how to work in a code base that lives in two places: Metabase questions that encapsulate chunks of SQL, and Postgres functions/procedures/views that also encapsulate chunks of SQL. To be effective working with this hybrid collection of SQL chunks, I needed to reason about their interrelationships. One way to do that entailed the creation of a documentation generator that writes a concordance of callers and callees.

Here’s the entry for the function sparklines_for_guid(_guid).

The called by column says that this function is called from two different contexts. One is a Metabase question, All courses: Weekly activity. If you’re viewing that question in Metabase, you’ll find that its SQL text is simply this:

select * from sparklines_for_guid( {{guid}} )

The same function call appears in a procedure, cache warmer, that preemptively memoizes the function for a set of the most active schools in the system. In either case, you can look up the function in the concordance, view its definition, and review how it’s called.

In the definition of sparkline_for_guid, names of other functions (like guid_hashed_view_exists) appear and are linked to their definitions. Similarly, names of views appearing in SELECT or JOIN contexts are linked to their definitions.

Here’s the entry for the function guid_hashed_view_exists. It is called by sparklines_for_guid as well as by functions that drive panels on the school dashboard. It links to the functions it uses: hash_for_guid and exists_view.

Here’s the entry for the view lms_course_groups which appears as a JOIN target in sparklines_for_guid. This central view is invoked — in SELECT or JOIN context — from many functions, from dependent views, and from Metabase questions.

Metabase questions can also “call” other Metabase questions. In A virtuous cycle for analytics I noted: “Queries can emit URLs in order to compose themselves with other queries.” Here’s an example of that.

This Metabase question (985) calls various linked functions, and is called by two other Metabase questions. Here is one of those.

Because this Metabase question (600) emits an URL that refers to 985, it links to the definition of 985. It also links to the view, top_annotated_domains_last_week, from which it SELECTs.

It was straightforward to include Postgres functions, views, and procedures in the concordance since these live in files that reside in the fileystem under source control. Metabase questions, however, live in a database — in our case, a Postgres database that’s separate from our primary Postgres analytics db. In order to extract them into a file I use this SQL snippet.

select
  r.id,
  m.name as dbname,
  r.name,
  r.description,
  r.dataset_query
from report_card r
join metabase_database m
  on m.id = cast(r.dataset_query::jsonb->>'database' as int)
where not r.archived
order by r.id;

The doc generator downloads that Metabase data as a CSV file, queries.csv, and processes it along with the files that contain the definitions of functions, procedures, and views in the Postgres data warehouse. It also emits queries.txt which is a more convenient way to diff changes from one commit to the next.

This technique solved a couple of problems. First, when we were only using Metabase — unaugmented by anything in Postgres — it enabled us to put our Metabase SQL under source control and helped us visualize relationships among queries.

Later, as we augmented Metabase with Postgres functions, procedures, and views, it became even more powerful. Developing a new panel for a school or course dashboard means writing a new memoized function. That process begins as a Metabase question with SQL code that calls existing Postgres functions, and/or JOINs/SELECTs FROM existing Postgres views. Typically it then leads to the creation of new supporting Postgres functions and/or views. All this can be tested by internal users, or even invited external users, in Metabase, with the concordance available to help understand the relationships among the evolving set of functions and views.

When supporting functions and views are finalized, the SQL content of the Metabase question gets wrapped up in a memoized Postgres function that’s invoked from a panel of a dashboard app. At that point the concordance links the new wrapper function to the same set of supporting functions and views. I’ve found this to be an effective way to reason about a hybrid code base as features move from Metabase for prototyping to Postgres in production, while maintaining all the code under source control.

That foundation of source control is necessary, but maybe not sufficient, for a team to consider this whole approach viable. The use of two complementary languages for in-database programming will certainly raise eyebrows, and if it’s not your cup of tea I completely understand. If you do find it appealing, though, one thing you’ll wonder about next will be tooling. I work in VSCode nowadays, for which I’ve not yet found a useful extension for pl/pgsql or pl/python. With metaprogramming life gets even harder for aspiring pl/pgsql or pl/python VSCode extensions. I can envision them, but I’m not holding my breath awaiting them. Meanwhile, two factors enable VSCode to be helpful even without deep language-specific intelligence.

The first factor, and by far the dominant one, is outlining. In Products and capabilities I reflect on how I’ve never adopted an outlining product, but often rely on outlining capability infused into a product. In VSCode that’s “only” basic indentation-driven folding and unfolding. But I find it works remarkably well across SQL queries, views and functions that embed them, CTEs that comprise them, and pl/pgsql or pl/python functions called by them.

The second factor, nascent thus far, is GitHub Copilot. It’s a complementary kind of language intelligence that’s aware of, but not bounded by, what a file extension of .sql or .py implies. It can sometimes discern patterns that mix language syntaxes and offer helpful suggestions. That hasn’t happened often so far, but it’s striking when it does. I don’t yet know the extent to which it may be training me while I train it, or how those interactions might influence others. At this point I’m not a major Copilot booster, but I am very much an interested observer of and participant in the experiment.

All in all, I’m guardedly optimistic that existing or feasible tooling can enable individuals and teams to sanely navigate the hybrid corpus of source code discussed in this series. If you’re along for the ride, you’ll next wonder about debugging and monitoring a system built this way. That’s a topic for a future episode.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

Postgres functional style

My dual premises in this series are:

– Modern SQL is more valuable as a programming language than you might think (see Markus Winand’s Modern SQL: A lot has changed since SQL-92)

– Postgres is more valuable as a programming environment than you might think. (see R0ml Lefkowitz’s The Image of Postgres)

As the patron saint of trailing edge technology it is my duty to explore what’s possible at the intersection of these two premises. The topic for this episode is Postgres functional style. Clearly what I’ve been doing with the combo of pl/python and pl/pgsql is very far from pure functional programming. The self-memoization technique shown in episode 7 is all about mutating state (ed: this means writing stuff down somewhere). But it feels functional to me in the broader sense that I’m using functions to orchestrate behavior that’s expressed in terms of SQL queries.

To help explain what I mean, I’m going to unpack one of the Postgres functions in our library.

count_of_distinct_lms_students_from_to(_guid text, _from date, _to date)

This is a function that accepts a school id (aka guid), a start date, and an end date. Its job is to:

– Find all the courses (groups) for that school (guid)

– Filter to those created between the start and end date

– Find all the users in the filtered set of courses

– Filter to just students (i.e. omit instructors)

– Remove duplicate students (i.e., who are in more than one course)

– Return the count of distinct students at the school who annotated in the date range

The production database doesn’t yet store things in ways friendly to this outcome, so doing all this requires some heavy lifting in the analytics data warehouse. Here’s the function that orchestrates that work.

create function count_of_distinct_lms_students_from_to(_guid text, _from date, _to date) 
  returns bigint as $$
  declare count bigint;
  begin
 1  -- all groups active for the guid in the date range
 2  with groups as (
 3    select pubid from groups_for_guid(_guid)
 4    where group_is_active_from_to(pubid, _from, _to)
 5  ),
 6  -- usernames in those groups
 7  usernames_by_course as (
 8    select
 9      pubid,
10      (users_in_group(pubid)).username 
11  from groups 
12  ),
13  -- filtered to just students
14  students_by_course as (
15    select * from usernames_by_course
16    where not is_instructor(username, pubid)
17  )
18  select 
19    count (distinct username) 
20  from students_by_course
    into count;
    return count;
  end;
$$ language plpgsql;

If you think pl/pgsql is old and clunky, then you are welcome to do this in pl/python instead. There’s negligible difference between how they’re written and how fast they run. It’s the same chunk of SQL either way, and it exemplifies the functional style I’m about to describe.

Two kinds of chunking work together here: CTEs (aka common table expressions, aka WITH clauses) and functions. If you’ve not worked with SQL for a long time, as I hadn’t, then CTEs may be unfamiliar. I think of them as pipelines of table transformations in which each stage of the pipeline gets a name. In this example those names are groups (line 2), usernames_by_course (line 7), and students_by_course (line 14).

The pipeline phases aren’t functions that accept parameters, but I still think of them as being function-like in the sense that they encapsulate named chunks of behavior. The style I’ve settled into aims to make each phase of the pipeline responsible for a single idea (“groups active in the range”, “usernames in those groups”), and to express that idea in a short snippet of SQL.

As I’m developing one of these pipelines, I test each phase. To test the first phase, for example, I’d do this in psql or Metabase.

-- all groups active for the guid in the date range
with groups as (
  select pubid from groups_for_guid('8anU0QwbgC2Cq:canvas-lms')
  where group_is_active_from_to(pubid, '2021-01-01', '2021-05-01')
)
select * from groups;

And I’d spot-check to make sure the selected groups for that school really are in the date range. Then I’d check the next phase.

-- all groups active for the guid in the date range
with groups as (
),
-- usernames in those groups
usernames_by_course as (
  select
    pubid,
    (users_in_group(pubid)).username 
  from groups 
)
select * from usernames_by_course;

After another sanity check against these results, I’d continue to the next phase, and eventually arrive at the final result. It’s the same approach I take with regular expressions. I am unable to visualize everything that’s happening in a complex regex. But I can reason effectively about a pipeline of matches that occur in easier-to-understand named steps.

Ideally each phase in one of these pipelines requires just a handful of lines of code: few enough to fit within the 7 +- 2 limit of working memory. Postgres functions make that possible. Here are the functions used in this 20-line chunk of SQL.

groups_for_guid(guid): Returns a table of course ids for a school.

group_is_active_from_to(pubid, _from, _to): Returns true if the group was created in the range.

users_in_group(pubid): Returns a table of user info for a course.

is_instructor(username, pubid): Returns true if that user is an instructor.

Two of these, groups_for_guid and users_in_group, are set-returning functions. As noted in Working with Postgres types, they have the option of returning an explicit Postgres type defined elsewhere, or an implicit Postgres type defined inline. As it happens, both do the latter.

create or replace function groups_for_guid(_guid text)
  returns table(
    pubid text
  ) as $$
create or replace function users_in_group (_pubid text)
  returns table (
    groupid text, 
    username text, 
    display_name text
  ) as $$

The other two, group_is_active_from_to and is_instructor, return boolean values.

All this feels highly readable to me now, but the syntax of line 10 took quite a while to sink in. It helps me to look at what users_in_group(pubid) does in a SELECT context.

select * from users_in_group('4VzA92Yy')

groupid   | username    | display_name
----------+-------------+----------------
4VzA92Yy  | 39vA94AsQp  | Brendan Nadeau

Here is an alternate way to write the usernames_by_course CTE at line 7.

-- usernames in those groups
usernames_by_course as (
  select
    g.pubid,
    u.username
from groups g 
join users_in_group(g.pubid) u on g.pubid = u.groupid
)
select * from usernames_by_course;

Both do exactly the same thing in very close to the same amount of time. Having mixed the two styles I’m leaning toward the first, but you could go either way or both. What matters more is the mental leverage you wield when writing CTEs and functions together to compose pipelines of transformations, and that others wield when reading and debugging.

I hope I’ve made the case for writing and reading. There’s a case to be made for debugging too, but that’s another episode.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

Postgres set-returning functions that self-memoize as materialized views

In episode 2 I mentioned three aspects of pl/python that are reasons to use it instead of pl/pgsql: access to Python modules, metaprogramming, and introspection. In episode 5 I discussed metaprogramming, by which I mean using pl/python to compose and run SQL code. This episode features introspection, by which I mean taking advantage of Python’s inspect module to enable a pl/python function to discover its own name.

Why do that? In this context, so that the function can create a materialized view by joining its own name with the value of its first parameter. Here’s the example from episode 5.

questions_and_answers_for_group(_group_id text)
  returns setof question_and_answer_for_group as $$
  from plpython_helpers import (
    exists_group_view,
    get_caller_name,
    memoize_view_name
  )
  base_view_name = get_caller_name()
  view_name = f'{base_view_name}_{_group_id}'
  if exists_group_view(plpy, view_name):
    sql = f""" select * from {view_name} """
  else:
    sql = f"""
    -- SQL THAT RETURNS A SETOF QUESTION_AND_ANSWER_FOR_GROUP
    """
    memoize_view_name(sql, view_name)
    sql = f""" select * from {view_name} """
  return plpy.execute(sql)
$$ language plpython3u;

The function drives a panel on the course dashboard. An initial call to, say, questions_and_answers_for_group('P1mQaEEp'), creates the materialized view questions_and_answers_for_group_p1mqaeep and returns SELECT * from the view. Subsequent calls skip creating the view and just return SELECT * from it.

Note that the even though the group name is mixed case; the view name created by Postgres is all lowercase. For example:

create materialized view test_AbC as (select 'ok') with data;
SELECT 1

\d test_AbC
Materialized view "public.test_abc"

I want to think of this as a form of capability injection, but it’s really more like a capability wrapper. The capability is memoization. A function endowed with it can run a SQL query and cache the resulting rows in a materialized view before returning them to a SQL SELECT context. The wrapper is boilerplate code that discovers the function’s name, checks for the existence of a corresponding view, and if it isn’t found, calls memoize_view_name(sql, view_name) to run an arbitrary chunk of SQL code whose result set matches the function’s type. So in short: this pattern wraps memoization around a set-returning pl/python function.

As noted in episode 5, although memoize_view_name is called from pl/python, it is not itself a pl/python function. It’s a normal Python function in a module that’s accessible to the instance of Python that the Postgres pl/python extension is using. In my case that module is just a few small functions in file called plpython_helpers.py, installed (cough, copied) to user postgres‘s ~/.local/lib/python3.8/site-packages/plpython_helpers.py.

So far, there are only two critical functions in that module: get_caller_name() and memoize_view_name.

Here is get_caller_name().

import re
base_view_name = inspect.stack()[1][3].replace('__plpython_procedure_', '')
return re.sub(r'_\d+$', '', base_view_name)

The internal name for a pl/python function created by CREATE FUNCTION foo() looks like __plpython_procedure_foo_981048462. What get_caller_name() returns is just foo.

Here’s memoize_view_name().

def memoize_view_name(sql, view_name):
  sql = sql.replace('\n', ' ')
  encoded_bytes = base64.b64encode(sql.encode('utf-8'))
  encoded_str = str(encoded_bytes, 'utf-8')
  cmd = f"""psql -d h_analytics -c "call memoizer('{encoded_str}', '{view_name}')" """
  result = os.system(cmd)
  print(f'memoize_view_name: {cmd} result: {result}')

Given a chunk of SQL and the name of a view, it converts newlines to spaces, base64-encodes the query text, and invokes psql to call a procedure, memoizer, that does the work of running the SQL query and creating the materialized view from those results. So for example the function that yields sparkline data for a school might look like sparkline_data_for_school('af513ee'), and produce the view sparkline_data_for_school_af513ee.

Why shell out to psql here? It may not be necessary, there may be a way to manage the transaction directly within the function, but if so I haven’t found it. I’m very far from being an expert on transaction semantics and will appreciate guidance here if anyone cares to offer it. Meanwhile, this technique seems to work well. memoizer is a Postgres procedure, not a function. Although “stored procedures” is the term that I’ve always associated with in-database programming, I went pretty far down this path using only CREATE FUNCTION, never CREATE PROCEDURE. When I eventually went there I found the distinction between functions and procedures to be a bit slippery. This StackOverflow answer matches what I’ve observed.

PostgreSQL 11 added stored procedures as a new schema object. You can create a new procedure by using the CREATE PROCEDURE statement.

Stored procedures differ from functions in the following ways:

  • Stored procedures do not have to return anything, and only return a single row when using INOUT parameters.

  • You can commit and rollback transactions inside stored procedures, but not in functions.

  • You execute a stored procedure using the CALL statement rather than a SELECT statement.

  • Unlike functions, procedures cannot be nested in other DDL commands (SELECT, INSERT, UPDATE, DELETE).

Here is the memoizer procedure. It happens to be written in pl/python but could as easily have been written in pl/pgsql using the built-in Postgres decode function. Procedures, like functions, can be written in either language (or others) and share the common Postgres type system.

create procedure memoizer(_sql text, _view_name text) as $$
  import base64
  decoded_bytes = base64.b64decode(_sql)
  decoded_str = str(decoded_bytes, 'utf-8')
  create = f"""
    create materialized view if not exists {_view_name} as (
      {decoded_str}
    ) with data;
    """
  plpy.execute(create)
  permit = f"""
    grant select on {_view_name} to analytics;
  """
  plpy.execute(permit)
$$ language plpython3u;

There’s no plpy.commit() here because psql takes care of that automatically. Eventually I wrote other procedures, some of which do their own committing, but that isn’t needed here.

Of course it’s only possible to shell out to psql from a function because pl/python is an “untrusted” language extension. Recall from episode 1:

The ability to wield any of Python’s built-in or loadable modules inside Postgres brings great power. That entails great responsibility, as the Python extension is “untrusted” (that’s the ‘u’ in ‘plpython3u’) and can do anything Python can do on the host system: read and write files, make network requests.

Using Python’s os.system() to invoke psql is another of those superpowers. It’s not something I do lightly, and if there’s a better/safer way I’m all ears.

Meanwhile, this approach is delivering much value. We have two main dashboards, each of which displays a dozen or so panels. The school dashboard reports on annotation activity across all courses at a school. The course dashboard reports on the documents, and selections within documents, that instructors and students are discussing in the course’s annotation layer. Each panel that appears on the school or course dashboard is the output of a memoized function that is parameterized by a school or course id.

The data warehouse runs on a 24-hour cycle. Within that cycle, the first call to a memoized function takes just as long as it takes to run the SQL wrapped by the function. The cached view only comes into play when the function is called again during the same cycle. That can happen in a few different ways.

– A user reloads a dashboard, or a second user loads it.

– A panel expands or refines the results of another panel. For example, questions_and_answers_for_group() provides a foundation for a family of related functions including:

questions_asked_by_teacher_answered_by_student()

questions_asked_by_student_answered_by_teacher()

questions_asked_by_student_answered_by_student()

– A scheduled job invokes a function in order to cache its results before any user asks for them. For example, the time required to cache panels for school dashboards varies a lot. For schools with many active courses it can take minutes to run those queries, so preemptive memoization matters a lot. For schools with fewer active courses it’s OK to memoize on the fly. This method enables flexible cache policy. Across schools we can decide how many of the most-active ones to cache. Within a school, we can decide which courses to cache, e.g. most recent, or most active. The mechanism to display a dashboard panel is always the same function call. The caching done in support of that function is highly configurable.

Caches, of course, must be purged. Since these materialized views depend on core tables it was enough, early on, to do this for views depending on the annotation table.

drop table annotation cascade;

At a certain point, with a growing number of views built during each cycle, the cascade failed.

ERROR:  out of shared memory
HINT:  You might need to increase max_locks_per_transaction.

That wasn’t the answer. Instead we switched to enumerating views and dropping them individually. Again that afforded great flexibility. We can scan the names in the pg_matviews system table and match all the memoized views, or just those for a subset of schools, or just particular panels on school or course dashboards. Policies that govern the purging of cached views can be as flexible as those that govern their creation.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

Postgres and JSON: Finding document hotspots (part 1)

One of the compelling aspects of modern SQL is the JSON support built into modern engines, including Postgres. The documentation is well done, but I need examples to motivate my understanding of where and how and why to use such a capability. The one I’ll use in this episode is something I call document hotspots.

Suppose a teacher has asked her students to annotate Arthur Miller’s The Crucible. How can she find the most heavily-annotated passages? They’re visible in the Hypothesis client, of course, but may be sparsely distributed. She can scroll through the 154-page PDF document to find the hotspots, but it will be helpful to see a report that brings them together. Let’s do that.

The Hypothesis system stores annotations using a blend of SQL and JSON datatypes. Consider this sample annotation:

When the Hypothesis client creates that annotation it sends a JSON payload to the server. Likewise, when the client subsequently requests the annotation in order to anchor it to the document, it receives a similar JSON payload.

{
  "id": "VLUhcP1-EeuHn5MbnGgJ0w",
  "created": "2021-08-15T04:07:39.343275+00:00",
  "updated": "2021-08-15T04:07:39.343275+00:00",
  "user": "acct:judell@hypothes.is",
  "uri": "https://ia800209.us.archive.org/17/items/TheCrucibleFullText/The%20Crucible%20full%20text.pdf",
  "text": "\"He is no one's favorite clergyman.\"  :-)\n\nhttps://www.thoughtco.com/crucible-character-study-reverend-parris-2713521",
  "tags": [],
  "group": "__world__",
  "permissions": {
    "read": [
      "group:__world__"
    ],
    "admin": [
      "acct:judell@hypothes.is"
    ],
    "update": [
      "acct:judell@hypothes.is"
    ],
    "delete": [
      "acct:judell@hypothes.is"
    ]
  },
  "target": [
    {
      "source": "https://ia800209.us.archive.org/17/items/TheCrucibleFullText/The%20Crucible%20full%20text.pdf",
      "selector": [
        {
          "end": 44483,
          "type": "TextPositionSelector",
          "start": 44392
        },
        {
          "type": "TextQuoteSelector",
          "exact": " am not some preaching farmer with a book under my arm; I am a graduate of Harvard College.",
          "prefix": " sixty-six pound, Mr. Proctor! I",
          "suffix": " Giles: Aye, and well instructed"
        }
      ]
    }
  ],
  "document": {
    "title": [
      "The%20Crucible%20full%20text.pdf"
    ]
  },
  "links": {
    "html": "https://hypothes.is/a/VLUhcP1-EeuHn5MbnGgJ0w",
    "incontext": "https://hyp.is/VLUhcP1-EeuHn5MbnGgJ0w/ia800209.us.archive.org/17/items/TheCrucibleFullText/The%20Crucible%20full%20text.pdf",
    "json": "https://hypothes.is/api/annotations/VLUhcP1-EeuHn5MbnGgJ0w"
  },
  "user_info": {
    "display_name": "Jon Udell"
  },
  "flagged": false,
  "hidden": false
}

The server mostly shreds this JSON into conventional SQL types. The tags array, for example, is hoisted out of the JSON into a SQL array-of-text. The expression to find its length is a conventional Postgres idiom: array_length(tags,1). Note the second parameter; array_length(tags) is an error, because Postgres arrays can be multidimensional. In this case there’s only one dimension but it’s still necessary to specify that.

A target_selectors column, though, is retained as JSON. These selectors define how an annotation anchors to a target selection in a document. Because selectors are used only by the Hypothesis client, which creates and consumes them in JSON format, there’s no reason to shred them into separate columns. In normal operation, selectors don’t need to be related to core tables. They can live in the database as opaque blobs of JSON.

For some analytic queries, though, it is necessary to peer into those blobs and relate their contents to core tables. There’s a parallel set of functions for working with JSON. For example, the target_selectors column corresponds to the target[0]['selector'] array in the JSON representation. The expression to find the length of that array is jsonb_array_length(target_selectors).

Here’s a similar expression that won’t work: json_array_length(target_selectors). Postgres complains that the function doesn’t exist.

ERROR: function json_array_length(jsonb) does not exist 
Hint: No function matches the given name and argument types.

In fact both functions, json_array_length and jsonb_array_length, exist. But Postgres knows the target_selectors column is of type jsonb, not json which is the correct type for the json_array_length function. What’s the difference between json and jsonb?

The json and jsonb data types accept almost identical sets of values as input. The major practical difference is one of efficiency. The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage.

https://www.postgresql.org/docs/12/datatype-json.html

Although I tend to use JSON to refer to data in a variety of contexts, the flavor of JSON in the Postgres queries, views, and functions I’ll discuss will always be jsonb. The input conversion overhead isn’t a problem for analytics work that happens in a data warehouse, and the indexing support is a tremendous enabler.

To illustrate some of the operators common to json and jsonb, here is a query that captures the target_selectors column from the sample annotation.

with example as (
  select 
    id,
    target_selectors as selectors
  from annotation
  where id = '54b52170-fd7e-11eb-879f-931b9c6809d3'
  )
  select * from example;

Here are some other queries against example

select selectors from example;

[{"end": 44483, "type": "TextPositionSelector", "start": 44392}, { ... } ]

The result is a human-readable representation, but the type of selectors is jsonb.

select pg_typeof(selectors) from example;

jsonb

The array-indexing operator, ->, can yield the zeroth element of the array.

select selectors->0 from example;

{"end": 44483, "type": "TextPositionSelector", "start": 44392}

The result is again a human-readable representation of a jsonb type.

select pg_typeof(selectors->0) from example;

jsonb

Another array-indexing operator, ->>, can also yield the zeroth element of the array, but now as type text.

select selectors->>0 from example;

{"end": 44483, "type": "TextPositionSelector", "start": 44392}

The result looks the same, but the type is different.

select pg_typeof(selectors->>0) from example;

text

The -> and ->> operators can also index objects by their keys. These examples work with the object that is the zeroth element of the array.

select selectors->0->'type' from example;

"TextPositionSelector"

select pg_typeof(selectors->0->'type') from example;

jsonb

select selectors->0->>'type' from example;

TextPositionSelector

select pg_typeof(selectors->0->>'type') from example;

text

The Hypothesis system stores the location of a target (i.e, the selection in a document to which an annotation refers) in the target_selectors column we’ve been exploring. It records selectors. TextQuoteSelector represents the selection as the exact highlighted text bracketed by snippets of context. TextPositionSelector represents it as a pair of numbers that mark the beginning and end of the selection. When one range formed by that numeric pair is equal to another, it means two students have annotated the same selection. When a range contains another range, it means one student annotated the containing range, and another student made an overlapping annotation on the contained range. We can use these facts to surface hotspots where annotations overlap exactly or in nested fashion.

To start, let’s have a function to extract the start/end range from an annotation. In a conventional programming language you might iterate through the selectors in the target_selectors array looking for the one with the type TextPositionSelector. That’s possible in pl/pgsql and pl/python, but Postgres affords a more SQL-oriented approach. Given a JSON array, the function jsonb_array_elements returns a table-like object with rows corresponding to array elements.

select jsonb_array_elements(selectors) from example;

{"end": 44483, "type": "TextPositionSelector", "start": 44392}
{"type": "TextQuoteSelector", "exact": " am not some preaching farmer with a book under my arm; I am a graduate of Harvard College.", "prefix": " sixty-six pound, Mr. Proctor! I", "suffix": " Giles: Aye, and well instructed"}

A function can convert the array to rows, select the row of interest, select the start and end values from the row, package the pair of numbers as an array, and return the array.

create function position_from_anno(_id uuid) returns numeric[] as $$
  declare range numeric[];
  begin
    with selectors as (
      select jsonb_array_elements(target_selectors) as selector
      from annotation
      where id = _id
    ),
    position as (
      select
        selector->>'start'::numeric as startpos,
        selector->>'end'::numeric as endpos
      from selectors
      where selector->>'type' = 'TextPositionSelector'
    )
    select array[p.startpos, p.endpos] 
    from position p
    into range;
    return range;
  end;  
$$ language plpgsql;

Using it for the sample annotation:

select position_from_anno('54b52170-fd7e-11eb-879f-931b9c6809d3')

position_from_anno
------------------
{44392,44483}

I’ll show how to use position_from_anno to find document hotspots in a later episode. The goal here is just to introduce an example, and to illustrate a few of the JSON functions and operators.

What’s most interesting, I think, is this part.

where selector->>'type' = 'TextPositionSelector'

Although the TextPositionSelector appears as the first element of the selectors array, that isn’t guaranteed. In a conventional language you’d have to walk through the array looking for it. SQL affords a declarative way to find an element in a JSON array.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

pl/python metaprogramming

In episode 2 I mentioned three aspects of pl/python that are reasons to use it instead of pl/pgsql: access to Python modules, metaprogramming, and introspection.

Although this episode focuses on metaprogramming — by which I mean using Python to dynamically compose and run SQL queries — my favorite example combines all three aspects.

The context for the example is an analytics dashboard with a dozen panels, each driven by a pl/plython function that’s parameterized by the id of a school or a course. So, for example, the Questions and Answers panel on the course dashboard is driven by a function, questions_and_answers_for_group(group_id), which wraps a SQL query that:

– calls another pl/python function, questions_for_group(group_id), to find notes in the group that contain question marks

– finds the replies to those notes

– builds a table that summarizes the question/answer pairs

Here’s the SQL wrapped by the questions_and_answers_for_group(group_id) function.

sql = f"""
  with questions as ( 
    select *
    from questions_for_group('{_group_id}')
    ),
  ids_and_refs as (
    select 
      id,
      unnest ("references") as ref
      from annotation
      where groupid = '{_group_id}'
  ),
  combined as (
    select
      q.*,
      array_agg(ir.id) as reply_ids
    from ids_and_refs ir
    inner join questions q on q.id = ir.ref
    group by q.id, q.url, q.title, q.questioner, q.question, q.quote
  ),
  unnested as (
    select
      c.url,
      c.title,
      c.quote,
      c.questioner,
      c.question,
      unnest(reply_ids) as reply_id
    from combined c
  )
  select distinct
    course_for_group('{_group_id}') as course,
    teacher_for_group('{_group_id}') as teacher,
    clean_url(u.url) as url,
    u.title,
    u.quote,
    u.questioner,
    (regexp_matches(u.question, '.+\?'))[1] as question,
    display_name_from_anno(u.reply_id) as answerer,
    text_from_anno(u.reply_id) as answer,
    app_host() || '/course/render_questions_and_answers/{_group_id}' as viewer
  from unnested u
  order by course, teacher, url, title, questioner, question

This isn’t yet what I mean by pl/python metaprogramming. You could as easily wrap this SQL code in a pl/pgsql function. More easily, in fact, because in pl/pgsql you could just write _group_id instead of '{_group_id}'.

To get where we’re going, let’s zoom out and look at the whole questions_and_and_answer_for_group(group_id) function.

questions_and_answers_for_group(_group_id text)
  returns setof question_and_answer_for_group as $$
  from plpython_helpers import (
    exists_group_view,
    get_caller_name,
    memoize_view_name
  )
  base_view_name = get_caller_name()
  view_name = f'{base_view_name}_{_group_id}'
  if exists_group_view(plpy, view_name):
    sql = f""" select * from {view_name} """
  else:
    sql = f"""
    <SEE ABOVE>
    """
    memoize_view_name(sql, view_name)
    sql = f""" select * from {view_name} """
  return plpy.execute(sql)
$$ language plpython3u;

This still isn’t what I mean by metaprogramming. It introduces introspection — this is a pl/python function that discovers its own name and works with an eponymous materialized view — but that’s for a later episode.

It also introduces the use of Python modules by pl/python functions. A key thing to note here is that this is an example of what I call a memoizing function. When called it looks for a materialized view that captures the results of the SQL query shown above. If yes, it only needs to use a simple SELECT to return the cached result. If no, it calls memoize_view_name to run the underlying query and cache it in a materialized view that the next call to questions_and_answers_for_group(group_id) will use in a simple SELECT. Note that memoize_view_name is a special function that isn’t defined in Postgres using CREATE FUNCTION foo() like a normal pl/python function. Instead it’s defined using def foo() in a Python module called plpython_helpers. The functions there can do things — like create materialized views — that pl/python functions can’t. More about that in another episode.

The focus in this episode is metaprogramming, which is used in this example to roll up the results of multiple calls to questions_and_answers_for_group(group_id). That happens when the group_id refers to a course that has sections. If you’re teaching the course and you’ve got students in a dozen sections, you don’t want to look at a dozen dashboards; you’d much rather see everything on the primary course dashboard.

Here’s the function that does that consolidation.

create function consolidated_questions_and_answers_for_group(group_id text)
  returns setof question_and_answer_for_group as $$
  from plpython_helpers import (
    get_caller_name,
    sql_for_consolidated_and_memoized_function_for_group
  )
  base_view_name = get_caller_name()
  sql = sql_for_consolidated_and_memoized_function_for_group(
    plpy, base_view_name, 'questions_and_answers_for_group', _group_id)
  sql += ' order by course, url, title, questioner, answerer'
  return plpy.execute(sql)
$$ language plpython3u;

This pl/python function not only memoizes its results as above, it also consolidates results for all sections of a course. The memoization happens here.

def sql_for_consolidated_and_memoized_function_for_group(plpy, 
    base_view_name, function, group_id):
  view_name = f'{base_view_name}_{group_id}'
  sql = f""" select exists_view('{view_name}') as exists """
  exists = row_zero_value_for_colname(plpy, sql, 'exists')
  if exists:
    sql = f""" select * from {view_name} """
  else:
    sql = consolidator_for_group_as_sql(plpy, group_id, function)
    memoize_view_name(sql, view_name)
    sql = f""" select * from {view_name} """
  return sql

The consolidation happens here, and this is finally what I think of as classical metaprogramming: using Python to compose SQL.

def consolidator_for_group_as_sql(plpy, _group_id, _function):
  sql = f"select type_for_group('{_group_id}') as type"
  type = row_zero_value_for_colname(plpy, sql, 'type')
  if type == 'section_group' or type == 'none':
    sql = f"select * from {_function}('{_group_id}')"
  if type == 'course_group' or type == 'course':
    sql = f"select has_sections('{_group_id}')"
    has_sections = row_zero_value_for_colname(plpy, sql, 'has_sections')
    if has_sections:
      sql = f"""
        select array_agg(group_id) as group_ids 
        from sections_for_course('{_group_id}')
      """
      group_ids = row_zero_value_for_colname(plpy, sql, 'group_ids')
      selects = [f"select * from {_function}('{_group_id}') "]
      for group_id in group_ids:
        selects.append(f"select * from {_function}('{group_id}')")
      sql = ' union '.join(selects)
    else:
      sql = f"select * from {_function}('{_group_id}')"
  return sql

If the inbound _groupid is p1mqaeep, the inbound _function is questions_and_answers_for_group, and the group has no sections, the SQL will just be select * from questions_and_answers_for_group('p1mqaeep').

If the group does have sections, then the SQL will instead look like this:

select * from questions_and_answers_for_group('p1mqaeep')
union 
select * from questions_and_answers_for_group('x7fe93ba')
union
select * from questions_and_answers_for_group('qz9a4b3d')

This is a very long-winded way of saying that pl/python is an effective way to compose and run arbitarily complex SQL code. In theory you could do the same thing using pl/pgsql, in practice it would be insane to try. I’ve entangled the example with other aspects — modules, introspection — because that’s the real situation. pl/python’s maximal power emerges from the interplay of all three aspects. That said, it’s a fantastic way to extend Postgres with user-defined functions that compose and run SQL code.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

The Tao of Unicode Sparklines

I’ve long been enamored of the sparkline, a graphical device which its inventor Edward Tufte defines thusly:

A sparkline is a small intense, simple, word-sized graphic with typographic resolution. Sparklines mean that graphics are no longer cartoonish special occasions with captions and boxes, but rather sparkline graphics can be everywhere a word or number can be: embedded in a sentence, table, headline, map, spreadsheet, graphic.

Nowadays you can create sparklines in many tools including Excel and Google Sheets, both of which can use the technique to pack a summary of a series of numbers into a single cell. By stacking such cells vertically you can create views that compress vast amounts of information.

In A virtuous cycle for analytics I noted that we often use Metabase to display tables and charts based on extracts from our Postgres warehouse. I really wanted to use sparklines to summarize views of activity over time, but that isn’t yet an option in Metabase.

When Metabase is connected to Postgres, though, you can write Metabase questions that can not only call built-in Postgres functions but can also call user-defined functions. Can such a function accept an array of numbers and return a sparkline for display in the Metabase table viewer? Yes, if you use Unicode characters to represent the variable-height bars of a sparkline.

There’s a page at rosettacode.org devoted to Unicode sparklines based on this sequence of eight characters:

U+2581 LOWER ONE EIGHTH BLOCK
U+2582 LOWER ONE QUARTER BLOCK
U+2583 LOWER THREE EIGHTHS BLOCK
U+2584 LOWER HALF BLOCK
U+2585 LOWER FIVE EIGHTHS BLOCK
U+2586 LOWER THREE QUARTERS BLOCK
U+2587 LOWER SEVEN EIGHTHS BLOCK
U+2588 FULL BLOCK

Notice that 2581, 2582, and 2588 are narrower than the rest. I’ll come back to that at the end.

If you combine them into a string of eight characters you get this result:

▁▂▃▄▅▆▇█

Notice that the fourth and eight characters in the sequence drop below the baseline. I’ll come back to that at the end too.

These characters can be used to define eight buckets into which numbers in a series can be quantized. Here are some examples from the rosettacode.org page:

“1 2 3 4 5 6 7 8 7 6 5 4 3 2 1” -> ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁
“1.5, 0.5 3.5, 2.5 5.5, 4.5 7.5, 6.5” -> ▂▁▄▃▆▅█▇
“0, 1, 19, 20” -> ▁▁██
“0, 999, 4000, 4999, 7000, 7999” -> ▁▁▅▅██

To write a Postgres function that would do this, I started with the Python example from rosettacode.org:

bar = '▁▂▃▄▅▆▇█'
barcount = len(bar)
 
def sparkline(numbers):
    mn, mx = min(numbers), max(numbers)
    extent = mx - mn
    sparkline = ''.join(bar[min([barcount - 1,
                                 int((n - mn) / extent * barcount)])]
                        for n in numbers)
    return mn, mx, sparkline

While testing it I happened to try an unchanging sequence, [3, 3, 3, 3], which fails with a divide-by-zero error. In order to address that, and to unpack the algorithm a bit for readability, I arrived at this Postgres function:

create function sparkline(numbers bigint[]) returns text as $$

    def bar_index(num, _min, barcount, extent):
        index = min([barcount - 1, int( (num - _min) / extent * bar_count)])
        return index

    bars = '\u2581\u2582\u2583\u2584\u2585\u2586\u2587\u2588'
    _min, _max = min(numbers), max(numbers)
    extent = _max - _min 

    if extent == 0:  # avoid divide by zero if all numbers are equal
        extent = 1

    bar_count = len(bars)
    sparkline = ''
    for num in numbers:
        index = bar_index(num, _min, bar_count, extent)
        sparkline = sparkline + bars[index]

    return sparkline

$$ language plpython3u;

Here’s a psql invocation of the function:

analytics=# select sparkline(array[1, 2, 3, 4, 5, 6, 7, 8, 7, 6, 5, 4, 3, 2, 1]);
    sparkline
-----------------
▁▂▃▄▅▆▇█▇▆▅▄▃▂▁
(1 row)

And here’s an example based on actual data:

Each row represents a university course in which students and teachers are annotating the course readings. Each bar represents a week’s worth of activity. Their heights are not comparable from row to row; some courses do a lot of annotating and some not so much; each sparkline reports relative variation from week to week; the sum and weekly max columns report absolute numbers.

This visualization makes it easy to see that annotation was occasional in some courses and continuous in others. And when you scroll, the temporal axis comes alive; try scrolling this view to see what I mean.

We use the same mechanism at three different scales. One set of sparklines reports daily activity for students in courses; another rolls those up to weekly activity for courses at a school; still another rolls all those up to weekly activity for each school in the system.

At the level of individual courses, the per-student sparkline views can show patterns of interaction. In the left example here, vertical bands of activity reflect students annotating for particular assignments. In the right example there may be a trace of such temporal alignment but activity is less synchronized and more continuous.

When we’re working in Metabase we can use its handy mini bar charts to contextualize the row-wise sums.

The sparkline-like mini bar chart shows a row’s sum relative to the max for the column. Here we can see that a course with 3,758 notes has about 1/4 the number of notes as the most note-heavy course at the school.

Because these Unicode sparklines are just strings of text in columns of SQL or HTML tables, they can participate in sort operations. In our case we can sort on all columns including ones not shown here: instructor name, course start date, number of students. But the default is to sort by the sparkline column which, because it encodes time, orders courses by the start of annotation activity.

The visual effect is admittedly crude, but it’s a good way to show certain kinds of variation. And it’s nicely portable. A Unicode sparkline looks the same in a psql console, an HTML table, or a tweet. The function will work in any database that can run it, using Python or another of the languages demoed at rosettacode.org. For example, I revisited the Workbench workflow described in A beautiful power tool to scrape, clean, and combine data and added a tab for Lake levels.

When I did that, though, the effect was even cruder than what I’ve been seeing in my own work.

In our scenarios, with longer strings of characters, the differences average out and things align pretty well; the below-the-baseline effect has been annoying but not a deal breaker. But the width variation in this example does feel like a deal breaker.

What if we omit the problematic characters U+2581 (too narrow) and U+2584/U+2588 (below baseline and too narrow)?

There are only 5 buckets into which to quantize numbers, and their heights aren’t evenly distributed. But for the intended purpose — to show patterns of variation — I think it’s sufficient in this case. I tried swapping the 5-bucket method into the function that creates sparklines for our dashboards but I don’t think I’ll switch. The loss of vertical resolution makes our longer sparklines less useful, and the width variation is barely noticeable.

Unicode evolves, of course, so maybe there will someday be a sequence of characters that’s friendlier to sparklines. Maybe there already is? If so please let me know, I’d love to use it.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

A beautiful power tool to scrape, clean, and combine data

Labels like “data scientist” and “data journalist” connote an elite corps of professionals who can analyze data and use it to reason about the world. There are elite practitioners, of course, but since the advent of online data a quarter century ago I’ve hoped that every thinking citizen of the world (and of the web) could engage in similar analysis and reasoning.

That’s long been possible for those of us with the ability to wrangle APIs and transform data using SQL, Python, or another programming language. But even for us it hasn’t been easy. When I read news stories that relate to the generation of electric power in California, for example, questions occur to me that I know I could illuminate by finding, transforming, and charting sources of web data:

– How can I visualize the impact of shutting down California’s last nuclear plant?

– What’s the relationship between drought and hydro power?

All the ingredients are lying around in plain sight, but the effort required to combine them winds up being more trouble than it’s worth. And that’s for me, a skilled longtime scraper and transformer of web data. For you — even if you’re a scientist or a journalist! — that may not even be an option.

Enter Workbench, a web app with the tagline: “Scrape, clean, combine and analyze data without code.” I’ve worked with tools in the past that pointed the way toward that vision. DabbleDB in 2005 (now gone) and Freebase Gridworks in 2010 (still alive as Open Refine) were effective ways to cut through data friction. Workbench carries those ideas forward delightfully. It enables me to fly through the boring and difficult stuff — the scraping, cleaning, and combining — in order to focus on what matters: the analysis.

Here’s the report that I made to address the questions I posed above. It’s based on a workflow that you can visit and explore as I describe it here. (If you create your own account you can clone and modify.)

The workflow contains a set of tabs; each tab contains a sequence of steps; each step transforms a data set and displays output as a table or chart. When you load the page the first tab runs, and the result of its last step is displayed. In this case that’s the first chart shown in the report:

As in a Jupyter notebook you can run each step individually. Try clicking step 1. You’ll see a table of data from energy.ca.gov. Notice that step 1 is labeled Concatenate tabs. If you unfurl it you’ll see that it uses another tab, 2001-2020 scraped, which in turn concatenates two other tabs, 2001-2010 scraped and 2011-2020 scraped. Note that I’ve helpfully explained that in the optional comment field above step 1.

Each of the two source tabs scrapes a table from the page at energy.ca.gov. As I note in the report, it wasn’t necessary to scrape those tables since the data are available as an Excel file that can be downloaded, then uploaded to Workbench (as I’ve done in the tab named energy.ca.gov xslx). I scraped them anyway because that web page presents a common challenge: the data appear in two separate HTML tables. That’s helpful to the reader but frustrating to an analyst who wants to use the data. Rapid and fluid combination of scraped tables is grease for cutting through data friction; Workbench supplies that grease.

Now click step 2 in the first tab. It’s the last step, so you’re back to the opening display of the chart. Unfurl it and you’ll see the subset of columns included in the chart. I’ve removed some minor sources, like oil and waste heat, in order to focus on major ones. Several details are notable here. First: colors. The system provides a default palette but you can adjust it. Black wasn’t on the default palette but I chose that for coal.

Second, grand total. The data set doesn’t include that column, and it’s not something I needed here. But in some situations I’d want it, so the system offers it as a choice. That’s an example of the attention to detail that pervades every aspect of Workbench.

Third, Vega. See the triple-dot button above the legend in the chart? Click it, then select Open in Vega Editor, and when you get there, click Run. Today I learned that Vega is:

a declarative format for creating, saving, and sharing visualization designs. With Vega, visualizations are described in JSON, and generate interactive views using either HTML5 Canvas or SVG.

Sweet! I think I’ll use it in my own work to simplify what I’ve recently (and painfully) learned how to do with D3.js. It’s also a nice example of how Workbench prioritizes openness, reusability, and reproducibility in every imaginable way.

I use the chart as the intro to my report, which is made with an elegant block editor in which you can combine tables and charts from any of your tabs with snippets of text written in markdown. There I begin to ask myself questions, adding tabs to marshal supporting evidence and sourcing evidence from tabs into the report.

My first question is about the contribution that the Diablo Canyon nuclear plant has been making to the overall mix. In the 2020 percentages all major sources tab I start in step 1 by reusing the tab 2001-2020 scraped. Step 2 filters the columns to just the same set of major sources shown in the chart. I could instead apply that step in 2001-2020 scraped and avoid the need to select columns for the chart. Since I’m not sure how that decision might affect downstream analysis I keep all the columns. If I change my mind it’s easy to push the column selection upstream.

Workbench not only makes it possible to refactor a workflow, it practically begs you to do that. When things go awry, as they inevitably will, it’s no problem. You can undo and redo the steps in each tab! You won’t see that in the read-only view but if you create your own account, and duplicate my workflow in it, give it a try. With stepwise undo/redo, exploratory analysis becomes a safe and stress-free activity.

At step 2 of 2020 percentages all major sources we have rows for all the years. In thinking about Diablo Canyon’s contribution I want to focus on a single reference year so in step 3 I apply a filter that selects just the 2020 row. Here’s the UX for that.

In situations like this, where you need to select one or more items from a list, Workbench does all the right things to minimize tedium: search if needed, start from all or none depending on which will be easier, then keep or remove selected items, again depending on which will be easier.

In step 4 I include an alternate way to select just the 2020 row. It’s a Select SQL step that says select * from input where Year = '2020'. That doesn’t change anything here; I could omit either step 3 or step 4 without affecting the outcome; I include step 4 just to show that SQL is available at any point to transform the output of a prior step.

Which is fantastic, but wait, there’s more. In step 5 I use a Python step to do the same thing in terms of a pandas dataframe. Again this doesn’t affect the outcome, I’m just showing that Python is available at any point to transform the output of a prior step. Providing equivalent methods for novices and experts, in a common interface, is extraordinarily powerful.

I’m noticing now, by the way, that step 5 doesn’t work if you’re not logged in. So I’ll show it to you here:

Step 6 transposes the table so we can reason about the fuel types. In steps 3-5 they’re columns, in step 6 they become rows. This is a commonly-needed maneuver. And while I might use the SQL in step 4 to do the row selection handled by the widget in step 3, I won’t easily accomplish the transposition that way. The Transpose step is one of the most powerful tools in the kit.

Notice at step 6 that the first column is named Year. That’s a common outcome of transposition and here necessitates step 7 in which I rename it to Fuel Type. There are two ways to do that. You can click the + Add Step button, choose the Rename columns option, drag the new step into position 7, open it, and do the renaming there.

But look:

You can edit anything in a displayed table. When I change Year to Fuel Type that way, the same step 7 that you can create manually appears automatically.

It’s absolutely brilliant.

In step 8 I use the Calculate step to add a new column showing each row’s percentage of the column sum. In SQL I’d have to think about that a bit. Here, as is true for so many routine operations like this, Workbench offers the solution directly:

Finally in step 9 I sort the table. The report includes it, and there I consider the question of Diablo Canyon’s contribution. According to my analysis nuclear power was 9% of the major sources I’ve selected, contributing 16,280 GWh in 2020. According to another energy.ca.gov page that I cite in the report, Diablo Canyon is the only remaining nuke plant in the state, producing “about 18,000 GWh.” That’s not an exact match but it’s close enough to give me confidence that reasoning about the nuclear row in the table applies to Diablo Canyon specifically.

Next I want to compare nuclear power to just the subset of sources that are renewable. That happens in the 2020 percentages renewable tab, the output of which is also included in the report. Step 1 begins with the output of 2020 percentages of all major sources. In step 2 I clarify that the 2020 column is really 2020 GWh. In step 3 I remove the percent column in order to recalculate it. In step 4 I remove rows in order to focus on just nuclear and renewables. In step 5 I recalculate the percentages. And in step 6 I make the chart that also flows through to the report.

Now, as I look at the chart, I notice that the line for large hydro is highly variable and appears to correlate with drought years. In order to explore that correlation I look for data on reservoir levels and arrive at https://cdec.water.ca.gov/. I’d love to find a table that aggregates levels for all reservoirs statewide since 2001, but that doesn’t seem to be on offer. So I decide to use Lake Mendocino as a proxy. In step 1 I scrape an HTML table with monthly levels for the lake since 2001. In step 2 I delete the first row which only has some months. In step 3 I rename the first column to Year in order to match what’s in the table I want to join with. In step 4 I convert the types of the month columns from text to numeric to enable calculation. In step 5 I calculate the average into a new column, Avg. In step 6 I select just Year and Avg.

When I first try the join in step 8 it fails for a common reason that Workbench helpfully explains:

In the other table Year looks like ‘2001’, but the text scraped from energy.ca.gov looks like ‘2,001’. That’s a common glitch that can bring an exercise like this to a screeching halt. There’s probably a Workbench way to do this, but in step 7 I use SQL to reformat the values in the Year column, removing the commas to enable the join. While there I also rename the Avg column to Lake Mendocino Avg Level. Now in tab 8 I can do the join.

In tab 9 I scale the values for Large Hydro into a new column, Scaled Large Hydro. Why? The chart I want to see will compare power generation in GWh (gigawatt hours) and lake levels in AF (acre-feet). These aren’t remotely compatible but I don’t care, I’m just looking for comparable trends. Doubling the value for Large Hydro gets close enough for the comparison chart in step 10, which also flows through to the report.

All this adds up to an astonishingly broad, deep, and rich set of features. And I haven’t even talked about the Clean text step for tweaking whitespace, capitalization, and punctuation, or the Refine step for finding and merging clusters of similar values that refer to the same things. Workbench is also simply beautiful as you can see from the screen shots here, or by visiting my workflow. When I reviewed software products for BYTE and InfoWorld it was rare to encounter one that impressed me so thoroughly.

But wait, there’s more.

At the core of my workflow there’s a set of tabs; each is a sequence of steps; some of these produce tables and charts. Wrapped around the workflow there’s the report into which I cherrypick tables and charts for the story I’m telling there.

There’s also another kind of wrapper: a lesson wrapped around the workflow. I could write a lesson that guides you through the steps I’ve described and checks that each yields the expected result. See Intro to data journalism for an example you can try. Again, it’s brilliantly well done.

So Workbench succeeds in three major ways. If you’re not a data-oriented professional, but you’re a thinking person who wants to use available web data to reason about the world, Workbench will help you power through the grunt work of scraping, cleaning, and combining that data so you can focus on analysis. If you aspire to become such a professional, and you don’t have a clue about how to do that grunt work, it will help you learn the ropes. And if you are one of the pros you’ll still find it incredibly useful.

Kudos to the Workbench team, and especially to core developers Jonathan Stray, Pierre Conti, and Adam Hooper, for making this spectacularly great software tool.