My own personal AWS S3 bucket

I’ve just rediscovered two digital assets that I’d forgotten about.

1. The Reddit username judell, which I created in 2005 and never used. When you visit the page it says “hmm… u/judell hasn’t posted anything” but also reports, in my Trophy Case, that I belong to the 15-year club.

2. The Amazon AWS S3 bucket named simply jon, which I created in 2006 for an InfoWorld blog post and companion column about the birth of Amazon Web Services. As Wikipedia’s timeline shows, AWS started in March of that year.

Care to guess the odds that I could still access both of these assets after leaving them in limbo for 15 years?

Spoiler alert: it was a coin flip.

I’ve had no luck with Reddit so far. The email account I signed up with no longer exists. The support folks kindly switched me to a current email but it’s somehow linked to Educational_Elk_7869 not to judell. I guess we may still get it sorted but the point is that I was not at all surprised by this loss of continuity. I’ve lost control of all kinds of digital assets over the years, including the above-cited InfoWorld article which only Wayback (thank you as always!) now remembers.

When I turned my attention to AWS S3 I was dreading a similar outcome. I’d gone to Microsoft not long after I made that AWS developer account; my early cloud adventures were all in Azure; could I still access those long-dormant AWS resources? Happily: yes.

Here’s the backstory from that 2006 blog post:

Naming

The name of the bucket is jon. The bucket namespace is global which means that as long as jon is owned by my S3 developer account, nobody else can use that name. Will this lead to a namespace land grab? We’ll see. Meanwhile, I’ve got mine, and although I may never again top Jon Stewart as Google’s #1 Jon, his people are going to have to talk to my people if they want my Amazon bucket.

I’m not holding my breath waiting for an offer. Bucket names never mattered in the way domain names do. Still, I would love to be pleasantly surprised!

My newfound interest in AWS is, of course, because Steampipe wraps SQL around a whole bunch of AWS APIs including the one for S3 buckets. So, for example, when exactly did I create that bucket? Of course I can log into the AWS console and click my way to the answer. But I’m all about SQL lately so instead I can do this.

> select name, arn, creation_date from aws_s3_bucket 
+-------+--------------------+---------------------+         
| name  | arn                | creation_date       |         
+-------+--------------------+---------------------+         
| jon   | arn:aws:s3:::jon   | 2006-03-16 08:16:12 |         
| luann | arn:aws:s3:::luann | 2007-04-26 14:47:45 |         
+-------+--------------------+---------------------+  

Oh, and there’s the other one I made for Luann the following year. These are pretty cool ARNs (Amazon Resource Names)! I should probably do something with them; the names you can get nowadays are more like Educational_Elk_7869.

Anyway I’m about to learn a great deal about the many AWS APIs that Steampipe can query, check for policy compliance, and join with the APIs of other services. Meanwhile it’s fun to recall that I wrote one of the first reviews of the inaugural AWS product and, in the process, laid claim to some very special S3 bucket names.

Query like it’s 2022

Monday will be my first day as community lead for Steampipe, a young open source project that normalizes APIs by way of Postgres foreign data wrappers. The project’s taglines are select * from cloud and query like it’s 1992; the steampipe.io home page nicely illustrates these ideas.

I’ve been thinking about API normalization for a long time. The original proposal for the World Wide Web says:

Databases

A generic tool could perhaps be made to allow any database which uses a commercial DBMS to be displayed as a hypertext view.

We ended up with standard ways for talking to databases — ODBC, JDBC — but not for expressing them on the web.

When I was at Microsoft I was bullish on OData, an outgrowth of Pablo Castro’s wonderful Project Astoria. Part of the promise was that every database-backed website could automatically offer basic API access that wouldn’t require API wrappers for everybody’s favorite programming language. The API was hypertext; a person could navigate it using links and search. Programs wrapped around that API could be useful, but meaningful interaction with data would be possible without them.

(For a great example of what that can feel like, jump into the middle of one of Simon Willison’s datasettes, for example san-francisco.datasettes.com, and start clicking clicking around.)

Back then I wrote a couple of posts on this topic[1, 2]. Many years later OData still hasn’t taken the world by storm. I still think it’s a great idea and would love to see it, or something like it, catch on more broadly. Meanwhile Steampipe takes a different approach. Given a proliferation of APIs and programming aids for them, let’s help by providing a unifying abstraction: SQL.

I’ve done a deep dive into the SQL world over the past few years. The first post in a series I’ve been writing on my adventures with Postgres is what connected me to Steampipe and its sponsor (my new employer) Turbot. When you install Steampipe it brings Postgres along for the ride. Imagine what you could do with data flowing into Postgres from many different APIs and filling up tables you can view, query, join, and expose to tools and systems that talk to Postgres. Well, it’s going to be my job to help imagine, and explain, what’s possible in that scenario.

Meanwhile I need to give some thought to my Twitter tag line: patron saint of trailing edge technologies. It’s funny and it’s true. At BYTE I explored how software based on the Net News Transfer Protocol enabled my team to do things that we use Slack for today. At Microsoft I built a system for community-scale calendaring based on iCalendar. When I picked up NNTP and iCalendar they were already on the trailing edge. Yet they were, and especially in the case of iCalendar still are, capable of doing much more than is commonly understood.

Then of course came web annotation. Although Hypothesis recently shepherded it to W3C standardization it goes all the way back to the Mosaic browser and is exactly the kind of generative tech that fires my imagination. With Hypothesis now well established in education, I hope others will continue to explore the breadth of what’s possible when every document workflow that needs to can readily connect people, activities, and data to selections in documents. If that’s of interest, here are some signposts pointing to scenarios I’ve envisioned and prototyped.

And now it’s SQL. For a long time I set it aside in favor of object, XML, and NoSQL stores. Coming back to it, by way of Postgres, has shown me that:

– Modern SQL is more valuable as a programming language than is commonly understood

– So is Postgres as a programming environment

The tagline query like it’s 1992 seems very on-brand for me. But maybe I should let go of the trailing-edge moniker. Nostalgia isn’t the best way to motivate fresh energy. Maybe query like it’s 2022 sets a better tone? In any case I’m very much looking forward to this next phase.

The Postgres REPL

R0ml Lefkowitz’s The Image of Postgres evokes the Smalltalk experience: reach deeply into a running system, make small changes, see immediate results. There isn’t yet a fullblown IDE for the style of Postgres-based development I describe in this series, though I can envision a VSCode extension that would provide one. But there is certainly a REPL (read-eval-print loop), it’s called psql, and it delivers the kind of immediacy that all REPLs do. In our case there’s also Metabase; it offers a complementary REPL that enhances its power as a lightweight app server.

In the Clojure docs it says:

The Clojure REPL gives the programmer an interactive development experience. When developing new functionality, it enables her to build programs first by performing small tasks manually, as if she were the computer, then gradually make them more and more automated, until the desired functionality is fully programmed. When debugging, the REPL makes the execution of her programs feel tangible: it enables the programmer to rapidly reproduce the problem, observe its symptoms closely, then improvise experiments to rapidly narrow down the cause of the bug and iterate towards a fix.

I feel the same way about the Python REPL, the browser’s REPL, the Metabase REPL, and now also the Postgres REPL. Every function and every materialized view in the analytics system begins as a snippet of code pasted into the psql console (or Metabase). Iteration yields successive results instantly, and those results reflect live data. In How is a Programmer Like a Pathologist Gilad Bracha wrote:

A live program is dynamic; it changes over time; it is animated. A program is alive when it’s running. When you work on a program in a text editor, it is dead.

Tudor Girba amplified the point in a tweet.

In a database-backed system there’s no more direct way to interact with live data than to do so in the database. The Postgres REPL is, of course, a very sharp tool. Here are some ways to handle it carefully.

Find the right balance for tracking incremental change

In Working in a hybrid Metabase / Postgres code base I described how version-controlled files — for Postgres functions and views, and for Metabase questions — repose in GitHub and drive a concordance of docs. I sometimes write code snippets directly in psql or Metabase, but mainly compose in a “repository” (telling word!) where those snippets are “dead” artifacts in a text editor. They come to life when pasted into psql.

A knock on Smalltalk was that it didn’t play nicely with version control. If you focus on the REPL aspect, you could say the same of Python or JavaScript. In any such case there’s a balance to be struck between iterating at the speed of thought and tracking incremental change. Working solo I’ve been inclined toward a fairly granular commit history. In a team context I’d want to leave a chunkier history but still record the ongoing narrative somewhere.

Make it easy to understand the scope and effects of changes

The doc concordance has been the main way I visualize interdependent Postgres functions, Postgres views, and Metabase questions. In Working with interdependent Postgres functions and materialized views I mentioned Laurenz Albe’s Tracking View Dependencies in Postgres. I’ve adapted the view dependency tracker he develops there, and adapted related work from others to track function dependencies.

This tooling is still a work in progress, though. The concordance doesn’t yet include Postgres types, for example, nor the tables that are upstream from materialized views. My hypothetical VSCode extension would know about all the artifacts and react immediately when things change.

Make it easy to find and discard unwanted artifacts

Given a function or view named foo, I’ll often write and test a foo2 before transplanting changes back into foo. Because foo may often depend on bar and call baz I wind up also with bar2 and baz2. These artifacts hang around in Postgres until you delete them, which I try to do as I go along.

If foo2 is a memoized function (see this episode), it can be necessary to delete the set of views that it’s going to recreate. I find these with a query.

select 
  'drop materialized view ' || matviewname || ';' as drop_stmt
from pg_matviews 
where matviewname ~* {{ pattern }}

That pattern might be question_and_answer_summary_for_group to find all views based on that function, or _6djxg2yk to find all views for a group, or even [^_]{8,8}$ to find all views made by memoized functions.

I haven’t yet automated the discovery or removal of stale artifacts and references to them. That’s another nice-to-have for the hypothetical IDE.

The Image of Postgres

I’ll give R0ml the last word on this topic.

This is the BYTE magazine cover from August of 1981. In the 70s and the 80s, programming languages had this sort of unique perspective that’s completely lost to history. The way it worked: a programming environment was a virtual machine image, it was a complete copy of your entire virtual machine memory and that was called the image. And then you loaded that up and it had all your functions and your data in it, and then you ran that for a while until you were sort of done and then you saved it out. And this wasn’t just Smalltalk, Lisp worked that way, APL worked that way, it was kind of like Docker only it wasn’t a separate thing because everything worked that way and so you didn’t worry very much about persistence because it was implied. If you had a programming environment it saved everything that you were doing in the programming environment, you didn’t have to separate that part out. A programming environment was a place where you kept all your data and business logic forever.

So then Postgres is kind of like Smalltalk only different.

What’s the difference? Well we took the UI out of Smalltalk and put it in the browser. The rest of it is the same, so really Postgres is an application delivery platform, just like we had back in the 80s.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

Metabase as a lightweight app server

In A virtuous cycle for analytics I said this about Metabase:

It’s all nicely RESTful. Interactive elements that can parameterize queries, like search boxes and date pickers, map to URLs. Queries can emit URLs in order to compose themselves with other queries. I came to see this system as a kind of lightweight application server in which to incubate an analytics capability that could later be expressed more richly.

Let’s explore that idea in more detail. Consider this query that finds groups created in the last week.

with group_create_days as (
  select
     to_char(created, 'YYYY-MM-DD') as day
  from "group"
  where created > now() - interval '1 week'
)
select 
  day,
  count(*)
from group_create_days
group by day 
order by day desc 

A Metabase user can edit the query and change the interval to, say, 1 month, but there’s a nicer way to enable that. Terms in double squigglies are Metabase variables. When you type {{interval}} in the query editor, the Variables pane appears.

Here I’m defining the variable’s type as text and providing the default value 1 week. The query sent to Postgres will be the same as above. Note that this won’t work if you omit ::interval. Postgres complains: “ERROR: operator does not exist: timestamp with time zone – character varying.” That’s because Metabase doesn’t support variables of type interval as required for date subtraction. But if you cast the variable to type interval it’ll work.

That’s an improvement. A user of this Metabase question can now type 2 months or 1 year to vary the interval. But while Postgres’ interval syntax is fairly intuitive, this approach still requires people to make an intuitive leap. So here’s a version that eliminates the guessing.

The variable type is now Field Filter; the filtered field is the created column of the group table; the widget type is Relative Date; the default is Last Month. Choosing other intervals is now a point-and-click operation. It’s less flexible — 3 weeks is no longer an option — but friendlier.

Metabase commendably provides URLs that capture these choices. The default in this case is METABASE_SERVER/question/1060?interval=lastmonth. For the Last Year option it becomes interval=lastyear.

Because all Metabase questions that use variables work this way, the notion of Metabase as rudimentary app server expands to sets of interlinked questions. In Working in a hybrid Metabase / Postgres code base I showed the following example.

A Metabase question, #600, runs a query that selects columns from the view top_20_annotated_domains_last_week. It interpolates one of those columns, domain, into an URL that invokes Metabase question #985 and passes the domain as a parameter to that question. In the results for question #600, each row contains a link to a question that reports details about groups that annotated pages at that row’s domain.

This is really powerful stuff. Even without all the advanced capabilities I’ve been discussing in this series — pl/python functions, materialized views — you can do a lot more with the Metabase / Postgres combo than you might think.

For example, here’s a interesting idiom I’ve discovered. It’s often useful to interpolate a Metabase variable into a WHERE clause.

select * 
from dashboard_users
where email = {{ email }} 

You can make that into a fuzzy search using the case-insensitive regex-match operator ~*.

select * 
from dashboard_users
where email ~* {{ email }}

That’ll find a single address regardless of case; you can also find all records matching, say, ucsc.edu. But it requires the user to type some value into the input box. Ideally this query won’t require any input. If none is given, it lists all addresses in the table. If there is input it does a fuzzy match on that input. Here’s a recipe for doing that. Tell Metabase that {{ email }} a required variable, and set its default to any. Then, in the query, do this:

select * 
from dashboard_users
where email ~*
  case 
    when {{ email }} = 'any' then ''
    else {{ email}}
  end

In the default case the matching operator binds to the empty string, so it matches everything and the query returns all rows. For any other input the operator binds to a value that drives a fuzzy search.

This is all very nice, you may think, but even the simplest app server can write to the database as well as read from it, and Metabase can’t. It’s ultimately just a tool that you point at a data warehouse to SELECT data for display in tables and charts. You can’t INSERT or UPDATE or ALTER or DELETE or CALL anything.

Well, it turns out that you can. Here’s a Metabase question that adds a user to the table.

select add_dashboard_user( {{email}} )

How can this possibly work? If add_dashboard_user were a Postgres procedure you could CALL it from psql, but in this context you can only SELECT.

We’ve seen the solution in Postgres set-returning functions that self-memoize as materialized views. A Postgres function written in pl/python can import and use a Python function from a plpython_helpers module. That helper function can invoke psql to CALL a procedure. So this is possible.

We’ve used Metabase for years. It provides a basic, general-purpose UX that’s deeply woven into the fabric of the company. Until recently we thought of it as a read-only system for analytics, so a lot of data management happens in spreadsheets that don’t connect to the data warehouse. It hadn’t occurred to me to leverage that same basic UX for data management too, and that’s going to be a game-changer. I always thought of Metabase as a lightweight app server. With some help from Postgres it turns out to be a more capable one than I thought.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

Notes for an annotation SDK

While helping Hypothesis find its way to ed-tech it was my great privilege to explore ways of adapting annotation to other domains including bioscience, journalism, and scholarly publishing. Working across these domains showed me that annotation isn’t just an app you do or don’t adopt. It’s also a service you’d like to be available in every document workflow that connects people to selections in documents.

In my talk Weaving the Annotated Web I showcased four such services: Science in the Classroom, The Digital Polarization Project, SciBot, and ClaimChart. Others include tools to evaluate credibility signals, or review claims, in news stories.

As I worked through these and other scenarios, I accreted a set of tools for enabling any annotation-aware interaction in any document-oriented workflow. I’ve wanted to package these as a coherent software development kit, that hasn’t happened yet, but here are some of the ingredients that belong in such an SDK.

Creating an annotation from a selection in a document

Two core operations lie at the heart of any annotation system: creating a note that will bind to a selection in a document, and binding (anchoring) that note to its place in the document. A tool that creates an annotation reacts to a selection in a document by forming one or more selectors that describe the selection.

The most important selector is TextQuoteSelector. If I visit http://www.example.com and select the phrase “illustrative examples” and then use Hypothesis to annotate that selection, the payload sent from the client to the server includes this construct.

{
  "type": "TextQuoteSelector", 
  "exact": "illustrative examples", 
  "prefix": "n\n    This domain is for use in ", 
  "suffix": " in documents. You may use this\n"
}

The Hypothesis client formerly used an NPM module, dom-anchor-text-quote, to derive that info from a selection. It no longer uses that module, and the equivalent code that it does use isn’t separately available. But annotations created using TextQuoteSelectors formed by dom-anchor-text-quote interoperate with those created using the Hypothesis client, and I don’t expect that will change since Hypothesis needs to remain backwards-compatible with itself.

You’ll find something like TextQuoteSelector in any annotation system. It’s formally defined by the W3C here. In the vast majority of cases this is all you need to describe the selection to which an annotation should anchor.

There are, however, cases where TextQuoteSelector won’t suffice. Consider a document that repeats the same passage three times. Given a short selection in the first of those passages, how can a system know that an annotation should anchor to that one, and not the second or third? Another selector, TextPositionSelector (https://www.npmjs.com/package/dom-anchor-text-position), enables a system to know which passage contains the selection.

{
  "type": "TextPositionSelector", 
  "start": 51
  "end": 72, 
}

It records the start and end of the selection in the visible text of an HTML document. Here’s the HTML source of that web page.

<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>

Here is the visible text to which the TextQuoteSelector refers.

\n\n Example Domain\n This domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.\n More information…\n\n\n\n

The positions recorded by a TextQuoteSelector can change for a couple of reasons. If the document is altered, it’s obvious that an annotation’s start and stop numbers might change. Less obviously that can happen even if the document’s text isn’t altered. A news website, for example, may inject different kinds of advertising-related text content from one page load to the next. In that case the positions for two consecutive Hypothesis annotations made on the same selection can differ. So while TextPositionSelector can resolve ambiguity, and provide hints to an annotation system about where to look for matches, the foundation is ultimately TextQuoteSelector.

If you try the first example in the README at https://github.com/judell/TextQuoteAndPosition, you can form your own TextQuoteSelector and TextPositionSelector from a selection in a web page. That repo exists only as a wrapper around the set of modules — dom-anchor-text-quote, dom-anchor-text-position, and wrap-range-text — needed to create and anchor annotations.

Building on these ingredients, HelloWorldAnnotated illustrates a common pattern.

  • Given a selection in a page, form the selectors needed to post an annotation that targets the selection.
  • Lead a user through an interaction that influences the content of that annotation.
  • Post the annotation.

Here is an example of such an interaction. It’s a content-labeling scenario in which a user rates the emotional affect of a selection. This is the kind of thing that can be done with the stock Hypothesis client, but awkwardly because users must reliably add tags like WeakNegative or StrongPositive to represent their ratings. The app prompts for those tags to ensure consistent use of them.

Although the annotation is created by a standalone app, the Hypothesis client can anchor it, display it, and even edit it.

And the Hypothesis service can search for sets of annotations that match the tags WeakNegative or StrongPositive.

There’s powerful synergy at work here. If your annotation scenario requires controlled tags, or a prescribed workflow, you might want to adapt the Hypothesis client to do those things. But it can be easier to create a standalone app that does exactly what you need, while producing annotations that interoperate with the Hypothesis system.

Anchoring an annotation to its place in a document

Using this same set of modules, a tool or system can retrieve an annotation from a web service and anchor it to a document in the place where it belongs. You can try the second example in the README at https://github.com/judell/TextQuoteAndPosition to see how this works.

For a real-world demonstration of this technique, see Science in the Classroom. It’s a project sponsored by The American Association for the Advancement of Science. Graduate students annotate research papers selected from the Science family of journals so that younger students can learn about the terminology, methods, and outcomes of scientific research.

Pre-Hypothesis, annotations on these papers were displayed using Learning Lens, a viewer that color-codes them by category.

Nothing about Learning Lens changed when Hypothesis came into the picture, it just provided a better way to record the annotations. Originally that was done as it’s often done in the absence of a formal way to describe annotation targets, by passing notes like: “highlight the word ‘proximodistal’ in the first paragraph of the abstract, and attach this note to it.” This kind of thing happens a lot, and wherever it does there’s an opportunity to adopt a more rigorous approach. Nowadays at Science in the Classroom the annotators use Hypothesis to describe where notes should anchor, as well as what they should say. When an annotated page loads it searches Hypothesis for annotations that target the page, and inserts them using the same format that’s always been used to drive the Learning Lens. Tags assigned by annotators align with Learning Lens categories. The search looks only for notes from designated annotators, so nothing unwanted will appear.

An annotation-powered survey

The Credibility Coalition is “a research community that fosters collaborative approaches to understanding the veracity, quality and credibility of online information.” We worked with them on a project to test a set of signals that bear on the credibility of news stories. Examples of such signals include:

  • Title Representativeness (Does the title of an article accurately reflect its content?)
  • Sources (Does the article cite sources?)
  • Acknowledgement of uncertainty (Does the author acknowledge uncertainty, or the possibility things might be otherwise?)

Volunteers were asked these questions for each of a set of news stories. Many of the questions were yes/no or multiple choice and could have been handled by any survey tool. But some were different. What does “acknowledgement of uncertainty” look like? You know it when you see it, and you can point to examples. But how can a survey tool solicit answers that refer to selections in documents, and record their locations and contexts?

The answer was to create a survey tool that enabled respondents to answer such questions by highlighting one or more selections. Like the HelloWorldAnnotated example above, this was a bespoke client that guided the user through a prescribed workflow. In this case, that workflow was more complex. And because it was defined in a declarative way, the same app can be used for any survey that requires people to provide answers that refer to selections in web documents.

A JavaScript wrapper for the Hypothesis API

The HelloWorldAnnotated example uses functions from a library, hlib, to post an annotation to the Hypothesis service. That library includes functions for searching and posting annotations using the Hypothesis API. It also includes support for interaction patterns common to annotation apps, most of which occur in facet, a standalone tool that searches, displays, and exports sets of annotations. Supported interactions include:

– Authenticating with an API token

– Creating a picklist of groups accessible to the authenticated user

– Assembling and displaying conversation threads

– Parsing annotations

– Editing annotations

– Editing tags

In addition to facet, other tools based on this library include CopyAnnotations and TagRename

A Python wrapper for the Hypothesis API

If you’re working in Python, hypothesis-api is an alternative API wrapper that supports searching for, posting, and parsing annotations.

Notifications

If you’re a publisher who embeds Hypothesis on your site, you can use a wildcard search to find annotations. But it would be helpful to be notified when annotations are posted. h_notify is a tool that uses the Hypothesis API to watch for annotations on individual or wildcard URLs, or from particular users, or in a specified group, or with a specified tag.

When an h_notify-based watcher finds notes in any of these ways, it can send alerts to a Slack channel, or to an email address, or add items to an RSS feed.

At Hypothesis we mainly rely on the Slack option. In this example, user nirgendheim highlighted the word “interacting” in a page on the Hypothesis website.

The watcher sent this notice to our #website channel in Slack.

A member of the support team (Hypothesis handle mdiroberts) saw it there and responded to nirgendheim as shown above. How did nirgendheim know that mdiroberts had responded? The core Hypothesis system sends you an email when somebody replies to one of your notes. h_notify is for bulk monitoring and alerting.

A tiny Hypothesis server

People sometimes ask about connecting the Hypothesis client to an alternate server in order to retain complete control over their data. It’s doable, you can follow the instructions here to build and run your own server, and some people and organizations do that. Depending on need, though, that can entail more effort, and more moving parts, than may be warranted.

Suppose for example you’re part of a team of investigative journalists annotating web pages for a controversial story, or a team of analysts sharing confidential notes on web-based financial reports. The documents you’re annotating are public, but the notes you’re taking in a Hypothesis private group are so sensitive that you’d rather not keep them in the Hypothesis service. You’d ideally like to spin up a minimal server for that purpose: small, simple, and easy to manage within your own infrastructure.

Here’s a proof of concept. This tiny server clocks in just 145 lines of Python with very few dependencies. It uses Python’s batteries-include SQLite module for annotation storage. The web framework is Pyramid only because that’s what I’m familiar with, but could as easily be Flask, the ultra-light framework typically used for this sort of thing.

A tiny app wrapped around those ingredients is all you need to receive JSON payloads from a Hypothesis client, and return JSON payloads when the client searches for annotations to anchor to a page.

The service is dockerized and easy to deploy. To test it I used the fly.io speedrun to create an instance at https://summer-feather-9970.fly.dev. Then I made the handful of small tweaks to the Hypothesis client shown in client-patches.txt. My method for doing that, typical for quick proofs of concept that vary the Hypothesis client in some small way, goes like this:

  • Clone the Hypothesis client.
  • Edit gulpfile.js to say const IS_PRODUCTION_BUILD = false. This turns off minification so it’s possible to read and debug the client code.
  • Follow the instructions to run the client from a browser extension. After establishing a link between the client repo and browser-extension repo, as per those instructions, use this build command — make build SETTINGS_FILE=settings/chrome-prod.json — to create a browser extension that authenticates to the Hypothesis production service.
  • In a Chromium browser (e.g. Chrome or Edge or Brave) use chrome://extensions, click Load unpacked, and point to the browser-extension/build directory where you built the extension.

This is the easiest way to create a Hypothesis client in which to try quick experiments. There are tons of source files in the repos, but just a handful of bundles and loose files in the built extension. You can run the extension, search and poke around in those bundles, set breakpoints, make changes, and see immediate results.

In this case I only made the changes shown in client-patches.txt:

  • In options/index.html I added an input box to name an alternate server.
  • In options/options.js I sync that value to the cloud and also to the browser’s localStorage.
  • In the extension bundle I check localStorage for an alternate server and, if present, modify the API request used by the extension to show the number of notes found for a page.
  • In the sidebar bundle I check localStorage for an alternate server and, if present, modify the API requests used to search for, create, update, and delete annotations.

I don’t recommend this cowboy approach for anything real. If I actually wanted to use this tweaked client I’d create branches of the client and the browser-extension, and transfer the changes into the source files where they belong. If I wanted to share it with a close-knit team I’d zip up the extension so colleagues could unzip and sideload it. If I wanted to share more broadly I could upload the extension to the Chrome web store. I’ve done all these things, and have found that it’s feasible — without forking Hypothesis — to maintain branches that maintain small but strategic changes like this one. But when I’m aiming for a quick proof of concept, I’m happy to be a cowboy.

In any event, here’s the proof. With the tiny server deployed to summer-feather-9970.fly.dev, I poked that address into the tweaked client.

And sure enough, I could search for, create, reply to, update, and delete annotations using that 145-line SQLite-backed server.

The client still authenticates to Hypothesis in the usual way, and behaves normally unless you specify an alternate server. In that case, the server knows nothing about Hypothesis private groups. The client sees it as the Hypothesis public layer, but it’s really the moral equivalent of a private group. Others will see it only if they’re running the same tweaked extension and pointing to the same server. You could probably go quite far with SQLite but, of course, it’s easy to see how you’d swap it out for a more robust database like Postgres.

Signposts

I think of these examples as signposts pointing to a coherent SDK for weaving annotation into any document workflow. They show that it’s feasible to decouple and recombine the core operations: creating an annotation based on a selection in a document, and anchoring an annotation to its place in a document. Why decouple? Reasons are as diverse as the document workflows we engage in. The stock Hypothesis system beautifully supports a wide range of scenarios. Sometimes it’s helpful to replace or augment Hypothesis with a custom app that provides a guided experience for annotators and/or an alternative display for readers. The annotation SDK I envision will make it straightforward for developers to build solutions that leverage the full spectrum of possibility.

End notes

https://www.infoworld.com/article/3263344/how-web-annotation-will-transform-content-management.html

Weaving the annotated web

Working with interdependent Postgres functions and materialized views

In Working with Postgres types I showed an example of a materialized view that depends on a typed set-returning function. Because Postgres knows about that dependency, it won’t allow DROP FUNCTION foo. Instead it requires DROP FUNCTION foo CASCADE.

A similar thing happens with materialized views that depend on tables or other materialized views. Let’s build a cascade of views and consider the implications.

create materialized view v1 as (
  select 
    1 as number,
    'note_count' as label 
);
SELECT 1

select * from v1;

number | label
-------+-------
     1 | note_count

Actually, before continuing the cascade, let’s linger here for a moment. This is a table-like object created without using CREATE TABLE and without explicitly specifying types. But Postgres knows the types.

\d v1;

Materialized view "public.v1"

Column  | Type 
--------+-----
number  | integer
label   | text

The read-only view can become a read-write table like so.

create table t1 as (select * from v1);
SELECT 1

select * from t1;

number | label
-------+-------
     1 | note_count

\d t1

Table "public.v1"

Column  | Type 
--------+-----
number  | integer
label   | text

This ability to derive a table from a materialized view will come in handy later. It’s also just interesting to see how the view’s implicit types become explicit in the table.

OK, let’s continue the cascade.

create materialized view v2 as (
  select 
    number + 1,
    label
  from v1
);
SELECT 1

select * from v2;

number | label
-------+-------
     2 | note_count

create materialized view v3 as (
  select 
    number + 1,
    label
  from v2
);
SELECT 1

select * from v3;

number | label
-------+-------
     3 | note_count

Why do this? Arguably you shouldn’t. Laurenz Albe makes that case in Tracking view dependencies in PostgreSQL. Recognizing that it’s sometimes useful, though, he goes on to provide code that can track recursive view dependencies.

I use cascading views advisedly to augment the use of CTEs and functions described in Postgres functional style. Views that refine views can provide a complementary form of the chunking that aids reasoning in an analytics system. But that’s a topic for another episode. In this episode I’ll describe a problem that arose in a case where there’s only a single level of dependency from a table to a set of dependent materialized views, and discuss my solution to that probem.

Here’s the setup. We have an annotation table that’s reloaded nightly. On an internal dashboard we have a chart based on the materialized view annos_at_month_ends_for_one_year which is derived from the annotation table and, as its name suggests, reports annotation counts on a monthly cycle. At the beginning of the nightly load, this happens: DROP TABLE annotation CASCADE. So the derived view gets dropped and needs to be recreated as part of the nightly process. But that’s a lot of unnecessary work for a chart that only changes monthly.

Here are two ways to protect a view from a cascading drop of the table it depends on. Both reside in a SQL script, monthly.sql, that only runs on the first of every month. First, annos_at_month_ends_for_one_year.

drop materialized view annos_at_month_ends_for_one_year;
create materialized view annos_at_month_ends_for_one_year as (
  with last_days as (
    select 
      last_days_of_prior_months( 
        date(last_month_date() - interval '6 year')
      ) as last_day
    ),
    monthly_counts as (
      select
        to_char(last_day, '_YYYY-MM') as end_of_month,
        anno_count_between( 
          date(last_day - interval '1 month'), last_day
        ) as monthly_annos
      from last_days
    )
    select
      end_of_month,
      monthly_annos,
      sum(monthly_annos) over 
        (order by end_of_month asc rows 
           between unbounded preceding and current row
        ) as cumulative_annos
    from monthly_counts
) with data;

Because this view depends indirectly on the annotation table — by way of the function anno_count_between — Postgres doesn’t see the dependency. So the view isn’t affected by the cascading drop of the annotation table. It persists until, once a month, it gets dropped and recreated.

What if you want Postgres to know about such a dependency, so that the view will participate in a cascading drop? You can do this.

create materialized view annos_at_month_ends_for_one_year as (
  with depends as (
    select * from annotation limit 1
  )
  last_days as (
    ),
  monthly_counts as (
    )
  select
    *
  from monthly_counts;

The depends CTE doesn’t do anything relevant to the query, it just tells Postgres that this view depends on the annotation table.

Here’s another way to protect a view from a cascading drop. This expensive-to-build view depends directly on the annotation table but only needs to be updated monthly. So in this case, cumulative_annotations is a table derived from a temporary materialized view.

create materialized view _cumulative_annotations as (
  with data as (
    select 
      to_char(created, 'YYYY-MM') as created
    from annotation
    group by created
  ) 
  select 
    data.created, 
    sum(data.count) 
      over ( 
        order by data.created asc 
        rows between unbounded preceding and current row
      )
    from data
    group by data.created 
    order by data.created 

drop table cumulative_annotations;

create table cumulative_annotations as (
  select * from _cumulative_annotations
);

drop materialized view _cumulative_annotations;

The table cumulative_annotations is only rebuilt once a month. It depends indirectly on the annotation table but Postgres doesn’t see that, so doesn’t include it in the cascading drop.

Here’s the proof.

-- create a table
create table t1 (number int);

insert into t1 (number) values (1);
INSERT 0 1

select * from t1;

number
-------
     1

-- derive a view from t1
create materialized view v1 as (select * from t1);
SELECT 1

select * from v1

number
-------
     1

-- try to drop t1
drop table t1;

ERROR: cannot drop table t1 because other objects depend on it
DETAIL: materialized view v1 depends on table t1
HINT: Use DROP ... CASCADE to drop the dependent objects too.

-- derive an independent table from t1 by way of a matview
drop materialized view v1;

create materialized view v1 as (select * from t1);
SELECT 1

create table t2 as (select * from v1);
SELECT 1

-- drop the matview
drop materialized view v1;

-- drop t1
drop table t1;

-- no complaint, and t2 still exists
select * from t2;

number
-------
     1

These are two ways I’ve found to protect a long-lived result set from the cascading drop of a short-lived table on which it depends. You can hide the dependency behind a function, or you can derive an independent table by way of a transient materialized view. I use them interchangeably, and don’t have a strong preference one way or another. Both lighten the load on the analytics server. Materialized views (or tables) that only need to change weekly or monthly, but were being dropped nightly by cascade from core tables, are now recreated only on their appropriate weekly or monthly cycles.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

Working in a hybrid Metabase / Postgres code base

In this series I’m exploring how to work in a code base that lives in two places: Metabase questions that encapsulate chunks of SQL, and Postgres functions/procedures/views that also encapsulate chunks of SQL. To be effective working with this hybrid collection of SQL chunks, I needed to reason about their interrelationships. One way to do that entailed the creation of a documentation generator that writes a concordance of callers and callees.

Here’s the entry for the function sparklines_for_guid(_guid).

The called by column says that this function is called from two different contexts. One is a Metabase question, All courses: Weekly activity. If you’re viewing that question in Metabase, you’ll find that its SQL text is simply this:

select * from sparklines_for_guid( {{guid}} )

The same function call appears in a procedure, cache warmer, that preemptively memoizes the function for a set of the most active schools in the system. In either case, you can look up the function in the concordance, view its definition, and review how it’s called.

In the definition of sparkline_for_guid, names of other functions (like guid_hashed_view_exists) appear and are linked to their definitions. Similarly, names of views appearing in SELECT or JOIN contexts are linked to their definitions.

Here’s the entry for the function guid_hashed_view_exists. It is called by sparklines_for_guid as well as by functions that drive panels on the school dashboard. It links to the functions it uses: hash_for_guid and exists_view.

Here’s the entry for the view lms_course_groups which appears as a JOIN target in sparklines_for_guid. This central view is invoked — in SELECT or JOIN context — from many functions, from dependent views, and from Metabase questions.

Metabase questions can also “call” other Metabase questions. In A virtuous cycle for analytics I noted: “Queries can emit URLs in order to compose themselves with other queries.” Here’s an example of that.

This Metabase question (985) calls various linked functions, and is called by two other Metabase questions. Here is one of those.

Because this Metabase question (600) emits an URL that refers to 985, it links to the definition of 985. It also links to the view, top_annotated_domains_last_week, from which it SELECTs.

It was straightforward to include Postgres functions, views, and procedures in the concordance since these live in files that reside in the fileystem under source control. Metabase questions, however, live in a database — in our case, a Postgres database that’s separate from our primary Postgres analytics db. In order to extract them into a file I use this SQL snippet.

select
  r.id,
  m.name as dbname,
  r.name,
  r.description,
  r.dataset_query
from report_card r
join metabase_database m
  on m.id = cast(r.dataset_query::jsonb->>'database' as int)
where not r.archived
order by r.id;

The doc generator downloads that Metabase data as a CSV file, queries.csv, and processes it along with the files that contain the definitions of functions, procedures, and views in the Postgres data warehouse. It also emits queries.txt which is a more convenient way to diff changes from one commit to the next.

This technique solved a couple of problems. First, when we were only using Metabase — unaugmented by anything in Postgres — it enabled us to put our Metabase SQL under source control and helped us visualize relationships among queries.

Later, as we augmented Metabase with Postgres functions, procedures, and views, it became even more powerful. Developing a new panel for a school or course dashboard means writing a new memoized function. That process begins as a Metabase question with SQL code that calls existing Postgres functions, and/or JOINs/SELECTs FROM existing Postgres views. Typically it then leads to the creation of new supporting Postgres functions and/or views. All this can be tested by internal users, or even invited external users, in Metabase, with the concordance available to help understand the relationships among the evolving set of functions and views.

When supporting functions and views are finalized, the SQL content of the Metabase question gets wrapped up in a memoized Postgres function that’s invoked from a panel of a dashboard app. At that point the concordance links the new wrapper function to the same set of supporting functions and views. I’ve found this to be an effective way to reason about a hybrid code base as features move from Metabase for prototyping to Postgres in production, while maintaining all the code under source control.

That foundation of source control is necessary, but maybe not sufficient, for a team to consider this whole approach viable. The use of two complementary languages for in-database programming will certainly raise eyebrows, and if it’s not your cup of tea I completely understand. If you do find it appealing, though, one thing you’ll wonder about next will be tooling. I work in VSCode nowadays, for which I’ve not yet found a useful extension for pl/pgsql or pl/python. With metaprogramming life gets even harder for aspiring pl/pgsql or pl/python VSCode extensions. I can envision them, but I’m not holding my breath awaiting them. Meanwhile, two factors enable VSCode to be helpful even without deep language-specific intelligence.

The first factor, and by far the dominant one, is outlining. In Products and capabilities I reflect on how I’ve never adopted an outlining product, but often rely on outlining capability infused into a product. In VSCode that’s “only” basic indentation-driven folding and unfolding. But I find it works remarkably well across SQL queries, views and functions that embed them, CTEs that comprise them, and pl/pgsql or pl/python functions called by them.

The second factor, nascent thus far, is GitHub Copilot. It’s a complementary kind of language intelligence that’s aware of, but not bounded by, what a file extension of .sql or .py implies. It can sometimes discern patterns that mix language syntaxes and offer helpful suggestions. That hasn’t happened often so far, but it’s striking when it does. I don’t yet know the extent to which it may be training me while I train it, or how those interactions might influence others. At this point I’m not a major Copilot booster, but I am very much an interested observer of and participant in the experiment.

All in all, I’m guardedly optimistic that existing or feasible tooling can enable individuals and teams to sanely navigate the hybrid corpus of source code discussed in this series. If you’re along for the ride, you’ll next wonder about debugging and monitoring a system built this way. That’s a topic for a future episode.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

Postgres functional style

My dual premises in this series are:

– Modern SQL is more valuable as a programming language than you might think (see Markus Winand’s Modern SQL: A lot has changed since SQL-92)

– Postgres is more valuable as a programming environment than you might think. (see R0ml Lefkowitz’s The Image of Postgres)

As the patron saint of trailing edge technology it is my duty to explore what’s possible at the intersection of these two premises. The topic for this episode is Postgres functional style. Clearly what I’ve been doing with the combo of pl/python and pl/pgsql is very far from pure functional programming. The self-memoization technique shown in episode 7 is all about mutating state (ed: this means writing stuff down somewhere). But it feels functional to me in the broader sense that I’m using functions to orchestrate behavior that’s expressed in terms of SQL queries.

To help explain what I mean, I’m going to unpack one of the Postgres functions in our library.

count_of_distinct_lms_students_from_to(_guid text, _from date, _to date)

This is a function that accepts a school id (aka guid), a start date, and an end date. Its job is to:

– Find all the courses (groups) for that school (guid)

– Filter to those created between the start and end date

– Find all the users in the filtered set of courses

– Filter to just students (i.e. omit instructors)

– Remove duplicate students (i.e., who are in more than one course)

– Return the count of distinct students at the school who annotated in the date range

The production database doesn’t yet store things in ways friendly to this outcome, so doing all this requires some heavy lifting in the analytics data warehouse. Here’s the function that orchestrates that work.

create function count_of_distinct_lms_students_from_to(_guid text, _from date, _to date) 
  returns bigint as $$
  declare count bigint;
  begin
 1  -- all groups active for the guid in the date range
 2  with groups as (
 3    select pubid from groups_for_guid(_guid)
 4    where group_is_active_from_to(pubid, _from, _to)
 5  ),
 6  -- usernames in those groups
 7  usernames_by_course as (
 8    select
 9      pubid,
10      (users_in_group(pubid)).username 
11  from groups 
12  ),
13  -- filtered to just students
14  students_by_course as (
15    select * from usernames_by_course
16    where not is_instructor(username, pubid)
17  )
18  select 
19    count (distinct username) 
20  from students_by_course
    into count;
    return count;
  end;
$$ language plpgsql;

If you think pl/pgsql is old and clunky, then you are welcome to do this in pl/python instead. There’s negligible difference between how they’re written and how fast they run. It’s the same chunk of SQL either way, and it exemplifies the functional style I’m about to describe.

Two kinds of chunking work together here: CTEs (aka common table expressions, aka WITH clauses) and functions. If you’ve not worked with SQL for a long time, as I hadn’t, then CTEs may be unfamiliar. I think of them as pipelines of table transformations in which each stage of the pipeline gets a name. In this example those names are groups (line 2), usernames_by_course (line 7), and students_by_course (line 14).

The pipeline phases aren’t functions that accept parameters, but I still think of them as being function-like in the sense that they encapsulate named chunks of behavior. The style I’ve settled into aims to make each phase of the pipeline responsible for a single idea (“groups active in the range”, “usernames in those groups”), and to express that idea in a short snippet of SQL.

As I’m developing one of these pipelines, I test each phase. To test the first phase, for example, I’d do this in psql or Metabase.

-- all groups active for the guid in the date range
with groups as (
  select pubid from groups_for_guid('8anU0QwbgC2Cq:canvas-lms')
  where group_is_active_from_to(pubid, '2021-01-01', '2021-05-01')
)
select * from groups;

And I’d spot-check to make sure the selected groups for that school really are in the date range. Then I’d check the next phase.

-- all groups active for the guid in the date range
with groups as (
),
-- usernames in those groups
usernames_by_course as (
  select
    pubid,
    (users_in_group(pubid)).username 
  from groups 
)
select * from usernames_by_course;

After another sanity check against these results, I’d continue to the next phase, and eventually arrive at the final result. It’s the same approach I take with regular expressions. I am unable to visualize everything that’s happening in a complex regex. But I can reason effectively about a pipeline of matches that occur in easier-to-understand named steps.

Ideally each phase in one of these pipelines requires just a handful of lines of code: few enough to fit within the 7 +- 2 limit of working memory. Postgres functions make that possible. Here are the functions used in this 20-line chunk of SQL.

groups_for_guid(guid): Returns a table of course ids for a school.

group_is_active_from_to(pubid, _from, _to): Returns true if the group was created in the range.

users_in_group(pubid): Returns a table of user info for a course.

is_instructor(username, pubid): Returns true if that user is an instructor.

Two of these, groups_for_guid and users_in_group, are set-returning functions. As noted in Working with Postgres types, they have the option of returning an explicit Postgres type defined elsewhere, or an implicit Postgres type defined inline. As it happens, both do the latter.

create or replace function groups_for_guid(_guid text)
  returns table(
    pubid text
  ) as $$
create or replace function users_in_group (_pubid text)
  returns table (
    groupid text, 
    username text, 
    display_name text
  ) as $$

The other two, group_is_active_from_to and is_instructor, return boolean values.

All this feels highly readable to me now, but the syntax of line 10 took quite a while to sink in. It helps me to look at what users_in_group(pubid) does in a SELECT context.

select * from users_in_group('4VzA92Yy')

groupid   | username    | display_name
----------+-------------+----------------
4VzA92Yy  | 39vA94AsQp  | Brendan Nadeau

Here is an alternate way to write the usernames_by_course CTE at line 7.

-- usernames in those groups
usernames_by_course as (
  select
    g.pubid,
    u.username
from groups g 
join users_in_group(g.pubid) u on g.pubid = u.groupid
)
select * from usernames_by_course;

Both do exactly the same thing in very close to the same amount of time. Having mixed the two styles I’m leaning toward the first, but you could go either way or both. What matters more is the mental leverage you wield when writing CTEs and functions together to compose pipelines of transformations, and that others wield when reading and debugging.

I hope I’ve made the case for writing and reading. There’s a case to be made for debugging too, but that’s another episode.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

Postgres set-returning functions that self-memoize as materialized views

In episode 2 I mentioned three aspects of pl/python that are reasons to use it instead of pl/pgsql: access to Python modules, metaprogramming, and introspection. In episode 5 I discussed metaprogramming, by which I mean using pl/python to compose and run SQL code. This episode features introspection, by which I mean taking advantage of Python’s inspect module to enable a pl/python function to discover its own name.

Why do that? In this context, so that the function can create a materialized view by joining its own name with the value of its first parameter. Here’s the example from episode 5.

questions_and_answers_for_group(_group_id text)
  returns setof question_and_answer_for_group as $$
  from plpython_helpers import (
    exists_group_view,
    get_caller_name,
    memoize_view_name
  )
  base_view_name = get_caller_name()
  view_name = f'{base_view_name}_{_group_id}'
  if exists_group_view(plpy, view_name):
    sql = f""" select * from {view_name} """
  else:
    sql = f"""
    -- SQL THAT RETURNS A SETOF QUESTION_AND_ANSWER_FOR_GROUP
    """
    memoize_view_name(sql, view_name)
    sql = f""" select * from {view_name} """
  return plpy.execute(sql)
$$ language plpython3u;

The function drives a panel on the course dashboard. An initial call to, say, questions_and_answers_for_group('P1mQaEEp'), creates the materialized view questions_and_answers_for_group_p1mqaeep and returns SELECT * from the view. Subsequent calls skip creating the view and just return SELECT * from it.

Note that the even though the group name is mixed case; the view name created by Postgres is all lowercase. For example:

create materialized view test_AbC as (select 'ok') with data;
SELECT 1

\d test_AbC
Materialized view "public.test_abc"

I want to think of this as a form of capability injection, but it’s really more like a capability wrapper. The capability is memoization. A function endowed with it can run a SQL query and cache the resulting rows in a materialized view before returning them to a SQL SELECT context. The wrapper is boilerplate code that discovers the function’s name, checks for the existence of a corresponding view, and if it isn’t found, calls memoize_view_name(sql, view_name) to run an arbitrary chunk of SQL code whose result set matches the function’s type. So in short: this pattern wraps memoization around a set-returning pl/python function.

As noted in episode 5, although memoize_view_name is called from pl/python, it is not itself a pl/python function. It’s a normal Python function in a module that’s accessible to the instance of Python that the Postgres pl/python extension is using. In my case that module is just a few small functions in file called plpython_helpers.py, installed (cough, copied) to user postgres‘s ~/.local/lib/python3.8/site-packages/plpython_helpers.py.

So far, there are only two critical functions in that module: get_caller_name() and memoize_view_name.

Here is get_caller_name().

import re
base_view_name = inspect.stack()[1][3].replace('__plpython_procedure_', '')
return re.sub(r'_\d+$', '', base_view_name)

The internal name for a pl/python function created by CREATE FUNCTION foo() looks like __plpython_procedure_foo_981048462. What get_caller_name() returns is just foo.

Here’s memoize_view_name().

def memoize_view_name(sql, view_name):
  sql = sql.replace('\n', ' ')
  encoded_bytes = base64.b64encode(sql.encode('utf-8'))
  encoded_str = str(encoded_bytes, 'utf-8')
  cmd = f"""psql -d h_analytics -c "call memoizer('{encoded_str}', '{view_name}')" """
  result = os.system(cmd)
  print(f'memoize_view_name: {cmd} result: {result}')

Given a chunk of SQL and the name of a view, it converts newlines to spaces, base64-encodes the query text, and invokes psql to call a procedure, memoizer, that does the work of running the SQL query and creating the materialized view from those results. So for example the function that yields sparkline data for a school might look like sparkline_data_for_school('af513ee'), and produce the view sparkline_data_for_school_af513ee.

Why shell out to psql here? It may not be necessary, there may be a way to manage the transaction directly within the function, but if so I haven’t found it. I’m very far from being an expert on transaction semantics and will appreciate guidance here if anyone cares to offer it. Meanwhile, this technique seems to work well. memoizer is a Postgres procedure, not a function. Although “stored procedures” is the term that I’ve always associated with in-database programming, I went pretty far down this path using only CREATE FUNCTION, never CREATE PROCEDURE. When I eventually went there I found the distinction between functions and procedures to be a bit slippery. This StackOverflow answer matches what I’ve observed.

PostgreSQL 11 added stored procedures as a new schema object. You can create a new procedure by using the CREATE PROCEDURE statement.

Stored procedures differ from functions in the following ways:

  • Stored procedures do not have to return anything, and only return a single row when using INOUT parameters.

  • You can commit and rollback transactions inside stored procedures, but not in functions.

  • You execute a stored procedure using the CALL statement rather than a SELECT statement.

  • Unlike functions, procedures cannot be nested in other DDL commands (SELECT, INSERT, UPDATE, DELETE).

Here is the memoizer procedure. It happens to be written in pl/python but could as easily have been written in pl/pgsql using the built-in Postgres decode function. Procedures, like functions, can be written in either language (or others) and share the common Postgres type system.

create procedure memoizer(_sql text, _view_name text) as $$
  import base64
  decoded_bytes = base64.b64decode(_sql)
  decoded_str = str(decoded_bytes, 'utf-8')
  create = f"""
    create materialized view if not exists {_view_name} as (
      {decoded_str}
    ) with data;
    """
  plpy.execute(create)
  permit = f"""
    grant select on {_view_name} to analytics;
  """
  plpy.execute(permit)
$$ language plpython3u;

There’s no plpy.commit() here because psql takes care of that automatically. Eventually I wrote other procedures, some of which do their own committing, but that isn’t needed here.

Of course it’s only possible to shell out to psql from a function because pl/python is an “untrusted” language extension. Recall from episode 1:

The ability to wield any of Python’s built-in or loadable modules inside Postgres brings great power. That entails great responsibility, as the Python extension is “untrusted” (that’s the ‘u’ in ‘plpython3u’) and can do anything Python can do on the host system: read and write files, make network requests.

Using Python’s os.system() to invoke psql is another of those superpowers. It’s not something I do lightly, and if there’s a better/safer way I’m all ears.

Meanwhile, this approach is delivering much value. We have two main dashboards, each of which displays a dozen or so panels. The school dashboard reports on annotation activity across all courses at a school. The course dashboard reports on the documents, and selections within documents, that instructors and students are discussing in the course’s annotation layer. Each panel that appears on the school or course dashboard is the output of a memoized function that is parameterized by a school or course id.

The data warehouse runs on a 24-hour cycle. Within that cycle, the first call to a memoized function takes just as long as it takes to run the SQL wrapped by the function. The cached view only comes into play when the function is called again during the same cycle. That can happen in a few different ways.

– A user reloads a dashboard, or a second user loads it.

– A panel expands or refines the results of another panel. For example, questions_and_answers_for_group() provides a foundation for a family of related functions including:

questions_asked_by_teacher_answered_by_student()

questions_asked_by_student_answered_by_teacher()

questions_asked_by_student_answered_by_student()

– A scheduled job invokes a function in order to cache its results before any user asks for them. For example, the time required to cache panels for school dashboards varies a lot. For schools with many active courses it can take minutes to run those queries, so preemptive memoization matters a lot. For schools with fewer active courses it’s OK to memoize on the fly. This method enables flexible cache policy. Across schools we can decide how many of the most-active ones to cache. Within a school, we can decide which courses to cache, e.g. most recent, or most active. The mechanism to display a dashboard panel is always the same function call. The caching done in support of that function is highly configurable.

Caches, of course, must be purged. Since these materialized views depend on core tables it was enough, early on, to do this for views depending on the annotation table.

drop table annotation cascade;

At a certain point, with a growing number of views built during each cycle, the cascade failed.

ERROR:  out of shared memory
HINT:  You might need to increase max_locks_per_transaction.

That wasn’t the answer. Instead we switched to enumerating views and dropping them individually. Again that afforded great flexibility. We can scan the names in the pg_matviews system table and match all the memoized views, or just those for a subset of schools, or just particular panels on school or course dashboards. Policies that govern the purging of cached views can be as flexible as those that govern their creation.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

Postgres and JSON: Finding document hotspots (part 1)

One of the compelling aspects of modern SQL is the JSON support built into modern engines, including Postgres. The documentation is well done, but I need examples to motivate my understanding of where and how and why to use such a capability. The one I’ll use in this episode is something I call document hotspots.

Suppose a teacher has asked her students to annotate Arthur Miller’s The Crucible. How can she find the most heavily-annotated passages? They’re visible in the Hypothesis client, of course, but may be sparsely distributed. She can scroll through the 154-page PDF document to find the hotspots, but it will be helpful to see a report that brings them together. Let’s do that.

The Hypothesis system stores annotations using a blend of SQL and JSON datatypes. Consider this sample annotation:

When the Hypothesis client creates that annotation it sends a JSON payload to the server. Likewise, when the client subsequently requests the annotation in order to anchor it to the document, it receives a similar JSON payload.

{
  "id": "VLUhcP1-EeuHn5MbnGgJ0w",
  "created": "2021-08-15T04:07:39.343275+00:00",
  "updated": "2021-08-15T04:07:39.343275+00:00",
  "user": "acct:judell@hypothes.is",
  "uri": "https://ia800209.us.archive.org/17/items/TheCrucibleFullText/The%20Crucible%20full%20text.pdf",
  "text": "\"He is no one's favorite clergyman.\"  :-)\n\nhttps://www.thoughtco.com/crucible-character-study-reverend-parris-2713521",
  "tags": [],
  "group": "__world__",
  "permissions": {
    "read": [
      "group:__world__"
    ],
    "admin": [
      "acct:judell@hypothes.is"
    ],
    "update": [
      "acct:judell@hypothes.is"
    ],
    "delete": [
      "acct:judell@hypothes.is"
    ]
  },
  "target": [
    {
      "source": "https://ia800209.us.archive.org/17/items/TheCrucibleFullText/The%20Crucible%20full%20text.pdf",
      "selector": [
        {
          "end": 44483,
          "type": "TextPositionSelector",
          "start": 44392
        },
        {
          "type": "TextQuoteSelector",
          "exact": " am not some preaching farmer with a book under my arm; I am a graduate of Harvard College.",
          "prefix": " sixty-six pound, Mr. Proctor! I",
          "suffix": " Giles: Aye, and well instructed"
        }
      ]
    }
  ],
  "document": {
    "title": [
      "The%20Crucible%20full%20text.pdf"
    ]
  },
  "links": {
    "html": "https://hypothes.is/a/VLUhcP1-EeuHn5MbnGgJ0w",
    "incontext": "https://hyp.is/VLUhcP1-EeuHn5MbnGgJ0w/ia800209.us.archive.org/17/items/TheCrucibleFullText/The%20Crucible%20full%20text.pdf",
    "json": "https://hypothes.is/api/annotations/VLUhcP1-EeuHn5MbnGgJ0w"
  },
  "user_info": {
    "display_name": "Jon Udell"
  },
  "flagged": false,
  "hidden": false
}

The server mostly shreds this JSON into conventional SQL types. The tags array, for example, is hoisted out of the JSON into a SQL array-of-text. The expression to find its length is a conventional Postgres idiom: array_length(tags,1). Note the second parameter; array_length(tags) is an error, because Postgres arrays can be multidimensional. In this case there’s only one dimension but it’s still necessary to specify that.

A target_selectors column, though, is retained as JSON. These selectors define how an annotation anchors to a target selection in a document. Because selectors are used only by the Hypothesis client, which creates and consumes them in JSON format, there’s no reason to shred them into separate columns. In normal operation, selectors don’t need to be related to core tables. They can live in the database as opaque blobs of JSON.

For some analytic queries, though, it is necessary to peer into those blobs and relate their contents to core tables. There’s a parallel set of functions for working with JSON. For example, the target_selectors column corresponds to the target[0]['selector'] array in the JSON representation. The expression to find the length of that array is jsonb_array_length(target_selectors).

Here’s a similar expression that won’t work: json_array_length(target_selectors). Postgres complains that the function doesn’t exist.

ERROR: function json_array_length(jsonb) does not exist 
Hint: No function matches the given name and argument types.

In fact both functions, json_array_length and jsonb_array_length, exist. But Postgres knows the target_selectors column is of type jsonb, not json which is the correct type for the json_array_length function. What’s the difference between json and jsonb?

The json and jsonb data types accept almost identical sets of values as input. The major practical difference is one of efficiency. The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage.

https://www.postgresql.org/docs/12/datatype-json.html

Although I tend to use JSON to refer to data in a variety of contexts, the flavor of JSON in the Postgres queries, views, and functions I’ll discuss will always be jsonb. The input conversion overhead isn’t a problem for analytics work that happens in a data warehouse, and the indexing support is a tremendous enabler.

To illustrate some of the operators common to json and jsonb, here is a query that captures the target_selectors column from the sample annotation.

with example as (
  select 
    id,
    target_selectors as selectors
  from annotation
  where id = '54b52170-fd7e-11eb-879f-931b9c6809d3'
  )
  select * from example;

Here are some other queries against example

select selectors from example;

[{"end": 44483, "type": "TextPositionSelector", "start": 44392}, { ... } ]

The result is a human-readable representation, but the type of selectors is jsonb.

select pg_typeof(selectors) from example;

jsonb

The array-indexing operator, ->, can yield the zeroth element of the array.

select selectors->0 from example;

{"end": 44483, "type": "TextPositionSelector", "start": 44392}

The result is again a human-readable representation of a jsonb type.

select pg_typeof(selectors->0) from example;

jsonb

Another array-indexing operator, ->>, can also yield the zeroth element of the array, but now as type text.

select selectors->>0 from example;

{"end": 44483, "type": "TextPositionSelector", "start": 44392}

The result looks the same, but the type is different.

select pg_typeof(selectors->>0) from example;

text

The -> and ->> operators can also index objects by their keys. These examples work with the object that is the zeroth element of the array.

select selectors->0->'type' from example;

"TextPositionSelector"

select pg_typeof(selectors->0->'type') from example;

jsonb

select selectors->0->>'type' from example;

TextPositionSelector

select pg_typeof(selectors->0->>'type') from example;

text

The Hypothesis system stores the location of a target (i.e, the selection in a document to which an annotation refers) in the target_selectors column we’ve been exploring. It records selectors. TextQuoteSelector represents the selection as the exact highlighted text bracketed by snippets of context. TextPositionSelector represents it as a pair of numbers that mark the beginning and end of the selection. When one range formed by that numeric pair is equal to another, it means two students have annotated the same selection. When a range contains another range, it means one student annotated the containing range, and another student made an overlapping annotation on the contained range. We can use these facts to surface hotspots where annotations overlap exactly or in nested fashion.

To start, let’s have a function to extract the start/end range from an annotation. In a conventional programming language you might iterate through the selectors in the target_selectors array looking for the one with the type TextPositionSelector. That’s possible in pl/pgsql and pl/python, but Postgres affords a more SQL-oriented approach. Given a JSON array, the function jsonb_array_elements returns a table-like object with rows corresponding to array elements.

select jsonb_array_elements(selectors) from example;

{"end": 44483, "type": "TextPositionSelector", "start": 44392}
{"type": "TextQuoteSelector", "exact": " am not some preaching farmer with a book under my arm; I am a graduate of Harvard College.", "prefix": " sixty-six pound, Mr. Proctor! I", "suffix": " Giles: Aye, and well instructed"}

A function can convert the array to rows, select the row of interest, select the start and end values from the row, package the pair of numbers as an array, and return the array.

create function position_from_anno(_id uuid) returns numeric[] as $$
  declare range numeric[];
  begin
    with selectors as (
      select jsonb_array_elements(target_selectors) as selector
      from annotation
      where id = _id
    ),
    position as (
      select
        selector->>'start'::numeric as startpos,
        selector->>'end'::numeric as endpos
      from selectors
      where selector->>'type' = 'TextPositionSelector'
    )
    select array[p.startpos, p.endpos] 
    from position p
    into range;
    return range;
  end;  
$$ language plpgsql;

Using it for the sample annotation:

select position_from_anno('54b52170-fd7e-11eb-879f-931b9c6809d3')

position_from_anno
------------------
{44392,44483}

I’ll show how to use position_from_anno to find document hotspots in a later episode. The goal here is just to introduce an example, and to illustrate a few of the JSON functions and operators.

What’s most interesting, I think, is this part.

where selector->>'type' = 'TextPositionSelector'

Although the TextPositionSelector appears as the first element of the selectors array, that isn’t guaranteed. In a conventional language you’d have to walk through the array looking for it. SQL affords a declarative way to find an element in a JSON array.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

pl/python metaprogramming

In episode 2 I mentioned three aspects of pl/python that are reasons to use it instead of pl/pgsql: access to Python modules, metaprogramming, and introspection.

Although this episode focuses on metaprogramming — by which I mean using Python to dynamically compose and run SQL queries — my favorite example combines all three aspects.

The context for the example is an analytics dashboard with a dozen panels, each driven by a pl/plython function that’s parameterized by the id of a school or a course. So, for example, the Questions and Answers panel on the course dashboard is driven by a function, questions_and_answers_for_group(group_id), which wraps a SQL query that:

– calls another pl/python function, questions_for_group(group_id), to find notes in the group that contain question marks

– finds the replies to those notes

– builds a table that summarizes the question/answer pairs

Here’s the SQL wrapped by the questions_and_answers_for_group(group_id) function.

sql = f"""
  with questions as ( 
    select *
    from questions_for_group('{_group_id}')
    ),
  ids_and_refs as (
    select 
      id,
      unnest ("references") as ref
      from annotation
      where groupid = '{_group_id}'
  ),
  combined as (
    select
      q.*,
      array_agg(ir.id) as reply_ids
    from ids_and_refs ir
    inner join questions q on q.id = ir.ref
    group by q.id, q.url, q.title, q.questioner, q.question, q.quote
  ),
  unnested as (
    select
      c.url,
      c.title,
      c.quote,
      c.questioner,
      c.question,
      unnest(reply_ids) as reply_id
    from combined c
  )
  select distinct
    course_for_group('{_group_id}') as course,
    teacher_for_group('{_group_id}') as teacher,
    clean_url(u.url) as url,
    u.title,
    u.quote,
    u.questioner,
    (regexp_matches(u.question, '.+\?'))[1] as question,
    display_name_from_anno(u.reply_id) as answerer,
    text_from_anno(u.reply_id) as answer,
    app_host() || '/course/render_questions_and_answers/{_group_id}' as viewer
  from unnested u
  order by course, teacher, url, title, questioner, question

This isn’t yet what I mean by pl/python metaprogramming. You could as easily wrap this SQL code in a pl/pgsql function. More easily, in fact, because in pl/pgsql you could just write _group_id instead of '{_group_id}'.

To get where we’re going, let’s zoom out and look at the whole questions_and_and_answer_for_group(group_id) function.

questions_and_answers_for_group(_group_id text)
  returns setof question_and_answer_for_group as $$
  from plpython_helpers import (
    exists_group_view,
    get_caller_name,
    memoize_view_name
  )
  base_view_name = get_caller_name()
  view_name = f'{base_view_name}_{_group_id}'
  if exists_group_view(plpy, view_name):
    sql = f""" select * from {view_name} """
  else:
    sql = f"""
    <SEE ABOVE>
    """
    memoize_view_name(sql, view_name)
    sql = f""" select * from {view_name} """
  return plpy.execute(sql)
$$ language plpython3u;

This still isn’t what I mean by metaprogramming. It introduces introspection — this is a pl/python function that discovers its own name and works with an eponymous materialized view — but that’s for a later episode.

It also introduces the use of Python modules by pl/python functions. A key thing to note here is that this is an example of what I call a memoizing function. When called it looks for a materialized view that captures the results of the SQL query shown above. If yes, it only needs to use a simple SELECT to return the cached result. If no, it calls memoize_view_name to run the underlying query and cache it in a materialized view that the next call to questions_and_answers_for_group(group_id) will use in a simple SELECT. Note that memoize_view_name is a special function that isn’t defined in Postgres using CREATE FUNCTION foo() like a normal pl/python function. Instead it’s defined using def foo() in a Python module called plpython_helpers. The functions there can do things — like create materialized views — that pl/python functions can’t. More about that in another episode.

The focus in this episode is metaprogramming, which is used in this example to roll up the results of multiple calls to questions_and_answers_for_group(group_id). That happens when the group_id refers to a course that has sections. If you’re teaching the course and you’ve got students in a dozen sections, you don’t want to look at a dozen dashboards; you’d much rather see everything on the primary course dashboard.

Here’s the function that does that consolidation.

create function consolidated_questions_and_answers_for_group(group_id text)
  returns setof question_and_answer_for_group as $$
  from plpython_helpers import (
    get_caller_name,
    sql_for_consolidated_and_memoized_function_for_group
  )
  base_view_name = get_caller_name()
  sql = sql_for_consolidated_and_memoized_function_for_group(
    plpy, base_view_name, 'questions_and_answers_for_group', _group_id)
  sql += ' order by course, url, title, questioner, answerer'
  return plpy.execute(sql)
$$ language plpython3u;

This pl/python function not only memoizes its results as above, it also consolidates results for all sections of a course. The memoization happens here.

def sql_for_consolidated_and_memoized_function_for_group(plpy, 
    base_view_name, function, group_id):
  view_name = f'{base_view_name}_{group_id}'
  sql = f""" select exists_view('{view_name}') as exists """
  exists = row_zero_value_for_colname(plpy, sql, 'exists')
  if exists:
    sql = f""" select * from {view_name} """
  else:
    sql = consolidator_for_group_as_sql(plpy, group_id, function)
    memoize_view_name(sql, view_name)
    sql = f""" select * from {view_name} """
  return sql

The consolidation happens here, and this is finally what I think of as classical metaprogramming: using Python to compose SQL.

def consolidator_for_group_as_sql(plpy, _group_id, _function):
  sql = f"select type_for_group('{_group_id}') as type"
  type = row_zero_value_for_colname(plpy, sql, 'type')
  if type == 'section_group' or type == 'none':
    sql = f"select * from {_function}('{_group_id}')"
  if type == 'course_group' or type == 'course':
    sql = f"select has_sections('{_group_id}')"
    has_sections = row_zero_value_for_colname(plpy, sql, 'has_sections')
    if has_sections:
      sql = f"""
        select array_agg(group_id) as group_ids 
        from sections_for_course('{_group_id}')
      """
      group_ids = row_zero_value_for_colname(plpy, sql, 'group_ids')
      selects = [f"select * from {_function}('{_group_id}') "]
      for group_id in group_ids:
        selects.append(f"select * from {_function}('{group_id}')")
      sql = ' union '.join(selects)
    else:
      sql = f"select * from {_function}('{_group_id}')"
  return sql

If the inbound _groupid is p1mqaeep, the inbound _function is questions_and_answers_for_group, and the group has no sections, the SQL will just be select * from questions_and_answers_for_group('p1mqaeep').

If the group does have sections, then the SQL will instead look like this:

select * from questions_and_answers_for_group('p1mqaeep')
union 
select * from questions_and_answers_for_group('x7fe93ba')
union
select * from questions_and_answers_for_group('qz9a4b3d')

This is a very long-winded way of saying that pl/python is an effective way to compose and run arbitarily complex SQL code. In theory you could do the same thing using pl/pgsql, in practice it would be insane to try. I’ve entangled the example with other aspects — modules, introspection — because that’s the real situation. pl/python’s maximal power emerges from the interplay of all three aspects. That said, it’s a fantastic way to extend Postgres with user-defined functions that compose and run SQL code.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

The Tao of Unicode Sparklines

I’ve long been enamored of the sparkline, a graphical device which its inventor Edward Tufte defines thusly:

A sparkline is a small intense, simple, word-sized graphic with typographic resolution. Sparklines mean that graphics are no longer cartoonish special occasions with captions and boxes, but rather sparkline graphics can be everywhere a word or number can be: embedded in a sentence, table, headline, map, spreadsheet, graphic.

Nowadays you can create sparklines in many tools including Excel and Google Sheets, both of which can use the technique to pack a summary of a series of numbers into a single cell. By stacking such cells vertically you can create views that compress vast amounts of information.

In A virtuous cycle for analytics I noted that we often use Metabase to display tables and charts based on extracts from our Postgres warehouse. I really wanted to use sparklines to summarize views of activity over time, but that isn’t yet an option in Metabase.

When Metabase is connected to Postgres, though, you can write Metabase questions that can not only call built-in Postgres functions but can also call user-defined functions. Can such a function accept an array of numbers and return a sparkline for display in the Metabase table viewer? Yes, if you use Unicode characters to represent the variable-height bars of a sparkline.

There’s a page at rosettacode.org devoted to Unicode sparklines based on this sequence of eight characters:

U+2581 LOWER ONE EIGHTH BLOCK
U+2582 LOWER ONE QUARTER BLOCK
U+2583 LOWER THREE EIGHTHS BLOCK
U+2584 LOWER HALF BLOCK
U+2585 LOWER FIVE EIGHTHS BLOCK
U+2586 LOWER THREE QUARTERS BLOCK
U+2587 LOWER SEVEN EIGHTHS BLOCK
U+2588 FULL BLOCK

Notice that 2581, 2582, and 2588 are narrower than the rest. I’ll come back to that at the end.

If you combine them into a string of eight characters you get this result:

▁▂▃▄▅▆▇█

Notice that the fourth and eight characters in the sequence drop below the baseline. I’ll come back to that at the end too.

These characters can be used to define eight buckets into which numbers in a series can be quantized. Here are some examples from the rosettacode.org page:

“1 2 3 4 5 6 7 8 7 6 5 4 3 2 1” -> ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁
“1.5, 0.5 3.5, 2.5 5.5, 4.5 7.5, 6.5” -> ▂▁▄▃▆▅█▇
“0, 1, 19, 20” -> ▁▁██
“0, 999, 4000, 4999, 7000, 7999” -> ▁▁▅▅██

To write a Postgres function that would do this, I started with the Python example from rosettacode.org:

bar = '▁▂▃▄▅▆▇█'
barcount = len(bar)
 
def sparkline(numbers):
    mn, mx = min(numbers), max(numbers)
    extent = mx - mn
    sparkline = ''.join(bar[min([barcount - 1,
                                 int((n - mn) / extent * barcount)])]
                        for n in numbers)
    return mn, mx, sparkline

While testing it I happened to try an unchanging sequence, [3, 3, 3, 3], which fails with a divide-by-zero error. In order to address that, and to unpack the algorithm a bit for readability, I arrived at this Postgres function:

create function sparkline(numbers bigint[]) returns text as $$

    def bar_index(num, _min, barcount, extent):
        index = min([barcount - 1, int( (num - _min) / extent * bar_count)])
        return index

    bars = '\u2581\u2582\u2583\u2584\u2585\u2586\u2587\u2588'
    _min, _max = min(numbers), max(numbers)
    extent = _max - _min 

    if extent == 0:  # avoid divide by zero if all numbers are equal
        extent = 1

    bar_count = len(bars)
    sparkline = ''
    for num in numbers:
        index = bar_index(num, _min, bar_count, extent)
        sparkline = sparkline + bars[index]

    return sparkline

$$ language plpython3u;

Here’s a psql invocation of the function:

analytics=# select sparkline(array[1, 2, 3, 4, 5, 6, 7, 8, 7, 6, 5, 4, 3, 2, 1]);
    sparkline
-----------------
▁▂▃▄▅▆▇█▇▆▅▄▃▂▁
(1 row)

And here’s an example based on actual data:

Each row represents a university course in which students and teachers are annotating the course readings. Each bar represents a week’s worth of activity. Their heights are not comparable from row to row; some courses do a lot of annotating and some not so much; each sparkline reports relative variation from week to week; the sum and weekly max columns report absolute numbers.

This visualization makes it easy to see that annotation was occasional in some courses and continuous in others. And when you scroll, the temporal axis comes alive; try scrolling this view to see what I mean.

We use the same mechanism at three different scales. One set of sparklines reports daily activity for students in courses; another rolls those up to weekly activity for courses at a school; still another rolls all those up to weekly activity for each school in the system.

At the level of individual courses, the per-student sparkline views can show patterns of interaction. In the left example here, vertical bands of activity reflect students annotating for particular assignments. In the right example there may be a trace of such temporal alignment but activity is less synchronized and more continuous.

When we’re working in Metabase we can use its handy mini bar charts to contextualize the row-wise sums.

The sparkline-like mini bar chart shows a row’s sum relative to the max for the column. Here we can see that a course with 3,758 notes has about 1/4 the number of notes as the most note-heavy course at the school.

Because these Unicode sparklines are just strings of text in columns of SQL or HTML tables, they can participate in sort operations. In our case we can sort on all columns including ones not shown here: instructor name, course start date, number of students. But the default is to sort by the sparkline column which, because it encodes time, orders courses by the start of annotation activity.

The visual effect is admittedly crude, but it’s a good way to show certain kinds of variation. And it’s nicely portable. A Unicode sparkline looks the same in a psql console, an HTML table, or a tweet. The function will work in any database that can run it, using Python or another of the languages demoed at rosettacode.org. For example, I revisited the Workbench workflow described in A beautiful power tool to scrape, clean, and combine data and added a tab for Lake levels.

When I did that, though, the effect was even cruder than what I’ve been seeing in my own work.

In our scenarios, with longer strings of characters, the differences average out and things align pretty well; the below-the-baseline effect has been annoying but not a deal breaker. But the width variation in this example does feel like a deal breaker.

What if we omit the problematic characters U+2581 (too narrow) and U+2584/U+2588 (below baseline and too narrow)?

There are only 5 buckets into which to quantize numbers, and their heights aren’t evenly distributed. But for the intended purpose — to show patterns of variation — I think it’s sufficient in this case. I tried swapping the 5-bucket method into the function that creates sparklines for our dashboards but I don’t think I’ll switch. The loss of vertical resolution makes our longer sparklines less useful, and the width variation is barely noticeable.

Unicode evolves, of course, so maybe there will someday be a sequence of characters that’s friendlier to sparklines. Maybe there already is? If so please let me know, I’d love to use it.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

A beautiful power tool to scrape, clean, and combine data

Labels like “data scientist” and “data journalist” connote an elite corps of professionals who can analyze data and use it to reason about the world. There are elite practitioners, of course, but since the advent of online data a quarter century ago I’ve hoped that every thinking citizen of the world (and of the web) could engage in similar analysis and reasoning.

That’s long been possible for those of us with the ability to wrangle APIs and transform data using SQL, Python, or another programming language. But even for us it hasn’t been easy. When I read news stories that relate to the generation of electric power in California, for example, questions occur to me that I know I could illuminate by finding, transforming, and charting sources of web data:

– How can I visualize the impact of shutting down California’s last nuclear plant?

– What’s the relationship between drought and hydro power?

All the ingredients are lying around in plain sight, but the effort required to combine them winds up being more trouble than it’s worth. And that’s for me, a skilled longtime scraper and transformer of web data. For you — even if you’re a scientist or a journalist! — that may not even be an option.

Enter Workbench, a web app with the tagline: “Scrape, clean, combine and analyze data without code.” I’ve worked with tools in the past that pointed the way toward that vision. DabbleDB in 2005 (now gone) and Freebase Gridworks in 2010 (still alive as Open Refine) were effective ways to cut through data friction. Workbench carries those ideas forward delightfully. It enables me to fly through the boring and difficult stuff — the scraping, cleaning, and combining — in order to focus on what matters: the analysis.

Here’s the report that I made to address the questions I posed above. It’s based on a workflow that you can visit and explore as I describe it here. (If you create your own account you can clone and modify.)

The workflow contains a set of tabs; each tab contains a sequence of steps; each step transforms a data set and displays output as a table or chart. When you load the page the first tab runs, and the result of its last step is displayed. In this case that’s the first chart shown in the report:

As in a Jupyter notebook you can run each step individually. Try clicking step 1. You’ll see a table of data from energy.ca.gov. Notice that step 1 is labeled Concatenate tabs. If you unfurl it you’ll see that it uses another tab, 2001-2020 scraped, which in turn concatenates two other tabs, 2001-2010 scraped and 2011-2020 scraped. Note that I’ve helpfully explained that in the optional comment field above step 1.

Each of the two source tabs scrapes a table from the page at energy.ca.gov. As I note in the report, it wasn’t necessary to scrape those tables since the data are available as an Excel file that can be downloaded, then uploaded to Workbench (as I’ve done in the tab named energy.ca.gov xslx). I scraped them anyway because that web page presents a common challenge: the data appear in two separate HTML tables. That’s helpful to the reader but frustrating to an analyst who wants to use the data. Rapid and fluid combination of scraped tables is grease for cutting through data friction; Workbench supplies that grease.

Now click step 2 in the first tab. It’s the last step, so you’re back to the opening display of the chart. Unfurl it and you’ll see the subset of columns included in the chart. I’ve removed some minor sources, like oil and waste heat, in order to focus on major ones. Several details are notable here. First: colors. The system provides a default palette but you can adjust it. Black wasn’t on the default palette but I chose that for coal.

Second, grand total. The data set doesn’t include that column, and it’s not something I needed here. But in some situations I’d want it, so the system offers it as a choice. That’s an example of the attention to detail that pervades every aspect of Workbench.

Third, Vega. See the triple-dot button above the legend in the chart? Click it, then select Open in Vega Editor, and when you get there, click Run. Today I learned that Vega is:

a declarative format for creating, saving, and sharing visualization designs. With Vega, visualizations are described in JSON, and generate interactive views using either HTML5 Canvas or SVG.

Sweet! I think I’ll use it in my own work to simplify what I’ve recently (and painfully) learned how to do with D3.js. It’s also a nice example of how Workbench prioritizes openness, reusability, and reproducibility in every imaginable way.

I use the chart as the intro to my report, which is made with an elegant block editor in which you can combine tables and charts from any of your tabs with snippets of text written in markdown. There I begin to ask myself questions, adding tabs to marshal supporting evidence and sourcing evidence from tabs into the report.

My first question is about the contribution that the Diablo Canyon nuclear plant has been making to the overall mix. In the 2020 percentages all major sources tab I start in step 1 by reusing the tab 2001-2020 scraped. Step 2 filters the columns to just the same set of major sources shown in the chart. I could instead apply that step in 2001-2020 scraped and avoid the need to select columns for the chart. Since I’m not sure how that decision might affect downstream analysis I keep all the columns. If I change my mind it’s easy to push the column selection upstream.

Workbench not only makes it possible to refactor a workflow, it practically begs you to do that. When things go awry, as they inevitably will, it’s no problem. You can undo and redo the steps in each tab! You won’t see that in the read-only view but if you create your own account, and duplicate my workflow in it, give it a try. With stepwise undo/redo, exploratory analysis becomes a safe and stress-free activity.

At step 2 of 2020 percentages all major sources we have rows for all the years. In thinking about Diablo Canyon’s contribution I want to focus on a single reference year so in step 3 I apply a filter that selects just the 2020 row. Here’s the UX for that.

In situations like this, where you need to select one or more items from a list, Workbench does all the right things to minimize tedium: search if needed, start from all or none depending on which will be easier, then keep or remove selected items, again depending on which will be easier.

In step 4 I include an alternate way to select just the 2020 row. It’s a Select SQL step that says select * from input where Year = '2020'. That doesn’t change anything here; I could omit either step 3 or step 4 without affecting the outcome; I include step 4 just to show that SQL is available at any point to transform the output of a prior step.

Which is fantastic, but wait, there’s more. In step 5 I use a Python step to do the same thing in terms of a pandas dataframe. Again this doesn’t affect the outcome, I’m just showing that Python is available at any point to transform the output of a prior step. Providing equivalent methods for novices and experts, in a common interface, is extraordinarily powerful.

I’m noticing now, by the way, that step 5 doesn’t work if you’re not logged in. So I’ll show it to you here:

Step 6 transposes the table so we can reason about the fuel types. In steps 3-5 they’re columns, in step 6 they become rows. This is a commonly-needed maneuver. And while I might use the SQL in step 4 to do the row selection handled by the widget in step 3, I won’t easily accomplish the transposition that way. The Transpose step is one of the most powerful tools in the kit.

Notice at step 6 that the first column is named Year. That’s a common outcome of transposition and here necessitates step 7 in which I rename it to Fuel Type. There are two ways to do that. You can click the + Add Step button, choose the Rename columns option, drag the new step into position 7, open it, and do the renaming there.

But look:

You can edit anything in a displayed table. When I change Year to Fuel Type that way, the same step 7 that you can create manually appears automatically.

It’s absolutely brilliant.

In step 8 I use the Calculate step to add a new column showing each row’s percentage of the column sum. In SQL I’d have to think about that a bit. Here, as is true for so many routine operations like this, Workbench offers the solution directly:

Finally in step 9 I sort the table. The report includes it, and there I consider the question of Diablo Canyon’s contribution. According to my analysis nuclear power was 9% of the major sources I’ve selected, contributing 16,280 GWh in 2020. According to another energy.ca.gov page that I cite in the report, Diablo Canyon is the only remaining nuke plant in the state, producing “about 18,000 GWh.” That’s not an exact match but it’s close enough to give me confidence that reasoning about the nuclear row in the table applies to Diablo Canyon specifically.

Next I want to compare nuclear power to just the subset of sources that are renewable. That happens in the 2020 percentages renewable tab, the output of which is also included in the report. Step 1 begins with the output of 2020 percentages of all major sources. In step 2 I clarify that the 2020 column is really 2020 GWh. In step 3 I remove the percent column in order to recalculate it. In step 4 I remove rows in order to focus on just nuclear and renewables. In step 5 I recalculate the percentages. And in step 6 I make the chart that also flows through to the report.

Now, as I look at the chart, I notice that the line for large hydro is highly variable and appears to correlate with drought years. In order to explore that correlation I look for data on reservoir levels and arrive at https://cdec.water.ca.gov/. I’d love to find a table that aggregates levels for all reservoirs statewide since 2001, but that doesn’t seem to be on offer. So I decide to use Lake Mendocino as a proxy. In step 1 I scrape an HTML table with monthly levels for the lake since 2001. In step 2 I delete the first row which only has some months. In step 3 I rename the first column to Year in order to match what’s in the table I want to join with. In step 4 I convert the types of the month columns from text to numeric to enable calculation. In step 5 I calculate the average into a new column, Avg. In step 6 I select just Year and Avg.

When I first try the join in step 8 it fails for a common reason that Workbench helpfully explains:

In the other table Year looks like ‘2001’, but the text scraped from energy.ca.gov looks like ‘2,001’. That’s a common glitch that can bring an exercise like this to a screeching halt. There’s probably a Workbench way to do this, but in step 7 I use SQL to reformat the values in the Year column, removing the commas to enable the join. While there I also rename the Avg column to Lake Mendocino Avg Level. Now in tab 8 I can do the join.

In tab 9 I scale the values for Large Hydro into a new column, Scaled Large Hydro. Why? The chart I want to see will compare power generation in GWh (gigawatt hours) and lake levels in AF (acre-feet). These aren’t remotely compatible but I don’t care, I’m just looking for comparable trends. Doubling the value for Large Hydro gets close enough for the comparison chart in step 10, which also flows through to the report.

All this adds up to an astonishingly broad, deep, and rich set of features. And I haven’t even talked about the Clean text step for tweaking whitespace, capitalization, and punctuation, or the Refine step for finding and merging clusters of similar values that refer to the same things. Workbench is also simply beautiful as you can see from the screen shots here, or by visiting my workflow. When I reviewed software products for BYTE and InfoWorld it was rare to encounter one that impressed me so thoroughly.

But wait, there’s more.

At the core of my workflow there’s a set of tabs; each is a sequence of steps; some of these produce tables and charts. Wrapped around the workflow there’s the report into which I cherrypick tables and charts for the story I’m telling there.

There’s also another kind of wrapper: a lesson wrapped around the workflow. I could write a lesson that guides you through the steps I’ve described and checks that each yields the expected result. See Intro to data journalism for an example you can try. Again, it’s brilliantly well done.

So Workbench succeeds in three major ways. If you’re not a data-oriented professional, but you’re a thinking person who wants to use available web data to reason about the world, Workbench will help you power through the grunt work of scraping, cleaning, and combining that data so you can focus on analysis. If you aspire to become such a professional, and you don’t have a clue about how to do that grunt work, it will help you learn the ropes. And if you are one of the pros you’ll still find it incredibly useful.

Kudos to the Workbench team, and especially to core developers Jonathan Stray, Pierre Conti, and Adam Hooper, for making this spectacularly great software tool.

Working with Postgres types

In episode 2 of this series I noted that the languages in which I’m writing Postgres functions share a common type system. It took me a while to understand how types work in the context of Postgres functions that can return sets of records and can interact with tables and materialized views.

Here is a set-returning function.

create function notes_for_user_in_group(
  _userid text, 
  _groupid text)
returns setof annotation as $$
  begin
    return query 
      select * from annotation
      where userid = concat('acct:', _userid)
      and groupid = _groupid;
  end;
$$ language plpgsql;

In this case the type that governs the returned set has already been defined: it’s the schema for the annotation table.

Column Type
id uuid
created timestamp without time zone
updated timestamp without time zone
userid text
groupid text
text text
tags text[]
shared boolean
target_uri text
target_uri_normalized text
target_selectors jsonb
references uuid[]
extra jsonb
text_rendered text
document_id integer
deleted boolean

The function returns records matching a userid and groupid. I can now find the URLs of documents most recently annotated by me.

select
  target_uri
from notes_for_user_in_group('judell@hypothes.is', '__world__')
order by created desc
limit 3;

The Postgres response:

                 target_uri                  
---------------------------------------------
 https://news.ycombinator.com/item?id=20020501
 https://www.infoworld.com/article/2886828/github-for-the-rest-of-us.html
 https://web.hypothes.is/help/formatting-annotations-with-markdown/
 http://example.com                          
(3 rows)

You might wonder why the function’s parameters are prefixed with underscores. That’s because variables used in functions can conflict with names of columns in tables. Since none of our column names begin with underscore, it’s a handy differentiator. Suppose the function’s signature were instead:

create function notes_for_user_in_group(
  userid text, 
  groupid text)

Postgres would complain about a confict:

ERROR:  column reference "userid" is ambiguous
LINE 2:       where userid = concat('acct:', userid)
                    ^
DETAIL:  It could refer to either a PL/pgSQL variable or a table column.

The table has userid and groupid columns that conflict with their eponymous variables. So for functions that combine variables and database values I prefix variable names with underscore.

Set-returning functions can be called in any SQL SELECT context. In the example above that context is psql, Postgres’ powerful and multi-talented REPL (read-eval-print loop). For an example of a different context, let’s cache the function’s result set in a materialized view.

create materialized view public_notes_for_judell as (
  select
    *
  from notes_for_user_in_group('judell@hypothes.is', '__world__')
  order by created desc
) with data;

Postgres reports success by showing the new view’s record count.

SELECT 3972

The view’s type is implicitly annotation; its schema matches the one shown above; selecting target_uri from the view is equivalent to selecting target_uri from the setof annotation returned from the function notes_for_user_in_group.

select
  target_uri
from public_notes_for_judell
limit 3;

The Postgres response is the same as above.

                 target_uri                  
---------------------------------------------
 https://news.ycombinator.com/item?id=20020501
 https://www.infoworld.com/article/2886828/github-for-the-rest-of-us.html
 https://web.hypothes.is/help/formatting-annotations-with-markdown/
 http://example.com                          
(3 rows)

It shows up a lot faster though! Every time you select the function’s result set, the wrapped query has to run. For this particular example that can take a few seconds. It costs the same amount of time to create the view. But once that’s done you can select its contents in milliseconds.

Now let’s define a function that refines notes_for_user_in_group by reporting the count of notes for each annotated document.

create function annotated_docs_for_user_in_group(
  _userid text, 
  _groupid text)
returns table (
  count bigint,
  userid text,
  groupid text,
  url text
) as $$
  begin
    return query
      select
        count(n.*) as anno_count,
        n.userid,
        n.groupid,
        n.target_uri
      from notes_for_user_in_group(_userid, _groupid) n
      group by n.userid, n.groupid, n.target_uri                          
      order by anno_count desc;
  end;
$$ language plpgsql;

Instead of returning a setof some named type, this function returns an anonymous table. I’ve aliased the set-returning function call notes_for_user_in_group as n and used the alias to qualify the names of selected columns. That avoids another naming conflict. If you write userid instead of n.userid in the body of the function and then call it, Postgres again complains about a conflict.

ERROR:  column reference "userid" is ambiguous
LINE 3:         userid,
                ^
DETAIL:  It could refer to either a PL/pgSQL variable or a table column.

Here’s a sample call to our new function..

select 
  *
from annotated_docs_for_user_in_group(
  'judell',
  'hypothes.is',
  '__world__'
);

The result:

 count | userid             | groupid   | target_uri
-------+---------------------------------------------
   516 | judell@hypothes.is | __world__ | http://shakespeare.mit.edu/macbeth/full.html
    73 | judell@hypothes.is | __world__ | https://www.independent.co.uk/news/world/asia/india-floods-bangladesh-nepal-deaths-millions-homeless-latest-news-updates-a7919006.html
    51 | judell@hypothes.is | __world__ | https://www.usatoday.com/story/news/nation-now/2017/06/16/coconut-oil-isnt-healthy-its-never-been-healthy/402719001/

Now let’s create a view based on that function.

create materialized view url_counts_for_public_notes_by_judell as (
  select 
    *
  from annotated_docs_for_user_in_group(
    'judell@hypothes.is',
    '__world__'
  )
) with data;

Postgres says:

SELECT 1710

When you ask for the definition of that view using the \d command in psql:

\d url_counts_for_public_notes_by_judell

It responds with the same table definition used when creating the function.

 Column  |  Type  
---------+--------
 count   | bigint
 userid  | text
 groupid | text
 url     | text   

Behind the scenes Postgres has created this definition from the anonymous table returned by the function.

To revise the function so that it uses a named type, first create the type.

create type annotated_docs_for_user_in_group as (
  count bigint,
  userid text,
  groupid text,
  url text
);

Postgres reports success:

CREATE TYPE

Now we can use that named type in the function. Since we’re redefining the function, first drop it.

drop function annotated_docs_for_user_in_group;

Uh oh. Postgres is unhappy about that.

ERROR:  cannot drop function annotated_docs_for_user_in_group(text,text) because other objects depend on it
DETAIL:  materialized view url_counts_for_public_notes_by_judell depends on function annotated_docs_for_user_in_group(text,text)
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

A view that depends on a function must be recreated when the function’s signature changes. I’ll say more about this in a future episode on set-returning functions that dynamically cache their results in materialized views. For now, since the view we just created is a contrived throwaway, just drop it along with the function by using CASCADE as Postgres recommends.

drop function annotated_docs_for_user_in_group cascade;

Postgres says:

NOTICE:  drop cascades to materialized view url_counts_for_public_notes_by_judell
DROP FUNCTION

Now we can recreate a version of the function that returns setof annotated_docs_for_user_in_group instead of an anonymous table(...)

create function annotated_docs_for_user_in_group(
  _userid text, 
  _groupid text)
returns setof annotated_docs_for_user_in_group as $$
  begin
    return query
      select
        count(n.*) as anno_count,
        n.userid,
        n.groupid,
        n.target_uri
      from notes_for_user_in_group(_userid, _groupid) n
      group by n.userid, n.groupid, n.target_uri                          
      order by anno_count desc;
  end;
$$ language plpgsql;

The results are the same as above. So why do it this way? In many cases I don’t. It’s extra overhead to declare a type. And just as a view can depend on a function, a function can depend on a type. To see why you might not want such dependencies, suppose we want to also track the most recent note for each URL.

create type annotated_docs_for_user_in_group as (
  count bigint,
  userid text,
  groupid text,
  url text,
  most_recent_note timestamp
);

That won’t work.

ERROR:  type "annotated_docs_for_user_in_group" already exists

Dropping the type won’t work either.

ERROR:  cannot drop type annotated_docs_for_user_in_group because other objects depend on it                           
DETAIL:  function annotated_docs_for_user_in_group(text,text,text) depends on type annotated_docs_for_user_in_group    
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

To redefine the type you have to do a cascading drop and then recreate functions that depend on the type. If any of those views depend on dropped functions, the drop cascades to them as well and they also must be recreated. That’s why I often write functions that return table(...) rather than setof TYPE. In dynamic languages it’s convenient to work with untyped bags of values; I find the same to be true when writing functions in Postgres.

Sometimes, though, it’s useful to declare and use types. In my experience so far it makes most sense to do that in Postgres when you find yourself writing the same returns table(...) statement in several related functions. Let’s say we want a function that combines the results of annotated_docs_for_user_in_group for some set of users.

create function annotated_docs_for_users_in_group(_userids text[], _groupid text)
  returns setof annotated_docs_for_user_in_group as $$
    begin
      return query 
        with userids as (
          select unnest(_userids) as userid
        )
        select 
          a.*
        from userids u
        join annotated_docs_for_user_in_group(u.userid, _groupid) a
          on a.userid = concat('acct:', u.userid);
    end;
$$ language plpgsql;

This new function uses the SQL WITH clause to create a common table expression (CTE) that converts an inbound array of userids into a transient table-like object, named userids, with one userid per row. The new function’s wrapped SQL then joins that CTE to the set returned from annotated_docs_for_user_in_group and returns the joined result.

(You can alternatively do this in a more procedural way by creating a loop variable and marching through the array to accumulate results. Early on I used that approach but in the context of Postgres functions I’ve come to prefer the more purely SQL-like set-oriented style.)

Sharing a common type between the two functions makes them simpler to write and easier to read. More importantly it connects them to one another and to all views derived from them. If I do decide to add most_recent_note to the type, Postgres will require me to adjust all depending functions and views so things remain consistent. That can be a crucial guarantee, and as we’ll see in a future episode it’s a key enabler of an advanced caching mechanism.

1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

pl/pgsql versus pl/python? Here’s why I’m using both to write Postgres functions.

In A virtuous cycle for analytics I noted that our library of Postgres functions is written in two languages: Postgres’ built-in pl/pgsql and the installable alternative pl/python. These share a common type system and can be used interchangeably.

Here’s a pl/pgsql classifier that tries to match the name of a course against a list of patterns that characterize the humanities.

create function humanities_classifier(course_name text) returns boolean as $$
  begin
    return
      lower(course_name) ~ any(array[
        'psych',
        'religio',
        'soci'
        ]);
  end;
$$ language plpgsql;

# select humanities_classifier('Religious Studies 101') as match;
match
-----
t

# select humanities_classifier('Comparative Religions 200') as match;
match
-----
t

Here is that same classifier in Python.

create function humanities_classifier(course_name text) returns boolean as $$
  sql = f"""
    select lower('{course_name}') ~ any(array[
      'psych',
      'religio',
      'soci'
      ]) as match"""
  results = plpy.execute(sql)
  return results[0]['match']
$$ language plpython3u;

# select humanities_classifier('Religious Studies 101') as match;
match
-----
t

# select humanities_classifier('Comparative Religions 200') as match;
match
-----
t

The results are exactly the same. In this case, Python is only wrapping the SQL used in the orginal function and interpolating course_name into it. So why use pl/python here? I wouldn’t. The pl/pgsql version is cleaner and simpler because the SQL body doesn’t need to be quoted and course_name doesn’t need to be interpolated into it.

Here’s a more Pythonic version of the classifier.

create function humanities_classifier(course_name text) returns boolean as $$
  import re
  regexes = [
    'psych',
    'religio',
    'soci'
  ]
  matches = [r for r in regexes if re.search(r, course_name, re.I)]
  return len(matches)
$$ language plpython3u;

There’s no SQL here, this is pure Python. Is there any benefit to doing things this way? In this case probably not. The native Postgres idiom for matching a string against a list of regular expressions is cleaner and simpler than the Python technique shown here. A Python programmer will be more familiar with list comprehensions than with the Postgres any and ~ operators but if you’re working in Postgres you’ll want to know about those, and use them not just in functions but in all SQL contexts.

What about performance? You might assume as I did that a pl/pgsql function is bound to be way faster than its pl/python equivalent. Let’s check that assumption. This SQL exercises both flavors of the function, which finds about 500 matches in a set of 30,000 names.

with matching_courses as (
  select humanities_classifier(name) as match
  from lms_course_groups
) 

select count(*) 
from matching_courses
where match;

Here are the results for three runs using each flavor of the function:

 pl/pgsql: 159ms, 201ms, 125ms

pl/python: 290ms, 255ms, 300ms

The Python flavor is slower but not order-of-magnitude slower; I’ve seen cases where a pl/python function outperforms its pl/pgsql counterpart.

So, what is special about Python functions inside Postgres? In my experience so far there are three big reasons to use it.

Python modules

The ability to wield any of Python’s built-in or loadable modules inside Postgres brings great power. That entails great responsibility, as the Python extension is “untrusted” (that’s the ‘u’ in ‘plpython3u’) and can do anything Python can do on the host system: read and write files, make network requests.

Here’s one of my favorite examples so far. Given a set of rows that count daily or weekly annotations for users in a group — so for weekly accounting each row has 52 columns — the desired result for the whole group is the element-wise sum of the rows. That’s not an easy thing in SQL but it’s trivial using numpy, and in pl/python it happens at database speed because there’s no need to transfer SQL results to an external Python program.

Metaprogramming

Functions can write and then run SQL queries. It’s overkill for simple variable interpolation; as shown above pl/pgsql does that handily without the cognitive overhead and visual clutter of poking values into a SQL string. For more advanced uses that compose queries from SQL fragments, though, pl/pgsql is hopeless. You can do that kind of thing far more easily, and more readably, in Python.

Introspection

A pl/python function can discover and use its own name. That’s the key enabler for a mechanism that memoizes the results of a function by creating a materialized view whose name combines the name of the function with the value of a parameter to the function. This technique has proven to be wildly effective.

I’ll show examples of these scenarios in later installments of this series. For now I just want to explain why I’ve found these two ways of writing Postgres functions to be usefully complementary. The key points are:

– They share a common type system.

pl/pgsql, despite its crusty old syntax, suffices for many things.

pl/python leverages Python’s strengths where they are most strategic

When I began this journey it wasn’t clear when you’d prefer one over the other, or why it might make sense to use both in complementary ways. This installment is what I’d like to have known when I started.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

A virtuous cycle for analytics

Suppose you’re a member of a team that runs a public web service. You need to help both internal and external users make sense of all the data that’s recorded as it runs. That’s been my role for the past few years, now it’s time to summarize what I’ve learned.

The web service featured in this case study is the Hypothesis web annotation system. The primary database, Postgres, stores information about users, groups, documents, courses, and annotations. Questions that our team needs to answer include:

– How many students created annotations last semester?

– In how many courses at each school?

Questions from instructors using Hypothesis in their courses include:

– Which passages in course readings are attracting highlights and discussion?

– Who is asking questions about those passages, and who is responding?

Early on we adopted a tool called Metabase that continues to be a pillar of our analytics system. When Metabase was hooked up to our Postgres database the team could start asking questions without leaning on developers. Some folks used the interactive query builder, while others went straight to writing SQL that Metabase passes through to Postgres.

Before long we had a large catalog of Metabase questions that query Postgres and display results as tables or charts that can be usefully arranged on Metabase dashboards. It’s all nicely RESTful. Interactive elements that can parameterize queries, like search boxes and date pickers, map to URLs. Queries can emit URLs in order to compose themselves with other queries. I came to see this system as a kind of lightweight application server in which to incubate an analytics capability that could later be expressed more richly.

Over time, and with growing amounts of data, early success with this approach gave way to two kinds of frustration: queries began to choke, and the catalog of Metabase questions became unmanageable. And so, in the time-honored tradition, we set up a data warehouse for analytics. Ours is another instance of Postgres that syncs nightly with the primary database. There are lots of other ways to skin the cat but it made sense to leverage ops experience with Postgres and I had a hunch that it would do well in this role.

To unthrottle the choking queries I began building materialized views that cache the results of Postgres queries. Suppose a query makes use of available indexes but still takes a few minutes, or maybe even an hour, to run. It still takes that long to build the corresponding materialized view, but once built other queries can use its results immediately. Metabase questions that formerly included chunks of SQL began reducing to select * from {viewname}.

This process continues to unfold in a highly productive way. Team members may or may not hit a performance wall as they try to use Metabase to answer their questions. When they do, we can convert the SQL text of a Metabase question to a Postgres materialized view that gets immediate results. Such views can join with others, and/or with underlying tables, in SQL SELECT contexts. The views become nouns in a language that expresses higher-order business concepts.

The verbs in this language turned out to be Postgres functions written in the native procedural language, pl/pgsql, and later also in its Python counterpart, pl/python. Either flavor can augment built-in Postgres library functions with user-defined functions that can return simple values, like numbers and strings, but can also return sets that behave in SQL SELECT contexts just like tables and views.

Functions were, at first, a way to reuse chunks of SQL that otherwise had to be duplicated across Metabase questions and Postgres CREATE MATERIALIZED VIEW statements. That made it possible to streamline and refactor both bodies of code and sanely maintain them.

To visualize what had now become a three-body system of sources in which Metabase questions, Postgres views, and Postgres functions can call (or link to) one another, I wrote a tool that builds a crosslinked concordance. That made it practical to reason effectively about the combined system.

Along the way I have learned how Postgres, and more broadly modern SQL, in conjunction with a tool like Metabase, can enable a team like ours to make sense of data. There’s plenty to say about the techniques I’ve evolved, and I aim to write them up over time. The details won’t interest most people, but here’s an outcome that might be noteworthy.

Team member: I had an idea that will help manage our communication with customers, and I’ve prototyped it in a Metabase question.

Toolsmith: Great! Here’s a Postgres function that encapsulates and refines your SQL. It’s fast enough for now, but if needed we can convert it into a materialized view. Now you can use that function in another Metabase question that projects your SQL across a set of customers that you can select.

That interaction forms the basis of a virtuous cycle: The team member formulates a question and does their best to answer it using Metabase; the toolsmith captures the intent and re-expresses it in a higher-level business language; the expanded language enables the team member to go farther in a next cycle.

We recognize this software pattern in the way application programmers who push a system to its limits induce systems programmers to respond with APIs that expand those limits. I suppose it’s harder to see when the application environment is Metabase and the systems environment is Postgres. But it’s the same pattern, and it is powerful.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

Working With Intelligent Machines

In The Chess Master and the Computer, Garry Kasparov famously wrote:

The winner was revealed to be not a grandmaster with a state-of-the-art PC but a pair of amateur American chess players using three computers at the same time. Their skill at manipulating and “coaching” their computers to look very deeply into positions effectively counteracted the superior chess understanding of their grandmaster opponents and the greater computational power of other participants. Weak human + machine + better process was superior to a strong computer alone and, more remarkably, superior to a strong human + machine + inferior process.

The title of his subsequent TED talk sums it up nicely: Don’t fear intelligent machines. Work with them.

That advice resonates powerfully as I begin a second work week augmented by GitHub Copilot, a coding assistant based on OpenAI’s Generative Pre-trained Transformer (GPT-3). Here is Copilot’s tagline: “Your AI pair programmer: get suggestions for whole lines or entire functions right inside your editor.” If you’re not a programmer, a good analogy is Gmail’s offer of suggestions to finish sentences you’ve begun to type.

In mainstream news the dominant stories are about copyright (“Copilot doesn’t respect software licenses”), security (“Copilot leaks passwords”), and quality (“Copilot suggests wrong solutions”). Tech Twitter amplifies these and adds early hot takes about dramatic Copilot successes and flops. As I follow these stories, I’m thinking of another. GPT-3 is an intelligent machine. How can we apply Kasparov’s advice to work effectively with it?

Were I still a tech journalist I’d be among the first wave of hot takes. Now I spend most days in Visual Studio Code, the environment in which Copilot runs, working most recently on analytics software. I don’t need to produce hot takes, I can just leave Copilot running and reflect on notable outcomes.

Here was the first notable outcome. In the middle of writing some code I needed to call a library function that prints a date. In this case the language context was Python, but might as easily have been JavaScript or SQL or shell. Could I memorize the date-formatting functions for all these contexts? Actually, yes, I believe that’s doable and might even be beneficial. But that’s a topic for another day. Let’s stipulate that we can remember a lot more than we think we can. We’ll still need to look up many things, and doing a lookup is a context-switching operation that often disrupts flow.

In this example I would have needed a broad search to recall the name of the date-formatting function that’s available in Python: strftime. Then I’d have needed to search more narrowly to find the recipe for printing a date object in a format like Mon Jan 01. A good place for that search to land is https://strftime.org/, where Will McCutchen has helpfully summarized several dozen directives that govern the strftime function.

Here’s the statement I needed to write:

day = day.strftime('%a %d %b')

Here’s where the needed directives appear in the documentation:

To prime Copilot I began with a comment:

# format day as Mon Jun 15

Copilot suggested the exact strftime incantation I needed.

Now it’s not hard to find a page like Will’s, and once you get there it’s not hard to pick out the needed directives. But when you’re in the flow of writing a function, avoiding that context switch doesn’t only save time. There is an even more profound benefit: it conserves attention and preserves flow.

The screencast embedded in the above tweet gives you a feel for the dynamic interaction. When I get as far as # format day as M, Copilot suggests MMDDYYY even before I write Mon, then adjusts as I do that. This tight feedback loop helps me explore the kinds of natural examples I can use to prime the system for the lookup.

For this particular pattern I’m not yet getting the same magical result in JavaScript, SQL, or shell contexts, but I expect that’ll change as Copilot watches me and others try the same example and arrive at the analogous solutions in these other languages.

I’m reminded of Language evolution with del.icio.us, from 2005, in which I explored the dynamics of the web’s original social bookmarking system. To associate a bookmarked resource with a shared concept you’d assign it a tag broadly used for that concept. Of course the tags we use for a given concept often vary. Your choice of cinema or movie or film was a way to influence the set of resources associated with your tag, and thus encourage others to use the same tag in the same way.

That kind of linguistic evolution hasn’t yet happened at large scale. I hope Copilot will become an environment in which it can. Intentional use of examples is one way to follow Kasparov’s advice for working well with intelligent systems.

Here’s a contrived Copilot session that suggests what I mean. The result I am looking for is the list [1, 2, 3, 4, 5].

l1 = [1, 2, 3]
l2 = [3, 4, 5]
# merge the two lists
l3 = l1 + l2 # no: [1, 2, 3, 3, 5]
# combine as [1, 2, 3, 4, 5]
l3 = l1 + l2 # no: [1, 2, 3, 3, 5]
# deduplicate the two lists
l1 = list(set(l1)) # no: [1, 2, 3]
# uniquely combine the lists
l3 = list(set(l1) | set(l2)) # yes: [1, 2, 3, 4, 5]
# merge and deduplicate the lists
l3 = list(set(l1 + l2)) # yes: [1, 2, 3, 4, 5]

The last two Copilot suggestions are correct; the final (and simplest) one would be my choice. If I contribute that choice to a public GitHub repository am I voting to reinforce an outcome that’s already popular? If I instead use the second comment (combine as [1, 2, 3, 4, 5]) am I instead voting for a show-by-example approach (like Mon Jun 15) that isn’t yet as popular in this case but might become so? It’s hard for me — and likely even for Copilot itself — to know exactly how Copilot works. That’s going to be part of the challenge of working well with intelligent machines. Still, I hope for (and mostly expect) a fruitful partnership in which our descriptions of intent will influence the mechanical synthesis even as it influences our descriptions.

Dear background mind: please think about Postgres and Clojure

I used a Lisp variant in my first programming job, so I have some appreciation for the “code as data” power of that language. Nowadays I’m building an analytics system that combines Postgres and two of its procedural languages, pl/pgsql and pl/python, with a sense-making tool called Metabase that’s written in a Lisp variant called Clojure.

In Postgres I’m able to wrap SQL queries in functions; they compose with other functions that do things like cache results in materialized views and aggregate subgroups. It all feels very dynamic and functional, which are two of Clojure’s main calling cards, so this makes me wonder about Clojure as another Postgres procedural language.

For the pl/python functions in my system, “code as data” looks like building SQL statements that swap variables into SQL templates and combine them. This string-building approach is an anti-pattern for Clojure folk. They don’t want to work with fragments of SQL text, they want to work with Clojure maps. Here’s an example from Honey SQL, the library that Metabase uses to build SQL texts from structured data.

(def sqlmap {:select [:a :b :c]
:from [:foo]
:where [:= :f.a "baz"]})

This seems like an important power to be able to wield in the hybrid programming environment that Postgres provides. I can imagine making very good use of it.

But I know very little yet about Clojure. Is this really an idea worth exploring? How else would it differ from what I’m now able to do in Python? To learn more about Clojure I’ve been watching talks by its creator, Rich Hickey. Most are tech-heavy but one of them isn’t like the others. In Hammock-driven development he lays out a process for creative problem solving that coordinates the waking mind (at the computer doing critical thinking and analysis) with the background mind (on the hammock doing abstraction, analogy, and synthesis). The coordination is explicit: You use the waking mind to feed work to the background mind which is “the solver of most non-trivial problems”; you weight your inputs to the background mind in order to influence their priority in its thinking.

I guess I’ve done that kind of thing implicitly from time to time, but never in such an intentional way. So, is Clojure-in-Postgres worth exploring? Perhaps by writing this I’ll prime my background mind and will receive more clarity in the morning.

Acknowledgement of uncertainty

In 2018 I built a tool to help researchers evaluate a proposed set of credibility signals intended to enable automated systems to rate the credibility of news stories.

Here are examples of such signals:

– Authors cite expert sources (positive)

– Title is clickbaity (negative)

And my favorite:

– Authors acknowledge uncertainty (positive)

Will the news ecosystem ever be able to label stories automatically based on automatic detection of such signals, and if so, should it? These are open questions. The best way to improve news literacy may be the SIFT method advocated by Mike Caulfield, which shifts attention away from intrinsic properties of individual news stories and advises readers to:

– Stop

– Investigate the source

– Find better coverage

– Trace claims, quotes, and media to original context

“The goal of SIFT,” writes Charlie Warzel in Don’t Go Down the Rabbit Hole, “isn’t to be the arbiter of truth but to instill a reflex that asks if something is worth one’s time and attention and to turn away if not.”

SIFT favors extrinsic signals over the intrinsic ones that were the focus of the W3C Credible Web Community Group. But intrinsic signals may yet play an important role, if not as part of a large-scale automated labeling effort then at least as another kind of news literacy reflex.

This morning, in How public health officials can convince those reluctant to get the COVID-19 vaccine, I read the following:

What made these Trump supporters shift their views on vaccines? Science — offered straight-up and with a dash of humility.

The unlikely change agent was Dr. Tom Frieden, who headed the Centers for Disease Control and Prevention during the Obama administration. Frieden appealed to facts, not his credentials. He noted that the theory behind the vaccine was backed by 20 years of research, that tens of thousands of people had participated in well-controlled clinical trials, and that the overwhelming share of doctors have opted for the shots.

He leavened those facts with an acknowledgment of uncertainty. He conceded that the vaccine’s potential long-term risks were unknown. He pointed out that the virus’s long-term effects were also uncertain.

“He’s just honest with us and telling us nothing is 100% here, people,” one participant noted.

Here’s evidence that acknowledgement of uncertainty really is a powerful signal of credibility. Maybe machines will be able to detect it and label it; maybe those labels will matter to people. Meanwhile, it’s something people can detect and do care about. Teaching students to value sources that acknowledge uncertainty, and discount ones that don’t, ought to be part of any strategy to improve news literacy.

The paradox of abundance

Several years ago I bought two 5-packs of reading glasses. There was a 1.75-diopter set for books, magazines, newspapers, and my laptop (when it’s in my lap), plus a 1.25-diopter set for the screens I look at when working in my Captain Kirk chair. They were cheap, and the idea was that they’d be an abundant resource. I could leave spectacles lying around in various places, there would always be a pair handy, no worries about losing them.

So of course I did lose them like crazy. At one point I bought another 5-pack but still, somehow, I’m down to a single 1.75 and a single 1.25. And I just realized it’s been that way for quite a while. Now that the resource is scarce, I value it more highly and take care to preserve it.

I’m sorely tempted to restock. It’s so easy! A couple of clicks and two more 5-packs will be here tomorrow. And they’re cheap, so what’s not to like?

For now, I’m resisting the temptation because I don’t like the effect such radical abundance has had on me. It’s ridiculous to lose 13 pairs of glasses in a couple of years. I can’t imagine how I’d explain that to my pre-Amazon self.

For now, I’m going to try to assign greater value to the glasses I do have, and treat them accordingly. And when I finally do lose them, I hope I’ll resist the one-click solution. I thought it was brilliant at the time, and part of me still does. But it just doesn’t feel good.

Original memories

Were it not for the Wayback Machine, a lot of my post-1995 writing would now be gone. Since the advent of online-only publications, getting published has been a lousy way to stay published. When pubs change hands, or die, the works of their writers tend to evaporate.

I’m not a great self-archivist, despite having better-than-average skills for the job. Many but not all of my professional archives are preserved — for now! — on my website. Occasionally, when I reach for a long-forgotten and newly-relevant item, only to find it 404, I’ll dig around and try to resurrect it. The forensic effort can be a big challenge; an even bigger one is avoiding self-blame.

The same thing happens with personal archives. When our family lived in New Delhi in the early 1960s, my dad captured thousands of images. Those color slides, curated in carousels and projected onto our living room wall in the years following, solidified the memories of what my five-year-old self had directly experienced. When we moved my parents to the facility where they spent their last years, one big box of those slides went missing. I try, not always successfully, to avoid blaming myself for that loss.

When our kids were little we didn’t own a videocassette recorder, which was how you captured home movies in that era. Instead we’d rent a VCR from Blockbuster every 6 months or so and spend the weekend filming. It turned out to be a great strategy. We’d set it on a table or on the floor, turn it on, and just let it run. The kids would forget it was there, and we recorded hours of precious daily life in episodic installments.

Five years ago our son-in-law volunteered the services of a friend of his to digitize those tapes, and brought us the MP4s on a thumb drive. I put copies in various “safe” places. Then we moved a couple of times, and when I reached for the digitized videos, they were gone. As were the original cassettes. This time around, there was no avoiding the self-blame. I beat myself up about it, and was so mortified that I hesitated to ask our daughter and son-in-law if they have safe copies. (Spoiler alert: they do.) Instead I’d periodically dig around in various hard drives, clouds, and boxes, looking for files or thumb drives that had to be there somewhere.

During this period of self-flagellation, I thought constantly about something I heard Roger Angell say about Carlton Fisk. Roger Angell was one of the greatest baseball writers, and Carlton Fisk one of the greatest players. One day I happened to walk into a bookstore in Harvard Square when Angell was giving a talk. In the Q and A, somebody asked: “What’s the most surprising thing you’ve ever heard a player say?”

The player was Carlton Fisk, and the surprise was his answer to the question: “How many time have you seen the video clip of your most famous moment?”

That moment is one of the most-watched sports clips ever: Fisk’s walk-off home run in game 6 of the 1975 World Series. He belts the ball deep to left field, it veers toward foul territory, he dances and waves it fair.

So, how often did Fisk watch that clip? Never.

Why not? He didn’t want to overwrite the original memory.

Of course we are always revising our memories. Photographic evidence arguably prevents us from doing so. Is that good or bad? I honestly don’t know. Maybe both.

For a while, when I thought those home videos were gone for good, I tried to convince myself that it was OK. The original memories live in my mind, I hold them in my heart, nothing can take them away, no recording can improve them.

Although that sort of worked, I was massively relieved when I finally fessed up to my negligence and learned that there are safe copies. For now, I haven’t requested them and don’t need to see them. It’s enough to know that they exist.

The Modesto Pile

Reading a collection of John McPhee stories, I found several that were new to me. The Duty of Care, published in the New Yorker in 1993, is about tires, and how we do or don’t properly recycle them. One form of reuse we’ve mostly abandoned is retreads. McPhee writes:

A retread is in no way inferior to a new tire, but new tires are affordable, and the retreaded passenger tire has descended to the status of a clip-on tie.

My dad wore clip-on ties. He also used retreaded tires, and I can remember visiting a shop on several occasions to have the procedure done.

Recently I asked a friend: “Whatever happened to retreaded tires?” We weren’t sure, but figured they’d gone away for good reasons: safety, reliability. But maybe not. TireRecappers and TreadWright don’t buy those arguments. Maybe retreads were always a viable option for our passenger fleet, as they still are for our truck fleet. And maybe, with better tech, they’re better than they used to be.

In Duty of Care, McPhee tells the story of the Modesto pile. It was, at the time, the world’s largest pile of scrap tires containing, by his estimate, 34 million tires.

You don’t have to stare long at that pile before the thought occurs to you that those tires were once driven upon by the Friends of the Earth. They are Environmental Defense Fund tires, Rainforest Action Network tires, Wilderness Society tires. They are California Natural Resources Federation tires, Save San Francisco Bay Association tires, Citizens for a Better Environment tires. They are Greenpeace tires, Sierra Club tires, Earth Island Institute tires. They are Earth First! tires!

(I love a good John McPhee list.)

The world’s largest pile of tires left a surprisingly small online footprint, but you can find the LA Times’ Massive Pile of Tires Fuels Controversial Energy Plan which describes the power plant — 41 million dollars, 14 megawatts, “the first of its kind in the United States and the largest in the world” — that McPhee visited when researching his story. I found it on Google Maps by following McPhee’s directions.

If you were abandon your car three miles from the San Joaquin County line and make your way on foot southwest one mile…

You can see the power plant. There’s no evidence of tires, or trucks moving them, so maybe the plant, having consumed the pile, is retired. Fortunately it never caught fire, that would’ve made a hell of a mess.

According to Wikipedia, we’ve reduced our inventory of stockpiled tires by an order of magnitude from a peak of a billion around the time McPhee wrote that article. We burn most of them for energy, and turn some into ground rubber for such uses as paving and flooring. So that’s progress. But I can’t help but wonder about the tire equivalent of Amory Lovins’ negawatt: “A watt of energy that you have not used through energy conservation or the use of energy-efficient products.”

Could retreaded passenger tires be an important source of negawatts? Do we reject the idea just because they’re as unfashionable as clip-on ties? I’m no expert on the subject, obviously, but I suspect these things might be true.

A wider view

For 20 bucks or less, nowadays, you can buy an extra-wide convex mirror that clips onto your car’s existing rear-view mirror. We just tried one for the first time, and I’m pretty sure it’s a keeper. These gadgets claim to eliminate blind spots, and this one absolutely does. Driving down 101, I counted three seconds as a car passed through my driver’s-side blind spot. That’s a long time when you’re going 70 miles per hour; during that whole time I could see that passing car in the extended mirror.

Precious few gadgets spark joy for me. This one had me at hello. Not having to turn your head, avoiding the risk of not turning your head — these are huge benefits, quite possibly life-savers. For 20 bucks!

It got even better. As darkness fell, we wondered how it would handle approaching headlights. It’s not adjustable like the stock mirror, but that turns out not to be a problem. The mirror dims those headlights so they’re easy to look at. The same lights in the side mirrors are blinding by comparison.

I’ve been driving more than 40 years. This expanded view could have been made available at any point along the way. There’s nothing electronic or digital. It’s just a better idea that combines existing ingredients in a new way. That pretty much sums up my own approach to product development.

Finally, there’s the metaphor. Seeing around corners is a superpower I’ve always wanted. I used to love taking photos with the fisheye lens on my dad’s 35mm Exacta, now I love making panoramic views with my phone. I hate being blindsided, on the road and in life, by things I can’t see coming. I hate narrow-mindedness, and always reach for a wider view.

I’ll never overcome all my blind spots but it’s nice to chip away at them. After today, there will be several fewer to contend with.


File:A wider view at sunset – geograph.org.uk – 593022.jpg – Wikimedia Commons

The New Freshman Comp

The column republished below, originally at http://www.oreillynet.com/pub/a/network/2005/04/22/primetime.html, was the vector that connected me to my dear friend Gardner Campbell. I’m resurrecting it here partly just to bring it back online, but mainly to celebrate the ways in which Gardner — a film scholar among many other things — is, right now, bringing his film expertise to the practice of online teaching.

In this post he reflects:

Most of the learning spaces I’ve been in provide very poorly, if at all, for the supposed magic of being co-located. A state-mandated prison-spec windowless classroom has less character than a well-lighted Zoom conference. A lectern with a touch-pad control for a projector-and-screen combo is much less flexible and, I’d argue, conveys much less human connection and warmth than I can when I share a screen on Zoom during a synchronous class, or see my students there, not in front of a white sheet of reflective material, but in the medium with me, lighting up the chat, sharing links, sharing the simple camaraderie of a hearty “good morning” as class begins.

And in this one he shares the course trailer (!) for Fiction into film: a study of adaptations of “Little Women”.

My 2005 column was a riff on a New York Times article, Is Cinema Studies the new MBA? It was perhaps a stretch, in 2005, to argue for cinema studies as an integral part of the new freshman comp. The argument makes a lot more sense now.


The New Freshman Comp

For many years I have alternately worn two professional hats: writer and programmer. Lately I find myself wearing a third hat: filmmaker. When I began making the films that I now call screencasts, my readers and I both sensed that this medium was different enough to justify the new name that we collaboratively gave it. Here’s how I define the difference. Film is a genre of storytelling that addresses the whole spectrum of human experience. Screencasting is a subgenre of film that can tell stories about the limited — but rapidly growing — slice of our lives that is mediated by software.

Telling stories about software in this audiovisual way is something I believe technical people will increasingly want to do. To explain why, let’s first discuss a more ancient storytelling mode: writing.

The typical reader of this column is probably, like me, a writer of both prose and code. Odds are you identify yourself as a coder more than as a writer. But you may also recently have begun blogging, in which case you’ve seen your writing muscles grow stronger with exercise.

Effective writing and effective coding are more closely related than you might think. Once upon a time I spent a year as a graduate student and teaching assistant at an Ivy League university. My program of study was science writing, but that was a tiny subspecialty within a larger MFA (master of fine arts) program dedicated to creative writing. That’s what they asked me to teach, and the notion terrified me. I had no idea what I’d say to a roomful of aspiring poets and novelists. As it turned out, though, many of these kids were in fact aspiring doctors, scientists, and engineers who needed humanities credits. So I decided to teach basic expository writing. The university’s view was that these kids had done enough of that in high school. Mine was that they hadn’t, not by a long shot.

I began by challenging their reverence for published work. Passages from books and newspapers became object lessons in editing, a task few of my students had ever been asked to perform in a serious way. They were surprised by the notion that you could improve material that had been professionally written and edited, then sold in bookstores or on newsstands. Who were they to mess with the work of the pros?

I, in turn, was surprised to find this reverent attitude even among the budding software engineers. They took it for granted that programs were imperfect texts, always subject to improvement. But they didn’t see prose in the same way. They didn’t equate refactoring a program with editing a piece of writing, as I did then and still do.

When I taught this class more than twenty years ago the term “refactoring” wasn’t commonly applied to software. Yet that’s precisely how I think about the iterative refinement of prose and of code. In both realms, we adjust vocabulary to achieve consistency of tone, and we transform structure to achieve economy of expression.

I encouraged my students to regard writing and editing as activities governed by engineering principles not unlike the ones that govern coding and refactoring. Yes, writing is a creative act. So is coding. But in both cases the creative impulse is expressed in orderly, calculated, even mechanical ways. This seemed to be a useful analogy. For technically-inclined students earning required humanities credits, it made the subject seem more relevant and at the same time more approachable.

In the pre-Internet era, none of us foresaw the explosive growth of the Internet as a textual medium. If you’d asked me then why a programmer ought to be able to write effectively, I’d have pointed mainly to specs and manuals. I didn’t see that software development was already becoming a global collaboration, that email and newsgroups were its lifeblood, and that the ability to articulate and persuade in the medium of text could be as crucial as the ability to design and build in the medium of code.

Nowadays, of course, software developers have embraced new tools of articulation and persuasion: blogs, wikis. I’m often amazed not only by the amount of writing that goes on in these forms, but also by its quality. Writing muscles do strengthen with exercise, and the game of collaborative software development gives them a great workout.

Not everyone drinks equally from this fountain of prose, though. Developers tend to write a great deal for other developers, but much less for those who use their software. Laziness is a factor; hubris even more so. We like to imagine that our software speaks for itself. And in some ways that’s true. Documentation is often only a crutch. If you have to explain how to use your software, you’ve failed.

It may, however, be obvious how to use a piece of software, and yet not at all obvious why to use it. I’ll give you two examples: Wikipedia and del.icio.us. Anyone who approaches either of these applications will immediately grasp their basic modes of use. That’s the easy part. The hard part is understanding what they’re about, and why they matter.

A social application works within an environment that it simultaneously helps to create. If you understand that environment, the application makes sense. Otherwise it can seem weird and pointless.

Paul Kedrosky, an investor, academic, and columnist, alluded to this problem on his blog last month:

Funny conversation I had with someone yesterday: We agreed that the thing that generally made us both persevere and keep trying any new service online, even if we didn’t get it the first umpteen times, was having Jon Udell post that said service was useful. After all, if Jon liked it then it had to be that we just hadn’t tried hard enough. [Infectious Greed]

I immodestly quote Paul’s remarks in order to revise and extend them. I agree that the rate-limiting factor for software adoption is increasingly not purchase, or installation, or training, but simply “getting it.” And while I may have a good track record for “getting it,” plenty of other people do too — the creators of new applications, obviously, as well as the early adopters. What’s unusual about me is the degree to which I am trained, inclined, and paid to communicate in ways that help others to “get it.”

We haven’t always seen the role of the writer and the role of the developer as deeply connected but, as the context for understanding software shifts from computers and networks to people and groups, I think we’ll find that they are. When an important application’s purpose is unclear on the first umpteen approaches, and when “getting it” requires hard work, you can’t fix the problem with a user-interface overhaul or a better manual. There needs to be an ongoing conversation about what the code does and, just as importantly, why. Professional communicators (like me) can help move things along, but everyone needs to participate, and everyone needs to be able to communicate effectively.

If you’re a developer struggling to evangelize an idea, I’d start by reiterating that your coding instincts can also help you become a better writer. Until recently, that’s where I’d have ended this essay too. But recent events have shown me that writing alone, powerful though it can be, won’t necessarily suffice.

I’ve written often — and, I like to think, cogently — about wikis and tagging. But my screencasts about Wikipedia and del.icio.us have had a profoundly greater impact than anything I’ve written on these topics. People “get it” when they watch these movies in ways that they otherwise don’t.

It’s undoubtedly true that an audiovisual narrative enters many 21st-century minds more easily, and makes a more lasting impression on those minds, than does a written narrative. But it’s also true that the interactive experience of software is fundamentally cinematic in nature. Because an application plays out as a sequence of frames on a timeline, a narrated screencast may be the best possible way to represent it and analyze it.

If you buy either or both of these explanations, what then? Would I really suggest that techies will become fluid storytellers not only in the medium of the written essay, but also in the medium of the narrated screencast? Actually, yes, I would, and I’m starting to find people who want to take on the challenge.

A few months ago I heard from Michael Tiller, who describes himself as a “mechanical engineer trapped in a computer scientist’s body.” Michael has had a long and passionate interest in Modelica, an open, object-oriented language for modeling mechanical, electrical, electronic, hydraulic, thermal, and control systems. He wanted to work with me to develop a screencast on this topic. But it’s far from my domains of expertise and, in the end, all he really needed was my encouragement. This week, Michael launched a website called Dynopsis.com that’s chartered to explore the intersection of engineering and information technologies. Featured prominently on the site is this 20-minute screencast in which he illustrates the use of Modelica in the context of the Dymola IDE.

This screencast was made with Windows Media Encoder 9, and without the help of any editing. After a couple of takes, Michael came up with a great overview of the Modelica language, the Dymola tool, and the modeling and simulation techniques that they embody. Since he is also author of a book on this subject, I asked Michael to reflect on these different narrative modes, and here’s how he responded on his blog:

If I were interested in teaching someone just the textual aspects of the Modelica language, this is exactly the approach I would take.

But when trying to teach or explain a medium that is visual, other tools can be much more effective. Screencasts are one technology that could really make an impact on the way some subjects are taught and I can see how these ideas could be extended much further. [Dynopsis: Learning by example: screencasts]

We’re just scratching the surface of this medium. Its educational power is immediately obvious, and over time its persuasive power will come into focus too. The New York Times recently asked: “Is cinema studies the new MBA?” I’ll go further and suggest that these methods ought to be part of the new freshman comp. Writing and editing will remain the foundation skills they always were, but we’ll increasingly combine them with speech and video. The tools and techniques are new to many of us. But the underlying principles — consistency of tone, clarity of structure, economy of expression, iterative refinement — will be familiar to programmers and writers alike.

The 3D splendor of the Sonoma County landscape

We’ve been here 6 years, and the magical Sonoma County landscape just keeps growing on me. I’ve written about the spectacular coastline, a national treasure just twenty miles to our west that’s pristine thanks to the efforts of a handful of environmental activists, notably Bill Kortum. Even closer, just ten miles to our east, lies the also spectacular Mayacamas range where today I again hiked Bald Mountain in Sugarloaf Ridge State Park.

If you ever visit this region and fancy a hike with great views, this is the one. (Ping me and I’ll go with you at the drop of a hat.) It’s not too challenging. The nearby Goodspeed Trail, leading up to Gunsight Notch on Hood Mountain, is more demanding, and in the end, less rewarding. Don’t get me wrong, the view from Gunsight — looking west over the Sonoma Valley and the Santa Rosa plain — is delightful. But the view from the top of Bald Mountain is something else again. On a clear day (which is most of them) you can spin a slow 360 and take in the Napa Valley towns of St. Helena and Calistoga to the west, Cobb and St. Helena mountains to the north, the Sonoma Valley and Santa Rosa plain to the east, then turn south to see Petaluma, Mt. Tamalpais, the top of the Golden Gate bridge, San Francisco, the San Pablo Bay, the tops of the tallest buildings in Oakland, and Mt. Diablo 51 miles away. Finally, to complete the loop, turn east again to look at the Sierra Nevada range 130 miles away.

The rugged topography you can see in that video occurs fractally everywhere around here. It’s taken a while to sink in, but I think I can finally explain what’s so fascinating about this landscape. It’s the sightlines. From almost anywhere, you’re looking at a 3D puzzle. I’m a relative newcomer, but I hike with friends who’ve lived their whole lives here, and from any given place, they are as likely as I am to struggle to identify some remote landmark. Everything looks different from everywhere. You’re always seeing multiple overlapping planes receding into the distance, like dioramas. And they change dramatically as you move around even slightly. Even just ten paces in any direction, or a slight change in elevation, can alter the sightlines completely and reveal or hide a distant landmark.

We’ve lived in flat places, we’ve lived in hilly places, but never until now in such a profoundly three-dimensional landscape. It is a blessing I will never take for granted.

My print edition superpower

When I walk around our neighborhood I see very few copies of our local newspaper on sidewalks, porches, and front lawns. We subscribe because it’s a lifelong habit, and because we like to support local journalism, and in my case because it’s a welcome reprieve from a long day of screentime.

This archaic habit has also, weirdly, become a kind of superpower. From time to time I meet people who are surprised when I know facts about them. These facts are public information, so why should it be surprising that I know them? Because they appear in the print edition of the newspaper. Consider the man in this photo.

We were out for a Sunday morning walk, carrying the newspaper, when we saw him in a cafe. We had just read an article about Whiskerino, the annual contest of fancy beards and mustaches in Petaluma. And here was the guy whose photo was featured in the article!

That was a nice coincidence, but the kicker is that he had no idea his picture was in the paper. Not being of newspaper-reading age, and not having friends of newspaper-reading age, he had to depend on elders to alert him. I guess we got to him before his parents did.

When a similar thing happened this weekend, it occurred to me that this could be a good marketing strategy for newspapers. Do you want to wield an information superpower? Do you want to amaze people by knowing facts that they can’t imagine you could possibly know? Subscribe to the print edition of your local newspaper!

Continental drift

In a 1999 interview David Bowie said that “the potential for what the Internet is going to do to society, both good and bad, is unimaginable.” I had no problem imagining the good, it was the bad where my imagination failed. The web of the late 90s was a cornucopia of wonder, delight, and inspiration. So was the blogosophere of the early 2000s. I know a lot of us are nostalgic for those eras, and depressed about how things have turned out. The bad is really quite bad, and sometimes I feel like there’s no way forward.

And then something wonderful happens. This time the spark was David Grinspoon aka @DrFunkySpoon. I’ve written before about a Long Now talk in which he posits that we might not just be living through the beginning of a geological epoch called the Anthropocene but rather — and far more profoundly — the dawn of an eon that he calls the Sapiezoic. Today he posted a stunning new visualization of plate tectonics.

As always when I think about plate tectonics, I’m reminded of the high school science teacher who introduced me to the topic. His name is John Ousey, and this happened almost 50 years ago. What always stuck with me is the way he presented it. Back then, plate tectonics was a new idea. As Mr. Ousey (he later became Dr. Ousey) described the continents sliding apart, I can still see the bemused look on his face. He was clearly wrestling with the concept, unsure whether to believe things really happen that way. That healthy skepticism, coupled with trust in the scientific process, made an indelible impression on me.

One of the wonders of the Internet is the ability to find people. It took some sleuthing, but I did find him and this ensued.

That’s the kind of magic that can still happen, that does happen all the time.

I learned for the first time that John Ousey’s introduction to plate tectonics came by way of “weekend courses taught by the great teacher/researcher Ehrling Dorf of Princeton” who was himself perhaps “not totally convinced.” Despite uncertainty, which he acknowledged, John Ousey was happy to share an important new idea with his Earth science class.

What a privilege to be able to thank him, after all these years, for being a great teacher who helped me form a scientific sensibility that has never mattered more than now. And to share a moment of appreciation for an extraordinary new visualization of the process once known as continental drift. Yes, there’s a dark side to our connected world, darker than I was once willing to imagine. But there is also so much light. It’s always helpful to consider deep geological time. That video shows a billion years of planetary churn. We’ve only been connected like this for 25 years. Maybe we’ll figure it out. For today, at least, I choose to believe that we will.

How and why to tell your story online, revisited

I wrote this essay in 2006 as part of a series of Internet explainers I did for New Hampshire Public Radio. It never aired for reasons lost to history, so I’m publishing this 15-year-old time capsule here for the first time. My motive is of course not purely archival. I’m also reminding myself why I should still practice now what I preached then.

How and why to tell your story online

Teens and twenty-somethings are flocking to social websites like MySpace and Facebook, where they post photos, music, and personal diaries. Parents, teachers, and cops wish they wouldn’t. It’s a culture war between generations, and right now everybody’s losing.

Kids: Listen up. Did you hear the story about the college student who didn’t get hired because of his Facebook page? Or the teenage girl whose MySpace blog told an attacker when she’d be home alone? These things happen very rarely, but they can happen. Realize that the words and pictures you publish online will follow you around for the rest of your lives. Realize that wrong choices can have embarrassing or even tragic consequences.

Now, grownups, it’s your turn to listen up. You’re right to worry about the kids. But there’s another side to the story. The new forms of Internet self-publishing — including social networks, blogs, podcasting, and video sharing — can be much more than narcissistic games. Properly understood and applied, they’re power tools for claiming identity, exerting influence, and managing reputation. Sadly, very few adults are learning those skills, and fewer still are teaching them.

It’s not enough to condemn bad online behavior. We’ve got to model good online behavior too — in schools, on the job, and in civic life. But we’re stuck in a Catch-22 situation. Kids, who intuit the benefits of the new social media, fail to appreciate the risks. Grownups, meanwhile, see only risks and no benefits.

There’s a middle ground here, and we need to approach it from both sides of the generation gap. The new reality is that, from now on, our lives will be documented online — perhaps by us, perhaps by others, perhaps by both. We may or may not influence what others will say about us. But we can surely control our own narratives, and shape them in ways that advance our personal, educational, professional, and civic agendas.

Your online identity is a lifelong asset. If you invest in it foolishly you’ll regret that. But failing to invest at all is equally foolish. The best strategy, as always, is to invest wisely.

Here’s a simple test to guide your strategy. Imagine someone searching Google for your name. That person might be a college admissions officer, a prospective employer, a new client, an old friend, or even a complete stranger. The reason for the search might be to evaluate your knowledge, interests, agenda, accomplishments, credentials, activities, or reputation.

What do you want that person to find? That’s what you should publish online.

To find three examples of what I mean, try searching the web for the following three names: Todd Suomela, Martha Burtis, Thomas Mahon. In each case, the first Google result points to a personal blog that narrates a professional life.

Todd Suomela is a graduate student at the University of Michigan. On his blog, Todd writes about what he’s learning, and about how his interests and goals are evolving. He hasn’t launched his professional career yet. But when he does, his habit of sharing the information resources he collects, and reflecting thoughtfully on his educational experience, will serve him well.

Martha Burtis is an instructional technologist at the University of Mary Washington. She and her team research and deploy the technologies that students, faculty, and staff use to learn, teach, and collaborate. On her blog, Martha writes about the tools and techniques she and her team are developing, she assesses how her local academic community is making use of those tools and techniques, and thinks broadly about the future of education.

Thomas Mahon is a Savile Row tailor. His shop in London caters to people who can spend two thousand pounds on a classic handmade suit. I’ll never be in the market for one of those, but if I were I’d be fascinated by Mahon’s blog, EnglishCut.com, which tells you everything you might want to know about Savile Row past and present, about how Mahan practices the craft of bespoke tailoring, and about how to buy and care for the garments he makes.

For Todd and Martha and Thomas, the benefits of claiming their Net identities in these ways run wide and deep. Over time, their online narratives become autobiographies read by friends, colleagues, or clients, and just as importantly, read by people who may one day become friends, colleagues, or clients.

In most cases, of course, the words, pictures, audio, and video you might choose to publish online won’t attract many readers, listeners, or viewers. That’s OK. The point is that the people they do attract will be exactly the right people: those who share your interests and goals.

We’ve always used the term ‘social networking’ to refer to the process of finding and connecting with those people. And that process has always depended on a fabric of trust woven most easily in the context of local communities and face-to-face interaction.

But our interests and goals aren’t merely local. We face global challenges that compel us to collaborate on a global scale. Luckily, the new modes of social networking can reach across the Internet to include people anywhere and everywhere. But if we’re going to trust people across the Internet, we’ll need to be able check their references. Self-published narrative is one crucial form of evidence. The public reaction to such narratives, readily discoverable thanks to search engines and citation indexes, is another.

Is this a new and strange new activity? From one perspective it is, and that’s why I can’t yet point to many other folks who’ve figured out appropriate and effective ways to be online, as Todd and Martha and Thomas have.

But from another perspective, Internet self-publishing is just a new way to do what we’ve been doing for tens of thousand years: telling stories to explain ourselves to one another, and to make sense of our world.

The Image of Postgres

At the 2015 Postgres conference, the great IT philosopher Robert r0ml Lefkowitz delivered a talk entitled The Image of Postgres. Here’s the blurb.

How do you think about what a database is? If you think of a database as only a place to store your data, then perhaps it does not really matter what the internals of that database are; all you really need is a home for your data to be managed, nothing more.

If you think of a database as a place where you develop applications, then your expectations of your database software change. No longer do you only need data management capabilities, but you require processing functions, the ability to load in additional libraries, interface with other databases, and perhaps even additional language support.

If your database is just for storage, there are plenty of options. If your database is your development framework, you need Postgres.

Why? Well, let’s get philosophical.

For over a year, I’ve been using Postgres as a development framework. In addition to the core Postgres server that stores all the Hypothesis user, group, and annotation data, there’s now also a separate Postgres server that provides an interpretive layer on top of the raw data. It synthesizes and caches product- and business-relevant views, using a combination of PL/pgSQL and PL/Python. Data and business logic share a common environment. Although I didn’t make the connection until I watched r0ml’s talk, this setup harkens back to the 1980s when Smalltalk (and Lisp, and APL) were programming environments with built-in persistence. The “image” in r0ml’s title refers to the Smalltalk image, i.e. the contents of the Smalltalk virtual machine. It may also connote reputation, in the sense that our image of Postgres isn’t that of a Smalltalk-like environment, though r0ml thinks it should be, and my experience so far leads me to agree.

I started writing a book to document what I’ve learned and done with this idea. It’s been a struggle to find motivation because, well, being the patron saint of trailing-edge technologies is often lonely and unrewarding. A book on this particular topic is likely to appeal to very few people. Stored procedures? So last century! Yes, Python provides a modern hook, but I can almost guarantee that one of the comments on my first book — “has a vision, but too weird” — would come back around.

I’m tempted not to bother. Maybe I should just focus on completing and polishing the things the book would describe.

And yet, it’s hard to let go. This isn’t just a compelling idea, it’s delivering excellent results. I rewatched r0ml’s talk today and got fired up again. Does it resonate for you? Would you like to see the ideas developed? If you watch the talk, please let me know.

Here are some excerpts to pique your interest.

On databases vs file systems:

I submit that the difference between the database and a file system is that database is a platform for enforcing your business rules.

On ontology:

client: The business guys are always upset because they want to know how many customers we have and we can’t tell them.

r0ml: That doesn’t sound like a very hard problem. SELECT * from the customer table, right?

client: No you don’t understand the problem.

r0ml: OK, what’s the problem?

client: It depends what you mean by customer because if you’re selling cell phone insurance, is the customer the person who has the cell phone account? What if they have two handsets and they’re both insured? What if it’s a family account and there are kids on the plan, do they count as customers? What if it’s a business account and you have 1000 people covered but only 700 using?

r0ml: How my customers you have, that’s a business rule.

So figuring out what your schema is, and figuring out how you organize the stuff and what do you do in the database, that’s all part of enforcing your business rules.

You have to decide what these things mean.

It’s an ontological problem.

You have to classify your knowledge and then enforce your business rules.

On n-tier architecture:

Let us think about the typical web application architecture. This architecture is called the three-tier architecture because it has four tiers. You have your browser, your web server, the thing that runs Python or PHP or JavaScript or Ruby or Java code, and then the database. And that’s always how you do it. And why do you do it that way? Well because that’s how everybody does it.

On Smalltalk and friends:

This is the BYTE magazine cover from August of 1981. In the 70s and the 80s, programming languages had this sort of unique perspective that’s completely lost to history. The way it worked: a programming environment was a virtual machine image, it was a complete copy of your entire virtual machine memory and that was called the image. And then you loaded that up and it had all your functions and your data in it, and then you ran that for a while until you were sort of done and then you saved it out. And this wasn’t just Smalltalk, Lisp worked that way, APL worked that way, it was kind of like Docker only it wasn’t a separate thing because everything worked that way and so you didn’t worry very much about persistence because it was implied. If you had a programming environment it saved everything that you were doing in the programming environment, you didn’t have to separate that part out. A programming environment was a place where you kept all your data and business logic forever.

So then Postgres is kind of like Smalltalk only different.

What’s the difference? Well we took the UI out of Smalltalk and put it in the browser. The rest of it is the same, so really Postgres is an application delivery platform, just like we had back in the 80s.

Why public phones still exist

My superpower has always been finding new uses for old tech. In the late 90s I dusted off the venerable NNTP server, which had been the backbone of the Usenet, and turned it into my team’s Slack. In the late 2000s I used iCalendar to make citywide event calendars. In the late 2010s I went deep into SQL.

It’s always intensely pragmatic. But also, I can’t deny, whimsical.

In that spirit, I offer you the public pay phone at the Pinnacles Visitor Center. I stayed in that campground on a road trip just before the election. Given the tense political and epidemiological situation, I’d promised to message home regularly. There was no cell service in the park so I headed over to the office. It was closed, so I sat on the bench and connected to their WiFi. Or tried to. You could connect, sometimes, but you couldn’t move any data. The router was clearly in need of a reboot.

The only option left was the public phone. I can’t remember the last time I used one. Most people alive today have, perhaps, never used one. But there it was, so I gave it a shot.

Once upon a time, you could pick up the handset, dial 0 for operator, and place a so-called collect (charge-reversed) call. Now dialing 0 gets you nowhere.

The instructions taped to the phone (in the 90s I’m guessing) say you can call an 800 number, or use a calling card. I remember calling cards, I had one once. Not a thing lately.

And then there was this: “Dial 611 for help.”

Me: 611

611: Hello, this is Steve.

Me: I’m at the Pinnacles Visitor Center trying to send a message.

Steve: Use the WiFi.

Me: I can’t, it’s broken.

Steve: Huh, that’s interesting. Let me see if I can reboot the router.

And he did. So there you have it. The public phone still provides a valuable service. Its mission has evolved over the years. Nowadays, it exists to summon Steve the IT guy who can fix the WiFi by turning it off and on again.

Works like a charm!