Postgres functional style

My dual premises in this series are:

– Modern SQL is more valuable as a programming language than you might think (see Markus Winand’s Modern SQL: A lot has changed since SQL-92)

– Postgres is more valuable as a programming environment than you might think. (see R0ml Lefkowitz’s The Image of Postgres)

As the patron saint of trailing edge technology it is my duty to explore what’s possible at the intersection of these two premises. The topic for this episode is Postgres functional style. Clearly what I’ve been doing with the combo of pl/python and pl/pgsql is very far from pure functional programming. The self-memoization technique shown in episode 7 is all about mutating state (ed: this means writing stuff down somewhere). But it feels functional to me in the broader sense that I’m using functions to orchestrate behavior that’s expressed in terms of SQL queries.

To help explain what I mean, I’m going to unpack one of the Postgres functions in our library.

count_of_distinct_lms_students_from_to(_guid text, _from date, _to date)

This is a function that accepts a school id (aka guid), a start date, and an end date. Its job is to:

– Find all the courses (groups) for that school (guid)

– Filter to those created between the start and end date

– Find all the users in the filtered set of courses

– Filter to just students (i.e. omit instructors)

– Remove duplicate students (i.e., who are in more than one course)

– Return the count of distinct students at the school who annotated in the date range

The production database doesn’t yet store things in ways friendly to this outcome, so doing all this requires some heavy lifting in the analytics data warehouse. Here’s the function that orchestrates that work.

create function count_of_distinct_lms_students_from_to(_guid text, _from date, _to date) 
  returns bigint as $$
  declare count bigint;
 1  -- all groups active for the guid in the date range
 2  with groups as (
 3    select pubid from groups_for_guid(_guid)
 4    where group_is_active_from_to(pubid, _from, _to)
 5  ),
 6  -- usernames in those groups
 7  usernames_by_course as (
 8    select
 9      pubid,
10      (users_in_group(pubid)).username 
11  from groups 
12  ),
13  -- filtered to just students
14  students_by_course as (
15    select * from usernames_by_course
16    where not is_instructor(username, pubid)
17  )
18  select 
19    count (distinct username) 
20  from students_by_course
    into count;
    return count;
$$ language plpgsql;

If you think pl/pgsql is old and clunky, then you are welcome to do this in pl/python instead. There’s negligible difference between how they’re written and how fast they run. It’s the same chunk of SQL either way, and it exemplifies the functional style I’m about to describe.

Two kinds of chunking work together here: CTEs (aka common table expressions, aka WITH clauses) and functions. If you’ve not worked with SQL for a long time, as I hadn’t, then CTEs may be unfamiliar. I think of them as pipelines of table transformations in which each stage of the pipeline gets a name. In this example those names are groups (line 2), usernames_by_course (line 7), and students_by_course (line 14).

The pipeline phases aren’t functions that accept parameters, but I still think of them as being function-like in the sense that they encapsulate named chunks of behavior. The style I’ve settled into aims to make each phase of the pipeline responsible for a single idea (“groups active in the range”, “usernames in those groups”), and to express that idea in a short snippet of SQL.

As I’m developing one of these pipelines, I test each phase. To test the first phase, for example, I’d do this in psql or Metabase.

-- all groups active for the guid in the date range
with groups as (
  select pubid from groups_for_guid('8anU0QwbgC2Cq:canvas-lms')
  where group_is_active_from_to(pubid, '2021-01-01', '2021-05-01')
select * from groups;

And I’d spot-check to make sure the selected groups for that school really are in the date range. Then I’d check the next phase.

-- all groups active for the guid in the date range
with groups as (
-- usernames in those groups
usernames_by_course as (
  from groups 
select * from usernames_by_course;

After another sanity check against these results, I’d continue to the next phase, and eventually arrive at the final result. It’s the same approach I take with regular expressions. I am unable to visualize everything that’s happening in a complex regex. But I can reason effectively about a pipeline of matches that occur in easier-to-understand named steps.

Ideally each phase in one of these pipelines requires just a handful of lines of code: few enough to fit within the 7 +- 2 limit of working memory. Postgres functions make that possible. Here are the functions used in this 20-line chunk of SQL.

groups_for_guid(guid): Returns a table of course ids for a school.

group_is_active_from_to(pubid, _from, _to): Returns true if the group was created in the range.

users_in_group(pubid): Returns a table of user info for a course.

is_instructor(username, pubid): Returns true if that user is an instructor.

Two of these, groups_for_guid and users_in_group, are set-returning functions. As noted in Working with Postgres types, they have the option of returning an explicit Postgres type defined elsewhere, or an implicit Postgres type defined inline. As it happens, both do the latter.

create or replace function groups_for_guid(_guid text)
  returns table(
    pubid text
  ) as $$
create or replace function users_in_group (_pubid text)
  returns table (
    groupid text, 
    username text, 
    display_name text
  ) as $$

The other two, group_is_active_from_to and is_instructor, return boolean values.

All this feels highly readable to me now, but the syntax of line 10 took quite a while to sink in. It helps me to look at what users_in_group(pubid) does in a SELECT context.

select * from users_in_group('4VzA92Yy')

groupid   | username    | display_name
4VzA92Yy  | 39vA94AsQp  | Brendan Nadeau

Here is an alternate way to write the usernames_by_course CTE at line 7.

-- usernames in those groups
usernames_by_course as (
from groups g 
join users_in_group(g.pubid) u on g.pubid = u.groupid
select * from usernames_by_course;

Both do exactly the same thing in very close to the same amount of time. Having mixed the two styles I’m leaning toward the first, but you could go either way or both. What matters more is the mental leverage you wield when writing CTEs and functions together to compose pipelines of transformations, and that others wield when reading and debugging.

I hope I’ve made the case for writing and reading. There’s a case to be made for debugging too, but that’s another episode.


Leave a Reply