The Postgres REPL

7 Sep 20218 Sep 2021 ~ Jon Udell ~ 9 Comments

R0ml Lefkowitz’s The Image of Postgres evokes the Smalltalk experience: reach deeply into a running system, make small changes, see immediate results. There isn’t yet a fullblown IDE for the style of Postgres-based development I describe in this series, though I can envision a VSCode extension that would provide one. But there is certainly a REPL (read-eval-print loop), it’s called psql, and it delivers the kind of immediacy that all REPLs do. In our case there’s also Metabase; it offers a complementary REPL that enhances its power as a lightweight app server.

In the Clojure docs it says:

The Clojure REPL gives the programmer an interactive development experience. When developing new functionality, it enables her to build programs first by performing small tasks manually, as if she were the computer, then gradually make them more and more automated, until the desired functionality is fully programmed. When debugging, the REPL makes the execution of her programs feel tangible: it enables the programmer to rapidly reproduce the problem, observe its symptoms closely, then improvise experiments to rapidly narrow down the cause of the bug and iterate towards a fix.

I feel the same way about the Python REPL, the browser’s REPL, the Metabase REPL, and now also the Postgres REPL. Every function and every materialized view in the analytics system begins as a snippet of code pasted into the psql console (or Metabase). Iteration yields successive results instantly, and those results reflect live data. In How is a Programmer Like a Pathologist Gilad Bracha wrote:

A live program is dynamic; it changes over time; it is animated. A program is alive when it’s running. When you work on a program in a text editor, it is dead.

Tudor Girba amplified the point in a tweet.

In a database-backed system there’s no more direct way to interact with live data than to do so in the database. The Postgres REPL is, of course, a very sharp tool. Here are some ways to handle it carefully.

Find the right balance for tracking incremental change

In Working in a hybrid Metabase / Postgres code base I described how version-controlled files — for Postgres functions and views, and for Metabase questions — repose in GitHub and drive a concordance of docs. I sometimes write code snippets directly in psql or Metabase, but mainly compose in a “repository” (telling word!) where those snippets are “dead” artifacts in a text editor. They come to life when pasted into psql.

A knock on Smalltalk was that it didn’t play nicely with version control. If you focus on the REPL aspect, you could say the same of Python or JavaScript. In any such case there’s a balance to be struck between iterating at the speed of thought and tracking incremental change. Working solo I’ve been inclined toward a fairly granular commit history. In a team context I’d want to leave a chunkier history but still record the ongoing narrative somewhere.

Make it easy to understand the scope and effects of changes

The doc concordance has been the main way I visualize interdependent Postgres functions, Postgres views, and Metabase questions. In Working with interdependent Postgres functions and materialized views I mentioned Laurenz Albe’s Tracking View Dependencies in Postgres. I’ve adapted the view dependency tracker he develops there, and adapted related work from others to track function dependencies.

This tooling is still a work in progress, though. The concordance doesn’t yet include Postgres types, for example, nor the tables that are upstream from materialized views. My hypothetical VSCode extension would know about all the artifacts and react immediately when things change.

Make it easy to find and discard unwanted artifacts

Given a function or view named foo, I’ll often write and test a foo2 before transplanting changes back into foo. Because foo may often depend on bar and call baz I wind up also with bar2 and baz2. These artifacts hang around in Postgres until you delete them, which I try to do as I go along.

If foo2 is a memoized function (see this episode), it can be necessary to delete the set of views that it’s going to recreate. I find these with a query.

select 
  'drop materialized view ' || matviewname || ';' as drop_stmt
from pg_matviews 
where matviewname ~* {{ pattern }}

That pattern might be question_and_answer_summary_for_group to find all views based on that function, or _6djxg2yk to find all views for a group, or even [^_]{8,8}$ to find all views made by memoized functions.

I haven’t yet automated the discovery or removal of stale artifacts and references to them. That’s another nice-to-have for the hypothetical IDE.

The Image of Postgres

I’ll give R0ml the last word on this topic.

This is the BYTE magazine cover from August of 1981. In the 70s and the 80s, programming languages had this sort of unique perspective that’s completely lost to history. The way it worked: a programming environment was a virtual machine image, it was a complete copy of your entire virtual machine memory and that was called the image. And then you loaded that up and it had all your functions and your data in it, and then you ran that for a while until you were sort of done and then you saved it out. And this wasn’t just Smalltalk, Lisp worked that way, APL worked that way, it was kind of like Docker only it wasn’t a separate thing because everything worked that way and so you didn’t worry very much about persistence because it was implied. If you had a programming environment it saved everything that you were doing in the programming environment, you didn’t have to separate that part out. A programming environment was a place where you kept all your data and business logic forever.

So then Postgres is kind of like Smalltalk only different.

What’s the difference? Well we took the UI out of Smalltalk and put it in the browser. The rest of it is the same, so really Postgres is an application delivery platform, just like we had back in the 80s.

Working with interdependent Postgres functions and materialized views

28 Aug 20217 Sep 2021 ~ Jon Udell ~ Leave a comment

In Working with Postgres types I showed an example of a materialized view that depends on a typed set-returning function. Because Postgres knows about that dependency, it won’t allow DROP FUNCTION foo. Instead it requires DROP FUNCTION foo CASCADE.

A similar thing happens with materialized views that depend on tables or other materialized views. Let’s build a cascade of views and consider the implications.

create materialized view v1 as (
  select 
    1 as number,
    'note_count' as label 
);
SELECT 1

select * from v1;

number | label
-------+-------
     1 | note_count

Actually, before continuing the cascade, let’s linger here for a moment. This is a table-like object created without using CREATE TABLE and without explicitly specifying types. But Postgres knows the types.

\d v1;

Materialized view "public.v1"

Column  | Type 
--------+-----
number  | integer
label   | text

The read-only view can become a read-write table like so.

create table t1 as (select * from v1);
SELECT 1

select * from t1;

number | label
-------+-------
     1 | note_count

\d t1

Table "public.v1"

Column  | Type 
--------+-----
number  | integer
label   | text

This ability to derive a table from a materialized view will come in handy later. It’s also just interesting to see how the view’s implicit types become explicit in the table.

OK, let’s continue the cascade.

create materialized view v2 as (
  select 
    number + 1,
    label
  from v1
);
SELECT 1

select * from v2;

number | label
-------+-------
     2 | note_count

create materialized view v3 as (
  select 
    number + 1,
    label
  from v2
);
SELECT 1

select * from v3;

number | label
-------+-------
     3 | note_count

Why do this? Arguably you shouldn’t. Laurenz Albe makes that case in Tracking view dependencies in PostgreSQL. Recognizing that it’s sometimes useful, though, he goes on to provide code that can track recursive view dependencies.

I use cascading views advisedly to augment the use of CTEs and functions described in Postgres functional style. Views that refine views can provide a complementary form of the chunking that aids reasoning in an analytics system. But that’s a topic for another episode. In this episode I’ll describe a problem that arose in a case where there’s only a single level of dependency from a table to a set of dependent materialized views, and discuss my solution to that probem.

Here’s the setup. We have an annotation table that’s reloaded nightly. On an internal dashboard we have a chart based on the materialized view annos_at_month_ends_for_one_year which is derived from the annotation table and, as its name suggests, reports annotation counts on a monthly cycle. At the beginning of the nightly load, this happens: DROP TABLE annotation CASCADE. So the derived view gets dropped and needs to be recreated as part of the nightly process. But that’s a lot of unnecessary work for a chart that only changes monthly.

Here are two ways to protect a view from a cascading drop of the table it depends on. Both reside in a SQL script, monthly.sql, that only runs on the first of every month. First, annos_at_month_ends_for_one_year.

drop materialized view annos_at_month_ends_for_one_year;
create materialized view annos_at_month_ends_for_one_year as (
  with last_days as (
    select 
      last_days_of_prior_months( 
        date(last_month_date() - interval '6 year')
      ) as last_day
    ),
    monthly_counts as (
      select
        to_char(last_day, '_YYYY-MM') as end_of_month,
        anno_count_between( 
          date(last_day - interval '1 month'), last_day
        ) as monthly_annos
      from last_days
    )
    select
      end_of_month,
      monthly_annos,
      sum(monthly_annos) over 
        (order by end_of_month asc rows 
           between unbounded preceding and current row
        ) as cumulative_annos
    from monthly_counts
) with data;

Because this view depends indirectly on the annotation table — by way of the function anno_count_between — Postgres doesn’t see the dependency. So the view isn’t affected by the cascading drop of the annotation table. It persists until, once a month, it gets dropped and recreated.

What if you want Postgres to know about such a dependency, so that the view will participate in a cascading drop? You can do this.

create materialized view annos_at_month_ends_for_one_year as (
  with depends as (
    select * from annotation limit 1
  )
  last_days as (
    ),
  monthly_counts as (
    )
  select
    *
  from monthly_counts;

The depends CTE doesn’t do anything relevant to the query, it just tells Postgres that this view depends on the annotation table.

Here’s another way to protect a view from a cascading drop. This expensive-to-build view depends directly on the annotation table but only needs to be updated monthly. So in this case, cumulative_annotations is a table derived from a temporary materialized view.

create materialized view _cumulative_annotations as (
  with data as (
    select 
      to_char(created, 'YYYY-MM') as created
    from annotation
    group by created
  ) 
  select 
    data.created, 
    sum(data.count) 
      over ( 
        order by data.created asc 
        rows between unbounded preceding and current row
      )
    from data
    group by data.created 
    order by data.created 

drop table cumulative_annotations;

create table cumulative_annotations as (
  select * from _cumulative_annotations
);

drop materialized view _cumulative_annotations;

The table cumulative_annotations is only rebuilt once a month. It depends indirectly on the annotation table but Postgres doesn’t see that, so doesn’t include it in the cascading drop.

Here’s the proof.

-- create a table
create table t1 (number int);

insert into t1 (number) values (1);
INSERT 0 1

select * from t1;

number
-------
     1

-- derive a view from t1
create materialized view v1 as (select * from t1);
SELECT 1

select * from v1

number
-------
     1

-- try to drop t1
drop table t1;

ERROR: cannot drop table t1 because other objects depend on it
DETAIL: materialized view v1 depends on table t1
HINT: Use DROP ... CASCADE to drop the dependent objects too.

-- derive an independent table from t1 by way of a matview
drop materialized view v1;

create materialized view v1 as (select * from t1);
SELECT 1

create table t2 as (select * from v1);
SELECT 1

-- drop the matview
drop materialized view v1;

-- drop t1
drop table t1;

-- no complaint, and t2 still exists
select * from t2;

number
-------
     1

These are two ways I’ve found to protect a long-lived result set from the cascading drop of a short-lived table on which it depends. You can hide the dependency behind a function, or you can derive an independent table by way of a transient materialized view. I use them interchangeably, and don’t have a strong preference one way or another. Both lighten the load on the analytics server. Materialized views (or tables) that only need to change weekly or monthly, but were being dropped nightly by cascade from core tables, are now recreated only on their appropriate weekly or monthly cycles.

Working in a hybrid Metabase / Postgres code base

26 Aug 202110 Sep 2021 ~ Jon Udell ~ Leave a comment

In this series I’m exploring how to work in a code base that lives in two places: Metabase questions that encapsulate chunks of SQL, and Postgres functions/procedures/views that also encapsulate chunks of SQL. To be effective working with this hybrid collection of SQL chunks, I needed to reason about their interrelationships. One way to do that entailed the creation of a documentation generator that writes a concordance of callers and callees.

Here’s the entry for the function sparklines_for_guid(_guid).

The called by column says that this function is called from two different contexts. One is a Metabase question, All courses: Weekly activity. If you’re viewing that question in Metabase, you’ll find that its SQL text is simply this:

select * from sparklines_for_guid( {{guid}} )

The same function call appears in a procedure, cache warmer, that preemptively memoizes the function for a set of the most active schools in the system. In either case, you can look up the function in the concordance, view its definition, and review how it’s called.

In the definition of sparkline_for_guid, names of other functions (like guid_hashed_view_exists) appear and are linked to their definitions. Similarly, names of views appearing in SELECT or JOIN contexts are linked to their definitions.

Here’s the entry for the function guid_hashed_view_exists. It is called by sparklines_for_guid as well as by functions that drive panels on the school dashboard. It links to the functions it uses: hash_for_guid and exists_view.

Here’s the entry for the view lms_course_groups which appears as a JOIN target in sparklines_for_guid. This central view is invoked — in SELECT or JOIN context — from many functions, from dependent views, and from Metabase questions.

Metabase questions can also “call” other Metabase questions. In A virtuous cycle for analytics I noted: “Queries can emit URLs in order to compose themselves with other queries.” Here’s an example of that.

This Metabase question (985) calls various linked functions, and is called by two other Metabase questions. Here is one of those.

Because this Metabase question (600) emits an URL that refers to 985, it links to the definition of 985. It also links to the view, top_annotated_domains_last_week, from which it SELECTs.

It was straightforward to include Postgres functions, views, and procedures in the concordance since these live in files that reside in the fileystem under source control. Metabase questions, however, live in a database — in our case, a Postgres database that’s separate from our primary Postgres analytics db. In order to extract them into a file I use this SQL snippet.

select
  r.id,
  m.name as dbname,
  r.name,
  r.description,
  r.dataset_query
from report_card r
join metabase_database m
  on m.id = cast(r.dataset_query::jsonb->>'database' as int)
where not r.archived
order by r.id;

The doc generator downloads that Metabase data as a CSV file, queries.csv, and processes it along with the files that contain the definitions of functions, procedures, and views in the Postgres data warehouse. It also emits queries.txt which is a more convenient way to diff changes from one commit to the next.

This technique solved a couple of problems. First, when we were only using Metabase — unaugmented by anything in Postgres — it enabled us to put our Metabase SQL under source control and helped us visualize relationships among queries.

Later, as we augmented Metabase with Postgres functions, procedures, and views, it became even more powerful. Developing a new panel for a school or course dashboard means writing a new memoized function. That process begins as a Metabase question with SQL code that calls existing Postgres functions, and/or JOINs/SELECTs FROM existing Postgres views. Typically it then leads to the creation of new supporting Postgres functions and/or views. All this can be tested by internal users, or even invited external users, in Metabase, with the concordance available to help understand the relationships among the evolving set of functions and views.

When supporting functions and views are finalized, the SQL content of the Metabase question gets wrapped up in a memoized Postgres function that’s invoked from a panel of a dashboard app. At that point the concordance links the new wrapper function to the same set of supporting functions and views. I’ve found this to be an effective way to reason about a hybrid code base as features move from Metabase for prototyping to Postgres in production, while maintaining all the code under source control.

That foundation of source control is necessary, but maybe not sufficient, for a team to consider this whole approach viable. The use of two complementary languages for in-database programming will certainly raise eyebrows, and if it’s not your cup of tea I completely understand. If you do find it appealing, though, one thing you’ll wonder about next will be tooling. I work in VSCode nowadays, for which I’ve not yet found a useful extension for pl/pgsql or pl/python. With metaprogramming life gets even harder for aspiring pl/pgsql or pl/python VSCode extensions. I can envision them, but I’m not holding my breath awaiting them. Meanwhile, two factors enable VSCode to be helpful even without deep language-specific intelligence.

The first factor, and by far the dominant one, is outlining. In Products and capabilities I reflect on how I’ve never adopted an outlining product, but often rely on outlining capability infused into a product. In VSCode that’s “only” basic indentation-driven folding and unfolding. But I find it works remarkably well across SQL queries, views and functions that embed them, CTEs that comprise them, and pl/pgsql or pl/python functions called by them.

The second factor, nascent thus far, is GitHub Copilot. It’s a complementary kind of language intelligence that’s aware of, but not bounded by, what a file extension of .sql or .py implies. It can sometimes discern patterns that mix language syntaxes and offer helpful suggestions. That hasn’t happened often so far, but it’s striking when it does. I don’t yet know the extent to which it may be training me while I train it, or how those interactions might influence others. At this point I’m not a major Copilot booster, but I am very much an interested observer of and participant in the experiment.

All in all, I’m guardedly optimistic that existing or feasible tooling can enable individuals and teams to sanely navigate the hybrid corpus of source code discussed in this series. If you’re along for the ride, you’ll next wonder about debugging and monitoring a system built this way. That’s a topic for a future episode.

Postgres functional style

21 Aug 20217 Sep 2021 ~ Jon Udell ~ Leave a comment

My dual premises in this series are:

– Modern SQL is more valuable as a programming language than you might think (see Markus Winand’s Modern SQL: A lot has changed since SQL-92)

– Postgres is more valuable as a programming environment than you might think. (see R0ml Lefkowitz’s The Image of Postgres)

As the patron saint of trailing edge technology it is my duty to explore what’s possible at the intersection of these two premises. The topic for this episode is Postgres functional style. Clearly what I’ve been doing with the combo of pl/python and pl/pgsql is very far from pure functional programming. The self-memoization technique shown in episode 7 is all about mutating state (ed: this means writing stuff down somewhere). But it feels functional to me in the broader sense that I’m using functions to orchestrate behavior that’s expressed in terms of SQL queries.

To help explain what I mean, I’m going to unpack one of the Postgres functions in our library.

count_of_distinct_lms_students_from_to(_guid text, _from date, _to date)

This is a function that accepts a school id (aka guid), a start date, and an end date. Its job is to:

– Find all the courses (groups) for that school (guid)

– Filter to those created between the start and end date

– Find all the users in the filtered set of courses

– Filter to just students (i.e. omit instructors)

– Remove duplicate students (i.e., who are in more than one course)

– Return the count of distinct students at the school who annotated in the date range

The production database doesn’t yet store things in ways friendly to this outcome, so doing all this requires some heavy lifting in the analytics data warehouse. Here’s the function that orchestrates that work.

create function count_of_distinct_lms_students_from_to(_guid text, _from date, _to date) 
  returns bigint as $$
  declare count bigint;
  begin
 1  -- all groups active for the guid in the date range
 2  with groups as (
 3    select pubid from groups_for_guid(_guid)
 4    where group_is_active_from_to(pubid, _from, _to)
 5  ),
 6  -- usernames in those groups
 7  usernames_by_course as (
 8    select
 9      pubid,
10      (users_in_group(pubid)).username 
11  from groups 
12  ),
13  -- filtered to just students
14  students_by_course as (
15    select * from usernames_by_course
16    where not is_instructor(username, pubid)
17  )
18  select 
19    count (distinct username) 
20  from students_by_course
    into count;
    return count;
  end;
$$ language plpgsql;

If you think pl/pgsql is old and clunky, then you are welcome to do this in pl/python instead. There’s negligible difference between how they’re written and how fast they run. It’s the same chunk of SQL either way, and it exemplifies the functional style I’m about to describe.

Two kinds of chunking work together here: CTEs (aka common table expressions, aka WITH clauses) and functions. If you’ve not worked with SQL for a long time, as I hadn’t, then CTEs may be unfamiliar. I think of them as pipelines of table transformations in which each stage of the pipeline gets a name. In this example those names are groups (line 2), usernames_by_course (line 7), and students_by_course (line 14).

The pipeline phases aren’t functions that accept parameters, but I still think of them as being function-like in the sense that they encapsulate named chunks of behavior. The style I’ve settled into aims to make each phase of the pipeline responsible for a single idea (“groups active in the range”, “usernames in those groups”), and to express that idea in a short snippet of SQL.

As I’m developing one of these pipelines, I test each phase. To test the first phase, for example, I’d do this in psql or Metabase.

-- all groups active for the guid in the date range
with groups as (
  select pubid from groups_for_guid('8anU0QwbgC2Cq:canvas-lms')
  where group_is_active_from_to(pubid, '2021-01-01', '2021-05-01')
)
select * from groups;

And I’d spot-check to make sure the selected groups for that school really are in the date range. Then I’d check the next phase.

-- all groups active for the guid in the date range
with groups as (
),
-- usernames in those groups
usernames_by_course as (
  select
    pubid,
    (users_in_group(pubid)).username 
  from groups 
)
select * from usernames_by_course;

After another sanity check against these results, I’d continue to the next phase, and eventually arrive at the final result. It’s the same approach I take with regular expressions. I am unable to visualize everything that’s happening in a complex regex. But I can reason effectively about a pipeline of matches that occur in easier-to-understand named steps.

Ideally each phase in one of these pipelines requires just a handful of lines of code: few enough to fit within the 7 +- 2 limit of working memory. Postgres functions make that possible. Here are the functions used in this 20-line chunk of SQL.

– groups_for_guid(guid): Returns a table of course ids for a school.

– group_is_active_from_to(pubid, _from, _to): Returns true if the group was created in the range.

– users_in_group(pubid): Returns a table of user info for a course.

– is_instructor(username, pubid): Returns true if that user is an instructor.

Two of these, groups_for_guid and users_in_group, are set-returning functions. As noted in Working with Postgres types, they have the option of returning an explicit Postgres type defined elsewhere, or an implicit Postgres type defined inline. As it happens, both do the latter.

create or replace function groups_for_guid(_guid text)
  returns table(
    pubid text
  ) as $$

create or replace function users_in_group (_pubid text)
  returns table (
    groupid text, 
    username text, 
    display_name text
  ) as $$

The other two, group_is_active_from_to and is_instructor, return boolean values.

All this feels highly readable to me now, but the syntax of line 10 took quite a while to sink in. It helps me to look at what users_in_group(pubid) does in a SELECT context.

select * from users_in_group('4VzA92Yy')

groupid   | username    | display_name
----------+-------------+----------------
4VzA92Yy  | 39vA94AsQp  | Brendan Nadeau

Here is an alternate way to write the usernames_by_course CTE at line 7.

-- usernames in those groups
usernames_by_course as (
  select
    g.pubid,
    u.username
from groups g 
join users_in_group(g.pubid) u on g.pubid = u.groupid
)
select * from usernames_by_course;

Both do exactly the same thing in very close to the same amount of time. Having mixed the two styles I’m leaning toward the first, but you could go either way or both. What matters more is the mental leverage you wield when writing CTEs and functions together to compose pipelines of transformations, and that others wield when reading and debugging.

I hope I’ve made the case for writing and reading. There’s a case to be made for debugging too, but that’s another episode.

Postgres set-returning functions that self-memoize as materialized views

19 Aug 20217 Sep 2021 ~ Jon Udell ~ 2 Comments

In episode 2 I mentioned three aspects of pl/python that are reasons to use it instead of pl/pgsql: access to Python modules, metaprogramming, and introspection. In episode 5 I discussed metaprogramming, by which I mean using pl/python to compose and run SQL code. This episode features introspection, by which I mean taking advantage of Python’s inspect module to enable a pl/python function to discover its own name.

Why do that? In this context, so that the function can create a materialized view by joining its own name with the value of its first parameter. Here’s the example from episode 5.

questions_and_answers_for_group(_group_id text)
  returns setof question_and_answer_for_group as $$
  from plpython_helpers import (
    exists_group_view,
    get_caller_name,
    memoize_view_name
  )
  base_view_name = get_caller_name()
  view_name = f'{base_view_name}_{_group_id}'
  if exists_group_view(plpy, view_name):
    sql = f""" select * from {view_name} """
  else:
    sql = f"""
    -- SQL THAT RETURNS A SETOF QUESTION_AND_ANSWER_FOR_GROUP
    """
    memoize_view_name(sql, view_name)
    sql = f""" select * from {view_name} """
  return plpy.execute(sql)
$$ language plpython3u;

The function drives a panel on the course dashboard. An initial call to, say, questions_and_answers_for_group('P1mQaEEp'), creates the materialized view questions_and_answers_for_group_p1mqaeep and returns SELECT * from the view. Subsequent calls skip creating the view and just return SELECT * from it.

Note that the even though the group name is mixed case; the view name created by Postgres is all lowercase. For example:

create materialized view test_AbC as (select 'ok') with data;
SELECT 1

\d test_AbC
Materialized view "public.test_abc"

I want to think of this as a form of capability injection, but it’s really more like a capability wrapper. The capability is memoization. A function endowed with it can run a SQL query and cache the resulting rows in a materialized view before returning them to a SQL SELECT context. The wrapper is boilerplate code that discovers the function’s name, checks for the existence of a corresponding view, and if it isn’t found, calls memoize_view_name(sql, view_name) to run an arbitrary chunk of SQL code whose result set matches the function’s type. So in short: this pattern wraps memoization around a set-returning pl/python function.

As noted in episode 5, although memoize_view_name is called from pl/python, it is not itself a pl/python function. It’s a normal Python function in a module that’s accessible to the instance of Python that the Postgres pl/python extension is using. In my case that module is just a few small functions in file called plpython_helpers.py, installed (cough, copied) to user postgres‘s ~/.local/lib/python3.8/site-packages/plpython_helpers.py.

So far, there are only two critical functions in that module: get_caller_name() and memoize_view_name.

Here is get_caller_name().

import re
base_view_name = inspect.stack()[1][3].replace('__plpython_procedure_', '')
return re.sub(r'_\d+$', '', base_view_name)

The internal name for a pl/python function created by CREATE FUNCTION foo() looks like __plpython_procedure_foo_981048462. What get_caller_name() returns is just foo.

Here’s memoize_view_name().

def memoize_view_name(sql, view_name):
  sql = sql.replace('\n', ' ')
  encoded_bytes = base64.b64encode(sql.encode('utf-8'))
  encoded_str = str(encoded_bytes, 'utf-8')
  cmd = f"""psql -d h_analytics -c "call memoizer('{encoded_str}', '{view_name}')" """
  result = os.system(cmd)
  print(f'memoize_view_name: {cmd} result: {result}')

Given a chunk of SQL and the name of a view, it converts newlines to spaces, base64-encodes the query text, and invokes psql to call a procedure, memoizer, that does the work of running the SQL query and creating the materialized view from those results. So for example the function that yields sparkline data for a school might look like sparkline_data_for_school('af513ee'), and produce the view sparkline_data_for_school_af513ee.

Why shell out to psql here? It may not be necessary, there may be a way to manage the transaction directly within the function, but if so I haven’t found it. I’m very far from being an expert on transaction semantics and will appreciate guidance here if anyone cares to offer it. Meanwhile, this technique seems to work well. memoizer is a Postgres procedure, not a function. Although “stored procedures” is the term that I’ve always associated with in-database programming, I went pretty far down this path using only CREATE FUNCTION, never CREATE PROCEDURE. When I eventually went there I found the distinction between functions and procedures to be a bit slippery. This StackOverflow answer matches what I’ve observed.

PostgreSQL 11 added stored procedures as a new schema object. You can create a new procedure by using the CREATE PROCEDURE statement.

Stored procedures differ from functions in the following ways:

Stored procedures do not have to return anything, and only return a single row when using INOUT parameters.

You can commit and rollback transactions inside stored procedures, but not in functions.

You execute a stored procedure using the CALL statement rather than a SELECT statement.

Unlike functions, procedures cannot be nested in other DDL commands (SELECT, INSERT, UPDATE, DELETE).

Here is the memoizer procedure. It happens to be written in pl/python but could as easily have been written in pl/pgsql using the built-in Postgres decode function. Procedures, like functions, can be written in either language (or others) and share the common Postgres type system.

create procedure memoizer(_sql text, _view_name text) as $$
  import base64
  decoded_bytes = base64.b64decode(_sql)
  decoded_str = str(decoded_bytes, 'utf-8')
  create = f"""
    create materialized view if not exists {_view_name} as (
      {decoded_str}
    ) with data;
    """
  plpy.execute(create)
  permit = f"""
    grant select on {_view_name} to analytics;
  """
  plpy.execute(permit)
$$ language plpython3u;

There’s no plpy.commit() here because psql takes care of that automatically. Eventually I wrote other procedures, some of which do their own committing, but that isn’t needed here.

Of course it’s only possible to shell out to psql from a function because pl/python is an “untrusted” language extension. Recall from episode 1:

The ability to wield any of Python’s built-in or loadable modules inside Postgres brings great power. That entails great responsibility, as the Python extension is “untrusted” (that’s the ‘u’ in ‘plpython3u’) and can do anything Python can do on the host system: read and write files, make network requests.

Using Python’s os.system() to invoke psql is another of those superpowers. It’s not something I do lightly, and if there’s a better/safer way I’m all ears.

Meanwhile, this approach is delivering much value. We have two main dashboards, each of which displays a dozen or so panels. The school dashboard reports on annotation activity across all courses at a school. The course dashboard reports on the documents, and selections within documents, that instructors and students are discussing in the course’s annotation layer. Each panel that appears on the school or course dashboard is the output of a memoized function that is parameterized by a school or course id.

The data warehouse runs on a 24-hour cycle. Within that cycle, the first call to a memoized function takes just as long as it takes to run the SQL wrapped by the function. The cached view only comes into play when the function is called again during the same cycle. That can happen in a few different ways.

– A user reloads a dashboard, or a second user loads it.

– A panel expands or refines the results of another panel. For example, questions_and_answers_for_group() provides a foundation for a family of related functions including:

– questions_asked_by_teacher_answered_by_student()

– questions_asked_by_student_answered_by_teacher()

– questions_asked_by_student_answered_by_student()

– A scheduled job invokes a function in order to cache its results before any user asks for them. For example, the time required to cache panels for school dashboards varies a lot. For schools with many active courses it can take minutes to run those queries, so preemptive memoization matters a lot. For schools with fewer active courses it’s OK to memoize on the fly. This method enables flexible cache policy. Across schools we can decide how many of the most-active ones to cache. Within a school, we can decide which courses to cache, e.g. most recent, or most active. The mechanism to display a dashboard panel is always the same function call. The caching done in support of that function is highly configurable.

Caches, of course, must be purged. Since these materialized views depend on core tables it was enough, early on, to do this for views depending on the annotation table.

drop table annotation cascade;

At a certain point, with a growing number of views built during each cycle, the cascade failed.

ERROR:  out of shared memory
HINT:  You might need to increase max_locks_per_transaction.

That wasn’t the answer. Instead we switched to enumerating views and dropping them individually. Again that afforded great flexibility. We can scan the names in the pg_matviews system table and match all the memoized views, or just those for a subset of schools, or just particular panels on school or course dashboards. Policies that govern the purging of cached views can be as flexible as those that govern their creation.

Postgres and JSON: Finding document hotspots (part 1)

15 Aug 20217 Sep 2021 ~ Jon Udell ~ 2 Comments

One of the compelling aspects of modern SQL is the JSON support built into modern engines, including Postgres. The documentation is well done, but I need examples to motivate my understanding of where and how and why to use such a capability. The one I’ll use in this episode is something I call document hotspots.

Suppose a teacher has asked her students to annotate Arthur Miller’s The Crucible. How can she find the most heavily-annotated passages? They’re visible in the Hypothesis client, of course, but may be sparsely distributed. She can scroll through the 154-page PDF document to find the hotspots, but it will be helpful to see a report that brings them together. Let’s do that.

The Hypothesis system stores annotations using a blend of SQL and JSON datatypes. Consider this sample annotation:

When the Hypothesis client creates that annotation it sends a JSON payload to the server. Likewise, when the client subsequently requests the annotation in order to anchor it to the document, it receives a similar JSON payload.

{
  "id": "VLUhcP1-EeuHn5MbnGgJ0w",
  "created": "2021-08-15T04:07:39.343275+00:00",
  "updated": "2021-08-15T04:07:39.343275+00:00",
  "user": "acct:judell@hypothes.is",
  "uri": "https://ia800209.us.archive.org/17/items/TheCrucibleFullText/The%20Crucible%20full%20text.pdf",
  "text": "\"He is no one's favorite clergyman.\"  :-)\n\nhttps://www.thoughtco.com/crucible-character-study-reverend-parris-2713521",
  "tags": [],
  "group": "__world__",
  "permissions": {
    "read": [
      "group:__world__"
    ],
    "admin": [
      "acct:judell@hypothes.is"
    ],
    "update": [
      "acct:judell@hypothes.is"
    ],
    "delete": [
      "acct:judell@hypothes.is"
    ]
  },
  "target": [
    {
      "source": "https://ia800209.us.archive.org/17/items/TheCrucibleFullText/The%20Crucible%20full%20text.pdf",
      "selector": [
        {
          "end": 44483,
          "type": "TextPositionSelector",
          "start": 44392
        },
        {
          "type": "TextQuoteSelector",
          "exact": " am not some preaching farmer with a book under my arm; I am a graduate of Harvard College.",
          "prefix": " sixty-six pound, Mr. Proctor! I",
          "suffix": " Giles: Aye, and well instructed"
        }
      ]
    }
  ],
  "document": {
    "title": [
      "The%20Crucible%20full%20text.pdf"
    ]
  },
  "links": {
    "html": "https://hypothes.is/a/VLUhcP1-EeuHn5MbnGgJ0w",
    "incontext": "https://hyp.is/VLUhcP1-EeuHn5MbnGgJ0w/ia800209.us.archive.org/17/items/TheCrucibleFullText/The%20Crucible%20full%20text.pdf",
    "json": "https://hypothes.is/api/annotations/VLUhcP1-EeuHn5MbnGgJ0w"
  },
  "user_info": {
    "display_name": "Jon Udell"
  },
  "flagged": false,
  "hidden": false
}

The server mostly shreds this JSON into conventional SQL types. The tags array, for example, is hoisted out of the JSON into a SQL array-of-text. The expression to find its length is a conventional Postgres idiom: array_length(tags,1). Note the second parameter; array_length(tags) is an error, because Postgres arrays can be multidimensional. In this case there’s only one dimension but it’s still necessary to specify that.

A target_selectors column, though, is retained as JSON. These selectors define how an annotation anchors to a target selection in a document. Because selectors are used only by the Hypothesis client, which creates and consumes them in JSON format, there’s no reason to shred them into separate columns. In normal operation, selectors don’t need to be related to core tables. They can live in the database as opaque blobs of JSON.

For some analytic queries, though, it is necessary to peer into those blobs and relate their contents to core tables. There’s a parallel set of functions for working with JSON. For example, the target_selectors column corresponds to the target[0]['selector'] array in the JSON representation. The expression to find the length of that array is jsonb_array_length(target_selectors).

Here’s a similar expression that won’t work: json_array_length(target_selectors). Postgres complains that the function doesn’t exist.

ERROR: function json_array_length(jsonb) does not exist 
Hint: No function matches the given name and argument types.

In fact both functions, json_array_length and jsonb_array_length, exist. But Postgres knows the target_selectors column is of type jsonb, not json which is the correct type for the json_array_length function. What’s the difference between json and jsonb?

The json and jsonb data types accept almost identical sets of values as input. The major practical difference is one of efficiency. The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage.

https://www.postgresql.org/docs/12/datatype-json.html

Although I tend to use JSON to refer to data in a variety of contexts, the flavor of JSON in the Postgres queries, views, and functions I’ll discuss will always be jsonb. The input conversion overhead isn’t a problem for analytics work that happens in a data warehouse, and the indexing support is a tremendous enabler.

To illustrate some of the operators common to json and jsonb, here is a query that captures the target_selectors column from the sample annotation.

with example as (
  select 
    id,
    target_selectors as selectors
  from annotation
  where id = '54b52170-fd7e-11eb-879f-931b9c6809d3'
  )
  select * from example;

Here are some other queries against example

select selectors from example;

[{"end": 44483, "type": "TextPositionSelector", "start": 44392}, { ... } ]

The result is a human-readable representation, but the type of selectors is jsonb.

select pg_typeof(selectors) from example;

jsonb

The array-indexing operator, ->, can yield the zeroth element of the array.

select selectors->0 from example;

{"end": 44483, "type": "TextPositionSelector", "start": 44392}

The result is again a human-readable representation of a jsonb type.

select pg_typeof(selectors->0) from example;

jsonb

Another array-indexing operator, ->>, can also yield the zeroth element of the array, but now as type text.

select selectors->>0 from example;

{"end": 44483, "type": "TextPositionSelector", "start": 44392}

The result looks the same, but the type is different.

select pg_typeof(selectors->>0) from example;

text

The -> and ->> operators can also index objects by their keys. These examples work with the object that is the zeroth element of the array.

select selectors->0->'type' from example;

"TextPositionSelector"

select pg_typeof(selectors->0->'type') from example;

jsonb

select selectors->0->>'type' from example;

TextPositionSelector

select pg_typeof(selectors->0->>'type') from example;

text

The Hypothesis system stores the location of a target (i.e, the selection in a document to which an annotation refers) in the target_selectors column we’ve been exploring. It records selectors. TextQuoteSelector represents the selection as the exact highlighted text bracketed by snippets of context. TextPositionSelector represents it as a pair of numbers that mark the beginning and end of the selection. When one range formed by that numeric pair is equal to another, it means two students have annotated the same selection. When a range contains another range, it means one student annotated the containing range, and another student made an overlapping annotation on the contained range. We can use these facts to surface hotspots where annotations overlap exactly or in nested fashion.

To start, let’s have a function to extract the start/end range from an annotation. In a conventional programming language you might iterate through the selectors in the target_selectors array looking for the one with the type TextPositionSelector. That’s possible in pl/pgsql and pl/python, but Postgres affords a more SQL-oriented approach. Given a JSON array, the function jsonb_array_elements returns a table-like object with rows corresponding to array elements.

select jsonb_array_elements(selectors) from example;

{"end": 44483, "type": "TextPositionSelector", "start": 44392}
{"type": "TextQuoteSelector", "exact": " am not some preaching farmer with a book under my arm; I am a graduate of Harvard College.", "prefix": " sixty-six pound, Mr. Proctor! I", "suffix": " Giles: Aye, and well instructed"}

A function can convert the array to rows, select the row of interest, select the start and end values from the row, package the pair of numbers as an array, and return the array.

create function position_from_anno(_id uuid) returns numeric[] as $$
  declare range numeric[];
  begin
    with selectors as (
      select jsonb_array_elements(target_selectors) as selector
      from annotation
      where id = _id
    ),
    position as (
      select
        selector->>'start'::numeric as startpos,
        selector->>'end'::numeric as endpos
      from selectors
      where selector->>'type' = 'TextPositionSelector'
    )
    select array[p.startpos, p.endpos] 
    from position p
    into range;
    return range;
  end;  
$$ language plpgsql;

Using it for the sample annotation:

select position_from_anno('54b52170-fd7e-11eb-879f-931b9c6809d3')

position_from_anno
------------------
{44392,44483}

I’ll show how to use position_from_anno to find document hotspots in a later episode. The goal here is just to introduce an example, and to illustrate a few of the JSON functions and operators.

What’s most interesting, I think, is this part.

where selector->>'type' = 'TextPositionSelector'

Although the TextPositionSelector appears as the first element of the selectors array, that isn’t guaranteed. In a conventional language you’d have to walk through the array looking for it. SQL affords a declarative way to find an element in a JSON array.

pl/python metaprogramming

13 Aug 20217 Sep 2021 ~ Jon Udell ~ 2 Comments

In episode 2 I mentioned three aspects of pl/python that are reasons to use it instead of pl/pgsql: access to Python modules, metaprogramming, and introspection.

Although this episode focuses on metaprogramming — by which I mean using Python to dynamically compose and run SQL queries — my favorite example combines all three aspects.

The context for the example is an analytics dashboard with a dozen panels, each driven by a pl/plython function that’s parameterized by the id of a school or a course. So, for example, the Questions and Answers panel on the course dashboard is driven by a function, questions_and_answers_for_group(group_id), which wraps a SQL query that:

– calls another pl/python function, questions_for_group(group_id), to find notes in the group that contain question marks

– finds the replies to those notes

– builds a table that summarizes the question/answer pairs

Here’s the SQL wrapped by the questions_and_answers_for_group(group_id) function.

sql = f"""
  with questions as ( 
    select *
    from questions_for_group('{_group_id}')
    ),
  ids_and_refs as (
    select 
      id,
      unnest ("references") as ref
      from annotation
      where groupid = '{_group_id}'
  ),
  combined as (
    select
      q.*,
      array_agg(ir.id) as reply_ids
    from ids_and_refs ir
    inner join questions q on q.id = ir.ref
    group by q.id, q.url, q.title, q.questioner, q.question, q.quote
  ),
  unnested as (
    select
      c.url,
      c.title,
      c.quote,
      c.questioner,
      c.question,
      unnest(reply_ids) as reply_id
    from combined c
  )
  select distinct
    course_for_group('{_group_id}') as course,
    teacher_for_group('{_group_id}') as teacher,
    clean_url(u.url) as url,
    u.title,
    u.quote,
    u.questioner,
    (regexp_matches(u.question, '.+\?'))[1] as question,
    display_name_from_anno(u.reply_id) as answerer,
    text_from_anno(u.reply_id) as answer,
    app_host() || '/course/render_questions_and_answers/{_group_id}' as viewer
  from unnested u
  order by course, teacher, url, title, questioner, question

This isn’t yet what I mean by pl/python metaprogramming. You could as easily wrap this SQL code in a pl/pgsql function. More easily, in fact, because in pl/pgsql you could just write _group_id instead of '{_group_id}'.

To get where we’re going, let’s zoom out and look at the whole questions_and_and_answer_for_group(group_id) function.

questions_and_answers_for_group(_group_id text)
  returns setof question_and_answer_for_group as $$
  from plpython_helpers import (
    exists_group_view,
    get_caller_name,
    memoize_view_name
  )
  base_view_name = get_caller_name()
  view_name = f'{base_view_name}_{_group_id}'
  if exists_group_view(plpy, view_name):
    sql = f""" select * from {view_name} """
  else:
    sql = f"""
    <SEE ABOVE>
    """
    memoize_view_name(sql, view_name)
    sql = f""" select * from {view_name} """
  return plpy.execute(sql)
$$ language plpython3u;

This still isn’t what I mean by metaprogramming. It introduces introspection — this is a pl/python function that discovers its own name and works with an eponymous materialized view — but that’s for a later episode.

It also introduces the use of Python modules by pl/python functions. A key thing to note here is that this is an example of what I call a memoizing function. When called it looks for a materialized view that captures the results of the SQL query shown above. If yes, it only needs to use a simple SELECT to return the cached result. If no, it calls memoize_view_name to run the underlying query and cache it in a materialized view that the next call to questions_and_answers_for_group(group_id) will use in a simple SELECT. Note that memoize_view_name is a special function that isn’t defined in Postgres using CREATE FUNCTION foo() like a normal pl/python function. Instead it’s defined using def foo() in a Python module called plpython_helpers. The functions there can do things — like create materialized views — that pl/python functions can’t. More about that in another episode.

The focus in this episode is metaprogramming, which is used in this example to roll up the results of multiple calls to questions_and_answers_for_group(group_id). That happens when the group_id refers to a course that has sections. If you’re teaching the course and you’ve got students in a dozen sections, you don’t want to look at a dozen dashboards; you’d much rather see everything on the primary course dashboard.

Here’s the function that does that consolidation.

create function consolidated_questions_and_answers_for_group(group_id text)
  returns setof question_and_answer_for_group as $$
  from plpython_helpers import (
    get_caller_name,
    sql_for_consolidated_and_memoized_function_for_group
  )
  base_view_name = get_caller_name()
  sql = sql_for_consolidated_and_memoized_function_for_group(
    plpy, base_view_name, 'questions_and_answers_for_group', _group_id)
  sql += ' order by course, url, title, questioner, answerer'
  return plpy.execute(sql)
$$ language plpython3u;

This pl/python function not only memoizes its results as above, it also consolidates results for all sections of a course. The memoization happens here.

def sql_for_consolidated_and_memoized_function_for_group(plpy, 
    base_view_name, function, group_id):
  view_name = f'{base_view_name}_{group_id}'
  sql = f""" select exists_view('{view_name}') as exists """
  exists = row_zero_value_for_colname(plpy, sql, 'exists')
  if exists:
    sql = f""" select * from {view_name} """
  else:
    sql = consolidator_for_group_as_sql(plpy, group_id, function)
    memoize_view_name(sql, view_name)
    sql = f""" select * from {view_name} """
  return sql

The consolidation happens here, and this is finally what I think of as classical metaprogramming: using Python to compose SQL.

def consolidator_for_group_as_sql(plpy, _group_id, _function):
  sql = f"select type_for_group('{_group_id}') as type"
  type = row_zero_value_for_colname(plpy, sql, 'type')
  if type == 'section_group' or type == 'none':
    sql = f"select * from {_function}('{_group_id}')"
  if type == 'course_group' or type == 'course':
    sql = f"select has_sections('{_group_id}')"
    has_sections = row_zero_value_for_colname(plpy, sql, 'has_sections')
    if has_sections:
      sql = f"""
        select array_agg(group_id) as group_ids 
        from sections_for_course('{_group_id}')
      """
      group_ids = row_zero_value_for_colname(plpy, sql, 'group_ids')
      selects = [f"select * from {_function}('{_group_id}') "]
      for group_id in group_ids:
        selects.append(f"select * from {_function}('{group_id}')")
      sql = ' union '.join(selects)
    else:
      sql = f"select * from {_function}('{_group_id}')"
  return sql

If the inbound _groupid is p1mqaeep, the inbound _function is questions_and_answers_for_group, and the group has no sections, the SQL will just be select * from questions_and_answers_for_group('p1mqaeep').

If the group does have sections, then the SQL will instead look like this:

select * from questions_and_answers_for_group('p1mqaeep')
union 
select * from questions_and_answers_for_group('x7fe93ba')
union
select * from questions_and_answers_for_group('qz9a4b3d')

This is a very long-winded way of saying that pl/python is an effective way to compose and run arbitarily complex SQL code. In theory you could do the same thing using pl/pgsql, in practice it would be insane to try. I’ve entangled the example with other aspects — modules, introspection — because that’s the real situation. pl/python’s maximal power emerges from the interplay of all three aspects. That said, it’s a fantastic way to extend Postgres with user-defined functions that compose and run SQL code.

The Tao of Unicode Sparklines

5 Aug 20217 Sep 2021 ~ Jon Udell ~ 17 Comments

I’ve long been enamored of the sparkline, a graphical device which its inventor Edward Tufte defines thusly:

A sparkline is a small intense, simple, word-sized graphic with typographic resolution. Sparklines mean that graphics are no longer cartoonish special occasions with captions and boxes, but rather sparkline graphics can be everywhere a word or number can be: embedded in a sentence, table, headline, map, spreadsheet, graphic.

Nowadays you can create sparklines in many tools including Excel and Google Sheets, both of which can use the technique to pack a summary of a series of numbers into a single cell. By stacking such cells vertically you can create views that compress vast amounts of information.

In A virtuous cycle for analytics I noted that we often use Metabase to display tables and charts based on extracts from our Postgres warehouse. I really wanted to use sparklines to summarize views of activity over time, but that isn’t yet an option in Metabase.

When Metabase is connected to Postgres, though, you can write Metabase questions that can not only call built-in Postgres functions but can also call user-defined functions. Can such a function accept an array of numbers and return a sparkline for display in the Metabase table viewer? Yes, if you use Unicode characters to represent the variable-height bars of a sparkline.

There’s a page at rosettacode.org devoted to Unicode sparklines based on this sequence of eight characters:

U+2581	▁	LOWER ONE EIGHTH BLOCK
U+2582	▂	LOWER ONE QUARTER BLOCK
U+2583	▃	LOWER THREE EIGHTHS BLOCK
U+2584	▄	LOWER HALF BLOCK
U+2585	▅	LOWER FIVE EIGHTHS BLOCK
U+2586	▆	LOWER THREE QUARTERS BLOCK
U+2587	▇	LOWER SEVEN EIGHTHS BLOCK
U+2588	█	FULL BLOCK

Notice that 2581, 2582, and 2588 are narrower than the rest. I’ll come back to that at the end.

If you combine them into a string of eight characters you get this result:

▁▂▃▄▅▆▇█

Notice that the fourth and eight characters in the sequence drop below the baseline. I’ll come back to that at the end too.

These characters can be used to define eight buckets into which numbers in a series can be quantized. Here are some examples from the rosettacode.org page:

“1 2 3 4 5 6 7 8 7 6 5 4 3 2 1” -> ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁
“1.5, 0.5 3.5, 2.5 5.5, 4.5 7.5, 6.5” -> ▂▁▄▃▆▅█▇
“0, 1, 19, 20” -> ▁▁██
“0, 999, 4000, 4999, 7000, 7999” -> ▁▁▅▅██

To write a Postgres function that would do this, I started with the Python example from rosettacode.org:

bar = '▁▂▃▄▅▆▇█'
barcount = len(bar)
 
def sparkline(numbers):
    mn, mx = min(numbers), max(numbers)
    extent = mx - mn
    sparkline = ''.join(bar[min([barcount - 1,
                                 int((n - mn) / extent * barcount)])]
                        for n in numbers)
    return mn, mx, sparkline

While testing it I happened to try an unchanging sequence, [3, 3, 3, 3], which fails with a divide-by-zero error. In order to address that, and to unpack the algorithm a bit for readability, I arrived at this Postgres function:

create function sparkline(numbers bigint[]) returns text as $$

    def bar_index(num, _min, barcount, extent):
        index = min([barcount - 1, int( (num - _min) / extent * bar_count)])
        return index

    bars = '\u2581\u2582\u2583\u2584\u2585\u2586\u2587\u2588'
    _min, _max = min(numbers), max(numbers)
    extent = _max - _min 

    if extent == 0:  # avoid divide by zero if all numbers are equal
        extent = 1

    bar_count = len(bars)
    sparkline = ''
    for num in numbers:
        index = bar_index(num, _min, bar_count, extent)
        sparkline = sparkline + bars[index]

    return sparkline

$$ language plpython3u;

Here’s a psql invocation of the function:

analytics=# select sparkline(array[1, 2, 3, 4, 5, 6, 7, 8, 7, 6, 5, 4, 3, 2, 1]);
    sparkline
-----------------
▁▂▃▄▅▆▇█▇▆▅▄▃▂▁
(1 row)

And here’s an example based on actual data:

Each row represents a university course in which students and teachers are annotating the course readings. Each bar represents a week’s worth of activity. Their heights are not comparable from row to row; some courses do a lot of annotating and some not so much; each sparkline reports relative variation from week to week; the sum and weekly max columns report absolute numbers.

This visualization makes it easy to see that annotation was occasional in some courses and continuous in others. And when you scroll, the temporal axis comes alive; try scrolling this view to see what I mean.

We use the same mechanism at three different scales. One set of sparklines reports daily activity for students in courses; another rolls those up to weekly activity for courses at a school; still another rolls all those up to weekly activity for each school in the system.

At the level of individual courses, the per-student sparkline views can show patterns of interaction. In the left example here, vertical bands of activity reflect students annotating for particular assignments. In the right example there may be a trace of such temporal alignment but activity is less synchronized and more continuous.

When we’re working in Metabase we can use its handy mini bar charts to contextualize the row-wise sums.

The sparkline-like mini bar chart shows a row’s sum relative to the max for the column. Here we can see that a course with 3,758 notes has about 1/4 the number of notes as the most note-heavy course at the school.

Because these Unicode sparklines are just strings of text in columns of SQL or HTML tables, they can participate in sort operations. In our case we can sort on all columns including ones not shown here: instructor name, course start date, number of students. But the default is to sort by the sparkline column which, because it encodes time, orders courses by the start of annotation activity.

The visual effect is admittedly crude, but it’s a good way to show certain kinds of variation. And it’s nicely portable. A Unicode sparkline looks the same in a psql console, an HTML table, or a tweet. The function will work in any database that can run it, using Python or another of the languages demoed at rosettacode.org. For example, I revisited the Workbench workflow described in A beautiful power tool to scrape, clean, and combine data and added a tab for Lake levels.

When I did that, though, the effect was even cruder than what I’ve been seeing in my own work.

In our scenarios, with longer strings of characters, the differences average out and things align pretty well; the below-the-baseline effect has been annoying but not a deal breaker. But the width variation in this example does feel like a deal breaker.

What if we omit the problematic characters U+2581 (too narrow) and U+2584/U+2588 (below baseline and too narrow)?

There are only 5 buckets into which to quantize numbers, and their heights aren’t evenly distributed. But for the intended purpose — to show patterns of variation — I think it’s sufficient in this case. I tried swapping the 5-bucket method into the function that creates sparklines for our dashboards but I don’t think I’ll switch. The loss of vertical resolution makes our longer sparklines less useful, and the width variation is barely noticeable.

Unicode evolves, of course, so maybe there will someday be a sequence of characters that’s friendlier to sparklines. Maybe there already is? If so please let me know, I’d love to use it.

Working with Postgres types

27 Jul 20217 Sep 2021 ~ Jon Udell ~ 4 Comments

In episode 2 of this series I noted that the languages in which I’m writing Postgres functions share a common type system. It took me a while to understand how types work in the context of Postgres functions that can return sets of records and can interact with tables and materialized views.

Here is a set-returning function.

create function notes_for_user_in_group(
  _userid text, 
  _groupid text)
returns setof annotation as $$
  begin
    return query 
      select * from annotation
      where userid = concat('acct:', _userid)
      and groupid = _groupid;
  end;
$$ language plpgsql;

In this case the type that governs the returned set has already been defined: it’s the schema for the annotation table.

Column	Type
id	uuid
created	timestamp without time zone
updated	timestamp without time zone
userid	text
groupid	text
text	text
tags	text[]
shared	boolean
target_uri	text
target_uri_normalized	text
target_selectors	jsonb
references	uuid[]
extra	jsonb
text_rendered	text
document_id	integer
deleted	boolean

The function returns records matching a userid and groupid. I can now find the URLs of documents most recently annotated by me.

select
  target_uri
from notes_for_user_in_group('judell@hypothes.is', '__world__')
order by created desc
limit 3;

The Postgres response:

                 target_uri                  
---------------------------------------------
 https://news.ycombinator.com/item?id=20020501
 https://www.infoworld.com/article/2886828/github-for-the-rest-of-us.html
 Formatting Annotations with Markdown

 http://example.com                          
(3 rows)

You might wonder why the function’s parameters are prefixed with underscores. That’s because variables used in functions can conflict with names of columns in tables. Since none of our column names begin with underscore, it’s a handy differentiator. Suppose the function’s signature were instead:

create function notes_for_user_in_group(
  userid text, 
  groupid text)

Postgres would complain about a confict:

ERROR:  column reference "userid" is ambiguous
LINE 2:       where userid = concat('acct:', userid)
                    ^
DETAIL:  It could refer to either a PL/pgSQL variable or a table column.

The table has userid and groupid columns that conflict with their eponymous variables. So for functions that combine variables and database values I prefix variable names with underscore.

Set-returning functions can be called in any SQL SELECT context. In the example above that context is psql, Postgres’ powerful and multi-talented REPL (read-eval-print loop). For an example of a different context, let’s cache the function’s result set in a materialized view.

create materialized view public_notes_for_judell as (
  select
    *
  from notes_for_user_in_group('judell@hypothes.is', '__world__')
  order by created desc
) with data;

Postgres reports success by showing the new view’s record count.

SELECT 3972

The view’s type is implicitly annotation; its schema matches the one shown above; selecting target_uri from the view is equivalent to selecting target_uri from the setof annotation returned from the function notes_for_user_in_group.

select
  target_uri
from public_notes_for_judell
limit 3;

The Postgres response is the same as above.

                 target_uri                  
---------------------------------------------
 https://news.ycombinator.com/item?id=20020501
 https://www.infoworld.com/article/2886828/github-for-the-rest-of-us.html
 Formatting Annotations with Markdown

 http://example.com                          
(3 rows)

It shows up a lot faster though! Every time you select the function’s result set, the wrapped query has to run. For this particular example that can take a few seconds. It costs the same amount of time to create the view. But once that’s done you can select its contents in milliseconds.

Now let’s define a function that refines notes_for_user_in_group by reporting the count of notes for each annotated document.

create function annotated_docs_for_user_in_group(
  _userid text, 
  _groupid text)
returns table (
  count bigint,
  userid text,
  groupid text,
  url text
) as $$
  begin
    return query
      select
        count(n.*) as anno_count,
        n.userid,
        n.groupid,
        n.target_uri
      from notes_for_user_in_group(_userid, _groupid) n
      group by n.userid, n.groupid, n.target_uri                          
      order by anno_count desc;
  end;
$$ language plpgsql;

Instead of returning a setof some named type, this function returns an anonymous table. I’ve aliased the set-returning function call notes_for_user_in_group as n and used the alias to qualify the names of selected columns. That avoids another naming conflict. If you write userid instead of n.userid in the body of the function and then call it, Postgres again complains about a conflict.

ERROR:  column reference "userid" is ambiguous
LINE 3:         userid,
                ^
DETAIL:  It could refer to either a PL/pgSQL variable or a table column.

Here’s a sample call to our new function..

select 
  *
from annotated_docs_for_user_in_group(
  'judell',
  'hypothes.is',
  '__world__'
);

The result:

 count | userid             | groupid   | target_uri
-------+---------------------------------------------
   516 | judell@hypothes.is | __world__ | http://shakespeare.mit.edu/macbeth/full.html
    73 | judell@hypothes.is | __world__ | https://www.independent.co.uk/news/world/asia/india-floods-bangladesh-nepal-deaths-millions-homeless-latest-news-updates-a7919006.html
    51 | judell@hypothes.is | __world__ | https://www.usatoday.com/story/news/nation-now/2017/06/16/coconut-oil-isnt-healthy-its-never-been-healthy/402719001/

Now let’s create a view based on that function.

create materialized view url_counts_for_public_notes_by_judell as (
  select 
    *
  from annotated_docs_for_user_in_group(
    'judell@hypothes.is',
    '__world__'
  )
) with data;

Postgres says:

SELECT 1710

When you ask for the definition of that view using the \d command in psql:

\d url_counts_for_public_notes_by_judell

It responds with the same table definition used when creating the function.

 Column  |  Type  
---------+--------
 count   | bigint
 userid  | text
 groupid | text
 url     | text

Behind the scenes Postgres has created this definition from the anonymous table returned by the function.

To revise the function so that it uses a named type, first create the type.

create type annotated_docs_for_user_in_group as (
  count bigint,
  userid text,
  groupid text,
  url text
);

Postgres reports success:

CREATE TYPE

Now we can use that named type in the function. Since we’re redefining the function, first drop it.

drop function annotated_docs_for_user_in_group;

Uh oh. Postgres is unhappy about that.

ERROR:  cannot drop function annotated_docs_for_user_in_group(text,text) because other objects depend on it
DETAIL:  materialized view url_counts_for_public_notes_by_judell depends on function annotated_docs_for_user_in_group(text,text)
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

A view that depends on a function must be recreated when the function’s signature changes. I’ll say more about this in a future episode on set-returning functions that dynamically cache their results in materialized views. For now, since the view we just created is a contrived throwaway, just drop it along with the function by using CASCADE as Postgres recommends.

drop function annotated_docs_for_user_in_group cascade;

Postgres says:

NOTICE:  drop cascades to materialized view url_counts_for_public_notes_by_judell
DROP FUNCTION

Now we can recreate a version of the function that returns setof annotated_docs_for_user_in_group instead of an anonymous table(...)

create function annotated_docs_for_user_in_group(
  _userid text, 
  _groupid text)
returns setof annotated_docs_for_user_in_group as $$
  begin
    return query
      select
        count(n.*) as anno_count,
        n.userid,
        n.groupid,
        n.target_uri
      from notes_for_user_in_group(_userid, _groupid) n
      group by n.userid, n.groupid, n.target_uri                          
      order by anno_count desc;
  end;
$$ language plpgsql;

The results are the same as above. So why do it this way? In many cases I don’t. It’s extra overhead to declare a type. And just as a view can depend on a function, a function can depend on a type. To see why you might not want such dependencies, suppose we want to also track the most recent note for each URL.

create type annotated_docs_for_user_in_group as (
  count bigint,
  userid text,
  groupid text,
  url text,
  most_recent_note timestamp
);

That won’t work.

ERROR:  type "annotated_docs_for_user_in_group" already exists

Dropping the type won’t work either.

ERROR:  cannot drop type annotated_docs_for_user_in_group because other objects depend on it                           
DETAIL:  function annotated_docs_for_user_in_group(text,text,text) depends on type annotated_docs_for_user_in_group    
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

To redefine the type you have to do a cascading drop and then recreate functions that depend on the type. If any of those views depend on dropped functions, the drop cascades to them as well and they also must be recreated. That’s why I often write functions that return table(...) rather than setof TYPE. In dynamic languages it’s convenient to work with untyped bags of values; I find the same to be true when writing functions in Postgres.

Sometimes, though, it’s useful to declare and use types. In my experience so far it makes most sense to do that in Postgres when you find yourself writing the same returns table(...) statement in several related functions. Let’s say we want a function that combines the results of annotated_docs_for_user_in_group for some set of users.

create function annotated_docs_for_users_in_group(_userids text[], _groupid text)
  returns setof annotated_docs_for_user_in_group as $$
    begin
      return query 
        with userids as (
          select unnest(_userids) as userid
        )
        select 
          a.*
        from userids u
        join annotated_docs_for_user_in_group(u.userid, _groupid) a
          on a.userid = concat('acct:', u.userid);
    end;
$$ language plpgsql;

This new function uses the SQL WITH clause to create a common table expression (CTE) that converts an inbound array of userids into a transient table-like object, named userids, with one userid per row. The new function’s wrapped SQL then joins that CTE to the set returned from annotated_docs_for_user_in_group and returns the joined result.

(You can alternatively do this in a more procedural way by creating a loop variable and marching through the array to accumulate results. Early on I used that approach but in the context of Postgres functions I’ve come to prefer the more purely SQL-like set-oriented style.)

Sharing a common type between the two functions makes them simpler to write and easier to read. More importantly it connects them to one another and to all views derived from them. If I do decide to add most_recent_note to the type, Postgres will require me to adjust all depending functions and views so things remain consistent. That can be a crucial guarantee, and as we’ll see in a future episode it’s a key enabler of an advanced caching mechanism.

pl/pgsql versus pl/python? Here’s why I’m using both to write Postgres functions.

24 Jul 20217 Sep 2021 ~ Jon Udell ~ 4 Comments

In A virtuous cycle for analytics I noted that our library of Postgres functions is written in two languages: Postgres’ built-in pl/pgsql and the installable alternative pl/python. These share a common type system and can be used interchangeably.

Here’s a pl/pgsql classifier that tries to match the name of a course against a list of patterns that characterize the humanities.

create function humanities_classifier(course_name text) returns boolean as $$
  begin
    return
      lower(course_name) ~ any(array[
        'psych',
        'religio',
        'soci'
        ]);
  end;
$$ language plpgsql;

# select humanities_classifier('Religious Studies 101') as match;
match
-----
t

# select humanities_classifier('Comparative Religions 200') as match;
match
-----
t

Here is that same classifier in Python.

create function humanities_classifier(course_name text) returns boolean as $$
  sql = f"""
    select lower('{course_name}') ~ any(array[
      'psych',
      'religio',
      'soci'
      ]) as match"""
  results = plpy.execute(sql)
  return results[0]['match']
$$ language plpython3u;

# select humanities_classifier('Religious Studies 101') as match;
match
-----
t

# select humanities_classifier('Comparative Religions 200') as match;
match
-----
t

The results are exactly the same. In this case, Python is only wrapping the SQL used in the orginal function and interpolating course_name into it. So why use pl/python here? I wouldn’t. The pl/pgsql version is cleaner and simpler because the SQL body doesn’t need to be quoted and course_name doesn’t need to be interpolated into it.

Here’s a more Pythonic version of the classifier.

create function humanities_classifier(course_name text) returns boolean as $$
  import re
  regexes = [
    'psych',
    'religio',
    'soci'
  ]
  matches = [r for r in regexes if re.search(r, course_name, re.I)]
  return len(matches)
$$ language plpython3u;

There’s no SQL here, this is pure Python. Is there any benefit to doing things this way? In this case probably not. The native Postgres idiom for matching a string against a list of regular expressions is cleaner and simpler than the Python technique shown here. A Python programmer will be more familiar with list comprehensions than with the Postgres any and ~ operators but if you’re working in Postgres you’ll want to know about those, and use them not just in functions but in all SQL contexts.

What about performance? You might assume as I did that a pl/pgsql function is bound to be way faster than its pl/python equivalent. Let’s check that assumption. This SQL exercises both flavors of the function, which finds about 500 matches in a set of 30,000 names.

with matching_courses as (
  select humanities_classifier(name) as match
  from lms_course_groups
) 

select count(*) 
from matching_courses
where match;

Here are the results for three runs using each flavor of the function:

 pl/pgsql: 159ms, 201ms, 125ms

pl/python: 290ms, 255ms, 300ms

The Python flavor is slower but not order-of-magnitude slower; I’ve seen cases where a pl/python function outperforms its pl/pgsql counterpart.

So, what is special about Python functions inside Postgres? In my experience so far there are three big reasons to use it.

Python modules

The ability to wield any of Python’s built-in or loadable modules inside Postgres brings great power. That entails great responsibility, as the Python extension is “untrusted” (that’s the ‘u’ in ‘plpython3u’) and can do anything Python can do on the host system: read and write files, make network requests.

Here’s one of my favorite examples so far. Given a set of rows that count daily or weekly annotations for users in a group — so for weekly accounting each row has 52 columns — the desired result for the whole group is the element-wise sum of the rows. That’s not an easy thing in SQL but it’s trivial using numpy, and in pl/python it happens at database speed because there’s no need to transfer SQL results to an external Python program.

Metaprogramming

Functions can write and then run SQL queries. It’s overkill for simple variable interpolation; as shown above pl/pgsql does that handily without the cognitive overhead and visual clutter of poking values into a SQL string. For more advanced uses that compose queries from SQL fragments, though, pl/pgsql is hopeless. You can do that kind of thing far more easily, and more readably, in Python.

Introspection

A pl/python function can discover and use its own name. That’s the key enabler for a mechanism that memoizes the results of a function by creating a materialized view whose name combines the name of the function with the value of a parameter to the function. This technique has proven to be wildly effective.

I’ll show examples of these scenarios in later installments of this series. For now I just want to explain why I’ve found these two ways of writing Postgres functions to be usefully complementary. The key points are:

– They share a common type system.

– pl/pgsql, despite its crusty old syntax, suffices for many things.

– pl/python leverages Python’s strengths where they are most strategic

When I began this journey it wasn’t clear when you’d prefer one over the other, or why it might make sense to use both in complementary ways. This installment is what I’d like to have known when I started.

A virtuous cycle for analytics

21 Jul 20217 Sep 2021 ~ Jon Udell ~ Leave a comment

Suppose you’re a member of a team that runs a public web service. You need to help both internal and external users make sense of all the data that’s recorded as it runs. That’s been my role for the past few years, now it’s time to summarize what I’ve learned.

The web service featured in this case study is the Hypothesis web annotation system. The primary database, Postgres, stores information about users, groups, documents, courses, and annotations. Questions that our team needs to answer include:

– How many students created annotations last semester?

– In how many courses at each school?

Questions from instructors using Hypothesis in their courses include:

– Which passages in course readings are attracting highlights and discussion?

– Who is asking questions about those passages, and who is responding?

Early on we adopted a tool called Metabase that continues to be a pillar of our analytics system. When Metabase was hooked up to our Postgres database the team could start asking questions without leaning on developers. Some folks used the interactive query builder, while others went straight to writing SQL that Metabase passes through to Postgres.

Before long we had a large catalog of Metabase questions that query Postgres and display results as tables or charts that can be usefully arranged on Metabase dashboards. It’s all nicely RESTful. Interactive elements that can parameterize queries, like search boxes and date pickers, map to URLs. Queries can emit URLs in order to compose themselves with other queries. I came to see this system as a kind of lightweight application server in which to incubate an analytics capability that could later be expressed more richly.

Over time, and with growing amounts of data, early success with this approach gave way to two kinds of frustration: queries began to choke, and the catalog of Metabase questions became unmanageable. And so, in the time-honored tradition, we set up a data warehouse for analytics. Ours is another instance of Postgres that syncs nightly with the primary database. There are lots of other ways to skin the cat but it made sense to leverage ops experience with Postgres and I had a hunch that it would do well in this role.

To unthrottle the choking queries I began building materialized views that cache the results of Postgres queries. Suppose a query makes use of available indexes but still takes a few minutes, or maybe even an hour, to run. It still takes that long to build the corresponding materialized view, but once built other queries can use its results immediately. Metabase questions that formerly included chunks of SQL began reducing to select * from {viewname}.

This process continues to unfold in a highly productive way. Team members may or may not hit a performance wall as they try to use Metabase to answer their questions. When they do, we can convert the SQL text of a Metabase question to a Postgres materialized view that gets immediate results. Such views can join with others, and/or with underlying tables, in SQL SELECT contexts. The views become nouns in a language that expresses higher-order business concepts.

The verbs in this language turned out to be Postgres functions written in the native procedural language, pl/pgsql, and later also in its Python counterpart, pl/python. Either flavor can augment built-in Postgres library functions with user-defined functions that can return simple values, like numbers and strings, but can also return sets that behave in SQL SELECT contexts just like tables and views.

Functions were, at first, a way to reuse chunks of SQL that otherwise had to be duplicated across Metabase questions and Postgres CREATE MATERIALIZED VIEW statements. That made it possible to streamline and refactor both bodies of code and sanely maintain them.

To visualize what had now become a three-body system of sources in which Metabase questions, Postgres views, and Postgres functions can call (or link to) one another, I wrote a tool that builds a crosslinked concordance. That made it practical to reason effectively about the combined system.

Along the way I have learned how Postgres, and more broadly modern SQL, in conjunction with a tool like Metabase, can enable a team like ours to make sense of data. There’s plenty to say about the techniques I’ve evolved, and I aim to write them up over time. The details won’t interest most people, but here’s an outcome that might be noteworthy.

Team member: I had an idea that will help manage our communication with customers, and I’ve prototyped it in a Metabase question.

Toolsmith: Great! Here’s a Postgres function that encapsulates and refines your SQL. It’s fast enough for now, but if needed we can convert it into a materialized view. Now you can use that function in another Metabase question that projects your SQL across a set of customers that you can select.

That interaction forms the basis of a virtuous cycle: The team member formulates a question and does their best to answer it using Metabase; the toolsmith captures the intent and re-expresses it in a higher-level business language; the expanded language enables the team member to go farther in a next cycle.

We recognize this software pattern in the way application programmers who push a system to its limits induce systems programmers to respond with APIs that expand those limits. I suppose it’s harder to see when the application environment is Metabase and the systems environment is Postgres. But it’s the same pattern, and it is powerful.