Postgres set-returning functions that self-memoize as materialized views

In episode 2 I mentioned three aspects of pl/python that are reasons to use it instead of pl/pgsql: access to Python modules, metaprogramming, and introspection. In episode 5 I discussed metaprogramming, by which I mean using pl/python to compose and run SQL code. This episode features introspection, by which I mean taking advantage of Python’s inspect module to enable a pl/python function to discover its own name.

Why do that? In this context, so that the function can create a materialized view by joining its own name with the value of its first parameter. Here’s the example from episode 5.

questions_and_answers_for_group(_group_id text)
  returns setof question_and_answer_for_group as $$
  from plpython_helpers import (
    exists_group_view,
    get_caller_name,
    memoize_view_name
  )
  base_view_name = get_caller_name()
  view_name = f'{base_view_name}_{_group_id}'
  if exists_group_view(plpy, view_name):
    sql = f""" select * from {view_name} """
  else:
    sql = f"""
    -- SQL THAT RETURNS A SETOF QUESTION_AND_ANSWER_FOR_GROUP
    """
    memoize_view_name(sql, view_name)
    sql = f""" select * from {view_name} """
  return plpy.execute(sql)
$$ language plpython3u;

The function drives a panel on the course dashboard. An initial call to, say, questions_and_answers_for_group('P1mQaEEp'), creates the materialized view questions_and_answers_for_group_p1mqaeep and returns SELECT * from the view. Subsequent calls skip creating the view and just return SELECT * from it.

Note that the even though the group name is mixed case; the view name created by Postgres is all lowercase. For example:

create materialized view test_AbC as (select 'ok') with data;
SELECT 1

\d test_AbC
Materialized view "public.test_abc"

I want to think of this as a form of capability injection, but it’s really more like a capability wrapper. The capability is memoization. A function endowed with it can run a SQL query and cache the resulting rows in a materialized view before returning them to a SQL SELECT context. The wrapper is boilerplate code that discovers the function’s name, checks for the existence of a corresponding view, and if it isn’t found, calls memoize_view_name(sql, view_name) to run an arbitrary chunk of SQL code whose result set matches the function’s type. So in short: this pattern wraps memoization around a set-returning pl/python function.

As noted in episode 5, although memoize_view_name is called from pl/python, it is not itself a pl/python function. It’s a normal Python function in a module that’s accessible to the instance of Python that the Postgres pl/python extension is using. In my case that module is just a few small functions in file called plpython_helpers.py, installed (cough, copied) to user postgres‘s ~/.local/lib/python3.8/site-packages/plpython_helpers.py.

So far, there are only two critical functions in that module: get_caller_name() and memoize_view_name.

Here is get_caller_name().

import re
base_view_name = inspect.stack()[1][3].replace('__plpython_procedure_', '')
return re.sub(r'_\d+$', '', base_view_name)

The internal name for a pl/python function created by CREATE FUNCTION foo() looks like __plpython_procedure_foo_981048462. What get_caller_name() returns is just foo.

Here’s memoize_view_name().

def memoize_view_name(sql, view_name):
  sql = sql.replace('\n', ' ')
  encoded_bytes = base64.b64encode(sql.encode('utf-8'))
  encoded_str = str(encoded_bytes, 'utf-8')
  cmd = f"""psql -d h_analytics -c "call memoizer('{encoded_str}', '{view_name}')" """
  result = os.system(cmd)
  print(f'memoize_view_name: {cmd} result: {result}')

Given a chunk of SQL and the name of a view, it converts newlines to spaces, base64-encodes the query text, and invokes psql to call a procedure, memoizer, that does the work of running the SQL query and creating the materialized view from those results. So for example the function that yields sparkline data for a school might look like sparkline_data_for_school('af513ee'), and produce the view sparkline_data_for_school_af513ee.

Why shell out to psql here? It may not be necessary, there may be a way to manage the transaction directly within the function, but if so I haven’t found it. I’m very far from being an expert on transaction semantics and will appreciate guidance here if anyone cares to offer it. Meanwhile, this technique seems to work well. memoizer is a Postgres procedure, not a function. Although “stored procedures” is the term that I’ve always associated with in-database programming, I went pretty far down this path using only CREATE FUNCTION, never CREATE PROCEDURE. When I eventually went there I found the distinction between functions and procedures to be a bit slippery. This StackOverflow answer matches what I’ve observed.

PostgreSQL 11 added stored procedures as a new schema object. You can create a new procedure by using the CREATE PROCEDURE statement.

Stored procedures differ from functions in the following ways:

Stored procedures do not have to return anything, and only return a single row when using INOUT parameters.

You can commit and rollback transactions inside stored procedures, but not in functions.

You execute a stored procedure using the CALL statement rather than a SELECT statement.

Unlike functions, procedures cannot be nested in other DDL commands (SELECT, INSERT, UPDATE, DELETE).

Here is the memoizer procedure. It happens to be written in pl/python but could as easily have been written in pl/pgsql using the built-in Postgres decode function. Procedures, like functions, can be written in either language (or others) and share the common Postgres type system.

create procedure memoizer(_sql text, _view_name text) as $$
  import base64
  decoded_bytes = base64.b64decode(_sql)
  decoded_str = str(decoded_bytes, 'utf-8')
  create = f"""
    create materialized view if not exists {_view_name} as (
      {decoded_str}
    ) with data;
    """
  plpy.execute(create)
  permit = f"""
    grant select on {_view_name} to analytics;
  """
  plpy.execute(permit)
$$ language plpython3u;

There’s no plpy.commit() here because psql takes care of that automatically. Eventually I wrote other procedures, some of which do their own committing, but that isn’t needed here.

Of course it’s only possible to shell out to psql from a function because pl/python is an “untrusted” language extension. Recall from episode 1:

The ability to wield any of Python’s built-in or loadable modules inside Postgres brings great power. That entails great responsibility, as the Python extension is “untrusted” (that’s the ‘u’ in ‘plpython3u’) and can do anything Python can do on the host system: read and write files, make network requests.

Using Python’s os.system() to invoke psql is another of those superpowers. It’s not something I do lightly, and if there’s a better/safer way I’m all ears.

Meanwhile, this approach is delivering much value. We have two main dashboards, each of which displays a dozen or so panels. The school dashboard reports on annotation activity across all courses at a school. The course dashboard reports on the documents, and selections within documents, that instructors and students are discussing in the course’s annotation layer. Each panel that appears on the school or course dashboard is the output of a memoized function that is parameterized by a school or course id.

The data warehouse runs on a 24-hour cycle. Within that cycle, the first call to a memoized function takes just as long as it takes to run the SQL wrapped by the function. The cached view only comes into play when the function is called again during the same cycle. That can happen in a few different ways.

– A user reloads a dashboard, or a second user loads it.

– A panel expands or refines the results of another panel. For example, questions_and_answers_for_group() provides a foundation for a family of related functions including:

– questions_asked_by_teacher_answered_by_student()

– questions_asked_by_student_answered_by_teacher()

– questions_asked_by_student_answered_by_student()

– A scheduled job invokes a function in order to cache its results before any user asks for them. For example, the time required to cache panels for school dashboards varies a lot. For schools with many active courses it can take minutes to run those queries, so preemptive memoization matters a lot. For schools with fewer active courses it’s OK to memoize on the fly. This method enables flexible cache policy. Across schools we can decide how many of the most-active ones to cache. Within a school, we can decide which courses to cache, e.g. most recent, or most active. The mechanism to display a dashboard panel is always the same function call. The caching done in support of that function is highly configurable.

Caches, of course, must be purged. Since these materialized views depend on core tables it was enough, early on, to do this for views depending on the annotation table.

drop table annotation cascade;

At a certain point, with a growing number of views built during each cycle, the cascade failed.

ERROR:  out of shared memory
HINT:  You might need to increase max_locks_per_transaction.

That wasn’t the answer. Instead we switched to enumerating views and dropping them individually. Again that afforded great flexibility. We can scan the names in the pg_matviews system table and match all the memoized views, or just those for a subset of schools, or just particular panels on school or course dashboards. Policies that govern the purging of cached views can be as flexible as those that govern their creation.

Postgres set-returning functions that self-memoize as materialized views

Like this:

Published by Jon Udell

2 thoughts on “Postgres set-returning functions that self-memoize as materialized views”

Leave a ReplyCancel reply

Share this:

Like this:

Published by Jon Udell

2 thoughts on “Postgres set-returning functions that self-memoize as materialized views”

Leave a ReplyCancel reply

Discover more from Jon Udell