Working with Postgres types

In episode 2 of this series I noted that the languages in which I’m writing Postgres functions share a common type system. It took me a while to understand how types work in the context of Postgres functions that can return sets of records and can interact with tables and materialized views.

Here is a set-returning function.

create function notes_for_user_in_group(
  _userid text, 
  _groupid text)
returns setof annotation as $$
  begin
    return query 
      select * from annotation
      where userid = concat('acct:', _userid)
      and groupid = _groupid;
  end;
$$ language plpgsql;

In this case the type that governs the returned set has already been defined: it’s the schema for the annotation table.

Column	Type
id	uuid
created	timestamp without time zone
updated	timestamp without time zone
userid	text
groupid	text
text	text
tags	text[]
shared	boolean
target_uri	text
target_uri_normalized	text
target_selectors	jsonb
references	uuid[]
extra	jsonb
text_rendered	text
document_id	integer
deleted	boolean

The function returns records matching a userid and groupid. I can now find the URLs of documents most recently annotated by me.

select
  target_uri
from notes_for_user_in_group('judell@hypothes.is', '__world__')
order by created desc
limit 3;

The Postgres response:

                 target_uri                  
---------------------------------------------
 https://news.ycombinator.com/item?id=20020501
 https://www.infoworld.com/article/2886828/github-for-the-rest-of-us.html
 Formatting Annotations with Markdown

 http://example.com                          
(3 rows)

You might wonder why the function’s parameters are prefixed with underscores. That’s because variables used in functions can conflict with names of columns in tables. Since none of our column names begin with underscore, it’s a handy differentiator. Suppose the function’s signature were instead:

create function notes_for_user_in_group(
  userid text, 
  groupid text)

Postgres would complain about a confict:

ERROR:  column reference "userid" is ambiguous
LINE 2:       where userid = concat('acct:', userid)
                    ^
DETAIL:  It could refer to either a PL/pgSQL variable or a table column.

The table has userid and groupid columns that conflict with their eponymous variables. So for functions that combine variables and database values I prefix variable names with underscore.

Set-returning functions can be called in any SQL SELECT context. In the example above that context is psql, Postgres’ powerful and multi-talented REPL (read-eval-print loop). For an example of a different context, let’s cache the function’s result set in a materialized view.

create materialized view public_notes_for_judell as (
  select
    *
  from notes_for_user_in_group('judell@hypothes.is', '__world__')
  order by created desc
) with data;

Postgres reports success by showing the new view’s record count.

SELECT 3972

The view’s type is implicitly annotation; its schema matches the one shown above; selecting target_uri from the view is equivalent to selecting target_uri from the setof annotation returned from the function notes_for_user_in_group.

select
  target_uri
from public_notes_for_judell
limit 3;

The Postgres response is the same as above.

                 target_uri                  
---------------------------------------------
 https://news.ycombinator.com/item?id=20020501
 https://www.infoworld.com/article/2886828/github-for-the-rest-of-us.html
 Formatting Annotations with Markdown

 http://example.com                          
(3 rows)

It shows up a lot faster though! Every time you select the function’s result set, the wrapped query has to run. For this particular example that can take a few seconds. It costs the same amount of time to create the view. But once that’s done you can select its contents in milliseconds.

Now let’s define a function that refines notes_for_user_in_group by reporting the count of notes for each annotated document.

create function annotated_docs_for_user_in_group(
  _userid text, 
  _groupid text)
returns table (
  count bigint,
  userid text,
  groupid text,
  url text
) as $$
  begin
    return query
      select
        count(n.*) as anno_count,
        n.userid,
        n.groupid,
        n.target_uri
      from notes_for_user_in_group(_userid, _groupid) n
      group by n.userid, n.groupid, n.target_uri                          
      order by anno_count desc;
  end;
$$ language plpgsql;

Instead of returning a setof some named type, this function returns an anonymous table. I’ve aliased the set-returning function call notes_for_user_in_group as n and used the alias to qualify the names of selected columns. That avoids another naming conflict. If you write userid instead of n.userid in the body of the function and then call it, Postgres again complains about a conflict.

ERROR:  column reference "userid" is ambiguous
LINE 3:         userid,
                ^
DETAIL:  It could refer to either a PL/pgSQL variable or a table column.

Here’s a sample call to our new function..

select 
  *
from annotated_docs_for_user_in_group(
  'judell',
  'hypothes.is',
  '__world__'
);

The result:

 count | userid             | groupid   | target_uri
-------+---------------------------------------------
   516 | judell@hypothes.is | __world__ | http://shakespeare.mit.edu/macbeth/full.html
    73 | judell@hypothes.is | __world__ | https://www.independent.co.uk/news/world/asia/india-floods-bangladesh-nepal-deaths-millions-homeless-latest-news-updates-a7919006.html
    51 | judell@hypothes.is | __world__ | https://www.usatoday.com/story/news/nation-now/2017/06/16/coconut-oil-isnt-healthy-its-never-been-healthy/402719001/

Now let’s create a view based on that function.

create materialized view url_counts_for_public_notes_by_judell as (
  select 
    *
  from annotated_docs_for_user_in_group(
    'judell@hypothes.is',
    '__world__'
  )
) with data;

Postgres says:

SELECT 1710

When you ask for the definition of that view using the \d command in psql:

\d url_counts_for_public_notes_by_judell

It responds with the same table definition used when creating the function.

 Column  |  Type  
---------+--------
 count   | bigint
 userid  | text
 groupid | text
 url     | text

Behind the scenes Postgres has created this definition from the anonymous table returned by the function.

To revise the function so that it uses a named type, first create the type.

create type annotated_docs_for_user_in_group as (
  count bigint,
  userid text,
  groupid text,
  url text
);

Postgres reports success:

CREATE TYPE

Now we can use that named type in the function. Since we’re redefining the function, first drop it.

drop function annotated_docs_for_user_in_group;

Uh oh. Postgres is unhappy about that.

ERROR:  cannot drop function annotated_docs_for_user_in_group(text,text) because other objects depend on it
DETAIL:  materialized view url_counts_for_public_notes_by_judell depends on function annotated_docs_for_user_in_group(text,text)
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

A view that depends on a function must be recreated when the function’s signature changes. I’ll say more about this in a future episode on set-returning functions that dynamically cache their results in materialized views. For now, since the view we just created is a contrived throwaway, just drop it along with the function by using CASCADE as Postgres recommends.

drop function annotated_docs_for_user_in_group cascade;

Postgres says:

NOTICE:  drop cascades to materialized view url_counts_for_public_notes_by_judell
DROP FUNCTION

Now we can recreate a version of the function that returns setof annotated_docs_for_user_in_group instead of an anonymous table(...)

create function annotated_docs_for_user_in_group(
  _userid text, 
  _groupid text)
returns setof annotated_docs_for_user_in_group as $$
  begin
    return query
      select
        count(n.*) as anno_count,
        n.userid,
        n.groupid,
        n.target_uri
      from notes_for_user_in_group(_userid, _groupid) n
      group by n.userid, n.groupid, n.target_uri                          
      order by anno_count desc;
  end;
$$ language plpgsql;

The results are the same as above. So why do it this way? In many cases I don’t. It’s extra overhead to declare a type. And just as a view can depend on a function, a function can depend on a type. To see why you might not want such dependencies, suppose we want to also track the most recent note for each URL.

create type annotated_docs_for_user_in_group as (
  count bigint,
  userid text,
  groupid text,
  url text,
  most_recent_note timestamp
);

That won’t work.

ERROR:  type "annotated_docs_for_user_in_group" already exists

Dropping the type won’t work either.

ERROR:  cannot drop type annotated_docs_for_user_in_group because other objects depend on it                           
DETAIL:  function annotated_docs_for_user_in_group(text,text,text) depends on type annotated_docs_for_user_in_group    
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

To redefine the type you have to do a cascading drop and then recreate functions that depend on the type. If any of those views depend on dropped functions, the drop cascades to them as well and they also must be recreated. That’s why I often write functions that return table(...) rather than setof TYPE. In dynamic languages it’s convenient to work with untyped bags of values; I find the same to be true when writing functions in Postgres.

Sometimes, though, it’s useful to declare and use types. In my experience so far it makes most sense to do that in Postgres when you find yourself writing the same returns table(...) statement in several related functions. Let’s say we want a function that combines the results of annotated_docs_for_user_in_group for some set of users.

create function annotated_docs_for_users_in_group(_userids text[], _groupid text)
  returns setof annotated_docs_for_user_in_group as $$
    begin
      return query 
        with userids as (
          select unnest(_userids) as userid
        )
        select 
          a.*
        from userids u
        join annotated_docs_for_user_in_group(u.userid, _groupid) a
          on a.userid = concat('acct:', u.userid);
    end;
$$ language plpgsql;

This new function uses the SQL WITH clause to create a common table expression (CTE) that converts an inbound array of userids into a transient table-like object, named userids, with one userid per row. The new function’s wrapped SQL then joins that CTE to the set returned from annotated_docs_for_user_in_group and returns the joined result.

(You can alternatively do this in a more procedural way by creating a loop variable and marching through the array to accumulate results. Early on I used that approach but in the context of Postgres functions I’ve come to prefer the more purely SQL-like set-oriented style.)

Sharing a common type between the two functions makes them simpler to write and easier to read. More importantly it connects them to one another and to all views derived from them. If I do decide to add most_recent_note to the type, Postgres will require me to adjust all depending functions and views so things remain consistent. That can be a crucial guarantee, and as we’ll see in a future episode it’s a key enabler of an advanced caching mechanism.

Working with Postgres types

Like this:

Published by Jon Udell

4 thoughts on “Working with Postgres types”

Leave a ReplyCancel reply

Share this:

Like this:

Published by Jon Udell

4 thoughts on “Working with Postgres types”

Leave a ReplyCancel reply

Discover more from Jon Udell