Postgres and JSON: Finding document hotspots (part 1)

One of the compelling aspects of modern SQL is the JSON support built into modern engines, including Postgres. The documentation is well done, but I need examples to motivate my understanding of where and how and why to use such a capability. The one I’ll use in this episode is something I call document hotspots.

Suppose a teacher has asked her students to annotate Arthur Miller’s The Crucible. How can she find the most heavily-annotated passages? They’re visible in the Hypothesis client, of course, but may be sparsely distributed. She can scroll through the 154-page PDF document to find the hotspots, but it will be helpful to see a report that brings them together. Let’s do that.

The Hypothesis system stores annotations using a blend of SQL and JSON datatypes. Consider this sample annotation:

When the Hypothesis client creates that annotation it sends a JSON payload to the server. Likewise, when the client subsequently requests the annotation in order to anchor it to the document, it receives a similar JSON payload.

{
  "id": "VLUhcP1-EeuHn5MbnGgJ0w",
  "created": "2021-08-15T04:07:39.343275+00:00",
  "updated": "2021-08-15T04:07:39.343275+00:00",
  "user": "acct:judell@hypothes.is",
  "uri": "https://ia800209.us.archive.org/17/items/TheCrucibleFullText/The%20Crucible%20full%20text.pdf",
  "text": "\"He is no one's favorite clergyman.\"  :-)\n\nhttps://www.thoughtco.com/crucible-character-study-reverend-parris-2713521",
  "tags": [],
  "group": "__world__",
  "permissions": {
    "read": [
      "group:__world__"
    ],
    "admin": [
      "acct:judell@hypothes.is"
    ],
    "update": [
      "acct:judell@hypothes.is"
    ],
    "delete": [
      "acct:judell@hypothes.is"
    ]
  },
  "target": [
    {
      "source": "https://ia800209.us.archive.org/17/items/TheCrucibleFullText/The%20Crucible%20full%20text.pdf",
      "selector": [
        {
          "end": 44483,
          "type": "TextPositionSelector",
          "start": 44392
        },
        {
          "type": "TextQuoteSelector",
          "exact": " am not some preaching farmer with a book under my arm; I am a graduate of Harvard College.",
          "prefix": " sixty-six pound, Mr. Proctor! I",
          "suffix": " Giles: Aye, and well instructed"
        }
      ]
    }
  ],
  "document": {
    "title": [
      "The%20Crucible%20full%20text.pdf"
    ]
  },
  "links": {
    "html": "https://hypothes.is/a/VLUhcP1-EeuHn5MbnGgJ0w",
    "incontext": "https://hyp.is/VLUhcP1-EeuHn5MbnGgJ0w/ia800209.us.archive.org/17/items/TheCrucibleFullText/The%20Crucible%20full%20text.pdf",
    "json": "https://hypothes.is/api/annotations/VLUhcP1-EeuHn5MbnGgJ0w"
  },
  "user_info": {
    "display_name": "Jon Udell"
  },
  "flagged": false,
  "hidden": false
}

The server mostly shreds this JSON into conventional SQL types. The tags array, for example, is hoisted out of the JSON into a SQL array-of-text. The expression to find its length is a conventional Postgres idiom: array_length(tags,1). Note the second parameter; array_length(tags) is an error, because Postgres arrays can be multidimensional. In this case there’s only one dimension but it’s still necessary to specify that.

A target_selectors column, though, is retained as JSON. These selectors define how an annotation anchors to a target selection in a document. Because selectors are used only by the Hypothesis client, which creates and consumes them in JSON format, there’s no reason to shred them into separate columns. In normal operation, selectors don’t need to be related to core tables. They can live in the database as opaque blobs of JSON.

For some analytic queries, though, it is necessary to peer into those blobs and relate their contents to core tables. There’s a parallel set of functions for working with JSON. For example, the target_selectors column corresponds to the target[0]['selector'] array in the JSON representation. The expression to find the length of that array is jsonb_array_length(target_selectors).

Here’s a similar expression that won’t work: json_array_length(target_selectors). Postgres complains that the function doesn’t exist.

ERROR: function json_array_length(jsonb) does not exist 
Hint: No function matches the given name and argument types.

In fact both functions, json_array_length and jsonb_array_length, exist. But Postgres knows the target_selectors column is of type jsonb, not json which is the correct type for the json_array_length function. What’s the difference between json and jsonb?

The json and jsonb data types accept almost identical sets of values as input. The major practical difference is one of efficiency. The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage.

https://www.postgresql.org/docs/12/datatype-json.html

Although I tend to use JSON to refer to data in a variety of contexts, the flavor of JSON in the Postgres queries, views, and functions I’ll discuss will always be jsonb. The input conversion overhead isn’t a problem for analytics work that happens in a data warehouse, and the indexing support is a tremendous enabler.

To illustrate some of the operators common to json and jsonb, here is a query that captures the target_selectors column from the sample annotation.

with example as (
  select 
    id,
    target_selectors as selectors
  from annotation
  where id = '54b52170-fd7e-11eb-879f-931b9c6809d3'
  )
  select * from example;

Here are some other queries against example

select selectors from example;

[{"end": 44483, "type": "TextPositionSelector", "start": 44392}, { ... } ]

The result is a human-readable representation, but the type of selectors is jsonb.

select pg_typeof(selectors) from example;

jsonb

The array-indexing operator, ->, can yield the zeroth element of the array.

select selectors->0 from example;

{"end": 44483, "type": "TextPositionSelector", "start": 44392}

The result is again a human-readable representation of a jsonb type.

select pg_typeof(selectors->0) from example;

jsonb

Another array-indexing operator, ->>, can also yield the zeroth element of the array, but now as type text.

select selectors->>0 from example;

{"end": 44483, "type": "TextPositionSelector", "start": 44392}

The result looks the same, but the type is different.

select pg_typeof(selectors->>0) from example;

text

The -> and ->> operators can also index objects by their keys. These examples work with the object that is the zeroth element of the array.

select selectors->0->'type' from example;

"TextPositionSelector"

select pg_typeof(selectors->0->'type') from example;

jsonb

select selectors->0->>'type' from example;

TextPositionSelector

select pg_typeof(selectors->0->>'type') from example;

text

The Hypothesis system stores the location of a target (i.e, the selection in a document to which an annotation refers) in the target_selectors column we’ve been exploring. It records selectors. TextQuoteSelector represents the selection as the exact highlighted text bracketed by snippets of context. TextPositionSelector represents it as a pair of numbers that mark the beginning and end of the selection. When one range formed by that numeric pair is equal to another, it means two students have annotated the same selection. When a range contains another range, it means one student annotated the containing range, and another student made an overlapping annotation on the contained range. We can use these facts to surface hotspots where annotations overlap exactly or in nested fashion.

To start, let’s have a function to extract the start/end range from an annotation. In a conventional programming language you might iterate through the selectors in the target_selectors array looking for the one with the type TextPositionSelector. That’s possible in pl/pgsql and pl/python, but Postgres affords a more SQL-oriented approach. Given a JSON array, the function jsonb_array_elements returns a table-like object with rows corresponding to array elements.

select jsonb_array_elements(selectors) from example;

{"end": 44483, "type": "TextPositionSelector", "start": 44392}
{"type": "TextQuoteSelector", "exact": " am not some preaching farmer with a book under my arm; I am a graduate of Harvard College.", "prefix": " sixty-six pound, Mr. Proctor! I", "suffix": " Giles: Aye, and well instructed"}

A function can convert the array to rows, select the row of interest, select the start and end values from the row, package the pair of numbers as an array, and return the array.

create function position_from_anno(_id uuid) returns numeric[] as $$
  declare range numeric[];
  begin
    with selectors as (
      select jsonb_array_elements(target_selectors) as selector
      from annotation
      where id = _id
    ),
    position as (
      select
        selector->>'start'::numeric as startpos,
        selector->>'end'::numeric as endpos
      from selectors
      where selector->>'type' = 'TextPositionSelector'
    )
    select array[p.startpos, p.endpos] 
    from position p
    into range;
    return range;
  end;  
$$ language plpgsql;

Using it for the sample annotation:

select position_from_anno('54b52170-fd7e-11eb-879f-931b9c6809d3')

position_from_anno
------------------
{44392,44483}

I’ll show how to use position_from_anno to find document hotspots in a later episode. The goal here is just to introduce an example, and to illustrate a few of the JSON functions and operators.

What’s most interesting, I think, is this part.

where selector->>'type' = 'TextPositionSelector'

Although the TextPositionSelector appears as the first element of the selectors array, that isn’t guaranteed. In a conventional language you’d have to walk through the array looking for it. SQL affords a declarative way to find an element in a JSON array.


1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/
3 https://blog.jonudell.net/2021/07/27/working-with-postgres-types/
4 https://blog.jonudell.net/2021/08/05/the-tao-of-unicode-sparklines/
5 https://blog.jonudell.net/2021/08/13/pl-python-metaprogramming/
6 https://blog.jonudell.net/2021/08/15/postgres-and-json-finding-document-hotspots-part-1/
7 https://blog.jonudell.net/2021/08/19/postgres-set-returning-functions-that-self-memoize-as-materialized-views/
8 https://blog.jonudell.net/2021/08/21/postgres-functional-style/
9 https://blog.jonudell.net/2021/08/26/working-in-a-hybrid-metabase-postgres-code-base/
10 https://blog.jonudell.net/2021/08/28/working-with-interdependent-postgres-functions-and-materialized-views/
11 https://blog.jonudell.net/2021/09/05/metabase-as-a-lightweight-app-server/
12 https://blog.jonudell.net/2021/09/07/the-postgres-repl/

2 thoughts on “Postgres and JSON: Finding document hotspots (part 1)

  1. Have you noticed any performance difference between modeling something as jsonb vs native sql types?

  2. Not that matters for what I’m doing. You can index jsonb, though, for example

    create index ix_annotation_anno_anno_id on public.annotation using btree (((anno ->> ‘id’::text)));

Leave a Reply