A virtuous cycle for analytics

Suppose you’re a member of a team that runs a public web service. You need to help both internal and external users make sense of all the data that’s recorded as it runs. That’s been my role for the past few years, now it’s time to summarize what I’ve learned.

The web service featured in this case study is the Hypothesis web annotation system. The primary database, Postgres, stores information about users, groups, documents, courses, and annotations. Questions that our team needs to answer include:

– How many students created annotations last semester?

– In how many courses at each school?

Questions from instructors using Hypothesis in their courses include:

– Which passages in course readings are attracting highlights and discussion?

– Who is asking questions about those passages, and who is responding?

Early on we adopted a tool called Metabase that continues to be a pillar of our analytics system. When Metabase was hooked up to our Postgres database the team could start asking questions without leaning on developers. Some folks used the interactive query builder, while others went straight to writing SQL that Metabase passes through to Postgres.

Before long we had a large catalog of Metabase questions that query Postgres and display results as tables or charts that can be usefully arranged on Metabase dashboards. It’s all nicely RESTful. Interactive elements that can parameterize queries, like search boxes and date pickers, map to URLs. Queries can emit URLs in order to compose themselves with other queries. I came to see this system as a kind of lightweight application server in which to incubate an analytics capability that could later be expressed more richly.

Over time, and with growing amounts of data, early success with this approach gave way to two kinds of frustration: queries began to choke, and the catalog of Metabase questions became unmanageable. And so, in the time-honored tradition, we set up a data warehouse for analytics. Ours is another instance of Postgres that syncs nightly with the primary database. There are lots of other ways to skin the cat but it made sense to leverage ops experience with Postgres and I had a hunch that it would do well in this role.

To unthrottle the choking queries I began building materialized views that cache the results of Postgres queries. Suppose a query makes use of available indexes but still takes a few minutes, or maybe even an hour, to run. It still takes that long to build the corresponding materialized view, but once built other queries can use its results immediately. Metabase questions that formerly included chunks of SQL began reducing to select * from {viewname}.

This process continues to unfold in a highly productive way. Team members may or may not hit a performance wall as they try to use Metabase to answer their questions. When they do, we can convert the SQL text of a Metabase question to a Postgres materialized view that gets immediate results. Such views can join with others, and/or with underlying tables, in SQL SELECT contexts. The views become nouns in a language that expresses higher-order business concepts.

The verbs in this language turned out to be Postgres functions written in the native procedural language, pl/pgsql, and later also in its Python counterpart, pl/python. Either flavor can augment built-in Postgres library functions with user-defined functions that can return simple values, like numbers and strings, but can also return sets that behave in SQL SELECT contexts just like tables and views.

Functions were, at first, a way to reuse chunks of SQL that otherwise had to be duplicated across Metabase questions and Postgres CREATE MATERIALIZED VIEW statements. That made it possible to streamline and refactor both bodies of code and sanely maintain them.

To visualize what had now become a three-body system of sources in which Metabase questions, Postgres views, and Postgres functions can call (or link to) one another, I wrote a tool that builds a crosslinked concordance. That made it practical to reason effectively about the combined system.

Along the way I have learned how Postgres, and more broadly modern SQL, in conjunction with a tool like Metabase, can enable a team like ours to make sense of data. There’s plenty to say about the techniques I’ve evolved, and I aim to write them up over time. The details won’t interest most people, but here’s an outcome that might be noteworthy.

Team member: I had an idea that will help manage our communication with customers, and I’ve prototyped it in a Metabase question.

Toolsmith: Great! Here’s a Postgres function that encapsulates and refines your SQL. It’s fast enough for now, but if needed we can convert it into a materialized view. Now you can use that function in another Metabase question that projects your SQL across a set of customers that you can select.

That interaction forms the basis of a virtuous cycle: The team member formulates a question and does their best to answer it using Metabase; the toolsmith captures the intent and re-expresses it in a higher-level business language; the expanded language enables the team member to go farther in a next cycle.

We recognize this software pattern in the way application programmers who push a system to its limits induce systems programmers to respond with APIs that expand those limits. I suppose it’s harder to see when the application environment is Metabase and the systems environment is Postgres. But it’s the same pattern, and it is powerful.

1 https://blog.jonudell.net/2021/07/21/a-virtuous-cycle-for-analytics/
2 https://blog.jonudell.net/2021/07/24/pl-pgsql-versus-pl-python-heres-why-im-using-both-to-write-postgres-functions/

Posted in .

One thought on “A virtuous cycle for analytics

Leave a Reply