When your database is an HTTP client

Here are three things I once hoped — but no longer expect — to outlive:

1. PDF files

2. passwords

3. The occasional need to scrape data from web pages

PDF files and passwords are topics for another day, but web scraping is timely. Today I was asked to corral data from the Steampipe blog, and the path of least resistance was (of course!) to extract it from the site.

I was all warmed up for the exercise because we’d just published a post dramatically entitled “Why build an HTTP client into a database? So you can ingest web data directly!” In that post I show three solutions enabled by the Net plugin’s net_http_request table.

Since the dawn of the web, scraping has worked this way: Use a script to fetch the data, then save it for retrieval and analysis. You might use the script language to query the data, or the query language of a database.

A couple of years ago I found a way to unify those ingredients: Run the script inside the database. You can do a lot with Postgres’ built-in procedural language, and even more if you activate Python inside Postgres. I went deeply into both and explained why in an earlier episode.

PL/Python was great for advanced uses, and I used it for a bit of web scraping too. It all worked fine, and I never thought to ask this question: “What if the database is the HTTP client, and SQL the common way to reason over data coming through that pipe?”

The examples in the post show what that’s like. In its simplest form you write a query like this.

select
    response_status_code,
    jsonb_pretty(response_headers) as headers
  from
    net_http_request
  where
    url = 'https://steampipe.io'

The result is like this.

+----------------------+-------------------------------------------------------+
| response_status_code | headers                                               |
+----------------------+-------------------------------------------------------+
| 200                  | {                                                     |
|                      |     "Age": [                                          |
|                      |         "45557"                                       |
|                      |     ],                                                |
|                      |     "Date": [                                         |
|                      |         "Fri, 09 Dec 2022 06:46:40 GMT"               |
|                      |     ],                                                |
|                      |     "Etag": [                                         |
|                      |         "W/\"614a142998557b388e053bfa4408cf70\""      |

The response_status_code is a regular Postgres column, the headers column is a JSONB column that you can index into (e.g. headers ->> 'etag'). If you also select the response_body column of the net_http_request table you’ll get another regular Postgres column containing the text of the web page. If it’s HTML text, you can use regular expressions to match patterns in it. If it’s JSON text you can use Postgres’ JSON functions to query and transform it.

You can join fields nested in the JSON with other Postgres columns. And those other columns can belong to tables populated in any of the ways Steampipe plugins populate tables: from JSON or GraphQL API endpoints, from CSV or Terraform or Yaml files, from anywhere really. As a developer writing Steampipe queries (and flowing results into dashboards) you see all of these sources as tables, you query them individually in all the ways Postgres can, and you join across diverse sources in a common way.

Of course web pages are structured in ways that regular expressions can’t easily grok. It’s easy to match links, but parsing HTML tag names and attributes is a job for a real parser. I’d made a start on an HTML plugin for Steampipe. There were already two tables: one to extract links from a web page, one to transform HTML tables to CSV format. So today, when tasked with tabulating blog metadata, I added a third table to enable these queries.

-- find the title

select
  page,
  tag_name,
  tag_content
from
  html_tag
where
  page = 'https://steampipe.io/blog/selective-select'
  and tag_name = 'title'

-- list the meta tags

select
  page,
  tag_name,
  tag_attrs
from
  html_tag
where
  page = 'https://steampipe.io/blog/selective-select'
  and tag_name = 'meta'

That’s a powerful way to reason over HTML data! It was easy for me to extend the HTML plugin in this way, and I assure you that I’m no 10x programmer. The Steampipe plugin SDK and the wonderful goquery package are doing all the heavy lifting. I just had to stitch the components together, and if you’re any kind of programmer, with or without Go experience, you could pretty readily do the same.

Posted in .

3 thoughts on “When your database is an HTTP client

Leave a Reply