Eric Radman : a Journal

Stateful Filtering

Part of the brilliance of SQL is that is declarative, however there are instances where imperative programming is necessary, and for this PostgeSQL supports embedded languages.

=# \dx pl*
                         List of installed extensions
    Name    | Version |   Schema   |                Description
------------+---------+------------+-------------------------------------------
 plpgsql    | 1.0     | pg_catalog | PL/pgSQL procedural language
 plpython3u | 1.0     | pg_catalog | PL/Python3U untrusted procedural language
(2 rows)

PL/pgSQL is somewhat awkward to use and more limited than Python, but it as a trusted language it poses a no security risk. (In PostgreSQL untrusted languages need to be activated by a member of the superuser role.)

Filtering a Table with PL/pgSQL

Sample data:

CREATE TABLE file_scan (
  owner text,
  path text
);
COPY file_scan (owner, path) FROM STDIN;
eradman Photos/2021
eradman Photos/2022
eradman Photos/2022/archive
eradman Photos/2022/new
eradman Videos
nobody  Videos/.temp
\.

If we wish to collapse this list of directories to a common path we might try comparing rows using a window function such as lag() or using a recursive query, but both of these are difficult because state needs to be carried across multiple rows.

CREATE FUNCTION summarize_scan()
  RETURNS SETOF file_scan
AS $$
DECLARE
  rec record;
  last_path text;
BEGIN
  FOR rec IN (SELECT * FROM file_scan ORDER BY path COLLATE "C")
  LOOP
    CONTINUE WHEN left(rec.path, length(last_path)) = last_path;
    last_path = rec.path + '/';
    RETURN NEXT rec;
  END LOOP;
END;
$$ LANGUAGE plpgsql;
SELECT * FROM summarize_scan();

The result is a set of common prefixes

  owner  |    path
---------+-------------
 eradman | Photos/2021
 eradman | Photos/2022
 eradman | Videos
(3 rows)

Appending / to the last pathname is used to ensure that this function returns complete pathnames.

These techniques work equally well when an intermediate query result and functions that operate on this data is written to the pg_temp schema. In this way the data and the function only lives as long as the client session.

Filtering a Table with PL/Python

For privileged database users, the plpy module provides an alternate approach to solving problems that require keeping state.

CREATE FUNCTION summarize_scan_py()
  RETURNS SETOF file_scan
AS $$
import os

last_path = "/"
rows = []

for row in plpy.execute(""" SELECT * FROM file_scan ORDER BY path COLLATE "C" """):
    curpath = row['path']
    if os.path.commonpath([last_path, curpath]) == last_path:
        continue
    last_path = curpath
    rows.append(row)

return rows
$$ LANGUAGE plpython3u;

Results

  owner  |    path
---------+-------------
 eradman | Photos/2021
 eradman | Photos/2022
 eradman | Videos
(3 rows)

This pattern is useful whenever there is a need to tap into the larger feature set and libraries available to Python. Previous rows may be updated since the intermediate result is a Python list, however a single list will not work well for a very large result set.