Stateful Filtering
Part of the brilliance of SQL is that is declarative, however there are instances where imperative programming is necessary, and for this PostgeSQL supports embedded languages.
=# \dx pl* List of installed extensions Name | Version | Schema | Description ------------+---------+------------+------------------------------------------- plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language plpython3u | 1.0 | pg_catalog | PL/Python3U untrusted procedural language (2 rows)
PL/pgSQL
is somewhat awkward to use and more limited than Python, but it as a
trusted
language it poses a no security risk. (In PostgreSQL
untrusted
languages need to be activated by a member of the
superuser
role.)
Filtering a Table with PL/pgSQL
Sample data:
CREATE TABLE file_scan ( owner text, path text );
COPY file_scan (owner, path) FROM STDIN; eradman Photos/2021 eradman Photos/2022 eradman Photos/2022/archive eradman Photos/2022/new eradman Videos nobody Videos/.temp \.
If we wish to collapse this list of directories to a common path we might try
comparing rows using a
window function
such as
lag()
or using a
recursive query,
but both of these are difficult because state needs to be carried across multiple rows.
CREATE FUNCTION summarize_scan() RETURNS SETOF file_scan AS $$ DECLARE rec record; last_path text; BEGIN FOR rec IN (SELECT * FROM file_scan ORDER BY path COLLATE "C") LOOP CONTINUE WHEN left(rec.path, length(last_path)) = last_path; last_path = rec.path + '/'; RETURN NEXT rec; END LOOP; END; $$ LANGUAGE plpgsql;
SELECT * FROM summarize_scan();
The result is a set of common prefixes
owner | path ---------+------------- eradman | Photos/2021 eradman | Photos/2022 eradman | Videos (3 rows)
Appending
/
to the last pathname is used to ensure that this function returns complete
pathnames.
These techniques work equally well when an intermediate query result and
functions that operate on this data is written to the
pg_temp
schema. In this way the data and the function only lives as long as the
client session.
Filtering a Table with PL/Python
For privileged database users, the plpy module provides an alternate approach to solving problems that require keeping state.
CREATE FUNCTION summarize_scan_py() RETURNS SETOF file_scan AS $$
import os last_path = "/" rows = [] for row in plpy.execute(""" SELECT * FROM file_scan ORDER BY path COLLATE "C" """): curpath = row['path'] if os.path.commonpath([last_path, curpath]) == last_path: continue last_path = curpath rows.append(row) return rows
$$ LANGUAGE plpython3u;
Results
owner | path ---------+------------- eradman | Photos/2021 eradman | Photos/2022 eradman | Videos (3 rows)
This pattern is useful whenever there is a need to tap into the larger feature set and libraries available to Python. Previous rows may be updated since the intermediate result is a Python list, however a single list will not work well for a very large result set.