Iterative Large Updates

Among the many strong features of a relational database is the capability to adjust existing of records

UPDATE employees SET hired='2017-07-07' WHERE name='Kevin';

Even at a small scale there are some dangers. The WHERE clause is not mandatory so it helps to try it in a transaction first. And you can see the results as well

BEGIN;
UPDATE employees SET hired='2017-07-07' WHERE name='Kevin'
RETURNING employees.*;
ROLLBACK;

Once content with the results change the last line to COMMIT.

Suspend Side-Effects

If you are attempting to modify thousands or millions of rows, update triggers need to be accounted for. At the very least these triggers will cause bulk updates to run slowly. Fortunately it is possible to disarm these by setting

SET session_replication_role = replica;

A word of caution: this also disables system triggers such as foreign key constraints. You also need to be running as a user that includes the role Superuser.

Make a List

Freezing a list of records that should be targeted for an update allows you to run careful analysis on a group before and after a change is made

CREATE MATERIALIZED VIEW update_candidates AS (
  SELECT id
  FROM employees
  WHERE hired IS NULL
);

This allso allows one to run a series of updates without as much worry that one update will skew the results of subsequent queries.

Running Iterations

Rarely is it a good practive to run an update across five or ten million rows in a single transaction. This leaves far too much room for human error. One solution to this is to run an update on a small sample of records. psql(1) Allows you to set variables in an SQL script using \set var value or on the command line using the -v flag.

psql hrdb -v mod_value=1 -f /tmp/update.sql

With this value in place we can proceed with a trial that bins records into approximately 100 groups

BEGIN;
UPDATE products
SET in_stock='f'
WHERE id % 100 = :mod_value
AND in_stock IS NULL;
ROLLBACK;

Once we're content with this we can run all 100 iterations

for v in {0..99}
do
  echo "pass $v of 99"
  psql hrdb -v mod_value=$v -f /tmp/update.sql
  sleep 10
done

As this runs the administrator has the opportunity to see the new data going live progressively! If this change becomes suspect, the current update can be canceled with Ctrl-C and easily resumed by changing the starting number.

Range Updates

If you have a very large number of records, it may be nessesary to step through them in ranges. This has the advantage of being a very fast query for a btree, and limits the chunk size of each update.

WHERE id >= :min AND id < :max

The corresponding shell script might look like this

let min=0
while true
do
  max=$((min+100))
  echo "$min-$max"
  psql hrdb -v min=$min -v max=$max -f /tmp/update.sql
  sleep 10
  min=$max
done

If you are running a hot standby that is actively queried, keeping each update small is important to prevent lag.

Eric Radman : a Journal

Iterative Large Updates

Suspend Side-Effects

Make a List

Running Iterations

Range Updates