Iterative Large Updates
Among the many strong features of a relational database is the capability to adjust existing of records
UPDATE employees SET hired='2017-07-07' WHERE name='Kevin';
Even at a small scale there are some dangers. The
WHERE
clause is
not mandatory
so it helps to try it in a transaction first. And you can see the results as
well
BEGIN; UPDATE employees SET hired='2017-07-07' WHERE name='Kevin' RETURNING employees.*; ROLLBACK;
Once content with the results change the last line to
COMMIT
.
Suspend Side-Effects
If you are attempting to modify thousands or millions of rows, update triggers need to be accounted for. At the very least these triggers will cause bulk updates to run slowly. Fortunately it is possible to disarm these by setting
SET session_replication_role = replica;
A word of caution: this also disables system triggers such as foreign key constraints. You also need to be running as a user that includes the role Superuser.
Make a List
Freezing a list of records that should be targeted for an update allows you to run careful analysis on a group before and after a change is made
CREATE MATERIALIZED VIEW update_candidates AS ( SELECT id FROM employees WHERE hired IS NULL );
This allso allows one to run a series of updates without as much worry that one update will skew the results of subsequent queries.
Running Iterations
Rarely is it a good practive to run an update across five or ten million rows
in a single transaction. This leaves far too much room for human error. One
solution to this is to run an update on a small sample of records.
psql(1)
Allows you to set variables in an SQL script using
\set var value
or on the command line using the
-v
flag.
psql hrdb -v mod_value=1 -f /tmp/update.sql
With this value in place we can proceed with a trial that bins records into approximately 100 groups
BEGIN; UPDATE products SET in_stock='f' WHERE id % 100 = :mod_value AND in_stock IS NULL; ROLLBACK;
Once we're content with this we can run all 100 iterations
for v in {0..99} do echo "pass $v of 99" psql hrdb -v mod_value=$v -f /tmp/update.sql sleep 10 done
As this runs the administrator has the opportunity to see the new data going live progressively! If this change becomes suspect, the current update can be canceled with Ctrl-C and easily resumed by changing the starting number.
Range Updates
If you have a very large number of records, it may be nessesary to step through them in ranges. This has the advantage of being a very fast query for a btree, and limits the chunk size of each update.
WHERE id >= :min AND id < :max
The corresponding shell script might look like this
let min=0 while true do max=$((min+100)) echo "$min-$max" psql hrdb -v min=$min -v max=$max -f /tmp/update.sql sleep 10 min=$max done
If you are running a hot standby that is actively queried, keeping each update small is important to prevent lag.