Eric Radman : a Journal

Managing PostgreSQL Logical Replication

SQL databases provide a critical service by giving the application a consistent view of data, but logical replication makes it possible to trade in the guarantee that a query always reads the most recent write for other features. Instead of global behavior, logical replication allows a relational database to target particular features for distributed processing.

I may be mistaken, but I believe the ability to replicate a subset of data between databases that are otherwise unrelated has become a unique feature of relational databases.

Unlike statement-based replication, logical replication is PostgreSQL is based on the WAL, and enables strict guarantees. It's also a pluggable architecture. This document will cover the plugin that ships with PostgreSQL, which is called pgoutput.

Server Configuration

There are three pieces required to get logical replication working:

  1. A user with the replication attribute
  2. Set wal_level to logical
  3. A pg_hba.conf for the connecting host

PostgreSQL will not allow a subscription to be created until the upstream server meets these requirements.

Selecting Tables for Export

CREATE PUBLICATION localharvest_pub FOR TABLE farmer, product;

The psql(1) commands for some operations are not obvious, but you can get help with \?.

localharvest=> \dRp+
                            List of publications
       Name       |    Owner     | All tables | Inserts | Updates | Deletes
------------------+--------------+------------+---------+---------+---------
 localharvest_sub | localharvest | f          | t       | t       | t
(1 row)
Tables:
    "public.farmer"
    "public.product"

List subscriptions using \dRs. or

SELECT * FROM pg_stat_subscription;

Establishing a Connection

CREATE SUBSCRIPTION localharvest_sub
  CONNECTION 'host=db1 dbname=localharvest user=localharvest password=XXXXXX'
  PUBLICATION localharvest_pub;

Add WITH ( copy_data=false ) if the table already contains a snapshot of the data.

Tables must have a primary key, otherwise, you'll see this unhelpful error

ERROR:  relation "public.farmer" does not exist

Also, you can only connect to a master

ERROR:  could not create replication slot "localharvest_sub": ERROR:  logical decoding cannot be used while in recovery

Once a subscription is created a corresponding replication slot is created on the master

localharvest=# \x
localharvest=# select * from pg_replication_slots;
-[ RECORD 1 ]-------+-----------------
slot_name           | localharvest_sub
plugin              | pgoutput
slot_type           | logical
datoid              | 16478
database            | localharvest
temporary           | f
active              | t
active_pid          | 41867
xmin                |
catalog_xmin        | 606
restart_lsn         | 0/16C0280
confirmed_flush_lsn | 0/16C02B8

Triggers

Postgres will run triggers on a replicated table, if you set it to respond to remote updates

ALTER TABLE farmer ENABLE REPLICA TRIGGER;

Mirroring Table Definitions

The synchronization of table contents is handled automatically, providing you have created the tables in both places.

# publisher
pg_dump -s -t farmer -t product > replicated_tables.sql
# subscriber
psql -f replicated_tables.sql

If a column is added in one database, replication will be blocked but PostgreSQL will retry until the schema change has been applied to all of the subscribers as well.

Promoting a New Master (publisher)

If you promote a standby to a master and change the DNS name pointing to it the logical replication launcher will resolve the new name. There is one manual step however; physical replication does not copy replication slots, so in the case of failover we need to create a slot manually

SELECT pg_create_logical_replication_slot('localharvest_sub', 'pgoutput');

Shortly after this slot appears the subscriber will receive missing updates.

localharvest=# select * from pg_stat_subscription ;
-[ RECORD 1 ]---------+------------------------------
subid                 | 35039
subname               | localharvest_sub
pid                   | 74670
relid                 |
received_lsn          | 0/7001078
last_msg_send_time    | 2020-01-10 11:57:03.890308-05
last_msg_receipt_time | 2020-01-10 11:57:03.87312-05
latest_end_lsn        | 0/7001078
latest_end_time       | 2020-01-10 11:57:03.890308-05

Promoting a New Master (subscriber)

While a subscription is defined and is only active on the master, the standby servers have the configuration

localharvest=# SELECT * FROM pg_stat_subscription;
-[ RECORD 1 ]---------+-----------------
subid                 | 32817
subname               | localharvest_sub
pid                   |
relid                 |
received_lsn          |
last_msg_send_time    |
last_msg_receipt_time |
latest_end_lsn        |
latest_end_time       |

I see how this design was intended to help, but this is a insidious hazard: promoting a standby can grab and use the replication slot you intended to be used for another master!

Handling Discrepancies

Any time a subscription has been dropped and subsequently recreated there is the opportunity for missed updates. To find the difference between two we need to run a query to compare both sides

CREATE TEMPORARY TABLE missing_product_ids AS
  SELECT product_id
  FROM   product
  EXCEPT ALL
  SELECT product_id
  FROM   product_copy;

Then use this information to build a table containing the full contents of the missing records

CREATE TABLE missing AS
  SELECT product_copy.* FROM product_copy
  JOIN missing_product_ids USING (product_id);

Finally, dump the contents and restore to the subscriber

pg_dump -O $src_url -t missing > missing.sql
psql $dst_url -f missing.sql
INSERT INTO product SELECT * FROM missing;
DROP TABLE missing;

If you can accept some downtime, a simpler solution may be to drop the subscription; truncate tables, and create the subscription again.

Disconnecting Subscribers

If communication is functioning, all you have to do to unhook a subscriber is run DROP SUBSCRIPTION, and the corresponding replication slot will be removed from the master. If the subscriber cannot communicate with the master the slot must be removed manually

SELECT * FROM pg_replication_slots;
SELECT pg_drop_replication_slot('localharvest_sub');

To drop a subscription that no longer has a replication slot upstream

ALTER SUBSCRIPTION localharvest_sub DISABLE ;
ALTER SUBSCRIPTION localharvest_sub SET (slot_name=NONE);
DROP SUBSCRIPTION localharvest_sub ;

Watching Replication Slots

Since logical replication requires a replication slot, a wedged subscriber can prevent a master from vacuuming old WAL. To keep an eye on this I run a cron job that alerts if any of the replication slots are behind.keep an eye on this I run a cron job that alerts if any of the replication slots are behind.

#!/bin/sh
# Run as the postgres user

CHECK_SQL=" \
SELECT slot_name \
FROM pg_replication_slots \
WHERE pg_current_wal_flush_lsn() - confirmed_flush_lsn > 500 \
OR confirmed_flush_lsn IS NULL \
"

for slot in $(psql -At -c "$CHECK_SQL")
do
  echo "Log Sequence Numbers have diverged for a replication slot"
  echo
  psql <<-SQL
    SELECT slot_name, slot_type, database, temporary, active, restart_lsn,
      confirmed_flush_lsn,
      pg_current_wal_flush_lsn() - confirmed_flush_lsn AS lsn_delta
    FROM pg_replication_slots
    WHERE slot_name='${slot}'
  SQL
done

Diagnosing Jammed Replication

The easiest problem to debug is a key violation: if an entry with a unique index already exists Postgres will log an error.

Other failures, such as an exception in a trigger function may not produce an error. Instead the clue is that the logical replication worker repeatedly dies

LOG:  logical replication apply worker for subscription "localharvest_sub" has started
LOG:  background worker "logical replication worker" (PID 240271) exited with exit code 1

UPDATE and DELETE are silently discarded if a given record does not exist, so you can manually work around a problem by removing records and re-adding them on the subscriber.

Every condition will be need a customized approach, but in general, the steps are to transfer the offending/missing records to the subscriber using pg_dump and psql and apply them on the remote side manually. By updating records manually you have the opportunity to observe the client side error.

The subscriber may have extra columns, but it must have all of the same columns as the publisher. If you drop a required column before an update is pushed you will see an error like this

logical replication target relation "public.farmer" is missing some replicated columns

You can add the missing columns or (if possible) drop and recreate the tables. Then synchronize the data using

ALTER SUBSCRIPTION localharvest_sub REFRESH PUBLICATION;

Refreshing the publication will also copy the data for tables that were added the publication upstream.

Reconciling Tables

One of the persistent issues of managing logical replication is to find which rows exist in one table but not the other. By importing a snapshot of a table created with SELECT ... INTO we can discover missing records. There is more than one way to do this, one form is

SELECT * INTO TABLE farmers_missing
FROM farmers_copy
WHERE NOT EXISTS (
   SELECT
   FROM  farmers
   WHERE anton_job_id = farmers_copy.anton_job_id
);

Now we have a delta that we can use to fill in the missing records.

INSERT INTO farmers SELECT * FROM farmers_missing;

DROP TABLE farmers_copy;
DROP TABLE farmers_missing;

Multi-Master

Can we configure multi-master using PostgreSQL built-in replication? The answer is no:

  1. Built-in logical replication does not yet provide a means of implementing conflict resolution.
  2. No way native mechanism to control propagation of updates or avoid cycles. pglogical provides both row_filters and forward_origins.

Last updated on April 24, 2020