Eric Radman : a Journal

Streaming Backups

Sometimes you take up a task because it needs to be done even though no one will notice. Setting up backups with real-time updates and point-in-time recovery may go unnoticed for a period, but it is not a thankless task.

The first time something very bad happens you will have a good story to tell. This story includes the fact that we have proper backups, but it also includes the changes and updates that you make every day to enable applications to evolve and adapt. Daily changes are best made under conditions are calm.

Mistakes will happen, and when they do a full database replica with streaming updates provide the means of recovering. Not living in terror of data loss means you can get a good nights rest, and write better software during the day. If this software can be moved into production without too much fuss you will definitely be thanked.

Making a Copy

In times past rsync (along with invocations to pg_start_backup()/pg_stop_backup()) was the most effective way of brining up a new standby server, but these days pg_basebackup is usually safer and more elegant. It operates by connecting as a client, so we need configure the database server as a master first by amending postgresql.conf

# postgresql.conf
wal_level = hot_standby
max_wal_senders = 4
wal_keep_segments = 100

The value of wal_keep_segments instructs the server to keep at least this number of WAL logs. This will allow a replica to catch up if it is disconnected. For the purpose of a backup it also allows us to sync up after the initial clone.

After a reload we can create a copy of the master using

pg_basebackup -U postgres -D backup

Hopefully you are taking this one step further to create an off-site replica. We can use the -max-rate option to rate-limit the stream over a WAN:

mkdir -m 0700 backup
cd backup
pg_basebackup -h primary -U postgres -D - -Ft -x -z -vP \
       --max-rate=10M | tar -xzf -

The options supplied to pg_basebackup will stream the results over STDOUT in the tar format while displaying periodic progress updates on STDERR.

LOG:  transaction log start point: 0/5000024 on timeline 1
LOG:  42286/42286 kB (100%), 1/1 tablespace
LOG:  transaction log end point: 0/50000DC
LOG:  pg_basebackup: base backup completed

Note that everything is copied from the master, including configuration files and log files. Starting a server to access the new backup is now as easy as starting up the server using the cloned data directory

cp -R backup backup_copy
pg_ctl -D backup_copy start

Tailing pg_log/* sould indicate that the server was able to reach a consistent state

LOG:  creating missing WAL directory "pg_wal/archive_status"
LOG:  redo starts at 36/750000C8
LOG:  consistent recovery state reached at 36/75751FC8

Point-in-Time Recovery

Now that we have a clone, pg_receivewal can be used to stream transaction logs from the primary

pg_receivewal -h primary -U postgres -D backup

Running a replica will usually protect you from a hardware failure, it will not protect you from operator error or a faulty application. Point-in-time recovery enables you to replay the transactions up to a specific point in time. It is activated by defining recovery_target_time in recovery.conf then Postgres will replay each transaction log until the specified timestamp

# recovery.conf
recovery_target_time = '2015-06-29 08:00:00'

The restore_command is used to copy files from an transaction log archive. The examples above are not using the archive feature, but this command is still required. When the server is started the restore_command is executed, will fail (because /pg_backup/pg_wal/ does not exist) and the data used in pg_wal will be used instead.

pg_ctl -D backup start

Postgres will record that it is running a point-in-time recovery to the specified time

LOG:  starting point-in-time recovery to 2015-06-29 08:00:00-04
LOG:  database system was not properly shut down; automatic recovery in
LOG:  redo starts at 36/756E53B8

Once the restore succeeds recovery.conf is renamed to restore.conf and the server starts up in master mode with a new timeline.

Establishing a Delayed Replica

Storing WAL using pg_receivewal works well for keeping transaction logs, but there is no way to apply these regularly, so a fresh full backup needs to taken periodically. Starting a server in-place on the backup will replay the WALL up to date, but a new timeline will be established during this process which will prevent us from receiving further WAL updates.

[I think] the most elegant solution to this is to run "live backups" using a delayed replica.

# recovery.conf
recovery_min_apply_delay = '5d'

Since these servers are mostly idling inbetween update appliation a bunch of them can be run on a single host using different port numbers, or my preference is to stand them up in limited footprint VMs in way nearly identical to every other standby.

As of PostgreSQL 10, standby servers with an apply delay do not reconnect in a timely manner. A patch is available which ensures that delayed replicas resume streaming.

Verifying Delayed Replica Status

Before we can rest assured that we have a usable copy of a database on another server or at another site there is one final task: verifying that our standby servers are up to date. We can do this by comparing the log sequence number flushed to disk on the master with the corresponding record that should have been received on each standby.

This is the method installed by default on every standby and delayed standby managed by postgresql-declaration. pg_verify works by leveraging PostgreSQL's arithmatic for the pg_lsn type.

localharvest-writer=> SELECT pg_current_wal_flush_lsn();
localharvest-query=> select '5/43C9D298' - pg_last_wal_receive_lsn() AS lsn_delta;

Alert if the standby server is greater some number of bytes behind the master.

Last updated on January 22, 2018