Eric Radman : a Journal

Streaming Backups

Sometimes you take up a task because it needs to be done even though no one will notice. Setting up backups with real-time updates and point-in-time recovery may go unnoticed for a period, but it is not a thankless task.

The first time something very bad happens you will have a good story to tell. This story includes the fact that we have proper backups, but it also includes the changes and updates that you make every day to enable applications to evolve and adapt. Daily changes are best made under conditions are calm.

Mistakes will happen, and when they do a full database replica with streaming updates provide the means of recovering. Not living in terror of data loss means you can get a good nights rest, and write better software during the day. If this software can be moved into production without too much fuss you will definitely be thanked.

Making a Copy

In times past rsync (bracketed with invocations to pg_start_backup(), pg_stop_backup()) was the most effective way of brining up a new standby server, but these days pg_basebackup is usually safer and more elegant. It operates by connecting as a client, so we need configure the database server as a master first by amending postgresql.conf

# postgresql.conf
wal_level = hot_standby
max_wal_senders = 4
wal_keep_segments = 100

The value of wal_keep_segments instructs the server to keep at least this number of WAL logs. This will allow a replica to catch up if it is disconnected. For the purpose of a backup it also allows us to sync up after the initial clone.

After a reload we can create a copy of the master using

pg_basebackup -U postgres -D backup

Hopefully you are taking this one step further to create an off-site replica. We can use pv utility or the built-in --max-rate option to rate-limit the stream over a WAN:

mkdir -m 0700 backup
cd backup
pg_basebackup -h primary -U postgres -D - -Ft -x -z -vP \
       --max-rate=10M | tar -xzf -

The options supplied to pg_basebackup will stream the results over STDOUT in the tar format while displaying periodic progress updates on STDERR.

LOG:  transaction log start point: 0/5000024 on timeline 1
LOG:  42286/42286 kB (100%), 1/1 tablespace
LOG:  transaction log end point: 0/50000DC
LOG:  pg_basebackup: base backup completed

Note that everything is copied from the master, including configuration files and log files. Before cloning a busy server be sure to trim the contents of /pg_log

Starting a server to access the new backup is now as easy as starting up the server using the cloned data directory

cp -R backup backup_copy
pg_ctl -D backup_copy start

Tailing pg_log/* sould indicate that the server was able to reach a consistent state

LOG:  creating missing WAL directory "pg_xlog/archive_status"
LOG:  redo starts at 36/750000C8
LOG:  consistent recovery state reached at 36/75751FC8

Point-in-Time Recovery

Now that we have a clone, pg_receivexlog can be used to stream transaction logs from the primary

pg_receivexlog -h primary -U postgres -D backup

Running a replica will usually protect you from a hardware failure, it will not protect you from operator error or a faulty application. Point-in-time recovery enables you to replay the transactions up to a specific point in time. It is activated by defining recovery_target_time in recovery.conf then Postgres will replay each transaction log until the specified timestamp

# recovery.conf
recovery_target_time = '2015-06-29 08:00:00'
restore_command = 'cp /pg_backup/wals/%f %p'

The restore_command is used to copy files from an transaction log archive. The examples above are not using the archive feature, but this command is still required. When the server is started the restore_command is executed, will fail (because /pg_backup/pg_xlog/ does not exist) and the data used in pg_xlog will be used instead.

pg_ctl -D backup start

Postgres will record that it is running a point-in-time recovery to the specified time

LOG:  starting point-in-time recovery to 2015-06-29 08:00:00-04
LOG:  database system was not properly shut down; automatic recovery in
LOG:  redo starts at 36/756E53B8

Once the restore succeeds recovery.conf is renamed to restore.conf and the server starts up in master mode with a new timeline.

All of this can be scripted and orchestrated, in my view this is exactly what Barman does, and it does it well.

Establishing a Delayed Replica

Storing WAL using pg_receivexlog works well for keeping transaction logs, but there is no way to apply these regularly, so a fresh full backup needs to taken periodically. Starting a server in-place on the backup will replay the WALL up to date, but a new timeline will be established during this process which will prevent us from receiving further WAL updates.

[I think] the most elegant solution to this is to run "live backups" using a delayed replica.

# recovery.conf
recovery_min_apply_delay = '5d'

Since these servers are mostly idling inbetween update appliation a bunch of them can be run on a single host using different port numbers, or my preference is to stand them up in limited footprint VMs in way nearly identical to every other standby.

Daily Verification

One way to verify a physical replica is to measure the delta between transaction IDs on the master and the backup

backup_txid=$(ssh $host "psql -At -c 'SELECT txid_current()'")
# copy database and start in place
restore_txid=$(psql -At -c 'SELECT txid_current()')
echo $restore_txid $backup_txid | awk -f compare.awk > $host.verify/compare.out

Admittedly relying an arbitrary TXID delta (I use 6000) can raise false positives, but it will not miss a replica who has fallen behind.

BEGIN { xid_max_delta=6000; }
{
if ($2 - $1 < xid_max_delta)
    print "xids",$0,"=> Pass";
else
    print "xids",$0,"=> Fail";
}

Finally, notification and alerting can be handled by some combination of syslog and e-mail

logger -t "pg_verifybackup $host" -p local0.notice &lt; $host.verify/compare.out
grep -q "Pass" $host $host.verify/compare.out

The pg-verify project provides a simple implementation of this process.

Last updated on January 11, 2017