Eric Radman : a Journal

Failover Orchestration for PostgreSQL

Ground Rules

The strong-consistory guarantees of a relational database does not generally lead to an architecture conducive to easy migration. Elaborate mechanisms can be conceived of to handle hot failover with PostgreSQL, but if there is one lesson that we can take away from Computer Science it is that the key to solving complex problems is to make assertions.

For this exercise we will use the following rules:

  1. Read-write connections are handled by a proxy (port 6432). By insisting that update operations run through a proxy we can instantly redirect transactions to a new backend while we wait for DNS to converge. The proxy also enables us to pause connections for the 5-8 seconds it will take to transition to a new master.
  2. Read-only connections are handled directly by postgres (port 5432). Applications that only need to read data will continue to operate on a standby that was promoted as a master. Again until DNS or some other mechanism shifts this load back to the new standby.
  3. Only one master is enabled at any given point in time. If we never have concurrent writes timelines will not diverge, eliminating the need to run a rewind.

The Salt Framework

For our basic salt configuration we will add two entries to /etc/salt/master:

file_roots:
  base:
    - /home/eradman/salt

pillar_roots:
  base:
    - /home/eradman/salt/pillar

Where file_roots defines where to look for our state files, and pillar_roots defines where state files that define global variables are located.

# top.sls
{{saltenv}}:
  'pgdb*':
    - postgres
# pillar/top.sls
{{saltenv}}:
  '*':
    pg:
      mydb:
        writer: pgdb1
        query: pgdb2

    pg_replication_map:
      pgdb2: pgdb1

Now we will give salt some directions for managing our PostgreSQL instances using the static pillar data we defined.

# postgres/init.sls
{% if grains.host in [pillar.pg.mydb.query] %}
/pg_data/9.5/recovery.conf:
  file.managed:
    - user: postgres
    - group: postgres
    - source: salt://postgres/recovery.conf.j2
    - user: postgres
    - group: postgres
    - template: jinja
    - context:
      restore_cmd: 'scp {{pillar.pg_replication_map[grains.host]}}:/pg_data/{{pgver}}/pg_xlog/%f

{% else %}
promote_new_master:
  cmd.run:
    - name: '/usr/pgsql-9.5/bin/pg_ctl promote -D /pg_data/9.5'
    - onlyif: 'test -e /pg_data/9.5/recovery.conf'
    - runas: postgres
{% endif %}

postgresql-9.5:
  service.running:
    - enable: True

Each standby will install recovery.conf, which will be rendered with context from postgres/init.sls

  # postgres/recovery.conf.j2
  standby_mode = 'on'
  primary_conninfo = 'user=postgres host={{ pillar.pg_replication_map[grains.host] }}'
  recovery_target_timeline = 'latest'
  restore_command = '{{restore_cmd}}'

Here a database is either a standby (query) or it is a master (writer). If it is a standby we install recovery.conf. recovery_target_timeline is an important parameter because it allows a standby to automatically follow a new timeline.

If the host is not a standby but recovery.conf is found on the file system then we know a transition is taking place and we promote the server. In any case we ensure that the service is running. We can test to see what actions will be taken on each host individually:

salt 'pgdb1*' state.highstate saltenv=eradman test=True
salt 'pgdb2*' state.highstate saltenv=eradman test=True

Redirection

PgBouncer has the very nice property of pausing connections if the backend (the postgresql server) is not responding). By shutting down the former master we also avoid the danger of stray WAL statements during this transition.

To reconfigure pgbouncer we re-write pgbouncer.ini and signal a configuration change

# postgres/init.sls
/etc/pgbouncer/pgbouncer.ini:
  file.managed:
    - source: salt://postgres/pgbouncer.ini.j2
    - template: jinja

kill -HUP `cat /var/run/pgbouncer/pgbouncer.pid`:
  cmd.run:
    - runas: pgbouncer
    - onchanges:
      - file: /etc/pgbouncer/pgbouncer.ini

Now in pgbouncer.ini we add some conditional configuration to point to a local UNIX socket or to a remote host. In this way connections to the proxy are always routed to a server that has write access.

[databases]
; local unix connections
{% if grains.host in [pillar.pg.mydb.writer] -%}
database1 =
database2 =
{% endif %}

; redirections to current master
{%- if grains.host == pillar.pg.mydb.query %}
* = host={{ pillar.pg.mydb.writer }}
{%- endif %}


Orchestration

Once we have the building blocks in place for connecting a stanby to the current master and promoting a stanby to a master we need to execute these in the right order. A schell script could do or we can use Salt's state.orchestrate module, which provides both sequential execution and the ability to specify dependencies.

stop_former_master:
  salt.function:
    - name: cmd.run
    - tgt: {{pillar.pg.mydb.query}}*
    - arg:
      - systemctl stop postgresql-9.5

promote_master:
  salt.state:
    - tgt: {{pillar.pg.mydb.writer}}*
    - highstate: True

rejigger_replicas:
  salt.state:
    - tgt: {{pillar.pg.mydb.query}}*
    - highstate: True

That is the core of it, I also like to finish with a command that gives a one-line status of the new master's WAL status:

mydb.writer_status:
  salt.function:
    - name: cmd.run
    - tgt: {{pillar.pg.mydb.writer}}*
    - arg:
      - |
        psql -U postgres <<SQL
           SELECT pg_xlogfile_name(pg_last_xlog_replay_location());
        SQL

That everything! Initiate failover using

$EDITOR pillar/globals.sls
sudo salt-run state.orchestrate postgres-failover.mydb saltenv=$USER

Log Messages

Transition to master:

LOG:  received promote request
LOG:  redo done at 32/39225840
LOG:  last completed transaction was at log time 2016-12-01 10:30:52.907917-05
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
LOG:  MultiXact member wraparound protections are now enabled
LOG:  checkpoint starting: force
LOG:  autovacuum launcher started

Transition to standby:

LOG:  entering standby mode
LOG:  consistent recovery state reached at 32/392258B0
LOG:  invalid record length at 32/392258B0
LOG:  database system is ready to accept read only connections
LOG:  fetching timeline history file for timeline 2 from primary server
LOG:  started streaming WAL from primary at 32/39000000 on timeline 1
DETAIL:  End of WAL reached on timeline 1 at 32/392258B0.
LOG:  new target timeline is 2
LOG:  replication terminated by primary server
LOG:  restarted WAL streaming at 32/39000000 on timeline 2
LOG:  redo starts at 32/392258B0

A Replication View

Another aspect of replication and failover the is not obvious is how to collect meaningful information about the status of database replicas. Making this inquiry efficient is more important as the number of databases servers grows.

Again Salt helps because we can collect the list of involved hosts to run against

{% set hosts = pillar.pg.mydb.values()|join("|") %}

Now we can find out the basic status of replication connections

mydb_replication_status:
  salt.function:
    - name: cmd.run
    - tgt: '{{hosts}}'
    - tgt_type: pcre
    - arg:
      - |
        echo
        psql -U postgres <<SQL
          SELECT
            application_name, client_addr, state, sent_location, replay_location, sync_state
          FROM pg_stat_replication;
        SQL

Execute by running the SLS

$ sudo salt-run state.sls mydb-status

A more in-depth view of reply status can be obtained from the pg_control_checkpoint view, which was introduced in 9.6

qmdb_replay_status:
  salt.function:
    - name: cmd.run
    - tgt: '{{hosts}}'
    - tgt_type: pcre
    - arg:
      - |
        echo
        psql -U postgres <<SQL
          SELECT
            pg_last_xlog_receive_location() receive_location,
            pg_last_xlog_replay_location() replay_location,
            redo_wal_file, timeline_id, next_xid,
            date_trunc('second', now()-checkpoint_time) AS last_checkpoint,
            CASE WHEN pg_is_in_recovery()='t'
              THEN date_trunc('seconds', now()-pg_last_xact_replay_timestamp())
            END AS last_xact_replay
          FROM pg_control_checkpoint();
        SQL

For standard standby servers this will enable you to verify that all of the database servers are running on the same timeline, that they have not fallen behind, and for delayed replicas you can verify that they are running behind with the prescribed delay. In this example mydb3 has recovery_min_apply_delay = '1d' set in recovery.conf

      ID: mydb_replay_status
Function: salt.function
    Name: cmd.run
  Result: True
 Comment: Function ran successfully. Function cmd.run ran on mydb1, mydb2, mydb3.
 Started: 16:12:50.445661
Duration: 5072.736 ms
 Changes:

          mydb1:
           receive_location | replay_location |      redo_wal_file       | timeline_id |  next_xid  | last_checkpoint | last_xact_replay
          ------------------+-----------------+--------------------------+-------------+------------+-----------------+------------------
           EB/6FF5B540      | EB/6FF5B540     | 00000003000000EB0000006F |           3 | 0:33686826 | 00:08:21        | 00:00:12
          (1 row)

          mydb2:
           receive_location | replay_location |      redo_wal_file       | timeline_id |  next_xid  | last_checkpoint | last_xact_replay
          ------------------+-----------------+--------------------------+-------------+------------+-----------------+------------------
                            |                 | 00000003000000EB0000006F |           3 | 0:33686826 | 00:03:21        |
          (1 row)

          mydb3:
           receive_location | replay_location |      redo_wal_file       | timeline_id |  next_xid  | last_checkpoint | last_xact_replay
          ------------------+-----------------+--------------------------+-------------+------------+-----------------+------------------
           EB/6FF5B540      | EA/C87286A0     | 00000003000000EA000000C7 |           3 | 0:33656828 | 1 day 00:09:50  | 1 day 00:00:02
          (1 row)

Last updated on May 18, 2017