Eric Radman : a Journal

Failover Orchestration for PostgreSQL

Ground Rules

The strong-consistory guarantees of a relational database does not generally lead to an architecture conducive to easy migration. Elaborate mechanisms can be conceived of to handle hot failover with PostgreSQL, but if there is one lesson that we can take away from Computer Science it is that the key to solving complex problems is to make assertions.

For this exercise we will use the following rules:

  1. Read-write connections are handled by a proxy (port 6432). By insisting that update operations run through a proxy we can instantly redirect transactions to a new backend while we wait for DNS to converge. The proxy also enables us to pause connections for the 5-8 seconds it will take to transition to a new master.
  2. Read-only connections are handled directly by postgres (port 5432). Applications that only need to read data will continue to operate on a standby that was promoted as a master. Again until DNS or some other mechanism shifts this load back to the new standby.
  3. Only one master is enabled at any given point in time. If we never have concurrent writes timelines will not diverge, eliminating the need to run a rewind.

The Salt Framework

For our basic salt configuration we will add two entries to /etc/salt/master:

file_roots:
  base:
    - /home/eradman/salt

pillar_roots:
  base:
    - /home/eradman/salt/pillar

Where file_roots defines where to look for our state files, and pillar_roots defines where state files that define global variables are located.

# top.sls
{{saltenv}}:
  'pgdb*':
    - postgres
# pillar/top.sls
{{saltenv}}:
  '*':
    pg:
      mydb:
        writer: pgdb1
        query: pgdb2

    pg_replication_map:
      pgdb2: pgdb1

Now we will give salt some directions for managing our PostgreSQL instances using the static pillar data we defined.

# postgres/init.sls
{% if grains.host in [pillar.pg.mydb.query] %}
/pg_data/9.5/recovery.conf:
  file.managed:
    - user: postgres
    - group: postgres
    - source: salt://postgres/recovery.conf.j2
    - user: postgres
    - group: postgres
    - template: jinja
    - context:
      restore_cmd: 'scp {{pillar.pg_replication_map[grains.host]}}:/pg_data/{{pgver}}/pg_xlog/%f

{% else %}
promote_new_master:
  cmd.run:
    - name: '/usr/pgsql-9.5/bin/pg_ctl promote -D /pg_data/9.5'
    - onlyif: 'test -e /pg_data/9.5/recovery.conf'
    - runas: postgres
{% endif %}

postgresql-9.5:
  service.running:
    - enable: True

Each standby will install recovery.conf, which will be rendered with context from postgres/init.sls

  # postgres/recovery.conf.j2
  standby_mode = 'on'
  primary_conninfo = 'user=postgres host={{ pillar.pg_replication_map[grains.host] }}'
  recovery_target_timeline = 'latest'
  restore_command = '{{restore_cmd}}'

Here a database is either a standby (query) or it is a master (writer). If it is a standby we install recovery.conf. recovery_target_timeline is an important parameter because it allows a standby to automatically follow a new timeline.

If the host is not a standby but recovery.conf is found on the file system then we know a transition is taking place and we promote the server. In any case we ensure that the service is running. We can test to see what actions will be taken on each host individually:

salt 'pgdb1*' state.highstate saltenv=eradman test=True
salt 'pgdb2*' state.highstate saltenv=eradman test=True

Redirection

PgBouncer has the very nice property of pausing connections if the backend (the postgresql server) is not responding). By shutting down the former master we also avoid the danger of stray WAL statements during this transition.

To reconfigure pgbouncer we re-write pgbouncer.ini and signal a configuration change

# postgres/init.sls
/etc/pgbouncer/pgbouncer.ini:
  file.managed:
    - source: salt://postgres/pgbouncer.ini.j2
    - template: jinja

kill -HUP `cat /var/run/pgbouncer/pgbouncer.pid`:
  cmd.run:
    - runas: pgbouncer
    - onchanges:
      - file: /etc/pgbouncer/pgbouncer.ini

Now in pgbouncer.ini we add some conditional configuration to point to a local UNIX socket or to a remote host. In this way connections to the proxy are always routed to a server that has write access.

[databases]
; local unix connections
{% if grains.host in [pillar.pg.mydb.writer] -%}
database1 =
database2 =
{% endif %}

; redirections to current master
{%- if grains.host == pillar.pg.mydb.query %}
* = host={{ pillar.pg.mydb.writer }}
{%- endif %}


Orchestration

Once we have the building blocks in place for connecting a stanby to the current master and promoting a stanby to a master we need to execute these in the right order. A schell script could do or we can use Salt's state.orchestrate module, which provides both sequential execution and the ability to specify dependencies.

stop_former_master:
  salt.function:
    - name: cmd.run
    - tgt: {{pillar.pg.mydb.query}}*
    - arg:
      - systemctl stop postgresql-9.5

promote_master:
  salt.state:
    - tgt: {{pillar.pg.mydb.writer}}*
    - highstate: True

rejigger_replicas:
  salt.state:
    - tgt: {{pillar.pg.mydb.query}}*
    - highstate: True

That is the core of it, I also like to finish with a command that gives a one-line status of the new master's WAL status:

mydb.writer_status:
  salt.function:
    - name: cmd.run
    - tgt: {{pillar.pg.mydb.writer}}*
    - arg:
      - |
        psql -U postgres <<SQL
           SELECT pg_xlogfile_name(pg_last_xlog_replay_location());
        SQL

That everything! Initiate failover using

$EDITOR pillar/globals.sls
sudo salt-run state.orchestrate postgres-failover.mydb saltenv=$USER

Log Messages

Transition to master:

LOG:  received promote request
LOG:  redo done at 32/39225840
LOG:  last completed transaction was at log time 2016-12-01 10:30:52.907917-05
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
LOG:  MultiXact member wraparound protections are now enabled
LOG:  checkpoint starting: force
LOG:  autovacuum launcher started

Transition to standby:

LOG:  entering standby mode
LOG:  consistent recovery state reached at 32/392258B0
LOG:  invalid record length at 32/392258B0
LOG:  database system is ready to accept read only connections
LOG:  fetching timeline history file for timeline 2 from primary server
LOG:  started streaming WAL from primary at 32/39000000 on timeline 1
DETAIL:  End of WAL reached on timeline 1 at 32/392258B0.
LOG:  new target timeline is 2
LOG:  replication terminated by primary server
LOG:  restarted WAL streaming at 32/39000000 on timeline 2
LOG:  redo starts at 32/392258B0

Last updated on February 10, 2017