Eric Radman : a Journal

Patroni and Consul Service Discovery

Among the tricky aspects to automating database failover is the need to

  1. redirect existing connections to the new primary and
  2. update DNS to reflect the new toplogy

Patroni's integration with Consul can solve both of these problems, since Patroni sets tags for each host indicating it's role, Consul keys can be monitored for changes, and Consul itself can publish nameservice records.

The Postgres client library (libpq) will try to resolve names in order, so we can prefer read-only queries with a fallback to the current primary by listing them first

postgresql://user@raddb-query.service.eradman.com,raddb-writer.service.eradman.com/db

Local Consul Agent

Patroni uses the DCS (distributed configuration store) as a coordination point, so a Consul cluster that is propagated to other datacenters using consul-replicate will not suffice. Instead, we need to establish set of a consul servers that spans all datacenters, which I'll call the "global" DC to indicate that it's not a geographical site

# /etc/consul.d/consul.hcl
data_dir = "/var/lib/consul"
datacenter = "global"
domain = "consul.eradman.com"
encrypt = "xxxxxxxxxxxxxxxxxxxxxxxx"
primary_datacenter = "global"
retry_join = ["10.2.0.30", "10.3.0.30", "10.4.0.30"]

The membership then should show servers that keep consensus across all data centers

$ consul members
Node                      Address         Status  Type    Build  Protocol  DC      Segment
consul4.east.eradman.com  10.2.0.30:8301  alive   server  1.4.3  2         global  <all>
consul4.north.eradman.com 10.4.0.30:8301  alive   server  1.4.3  2         global  <all>
consul4.west.eradman.com  10.3.0.30:8301  alive   server  1.4.3  2         global  <all>
raddb1.east.eradman.com   10.2.0.90:8301  alive   client  1.4.3  2         global  <default>
raddb1.west.eradman.com   10.3.0.90:8301  alive   client  1.4.3  2         global  <default>

Patroni Configuration

The basic "timeout" values for Patroni include the following:

# patroni.yaml
bootstrap:
  dcs:
    ttl: 90
    loop_wait: 10
    retry_timeout: 90

Set dcs.ttl and dcs.retry_timeout to a value that is longer than the time it takes consul to complete an election. These values are read from the yaml configuration when a cluster is initialized, but for existing clusters change these using patronictl edit-config.

Tags and Prepared Queries

Since Patroni registers tags with each service, we can install a prepared query to enable name server resolution

$ curl -s http://127.0.0.1:8500/v1/query
[
  {
    "Name": "raddb-writer"
    },
    "Service": {
      "Service": "raddb",
      "OnlyPassing": true,
      "IgnoreCheckIDs": null,
      "Tags": [
        "primary"
      ]
    },
    "DNS": {
      "TTL": "30s"
    }
  },
]

With this query in place we can query each separately

$ dig raddb-writer.query.consul.eradman.com +short
$ dig raddb-query.query.consul.eradman.com +short

Redirecting Connections with PgBouncer

Using a connection pooler provides some valuable capabilities, including transaction pooling, the ability to pause/resume connections, and redirect connections by retrying the backend. One means of reconfiguring pgbouncer is to use consul-template to regenerate configuration. Keys referenced in the template are automatically monitored for changes

# /etc/consul-template.d/reconfig-pgbouncer.conf
consul = {
  address = "localhost:8500"
  token = "..."
  retry = {
    attempts = 0
  }

template {
  source      = "/etc/pgbouncer/pgbouncer.ini.ctmpl"
  destination = "/var/lib/consul-template/pgbouncer.ini"
  wait {
    min = "10s"
    max = "30s"
  }
  exec {
    command = "sudo /usr/local/sbin/reconfig-pgbouncer"
  }
}

A minimal pgbouncer configuration includes

[databases]
* = host={{ key "postgresql/raddb/leader" }}

[pgbouncer]
listen_addr = *
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
server_fast_close = 1

Since consul-template runs at startup as well as a key change it can be used for maintenance tasks, such as synchronizing users

#!/bin/sh
# /usr/local/sbin/reconfig-pgbouncer
trap 'echo "exit code $? on line $LINENO"; exit' ERR
cd /tmp

# install pgbouncer configuration, reload, and refresh backend connections
src=/var/lib/consul-template/pgbouncer.ini
if [ -s $src ]; then
  cp $src /etc/pgbouncer/pgbouncer.ini; rm $src
  sudo -u pgbouncer psql -h /tmp -p 6432 -c 'RELOAD'
fi

# refresh userlist
psql -Atq -U postgres <<SQL > /etc/pgbouncer/userlist.txt
  SELECT '"pgbouncer" "md5xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"' AS user_hash
  UNION
  SELECT concat('"', usename, '" "', passwd, '"')
  FROM pg_shadow
  WHERE usename NOT IN ('postgres')
  ORDER BY user_hash
SQL

pgBackRest

There are two places where we should check to ensure pgBackRest repository is initialized. First, when the cluster is first bootstrapped, and the other when a new primary is promoted

# patroni.yaml
bootstrap:
  post_init: /var/lib/pgsql/bin/post_init.sh

postgresql:
  parameters:
    archive_command: 'pgbackrest --stanza=main archive-push %p'
    archive_mode: 'on'

  callbacks:
    on_role_change: /var/lib/pgsql/bin/post_init.sh

Note that on is quoted to prevent this value from being loaded as a Python boolean.

The script doing the housework after a failover should check that the backup location is initialized from scratch or upgraded based on exit codes of pgbackrest check

#!/bin/sh

backup_path=/pgbackup-0/$(hostname -s)
[ -d $backup_path ] || mkdir $backup_path

pgbackrest --stanza=main check
case $? in
    28|51)
        pgbackrest --stanza=main --log-level-console=info stanza-upgrade
        ;;
    55)
        pgbackrest --stanza=main --log-level-console=info stanza-create
        ;;
esac

An example cron schedule might look like this

0  15   *   *   0     [ $(psql -c "SELECT pg_is_in_recovery()" -At) == "t" ] || pgbackrest --type=full --stanza=main backup
0  15   *   *   1-6   [ $(psql -c "SELECT pg_is_in_recovery()" -At) == "t" ] || pgbackrest --type=incr --stanza=main backup

Major Version Upgrades

The mechanism for upgrading a Patroni cluster is a tool that you will need to build for yourself. To do this, stand up a test cluster, and repeatedly rebuild/upgrade it until the bugs are shaken out of your script. The basic steps are:

  1. Backup files (leader)
    pg_hba.conf, ident.conf
  2. Stop patroni (all hosts)
    sudo rcctl stop patroni
  3. Erase consul state (leader)
    patronictl -c patroni.yml remove {0}
  4. Initialize new database (leader)
    /usr/pgsql-{0}/bin/initdb -k -D /pg_data/{0} --encoding=UTF8 --locale=en_US.UTF-8
  5. Run upgrade (leader)
    /usr/pgsql-{0}/bin/pg_upgrade -b /usr/pgsql-{1}/bin -B /usr/pgsql-{0}/bin -d /pg_data/{1} -D /pg_data/{0}
  6. Restore files (leader)
    cp /tmp/pg_hba.conf.saved /pg_data/{0}/pg_hba.conf
  7. Update patroni with new paths (all hosts)
    sed -i -e 's:/pg_data/{0}:/pg_data/{1}:g' -e 's:/usr/pgsql-{0}:usr/pgsql-{1}:g' patroni.yml
  8. Start patroni (all hosts)
    sudo rcctl start patroni
  9. Analyze (leader)
    /usr/pgsql-{0}/bin/vacuumdb --all --analyze-in-stages
  10. Update backup repo
    pgbackrest --stanza=main --log-level-console=info stanza-upgrade
  11. Push updated configuration

The upgrade script depends on some state which is dynamic, we can query Patroni itself for the topology, and drive the rest of the upgrade

#!/usr/bin/python

import sys
import requests

class PatroniCluster:
    version = None
    database = None
    leader = None
    replicas = []

    def __init__(self, url):
        r = requests.get(url + "/").json()
        self.version = r['server_version']/10000
        self.next_version = self.version + 1
        self.database = r['patroni']['scope']

        # roles
        r = requests.get(url + "/cluster").json()
        for member in r['members']:
            hostname = str(member['name'])
            if member['role'] == 'leader':
                self.leader = hostname
            else:
                self.replicas.append(hostname)

        self.all_hosts = [self.leader] + self.replicas

    def info(self):
        print " version:", self.version, "->", self.next_version
        print "database:", self.database
        print "  leader:", self.leader
        print "replicas:", ",".join(self.replicas)

if __name__ == "__main__":
      if len(sys.argv) != 2:
        print "usage: patroni-upgrade.py patroni_host"
        sys.exit(1)
    url = "http://{}:8008".format(sys.argv[1])

    cluster = PatroniCluster(url)
    cluster.info()

Key-store Maintenance

Patroni periodically writes to the DCS, and if Consul is not responding properly all databases will be restarted in read-only mode. To avoid this you can disable failover on each cluster

patronictl pause
# DCS maintenance
patronictl resume

Restoring from a Backup

Begin by stopping all of the standby databases and finally the leader

mv /pg_data/{14,14.old}
mkdir /pg_data/14
pgbackrest --stanza=main --log-level-console=warn restore

After this run the restore and start/resume/stop the database

export PATH=/usr/pgsql-14/bin:$PATH
bin/pg_ctl -D /pg_data/14 start
psql -c 'select pg_wal_replay_resume()'
bin/pg_ctl -D /pg_data/14 stop

Finalize by starting the Patroni leader. Before bringing the standbys up remove their data directories to cause Patroni to run a full synchronization.

NearestN Service Discovery

If you want to be able to select a database for read that has the lowest latency from the reader, Consul cluster spanning a WAN will not work. Since Patroni needs a global session, we need to write our own script to register and deregister services. First disable Patroni's service registration

consul:
  register_service: false

Then we can run our own loop on each server that polls Patroni endpoints

#!/bin/sh

export CONSUL_HTTP_TOKEN="..."
database_name=$(awk '/^scope:/ {print $2}' /etc/patroni/patroni.yml)

health_check() {
    curl -f -s http://$(hostname):8008/$1
}

service_check() {
    curl -f -s http://127.0.0.1:8500/v1/agent/service/$1
}

while :
do
    for service in writer query any; do
        case $service in
            writer) endpoint='primary'   ;;
            query)  endpoint='replica'   ;;
            any)    endpoint='read-only' ;;
        esac
        service_id=$database_name-$service
        if health_check $endpoint; then
            service_check $service_id || consul services register -name $database_name -id $service_id -tag=$endpoint
        else
            service_check $service_id && consul services deregister -id $service_id
        fi
    done
    sleep 10
done

Now we can write a prepared query that will scan each of the datacenters and pick the replica that is closest

{
  "Name": "raddb-query",
  "Service": {
    "Service": "raddb",
    "Failover": {
      "NearestN": 0,
      "Datacenters": [
        "east",
        "west",
        "south"
      ]
    },
    "OnlyPassing": true,
    "Near": "_agent",
    "Tags": [
      "read-only"
    ]
  },
  "DNS": {
    "TTL": "10s"
  }
}