Patroni and Consul Service Discovery
Among the tricky aspects to automating database failover is the need to
- redirect existing connections to the new primary and
- update DNS to reflect the new toplogy
Patroni's integration with Consul can solve both of these problems, since Patroni sets tags for each host indicating it's role, Consul keys can be monitored for changes, and Consul itself can publish nameservice records.
The Postgres client library
(libpq
)
will try to resolve names in order, so we can prefer read-only queries with a
fallback to the current primary by listing them first
postgresql://user@raddb-query.service.eradman.com,raddb-writer.service.eradman.com/db
Local Consul Agent
Patroni uses the DCS (distributed configuration store) as a coordination point, so a Consul cluster that is propagated to other datacenters using consul-replicate will not suffice. Instead, we need to establish set of a consul servers that spans all datacenters, which I'll call the "global" DC to indicate that it's not a geographical site
# /etc/consul.d/consul.hcl data_dir = "/var/lib/consul" datacenter = "global" domain = "consul.eradman.com" encrypt = "xxxxxxxxxxxxxxxxxxxxxxxx" primary_datacenter = "global" retry_join = ["10.2.0.30", "10.3.0.30", "10.4.0.30"]
The membership then should show servers that keep consensus across all data centers
$ consul members Node Address Status Type Build Protocol DC Segment consul4.east.eradman.com 10.2.0.30:8301 alive server 1.4.3 2 global <all> consul4.north.eradman.com 10.4.0.30:8301 alive server 1.4.3 2 global <all> consul4.west.eradman.com 10.3.0.30:8301 alive server 1.4.3 2 global <all> raddb1.east.eradman.com 10.2.0.90:8301 alive client 1.4.3 2 global <default> raddb1.west.eradman.com 10.3.0.90:8301 alive client 1.4.3 2 global <default>
Patroni Configuration
The basic "timeout" values for Patroni include the following:
# patroni.yaml bootstrap: dcs: ttl: 90 loop_wait: 10 retry_timeout: 90
Set
dcs.ttl
and
dcs.retry_timeout
to a value that is longer than the time it takes consul to complete an
election. These values are read from the yaml configuration when a cluster is
initialized, but for existing clusters change these using
patronictl edit-config
.
Tags and Prepared Queries
Since Patroni registers tags with each service, we can install a prepared query to enable name server resolution
$ curl -s http://127.0.0.1:8500/v1/query
[ { "Name": "raddb-writer" }, "Service": { "Service": "raddb", "OnlyPassing": true, "IgnoreCheckIDs": null, "Tags": [ "primary" ] }, "DNS": { "TTL": "30s" } }, ]
With this query in place we can query each separately
$ dig raddb-writer.query.consul.eradman.com +short $ dig raddb-query.query.consul.eradman.com +short
Redirecting Connections with PgBouncer
Using a connection pooler provides some valuable capabilities, including transaction pooling, the ability to pause/resume connections, and redirect connections by retrying the backend. One means of reconfiguring pgbouncer is to use consul-template to regenerate configuration. Keys referenced in the template are automatically monitored for changes
# /etc/consul-template.d/reconfig-pgbouncer.conf consul = { address = "localhost:8500" token = "..." retry = { attempts = 0 } template { source = "/etc/pgbouncer/pgbouncer.ini.ctmpl" destination = "/var/lib/consul-template/pgbouncer.ini" wait { min = "10s" max = "30s" } exec { command = "sudo /usr/local/sbin/reconfig-pgbouncer" } }
A minimal pgbouncer configuration includes
[databases] * = host={{ key "postgresql/raddb/leader" }} [pgbouncer] listen_addr = * listen_port = 6432 auth_type = md5 auth_file = /etc/pgbouncer/userlist.txt server_fast_close = 1
Since consul-template runs at startup as well as a key change it can be used for maintenance tasks, such as synchronizing users
#!/bin/sh # /usr/local/sbin/reconfig-pgbouncer trap 'echo "exit code $? on line $LINENO"; exit' ERR cd /tmp # install pgbouncer configuration, reload, and refresh backend connections src=/var/lib/consul-template/pgbouncer.ini if [ -s $src ]; then cp $src /etc/pgbouncer/pgbouncer.ini; rm $src sudo -u pgbouncer psql -h /tmp -p 6432 -c 'RELOAD' fi # refresh userlist psql -Atq -U postgres <<SQL > /etc/pgbouncer/userlist.txt SELECT '"pgbouncer" "md5xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"' AS user_hash UNION SELECT concat('"', usename, '" "', passwd, '"') FROM pg_shadow WHERE usename NOT IN ('postgres') ORDER BY user_hash SQL
pgBackRest
There are two places where we should check to ensure pgBackRest repository is initialized. First, when the cluster is first bootstrapped, and the other when a new primary is promoted
# patroni.yaml bootstrap: post_init: /var/lib/pgsql/bin/post_init.sh postgresql: parameters: archive_command: 'pgbackrest --stanza=main archive-push %p' archive_mode: 'on' callbacks: on_role_change: /var/lib/pgsql/bin/post_init.sh
Note that on is quoted to prevent this value from being loaded as a Python boolean.
The script doing the housework after a failover should check that the backup
location is initialized from scratch or upgraded based on exit codes of
pgbackrest check
#!/bin/sh backup_path=/pgbackup-0/$(hostname -s) [ -d $backup_path ] || mkdir $backup_path pgbackrest --stanza=main check case $? in 28|51) pgbackrest --stanza=main --log-level-console=info stanza-upgrade ;; 55) pgbackrest --stanza=main --log-level-console=info stanza-create ;; esac
An example cron schedule might look like this
0 15 * * 0 [ $(psql -c "SELECT pg_is_in_recovery()" -At) == "t" ] || pgbackrest --type=full --stanza=main backup 0 15 * * 1-6 [ $(psql -c "SELECT pg_is_in_recovery()" -At) == "t" ] || pgbackrest --type=incr --stanza=main backup
Major Version Upgrades
The mechanism for upgrading a Patroni cluster is a tool that you will need to build for yourself. To do this, stand up a test cluster, and repeatedly rebuild/upgrade it until the bugs are shaken out of your script. The basic steps are:
-
Backup files (leader)
pg_hba.conf, ident.conf
-
Stop patroni (all hosts)
sudo rcctl stop patroni
-
Erase consul state (leader)
patronictl -c patroni.yml remove {0}
-
Initialize new database (leader)
/usr/pgsql-{0}/bin/initdb -k -D /pg_data/{0} --encoding=UTF8 --locale=en_US.UTF-8
-
Run upgrade (leader)
/usr/pgsql-{0}/bin/pg_upgrade -b /usr/pgsql-{1}/bin -B /usr/pgsql-{0}/bin -d /pg_data/{1} -D /pg_data/{0}
-
Restore files (leader)
cp /tmp/pg_hba.conf.saved /pg_data/{0}/pg_hba.conf
-
Update patroni with new paths (all hosts)
sed -i -e 's:/pg_data/{0}:/pg_data/{1}:g' -e 's:/usr/pgsql-{0}:usr/pgsql-{1}:g' patroni.yml
-
Start patroni (all hosts)
sudo rcctl start patroni
-
Analyze (leader)
/usr/pgsql-{0}/bin/vacuumdb --all --analyze-in-stages
-
Update backup repo
pgbackrest --stanza=main --log-level-console=info stanza-upgrade
- Push updated configuration
The upgrade script depends on some state which is dynamic, we can query Patroni itself for the topology, and drive the rest of the upgrade
#!/usr/bin/python import sys import requests class PatroniCluster: version = None database = None leader = None replicas = [] def __init__(self, url): r = requests.get(url + "/").json() self.version = r['server_version']/10000 self.next_version = self.version + 1 self.database = r['patroni']['scope'] # roles r = requests.get(url + "/cluster").json() for member in r['members']: hostname = str(member['name']) if member['role'] == 'leader': self.leader = hostname else: self.replicas.append(hostname) self.all_hosts = [self.leader] + self.replicas def info(self): print " version:", self.version, "->", self.next_version print "database:", self.database print " leader:", self.leader print "replicas:", ",".join(self.replicas) if __name__ == "__main__": if len(sys.argv) != 2: print "usage: patroni-upgrade.py patroni_host" sys.exit(1) url = "http://{}:8008".format(sys.argv[1]) cluster = PatroniCluster(url) cluster.info()
Key-store Maintenance
Patroni periodically writes to the DCS, and if Consul is not responding properly all databases will be restarted in read-only mode. To avoid this you can disable failover on each cluster
patronictl pause
# DCS maintenance
patronictl resume
Restoring from a Backup
Begin by stopping all of the standby databases and finally the leader
mv /pg_data/{14,14.old} mkdir /pg_data/14 pgbackrest --stanza=main --log-level-console=warn restore
After this run the restore and start/resume/stop the database
export PATH=/usr/pgsql-14/bin:$PATH bin/pg_ctl -D /pg_data/14 start psql -c 'select pg_wal_replay_resume()' bin/pg_ctl -D /pg_data/14 stop
Finalize by starting the Patroni leader. Before bringing the standbys up remove their data directories to cause Patroni to run a full synchronization.
NearestN Service Discovery
If you want to be able to select a database for read that has the lowest latency from the reader, Consul cluster spanning a WAN will not work. Since Patroni needs a global session, we need to write our own script to register and deregister services. First disable Patroni's service registration
consul: register_service: false
Then we can run our own loop on each server that polls Patroni endpoints
#!/bin/sh export CONSUL_HTTP_TOKEN="..." database_name=$(awk '/^scope:/ {print $2}' /etc/patroni/patroni.yml) health_check() { curl -f -s http://$(hostname):8008/$1 } service_check() { curl -f -s http://127.0.0.1:8500/v1/agent/service/$1 } while : do for service in writer query any; do case $service in writer) endpoint='primary' ;; query) endpoint='replica' ;; any) endpoint='read-only' ;; esac service_id=$database_name-$service if health_check $endpoint; then service_check $service_id || consul services register -name $database_name -id $service_id -tag=$endpoint else service_check $service_id && consul services deregister -id $service_id fi done sleep 10 done
Now we can write a prepared query that will scan each of the datacenters and pick the replica that is closest
{ "Name": "raddb-query", "Service": { "Service": "raddb", "Failover": { "NearestN": 0, "Datacenters": [ "east", "west", "south" ] }, "OnlyPassing": true, "Near": "_agent", "Tags": [ "read-only" ] }, "DNS": { "TTL": "10s" } }