Eric Radman : a Journal

Synchronizing PostgreSQL Databases

Option 1: Triggers

At first triggers seem like the logical way to go, but it's not as easy as one might hope because the text for the SQL updates to a slave database have to be reconstructed for every event. There's no such idea as a "calling query", or $@ in the shell world, that can simply be run against a remote database connection.

Option 2: Logging

In terms of simplicity this is definitely the way to go: make Postgres log all UPDATE, INSERT, and DELETE statements and replay them on backup databases. Several modifications need to be made to postgresql.conf:

redirect_stderr = on
log_directory = 'pg_log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log
log_rotation_age = 15
log_line_prefix = ':%u:%d:'
log_statement = 'mod'

All of these figures are important. redirect_stderr, log_filename, and log_directory start logging to the directory ~/data/pg_log with log names that are timestamped. The log_rotation_age essentially determines how often (in minutes) updates can be sent out because we only want to scan and remove log files that are not being written to by postmaser.

The string ':%u:%d:' in log_line_prefix sets up a format that a script can use to (a) identify the user that made the query, (b) the database it was called on, and (c) identify log entries that are notifications, and not queries at all; resulting in lines beginning with :::.

Finally, log_statement = 'mod' enables logging for record changes only.

The Log Parser

#!/usr/pkg/bin/ruby
#
# logparser.rb by Eric Radman
# Modified 2007-01-04

PG_LOGS = "/var/pgsql/data/pg_log"
PG_SYNC = "/home/system"

def parselog(logfile)
    puts "Processing '#{logfile}'"
    lf = File.new(logfile)
    sql = db = ""
    lf.each_line {|ln|
        if ln.slice(0,2) === "::"       # Not a query, skip
            print '.'                   # - skip -
        elsif ln.slice(0,1) === ":"     # New statement
            flush_query db, sql         # Write query before next statement
            print '!'                   # - new statement -
            sql = ln.split(/statement:/)[1]
            db = ln.split(':')[2]
        else                            # Everything else is statement body
            print '-'                   # - statement continued -
            sql.concat(ln)
        end
    }
    flush_query db, sql                 # Write whatever was collected 'till EOF
    puts
    File.delete(logfile)
    puts "done"
end

def flush_query(db, sql)
    if sql.strip.length > 0 and db != "" then
        sql.strip!.concat(";\n")
        begin
            file = open("#{PG_SYNC}/db_#{db}_updates.sql", "a")
            file.syswrite(sql)
        end
    end
end

if $0 == __FILE__
    lst = Dir.entries(PG_LOGS).delete_if {|i| i[0,11] != 'postgresql-'}.sort
    if lst.length > 1
        parselog "#{PG_LOGS}/#{lst[0]}"
    end
end

The if $0 == __FILE__ statement test to see if there's more than one log file in the folder specified by PG_LOGS and if there is, it's parsed and separate log files for each database are written to the folder PG_SYNC.

In my case the destination is not writable by the user pgsql running sqllogparser.rb so I created files with write access in the /home/system.

Scheduling Updates

Now set up the pgsql crontab:

*/6    *       *       *       *      /var/pgsql/data/pg_log/parselog.rb

and watch for mail to be generated with the result of the operation:

$ mail
Mail version 8.1 6/6/93.  Type ? for help.
"/var/mail/pgsql": 1 message 1 new
Message 1:
From root@teisprint.net Thu Jan 04 14:36:00 2007
Envelope-to: pgsql@localhost
Delivery-date: Thu, 04 Jan 2007 14:36:00 -0500
From: root@teisprint.net (Cron Daemon)
To: pgsql@localhost.teisprint.net
Subject: Cron <pgsql@am2800--nb0> /var/pgsql/data/pg_log/parselog.rb
...
Date: Thu, 04 Jan 2007 14:36:00 -0500

Processing '/var/pgsql/data/pg_log/postgresql-2007-01-04_141500.log'
!--------------------!--!------------!----!--------------------!-----------------
--------!------------!----!-------------------------!--!------------!----!-------
-------------------!----------------------!------------!----!--------------------
--!------------!----!---------------------!------------!----!--------------------
!------------!----!------------------------!------------!----!-------------------
----!------------------------!-------------------------
done

Assuming that the slave database server has the update files (pushed over NFS, SSH, etc.), a crontab can be set up to apply the updates:

*/20    *       *       *       *      /home/system/bin/sync-db.sh

The results to apply:

-rw-r--r--   1 pgsql   wheel      0 Jan  3 16:19 db_pgsql_updates.sql
-rw-rw----   1 pgsql   wheel   2016 Jan  4 14:48 db_roundcube_updates.sql
-rw-rw----   1 pgsql   wheel   6128 Jan  4 14:48 db_system_updates.sql
-rw-r--r--   1 pgsql   wheel      0 Jan  3 16:19 db_tei_updates.sql

My update script:

#!/bin/sh
psql -U system system < ~/db_system_updates.sql
cat /dev/null > ~/db_system_updates.sql

PostgreSQL doesn't allow the database password to be passed in the command line; the passwords are stored in ~/.pgpass.

Not Done Yet

Okay, it couldn't be that easy, could it? It's not. The problem is that this scheme assumes that an INSERT statement run against a database running on two different servers will have the same result. One case where this is not true is if an auto-index is used on a column. To deal with the your application to must be coded to set indexes explicitly:

conn = dbh.prepare("SELECT nextval('customers_id_seq')")
conn.execute()

Then pass the result of conn.fetch[0] to the function that adds the records instead of relying on nextval() to be executed by the column's

DEFAULT parameter.

Last updated on March 27, 2015