2024-11-09 Setting Up HA PostgreSQL with repmgr
As mentioned previously, I'd like to setup a database with high availability for a webapp. I will try PostgreSQL first, based on the Nebula overlay network I just set up.
Search "PostgreSQL", "HA" and "automatic failover" and browse through the abundant results. For example this gives some comprehensive introduction. There are three major choices: PAF, Patroni, and repmgr. As the goal is to have some quick hands-on experience on high availability setup for PostgreSQL. Considering only open-source solutions with low resource demand, I will try repmgr as my first pick, also because PAF has the weakness of a single point of failure regarding its monitoring node.
Installation
Look for on-line resources first, and see if there are results that can help us gain insights, for example this and this. As I have used PostgreSQL in many project setups before, so I follow the Official Image site and create the docker-compose.yaml as:networks:
  mynet:
    name: mynet
    external: true
services:
  postgres:
    restart: unless-stopped
    image: postgres:16
    container_name: postgres
    environment:
      POSTGRES_PASSWORD: mypass
      # POSTGRES_USER: postgres  # default
      # POSTGRES_DB: postgres    # default $POSTGRES_USER
      PGUSER: postgres           # or psql will use 'root'
    networks:
      - mynet
    ports:
      - '5432:5432'
    volumes:
      - "./data:/var/lib/postgresql/data"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready", "-d", "postgres", "-U", "postgres"]
      interval: '10s'
      timeout: '10s'
      retries: 5
      start_period: '20s'
  adminer:
    image: adminer
    restart: always
    ports:
      - 8080:8080I choose the postgres:16 image instead of postgres:17-alpine because repmgr currently supports up to PostgreSQL 16 and it is easier to install using Debian-based images.
After starting with Docker Compose, we can shell into the postgres container to access the database using psql:
$ docker-compose exec -it postgres bash
abd6bf2309ad:/# psql -U postgres
psql (16.4 (Debian 16.4-1.pgdg120+2))
Type "help" for help.
postgres=#Or just use psql to connect to the database on the host if you have installed it:
psql -h localhost -U postgresEither way, you can change the password for postgres, which is the super user for the setup above:
postgres=# \password postgres
Enter new password: <new-password>
postgres=# \qWe can also access the adminer site at http://192.168.x.x:8080 , here assuming the host is on the local LAN:

In actual setup, I may use another port instead of 5432 and skip adminer .
Via the VPN Overlay
Since I have created the mesh VPN  192.168.111.0/24 using Nebula, I'd like to  make sure I can connect to the database from other nebula nodes, one of which is behind NAT. However,
psql -h 192.168.111.x -U postgreshas no response. After checking the firewall settings on the machines, it turns out the firewall settings of Nebula I set up before are too strict, so I change the firewall rules in the configuration config.yml:
firewall:
  ...
  inbound:
    # Allow icmp between any nebula hosts
    - port: any
      proto: icmp
      host: any
    - port: any
      proto: tcp
      host: anyAfter restarting the network, the command above works.
Later I will re-design the Nebula nodes to have proper groups for better firewall setup. I will also bind the database port to the IP 192.168.111.x to make sure all database traffic get through the overlay network.repmgr
Now come to high availability. I use the same docker-compose.yaml on the nebula nodes – one of which is behind NAT. So assuming we have three nodes which we are going to install PostgreSQL HA using docker-compse.yaml:
node1 192.168.111.1 - primary
node2 192.168.111.2 - standby
node3 192.168.111.3 - standbyLog in to each node and see if you can ping each other via the 192.168.111.0/24 network.
On node1 which we will install the primary database:
mkdir postgres-ha
cd postgres-haTo get started, we can not use the docker-compose.yaml described above directly, since we need to install repmgr package in the container. Docker Compose will remove the containers and associated anonymous volumes once docker-compose down is executed. Although we can use docker-compose stop to avoid this, but it is not safe. I type docker-compose down too often.
For this, we need to create a image. See this for basic procedure. So first create a Dockerfile in the folder:
FROM postgres:16
RUN apt update -y && \
    apt install  -y \
      iputils-ping net-tools vim \
      postgresql-16-repmgrand create the docker-compose.yaml file:
networks:
  mynet:
    name: mynet
    external: true
services:
  postgres-ha:
    restart: unless-stopped
    build: .
    container_name: postgres-ha
    environment:
      POSTGRES_PASSWORD: mypass
      PGUSER: postgres
    networks:
      - mynet
    ports:
      - '5432:5432'
    volumes:
      - "./data:/var/lib/postgresql/data"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready", "-d", "postgres", "-U", "postgres"]
      interval: '10s'
      timeout: '10s'
      retries: 5
      start_period: '20s'Launch the container docker-compose up -d, and shell into it, log in as postgres and create both the role repmgr and database repmgr :
$ docker-compose exec -it postgres-ha bash
root:/# su postgres
postgres:/$ createuser -s repmgr
postgres:/$ createdb repmgr -O repmgr
Edit the postgresql.conf file at /var/lib/postgres/data/ with the following changes:
shared_preload_libraries = 'repmgr'
wal_level = replica
archive_mode = on
archive_command = '/bin/true'
max_wal_senders = 10
max_replication_slots = 10
hot_standby = on
listen_addresses = '*'Edit pg_hba.conf for the part of entries like this:
# "local" is for Unix domain socket connections only
local   all           all                                       peer
# IPv4 local connections:
host    all           all               127.0.0.1/32            scram-sha-256
# IPv6 local connections:
host    all           all               ::1/128                 scram-sha-256
# Allow replication connections from localhost, by a user with the
# replication privilege.
local   replication   repmgr                                    trust
host    replication   repmgr            127.0.0.1/32            trust
host    replication   repmgr            192.168.111.0/24           trust
local   repmgr        repmgr                                    trust
host    repmgr        repmgr            127.0.0.1/32            trust
host    repmgr        repmgr            192.168.111.0/24           trustThe creation of the rolerepmgr, databaserepmgr, and the two configuration files only needs to be done on the primary database. Later they will be "cloned" from other nodes as standbys. Thepg_hba.conffile contains specification of client authorization. Later we will see it allows local or remote nodes to connect to the database without password.
Now we can stop the container:
docker-compose downWe need to create another configuration file – repmgr.conf in the /etc folder. Other places are also OK. We just need to make it persist. Create a local etc folder
mkdir etcand create the repmgr.conf file in it:
node_id=1
node_name=node1
conninfo='host=192.168.111.1 port=5432 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/data'
failover=automatic
promote_command='repmgr -f /etc/repmgr.conf standby promote'
follow_command='repmgr -f /etc/repmgr.conf standby follow'
repmgrd_service_start_command='repmgrd -f /etc/repmgr.conf -d'
repmgrd_service_stop_command='kill $(cat /tmp/repmgrd.pid)'We need to bind this file to /etc/repmgr.conf in the container, so update the docker-compose.yaml :
...
  volumes:
    - "./data:/var/lib/postgresql/data"
    - type: bind
      source: ./etc/repmgr.conf
      target: /etc/repmgr.conf
...We can start the container now, and shell into it again:
docker-compose up -d
docker-compose exec -it postgres-ha bashcat /etc/repmgr.conf to see if it is bound correctly, and fix the ownership of /etc/repmgr.conf (may not be necessary):
chown root:root /etc/repmgr.conf
chmod a+r /etc/repmgr.confAt this stage make sure we can connect to the database with both user and database asrepmgr without password:
root:/# psql -h 192.168.111.1 -U repmgr -d repmgr
psql (16.4 (Debian 16.4-1.pgdg120+2))
Type "help" for help.
repmgr=# \qFinally we can start the primary server. First log in as postgres:
su postgresThen run the command:
postgres:/$ repmgr -f /etc/repmgr.conf primary register
INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registeredTo see if things go well, run the command below to see whether node1 is running as the primary:
postgres:/$ repmgr -f /etc/repmgr.conf cluster show
ID | Name   | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
...
 1 | node1  | primary | * running |          | default  | 100      | 1        | host=192.168.111.1 port=5432 user=repmgr dbname=repmgr connect_timeout=2Set up the Standby Database
Now that the primary database is running. Log on to the next node node2 and create the folder:
mkdir postgres-ha
cd postgres-haAlso copy the Dockerfile, docker-compose.yaml and etc/repmgr.conf files over. However, we need change the docker-compose.yaml 
services:
  postgres-ha:
    restart: unless-stopped
    build: .
    entrypoint: 'sleep infinity'
    container_name: postgres-ha
    ...That is, add an entry entrypoint: 'sleep infinity' to prevent PostgreSQL from running. We should modify etc/repmgr.conf too:
node_id=2
node_name=node2
conninfo='host=192.168.111.2 port=5432 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/data'
failover=automatic
promote_command='repmgr -f /etc/repmgr.conf standby promote'
follow_command='repmgr -f /etc/repmgr.conf standby follow'
repmgrd_service_start_command='repmgrd -f /etc/repmgr.conf -d'
repmgrd_service_stop_command='kill $(cat /tmp/repmgrd.pid)'Now start the container and shell into it:
docker-compose up -d
docker-compose exec -it postgres-ha bashMake sure PostgreSQL is not running:
root:/# psql
psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
	Is the server running locally and accepting connections on that socket?and that we can connect to the primary database without password:
root:/# psql -h 192.168.111.1 -p 5432 -U repmgr -d repmgr
psql (16.4 (Debian 16.4-1.pgdg120+2))
Type "help" for help.
repmgr=# \qIf you cannot connect, it is probably the settings inpg_hba.confon the primary database. Check if192.168.111.0/24is trusted.
Let's fix the ownership of the data folder /var/lib/postgresql/data first, make sure it is empty, and similarly for /etc/repmgr.conf:
chown postgres:postgres /var/lib/postgresql/data
ls /var/lib/postgresql/data
chown root:root /etc/repmgr.conf
chmod a+r /etc/repmgr.confAt this point we want to log in as postgres and clone the database from the primary (without local PostgresSQL running):
su postgresthen test if it is ready to clone:
repmgr -h 192.168.111.1 -p 5432 -U repmgr -d repmgr -f /etc/repmgr.conf standby clone --dry-runIf OK, the message looks like:
...
INFO: would execute:
  pg_basebackup -l "repmgr base backup"  -D /var/lib/postgresql/data -h 192.168.111.1 -p 5432 -U repmgr -X stream
INFO: all prerequisites for "standby clone" are metand we can run the same command without the last --dry-run flag:
postgres:~$ repmgr -h 192.168.111.1 -p 5432 -U repmgr -d repmgr -f /etc/repmgr.conf standby clone
NOTICE: destination directory "/var/lib/postgresql/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.111.1 port=5432 user=repmgr dbname=repmgr
...
HINT: after starting the server, you need to register this standby with "repmgr standby register"
postgres:~$At this stage, the database is cloned over. We can stop the container now:
docker-comppose downComment out the line entrypoint: 'sleep infinity' in docker-compose.yaml, and restart the container (be sure to add the --build flag to force a rebuild):
docker-compose up -d --buildShell into the container and you should be able to connect to local database without password:
$ docker-compose exec -it postgres-ha bash
root:/# psql
psql (16.4 (Debian 16.4-1.pgdg120+2))
Type "help" for help.
postgres=# \q
root:/# psql -h 192.168.111.2 -p 5432 -U repmgr -d repmgr
psql (16.4 (Debian 16.4-1.pgdg120+2))
Type "help" for help.
repmgr=# \qNow we can register to the cluster as a standby:
su postgresThen
postgres:/$ repmgr -f /etc/repmgr.conf standby register
INFO: connecting to local node "node2" (ID: 2)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID: 1)
INFO: standby registration complete
NOTICE: standby node "node2" (ID: 2) successfully registered
postgres:/$Check the cluster status with:
postgres:/$ repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status     | Upstream | Location | Priority | Timeline | Connection string
  2 | node2 | standby |   running  | node1    | default  | 100      | 1        | host=192.168.111.2 port=5432 user=repmgr dbname=repmgr connect_timeout=2
  1 | node1 | primary | * running  |          | default  | 100      | 1        | host=192.168.111.1 port=5432 user=repmgr dbname=repmgr connect_timeout=2To Continue
The third node is set up similarly to node2, so that the cluster show command above should show three entries – one primary and two standbys. We need to make sure automatic failover is functioning:
- docker-compose downone standby and see the status, and how the overall cluster works, then turn it back on.
- docker-compose downthe primary, and see if a standby is promoted as a new primary, then try to turn the server back on.
In the second case what would happen? In my case there would be two primaries running, which does not seem OK. So I turn it off again and search the web for clues. It turns out all nodes are equal – once a node become a primary, the off-lined nodes should join as standbys (without starting PostgreSQL directly). Therefore if the repmgr node rejoin command does not work for some reason, one probably needs to repeat the above process from scratch (meaning backup the data folder elsewhere and empty it) and join as a standby.The setup so far is only part of the overall HA architecture. One big question is – how do clients access the cluster in a reliable and seamless way. There are more pieces to the puzzle – client-side multi-host connections and disconnect handling, adding tools such as PgPool or PgBouncer, etc. All these need to be considered and can only be left as future exercises.
