PostgreSQL High Availability on Single Hardware for Prototype and Learning

When students and engineers first encounter the concept of high availability, they often face a chicken-and-egg problem: they need to understand distributed systems to build them, but they need to build them to understand them. Running a PostgreSQL cluster on a single server solves this dilemma—it’s a learning laboratory where you can experiment with real HA concepts without the complexity and cost of managing multiple physical machines.

This article documents that learning journey, examining whether single-node clustering is a legitimate educational tool or just architectural theater that teaches bad habits.

1. The Educational Paradox

1.1 The Dilemma of Learning High Availability

Every database course teaches that high availability requires eliminating single points of failure. Yet for someone building their first HA system—whether for a university project, a proof of concept, or personal learning—procuring three physical servers (or even three cloud instances) is impractical.

This creates a fundamental learning barrier. How do you understand Patroni’s failover logic without seeing it fail over? How do you grasp replication lag without monitoring it? How do you learn to debug cluster state without having a cluster?

The reality is stark: running three PostgreSQL instances on one physical server shares every failure domain—power supply, CPU, memory, disk controller, network interface. If the server dies, everything dies together. From a pure HA perspective, this is pointless.

But from a learning perspective? It’s invaluable.

1.2 What We’re Actually Learning About

Before diving into implementation, let’s be honest about what single-node clustering teaches versus what it doesn’t:

Failure TypeExampleSingle InstanceMulti-Process ClusterWhat You Learn
HardwareDisk failure, power lossNothing—both fail
Process crashSegfault, OOM kill✗ (5-30 min)✓ (30 sec)Automatic failover mechanics
Connection exhaustionmax_connections hit✓ (replicas available)Load distribution patterns
Query of deathRecursive CTE crash✓ (failover)Blast radius containment
Config errorWrong parameter value✓ (replicas unaffected)Configuration management
Maintenance errorFailed schema migration✓ (rollback via failover)Safe deployment strategies
Data corruptionShared disk corruptionBackup importance (hard lesson)

The pattern reveals something important: this setup teaches software-level resilience (60-70% of the learning objectives) while being useless for hardware resilience (30-40% of scenarios). For a prototype or demonstration environment, that’s actually perfect—you learn the complex orchestration parts before worrying about physical infrastructure.

Think of it like learning to drive with a driving simulator. You’re not learning about tire mechanics or engine repair, but you are learning steering, braking, and navigation—skills that transfer directly to real driving.

2. The Architecture: Understanding Through Building

2.1 PostgreSQL’s Multi-Process Foundation

The first lesson in this laboratory is understanding why PostgreSQL can do this at all. PostgreSQL’s process model is pedagogically perfect for demonstrating isolation concepts.

Imagine you’re explaining processes to a student. You can show them this:

Your Laptop (16GB RAM, 8 CPU cores)
    ├── PostgreSQL Instance 1 (port 5432)
    │   ├── Postmaster process (PID 1001)
    │   ├── Backend processes (PID 1002-1050)
    │   ├── WAL writer (PID 1051)
    │   └── Memory: 2GB shared_buffers
    │
    ├── PostgreSQL Instance 2 (port 5433)
    │   ├── Postmaster process (PID 2001)
    │   ├── Backend processes (PID 2002-2050)
    │   └── Memory: 2GB shared_buffers
    │
    └── PostgreSQL Instance 3 (port 5434)
        ├── Postmaster process (PID 3001)
        └── Memory: 2GB shared_buffers

You can then demonstrate process isolation in real-time:

# Kill the primary's postmaster
$ kill -9 1001

# Watch the other instances keep running
$ ps aux | grep postgres
# Instances 2 and 3 still alive!

This is a tangible demonstration of operating system process isolation. Students can see with their own eyes that Instance 1 crashing doesn’t affect Instances 2 and 3. They’re sharing hardware, but the OS keeps them separate.

Try teaching this concept with a threaded database. You can’t. There’s no way to demonstrate thread isolation at this level because a catastrophic thread failure can corrupt shared memory and take everything down.

2.2 Patroni: Learning Distributed Coordination

The next learning objective is understanding how distributed systems coordinate. Patroni provides an excellent teaching tool because its configuration is transparent and its behavior is observable.

For a demonstration environment, you run three Patroni instances as separate processes:

$ ps aux | grep patroni

student  1001  0.5  1.2  python3 /usr/local/bin/patroni /etc/patroni/patroni1.yml
student  2001  0.4  1.1  python3 /usr/local/bin/patroni /etc/patroni/patroni2.yml
student  3001  0.4  1.1  python3 /usr/local/bin/patroni /etc/patroni/patroni3.yml

The beauty of this setup for learning is that all communication happens on localhost. You can watch network traffic, debug HTTP calls, and trace every interaction without dealing with firewall rules, network latency, or SSH key management.

Here’s what the coordination looks like in your demo environment:

Your Laptop - All on 127.0.0.1

etcd (port 2379)
    ↑ ↑ ↑
    | | |___ Patroni3 (REST API :8010) → PostgreSQL :5434
    | |_____ Patroni2 (REST API :8009) → PostgreSQL :5433
    |_______ Patroni1 (REST API :8008) → PostgreSQL :5432

For teaching purposes, the configuration demonstrates key distributed systems concepts while remaining simple:

# /etc/patroni/patroni1.yml
scope: postgres-cluster    # Logical cluster identifier
namespace: /db/            # Key-value namespace
name: node1                # Unique node identifier

restapi:
  listen: 127.0.0.1:8008   # Different per instance
  connect_address: 127.0.0.1:8008

etcd:
  host: 127.0.0.1:2379     # Coordination service

postgresql:
  listen: 127.0.0.1:5432   # Different per instance
  connect_address: 127.0.0.1:5432
  data_dir: /var/lib/postgresql/data1  # Isolated storage
  
  authentication:
    replication:
      username: replicator
      password: rep_pass
    superuser:
      username: postgres
      password: super_pass

Students can create all three configs by copying this file and changing just five lines. This teaches an important principle: distributed systems configuration should be templatable. When you eventually move to three physical servers, you change the same five values (replacing 127.0.0.1 with actual IPs).

Using systemd with a service template demonstrates infrastructure-as-code concepts:

# /etc/systemd/system/patroni@.service
[Unit]
Description=Patroni PostgreSQL Cluster Manager (instance %i)
After=network.target etcd.service

[Service]
Type=simple
User=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni%i.yml
Restart=on-failure

[Install]
WantedBy=multi-user.target

Then starting the cluster teaches automation:

$ sudo systemctl enable patroni@{1,2,3}
$ sudo systemctl start patroni@{1,2,3}

# Check status
$ sudo systemctl status patroni@1
● patroni@1.service - Patroni (instance 1)
   Active: active (running)

Three services managed with one command—this is infrastructure thinking.

2.3 Understanding Patroni’s Logic

For students learning distributed systems, understanding Patroni’s behavior is crucial. Here’s the simplified logic each instance runs:

# Runs in a loop every 10 seconds

while True:
    # 1. Health check - is my database responding?
    pg_status = check_postgres_health()
    
    # 2. Heartbeat - tell others I'm alive (30-second TTL)
    etcd.set(f'/db/postgres-cluster/members/{my_name}', {
        'state': pg_status,
        'role': 'leader' if is_primary else 'replica',
        'xlog_location': get_wal_position()
    }, ttl=30)
    
    # 3. Leadership check - is there a leader?
    leader_key = etcd.get('/db/postgres-cluster/leader')
    
    # 4. Leader election - if no leader and I'm healthy, try to lead
    if not leader_key and pg_status == 'running':
        try:
            etcd.create('/db/postgres-cluster/leader', my_name, ttl=30)
            run_command('pg_ctl promote')
        except KeyAlreadyExists:
            pass  # Another node won the race
    
    # 5. Follower configuration - if someone else is leader, follow them
    if leader_key and leader_key != my_name:
        setup_replication_to(leader_key)
    
    sleep(10)

This code demonstrates several distributed systems concepts:

  • Heartbeating: Nodes prove they’re alive by periodically updating state
  • TTLs (Time To Live): If heartbeats stop, the node is considered dead
  • Leader election: Using atomic operations to elect a coordinator
  • Race conditions: Handling multiple nodes trying to become leader simultaneously
  • State machines: Nodes transition between follower and leader roles

In a demo environment, you can observe this in real-time:

# Watch the heartbeats
$ watch -n 1 'etcdctl get --prefix /db/postgres-cluster/'

# You'll see keys being updated every 10 seconds
/db/postgres-cluster/leader → "node1"
/db/postgres-cluster/members/node1 → {"state": "running", "role": "master"}

2.4 Resource Overhead: Learning System Constraints

An important learning objective is understanding resource constraints. Running three instances on one machine makes trade-offs visible:

ComponentProcessesRAM UsageCPU Usage
PostgreSQL x396-8 GB40-60%
Patroni x33200-300 MB2-5%
etcd1100-200 MB1-2%
HAProxy130-50 MB1-2%
Total146.5-8.5 GB44-69%
Remaining7.5-9.5 GB31-56%

Students can monitor this with standard Linux tools:

# View all processes
$ ps aux | grep -E 'postgres|patroni|etcd|haproxy' | grep -v grep

# Check memory
$ free -h
              total        used        free
Mem:           16Gi        8.2Gi        7.8Gi

# Monitor CPU
$ top -bn1 | grep -E 'postgres|patroni'

This teaches resource budgeting: you can’t just add infinite replicas because each one consumes real resources.

2.5 Streaming Replication: Watching Data Flow

One of the most powerful learning experiences is watching data replicate in real-time. PostgreSQL’s physical replication makes data flow visible.

First, you set up the replication user (Patroni automates this, but seeing it manually teaches the concept):

# On the primary
$ psql -h 127.0.0.1 -p 5432 -U postgres
postgres=# CREATE USER replicator REPLICATION LOGIN PASSWORD 'rep_pass';

Then you can observe the replication connections:

# On primary: Check who's connected for replication
$ psql -h 127.0.0.1 -p 5432 -U postgres -c "
SELECT 
    application_name,
    client_addr,
    state,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS lag_mb
FROM pg_stat_replication;"

 application_name | client_addr | state     | lag_mb 
------------------+-------------+-----------+--------
 node2            | 127.0.0.1   | streaming | 0.02
 node3            | 127.0.0.1   | streaming | 0.01

Now you can demonstrate replication in action:

# Write to primary
$ psql -h 127.0.0.1 -p 5432 -U postgres -c "
CREATE TABLE demo (id SERIAL, data TEXT, created_at TIMESTAMP DEFAULT NOW());
INSERT INTO demo (data) VALUES ('Hello from primary');"

# Read from replica (appears within milliseconds)
$ psql -h 127.0.0.1 -p 5433 -U postgres -c "SELECT * FROM demo;"
 id | data                | created_at
----+--------------------+----------------------------
  1 | Hello from primary | 2025-10-06 14:32:15.123456

# Try to write to replica (demonstrates read-only enforcement)
$ psql -h 127.0.0.1 -p 5433 -U postgres -c "INSERT INTO demo (data) VALUES ('test');"
ERROR:  cannot execute INSERT in a read-only transaction

This is pedagogically gold. Students can see:

  1. Data written to one database appears in another
  2. Replicas are read-only
  3. Replication happens almost instantly (lag_mb shows it)
  4. All of this on localhost without network complexity

2.6 etcd: The Coordination Service

Understanding distributed coordination is essential for systems engineers. etcd provides a simple key-value store that demonstrates coordination primitives.

For a demo setup:

# Install and configure etcd
$ sudo apt-get install etcd

# /etc/default/etcd
ETCD_NAME="demo-node"
ETCD_DATA_DIR="/var/lib/etcd"
ETCD_LISTEN_CLIENT_URLS="http://127.0.0.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://127.0.0.1:2379"

# Start and verify
$ sudo systemctl start etcd
$ curl http://127.0.0.1:2379/version
{"etcdserver":"3.4.13"}

Students can then explore how Patroni uses etcd:

# View cluster state
$ etcdctl get --prefix /db/postgres-cluster/

/db/postgres-cluster/leader → "node1"
/db/postgres-cluster/members/node1 → {
    "conn_url": "postgres://127.0.0.1:5432/postgres",
    "state": "running",
    "role": "master",
    "timeline": 1
}
/db/postgres-cluster/members/node2 → {
    "conn_url": "postgres://127.0.0.1:5433/postgres",
    "state": "running",
    "role": "replica"
}

This teaches key-value store concepts, TTLs, and how distributed applications use coordination services.

2.7 Simulating Failover: The Ultimate Demo

The most impressive demonstration is simulating a failure and watching automatic recovery. Here’s the timeline students can observe:

# Terminal 1: Watch cluster state
$ watch -n 1 "patronictl -c /etc/patroni/patroni1.yml list"

# Terminal 2: Watch etcd
$ watch -n 1 "etcdctl get --prefix /db/postgres-cluster/leader"

# Terminal 3: Kill the primary
$ sudo systemctl stop patroni@1

What happens (with timestamps):

T=0s:   Primary crashes (node1:5432 stops)

T=10s:  Patroni1 fails to refresh its heartbeat

T=30s:  etcd expires the /leader key (TTL exceeded)

T=31s:  Patroni2 and Patroni3 both detect: no leader!

T=32s:  Both try to acquire leadership (race condition demo)

T=33s:  etcd's atomic compare-and-swap gives lock to node2

T=34s:  Patroni2 promotes its PostgreSQL
        $ pg_ctl promote -D /var/lib/postgresql/data2

T=38s:  node2 completes promotion, becomes new primary

T=39s:  node3 reconfigures to replicate from node2

T=40s:  HAProxy health check detects node2 is primary

T=41s:  HAProxy routes port 5000 traffic to node2:5433

Total automated failover: ~40 seconds

Students can watch this entire sequence unfold in real-time across multiple terminals. It’s a distributed systems ballet happening on localhost.

Visual state changes they can observe:

Before:
    etcd: /leader → "node1"
    HAProxy :5000 → 127.0.0.1:5432
    psql -h localhost -p 5000  # works

During (T=30-38):
    etcd: /leader → (expired, then) "node2"
    HAProxy :5000 → No healthy backend
    psql -h localhost -p 5000  # connection refused (brief)

After:
    etcd: /leader → "node2"
    HAProxy :5000 → 127.0.0.1:5433
    psql -h localhost -p 5000  # works again, different backend

2.8 HAProxy: Teaching Load Balancing

HAProxy demonstrates health-check based routing. The configuration teaches proxy patterns:

# /etc/haproxy/haproxy.cfg

global
    maxconn 4096
    log /dev/log local0

defaults
    mode tcp
    timeout connect 5s
    timeout client 50s
    timeout server 50s

# Primary endpoint (read-write)
listen postgres_primary
    bind *:5000
    option tcplog
    option httpchk OPTIONS /master
    http-check expect status 200
    default-server inter 3s fall 3 rise 2
    
    server node1 127.0.0.1:5432 maxconn 100 check port 8008
    server node2 127.0.0.1:5433 maxconn 100 check port 8009
    server node3 127.0.0.1:5434 maxconn 100 check port 8010

# Replica endpoint (read-only)
listen postgres_replicas
    bind *:5001
    option tcplog
    option httpchk OPTIONS /replica
    http-check expect status 200
    balance roundrobin
    
    server node1 127.0.0.1:5432 maxconn 100 check port 8008
    server node2 127.0.0.1:5433 maxconn 100 check port 8009
    server node3 127.0.0.1:5434 maxconn 100 check port 8010

# Statistics dashboard
listen stats
    bind *:7000
    mode http
    stats enable
    stats uri /

Students can observe health checks in action:

# HAProxy checks Patroni REST API every 3 seconds

# Node is primary:
$ curl http://127.0.0.1:8008/master
HTTP/1.0 200 OK

# Node is replica:
$ curl http://127.0.0.1:8008/master
HTTP/1.0 503 Service Unavailable

They can also write application code that uses the proxy:

# demo_app.py
import psycopg2

# Write connection always goes to current primary
write_conn = psycopg2.connect(
    host='localhost',
    port=5000,  # HAProxy routes to whoever is primary
    database='demo',
    user='demo_user'
)

# Read connection load-balances across replicas
read_conn = psycopg2.connect(
    host='localhost',
    port=5001,  # HAProxy round-robins to replicas
    database='demo',
    user='demo_user'
)

# Write
with write_conn.cursor() as cur:
    cur.execute("INSERT INTO demo (data) VALUES ('test')")
write_conn.commit()

# Read (might go to different replica each time)
with read_conn.cursor() as cur:
    cur.execute("SELECT * FROM demo")
    print(cur.fetchall())

The application doesn’t need to know which specific database is primary. HAProxy handles that complexity.

3. Real Failure Scenarios: Educational Value

3.1 Demonstrating Process Crash Recovery

In a learning environment, you can safely crash processes to understand recovery:

# Find the primary's postmaster PID
$ ps aux | grep "postgres.*master"
postgres  1001  0.5  /usr/lib/postgresql/14/bin/postgres -D /var/lib/postgresql/data1

# Kill it violently (simulates segfault)
$ kill -9 1001

What students observe:

Single-Instance Behavior (for comparison):

[ERROR] server terminated by signal 11: Segmentation fault
[LOG] all server processes terminated; reinitializing
# All connections dropped immediately
# 30-60 second recovery time
# Manual investigation required

Multi-Process Cluster Behavior:

# Primary crashes at T=0

T=10s:  Patroni health check fails on node1
        [ERROR] PostgreSQL not responding

T=30s:  Leader key expires in etcd

T=32s:  node2 wins election, promotes to primary
        [INFO] Promoting PostgreSQL instance

T=40s:  HAProxy detects new primary
        [INFO] Routing :5000 to 127.0.0.1:5433

# Applications automatically reconnect
# Downtime: 25-40 seconds, fully automated

3.2 Connection Exhaustion Demo

This demonstrates resource limits:

# Set low connection limit for demo
$ psql -h 127.0.0.1 -p 5432 -c "ALTER SYSTEM SET max_connections = 10"
$ sudo systemctl restart patroni@1

# Exhaust connections
$ for i in {1..10}; do
    psql -h 127.0.0.1 -p 5000 -c "SELECT pg_sleep(300)" &
done

# Try one more
$ psql -h 127.0.0.1 -p 5000 -c "SELECT 1"
ERROR: too many clients already

With a cluster:

# Primary at :5000 is full, but replicas at :5001 still work
$ psql -h 127.0.0.1 -p 5001 -c "SELECT 1"
 ?column? 
----------
        1

This teaches about read/write separation and capacity planning.

3.3 Configuration Error Isolation

Demonstrate blast radius:

# Misconfigure only node1
$ psql -h 127.0.0.1 -p 5432 -c "ALTER SYSTEM SET shared_buffers = '999GB'"
$ sudo systemctl restart patroni@1
# node1 fails to start (impossible memory requirement)

# But node2 and node3 are unaffected
$ patronictl -c /etc/patroni/patroni1.yml list
| Member | Host           | Role    | State   |
+--------+----------------+---------+---------+
| node1  | 127.0.0.1:5432 | Replica | crashed |
| node2  | 127.0.0.1:5433 | Leader  | running |
| node3  | 127.0.0.1:5434 | Replica | running |

Students learn that configuration changes should be tested on replicas first.

4. Understanding Limitations: Honest Education

4.1 What This Setup Cannot Teach

Being honest about limitations is crucial for proper learning:

This setup CANNOT demonstrate protection against:

  • Server power loss (everything dies)
  • Kernel panic (everything dies)
  • Disk controller failure (all instances affected)
  • Network interface failure (all instances unreachable)
  • Physical security breach (single machine compromised)

The demonstration setup teaches:

  • Process-level fault tolerance
  • Automatic failover mechanics
  • Replication lag monitoring
  • Load distribution patterns
  • Configuration management
  • Health checking and routing
  • Distributed coordination

4.2 The Learning Principle: 60/40 Coverage

Here’s the honest assessment:

Database failure scenarios (100%):
    60%: Software/process/operational issues
        → Single-node cluster demonstrates ✓
    
    40%: Hardware/infrastructure failures
        → Single-node cluster useless ✗

For a prototype or learning environment, this is perfect. You learn the complex parts (coordination, replication, failover) without the infrastructure overhead (multiple machines, networking, monitoring).

4.3 Performance Trade-offs: Teaching Resource Constraints

Students should understand that HA isn’t free. Benchmarking teaches this:

# Install pgbench
$ sudo apt-get install postgresql-contrib

# Initialize test database
$ pgbench -h 127.0.0.1 -p 5432 -i -s 100 demo

# Single instance baseline
$ pgbench -h 127.0.0.1 -p 5432 -c 10 -j 2 -T 60 demo
tps = 3200 (including connections)

# Three-instance cluster (primary writes)
$ pgbench -h 127.0.0.1 -p 5000 -c 10 -j 2 -T 60 demo
tps = 2100 (66% of single, -34% overhead)

# Cluster replicas (read-only)
$ pgbench -h 127.0.0.1 -p 5001 -c 10 -j 2 -T 60 -S demo
tps = 2800 per replica × 2 = 5600 total reads

Analysis: You lose 34% write performance but gain 175% read capacity. This teaches about read-write tradeoffs in distributed systems.

5. The Learning Path: A Structured Approach

5.1 Four-Week Curriculum

This setup enables a structured learning progression:

Week 1: Setup and Architecture

  • Install all components
  • Understand process model
  • Read configuration files
  • Monitor cluster state with patronictl
  • Draw architecture diagrams

Week 2: Failure Simulation

  • Kill processes with kill -9
  • Watch automatic recovery
  • Measure failover times
  • Observe etcd key changes
  • Practice manual recovery with patronictl

Week 3: Replication and Performance

  • Monitor replication lag
  • Generate write load
  • Watch replicas catch up
  • Run pgbench benchmarks
  • Profile resource usage

Week 4: Application Integration

  • Write demo applications
  • Implement connection pooling
  • Handle failover in code
  • Test read/write splitting
  • Measure end-to-end latency

5.2 Hands-On Exercises

Exercise 1: Observing Failover

# Terminal 1
$ watch -n 1 "patronictl -c /etc/patroni/patroni1.yml list"

# Terminal 2
$ watch -n 1 "psql -h localhost -p 5000 -c 'SELECT pg_is_in_recovery()'"

# Terminal 3
$ sudo systemctl stop patroni@1

# Observe the transitions, measure the time

Exercise 2: Replication Lag

# Terminal 1: Monitor lag
$ watch -n 1 "psql -h 127.0.0.1 -p 5432 -c '
SELECT application_name, 
       pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS lag_mb
FROM pg_stat_replication'"

# Terminal 2: Generate load
$ pgbench -h 127.0.0.1 -p 5432 -c 20 -j 4 -T 300 demo

# Watch lag grow and shrink

Exercise 3: Configuration Changes

# Change setting on replica
$ psql -h 127.0.0.1 -p 5433 -c "
ALTER SYSTEM SET work_mem = '64MB'"
$ sudo systemctl restart patroni@2

# Verify primary unaffected
$ psql -h 127.0.0.1 -p 5432 -c "SHOW work_mem"

5.3 Migration to Production

The beauty of this learning approach is the path forward. When you’re ready for actual multi-server deployment:

Single Server (Learning):

postgresql:
  listen: 127.0.0.1:5432
  connect_address: 127.0.0.1:5432
etcd:
  host: 127.0.0.1:2379

Three Servers (Production):

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.1.10:5432  # Real server IP
etcd:
  hosts: 10.0.1.10:2379,10.0.1.11:2379,10.0.1.12:2379

Only the IP addresses change. The concepts, commands, and mental models transfer directly.

6. Monitoring and Observability: Teaching DevOps

6.1 Essential Monitoring Queries

Part of the learning is understanding what to monitor:

-- Replication lag (run on primary)
SELECT 
    application_name,
    client_addr,
    state,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS lag_mb,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 AS total_lag_mb
FROM pg_stat_replication;

-- Connection count per database
SELECT 
    datname,
    count(*) as connections
FROM pg_stat_activity 
GROUP BY datname;

-- Active queries
SELECT 
    pid,
    usename,
    application_name,
    state,
    query_start,
    query
FROM pg_stat_activity 
WHERE state = 'active' 
AND pid != pg_backend_pid();

-- Cache hit ratio (should be >99%)
SELECT 
    sum(blks_hit) * 100.0 / sum(blks_hit + blks_read) as cache_hit_ratio
FROM pg_stat_database;

6.2 Port Reference for Documentation

Students should document their setup:

Port Reference for Demo Environment:
├── 2379: etcd coordination service
├── 5000: HAProxy primary (read-write)
├── 5001: HAProxy replicas (read-only load-balanced)
├── 5432: PostgreSQL instance 1
├── 5433: PostgreSQL instance 2
├── 5434: PostgreSQL instance 3
├── 7000: HAProxy statistics dashboard
├── 8008: Patroni instance 1 REST API
├── 8009: Patroni instance 2 REST API
└── 8010: Patroni instance 3 REST API

6.3 Useful Commands Cheat Sheet

# Cluster status
$ patronictl -c /etc/patroni/patroni1.yml list

# Manual failover (practice controlled failover)
$ patronictl -c /etc/patroni/patroni1.yml failover postgres-cluster

# Restart a node (safe, waits for replication to catch up)
$ patronictl -c /etc/patroni/patroni1.yml restart postgres-cluster node2

# Reinitialize a node (rebuild from primary)
$ patronictl -c /etc/patroni/patroni1.yml reinit postgres-cluster node3

# Check replication status
$ psql -h localhost -p 5432 -c "SELECT * FROM pg_stat_replication;"

# HAProxy statistics
$ curl http://localhost:7000

# etcd cluster health
$ etcdctl endpoint health

# View logs
$ journalctl -u patroni@1 -f
$ journalctl -u patroni@2 -f
$ journalctl -u patroni@3 -f

7. Project Ideas and Demonstrations

7.1 Academic Project Scenarios

This setup enables realistic project demonstrations:

Project 1: Automated Failover Demo

  • Simulate primary failure during live database operations
  • Measure recovery time
  • Document state transitions
  • Present findings with graphs

Project 2: Load Distribution Analysis

  • Generate mixed read/write workload
  • Compare single vs clustered performance
  • Analyze resource utilization
  • Create performance comparison charts

Project 3: Configuration Management Study

  • Test configuration changes on replicas first
  • Demonstrate safe rollback via failover
  • Document best practices
  • Build configuration templates

Project 4: Disaster Recovery Planning

  • Implement WAL archiving to external storage
  • Practice point-in-time recovery
  • Document recovery procedures
  • Create runbooks

7.2 Demo Scenarios for Presentations

5-Minute Demo: “Magic” Failover

# Setup
$ watch -n 1 "patronictl -c /etc/patroni/patroni1.yml list"

# Show application running
$ while true; do 
    psql -h localhost -p 5000 -c "SELECT now(), pg_is_in_recovery()"; 
    sleep 1; 
done

# Kill primary (dramatic moment)
$ sudo systemctl stop patroni@1

# Watch automatic recovery (~40 seconds)
# Application reconnects automatically

10-Minute Demo: Complete Architecture Tour

  1. Show all processes running (ps aux)
  2. Explain configuration files
  3. Demonstrate writes replicating
  4. Show HAProxy routing
  5. Trigger failover
  6. Explain what happened

20-Minute Demo: Deep Dive

  1. Architecture overview
  2. Configuration walkthrough
  3. Normal operation monitoring
  4. Simulated failure
  5. Recovery analysis
  6. Performance comparison
  7. Q&A

7.3 Capture Metrics for Analysis

Create scripts to capture data for reports:

#!/bin/bash
# capture_metrics.sh - Run during experiments

while true; do
    # Cluster state
    patronictl -c /etc/patroni/patroni1.yml list >> cluster_state.log
    
    # Replication lag
    psql -h 127.0.0.1 -p 5432 -t -c "
        SELECT now(), application_name, 
               pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 
        FROM pg_stat_replication" >> replication_lag.log
    
    # Resource usage
    echo "$(date),$(top -bn1 | grep postgres | awk '{sum += $9} END {print sum}')" >> cpu_usage.log
    
    # Connection count
    psql -h 127.0.0.1 -p 5432 -t -c "SELECT count(*) FROM pg_stat_activity" >> connections.log
    
    sleep 5
done

Then analyze with Python:

import pandas as pd
import matplotlib.pyplot as plt

# Load data
lag = pd.read_csv('replication_lag.log', names=['time', 'node', 'lag_mb'])
lag['time'] = pd.to_datetime(lag['time'])

# Plot replication lag over time
plt.figure(figsize=(12, 6))
for node in lag['node'].unique():
    node_data = lag[lag['node'] == node]
    plt.plot(node_data['time'], node_data['lag_mb'], label=node)

plt.xlabel('Time')
plt.ylabel('Replication Lag (MB)')
plt.title('Replication Lag During Load Test')
plt.legend()
plt.savefig('replication_lag_chart.png')

8. Understanding Trade-offs: Critical Thinking

8.1 The Honest Assessment

After working through this setup, students should understand:

What Single-Node Clustering IS:

  • A learning laboratory for distributed database concepts
  • Protection against software-level failures
  • Automated recovery for process crashes
  • Read scalability through replication
  • A prototype architecture that scales to production

What Single-Node Clustering IS NOT:

  • True high availability (hardware failures take everything down)
  • A replacement for proper backups
  • Protection against data center failures
  • Suitable for mission-critical production systems
  • A shortcut to avoid learning distributed systems properly

8.2 When This Approach Makes Sense

Ideal for:

  • University courses on distributed databases
  • Personal learning projects
  • Proof-of-concept demonstrations
  • Development/testing environments
  • Technical interviews (showing knowledge)
  • Documentation and tutorial creation
  • Architecture decision justification

Not suitable for:

  • Production systems requiring 99.9%+ uptime
  • Applications where data loss is unacceptable
  • Environments with compliance requirements (SOC2, HIPAA)
  • Systems handling financial transactions
  • Applications with SLA penalties

8.3 The Educational Value Proposition

The question isn’t “Does this provide real HA?” but rather “What does this teach?”

Skills Learned:

  • PostgreSQL streaming replication mechanics
  • Patroni configuration and operations
  • etcd distributed coordination
  • HAProxy load balancing and health checks
  • Linux process management
  • Systemd service configuration
  • Failover automation logic
  • Distributed consensus protocols
  • Resource constraint analysis
  • Performance benchmarking

Concepts Understood:

  • Leader election algorithms
  • Heartbeat mechanisms
  • Time-to-live (TTL) patterns
  • Atomic operations in distributed systems
  • Split-brain prevention
  • Quorum-based decisions
  • State machine replication
  • Read/write splitting
  • Connection pooling
  • Blast radius containment

These are production-grade skills learned in a risk-free environment.

9. Cost Analysis: Educational Economics

9.1 Learning Path Budget

Single Server Setup (Learning):

Local laptop: $0 (already owned)
VPS for practice: $5-10/month (Hetzner, DigitalOcean)
Total: ~$50 for 6 months of learning

Direct to Multi-Server (Learning on Production):

3 VPS instances: $15/month × 3 = $45/month
Load balancer: $10/month
Managed coordination: $20/month
Total: $75/month = $450 for 6 months

Learning cost difference: $400 saved, plus no risk of breaking production

9.2 Equipment Requirements

Minimum Laptop Specs:

  • 8 GB RAM (12 GB recommended)
  • 4 CPU cores
  • 50 GB free disk space
  • Linux or macOS (Windows WSL2 works)

Recommended VPS for Practice:

  • 16 GB RAM
  • 4-8 vCPUs
  • 100 GB SSD
  • Ubuntu 22.04 or similar
  • Cost: ~$40-60/month

This is accessible to students and hobbyists.

10. Documentation and Reporting

10.1 Lab Report Template

For academic projects, document systematically:

# PostgreSQL HA Cluster - Lab Report

## 1. Objective
Demonstrate understanding of PostgreSQL high availability through 
implementation and testing of a single-node clustered setup.

## 2. Architecture
[Include your architecture diagram]

## 3. Configuration
[Document key configuration decisions]

## 4. Experiments Performed

### Experiment 1: Failover Time Measurement
- Procedure: [steps taken]
- Observations: [what happened]
- Metrics: [failover time, connection drops]
- Analysis: [why it behaved this way]

### Experiment 2: Replication Lag Under Load
- Procedure: [load generation method]
- Observations: [lag behavior]
- Metrics: [graphs of lag over time]
- Analysis: [bottleneck identification]

## 5. Performance Comparison
[Benchmark results: single vs clustered]

## 6. Lessons Learned
[What worked, what didn't, what surprised you]

## 7. Limitations Identified
[Honest assessment of what this can't do]

## 8. Production Readiness Analysis
[What would need to change for production]

## 9. References
[Documentation and resources used]

10.2 Creating Effective Diagrams

Students should visualize what they built:

Architecture Diagram:

┌─────────────────────────────────────────────────────────────┐
│                    Single Server (Laptop/VPS)                │
│                                                               │
│  ┌──────────────┐                                            │
│  │   etcd       │←──────────────┐                            │
│  │  :2379       │                │                            │
│  └──────────────┘                │ Coordination              │
│         ↑ ↑ ↑                    │                            │
│         │ │ └──────────┐         │                            │
│         │ └─────┐      │         │                            │
│         │       │      │         │                            │
│  ┌──────┴────┐ ┌┴──────┴───┐ ┌──┴─────────┐                 │
│  │ Patroni 1 │ │ Patroni 2 │ │ Patroni 3  │                 │
│  │  :8008    │ │  :8009    │ │  :8010     │                 │
│  └─────┬─────┘ └─────┬─────┘ └─────┬──────┘                 │
│        │             │             │                          │
│        ↓             ↓             ↓                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                   │
│  │ PG :5432 │  │ PG :5433 │  │ PG :5434 │                   │
│  │ PRIMARY  │→→│ REPLICA  │  │ REPLICA  │  Replication      │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘                   │
│       │             │             │                           │
│       └─────────────┴─────────────┘                          │
│                     ↑                                         │
│            ┌────────┴────────┐                               │
│            │    HAProxy       │                               │
│            │  :5000 (write)   │                               │
│            │  :5001 (read)    │                               │
│            └─────────┬────────┘                               │
└──────────────────────┼───────────────────────────────────────┘
                       │
                  Application

State Machine Diagram:

Normal Operation:
┌──────────┐     ┌──────────┐     ┌──────────┐
│  node1   │────→│  node2   │     │  node3   │
│ PRIMARY  │     │ REPLICA  │←────│ REPLICA  │
└──────────┘     └──────────┘     └──────────┘

Primary Fails:
┌──────────┐     ┌──────────┐     ┌──────────┐
│  node1   │     │  node2   │     │  node3   │
│  CRASHED │     │ REPLICA  │     │ REPLICA  │
└──────────┘     └──────────┘     └──────────┘
                       ↓ Election
After Failover:
┌──────────┐     ┌──────────┐     ┌──────────┐
│  node1   │     │  node2   │────→│  node3   │
│  DOWN    │     │ PRIMARY  │     │ REPLICA  │
└──────────┘     └──────────┘     └──────────┘

11. Troubleshooting Guide for Learners

11.1 Common Issues and Solutions

Issue 1: Patroni won’t start

# Check logs
$ journalctl -u patroni@1 -n 50

# Common causes:
# - PostgreSQL already running on that port
# - Data directory already exists
# - etcd not running
# - Configuration syntax error

# Solutions:
$ sudo systemctl stop postgresql  # Stop system PostgreSQL
$ sudo rm -rf /var/lib/postgresql/data1  # Clean data dir
$ sudo systemctl start etcd  # Ensure etcd is running
$ patroni --validate /etc/patroni/patroni1.yml  # Check config

Issue 2: Replication not working

# Check replication user exists
$ psql -h 127.0.0.1 -p 5432 -c "\du"

# Check pg_hba.conf allows replication
$ cat /var/lib/postgresql/data1/pg_hba.conf | grep replication

# Check replication slots
$ psql -h 127.0.0.1 -p 5432 -c "SELECT * FROM pg_replication_slots;"

# Check for replication lag
$ psql -h 127.0.0.1 -p 5432 -c "SELECT * FROM pg_stat_replication;"

Issue 3: Failover doesn’t happen

# Check etcd is running
$ curl http://127.0.0.1:2379/version

# Check if leader key exists
$ etcdctl get /db/postgres-cluster/leader

# Check Patroni can reach etcd
$ curl http://127.0.0.1:8008/patroni

# Verify TTL settings in Patroni config
$ grep ttl /etc/patroni/patroni1.yml

11.2 Debugging Commands

# See all PostgreSQL processes
$ ps aux | grep postgres | grep -v grep

# Check which ports are listening
$ sudo netstat -tlnp | grep -E '5432|5433|5434|2379|8008|8009|8010'

# Watch cluster state continuously
$ watch -n 1 "patronictl -c /etc/patroni/patroni1.yml list"

# Check disk space (replication needs space!)
$ df -h

# Monitor system resources
$ htop

# Check PostgreSQL logs
$ tail -f /var/log/postgresql/postgresql-14-main.log

11.3 Reset and Start Over

When learning, sometimes you need a clean slate:

#!/bin/bash
# reset_cluster.sh - Nuclear option for starting fresh

# Stop everything
sudo systemctl stop patroni@{1,2,3}
sudo systemctl stop etcd
sudo systemctl stop haproxy

# Clean data directories
sudo rm -rf /var/lib/postgresql/data{1,2,3}
sudo rm -rf /var/lib/etcd/*

# Recreate data directories
sudo -u postgres mkdir -p /var/lib/postgresql/data{1,2,3}

# Start etcd first
sudo systemctl start etcd
sleep 5

# Start Patroni instances
sudo systemctl start patroni@1
sleep 10
sudo systemctl start patroni@2
sudo systemctl start patroni@3
sleep 10

# Start HAProxy
sudo systemctl start haproxy

# Check status
patronictl -c /etc/patroni/patroni1.yml list

12. Advanced Experiments

12.1 Chaos Engineering Practice

Once comfortable, introduce controlled chaos:

Experiment: Network Partition Simulation

# Simulate network issues with tc (traffic control)
$ sudo tc qdisc add dev lo root netem delay 100ms loss 5%

# Watch how replication lag responds
$ watch -n 1 "psql -h 127.0.0.1 -p 5432 -c 'SELECT * FROM pg_stat_replication'"

# Remove network impairment
$ sudo tc qdisc del dev lo root

Experiment: Resource Exhaustion

# Fill up disk to see replication fail
$ dd if=/dev/zero of=/tmp/fillup bs=1M count=10000

# Watch replication break
$ patronictl -c /etc/patroni/patroni1.yml list

# Clean up
$ rm /tmp/fillup

Experiment: Split-Brain Prevention

# Try to manually promote two nodes simultaneously
$ psql -h 127.0.0.1 -p 5433 -c "SELECT pg_promote()" &
$ psql -h 127.0.0.1 -p 5434 -c "SELECT pg_promote()" &

# Watch Patroni/etcd prevent split-brain
$ watch -n 1 "etcdctl get /db/postgres-cluster/leader"

12.2 Performance Optimization Exercises

Exercise: Tuning for Read-Heavy Workload

# Baseline
$ pgbench -h localhost -p 5001 -c 10 -T 60 -S demo

# Increase replica connections
$ psql -h 127.0.0.1 -p 5433 -c "ALTER SYSTEM SET max_connections = 200"

# Increase cache
$ psql -h 127.0.0.1 -p 5433 -c "ALTER SYSTEM SET shared_buffers = '4GB'"

# Restart and retest
$ sudo systemctl restart patroni@2
$ pgbench -h localhost -p 5001 -c 10 -T 60 -S demo

Exercise: Monitoring Replication Lag Under Load

# lag_monitor.py
import psycopg2
import time
import csv

def get_lag():
    conn = psycopg2.connect("host=127.0.0.1 port=5432 dbname=postgres")
    cur = conn.cursor()
    cur.execute("""
        SELECT application_name, 
               pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS lag_mb
        FROM pg_stat_replication
    """)
    results = cur.fetchall()
    conn.close()
    return results

with open('lag_over_time.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['timestamp', 'node', 'lag_mb'])
    
    while True:
        for node, lag in get_lag():
            writer.writerow([time.time(), node, lag])
        time.sleep(1)

13. Conclusion: What Was Really Learned

13.1 The Honest Final Assessment

After completing this learning journey, students should be able to answer:

Question: Is single-node PostgreSQL clustering real high availability?

Answer: No—and understanding why is the point.

13.2 Skills Transferred to Production

What students learn here applies directly to real distributed systems:

Architectural Skills:

  • How to design for fault tolerance
  • When to use replication vs sharding
  • Trade-offs between consistency and availability
  • Importance of observability and monitoring

Operational Skills:

  • Reading and writing service configurations
  • Using systemd for service management
  • Debugging distributed systems
  • Interpreting logs and metrics

Database Skills:

  • PostgreSQL replication internals
  • Understanding WAL and LSN
  • Query routing strategies
  • Connection pooling patterns

Systems Skills:

  • Process vs thread models
  • Resource constraint analysis
  • Performance benchmarking
  • Capacity planning

13.3 The Migration Path Forward

The beauty of this approach: when ready for production, you already know how it works.

Learning Setup → Production Setup:

  • Same configuration structure (just change IPs)
  • Same monitoring queries
  • Same troubleshooting approach
  • Same operational commands
  • Same architectural patterns

What Changes in Production:

  • Three physical servers instead of one
  • Geographic distribution for real HA
  • Proper backup and disaster recovery
  • Monitoring and alerting infrastructure
  • Security hardening and access control

What Stays the Same:

  • Patroni orchestration logic
  • PostgreSQL replication mechanics
  • HAProxy routing patterns
  • etcd coordination approach
  • Application integration code

13.4 Final Wisdom

Single-node PostgreSQL clustering is a pedagogical tool, not a production solution. It’s a training environment where you can:

  • Break things safely
  • Learn from failures
  • Experiment with configurations
  • Understand distributed systems
  • Build transferable skills

For students, researchers, and engineers learning database administration, it’s perfect. For production systems requiring real high availability, it’s a starting point, not a destination.

The goal isn’t to run this in production. The goal is to understand distributed databases well enough that when you do build production systems, you know what you’re doing.

Quick Reference Card

Port Mapping

2379        etcd coordination
5000        HAProxy primary (read-write)
5001        HAProxy replicas (read-only)
5432-5434   PostgreSQL instances
7000        HAProxy statistics
8008-8010   Patroni REST APIs

Essential Commands

# Cluster status
$ patronictl -c /etc/patroni/patroni1.yml list

# Manual failover
$ patronictl -c /etc/patroni/patroni1.yml failover postgres-cluster

# Restart node
$ patronictl -c /etc/patroni/patroni1.yml restart postgres-cluster node2

# Check replication
$ psql -h localhost -p 5432 -c "SELECT * FROM pg_stat_replication;"

# HAProxy stats
$ curl http://localhost:7000

Monitoring Queries

-- Replication lag
SELECT application_name,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS lag_mb
FROM pg_stat_replication;

-- Connection count
SELECT count(*) FROM pg_stat_activity;

-- Active queries
SELECT pid, query, state 
FROM pg_stat_activity 
WHERE state = 'active';

Troubleshooting Checklist

□ Is etcd running? (curl http://127.0.0.1:2379/version)
□ Are all Patroni services running? (systemctl status patroni@{1,2,3})
□ Are PostgreSQL processes running? (ps aux | grep postgres)
□ Is there a leader? (etcdctl get /db/postgres-cluster/leader)
□ Is replication working? (psql query pg_stat_replication)
□ Is HAProxy routing correctly? (curl http://localhost:7000)
□ Is there disk space? (df -h)
□ Are there errors in logs? (journalctl -u patroni@1 -n 50)

This architecture serves as an excellent learning laboratory—a safe space to break things, understand distributed systems, and build real skills without the complexity and cost of managing multiple physical servers. When you’re ready to build production systems, you’ll already understand how they work.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>