When students and engineers first encounter the concept of high availability, they often face a chicken-and-egg problem: they need to understand distributed systems to build them, but they need to build them to understand them. Running a PostgreSQL cluster on a single server solves this dilemma—it’s a learning laboratory where you can experiment with real HA concepts without the complexity and cost of managing multiple physical machines.
This article documents that learning journey, examining whether single-node clustering is a legitimate educational tool or just architectural theater that teaches bad habits.
1. The Educational Paradox
1.1 The Dilemma of Learning High Availability
Every database course teaches that high availability requires eliminating single points of failure. Yet for someone building their first HA system—whether for a university project, a proof of concept, or personal learning—procuring three physical servers (or even three cloud instances) is impractical.
This creates a fundamental learning barrier. How do you understand Patroni’s failover logic without seeing it fail over? How do you grasp replication lag without monitoring it? How do you learn to debug cluster state without having a cluster?
The reality is stark: running three PostgreSQL instances on one physical server shares every failure domain—power supply, CPU, memory, disk controller, network interface. If the server dies, everything dies together. From a pure HA perspective, this is pointless.
But from a learning perspective? It’s invaluable.
1.2 What We’re Actually Learning About
Before diving into implementation, let’s be honest about what single-node clustering teaches versus what it doesn’t:
Failure Type | Example | Single Instance | Multi-Process Cluster | What You Learn |
---|---|---|---|---|
Hardware | Disk failure, power loss | ✗ | ✗ | Nothing—both fail |
Process crash | Segfault, OOM kill | ✗ (5-30 min) | ✓ (30 sec) | Automatic failover mechanics |
Connection exhaustion | max_connections hit | ✗ | ✓ (replicas available) | Load distribution patterns |
Query of death | Recursive CTE crash | ✗ | ✓ (failover) | Blast radius containment |
Config error | Wrong parameter value | ✗ | ✓ (replicas unaffected) | Configuration management |
Maintenance error | Failed schema migration | ✗ | ✓ (rollback via failover) | Safe deployment strategies |
Data corruption | Shared disk corruption | ✗ | ✗ | Backup importance (hard lesson) |
The pattern reveals something important: this setup teaches software-level resilience (60-70% of the learning objectives) while being useless for hardware resilience (30-40% of scenarios). For a prototype or demonstration environment, that’s actually perfect—you learn the complex orchestration parts before worrying about physical infrastructure.
Think of it like learning to drive with a driving simulator. You’re not learning about tire mechanics or engine repair, but you are learning steering, braking, and navigation—skills that transfer directly to real driving.
2. The Architecture: Understanding Through Building
2.1 PostgreSQL’s Multi-Process Foundation
The first lesson in this laboratory is understanding why PostgreSQL can do this at all. PostgreSQL’s process model is pedagogically perfect for demonstrating isolation concepts.
Imagine you’re explaining processes to a student. You can show them this:
Your Laptop (16GB RAM, 8 CPU cores)
├── PostgreSQL Instance 1 (port 5432)
│ ├── Postmaster process (PID 1001)
│ ├── Backend processes (PID 1002-1050)
│ ├── WAL writer (PID 1051)
│ └── Memory: 2GB shared_buffers
│
├── PostgreSQL Instance 2 (port 5433)
│ ├── Postmaster process (PID 2001)
│ ├── Backend processes (PID 2002-2050)
│ └── Memory: 2GB shared_buffers
│
└── PostgreSQL Instance 3 (port 5434)
├── Postmaster process (PID 3001)
└── Memory: 2GB shared_buffers
You can then demonstrate process isolation in real-time:
# Kill the primary's postmaster
$ kill -9 1001
# Watch the other instances keep running
$ ps aux | grep postgres
# Instances 2 and 3 still alive!
This is a tangible demonstration of operating system process isolation. Students can see with their own eyes that Instance 1 crashing doesn’t affect Instances 2 and 3. They’re sharing hardware, but the OS keeps them separate.
Try teaching this concept with a threaded database. You can’t. There’s no way to demonstrate thread isolation at this level because a catastrophic thread failure can corrupt shared memory and take everything down.
2.2 Patroni: Learning Distributed Coordination
The next learning objective is understanding how distributed systems coordinate. Patroni provides an excellent teaching tool because its configuration is transparent and its behavior is observable.
For a demonstration environment, you run three Patroni instances as separate processes:
$ ps aux | grep patroni
student 1001 0.5 1.2 python3 /usr/local/bin/patroni /etc/patroni/patroni1.yml
student 2001 0.4 1.1 python3 /usr/local/bin/patroni /etc/patroni/patroni2.yml
student 3001 0.4 1.1 python3 /usr/local/bin/patroni /etc/patroni/patroni3.yml
The beauty of this setup for learning is that all communication happens on localhost. You can watch network traffic, debug HTTP calls, and trace every interaction without dealing with firewall rules, network latency, or SSH key management.
Here’s what the coordination looks like in your demo environment:
Your Laptop - All on 127.0.0.1
etcd (port 2379)
↑ ↑ ↑
| | |___ Patroni3 (REST API :8010) → PostgreSQL :5434
| |_____ Patroni2 (REST API :8009) → PostgreSQL :5433
|_______ Patroni1 (REST API :8008) → PostgreSQL :5432
For teaching purposes, the configuration demonstrates key distributed systems concepts while remaining simple:
# /etc/patroni/patroni1.yml
scope: postgres-cluster # Logical cluster identifier
namespace: /db/ # Key-value namespace
name: node1 # Unique node identifier
restapi:
listen: 127.0.0.1:8008 # Different per instance
connect_address: 127.0.0.1:8008
etcd:
host: 127.0.0.1:2379 # Coordination service
postgresql:
listen: 127.0.0.1:5432 # Different per instance
connect_address: 127.0.0.1:5432
data_dir: /var/lib/postgresql/data1 # Isolated storage
authentication:
replication:
username: replicator
password: rep_pass
superuser:
username: postgres
password: super_pass
Students can create all three configs by copying this file and changing just five lines. This teaches an important principle: distributed systems configuration should be templatable. When you eventually move to three physical servers, you change the same five values (replacing 127.0.0.1 with actual IPs).
Using systemd with a service template demonstrates infrastructure-as-code concepts:
# /etc/systemd/system/patroni@.service
[Unit]
Description=Patroni PostgreSQL Cluster Manager (instance %i)
After=network.target etcd.service
[Service]
Type=simple
User=postgres
ExecStart=/usr/local/bin/patroni /etc/patroni/patroni%i.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target
Then starting the cluster teaches automation:
$ sudo systemctl enable patroni@{1,2,3}
$ sudo systemctl start patroni@{1,2,3}
# Check status
$ sudo systemctl status patroni@1
● patroni@1.service - Patroni (instance 1)
Active: active (running)
Three services managed with one command—this is infrastructure thinking.
2.3 Understanding Patroni’s Logic
For students learning distributed systems, understanding Patroni’s behavior is crucial. Here’s the simplified logic each instance runs:
# Runs in a loop every 10 seconds
while True:
# 1. Health check - is my database responding?
pg_status = check_postgres_health()
# 2. Heartbeat - tell others I'm alive (30-second TTL)
etcd.set(f'/db/postgres-cluster/members/{my_name}', {
'state': pg_status,
'role': 'leader' if is_primary else 'replica',
'xlog_location': get_wal_position()
}, ttl=30)
# 3. Leadership check - is there a leader?
leader_key = etcd.get('/db/postgres-cluster/leader')
# 4. Leader election - if no leader and I'm healthy, try to lead
if not leader_key and pg_status == 'running':
try:
etcd.create('/db/postgres-cluster/leader', my_name, ttl=30)
run_command('pg_ctl promote')
except KeyAlreadyExists:
pass # Another node won the race
# 5. Follower configuration - if someone else is leader, follow them
if leader_key and leader_key != my_name:
setup_replication_to(leader_key)
sleep(10)
This code demonstrates several distributed systems concepts:
- Heartbeating: Nodes prove they’re alive by periodically updating state
- TTLs (Time To Live): If heartbeats stop, the node is considered dead
- Leader election: Using atomic operations to elect a coordinator
- Race conditions: Handling multiple nodes trying to become leader simultaneously
- State machines: Nodes transition between follower and leader roles
In a demo environment, you can observe this in real-time:
# Watch the heartbeats
$ watch -n 1 'etcdctl get --prefix /db/postgres-cluster/'
# You'll see keys being updated every 10 seconds
/db/postgres-cluster/leader → "node1"
/db/postgres-cluster/members/node1 → {"state": "running", "role": "master"}
2.4 Resource Overhead: Learning System Constraints
An important learning objective is understanding resource constraints. Running three instances on one machine makes trade-offs visible:
Component | Processes | RAM Usage | CPU Usage |
---|---|---|---|
PostgreSQL x3 | 9 | 6-8 GB | 40-60% |
Patroni x3 | 3 | 200-300 MB | 2-5% |
etcd | 1 | 100-200 MB | 1-2% |
HAProxy | 1 | 30-50 MB | 1-2% |
Total | 14 | 6.5-8.5 GB | 44-69% |
Remaining | 7.5-9.5 GB | 31-56% |
Students can monitor this with standard Linux tools:
# View all processes
$ ps aux | grep -E 'postgres|patroni|etcd|haproxy' | grep -v grep
# Check memory
$ free -h
total used free
Mem: 16Gi 8.2Gi 7.8Gi
# Monitor CPU
$ top -bn1 | grep -E 'postgres|patroni'
This teaches resource budgeting: you can’t just add infinite replicas because each one consumes real resources.
2.5 Streaming Replication: Watching Data Flow
One of the most powerful learning experiences is watching data replicate in real-time. PostgreSQL’s physical replication makes data flow visible.
First, you set up the replication user (Patroni automates this, but seeing it manually teaches the concept):
# On the primary
$ psql -h 127.0.0.1 -p 5432 -U postgres
postgres=# CREATE USER replicator REPLICATION LOGIN PASSWORD 'rep_pass';
Then you can observe the replication connections:
# On primary: Check who's connected for replication
$ psql -h 127.0.0.1 -p 5432 -U postgres -c "
SELECT
application_name,
client_addr,
state,
pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS lag_mb
FROM pg_stat_replication;"
application_name | client_addr | state | lag_mb
------------------+-------------+-----------+--------
node2 | 127.0.0.1 | streaming | 0.02
node3 | 127.0.0.1 | streaming | 0.01
Now you can demonstrate replication in action:
# Write to primary
$ psql -h 127.0.0.1 -p 5432 -U postgres -c "
CREATE TABLE demo (id SERIAL, data TEXT, created_at TIMESTAMP DEFAULT NOW());
INSERT INTO demo (data) VALUES ('Hello from primary');"
# Read from replica (appears within milliseconds)
$ psql -h 127.0.0.1 -p 5433 -U postgres -c "SELECT * FROM demo;"
id | data | created_at
----+--------------------+----------------------------
1 | Hello from primary | 2025-10-06 14:32:15.123456
# Try to write to replica (demonstrates read-only enforcement)
$ psql -h 127.0.0.1 -p 5433 -U postgres -c "INSERT INTO demo (data) VALUES ('test');"
ERROR: cannot execute INSERT in a read-only transaction
This is pedagogically gold. Students can see:
- Data written to one database appears in another
- Replicas are read-only
- Replication happens almost instantly (lag_mb shows it)
- All of this on localhost without network complexity
2.6 etcd: The Coordination Service
Understanding distributed coordination is essential for systems engineers. etcd provides a simple key-value store that demonstrates coordination primitives.
For a demo setup:
# Install and configure etcd
$ sudo apt-get install etcd
# /etc/default/etcd
ETCD_NAME="demo-node"
ETCD_DATA_DIR="/var/lib/etcd"
ETCD_LISTEN_CLIENT_URLS="http://127.0.0.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://127.0.0.1:2379"
# Start and verify
$ sudo systemctl start etcd
$ curl http://127.0.0.1:2379/version
{"etcdserver":"3.4.13"}
Students can then explore how Patroni uses etcd:
# View cluster state
$ etcdctl get --prefix /db/postgres-cluster/
/db/postgres-cluster/leader → "node1"
/db/postgres-cluster/members/node1 → {
"conn_url": "postgres://127.0.0.1:5432/postgres",
"state": "running",
"role": "master",
"timeline": 1
}
/db/postgres-cluster/members/node2 → {
"conn_url": "postgres://127.0.0.1:5433/postgres",
"state": "running",
"role": "replica"
}
This teaches key-value store concepts, TTLs, and how distributed applications use coordination services.
2.7 Simulating Failover: The Ultimate Demo
The most impressive demonstration is simulating a failure and watching automatic recovery. Here’s the timeline students can observe:
# Terminal 1: Watch cluster state
$ watch -n 1 "patronictl -c /etc/patroni/patroni1.yml list"
# Terminal 2: Watch etcd
$ watch -n 1 "etcdctl get --prefix /db/postgres-cluster/leader"
# Terminal 3: Kill the primary
$ sudo systemctl stop patroni@1
What happens (with timestamps):
T=0s: Primary crashes (node1:5432 stops)
T=10s: Patroni1 fails to refresh its heartbeat
T=30s: etcd expires the /leader key (TTL exceeded)
T=31s: Patroni2 and Patroni3 both detect: no leader!
T=32s: Both try to acquire leadership (race condition demo)
T=33s: etcd's atomic compare-and-swap gives lock to node2
T=34s: Patroni2 promotes its PostgreSQL
$ pg_ctl promote -D /var/lib/postgresql/data2
T=38s: node2 completes promotion, becomes new primary
T=39s: node3 reconfigures to replicate from node2
T=40s: HAProxy health check detects node2 is primary
T=41s: HAProxy routes port 5000 traffic to node2:5433
Total automated failover: ~40 seconds
Students can watch this entire sequence unfold in real-time across multiple terminals. It’s a distributed systems ballet happening on localhost.
Visual state changes they can observe:
Before:
etcd: /leader → "node1"
HAProxy :5000 → 127.0.0.1:5432
psql -h localhost -p 5000 # works
During (T=30-38):
etcd: /leader → (expired, then) "node2"
HAProxy :5000 → No healthy backend
psql -h localhost -p 5000 # connection refused (brief)
After:
etcd: /leader → "node2"
HAProxy :5000 → 127.0.0.1:5433
psql -h localhost -p 5000 # works again, different backend
2.8 HAProxy: Teaching Load Balancing
HAProxy demonstrates health-check based routing. The configuration teaches proxy patterns:
# /etc/haproxy/haproxy.cfg
global
maxconn 4096
log /dev/log local0
defaults
mode tcp
timeout connect 5s
timeout client 50s
timeout server 50s
# Primary endpoint (read-write)
listen postgres_primary
bind *:5000
option tcplog
option httpchk OPTIONS /master
http-check expect status 200
default-server inter 3s fall 3 rise 2
server node1 127.0.0.1:5432 maxconn 100 check port 8008
server node2 127.0.0.1:5433 maxconn 100 check port 8009
server node3 127.0.0.1:5434 maxconn 100 check port 8010
# Replica endpoint (read-only)
listen postgres_replicas
bind *:5001
option tcplog
option httpchk OPTIONS /replica
http-check expect status 200
balance roundrobin
server node1 127.0.0.1:5432 maxconn 100 check port 8008
server node2 127.0.0.1:5433 maxconn 100 check port 8009
server node3 127.0.0.1:5434 maxconn 100 check port 8010
# Statistics dashboard
listen stats
bind *:7000
mode http
stats enable
stats uri /
Students can observe health checks in action:
# HAProxy checks Patroni REST API every 3 seconds
# Node is primary:
$ curl http://127.0.0.1:8008/master
HTTP/1.0 200 OK
# Node is replica:
$ curl http://127.0.0.1:8008/master
HTTP/1.0 503 Service Unavailable
They can also write application code that uses the proxy:
# demo_app.py
import psycopg2
# Write connection always goes to current primary
write_conn = psycopg2.connect(
host='localhost',
port=5000, # HAProxy routes to whoever is primary
database='demo',
user='demo_user'
)
# Read connection load-balances across replicas
read_conn = psycopg2.connect(
host='localhost',
port=5001, # HAProxy round-robins to replicas
database='demo',
user='demo_user'
)
# Write
with write_conn.cursor() as cur:
cur.execute("INSERT INTO demo (data) VALUES ('test')")
write_conn.commit()
# Read (might go to different replica each time)
with read_conn.cursor() as cur:
cur.execute("SELECT * FROM demo")
print(cur.fetchall())
The application doesn’t need to know which specific database is primary. HAProxy handles that complexity.
3. Real Failure Scenarios: Educational Value
3.1 Demonstrating Process Crash Recovery
In a learning environment, you can safely crash processes to understand recovery:
# Find the primary's postmaster PID
$ ps aux | grep "postgres.*master"
postgres 1001 0.5 /usr/lib/postgresql/14/bin/postgres -D /var/lib/postgresql/data1
# Kill it violently (simulates segfault)
$ kill -9 1001
What students observe:
Single-Instance Behavior (for comparison):
[ERROR] server terminated by signal 11: Segmentation fault
[LOG] all server processes terminated; reinitializing
# All connections dropped immediately
# 30-60 second recovery time
# Manual investigation required
Multi-Process Cluster Behavior:
# Primary crashes at T=0
T=10s: Patroni health check fails on node1
[ERROR] PostgreSQL not responding
T=30s: Leader key expires in etcd
T=32s: node2 wins election, promotes to primary
[INFO] Promoting PostgreSQL instance
T=40s: HAProxy detects new primary
[INFO] Routing :5000 to 127.0.0.1:5433
# Applications automatically reconnect
# Downtime: 25-40 seconds, fully automated
3.2 Connection Exhaustion Demo
This demonstrates resource limits:
# Set low connection limit for demo
$ psql -h 127.0.0.1 -p 5432 -c "ALTER SYSTEM SET max_connections = 10"
$ sudo systemctl restart patroni@1
# Exhaust connections
$ for i in {1..10}; do
psql -h 127.0.0.1 -p 5000 -c "SELECT pg_sleep(300)" &
done
# Try one more
$ psql -h 127.0.0.1 -p 5000 -c "SELECT 1"
ERROR: too many clients already
With a cluster:
# Primary at :5000 is full, but replicas at :5001 still work
$ psql -h 127.0.0.1 -p 5001 -c "SELECT 1"
?column?
----------
1
This teaches about read/write separation and capacity planning.
3.3 Configuration Error Isolation
Demonstrate blast radius:
# Misconfigure only node1
$ psql -h 127.0.0.1 -p 5432 -c "ALTER SYSTEM SET shared_buffers = '999GB'"
$ sudo systemctl restart patroni@1
# node1 fails to start (impossible memory requirement)
# But node2 and node3 are unaffected
$ patronictl -c /etc/patroni/patroni1.yml list
| Member | Host | Role | State |
+--------+----------------+---------+---------+
| node1 | 127.0.0.1:5432 | Replica | crashed |
| node2 | 127.0.0.1:5433 | Leader | running |
| node3 | 127.0.0.1:5434 | Replica | running |
Students learn that configuration changes should be tested on replicas first.
4. Understanding Limitations: Honest Education
4.1 What This Setup Cannot Teach
Being honest about limitations is crucial for proper learning:
This setup CANNOT demonstrate protection against:
- Server power loss (everything dies)
- Kernel panic (everything dies)
- Disk controller failure (all instances affected)
- Network interface failure (all instances unreachable)
- Physical security breach (single machine compromised)
The demonstration setup teaches:
- Process-level fault tolerance
- Automatic failover mechanics
- Replication lag monitoring
- Load distribution patterns
- Configuration management
- Health checking and routing
- Distributed coordination
4.2 The Learning Principle: 60/40 Coverage
Here’s the honest assessment:
Database failure scenarios (100%):
60%: Software/process/operational issues
→ Single-node cluster demonstrates ✓
40%: Hardware/infrastructure failures
→ Single-node cluster useless ✗
For a prototype or learning environment, this is perfect. You learn the complex parts (coordination, replication, failover) without the infrastructure overhead (multiple machines, networking, monitoring).
4.3 Performance Trade-offs: Teaching Resource Constraints
Students should understand that HA isn’t free. Benchmarking teaches this:
# Install pgbench
$ sudo apt-get install postgresql-contrib
# Initialize test database
$ pgbench -h 127.0.0.1 -p 5432 -i -s 100 demo
# Single instance baseline
$ pgbench -h 127.0.0.1 -p 5432 -c 10 -j 2 -T 60 demo
tps = 3200 (including connections)
# Three-instance cluster (primary writes)
$ pgbench -h 127.0.0.1 -p 5000 -c 10 -j 2 -T 60 demo
tps = 2100 (66% of single, -34% overhead)
# Cluster replicas (read-only)
$ pgbench -h 127.0.0.1 -p 5001 -c 10 -j 2 -T 60 -S demo
tps = 2800 per replica × 2 = 5600 total reads
Analysis: You lose 34% write performance but gain 175% read capacity. This teaches about read-write tradeoffs in distributed systems.
5. The Learning Path: A Structured Approach
5.1 Four-Week Curriculum
This setup enables a structured learning progression:
Week 1: Setup and Architecture
- Install all components
- Understand process model
- Read configuration files
- Monitor cluster state with patronictl
- Draw architecture diagrams
Week 2: Failure Simulation
- Kill processes with kill -9
- Watch automatic recovery
- Measure failover times
- Observe etcd key changes
- Practice manual recovery with patronictl
Week 3: Replication and Performance
- Monitor replication lag
- Generate write load
- Watch replicas catch up
- Run pgbench benchmarks
- Profile resource usage
Week 4: Application Integration
- Write demo applications
- Implement connection pooling
- Handle failover in code
- Test read/write splitting
- Measure end-to-end latency
5.2 Hands-On Exercises
Exercise 1: Observing Failover
# Terminal 1
$ watch -n 1 "patronictl -c /etc/patroni/patroni1.yml list"
# Terminal 2
$ watch -n 1 "psql -h localhost -p 5000 -c 'SELECT pg_is_in_recovery()'"
# Terminal 3
$ sudo systemctl stop patroni@1
# Observe the transitions, measure the time
Exercise 2: Replication Lag
# Terminal 1: Monitor lag
$ watch -n 1 "psql -h 127.0.0.1 -p 5432 -c '
SELECT application_name,
pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS lag_mb
FROM pg_stat_replication'"
# Terminal 2: Generate load
$ pgbench -h 127.0.0.1 -p 5432 -c 20 -j 4 -T 300 demo
# Watch lag grow and shrink
Exercise 3: Configuration Changes
# Change setting on replica
$ psql -h 127.0.0.1 -p 5433 -c "
ALTER SYSTEM SET work_mem = '64MB'"
$ sudo systemctl restart patroni@2
# Verify primary unaffected
$ psql -h 127.0.0.1 -p 5432 -c "SHOW work_mem"
5.3 Migration to Production
The beauty of this learning approach is the path forward. When you’re ready for actual multi-server deployment:
Single Server (Learning):
postgresql:
listen: 127.0.0.1:5432
connect_address: 127.0.0.1:5432
etcd:
host: 127.0.0.1:2379
Three Servers (Production):
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.1.10:5432 # Real server IP
etcd:
hosts: 10.0.1.10:2379,10.0.1.11:2379,10.0.1.12:2379
Only the IP addresses change. The concepts, commands, and mental models transfer directly.
6. Monitoring and Observability: Teaching DevOps
6.1 Essential Monitoring Queries
Part of the learning is understanding what to monitor:
-- Replication lag (run on primary)
SELECT
application_name,
client_addr,
state,
pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS lag_mb,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 AS total_lag_mb
FROM pg_stat_replication;
-- Connection count per database
SELECT
datname,
count(*) as connections
FROM pg_stat_activity
GROUP BY datname;
-- Active queries
SELECT
pid,
usename,
application_name,
state,
query_start,
query
FROM pg_stat_activity
WHERE state = 'active'
AND pid != pg_backend_pid();
-- Cache hit ratio (should be >99%)
SELECT
sum(blks_hit) * 100.0 / sum(blks_hit + blks_read) as cache_hit_ratio
FROM pg_stat_database;
6.2 Port Reference for Documentation
Students should document their setup:
Port Reference for Demo Environment:
├── 2379: etcd coordination service
├── 5000: HAProxy primary (read-write)
├── 5001: HAProxy replicas (read-only load-balanced)
├── 5432: PostgreSQL instance 1
├── 5433: PostgreSQL instance 2
├── 5434: PostgreSQL instance 3
├── 7000: HAProxy statistics dashboard
├── 8008: Patroni instance 1 REST API
├── 8009: Patroni instance 2 REST API
└── 8010: Patroni instance 3 REST API
6.3 Useful Commands Cheat Sheet
# Cluster status
$ patronictl -c /etc/patroni/patroni1.yml list
# Manual failover (practice controlled failover)
$ patronictl -c /etc/patroni/patroni1.yml failover postgres-cluster
# Restart a node (safe, waits for replication to catch up)
$ patronictl -c /etc/patroni/patroni1.yml restart postgres-cluster node2
# Reinitialize a node (rebuild from primary)
$ patronictl -c /etc/patroni/patroni1.yml reinit postgres-cluster node3
# Check replication status
$ psql -h localhost -p 5432 -c "SELECT * FROM pg_stat_replication;"
# HAProxy statistics
$ curl http://localhost:7000
# etcd cluster health
$ etcdctl endpoint health
# View logs
$ journalctl -u patroni@1 -f
$ journalctl -u patroni@2 -f
$ journalctl -u patroni@3 -f
7. Project Ideas and Demonstrations
7.1 Academic Project Scenarios
This setup enables realistic project demonstrations:
Project 1: Automated Failover Demo
- Simulate primary failure during live database operations
- Measure recovery time
- Document state transitions
- Present findings with graphs
Project 2: Load Distribution Analysis
- Generate mixed read/write workload
- Compare single vs clustered performance
- Analyze resource utilization
- Create performance comparison charts
Project 3: Configuration Management Study
- Test configuration changes on replicas first
- Demonstrate safe rollback via failover
- Document best practices
- Build configuration templates
Project 4: Disaster Recovery Planning
- Implement WAL archiving to external storage
- Practice point-in-time recovery
- Document recovery procedures
- Create runbooks
7.2 Demo Scenarios for Presentations
5-Minute Demo: “Magic” Failover
# Setup
$ watch -n 1 "patronictl -c /etc/patroni/patroni1.yml list"
# Show application running
$ while true; do
psql -h localhost -p 5000 -c "SELECT now(), pg_is_in_recovery()";
sleep 1;
done
# Kill primary (dramatic moment)
$ sudo systemctl stop patroni@1
# Watch automatic recovery (~40 seconds)
# Application reconnects automatically
10-Minute Demo: Complete Architecture Tour
- Show all processes running (ps aux)
- Explain configuration files
- Demonstrate writes replicating
- Show HAProxy routing
- Trigger failover
- Explain what happened
20-Minute Demo: Deep Dive
- Architecture overview
- Configuration walkthrough
- Normal operation monitoring
- Simulated failure
- Recovery analysis
- Performance comparison
- Q&A
7.3 Capture Metrics for Analysis
Create scripts to capture data for reports:
#!/bin/bash
# capture_metrics.sh - Run during experiments
while true; do
# Cluster state
patronictl -c /etc/patroni/patroni1.yml list >> cluster_state.log
# Replication lag
psql -h 127.0.0.1 -p 5432 -t -c "
SELECT now(), application_name,
pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024
FROM pg_stat_replication" >> replication_lag.log
# Resource usage
echo "$(date),$(top -bn1 | grep postgres | awk '{sum += $9} END {print sum}')" >> cpu_usage.log
# Connection count
psql -h 127.0.0.1 -p 5432 -t -c "SELECT count(*) FROM pg_stat_activity" >> connections.log
sleep 5
done
Then analyze with Python:
import pandas as pd
import matplotlib.pyplot as plt
# Load data
lag = pd.read_csv('replication_lag.log', names=['time', 'node', 'lag_mb'])
lag['time'] = pd.to_datetime(lag['time'])
# Plot replication lag over time
plt.figure(figsize=(12, 6))
for node in lag['node'].unique():
node_data = lag[lag['node'] == node]
plt.plot(node_data['time'], node_data['lag_mb'], label=node)
plt.xlabel('Time')
plt.ylabel('Replication Lag (MB)')
plt.title('Replication Lag During Load Test')
plt.legend()
plt.savefig('replication_lag_chart.png')
8. Understanding Trade-offs: Critical Thinking
8.1 The Honest Assessment
After working through this setup, students should understand:
What Single-Node Clustering IS:
- A learning laboratory for distributed database concepts
- Protection against software-level failures
- Automated recovery for process crashes
- Read scalability through replication
- A prototype architecture that scales to production
What Single-Node Clustering IS NOT:
- True high availability (hardware failures take everything down)
- A replacement for proper backups
- Protection against data center failures
- Suitable for mission-critical production systems
- A shortcut to avoid learning distributed systems properly
8.2 When This Approach Makes Sense
Ideal for:
- University courses on distributed databases
- Personal learning projects
- Proof-of-concept demonstrations
- Development/testing environments
- Technical interviews (showing knowledge)
- Documentation and tutorial creation
- Architecture decision justification
Not suitable for:
- Production systems requiring 99.9%+ uptime
- Applications where data loss is unacceptable
- Environments with compliance requirements (SOC2, HIPAA)
- Systems handling financial transactions
- Applications with SLA penalties
8.3 The Educational Value Proposition
The question isn’t “Does this provide real HA?” but rather “What does this teach?”
Skills Learned:
- PostgreSQL streaming replication mechanics
- Patroni configuration and operations
- etcd distributed coordination
- HAProxy load balancing and health checks
- Linux process management
- Systemd service configuration
- Failover automation logic
- Distributed consensus protocols
- Resource constraint analysis
- Performance benchmarking
Concepts Understood:
- Leader election algorithms
- Heartbeat mechanisms
- Time-to-live (TTL) patterns
- Atomic operations in distributed systems
- Split-brain prevention
- Quorum-based decisions
- State machine replication
- Read/write splitting
- Connection pooling
- Blast radius containment
These are production-grade skills learned in a risk-free environment.
9. Cost Analysis: Educational Economics
9.1 Learning Path Budget
Single Server Setup (Learning):
Local laptop: $0 (already owned)
VPS for practice: $5-10/month (Hetzner, DigitalOcean)
Total: ~$50 for 6 months of learning
Direct to Multi-Server (Learning on Production):
3 VPS instances: $15/month × 3 = $45/month
Load balancer: $10/month
Managed coordination: $20/month
Total: $75/month = $450 for 6 months
Learning cost difference: $400 saved, plus no risk of breaking production
9.2 Equipment Requirements
Minimum Laptop Specs:
- 8 GB RAM (12 GB recommended)
- 4 CPU cores
- 50 GB free disk space
- Linux or macOS (Windows WSL2 works)
Recommended VPS for Practice:
- 16 GB RAM
- 4-8 vCPUs
- 100 GB SSD
- Ubuntu 22.04 or similar
- Cost: ~$40-60/month
This is accessible to students and hobbyists.
10. Documentation and Reporting
10.1 Lab Report Template
For academic projects, document systematically:
# PostgreSQL HA Cluster - Lab Report
## 1. Objective
Demonstrate understanding of PostgreSQL high availability through
implementation and testing of a single-node clustered setup.
## 2. Architecture
[Include your architecture diagram]
## 3. Configuration
[Document key configuration decisions]
## 4. Experiments Performed
### Experiment 1: Failover Time Measurement
- Procedure: [steps taken]
- Observations: [what happened]
- Metrics: [failover time, connection drops]
- Analysis: [why it behaved this way]
### Experiment 2: Replication Lag Under Load
- Procedure: [load generation method]
- Observations: [lag behavior]
- Metrics: [graphs of lag over time]
- Analysis: [bottleneck identification]
## 5. Performance Comparison
[Benchmark results: single vs clustered]
## 6. Lessons Learned
[What worked, what didn't, what surprised you]
## 7. Limitations Identified
[Honest assessment of what this can't do]
## 8. Production Readiness Analysis
[What would need to change for production]
## 9. References
[Documentation and resources used]
10.2 Creating Effective Diagrams
Students should visualize what they built:
Architecture Diagram:
┌─────────────────────────────────────────────────────────────┐
│ Single Server (Laptop/VPS) │
│ │
│ ┌──────────────┐ │
│ │ etcd │←──────────────┐ │
│ │ :2379 │ │ │
│ └──────────────┘ │ Coordination │
│ ↑ ↑ ↑ │ │
│ │ │ └──────────┐ │ │
│ │ └─────┐ │ │ │
│ │ │ │ │ │
│ ┌──────┴────┐ ┌┴──────┴───┐ ┌──┴─────────┐ │
│ │ Patroni 1 │ │ Patroni 2 │ │ Patroni 3 │ │
│ │ :8008 │ │ :8009 │ │ :8010 │ │
│ └─────┬─────┘ └─────┬─────┘ └─────┬──────┘ │
│ │ │ │ │
│ ↓ ↓ ↓ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ PG :5432 │ │ PG :5433 │ │ PG :5434 │ │
│ │ PRIMARY │→→│ REPLICA │ │ REPLICA │ Replication │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ └─────────────┴─────────────┘ │
│ ↑ │
│ ┌────────┴────────┐ │
│ │ HAProxy │ │
│ │ :5000 (write) │ │
│ │ :5001 (read) │ │
│ └─────────┬────────┘ │
└──────────────────────┼───────────────────────────────────────┘
│
Application
State Machine Diagram:
Normal Operation:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ node1 │────→│ node2 │ │ node3 │
│ PRIMARY │ │ REPLICA │←────│ REPLICA │
└──────────┘ └──────────┘ └──────────┘
Primary Fails:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ node1 │ │ node2 │ │ node3 │
│ CRASHED │ │ REPLICA │ │ REPLICA │
└──────────┘ └──────────┘ └──────────┘
↓ Election
After Failover:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ node1 │ │ node2 │────→│ node3 │
│ DOWN │ │ PRIMARY │ │ REPLICA │
└──────────┘ └──────────┘ └──────────┘
11. Troubleshooting Guide for Learners
11.1 Common Issues and Solutions
Issue 1: Patroni won’t start
# Check logs
$ journalctl -u patroni@1 -n 50
# Common causes:
# - PostgreSQL already running on that port
# - Data directory already exists
# - etcd not running
# - Configuration syntax error
# Solutions:
$ sudo systemctl stop postgresql # Stop system PostgreSQL
$ sudo rm -rf /var/lib/postgresql/data1 # Clean data dir
$ sudo systemctl start etcd # Ensure etcd is running
$ patroni --validate /etc/patroni/patroni1.yml # Check config
Issue 2: Replication not working
# Check replication user exists
$ psql -h 127.0.0.1 -p 5432 -c "\du"
# Check pg_hba.conf allows replication
$ cat /var/lib/postgresql/data1/pg_hba.conf | grep replication
# Check replication slots
$ psql -h 127.0.0.1 -p 5432 -c "SELECT * FROM pg_replication_slots;"
# Check for replication lag
$ psql -h 127.0.0.1 -p 5432 -c "SELECT * FROM pg_stat_replication;"
Issue 3: Failover doesn’t happen
# Check etcd is running
$ curl http://127.0.0.1:2379/version
# Check if leader key exists
$ etcdctl get /db/postgres-cluster/leader
# Check Patroni can reach etcd
$ curl http://127.0.0.1:8008/patroni
# Verify TTL settings in Patroni config
$ grep ttl /etc/patroni/patroni1.yml
11.2 Debugging Commands
# See all PostgreSQL processes
$ ps aux | grep postgres | grep -v grep
# Check which ports are listening
$ sudo netstat -tlnp | grep -E '5432|5433|5434|2379|8008|8009|8010'
# Watch cluster state continuously
$ watch -n 1 "patronictl -c /etc/patroni/patroni1.yml list"
# Check disk space (replication needs space!)
$ df -h
# Monitor system resources
$ htop
# Check PostgreSQL logs
$ tail -f /var/log/postgresql/postgresql-14-main.log
11.3 Reset and Start Over
When learning, sometimes you need a clean slate:
#!/bin/bash
# reset_cluster.sh - Nuclear option for starting fresh
# Stop everything
sudo systemctl stop patroni@{1,2,3}
sudo systemctl stop etcd
sudo systemctl stop haproxy
# Clean data directories
sudo rm -rf /var/lib/postgresql/data{1,2,3}
sudo rm -rf /var/lib/etcd/*
# Recreate data directories
sudo -u postgres mkdir -p /var/lib/postgresql/data{1,2,3}
# Start etcd first
sudo systemctl start etcd
sleep 5
# Start Patroni instances
sudo systemctl start patroni@1
sleep 10
sudo systemctl start patroni@2
sudo systemctl start patroni@3
sleep 10
# Start HAProxy
sudo systemctl start haproxy
# Check status
patronictl -c /etc/patroni/patroni1.yml list
12. Advanced Experiments
12.1 Chaos Engineering Practice
Once comfortable, introduce controlled chaos:
Experiment: Network Partition Simulation
# Simulate network issues with tc (traffic control)
$ sudo tc qdisc add dev lo root netem delay 100ms loss 5%
# Watch how replication lag responds
$ watch -n 1 "psql -h 127.0.0.1 -p 5432 -c 'SELECT * FROM pg_stat_replication'"
# Remove network impairment
$ sudo tc qdisc del dev lo root
Experiment: Resource Exhaustion
# Fill up disk to see replication fail
$ dd if=/dev/zero of=/tmp/fillup bs=1M count=10000
# Watch replication break
$ patronictl -c /etc/patroni/patroni1.yml list
# Clean up
$ rm /tmp/fillup
Experiment: Split-Brain Prevention
# Try to manually promote two nodes simultaneously
$ psql -h 127.0.0.1 -p 5433 -c "SELECT pg_promote()" &
$ psql -h 127.0.0.1 -p 5434 -c "SELECT pg_promote()" &
# Watch Patroni/etcd prevent split-brain
$ watch -n 1 "etcdctl get /db/postgres-cluster/leader"
12.2 Performance Optimization Exercises
Exercise: Tuning for Read-Heavy Workload
# Baseline
$ pgbench -h localhost -p 5001 -c 10 -T 60 -S demo
# Increase replica connections
$ psql -h 127.0.0.1 -p 5433 -c "ALTER SYSTEM SET max_connections = 200"
# Increase cache
$ psql -h 127.0.0.1 -p 5433 -c "ALTER SYSTEM SET shared_buffers = '4GB'"
# Restart and retest
$ sudo systemctl restart patroni@2
$ pgbench -h localhost -p 5001 -c 10 -T 60 -S demo
Exercise: Monitoring Replication Lag Under Load
# lag_monitor.py
import psycopg2
import time
import csv
def get_lag():
conn = psycopg2.connect("host=127.0.0.1 port=5432 dbname=postgres")
cur = conn.cursor()
cur.execute("""
SELECT application_name,
pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS lag_mb
FROM pg_stat_replication
""")
results = cur.fetchall()
conn.close()
return results
with open('lag_over_time.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(['timestamp', 'node', 'lag_mb'])
while True:
for node, lag in get_lag():
writer.writerow([time.time(), node, lag])
time.sleep(1)
13. Conclusion: What Was Really Learned
13.1 The Honest Final Assessment
After completing this learning journey, students should be able to answer:
Question: Is single-node PostgreSQL clustering real high availability?
Answer: No—and understanding why is the point.
13.2 Skills Transferred to Production
What students learn here applies directly to real distributed systems:
Architectural Skills:
- How to design for fault tolerance
- When to use replication vs sharding
- Trade-offs between consistency and availability
- Importance of observability and monitoring
Operational Skills:
- Reading and writing service configurations
- Using systemd for service management
- Debugging distributed systems
- Interpreting logs and metrics
Database Skills:
- PostgreSQL replication internals
- Understanding WAL and LSN
- Query routing strategies
- Connection pooling patterns
Systems Skills:
- Process vs thread models
- Resource constraint analysis
- Performance benchmarking
- Capacity planning
13.3 The Migration Path Forward
The beauty of this approach: when ready for production, you already know how it works.
Learning Setup → Production Setup:
- Same configuration structure (just change IPs)
- Same monitoring queries
- Same troubleshooting approach
- Same operational commands
- Same architectural patterns
What Changes in Production:
- Three physical servers instead of one
- Geographic distribution for real HA
- Proper backup and disaster recovery
- Monitoring and alerting infrastructure
- Security hardening and access control
What Stays the Same:
- Patroni orchestration logic
- PostgreSQL replication mechanics
- HAProxy routing patterns
- etcd coordination approach
- Application integration code
13.4 Final Wisdom
Single-node PostgreSQL clustering is a pedagogical tool, not a production solution. It’s a training environment where you can:
- Break things safely
- Learn from failures
- Experiment with configurations
- Understand distributed systems
- Build transferable skills
For students, researchers, and engineers learning database administration, it’s perfect. For production systems requiring real high availability, it’s a starting point, not a destination.
The goal isn’t to run this in production. The goal is to understand distributed databases well enough that when you do build production systems, you know what you’re doing.
Quick Reference Card
Port Mapping
2379 etcd coordination
5000 HAProxy primary (read-write)
5001 HAProxy replicas (read-only)
5432-5434 PostgreSQL instances
7000 HAProxy statistics
8008-8010 Patroni REST APIs
Essential Commands
# Cluster status
$ patronictl -c /etc/patroni/patroni1.yml list
# Manual failover
$ patronictl -c /etc/patroni/patroni1.yml failover postgres-cluster
# Restart node
$ patronictl -c /etc/patroni/patroni1.yml restart postgres-cluster node2
# Check replication
$ psql -h localhost -p 5432 -c "SELECT * FROM pg_stat_replication;"
# HAProxy stats
$ curl http://localhost:7000
Monitoring Queries
-- Replication lag
SELECT application_name,
pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS lag_mb
FROM pg_stat_replication;
-- Connection count
SELECT count(*) FROM pg_stat_activity;
-- Active queries
SELECT pid, query, state
FROM pg_stat_activity
WHERE state = 'active';
Troubleshooting Checklist
□ Is etcd running? (curl http://127.0.0.1:2379/version)
□ Are all Patroni services running? (systemctl status patroni@{1,2,3})
□ Are PostgreSQL processes running? (ps aux | grep postgres)
□ Is there a leader? (etcdctl get /db/postgres-cluster/leader)
□ Is replication working? (psql query pg_stat_replication)
□ Is HAProxy routing correctly? (curl http://localhost:7000)
□ Is there disk space? (df -h)
□ Are there errors in logs? (journalctl -u patroni@1 -n 50)
This architecture serves as an excellent learning laboratory—a safe space to break things, understand distributed systems, and build real skills without the complexity and cost of managing multiple physical servers. When you’re ready to build production systems, you’ll already understand how they work.
Leave a Reply