The production database suddenly can’t be accessed. It is gone—not slow, not corrupted, but gone. Turned out, it is Could be a cascade of failed drives. Maybe a misconfigured automation script. Possibly ransomware. Whatever the cause, your company’s entire data layer just evaporated, and every second of downtime costs thousands.
Now comes the question that’ll determine your career trajectory: When did you last test your backups?
This isn’t hypothetical. Back in 2023, a major cloud provider suffered a data loss incident affecting customer databases. Companies with solid backup strategies recovered within hours. Those who’d just trusted default settings without verification? Some never got their data back.
Database backups aren’t some best practice checkbox—they’re what separates a stressful Tuesday from updating your LinkedIn profile. But here’s something most teams won’t admit: they’re probably doing it wrong. Not obviously wrong, but subtly, dangerously wrong in ways that only show up during disasters.
Think of this article as your insurance policy. We’re going way beyond checklists into the actual architecture of PostgreSQL backup systems, exploring not just what to do, but why it works and when it fails.
What You’re Actually Backing Up (Spoiler: It’s Not What You Think)
Let’s kill a dangerous myth right now: a database isn’t just “data in tables.” That mental model? It’ll bite you hard during recovery.
PostgreSQL is layered and sophisticated—your business data is just one piece in a complex puzzle of files, logs, and metadata. Understanding this architecture isn’t academic fluff. It’s what separates a 15-minute restore from a 15-hour nightmare.
The Hidden Complexity Inside Your Data Directory
When PostgreSQL writes data, it operates like a meticulous librarian maintaining multiple catalogs at once. Your data/ directory contains way more than you’d expect:
Heap Files store your actual tables, but they’re not simple files. Each table gets segmented into 1GB chunks (filesystem limitations), organized into 8KB pages. Every page has a header, line pointers, and tuple data arranged just so. Corrupt a single byte in a page header? That entire page becomes unreadable—along with potentially hundreds of rows.
Write-Ahead Logs (WAL) are where things get interesting. Before PostgreSQL commits any change to data files, it writes a description of that change to WAL segments. Seems redundant, right? It’s actually your durability guarantee. Database crashes mid-transaction? WAL replay reconstructs what should’ve happened. This is exactly why “just copying the data directory” while PostgreSQL runs is catastrophically wrong—you’ll capture an inconsistent state across files.
System Catalogs define what your database means. Table schemas, user permissions, function definitions—stored in special system tables. Lose these, and even if you recover the raw heap files, PostgreSQL won’t know how to interpret them.
Tablespaces let you spread data across multiple disks. Great for performance, terrible for naive backup scripts that only capture one directory.
Why does this matter? Different backup approaches capture these layers differently. A logical backup (pg_dump) reconstructs everything through SQL, losing the physical layout but gaining portability. A physical backup (pg_basebackup) copies the exact byte structure, preserving layout but tying you to specific PostgreSQL versions. Neither is universally “better”—they solve different problems.
The Great Divide: Logical vs. Physical Backups
There’s a fundamental tension in backup design: abstraction versus efficiency. Let’s dig into both sides.
Logical Backups: Beautiful Lies That Actually Save You
pg_dump tells a beautiful lie: it pretends your database is just SQL statements. It reads your tables, generates INSERT commands (or custom binary formats), and hands you a file that could theoretically rebuild your database anywhere.
This abstraction packs some serious power. You can:
- Restore to different PostgreSQL versions (upgrading in the process)
- Cherry-pick specific tables during recovery
- Move between different architectures (x86 to ARM)
- Apply schema transformations during the restore
But abstraction comes with a price tag, and at scale, that price is brutal.
I have a real story from whom worked with a company which running a 4TB PostgreSQL database. Their logical backup took 18 hours to complete. Worse? That 18-hour window held a transaction snapshot open, preventing VACUUM from reclaiming dead tuples. The production database bloated by hundreds of gigabytes during each backup cycle. They were literally damaging their database to protect it.
Why so slow? pg_dump has to read every table sequentially, parse every row, serialize it to SQL or binary format, and write it out. For large databases, you’re stuck with single-threaded CPU and I/O throughput bottlenecks.
# Even with parallelism, logical dumps hit CPU limits
pg_dump -h localhost -U postgres \
--format=directory \
--jobs=8 \ # Parallel table dumps help
--compress=9 \ # But compression eats CPU
--verbose \
--file=/backup/mydb_$(date +%Y%m%d) \
mydb
The --jobs flag helps by dumping tables in parallel, but you’re still serializing data through the PostgreSQL query executor. For a 100GB table, that means parsing millions of rows, each burning CPU cycles for format conversion.
When logical backups make sense: Database under 100GB, need version independence, perform schema migrations, or need selective table restoration. They’re also perfect for development/staging database clones where absolute consistency with production isn’t critical.
Physical Backups: The Raw Truth
Physical backups take the opposite path: copy the actual bytes PostgreSQL uses internally. pg_basebackup doesn’t care about SQL—it streams 8KB pages directly from data files.
This approach is dramatically faster. Instead of parsing and serializing, you’re doing what filesystems do best: copying blocks. A multi-terabyte database that’d take a day to dump logically might back up physically in 2-3 hours (limited only by network/disk bandwidth).
Catch? Physical backups are opinionated. You’re capturing PostgreSQL’s internal file format for a specific major version. Can’t restore a PostgreSQL 13 physical backup to PostgreSQL 15. Can’t selectively restore tables. It’s everything or nothing.
# Physical backup with critical flags
pg_basebackup -h primary.example.com \
-D /backup/base_$(date +%Y%m%d_%H%M%S) \
-U replication_user \
-Fp \ # Plain format (actual files)
-Xs \ # Stream WAL during backup (CRITICAL!)
-P \ # Show progress
-R \ # Write recovery config
-C \ # Create replication slot
-S backup_slot_$(date +%s)
That -Xs flag deserves special attention. Without it, you’re in a race condition: PostgreSQL might recycle WAL segments your backup needs for consistency. With -Xs, WAL records generated during the backup stream alongside data files, guaranteeing consistency even for multi-hour backup windows.
The -R flag is equally clever—it writes standby.signal and connection parameters, so your backup is already configured to become a replica. Not just convenient; it’s a testable disaster recovery node you can spin up to verify backup integrity.
When physical backups make sense: Large databases (500GB+), need Point-In-Time Recovery (PITR), want fast backup/restore cycles, or maintain disaster recovery replicas.
The Secret Third Option: Continuous Archiving
Now we’re getting into territory where PostgreSQL’s architecture really shines. What if you could combine physical backup speed with transaction-level granularity? That’s continuous archiving.
Works like this: Take a physical base backup (your foundation). Then, instead of waiting for the next full backup, continuously archive every WAL segment PostgreSQL produces. Since WAL contains every database change, you’ve got a time machine—restore the base backup, replay WAL segments, stop at any transaction timestamp.
# postgresql.conf - The game-changing configuration
wal_level = replica # Generate enough WAL for replay
archive_mode = on # Enable archiving
archive_command = 'test ! -f /mnt/archive/%f && cp %p /mnt/archive/%f'
archive_timeout = 300 # Force segment switch every 5 min
That archive_command looks simple but it’s incredibly powerful. Every time PostgreSQL fills a WAL segment (typically 16MB), it executes this command. The test ! -f check prevents overwriting existing archives (critical for data safety). The cp copies the segment to your archive location.
In production, you’d swap that cp for something more robust—encryption, compression, remote transfer to cloud storage. But the principle stays the same: each 16MB segment gives you a few minutes of transaction history. Stack enough segments, and you can recover to within seconds of any failure point.
The RPO game-changer: Your Recovery Point Objective shifts from “whenever the last backup ran” to “whenever the last WAL segment archived”—potentially seconds ago.
Continuous archiving does introduce operational complexity though. What happens if your archive storage fills? If network issues prevent archiving? PostgreSQL will halt all write operations rather than lose data. You need monitoring, alerting, and automated cleanup policies.
This is where specialized tools really earn their keep.
pgBackRest: When Native Tools Fall Short
I have another story about pgBackRest conversion moment. When running a multi-terabyte PostgreSQL database with what seemed like a solid setup: daily pg_basebackup to a NAS, WAL archiving via custom scripts. Everything worked fine until the day we actually needed to restore.
The base backup was 72 hours old because a failed job nobody noticed. Fine, we’d replay WAL. But our WAL archive had gaps—network hiccups caused some segments to not archive, and our monitoring didn’t catch it. We had a 48-hour window we could recover to, but not the 2 hours we needed.
That failure cost huge lost of transactions.
Why pgBackRest Exists
pgBackRest isn’t just “better backup scripts”—it’s a fundamentally different architecture for backup management. The developers got something crucial: backups aren’t a single operation, they’re an ongoing system that needs verification, optimization, and recovery guarantees.
Multiple Repository Architecture lets you maintain several backup destinations simultaneously:
# /etc/pgbackrest/pgbackrest.conf
[global]
# Fast local backups for quick recovery repo1-type=posix repo1-path=/var/lib/pgbackrest repo1-retention-full=4 # Cloud backups for disaster recovery repo2-type=s3 repo2-s3-bucket=myorg-pgbackup-prod repo2-s3-region=us-east-1 repo2-retention-full=12 # Geographic redundancy repo3-type=s3 repo3-s3-bucket=myorg-pgbackup-dr repo3-s3-region=eu-west-1 repo3-retention-full=12
With this setup, every backup hits three locations: local disk for 15-minute restores, primary cloud region for daily recovery scenarios, and secondary region for true disaster scenarios. Your entire primary region disappears? Your data survives.
The Incremental Backup Ladder
pgBackRest implements a three-tier backup strategy that’s way more sophisticated than “full or nothing”:
Full backups copy everything—your baseline. Large and slow but establish clean restore points.
Differential backups copy only files modified since the last full backup. Weekly fulls and daily differentials? Each differential captures one day’s changes. Restore needs the full backup plus latest differential—two operations max.
Incremental backups copy only files changed since the last backup of any type. Smallest and fastest, but restore requires replaying a chain: full + differential + multiple incrementals.
The magic shows up in the scheduling:
# Sunday: Full backup (baseline)
pgbackrest --stanza=main --type=full backup
# Monday-Saturday: Differential (one day's changes from Sunday)
pgbackrest --stanza=main --type=diff backup
# Every 4 hours: Incremental (changes since last backup)
pgbackrest --stanza=main --type=incr backup
This gives you 4-hour RPO with minimal storage overhead. A restore pulls the full, latest differential, and a few incrementals—typically done in 30 minutes for multi-terabyte databases.
Block-Level Incremental: The Hidden Gem
Here’s where pgBackRest gets genuinely clever. Traditional incremental backups compare file modification times. If a 10GB table file was touched, back up all 10GB again—even if only one page changed.
pgBackRest maintains a checksum manifest of every 8KB page in your database. On incremental backups, it:
- Scans each file
- Checksums individual pages
- Backs up only pages whose checksums changed
For databases with localized write patterns (super common in real workloads), this is transformative. Instead of backing up entire modified files, you back up individual dirty pages. I’ve seen 10GB “incremental” backups that would’ve been 500GB with traditional file-level approaches.
Parallel Everything
pgBackRest parallelizes aggressively:
pgbackrest --stanza=main \
--type=full \
--process-max=8 \ # 8 parallel workers
--compress-type=lz4 \ # Fast compression
--compress-level=6 \
backup
Each worker handles separate files, maximizing throughput. Compression happens in-stream per worker, distributing CPU load. On modern multi-core systems, this saturates available network or disk bandwidth—often hitting 2-4GB/sec backup speeds.
Point-In-Time Recovery: Your Actual Time Machine
PITR is where backup architecture gets almost magical. You can rewind your database to any transaction between your base backup and current WAL position.
Why does this matter? Consider these scenarios:
3:17 PM: A developer runs DELETE FROM orders without a WHERE clause. Mistake noticed at 3:19 PM. With PITR, you restore to 3:16:59 PM—before the deletion—losing only seconds of legitimate transactions.
Midnight: An automated process corrupts data, but you don’t discover it until 8 AM. With PITR, you restore to 11:59 PM, before corruption happened.
Legal discovery: You need your database state as it existed on a specific date six months ago. With archived WAL and base backups, you can reconstruct that exact state.
How PITR Actually Works
The mechanics are surprisingly elegant. When you initiate PITR:
# Restore base backup
pgbackrest --stanza=main \
--type=time \
--target="2025-11-11 14:30:00" \
--target-action=promote \
restore
pgBackRest:
- Restores the most recent base backup before your target time
- Configures PostgreSQL to replay WAL segments
- PostgreSQL starts in recovery mode, replaying transactions sequentially
- At 14:30:00, PostgreSQL stops replay and promotes to primary
The --target-action=promote matters. You could also use pause, which stops recovery but leaves PostgreSQL read-only. This lets you inspect data before committing to the restore—useful when you’re not 100% certain of the right recovery point.
The Recovery Speed Problem
Here’s an uncomfortable truth about PITR: recovery speed is limited by WAL replay, which is single-threaded. Target recovery point is a week after your base backup? PostgreSQL has to replay a week’s worth of transactions—potentially billions of operations.
This is exactly why base backup frequency matters. A database doing 10,000 TPS (transactions per second) generates substantial WAL. Replaying 24 hours of WAL might take 4-6 hours. Replaying a week? Could take days.
Solution: More frequent base backups, or use the differential/incremental strategy to reduce WAL replay windows. Daily differential means you’re never replaying more than 24 hours of WAL.
Testing: The Practice That Saves Careers
Here’s a fact that should terrify you: surveys show only 30% of companies regularly test database restores. The other 70%? They discover their backup failures during actual disasters.
I’ve interviewed dozens of DBAs with “resume-generating event” stories. Almost all share a theme: their backups worked fine for months until the day they didn’t. Configuration drift. A storage system that silently corrupted data. A restore process that worked on the test database but failed on the 10x larger production database due to memory constraints.
Automated Restore Testing
Your backup system should test itself. Here’s a production-grade verification script:
#!/bin/bash
# Automated weekly restore verification
STANZA="production"
VERIFY_PATH="/restore/verify_$(date +%s)"
VERIFY_PORT=5433
SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
# Restore latest backup
pgbackrest --stanza=${STANZA} \
--pg1-path=${VERIFY_PATH} \
restore || {
curl -X POST ${SLACK_WEBHOOK} \
-d '{"text":"🚨 BACKUP RESTORE FAILED"}'
exit 1
}
# Start PostgreSQL on alternate port
pg_ctl -D ${VERIFY_PATH} -o "-p ${VERIFY_PORT}" start
# Wait for startup
sleep 10
# Run validation queries
psql -p ${VERIFY_PORT} -c "SELECT count(*) FROM critical_business_table;" \
|| {
curl -X POST ${SLACK_WEBHOOK} \
-d '{"text":"⚠️ Backup restored but validation failed"}'
pg_ctl -D ${VERIFY_PATH} stop
exit 1
}
# Verify recent data exists (catches incremental backup issues)
LATEST_TIMESTAMP=$(psql -p ${VERIFY_PORT} -t -c \
"SELECT max(created_at) FROM transactions;")
HOURS_OLD=$(( ($(date +%s) - $(date -d "${LATEST_TIMESTAMP}" +%s)) / 3600 ))
if [ ${HOURS_OLD} -gt 24 ]; then
curl -X POST ${SLACK_WEBHOOK} \
-d "{\"text\":\"⚠️ Restored backup is ${HOURS_OLD} hours old\"}"
fi
# Success notification
curl -X POST ${SLACK_WEBHOOK} \
-d '{"text":"✅ Weekly backup verification successful"}'
# Cleanup
pg_ctl -D ${VERIFY_PATH} stop
rm -rf ${VERIFY_PATH}
Run this weekly. The script:
- Restores the latest backup to a temporary location
- Starts PostgreSQL on a non-conflicting port
- Runs actual business queries to verify data integrity
- Checks that recent data exists (catches WAL archiving failures)
- Alerts via Slack on any issues
- Cleans up automatically
This isn’t theoretical—this script has caught numerous issues in production: backup corruption, configuration drift, storage failures, and WAL archiving gaps that would’ve been catastrophic to discover during actual disasters.
The Chaos Engineering Approach
Go even further: periodically perform unannounced restore drills. Pick a random day, spin up a restored database, have your team verify it’s usable. Time how long the restore takes. Document issues encountered. Update your runbooks.
Some teams take this to the extreme: they regularly destroy their primary database and restore from backup in production. Sounds insane, but it’s the ultimate confidence builder. If you can casually destroy and restore your production database, you know your backup system works.
(Don’t actually do this without extensive preparation, a maintenance window, and approval from people who can fire you. But the thought experiment? Valuable.)
Monitoring: Knowing Before It’s Too Late
Backups fail silently all the time. A cron job returns exit code 0 but wrote corrupted data. Network issues caused gaps in WAL archiving. Your backup storage hits 95% full and will max out tomorrow.
You need comprehensive monitoring, not just “did the backup job complete.”
Critical Metrics Worth Tracking
Backup Age: How long since your last successful backup? Alert if it exceeds your RPO by 50%.
-- Query last backup timestamp (pgBackRest)
SELECT
stanza,
type,
label,
(to_timestamp(stop::bigint))::timestamp AS backup_completed,
now() - to_timestamp(stop::bigint) AS age
FROM (
SELECT
jsonb_array_elements(info->'backup')->>'label' AS label,
jsonb_array_elements(info->'backup')->>'stop' AS stop,
jsonb_array_elements(info->'backup')->>'type' AS type,
info->>'name' AS stanza
FROM pgbackrest_info('production')
) backups
ORDER BY backup_completed DESC
LIMIT 1;
Last backup 30 hours old and your RPO is 24 hours? You have a problem right now.
WAL Archive Lag: The gap between current WAL position and last archived segment.
SELECT
pg_current_wal_lsn() AS current_lsn,
last_archived_wal,
pg_wal_lsn_diff(
pg_current_wal_lsn(),
last_archived_wal
) AS bytes_behind,
pg_wal_lsn_diff(
pg_current_wal_lsn(),
last_archived_wal
) / 1024 / 1024 AS mb_behind
FROM pg_stat_archiver;
If mb_behind exceeds 500MB, you’ve got an archiving issue. PostgreSQL is generating WAL faster than you’re archiving it. Eventually, pg_wal/ fills up and your database halts writes.
Backup Size Trends: Sudden changes in backup size often signal issues.
Plot backup sizes over time. A 3TB database that suddenly produces 4TB backups might indicate:
- Excessive table bloat (VACUUM not running)
- Someone uploaded massive data without telling you
- Backup compression failing silently
Restore Test Success Rate: Track automated restore verification results. Testing weekly? You should have 52 data points per year. Your success rate should be 100%. Anything less indicates systematic issues.
Alert Fatigue vs. Alert Blindness
The challenge: too many alerts and teams ignore them. Too few, and critical issues slip by unnoticed.
Tiered alerting helps:
Critical (page on-call immediately):
- Backup failed for 12+ hours
- WAL archiving stopped
- Backup storage over 90% full
- Restore test failed
Warning (ticket during business hours):
- Backup duration increasing 30%+ week-over-week
- Backup storage over 70% full
- WAL archive lag over 1 hour
Info (dashboard only):
- Backup completed successfully
- Retention policy cleaned old backups
- Restore test successful
Real-World Architecture Patterns
Let’s move from theory to practice with actual deployment patterns I’ve seen work (and fail) in production.
Cloud-Native: AWS RDS Example
You’re running RDS PostgreSQL. AWS handles automated backups, but here’s what most teams miss: trusting only RDS snapshots.
The problem: RDS snapshots are storage-level and crash-consistent. Got a bug that slowly corrupts application data? RDS snapshots will happily preserve that corruption. By the time you notice, all your snapshots are corrupted.
The solution: Supplement RDS snapshots with logical dumps to S3:
# Daily logical backup to S3 (runs on EC2 instance)
#!/bin/bash
TIMESTAMP=$(date +%Y%m%d)
BUCKET="s3://mycompany-db-logical-backups"
# Dump to compressed custom format
pg_dump -h mydb.xxxx.rds.amazonaws.com \
-U postgres \
-Fc \
-f /tmp/dump_${TIMESTAMP}.backup \
mydb
# Upload to S3 with encryption
aws s3 cp /tmp/dump_${TIMESTAMP}.backup \
${BUCKET}/dump_${TIMESTAMP}.backup \
--storage-class STANDARD_IA \
--server-side-encryption AES256
# Cleanup
rm /tmp/dump_${TIMESTAMP}.backup
Now you’ve got two recovery paths:
- RDS snapshots for infrastructure failures (fast recovery)
- Logical dumps for data corruption issues (can restore to self-managed PostgreSQL if needed)
Cross-region replication: Don’t stop at cross-region read replicas. Copy your logical backups to a separate region:
# Replicate to DR region
aws s3 sync s3://mycompany-db-logical-backups-us-east-1 \
s3://mycompany-db-logical-backups-eu-west-1 \
--region eu-west-1
Self-Managed: The On-Premises Setup
You’re running PostgreSQL on physical hardware in your data center. Full control, but also all the responsibility.
Three-tier storage strategy:
┌─────────────────────────────────────────────┐
│ Tier 1: NVMe (Hot) │
│ - Last 3 days of full backups │
│ - Last 24 hours of WAL │
│ - Target: under 15 min restore │
└─────────────────────────────────────────────┘
↓ (Copy completed backups)
┌─────────────────────────────────────────────┐
│ Tier 2: NAS/SAN (Warm) │
│ - Last 4 weeks of full backups │
│ - Last 2 weeks of WAL │
│ - Target: under 2 hour restore │
└─────────────────────────────────────────────┘
↓ (Weekly sync)
┌─────────────────────────────────────────────┐
│ Tier 3: Object Storage or Tape (Cold) │
│ - 12 months of monthly fulls │
│ - Compliance retention │
│ - Target: under 24 hour restore │
└─────────────────────────────────────────────┘
The beauty here: common failures (mistaken DELETE, corruption detected within hours) restore from Tier 1 in minutes. Rare disasters (building fire, complete storage failure) recover from Tier 3, accepting longer RTO.
Kubernetes: The Containerized Challenge
Running PostgreSQL in Kubernetes introduces unique complexity. Pods are ephemeral, your data isn’t.
The pattern that works: StatefulSets with dedicated backup sidecars:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:14
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
- name: backup
mountPath: /backup
# Backup sidecar
- name: pgbackrest
image: pgbackrest/pgbackrest:latest
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
readOnly: true
- name: backup
mountPath: /backup
command:
- /bin/bash
- -c
- |
# Configure pgBackRest
cat > /etc/pgbackrest/pgbackrest.conf <<EOF
[global]
repo1-type=s3 repo1-s3-bucket=${S3_BUCKET} repo1-s3-endpoint=s3.amazonaws.com repo1-s3-region=${AWS_REGION}
[main]
pg1-path=/var/lib/postgresql/data pg1-port=5432 EOF # Run backup every hour while true; do pgbackrest –stanza=main backup sleep 3600 done volumeClaimTemplates: – metadata: name: data spec: accessModes: [“ReadWriteOnce”] resources: requests: storage: 100Gi
The sidecar pattern keeps backup logic separate from the database container, allowing independent updates and scaling.
The War Stories: What Can Go Wrong
Let me share some real failures I’ve witnessed (names and details changed to protect identities):
Story 1: The Silent Corruption
A financial services company ran nightly pg_dump backups for their transaction database. Everything looked perfect—backups completed successfully every night for two years.
Then they discovered a bug in their application logic that’d been slowly corrupting foreign key relationships for six months. When they tried to restore to a pre-corruption state, they discovered all their backups were taken using pg_dump, which exports data in a non-deterministic order. They couldn’t identify which transactions were affected.
Worse, their logical dumps didn’t capture the exact physical state of the database. Some corruption was invisible to pg_dump but present in the actual heap files.
The lesson: Logical backups work great for migrations, terrible for forensic recovery. They needed physical backups with WAL archiving for PITR.
Story 2: The Untested Restore
An e-commerce company had what looked like “perfect” backups: pgBackRest configured, monitoring in place, backups completing successfully. They’d never tested a restore though.
During a production incident, they attempted to restore. The backup server had been configured with hardcoded paths that existed on the backup server but not the production server. Restore failed. After fixing paths, they discovered the restore consumed 4x the RAM of their production server (due to PostgreSQL recovery buffers), causing OOM kills.
By the time they sorted out the restore, they’d been down for 11 hours. Their SLA was 4 hours.
The lesson: Test restores on production-identical infrastructure. Document actual resource consumption during recovery, not just during normal operations.
Story 3: The Retention Policy Disaster
A healthcare company needed to recover data from 18 months prior for a legal discovery request. Their backup retention policy kept 30 days of daily backups and 12 months of monthly backups.
The 18-month-old data fell between retention periods—it’d been automatically deleted. They had no recovery path. The legal consequences were… expensive.
The lesson: Understand your compliance requirements before configuring retention policies. When in doubt, keep backups longer than you think necessary. Storage is cheap compared to legal liability.
Story 4: The Encryption Key Fiasco
A startup implemented backup encryption to meet SOC2 requirements. They stored the encryption key in their infrastructure-as-code repository for convenience.
A disgruntled employee deleted the encryption key from the repository before leaving. When they needed to restore three months later, they discovered the encrypted backups were unrecoverable—the key was gone.
The company nearly went bankrupt. They recovered only by hiring a forensics firm that found an old key in Git history at a cost of six figures.
The lesson: Encryption keys require separate key management infrastructure. Use AWS KMS, Azure Key Vault, or HashiCorp Vault. Never store keys alongside encrypted data.
When Disaster Strikes: The Runbook
You’ve built the perfect backup system. Now it’s 3 AM and production is down. What do you actually do?
The Recovery Decision Tree
DATABASE FAILURE DETECTED
│
├─ Is the primary server accessible?
│ ├─ YES → Is data corrupted or just the instance?
│ │ ├─ Corrupted → PITR to before corruption
│ │ └─ Instance → Restart instance, verify integrity
│ └─ NO → Promote replica OR restore from backup
│
├─ Do you have a hot standby?
│ ├─ YES → Promote standby (fastest recovery)
│ └─ NO → Continue to backup restore
│
├─ How much data loss is acceptable?
│ ├─ None → PITR to latest possible moment
│ │ (requires base backup + all WAL segments)
│ └─ Some acceptable → Restore latest full backup
│ (faster, loses recent transactions)
│
└─ What's the target environment?
├─ Same server → Restore to original location
├─ Different server → Restore, update connection strings
└─ New infrastructure → Provision resources, then restore
The Actual Commands
Scenario 1: Promote Standby (Fastest)
# On standby server
pg_ctl promote -D /var/lib/postgresql/14/main
# Update application connection strings immediately
# Old primary becomes obsolete
Scenario 2: PITR After Accidental Deletion
# Stop PostgreSQL
systemctl stop postgresql
# Backup current data (just in case)
mv /var/lib/postgresql/14/main /var/lib/postgresql/14/main.old
# Restore to before the mistake
pgbackrest --stanza=main \
--type=time \
--target="2025-11-11 14:30:00" \
--target-action=promote \
restore
# Start PostgreSQL (will replay WAL to target time)
systemctl start postgresql
# Verify data is correct
psql -c "SELECT count(*) FROM important_table;"
# If correct, remove old data
# If wrong, stop and reassess
Scenario 3: Complete Infrastructure Failure
# Provision new server with same PostgreSQL version
# Install pgBackRest and configure with your repository settings
# Restore latest backup
pgbackrest --stanza=main \
--type=default \
restore
# Start PostgreSQL
systemctl start postgresql
# Verify integrity
psql -c "SELECT version();"
psql -c "SELECT count(*) FROM critical_business_table;"
# Update DNS or load balancer to point to new instance
# Monitor for issues
tail -f /var/log/postgresql/postgresql-14-main.log
The Post-Recovery Checklist
Recovery doesn’t end when PostgreSQL starts. You’ve got to verify everything actually works:
# 1. Check replication lag (if you have replicas)
psql -c "SELECT application_name, state, sync_state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;"
# 2. Verify table counts match expectations
psql -c "SELECT schemaname, tablename, n_live_tup
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC LIMIT 20;"
# 3. Check for bloat or corruption
psql -c "SELECT * FROM pg_stat_database WHERE datname = 'mydb';"
# 4. Test application connectivity
curl http://localhost:8080/health
# 5. Review PostgreSQL logs for errors
grep ERROR /var/log/postgresql/postgresql-14-main.log
Document everything: what failed, how long recovery took, what worked and what didn’t. This becomes your post-mortem and improves your next response.
The Compliance Conversation Nobody Wants to Have
If you work in healthcare, finance, or government, compliance isn’t optional—it’s legally mandated and auditors don’t mess around.
HIPAA Example
HIPAA requires specific things, not suggestions:
- Encryption at rest and in transit: Use pgBackRest with cipher options
- Access controls: Only authorized personnel can access backups
- Audit trails: Log all backup and restore operations
- Retention policies: Typically 6-7 years
- Periodic testing: Document restore tests quarterly
Configuration example:
[global]
repo1-type=s3
repo1-s3-bucket=mycompany-hipaa-backups
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass-from-env=PGBACKREST_CIPHER_PASS # Never hardcode!
repo1-retention-full=84 # 12 weeks of weeklies
repo1-retention-full-type=time
repo1-retention-archive=2555 # 7 years in days
# Enable detailed logging
log-level-console=info
log-level-file=debug
log-path=/var/log/pgbackrest
Store the encryption passphrase in AWS Secrets Manager or HashiCorp Vault, never in configuration files or environment variables in plaintext.
The Audit Trail
Compliance auditors will ask: “Prove your backups work.” Your answer should be:
“Here’s our automated restore test log showing successful weekly verifications for the past year. Here’s our retention policy configuration. Here’s our access log showing who touched backups and when. Here’s our encryption key rotation schedule.”
Not: “Uh, we run backups nightly and assume they work?”
The Economic Argument: Why Backups Pay for Themselves
CFOs love when technical decisions have clear ROI. Here’s the math that actually matters:
Scenario: E-commerce site processing $10M annually
- Average downtime cost: $5,000/hour (lost sales, customer churn, reputation damage)
- Without proper backups: 48-hour recovery from catastrophic failure
- Cost: $240,000
Investment in proper backup infrastructure:
- pgBackRest configuration and automation: $10,000 (one-time)
- Additional storage (3-2-1 strategy): $2,000/year
- Monthly restore testing time: $6,000/year
- Total first year: $18,000
With proper backups: 4-hour recovery from same failure
- Cost: $20,000
Break-even after first disaster: $240,000 vs. $20,000 = $220,000 saved
And that’s a single incident. Over five years, assuming one major incident and several minor ones, proper backups save millions while costing tens of thousands.
The ROI is overwhelming. The real question isn’t “can we afford proper backups?” It’s “can we afford not to?”
The Human Element: Building a Backup Culture
Technology is actually the easy part. The hard part? Organizational culture.
The Blameless Post-Mortem
When backup failures occur, resist the urge to find someone to fire. Run blameless post-mortems instead:
INCIDENT: Production database corruption required restore
TIMELINE:
09:15 - Corruption detected in orders table
09:20 - Decision made to restore from backup
09:25 - Restore initiated from last night's backup
09:30 - Discovered backup was incomplete (script error)
10:45 - Located older backup that was valid
12:30 - Restore completed
13:00 - Service restored
WHAT WENT WRONG:
- Backup script had silent failure mode (exit code 0 despite errors)
- Backup verification only checked file existence, not integrity
- Recent backups were incomplete for 3 days without detection
WHAT WENT RIGHT:
- Team identified corruption quickly
- Escalation procedures worked
- Older backups were available
CORRECTIVE ACTIONS:
1. Update backup scripts to fail loudly (assigned: Alice, due: next week)
2. Implement pg_verifybackup in verification pipeline (assigned: Bob, due: 2 weeks)
3. Add backup integrity checks to monitoring (assigned: Charlie, due: 1 week)
Focus on systems, not people. Blame creates cultures where people hide mistakes, making systems less reliable.
The Backup Champion
Designate a “backup champion” in your organization. This person:
- Owns backup strategy and tooling
- Reviews backup policies quarterly
- Coordinates disaster recovery drills
- Evangelizes backup testing
This isn’t a full-time role, but it ensures backups don’t become “someone else’s problem.”
Training and Drills
Everyone who might touch production should know basic recovery procedures. Run quarterly drills:
Quarter 1: Simulate accidental table drop, restore via PITR Quarter 2: Simulate server failure, restore to new infrastructure
Quarter 3: Simulate backup corruption, restore from older backup Quarter 4: Full disaster recovery drill with all systems
Time each drill. Document issues. Update procedures. Make it routine, not heroic.
Closing Thoughts: The Backup Mindset
I’ll leave you with a mental model that’s served me well over the years:
Your database is already lost.
Adopt this assumption. Your primary database? It’s an ephemeral cache of your backups, not the source of truth. The backups are your real data—tested, verified, and recoverable.
This mental inversion changes everything:
- Backups become first-class systems requiring investment
- Testing becomes non-negotiable
- Recovery procedures get exercised regularly, not just during disasters
When you treat your production database as ephemeral, you build systems that can casually destroy and recreate it. And when you can casually destroy and recreate production, you know you’re truly protected.
Practical Next Steps
Here’s your action plan, regardless of where you’re starting:
This Week:
- Test your most recent backup. Actually restore it. Verify data integrity.
- Check your backup age monitoring. Do you alert if backups fail?
- Review retention policies. Do they meet your compliance requirements?
This Month:
- Implement automated restore testing (use the script from earlier)
- Configure multiple backup repositories if you haven’t
- Document your recovery procedures in a runbook
This Quarter:
- Evaluate pgBackRest or Barman if you’re using custom scripts
- Implement continuous WAL archiving for PITR
- Run a disaster recovery drill with your team
This Year:
- Reach operational maturity level 3 or 4
- Build monitoring dashboards for backup metrics
- Integrate backup verification into your CI/CD pipeline
Start small. Every improvement counts. The gap between “we have backups” and “we have tested, verified, recoverable backups” is the gap between hope and confidence.
The Final Test
Here’s how you know if your backup system actually works:
Close your eyes and imagine your worst disaster scenario. The data center burns down. Ransomware encrypts everything. A disgruntled employee runs DROP DATABASE. Your cloud provider has a regional outage.
Now ask yourself: How confident are you that you’ll recover?
If the answer isn’t “completely confident, because we test this regularly,” you’ve got work to do.
But if you’ve built the systems described in this article—proper tooling, multiple repositories, automated testing, documented procedures—then you can answer with genuine confidence: “We’ll recover. It’ll be stressful, but we’ll recover.”
And that confidence, in the middle of a production disaster at 3 AM, is worth more than any technology investment you’ll ever make.
Resources Worth Your Time
Official Documentation:
- PostgreSQL Documentation: Backup and Restore
- pgBackRest Documentation
- Barman Documentation
Books:
- “PostgreSQL: Up and Running” by Regina Obe and Leo Hsu
- “The Art of PostgreSQL” by Dimitri Fontaine
- “Site Reliability Engineering” by Niall Murphy et al. (for disaster recovery context)
Tools Mentioned:
- pgBackRest: https://pgbackrest.org
- Barman: https://www.pgbarman.org
- WAL-G: https://github.com/wal-g/wal-g
- Patroni: https://github.com/zalando/patroni (for HA setups)
Community Resources:
- PostgreSQL Slack community
- pgsql-admin mailing list
- r/PostgreSQL
Remember: The best backup is the one that successfully restores when you need it. Everything else is just expensive storage.
Leave a Reply