Enhanced Minimal Downtime Upgrade & Zero Downtime Migration Strategy
PostgreSQL 11 β 15.9
π― Objective: Upgrade PostgreSQL from version 11 to 15.9 using a Blue-Green strategy with:
- β Zero downtime during data replication and testing
- β Minimal downtime during upgrade
- β Rollback strategy : promote original master if needed
- β Safety checks and monitoring
- β Connection management
π΅ Blue-Green Concept
- Blue = Current production server (PostgreSQL 11)
- Green = New target server (PostgreSQL 11 initially, PostgreSQL 15.9 after upgrade)
- Use streaming replication to keep Green synced with Blue
- Promote Green when ready
- Upgrade Green to PG15.9 after switchover
- If rollback is needed, re-promote Blue
β Enhanced Step-by-Step Plan
πΉ Phase 1: Prepare Green (New Server)
- Provision new hardware (Green)
- Install PostgreSQL 11 (same exact version as Blue)
- Match all configuration:
- postgresql.conf, pg_hba.conf, users, roles, extensions
- Create a replication user on Blue (e.g., replicator)
- Install PostgreSQL 15.9 binaries (for later upgrade)
- Set up connection management:
- Install PgBouncer or prepare DNS CNAME switching
- Document application connection string changes
πΉ Phase 2: Set Up Streaming Replication (PG11 β PG11)
On Blue (current master):
# postgresql.conf
wal_level = replica
max_wal_senders = 5
wal_keep_size = 8GB # Updated from wal_keep_segments for PG 11+
hot_standby = on
max_replication_slots = 5
# Create replication slot for better WAL retention
SELECT pg_create_physical_replication_slot('green_slot');
# pg_hba.conf
host replication replicator green_server_ip/32 md5
On Green (new server):
# Stop PostgreSQL
systemctl stop postgresql
# Clear data directory
rm -rf /var/lib/postgresql/11/main/*
# Perform base backup with replication slot
pg_basebackup -h blue_ip -D /var/lib/postgresql/11/main -U replicator -P -R -S green_slot
# Configure recovery settings in postgresql.conf
primary_slot_name = 'green_slot'
# Start PostgreSQL
systemctl start postgresql
β Now Green is a hot standby synced with Blue.
πΉ Phase 3: Initial Validation & Monitoring Setup
Monitor replication status:
-- On Blue: Check replication status
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes
FROM pg_stat_replication;
-- On Green: Check receiver status
SELECT status, receive_start_lsn, receive_start_tli, received_lsn, received_tli,
last_msg_send_time, last_msg_receipt_time, latest_end_lsn, latest_end_time
FROM pg_stat_wal_receiver;
Basic validation:
- Perform read-only query tests on Green
- Validate user access and system compatibility
- Set up monitoring alerts for replication lag > 100MB
πΉ Phase 3.5: Pre-Switchover Validation
Critical safety checks before switchover:
- Dry-run upgrade check:
# Test pg_upgrade compatibility (don't actually upgrade yet)
sudo -u postgres /usr/lib/postgresql/15/bin/pg_upgrade \
--old-datadir=/var/lib/postgresql/11/main \
--new-datadir=/var/lib/postgresql/15/main \
--old-bindir=/usr/lib/postgresql/11/bin \
--new-bindir=/usr/lib/postgresql/15/bin \
--check
- Extension compatibility validation:
-- On Green: Check all extensions are PG15 compatible
SELECT name, default_version, installed_version
FROM pg_available_extensions
WHERE name IN (SELECT extname FROM pg_extension);
- Application connectivity test:
# Test read-only connection to Green
psql -h green_ip -U app_user -d app_db -c "SELECT version();"
- Replication lag verification:
-- Ensure lag is minimal (< 1MB) before switchover
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication WHERE application_name = 'green_slot';
- Prepare connection switchover:
- Stage application config changes
- Prepare PgBouncer config or DNS changes
- Document rollback connection procedures
πΉ Phase 4: Enhanced Manual Switchover to Green
Pre-switchover checklist:
- [ ] Replication lag < 1MB
- [ ] No replication errors in logs
- [ ] Application teams notified
- [ ] Rollback procedures documented
- [ ] Connection switching method prepared
- [ ] Connection termination procedures ready
Switchover procedure:
# 1. Set Blue to read-only mode (safer than stopping)
-- On Blue:
ALTER SYSTEM SET default_transaction_read_only = on;
SELECT pg_reload_conf();
# 2. Wait for Green to catch up completely
-- Monitor until lag = 0
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) FROM pg_stat_replication;
# 3. Promote Green server with verification
-- On Green:
pg_ctl promote -D /var/lib/postgresql/11/main
# 4. Verify promotion successful
-- On Green:
SELECT pg_is_in_recovery(); -- Should return false
# 5. Switch connections (choose one method):
# Option A: DNS CNAME switch
# Option B: PgBouncer config switch
# Option C: Application config deployment
# 6. Verify application functionality
-- Test critical application functions
-- Monitor error rates and performance
β Green is now the active master (PG11)
CRITICAL: Immediate Blue Sync Setup (Prevent Blue from becoming stale)
bash# Option 1: Set up reverse replication immediately (Recommended)
# This ensures Blue stays current for potential rollback
# On Green (new master), create replication user for Blue
CREATE
USER
blue_replicator REPLICATION LOGIN PASSWORD
'secure_password'
;
# Configure Green postgresql.conf for reverse replication
echo
"max_wal_senders = 6"
>>
/etc/postgresql/11/main/postgresql.conf
# Increase by 1
SELECT pg_reload_conf();
# Create replication slot for Blue
SELECT pg_create_physical_replication_slot(
'blue_sync_slot'
);
# Option 2: Alternative - Set up WAL archiving for Blue sync
# Configure if reverse replication is not preferred
echo
"archive_mode = on"
>>
/etc/postgresql/11/main/postgresql.conf
echo
"archive_command = 'rsync %p blue_server:/var/lib/postgresql/wal_archive/%f'"
>>
/etc/postgresql/11/main/postgresql.conf
SELECT pg_reload_conf();
πΉ Phase 5: Upgrade Green to PostgreSQL 15.9
CRITICAL: Connection Management Before Upgrade
# 1. Check for active connections (MANDATORY)
-- On Green: Verify no active connections before upgrade
SELECT pid, usename, application_name, client_addr, state, query_start, query
FROM pg_stat_activity
WHERE state = 'active' AND pid <> pg_backend_pid();
# 2. If connections detected, terminate them safely
-- Option A: Graceful termination (preferred)
SELECT pg_cancel_backend(pid)
FROM pg_stat_activity
WHERE state = 'active' AND pid <> pg_backend_pid();
-- Wait 30 seconds, then check again
-- Option B: Force termination if graceful fails
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'active' AND pid <> pg_backend_pid();
# 3. Prevent new connections during upgrade
-- Update pg_hba.conf to reject new connections (backup original first)
cp /etc/postgresql/11/main/pg_hba.conf /etc/postgresql/11/main/pg_hba.conf.backup
echo "# UPGRADE MODE - NO CONNECTIONS ALLOWED" > /etc/postgresql/11/main/pg_hba.conf
echo "local all postgres peer" >> /etc/postgresql/11/main/pg_hba.conf
SELECT pg_reload_conf();
# 4. Final connection verification
SELECT count(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid();
-- Should return 0 before proceeding
# 5. Stop PostgreSQL (this is the main downtime period)
systemctl stop postgresql
# 6. Run pg_upgrade (already validated in Phase 3.5)
sudo -u postgres /usr/lib/postgresql/15/bin/pg_upgrade \
--old-datadir=/var/lib/postgresql/11/main \
--new-datadir=/var/lib/postgresql/15/main \
--old-bindir=/usr/lib/postgresql/11/bin \
--new-bindir=/usr/lib/postgresql/15/bin \
--link # Use hard links for faster upgrade
# 7. Update systemd service to use PG15
systemctl stop postgresql@11-main
systemctl start postgresql@15-main
systemctl enable postgresql@15-main
# 8. Restore connection access
-- Restore original pg_hba.conf for PG15
cp /etc/postgresql/11/main/pg_hba.conf.backup /etc/postgresql/15/main/pg_hba.conf
SELECT pg_reload_conf();
# 9. Run post-upgrade scripts
sudo -u postgres ./analyze_new_cluster.sh
# 10. Test application connectivity and functionality
# 11. After validation: sudo -u postgres ./delete_old_cluster.sh
β± Downtime: ~5β10 minutes max
πΉ Phase 6: Enhanced Rollback Plan with Blue Data Sync
Option A: Immediate Rollback (if issues detected within 30 minutes)
bash# 1. Stop Green server
systemctl stop postgresql@15-main
# 2. Re-enable Blue server
-- On Blue:
ALTER SYSTEM RESET default_transaction_read_only;
SELECT pg_reload_conf();
# 3. Promote Blue (if needed)
pg_ctl promote -D /var/lib/postgresql/11/main
# 4. Switch connections back to Blue
# Use reverse of connection switching method from Phase 4
Option B: Blue Data Recovery (Prevent Blue from becoming stale)
B1: Continuous Blue Sync Strategy (Recommended)
bash# After Green promotion, immediately set up reverse replication
# This keeps Blue current with Green's data
# 1. On Green (new master PG15), create replication user
CREATE
USER
blue_replicator REPLICATION LOGIN PASSWORD
'secure_password'
;
# 2. Configure Green for replication
-- postgresql.conf on Green
wal_level
=
replica
max_wal_senders
=
5
wal_keep_size
=
8GB
max_replication_slots
=
5
# Create replication slot for Blue
SELECT pg_create_physical_replication_slot(
'blue_sync_slot'
);
# 3. Configure Blue as standby of Green
-- Stop Blue temporarily
systemctl stop postgresql@11-main
# Clear Blue data directory and sync from Green
rm
-rf /var/lib/postgresql/11/main/*
pg_basebackup -h green_ip -D /var/lib/postgresql/11/main -U blue_replicator -P -R -S blue_sync_slot
# Configure Blue's recovery settings
echo
"primary_slot_name = 'blue_sync_slot'"
>>
/var/lib/postgresql/11/main/postgresql.conf
# Start Blue as standby
systemctl start postgresql@11-main
B2: WAL Shipping Strategy (Alternative)
bash# Set up WAL archiving on Green to keep Blue synchronized
# 1. On Green, configure WAL archiving
-- postgresql.conf
archive_mode
=
on
archive_command
=
'rsync %p blue_server:/var/lib/postgresql/wal_archive/%f'
# 2. On Blue, configure WAL replay
-- recovery.conf or postgresql.conf (PG12+)
restore_command
=
'cp /var/lib/postgresql/wal_archive/%f %p'
recovery_target_timeline
=
'latest'
Option C: Delayed Rollback with Data Recovery
C1: Point-in-Time Recovery Rollback
bash# If Blue became stale, use PITR to bring it current
# 1. Stop Green server
systemctl stop postgresql@15-main
# 2. Get the exact LSN where Green was promoted
-- Check Green
's promotion LSN from logs
grep "selected new timeline" /var/log/postgresql/postgresql-15-main.log
# 3. Perform PITR on Blue to catch up
-- On Blue: Configure recovery to specific point
echo "recovery_target_lsn = '
LSN_FROM_GREEN_PROMOTION
'" > /var/lib/postgresql/11/main/recovery.conf
echo "recovery_target_timeline = '
latest'"
>>
/var/lib/postgresql/11/main/recovery.conf
# 4. Start Blue with recovery
systemctl start postgresql@11-main
# 5. Promote Blue when recovery completes
pg_ctl promote -D /var/lib/postgresql/11/main
C2: Logical Replication Catch-up (For Large Data Gaps)
bash# Use logical replication to sync specific tables/changes
# 1. On Green, create publication for critical tables
CREATE PUBLICATION blue_catchup FOR ALL TABLES;
# 2. On Blue, create subscription
CREATE SUBSCRIPTION blue_sync
CONNECTION
'host=green_ip dbname=mydb user=replicator'
PUBLICATION blue_catchup;
# 3. Monitor synchronization
SELECT * FROM pg_stat_subscription;
# 4. When caught up, promote Blue
pg_ctl promote -D /var/lib/postgresql/11/main
Rollback Decision Matrix:
Time Since Switchover | Data Volume | Recommended Strategy |
---|---|---|
< 30 minutes | Any | Option A: Direct rollback |
30 min – 2 hours | Low-Medium | Option B1: Continuous sync |
30 min – 2 hours | High | Option B2: WAL shipping |
> 2 hours | Low | Option C2: Logical replication |
> 2 hours | High | Option C1: PITR recovery |
Enhanced Blue Sync Monitoring:
sql-- Monitor Blue sync status (when using continuous sync)
-- On Green:
SELECT
client_addr, application_name, state,
pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)
AS
lag_bytes
FROM
pg_stat_replication
WHERE
application_name
=
'blue_sync'
;
-- On Blue:
SELECT
status
, received_lsn, latest_end_lsn,
pg_wal_lsn_diff(latest_end_lsn, received_lsn)
AS
lag_bytes
FROM
pg_stat_wal_receiver;
πΉ Phase 7: Enhanced Monitoring & Validation
Post-upgrade monitoring (first 24 hours):
- Application error rates
- Query performance metrics
- Replication lag (if standby is re-established)
- Resource utilization
- Extension functionality
Performance validation:
-- Compare query performance before/after
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;
-- Check for any deprecated features
SELECT * FROM pg_stat_user_tables WHERE schemaname NOT IN ('information_schema', 'pg_catalog');
πΉ Phase 8: Finalization & Reuse
After confirming success:
- Clean up PostgreSQL 11 binaries on Green
- Optionally reconfigure Blue as:
- Hot standby (replica of PG15) – recommended
- Read-only server for reporting
- Backup/DR server
- Drop replication slot if not reusing:
SELECT pg_drop_replication_slot('green_slot');
β±οΈ Enhanced Downtime Summary
Step | Downtime | Risk Level |
---|---|---|
Streaming Replication Setup | 0 | Low |
Pre-switchover Validation | 0 | Low |
Promote Green | ~30 seconds | Medium |
Connection Switchover | ~1-2 minutes | Medium |
Connection Termination & Validation | ~2-3 minutes | Medium |
pg_upgrade Execution | ~5β10 minutes | High |
Total Planned Downtime | ~8-16 minutes | |
Rollback (if needed) | ~1β3 minutes | High |
π§ͺ Enhanced Key Commands Recap
Setup & Monitoring:
pg_basebackup -S slot_name
: Clone data with replication slotpg_create_physical_replication_slot()
: Better WAL retentionpg_stat_replication
: Monitor sync statuspg_wal_lsn_diff()
: Calculate replication lag- Blue sync commands: Reverse replication setup
- WAL archiving:
archive_mode
andrestore_command
- PITR recovery:
recovery_target_lsn
and timeline management
Validation & Safety:
pg_upgrade --check
: Dry-run validationpg_is_in_recovery()
: Verify promotion statusALTER SYSTEM SET default_transaction_read_only
: Safe read-only modepg_stat_activity
: Monitor active connectionspg_cancel_backend()
/pg_terminate_backend()
: Connection management
Upgrade & Control:
pg_ctl promote
: Manual promotionpg_upgrade --link
: Fast major version upgradeanalyze_new_cluster.sh
: Post-upgrade optimization
β Enhanced Benefits
- βοΈ Comprehensive safety checks before each critical step
- βοΈ Detailed monitoring and validation procedures
- βοΈ Improved connection management options
- βοΈ Enhanced rollback strategy with multiple scenarios
- βοΈ Blue data sync prevention – keeps Blue current for rollback
- βοΈ Multiple rollback options based on timing and data volume
- βοΈ Point-in-time recovery capabilities for delayed rollbacks
- βοΈ Logical replication for selective data synchronization
- βοΈ Replication slots for better WAL management
- βοΈ Read-only mode instead of stopping Blue (safer)
- βοΈ Pre-upgrade validation to catch issues early
- βοΈ Detailed runbooks to minimize human error
- βοΈ No external tools dependency (no HAProxy)
- βοΈ Full control over cutover and recovery process
π¨ Critical Success Factors
- Test the entire procedure in staging environment first
- Monitor replication lag continuously during setup
- Validate application compatibility with read-only Green before switchover
- Have clear communication plan with application teams
- Practice rollback procedures before go-live
- Set up proper monitoring for post-upgrade validation
π References & Documentation
Official PostgreSQL Documentation
- PostgreSQL 15 Release Notes: https://www.postgresql.org/docs/15/release-15.html
- pg_upgrade Documentation: https://www.postgresql.org/docs/15/pgupgrade.html
- Streaming Replication: https://www.postgresql.org/docs/15/warm-standby.html#STREAMING-REPLICATION
- pg_basebackup: https://www.postgresql.org/docs/15/app-pgbasebackup.html
- Replication Slots: https://www.postgresql.org/docs/15/warm-standby.html#STREAMING-REPLICATION-SLOTS
- Monitoring Statistics: https://www.postgresql.org/docs/15/monitoring-stats.html
PostgreSQL Upgrade Guides
- Major Version Upgrade Methods: https://www.postgresql.org/docs/15/upgrading.html
- Upgrade Best Practices: https://wiki.postgresql.org/wiki/Upgrading
- pg_upgrade Tutorial: https://www.postgresql.org/docs/15/pgupgrade.html
- Version Compatibility Matrix: https://www.postgresql.org/support/versioning/
High Availability & Replication
- High Availability Guide: https://www.postgresql.org/docs/15/high-availability.html
- Replication Configuration: https://www.postgresql.org/docs/15/runtime-config-replication.html
- WAL Configuration: https://www.postgresql.org/docs/15/runtime-config-wal.html
- Backup and Recovery: https://www.postgresql.org/docs/15/backup.html
PostgreSQL Configuration
- Configuration Parameters: https://www.postgresql.org/docs/15/runtime-config.html
- pg_hba.conf Documentation: https://www.postgresql.org/docs/15/auth-pg-hba-conf.html
- Client Authentication: https://www.postgresql.org/docs/15/client-authentication.html
Monitoring & Troubleshooting
- System Views: https://www.postgresql.org/docs/15/monitoring-stats.html#MONITORING-STATS-VIEWS
- pg_stat_replication: https://www.postgresql.org/docs/15/monitoring-stats.html#MONITORING-STATS-REPLICATION
- Activity Monitoring: https://www.postgresql.org/docs/15/monitoring-stats.html#MONITORING-STATS-ACTIVITY
- WAL Monitoring: https://www.postgresql.org/docs/15/functions-info.html#FUNCTIONS-INFO-MISC
Tools & Utilities
- PostgreSQL Utilities: https://www.postgresql.org/docs/15/reference-server.html
- pg_ctl Documentation: https://www.postgresql.org/docs/15/app-pg-ctl.html
- PostgreSQL System Catalogs: https://www.postgresql.org/docs/15/catalogs.html
Best Practices & Community Guides
- PostgreSQL Wiki: https://wiki.postgresql.org/wiki/Main_Page
- Upgrade Checklist: https://wiki.postgresql.org/wiki/Upgrade_Checklist
- Blue-Green Deployment Patterns: https://martinfowler.com/bliki/BlueGreenDeployment.html
- Database Reliability Engineering: https://www.oreilly.com/library/view/database-reliability-engineering/9781491925935/
Community Resources
- PostgreSQL Mailing Lists: https://www.postgresql.org/list/
- PostgreSQL Planet (Blogs): https://planet.postgresql.org/
- PostgreSQL Slack Community: https://postgres-slack.herokuapp.com/
- Stack Overflow PostgreSQL: https://stackoverflow.com/questions/tagged/postgresql
Vendor-Specific Guides
- AWS RDS PostgreSQL Upgrades: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_UpgradeDBInstance.PostgreSQL.html
- Google Cloud SQL Upgrades: https://cloud.google.com/sql/docs/postgres/upgrade-major-db-version
- Azure Database Upgrades: https://docs.microsoft.com/en-us/azure/postgresql/how-to-upgrade-using-dump-and-restore
Benchmarking & Performance
- pgbench Documentation: https://www.postgresql.org/docs/15/pgbench.html
- PostgreSQL Performance Tuning: https://wiki.postgresql.org/wiki/Performance_Optimization
- Query Performance: https://www.postgresql.org/docs/15/using-explain.html
Security & Compliance
- PostgreSQL Security: https://www.postgresql.org/docs/15/security.html
- SSL Configuration: https://www.postgresql.org/docs/15/ssl-tcp.html
- User Management: https://www.postgresql.org/docs/15/user-manag.html
π Additional Reading
Books
- “PostgreSQL: Up and Running” by Regina Obe & Leo Hsu
- “PostgreSQL High Availability Cookbook” by Shaun Thomas
- “Mastering PostgreSQL” by Hans-JΓΌrgen SchΓΆnig
Blogs & Articles
- 2ndQuadrant Blog: https://www.2ndquadrant.com/en/blog/
- Percona PostgreSQL Blog: https://www.percona.com/blog/category/postgresql/
- EDB Blog: https://www.enterprisedb.com/blog
- Citus Data Blog: https://www.citusdata.com/blog/
Training & Certification
- PostgreSQL Training: https://www.postgresql.org/support/professional_training/
- EDB Training: https://www.enterprisedb.com/training
- Postgres Professional: https://postgrespro.com/education
π οΈ Tools & Scripts Repository
Useful Scripts
# Replication monitoring script
#!/bin/bash
# replication_monitor.sh
while true; do
psql -h primary_ip -c "SELECT client_addr, state, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag FROM pg_stat_replication;"
sleep 10
done
# Connection checker script
#!/bin/bash
# check_connections.sh
psql -t -c "SELECT count(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid();"
Monitoring Queries Collection
-- Save as monitoring_queries.sql
-- Active connections
SELECT pid, usename, application_name, client_addr, state, query_start
FROM pg_stat_activity WHERE state = 'active';
-- Replication lag
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;
-- Database sizes
SELECT datname, pg_size_pretty(pg_database_size(datname)) AS size
FROM pg_database ORDER BY pg_database_size(datname) DESC;
π Support & Community
Getting Help
- PostgreSQL IRC: #postgresql on Freenode
- PostgreSQL Forums: https://www.postgresql.org/list/
- Reddit: https://www.reddit.com/r/PostgreSQL/
- Discord: PostgreSQL Community Server
Professional Support
- PostgreSQL Commercial Support: https://www.postgresql.org/support/professional_support/
- EDB Support: https://www.enterprisedb.com/support
- 2ndQuadrant Support: https://www.2ndquadrant.com/support/
Leave a Reply