Enhanced Minimal Downtime Upgrade & Zero Downtime Migration Strategy
PostgreSQL 11 β 15.9
π― Objective: Upgrade PostgreSQL from version 11 to 15.9 using a Blue-Green strategy with:
- β Zero downtime during data replication and testing
- β Minimal downtime during upgrade
- β Rollback strategy : promote original master if needed
- β Safety checks and monitoring
- β Connection management
π΅ Blue-Green Concept
- Blue = Current production server (PostgreSQL 11)
- Green = New target server (PostgreSQL 11 initially, PostgreSQL 15.9 after upgrade)
- Use streaming replication to keep Green synced with Blue
- Promote Green when ready
- Upgrade Green to PG15.9 after switchover
- If rollback is needed, re-promote Blue
β Enhanced Step-by-Step Plan
πΉ Phase 1: Prepare Green (New Server)
- Provision new hardware (Green)
- Install PostgreSQL 11 (same exact version as Blue)
- Match all configuration:
- postgresql.conf, pg_hba.conf, users, roles, extensions
- Create a replication user on Blue (e.g., replicator)
- Install PostgreSQL 15.9 binaries (for later upgrade)
- Set up connection management:
- Install PgBouncer or prepare DNS CNAME switching
- Document application connection string changes
πΉ Phase 2: Set Up Streaming Replication (PG11 β PG11)
On Blue (current master):
# postgresql.confwal_level = replicamax_wal_senders = 5wal_keep_size = 8GB # Updated from wal_keep_segments for PG 11+hot_standby = onmax_replication_slots = 5# Create replication slot for better WAL retentionSELECT pg_create_physical_replication_slot('green_slot');# pg_hba.confhost replication replicator green_server_ip/32 md5
On Green (new server):
# Stop PostgreSQLsystemctl stop postgresql# Clear data directoryrm -rf /var/lib/postgresql/11/main/*# Perform base backup with replication slotpg_basebackup -h blue_ip -D /var/lib/postgresql/11/main -U replicator -P -R -S green_slot# Configure recovery settings in postgresql.confprimary_slot_name = 'green_slot'# Start PostgreSQLsystemctl start postgresql
β Now Green is a hot standby synced with Blue.
πΉ Phase 3: Initial Validation & Monitoring Setup
Monitor replication status:
-- On Blue: Check replication statusSELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytesFROM pg_stat_replication;-- On Green: Check receiver statusSELECT status, receive_start_lsn, receive_start_tli, received_lsn, received_tli,last_msg_send_time, last_msg_receipt_time, latest_end_lsn, latest_end_timeFROM pg_stat_wal_receiver;
Basic validation:
- Perform read-only query tests on Green
- Validate user access and system compatibility
- Set up monitoring alerts for replication lag > 100MB
πΉ Phase 3.5: Pre-Switchover Validation
Critical safety checks before switchover:
- Dry-run upgrade check:
# Test pg_upgrade compatibility (don't actually upgrade yet)sudo -u postgres /usr/lib/postgresql/15/bin/pg_upgrade \--old-datadir=/var/lib/postgresql/11/main \--new-datadir=/var/lib/postgresql/15/main \--old-bindir=/usr/lib/postgresql/11/bin \--new-bindir=/usr/lib/postgresql/15/bin \--check
- Extension compatibility validation:
-- On Green: Check all extensions are PG15 compatibleSELECT name, default_version, installed_versionFROM pg_available_extensionsWHERE name IN (SELECT extname FROM pg_extension);
- Application connectivity test:
# Test read-only connection to Greenpsql -h green_ip -U app_user -d app_db -c "SELECT version();"
- Replication lag verification:
-- Ensure lag is minimal (< 1MB) before switchoverSELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytesFROM pg_stat_replication WHERE application_name = 'green_slot';
- Prepare connection switchover:
- Stage application config changes
- Prepare PgBouncer config or DNS changes
- Document rollback connection procedures
πΉ Phase 4: Enhanced Manual Switchover to Green
Pre-switchover checklist:
- [ ] Replication lag < 1MB
- [ ] No replication errors in logs
- [ ] Application teams notified
- [ ] Rollback procedures documented
- [ ] Connection switching method prepared
- [ ] Connection termination procedures ready
Switchover procedure:
# 1. Set Blue to read-only mode (safer than stopping)-- On Blue:ALTER SYSTEM SET default_transaction_read_only = on;SELECT pg_reload_conf();# 2. Wait for Green to catch up completely-- Monitor until lag = 0SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) FROM pg_stat_replication;# 3. Promote Green server with verification-- On Green:pg_ctl promote -D /var/lib/postgresql/11/main# 4. Verify promotion successful-- On Green:SELECT pg_is_in_recovery(); -- Should return false# 5. Switch connections (choose one method):# Option A: DNS CNAME switch# Option B: PgBouncer config switch# Option C: Application config deployment# 6. Verify application functionality-- Test critical application functions-- Monitor error rates and performance
β Green is now the active master (PG11)
CRITICAL: Immediate Blue Sync Setup (Prevent Blue from becoming stale)
bash# Option 1: Set up reverse replication immediately (Recommended)# This ensures Blue stays current for potential rollback# On Green (new master), create replication user for BlueCREATEUSERblue_replicator REPLICATION LOGIN PASSWORD'secure_password';# Configure Green postgresql.conf for reverse replicationecho"max_wal_senders = 6">>/etc/postgresql/11/main/postgresql.conf# Increase by 1SELECT pg_reload_conf();# Create replication slot for BlueSELECT pg_create_physical_replication_slot('blue_sync_slot');# Option 2: Alternative - Set up WAL archiving for Blue sync# Configure if reverse replication is not preferredecho"archive_mode = on">>/etc/postgresql/11/main/postgresql.confecho"archive_command = 'rsync %p blue_server:/var/lib/postgresql/wal_archive/%f'">>/etc/postgresql/11/main/postgresql.confSELECT pg_reload_conf();
πΉ Phase 5: Upgrade Green to PostgreSQL 15.9
CRITICAL: Connection Management Before Upgrade
# 1. Check for active connections (MANDATORY)-- On Green: Verify no active connections before upgradeSELECT pid, usename, application_name, client_addr, state, query_start, queryFROM pg_stat_activityWHERE state = 'active' AND pid <> pg_backend_pid();# 2. If connections detected, terminate them safely-- Option A: Graceful termination (preferred)SELECT pg_cancel_backend(pid)FROM pg_stat_activityWHERE state = 'active' AND pid <> pg_backend_pid();-- Wait 30 seconds, then check again-- Option B: Force termination if graceful failsSELECT pg_terminate_backend(pid)FROM pg_stat_activityWHERE state = 'active' AND pid <> pg_backend_pid();# 3. Prevent new connections during upgrade-- Update pg_hba.conf to reject new connections (backup original first)cp /etc/postgresql/11/main/pg_hba.conf /etc/postgresql/11/main/pg_hba.conf.backupecho "# UPGRADE MODE - NO CONNECTIONS ALLOWED" > /etc/postgresql/11/main/pg_hba.confecho "local all postgres peer" >> /etc/postgresql/11/main/pg_hba.confSELECT pg_reload_conf();# 4. Final connection verificationSELECT count(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid();-- Should return 0 before proceeding# 5. Stop PostgreSQL (this is the main downtime period)systemctl stop postgresql# 6. Run pg_upgrade (already validated in Phase 3.5)sudo -u postgres /usr/lib/postgresql/15/bin/pg_upgrade \--old-datadir=/var/lib/postgresql/11/main \--new-datadir=/var/lib/postgresql/15/main \--old-bindir=/usr/lib/postgresql/11/bin \--new-bindir=/usr/lib/postgresql/15/bin \--link # Use hard links for faster upgrade# 7. Update systemd service to use PG15systemctl stop postgresql@11-mainsystemctl start postgresql@15-mainsystemctl enable postgresql@15-main# 8. Restore connection access-- Restore original pg_hba.conf for PG15cp /etc/postgresql/11/main/pg_hba.conf.backup /etc/postgresql/15/main/pg_hba.confSELECT pg_reload_conf();# 9. Run post-upgrade scriptssudo -u postgres ./analyze_new_cluster.sh# 10. Test application connectivity and functionality# 11. After validation: sudo -u postgres ./delete_old_cluster.sh
β± Downtime: ~5β10 minutes max
πΉ Phase 6: Enhanced Rollback Plan with Blue Data Sync
Option A: Immediate Rollback (if issues detected within 30 minutes)
bash# 1. Stop Green serversystemctl stop postgresql@15-main# 2. Re-enable Blue server-- On Blue:ALTER SYSTEM RESET default_transaction_read_only;SELECT pg_reload_conf();# 3. Promote Blue (if needed)pg_ctl promote -D /var/lib/postgresql/11/main# 4. Switch connections back to Blue# Use reverse of connection switching method from Phase 4
Option B: Blue Data Recovery (Prevent Blue from becoming stale)
B1: Continuous Blue Sync Strategy (Recommended)
bash# After Green promotion, immediately set up reverse replication# This keeps Blue current with Green's data# 1. On Green (new master PG15), create replication userCREATEUSERblue_replicator REPLICATION LOGIN PASSWORD'secure_password';# 2. Configure Green for replication-- postgresql.conf on Greenwal_level=replicamax_wal_senders=5wal_keep_size=8GBmax_replication_slots=5# Create replication slot for BlueSELECT pg_create_physical_replication_slot('blue_sync_slot');# 3. Configure Blue as standby of Green-- Stop Blue temporarilysystemctl stop postgresql@11-main# Clear Blue data directory and sync from Greenrm-rf /var/lib/postgresql/11/main/*pg_basebackup -h green_ip -D /var/lib/postgresql/11/main -U blue_replicator -P -R -S blue_sync_slot# Configure Blue's recovery settingsecho"primary_slot_name = 'blue_sync_slot'">>/var/lib/postgresql/11/main/postgresql.conf# Start Blue as standbysystemctl start postgresql@11-main
B2: WAL Shipping Strategy (Alternative)
bash# Set up WAL archiving on Green to keep Blue synchronized# 1. On Green, configure WAL archiving-- postgresql.confarchive_mode=onarchive_command='rsync %p blue_server:/var/lib/postgresql/wal_archive/%f'# 2. On Blue, configure WAL replay-- recovery.conf or postgresql.conf (PG12+)restore_command='cp /var/lib/postgresql/wal_archive/%f %p'recovery_target_timeline='latest'
Option C: Delayed Rollback with Data Recovery
C1: Point-in-Time Recovery Rollback
bash# If Blue became stale, use PITR to bring it current# 1. Stop Green serversystemctl stop postgresql@15-main# 2. Get the exact LSN where Green was promoted-- Check Green's promotion LSN from logsgrep "selected new timeline" /var/log/postgresql/postgresql-15-main.log# 3. Perform PITR on Blue to catch up-- On Blue: Configure recovery to specific pointecho "recovery_target_lsn = 'LSN_FROM_GREEN_PROMOTION'" > /var/lib/postgresql/11/main/recovery.confecho "recovery_target_timeline = 'latest'">>/var/lib/postgresql/11/main/recovery.conf# 4. Start Blue with recoverysystemctl start postgresql@11-main# 5. Promote Blue when recovery completespg_ctl promote -D /var/lib/postgresql/11/main
C2: Logical Replication Catch-up (For Large Data Gaps)
bash# Use logical replication to sync specific tables/changes# 1. On Green, create publication for critical tablesCREATE PUBLICATION blue_catchup FOR ALL TABLES;# 2. On Blue, create subscriptionCREATE SUBSCRIPTION blue_syncCONNECTION'host=green_ip dbname=mydb user=replicator'PUBLICATION blue_catchup;# 3. Monitor synchronizationSELECT * FROM pg_stat_subscription;# 4. When caught up, promote Bluepg_ctl promote -D /var/lib/postgresql/11/main
Rollback Decision Matrix:
| Time Since Switchover | Data Volume | Recommended Strategy |
|---|---|---|
| < 30 minutes | Any | Option A: Direct rollback |
| 30 min – 2 hours | Low-Medium | Option B1: Continuous sync |
| 30 min – 2 hours | High | Option B2: WAL shipping |
| > 2 hours | Low | Option C2: Logical replication |
| > 2 hours | High | Option C1: PITR recovery |
Enhanced Blue Sync Monitoring:
sql-- Monitor Blue sync status (when using continuous sync)-- On Green:SELECTclient_addr, application_name, state,pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)ASlag_bytesFROMpg_stat_replicationWHEREapplication_name='blue_sync';-- On Blue:SELECTstatus, received_lsn, latest_end_lsn,pg_wal_lsn_diff(latest_end_lsn, received_lsn)ASlag_bytesFROMpg_stat_wal_receiver;
πΉ Phase 7: Enhanced Monitoring & Validation
Post-upgrade monitoring (first 24 hours):
- Application error rates
- Query performance metrics
- Replication lag (if standby is re-established)
- Resource utilization
- Extension functionality
Performance validation:
-- Compare query performance before/afterEXPLAIN (ANALYZE, BUFFERS) SELECT ...;-- Check for any deprecated featuresSELECT * FROM pg_stat_user_tables WHERE schemaname NOT IN ('information_schema', 'pg_catalog');
πΉ Phase 8: Finalization & Reuse
After confirming success:
- Clean up PostgreSQL 11 binaries on Green
- Optionally reconfigure Blue as:
- Hot standby (replica of PG15) – recommended
- Read-only server for reporting
- Backup/DR server
- Drop replication slot if not reusing:
SELECT pg_drop_replication_slot('green_slot');
β±οΈ Enhanced Downtime Summary
| Step | Downtime | Risk Level |
|---|---|---|
| Streaming Replication Setup | 0 | Low |
| Pre-switchover Validation | 0 | Low |
| Promote Green | ~30 seconds | Medium |
| Connection Switchover | ~1-2 minutes | Medium |
| Connection Termination & Validation | ~2-3 minutes | Medium |
| pg_upgrade Execution | ~5β10 minutes | High |
| Total Planned Downtime | ~8-16 minutes | |
| Rollback (if needed) | ~1β3 minutes | High |
π§ͺ Enhanced Key Commands Recap
Setup & Monitoring:
pg_basebackup -S slot_name: Clone data with replication slotpg_create_physical_replication_slot(): Better WAL retentionpg_stat_replication: Monitor sync statuspg_wal_lsn_diff(): Calculate replication lag- Blue sync commands: Reverse replication setup
- WAL archiving:
archive_modeandrestore_command - PITR recovery:
recovery_target_lsnand timeline management
Validation & Safety:
pg_upgrade --check: Dry-run validationpg_is_in_recovery(): Verify promotion statusALTER SYSTEM SET default_transaction_read_only: Safe read-only modepg_stat_activity: Monitor active connectionspg_cancel_backend()/pg_terminate_backend(): Connection management
Upgrade & Control:
pg_ctl promote: Manual promotionpg_upgrade --link: Fast major version upgradeanalyze_new_cluster.sh: Post-upgrade optimization
β Enhanced Benefits
- βοΈ Comprehensive safety checks before each critical step
- βοΈ Detailed monitoring and validation procedures
- βοΈ Improved connection management options
- βοΈ Enhanced rollback strategy with multiple scenarios
- βοΈ Blue data sync prevention – keeps Blue current for rollback
- βοΈ Multiple rollback options based on timing and data volume
- βοΈ Point-in-time recovery capabilities for delayed rollbacks
- βοΈ Logical replication for selective data synchronization
- βοΈ Replication slots for better WAL management
- βοΈ Read-only mode instead of stopping Blue (safer)
- βοΈ Pre-upgrade validation to catch issues early
- βοΈ Detailed runbooks to minimize human error
- βοΈ No external tools dependency (no HAProxy)
- βοΈ Full control over cutover and recovery process
π¨ Critical Success Factors
- Test the entire procedure in staging environment first
- Monitor replication lag continuously during setup
- Validate application compatibility with read-only Green before switchover
- Have clear communication plan with application teams
- Practice rollback procedures before go-live
- Set up proper monitoring for post-upgrade validation
π References & Documentation
Official PostgreSQL Documentation
- PostgreSQL 15 Release Notes: https://www.postgresql.org/docs/15/release-15.html
- pg_upgrade Documentation: https://www.postgresql.org/docs/15/pgupgrade.html
- Streaming Replication: https://www.postgresql.org/docs/15/warm-standby.html#STREAMING-REPLICATION
- pg_basebackup: https://www.postgresql.org/docs/15/app-pgbasebackup.html
- Replication Slots: https://www.postgresql.org/docs/15/warm-standby.html#STREAMING-REPLICATION-SLOTS
- Monitoring Statistics: https://www.postgresql.org/docs/15/monitoring-stats.html
PostgreSQL Upgrade Guides
- Major Version Upgrade Methods: https://www.postgresql.org/docs/15/upgrading.html
- Upgrade Best Practices: https://wiki.postgresql.org/wiki/Upgrading
- pg_upgrade Tutorial: https://www.postgresql.org/docs/15/pgupgrade.html
- Version Compatibility Matrix: https://www.postgresql.org/support/versioning/
High Availability & Replication
- High Availability Guide: https://www.postgresql.org/docs/15/high-availability.html
- Replication Configuration: https://www.postgresql.org/docs/15/runtime-config-replication.html
- WAL Configuration: https://www.postgresql.org/docs/15/runtime-config-wal.html
- Backup and Recovery: https://www.postgresql.org/docs/15/backup.html
PostgreSQL Configuration
- Configuration Parameters: https://www.postgresql.org/docs/15/runtime-config.html
- pg_hba.conf Documentation: https://www.postgresql.org/docs/15/auth-pg-hba-conf.html
- Client Authentication: https://www.postgresql.org/docs/15/client-authentication.html
Monitoring & Troubleshooting
- System Views: https://www.postgresql.org/docs/15/monitoring-stats.html#MONITORING-STATS-VIEWS
- pg_stat_replication: https://www.postgresql.org/docs/15/monitoring-stats.html#MONITORING-STATS-REPLICATION
- Activity Monitoring: https://www.postgresql.org/docs/15/monitoring-stats.html#MONITORING-STATS-ACTIVITY
- WAL Monitoring: https://www.postgresql.org/docs/15/functions-info.html#FUNCTIONS-INFO-MISC
Tools & Utilities
- PostgreSQL Utilities: https://www.postgresql.org/docs/15/reference-server.html
- pg_ctl Documentation: https://www.postgresql.org/docs/15/app-pg-ctl.html
- PostgreSQL System Catalogs: https://www.postgresql.org/docs/15/catalogs.html
Best Practices & Community Guides
- PostgreSQL Wiki: https://wiki.postgresql.org/wiki/Main_Page
- Upgrade Checklist: https://wiki.postgresql.org/wiki/Upgrade_Checklist
- Blue-Green Deployment Patterns: https://martinfowler.com/bliki/BlueGreenDeployment.html
- Database Reliability Engineering: https://www.oreilly.com/library/view/database-reliability-engineering/9781491925935/
Community Resources
- PostgreSQL Mailing Lists: https://www.postgresql.org/list/
- PostgreSQL Planet (Blogs): https://planet.postgresql.org/
- PostgreSQL Slack Community: https://postgres-slack.herokuapp.com/
- Stack Overflow PostgreSQL: https://stackoverflow.com/questions/tagged/postgresql
Vendor-Specific Guides
- AWS RDS PostgreSQL Upgrades: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_UpgradeDBInstance.PostgreSQL.html
- Google Cloud SQL Upgrades: https://cloud.google.com/sql/docs/postgres/upgrade-major-db-version
- Azure Database Upgrades: https://docs.microsoft.com/en-us/azure/postgresql/how-to-upgrade-using-dump-and-restore
Benchmarking & Performance
- pgbench Documentation: https://www.postgresql.org/docs/15/pgbench.html
- PostgreSQL Performance Tuning: https://wiki.postgresql.org/wiki/Performance_Optimization
- Query Performance: https://www.postgresql.org/docs/15/using-explain.html
Security & Compliance
- PostgreSQL Security: https://www.postgresql.org/docs/15/security.html
- SSL Configuration: https://www.postgresql.org/docs/15/ssl-tcp.html
- User Management: https://www.postgresql.org/docs/15/user-manag.html
π Additional Reading
Books
- “PostgreSQL: Up and Running” by Regina Obe & Leo Hsu
- “PostgreSQL High Availability Cookbook” by Shaun Thomas
- “Mastering PostgreSQL” by Hans-JΓΌrgen SchΓΆnig
Blogs & Articles
- 2ndQuadrant Blog: https://www.2ndquadrant.com/en/blog/
- Percona PostgreSQL Blog: https://www.percona.com/blog/category/postgresql/
- EDB Blog: https://www.enterprisedb.com/blog
- Citus Data Blog: https://www.citusdata.com/blog/
Training & Certification
- PostgreSQL Training: https://www.postgresql.org/support/professional_training/
- EDB Training: https://www.enterprisedb.com/training
- Postgres Professional: https://postgrespro.com/education
π οΈ Tools & Scripts Repository
Useful Scripts
# Replication monitoring script#!/bin/bash# replication_monitor.shwhile true; dopsql -h primary_ip -c "SELECT client_addr, state, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag FROM pg_stat_replication;"sleep 10done# Connection checker script#!/bin/bash# check_connections.shpsql -t -c "SELECT count(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid();"
Monitoring Queries Collection
-- Save as monitoring_queries.sql-- Active connectionsSELECT pid, usename, application_name, client_addr, state, query_startFROM pg_stat_activity WHERE state = 'active';-- Replication lagSELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytesFROM pg_stat_replication;-- Database sizesSELECT datname, pg_size_pretty(pg_database_size(datname)) AS sizeFROM pg_database ORDER BY pg_database_size(datname) DESC;
π Support & Community
Getting Help
- PostgreSQL IRC: #postgresql on Freenode
- PostgreSQL Forums: https://www.postgresql.org/list/
- Reddit: https://www.reddit.com/r/PostgreSQL/
- Discord: PostgreSQL Community Server
Professional Support
- PostgreSQL Commercial Support: https://www.postgresql.org/support/professional_support/
- EDB Support: https://www.enterprisedb.com/support
- 2ndQuadrant Support: https://www.2ndquadrant.com/support/
Leave a Reply