PostgreSQL WAL Archiving — Mechanism, Implementation, Best Practice, and Real-World Case Studies

I have a story about a database crashes due to some malfunction from hard drive. After check, the last backup ran at midnight. In those time period between backup and crash’s time, the system processed 10,000 customer orders, 50,000 product updates, and hundreds of thousands of user interactions. All of that data is gone. The phone starts ringing.

This nightmare scenario happens more often than you’d think, and it illustrates a fundamental truth about database management: backups alone are not enough. Traditional backups create snapshots at fixed points in time—midnight, noon, whenever you schedule them. But databases are living, breathing systems that change every second. The gap between backups is a vulnerability, a window of potential data loss that can sink businesses, violate regulations, and destroy customer trust.

PostgreSQL’s designers understood this problem deeply, which is why they built one of the most elegant solutions in database engineering: Write-Ahead Logging combined with continuous archiving. This system doesn’t just protect your data—it creates a time machine that lets you recover to any precise moment, down to the second.

Understanding PostgreSQL’s Durability Design

When you commit a transaction in PostgreSQL, something fascinating happens under the hood. The database doesn’t immediately scramble to update all the relevant data files scattered across your disk. That would be slow, requiring random disk access and complex coordination. Instead, PostgreSQL writes a compact description of what changed to a sequential log file called the Write-Ahead Log, or WAL. This write is fast—a single sequential disk operation—and once it’s complete, PostgreSQL considers your transaction durable and tells you it’s committed.

The actual data files get updated later, in the background, when it’s convenient. This might seem risky—what if the server crashes before those updates happen? That’s where WAL’s genius shines. When PostgreSQL restarts after a crash, it reads the WAL and replays all the committed changes that hadn’t yet made it to the data files. Your data is reconstructed perfectly, as if nothing happened.

This design gives PostgreSQL its remarkable combination of speed and reliability. But here’s where it gets even better: what if you could save that WAL somewhere safe? Not just keep it on the database server, but archive it continuously to remote storage? Then you wouldn’t just have crash recovery—you’d have a complete history of every change ever made to your database.

How WAL Archiving Enables Point-in-Time Recovery

WAL archiving takes PostgreSQL’s internal durability mechanism and extends it outward. As the database fills each WAL segment file (typically 16MB), it copies that completed segment to an archive location you specify. This archive could be a network share, cloud storage, or a dedicated backup server. The key insight is that each WAL segment is like a chapter in your database’s autobiography, recording everything that happened during a specific period.

Now imagine you have a full backup from Sunday night, plus all the WAL segments generated since then. If disaster strikes on Wednesday afternoon, you can restore that Sunday backup and then replay the archived WAL segments, watching your database reconstruct itself day by day, hour by hour, minute by minute. You can stop at any moment—say, two seconds before someone accidentally ran DELETE FROM customers WHERE true. This capability, called Point-in-Time Recovery or PITR, transforms how you think about data protection.

When Organizations Must Implement WAL Archiving

Some organizations discover they need WAL archiving the hard way. A financial services company loses six hours of trading data and faces regulatory penalties. An e-commerce platform can’t recover today’s orders after a storage failure. A SaaS company violates its 99.9% uptime SLA because it can’t restore recent customer data.

But forward-thinking organizations implement WAL archiving before disaster strikes. Financial institutions know that regulators expect complete transaction histories and the ability to restore to specific timestamps for auditing. E-commerce platforms understand that every lost order is both lost revenue and damaged customer trust. SaaS companies recognize that their entire business depends on convincing customers that their data is safe.

The question isn’t whether you need WAL archiving—it’s whether you can afford not to have it. If your database contains anything valuable (and why else would you have a database?), you need the protection that only continuous archiving provides.

1. Understanding WAL: The Foundation

Before we dive into archiving, we need to understand what we’re actually archiving. WAL isn’t just a technical detail—it’s the heart of PostgreSQL’s reliability, and understanding how it works will help you implement archiving correctly.

What WAL Really Means

Write-Ahead Logging sounds abstract, but the concept is simple and powerful. Consider keeping a diary of everything done each day. Instead of immediately reorganizing an entire filing cabinet after each activity, just write down what happened. Later, when there’s time, the filing cabinet can be updated based on diary entries. If something interrupts before finishing the cabinet update, no problem—the diary remains and work can resume where it left off.

PostgreSQL’s WAL is exactly this diary. Every change to the database—inserts, updates, deletes, even schema changes—gets recorded in the WAL before anything else happens. This sequential record serves three vital purposes that interlock to create PostgreSQL’s famous reliability.

First, WAL provides durability. When PostgreSQL tells you a transaction committed, that commitment is physically written to the WAL on disk. Even if the power fails one microsecond later, that transaction is safe. Second, WAL enables crash recovery. After any crash, PostgreSQL simply replays the WAL from the last checkpoint forward, perfectly reconstructing the database state. Third, WAL maintains consistency. Because all changes go through the WAL, PostgreSQL can ensure that partial transactions—those that didn’t complete before a crash—are never visible.

The Dance Between WAL and Checkpoints

Understanding checkpoints is crucial because they’re the counterpart to WAL in PostgreSQL’s durability strategy. While WAL records changes sequentially, checkpoints flush accumulated changes from memory to the actual data files.

PostgreSQL keeps recently-used data pages in shared memory buffers for fast access. When data is modified, PostgreSQL updates these in-memory copies, creating what are called “dirty buffers”—pages that differ from what’s on disk. Writing these scattered pages to disk randomly would be slow, so PostgreSQL delays the writes and batches them during checkpoints.

Here’s where the magic happens. Transactions write to WAL continuously, creating a fast sequential stream of changes. Periodically—every few minutes, controlled by configuration—PostgreSQL triggers a checkpoint. During the checkpoint, all those accumulated dirty buffers get written to the data files. Once a checkpoint completes successfully, PostgreSQL knows that all changes up to a certain point are safely on disk, which means older WAL segments are no longer needed for crash recovery.

This choreography between WAL and checkpoints gives PostgreSQL its characteristic performance. Commits are fast because they only require sequential WAL writes. The heavier random I/O of updating data files happens in the background during checkpoints, spread out over time to avoid performance spikes.

Inside a WAL Segment

WAL segments are physical files stored in PostgreSQL’s pg_wal directory (called pg_xlog before version 10). Each segment is exactly 16MB by default, though this can be changed when compiling PostgreSQL. Why 16MB? It’s a balance: large enough to reduce the overhead of file operations, small enough to archive quickly.

The filenames look cryptic at first glance: 000000010000000000000042. But they encode important information in hexadecimal. The first eight characters represent the timeline ID, a concept we’ll explore deeply when discussing recovery. The next eight characters identify the logical log file number. The final eight characters specify the segment number within that logical file.

This naming scheme ensures segments are naturally sorted in chronological order. It also makes it immediately obvious which timeline a segment belongs to—critical information when you’re performing point-in-time recovery and need to ensure you’re following the correct history branch.

Timeline IDs deserve special attention because they’re central to PostgreSQL’s recovery model. Every time you perform PITR and recover to a point in the past, PostgreSQL increments the timeline ID. This creates a fork in your database’s history. The old timeline continues to exist in your archives, but your database has branched off onto a new timeline. This prevents the catastrophic scenario where WAL from the old timeline accidentally gets mixed with the new one, which would corrupt your database.

2. WAL Archiving: Extending Durability Beyond the Server

Now that we understand WAL’s role in crash recovery, we can appreciate how archiving extends this protection across space and time.

What Archiving Actually Does

WAL archiving is conceptually straightforward: instead of letting PostgreSQL recycle WAL segments after checkpoints, we copy them somewhere safe first. PostgreSQL generates WAL segments continuously as transactions run. When a segment fills up (or when certain other conditions are met), PostgreSQL executes a command you specify—the archive_command—to copy that segment to your archive location. Only after your command reports success does PostgreSQL consider that segment safely archived and available for recycling.

This continuous stream of archived segments becomes an incremental backup. Unlike traditional backups that copy your entire database periodically, WAL archiving captures every single change, all the time. The segments are small (16MB each), fast to copy, and collectively represent a complete record of your database’s evolution.

Why This Matters More Than You Think

The importance of WAL archiving becomes clear when you think through various disaster scenarios. Suppose your database server’s storage controller fails and corrupts all your data files. Without archiving, you’re stuck restoring your last full backup—maybe last night’s, losing an entire day of work. With archiving, you restore that backup and then replay archived WAL segments, recovering everything up to the moment of failure.

Or consider a subtler scenario: a developer accidentally runs a destructive query at 2:47 PM. With just regular backups, your only option is restoring last night’s backup, losing all of today’s work including the good changes. With WAL archiving and PITR, you restore to 2:46:59 PM—one second before the mistake—preserving everything except the bad query.

WAL archiving also forms the foundation of high-availability architectures. Streaming replication, where standby servers stay synchronized with the primary, works by transferring WAL records. If replication breaks temporarily—say, due to a network outage—the standby can catch up by fetching archived WAL segments. Without archiving, a network outage lasting longer than the primary keeps WAL segments would create an unrecoverable gap, forcing you to rebuild the standby from scratch.

Archiving vs. Streaming Replication: Partners, Not Competitors

Organizations sometimes think they must choose between streaming replication and WAL archiving, but this is a false dichotomy. They’re complementary technologies that solve different problems.

Streaming replication keeps standby servers synchronized in near-real-time, typically with just milliseconds of lag. It’s perfect for high availability—if your primary fails, you can promote a standby almost instantly. Standbys can also serve read-only queries, distributing load. But streaming replication has a critical limitation: it requires a live network connection, and it doesn’t give you point-in-time recovery. If someone deletes the wrong data, that deletion replicates immediately to all your standbys. You can’t use streaming replication to “go back in time.”

WAL archiving fills these gaps. It provides the historical WAL you need for PITR. It works even when networks are down—segments queue locally and archive once connectivity returns. It protects against mistakes that affect all your servers simultaneously.

Smart organizations use both: streaming replication for high availability and read scaling, plus WAL archiving for disaster recovery and point-in-time recovery. This defense-in-depth approach ensures you can handle any scenario: hardware failure, network outage, data center disaster, or human error.

3. How WAL Archiving Works Under the Hood

Understanding the internal mechanics of archiving helps you configure it correctly and troubleshoot when things go wrong.

The Journey of a WAL Segment

Let’s follow a single WAL segment through its lifecycle. Your database is processing transactions, writing changes to the current active WAL segment. Eventually, this segment fills completely—it’s reached its 16MB limit. PostgreSQL performs what’s called a WAL switch: it closes the completed segment, marks it ready for archiving, and opens a fresh segment for new changes.

The archiver process, a background worker that’s always running when archiving is enabled, notices the new segment waiting in the archive-ready state. It constructs a shell command by taking your configured archive_command and replacing special placeholders: %p becomes the full path to the segment file, %f becomes just the filename. Then it spawns a shell and executes this command.

Your archive command—maybe a simple copy operation, maybe a sophisticated upload to cloud storage—does its work. If it succeeds, it must return exit code zero. PostgreSQL sees this success code and marks the segment as archived, making it eligible for recycling. If the command fails (returns any non-zero exit code), PostgreSQL makes a note of the failure and tries again later.

The Critical Importance of Exit Codes

This exit code handling is where many homegrown archiving solutions go wrong. Your archive_command must accurately report success or failure. Returning success when the archive actually failed—even in subtle ways—leads to data loss scenarios.

Consider a naive archive command: cp %p /archive/%f. If the /archive filesystem is full, cp fails and returns non-zero—good. But what if /archive is an NFS mount and the network is flaky? The copy might partially complete, or appear to succeed but be buffered in NFS caches that never flush. If your command returns zero in these cases, PostgreSQL thinks the segment is safely archived when it’s not.

This is why the test-before-copy idiom is so important: test ! -f /archive/%f && cp %p /archive/%f. This checks that a file with that name doesn’t already exist before copying. If it does exist, the command fails, preventing you from accidentally overwriting a good archive with a partial copy. It’s a simple safety check that prevents subtle corruption.

When Archiving Fails: Retry, Backlog, and Crisis

PostgreSQL’s retry behavior is persistent but not aggressive. When an archive command fails, the archiver waits (typically 60 seconds by default, though this isn’t directly configurable) before trying again. It keeps retrying the same segment indefinitely. This persistence is usually good—transient issues like temporary network problems resolve themselves.

But there’s a catch: the archiver processes segments strictly sequentially. While segment 42 is failing repeatedly, segment 43, 44, 45 and beyond pile up in pg_wal, waiting their turn. If archiving stays broken long enough, pg_wal fills its filesystem. And when that filesystem is full, PostgreSQL has no choice but to halt all write operations. A running database that can’t write WAL is a database that can’t make any changes—effectively frozen.

This cascading failure mode is why monitoring archive success is absolutely critical. You need alerts that fire when archiving fails, not hours later when the filesystem fills. The warning signs are clear in the pg_stat_archiver view: failed_count starts rising, last_failed_time keeps updating, last_archived_time falls behind. Catch these signals early, and you can fix the problem before it becomes a crisis.

Compression and Destinations: The Practical Decisions

Modern WAL archiving almost always involves compression. WAL segments compress remarkably well—70% to 90% size reduction is typical—because they contain structured, somewhat redundant data. The choice of compression algorithm matters.

Gzip is the old reliable: available everywhere, good compression ratios, but CPU-intensive and relatively slow. It’s fine for low-volume databases but can become a bottleneck in busy systems. LZ4 is the speed demon: extremely fast compression and decompression with moderate compression ratios. It’s ideal when you’re CPU-bound but have plenty of network bandwidth and storage. Zstd is the modern balanced choice: excellent compression ratios rivaling gzip but much faster, making it increasingly popular.

Where you archive to shapes everything else about your setup. Local disk archiving is simple and fast but provides zero protection against hardware failure—if the server dies, so does your archive. NFS or other network storage protects against single-server failure but introduces network dependencies and potential performance bottlenecks. Cloud object storage—S3, Google Cloud Storage, Azure Blob—provides durability, geographic redundancy, and effectively unlimited capacity, but requires sophisticated tooling to handle uploads, retries, and authentication.

This is where specialized tools like WAL-G and pgBackRest shine. They abstract away the complexity of cloud uploads, handle retry logic, manage compression, and even encrypt segments automatically. Using these tools isn’t just convenient—it’s more reliable than most homegrown scripts could ever be.

4. Implementing WAL Archiving: From Theory to Practice

Understanding how archiving works conceptually is one thing; making it work reliably in production is another. Let’s walk through implementing archiving the right way.

Before You Begin: The Pre-Flight Checklist

Jumping straight into configuration without preparation is a recipe for problems. Your archive destination needs to be ready. Calculate how much storage you’ll need: monitor your database’s WAL generation rate for a few days, multiply by your planned retention period, and add generous overhead. Running out of archive space mid-crisis is not the time to discover you underestimated capacity.

Network reliability matters enormously if you’re archiving to remote storage. An unstable network doesn’t just slow archiving—it can cause repeated failures that cascade into system-wide problems. Test your network path thoroughly. If you’re using cloud storage, understand the API rate limits and pricing. Some cloud providers charge significantly for PUT operations; archiving generates many small writes.

Set up monitoring before you enable archiving, not after. You need to know immediately when archiving fails. Configure your monitoring system to track pg_stat_archiver, alert on failures, alert on growing backlogs, and alert on archive storage capacity. Discovering archiving has been broken for three days only when you need to restore is a nightmare you can avoid.

Basic Configuration: Getting Started

The minimal configuration to enable archiving is deceptively simple. In postgresql.conf, you need just three settings:

wal_level = replica
archive_mode = on  
archive_command = 'test ! -f /archive/%f && cp %p /archive/%f'

The wal_level = replica setting tells PostgreSQL to include enough information in WAL for archiving and replication. Lower levels exist for systems that need neither, but modern practice is to always use replica because the performance overhead is negligible and the flexibility is worth it.

The archive_mode = on setting activates the archiver background process. Without this, PostgreSQL generates WAL but never tries to archive it.

The archive_command is where you specify how to copy segments. The %p placeholder gets replaced with the full path to the segment file, and %f gets replaced with just the filename. The test ! -f check prevents overwriting existing archives—a critical safety measure.

After changing these settings, you must restart PostgreSQL. Unlike many configuration changes that can be reloaded, these require a restart because they affect fundamental server behavior.

Real-World Implementations: From Simple to Sophisticated

A local filesystem archive is the simplest approach, suitable for testing or very small deployments:

# Create the archive directory
mkdir -p /var/lib/postgresql/wal_archive
chown postgres:postgres /var/lib/postgresql/wal_archive
chmod 700 /var/lib/postgresql/wal_archive

# Configure PostgreSQL
archive_command = 'test ! -f /var/lib/postgresql/wal_archive/%f && cp %p /var/lib/postgresql/wal_archive/%f'

This works, but it’s fragile. If your server’s storage dies, you lose both your database and your archive. It’s fine for development or as a first layer in a multi-layered strategy, but never as your only archive.

NFS archiving adds some resilience by storing segments on a separate system:

# Mount NFS share (in /etc/fstab)
backup-server:/exports/postgres-archive /mnt/archive nfs hard,intr,rsize=8192,wsize=8192 0 0

# Configure PostgreSQL  
archive_command = 'test ! -f /mnt/archive/%f && cp %p /mnt/archive/%f'

The hard mount option is important—it makes NFS operations wait indefinitely rather than failing on temporary network issues. The downside is that a completely dead NFS server can hang your archiver, but this is preferable to silently losing segments.

Cloud archiving with WAL-G represents the modern best practice for most organizations. First, install WAL-G and configure credentials:

# Install WAL-G (methods vary by OS)
wget https://github.com/wal-g/wal-g/releases/download/v2.0.0/wal-g-pg-ubuntu-20.04-amd64
chmod +x wal-g-pg-ubuntu-20.04-amd64
mv wal-g-pg-ubuntu-20.04-amd64 /usr/local/bin/wal-g

# Set up environment (in .bashrc or systemd unit file)
export AWS_REGION=us-east-1
export WALG_S3_PREFIX=s3://my-company-postgres-backups/production/wal
export WALG_COMPRESSION_METHOD=zstd

Then configure PostgreSQL simply:

archive_command = '/usr/local/bin/wal-g wal-push %p'

WAL-G handles everything: uploading to S3, compressing with zstd, retrying on transient failures, even encrypting if you configure it. The simplicity masks sophisticated error handling that would take hundreds of lines of shell script to replicate poorly.

pgBackRest offers even more sophistication, particularly for complex environments:

# Install pgBackRest
apt-get install pgbackrest

# Configure /etc/pgbackrest/pgbackrest.conf

[global]

repo1-type=s3 repo1-s3-bucket=my-company-backups repo1-s3-region=us-east-1 repo1-s3-key=AKIAIOSFODNN7EXAMPLE repo1-s3-key-secret=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY repo1-path=/postgres/production repo1-retention-full=4 repo1-retention-diff=4

[production]

pg1-path=/var/lib/postgresql/13/main pg1-port=5432

Then in PostgreSQL:

archive_command = 'pgbackrest --stanza=production archive-push %p'

pgBackRest coordinates archiving with base backups, manages retention automatically, and provides a unified interface for all backup operations.

Coordinating Base Backups with Archiving

WAL archiving without base backups is pointless—you need both. A base backup is a complete copy of your database at a specific point in time. The archived WAL segments let you replay changes from that point forward.

Taking a base backup with pg_basebackup is straightforward:

# Take a compressed base backup
pg_basebackup -h localhost -U postgres \
  -D /backups/base-$(date +%Y%m%d-%H%M%S) \
  -Ft -z -P \
  --wal-method=stream

This creates a tarball of your data directory. The --wal-method=stream option ensures necessary WAL is included in the backup, so recovery doesn’t depend on archived segments for that period.

The relationship between base backups and archiving is crucial: your base backups define the starting points for recovery, and archived WAL fills in everything that happens afterward. You need both retention policies to align. There’s no point keeping archived WAL from before your oldest base backup—it’s useless without a starting point. Similarly, keeping base backups without the corresponding WAL limits your recovery options.

Smart backup tools handle this coordination automatically. When pgBackRest expires an old base backup because it exceeds your retention period, it also removes the corresponding WAL segments. When you restore, it automatically fetches the correct base backup and the right WAL segments to reach your target recovery point.

5. Validation: Trust, But Verify

The most dangerous phrase in data protection is “I assume our backups work.” Assumptions about backups are worthless. You need proof, and the only proof is testing recovery.

Verifying Segments Are Actually Archiving

Start with the basics: are segments actually reaching your archive? This is surprisingly easy to overlook, especially if you don’t have monitoring set up yet.

# List recent archives
ls -lht /archive/ | head -20

# Or for cloud storage with WAL-G
wal-g wal-show --limit 20

# Check PostgreSQL's own statistics
psql -c "SELECT archived_count, last_archived_time, 
         last_failed_time, failed_count 
         FROM pg_stat_archiver;"

You should see archived_count steadily increasing and last_archived_time staying current (within a few minutes). If last_archived_time is hours old, archiving is broken or your database is idle. If failed_count is anything but zero, investigate immediately.

Look at the segments themselves. The filenames should be sequential with no gaps. A missing segment in the sequence is a serious problem—recovery will fail when it reaches that gap. This usually indicates that an archive command reported success when it actually failed.

Testing Point-in-Time Recovery End-to-End

Reading about PITR is one thing; successfully performing it is another. Test regularly on non-production systems that mirror your production setup.

Here’s a complete test scenario: Take a base backup right now and note the exact time. Create some test data that you’ll later verify is present. Wait a few minutes, then note another exact time—this will be your recovery target. Create more test data that you’ll verify is absent after recovery, proving you stopped at the right point.

Now perform the recovery. Stop your test PostgreSQL instance and replace its data directory with your base backup. Create the recovery configuration specifying your restore command and target time. Start PostgreSQL and watch the logs as it recovers.

What you’re looking for: PostgreSQL should start fetching archived WAL segments using your restore command. The logs should show it replaying transactions. Eventually, it should reach your target recovery time and stop. After recovery completes and PostgreSQL starts normally, verify your data: the first test data should be present, the second should be absent.

This test validates your entire backup and recovery pipeline: base backups are restorable, archived WAL is accessible and correctly ordered, your restore command works, recovery proceeds to the target time. Until you’ve successfully performed this test, you don’t truly have working backups.

Detecting Corruption and Gaps

WAL segments can become corrupted in transit or storage. Detecting this corruption before you need to restore is critical.

PostgreSQL includes pg_waldump, a tool that can read and validate WAL segments:

# Examine a specific segment
pg_waldump /archive/000000010000000000000042

# This shows every WAL record in the segment
# Errors indicate corruption

# Check a range of segments for continuity
pg_waldump /archive/0000000100000000000000{40..50}

Modern backup tools include validation features that are easier to use:

# WAL-G checks segment integrity
wal-g wal-verify

# pgBackRest checks entire backup sets
pgbackrest --stanza=production check

These tools verify not just that segments exist but that they’re complete, uncorrupted, and form a valid sequence. They catch problems before you discover them during a real emergency.

6. Best Practices: Learning from Hard-Won Experience

The difference between adequate archiving and excellent archiving comes down to following practices learned from real-world failures.

Making Archiving Bulletproof

Never, ever use rsync directly as your archive command. This advice appears everywhere in PostgreSQL documentation, yet people still do it and suffer the consequences. The problem is that rsync syncs files incrementally. If PostgreSQL calls your archive command while a segment is being written, rsync might see a partial file, copy it, and report success. Your archive now contains corrupted segments that will fail during recovery.

The correct approach is using atomic operations. The simple cp command is atomic for local filesystems—the file appears fully written or not at all. For remote archiving, use tools designed for the purpose. WAL-G and pgBackRest handle partial writes, network failures, and retry logic correctly.

Speaking of retries, avoid remote calls without sophisticated retry logic. A naive scp %p backup-server:/archive/%f fails on any transient network hiccup. PostgreSQL will retry, but each retry delays all subsequent segments. Better to use tools that implement exponential backoff, distinguish transient from permanent failures, and handle edge cases you haven’t thought of.

Your archive destination must be more durable than your database server. This sounds obvious, but many organizations archive to disks in the same RAID array or virtual machines on the same hypervisor as their database. When that shared infrastructure fails, both database and archive are lost. Use truly independent storage: different hardware, different data center, or cloud storage with geographic redundancy.

Performance Optimization Without Compromise

Compression is almost always worth it. WAL compresses extremely well because it contains structured data with patterns and redundancy. A 16MB segment typically becomes 2-4MB with zstd, reducing storage costs and network bandwidth by 75-85%.

Choose your compression algorithm based on your bottleneck. If archive command execution time is your limiting factor (check with time in your script), you’re CPU-bound—use lz4 for fast compression. If network bandwidth or storage IOPS is the constraint, use zstd or gzip for better compression ratios. Profile your specific workload to find the sweet spot.

Archive filesystem performance directly impacts your database. A slow archive command causes WAL segments to accumulate in pg_wal. While PostgreSQL handles this gracefully up to a point, eventually checkpoint behavior changes, and performance degrades. Ensure your archive destination—whether local disk, NFS mount, or network connection to cloud storage—has sufficient throughput and low latency.

Size the pg_wal filesystem with room to spare. A good rule of thumb is three to five times your max_wal_size setting. This provides buffer space for temporary archiving delays without risking filesystem exhaustion. Monitor pg_wal size continuously and alert well before it approaches capacity.

Security: Protecting Your Most Sensitive Data

WAL segments contain your actual data—every inserted row, every updated value, even deleted data appears in WAL. If you archive to untrusted storage (public cloud accounts, third-party backup services, potentially compromised systems), encryption isn’t optional.

Modern tools make encryption straightforward. WAL-G supports multiple encryption methods:

export WALG_PGP_KEY_PATH=/etc/wal-g/encryption.key
# Or for libsodium encryption
export WALG_LIBSODIUM_KEY='your-key-here'

pgBackRest offers similar encryption options. Enable them. The performance impact of encryption is negligible compared to network and compression overhead.

Transport security matters too. When uploading to S3, use HTTPS (which modern SDKs do by default). For SSH-based transfers, verify host keys and use key-based authentication rather than passwords. Configure IAM policies to restrict archive bucket access to only the necessary principals.

Monitoring: Your Early Warning System

You cannot manage what you don’t measure, and you cannot fix what you don’t know is broken. Monitor pg_stat_archiver continuously, not just when you remember to check.

Key metrics to track: archived_count should increase steadily (the rate depends on your transaction volume). failed_count should stay at zero always—any increase demands immediate investigation. last_archived_time should be recent, typically within the last few minutes unless your database is truly idle. last_failed_time should be null or far in the past.

Set up alerts that fire immediately, not hours later. If archiving fails three times in a row, page someone. If last_archived_time is more than 10 minutes old during business hours, investigate. If pg_wal directory size exceeds 75% of capacity, that’s an emergency.

Alert on archive storage capacity too. Running out of space in your archive destination is a cascading failure waiting to happen. Set alerts at 70% capacity so you have time to add storage or clean up old segments gracefully.

Retention: Balancing Protection and Pragmatism

Infinite retention sounds safe but is neither practical nor necessary. Define retention policies based on actual recovery requirements, not vague fears.

Common patterns work well for most organizations. Keep all WAL segments for your operational recovery window—typically 7 to 30 days. This covers accidentally discovered problems and rolling back recent changes. Keep base backups at decreasing frequency: daily for a week, weekly for a month, monthly for a year. This provides both recent fine-grained recovery points and historical checkpoints.

The relationship between base backup retention and WAL retention is crucial: you can safely delete WAL older than your oldest base backup because it’s useless without a starting point. Good backup tools automate this coupling. When pgBackRest expires a base backup, it automatically removes the corresponding WAL segments. When WAL-G deletes old backups based on retention rules, it cleans up orphaned WAL.

Different data types might warrant different retention. Production transaction databases might keep 30 days of fine-grained recovery. Analytical databases loaded nightly might only need 7 days. Development databases might need just 24 hours. Tailor retention to actual requirements rather than applying one-size-fits-all policies.

7. Choosing Your Tools: A Practical Guide

The right tool for WAL archiving depends on your environment, team size, and operational maturity.

Native PostgreSQL: Simplicity and Control

Using just PostgreSQL’s built-in archive_command with shell commands is perfectly viable for simple setups. You write a script, PostgreSQL executes it, done. This approach gives you complete control and requires no additional software to install, maintain, or learn.

The limitations become apparent as complexity grows. You’re responsible for implementing compression, encryption, retry logic, error handling, and parallel operations. Your script needs to handle edge cases: what if the network times out? What if the destination is temporarily unavailable? What if the segment already exists? Each of these scenarios requires code you must write, test, and maintain.

For small deployments archiving to local or NFS storage, native commands work fine. Once you need cloud storage, encryption, or sophisticated error handling, the balance shifts toward specialized tools.

WAL-G: Cloud-Native Elegance

WAL-G emerged from the recognition that PostgreSQL’s backup ecosystem needed a modern, cloud-first tool. Written in Go, it’s a single binary with no dependencies—install it and you’re ready to go.

The design philosophy is “do one thing well.” WAL-G focuses specifically on archiving PostgreSQL (and MySQL) to cloud object storage. It speaks S3, Google Cloud Storage, Azure Blob Storage, and OpenStack Swift natively, handling authentication, retry logic, and rate limiting for you.

Configuration is refreshingly simple. Set environment variables specifying your cloud credentials and destination, then configure PostgreSQL to call wal-g wal-push %p. That’s it. WAL-G handles compression (you choose lz4, zstd, or brotli), encryption (PGP or libsodium), and upload with automatic retries.

Where WAL-G particularly shines is in recovery performance. Its wal-fetch command implements prefetching—while PostgreSQL is replaying one segment, WAL-G is already downloading the next several in the background. This dramatically speeds up recovery, especially when pulling from cloud storage with network latency.

The delta backup feature is another standout. Instead of copying your entire database for each base backup, WAL-G can copy only the pages that changed since the last backup. For large databases with relatively small daily changes, this cuts backup time and storage costs substantially.

WAL-G suits cloud-native organizations, startups optimizing for simplicity, and teams that want a lightweight, focused tool without enterprise complexity.

pgBackRest: Enterprise-Grade Comprehensive Solution

If WAL-G is a precision instrument, pgBackRest is a full backup orchestra. It handles not just WAL archiving but complete backup lifecycle management: full, incremental, and differential backups; retention policies with multiple retention types; parallel operations for speed; delta restore to minimize recovery time; backup validation and verification.

The architecture is more complex, reflecting its broader scope. You can run pgBackRest as a dedicated backup server managing multiple PostgreSQL instances, or configure it for direct repository access from each database server. It supports multiple repositories simultaneously—you might have local fast storage for quick restores and cloud storage for disaster recovery.

Configuration is correspondingly more involved. You create a stanza (pgBackRest’s term for a specific database backup configuration), define repository settings, specify retention rules, and schedule different backup types. The investment pays off in operational maturity—pgBackRest essentially codifies backup best practices into its design.

The retention policy system is particularly sophisticated. You can specify how many full backups to keep, how many differential backups per full backup, and how many days of WAL archives to maintain. pgBackRest manages these relationships automatically, ensuring you never delete a base backup that still has dependent differential backups or corresponding WAL.

Delta restore is a game-changer for large databases. During restore, pgBackRest checksums each file and only copies files that differ from what’s already on disk. If you’re restoring to a system that already has most of the data (say, you’re recovering a recent corruption rather than building from scratch), delta restore can reduce restore time from hours to minutes.

pgBackRest fits organizations with multiple PostgreSQL instances, those with strict compliance requirements, and teams that value comprehensive backup management over simplicity.

Barman: Centralized Backup Management

Barman takes a different architectural approach: a dedicated backup server that manages backups for multiple PostgreSQL servers remotely. Think of it as your backup command center.

The model works well for organizations running dozens or hundreds of PostgreSQL instances. Instead of configuring backup separately on each database server, you centralize all backup operations on Barman servers. These servers pull backups via rsync or pg_basebackup, receive WAL via both archive_command and streaming replication (providing redundancy), and manage retention policies centrally.

Barman’s interface is command-line based but intuitive. You define servers in configuration files, then use Barman commands to take backups, list recovery points, perform restores, and check backup health. The barman check command validates your entire backup configuration, catching issues before they cause problems.

One unique feature is WAL streaming alongside traditional archiving. Barman can receive WAL via replication protocol while also accepting segments through archive_command. If streaming replication falters, archiving provides redundancy. This defense-in-depth approach maximizes reliability.

Geographic redundancy is straightforward with Barman—run Barman servers in multiple locations, each backing up your PostgreSQL instances. If one data center is lost, the other Barman server has complete backups.

Barman suits organizations managing fleet of PostgreSQL instances, those with dedicated database operations teams, and environments where centralized backup management provides operational advantages.

8. Real-World Case Studies: Learning from Experience

Theory teaches you how things should work. Experience teaches you how things actually work. These case studies, drawn from real implementations (with names changed), illustrate both success patterns and hard-learned lessons.

The SaaS Company: Cloud-First with WAL-G

CloudMetrics, a SaaS analytics platform, runs PostgreSQL on AWS EC2 instances with customer data totaling 800GB. Their architecture reflects modern cloud-native thinking: database servers on EC2, WAL archives on S3 with cross-region replication for disaster recovery, automated base backups every six hours.

They chose WAL-G specifically for its S3 integration and compression efficiency. Their configuration is deceptively simple:

# In postgresql.conf
wal_level = replica
archive_mode = on
archive_command = '/usr/local/bin/wal-g wal-push %p'
archive_timeout = 300

# Environment variables
WALG_S3_PREFIX=s3://cloudmetrics-postgres-prod/wal
WALG_COMPRESSION_METHOD=zstd
WALG_DELTA_MAX_STEPS=6
AWS_REGION=us-east-1

Their transaction rate generates about 500MB of WAL per hour during business hours, compressing to approximately 50MB with zstd. Each segment uploads to S3 in under two seconds typically. Monthly storage costs run about $50 for the primary region, plus $15 for cross-region replication to their disaster recovery region.

The real test came six months after implementation. A developer, debugging a production issue at 2:45 PM on a Tuesday, accidentally executed a destructive query against the wrong table. The mistake was discovered within minutes, but 35,000 rows had already been deleted—representing customer data that couldn’t be easily reconstructed.

The recovery process was almost anticlimactic. The database engineer identified the exact timestamp before the mistake (2:45:30 PM), stopped the database, restored the 2 AM base backup to a spare server, configured recovery to stop at 2:45:29 PM, and started PostgreSQL. Eight minutes later, after replaying 12 hours of WAL, the database was back online with all data intact. The corrupted production server was quickly rebuilt from this recovered instance.

Total business impact: 15 minutes of downtime while they verified data integrity and redirected traffic. Without PITR, they would have lost an entire day of data or spent days attempting manual reconstruction from application logs.

The cost optimization they implemented is worth noting. S3 storage can get expensive at scale, so they configured S3 Intelligent-Tiering for their backup bucket. This automatically moves older WAL segments to cheaper storage tiers as they age. Combined with lifecycle policies that delete WAL older than their oldest base backup, they maintain 30-day recovery capability while keeping storage costs reasonable.

The Financial Institution: On-Premise with All the Safeguards

SecureBank operates PostgreSQL for transaction processing in their on-premise data center, with regulatory requirements demanding complete audit trails and precise recovery capabilities. Their architecture is conservative and defense-in-depth: physical servers with local SSDs, WAL archiving to enterprise NFS appliances, base backups every six hours, streaming replication to a hot standby in a secondary data center.

They learned several lessons the hard way. Three months after deploying WAL archiving, a network switch failure disconnected their NFS mount. PostgreSQL’s pg_wal directory filled the root filesystem over two hours, ultimately halting all write operations and bringing down their transaction processing system at 3 PM—peak business hours.

The immediate fix was ungraceful: they temporarily disabled archiving to let WAL recycle and restore write operations. But this left them without PITR protection during recovery, a situation their compliance team found unacceptable. The incident led to three permanent changes.

First, they moved pg_wal to a dedicated filesystem with 100GB capacity, isolated from root. This provided buffer space for temporary archiving failures without threatening the entire system. Second, they implemented network redundancy with bonded network interfaces to the NFS appliances, eliminating the single point of failure. Third, they deployed aggressive monitoring: alerts fire if archiving fails three consecutive times, if pg_wal exceeds 30GB, or if NFS mount latency spikes above 50ms.

Six months later, this investment proved its worth. A storage controller failure in the primary SAN corrupted database files during a write operation. The database crashed and refused to start, reporting data file corruption. Their recovery process, documented in detailed runbooks, activated immediately.

They stood up recovery on the standby server, restored the most recent base backup (four hours old), configured recovery to replay all available WAL, and watched as PostgreSQL meticulously reconstructed the database state. The recovery took three hours—mostly spent restoring 2TB of base backup data over the network—but succeeded completely. They lost only two minutes of transactions, the gap between the last successfully archived WAL segment and the corruption event.

The lesson they emphasize to peers: redundancy at every layer. Network redundancy prevented archive failures from cascading. Storage diversity (local SSDs, NFS appliances, tape backups) ensured no single failure destroyed everything. Streaming replication plus archiving meant they had multiple paths to recovery. And detailed, tested runbooks meant that at 3 AM when disaster struck, their on-call engineer knew exactly what to do.

The Startup: Bootstrapping on a VPS

FitTrack, a fitness app startup, ran PostgreSQL on a modest VPS—100GB disk, 8GB RAM, $40/month budget. They needed reliable backups but couldn’t afford elaborate infrastructure. Their solution shows how far simplicity and cleverness can take you.

Local disk archiving provided the first layer: fast, simple, no network dependencies. But a VPS host failure would lose everything, so they added a second layer: a cron job rsync’ing archived WAL to a cheap backup VPS at a different provider for $5/month.

# Primary archive to local disk
archive_command = 'test ! -f /var/lib/postgresql/wal_archive/%f && \
                   cp %p /var/lib/postgresql/wal_archive/%f'

# Hourly sync to remote backup
# /etc/cron.hourly/wal-sync
#!/bin/bash
rsync -av --remove-source-files /var/lib/postgresql/wal_archive/ \
      backup-vps:/postgres-wal/

The --remove-source-files flag deletes local segments after successful transfer, preventing local disk exhaustion. This creates a natural flow: segments archive locally (fast, no network dependency), then migrate to remote storage (reliable, separate infrastructure), then get deleted locally (free up space).

Disk space management was crucial. They implemented automatic cleanup that kept seven days locally and 30 days remotely:

# Daily cleanup of local archive
find /var/lib/postgresql/wal_archive/ -type f -mtime +7 -delete

# Emergency cleanup if disk usage critical
USAGE=$(df /var/lib/postgresql | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $USAGE -gt 85 ]; then
  find /var/lib/postgresql/wal_archive/ -type f -mtime +3 -delete
fi

This pragmatic approach gave them genuine PITR capability for under $50/month total infrastructure cost. When they needed to recover from a schema migration gone wrong, they restored to 30 seconds before the migration, saving hours of downtime and data reconstruction.

As they grew, they gradually upgraded: moving to managed PostgreSQL with built-in backups, adding monitoring, eventually implementing proper cloud storage with WAL-G. But the two-tier local-then-remote archiving pattern served them well during the critical early growth phase.

The Global Enterprise: Multi-Region DR Strategy

GlobalCommerce, an international e-commerce platform, operates PostgreSQL at scale: multi-terabyte databases, millions of transactions daily, SLAs promising 99.99% uptime. Their disaster recovery strategy combines every technology we’ve discussed into a comprehensive defense.

Their primary database runs in US-East with a hot standby in a separate availability zone for high availability. WAL archives continuously to S3 in US-East, with automatic cross-region replication to EU-West. A warm standby server runs in EU-West, normally fed via streaming replication from US-East but configured to fall back to fetching WAL from the local EU-West S3 bucket if streaming fails.

Primary (US-East-1a)
  ├─ Streaming → Hot Standby (US-East-1b)  [for HA]
  └─ WAL-G → S3 (US-East) → Replicate → S3 (EU-West)
                                            ↓
                        Warm Standby (EU-West) ← reads S3 when needed

This architecture provides multiple recovery paths. For routine failover, they promote the US-East-1b hot standby—near-instant with no data loss. For regional failures, they promote the EU-West warm standby, accepting a few minutes of data loss (the replication lag to EU-West) but maintaining service.

They test this quarterly, running full failover drills. During one test, they discovered a subtle timeline management issue. After promoting EU-West to primary and creating timeline 2, they forgot to update the old US-East primary’s recovery configuration to follow timeline 2 when rebuilding it as a standby. PostgreSQL refused to start, complaining about timeline mismatch.

The fix was simple but the lesson valuable: timeline management after failover requires attention. They now include explicit timeline configuration in their runbooks:

# When rebuilding old primary as standby after failover
# in recovery.conf (or postgresql.auto.conf in PG12+)
recovery_target_timeline = 'latest'
restore_command = 'wal-g wal-fetch %f %p'
primary_conninfo = 'host=new-primary port=5432'

The recovery_target_timeline = 'latest' setting tells PostgreSQL to automatically follow timeline branches, preventing manual timeline ID management errors.

Their architecture might seem over-engineered, but it proved its worth during an actual regional outage when AWS’s US-East experienced widespread failures. They executed their tested failover procedure, promoted EU-West to primary, redirected application traffic, and maintained service throughout the incident. When US-East recovered, they rebuilt it as a standby and fell back. Total customer-visible disruption: under five minutes, well within their SLA.

9. Troubleshooting: When Things Go Wrong

Even perfectly designed systems encounter problems. Knowing how to diagnose and fix issues quickly separates adequate operations from excellent ones.

When Archiving Gets Stuck

Archiving hasn’t happened in 30 minutes. The pg_stat_archiver view shows last_failed_time updating repeatedly, and failed_count climbing. The pg_wal directory is growing. What’s happening?

Start by checking PostgreSQL’s logs—they’ll show the exact error message from the failed archive_command. Common culprits include destination filesystem full, permission denied, network timeouts to remote storage, or bugs in custom archive scripts.

The destination-full scenario is particularly common and dangerous. Your archive location ran out of space, so the archive command fails. But PostgreSQL keeps generating WAL, filling pg_wal. If you don’t act quickly, pg_wal fills its filesystem, and PostgreSQL halts all writes.

The immediate fix depends on your situation. If you can quickly add storage to the archive destination, do that—PostgreSQL will automatically resume archiving and work through the backlog. If adding storage takes time, you might need to temporarily delete older archived WAL segments to free space (accepting that you’ve reduced your PITR window). In extremis, you can temporarily disable archiving with archive_command = '/bin/true', but this is dangerous—you’re now without PITR protection.

The long-term fix is implementing proper retention policies with automated cleanup, capacity monitoring with alerts well before you hit limits, and emergency procedures documented before you need them.

The Nightmare: Missing WAL Segments

Recovery fails with a chilling message: “could not find WAL segment 000000010000000000000042.” You check your archive—segment 42 is indeed missing. You have 41, you have 43, but 42 is gone.

This is serious. Recovery cannot proceed past this gap. You can recover to just before segment 42, but everything afterward is lost. How did this happen?

The most common cause is an archive command that reported success when it actually failed. Maybe it was a custom script with poor error handling that exited 0 even when the copy failed. Maybe it was rsync used incorrectly, appearing to succeed but copying a partial file that was later overwritten. Maybe someone manually deleted the segment from the archive, thinking it was safe.

Prevention is everything here because the cure is painful—you’ve lost data. Use atomic operations in your archive command. The test ! -f /archive/%f && cp %p /archive/%f pattern prevents overwriting existing segments and ensures the copy completes before success is reported. Never manually delete segments from an archive without careful analysis. Use tools like WAL-G or pgBackRest that track segments and detect gaps.

If you discover a gap, assess the damage. You can recover to just before the gap, losing transactions afterward. Or if you have streaming replication standby that has the missing segment still in its pg_wal, you might be able to retrieve it. This is a last-resort recovery path, but it’s sometimes possible.

Timeline Confusion After Recovery

After performing PITR yesterday, recovering to a point before a bad schema change, today’s restoration attempt (testing the procedure) results in PostgreSQL reporting “requested timeline is not a child of this server’s history.” What went wrong?

This is timeline confusion. When PITR was executed yesterday, PostgreSQL created timeline 2, branching from timeline 1. The archive now contains both timeline 1 segments (the original history) and timeline 2 segments (the new history after recovery). Today’s restore attempt is mixing them up, trying to apply timeline 1 segments after timeline 2, which is invalid.

The solution is explicit timeline configuration. In your recovery configuration, always use:

recovery_target_timeline = 'latest'

This tells PostgreSQL to automatically follow timeline changes, applying segments from each timeline in the correct sequence. Alternatively, you can explicitly specify a timeline if you need to recover along a specific history branch, but 'latest' is usually what you want.

Understanding timelines conceptually helps avoid confusion. Each timeline is a distinct history of your database. When you go back in time via PITR and continue forward, you create a new timeline—a new branch of history. The old timeline still exists in your archives, but your database has moved to a new one. This prevents the catastrophic scenario of accidentally mixing WAL from different histories, which would corrupt your database.

Permission Problems During Restore

During a recovery attempt, the restore_command fails with “permission denied.” Checking confirms the archived WAL files exist and appear valid. What’s wrong?

The PostgreSQL process needs read access to archived segments during recovery. This seems obvious but is easy to overlook, especially if the archive destination has strict permissions or the restore is happening on a different server than where archiving occurred.

Check the basics: What user is PostgreSQL running as? Usually postgres, but not always. Do archived files have appropriate read permissions for that user? If archives are owned by root or another user, PostgreSQL might not be able to read them.

For local archives, the fix is straightforward:

chown -R postgres:postgres /path/to/archive/
chmod 700 /path/to/archive/
chmod 600 /path/to/archive/*

For remote archives accessed via commands like scp, ensure SSH keys are set up for the postgres user and the remote host is in known_hosts. For cloud storage with WAL-G or pgBackRest, verify credentials are configured for the postgres user, not just for root.

Test your restore command manually before attempting actual recovery:

# Become the postgres user
su - postgres

# Test retrieving a segment
cp /archive/000000010000000000000042 /tmp/test_segment
# Or with WAL-G
wal-g wal-fetch 000000010000000000000042 /tmp/test_segment

# Check exit code
echo $?  # Should be 0

If this manual test works but recovery still fails, check that the restore command in your recovery configuration is exactly correct, including paths and syntax.

Corrupted Segments: The Unrecoverable Failure

Recovery proceeds normally, replaying segment after segment, then suddenly stops with “invalid WAL record.” The segment appears present but is corrupted. You try to examine it with pg_waldump and get errors.

Corruption can happen during transfer (network errors, disk failures), in storage (bit rot, hardware problems), or through bugs in archive scripts. Whatever the cause, corrupted segments break recovery—you cannot replay past the corruption.

Your recovery options depend on what else you have. If you have streaming replication standbys, they might have uncorrupted copies of the segment. If you archived to multiple destinations, check alternate copies. If neither is available, you can only recover to just before the corrupted segment.

This is why validation is crucial. Tools like pg_waldump can verify segment integrity:

pg_waldump /archive/000000010000000000000042

If it completes without errors, the segment is valid. If it reports errors, corruption is present. Run this validation periodically on your archives—don’t wait until restore to discover corruption.

Modern backup tools include validation in their workflows. WAL-G’s wal-verify command checks segment integrity. pgBackRest’s check command validates entire backup sets, including WAL. Run these checks regularly, catch corruption early, and re-archive from source if possible.

10. Performance: Understanding the Impact

WAL archiving isn’t free—it consumes resources and affects database performance in subtle ways. Understanding these impacts helps you tune for optimal balance.

The Real Performance Story

Here’s the good news: WAL archiving doesn’t directly slow down transaction commits. The archive_command runs asynchronously in a background process. When your application commits a transaction, it waits for WAL to be written to disk but doesn’t wait for archiving. From the transaction’s perspective, archiving is invisible.

The indirect effects matter more. If archiving is slow or failing, WAL segments accumulate in pg_wal. These accumulated segments affect checkpoint behavior. PostgreSQL triggers checkpoints both on a schedule (checkpoint_timeout) and when WAL volume reaches max_wal_size. When segments can’t be recycled because they haven’t been archived, WAL volume grows beyond max_wal_size, triggering more frequent checkpoints.

Frequent checkpoints cause I/O spikes as PostgreSQL flushes dirty buffers to disk. These spikes affect transaction performance—not catastrophically, but measurably. Monitor your checkpoint patterns:

SELECT 
  checkpoints_timed,
  checkpoints_req,
  checkpoint_write_time,
  checkpoint_sync_time,
  buffers_checkpoint,
  buffers_clean,
  buffers_backend
FROM pg_stat_bgwriter;

Healthy systems show mostly checkpoints_timed (scheduled checkpoints) with few checkpoints_req (requested due to WAL volume). If checkpoints_req is high, investigate whether archiving lag is causing WAL accumulation.

Optimizing Archive Performance

The archive command itself needs tuning. Profile it to understand your bottleneck:

# Time a single archive operation
time cp /path/to/wal/segment /archive/segment

# Or with compression
time zstd -q /path/to/wal/segment -o /archive/segment.zst

If the operation takes more than a second or two, you have a problem. Archiving should be fast—segments are only 16MB. Slow archiving indicates storage performance issues, network problems, or inefficient compression.

Compression choice significantly affects performance. Test different algorithms with actual WAL segments:

# Test compression ratio and speed
time gzip -c segment > segment.gz
time lz4 -c segment > segment.lz4
time zstd -c segment > segment.zst

# Compare sizes
ls -lh segment*

Typically, lz4 is 3-5x faster than gzip but achieves 50-60% compression versus gzip’s 70-80%. Zstd offers near-gzip compression with near-lz4 speed, making it often the best choice.

Network performance matters enormously for remote archiving. Upload 16MB to your archive destination and measure the time:

time aws s3 cp testfile s3://your-bucket/test

If this takes more than a few seconds, network bandwidth or latency is your bottleneck. Consider network upgrades, using a closer cloud region, or implementing local buffering with background upload.

WAL Recycling and System Behavior

PostgreSQL’s WAL recycling is an elegant optimization, but its interaction with archiving deserves understanding. After successfully archiving a segment, PostgreSQL doesn’t delete it—it renames it to a future segment number. When writes eventually reach that segment, the pre-existing file is reused, avoiding repeated file creation overhead.

The number of segments in pg_wal reflects this recycling behavior. A healthy system typically maintains 3-10 segments: enough for immediate use and recycling, but not excessive. If you observe 50 or 100 segments, archiving is falling behind and recycling isn’t happening.

The min_wal_size and max_wal_size parameters control recycling behavior. PostgreSQL tries to keep at least min_wal_size worth of segments (for recycling) while triggering checkpoints when approaching max_wal_size. Tune these based on your transaction volume and archiving performance.

When WAL Bloats: Causes and Remedies

WAL bloating—excessive segment accumulation in pg_wal—has several causes beyond archiving failures. Replication slots prevent WAL recycling until downstream replicas consume those segments. If a replica is down or severely lagging, WAL accumulates indefinitely.

-- Check replication slot lag
SELECT 
  slot_name,
  slot_type,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;

If retained_wal is large (gigabytes), investigate why that slot is lagging. If the replica is permanently gone, drop the slot to allow WAL recycling:

SELECT pg_drop_replication_slot('old_replica_slot');

Long-running transactions also retain WAL because those segments might be needed for crash recovery of the transaction. A transaction open for hours can prevent WAL recycling, causing bloat:

SELECT 
  pid,
  usename,
  state,
  now() - xact_start AS duration,
  query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY duration DESC;

If you find transactions running for hours, investigate why. Are they stuck? Did the application crash without releasing the connection? Terminate them if necessary, but understand the cause to prevent recurrence.

Prepared transactions from two-phase commit are another subtle cause. These hold WAL until explicitly committed or rolled back:

SELECT * FROM pg_prepared_xacts;

If you find old prepared transactions, resolve them:

COMMIT PREPARED 'transaction_id';

11. Bringing It All Together

We’ve covered immense ground—from WAL fundamentals through archiving mechanics, implementation details, tool choices, real-world experiences, and troubleshooting. Let’s distill this into actionable wisdom.

The Non-Negotiable Truth

WAL archiving isn’t optional for production databases. Full stop. If your data has any value—business transactions, customer information, historical records—you need continuous protection. Base backups alone leave you vulnerable to losing everything between backup windows. WAL archiving fills those gaps, enabling recovery to any specific moment.

For regulated industries, this isn’t just best practice—it’s often legally required. Regulators expect you to demonstrate comprehensive backup and recovery capabilities. “We take daily backups” doesn’t satisfy auditors when they ask “Can you recover to 2:47 PM on March 15th?” With proper WAL archiving, you can.

Keep Your Archive Command Simple and Reliable

The archive_command is a single point of failure in your backup strategy. Every execution must succeed, or you risk data loss. This demands simplicity and reliability over cleverness.

Resist the temptation to build complex custom scripts unless you have specific needs that existing tools can’t meet. Simple is better: cp for local archiving, well-tested tools like WAL-G or pgBackRest for remote archiving. Every line of custom code is a potential bug, and bugs in archive_command create gaps in your archives.

When you must write custom archive logic, follow defensive patterns: test that destination files don’t exist before writing, verify copy completion before reporting success, log everything for troubleshooting, handle errors explicitly rather than failing silently.

Test Relentlessly

An untested backup system is an untrustworthy backup system. You don’t know your backups work until you’ve successfully restored from them. Schedule regular recovery tests—monthly at minimum, weekly if you’re paranoid (and you should be).

Don’t just test the happy path. Test edge cases: recovering from compressed archives, recovering when some segments are missing, recovering to a different server, recovering under time pressure. Each test reveals gaps in your procedures, documentation, or automation.

Document everything you learn. Your recovery runbooks should be detailed enough that someone could follow them at 3 AM during an outage. Include specific commands, expected output, troubleshooting steps, and contact information for escalation.

Invest in Proper Tools

For toy databases or learning environments, native archive_command with shell scripts is fine. For anything approaching production use, invest in proper backup tools.

WAL-G gives you cloud-native archiving with compression, encryption, and intelligent retry logic in a single lightweight binary. pgBackRest provides enterprise-grade backup management with sophisticated retention policies and delta operations. Barman offers centralized backup administration for multiple PostgreSQL instances.

These tools encode years of hard-won experience. They handle edge cases you haven’t thought of, implement retry logic you’d take weeks to write, and provide operational maturity that custom scripts rarely achieve. The time you save in not building and maintaining custom backup infrastructure far exceeds any licensing or learning costs.

Build Layered Defense

No single backup technology is perfect. Streaming replication can fail if the primary server becomes corrupted. WAL archiving depends on storage that could itself fail. Base backups consume time to restore. Smart architectures layer these technologies.

Run streaming replication for high availability and near-instant failover. Implement WAL archiving for point-in-time recovery and protection against logical errors. Take base backups at appropriate frequencies to limit recovery time. Archive to multiple destinations—local for speed, cloud for durability. Test the entire stack regularly.

This layered approach means you can handle any scenario: hardware failure, data corruption, accidental deletions, data center disasters, even ransomware attacks if you maintain offline or immutable backups.

Monitor Everything

You cannot manage what you cannot measure. Instrument every component of your backup system: archive success rates, archive latency, storage capacity, backup ages, recovery test results. Alert on deviations from normal patterns before they become emergencies.

The pg_stat_archiver view is your primary monitoring source for archiving health. Export its metrics to your monitoring system and alert on anomalies. Supplement with storage monitoring (capacity, IOPS, latency), network monitoring for remote archives, and periodic validation of archived segments.

When alerts fire, treat them seriously. A single failed archive is a yellow alert—investigate but not urgent. Three consecutive failures is red alert—page someone immediately. Storage nearing capacity is critical—add space now, not when you’re at 95%.

The Final Word

Data is the lifeblood of modern organizations. Everything else—hardware, software, even people—can be replaced. But lost data is gone forever. WAL archiving, implemented thoughtfully and tested regularly, ensures your data survives any disaster.

The investment seems heavy: storage costs, tool licenses, engineering time, testing overhead. But consider the alternative: explaining to your CEO why you lost a day of transactions, facing regulatory penalties for audit failures, watching customers leave after a data loss incident. The cost of doing archiving right is trivial compared to the cost of doing it wrong.

Start today if you haven’t already. Enable WAL archiving with appropriate tools for your environment. Set up monitoring and alerts. Schedule your first recovery test. Document your procedures. Then sleep better knowing your data is truly protected.

The database you save may be your own.


Additional Resources and Further Reading

Official PostgreSQL Documentation
The PostgreSQL manual’s chapter on Continuous Archiving and Point-in-Time Recovery remains the definitive reference. It’s dense but comprehensive, covering every configuration parameter and corner case.

Tool Documentation
WAL-G, pgBackRest, and Barman all maintain excellent documentation with detailed examples. WAL-G’s GitHub repository includes configuration samples for every major cloud provider. pgBackRest’s website offers comprehensive guides covering everything from initial setup to advanced multi-repository configurations. Barman’s documentation includes architecture diagrams that help you understand its centralized backup model.

Community Wisdom
The PostgreSQL community mailing lists, particularly pgsql-general and pgsql-admin, contain decades of accumulated knowledge. Search the archives when you encounter problems—chances are someone else has faced and solved the same issue. The PostgreSQL Slack workspace and Discord servers offer real-time help from experienced practitioners.

Books Worth Reading
“PostgreSQL 14 Administration Cookbook” by Simon Riggs and Gianni Ciolli dedicates entire chapters to backup and recovery strategies, with practical recipes you can implement immediately. “PostgreSQL High Performance” by Gregory Smith, while focused on performance tuning, provides valuable context on how WAL and checkpoints affect system behavior.

Blogs and Case Studies
Many organizations publish detailed post-mortems of outages and recovery experiences. These real-world stories, often more valuable than official documentation, reveal the unexpected challenges and practical solutions that only emerge under pressure. Seek them out, learn from others’ mistakes, and share your own experiences to help the community.


Epilogue: A Story of Survival

One final story brings everything together.

A mid-sized software company—call them DataFlow—had been running PostgreSQL for three years without incident. They took nightly backups, monitored their systems, and felt reasonably confident about their data protection. They’d heard about WAL archiving but kept putting it off. “We’ll implement it next quarter,” became a running joke.

Then came the Tuesday that changed everything. At 11:47 AM, during a routine deployment, a configuration error triggered a cascading failure. Their primary database server’s storage array experienced controller corruption, spreading bad data across multiple volumes. By the time they realized what was happening and killed power, substantial portions of their database files were corrupted beyond recovery.

They had last night’s backup. Restoring it would lose 12 hours of customer data—thousands of transactions, user registrations, purchases, and updates. Their SLA promised 99.9% uptime and data durability. The potential liability was enormous.

Their CTO made an emergency call to a PostgreSQL consultant who asked one question: “Do you have WAL archiving?” The answer was no. The silence on the line spoke volumes.

They spent the next 48 hours in what they later called “digital archaeology”—attempting to reconstruct lost transactions from application logs, cached data, and even backup database queries logged by their monitoring system. They recovered perhaps 60% of the lost data. The other 40% was simply gone. Customer complaints poured in. Their Net Promoter Score plummeted. The incident cost them three major clients and hundreds of smaller ones.

Six months later, after implementing comprehensive WAL archiving with pgBackRest, after running monthly disaster recovery drills, after hiring a dedicated database reliability engineer, the CTO gave a conference talk about their experience. His conclusion resonated: “WAL archiving seemed like an expensive, complex overhead we could defer. Losing customer data turned out to be far more expensive and complex.”

The audience included an engineer from a startup who went back to her office that evening and spent four hours setting up WAL-G to archive to S3. Three weeks later, when a developer accidentally truncated a critical table, she performed PITR to 30 seconds before the mistake. Recovery time: 12 minutes. Data loss: zero. Cost of implementing WAL archiving: a few hours of work. Value delivered: immeasurable.


Next Steps

The knowledge needed to implement robust WAL archiving is now in hand. Here’s what to do next:

If there’s no WAL archiving yet:

  1. Choose a tool based on the environment (WAL-G for cloud, pgBackRest for enterprise, or native for simple local archiving)
  2. Configure a test environment that mirrors production
  3. Enable archiving and take a base backup
  4. Perform a complete point-in-time recovery test
  5. Document every step in a recovery runbook
  6. Once confident, deploy to production with comprehensive monitoring

If WAL archiving is already in place:

  1. When was the last recovery test? Schedule one for this week
  2. Review retention policies—do they align with actual recovery requirements?
  3. Check monitoring—would failures be detected within minutes if archiving failed?
  4. Examine pg_stat_archiver statistics—any warning signs?
  5. Consider whether current tools are still the best fit as the environment evolves

For everyone:

  1. Share this knowledge with the team—backup strategies are team responsibilities, not individual ones
  2. Schedule regular recovery drills—quarterly at minimum
  3. Document configurations and procedures—future teams will be grateful
  4. Stay engaged with the PostgreSQL community—learn from others’ experiences
  5. When (not if) facing a recovery scenario, write up the experience and share it

Acknowledgments and Closing Thoughts

The techniques and best practices described in this guide represent decades of collective wisdom from the PostgreSQL community. They’ve been refined through countless production deployments, learned through painful outages, and validated through successful recoveries.

PostgreSQL’s Write-Ahead Logging architecture is a testament to thoughtful engineering. The developers who designed it understood that databases aren’t just about storing data—they’re about preserving data through every imaginable failure scenario. WAL archiving extends that protection from a single server to your entire infrastructure.

Technology continues evolving. New tools emerge, cloud capabilities expand, and best practices adapt. But the fundamental principles remain constant: protect data continuously, test recovery procedures regularly, and never assume backups work without proof.

Data tells an organization’s story. It records every customer interaction, every transaction, every decision. Losing that story isn’t just a technical failure—it’s an organizational catastrophe. WAL archiving ensures that no matter what happens, the story continues.

Implement it thoughtfully. Test it relentlessly. Trust it confidently. Future teams will thank the current one.


About This Guide

This comprehensive guide covers PostgreSQL WAL archiving from fundamental concepts through advanced implementations. It’s intended for database administrators, site reliability engineers, and developers responsible for data durability. The examples are drawn from real-world deployments, the recommendations are battle-tested, and the advice is given in the spirit of helping readers avoid mistakes that others have made.

Version: 1.0
Last Updated: November 2025
PostgreSQL Versions Covered: 12-16

For corrections, suggestions, or sharing your own WAL archiving experiences, the PostgreSQL community welcomes your contributions. Learn from others, share your knowledge, and help make database operations more reliable for everyone.

Stay safe. Keep your data safer.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>