Here’s a truth that keeps DBAs up at night: you only know your database is healthy until the moment it isn’t. By then, it’s too late.
I’ve seen it happen. A perfectly running system, humming along for months, then suddenly—gone. The postmortem always reveals the same thing: the warning signs were there all along. We just weren’t looking.
Let’s fix that.
Why Your Database Needs a Daily Checkup
Think about your last doctor’s visit. They didn’t just ask “feeling okay?” and send you home. They checked your pulse, blood pressure, temperature—your vital signs. Your database deserves the same treatment.
The difference between a great DBA and a firefighter? Great DBAs see problems coming. That replication lag creeping up over three days? That’s your database telling you something’s wrong. The connections slowly maxing out? Your app is leaking connections. The disk usage at 82%? You’ve got maybe a week before things get ugly.
Reactive management is expensive. Proactive monitoring is cheap insurance.
The 8 Vital Signs Every PostgreSQL Database Needs
Forget the hundred metrics your monitoring tool tracks. Most are noise. Here are the eight that actually matter:
1. Is Anyone Home? (Database Availability)
This sounds obvious, but you’d be surprised how many “monitoring systems” don’t actually check if the database is up. Can you connect? Great. Now check:
Connection usage is your first red flag. PostgreSQL has a hard limit on connections (max_connections). When you hit it, new connections fail. Period. No graceful degradation, no queue—just errors.
SELECT count(*) as current,
current_setting('max_connections')::int as max,
round(count(*) * 100.0 / current_setting('max_connections')::int, 2) || '%' as usage
FROM pg_stat_activity;
Here’s your cheat sheet:
- Below 70%? You’re fine.
- 70-85%? Start investigating.
- Above 85%? Drop everything and figure out why.
Idle connections are the silent killers. Especially “idle in transaction”—these are connections that started a transaction and then… nothing. They’re holding locks, blocking vacuum, wasting resources. If you see connections idle in transaction for more than 5 minutes, something’s broken in your application code.
SELECT pid, usename, state,
now() - state_change as idle_duration
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND now() - state_change > interval '5 minutes';
Kill them. Fix the app. Move on.
2. Are Your Replicas Keeping Up? (Replication Status)
If you’re running replicas (and you should be), replication lag is existential. It answers one question: if your primary dies right now, how much data are you losing?
Two numbers matter:
Byte lag: How much WAL data hasn’t been replayed yet Time lag: How old is the data on your replica
On the primary:
SELECT client_addr,
state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as lag_bytes,
replay_lag as time_lag
FROM pg_stat_replication;
On the standby:
SELECT now() - pg_last_xact_replay_timestamp() as lag;
The rules:
- Under 10MB / 10 seconds: Perfect
- 10-100MB / 10-60 seconds: Investigate
- Over 100MB / 60 seconds: This is bad. Really bad.
Why does lag happen? Usually it’s one of three things:
- Network is slow (especially for geo-distributed replicas)
- Standby hardware is weak
- Long queries on the standby are blocking replay
3. Will You Run Out of Disk Tomorrow? (Storage Health)
Running out of disk space is a special kind of panic. Writes fail. Transactions abort. Sometimes you get corruption. Fun times.
Check your database sizes:
SELECT datname,
pg_size_pretty(pg_database_size(datname)) as size
FROM pg_database
WHERE datname NOT IN ('template0', 'template1')
ORDER BY pg_database_size(datname) DESC;
But here’s what people forget: WAL files. If your replication falls behind or your archive command breaks, WAL files accumulate fast. I’ve seen the pg_wal directory grow 50GB overnight.
du -sh /var/lib/postgresql/*/pg_wal
Your filesystem should stay under 70% usage. At 85%, start worrying. At 90%, start deleting things.
Bloat is the other storage killer. Dead tuples that vacuum hasn’t cleaned up yet. They waste space and slow down queries:
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) as bloat_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY bloat_pct DESC
LIMIT 10;
If you see tables with 20%+ bloat, vacuum isn’t keeping up with your write load.
4. Is Your Cache Actually Helping? (Performance Metrics)
Your cache hit ratio tells you if PostgreSQL is using memory effectively or thrashing to disk constantly.
SELECT datname,
round(100.0 * blks_hit / NULLIF(blks_hit + blks_read, 0), 2) as cache_hit_pct
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1');
Above 95%? Excellent. Between 90-95%? Room for improvement. Below 90%? Your queries are slow and you know why.
Checkpoints are another thing people ignore until they become a problem. PostgreSQL flushes dirty buffers to disk periodically. Too often = too much I/O. Not often enough = long crash recovery.
SELECT checkpoints_timed,
checkpoints_req,
round(100.0 * checkpoints_req / NULLIF(checkpoints_timed + checkpoints_req, 0), 2) as forced_pct
FROM pg_stat_bgwriter;
If more than 10% of your checkpoints are “forced” (requested), tune your checkpoint settings.
5. What’s Taking So Long? (Query Performance)
You need pg_stat_statements. If you don’t have it enabled, enable it now:
CREATE EXTENSION pg_stat_statements;
This extension tracks every query pattern—how many times it ran, how long it took, how much data it touched. It’s a gold mine.
Long-running queries are your immediate concern:
SELECT pid, usename,
now() - query_start as duration,
left(query, 60) as query
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - query_start > interval '5 minutes'
ORDER BY duration DESC;
Anything running over 5 minutes deserves investigation. Maybe it needs an index. Maybe it’s a bad query. Maybe someone’s doing a full table scan on your 500GB orders table.
Blocked queries mean lock contention:
SELECT blocked.pid AS blocked_pid,
blocked.usename,
blocking.pid AS blocking_pid,
blocking.usename AS blocker,
blocked.query AS blocked_query
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked ON blocked.pid = blocked_locks.pid
JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_stat_activity blocking ON blocking.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
One blocker can stop dozens of queries. Kill the blocker, clear the queue.
Top queries by total time:
SELECT calls,
round(total_exec_time::numeric, 2) as total_ms,
round(mean_exec_time::numeric, 2) as avg_ms,
left(query, 80)
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
These are your optimization targets. The queries consuming the most total database time give you the biggest return on investment.
6. Can You Actually Recover? (Backup Status)
Here’s a scary stat: 34% of companies who test their backups discover they don’t work. Don’t be that company.
Check your WAL archiving:
SELECT archived_count,
failed_count,
last_archived_time,
last_failed_time
FROM pg_stat_archiver;
If failed_count isn’t zero, your point-in-time recovery is broken.
For your base backups, track:
- When was the last successful backup?
- How big was it?
- How long did it take?
Under 24 hours? Good. 24-48 hours? Concerning. Over 48 hours? Critical. Your RPO (recovery point objective) is now measured in days.
7. Is Vacuum Keeping Up? (Maintenance Status)
VACUUM is not optional. PostgreSQL’s MVCC architecture creates dead tuples with every UPDATE and DELETE. If vacuum doesn’t clean them up, your tables bloat, your queries slow down, and eventually you hit transaction ID wraparound shutdown. Yes, shutdown. PostgreSQL will refuse to start.
Check which tables need attention:
SELECT schemaname, tablename,
n_live_tup as live,
n_dead_tup as dead,
round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) as dead_pct,
last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC
LIMIT 10;
Over 10% dead tuples? Warning. Over 20%? Critical. Over a million dead tuples? Why are you still reading this? Go fix it.
8. Who’s Doing What? (Security & Access)
Know who’s connected:
SELECT usename,
count(*) as connections,
count(*) FILTER (WHERE ssl) as encrypted,
count(*) FILTER (WHERE NOT ssl) as unencrypted
FROM pg_stat_activity
GROUP BY usename;
All production connections should be encrypted. If you see unencrypted connections, you have a security problem.
Check your logs for failed login attempts:
grep "FATAL.*password authentication failed" /var/log/postgresql/*.log
A few failures? Fat fingers. A pattern of failures? Someone’s trying to break in.
Build a Health Check Script That Actually Works
Enough theory. Here’s a script that checks everything:
#!/bin/bash
# pg_health_check.sh
PGHOST="${PGHOST:-localhost}"
REPORT_DATE=$(date +"%Y-%m-%d %H:%M:%S")
REPORT_FILE="/var/log/postgresql/health_$(date +%Y%m%d).log"
log() {
echo "[$REPORT_DATE] $1" | tee -a $REPORT_FILE
}
# 1. Can we connect?
if ! psql -c "SELECT 1" > /dev/null 2>&1; then
log "ERROR: Database is DOWN"
exit 1
fi
log "✓ Database is UP"
# 2. Connection usage
CONN_PCT=$(psql -t -c "
SELECT round(count(*) * 100.0 / current_setting('max_connections')::int)
FROM pg_stat_activity;
")
log "Connections: ${CONN_PCT}%"
[ "$CONN_PCT" -gt 85 ] && log "WARNING: Connection usage critical"
# 3. Replication lag
if [ "$(psql -t -c 'SELECT NOT pg_is_in_recovery()')" = " t" ]; then
LAG=$(psql -t -c "
SELECT max(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn))
FROM pg_stat_replication;
")
[ -n "$LAG" ] && log "Replication lag: $(($LAG / 1024 / 1024))MB"
fi
# 4. Disk space
DISK_PCT=$(df -h /var/lib/postgresql | awk 'NR==2 {print $5}' | tr -d '%')
log "Disk usage: ${DISK_PCT}%"
[ "$DISK_PCT" -gt 85 ] && log "WARNING: Disk space critical"
# 5. Cache hit ratio
psql -c "
SELECT datname,
round(100.0 * blks_hit / NULLIF(blks_hit + blks_read, 0), 2) || '%' as cache_hit
FROM pg_stat_database
WHERE datname = current_database();
" | tee -a $REPORT_FILE
# 6. Long queries
LONG=$(psql -t -c "
SELECT count(*) FROM pg_stat_activity
WHERE state != 'idle' AND now() - query_start > interval '5 minutes';
")
[ "$LONG" != " 0" ] && log "WARNING: $LONG long-running queries"
# 7. Backup age
LAST_BACKUP=$(psql -t -c "SELECT last_archived_time FROM pg_stat_archiver;")
log "Last archive: $LAST_BACKUP"
log "Health check complete"
Run it daily at 7 AM:
0 7 * * * /usr/local/bin/pg_health_check.sh
Set Up Alerts That Don’t Cry Wolf
Bad alerts are worse than no alerts. Here’s how to do it right:
Critical (wake someone up):
- Database down
- Replication lag > 100MB
- Disk > 90% full
- Connections maxed out
- Backup failed
Warning (email during business hours):
- Replication lag > 10MB
- Disk > 70% full
- Cache hit ratio < 95%
- Backup > 24 hours old
Info (just log it):
- Daily health check results
- Maintenance completions
- Normal operations
The key? Make critical alerts rare. If you’re getting paged three times a night, your thresholds are wrong.
Pick Your Tools
You don’t need everything. Start simple:
Just getting started?
- Built-in
pg_stat_*views - A daily health check script
- Email reports
Ready for more?
- Prometheus + postgres_exporter
- Grafana dashboards
- PagerDuty for alerts
Running at scale?
- Datadog or New Relic
- Full observability stack
- Automated remediation
On cloud?
- Use the built-in monitoring (RDS Performance Insights, Azure Monitor, etc.)
- It’s good enough and it’s already there
Make It Part of Your Culture
Technology is only half the battle. The other half is making sure people actually look at the data and act on it.
Weekly ritual: Review the past week’s metrics. Are there trends? Is something slowly getting worse? Did that optimization actually help?
Monthly capacity planning: Look at growth rates. When will you need more disk? More connections? More CPU?
After every incident: Did monitoring catch it? Did alerts fire? Could we have seen it coming? Update your monitors based on what you learn.
Document everything: New team member should be able to understand what each alert means and how to respond. Write it down.
The Hard Truth
PostgreSQL won’t tell you it’s dying until it’s too late. It’ll keep accepting connections right up until it can’t. It’ll keep writing data until the disk is full. It’ll keep your replication broken until you notice weeks later during a failover attempt.
The eight vital signs above aren’t nice-to-haves. They’re your early warning system. Check them daily. Alert on the thresholds. Act on the warnings.
Your database is the heart of your application. Treat it like it matters.
Start with a simple health check script. Run it daily. Read the reports. Add monitoring tools as you grow. But start somewhere, start today.
Because the worst database failure is the one you didn’t see coming.
Leave a Reply