In PostgreSQL, performance problems rarely begin with a catastrophic event. More often, they start with just one slow query — a report, a dashboard filter, a batch job, or even a poorly parameterized ORM query. At first, this delay seems harmless. But as the system becomes busier, that same query begins consuming more CPU, holding locks longer, filling buffers, or causing I/O waits. Eventually, other queries start waiting. Applications slow down. Threads saturate. Connections queue. And the entire system can become unresponsive.
This chain reaction is what we call the snowball effect: a small inefficiency turning into a system-wide degradation.
The purpose of this guide is to help you:
- Detect slow queries early
- Analyze and understand why they happen
- Fix them with thoughtful tuning
- Prevent them from cascading into outages
We do this by combining monitoring, proactive design, and operational discipline.
1. Why Slow Queries Matter More Than You Think
A slow query is not just about latency; it affects shared resources:
- CPU – Excessive computation delays others
- Buffer Cache – Large scans evict hot data (shared_buffers pollution)
- Locks – Long-running transactions block critical updates
- Connections – High concurrency saturates process slots (max_connections)
- WAL Generation – Heavy writes can cause replication lag
The database engine treats queries as part of the same shared ecosystem. This is why one unoptimized query can slow down everything else.
Resource Contention Mechanics
PostgreSQL uses a process-per-connection model. Each backend:
- Allocates memory from
work_memfor sorts/hashes - Competes for shared buffer pool access
- Holds locks on rows, pages, and relation-level objects
- Contributes to WAL write pressure
When one query holds an AccessExclusiveLock on a table (e.g., during ALTER TABLE), all subsequent queries on that table queue behind it—creating a lock pile-up.
The earlier you detect these early signals, the more you prevent the snowball.
2. Monitoring: Seeing the Problem Before It Explodes
pg_stat_statements – The Most Important Performance Tool
pg_stat_statements collects detailed execution statistics across all statements. It answers:
- Which queries consume the most cumulative time?
- Which queries have inconsistent performance?
- Which queries are called too often?
Rather than chasing slow queries one-by-one, this lets you address high-impact query patterns.
Enable it in postgresql.conf:
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all
pg_stat_statements.max = 10000
After restart, create the extension:
CREATE EXTENSION pg_stat_statements;
Key queries to run:
Top 10 queries by total time:
SELECT
substring(query, 1, 100) AS short_query,
round(total_exec_time::numeric, 2) AS total_time_ms,
calls,
round(mean_exec_time::numeric, 2) AS mean_time_ms,
round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS percentage
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
Queries with high variance (unpredictable performance):
SELECT
substring(query, 1, 100) AS short_query,
calls,
round(mean_exec_time::numeric, 2) AS mean_ms,
round(stddev_exec_time::numeric, 2) AS stddev_ms,
round((stddev_exec_time / mean_exec_time)::numeric, 2) AS coefficient_of_variation
FROM pg_stat_statements
WHERE calls > 100
AND stddev_exec_time > mean_exec_time * 0.5
ORDER BY stddev_exec_time DESC
LIMIT 10;
Most frequently called queries:
SELECT
substring(query, 1, 100) AS short_query,
calls,
round(total_exec_time::numeric, 2) AS total_time_ms,
round(mean_exec_time::numeric, 2) AS mean_time_ms
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 10;
pg_stat_activity – The Real-Time View
pg_stat_activity – The Real-Time View
While pg_stat_statements shows history, pg_stat_activity shows what is happening right now:
- Long-running queries
- Idle-in-transaction sessions
- Queries waiting on locks
- Sessions consuming resources
During incidents, this is your first dashboard.
Find long-running queries:
SELECT
pid,
usename,
application_name,
client_addr,
state,
now() - query_start AS duration,
query
FROM pg_stat_activity
WHERE state != 'idle'
AND query_start < now() - interval '5 minutes'
ORDER BY duration DESC;
Detect idle-in-transaction sessions:
SELECT
pid,
usename,
state,
now() - state_change AS idle_duration,
query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND state_change < now() - interval '10 minutes';
Identify blocking queries:
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement,
blocked_activity.application_name AS blocked_application
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
Check table bloat:
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
n_dead_tup,
n_live_tup,
round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_ratio
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 20;
Setting Up Alerts
Configure alerting thresholds:
- Query duration > 30s: Investigate immediately
- Idle-in-transaction > 5 min: Potential connection leak
- Active connections > 80% of max_connections: Connection pool saturation
- Dead tuple ratio > 20%: Autovacuum falling behind
- Lock wait time > 10s: Lock contention issue
3. Understanding How PostgreSQL Executes Queries
PostgreSQL’s query planner chooses how to execute a query based on statistics. But those statistics can be wrong — especially when tables grow or data distribution shifts.
EXPLAIN ANALYZE reveals:
- Whether the planner chose correct indexes
- Actual rows processed vs expected
- Join algorithms used
- Whether operations spilled to disk due to low memory
This is the heart of query optimization: we compare planner expectations to reality.
Typical signals you’ll notice:
| Symptom | Meaning | Likely Fix |
|---|---|---|
| Sequential scan on large table | Missing index or unselective filter | Create index or rewrite filter |
| Planner expected 10 rows; got 50,000 | Stale statistics | Run ANALYZE |
| Sort spills to disk | work_mem too small | Adjust work_mem (locally) |
| Nested Loop multiplying row count | Wrong join strategy | Use Hash Join or rewrite query |
Understanding this feedback loop is what turns slow query debugging into a method, not guesswork.
PostgreSQL’s query planner uses a cost-based optimizer. It estimates costs using:
cost = (page_cost × pages_read) + (cpu_cost × rows_processed)
Default values:
seq_page_cost = 1.0random_page_cost = 4.0(lower to 1.1-1.5 for SSD)cpu_tuple_cost = 0.01cpu_operator_cost = 0.0025
EXPLAIN ANALYZE Deep Dive
Run this for any suspicious query:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, COSTS)
SELECT ...;
Key metrics to examine:
- Planning vs Execution Time
- High planning time (>50ms)? Consider prepared statements
- Rows Estimate vs Actual
- Large mismatch? Run
ANALYZE table_name
- Large mismatch? Run
- Buffer Statistics
Buffers: shared hit=X read=Y- High
readmeans cache misses (increase shared_buffers or optimize query)
- I/O Timings (enable
track_io_timing)- Shows actual disk read time
- Loop Count
- Nested loop with 10,000+ loops? Wrong join method chosen
Join Algorithm Selection
Nested Loop Join:
- Best for: Small outer table, indexed inner table
- Cost: O(N × log M)
- When planner chooses it: Expected few rows from outer relation
Hash Join:
- Best for: Large tables, equality conditions
- Cost: O(N + M)
- Requires: Enough
work_memto build hash table
Merge Join:
- Best for: Pre-sorted data or useful indexes
- Cost: O(N + M)
- Requires: Sorted inputs on join key
Force join method (for testing):
SET enable_nestloop = off;
SET enable_hashjoin = off;
SET enable_mergejoin = off;
Common EXPLAIN Output Patterns
Problem: Sequential Scan on Large Table
Seq Scan on orders (cost=0.00..543521.00 rows=5000000 width=100)
Filter: (status = 'pending')
Rows Removed by Filter: 4950000
Fix: Create index on status if selective enough
Problem: Sort Spilling to Disk
Sort Method: external merge Disk: 245632kB
Fix: Increase work_mem or reduce result set
Problem: Wrong Row Estimate
Hash Join (cost=... rows=100 width=...)
Actual rows=125000 loops=1
Fix: Run ANALYZE, increase default_statistics_target
Problem: Bitmap Heap Scan with High Recheck
Bitmap Heap Scan on products
Recheck Cond: (price > 100 AND price < 200)
Rows Removed by Recheck: 45000
Heap Blocks: exact=1234 lossy=8900
Fix: Reduce lossy blocks with more work_mem or better index
4. Reducing Query Cost Through Schema and Query Design
Indexes Done Right
Indexes improve read performance by enabling selective access. But:
- Too many indexes slow writes.
- Wrong index order reduces usefulness.
- Expression and partial indexes can target specific workloads efficiently.
Good indexing is about precision, not quantity.
Index Types:
B-tree (default): General purpose, supports <, <=, =, >=, >, BETWEEN, IN
CREATE INDEX idx_orders_created ON orders(created_at);
Hash: Only equality (=), smaller than B-tree
CREATE INDEX idx_users_email ON users USING hash(email);
GIN (Generalized Inverted Index): Full-text search, JSONB, arrays
CREATE INDEX idx_products_tags ON products USING gin(tags);
CREATE INDEX idx_documents_content ON documents USING gin(to_tsvector('english', content));
GiST (Generalized Search Tree): Geometric data, range types, full-text
CREATE INDEX idx_locations_geom ON locations USING gist(geom);
BRIN (Block Range Index): Time-series data, naturally ordered large tables
CREATE INDEX idx_events_timestamp ON events USING brin(timestamp);
Composite Indexes: Order matters! Most selective column first (usually).
-- Good for: WHERE status = 'active' AND created_at > '...'
CREATE INDEX idx_orders_status_created ON orders(status, created_at);
-- Not optimal for queries filtering only on created_at
Partial Indexes: Index subset of rows only:
CREATE INDEX idx_orders_pending ON orders(created_at)
WHERE status = 'pending';
Expression Indexes: Index computed values:
CREATE INDEX idx_users_lower_email ON users(lower(email));
-- Query must use same expression
SELECT * FROM users WHERE lower(email) = 'user@example.com';
Covering Indexes (INCLUDE): Avoid heap lookups:
CREATE INDEX idx_orders_user_id_covering ON orders(user_id)
INCLUDE (total_amount, created_at);
-- Can satisfy query without touching table
SELECT total_amount, created_at FROM orders WHERE user_id = 123;
Materialized Views for Heavy Aggregations
Expensive read queries (dashboards, analytics, daily summaries) can be precomputed.
This trades storage for speed and is often the simplest performance multiplier.
Create materialized view:
CREATE MATERIALIZED VIEW sales_daily_summary AS
SELECT
date_trunc('day', created_at) AS day,
product_id,
count(*) AS order_count,
sum(amount) AS total_amount
FROM orders
WHERE status = 'completed'
GROUP BY 1, 2;
CREATE UNIQUE INDEX ON sales_daily_summary(day, product_id);
Refresh strategies:
Manual refresh:
REFRESH MATERIALIZED VIEW sales_daily_summary;
Concurrent refresh (no read lock):
REFRESH MATERIALIZED VIEW CONCURRENTLY sales_daily_summary;
-- Requires unique index
Scheduled refresh via cron/pg_cron:
SELECT cron.schedule('refresh-summary', '0 1 * * *',
'REFRESH MATERIALIZED VIEW CONCURRENTLY sales_daily_summary');
Partitioning for Large Tables
Instead of one massive table, split by time or category.
This avoids scanning irrelevant data and makes queries predictable.
Range Partitioning (most common):
CREATE TABLE orders (
id bigint,
user_id bigint,
created_at timestamp,
amount numeric
) PARTITION BY RANGE (created_at);
CREATE TABLE orders_2024_q1 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
CREATE TABLE orders_2024_q2 PARTITION OF orders
FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');
-- Automatic pruning: queries with created_at filters only scan relevant partitions
List Partitioning:
CREATE TABLE events (
id bigint,
event_type text,
data jsonb
) PARTITION BY LIST (event_type);
CREATE TABLE events_login PARTITION OF events
FOR VALUES IN ('login', 'logout');
CREATE TABLE events_purchase PARTITION OF events
FOR VALUES IN ('purchase', 'refund');
Hash Partitioning:
CREATE TABLE users (
id bigint,
email text
) PARTITION BY HASH (id);
CREATE TABLE users_p0 PARTITION OF users
FOR VALUES WITH (modulus 4, remainder 0);
CREATE TABLE users_p1 PARTITION OF users
FOR VALUES WITH (modulus 4, remainder 1);
-- ... p2, p3
Partition management automation:
-- Create next month partition
CREATE TABLE orders_2025_01 PARTITION OF orders
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
-- Detach old partitions
ALTER TABLE orders DETACH PARTITION orders_2023_q1;
-- Drop or archive
DROP TABLE orders_2023_q1;
5. Preventing the Snowball Effect at Runtime
PostgreSQL does not handle high client concurrency efficiently.
Use PgBouncer to limit active connections and queue excess load before it hits the database.
Connection Pooling with PgBouncer
Install and configure /etc/pgbouncer/pgbouncer.ini:
[databases]
mydb = host=localhost port=5432 dbname=mydb
[pgbouncer]
listen_addr = * listen_port = 6432 auth_type = md5 auth_file = /etc/pgbouncer/userlist.txt # Pool mode pool_mode = transaction # or session, statement # Connection limits max_client_conn = 1000 default_pool_size = 25 reserve_pool_size = 5 reserve_pool_timeout = 3 # Timeouts server_idle_timeout = 600 server_lifetime = 3600 server_connect_timeout = 15 query_timeout = 0 query_wait_timeout = 120 # Logging log_connections = 1 log_disconnections = 1 log_pooler_errors = 1
Pool sizing formula:
pool_size = ((core_count × 2) + effective_spindle_count)
For 4 cores + SSD: (4 × 2) + 1 = 9
Add 20% buffer: ~11-12 connections per pool
Pool modes:
- Session: Connection held until client disconnects (most compatible)
- Transaction: Connection returned after transaction completes (recommended)
- Statement: Connection returned after each statement (breaks transactions)
Statement Timeouts
Global settings in postgresql.conf:
# Abort any statement taking more than 30s
statement_timeout = 30000 # milliseconds
# Close idle-in-transaction sessions after 5 min
idle_in_transaction_session_timeout = 300000
# Abort if lock not acquired in 10s
lock_timeout = 10000
# Close idle connections after 10 min
idle_session_timeout = 600000 # PostgreSQL 14+
Per-role timeouts:
ALTER ROLE app_user SET statement_timeout = '30s';
ALTER ROLE reporting_user SET statement_timeout = '5min';
Per-session temporary override:
SET LOCAL statement_timeout = '60s';
-- Applies only to current transaction
Autovacuum Tuning
Key parameters in postgresql.conf:
# Enable autovacuum
autovacuum = on
# Autovacuum workers (1 per core up to 10)
autovacuum_max_workers = 4
# Aggressiveness
autovacuum_naptime = 10s # Check interval (default 1min)
autovacuum_vacuum_threshold = 50 # Base threshold
autovacuum_vacuum_scale_factor = 0.1 # 10% of table size
# When to trigger vacuum
# threshold + (scale_factor × table_size)
# For 1M row table: 50 + (0.1 × 1000000) = 100050 dead tuples
# Analyze settings
autovacuum_analyze_threshold = 50
autovacuum_analyze_scale_factor = 0.05
# Resource limits
autovacuum_vacuum_cost_delay = 2ms # Throttle I/O (lower = faster)
autovacuum_vacuum_cost_limit = 200 # Work units before delay
# Prevent wraparound
autovacuum_freeze_max_age = 200000000
Per-table vacuum tuning:
-- High-churn table: vacuum more frequently
ALTER TABLE orders SET (
autovacuum_vacuum_scale_factor = 0.05,
autovacuum_vacuum_threshold = 100
);
-- Large append-only table: vacuum less often
ALTER TABLE logs SET (
autovacuum_vacuum_scale_factor = 0.2,
autovacuum_vacuum_cost_delay = 0
);
Monitor autovacuum effectiveness:
SELECT
schemaname,
relname,
last_vacuum,
last_autovacuum,
vacuum_count,
autovacuum_count,
n_dead_tup,
n_live_tup
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
Query Result Caching
Application-level caching with TTL:
import redis
cache = redis.Redis()
def get_user_stats(user_id):
cache_key = f"user_stats:{user_id}"
cached = cache.get(cache_key)
if cached:
return json.loads(cached)
result = db.execute("SELECT ... FROM users WHERE id = %s", user_id)
cache.setex(cache_key, 300, json.dumps(result)) # 5 min TTL
return result
PostgreSQL prepared statements (plan caching):
PREPARE get_user_orders (int) AS
SELECT * FROM orders WHERE user_id = $1 ORDER BY created_at DESC LIMIT 10;
EXECUTE get_user_orders(123);
6. When Performance Issues Happen: A Practical Workflow
| Phase | Time | Goal |
|---|---|---|
| Detection | 0–2 min | Confirm slowdown is real |
| Diagnosis | 2–10 min | Identify the slow or blocking query |
| Mitigation | 10–15 min | Cancel / terminate offenders to restore stability |
| Root Cause | 15–60 min | Analyze EXPLAIN / statistics to understand why |
| Permanent Fix | Hours–Days | Rewrite, index, refactor, or redesign |
Stabilize first. Optimize second. Refactor last.
Phase 1: Detection (0–2 min)
Check if database is actually the problem:
# CPU usage
top -p $(pgrep -d',' postgres)
# Active connections
psql -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"
# Lock waits
psql -c "SELECT count(*) FROM pg_stat_activity WHERE wait_event IS NOT NULL;"
Phase 2: Diagnosis (2–10 min)
Identify the culprit query:
-- Longest running queries
SELECT pid, usename, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC
LIMIT 5;
-- Queries causing blocking
-- (use blocking query from section 2)
Get query plan from running query:
-- Get pid from pg_stat_activity
SELECT pg_get_backend_pid();
-- In another session, enable auto_explain
SET auto_explain.log_min_duration = 0;
SET auto_explain.log_analyze = true;
SET auto_explain.log_buffers = true;
-- Check PostgreSQL logs for EXPLAIN output
Phase 3: Mitigation (10–15 min)
Cancel gracefully (preferred):
SELECT pg_cancel_backend(PID);
-- Sends SIGINT, allows cleanup
Terminate forcefully (if cancel fails):
SELECT pg_terminate_backend(PID);
-- Sends SIGTERM, immediate abort
Cancel all queries from a specific user:
SELECT pg_cancel_backend(pid)
FROM pg_stat_activity
WHERE usename = 'reporting_user'
AND state = 'active';
Kill all idle-in-transaction:
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND state_change < now() - interval '10 minutes';
Phase 4: Root Cause Analysis (15–60 min)
Get historical query performance:
SELECT
calls,
total_exec_time,
mean_exec_time,
stddev_exec_time,
query
FROM pg_stat_statements
WHERE query LIKE '%specific_table%'
ORDER BY total_exec_time DESC;
Check if statistics are stale:
SELECT
schemaname,
tablename,
last_analyze,
last_autoanalyze
FROM pg_stat_user_tables
WHERE tablename = 'problematic_table';
-- Force statistics update
ANALYZE problematic_table;
Review table and index sizes:
SELECT
tablename,
pg_size_pretty(pg_total_relation_size(tablename::regclass)) AS total_size,
pg_size_pretty(pg_indexes_size(tablename::regclass)) AS indexes_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(tablename::regclass) DESC;
Phase 5: Permanent Fix (Hours–Days)
Missing index:
CREATE INDEX CONCURRENTLY idx_name ON table(column);
-- CONCURRENTLY prevents blocking writes
Query rewrite:
-- Before: Uses OR (prevents index usage)
SELECT * FROM orders WHERE user_id = 123 OR user_id = 456;
-- After: Use IN or UNION
SELECT * FROM orders WHERE user_id IN (123, 456);
Materialized view:
CREATE MATERIALIZED VIEW expensive_report AS
SELECT ... FROM ... GROUP BY ...;
CREATE UNIQUE INDEX ON expensive_report(...);
Partition large table:
-- Convert existing table to partitioned
-- (requires data migration or pg_partman)
7. Sustaining Performance: Operational Discipline
Slow query management is not a one-time project, but a routine.
Weekly:
- Review
pg_stat_statements - Identify new expensive query shapes
Monthly:
- Check autovacuum effectiveness
- Reevaluate indexing patterns
Quarterly:
- Reassess capacity needs
- Review schema growth and future workload expectations
Every improvement compounds over time — just like every neglected inefficiency does.
Weekly Checklist
Review pg_stat_statements:
-- Save baseline for comparison
CREATE TABLE query_stats_baseline AS
SELECT now() AS snapshot_time, *
FROM pg_stat_statements;
-- Compare week-over-week
SELECT
current.query,
current.calls - COALESCE(baseline.calls, 0) AS calls_delta,
current.total_exec_time - COALESCE(baseline.total_exec_time, 0) AS time_delta
FROM pg_stat_statements current
LEFT JOIN query_stats_baseline baseline USING (queryid)
WHERE current.calls - COALESCE(baseline.calls, 0) > 1000
ORDER BY time_delta DESC
LIMIT 20;
Check for new slow queries:
SELECT query, calls, mean_exec_time
FROM pg_stat_statements
WHERE mean_exec_time > 1000 -- 1 second
AND calls > 10
ORDER BY mean_exec_time DESC;
Review connection pool metrics:
psql -h pgbouncer_host -p 6432 -U pgbouncer pgbouncer -c "SHOW POOLS;"
psql -h pgbouncer_host -p 6432 -U pgbouncer pgbouncer -c "SHOW STATS;"
Monthly Checklist
Autovacuum effectiveness:
SELECT
schemaname || '.' || relname AS table_name,
n_dead_tup,
n_live_tup,
round(100.0 * n_dead_tup / NULLIF(n_live_tup, 0), 1) AS dead_pct,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
WHERE n_live_tup > 10000
ORDER BY dead_pct DESC NULLS LAST
LIMIT 30;
Index usage analysis:
SELECT
schemaname,
tablename,
indexname,
idx_scan,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE '%pkey'
AND pg_relation_size(indexrelid) > 1048576 -- > 1MB
ORDER BY pg_relation_size(indexrelid) DESC;
-- Drop unused indexes
-- DROP INDEX CONCURRENTLY unused_index_name;
Duplicate indexes:
SELECT
pg_size_pretty(sum(pg_relation_size(idx))::bigint) AS size,
(array_agg(idx))[1] AS idx1,
(array_agg(idx))[2] AS idx2,
(array_agg(idx))[3] AS idx3,
(array_agg(idx))[4] AS idx4
FROM (
SELECT
indexrelid::regclass AS idx,
indrelid::regclass AS tbl,
(indrelid::text || E'\n' || indclass::text || E'\n' ||
indkey::text || E'\n' || COALESCE(indexprs::text, '') || E'\n' ||
COALESCE(indpred::text, '')) AS key
FROM pg_index
) sub
GROUP BY key, tbl
HAVING count(*) > 1
ORDER BY sum(pg_relation_size(idx)) DESC;
Quarterly Checklist
Capacity planning:
-- Database growth rate
SELECT
pg_size_pretty(pg_database_size(current_database())) AS current_size,
-- Compare with previous quarter backup
'TODO: Calculate growth rate' AS growth_rate;
Review configuration:
-- Check if settings match hardware
SHOW shared_buffers; -- Should be 25% of RAM
SHOW effective_cache_size; -- Should be 50-75% of RAM
SHOW work_mem; -- Depends on concurrent queries
SHOW maintenance_work_mem; -- For VACUUM, CREATE INDEX
Partition maintenance:
-- Auto-create future partitions
-- Auto-archive/drop old partitions
-- Update partition retention policy
Automated Monitoring Setup
Example monitoring queries in Prometheus/Grafana:
-- Active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
-- Long-running queries
SELECT count(*) FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '30 seconds';
-- Replication lag (for replicas)
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()));
-- Cache hit ratio (should be > 95%)
SELECT
round(100.0 * sum(blks_hit) / NULLIF(sum(blks_hit + blks_read), 0), 2) AS cache_hit_ratio
FROM pg_stat_database;
-- Transaction rate
SELECT
sum(xact_commit + xact_rollback) AS transactions_per_sec
FROM pg_stat_database;
Conclusion
Slow queries are inevitable — but system-wide slowdowns are not.
By implementing comprehensive monitoring with pg_stat_statements and pg_stat_activity, deeply understanding query execution through EXPLAIN ANALYZE, applying strategic indexing and partitioning, and enforcing operational safeguards like connection pooling, timeouts, and autovacuum tuning, you can keep PostgreSQL responsive and resilient even as your data and workloads grow.
- Monitor proactively – Use pg_stat_statements to catch problems before users do
- Understand the planner – EXPLAIN ANALYZE is your diagnostic tool
- Index strategically – Right type, right columns, right order
- Partition wisely – For large tables with natural boundaries
- Pool connections – PostgreSQL wasn’t built for thousands of direct connections
- Set timeouts – Circuit breakers prevent cascading failures
- Maintain regularly – Autovacuum, analyze, and index maintenance are not optional
- Test in production-like environments – Statistics and data distribution matter
Performance is a continuous practice, not a one-time fix. Handled correctly, the database becomes not a bottleneck, but a competitive advantage.
Leave a Reply