Managing Slow Queries in PostgreSQL and Preventing the Snowball Effect

In PostgreSQL, performance problems rarely begin with a catastrophic event. More often, they start with just one slow query — a report, a dashboard filter, a batch job, or even a poorly parameterized ORM query. At first, this delay seems harmless. But as the system becomes busier, that same query begins consuming more CPU, holding locks longer, filling buffers, or causing I/O waits. Eventually, other queries start waiting. Applications slow down. Threads saturate. Connections queue. And the entire system can become unresponsive.

This chain reaction is what we call the snowball effect: a small inefficiency turning into a system-wide degradation.

The purpose of this guide is to help you:

Detect slow queries early
Analyze and understand why they happen
Fix them with thoughtful tuning
Prevent them from cascading into outages

We do this by combining monitoring, proactive design, and operational discipline.

1. Why Slow Queries Matter More Than You Think

A slow query is not just about latency; it affects shared resources:

CPU – Excessive computation delays others
Buffer Cache – Large scans evict hot data (shared_buffers pollution)
Locks – Long-running transactions block critical updates
Connections – High concurrency saturates process slots (max_connections)
WAL Generation – Heavy writes can cause replication lag

The database engine treats queries as part of the same shared ecosystem. This is why one unoptimized query can slow down everything else.

Resource Contention Mechanics

PostgreSQL uses a process-per-connection model. Each backend:

Allocates memory from work_mem for sorts/hashes
Competes for shared buffer pool access
Holds locks on rows, pages, and relation-level objects
Contributes to WAL write pressure

When one query holds an AccessExclusiveLock on a table (e.g., during ALTER TABLE), all subsequent queries on that table queue behind it—creating a lock pile-up.

The earlier you detect these early signals, the more you prevent the snowball.

2. Monitoring: Seeing the Problem Before It Explodes

pg_stat_statements – The Most Important Performance Tool

pg_stat_statements collects detailed execution statistics across all statements. It answers:

Which queries consume the most cumulative time?
Which queries have inconsistent performance?
Which queries are called too often?

Rather than chasing slow queries one-by-one, this lets you address high-impact query patterns.

Enable it in postgresql.conf:

shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all
pg_stat_statements.max = 10000

After restart, create the extension:

CREATE EXTENSION pg_stat_statements;

Key queries to run:

Top 10 queries by total time:

SELECT 
  substring(query, 1, 100) AS short_query,
  round(total_exec_time::numeric, 2) AS total_time_ms,
  calls,
  round(mean_exec_time::numeric, 2) AS mean_time_ms,
  round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS percentage
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

Queries with high variance (unpredictable performance):

SELECT 
  substring(query, 1, 100) AS short_query,
  calls,
  round(mean_exec_time::numeric, 2) AS mean_ms,
  round(stddev_exec_time::numeric, 2) AS stddev_ms,
  round((stddev_exec_time / mean_exec_time)::numeric, 2) AS coefficient_of_variation
FROM pg_stat_statements
WHERE calls > 100 
  AND stddev_exec_time > mean_exec_time * 0.5
ORDER BY stddev_exec_time DESC
LIMIT 10;

Most frequently called queries:

SELECT 
  substring(query, 1, 100) AS short_query,
  calls,
  round(total_exec_time::numeric, 2) AS total_time_ms,
  round(mean_exec_time::numeric, 2) AS mean_time_ms
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 10;

pg_stat_activity – The Real-Time View

While pg_stat_statements shows history, pg_stat_activity shows what is happening right now:

Long-running queries
Idle-in-transaction sessions
Queries waiting on locks
Sessions consuming resources

During incidents, this is your first dashboard.

Find long-running queries:

SELECT 
  pid,
  usename,
  application_name,
  client_addr,
  state,
  now() - query_start AS duration,
  query
FROM pg_stat_activity
WHERE state != 'idle'
  AND query_start < now() - interval '5 minutes'
ORDER BY duration DESC;

Detect idle-in-transaction sessions:

SELECT 
  pid,
  usename,
  state,
  now() - state_change AS idle_duration,
  query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND state_change < now() - interval '10 minutes';

Identify blocking queries:

SELECT 
  blocked_locks.pid AS blocked_pid,
  blocked_activity.usename AS blocked_user,
  blocking_locks.pid AS blocking_pid,
  blocking_activity.usename AS blocking_user,
  blocked_activity.query AS blocked_statement,
  blocking_activity.query AS blocking_statement,
  blocked_activity.application_name AS blocked_application
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks 
  ON blocking_locks.locktype = blocked_locks.locktype
  AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
  AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
  AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
  AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
  AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
  AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
  AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
  AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
  AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
  AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

Check table bloat:

SELECT 
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
  n_dead_tup,
  n_live_tup,
  round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_ratio
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 20;

Setting Up Alerts

Configure alerting thresholds:

Query duration > 30s: Investigate immediately
Idle-in-transaction > 5 min: Potential connection leak
Active connections > 80% of max_connections: Connection pool saturation
Dead tuple ratio > 20%: Autovacuum falling behind
Lock wait time > 10s: Lock contention issue

3. Understanding How PostgreSQL Executes Queries

PostgreSQL’s query planner chooses how to execute a query based on statistics. But those statistics can be wrong — especially when tables grow or data distribution shifts.

EXPLAIN ANALYZE reveals:

Whether the planner chose correct indexes
Actual rows processed vs expected
Join algorithms used
Whether operations spilled to disk due to low memory

This is the heart of query optimization: we compare planner expectations to reality.

Typical signals you’ll notice:

Symptom	Meaning	Likely Fix
Sequential scan on large table	Missing index or unselective filter	Create index or rewrite filter
Planner expected 10 rows; got 50,000	Stale statistics	Run `ANALYZE`
Sort spills to disk	work_mem too small	Adjust work_mem (locally)
Nested Loop multiplying row count	Wrong join strategy	Use Hash Join or rewrite query

Understanding this feedback loop is what turns slow query debugging into a method, not guesswork.

PostgreSQL’s query planner uses a cost-based optimizer. It estimates costs using:

cost = (page_cost × pages_read) + (cpu_cost × rows_processed)

Default values:

seq_page_cost = 1.0
random_page_cost = 4.0 (lower to 1.1-1.5 for SSD)
cpu_tuple_cost = 0.01
cpu_operator_cost = 0.0025

EXPLAIN ANALYZE Deep Dive

Run this for any suspicious query:

EXPLAIN (ANALYZE, BUFFERS, VERBOSE, COSTS) 
SELECT ...;

Key metrics to examine:

Planning vs Execution Time
- High planning time (>50ms)? Consider prepared statements
Rows Estimate vs Actual
- Large mismatch? Run ANALYZE table_name
Buffer Statistics
- Buffers: shared hit=X read=Y
- High read means cache misses (increase shared_buffers or optimize query)
I/O Timings (enable track_io_timing)
- Shows actual disk read time
Loop Count
- Nested loop with 10,000+ loops? Wrong join method chosen

Join Algorithm Selection

Nested Loop Join:

Best for: Small outer table, indexed inner table
Cost: O(N × log M)
When planner chooses it: Expected few rows from outer relation

Hash Join:

Best for: Large tables, equality conditions
Cost: O(N + M)
Requires: Enough work_mem to build hash table

Merge Join:

Best for: Pre-sorted data or useful indexes
Cost: O(N + M)
Requires: Sorted inputs on join key

Force join method (for testing):

SET enable_nestloop = off;
SET enable_hashjoin = off;
SET enable_mergejoin = off;

Common EXPLAIN Output Patterns

Problem: Sequential Scan on Large Table

Seq Scan on orders (cost=0.00..543521.00 rows=5000000 width=100)
  Filter: (status = 'pending')
  Rows Removed by Filter: 4950000

Fix: Create index on status if selective enough

Problem: Sort Spilling to Disk

Sort Method: external merge  Disk: 245632kB

Fix: Increase work_mem or reduce result set

Problem: Wrong Row Estimate

Hash Join (cost=... rows=100 width=...)
  Actual rows=125000 loops=1

Fix: Run ANALYZE, increase default_statistics_target

Problem: Bitmap Heap Scan with High Recheck

Bitmap Heap Scan on products
  Recheck Cond: (price > 100 AND price < 200)
  Rows Removed by Recheck: 45000
  Heap Blocks: exact=1234 lossy=8900

Fix: Reduce lossy blocks with more work_mem or better index

4. Reducing Query Cost Through Schema and Query Design

Indexes Done Right

Indexes improve read performance by enabling selective access. But:

Too many indexes slow writes.
Wrong index order reduces usefulness.
Expression and partial indexes can target specific workloads efficiently.

Good indexing is about precision, not quantity.

Index Types:

B-tree (default): General purpose, supports <, <=, =, >=, >, BETWEEN, IN

CREATE INDEX idx_orders_created ON orders(created_at);

Hash: Only equality (=), smaller than B-tree

CREATE INDEX idx_users_email ON users USING hash(email);

GIN (Generalized Inverted Index): Full-text search, JSONB, arrays

CREATE INDEX idx_products_tags ON products USING gin(tags);
CREATE INDEX idx_documents_content ON documents USING gin(to_tsvector('english', content));

GiST (Generalized Search Tree): Geometric data, range types, full-text

CREATE INDEX idx_locations_geom ON locations USING gist(geom);

BRIN (Block Range Index): Time-series data, naturally ordered large tables

CREATE INDEX idx_events_timestamp ON events USING brin(timestamp);

Composite Indexes: Order matters! Most selective column first (usually).

-- Good for: WHERE status = 'active' AND created_at > '...'
CREATE INDEX idx_orders_status_created ON orders(status, created_at);

-- Not optimal for queries filtering only on created_at

Partial Indexes: Index subset of rows only:

CREATE INDEX idx_orders_pending ON orders(created_at) 
WHERE status = 'pending';

Expression Indexes: Index computed values:

CREATE INDEX idx_users_lower_email ON users(lower(email));

-- Query must use same expression
SELECT * FROM users WHERE lower(email) = 'user@example.com';

Covering Indexes (INCLUDE): Avoid heap lookups:

CREATE INDEX idx_orders_user_id_covering ON orders(user_id) 
INCLUDE (total_amount, created_at);

-- Can satisfy query without touching table
SELECT total_amount, created_at FROM orders WHERE user_id = 123;

Materialized Views for Heavy Aggregations

Expensive read queries (dashboards, analytics, daily summaries) can be precomputed.
This trades storage for speed and is often the simplest performance multiplier.

Create materialized view:

CREATE MATERIALIZED VIEW sales_daily_summary AS
SELECT 
  date_trunc('day', created_at) AS day,
  product_id,
  count(*) AS order_count,
  sum(amount) AS total_amount
FROM orders
WHERE status = 'completed'
GROUP BY 1, 2;

CREATE UNIQUE INDEX ON sales_daily_summary(day, product_id);

Refresh strategies:

Manual refresh:

REFRESH MATERIALIZED VIEW sales_daily_summary;

Concurrent refresh (no read lock):

REFRESH MATERIALIZED VIEW CONCURRENTLY sales_daily_summary;
-- Requires unique index

Scheduled refresh via cron/pg_cron:

SELECT cron.schedule('refresh-summary', '0 1 * * *', 
  'REFRESH MATERIALIZED VIEW CONCURRENTLY sales_daily_summary');

Partitioning for Large Tables

Instead of one massive table, split by time or category.
This avoids scanning irrelevant data and makes queries predictable.

Range Partitioning (most common):

CREATE TABLE orders (
  id bigint,
  user_id bigint,
  created_at timestamp,
  amount numeric
) PARTITION BY RANGE (created_at);

CREATE TABLE orders_2024_q1 PARTITION OF orders
  FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');

CREATE TABLE orders_2024_q2 PARTITION OF orders
  FOR VALUES FROM ('2024-04-01') TO ('2024-07-01');

-- Automatic pruning: queries with created_at filters only scan relevant partitions

List Partitioning:

CREATE TABLE events (
  id bigint,
  event_type text,
  data jsonb
) PARTITION BY LIST (event_type);

CREATE TABLE events_login PARTITION OF events
  FOR VALUES IN ('login', 'logout');

CREATE TABLE events_purchase PARTITION OF events
  FOR VALUES IN ('purchase', 'refund');

Hash Partitioning:

CREATE TABLE users (
  id bigint,
  email text
) PARTITION BY HASH (id);

CREATE TABLE users_p0 PARTITION OF users
  FOR VALUES WITH (modulus 4, remainder 0);

CREATE TABLE users_p1 PARTITION OF users
  FOR VALUES WITH (modulus 4, remainder 1);
-- ... p2, p3

Partition management automation:

-- Create next month partition
CREATE TABLE orders_2025_01 PARTITION OF orders
  FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

-- Detach old partitions
ALTER TABLE orders DETACH PARTITION orders_2023_q1;

-- Drop or archive
DROP TABLE orders_2023_q1;

5. Preventing the Snowball Effect at Runtime

PostgreSQL does not handle high client concurrency efficiently.
Use PgBouncer to limit active connections and queue excess load before it hits the database.

Connection Pooling with PgBouncer

Install and configure /etc/pgbouncer/pgbouncer.ini:

[databases]
mydb = host=localhost port=5432 dbname=mydb

[pgbouncer]

listen_addr = * listen_port = 6432 auth_type = md5 auth_file = /etc/pgbouncer/userlist.txt # Pool mode pool_mode = transaction # or session, statement # Connection limits max_client_conn = 1000 default_pool_size = 25 reserve_pool_size = 5 reserve_pool_timeout = 3 # Timeouts server_idle_timeout = 600 server_lifetime = 3600 server_connect_timeout = 15 query_timeout = 0 query_wait_timeout = 120 # Logging log_connections = 1 log_disconnections = 1 log_pooler_errors = 1

Pool sizing formula:

pool_size = ((core_count × 2) + effective_spindle_count)

For 4 cores + SSD: (4 × 2) + 1 = 9
Add 20% buffer: ~11-12 connections per pool

Pool modes:

Session: Connection held until client disconnects (most compatible)
Transaction: Connection returned after transaction completes (recommended)
Statement: Connection returned after each statement (breaks transactions)

Statement Timeouts

Global settings in postgresql.conf:

# Abort any statement taking more than 30s
statement_timeout = 30000  # milliseconds

# Close idle-in-transaction sessions after 5 min
idle_in_transaction_session_timeout = 300000

# Abort if lock not acquired in 10s
lock_timeout = 10000

# Close idle connections after 10 min
idle_session_timeout = 600000  # PostgreSQL 14+

Per-role timeouts:

ALTER ROLE app_user SET statement_timeout = '30s';
ALTER ROLE reporting_user SET statement_timeout = '5min';

Per-session temporary override:

SET LOCAL statement_timeout = '60s';
-- Applies only to current transaction

Autovacuum Tuning

Key parameters in postgresql.conf:

# Enable autovacuum
autovacuum = on

# Autovacuum workers (1 per core up to 10)
autovacuum_max_workers = 4

# Aggressiveness
autovacuum_naptime = 10s  # Check interval (default 1min)
autovacuum_vacuum_threshold = 50  # Base threshold
autovacuum_vacuum_scale_factor = 0.1  # 10% of table size

# When to trigger vacuum
# threshold + (scale_factor × table_size)
# For 1M row table: 50 + (0.1 × 1000000) = 100050 dead tuples

# Analyze settings
autovacuum_analyze_threshold = 50
autovacuum_analyze_scale_factor = 0.05

# Resource limits
autovacuum_vacuum_cost_delay = 2ms  # Throttle I/O (lower = faster)
autovacuum_vacuum_cost_limit = 200  # Work units before delay

# Prevent wraparound
autovacuum_freeze_max_age = 200000000

Per-table vacuum tuning:

-- High-churn table: vacuum more frequently
ALTER TABLE orders SET (
  autovacuum_vacuum_scale_factor = 0.05,
  autovacuum_vacuum_threshold = 100
);

-- Large append-only table: vacuum less often
ALTER TABLE logs SET (
  autovacuum_vacuum_scale_factor = 0.2,
  autovacuum_vacuum_cost_delay = 0
);

Monitor autovacuum effectiveness:

SELECT 
  schemaname,
  relname,
  last_vacuum,
  last_autovacuum,
  vacuum_count,
  autovacuum_count,
  n_dead_tup,
  n_live_tup
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

Query Result Caching

Application-level caching with TTL:

import redis
cache = redis.Redis()

def get_user_stats(user_id):
    cache_key = f"user_stats:{user_id}"
    cached = cache.get(cache_key)
    
    if cached:
        return json.loads(cached)
    
    result = db.execute("SELECT ... FROM users WHERE id = %s", user_id)
    cache.setex(cache_key, 300, json.dumps(result))  # 5 min TTL
    return result

PostgreSQL prepared statements (plan caching):

PREPARE get_user_orders (int) AS
  SELECT * FROM orders WHERE user_id = $1 ORDER BY created_at DESC LIMIT 10;

EXECUTE get_user_orders(123);

6. When Performance Issues Happen: A Practical Workflow

Phase	Time	Goal
Detection	0–2 min	Confirm slowdown is real
Diagnosis	2–10 min	Identify the slow or blocking query
Mitigation	10–15 min	Cancel / terminate offenders to restore stability
Root Cause	15–60 min	Analyze EXPLAIN / statistics to understand why
Permanent Fix	Hours–Days	Rewrite, index, refactor, or redesign

Stabilize first. Optimize second. Refactor last.

Phase 1: Detection (0–2 min)

Check if database is actually the problem:

# CPU usage
top -p $(pgrep -d',' postgres)

# Active connections
psql -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"

# Lock waits
psql -c "SELECT count(*) FROM pg_stat_activity WHERE wait_event IS NOT NULL;"

Phase 2: Diagnosis (2–10 min)

Identify the culprit query:

-- Longest running queries
SELECT pid, usename, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC
LIMIT 5;

-- Queries causing blocking
-- (use blocking query from section 2)

Get query plan from running query:

-- Get pid from pg_stat_activity
SELECT pg_get_backend_pid();

-- In another session, enable auto_explain
SET auto_explain.log_min_duration = 0;
SET auto_explain.log_analyze = true;
SET auto_explain.log_buffers = true;

-- Check PostgreSQL logs for EXPLAIN output

Phase 3: Mitigation (10–15 min)

Cancel gracefully (preferred):

SELECT pg_cancel_backend(PID);
-- Sends SIGINT, allows cleanup

Terminate forcefully (if cancel fails):

SELECT pg_terminate_backend(PID);
-- Sends SIGTERM, immediate abort

Cancel all queries from a specific user:

SELECT pg_cancel_backend(pid)
FROM pg_stat_activity
WHERE usename = 'reporting_user' 
  AND state = 'active';

Kill all idle-in-transaction:

SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND state_change < now() - interval '10 minutes';

Phase 4: Root Cause Analysis (15–60 min)

Get historical query performance:

SELECT 
  calls,
  total_exec_time,
  mean_exec_time,
  stddev_exec_time,
  query
FROM pg_stat_statements
WHERE query LIKE '%specific_table%'
ORDER BY total_exec_time DESC;

Check if statistics are stale:

SELECT 
  schemaname,
  tablename,
  last_analyze,
  last_autoanalyze
FROM pg_stat_user_tables
WHERE tablename = 'problematic_table';

-- Force statistics update
ANALYZE problematic_table;

Review table and index sizes:

SELECT 
  tablename,
  pg_size_pretty(pg_total_relation_size(tablename::regclass)) AS total_size,
  pg_size_pretty(pg_indexes_size(tablename::regclass)) AS indexes_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(tablename::regclass) DESC;

Phase 5: Permanent Fix (Hours–Days)

Missing index:

CREATE INDEX CONCURRENTLY idx_name ON table(column);
-- CONCURRENTLY prevents blocking writes

Query rewrite:

-- Before: Uses OR (prevents index usage)
SELECT * FROM orders WHERE user_id = 123 OR user_id = 456;

-- After: Use IN or UNION
SELECT * FROM orders WHERE user_id IN (123, 456);

Materialized view:

CREATE MATERIALIZED VIEW expensive_report AS
SELECT ... FROM ... GROUP BY ...;

CREATE UNIQUE INDEX ON expensive_report(...);

Partition large table:

-- Convert existing table to partitioned
-- (requires data migration or pg_partman)

7. Sustaining Performance: Operational Discipline

Slow query management is not a one-time project, but a routine.

Weekly:

Review pg_stat_statements
Identify new expensive query shapes

Monthly:

Check autovacuum effectiveness
Reevaluate indexing patterns

Quarterly:

Reassess capacity needs
Review schema growth and future workload expectations

Every improvement compounds over time — just like every neglected inefficiency does.

Weekly Checklist

Review pg_stat_statements:

-- Save baseline for comparison
CREATE TABLE query_stats_baseline AS
SELECT now() AS snapshot_time, *
FROM pg_stat_statements;

-- Compare week-over-week
SELECT 
  current.query,
  current.calls - COALESCE(baseline.calls, 0) AS calls_delta,
  current.total_exec_time - COALESCE(baseline.total_exec_time, 0) AS time_delta
FROM pg_stat_statements current
LEFT JOIN query_stats_baseline baseline USING (queryid)
WHERE current.calls - COALESCE(baseline.calls, 0) > 1000
ORDER BY time_delta DESC
LIMIT 20;

Check for new slow queries:

SELECT query, calls, mean_exec_time
FROM pg_stat_statements
WHERE mean_exec_time > 1000  -- 1 second
  AND calls > 10
ORDER BY mean_exec_time DESC;

Review connection pool metrics:

psql -h pgbouncer_host -p 6432 -U pgbouncer pgbouncer -c "SHOW POOLS;"
psql -h pgbouncer_host -p 6432 -U pgbouncer pgbouncer -c "SHOW STATS;"

Monthly Checklist

Autovacuum effectiveness:

SELECT 
  schemaname || '.' || relname AS table_name,
  n_dead_tup,
  n_live_tup,
  round(100.0 * n_dead_tup / NULLIF(n_live_tup, 0), 1) AS dead_pct,
  last_vacuum,
  last_autovacuum
FROM pg_stat_user_tables
WHERE n_live_tup > 10000
ORDER BY dead_pct DESC NULLS LAST
LIMIT 30;

Index usage analysis:

SELECT 
  schemaname,
  tablename,
  indexname,
  idx_scan,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexrelname NOT LIKE '%pkey'
  AND pg_relation_size(indexrelid) > 1048576  -- > 1MB
ORDER BY pg_relation_size(indexrelid) DESC;

-- Drop unused indexes
-- DROP INDEX CONCURRENTLY unused_index_name;

Duplicate indexes:

SELECT 
  pg_size_pretty(sum(pg_relation_size(idx))::bigint) AS size,
  (array_agg(idx))[1] AS idx1,
  (array_agg(idx))[2] AS idx2,
  (array_agg(idx))[3] AS idx3,
  (array_agg(idx))[4] AS idx4
FROM (
  SELECT 
    indexrelid::regclass AS idx,
    indrelid::regclass AS tbl,
    (indrelid::text || E'\n' || indclass::text || E'\n' || 
     indkey::text || E'\n' || COALESCE(indexprs::text, '') || E'\n' || 
     COALESCE(indpred::text, '')) AS key
  FROM pg_index
) sub
GROUP BY key, tbl
HAVING count(*) > 1
ORDER BY sum(pg_relation_size(idx)) DESC;

Quarterly Checklist

Capacity planning:

-- Database growth rate
SELECT 
  pg_size_pretty(pg_database_size(current_database())) AS current_size,
  -- Compare with previous quarter backup
  'TODO: Calculate growth rate' AS growth_rate;

Review configuration:

-- Check if settings match hardware
SHOW shared_buffers;  -- Should be 25% of RAM
SHOW effective_cache_size;  -- Should be 50-75% of RAM
SHOW work_mem;  -- Depends on concurrent queries
SHOW maintenance_work_mem;  -- For VACUUM, CREATE INDEX

Partition maintenance:

-- Auto-create future partitions
-- Auto-archive/drop old partitions
-- Update partition retention policy

Automated Monitoring Setup

Example monitoring queries in Prometheus/Grafana:

-- Active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

-- Long-running queries
SELECT count(*) FROM pg_stat_activity 
WHERE state = 'active' AND now() - query_start > interval '30 seconds';

-- Replication lag (for replicas)
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()));

-- Cache hit ratio (should be > 95%)
SELECT 
  round(100.0 * sum(blks_hit) / NULLIF(sum(blks_hit + blks_read), 0), 2) AS cache_hit_ratio
FROM pg_stat_database;

-- Transaction rate
SELECT 
  sum(xact_commit + xact_rollback) AS transactions_per_sec
FROM pg_stat_database;

Conclusion

Slow queries are inevitable — but system-wide slowdowns are not.

By implementing comprehensive monitoring with pg_stat_statements and pg_stat_activity, deeply understanding query execution through EXPLAIN ANALYZE, applying strategic indexing and partitioning, and enforcing operational safeguards like connection pooling, timeouts, and autovacuum tuning, you can keep PostgreSQL responsive and resilient even as your data and workloads grow.

Monitor proactively – Use pg_stat_statements to catch problems before users do
Understand the planner – EXPLAIN ANALYZE is your diagnostic tool
Index strategically – Right type, right columns, right order
Partition wisely – For large tables with natural boundaries
Pool connections – PostgreSQL wasn’t built for thousands of direct connections
Set timeouts – Circuit breakers prevent cascading failures
Maintain regularly – Autovacuum, analyze, and index maintenance are not optional
Test in production-like environments – Statistics and data distribution matter

Performance is a continuous practice, not a one-time fix. Handled correctly, the database becomes not a bottleneck, but a competitive advantage.

Managing Slow Queries in PostgreSQL and Preventing the Snowball Effect

1. Why Slow Queries Matter More Than You Think

Resource Contention Mechanics

2. Monitoring: Seeing the Problem Before It Explodes

pg_stat_statements – The Most Important Performance Tool

pg_stat_activity – The Real-Time View

pg_stat_activity – The Real-Time View

Setting Up Alerts

3. Understanding How PostgreSQL Executes Queries

EXPLAIN ANALYZE Deep Dive

Join Algorithm Selection

Common EXPLAIN Output Patterns

4. Reducing Query Cost Through Schema and Query Design

Indexes Done Right

Materialized Views for Heavy Aggregations

Partitioning for Large Tables

5. Preventing the Snowball Effect at Runtime

Connection Pooling with PgBouncer

Statement Timeouts

Autovacuum Tuning

Query Result Caching

6. When Performance Issues Happen: A Practical Workflow

Phase 1: Detection (0–2 min)

Phase 2: Diagnosis (2–10 min)

Phase 3: Mitigation (10–15 min)

Phase 4: Root Cause Analysis (15–60 min)

Phase 5: Permanent Fix (Hours–Days)

7. Sustaining Performance: Operational Discipline

Weekly Checklist

Monthly Checklist

Quarterly Checklist

Automated Monitoring Setup

Conclusion

Leave a Reply Cancel Reply

Got some questions ?