Improving Query Performance With Index

The query planner and optimizer adjust with SQL structure changes so can be impacted by changing your query. The planner generates a series of plan trees with many nodes, or steps. The trees are unique plans that vary the query steps order. Plans are best visualized through the output of the EXPLAIN function. The planner estimates the costs of each plan by using row counts and other metadata in the pg_tables schema. The optimizer chooses the best of these plans. Postgres uses time as its cost metric for optimization.

The query planner uses the pg_class table and pg_stats view to create its plan and cost estimates. They have an indicator of tables with indexes. They store information such as the number of null values, column width, and distinct values.

EXPLAIN provides a window into the query plan. Place EXPLAIN before any query to see the steps and cost estimates. EXPLAIN does not run the query; it only provides estimates.

Here we are going to examine the query plan for SELECT *  from test1 table which have 1 million rows with one column, then we will see the different between test1 without indexes and with indexes. To do so we have to create table named test1, and insert dummy data.

db_test=# CREATE TABLE test1 (a int);
CREATE TABLE
db_test=# INSERT INTO test1 (SELECT test.id
  FROM generate_series(1, 1000000) AS test (id));
INSERT 0 1000000

And then see the query plan on that query

db_test=# EXPLAIN SELECT * FROM test1;

                          QUERY PLAN
---------------------------------------------------------------
 Seq Scan on test1  (cost=0.00..14425.00 rows=1000000 width=4)
(1 row)

The query plan for the simple SELECT * query has one step. It is a sequential scan of the test1 table. This one step includes cost and size estimates. Sequential scan is the single step in this query plan. A sequential scan is a scan of all the rows in a table. The cost estimate of the sequential scan is dimensionless.  It used to compare two structures of a query with the same output. Also it cannot to compare queries with different output. The first cost number is 0. It describes the startup cost. The second number, 14425.00, describes the total time. The total time equals the startup time plus the run time. The size estimates cover rows and width. Rows provides an estimate of the rows the query examines in order to run. Width provides an estimate of the byte width of the total rows.

Next we will observe the major different between scanning table without index and with index.

These query will take several rows from table test1, and we add EXPLAIN in the beginning of the query to see the query plan:

db_test=# EXPLAIN SELECT * FROM test1 WHERE a < 500 AND a > 250;

                                QUERY PLAN
--------------------------------------------------------------------------
 Gather  (cost=1000.00..11699.10 rows=241 width=4)
   Workers Planned: 2
   ->  Parallel Seq Scan on test1  (cost=0.00..10675.00 rows=100 width=4)
         Filter: ((a < 500) AND (a > 250))
(4 rows)

follow with EXPLAIN ANALYZE:

db_test=# EXPLAIN ANALYZE SELECT * FROM test1 WHERE a < 500 AND a > 250;

                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..11699.10 rows=241 width=4) (actual time=0.320..418.270 rows=249 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on test1  (cost=0.00..10675.00 rows=100 width=4) (actual time=254.281..393.500 rows=83 loops=3)
         Filter: ((a < 500) AND (a > 250))
         Rows Removed by Filter: 333250
 Planning Time: 0.084 ms
 Execution Time: 418.335 ms
(8 rows)

And now create index on table test1, we will observe is there anything different running the same query before and after creating an index.

db_test=# CREATE INDEX a_idx ON test1 (a);
CREATE INDEX

Run the same query again on test1:

db_test=# EXPLAIN SELECT * FROM test1 WHERE a < 500 AND a > 250;

                                QUERY PLAN
---------------------------------------------------------------------------
 Index Only Scan using a_idx on test1  (cost=0.42..14.25 rows=241 width=4)
   Index Cond: ((a < 500) AND (a > 250))
(2 rows)

With EXPLAIN ANALYZE:

db_test=# EXPLAIN ANALYZE SELECT * FROM test1 WHERE a < 500 AND a > 250;

                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Index Only Scan using a_idx on test1  (cost=0.42..14.25 rows=241 width=4) (actual
time=0.019..0.176 rows=249 loops=1)
   Index Cond: ((a < 500) AND (a > 250))
   Heap Fetches: 249
 Planning Time: 0.223 ms
 Execution Time: 0.212 ms
(5 rows)

Recall that a column with an index looks the same as a column without an index. The index exists on the back end to help performance. Soon after we create index on test1 table, the query plan is change from sequential scan to index scan which perform much faster than before. The easy way to see is the Execution time on test1 before having an index and after adding an index, from 418.335 millisecond down to 0.212 millisecond, that is a huge different there. Other things that obviously seen is the cost is drastically reduce from 11699.10 to 14.25, it also means that the back end process only access a feewer rows on that table.

From the example above we can conclude that having an index on specific column can boost the performance of the query, so having an index on a table is a good start for optimizing query performance. However if we take to many rows for a query to access the table like 75% of total rows, the query plan will go back to sequential scan and will not perform an index scan.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>