The cost of redundant indexes and how to identify and prevent them.
Indexes function like a double-edged scalpel in query optimization: precise and lifesaving when wielded skillfully, but capable of impeding a database’s performance if used carelessly. Any query can be sped up by adding the RIGHT indexes. You have a query with an execution time > 1s? Add an index for the columns in the query condition, and voilà! The execution time crashes to below 1ms. This is because an index is essentially a shortcut to the physical location of the data the query intends to retrieve.
EXPLAIN ANALYZE output of a query executed with a sequential scan.
EXPLAIN ANALYZE output of the same query executed with an index scan. The difference in cost and execution time is night and day.
In this article, I’ll focus on PostgreSQL. By default, PostgreSQL does lookups sequentially if it doesn’t find a suitable index for the query. So it scans the main table (AKA heap) page by page and for each page, row by row, until it finds the row(s) that match the query condition. This might be trivial work for the database for tiny tables (a few hundred rows), and indexes are often ignored for queries on small tables because sequential scans are considered cheaper than index scans for small tables. But as the table grows, the limitations of a sequential scan become apparent in slow query execution time; that’s where indexing comes in.
The B+ tree index is the most popular index for databases and the default index in PostgreSQL when you run the CREATE INDEX statement without specifying an index access method. B+trees are structured in a way that makes it possible to eliminate many index entries and get to the entry being searched for lightning fast. The details of a B+tree index and effective indexing are beyond the scope of this article, but I’ll recommend reading usetheindexluke, a concise, but well-rounded resource on indexing and playing with this B+tree simulator to understand the structure of a B+tree.
The cost of indexes
Indexes speed up queries, but their existence comes with some costs.:
Disk utilization: An index is essentially a subset of the main table structured differently. So it occupies its own space(duh). The size of an index will depend on the index type (inverted indexes, such as GIN, typically use more space than B+trees) and the number of columns that make up the index.
Memory buffer utilization: PostgreSQL uses a finite memory buffer to cache the pages of relations (tables, indexes, etc.) to reduce disk I/O costs. Relation data in disk and memory is segmented into pages (8 KB in size) and accessed in pages. During query execution, pages that contain the rows being scanned are first looked up in the buffer. If missing, they are loaded from disk and added to the buffer. If the buffer is full, a page replacement algorithm is used to replace pages that are rarely accessed with those that are accessed more recently. Index pages also occupy the memory buffer.
Write overhead: An index contains data from the main table and a pointer (not a C/C++ pointer by the way) to the exact physical location of the associated row in the main table. Because indexes need to be in sync with the main table, table inserts and updates result in an index update. INSERTs result in updates to all indexes in the table. For most UPDATEs, all indexes are also updated, including indexes that don’t contain the column(s) being updated. This is because an update in PostgreSQL results in the existing row being marked as dead, and an insertion of a new version of the row with the updated data in a new location. So all indexes need to be updated with the location of the latest version of the row. If a table has 100 indexes, for example, all 100 indexes will be updated when a new row is inserted or an existing one is updated. An exception to this is HOT (Heap Only Tuple) updates, which avoid index updates.
Query planner decision fatigue: An abundance of options often requires more effort to make a decision and increases the probability of making the wrong one. That’s true for many things in life and for the Postgres query planner. Postgres uses a cost-based planner to determine how to execute a query. The cost of all possible execution paths is estimated using arbitrary units and statistics it collects during VACUUM and ANALYZE, and the cheapest path is selected. One factor that contributes to the overall cost of a path is the scan types (seq scan, index scan, index only scan, bitmap scan, etc) and the specific indexes (for index scans). More indexes may result in longer planning times, as the planner must evaluate all relevant indexes for the query. Also, because cost estimation is based on often outdated single-column statistics, the query planner’s estimations are off sometimes. For example, by the query planner’s estimation, index A may have a slightly cheaper cost than index B, so it chooses index A, but in reality, index B offers better performance. The probability of this happening increases with the number of indices.
The implication of having redundant indexes is the unnecessary incurrence of the costs listed above:
Wasted disk space.
Wasted occupancy of memory buffer, which may contribute to cache misses for pages of tables and actively used indexes, thus slowing down reads and writes.
Unnecessary write overhead since Postgres updates all the indexes in a table during inserts and updates (except for HOT updates).
Factors that could contribute to redundant index accumulation and how to prevent them
Premature indexing: Developers often blindly speculate on the present/future necessity of an index, so they create an index prematurely to speed up a query that isn’t slow. For new tables, (non-unique) indexes will likely be ignored by the query planner until the table grows to a certain size. And by the time the table reaches that size, the query that the index was intended to speed up may have been removed or changed in the application. A good rule of thumb is to have comprehensive monitoring in place for SQL queries and only consider creating indexes to speed up a query when there is a noticeable degradation in the performance of the query.
Improper indexing: Lack of understanding of the index type, its structure, and how it’s traversed often leads to the creation of redundant indexes. From my experience, this happens mostly with multi-column indexes intended to speed up queries that filter by multiple columns. Indexing a single column is straightforward, but for multi-column indexes, the order of columns in the index is very important and can affect the usability of the index and the cost of using it. The ability to decide on the ideal order of the columns for a multi-column index is an important skill to have. The usetheindexluke book covers this.
Excessive indexing: Developers sometimes introduce a new index without checking if an existing one could be modified to improve the query being optimized. Depending on the query, sometimes, simply adding a column to an existing single-column index or switching the order of an existing multi-column index could make an existing index useful for both existing queries and the new query you’re trying to optimize.
Lack of query testing: SQL provides a great tool to analyze the execution plan and cost of a query, and that’s the EXPLAIN command. “Plan-reading is an art that requires some experience to master” — Postgres docs. The Postgres docs cover the basics of the EXPLAIN output. After creating an index to speed up a query, it’s important to run the EXPLAIN command (with ANALYZE for reads) against the query to verify that the index you created is indeed used. Remember to run VACUUM (without FULL) before testing.
Removed/changed queries: Queries for which the indexes were introduced to speed up have been removed from the application or changed in a way that makes the indexes no longer usable for them.
How to identify redundant indexes in Postgres
Thankfully, Postgres exposes a view for index statistics: pg_stat_user_indexes. This contains a row for each index in the database with statistics about how many times the index has been accessed. You can run the following query to find non-unique indexes that have never been used:
SELECT i.relname, i.indexrelname, i.idx_scan
FROM pg_stat_user_indexes i
JOIN pg_index ix ON i.indexrelid = ix.indexrelid
WHERE i.idx_scan = 0
AND ix.indisunique = false -- Exclude unique indexes
ORDER BY i.relname;
The idx_scan column shows the number of times the indexes have been scanned since they were created. So the 7 indices in the result have never been used.
I ran the same query in a production database for a large SAAS application, and I discovered hundreds of never-been-used indexes.
Postgres 16 added a last_idx_scan column, which shows the last time the index was used. That can be used to identify indexes that were previously used but are now redundant. The previous query can be modified to also include indexes that haven’t been used in X days:
-- Indexes that have never been used at all or haven't been used in 30 days:
SELECT i.relname, i.indexrelname, i.idx_scan, last_idx_scan
FROM pg_stat_user_indexes i
JOIN pg_index ix ON i.indexrelid = ix.indexrelid
AND ix.indisunique = false -- Exclude unique indexes
AND(last_idx_scan IS NULL OR last_idx_scan < NOW() - INTERVAL '30 day')
ORDER BY i.relname;
After identifying unused indexes, you can remove them with the DROP INDEX command. Removing them will free up disk space, eliminate write overhead, and improve cache hits for active relation pages during read and writes.
Note: The pg_stat_user_indexes view only contains statistics about the usage of the indexes by client queries. Postgres internally uses unique indexes and indexes on foreign keys for unique constraint and referential integrity checks, respectively. Usage of those indexes from internal operations doesn’t reflect in the view. The query above filters out unique indexes but leaves foreign key indexes. So you might want to consider removing only non-unique and non-FK indexes.