A Sort in my Aggregate? No thanks!

When performing aggregation queries in Postgres, you may have noticed something unexpected: Postgres sometimes adds a Sort step to perform the aggregation. Although this may be perfectly reasonable in some conditions, it may also be a red flag and negatively affect the performance of your query, sometimes dramatically. I have seen hot analytics queries go from 5s to under 1s by guiding Postgres to remove that Sort step! Let’s figure out why it happens and how to solve the issue.

A little note before we start: do not worry about the considerations of this article if your tables contain less than 100,000 rows, as the practical impact would be negligible.

HashAggregate

For most common aggregation queries, the query plan will use the HashAggregate strategy. In such a case, Postgres will progressively create a hash table, in which the key represents the grouping columns and the value represents the aggregated result. It will look at each row of the input table matching the filters, and mutate the hash table, then at the end return the values in the hash table.

Let’s examine how it would work with a very basic analytics query looking at the number of books per publisher in a database of books.

SELECT publisher, COUNT(*) FROM books GROUP BY publisher;

To perform this aggregation using a HashAggregate, Postgres would go row by row without any particular order, and will progressively update a hash table in working memory, with the keys being a hash based on the publisher column, and the value being the count of books. Let’s see it in action:

Books table
1 ... Penguin
2 ... O'Reilly
3 ... Penguin
4 ... Hachette
5 ... O'Reilly
Work Memory
Empty
Completed Groups
Empty

Note that the whole hash table (keys and values) has to fit in memory for this strategy to be efficient. This is because at any point Postgres may need to update the value of a previously encountered group. If the hash table does not fit in memory, Postgres would have to write part of the hash table to disk, and then update that hash table in the disk progressively, which would be orders of magnitude slower.

Let’s now look at the second most common strategy for aggregation in Postgres.

GroupAggregate

Unlike HashAggregate, this strategy does not require much working memory. In fact, it only keeps one group at a time in memory, so its memory footprint is very minimal. How is this possible? By working on data sorted by the grouping columns, then iterating on the rows in that order. This little trick ensures that once all the rows of a certain combination of grouping columns have been processed, no other row will be associated with the same group, so Postgres can safely mark the group as final and put it in the result bucket.

Let’s see that mechanism in action for the same query and rows as above:

Books table
4 ... Hachette
2 ... O'Reilly
5 ... O'Reilly
1 ... Penguin
3 ... Penguin
Work Memory
Empty
Completed Groups
Empty

This is both memory and CPU efficient, keeping very little in memory, avoiding having to hash the keys, and mutating the same data repeatedly. However, it requires sorted data. If the data is not yet sorted, the query planning will therefore have to add a Sort node before the GroupAggregate one.

GroupAggregate is perfectly fine to use when the data is already sorted. It is in fact one of the few cases in which aggregations do benefit from indexes, because an index scan will return the rows in sorted order, so having an index on publisher would allow Postgres to use a GroupAggregate without having to sort the table during query execution.

How Postgres chooses which strategy to use

There are many considerations to take into account by Postgres to pick a given query plan, but we can abstract most of it and consider that for most queries the decision tree looks more or less like this:

Yes

No

Yes

No

Aggregation needed

Is data already
sorted by group key?

GroupAggregate

Will the hash table
fit in work_mem?

HashAggregate

Sort + GroupAggregate

As we can see here, the crucial question in our case is “Will the hash table fit in work_mem?”. The work_mem parameter is configurable, so we have control over that. The default value (4MB) is very low for modern machines and it is one of the first parameters that DBAs will change when deploying a Postgres cluster. If you still have it set at 4MB, now is the time to change that! There is no magic rule to set that number1 and it depends a lot on both your workload (how many simultaneous queries do you usually have, how many of those require work_mem) and your hardware, so I’ll just recommend testing it at different levels while making sure you don’t push your Postgres instance to use too much memory, as it may cause very negative consequences for performance.

In many cases however, work_mem is not the issue at all. The other side of the equation is often more problematic: the size of the hash table. Of course, the actual size needed cannot fully be known in advance, so Postgres has to rely on estimations (i.e. guesses) for the size needed to store the hash table.

The approximate2 formula is something like this:

The number of groups is estimated based on column statistics for the columns involved in the grouping, together with the query’s filtering predicates. The average size of a group’s data in turn is based on the values you are asking for and potentially statistics about column size once again if you are returning values of columns.

Let’s take our query from above and see how Postgres would estimate the size of the hash table. Postgres will look at the statistics of the publisher column and see that there are around 20,000 distinct values in our example table. It also sees that a publisher is on average around 19 bytes, and count is a double (8 bytes), so an approximate size for the table will be something like 20,000 * (8 + 19 + constant). The constant is not actually fully constant and is not even a single value so it gets a bit tricky, but let’s say it’s 20 bytes for the example, then the estimated size of the hash table would be 940,000 bytes, which is fine.

When estimations go wrong

There are many reasons why estimations may be off, but in the context of aggregations, the most common one is when columns are not fully independent from each other. For instance, let’s say that “publisher” on its own is not enough and we want to get a bit more information by not looking only at the publishing house, but also at the genre of the book, and the imprint.

SELECT publisher, genre, imprint, COUNT(*) 
FROM books 
GROUP BY publisher, genre, imprint;

When estimating the number of groups, Postgres will look at the statistics of each column independently, and then multiply the distinct values together to get an estimate of the number of groups. If we have 20,000 publishers, 50 genres and 40,000 imprints, Postgres will estimate that there will be 20,000 * 50 * 40,000 = 40,000,000,000 groups. Suddenly, this means that the estimate for the size of the hash table would go from less than 1MB to more than 1TB! It would certainly look like it cannot fit in memory for Postgres!

But in reality, an imprint is nearly always associated with a single publisher, and many imprints have only a few genres associated with them. If we run the query, we find out that there are actually only about 300,000 groups, which would have needed only about 14MB of memory, perfectly reasonable!

To tell Postgres about this correlation, we can create a multi-column statistics object:

CREATE STATISTICS books_pub_genre_imprint_stats (ndistinct) 
ON publisher, genre, imprint 
FROM books;

After creating such statistics (and running ANALYZE books; of course), Postgres will have internal statistics about the number of combinations between the 3 columns we are looking at, and thus will know that there are NOT 40,000,000,000 combinations, but only around 300,000.

Creating statistics, like everything when it comes to databases, is not free. It adds some overhead to the ANALYZE cycle, so you should not go too wild, but it can be extremely useful if you frequently run aggregation queries on correlated columns.

Why it matters

The main issue with Postgres picking a Sort + GroupAggregate plan instead of a HashAggregate is that sorting large tables is not cheap. Postgres will most often have to sort that data on disk because the data would be too large to fit in memory, and suddenly you may notice that the Sort step is taking the majority of the query time, significantly more even than the table scan needed for the aggregation itself.

Summary

If you see an expensive Sort step in your aggregation query plan, check that:

  • work_mem is correctly configured and not too low
  • The estimated number of groups is not too far from the actual number of groups
  • If the estimated number of groups is widely inaccurate (not just some % but orders of magnitude wrong), look at the columns being aggregated and consider using extended statistics to improve the query planning.

Footnotes

  1. A useful rule of thumb from Christophe Pettus is (average freeable memory * 4) / max_connections. You can read the blog post to see why it’s not an absolute rule, or check out the great blog post from pganalyze on the topic

  2. If you are curious about the full calculation, take a look at the source code