GROUP BY vs DISTINCT

2019-03-21

Every now and then I see queries that use GROUP BY without any aggregates which is the same as using DISTINCT instead.

For example this query:

select x,y
from some_table
group by x,y;

is equivalent to:

SELECT distinct x,y
from some_table;

I always argued that using DISTINCT is the better way to write this query. Both are doing exactly the same thing and any modern optimizer will generate the same execution plan. Plus: the DISTINCT documents the intention better (I think). A query that uses GROUP BY but no aggregates always begs the question if something was forgotten.

But recently I discovered that Postgres will run these two queries differently. Apparently GROUP BY can make use of parallel query execution while DISTINCT can not.

The following inserts 25 million rows into our test table. Each column will contain the values 1 through 100. So the result of the query yields 10000 rows.

insert into some_table
select (random() * 99 + 1)::int, (random() * 99 + 1)::int
from generate_series(1,25e6);

This is the execution plan for the DISTINCT query.

HashAggregate  (cost=485621.80..485721.80 rows=10000 width=8) (actual time=5635.753..5636.618 rows=10000 loops=1)
  Group Key: x, y
  Buffers: shared hit=110620
  ->  Seq Scan on test  (cost=0.00..360621.20 rows=25000120 width=8) (actual time=0.011..1059.319 rows=25000000 loops=1)
        Buffers: shared hit=110620
Planning Time: 0.040 ms
Execution Time: 5637.222 ms

However, with the GROUP BY, Postgres decides that doing this with 4 parallel workers is the better choice:

Group  (cost=206134.89..211124.28 rows=10000 width=8) (actual time=1891.418..1906.560 rows=10000 loops=1)
  Group Key: x, y
  Buffers: shared hit=23797
  ->  Gather Merge  (cost=206134.89..210924.28 rows=40000 width=8) (actual time=1891.417..1931.831 rows=50000 loops=1)
        Workers Planned: 4
        Workers Launched: 4
        Buffers: shared hit=110656
        ->  Sort  (cost=205134.84..205159.84 rows=10000 width=8) (actual time=1853.463..1853.987 rows=10000 loops=5)
              Sort Key: x, y
              Sort Method: quicksort  Memory: 853kB
              Worker 0:  Sort Method: quicksort  Memory: 853kB
              Worker 1:  Sort Method: quicksort  Memory: 853kB
              Worker 2:  Sort Method: quicksort  Memory: 853kB
              Worker 3:  Sort Method: quicksort  Memory: 853kB
              Buffers: shared hit=110656
              ->  Partial HashAggregate  (cost=204370.45..204470.45 rows=10000 width=8) (actual time=1847.461..1849.122 rows=10000 loops=5)
                    Group Key: x, y
                    Buffers: shared hit=110620
                    ->  Parallel Seq Scan on test  (cost=0.00..173120.30 rows=6250030 width=8) (actual time=0.022..364.508 rows=5000000 loops=5)
                          Buffers: shared hit=110620
Planning Time: 0.101 ms
Execution Time: 1936.683 ms

About 2 seconds vs. more than 5 seconds does warrant a query that might look like something was forgotten.

Relevant configuration options that I have on this installation are:

Postgres 11.2 on Windows 10
work_mem = 4MB
random_page_cost = 4
max_parallel_workers = 8
max_parallel_workers_per_gather = 4

I don’t know if this is intended or could be seen as a bug in the optimizer.