JSON Path support in Postgres 12

2019-10-16

The SQL:2016 standard introduced JSON and various ways to query JSON values. Postgres has been supporting JSON for a long time but defined its own set of operators and functions to query JSON values.

With PostgreSQL 12, the standard JSON path query functionality is now also supported by Postgres. Although the functions accepting JSON path queries do not conform to the SQL standard, the parameters and behavior does.

Sample data set

Querying scalar values or JSON values is not made easier with JSON path. I have imported the posts of DBA Stackexchange.com into a Postgres 12 database and stored each post in a JSON value.

CREATE TABLE posts_json
(
   id    integer not null primary key,
   data  jsonb
);

CREATE INDEX idx_post_json_data ON posts_json USING gin (data);

A single post may look like this:

{
    "tags": [
        "postgresql",
        "query-performance",
        "parallelism"
    ],
    "score": 56,
    "title": "Can a single PostgreSQL query use multiple cores?",
    "parent_id": null,
    "view_count": 54233,
    "closed_date": null,
    "answer_count": 4,
    "post_type_id": 1,
    "comment_count": 0,
    "creation_date": "2013-12-14T14:13:46.783",
    "owner_user_id": 4843,
    "favorite_count": 11,
    "last_edit_date": "2017-02-15T00:51:24.703",
    "accepted_answer_id": 55044,
    "last_activity_date": "2018-09-16T19:11:24.837",
    "last_editor_user_id": 1192
}

Simple queries

If we want to find alls posts from a specific user_id, we can use this query:

SELECT *
FROM posts_json
WHERE data @> '{"owner_user_id": 4843}';

With the new JSON path support, this can also be written as:

SELECT *
FROM posts_json
WHERE jsonb_path_exists(data, '$.owner_user_id ? (@ == 4843)');

The expression $.owner_user_id accesses the top level key in the JSON value. And the ? operator then defines a filter that should be evaluated. The return value of the first expression is referenced using the @ that value is compared with the user ID we are looking for.

The biggest drawback of the new JSON path functions is, that they can’t be indexed. The execution plan for the first query looks like this:

Bitmap Heap Scan on posts_json  (cost=37.42..700.24 rows=183 width=36) (actual time=0.339..0.412 rows=61 loops=1)
  Recheck Cond: (data @> '{"owner_user_id": 4843}'::jsonb)
  Rows Removed by Index Recheck: 2
  Heap Blocks: exact=55
  ->  Bitmap Index Scan on idx_post_json_data  (cost=0.00..37.37 rows=183 width=0) (actual time=0.328..0.328 rows=63 loops=1)
        Index Cond: (data @> '{"owner_user_id": 4843}'::jsonb)
Planning Time: 0.083 ms
Execution Time: 0.439 ms

As you can see, the GIN index we defined on the whole JSON column is used. The table contains a bit over 180.000 rows.

The execution plan for the query with the JSON path expression, shows a Seq Scan on the table:

Seq Scan on posts_json  (cost=0.00..14205.06 rows=60962 width=36) (actual time=4.393..82.302 rows=61 loops=1)
  Filter: jsonb_path_exists(data, '$."owner_user_id"?(@ == 4843)'::jsonpath, '{}'::jsonb, false)
  Rows Removed by Filter: 182824
Planning Time: 0.073 ms
Execution Time: 82.318 ms

Unfortunately it does not even use a parallel seq scan for this. So if the table grows larger, this query will not scale.

Queries for containment of values

If we want to find out how many posts are tagged with postgresql, we can use the following query:

SELECT count(*)
FROM posts_json
WHERE data @> '{"tags":["postgresql"]}';

The equivalent query using JSON path is:

SELECT count(*)
FROM posts_json
WHERE jsonb_path_exists(data, '$.tags[*] ? (@ == "postgresql")')

Here we can see one very convenient way how to deal with array inside JSON path. The [*] implicitly expands all elements in the array and the condition @ == "postgresql" is applied for all values in the array.

Again, the first query using the @> operator can use the GIN index:

Aggregate  (cost=700.69..700.70 rows=1 width=8) (actual time=13.161..13.161 rows=1 loops=1)
  ->  Bitmap Heap Scan on posts_json  (cost=37.42..700.24 rows=183 width=0) (actual time=3.307..12.754 rows=10219 loops=1)
        Recheck Cond: (data @> '{"tags": ["postgresql"]}'::jsonb)
        Heap Blocks: exact=6840
        ->  Bitmap Index Scan on idx_post_json_data  (cost=0.00..37.37 rows=183 width=0) (actual time=2.612..2.612 rows=10219 loops=1)
              Index Cond: (data @> '{"tags": ["postgresql"]}'::jsonb)
Planning Time: 0.082 ms
Execution Time: 13.387 ms

And the JSON path query uses a Seq Scan:

Finalize Aggregate  (cost=13705.92..13705.93 rows=1 width=8) (actual time=69.788..69.788 rows=1 loops=1)
  ->  Gather  (cost=13705.60..13705.91 rows=3 width=8) (actual time=69.513..98.152 rows=4 loops=1)
        Workers Planned: 3
        Workers Launched: 3
        ->  Partial Aggregate  (cost=12705.60..12705.61 rows=1 width=8) (actual time=41.060..41.060 rows=1 loops=4)
              ->  Parallel Seq Scan on posts_json  (cost=0.00..12656.44 rows=19665 width=0) (actual time=0.060..40.847 rows=2555 loops=4)
                    Filter: jsonb_path_exists(data, '$."tags"[*]?(@ == "postgresql")'::jsonpath, '{}'::jsonb, false)
                    Rows Removed by Filter: 43167
Planning Time: 0.090 ms
Execution Time: 98.193 ms

Luckily this time Postgres at least uses a Parallel Seq Scan to improve performance.

More complex queries

One type of query that is notoriously cumbersome to write, is when you can’t use the containment operator e.g. because you want a LIKE condition on elements in an array (this is typically the moment where you should re-think the use of JSON, but that’s a different topic).

Just checking for the tag postgresql is not the complete picture, because there are also version specific tags, e.g. postgresql-11. The condition data @> '{"tags":["postgresql"]}' does not return those. If we want to count those as well, we need to unnest the array and use a LIKE operator on each element:

SELECT count(*)
FROM posts_json
WHERE EXISTS (SELECT *
              FROM jsonb_array_elements_Text(data -> 'tags') as x(tag)
              WHERE x.tag LIKE 'postgres%');

This kind of condition can be written in a much more compact way using JSON path:

SELECT count(*)
FROM posts_json
WHERE jsonb_path_exists(data, '$.tags[*] ? (@ starts with "postgres")');

Neither of those queries can use an index.

The JSON path expressions can also apply comparison expressions without the need to cast the value.

If we want to find all posts with a score higher than 100, we can use this:

SELECT *
FROM posts_json
WHERE (data ->> 'score')::int > 100;

With JSON path this can be written as:

SELECT count(*)
FROM posts_json
WHERE jsonb_path_exists(data, '$.score ? (@ > 100)');

The big advantage of the JSON path query is, that it’s lenient with regards to non-numeric values. With JSON we lose the ability to check the values for a specific data type, so it could happen that one post contains "score": "none". In that case the first query will fail with invalid input syntax for type integer: “none”.

The JSON path query will silently ignore those values.

Conclusion

JSON path offers a very powerful (albeit cumbersome) way to express conditions on the JSON values. The biggest disadvantage I see is, that they can’t use existing GIN indexes on the whole JSON column. Obviously it is still possible to index specific expressions.