Cookbook#

The cookbook demonstrates how to perform certain, frequently used patterns. Throughout the examples, we use the following setup:

In [1]: import postbound as pb

In [2]: pg_instance = pb.postgres.connect(config_file=".psycopg_connection")

In [3]: stats = pb.workloads.stats()

Cardinality estimation#

PostBOUND provides two main optimization pipelines to implement new cardinality estimators, the TextBookOptimizationPipeline and the MultiStageOptimizationPipeline. At a first glance, the TextBookOptimizationPipeline might seem to be the better suited one. However, this pipeline does not allow to “fill the gaps” as well, since it relies on a (simulated) plan enumerator for performance reasons. Therefore, plans generated by a textbook pipeline which does not specify its own enumerator can differ slightly from the plans that the native query optimizer would select (see the details of the default plan enumerator selection). In contrast, the MultiStageOptimizationPipeline allows to use only generated cardinality estimates as part of the PlanParameterization stage, even without specifying a join order or physical operators. This allows the “fill the gaps” approach to use the native query optimizer’s join ordering and operator selection strategies to take over.

Therefore, the recommended way to implement a new cardinality estimator is to use the CardinalityEstimator within a MultiStageOptimizationPipeline. If you should need to switch to a textbook pipeline later, you can do so without changing your estimator implementation. This is because the estimator class already provides default implementations to satisfy both interfaces.

Manual hinting#

You can easily generate hinted queries and execution plans using any of the fundamental optimizer data structures by talking directly to the HintService of your target Database:

In [4]: query = stats["q-10"]

In [5]: print(pb.qal.format_quick(query))
SELECT COUNT(*)
FROM comments AS c, posts AS p, users AS u
WHERE c.userid = u.id
  AND u.id = p.owneruserid
  AND c.creationdate >= CAST('2010-08-05 00:36:02' AS timestamp)
  AND c.creationdate <= CAST('2014-09-08 16:50:49' AS timestamp)
  AND p.viewcount >= 0
  AND p.viewcount <= 2897
  AND p.commentcount >= 0
  AND p.commentcount <= 16
  AND p.favoritecount >= 0
  AND p.favoritecount <= 10;

In [6]: operators = pb.PhysicalOperatorAssignment()

In [7]: operators.add(pb.JoinOperator.HashJoin, query.tables())
Out[7]: 
PhysicalOperatorAssignment
  Join operators:
    +- {u, c, p}: Hash Join

In [8]: operators
Out[8]: 
PhysicalOperatorAssignment
  Join operators:
    +- {u, c, p}: Hash Join

In [9]: hinted_query = pg_instance.hinting().generate_hints(query, physical_operators=operators)

In [10]: hinted_query
Out[10]: 
/*+
 HashJoin(u c p)
 */
SELECT COUNT(*)
FROM comments AS c, posts AS p, users AS u
WHERE c.userid = u.id
  AND u.id = p.owneruserid
  AND c.creationdate >= CAST('2010-08-05 00:36:02' AS timestamp)
  AND c.creationdate <= CAST('2014-09-08 16:50:49' AS timestamp)
  AND p.viewcount >= 0
  AND p.viewcount <= 2897
  AND p.commentcount >= 0
  AND p.commentcount <= 16
  AND p.favoritecount >= 0
  AND p.favoritecount <= 10;

In [11]: print(pg_instance.optimizer().query_plan(hinted_query).inspect())
Aggregate
  Estimated Cardinality=1, Estimated Cost=20226.35
  ->  Gather
        Estimated Cardinality=2, Estimated Cost=20226.34
        Parallel Workers=2
        ->  Aggregate
              Estimated Cardinality=3, Estimated Cost=19226.14
              ->  Nested Loop
                    Estimated Cardinality=4267287, Estimated Cost=15670.06
                    ->  Hash Join
                          Estimated Cardinality=215637, Estimated Cost=8747.24
                          ->  Seq Scan(comments AS c)
                                Estimated Cardinality=215637, Estimated Cost=7370.91
                          ->  Hash
                                Estimated Cardinality=71163, Estimated Cost=891.12
                                ->  Index Only Scan(users AS u)
                                      Estimated Cardinality=71163, Estimated Cost=891.12
                                      Index=users_pkey
                    ->  Memoize
                          Estimated Cardinality=3, Estimated Cost=0.79
                          ->  Index Scan(posts AS p)
                                Estimated Cardinality=3, Estimated Cost=0.78
                                Index=posts_owneruserid_fkey

Combined with the query transformation tools this is a powerful mechanism to obtain (partial) plans for arbitrary subqueries:

In [12]: subquery = pb.transform.extract_query_fragment(query, pb.TableReference("posts", "p"))

In [13]: print(pb.qal.format_quick(subquery))
SELECT COUNT(*)
FROM posts AS p
WHERE p.viewcount >= 0
  AND p.viewcount <= 2897
  AND p.commentcount >= 0
  AND p.commentcount <= 16
  AND p.favoritecount >= 0
  AND p.favoritecount <= 10;

In [14]: cards = pb.PlanParameterization()

In [15]: cards.add_cardinality(subquery.tables(), 42)
Out[15]: 
PlanParameterization
  Cardinalities:
    +- {p}: 42

In [16]: hinted_subquery = pg_instance.hinting().generate_hints(subquery, plan_parameters=cards)

Postgres Query Plans#

When working with Postgres, there are three basic ways to access query plans:

  1. You can retrieve the raw plan JSON using a plain execute_query()

  2. You can parse a raw plan into a PostgresExplainPlan, which is pretty much a 1:1 model of the raw plan with more expressive attribute access and some high-level access methods

  3. You can convert an explain into a proper normalized QueryPlan object

The conversion between the different formats works as follows:

In [17]: query = stats["q-10"]

In [18]: explain_query = pb.transform.as_explain(query)

In [19]: raw_plan = pg_instance.execute_query(explain_query)

In [20]: raw_plan
Out[20]: 
[{'Plan': {'Node Type': 'Aggregate',
   'Strategy': 'Plain',
   'Partial Mode': 'Finalize',
   'Parallel Aware': False,
   'Async Capable': False,
   'Startup Cost': 20226.34,
   'Total Cost': 20226.35,
   'Plan Rows': 1,
   'Plan Width': 8,
   'Plans': [{'Node Type': 'Gather',
     'Parent Relationship': 'Outer',
     'Parallel Aware': False,
     'Async Capable': False,
     'Startup Cost': 20226.13,
     'Total Cost': 20226.34,
     'Plan Rows': 2,
     'Plan Width': 8,
     'Workers Planned': 2,
     'Single Copy': False,
     'Plans': [{'Node Type': 'Aggregate',
       'Strategy': 'Plain',
       'Partial Mode': 'Partial',
       'Parent Relationship': 'Outer',
       'Parallel Aware': False,
       'Async Capable': False,
       'Startup Cost': 19226.13,
       'Total Cost': 19226.14,
       'Plan Rows': 1,
       'Plan Width': 8,
       'Plans': [{'Node Type': 'Nested Loop',
         'Parent Relationship': 'Outer',
         'Parallel Aware': False,
         'Async Capable': False,
         'Join Type': 'Inner',
         'Startup Cost': 1187.94,
         'Total Cost': 15670.06,
         'Plan Rows': 1422429,
         'Plan Width': 0,
         'Inner Unique': False,
         'Plans': [{'Node Type': 'Hash Join',
           'Parent Relationship': 'Outer',
           'Parallel Aware': True,
           'Async Capable': False,
           'Join Type': 'Inner',
           'Startup Cost': 1187.63,
           'Total Cost': 8747.24,
           'Plan Rows': 71879,
           'Plan Width': 8,
           'Inner Unique': True,
           'Hash Cond': '(c.userid = u.id)',
           'Plans': [{'Node Type': 'Seq Scan',
             'Parent Relationship': 'Outer',
             'Parallel Aware': True,
             'Async Capable': False,
             'Relation Name': 'comments',
             'Alias': 'c',
             'Startup Cost': 0.0,
             'Total Cost': 7370.91,
             'Plan Rows': 71879,
             'Plan Width': 4,
             'Filter': "((creationdate >= '2010-08-05 00:36:02'::timestamp without time zone) AND (creationdate <= '2014-09-08 16:50:49'::timestamp without time zone))"},
            {'Node Type': 'Hash',
             'Parent Relationship': 'Inner',
             'Parallel Aware': True,
             'Async Capable': False,
             'Startup Cost': 891.12,
             'Total Cost': 891.12,
             'Plan Rows': 23721,
             'Plan Width': 4,
             'Plans': [{'Node Type': 'Index Only Scan',
               'Parent Relationship': 'Outer',
               'Parallel Aware': True,
               'Async Capable': False,
               'Scan Direction': 'Forward',
               'Index Name': 'users_pkey',
               'Relation Name': 'users',
               'Alias': 'u',
               'Startup Cost': 0.29,
               'Total Cost': 891.12,
               'Plan Rows': 23721,
               'Plan Width': 4}]}]},
          {'Node Type': 'Memoize',
           'Parent Relationship': 'Inner',
           'Parallel Aware': False,
           'Async Capable': False,
           'Startup Cost': 0.3,
           'Total Cost': 0.79,
           'Plan Rows': 1,
           'Plan Width': 4,
           'Cache Key': 'c.userid',
           'Cache Mode': 'logical',
           'Plans': [{'Node Type': 'Index Scan',
             'Parent Relationship': 'Outer',
             'Parallel Aware': False,
             'Async Capable': False,
             'Scan Direction': 'Forward',
             'Index Name': 'posts_owneruserid_fkey',
             'Relation Name': 'posts',
             'Alias': 'p',
             'Startup Cost': 0.29,
             'Total Cost': 0.78,
             'Plan Rows': 1,
             'Plan Width': 4,
             'Index Cond': '(owneruserid = c.userid)',
             'Filter': '((viewcount >= 0) AND (viewcount <= 2897) AND (commentcount >= 0) AND (commentcount <= 16) AND (favoritecount >= 0) AND (favoritecount <= 10))'}]}]}]}]}]},
  'Settings': {'effective_cache_size': '6553MB',
   'effective_io_concurrency': '20',
   'maintenance_io_concurrency': '100'}}]

In [21]: postgres_plan = pb.postgres.PostgresExplainPlan(raw_plan)

In [22]: print(postgres_plan.inspect())
Aggregate(cost=20226.35 rows=1)
  <- Gather(cost=20226.34 rows=2)
    <- Aggregate(cost=19226.14 rows=1)
      <- Nested Loop(cost=15670.06 rows=1422429)
        <- Hash Join(cost=8747.24 rows=71879) Hash Cond: (c.userid = u.id)
          <- Seq Scan on c(cost=7370.91 rows=71879) Filter: ((creationdate >= '2010-08-05 00:36:02'::timestamp without time zone) AND (creationdate <= '2014-09-08 16:50:49'::timestamp without time zone))
          <- Hash(cost=891.12 rows=23721)
            <- Index Only Scan on u(cost=891.12 rows=23721)
        <- Memoize(cost=0.79 rows=1)
          <- Index Scan on p(cost=0.78 rows=1) Filter: ((viewcount >= 0) AND (viewcount <= 2897) AND (commentcount >= 0) AND (commentcount <= 16) AND (favoritecount >= 0) AND (favoritecount <= 10)) Index Cond: (owneruserid = c.userid)

In [23]: qep = postgres_plan.as_qep()

In [24]: print(qep.inspect())
Aggregate
  Estimated Cardinality=1, Estimated Cost=20226.35
  ->  Gather
        Estimated Cardinality=2, Estimated Cost=20226.34
        Parallel Workers=2
        ->  Aggregate
              Estimated Cardinality=3, Estimated Cost=19226.14
              ->  Nested Loop
                    Estimated Cardinality=4267287, Estimated Cost=15670.06
                    ->  Hash Join
                          Estimated Cardinality=215637, Estimated Cost=8747.24
                          ->  Seq Scan(comments AS c)
                                Estimated Cardinality=215637, Estimated Cost=7370.91
                          ->  Hash
                                Estimated Cardinality=71163, Estimated Cost=891.12
                                ->  Index Only Scan(users AS u)
                                      Estimated Cardinality=71163, Estimated Cost=891.12
                                      Index=users_pkey
                    ->  Memoize
                          Estimated Cardinality=3, Estimated Cost=0.79
                          ->  Index Scan(posts AS p)
                                Estimated Cardinality=3, Estimated Cost=0.78
                                Index=posts_owneruserid_fkey

JSON export#

To export arbitrary objets to JSON, PostBOUND provides a jsonize protocol. Essentially, all you need to do is a add a __json__ method to your class. This class can emit arbitrary objects that can either be JSON-serialized by Python’s standard JSON dump logic, or that provide a __json__ method themselves. To make sure that this method works, use the to_json() or to_json_dump() for the export. All of PostBOUND’s built-in JSON export does this automatically.

Miscellaneous utilities#

There are some general utilities that might make your life a little easier, mostly when it comes to working with one or multiple instances of some class.

Use enlist() if you need to make sure that you are always working with a list of objects. This function will wrap single objects into a list, but leave iterables as they are:

In [25]: pb.util.enlist(42)
Out[25]: [42]

In [26]: pb.util.enlist([1, 2, 3])
Out[26]: [1, 2, 3]

In [27]: pb.util.enlist("hello world")  # strings are treated as single objects
Out[27]: ['hello world']

simplify() can be thought of as the inverse. For iterables that only contain a single element, it provides that element. Otherwise, it leaves the iterable as it is. Therefore, this function should mostly be used in situations where you already know that you are dealing with a singleton list. The function is overloaded to provide type hints for precisely this situation.

In [28]: pb.util.simplify([42])
Out[28]: 42

In [29]: pb.util.simplify([1, 2, 3])
Out[29]: [1, 2, 3]

Use flatten() to flatten nested lists or iterables:

In [30]: pb.util.flatten([[1, 2], [3, 4]])
Out[30]: [1, 2, 3, 4]

Likewise, set_union() performs a union over multiple sets, thereby removing duplicates:

In [31]: pb.util.set_union([{1, 2, 3}, {2, 3, 4}, {4, 5}])
Out[31]: {1, 2, 3, 4, 5}