Cookbook#
The cookbook demonstrates how to perform certain, frequently used patterns. Throughout the examples, we use the following setup:
In [1]: import postbound as pb
In [2]: pg_instance = pb.postgres.connect(config_file=".psycopg_connection")
In [3]: stats = pb.workloads.stats()
Cardinality estimation#
PostBOUND provides two main optimization pipelines to implement new cardinality estimators, the
TextBookOptimizationPipeline and the MultiStageOptimizationPipeline. At a first glance,
the TextBookOptimizationPipeline might seem to be the better suited one. However, this pipeline does not
allow to “fill the gaps” as well, since it relies on a (simulated) plan enumerator for performance reasons. Therefore, plans
generated by a textbook pipeline which does not specify its own enumerator can differ slightly from the plans that the native
query optimizer would select (see the details of the default plan enumerator selection).
In contrast, the MultiStageOptimizationPipeline allows to use only generated cardinality estimates as part
of the PlanParameterization stage, even without specifying a join order or physical operators.
This allows the “fill the gaps” approach to use the native query optimizer’s join ordering and operator selection strategies to
take over.
Therefore, the recommended way to implement a new cardinality estimator is to use the
CardinalityEstimator within a MultiStageOptimizationPipeline. If you should need to
switch to a textbook pipeline later, you can do so without changing your estimator implementation. This is because the
estimator class already provides default implementations to satisfy both interfaces.
Manual hinting#
You can easily generate hinted queries and execution plans using any of the fundamental
optimizer data structures by talking directly to the HintService of
your target Database:
In [4]: query = stats["q-10"]
In [5]: print(pb.qal.format_quick(query))
SELECT COUNT(*)
FROM comments AS c, posts AS p, users AS u
WHERE c.userid = u.id
AND u.id = p.owneruserid
AND c.creationdate >= CAST('2010-08-05 00:36:02' AS timestamp)
AND c.creationdate <= CAST('2014-09-08 16:50:49' AS timestamp)
AND p.viewcount >= 0
AND p.viewcount <= 2897
AND p.commentcount >= 0
AND p.commentcount <= 16
AND p.favoritecount >= 0
AND p.favoritecount <= 10;
In [6]: operators = pb.PhysicalOperatorAssignment()
In [7]: operators.add(pb.JoinOperator.HashJoin, query.tables())
Out[7]:
PhysicalOperatorAssignment
Join operators:
+- {u, c, p}: Hash Join
In [8]: operators
Out[8]:
PhysicalOperatorAssignment
Join operators:
+- {u, c, p}: Hash Join
In [9]: hinted_query = pg_instance.hinting().generate_hints(query, physical_operators=operators)
In [10]: hinted_query
Out[10]:
/*+
HashJoin(u c p)
*/
SELECT COUNT(*)
FROM comments AS c, posts AS p, users AS u
WHERE c.userid = u.id
AND u.id = p.owneruserid
AND c.creationdate >= CAST('2010-08-05 00:36:02' AS timestamp)
AND c.creationdate <= CAST('2014-09-08 16:50:49' AS timestamp)
AND p.viewcount >= 0
AND p.viewcount <= 2897
AND p.commentcount >= 0
AND p.commentcount <= 16
AND p.favoritecount >= 0
AND p.favoritecount <= 10;
In [11]: print(pg_instance.optimizer().query_plan(hinted_query).inspect())
Aggregate
Estimated Cardinality=1, Estimated Cost=20226.35
-> Gather
Estimated Cardinality=2, Estimated Cost=20226.34
Parallel Workers=2
-> Aggregate
Estimated Cardinality=3, Estimated Cost=19226.14
-> Nested Loop
Estimated Cardinality=4267287, Estimated Cost=15670.06
-> Hash Join
Estimated Cardinality=215637, Estimated Cost=8747.24
-> Seq Scan(comments AS c)
Estimated Cardinality=215637, Estimated Cost=7370.91
-> Hash
Estimated Cardinality=71163, Estimated Cost=891.12
-> Index Only Scan(users AS u)
Estimated Cardinality=71163, Estimated Cost=891.12
Index=users_pkey
-> Memoize
Estimated Cardinality=3, Estimated Cost=0.79
-> Index Scan(posts AS p)
Estimated Cardinality=3, Estimated Cost=0.78
Index=posts_owneruserid_fkey
Combined with the query transformation tools this is a powerful mechanism to obtain (partial)
plans for arbitrary subqueries:
In [12]: subquery = pb.transform.extract_query_fragment(query, pb.TableReference("posts", "p"))
In [13]: print(pb.qal.format_quick(subquery))
SELECT COUNT(*)
FROM posts AS p
WHERE p.viewcount >= 0
AND p.viewcount <= 2897
AND p.commentcount >= 0
AND p.commentcount <= 16
AND p.favoritecount >= 0
AND p.favoritecount <= 10;
In [14]: cards = pb.PlanParameterization()
In [15]: cards.add_cardinality(subquery.tables(), 42)
Out[15]:
PlanParameterization
Cardinalities:
+- {p}: 42
In [16]: hinted_subquery = pg_instance.hinting().generate_hints(subquery, plan_parameters=cards)
Postgres Query Plans#
When working with Postgres, there are three basic ways to access query plans:
You can retrieve the raw plan JSON using a plain
execute_query()You can parse a raw plan into a
PostgresExplainPlan, which is pretty much a 1:1 model of the raw plan with more expressive attribute access and some high-level access methodsYou can convert an explain into a proper normalized
QueryPlanobject
The conversion between the different formats works as follows:
In [17]: query = stats["q-10"]
In [18]: explain_query = pb.transform.as_explain(query)
In [19]: raw_plan = pg_instance.execute_query(explain_query)
In [20]: raw_plan
Out[20]:
[{'Plan': {'Node Type': 'Aggregate',
'Strategy': 'Plain',
'Partial Mode': 'Finalize',
'Parallel Aware': False,
'Async Capable': False,
'Startup Cost': 20226.34,
'Total Cost': 20226.35,
'Plan Rows': 1,
'Plan Width': 8,
'Plans': [{'Node Type': 'Gather',
'Parent Relationship': 'Outer',
'Parallel Aware': False,
'Async Capable': False,
'Startup Cost': 20226.13,
'Total Cost': 20226.34,
'Plan Rows': 2,
'Plan Width': 8,
'Workers Planned': 2,
'Single Copy': False,
'Plans': [{'Node Type': 'Aggregate',
'Strategy': 'Plain',
'Partial Mode': 'Partial',
'Parent Relationship': 'Outer',
'Parallel Aware': False,
'Async Capable': False,
'Startup Cost': 19226.13,
'Total Cost': 19226.14,
'Plan Rows': 1,
'Plan Width': 8,
'Plans': [{'Node Type': 'Nested Loop',
'Parent Relationship': 'Outer',
'Parallel Aware': False,
'Async Capable': False,
'Join Type': 'Inner',
'Startup Cost': 1187.94,
'Total Cost': 15670.06,
'Plan Rows': 1422429,
'Plan Width': 0,
'Inner Unique': False,
'Plans': [{'Node Type': 'Hash Join',
'Parent Relationship': 'Outer',
'Parallel Aware': True,
'Async Capable': False,
'Join Type': 'Inner',
'Startup Cost': 1187.63,
'Total Cost': 8747.24,
'Plan Rows': 71879,
'Plan Width': 8,
'Inner Unique': True,
'Hash Cond': '(c.userid = u.id)',
'Plans': [{'Node Type': 'Seq Scan',
'Parent Relationship': 'Outer',
'Parallel Aware': True,
'Async Capable': False,
'Relation Name': 'comments',
'Alias': 'c',
'Startup Cost': 0.0,
'Total Cost': 7370.91,
'Plan Rows': 71879,
'Plan Width': 4,
'Filter': "((creationdate >= '2010-08-05 00:36:02'::timestamp without time zone) AND (creationdate <= '2014-09-08 16:50:49'::timestamp without time zone))"},
{'Node Type': 'Hash',
'Parent Relationship': 'Inner',
'Parallel Aware': True,
'Async Capable': False,
'Startup Cost': 891.12,
'Total Cost': 891.12,
'Plan Rows': 23721,
'Plan Width': 4,
'Plans': [{'Node Type': 'Index Only Scan',
'Parent Relationship': 'Outer',
'Parallel Aware': True,
'Async Capable': False,
'Scan Direction': 'Forward',
'Index Name': 'users_pkey',
'Relation Name': 'users',
'Alias': 'u',
'Startup Cost': 0.29,
'Total Cost': 891.12,
'Plan Rows': 23721,
'Plan Width': 4}]}]},
{'Node Type': 'Memoize',
'Parent Relationship': 'Inner',
'Parallel Aware': False,
'Async Capable': False,
'Startup Cost': 0.3,
'Total Cost': 0.79,
'Plan Rows': 1,
'Plan Width': 4,
'Cache Key': 'c.userid',
'Cache Mode': 'logical',
'Plans': [{'Node Type': 'Index Scan',
'Parent Relationship': 'Outer',
'Parallel Aware': False,
'Async Capable': False,
'Scan Direction': 'Forward',
'Index Name': 'posts_owneruserid_fkey',
'Relation Name': 'posts',
'Alias': 'p',
'Startup Cost': 0.29,
'Total Cost': 0.78,
'Plan Rows': 1,
'Plan Width': 4,
'Index Cond': '(owneruserid = c.userid)',
'Filter': '((viewcount >= 0) AND (viewcount <= 2897) AND (commentcount >= 0) AND (commentcount <= 16) AND (favoritecount >= 0) AND (favoritecount <= 10))'}]}]}]}]}]},
'Settings': {'effective_cache_size': '6553MB',
'effective_io_concurrency': '20',
'maintenance_io_concurrency': '100'}}]
In [21]: postgres_plan = pb.postgres.PostgresExplainPlan(raw_plan)
In [22]: print(postgres_plan.inspect())
Aggregate(cost=20226.35 rows=1)
<- Gather(cost=20226.34 rows=2)
<- Aggregate(cost=19226.14 rows=1)
<- Nested Loop(cost=15670.06 rows=1422429)
<- Hash Join(cost=8747.24 rows=71879) Hash Cond: (c.userid = u.id)
<- Seq Scan on c(cost=7370.91 rows=71879) Filter: ((creationdate >= '2010-08-05 00:36:02'::timestamp without time zone) AND (creationdate <= '2014-09-08 16:50:49'::timestamp without time zone))
<- Hash(cost=891.12 rows=23721)
<- Index Only Scan on u(cost=891.12 rows=23721)
<- Memoize(cost=0.79 rows=1)
<- Index Scan on p(cost=0.78 rows=1) Filter: ((viewcount >= 0) AND (viewcount <= 2897) AND (commentcount >= 0) AND (commentcount <= 16) AND (favoritecount >= 0) AND (favoritecount <= 10)) Index Cond: (owneruserid = c.userid)
In [23]: qep = postgres_plan.as_qep()
In [24]: print(qep.inspect())
Aggregate
Estimated Cardinality=1, Estimated Cost=20226.35
-> Gather
Estimated Cardinality=2, Estimated Cost=20226.34
Parallel Workers=2
-> Aggregate
Estimated Cardinality=3, Estimated Cost=19226.14
-> Nested Loop
Estimated Cardinality=4267287, Estimated Cost=15670.06
-> Hash Join
Estimated Cardinality=215637, Estimated Cost=8747.24
-> Seq Scan(comments AS c)
Estimated Cardinality=215637, Estimated Cost=7370.91
-> Hash
Estimated Cardinality=71163, Estimated Cost=891.12
-> Index Only Scan(users AS u)
Estimated Cardinality=71163, Estimated Cost=891.12
Index=users_pkey
-> Memoize
Estimated Cardinality=3, Estimated Cost=0.79
-> Index Scan(posts AS p)
Estimated Cardinality=3, Estimated Cost=0.78
Index=posts_owneruserid_fkey
JSON export#
To export arbitrary objets to JSON, PostBOUND provides a jsonize protocol. Essentially, all you need to do is a add a
__json__ method to your class. This class can emit arbitrary objects that can either be JSON-serialized by Python’s
standard JSON dump logic, or that provide a __json__ method themselves.
To make sure that this method works, use the to_json() or
to_json_dump() for the export. All of PostBOUND’s built-in JSON export does this automatically.
Miscellaneous utilities#
There are some general utilities that might make your life a little easier, mostly when it comes to working with one or multiple instances of some class.
Use enlist() if you need to make sure that you are always working with a list of objects.
This function will wrap single objects into a list, but leave iterables as they are:
In [25]: pb.util.enlist(42)
Out[25]: [42]
In [26]: pb.util.enlist([1, 2, 3])
Out[26]: [1, 2, 3]
In [27]: pb.util.enlist("hello world") # strings are treated as single objects
Out[27]: ['hello world']
simplify() can be thought of as the inverse. For iterables that only contain a single
element, it provides that element. Otherwise, it leaves the iterable as it is. Therefore, this function should mostly be used
in situations where you already know that you are dealing with a singleton list. The function is overloaded to provide type
hints for precisely this situation.
In [28]: pb.util.simplify([42])
Out[28]: 42
In [29]: pb.util.simplify([1, 2, 3])
Out[29]: [1, 2, 3]
Use flatten() to flatten nested lists or iterables:
In [30]: pb.util.flatten([[1, 2], [3, 4]])
Out[30]: [1, 2, 3, 4]
Likewise, set_union() performs a union over multiple sets, thereby removing duplicates:
In [31]: pb.util.set_union([{1, 2, 3}, {2, 3, 4}, {4, 5}])
Out[31]: {1, 2, 3, 4, 5}