Optimizer Abstraction#
To implement a new optimizer prototype in PostBOUND, you need to implement an appropriate
OptimizationPipeline. Pipelines are mental models for different optimizer architectures.
This section describes the most commonly used pipelines and how to use them. The final section contains a summary of the
core optimizer-related data structures. If you are interested in the practical implementation of the different optimization
stages, check out the dedicated prototyping guide.
Textbook Optimizer Pipeline#
The textbook optimizer pipeline models the traditional optimizer architecture consisting of plan enumerator, cost model,
and cardinality estimator.
The plan enumerator generates a set of candidate plans.
They are evaluated by the cost model and ultimately the plan with the lowest cost is selected.
The cost model uses cardinality estimates as its principal input to assess how much work each plan will require.
See the TextBookOptimizationPipeline for a full rundown of all available
methods.
Essentially, you can provide any combination of plan enumerator, cost model, and cardinality estimator.
Architecture of a traditional textbook query optimizer.#
The
PlanEnumeratoris responsible for constructing and returning the final plan to be executed. It can use any algorithm it sees fit, such as traditional dynamic programming, greedy algorithms, or cascades-style enumeration. The enumerator also controls when to ask the cost model and cardinality estimator for their estimates. The main artifact of the enumerator is the plainQueryPlan.The
CostModelestimates the cost of a plan. It receives the current plan along with the query as input and computes the execution cost of the plan. The plan must not compute the entire query, but can also be responsible for a smaller part of it. The cost model can query the cardinality estimator as it sees fit. By convention, if the cost model cannot estimate a plan or if a plan is otherwise invalid, inf costs should be returned. Currently, costs are expressed as plain Python floats.The
CardinalityEstimatortakes care of estimating the number of rows that an (intermediate) relation will contain. It receives the tables that form the intermediate along with the query as input. Similar to the cost model, the intermediate will typically be a subset of the entire query. The estimator can assume that all applicable filters and join conditions have already been applied to the intermediate. Cardinalities are modelled using theCardinalityclass, which also allows to express unknown and infinite cardinalities.
Tip
If you do not implement your own plan enumerator, PostBOUND will select a dynamic programming-based algorithm (see
DynamicProgrammingEnumerator) by
default.
If your target database happens to be a Postgres system, a DP algorithm specifically designed to mimic the Postgres
algorithm will be used (see PostgresDynProg).
In contrast to the multi-stage pipeline, we cannot simply let the target database system supply its own enumerator, because it is typically the enumerators job to request cost and cardinality estimates. This would mean that the target database system would need some way to call back into PostBOUND to obtain these estimates. While that is certainly an exciting possibility, we do currently not have the resources to investigate it further.
Multi-stage Optimizer Pipeline#
The multi-stage optimizer follows a sequential optimization approach. The pipeline starts by computing a join order for the given query. Based on this join order, the optimizer the computes the best physical operators for each join and scan operation. Since both stages are optional, this pipeline is particularly well suited for scenarios where only a portion of the optimization decisions should be overwritten and the native query optimizer should be used for the rest. For example, you can use the multi-stage pipeline to only compute a join order. In this case, the native optimizer has to select the best physical operators for each join and scan operation on its own. To further support this use-case, you can also generate additional parameters in the pipeline. These are used to restrict the native optimizer, e.g. by providing cardinality estimates for (some of the) relations. These estimates would then overwrite the native cardinality estimator.
Architecture of a multi-stage query optimizer.#
See the MultiStageOptimizationPipeline for a full rundown of all
available methods.
Essentially, you can provide any combination of join ordering, operator selection and plan parameterization:
The
JoinOrderOptimizationis responsible for computing theJoinTreeof the query. If this stage is skipped, the native optimizer has to perform its own join ordering.The
PhysicalOperatorSelectiondetermines the scan and join operators for each intermediate of the query. Those are encoded in thePhysicalOperatorAssignment. If this stage is skipped, the native optimizer has to select its own physical operators. If the join order stage is skipped, the selected operators will only be used if their corresponding intermediates are actually calculated.The
PlanParameterizationallows to generate metadata for the query plan. Currently, this includes cardinality estimates and parallel workers. How these parameters are used heavily depends on which other stages are used:If a join ordering is performed, the cardinality estimates affect the operator selection (unless the operator selection is also used). Otherwise, cardinality estimates are used to determine the best join tree.
If an operator selection is performed, the cardinality estimates can be used to determine whether a parallel computation of the operator is beneficial (unless parallel workers are selected explicitly). Otherwise, the cardinality estimates directly influence the operator selection.
If neither join ordering nor operator selection is performed, the supplied cardinalities overwrite the native cardinality estimator.
Tip
If your research area is cardinality estimation, both the textbook pipeline as well as the multi-stage pipeline can
be used to implement prototypes. However, if you are not interested in plan enumeration or cost model, it might be
a better idea to test your cardinality estimator using the
MultiStageOptimizationPipeline.
The reasoning behind this recommendation is that the textbook pipeline requires a plan enumerator which currently cannot be re-used from the target database system (because it is the enumerator’s job to request cost and cardinality estimates, see the textbook pipeline for details). While the default enumerator can mimic the actual enumerator pretty well for PostgreSQL, it is still a simplification.
On the other hand, the multi-stage pipeline just passes the cardinality estimates from the plan parameterization to the native optimizer. Therefore, the key idea is to use a multi-stage pipeline that does not perform any join ordering or operator selection. During the plan parameterization stage, estimates for all potential intermediates are generated.
See the Cookbook for a practical example. In short, by implementing your own cardinality estimator, you can use it both in a textbook pipeline as well as in a multi-stage pipeline.
Integrated optimization pipeline#
TODO
Fundamental data structures#
Query plans#
The QueryPlan is the central artifact of the optimizer. It represents the physical execution
plan of a query (or a part of it). There are three main ways to obtain a query plan:
query plans can be created manually just like any other Python object
query plans are the main output of the optimization pipelines.
the
Databaseprovides anOptimizerInterfacewhich in turn has aquery_plan()(andanalyze_plan()) method to the query plan that the database system would use to execute a specific queryquery plans can be constructed from the optimizer artifacts (see below) using
to_query_plan()
Use the normal JSON export tools to export query plans as JSON.
read_query_plan_json() can be used to load a plan back from its JSON
representation.
Optimizer artifacts#
The JoinTree is used to represent the join order of a query. Nodes can be annotated
to keep track of additional information. Other than constructing a tree manually, trees can be extracted from query plans
via jointree_from_plan(). Use the normal JSON export tools to export
join trees as JSON. read_jointree_json() can be used to load a tree back
from its JSON representation.
PhysicalOperatorAssignment and PlanParameterization can be used to reperesent
(partial) optimizer decisions.
Similar to join trees, these artifacts can be either constructed manually or extracted from query plans using
operators_from_plan() or parameters_from_plan().
Use the normal JSON export tools to export them as JSON and read_operator_assignment_json() or
read_plan_params_json() to load them from their JSON representation.
Tip
If you want to convert a query plan into all artifacts, you can use
explode_query_plan() as a convenience function.
Join tree, operator assignment and plan parameterization can also be used to generate hinted queries. See the Cookbook for details.
Optimizer utilities#
To aid the implementation of new optimizers, PostBOUND provides a number of commonly used data structures:
The JoinGraph can be used to keep track of relations that have already been joined
or that can be integrated into a join tree.
Other utilities are implemented as pre-defined optimization stages, see Existing Optimizer Implementations for details.