Optimization Pipelines

Optimization Pipelines#

class postbound.OptimizationPipeline#

The optimization pipeline is the main tool to apply different strategies to optimize SQL queries.

Depending on the specific scenario, different concrete pipeline implementations exist. For example, to apply multi-stage optimization design (e.g. consisting of join ordering and a subsequent physical operator selection), the MultiStageOptimizationPipeline exists. Similarly, for optimization algorithms that perform join ordering and operator selection in one process, an IntegratedOptimizationPipeline is available. The TextBookOptimizationPipeline is modelled after the traditional interplay of cardinality estimator, cost model and plan enumerator. Lastly, to model approaches that subsequently improve query plans by correcting some previous optimization decisions (e.g. transforming a hash join to a nested loop join), the IncrementalOptimizationPipeline is provided. Consult the individual pipeline documentation for more details. This class only describes the basic interface that is shared by all the pipeline implementations.

If in doubt what the best pipeline implementation is, it is probably best to start with the MultiStageOptimizationPipeline or the TextBookOptimizationPipeline, since they are the most flexible.

Training of Optimization Stages#

Some optimization strategies require training on the database, the workload or sample queries. For example, a learned cardinality estimator might need prior access to the statistics catalog of the target database or the database schema.

Each pipeline automatically analyzes the requirements of its optimization stages and passes the appropriate training data to them. This is tightly integrated into the benchmarking tools, which in turn analyze what each pipeline needs (based on its optimization stages) and provide the necessary training data to the pipeline. If you are using an optimization pipeline outside of the benchmarking utilities, you must make sure to call these methods yourself.

The training process is organized in the following steps:

Training on the target database is performed first. This is handled by the train_on_database method. Use requires_data_training to check whether this step is necessary.
Training on the workload is performed next. The train_on_workload method is responsible for this step.. Use requires_workload_training to check whether this it is necessary.
Training on sample queries is performed last. This is handled by the train_on_samples method. Use requires_sample_training to check whether this step is necessary.
Lastly, some optimization stages might require training on actual query executions in an online fashion. The learn_from_feedback method takes care of this process. Use requires_online_training to check whether this step is necessary.

abstractmethod query_execution_plan(query: SqlQuery) → QueryPlan#

Applies the current pipeline configuration to obtain an optimized plan for the input query.

Parameters:

query (SqlQuery) – The query that should be optimized

Returns:

An optimized query execution plan for the input query.

If the optimization strategies only provide partial optimization decisions (e.g. physical operators for a subset of the joins), it is up to the pipeline to fill the gaps in order to provide a complete execution plan. A typical approach could be to delegate this task to the optimizer of the target database by providing it the partial optimization information.

Return type:

QueryPlan

Raises:

UnsupportedQueryError – If the selected optimization algorithms cannot be applied to the specific query, e.g. because it contains unsupported features.

optimize_query(query: SqlQuery) → SqlQuery#

Applies the current pipeline configuration to optimize the input query.

This process also involves the generation of appropriate optimization information that enforces the selected optimization decision when the query is executed on an actual database system.

Parameters:

query (SqlQuery) – The query that should be optimized

Returns:

A transformed query that encapsulates all the optimization decisions made by the pipeline. What this actually means depends on the selected optimization strategies, as well as specifics of the target database system:

Depending on the optimization strategy the optimization decisions can range from simple operator selections (such as “no nested loop join for this join”) to entire physical query execution plans (consisting of a join order, as well as scan and join operators for all parts of the plan) and anything in between. For novel cardinality estimation approaches, the optimization info could also be structured such that the default cardinality estimates are overwritten.

Furthermore, the way the optimization info is expressed depends on the selected database system. Most systems do not allow a direct modification of the query optimizer’s implementation. Therefore, PostBOUND takes an indirect approach: it emits system-specific hints that enable corrections for individual optimizer decisions (such as disabling a specific physical operator). For example, PostgreSQL allows to use planner options such as SET enable_nestloop = 'off' to disable nested loop joins for the all subsequent queries in the current connection. MySQL provides hints like BNL(R S) to recommend a block-nested loop join or hash join (depending on the MySQL version) to the optimizer for a specific join. These hints are inserted into comment blocks in the final SQL query. Likewise, some systems treat certain SQL keywords differently or provide their own extensions. This also allows to modify the underlying plans. For example, when SQLite encouters a CROSS JOIN syntax in the FROM clause, it does not try to optimize the join order and uses the order in which the tables are specified in the relation instead.

Therefore, the resulting query will differ from the original input query in a number of ways. However, the produced result sets should still be equivalent. If this is not the case, something went severly wrong during query optimization. Take a look at the db module for more details on the database system support and the query generation capabilities.

Return type:

SqlQuery

Raises:

UnsupportedQueryError – If the selected optimization algorithms cannot be applied to the specific query, e.g. because it contains unsupported features.

References

abstractmethod stages() → Collection[OptimizationStage]#

Provides all optimization stages that are part of the pipeline. The order of the stages is not relevant.

Return type:: Collection[OptimizationStage]

abstractmethod target_database() → Database#

Provides the current target database.

Return type:: Database

requires_data_training() → bool#

Checks, whether any of the selected optimization stages requires training on the database.

Return type:: bool

train_on_database(database: Database) → Mapping[str, TrainingMetrics]#

Trains all optimization stages that require training on the database.

Returns:: A dictionary mapping each optimization stage to the metrics obtained during training. Optimization stages are identified by their name. If a stage does not require training on the database or has already completed its training, it is not included in the output.
Return type:: Mapping[str, TrainingMetrics]
Parameters:: database (Database)

requires_workload_training() → bool#

Checks, whether any of the selected optimization stages requires training on the workload.

Return type:: bool

train_on_workload(workload: Workload, database: Database) → Mapping[str, TrainingMetrics]#

Trains all optimization stages that require training on the workload.

Returns:

A dictionary mapping each optimization stage to the metrics obtained during training. Optimization stages are identified by their name. If a stage does not require training on the workload or has already completed its training, it is not included in the output.

Return type:

Mapping[str, TrainingMetrics]

Parameters:

workload (Workload)
database (Database)

requires_sample_training() → bool#

Checks, whether any of the selected optimization stages requires training on sample queries.

Return type:: bool

train_on_samples(samples: TrainingData | TrainingDataRepository) → Mapping[str, TrainingMetrics]#

Trains all optimization stages that require training on sample queries.

Returns:: A dictionary mapping each optimization stage to the metrics obtained during training. Optimization stages are identified by their name. If a stage does not require training on sample queries or has already completed its training, it is not included in the output.
Return type:: Mapping[str, TrainingMetrics]
Parameters:: samples (TrainingData | TrainingDataRepository)

sample_fit_completed() → bool#

Checks, whether all optimization stages that require training on sample queries have completed their training.

Return type:: bool

requires_online_training() → bool#

Checks, whether any of the selected optimization stages requires training on actual query executions.

Return type:: bool

learn_from_feedback(query: SqlQuery, result_set: Sequence[tuple], *, exec_time: float) → Mapping[str, TrainingMetrics]#

Trains all optimization stages that require training on actual query executions.

Parameters:

query (SqlQuery) – The query that has been executed
result_set (ResultSet) – The result set obtained from executing the query
exec_time (TimeMs) – The execution time of the query in milliseconds

Returns:

A dictionary mapping each optimization stage to the metrics obtained during training. Optimization stages are identified by their name. If a stage does not require training on actual query executions it is not included in the output.

Return type:

Mapping[str, TrainingMetrics]

abstractmethod describe() → jsondict#

Generates a description of the current pipeline configuration.

This description is intended to transparently document which optimization strategies have been selected and how they have been instantiated. It can be JSON-serialized and will be included in the output of the benchmarking utilities.

Returns:: The actual description
Return type:: jsondict

class postbound.OptimizationStage(*args, **kwargs)#

Optimization stages are the core building blocks of optimization pipelines.

Each stage implements a different, pipeline-specific step of the optimization process. When developing a new optimizer prototype, you generally identify which mental optimizer architecture is most suitable for your needs (i.e. which pipeline you need to use) and then implement the relevant stages for that pipeline and your specific idea.

Optimization stages are generally “hooks” that extend the optimization process at specific points. If you do not care about a specific stage, you simply do not implement it. The pipeline will either skip the stage entirely or use a reasonable default implementation. However, there might be some stages that are required for a specific pipeline. Check the documentation of the pipeline you want to use for more details.

Customizing Your Pipeline#

When implementing a new optimization stage, the specific type of stage has at least one abstract method that you have to implement. This method is used to provide the core logic of the stage. For example, the JoinOrderOptimization stage requires you to implement the optimize_join_order method.

In addition, you should also override the describe method. This method provides a JSON-serializable description of the specific optimization strategy along with important parameters. This information is used (among others) for benchmarking to document precisely how a pipeline was set up, with the end goal of being able to debug and reproduce results. For example, you could provide sample sizes, learning rates, etc. in the description. The default implementation only provides a name for the stage.

Furthermore, you can implement the pre_check method to provide requirements that the input query or database system have to satisfy for the optimization stage to work properly. For example, if your hook only works for equi-join predicates, the pre-check can verify that the input query contains only such predicates. The benchmarking tools will make sure that only supported queries are passed to the stage.

Finally, optimization stages provide a number of methods related to training. Specifically, each stage can specify that it needs to be trained on the database, the workload, or some sort of training samples in order to work properly. This is realized once again by a set of methods that you can implement to provide the actual training logic. The benchmarking tools will analyze the final optimization pipeline and make sure to initialize all of its stages with the appropriate kind of training data. The training methods generally come in pairs of two: one method to perform the actual training logic and another method to indicate whether the training process has already been completed. If you implement one of the training methods, you always have to implement the other one as well. Otherwise PostBOUND will raise an error when creating an instance of your stage. The training methods should return a TrainingMetrics object that contains information about the training process, e.g. how long it took, how many samples were used, etc. This information is included in the benchmark results for later analysis. PostBOUND does not make any assumptions about the kind of information that is included in the metrics object, so you should include whatever makes sense for your specific training process.

The entire training process is completely optional. If you do not require any kind of training, you don’t need to do anything.

Data-driven Training#

This kind of training gives you access to the target database. You can execute arbitrary queries, fetch statistics, analyze the schema, etc. For example, this can be used to implement new kinds of statistics for cardinality estimation. To use data-driven training, implement the fit_database and database_fit_completed methods.

Workload-based Training#

This kind of training gives you access to the entire workload of queries that will be optimized. Since this is a severe leak of the test set, you should only use it to extract general information about the workload. For example, many research ideas need to know which joins are executed in the workload, or which columns are used for specific filter predicates. To use workload-based training, implement the fit_workload and workload_fit_completed methods.

Sample-based Training#

This kind of training implements traditional offline training based on a set of pre-computed training samples. Samples can contain arbitrary information. For example, a learned cardinality estimator might require pairs of SQL queries and their actual cardinalities as training samples. Since multiple optimization stages might require similar training data, PostBOUND implements a flexible system to describe the required training samples. Therefore, you need to implement three methods to use sample-based training: the usual fit_samples and sample_fit_completed methods, as well as a sample_spec method. This method provides a description of the kind of information that is required for training. The benchmarking tools will analyze the available training data and make sure to provide the appropriate samples to the stage.

Online Training#

This kind of training allows you to learn from the actual execution of past queries, live during the benchmark. The benchmarking tools will provide the executed query, its execution time, and the raw result set. This can be used to implement a wide range of reinforcement learning-style approaches. To use online training, implement the learn_from_feedback and uses_online_learning methods.

fit_database(database: Database) → TrainingMetrics#

Performs training based on the target database.

Notes

If this method is implemented, database_fit_completed method has to be implemented as well. Otherwise, PostBOUND will raise an error when an instance of this stage is created.

Parameters:: database (Database)
Return type:: TrainingMetrics

database_fit_completed() → bool#

Checks, if a data-driven optimization stage has already been trained on the target database.

Notes

If this method is implemented, fit_database method has to be implemented as well. Otherwise, PostBOUND will raise an error when an instance of this stage is created.

Return type:: bool

classmethod requires_data_training() → bool#

Checks, if this optimization stage supports data-driven training.

In contrast to database_fit_completed, this method does not check whether a specific instance of the optimization stage has already been trained on the target database, but rather whether the stage in general supports data-driven training.

Notes

This method uses reflection on the optimization stage and does not need to be implemented/overridden by the client.

Return type:: bool

fit_workload(queries: Workload, database: Database) → TrainingMetrics#

Performs training based on the entire workload of queries.

This method is automatically called by the benchmarking tools before the actual optimization process starts. The training process can include arbitrary interactions with the workload, e.g. analyzing which joins are executed, or which columns are used for specific filter predicates. Since this is a severe leak of the test set, it should only be used to extract general information about the workload.

Notes

If this method is implemented, workload_fit_completed method has to be implemented as well. Otherwise, PostBOUND will raise an error when an instance of this stage is created.

Parameters:

queries (Workload)
database (Database)

Return type:

TrainingMetrics

workload_fit_completed() → bool#

Checks, if a workload-driven optimization stage has already been trained on the target workload.

Notes

If this method is implemented, fit_workload method has to be implemented as well. Otherwise, PostBOUND will raise an error when an instance of this stage is created.

Return type:: bool

classmethod requires_workload_training() → bool#

Checks, if this optimization stage supports workload-driven training.

In contrast to workload_fit_completed, this method does not check whether a specific instance of the optimization stage has already been trained, but rather whether the stage in general supports workload-driven training.

Notes

This method uses reflection on the optimization stage and does not need to be implemented/overridden by the client.

Return type:: bool

fit_samples(samples: TrainingData) → TrainingMetrics#

Performs training based on a set of pre-computed training samples.

This method is automatically called by the benchmarking tools before the actual optimization process starts. It is completely up to the implementation to decide what to do with the training samples.

To make sure that the benchmarking tools provide the appropriate training samples, the sample_spec method is used. This method must describe the kind of data required for training.

Notes

If this method is implemented, sample_spec and sample_fit_completed methods have to be implemented as well. Otherwise, PostBOUND will raise an error when an instance of this stage is created.

Parameters:: samples (TrainingData)
Return type:: TrainingMetrics

sample_spec() → TrainingSpec#

Describes the structure of the training samples that are required to train this optimization stage.

PostBOUND uses a tabular model for training data. The TrainingSpec describes what columns need to be present in a dataset. However, we currently cannot enforce any specific semantics for the columns. This needs to be handled by the user. This is a pragmatic choice to prevent us from implementing a full meta-model of data sets and to provide a rather lightweight interface to the user.

Notes

If this method is implemented, fit_samples and sample_fit_completed methods have to be implemented as well. Otherwise, PostBOUND will raise an error when an instance of this stage is created.

Return type:: TrainingSpec

sample_fit_completed() → bool#

Checks, if a sample-driven optimization stage has already been trained.

Notes

If this method is implemented, fit_samples and sample_spec methods have to be implemented as well. Otherwise, PostBOUND will raise an error when an instance of this stage is created.

Return type:: bool

classmethod requires_sample_training() → bool#

Checks, if this optimization stage supports sample-based training.

In contrast to sample_fit_completed, this method does not check whether a specific instance of the optimization stage has already been trained, but rather whether the stage in general supports data-driven training.

Notes

This method uses reflection on the optimization stage and does not need to be implemented/overridden by the client.

Return type:: bool

learn_from_feedback(query: SqlQuery, result_set: Sequence[tuple], *, exec_time: float) → TrainingMetrics#

Performs online learning based on the execution of a past query.

This method is automatically called by the benchmarking tools after each query is executed. Only valid runs are considered. If the query timed out or produced an error, this method will not be called.

Parameters:

query (SqlQuery) – The query that was executed, exactly as it was passed to the database system for execution.
result_set (ResultSet) – The raw result set that was returned by the database system after executing the query. This is not processed in any way, so it is up to the implementation to extract any relevant information from it.
exec_time (TimeMs) – The execution time of the query in milliseconds. This is measured directly by the benchmarking tools and will always be a valid number (i.e. not NaN, negative, nor infinite).

Return type:

TrainingMetrics

Notes

This method stands on its own and does not require any other methods to be implemented.

classmethod uses_online_feedback() → bool#

Checks, if this optimization stage supports online learning.

In contrast to learn_from_feedback, this method does not check whether a specific instance of the optimization stage has already been trained, but rather whether the stage in general supports online learning.

Notes

This method uses reflection on the optimization stage and does not need to be implemented/overridden by the client.

Return type:: bool

pre_check() → OptimizationPreCheck#

Provides requirements that input query or database system have to satisfy for the optimizer to work properly.

Returns:: The check instance. Can be an empty check if no specific requirements exist.
Return type:: OptimizationPreCheck

describe() → jsondict#

Provides a JSON-serializable representation of the specific strategy, as well as important parameters.

Returns:: The description
Return type:: jsondict

See also

OptimizationPipeline.describe

Return type:: Self

Textbook Pipeline#

class postbound.TextBookOptimizationPipeline(target_db: Database)#

This pipeline is modelled after the traditional approach to query optimization as used in most real-world systems.

The optimizer consists of a cardinality estimator that calculates the size of intermediate results, a cost model that quantifies how expensive specific access paths for the intermediates are, and an enumerator that generates the intermediates in the first place.

To configure the pipeline, specific strategies for each of the three components have to be assigned.

Parameters:: target_db (Database) – The database for which the optimized queries should be generated.

target_database() → Database#

Provides the current target database.

Return type:: Database

setup_cardinality_estimator(estimator: CardinalityEstimator) → Self#

Configures the cardinality estimator of the optimizer.

Setting a new algorithm requires the pipeline to be build again.

Parameters:: estimator (CardinalityEstimator) – The estimator to be used
Returns:: The current pipeline to allow for easy method-chaining.
Return type:: self

setup_cost_model(cost_model: CostModel) → Self#

Configures the cost model of the optimizer.

Setting a new algorithm requires the pipeline to be build again.

Parameters:: cost_model (CostModel) – The cost model to be used
Returns:: The current pipeline to allow for easy method-chaining.
Return type:: self

setup_plan_enumerator(plan_enumerator: PlanEnumerator) → Self#

Configures the plan enumerator of the optimizer.

Setting a new algorithm requires the pipeline to be build again.

Parameters:: plan_enumerator (PlanEnumerator) – The enumerator to be used
Returns:: The current pipeline to allow for easy method-chaining.
Return type:: self

use(component: PlanEnumerator | CostModel | CardinalityEstimator) → Self#

Shortcut method to setup the pipeline. Delegates to the appropriate setup_XXX method.

Parameters:: component (PlanEnumerator | CostModel | CardinalityEstimator)
Return type:: Self

build() → Self#

Constructs the optimization pipeline.

This includes checking all strategies for compatibility with the target_db. Afterwards, the pipeline is ready to optimize queries.

Returns:: The current pipeline to allow for easy method-chaining.
Return type:: self
Raises:: UnsupportedSystemError – If any of the selected optimization stages is not compatible with the target_db.

query_execution_plan(query: SqlQuery) → QueryPlan#

Applies the current pipeline configuration to obtain an optimized plan for the input query.

Parameters:

query (SqlQuery) – The query that should be optimized

Returns:

An optimized query execution plan for the input query.

Return type:

QueryPlan

Raises:

UnsupportedQueryError – If the selected optimization algorithms cannot be applied to the specific query, e.g. because it contains unsupported features.

stages() → list[OptimizationStage]#

Provides all optimization stages that are part of the pipeline. The order of the stages is not relevant.

Return type:: list[OptimizationStage]

describe() → jsondict#

Generates a description of the current pipeline configuration.

Returns:: The actual description
Return type:: jsondict

class postbound.PlanEnumerator(*args, **kwargs)#

The plan enumerator traverses the space of different candidate plans and ultimately selects the optimal one.

Implement the generate_execution_plan method to provide the actual optimization logic. The describe and pre_check methods should be overridden to provide metadata about the specific algorithm for benchmarking and to ensure that the input query and database system are compatible with the algorithm.

Notes

When implementing this class, make sure to call super().__init__ to ensure that all of the internal data is set up properly.

abstractmethod generate_execution_plan(query: SqlQuery, *, cost_model: CostModel, cardinality_estimator: CardinalityEstimator) → QueryPlan#

Computes the optimal plan to execute the given query.

Parameters:

query (SqlQuery) – The query to optimize
cost_model (CostModel) – The cost model to compare different candidate plans
cardinality_estimator (CardinalityEstimator) – The cardinality estimator to calculate the sizes of intermediate results

Returns:

The query plan

Return type:

QueryPlan

Notes

The precise generation “style” (e.g. top-down vs. bottom-up, complete plans vs. plan fragments, etc.) is completely up to the specific algorithm. Therefore, it is really hard to provide a more expressive interface for the enumerator beyond just generating a plan. Generally the enumerator should query the cost model to compare different candidates. The top-most operator of each candidate will usually not have a cost estimate set at the beginning and it is the enumerator’s responsibility to set the estimate correctly. The jointree.update_cost_estimate function can be used to help with this.

class postbound.CostModel(*args, **kwargs)#

The cost model estimates how expensive computing a certain query plan is.

Implement the estimate_cost method to provide the actual optimization logic. The describe and pre_check methods should be overridden to provide metadata about the specific algorithm for benchmarking and to ensure that the input query and database system are compatible with the algorithm. In addition, use initialize and cleanup methods to implement any necessary setup and teardown logic for the current query.

Notes

When implementing this class, make sure to call super().__init__ to ensure that all of the internal data is set up properly.

abstractmethod estimate_cost(query: SqlQuery, plan: QueryPlan) → float#

Computes the cost estimate for a specific plan.

The following conventions are used for the estimation: the root node of the plan will not have any cost set. However, all input nodes will have already been estimated by earlier calls to the cost model. Hence, while estimating the cost of the root node, all earlier costs will be available as inputs. It is further assumed that all nodes already have associated cardinality estimates. This method explicitly does not make any assumption regarding the relationship between query and plan. Specifically, it does not assume that the plan is capable of computing the entire result set nor a correct result set. Instead, the plan might just be a partial plan that computes a subset of the query (e.g. a join of some of the tables). It is the implementation’s responsibility to figure out the appropriate course of action.

It is not the responsibility of the cost model to set the estimate on the plan, this is the task of the enumerator (which can decide whether the plan should be considered any further).

Parameters:

query (SqlQuery) – The query being optimized
plan (QueryPlan) – The plan to estimate.

Returns:

The estimated cost

Return type:

Cost

initialize(target_db: Database, query: SqlQuery) → None#

Hook method that is called before the actual optimization process starts.

This method can be overwritten to set up any necessary data structures, etc. and will be called before each query.

Parameters:

target_db (Database) – The database for which the optimized queries should be generated.
query (SqlQuery) – The query to be optimized

Return type:

None

cleanup() → None#

Hook method that is called after the optimization process has finished.

This method can be overwritten to remove any temporary state that was specific to the last query being optimized and should not be shared with later queries.

Return type:: None

pre_check() → OptimizationPreCheck#

Provides requirements that input query or database system have to satisfy for the optimizer to work properly.

Returns:: The check instance. Can be an empty check if no specific requirements exist.
Return type:: OptimizationPreCheck

class postbound.CardinalityEstimator(*args, **kwargs)#

The cardinality estimator calculates how many tuples specific operators will produce.

Implement the calculate_estimate method to provide the actual optimization logic. The describe and pre_check methods should be overridden to provide metadata about the specific algorithm for benchmarking and to ensure that the input query and database system are compatible with the algorithm. In addition, use initialize and cleanup methods to implement any necessary setup and teardown logic for the current query.

Notes

If you only care about cardinality estimation, you should generally use this class in a MultiStageOptimizationPipeline instead of a TextBookOptimizationPipeline. This is because a multi-stage pipeline has a simple control flow from one stage to the next. This allows us to just generate cardinality estimates for all possible intermediates if no join order is given, or just for the intermediates defined in a specific join order otherwise. In contrast, the textbook pipeline is controlled by the plan enumerator which decides which plans to construct and by extension for which intermediates cardinality estimates are required. However, the framework implementation does not provide any way for the actual query optimizer of a database system to hook back into the framework to request such data. Therefore, we rely on emulating the behaviour of the actual plan enumerator of the target database system (unless a enumerator is explicitly provided). While our approximation for Postgres works quite well, it is not entirely accurate and other backends are much less supported.

The default implementation of all methods related to the ParameterGeneration either request cardinality estimates for all possible intermediate results (in the estimate_cardinalities method), or for exactly those intermediates that are defined in a specific join order (in the generate_plan_parameters method that implements the protocol of the ParameterGeneration class). Therefore, developers working on their own cardinality estimation algorithm only need to implement the calculate_estimate method. All related processes are provided by the generator with reasonable default strategies.

However, special care is required when considering cross products: depending on the setting intermediates can either allow cross products at all stages (by passing allow_cross_products=True during instantiation), or to disallow them entirely. Therefore, the calculate_estimate method should act accordingly. Implementations of this class should pass the appropriate parameter value to the super __init__ method. If they support both scenarios, the parameter can also be exposed to the client. In either case, make sure to call super().__init__ to ensure that all of the internal data is set up properly.

Parameters:: allow_cross_products (bool)

abstractmethod calculate_estimate(query: SqlQuery, intermediate: TableReference | Iterable[TableReference]) → Cardinality#

Determines the cardinality of a specific intermediate.

Parameters:

query (SqlQuery) – The query being optimized
intermediate (TableReference | Iterable[TableReference]) – The intermediate for which the cardinality should be estimated. All filter predicates, etc. that are applicable to the intermediate can be assumed to be applied.

Returns:

The estimated cardinality of the specific intermediate

Return type:

Cardinality

initialize(target_db: Database, query: SqlQuery) → None#

Hook method that is called before the actual optimization process starts.

This method can be overwritten to set up any necessary data structures, etc. and will be called before each query. The default implementation stores the target database and query as attributes for later use.

Parameters:

target_db (Database) – The database for which the optimized queries should be generated.
query (SqlQuery) – The query to be optimized

Return type:

None

cleanup() → None#

Hook method that is called after the optimization process has finished.

This method can be overwritten to remove any temporary state that was specific to the last query being optimized and should not be shared with later queries.

The default implementation removes the references to the target database and query.

Return type:: None

generate_intermediates(query: SqlQuery) → Generator[frozenset[TableReference], None, None]#

Provides all intermediate results of a query.

The inclusion of cross-products between arbitrary tables can be configured via the allow_cross_products attribute.

Parameters:: query (SqlQuery) – The query for which to generate the intermediates
Yields:: Generator[frozenset[TableReference], None, None] – The intermediates
Return type:: Generator[frozenset[TableReference], None, None]

Warning

The default implementation of this method does not work for queries that naturally contain cross products. If such a query is passed, no intermediates with tables from different partitions of the join graph are yielded.

estimate_cardinalities(query: SqlQuery) → PlanParameterization#

Produces all cardinality estimates for a specific query.

The default implementation of this method delegates the actual estimation to the calculate_estimate method. It is called for each intermediate produced by generate_intermediates.

Parameters:: query (SqlQuery) – The query to optimize
Returns:: A parameterization containing cardinality hints for all intermediates. Other attributes of the parameterization are not modified.
Return type:: PlanParameterization

generate_plan_parameters(query: SqlQuery, join_order: JoinTree | None, operator_assignment: PhysicalOperatorAssignment | None) → PlanParameterization#

Executes the actual parameterization.

Parameters:

query (SqlQuery) – The query to optimize
join_order (Optional[JoinTree]) – The selected join order for the query.
operator_assignment (Optional[PhysicalOperatorAssignment]) – The selected operators for the query

Returns:

The parameterization. If for some reason no parameters can be determined, an empty parameterization can be returned

Return type:

PlanParameterization

Notes

Since this is the final stage of the optimization process, a number of special cases have to be handled:

the previous phases might not have determined any join order or operator assignment
there might not have been a physical operator selection, but only a join ordering (which potentially included an initial selection of physical operators)
there might not have been a join order optimization, but only a selection of physical operators
both join order and physical operators might have been optimized (in which case only the actual operator assignment matters, not any assignment contained in the join order)

Multi-stage Optimization Pipeline#

class postbound.MultiStageOptimizationPipeline(target_db: Database)#

This optimization pipeline performs query optimization in separate phases.

The pipeline is organized in two large stages (join ordering and physical operator selection), which are accompanied by initial pre check and a final plan parameterization steps. In total, those four individual steps completely specify the optimization settings that should be applied to an incoming query. For each of the steps general interface exist that must be implemented by the selected strategies.

The steps are applied in consecutive order and perform the following tasks:

the incoming query is checked for unsupported features
an optimized join order for the query is calculated
appropriate physical operators are determined, depending on the join order
the query plan (join order + physical operators) is further parameterized, for example with custom cardinality estimates

All steps are optional. If they are not specified, no operation will be performed at the specific stage. Effectively, this means that the query optimizer of the target database system needs to step in and “fill the gaps”. For example, if no join ordering is performed, the native optimizer needs to come up with a join order. But, the native optimizer will use the selected physical operators to perform these joins. Likewise, specifying only a join order means that the native optimizer will select its own physical operators. If cardinalities are provided, they are used to guide the native optimizer. As an extreme case, one can skip join ordering and physical operator selection completely and only compute cardinality estimates in the parameterization step. This way, a different cardinality estimator can be simulated without using the TextBookOptimizationPipeline. This has the advantage that no default strategies for cost estimation and plan enumeration need to to be simulated and the actual algorithms from the target database are used.

Once the optimization settings have been selected via the setup methods (or alternatively via the load_settings functionality), the pipeline has to be build using the build method. Afterwards, it is ready to optimize input queries.

A pipeline depends on a specific database system. This is necessary to produce the appropriate metadata for an input query (i.e. to apply the specifics that enforce the optimized query plan during query execution for the database system). This field can be changed between optimization calls to use the same pipeline for different systems.

As a shortcut, load_settings can be used to initialize a pipeline with pre-defined optimization strategies.

Parameters:: target_db (Database) – The database for which the optimized queries should be generated.

Examples

>>> pipeline = pb.MultiStageOptimizationPipeline(postgres_db)
>>> pipeline.load_settings(ues_settings)
>>> pipeline.build()
>>> pipeline.optimize_query(join_order_benchmark["1a"])

property target_db: Database#

The database for which optimized queries should be generated.

When assigning a new target database, the pipeline needs to be build again.

Returns:: The currently selected database system
Return type:: Database

property pre_check: OptimizationPreCheck | None#

An overarching check that should be applied to all queries before they are optimized.

This check complements the pre checks of the individual stages and can be used to enforce experiment-specific constraints.

Returns:: The current check, if any. Can also be an EmptyPreCheck instance.
Return type:: Optional[OptimizationPreCheck]

property join_order_enumerator: JoinOrderOptimization | None#

The selected join order optimization algorithm.

Returns:: The current algorithm, if any has been selected.
Return type:: Optional[JoinOrderOptimization]

property physical_operator_selection: PhysicalOperatorSelection | None#

The selected operator selection algorithm.

Returns:: The current algorithm, if any has been selected.
Return type:: Optional[PhysicalOperatorSelection]

property plan_parameterization: ParameterGeneration | None#

The selected parameterization algorithm.

Returns:: The current algorithm, if any has been selected.
Return type:: Optional[ParameterGeneration]

setup_query_support_check(check: OptimizationPreCheck) → Self#

Configures the pre-check that should be executed for each query.

This check will be combined with any additional checks that are required by the actual optimization strategies. Setting a new check requires the pipeline to be build again.

Parameters:: check (OptimizationPreCheck) – The new check
Returns:: The current pipeline to allow for easy method-chaining.
Return type:: self

setup_join_order_optimization(enumerator: JoinOrderOptimization) → Self#

Configures the pipeline to obtain an optimized join order.

The actual strategy can either produce a purely logical join order, or an initial physical query execution plan that also specifies how the individual joins should be executed. All later stages are expected to work with these two cases.

Setting a new algorithm requires the pipeline to be build again.

Parameters:: enumerator (JoinOrderOptimization) – The new join order optimization algorithm
Returns:: The current pipeline to allow for easy method-chaining.
Return type:: self

setup_physical_operator_selection(selector: PhysicalOperatorSelection) → Self#

Configures the algorithm to assign physical operators to the query.

This algorithm receives the input query as well as the join order (if there is one) as input. In a special case, this join order can also provide an initial assignment of physical operators. These settings can then be further adapted by the selected algorithm (or completely overwritten).

Setting a new algorithm requires the pipeline to be build again.

Paramters#

selectorPhysicalOperatorSelection: The new operator selection algorithm

returns:: The current pipeline to allow for easy method-chaining.
rtype:: self

Parameters:: selector (PhysicalOperatorSelection)
Return type:: Self

setup_plan_parameterization(param_generator: ParameterGeneration) → Self#

Configures the algorithm to parameterize the query plan.

This algorithm receives the input query as well as the join order and the physical operators (if those have been determined yet) as input.

Setting a new algorithm requires the pipeline to be build again.

Parameters:: param_generator (ParameterGeneration) – The new parameterization algorithm
Returns:: The current pipeline to allow for easy method-chaining.
Return type:: self

use(component: JoinOrderOptimization | PhysicalOperatorSelection | ParameterGeneration) → Self#

Shortcut method to setup the pipeline. Delegates to the appropriate setup_XXX method.

Parameters:: component (JoinOrderOptimization | PhysicalOperatorSelection | ParameterGeneration)
Return type:: Self

load_settings(optimization_settings: OptimizationSettings) → Self#

Applies all the optimization settings from a pre-defined optimization strategy to the pipeline.

This is just a shorthand method to skip calling all setup methods individually for a fixed combination of optimization settings. After the settings have been loaded, they can be overwritten again using the setup methods.

Loading new presets requires the pipeline to be build again.

Parameters:: optimization_settings (OptimizationSettings) – The specific settings
Returns:: The current pipeline to allow for easy method-chaining.
Return type:: self

build() → Self#

Constructs the optimization pipeline.

This includes filling all undefined optimization steps with empty strategies and checking all strategies for compatibility with the target_db. Afterwards, the pipeline is ready to optimize queries.

Returns:: The current pipeline to allow for easy method-chaining.
Return type:: self
Raises:: UnsupportedSystemError – If any of the selected optimization stages is not compatible with the target_db.

target_database() → Database#

Provides the current target database.

Return type:: Database

query_execution_plan(query: SqlQuery) → QueryPlan#

Applies the current pipeline configuration to obtain an optimized plan for the input query.

Parameters:

query (SqlQuery) – The query that should be optimized

Returns:

An optimized query execution plan for the input query.

Return type:

QueryPlan

Raises:

UnsupportedQueryError – If the selected optimization algorithms cannot be applied to the specific query, e.g. because it contains unsupported features.

optimize_query(query: SqlQuery) → SqlQuery#

Applies the current pipeline configuration to optimize the input query.

This process also involves the generation of appropriate optimization information that enforces the selected optimization decision when the query is executed on an actual database system.

Parameters:

query (SqlQuery) – The query that should be optimized

Returns:

Return type:

SqlQuery

Raises:

UnsupportedQueryError – If the selected optimization algorithms cannot be applied to the specific query, e.g. because it contains unsupported features.

References

stages() → list[OptimizationStage]#

Provides all optimization stages that are part of the pipeline. The order of the stages is not relevant.

Return type:: list[OptimizationStage]

describe() → jsondict#

Generates a description of the current pipeline configuration.

Returns:: The actual description
Return type:: jsondict

class postbound.JoinOrderOptimization(*args, **kwargs)#

The join order optimization generates a complete join order for an input query.

This is the first step in a multi-stage optimizer design. Implement the optimize_join_order method to provide the actual optimization logic. The describe and pre_check methods should be overridden to provide metadata about the specific algorithm for benchmarking and to ensure that the input query and database system are compatible with the algorithm.

Notes

When implementing this class, make sure to call super().__init__ to ensure that all of the internal data is set up properly.

abstractmethod optimize_join_order(query: SqlQuery) → JoinTree | None#

Performs the actual join ordering process.

The join tree can be further annotated with an initial operator assignment, if that is an inherent part of the specific optimization strategy. However, this is generally discouraged and the multi-stage pipeline will discard such operators to prepare for the subsequent physical operator selection.

Other than the join order and operator assignment, the algorithm should add as much information to the join tree as possible, e.g. including join conditions and cardinality estimates that were calculated for the selected joins. This enables other parts of the optimization process to re-use that information.

Parameters:: query (SqlQuery) – The query to optimize
Returns:: The join order. If for some reason there is no valid join order for the given query (e.g. queries with just a single selected table), None can be returned. Otherwise, the selected join order has to be described using a JoinTree.
Return type:: Optional[LogicalJoinTree]

class postbound.PhysicalOperatorSelection(*args, **kwargs)#

The physical operator selection assigns scan and join operators to the tables of the input query.

This is the second stage in the two-phase optimization process, and takes place after the join order has been determined. Implement the select_physical_operators method to provide the actual optimization logic. The describe and pre_check methods should be overridden to provide metadata about the specific algorithm for benchmarking and to ensure that the input query and database system are compatible with the algorithm.

Notes

When implementing this class, make sure to call super().__init__ to ensure that all of the internal data is set up properly.

abstractmethod select_physical_operators(query: SqlQuery, join_order: JoinTree | None) → PhysicalOperatorAssignment#

Performs the operator assignment.

Parameters:

query (SqlQuery) – The query to optimize
join_order (Optional[JoinTree]) – The selected join order of the query

Returns:

The operator assignment. If for some reason no operators can be assigned, an empty assignment can be returned

Return type:

PhysicalOperatorAssignment

Notes

The operator selection should handle a None join order gracefully. This can happen if the query does not require any joins (e.g. processing of a single table.

Depending on the specific optimization settings, it is also possible to raise an error if such a situation occurs and there is no reasonable way to deal with it.

class postbound.ParameterGeneration(*args, **kwargs)#

The parameter generation assigns additional metadata to a query plan.

Such parameters do not influence the previous choice of join order and physical operators directly, but affect their specific implementation. Therefore, this is an optional final step in a multi-stage optimization process. Implement the generate_plan_parameters method to provide the actual optimization logic. The describe and pre_check methods should be overridden to provide metadata about the specific algorithm for benchmarking and to ensure that the input query and database system are compatible with the algorithm.

Notes

When implementing this class, make sure to call super().__init__ to ensure that all of the internal data is set up properly.

abstractmethod generate_plan_parameters(query: SqlQuery, join_order: JoinTree | None, operator_assignment: PhysicalOperatorAssignment | None) → PlanParameterization#

Executes the actual parameterization.

Parameters:

query (SqlQuery) – The query to optimize
join_order (Optional[JoinTree]) – The selected join order for the query.
operator_assignment (Optional[PhysicalOperatorAssignment]) – The selected operators for the query

Returns:

The parameterization. If for some reason no parameters can be determined, an empty parameterization can be returned

Return type:

PlanParameterization

Notes

Since this is the final stage of the optimization process, a number of special cases have to be handled:

the previous phases might not have determined any join order or operator assignment
there might not have been a physical operator selection, but only a join ordering (which potentially included an initial selection of physical operators)
there might not have been a join order optimization, but only a selection of physical operators
both join order and physical operators might have been optimized (in which case only the actual operator assignment matters, not any assignment contained in the join order)

Tip

The CardinalityEstimator can also be used as a ParameterGeneration.

Other Pipelines#

class postbound.IntegratedOptimizationPipeline(target_db: Database | None = None)#

This pipeline is intended for algorithms that calculate the entire query plan in a single process.

To configure the pipeline, use the set_optimization_algorithm method followed by the build method (in line with the other pipelines).

Parameters:: target_db (Optional[Database], optional) – The database for which the optimized queries should be generated. If this is not given, he default database is extracted from the DatabasePool.

property target_db: Database#

The database for which optimized queries should be generated.

When assigning a new target database, the pipeline has to be build again.

Returns:: The currently selected database system
Return type:: Database

See also

CompleteOptimizationAlgorithm.pre_check

property optimization_algorithm: CompleteOptimizationAlgorithm | None#

The optimization algorithm is used each time a query should be optimized.

Returns:: The currently selected optimization algorithm, if any.
Return type:: Optional[CompleteOptimizationAlgorithm]

setup_optimization_algorithm(algorithm: CompleteOptimizationAlgorithm) → Self#

Configures the pipeline to use the given optimization algorithm.

Parameters:: algorithm (CompleteOptimizationAlgorithm) – The new optimization algorithm to use. No compatibility checks are performed, yet. This is done when building the pipeline.
Returns:: The current pipeline to allow for easy method-chaining.
Return type:: IntegratedOptimizationPipeline

use(algorithm: CompleteOptimizationAlgorithm) → Self#

Alias for setup_optimization_algorithm to keep a consistent interface across all pipelines.

Parameters:: algorithm (CompleteOptimizationAlgorithm)
Return type:: Self

build() → Self#

Constructs the optimization pipeline.

This includes checking the selected optimization algorithm for compatibility with the target_db. Afterwards, the pipeline is ready to optimize queries.

Returns:: The current pipeline to allow for easy method-chaining.
Return type:: IntegratedOptimizationPipeline
Raises:: UnsupportedSystemError – If the new optimization algorithm is not compatible with the current target database system.

See also

CompleteOptimizationAlgorithm.pre_check

query_execution_plan(query: SqlQuery) → QueryPlan#

Applies the current pipeline configuration to obtain an optimized plan for the input query.

Parameters:

query (SqlQuery) – The query that should be optimized

Returns:

An optimized query execution plan for the input query.

Return type:

QueryPlan

Raises:

UnsupportedQueryError – If the selected optimization algorithms cannot be applied to the specific query, e.g. because it contains unsupported features.

stages() → list[OptimizationStage]#

Provides all optimization stages that are part of the pipeline. The order of the stages is not relevant.

Return type:: list[OptimizationStage]

target_database() → Database#

Provides the current target database.

Return type:: Database

describe() → jsondict#

Generates a description of the current pipeline configuration.

Returns:: The actual description
Return type:: jsondict

class postbound.CompleteOptimizationAlgorithm(*args, **kwargs)#

Constructs an entire query plan for an input query in one integrated optimization process.

This stage closely models the behaviour of traditional optimization algorithms, e.g. based on dynamic programming. Implement the optimize_query method to provide the actual optimization logic. The describe and pre_check methods should be overridden to provide metadata about the specific algorithm for benchmarking and to ensure that the input query and database system are compatible with the algorithm.

Notes

When implementing this class, make sure to call super().__init__ to ensure that all of the internal data is set up properly.

abstractmethod optimize_query(query: SqlQuery) → QueryPlan#

Constructs the optimized execution plan for an input query.

Parameters:: query (SqlQuery) – The query to optimize
Returns:: The optimized query plan
Return type:: QueryPlan

class postbound.IncrementalOptimizationPipeline(target_db: Database)#

This optimization pipeline can be thought of as a generalization of the MultiStageOptimizationPipeline.

Instead of only operating in two stages, an arbitrary amount of optimization steps can be applied. During each step an entire physical query execution plan is received as input and also produced as output. Therefore, partial operator assignments or cardinality estimates are not supported by this pipeline. The incremental nature probably makes it the most usefull for optimization strategies that continously improve query plans.

Parameters:: target_db (Database) – The database for which the optimized queries should be generated.

property target_db: Database#

The database for which optimized queries should be generated.

When a new target database is selected, all optimization steps are checked for support of the new database.

Returns:: _description_
Return type:: Database
Raises:: UnsupportedSystemError – If any of the optimization steps or the initial plan generator cannot work with the target database

property initial_plan_generator: CompleteOptimizationAlgorithm | None#

Strategy to construct the first physical query execution plan to start the incremental optimization.

If no initial generator is selected, the initial plan will be derived from the optimizer of the target database.

Returns:: The current initial generator.
Return type:: Optional[CompleteOptimizationAlgorithm]
Raises:: UnsupportedSystemError – If the initial generator does not work with the current target_db

add_optimization_step(next_step: IncrementalOptimizationStep) → Self#

Expands the optimization pipeline by another stage.

The given step will be applied at the end of the pipeline. The very first optimization steps receives an initial plan that has either been generated via the initial_plan_generator (if it has been setup), or by retrieving the query execution plan from the target_db.

Parameters:: next_step (IncrementalOptimizationStep) – The next optimization stage
Returns:: If any of the optimization steps does not work with the target database
Return type:: IncrementalOptimizationPipeline

use(step: CompleteOptimizationAlgorithm | IncrementalOptimizationStep) → Self#

Shortcut method to setup the pipeline.

Parameters:: step (CompleteOptimizationAlgorithm | IncrementalOptimizationStep)
Return type:: Self

target_database() → Database#

Provides the current target database.

Return type:: Database

query_execution_plan(query: SqlQuery) → QueryPlan#

Applies the current pipeline configuration to obtain an optimized plan for the input query.

Parameters:

query (SqlQuery) – The query that should be optimized

Returns:

An optimized query execution plan for the input query.

Return type:

QueryPlan

Raises:

UnsupportedQueryError – If the selected optimization algorithms cannot be applied to the specific query, e.g. because it contains unsupported features.

stages() → list[OptimizationStage]#

Provides all optimization stages that are part of the pipeline. The order of the stages is not relevant.

Return type:: list[OptimizationStage]

describe() → jsondict#

Generates a description of the current pipeline configuration.

Returns:: The actual description
Return type:: jsondict

class postbound.IncrementalOptimizationStep(*args, **kwargs)#

Incremental optimization allows to chain different smaller optimization strategies.

Each step receives the query plan of its predecessor and can change its decisions in arbitrary ways. For example, this scheme can be used to gradually correct mistakes or risky decisions of individual optimizers.

Implement the optimize_query method to provide the actual optimization logic. The describe and pre_check methods should be overridden to provide metadata about the specific algorithm for benchmarking and to ensure that the input query and database system are compatible with the algorithm.

Notes

When implementing this class, make sure to call super().__init__ to ensure that all of the internal data is set up properly.

abstractmethod optimize_query(query: SqlQuery, current_plan: QueryPlan) → QueryPlan#

Determines the next query plan.

If no further optimization steps are configured in the pipeline, this is also the final query plan.

Parameters:

query (SqlQuery) – The query to optimize
current_plan (QueryPlan) – The execution plan that has so far been built by predecessor strategies. If this step is the first step in the optimization pipeline, this might also be a plan from the target database system

Returns:

The optimized plan

Return type:

QueryPlan

postbound.as_complete_algorithm(stage: JoinOrderOptimization | PhysicalOperatorSelection | ParameterGeneration, *, database: Database | None = None) → CompleteOptimizationAlgorithm#

Enables using a partial optimization stage in situations where a complete optimizer is expected.

This emulation is achieved by using the partial stage to obtain a partial query plan. The target database system is then tasked with filling the gaps to construct a complete execution plan.

Basically this method is syntactic sugar in situations where a MultiStageOptimizationPipeline would be filled with only a single stage. Using as_complete_algorithm, the construction of an entire pipeline can be omitted. Furthermore it can seem more natural to “convert” the stage into a complete algorithm in this case.

Parameters:

stage (JoinOrderOptimization | PhysicalOperatorSelection | ParameterGeneration) – The stage that should become a complete optimization algorithm
database (Optional[Database], optional) – The target database to execute the optimized queries in. This is required to fill the gaps of the partial query plans. If the database is omitted, it will be inferred based on the database pool.

Returns:

A emulated optimization algorithm for the optimization stage

Return type:

CompleteOptimizationAlgorithm

Support Functionality#

class postbound.OptimizationSettings(*args, **kwargs)#

Captures related settings for the optimization pipeline to make them more easily accessible.

All components are optional, depending on the specific optimization scenario/approach.

query_pre_check() → OptimizationPreCheck | None#

The required query pre-check.

Returns:: The pre-check if one is necessary, or None otherwise.
Return type:: Optional[OptimizationPreCheck]

build_join_order_optimizer() → JoinOrderOptimization | None#

The algorithm that is used to obtain the optimized join order.

Returns:: The optimization strategy for the join order, or None if the scenario does not include a join order optimization.
Return type:: Optional[JoinOrderOptimization]

build_physical_operator_selection() → PhysicalOperatorSelection | None#

The algorithm that is used to determine the physical operators.

Returns:: The optimization strategy for the physical operators, or None if the scenario does not include an operator optimization.
Return type:: Optional[PhysicalOperatorSelection]

build_plan_parameterization() → ParameterGeneration | None#

The algorithm that is used to further parameterize the query plan.

Returns:: The parameter optimization strategy, or None if the scenario does not include such a stage.
Return type:: Optional[ParameterGeneration]

Optimization Pipelines

Contents

Optimization Pipelines#

Training of Optimization Stages#

Customizing Your Pipeline#

Data-driven Training#

Workload-based Training#

Sample-based Training#

Online Training#

Textbook Pipeline#

Multi-stage Optimization Pipeline#

Paramters#

Other Pipelines#

Support Functionality#