Dynamic Programming#
- class postbound.opt.dynprog.DynamicProgrammingEnumerator(*args, **kwargs)#
A very basic dynamic programming-based plan enumerator.
This enumerator is very basic because it does not implement any sophisticated pruning rules or traversal strategies and only focuses on a small subset of possible operators. It simply enumerates all possible access paths and join paths and picks the cheapest one. This should only serve as a starting point when lacking an actual decent enumerator implementation (see Limitation below). Its purpose is mainly to shield users that are only interested in the cost model or the cardinality estimator from having to implement their own enumerator in order to use the TextBookOptimizationPipeline. Notice that for experiments based on PostgreSQL, a much more sophisticated implementation is available with the PostgresDynProg enumerator (and this enumerator is automatically selected when using the textbook pipeline with a Postgres target database).
Limitations#
Only the cheapest access paths are considered, without taking sort orders into account. This prevents free merge join optimizations, i.e. if an access path is more expensive but already sorted, it will be discarded in favor of a cheaper alternative, even though a later merge join might become much cheaper due to the sort order.
No optimizations to intermediates are considered, i.e. no materialization or memoization of subplans.
Only the basic scan and join operators are considered. For scans, this includes sequential scan, index scan, index-only scan and bitmap scan. For joins, this includes nested loop join, hash join and sort merge join. These can be further restricted through the supported_scan_ops and supported_join_ops parameters.
Only simple SPJ queries are supported. Importantly, the query may not contain any set operations, subqueries, CTEs etc. All joins must be inner equijoins and no cross products are allowed.
Aggregations, sorting, etc. are not considered. In this way, the enumerator is comparable to the
join_search_hookof PostgreSQL. We assume that such “technicalities” are handled when creating appropriate hints for the target database or when executing the query on the target database at the latest.
- param supported_scan_ops:
The set of scan operators that should be considered during the enumeration. This should be a subset of the following operators: sequential scan, index scan, index-only scan, bitmap scan. If any other operators are included, these are simply never considered. By default all operators that are available on the target_db are allowed.
- type supported_scan_ops:
Optional[set[ScanOperator]], optional
- param supported_join_ops:
The set of join operators that should be considered during the enumeration. This should be a subset of the following operators: nested loop join, hash join, sort merge join. If any other operators are included, these are simply never considered. By default all operators that are available on the target_db are allowed.
- type supported_join_ops:
Optional[set[JoinOperator]], optional
- param target_db:
The target database system for which the optimization pipeline is intended. If not omitted, the database is inferred from the DatabasePool.
- type target_db:
Optional[Database], optional
- generate_execution_plan(query, *, cost_model, cardinality_estimator) QueryPlan#
Computes the optimal plan to execute the given query.
- Parameters:
query (SqlQuery) – The query to optimize
cost_model (CostModel) – The cost model to compare different candidate plans
cardinality_estimator (CardinalityEstimator) – The cardinality estimator to calculate the sizes of intermediate results
- Returns:
The query plan
- Return type:
Notes
The precise generation “style” (e.g. top-down vs. bottom-up, complete plans vs. plan fragments, etc.) is completely up to the specific algorithm. Therefore, it is really hard to provide a more expressive interface for the enumerator beyond just generating a plan. Generally the enumerator should query the cost model to compare different candidates. The top-most operator of each candidate will usually not have a cost estimate set at the beginning and it is the enumerator’s responsibility to set the estimate correctly. The jointree.update_cost_estimate function can be used to help with this.
- pre_check() OptimizationPreCheck#
Provides requirements that input query or database system have to satisfy for the optimizer to work properly.
- Returns:
The check instance. Can be an empty check if no specific requirements exist.
- Return type:
- describe() jsondict#
Provides a JSON-serializable representation of the specific strategy, as well as important parameters.
- Returns:
The description
- Return type:
jsondict
See also
OptimizationPipeline.describe
- Parameters:
supported_scan_ops (Optional[set[ScanOperator]])
supported_join_ops (Optional[set[JoinOperator]])
target_db (Optional[Database])
- class postbound.opt.dynprog.RelOptInfo(intermediate: frozenset[TableReference], pathlist: list[QueryPlan], partial_paths: list[QueryPlan], cheapest_path: QueryPlan | None, cheapest_partial_path: QueryPlan | None, cardinality: Cardinality)#
Simplified model of the RelOptInfo from the Postgres planner.
We only specify the fields that we truly care about (and that are not covered by other parts of PostBOUND), the rest is omitted.
For example, we don’t need to worry about equivalence classes, because the Postgres enumerator is responsible for expanding the query with all EQ predicates. Aftwards, we can use the query abstraction to determine available joins.
- Parameters:
intermediate (frozenset[TableReference])
pathlist (list[QueryPlan])
partial_paths (list[QueryPlan])
cheapest_path (QueryPlan | None)
cheapest_partial_path (QueryPlan | None)
cardinality (Cardinality)
- intermediate: frozenset[TableReference]#
The relation that is represented by this RelOptInfo.
This is simply the set of all tables that are part of the relation.
- pathlist: list[QueryPlan]#
All access paths that can be used to compute this relation (that we know of and care about).
In contrast to the original PG implementation, we don’t care about sorting this list. Retaining the sort order is mainly an implementation detail and optimization of PG.
- partial_paths: list[QueryPlan]#
All access paths that can be used to compute this relation with parallel workers. Otherwise the same as paths.
- cheapest_path: QueryPlan | None#
The cheapest access path that we have found.
Notice that this is only set after all paths for the RelOpt have been collected.
- cheapest_partial_path: QueryPlan | None#
The cheapest access path that we have found for parallel execution.
Notice that this is only set after all paths for the RelOpt have been collected.
- cardinality: Cardinality#
The estimated number of rows that are produced by this relation.
- postbound.opt.dynprog.Level#
The current level in the dynamic programming table.
- postbound.opt.dynprog.JoinRelLevel#
Alias for our dynamic programming table.
alias of
dict[int,list[RelOptInfo]]
- postbound.opt.dynprog.AddPathHook#
Hook method for users to get control in Postgres’ add_path() method.
The method is reponsible for storing a new candidate path in its RelOptInfo. It can decide, whether the path is actually worth storing or not. Furthermore, the method can also prune existing paths from the pathlist that are dominated by the new path.
All of these actions should be performed in-place by modifying the RelOptInfo object. No return value is expected.
The last boolean parameter indicates whether the path is a partial path (i.e. a path for parallel execution) or not. If it is, the path should be stored in the partial_paths list, instead of the standard pathlist. Likewise, all checks should be performed against the partial_paths list. If the hook should not handle partial paths, it can simply delegate to the standard implementation on the enumerator.
If need be, the method can also access the current state of the dynamic programmer. Specifically, the enumerator provides access to the query and database, the selected cost model and cardinality estimator, as well as to the current JoinRelLevel. Finally, the method is allowed to invoke the default path addition logic by calling the standard_add_path() method on the enumerator.
alias of
Callable[[PostgresDynProg,RelOptInfo,QueryPlan,bool],None]
- exception postbound.opt.dynprog.DPWarning#
- class postbound.opt.dynprog.PostgresDynProg(*args, **kwargs)#
Dynamic programming-based plan enumeration strategy that mimics the behavior of the Postgres query optimizer.
Postgres-style dynamic programming means two things: first, we use the Postgres pruning rules to reduce the search space. Second, we apply the same opinionated traversal rules. Most importantly, this concerns when we consider materialization or memoization of subplans. If some of the related operators are not allowed, the traversal rules are adjusted accordingly.
The implementation is based on a translation of the actual Postgres source code.
- Parameters:
supported_scan_ops (Optional[set[ScanOperators]], optional) – The scan operators that the enumerator is allowed to use. If omitted, all scan operators that are supported by the target database are used.
supported_join_ops (Optional[set[JoinOperators]], optional) – The join operators that the enumerator is allowed to use. If omitted, all join operators supported by the target database are used.
enable_materialize (bool, optional) – Whether the optimizer is allowed to insert materialization operators into the query plan. This is enabled by default.
enable_memoize (bool, optional) – Whether the optimizer is allowed to insert memoization operators into the query plan. This is enabled by default.
enable_sort (bool, optional) – Whether the optimizer is allowed to perform explicit sorts in the query plan. Notice that setting this to False only prevents optional sorts. For example, if the query contains an ORDER BY clause, the optimizer will still perform the required sorting. However, it will not perform any merge joins that require a different kind of sorting. This is enabled by default.
add_path_hook (Optional[AddPathHook], optional) – Optional function to implement custom path addition logic. See documentation on AddPathHook for more details.
target_db (Optional[PostgresInterface], optional) – The database on which the plans should be executed. This has to be a Postgres instance. If omitted, the database is inferred from the DatabasePool.
verbose (bool, optional) – Whether the enumerator should issue warnings if it encounters unexpected situations, e.g. if it rejects a path because it is illegal. This includes cases where the cost of some operators cannot be estimated by the target database system.
max_parallel_workers (Optional[int])
- infer_settings() None#
Sets all allowed operators according to the configuration of the target database.
For example, if the target database has index scans disabled, the enumerator will disable them as well.
- Return type:
None
- generate_execution_plan(query, *, cost_model, cardinality_estimator) QueryPlan#
Computes the optimal plan to execute the given query.
- Parameters:
query (SqlQuery) – The query to optimize
cost_model (CostModel) – The cost model to compare different candidate plans
cardinality_estimator (CardinalityEstimator) – The cardinality estimator to calculate the sizes of intermediate results
- Returns:
The query plan
- Return type:
Notes
The precise generation “style” (e.g. top-down vs. bottom-up, complete plans vs. plan fragments, etc.) is completely up to the specific algorithm. Therefore, it is really hard to provide a more expressive interface for the enumerator beyond just generating a plan. Generally the enumerator should query the cost model to compare different candidates. The top-most operator of each candidate will usually not have a cost estimate set at the beginning and it is the enumerator’s responsibility to set the estimate correctly. The jointree.update_cost_estimate function can be used to help with this.
- describe() jsondict#
Provides a JSON-serializable representation of the specific strategy, as well as important parameters.
- Returns:
The description
- Return type:
jsondict
See also
OptimizationPipeline.describe
- pre_check() OptimizationPreCheck#
Provides requirements that input query or database system have to satisfy for the optimizer to work properly.
- Returns:
The check instance. Can be an empty check if no specific requirements exist.
- Return type:
- standard_add_path(rel: RelOptInfo, path: QueryPlan, *, is_partial: bool = False) None#
Checks, whether a specific path is worthy of further consideration. If it is, the path is stored in the pathlist.
This method’s naming is exceptionally bad, but this the way it is named in the PG source code, so we stick with it.
On an abstract level, this method implements the following logic:
For each existing path in the pathlist, we check whether the new path dominates the existing one. If it does, we evict the existing path. It the existing path is better, we keep it and discard the new path.
To determine, whether one path dominates another, we compare the paths’ costs and sort orders. For one path to dominate the other one, it must be cheaper and at least as good sorted. If the paths are sorted differently, we keep them both.
This basic logic is executed for both regular paths and partial paths. Those are paths that can be executed in parallel and will eventually be merged with the main paths (once the parallel portion is finished). While “normal” paths (completely sequential ones or parallel paths that have finished their parallel portion) are stored in the pathlist, partial paths are stored in the partial_paths list.
- Parameters:
rel (RelOptInfo)
path (QueryPlan)
is_partial (bool)
- Return type:
None