Optimizer Package#
The optimizer module contains utilities to develop optimizers as well as simple optimization algorithms.
- class postbound.opt.CardinalityDistortion(*args, **kwargs)#
Decorator to simulate errors during cardinality estimation.
The distortion service uses cardinality estimates produced by an actual estimator and mofifies its estimations to simulate the effect of deviations and misestimates.
Behavior regarding cross products is inferred based on the behavior of the actual estimator.
- Parameters:
estimator (CardinalityEstimator) – The actual estimator that calculates the “correct” cardinalities.
distortion_factor (float) – How much the cardinalities are allowed to deviate from the original estimations. Values > 1 simulate overestimation whereas values < 1 simulate underestimation. For example, a distortion factor of 0.5 means that the final estimates can deviate at most half of the original cardinalities, pr a factor of 1.3 allows an overestimation of up to 30%.
distortion_strategy (Literal["fixed", "random"], optional) – How the estimation errors should be calculated. The default fixed strategy always applies the exact distrotion factor to the cardinalities. For example, an estimate of 1000 tuples would always become 1300 tuples with a distrotion factor of 1.3. On the other hand the random strategy allows any error between 1 and the desired factor and selects the specific distortion at random. For example, an estimate of 100 could become any cardinality between 50 and 100 tuples with a distortion factor of 0.5.
- calculate_estimate(query: SqlQuery, intermediate: TableReference | Iterable[TableReference]) Cardinality#
Determines the cardinality of a specific intermediate.
- Parameters:
query (SqlQuery) – The query being optimized
intermediate (TableReference | Iterable[TableReference]) – The intermediate for which the cardinality should be estimated. All filter predicates, etc. that are applicable to the intermediate can be assumed to be applied.
- Returns:
The estimated cardinality of the specific intermediate
- Return type:
- describe() dict#
Provides a JSON-serializable representation of the specific strategy, as well as important parameters.
- Returns:
The description
- Return type:
jsondict
See also
OptimizationPipeline.describe
- class postbound.opt.IndexInfo(column: ColumnReference, index_type: Literal['primary', 'secondary', 'none'], invalid: bool = False)#
This class captures relevant information about the availability of per-column indexes and their status.
The lifecycle of an index can be managed using the invalidate method. This indicates that an index can no longer be used for a specific join, for example because its column has become part of an intermediate result already. In contrast to many other types in PostBOUND, index information is a mutable structure and can be changed in-place.
The current implementation is only focused on indexes over a single column, multi-column indexes are not supported. Another limitation is that the specific type (i.e. data structure) of the index is not captured. If this information is important, it has to be maintained by the user.
- Parameters:
column (ColumnReference) – The column fr which the index is created
index_type (Literal["primary", "secondary", "none"]) – The kind of index that is maintained.
"none"indicates that there is no index on the column. This is a different concept from an index that exists, but cannot be used. The latter case is indicated via the invalid parameterinvalid (bool, optional) – Whether the index can still be used during query execution. Typically, this is true for relations that have not been included in any intermediate result and false afterwards.
- can_pk_fk_join(other: IndexInfo) bool#
Checks, whether two columns can be joined via a primary key/foreign key join.
This method does not restrict the direction of such a join, i.e. each column could act as the primary key or foreign key. Likewise, no datatype checks are performed and it is assumed that a database system would be able to actually join the two columns involved.
If indexes on any of the columns are no longer available, this check fails.
- Parameters:
other (IndexInfo) – The index information of the other column that should participate in the join
- Returns:
Whether a primary key/foreign key join could be executed between the columns.
- Return type:
bool
- property column: ColumnReference#
Get the column to which the index information belongs.
- Returns:
The column
- Return type:
- static generate_for(column: ColumnReference, db_schema: DatabaseSchema) IndexInfo#
Determines available indexes for a specific column.
- Parameters:
column (ColumnReference) – The column. It has to be connected to a valid, non-virtual table reference
db_schema (DatabaseSchema) – The schema of the database to which the column belongs.
- Returns:
_description_
- Return type:
- Raises:
base.UnboundColumnError – If the column is not associated with any table
- property index_type: Literal['primary', 'secondary', 'none']#
Get the kind of index that is in principle available on the column.
The index type does not contain any information about whether an index is actually usable for a specific join. It merely states whether an index has been defined.
- Returns:
The index type. Can be primary, secondary or none.
- Return type:
str
- invalidate() None#
Marks the index as invalid.
Once a table is included in an intermediate join result, the index structures of its columns most likely become invalid, and it is no longer possible to use the index to query for specific tuples (because the occurrences of the individual tuples are multiplied when executing the join). This method can be used to model the lifecycle of index structures within the course of the execution of a single query.
- Return type:
None
- is_indexed() bool#
Checks, whether there is any valid index defined for the column.
This check does not differentiate between primary key indexes and secondary indexes.
- Returns:
Whether this is a primary key or secondary index and ensures that it is still valid.
- Return type:
bool
- property is_invalid: bool#
Get whether the index is actually usable.
To determine whether an index can be used right now, this property has to be combined with the index_type value. If there never was an index on the column, is_valid might have been true from the get-go. To make this check easier, a number of utility methods exist.
- Returns:
Whether the index is usable if it exists. If there is no index on the column, the index cannot be interpreted in any meaningful way.
- Return type:
bool
- is_primary() bool#
Checks, whether this is a valid primary index.
- Returns:
Whether this is a primary key index and ensures that it is still valid.
- Return type:
bool
- is_secondary() bool#
Checks, whether this is a valid secondary index.
- Returns:
Whether this is a secondary index and ensures that it is still valid.
- Return type:
bool
- static no_index(column: ColumnReference) IndexInfo#
Creates index information that indicates the absence of an index.
- Parameters:
column (ColumnReference) – A column that does not have any index
- Returns:
The index information
- Return type:
- static primary_index(column: ColumnReference) IndexInfo#
Creates index information for a primary key index.
- Parameters:
column (ColumnReference) – The indexed column
- Returns:
The index information. The index is initialized as a valid index.
- Return type:
- static secondary_index(column: ColumnReference) IndexInfo#
Creates index information for a secondary index.
Foreign key indexes are often defined this way.
- Parameters:
column (ColumnReference) – The indexed column
- Returns:
The index information. The index is initialized as a valid index.
- Return type:
- class postbound.opt.JoinGraph(query: ImplicitSqlQuery, db_schema: DatabaseSchema | None = None, *, include_predicate_equivalence_classes: bool = False)#
The join graph models the connection between different tables in a query.
All tables that are referenced in the query are represented as the nodes in the graph. If two tables are joined via a join predicate in the SQL query, they will be linked with an edge in the join graph. This graph is further annotated by the join predicate. Additionally, the join graph stores index information for each of the relevant columns in the query.
In contrast to many other types in PostBOUND, a join graph is a mutable structure. It also models the current state of the optimizer once specific tables have been included in an intermediate join result.
The wording of the different join graph methods distinguishes three states of joins (and correspondingly tables):
a join might be free, if at least one of the corresponding tables have not been marked as joined, yet
a join might be available, if it is free and one of the tables is already included in some intermediate join
a join is consumed, if its no longer free. This occurs once the partner tables have both been marked as joined.
A further distinction is made between n:m joins and primary key/foreign key joins. Information about the available joins of each type can be queried easily and many methods are available in two variants: one that includes all possible joins and one that is only focused on primary key/foreign key joins. To determine the precise join types, the join graph needs to access the database schema. A n:m join is one were the column values of both join partners can appear an arbitrary number of times, corresponding to an n:m relation between the two tables.
By calling the mark_joined method, the state of individual joins and their corresponding tables might change. This also means that former primary key/foreign key joins might become n:m joins (which is the case exactly when the primary key table is inserted into an intermediate join result).
- Parameters:
query (ImplicitSqlQuery) – The query for which the join graph should be generated
db_schema (Optional[DatabaseSchema], optional) – The schema of the database on which the query should be executed. If this is
None, the database schema is inferred based on the DatabasePool.include_predicate_equivalence_classes (bool, optional) – Whether predicates from the same equivalence class should be added to the join graph as well, even if they do not exist in the original query. Consider two join conditions a = b and b = c. From these conditions, it follows that a = c by the transitive property. If include_predicate_equivalence_classes is true, that last predicate will also be added to the join graph.
Warning
If predicate equivalence classes are used, the optimization algorithm can potentially generate queries that contain additional predicates that were not present in the original query.
- all_joins() Iterable[tuple[TableReference, TableReference]]#
Provides all edges in the join graph, no matter whether they are available or not.
- Returns:
The possible joins in the graph. The assignment to the first or second component of the tuple is arbitrary
- Return type:
Iterable[tuple[TableReference, TableReference]]
- available_deep_pk_join_paths_for(fk_table: TableReference, ordering: Callable[[TableReference, dict], int] | None = None) Iterable[JoinPath]#
Provides all available pk/fk joins with the given table, as well as follow-up pk/fk joins.
In contrast to the available_pk_fk_joins_for method, this method does not only return direct joins between the foreign key table, but augments its output in the following way: suppose the foreign key table is pk/fk joined with a primary key table t. Then, this method also includes all joins of t with additional tables t’, such that t ⋈ t’ is once again a primary key/foreign key join, but this time with t acting as the foreign key and t’ as the primary key. This procedure is repeated for all t’ tables recursively until no more primary key/foreign key joins are available.
Essentially, this is equivalent to performing a breadth-first search on all (directed) primary key/foreign key joins, starting at the foreign key table. The sequence in which joins on the same level are placed into the resulting iterable can be customized via the ordering parameter. This callable receives the current primary key table and the edge data as input and produces a numerical position weight as output (smaller values meaning earlier placement). The provided edge data contains the join predicate under the
"predicate"key. Using the join predicate, the join partner (i.e. the foreign key table) can be retrieved.- Parameters:
fk_table (TableReference) – The foreign key table at which the search should be anchored.
ordering (Callable[[TableReference, dict], int] | None, optional) – How to sort different primary key join partners on the same level. Lower values mean earlier positioning. This defaults to
None, in which case an arbitrary ordering is used.
- Returns:
All deep primary key/foreign key join paths, starting at the fk_table
- Return type:
Iterable[JoinPath]
- available_join_paths(*, both_directions_on_initial: bool = False) Iterable[JoinPath]#
Provides all joins that can be executed in the current join graph.
The precise output of this method depends on the current state of the join graph: If the graph is still in its initial state (i.e. none of the tables is joined yet), all joins are provided. Otherwise, only those join paths are considered available, where one table is already joined, and the join partner is still free. The free table will be the target table in the join path whereas the joined table will be the start table.
- Parameters:
both_directions_on_initial (bool, optional) – Whether to include the join path R -> S as well as S -> R for initial join graphs, assuming there is a join between R and S in the graph.
- Returns:
All possible joins in the current graph.
- Return type:
Iterable[JoinPath]
- available_join_paths_for(table: TableReference) Iterable[JoinPath]#
Returns all possible joins of a specific table.
What constitutes a possible join depends on the state of the join graph: for an initial join graph, the only requirement is a valid join predicate between the tables. In all other cases, exactly one of the tables has to be free and the other table has to be consumed.
- Parameters:
table (TableReference) – The table that should be joined
- Returns:
All possible join paths for the table. This includes n:m joins as well as primary key/foreign key joins. In each join path the specified table will be the start table and the join partner will be the target table.
- Return type:
Iterable[JoinPath]
- available_n_m_join_paths(*, both_directions_on_initial: bool = False) Iterable[JoinPath]#
Provides exactly those join paths from available_join_paths that correspond to n:m joins.
The logic for initial and “dirty” join graphs is inherited from available_join_paths and can be further customized via the both_directions_on_initial parameter.
- Parameters:
both_directions_on_initial (bool, optional) – Whether to include the join path R -> S as well as S -> R for initial join graphs if R ⨝ S is an n:m join.
- Returns:
The available n:m joins
- Return type:
Iterable[JoinPath]
- available_pk_fk_joins_for(fk_table: TableReference) Iterable[JoinPath]#
Provides all available primary key/foreign key joins with a specific foreign key table.
This method does not restrict itself to available joins, but requires that at least one of the join parts is free.
- Parameters:
fk_table (TableReference) – The foreign key table. This will be the start table in all join paths.
- Returns:
All matching join paths. The start table of the path will be the foreign key table, whereas the primary key table will be the target table.
- Return type:
Iterable[JoinPath]
- clone() JoinGraph#
Provides a deep copy of the current join graph.
- Returns:
The copy. It can be safely modified without affecting the original join graph.
- Return type:
- contains_cross_products() bool#
Checks, whether there are any cross products in the input query.
A cross product is a join between tables without a restricting join predicate. Note that this is only the case if those tables are also not linked via a sequence of join predicates with other tables.
- Returns:
Whether the join graph contains at least one cross product.
- Return type:
bool
- contains_free_n_m_joins() bool#
Checks, whether there is at least one more free n:m join remaining in the graph.
- Returns:
Whether there are still n:m joins with at least one free table.
- Return type:
bool
- contains_free_tables() bool#
Checks, whether there is at least one more free tables remaining in the graph.
- Returns:
Whether there are still free tables in the join graph
- Return type:
bool
- count_consumed_tables() int#
Determines the number of tables that have been joined already.
This number might be 1 if only the initial tables has been selected, or 0 if the join graph is still in its initial state.
- Returns:
The number of joined tables
- Return type:
int
- free_tables() frozenset[TableReference]#
Provides all tables that have not been joined, yet.
- Returns:
The tables that are not consumed
- Return type:
frozenset[TableReference]
- initial() bool#
Checks, whether the join graph has already been modified.
- Returns:
Trueindicates that the join graph is still in its initial state, i.e. no table has been marked as joined, yet.- Return type:
bool
- is_available_join(first_table: TableReference, second_table: TableReference) bool#
Checks, whether the join between two tables is still available.
For initial join graphs, this check passes as long as there is a valid join predicate between the two given tables. In all other cases, one of the join partners has to be consumed, whereas the other partner has to be free.
- Parameters:
first_table (TableReference) – The first join partner
second_table (TableReference) – The second join partner
- Returns:
Whether there is a valid join between the given tables and whether this join is still available. The join direction and join type do not matter.
- Return type:
bool
- is_free_table(table: TableReference) bool#
Checks, whether a specific table is still free in this join graph.
If the table is not part of the join graph, an error is raised.
- Parameters:
table (TableReference) – The table to check
- Returns:
Whether the given table is still free
- Return type:
bool
- is_n_m_join(first_table: TableReference, second_table: TableReference) bool#
Checks, whether the join between the supplied tables is an n:m join.
This check does not require the indicated join to be available.
- Parameters:
first_table (TableReference) – The first join partner
second_table (TableReference) – The second join partner
- Returns:
Whether the join between the given tables is an n:m join
- Return type:
bool
Warning
In the current implementation, this check only works for (conjunctions of) binary join predicates. An error is raised for joins between multiple columns
- is_pk_fk_join(fk_table: TableReference, pk_table: TableReference) bool#
Checks, whether the join between the supplied tables is a primary key/foreign key join.
This check does not require the indicated join to be available.
- Parameters:
fk_table (TableReference) – The foreign key table
pk_table (TableReference) – The primary key table
- Returns:
Whether the join between the given tables is a primary key/foreign key join with the correct direction
- Return type:
bool
Warning
In the current implementation, this check only works for (conjunctions of) binary join predicates. An error is raised for joins between multiple columns
- join_components() Iterable[JoinGraph]#
Provides all components of the join graph.
A component is a subgraph of the original join graph, such that the subgraph is connected but there was no edge between nodes from different sub-graphs. This corresponds to the parts of the query that have to be joined via a cross product.
- Returns:
The components of the join graph, each as its own full join graph object
- Return type:
Iterable[JoinGraph]
- join_partners_from(table: TableReference, candidate_tables: Iterable[TableReference]) set[TableReference]#
Provides exactly those tables from a set of candidates that are joined with a specific given table.
This check does not require the joins in question to be available. The existence of a join edge is sufficient.
- Parameters:
table (TableReference) – The table that should be joined with the candidates
candidate_tables (Iterable[TableReference]) – Possible join partners for the table. Join type and direction do not matter
- Returns:
Those tables of the candidate_tables that can be joined with the partner table.
- Return type:
set[TableReference]
- join_predicates_between(first_tables: TableReference | Iterable[TableReference], second_tables: TableReference | Iterable[TableReference] | None = None) Collection[AbstractPredicate]#
Provides all join predicates between sets of tables.
This method operates in two modes: if only one set of tables is given, all join predicates for tables within that set are collected. If two sets are given, all join predicates for tables from both sets are collected, but not predicates from tables within the same set.
The status of the tables, as well as the join type, do not play a role in this check.
- Parameters:
first_tables (TableReference | Iterable[TableReference]) – The first set of candidate tables. Can optionally also be a single table, in which case the check is only performed for this table and the partner set
second_tables (Optional[TableReference | Iterable[TableReference]], optional) – The second set of candidate tables. By default this is
None, which results in collecting only join predicates for tables from first_tables. Can also be a single table, in which case the check is only performed for this table and the partner set
- Returns:
All join predicates
- Return type:
Collection[AbstractPredicate]
- joined_tables() frozenset[TableReference]#
Provides all non-free tables in the join graph.
- Returns:
The tables that have already been joined / consumed.
- Return type:
frozenset[TableReference]
- joins_tables(first_table: TableReference, second_table: TableReference) bool#
Checks, whether the join graph contains an edge between specific tables.
This check does not require the join in question to be available (this is what is_available_join is for).
- Parameters:
first_table (TableReference) – The first join partner
second_table (TableReference) – The second join partner
- Returns:
Whether there is any join predicate between the given tables. The direction or availability does not matter for this check
- Return type:
bool
- mark_joined(table: TableReference, join_edge: AbstractPredicate | None = None) None#
Updates the join graph to include a specific table in the intermediate result.
This procedure also changes the available index structures according to the kind of join that was executed. This is determined based on the current state of the join graph, the index structures, as well as the supplied join predicate. If no join predicate is supplied, it is inferred from the query predicates.
- Parameters:
table (TableReference) – The tables that becomes part of an intermediate result
join_edge (Optional[AbstractPredicate], optional) – The condition that is used to carry out the join. Defaults to
None, in which case the predicate is inferred from the predicates that have been supplied by the initial query.
- Return type:
None
- nx_graph() Graph#
Provides the underlying graph object for this join graph.
- Returns:
A deep copy of the raw join graph
- Return type:
nx.Graph
- class postbound.opt.JoinPath(start_table: TableReference, target_table: TableReference, join_condition: AbstractPredicate | None = None)#
A join path models the join between two tables.
Usually a path represents a join where one table is part of an intermediate result, whereas the other table is already part of an intermediate result. However, this is not required.
- Parameters:
start_table (TableReference)
target_table (TableReference)
join_condition (AbstractPredicate | None)
- start_table#
The first join partner involved in the join. This is the table that is already part of the intermediate result of the query
- Type:
- target_table#
The second join partner involved in the join. This is the table that is not yet part of any intermediate result. Thus, this is the table that should be joined next
- Type:
- join_condition#
The predicate that is used to actually join the target_table with the current intermediate result. Usually the predicate is restricted to the the join between start_table and target_table, but can also include additional join predicates over other tables in the intermediate results.
- Type:
Optional[AbstractPredicate], optional
- flip_direction() JoinPath#
Creates a new join path with the start and target tables reversed.
- Returns:
The new join path
- Return type:
- spans_table(table: TableReference) bool#
Checks, whether a specific table is either the start, or the target table in this path.
- Parameters:
table (TableReference) – The table to check
- Returns:
Whether the table is part of the join path. Notice that this check does not consider tables that are part of the join_condition.
- Return type:
bool
- tables() tuple[TableReference, TableReference]#
Provides the tables that are joined.
- Returns:
The tables
- Return type:
tuple[TableReference, TableReference]
Warning
The definition of this methods differs slightly from other definitions of the tables method that can be found in the query abstraction layer. The tables method for join paths really only focuses on start_table and target_table. If additional tables appear as part of the join_condition, they are ignored.
- class postbound.opt.PlanChangeEntry(change_type: Literal['tree-structure', 'join-direction', 'physical-op', 'card-est', 'cost-est', 'actual-card'], left_state: frozenset[TableReference] | ScanOperator | JoinOperator | IntermediateOperator | float, right_state: frozenset[TableReference] | ScanOperator | JoinOperator | IntermediateOperator | float, context: frozenset[TableReference] | None = None)#
Models a single diff between two join trees.
The compared join trees are referred two as the left tree and the right tree, respectively.
- Parameters:
change_type (Literal['tree-structure', 'join-direction', 'physical-op', 'card-est', 'cost-est', 'actual-card'])
left_state (frozenset[TableReference] | ScanOperator | JoinOperator | IntermediateOperator | float)
right_state (frozenset[TableReference] | ScanOperator | JoinOperator | IntermediateOperator | float)
context (frozenset[TableReference] | None)
- change_type#
Describes the precise difference between the trees. tree-structure indicates that the two trees are fundamentally different. This occurs when the join orders are not the same. join-direction means that albeit the join orders are the same, the roles in a specific join are reversed: the inner relation of one tree acts as the outer relation in the other one and vice-versa. physical-op means that two structurally identical nodes (i.e. same join or base table) differ in the assigned physical operator. card-est indicates that two structurally identifcal nodes (i.e. same join or base table) differ in the estimated cardinality, while cost-est does the same, just for the estimated cost.
- Type:
Literal[“tree-structure”, “join-direction”, “physical-op”, “card-est”]
- left_state#
Depending on the change_type this attribute describes the left tree. For example, for different tree structures, these are the tables in the left subtree, for different physical operators, this is the operator assigned to the node in the left tree and so on. For different join directions, this is the entire join node
- Type:
frozenset[TableReference] | PhysicalOperator | float
- right_state#
Equivalent attribute to left_state, just for the right tree.
- Type:
frozenset[TableReference] | PhysicalOperator | float
- context#
For different physical operators or cardinality estimates, this describes the intermediate that is different. This attribute is unset by default.
- Type:
Optional[frozenset[TableReference]], optional
- inspect() str#
Provides a human-readable string of the diff.
- Returns:
The diff
- Return type:
str
- class postbound.opt.PlanChangeset(changes: Collection[PlanChangeEntry])#
Captures an arbitrary amount of join tree diffs.
- Parameters:
changes (Collection[PlanChangeEntry])
- changes#
The diffs
- Type:
Collection[JointreeChangeEntry]
- inspect() str#
Provides a human-readable string of the entire diff.
The diff will typically contain newlines to separate individual entries.
- Returns:
The diff
- Return type:
str
- class postbound.opt.PreComputedCardinalities(*args, **kwargs)#
Re-uses existing cardinalities from an external data source.
The cardinalities have to be stored in a CSV file which follows a certain structure. Some details can be customized (e.g. column names). Most importantly, queries have to be identified via their labels. See parameters for details.
- Parameters:
workload (workloads.Workload) – The workload which was used to calculate the cardinalities. This is required to determine the query label based on an input query. Each hint generator can only support a specific workload.
lookup_table_path (str) – The file path to the CSV file containing the cardinalities.
include_cross_products (bool, optional) – Whether cardinality estimates for arbitrary cross products are contained in the CSV file and hence can be used during estimation. By default this is disabled.
default_cardinality (Optional[Cardinality], optional) – In case no cardinality estimate exists for a specific intermediate, a default cardinality can be used instead. In case no default value has been specified, an error would be raised. Notice that a
Nonevalue unsets the default. If the client should handle this situation instead, another value (e.g.Cardinality.unknown()has to be used). The default cardinality is not used if live fallback is enabled (see below).label_col (str, optional) – The column in the CSV file that contains the query labels. Defaults to label.
tables_col (str, optional) – The column in the CSV file that contains the (JSON serialized) tables that form the current intermediate result of the current query. Defaults to tables.
cardinality_col (str, optional) – The column in the CSV file that contains the actual cardinalities. Defaults to cardinality.
live_fallback (bool, optional) – Whether to fall back to a live database in case no cardinality estimate is found in the CSV file. This is off by default.
error_on_missing_card (bool, optional) – If live fallback is disabled and we did not find a cardinality estimate for a specific intermediate, we will raise an error by default. If this is not desired and missing values can be handled by the client, this behavior can be disabled with this parameter.
live_fallback_style (Literal["actual", "estimated"], optional) – In case the fallback is enabled, this customizes the calculation strategy. “actual” will calculate the true cardinality of the intermediate in question, whereas “estimated” (the default) will use the native optimizer to estimate the cardinality.
live_db (Optional[Database], optional) – The database system that should be used in case of a live fallback. If omitted, the database system is inferred from the database pool.
save_live_fallback_results (bool, optional) – Whether the cardinalities computed by the live fallback should be stored in the original file containing the lookup table. This is only used if live fallback is active and enabled by default.
- calculate_estimate(query: SqlQuery, intermediate: TableReference | Iterable[TableReference]) Cardinality#
Determines the cardinality of a specific intermediate.
- Parameters:
query (SqlQuery) – The query being optimized
intermediate (TableReference | Iterable[TableReference]) – The intermediate for which the cardinality should be estimated. All filter predicates, etc. that are applicable to the intermediate can be assumed to be applied.
- Returns:
The estimated cardinality of the specific intermediate
- Return type:
- describe() dict#
Provides a JSON-serializable representation of the specific strategy, as well as important parameters.
- Returns:
The description
- Return type:
jsondict
See also
OptimizationPipeline.describe
- class postbound.opt.PreciseCardinalities(*args, **kwargs)#
Cardinality “estimator” that calculates exact cardinalities.
These cardinalities are determined by actually executing the intermediate query plan and counting the number of result tuples. To speed up this potentially very costly computation, the estimator can store already calculated cardinalities in an intermediate cache. Notice that this cache is different from the query cache provided by the Database interface. The reason for this distinction is simple: the query result cache assumes static databases. If it connects to the same logical database at two different points in time (potentially after a data shift), the cached results will be out-of-date. On the other hand, the cardinality cache is transient and local to each estimator. Therefore, it will always calculate the current results, even when a data shift is simulated. Even when the same estimator is used while simulating a data shift, the cache can be reset manually without impacting caching of all other queries.
- Parameters:
database (Optional[Database], optional) – The database for which the estimates should be calculated. If omitted, the database system is inferred from the database pool.
enable_cache (bool, optional) – Whether cardinalities of intermediates should be cached for the lifetime of the estimator object. Defaults to False.
allow_cross_products (bool, optional) – Whether cardinality estimates for arbitrary cross products should be included.
- calculate_estimate(query: SqlQuery, intermediate: TableReference | Iterable[TableReference]) Cardinality#
Determines the cardinality of a specific intermediate.
- Parameters:
query (SqlQuery) – The query being optimized
intermediate (TableReference | Iterable[TableReference]) – The intermediate for which the cardinality should be estimated. All filter predicates, etc. that are applicable to the intermediate can be assumed to be applied.
- Returns:
The estimated cardinality of the specific intermediate
- Return type:
- describe() dict#
Provides a JSON-serializable representation of the specific strategy, as well as important parameters.
- Returns:
The description
- Return type:
jsondict
See also
OptimizationPipeline.describe
- class postbound.opt.TableInfo(free: bool, index_info: Collection[IndexInfo])#
This class captures information about the state of tables in the join graph.
- Parameters:
free (bool)
index_info (Collection[IndexInfo])
- free#
Whether the table is still free, i.e. is not a part of any intermediate join result.
- Type:
bool
- index_info#
Information about the indexes of all columns that belong to the table. If a column does not appear in this collection, it does not have any indexes, or the column is not relevant in the current join graph (i.e. because it does not appear in any join predicate)
- Type:
Collection[IndexInfo]
- postbound.opt.actual_plan_cost(query: SqlQuery, analyze_plan: QueryPlan, *, database: Database | None = None) float#
Utility to compute the true cost of a query plan based on the actual cardinalities.
- Parameters:
- Returns:
_description_
- Return type:
float
- postbound.opt.compare_query_plans(left: QueryPlan, right: QueryPlan) PlanChangeset#
Computes differences between two query execution plans.
- postbound.opt.explode_query_plan(query_plan: QueryPlan, *, card_source: Literal['estimated', 'actual'] = 'estimated') tuple[LogicalJoinTree, PhysicalOperatorAssignment, PlanParameterization]#
Extracts the join tree, physical operators, and plan parameters from a query plan.
- Parameters:
query_plan (QueryPlan) – The query plan to extract the information from
card_source (Literal["estimated", "actual"], optional) – Which cardinalities to use in the join tree and the plan parameters. Defaults to the estimated cardinalities.
- Returns:
The different components of the query plan
- Return type:
tuple[LogicalJoinTree, PhysicalOperatorAssignment, PlanParameterization]
- postbound.opt.join_depth(join_tree: JoinTree) dict[TableReference, int]#
Calculates for each base table in a join tree the join index when it was integrated into an intermediate result.
For joins of two base tables, the depth value is 1. If a table is joined with the intermediate result of the base table join, its depth is 2. Generally speaking, the depth of each table is 1 plus the maximum depth of any table in the intermediate result that the new table is joined with.
- Parameters:
join_tree (JoinTree) – The join tree for which the depths should be calculated.
- Returns:
A mapping from tables to their depth values.
- Return type:
dict[TableReference, int]
Examples
TODO add examples
- postbound.opt.jointree_similarity_bottomup(a: JoinTree, b: JoinTree) float#
Computes the similarity of two join trees based on a bottom-up approach.
- Parameters:
- Returns:
An artificial similarity score in [0, 1]. Higher values indicate larger similarity.
- Return type:
float
Notes
TODO: add discussion of the algorithm
- postbound.opt.jointree_similarity_topdown(a: JoinTree, b: JoinTree, *, symmetric: bool = False, gamma: float = 1.1) float#
Computes the similarity of two join trees using a top-down approach.
- Parameters:
a (JoinTree) – The first join tree
b (JoinTree) – The second join tree
symmetric (bool, optional) – Whether the calculation should be symmetric. If true, the occurence of joins in different branches is not penalized. See Notes for details.
gamma (float, optional) – The reinforcement factor to prioritize similarity of earlier (i.e. deeper) joins. The higher the value, the stronger the amplification, by default 1.1
- Returns:
An artificial similarity score in [0, 1]. Higher values indicate larger similarity.
- Return type:
float
Notes
TODO: add discussion of the algorithm
- postbound.opt.linearized_levenshtein_distance(a: JoinTree, b: JoinTree) int#
Computes the levenshtein distance of the table sequences of two join trees.
- Parameters:
- Returns:
The distance score. Higher values indicate larger distance.
- Return type:
int
References
- postbound.opt.possible_plans_bound(query: SqlQuery, *, join_operators: set[str] = {'hash join', 'nested-loop join', 'sort-merge join'}, scan_operators: set[str] = {'index scan', 'sequential scan'}) int#
Computes a quick upper bound on the maximum number of possible query execution plans for a given query.
This upper bound is a very coarse one, based on three assumptions:
any join sequence (even involving cross-products) of any form (i.e. right-deep, bushy, …) is allowed
the choice of scan operators and join operators can be varied freely
each table can be scanned using arbitrary operators
The number of real-world query execution plans will typically be much smaller, because cross-products are only used if really necessary and the selected join operator influences the scan operators and vice-versa.
- Parameters:
query (SqlQuery) – The query for which the bound should be computed
join_operators (set[str], optional) – The allowed join operators, by default {“nested-loop join”, “hash join”, “sort-merge join”}
scan_operators (set[str], optional) – The allowed scan operators, by default {“sequential scan”, “index scan”}
- Returns:
An upper bound on the number of possible query execution plans
- Return type:
int
- postbound.opt.read_jointree_json(json_data: dict | str) JoinTree#
Loads a jointree from its JSON representations.
- Parameters:
json_data (dict | str) – Either the JSON dictionary, or a string encoding of the dictionary (which will be parsed by json.loads).
- Returns:
The corresponding join tree
- Return type:
- postbound.opt.read_operator_assignment_json(json_data: dict | str) PhysicalOperatorAssignment#
Loads an operator assignment from its JSON representation.
- Parameters:
json_data (dict | str) – Either the JSON dictionary, or a string encoding of the dictionary (which will be parsed by json.loads).
- Returns:
The assignment
- Return type:
- postbound.opt.read_operator_json(json_data: dict | str) ScanOperator | JoinOperator | IntermediateOperator | ScanOperatorAssignment | JoinOperatorAssignment | None#
Reads a physical operator assignment from a JSON dictionary.
- Parameters:
json_data (dict | str) – Either the JSON dictionary, or a string encoding of the dictionary (which will be parsed by json.loads).
- Returns:
The parsed assignment. Whether it is a scan or join assignment is inferred from the JSON dictionary. If the input is empty or None, None is returned.
- Return type:
Optional[ScanOperators | JoinOperators | ScanOperatorAssignment | JoinOperatorAssignment]
- postbound.opt.read_plan_params_json(json_data: dict | str) PlanParameterization#
Loads a plan parameterization from its JSON representation.
- Parameters:
json_data (dict | str) – Either the JSON dictionary, or a string encoding of the dictionary (which will be parsed by json.loads).
- Returns:
The plan parameterization
- Return type:
- postbound.opt.read_query_plan_json(json_data: dict | str) QueryPlan#
Reads a query plan from its JSON representation.
- Parameters:
json_data (dict | str) – Either the JSON dictionary, or a string encoding of the dictionary (which will be parsed by json.loads).
- Returns:
The corresponding query plan
- Return type:
- postbound.opt.star_query_cardinality(query: SqlQuery, fact_table_pk_column: ColumnReference, *, database: Database | None = None, verbose: bool = False) int#
Utility function to manually compute the cardinality of a star query.
This function is intended for situations where the database is unable to compute the cardinality because the intermediates involved in the query become to large or the query plans are simply too bad. It operates by manually computing the number of output tuples for each of the entries in the fact table by sequentially joining the fact table with each dimension table.
- Parameters:
query (SqlQuery) – The query to compute the cardinality for. This is assumed to be a SELECT * query and the actual SELECT clause is ignored completely.
fact_table_pk_column (ColumnReference) – The fact table’s primary key column. All dimension tables must perform an equi-join on this column.
database (Optional[Database], optional) – The actual database. If this is omitted, the current database from the database pool is used.
verbose (bool, optional) – Whether progress information should be printed during the computation. If this is enabled, the function will report every 1000th value processed.
- Returns:
The cardinality (i.e. number of output tuples) of the query
- Return type:
int
Warning
Currently, this function works well for simple SPJ-based queries, more complicated features might lead to wrong results. Similarly, only pure star queries are supported, i.e. there has to be one central fact table and each dimension table performs exactly one equi-join with the fact table’s primary key. There may not be additional joins on the dimension tables. If such additional dimension joins exist, they have to be pre-processed (e.g. by introducing materialized views) and the query has to be rewritten to operate on the views instead. It is the user’s responsibility to ensure that the query is well-formed in these regards.
- postbound.opt.to_query_plan(join_tree: JoinTree, *, query: SqlQuery | None = None, physical_ops: PhysicalOperatorAssignment | None = None, plan_params: PlanParameterization | None = None, scan_op: ScanOperator | None = None, join_op: JoinOperator | None = None) QueryPlan#
Creates a query plan from a join tree.
This function operates in two different modes: physical operators can either be assigned to each node of the join tree individually using the physical_ops, or the same operator can be assigned to all scans and joins using the scan_op and join_op parameters. If the former approach is used, fallback/default operators can be provided to compensate missing operators in the assignment. Furthermore, plan_params can be used to inject custom cardinality estimates and parallel workers to the nodes.
If the supplied join_tree is a LogicalJoinTree, its cardinality estimates are used as a fallback if no estimate from the plan parameters is available.
Notice that the resulting query plan does not contain any DB-specific features. For example, assigning a hash join to an intermediate does not also insert a hash operator, as is done by some database systems.
- Parameters:
join_tree (JoinTree) – The join order to use for the query plan. If this is a logical join tree, the cardinality estimates can be added to the query plan if no more specific estimates are available through the plan_params.
query (Optional[SqlQuery], optional) – The query that is computed by the query plan. If this is supplied, it is used to compute join predicates and filters that can be computed at the various nodes of the query plan.
physical_ops (Optional[PhysicalOperatorAssignment], optional) – The physical operators that should be used for individual nodes of the join tree. If this is supplied, the scan_op and join_op parameters are used as a fallback if no assignment exists for a specific intermediate. Notice that parallel workers contained in the operator assignments are never used since this information should be made available through the plan_params.
plan_params (Optional[PlanParameterization], optional) – Optional cardinality estimates and parallelization info for the nodes of the join tree. If this is not supplied, cardinality estimates are inferred from a logical join tree or left as NaN otherwise.
scan_op (Optional[ScanOperator], optional) – The operator to assign to all scans in the query plan. If no physical_ops are given, this parameter has to be specified. If physical_ops are indeed given, this parameter is used as a fallback if no assignment exists for a specific scan.
join_op (Optional[JoinOperator], optional) – The operator to assign to all joins in the query plan. If no physical_ops are given, this parameter has to be specified. If physical_ops are indeed given, this parameter is used as a fallback if no assignment exists for a specific join.
- Returns:
The resulting query plan
- Return type:
- postbound.opt.update_plan(query_plan: QueryPlan, *, operators: PhysicalOperatorAssignment | None = None, params: PlanParameterization | None = None, simplify: bool = True) QueryPlan#
Assigns new operators and/or new estimates to a query plan, leaving the join order intact.
Notice that this update method is not particularly smart and only operates on a per-node basis. This means that high-level functions that are composed of multiple operators might not be updated properly. For example, Postgres represents a hash join as a combination of a hash operator (which builds the actual hash table) and a follow-up hash join operator (which performs the probing). If the update changes the hash join to a different join, the hash operator will still exist, likely leading to an invalid query plan. To circumvent such problems, the query plan is by default simplified before processing. Simplification removes all auxiliary non-join and non-scan operators, thereby effectively only leaving those nodes with a corresponding operator. But, there is no free lunch and the simplification might also remove some other important operators, such as using hash-based or sort-based aggregation operators. Therefore, simplification can be disabled by setting the simplify parameter to False.
- Parameters:
query_plan (QueryPlan) – The plan to update.
operators (Optional[PhysicalOperatorAssignment], optional) – The new operators to use. This can be a partial assignment, in which case only the operators that are present in the new assignment are used and all others are left unchanged. If this parameter is not given, no operators are updated.
params (Optional[PlanParameterization], optional) – The new parameters to use. This can be a partial assignment, in which case only the cardinalities/parallel workers in the new assignment are used and all others are left unchanged. If this parameter is not given, no parameters are updated.
simplify (bool, optional) – Whether to simplify the query plan before updating it. For a detailed discussion, see the high-level documentatio of this method. Simplifications is enabled by default.
- Returns:
The updated query plan
- Return type:
See also
QueryPlan.simplify
Simple Optimizers#
In addition to general utilities, we also ship a number of simple or commonly-used optimizer implementations. These should mostly be used as baselines or as part of a more complex optimizer. The following optimizers are provided: