Database Abstraction

Database Abstraction#

Module name: postbound.db

The db module provides tools to interact with various database instances.

Generally speaking, the interactions are bidirectional: on the one hand, common database concepts like retrieving statistical information, introspecting the logical schema or obtaining physical query execution plans are enabled through abstract interfaces and can be used by the optimizer modules. On the other hand, the db modules provides tools to enforce optimization decisions made as part of an optimization pipeline or other modules when executing the query on the actual database system.

Recall that PostBOUND does not interact with the query optimizer directly and instead relies on system-specific hints or other special properties of the target database system to influence the optimizer behaviour. Therefore, the optimization process usually terminates with transforming the original input query to a logically equivalent query that at the same time contains the necessary modifications for optimization.

The central entrypoint to all database interaction is the abstract Database class. This class is inherited by all supported database systems (currently PostgreSQL and MySQL). Each Database instace provides some basic functionality on its own (such as the ability to execute queries), but delegates most of the work to specific and tailored interfaces. For example, the DatabaseSchema models all access to the logical schema of a database and the OptimizerInterface encapsulates the functionality to retrieve cost estimates or phyiscal query execution plans. All of these interfaces are once again abstract and implemented according to the specifics of the actual database system.

Take a look at the individual interfaces for further information about their functionality and intended usage.

This module provide direct access to the Postgres interface along with a shortcut method to retrieve the current database (aptly called current_database). In the background, this method delegates to the DatabasePool. If you want to use the MySQL interface, make sure to install PostBOUND with MySQL support enabled and import mysql from the db package.

class postbound.db.Cursor(*args, **kwargs)#

Interface for database cursors that adhere to the Python Database API specification.

This is not a complete representation and only focuses on the parts of the specification that are important for PostBOUND right now. In the future, additional methods might get added.

This type is only intended to denote the expected return type of certain methods, the cursors themselves are supplied by the respective database integrations. There should be no need to implement one manually and all cursors should be compatible with this interface by default (since they are DB API 2.0 cursor objects).

See PEP 249 for details (https://peps.python.org/pep-0249/)

class postbound.db.Connection(*args, **kwargs)#

Interface for database connections that adhere to the Python Database API specification.

This is not a complete representation and only focuses on the parts of the specification that are important for PostBOUND right now. In the future, additional methods might get added.

This type is only intended to denote the expected return type of certain methods, the connections themselves are supplied by the respective database integrations. There should be no need to implement one manually and all connections should be compatible with this interface by default (since they are DB API 2.0 connection objects).

See PEP 249 for details (https://peps.python.org/pep-0249/)

class postbound.db.PrewarmingSupport(*args, **kwargs)#

Some databases might support adding specific tables to their shared buffer.

If so, they should implement this protocol to allow other parts of the framework to exploit this feature.

abstractmethod prewarm_tables(tables: TableReference | Iterable[TableReference] | None = None, *more_tables: TableReference, exclude_table_pages: bool = False, include_primary_index: bool = True, include_secondary_indexes: bool = True) → None#

Prepares the database buffer pool with tuples from specific tables.

Parameters:

tables (Optional[TableReference | Iterable[TableReference]], optional) – The tables that should be placed into the buffer pool
*more_tables (TableReference) – More tables that should be placed into the buffer pool, enabling a more convenient usage of this method. See examples for details on the usage.
exclude_table_pages (bool, optional) – Whether the table data (i.e. pages containing the actual tuples) should not be prewarmed. This is off by default, meaning that prewarming is applied to the data pages. This can be toggled on to only prewarm index pages (see include_primary_index and include_secondary_index).
include_primary_index (bool, optional) – Whether the pages of the primary key index should also be prewarmed. Enabled by default.
include_secondary_indexes (bool, optional) – Whether the pages for secondary indexes should also be prewarmed. Enabled by default.

Return type:

None

Notes

If the database should prewarm more table pages than can be contained in the shared buffer, the actual contents of the pool are not specified. All prewarming tasks might happen sequentially, in which case the first prewarmed relations will typically be evicted and only the last relations (tables or indexes) are retained in the shared buffer. The precise order in which the prewarming tasks are executed is not specified and depends on the actual relations.

Examples

>>> database.prewarm_tables([table1, table2])
>>> database.prewarm_tables(table1, table2)

class postbound.db.TimeoutSupport(*args, **kwargs)#

Marks database systems that support executing queries with a timeout.

execute_with_timeout(query: SqlQuery | str, *, timeout: float = 60.0) → Sequence[tuple] | None#

Executes a query with a specific timeout.

For query execution, we use the following rules in contrast to Database.execute_query:

We never make use of the database interfaces’ cache, even if it the query is contained in the cache
We never attempt to simplify the result set, even if this would be possible (e.g., for single-row result sets). This is more of a pragmatic decision to be able to indicate a timeout with None and distinguishing it from a valid result set of a single NULL tuple. Otherwise, we would have to resort to raising TimeoutError or similar strategies, which complicates the control flow for the caller.

Parameters:

query (SqlQuery | str) – The query to execute. If this contains hints or other special features, those will be treated normally.
timeout (float, optional) – The timeout in seconds. If the query takes longer (inlcuding all special treatment of the database interface), it will be cancelled. Defaults to 60 seconds.

Returns:

The result set of the query. If the query was cancelled, this will be None.

Return type:

Optional[ResultSet]

class postbound.db.StopwatchSupport(*args, **kwargs)#

Marks the database systems that support measurement of query execution times.

last_query_runtime() → float#

Get the runtime of the last executed query.

The execution time is measured from the moment the query is passed to the internal cursor (i.e. including sending the query to the database server), until the execution is finished. Therfore, it does not include the time required to transfer the result set back to the client.

Returns:: The runtime of the last executed query in seconds. If no query has been executed before, NaN is returned.
Return type:: float

time_query(query: SqlQuery | str, *, timeout: float | None = None) → float#

Determines the execution time of a query.

Parameters:

query (SqlQuery | str) – The query to execute.
timeout (Optional[float], optional) – Cancels the query execution if it takes longer than this number (in seconds). Notice that this parameter requires timeout support from the database system.

Returns:

The runtime of the query in seconds. The result set is ignored.

Return type:

float

Raises:

UnsupportedDatabaseFeatureError – If the database system does not support timeouts. You can use the TimeoutSupport protocol to check this beforehand.

exception postbound.db.QueryCacheWarning(msg: str)#

Warning to indicate that the query result cache was not found.

Parameters:: msg (str)
Return type:: None

class postbound.db.ForeignKeyRef(fk_col, referenced_col)#

fk_col#: Alias for field number 0

referenced_col#: Alias for field number 1

class postbound.db.DatabaseSchema(db: Database, *, prep_placeholder: str = '%s')#

This interface provides access to different information about the logical structure of a database.

In contrast to database statistics, schema information is much more standardized. PostBOUND therefore only takes on the role of a mediator to delegate requests to different parts of the schema to the approapriate - and sometimes system specific - metadata catalogs of the database systems. For each kind of schema information a dedicated query method exists. Take a look at these methods to understand the functionality provided by the database schema interface.

The database schema can be iterated over to obtain all tables in the current database. Key-based access is supported as follows: schema[table] provides a TableInfo that contains a summary of the tables structure (e.g. its columns, primary key, foreign keys, etc.). Similarly, schema[column] provides a ColumnInfo with detailed information about the column (e.g. its datatype, whether it is nullable, etc.). Generally speaking, you should pass a proper TableReference or ColumnReference. It is also possible to pass just a string, but in this case the schema has to guess whether you are referring to a table or a column.

Parameters:

db (Database) – The database for which the schema information should be read. This is required to obtain cursors that request the desired data.
prep_placeholder (str, optional) – The placeholder that is used for prepared statements. Some systems use ? as a placeholder, while others use %s (the default). This needs to be specified to ensure that the information_schema queries are correctly formatted.

Notes

Hint for implementors: the database schema contains no abstract methods that need to be overridden. All methods come with a default implementation that uses the information_schema to retrieve the necessary information. However, if the target database system does not support specific features of the information_schema, the corresponding methods need to be overridden to provide the necessary functionality. The documentation of each method details which parts of the information_schema it needs.

as_graph() → DiGraph#

Constructs a compact representation of the database schema.

The schema is expressed as a directed graph. Each table is represented as a node. Nodes contain the following attributes: - columns: a list of all columns in the table - data_type: a dictionary mapping each column to its data type - primary_key: the primary key of the table (if it exists, otherwise None)

In addition, edges are used to model foreign key constraints. Each edge points from the table that contains the foreign key (column y in the example in foreign_keys_on) to the table that is referenced by the foreign key (x in the example in foreign_keys_on). Edges contain an attribute foreign_keys with a list of the foreign key relationships. Each such constraint is described by a ForeignKeyRef.

Return type:: DiGraph

columns(table: TableReference | str) → Sequence[BoundColumnReference]#

Fetches all columns of the given table.

Parameters:: table (TableReference | str) – A table in the current schema
Returns:: All columns for the given table. Columns are ordered according to their position in the table. Will be empty if the table is not found or does not contain any columns.
Return type:: Sequence[BoundColumnReference]
Raises:: postbound.qal.VirtualTableError – If the given table is virtual (e.g. subquery or CTE)

Notes

Hint for implementors: the default implementation of this method relies on the information_schema.columns view.

datatype(column: ColumnReference, *, raw: bool = False) → str#

Retrieves the (physical) data type of a column.

The provided type can be a standardized SQL-type, but it can be a type specific to the concrete database system just as well. It is up to the user to figure this out and to react accordingly.

Parameters:

column (ColumnReference) – The colum to check
raw (bool, optional) – If enabled, the type will be returned as-is. Otherwise, it will be lower-cased. By default, lower-casing is enabled.

Returns:

The datatype. Will never be empty. The type will automatically be lower-cased unless raw is set.

Return type:

str

Raises:

postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

Notes

Hint for implementors: the default implementation of this method relies on the information_schema.columns view.

foreign_keys_on(column: ColumnReference) → set[BoundColumnReference]#

Fetches all foreign key constraints that are specified on a specific column.

The provided columns are the target columns that are referenced by the foreign key constraint. E.g., suppose there are tables A and B with columns x and y. We specify a foreign key constraint on column y to ensure that all values in y reference a value in x. Then, calling this method on column y will return column x. If there are multiple foreign key constraints on the same column, all of them will be returned.

Parameters:

column (ColumnReference) – The column to check. All foreign keys that are “pointing from” this column to another column are returned.

Returns:

The columns that are “pointed to” by foreign key constraints on the given column. If no such foreign keys exist, an empty set is returned.

Return type:

set[BoundColumnReference]

Raises:

postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

has_index(column: ColumnReference) → bool#

Checks, whether there is any index structure available on a column

Parameters:

column (ColumnReference) – The column to check

Returns:

Whether any kind of index (primary, or secondary) is available for the column. Only compound indexes will fail this test.

Return type:

bool

Raises:

postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

Notes

Hints for implementors: the default implementation of this method (transitively) relies on the information_schema.table_constraints and information_schema.constraint_column_usage views. It assumes that primary keys, foreign keys and unique constraints are all associated with an index structure. If this is not the case, a custom implementation needs to be supplied.

has_secondary_index(column: ColumnReference) → bool#

Checks, whether a secondary index is available for a specific column.

Parameters:

column (ColumnReference) – The column to check

Returns:

Whether a secondary index of any kind was created for the column. Compound indexes and primary key indexes fail this test.

Return type:

bool

Raises:

postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

Notes

Hints for implementors: The default implementation of this method assumes that each foreign key column and each column with a UNIQUE constraint has an associated index. If this should not be the case, a custom implementation needs to be supplied. Furthermore, the implementation relies on the information_schema.table_constraints, information_schema.constraint_column_usage and information_schema.key_column_usage views.

indexes_on(column: ColumnReference) → set[str]#

Retrieves the names of all indexes of a specific column.

Parameters:

column (ColumnReference) – The column to check.

Returns:

The indexes. If no indexes are available, the set will be empty.

Return type:

set[str]

Raises:

postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

Notes

Hints for implementors: the default implementation of this method assumes that primary keys, foreign keys and unique constraints are all associated with an index structure. It provides the names of the corresponding constraints. The implementation relies on the information_schema.table_constraints, information_schema.constraint_column_usage and information_schema.key_column_usage views.

is_nullable(column: ColumnReference) → bool#

Checks, whether a specific column may contain NULL values.

Parameters:

column (ColumnReference) – The column to check

Returns:

Whether the column may contain NULL values

Return type:

bool

Raises:

postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

Notes

Hint for implementors: the default implementation of this method relies on the information_schema.columns view.

is_primary_key(column: ColumnReference) → bool#

Checks, whether a column is the primary key for its associated table.

Parameters:

column (ColumnReference) – The column to check

Returns:

Whether the column is the primary key of its table. If it is part of a compound primary key, this is False.

Return type:

bool

Raises:

postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

Notes

Hint for implementors: the default implementation of this method relies on the information_schema.table_constraints and information_schema.constraint_column_usage views.

is_string_column(column: ColumnReference) → bool#

Checks, whether a specific column is of a string type.

Parameters:: column (ColumnReference)
Return type:: bool

is_view(table: TableReference | str) → bool#

Checks, whether a specific table is actually is a view.

Parameters:: table (TableReference | str) – The table to check. May not be a virtual table.
Returns:: Whether the table is a view
Return type:: bool
Raises:: ValueError – If the table was not found in the current database

Notes

Hint for implementors: the default implementation of this method relies on the information_schema.tables view.

join_equivalence_classes() → Iterable[set[BoundColumnReference]]#

Calculates the quivalence classes of joinable columns in the database schema.

This method is similar to join_equivalence_keys, but returns the different equivalence classes instead of a mapping. See its documentation for more details.

See also

join_equivalence_keys

Return type:: Iterable[set[BoundColumnReference]]

join_equivalence_keys() → dict[BoundColumnReference, set[BoundColumnReference]]#

Calculates the equivalence classes of joinable columns in the database schema.

Two columns are considered joinable, if they are linked by a foreign key constraint. For example, consider a schema with three tables R, S and T with foreign keys R.a -> S.b and S.b -> T.c. Then, the columns R.a, S.b and T.c are all joinable and form an equivalence class. Likewise, the constraints R.a -> T.c and S.b -> T.c would establish the same equivalence class. On the other hand, the constraints R.a -> S.b and S.c -> T.d create two different equivalence classes.

Returns:: A mapping from each column to its equivalence class, i.e. the set of all columns that are joinable with it (including itself).
Return type:: dict[BoundColumnReference, set[BoundColumnReference]]

lookup_column(column: ColumnReference | str, candidate_tables: Iterable[TableReference], *, expect_match: bool = False) → TableReference | None#

Searches for a table that owns the given column.

Parameters:

column (ColumnReference | str) – The column that is being looked up
candidate_tables (Iterable[TableReference]) – Tables that could possibly own the given column
expect_match (bool, optional) – If enabled, an error is raised whenever no table is found. Otherwise None is returned. By default, this is disabled.

Returns:

The first of the candidate_tables that has a column of similar name.

Return type:

TableReference

Raises:

ValueError – If expect_match is enabled and none of the candidate tables has a column of the given name.

Notes

Hint for implementors: the default implementation of this method (transitively) relies on the information_schema.columns view.

primary_key_column(table: TableReference | str) → BoundColumnReference | None#

Determines the primary key column of a specific table.

Parameters:: table (TableReference | str) – The table to check
Returns:: The primary key if it exists, or None otherwise.
Return type:: Optional[BoundColumnReference]

Notes

Hint for implementors: the default implementation of this method relies on the information_schema.table_constraints and information_schema.constraint_column_usage views.

tables(*, catalog: str = '', schema: str = '', include_system_tables: bool = False) → set[TableReference]#

Fetches all tables in the database.

Parameters:

catalog (str, optional) – The catalog (i.e. database name) to search for tables. By default, the current catalog is used.
schema (str, optional) – The schema to search for tables. By default, the current schema is used.
include_system_tables (bool, optional) – Whether system tables should also be included. By default, only user-defined tables are returned.

Returns:

All tables in the current schema, including materialized views, etc.

Return type:

set[TableReference]

Notes

Hint for implementors: the default implementation of this method relies on the information_schema.tables view.

class postbound.db.DatabaseStatistics(db: Database, *, emulated: bool = True, enable_emulation_fallback: bool = True, cache_enabled: bool | None = True)#

The statistics interface provides unified access to table-level and column-level statistics.

There are two main challenges when implementing a generalized statistics interface for different database systems. The first one is the non-deterministic creation and maintenance of statistics by most database systems. This means that creating two identical databases on the same database system on the same machine might still yield different statistical values. This is because database systems oftentimes create statistics from random samples of column values to speed up computation. However, such variability hurts our efforts to enable reproducible experiments since different performances metrics might not be due to differences in the optimization algorithms but due to bad luck when creating the statistics (whether it is a good sign if an algorithm is that fragile to deviations in statistics is another question). The second main challenge is that different database systems maintain different statistics. Even though many statistics are considered quite “basic” by the research community, not all systems developers deemed all statistics necessary for their optimizer. Once again, this can severly hinder the application of an optimization algorithm if it relies on a basic statistic that just happens to not be available on the desired target database system.

To address both of these issues, the statistics interface operates in two different modes: in native mode it simply delegates all requests to statistical information to the corresponding catalogs of the database systems. Alternatively, the statistics interface can create the illusion of a normalized and standardized statistics catalogue. This so-called emulated mode does not rely on the statistics catalogs and issues equivalent SQL queries instead. For example, if a statistic on the number of distinct values of a column is requested, this emulated by running a SELECT COUNT(DISTINCT column) FROM table query.

The current mode can be customized using the boolean emulated property. If the statistics interface operates in native mode (i.e. based on the actual statistics catalog) and the user requests a statistic that is not available in the selected database system, the behavior depends on another attribute: enable_emulation_fallback. If this boolean attribute is True, an emulated statistic will be calculated instead. Otherwise, an UnsupportedDatabaseFeatureError is raised.

Since the live computation of emulated statistics can be costly, the statistics interface has its own cache_enabled attribute. It can be set to None to use the default caching behavior of the database system. However, if this attribute is set to True or False directly, caching will be used accordingly for all compute-intensive statistics operations (and only such operations). Once again, this only works because PostBOUND assumes the database to be immutable.

Parameters:

db (Database) – The database for which the schema information should be read. This is required to hook into the database cache and to obtain the cursors to actuall execute queries.
emulated (bool, optional) – Whether the statistics interface should operate in emulation mode. To enable reproducibility, this is True by default
enable_emulation_fallback (bool, optional) – Whether emulation should be used for unsupported statistics when running in native mode, by default True
cache_enabled (Optional[bool], optional) – Whether emulated statistics queries should be subject to caching, by default True. Set to None to use the caching behavior of the db

See also

postbound.postbound.OptimizationPipeline: The basic optimization process applied by PostBOUND

distinct_values(column: ColumnReference, *, emulated: bool | None = None, cache_enabled: bool | None = None) → int | None#

Legacy alias for num_distinct.

Deprecated since version v0.20.2: distinct_values() is deprecated due to the confusing name. Use the more aptly-named num_distinct(), which provides exactly the same interface and functionality.

Parameters:

column (ColumnReference)
emulated (bool | None)
cache_enabled (bool | None)

Return type:

int | None

min_max(column: ColumnReference, *, emulated: bool | None = None, cache_enabled: bool | None = None) → tuple[Any, Any] | None#

Provides (an estimate of) the minimum and maximum values in a column.

Parameters:

column (ColumnReference) – The column to check
emulated (Optional[bool], optional) – Whether to force emulation mode for this single call. Defaults to None which indicates that the emulation setting of the statistics interface should be used.
cache_enabled (Optional[bool], optional) – Whether to enable result caching in emulation mode. Defaults to None which indicates that the caching setting of the statistics interface should be used.

Returns:

A tuple of minimum and maximum value. If no such statistic exists, but the database system in principle maintains the statistic, None is returned. For example, this situation can occur if thec database system only maintains the min/max value if they are sufficiently far apart.

Return type:

Optional[tuple[Any, Any]]

Raises:

postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

most_common_values(column: ColumnReference, *, k: int | None = 10, emulated: bool | None = None, cache_enabled: bool | None = None) → MostCommonValues#

Provides (an estimate of) the total number of occurrences of the k most frequent values of a column.

Parameters:

column (ColumnReference) – The column to check
k (Optional[int], optional) – The maximum number of most common values to return. Defaults to 10. If there are less values available, all of the available values will be returned. Setting this to None or a negative value will return the frequency of all distinct values.
emulated (Optional[bool], optional) – Whether to force emulation mode for this single call. Defaults to None which indicates that the emulation setting of the statistics interface should be used.
cache_enabled (Optional[bool], optional) – Whether to enable result caching in emulation mode. Defaults to None which indicates that the caching setting of the statistics interface should be used.

Returns:

The most common values in pairs of (value, frequency), starting with the highest frequency. Notice that this sequence can be empty if no values are available. This can happen if the database system in principle maintains this statistic but does considers the value distribution to uniform to make the maintenance worthwhile. Likewise, if less common values exist than the requested k value, only the available values will be returned (and the sequence will be shorter than k in that case).

Return type:

MostCommonValues

Raises:

postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

num_distinct(column: ColumnReference, *, emulated: bool | None = None, cache_enabled: bool | None = None) → int | None#

Provides (an estimate of) the total number of different column values of a specific column.

Parameters:

column (ColumnReference) – The column to check
emulated (Optional[bool], optional) – Whether to force emulation mode for this single call. Defaults to None which indicates that the emulation setting of the statistics interface should be used.
cache_enabled (Optional[bool], optional) – Whether to enable result caching in emulation mode. Defaults to None which indicates that the caching setting of the statistics interface should be used.

Returns:

The number of distinct values in the column. If no such statistic exists, but the database system in principle maintains the statistic, None is returned. For example, this situation can occur if the database system only maintains a distinct value count if the column values are distributed in a sufficiently diverse way.

Return type:

Optional[int]

Raises:

postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

total_rows(table: TableReference, *, emulated: bool | None = None, cache_enabled: bool | None = None) → int | None#

Provides (an estimate of) the total number of rows in a table.

Parameters:

table (TableReference) – The table to check
emulated (Optional[bool], optional) – Whether to force emulation mode for this single call. Defaults to None which indicates that the emulation setting of the statistics interface should be used.
cache_enabled (Optional[bool], optional) – Whether to enable result caching in emulation mode. Defaults to None which indicates that the caching setting of the statistics interface should be used.

Returns:

The total number of rows in the table. If no such statistic exists, but the database system in principle maintains the statistic, None is returned. For example, this situation can occur if the database system only maintains a row count if the table has at least a certain size and the table in question did not reach that size yet.

Return type:

Optional[int]

Raises:

VirtualTableError – If the given table is virtual (e.g. subquery or CTE)

exception postbound.db.HintWarning(msg: str)#

Custom warning category for hinting-related problems.

Parameters:: msg (str)
Return type:: None

class postbound.db.HintService#

Provides the necessary tools to generate system-specific query instances based on optimizer decisions.

Hints are PostBOUNDs way to enforce that decisions made in the optimization pipeline are respected by the native query optimizer once the query is executed in an actual database system. The general documentation provides much more information about why this is necessary and how PostBOUND approaches query optimization and query generation.

Each database system has to implement this interface to be usable as part of an optimization pipeline.

See also

OptimizationPipeline.optimize_query: For a general introduction into the query optimization process

abstractmethod format_query(query: SqlQuery) → str#

Transforms the query into a database-specific string, mostly to incorporate deviations from standard SQL.

This method is necessary because the query abstraction layer is focused on modelling and unifying different parts of an SQL query. However, some database systems (cough .. MySQL .. cough) deviate from standard SQL syntax and express different parts of a query different. The most prominent example are older versions of MySQL that used double quotes for string values rather than the SQL standard single quotes. Therefore, the format_query method takes an abstract representation of an SQL query as input and turns it into a string representation that accounts for all such deviations.

Parameters:: query (SqlQuery) – The query that should be adapted for the database system
Returns:: An equivalent notation of the query that incorporates system-specific deviations from standard SQL. Notice that this query possibly can no longer be parsed by the query abstraction layer. It is a one-way process.
Return type:: str

See also

postbound.qal: the query abstraction layer provided by PostBOUND

abstractmethod generate_hints(query: SqlQuery, plan: QueryPlan | None = None, *, join_order: JoinTree | None = None, physical_operators: PhysicalOperatorAssignment | None = None, plan_parameters: PlanParameterization | None = None) → SqlQuery#

Transforms the input query such that the given optimization decisions are respected during query execution.

In the most common case this involves building a Hint clause that encodes the optimization decisions in a system-specific way. However, depending on the concrete database system, this might also involve a restructuring of certain parts of the query, e.g. the usage of specific join statements, the introduction of non-standard SQL statements, or a reordering of the FROM clause.

Notice that all optimization information is optional. If individual parameters are set to None, nothing has been enforced by PostBOUND’s optimization process and the native optimizer of the database system should “fill the gaps”.

Implementations of this method are required to adhere to operators for joins and scans as much as possible. However, there is no requirement to represent auxiliary nodes (e.g. sorts) if this is not possible or meaningful for the plan. As a rule of thumb, implementations should rate the integrity of the plan in the database higher than a perfect representation of the input data.

Parameters:

query (SqlQuery) – The query that should be transformed
plan (Optional[QueryPlan], optional) – The query execution plan. If this is given, all other parameters should be None. This essentially enforces the given query plan.
join_order (Optional[JoinTree], optional) – The sequence in which individual joins should be executed.
physical_operators (Optional[PhysicalOperatorAssignment], optional) – The physical operators that should be used for the query execution. In addition to selecting specific operators for specific joins or scans, this can also include disabling certain operators for the entire query.
plan_parameters (Optional[PlanParameterization], optional) – Additional parameters and metadata for the native optimizer of the database system. Probably the most important use-case of these parameters is the supply of cardinality estimates for different joins and scans. For example, these can be combined with a join order to influence the physical operators that the native optimizer chooses. Another scenario is to only supply such cardinality estimates and leave the join_order and physical_operators completely empty, which essentially simulates a different cardinality estimation algorithm for the query. Notice however, that in this scenario cardinality estimates for all possible intermediate results of the query have to be supplied. Otherwise, the native optimizer once again “fills the gaps” and uses its own estimates for the remaining intermediate results that it explores during plan enumeration. This would probably effectively break the estimation algorithm.

Returns:

The transformed query. It contains all necessary information to enforce the optimization decisions as best as possible. Notice that whether the native optimizer of the database system is obliged to respect the optimization decisions depends on the specific system. For example, for MySQL hints are really just hints and the optimizer is only encouraged to use specific operators but not forced to do so.

Return type:

SqlQuery

abstractmethod supports_hint(hint: ScanOperator | JoinOperator | IntermediateOperator | HintType) → bool#

Checks, whether the database system is capable of using the specified hint or operator

Parameters:: hint (PhysicalOperator | HintType) – The hint/feature to check
Returns:: Indicates whether the feature is supported by the specific database system.
Return type:: bool

class postbound.db.Histogram(bounds: Iterable[T], frequencies: Iterable[int], *, lower: T, bucket_interpolation: HistogramApproximation = 'bound')#

Histograms are a common way to approximate the distribution of values in a column.

Our implementation models a general histogram as a list of buckets. Each bucket is associated with a frequency, i.e. the number of elements that fall into the bucket. This might be an equi-depth or an equi-width histogram, but this is not a requirement.

Currently, we only provide two high-level methods for cardinality estimation: frequency_below and frequency_above. The former estimates the number of elements that are less than or equal to a given value, while the latter estimates the number of elements that are greater than a given value. If the queried value does not align perfectly with the bucket bounds, the behavior depends on the selected bucket_interpolation strategy:

“bound” is a conservative strategy. It bounds the frequency of the queried value by the frequency of the entire bucket that contains it.
“approx-uni” is a more aggressive strategy. It assumes that the values within each bucket are uniformly distributed and uses a simple linear interpolation to estimate the frequency of the queried value.

Histograms can be iterated over to access the individual (bucket bound, frequency) pairs. In addition, we provide __getitem__ access to get the i-th bucket bound and its frequency.

Parameters:

bounds (Sequence[T]) – The upper bounds of the buckets. These must be sorted in ascending order. The lower bound of the first bucket is given by the additional lower parameter. The upper bound of the final bucket is implicitly given by the final element in this list.
frequencies (Sequence[int]) – The frequencies of the buckets. The i-th element in this list corresponds to the i-th bucket.
lower (T) – The lower bound of the first bucket. This is required to be less than or equal to the first element in bounds.
bucket_interpolation (Literal["approx-uni", "bound"], optional) – The strategy to estimate the frequency of values that fall into a bucket, but are not exactly on the bucket bounds. The default is “bound”.

property bounds: Sequence[T]#: Get the upper bounds of all buckets.

property bucket_interpolation: HistogramApproximation#: Get the strategy to estimate the frequency of values that are not exactly on the bucket bounds.

property freq_per_bucket: int#

Get the average number of elements per bucket.

For equi-depth histograms, this number will be pretty accurate. For all other kinds of histograms, this number is really just an average.

frequency_above(value: T) → int#

Estimate the frequency of values that are greater than a given value.

Due to the way we model histograms, we cannot directly estimate the frequency of values greater or equal to a given value, but only the frequency of values that are strictly greater than it.

If the value does not align exactly with the bucket bounds, the behavior depends on the selected bucket_interpolation strategy. See the class documentation for details.

Parameters:: value (T)
Return type:: int

frequency_below(value: T) → int#

Estimate the frequency of values that are less than or equal to a given value.

If the value does not align exactly with the bucket bounds, the behavior depends on the selected bucket_interpolation strategy. See the class documentation for details.

Parameters:: value (T)
Return type:: int

property lower: T#: Get the lower bound of the first bucket (and thus of the entire histogram).

property n_buckets: int#: Get the number of buckets in the histogram.

property n_rows: int#: Get the total number of rows that are represented by the histogram.

property upper: T#: Get the upper bound of the last bucket (and thus of the entire histogram).

class postbound.db.MostCommonValues(mcvs: Iterable[tuple[T, int]])#

Most Common Values contain an ordered list of (column value, frequency) pairs.

The list is sorted in descending order of the frequencies. Ties are broken according to the column values (once again in descending order).

By default, Most Common Values acts as a drop-in replacement for a list of tuples, so all sequence-like methods like __iter__, __getitem__, etc. are implemented in terms of lists.

In addition, Most Common Values provide a number of utility functions like min_freq or frequency_of to extract commonly-used information.

Parameters:: mcvs (Sequence[tuple[T, int]]) – The value, frequency pairs in the list.

below(k: int) → MostCommonValues#

Provide the most common values after the k-th most frequent.

Parameters:: k (int)
Return type:: MostCommonValues

property frequencies: Sequence[int]#: Get the frequencies in the MCV list, starting with the highest frequency.

frequency_of(value: T, *, default: int | None = None, bound_missing: bool = False) → int | None#

Load the frequency of a specific value.

Parameters:

value (T) – The value to check
default (int, optional) – The frequency to return if the given value is not in the MCV list. If not provided, the behavior depends on the bound_missing parameter.
bound_missing (bool, optional) – If enabled, the frequency of missing values (i.e. values that are not in the MCV list) is assumed to be at least the frequency of the least common value in the MCV list. Otherwise, None is returned to indicate a missing value. If a default value is provided, this parameter is ignored. By default, this is disabled.

Return type:

Optional[int]

property k: int#: Get the number of values in the MCV list.

property max_freq: int#

Get the highest frequency in the MCV list.

This is equivalent to the frequency of the first value in the list.

property min_freq: int#

Get the lowest frequency in the MCV list.

This is equivalent to the frequeny of the final value in the list.

top(k: int) → MostCommonValues#

Provide the most common values up-to (and including) the k-th most frequent.

Parameters:: k (int)
Return type:: MostCommonValues

property values: Sequence[T]#: Get the values in the MCV list, ordered by their frequency (highest first).

class postbound.db.OptimizerInterface#

Provides high-level access to internal optimizer-related data for the database system.

Each funtionality is available through a dedicated method. Notice that not all database systems necessarily support all of this functions.

abstractmethod analyze_plan(query: SqlQuery) → QueryPlan#

Executes a specific query and provides the query execution plan supplemented with runtime information.

This respects all hints that potentially influence the optimization process.

Parameters:: query (SqlQuery) – The input query
Returns:: The corresponding execution plan. This plan will be an ANALYZE plan and contain all information that can be derived for the specific database system (e.g. cardinality estimates as well as true cardinality counts)
Return type:: QueryPlan

abstractmethod cardinality_estimate(query: SqlQuery | str) → Cardinality#

Queries the DBMS query optimizer for its cardinality estimate, instead of executing the query.

The cardinality estimate will correspond to the estimate for the final node. Therefore, running this method with aggregate queries is not particularly meaningful.

Parameters:: query (SqlQuery | str) – The input query
Returns:: The cardinality estimate of the native optimizer for the database system.
Return type:: Cardinality

abstractmethod cost_estimate(query: SqlQuery | str) → float#

Queries the DBMS query optimizer for the estimated cost of executing the query.

The cost estimate will correspond to the estimate for the final node. Typically, this cost includes the cost of all sub-operators as well.

Parameters:: query (SqlQuery | str) – The input query
Returns:: The cost estimate of the native optimizer for the database system.
Return type:: Cost

explain(query: SqlQuery | str) → QueryPlan#

Alias for query_plan.

Parameters:: query (SqlQuery | str)
Return type:: QueryPlan

explain_analyze(query: SqlQuery) → QueryPlan#

Alias for analyze_plan.

Parameters:: query (SqlQuery)
Return type:: QueryPlan

abstractmethod parse_plan(plan: Any, *, query: SqlQuery | None = None) → QueryPlan#

Transforms the system-specific EXPLAIN output into a standardized QueryPlan.

The optional query can be used to provide additional context for the plan. This can be used by some database systems for correct parsing (e.g. DuckDB), while others might not use it at all. Omitting the query parameter is acceptable.

This method is mainly inteded to parse a previously generated plan, or when the plan could not simply be obtained via query_plan or analyze_plan.

Parameters:

plan (Any)
query (SqlQuery | None)

Return type:

QueryPlan

abstractmethod query_plan(query: SqlQuery | str) → QueryPlan#

Obtains the query execution plan for a specific query.

This respects all hints that potentially influence the optimization process.

Parameters:: query (SqlQuery | str) – The input query
Returns:: The corresponding execution plan. This will never be an ANALYZE plan, but contain as much meaningful information as can be derived for the specific database system (e.g. regarding cardinality and cost estimates)
Return type:: QueryPlan

class postbound.db.DatabasePool#

The database pool allows different parts of the code base to easily obtain access to a database.

This is achieved by maintaining one global pool of database connections which is shared by the entire system. New database instances can be registered and retrieved via unique keys. As long as there is just a single database instance, it can be accessed via the current_database method.

The database pool implementation follows the singleton pattern. Use the static get_instance method to retrieve the database pool instance. All other functionality is provided based on that pool instance.

References

any_database() → Database#

Provides any database that is currently stored in the pool, provided there is at least one.

Return type:: Database

clear() → None#

Removes all currently registered databases from the pool.

Return type:: None

current_database() → Database#

Provides the database that is currently stored in the pool, provided there is just one.

Returns:: The only database in the pool
Return type:: Database
Raises:: ValueError – If there are multiple database instances registered in the pool

empty() → bool#

Checks, whether the database pool is currently emtpy (i.e. no database are registered).

Returns:: True if the pool is empty.
Return type:: bool

static get_instance() → DatabasePool#

Provides access to the singleton database pool, creating a new pool instance if necessary.

Returns:: The current pool instance
Return type:: DatabasePool

n_databases() → int#

Returns the number of databases currently registered in the pool.

Return type:: int

register_database(key: str, db: Database) → None#

Stores a new database in the pool.

This method is typically called by the connect methods of the respective database system implementations.

Parameters:

key (str) – A unique identifier under which the database can be retrieved
db (Database) – The database to store

Return type:

None

retrieve_database(key: str) → Database#

Provides the database that is registered under a specific key.

Parameters:: key (str) – The key that was previously used to register the database
Returns:: The corresponding database
Return type:: Database
Raises:: KeyError – If no database was registered under the given key.

exception postbound.db.UnsupportedDatabaseFeatureError(database: Database, feature: str | ScanOperator | JoinOperator | IntermediateOperator)#

Indicates that some requested feature is not supported by the database.

For example, PostgreSQL (at least up to version 15) does not capture minimum or maximum column values in its system statistics. Therefore, forcing the DBS to retrieve such information from its metadata could result in this error.

Parameters:

database (Database) – The database that was requested to provide the problematic feature
feature (str) – A textual description for the requested feature

Return type:

None

exception postbound.db.DatabaseServerError(message: str = '', context: object | None = None)#

Indicates an error caused by the database server occured while executing a database operation.

The error was not due to a mistake in the user input (such as an SQL syntax error or access privilege violation), but an implementation issue instead (such as out of memory during query execution).

Parameters:

message (str, optional) – A textual description of the error, e.g. out of memory. Can be left empty by default.
context (Optional[object], optional) – Additional context information for when the error occurred, e.g. the query that caused the error. Mainly intended for debugging purposes.

Return type:

None

exception postbound.db.DatabaseUserError(message: str = '', context: object | None = None)#

Indicates that a database operation failed due to an error on the user’s end.

The error could be due to an SQL syntax error, access privilege violation, etc.

Parameters:

message (str, optional) – A textual description of the error, e.g. no such table. Can be left empty by default.
context (Optional[object], optional) – Additional context information for when the error occurred, e.g. the query that caused the error. Mainly intended for debugging purposes.

Return type:

None

postbound.db.current_database() → Database#

Provides the current database from the DatabasePool.

Returns:: The current database instance. If there is not exactly one database in the pool, a ValueError is raised.
Return type:: Database

See also

DatabasePool.current_database

postbound.db.simplify_result_set(result_set: Sequence[tuple] | None) → Any#

Default implementation of the result set simplification logic outlined in Database.execute_query.

Parameters:: result_set (ResultSet | None) – Result set to simplify: each entry in the list corresponds to one row in the result set and each component of the tuples corresponds to one column in the result set
Returns:: The simplified result set: if the result set consists just of a single row, this row is unwrapped from the list. If the result set contains just a single column, this is unwrapped from the tuple. Both simplifications are also combined, such that a result set of a single row of a single column is turned into the single value. As a special case, an empty result set is turned into None.
Return type:: Any

postbound.db.ResultRow#: alias of tuple

Database Abstraction

Contents

Database Abstraction#

Submodules#