Core Functionality#
The core data structures and abstractions are available at the top-level of the package, i.e. simply via postbound
as in postbound.TableReference. We list the core abstractions here. In addition, several important components such as
the hinting infrastructure and optimization pipelines are described in separate documents:
Core Abstractions#
PostBOUND - A research framework for query optimization in relational database systems.
PostBOUND allows to rapidly prototype novel ideas in query optimization and to evaluate them in a transparent and reproducible manner. On a high level, the framework uses the following concepts:
optimization pipelines provide models for different optimizer architecures. Each pipeline provides different hooks (called optimization stages) where users can plug in their own optimization strategies.
database backends enable the evaluation of optimization pipelines on real-world database systems. Backends translate the optimization decisions into system-specific query hints that enforce the selected execution plan at runtime.
workloads and benchmarking utilities allow to evaluate optimization pipelines on popular benchmarks in a reproducible manner.
additional infrastructure modules handle the boilerplate part of query optimization, such as parsing and representing SQL queries, retrieving schema information, or obtaining database statistics.
Documentation#
Visit https://postbound.readthedocs.io/ for the full documentation of PostBOUND, including guides, tutorials and API references.
Package Structure#
Core data structures and utilities are globally available (e.g. OptimizationPipeline, Database, or SqlQuery)
SQL representation and parsing is handled by the qal and parser modules
Query modification is implemented in the transform module
The database interface is defined in the db module. postgres and duckdb provide concrete backends
Workloads and benchmarking utilities are available in workloads and bench
Simple optimization algorithms (e.g. dynamic-programming-based plan enumeration) are provided in the opt module
- class postbound.TableReference(full_name: str, alias: str = '', *, virtual: bool = False, schema: str = '')#
A table reference represents a database table.
It can either be a physical table, a CTE, or an entirely virtual query created via subqueries. Note that a table reference is indeed just a reference and not a 1:1 “representation” since each table can be sourced multiple times in a query. Therefore, in addition to the table name, each instance can optionally also contain an alias to distinguish between different references to the same table. In case of virtual tables, the full name will usually be empty and only the alias set. An exception are table references that refer to CTEs: their full name is set to the CTE name, the alias to the alias from the FROM clause (if present) and the table is still treated as virtual.
Table references can be sorted lexicographically. All instances should be treated as immutable objects.
- Parameters:
full_name (str) – The name of the table, corresponding to the name of a physical database table (or a view)
alias (str, optional) – Alternative name that is in queries to refer to the table, or to refer to the same table multiple times. Defaults to an empty string
virtual (bool, optional) – Whether the table is virtual or not. As a rule of thumb, virtual tables cannot be part of a FROM clause on their own, but need some sort of context. For example, the alias of a subquery is typically represented as a virtual table in PostBOUND. One cannot directly reference that alias in a FROM clause, without also specifying the subquery. Defaults to False since most tables will have direct physical counterparts.
schema (str, optional) – The schema in which the table is located. Defaults to an empty string if the table is in the default schema or the schema is unknown.
- Raises:
ValueError – If neither full name nor an alias are provided, or if a schema without a full name is provided.
- static create_virtual(alias: str, *, full_name: str = '', schema: str = '') TableReference#
Generates a new virtual table reference with the given alias.
- Parameters:
alias (str) – The alias of the virtual table. Cannot be None.
full_name (str, optional) – An optional full name for the entire table. This is mostly used to create references to CTE tables.
schema (str, optional) – The schema in which the table is located. Defaults to an empty string if the table is in the default schema or the schema is unknown.
- Returns:
The virtual table reference
- Return type:
- property full_name: str#
Get the full name of this table. If empty, alias is guaranteed to be set.
- Returns:
The name of the table
- Return type:
str
- property alias: str#
Get the alias of this table. If empty, the full name is guaranteed to be set.
The precise semantics of alias usage differ from database system to system. For example, in Postgres an alias shadows the original table name, i.e. once an alias is specified, it must be used to reference to the table and its columns.
- Returns:
The alias of the table
- Return type:
str
- property virtual: bool#
Checks whether this table is virtual. In this case, only the alias and not the full name is set.
- Returns:
Whether this reference describes a virtual table
- Return type:
bool
- property schema: str#
Get the schema in which this table is located.
- Returns:
The schema or an empty string if the schema is either unknown or the table is located in the default schema.
- Return type:
str
- identifier() str#
Provides a shorthand key that columns can use to refer to this table reference.
For example, a table reference for movie_companies AS mc would have mc as its identifier (i.e. the alias), whereas a table reference without an alias such as company_type would provide the full table name as its identifier, i.e. company_type.
- Returns:
The shorthand
- Return type:
str
- qualified_name() str#
Provides the fully qualified name (i.e. including the schema) of this table.
Notice that virtual tables do not have a qualified name, since they do not correspond to a physical table.
- Returns:
The qualified name, quoted as necessary.
- Return type:
str
- drop_alias() TableReference#
Removes the alias from the current table if there is one. Returns the tabel as-is otherwise.
- Returns:
This table, but without an alias. Since table references are immutable, the original reference is not modified
- Return type:
- Raises:
StateError – If this table is a virtual table, since virtual tables only have an alias and no full name.
- with_alias(alias: str) TableReference#
Creates a new table reference for the same table but with a different alias.
- Parameters:
alias (str) – The new alias
- Returns:
The updated table reference
- Return type:
- Raises:
StateError – If the current table does not have a full name.
- make_virtual() TableReference#
Creates a new virtual table reference for the same table.
- Returns:
The updated table reference
- Return type:
- class postbound.ColumnReference(name: str, table: TableReference | None = None)#
A column reference represents a specific column of a specific database table.
This reference always consists of the name of the “physical” column (see below for special cases). In addition, each column can be bound to the table to which it belongs by providing the associated table reference.
Column references can be sorted lexicographically and are designed as immutable data objects.
For situations where columns must be bound to a table, the BoundColumnReference subclass exists. It has the same interface like normal column references, but with one difference: its
tableproperty is guaranteed to always be a valid TableReference. Use the static assert_bound method to ensure that you are dealing with a bound reference. It returns a TypeGuard so type checkers should work with the narrowed type after a successful check.- Parameters:
name (str) – The name of the column. Cannot be empty.
table (Optional[TableReference], optional) – The table which provides the column. Can be None if the table is unknown.
- Raises:
ValueError – If the name is empty (or None)
- Return type:
Notes
A number of special cases arise when dealing with subqueries and common table expressions. The first one is the fact that columns can be bound to virtual tables, e.g. if they are exported by subqueries, etc. In the same vein, columns also do not always need to refer directly to physical columns. Consider the following example query:
WITH cte_table AS (SELECT foo.f_id, foo.a + foo.b AS 'sum' FROM foo) SELECT * FROM bar JOIN cte_table ON bar.b_id = cte_table.f_id WHERE cte_table.sum < 42
In this case, the CTE exports a column sum that is constructed based on two “actual” columns. Hence, the sum column itself does not have any physical representation but will be modelled as a column reference nevertheless.
Due to these situations, we use the full table reference when comparing columns. BUt this can have unintended side effects as well. For example, consider a physical column production_year of a title relation. In one query, the column is referenced as title.production_year, whereas in the other query, title is aliased and the column referenced as t.production_year. While both columns are bound to the title relation, only one of the table references contains an alias. Therefore, the table references do not compare equal and by extension the columns do not compare equal either. To mitigate this issue (at least to some extend), the drop_table_alias method can be used to obtain a “normalized” version of a column reference.
See also
BoundColumnReferenceA subclass of ColumnReference that is always bound to a valid table.
- static create(column: str, *, table: str) BoundColumnReference#
Shortcut method to create a column along with its table.
- Parameters:
column (str)
table (str)
- Return type:
BoundColumnReference
- static assert_bound(col: ColumnReference) TypeGuard[BoundColumnReference]#
Checks whether a specific column is bound to (any) table.
If it is, the column can be treated as an instance of BoundColumnReference and the type checker will be able to narrow the type accordingly. Most importantly, the table property of the column will be guaranteed to be a valid TableReference and not None.
Notes
Sadly, the current spec of the TypeGuard explicitly excludes self from the type check. Therefore, we cannot simply modify is_bound to return a TypeGuard and have to use an additional method. Ugh.
- Parameters:
col (ColumnReference)
- Return type:
TypeGuard[BoundColumnReference]
- property name: str#
Get the name of this column. This is guaranteed to be set and will never be empty
- Returns:
The name
- Return type:
str
- property table: TableReference | None#
Get the table to which this column belongs, if specified.
- Returns:
The table or None. The table can be an arbitrary reference, i.e. virtual or physical.
- Return type:
Optional[TableReference]
- is_bound() bool#
Checks, whether this column is bound to a table.
- Returns:
Whether a valid table reference is set
- Return type:
bool
See also
assert_boundA static method that performs the same check but also serves as a type guard to narrow the type to BoundColumnReference if the check is successful.
- belongs_to(table: TableReference) bool#
Checks, whether the column is part of the given table.
This check does not consult the schema of the actual database or the like, it merely matches the given table reference with the table attribute of this column.
- Parameters:
table (TableReference) – The table to check
- Returns:
Whether the table’s column is the same as the given one
- Return type:
bool
- bind_to(table: TableReference) BoundColumnReference#
Binds this column to a new table.
- Parameters:
table (TableReference) – The new table
- Returns:
The updated column reference, the original reference is not modified.
- Return type:
BoundColumnReference
- as_unbound() ColumnReference#
Removes the table binding from this column.
- Returns:
The updated column reference, the original reference is not modified.
- Return type:
- drop_table_alias() ColumnReference#
Removes the alias from the table this column is bound to, if there is one.
For example, if this column is t.production_year based on a table alias title as t, calling this method turns the column into title.production_year, effectively dropping the t alias.
This can be useful to obtain a “normalized” version of a column reference.
- Return type:
- class postbound.ScanOperator(*values)#
The scan operators supported by PostBOUND.
These can differ from the scan operators that are actually available in the selected target database system. The individual operators are chosen because they are supported by a wide variety of database systems and they are sufficiently different from each other.
- class postbound.JoinOperator(*values)#
The join operators supported by PostBOUND.
These can differ from the join operators that are actually available in the selected target database system. The individual operators are chosen because they are supported by a wide variety of database systems and they are sufficiently different from each other.
- class postbound.IntermediateOperator(*values)#
The intermediate operators supported by PostBOUND.
Intermediate operators are those that do not change the contents of their input relation, but only the way in which it is available. For example, a sort operator changes the order of the tuples.
- postbound.PhysicalOperator = postbound._core.ScanOperator | postbound._core.JoinOperator | postbound._core.IntermediateOperator#
Represent a PEP 604 union type
E.g. for int | str
- postbound.Cost = <class 'float'>#
Convert a string or number to a floating-point number, if possible.
- class postbound.Cardinality(value: int | float)#
Cardinlities represent the number of tuples/row in a relation.
Our cardinality model can be in one of three states:
A valid cardinality can be any non-negative integer. This is the default and most common state.
An unknown cardinality is represented by NaN.
A prohibitively large cardinality is represented by inf.
Basically, cardinality instances are just wrappers around their integer value that also catch the two special cases. Use cardinalities as you would use normal numbers. Notice that cardinalities are immutable, so all mathematical operators return a new cardinality instance.
To check for the state of a cardinality, you can either use is_valid() or the more specific isnan() and isinf(). Furthermore, a more expressive alias for isnan() exists in the form of is_unknown().
You can access the raw cardinality value vie the value property. However, this property requires that the cardinality is indeed in a valid state. If you want to handle invalid cardinalities yourself, get() returns a general float value (that can also be NaN or inf).
To construct valid cardinalities, it is probably easiest to just create a new instance and passing the desired value. The of() factory method can be used for better readability. Additionally, the unknown() and infinite() factory methods can be used to create cardinalities in the special states.
Lastly, cardinalities can be used in match statements. They match the following pattern: (is_valid, value). If the cardinality is invalid, the value is set to -1.
- Parameters:
value (int | float)
- static of(value: int | float | Cardinality) Cardinality#
Creates a new cardinality with a specific value. This is just a shorthand for Cardinality(value).
- Parameters:
value (int | float | Cardinality)
- Return type:
- static unknown() Cardinality#
Creates a new cardinality with an unknown value.
- Return type:
- static infinite() Cardinality#
Creates a new cardinality with an infinite value.
- Return type:
- property value: int#
Get the value wrapped by this cardinality instance. If the cardinality is invalid, a StateError is raised.
- isnan() bool#
Checks, whether cardinality value is NaN.
- Return type:
bool
- isinf() bool#
Checks, whether cardinality value is infinite.
- Return type:
bool
- is_unknown() bool#
Checks, whether cardinality value is unknown (i.e. NaN).
This is just a more expressive alias for isnan(). It is not a synonym for is_valid().
- Return type:
bool
- is_valid() bool#
Checks, whether this cardinality is valid, i.e. neither NaN nor infinite.
If the cardinality is valid, the value can be safely accessed via the value property.
- Return type:
bool
- get() float#
Provides the value of this cardinality.
In contrast to accessing the value property, this method always returns a float and does not raise an error for invalid cardinalities. Instead, it returns NaN for unknown cardinalities and inf for infinite cardinalities.
- Return type:
float
- class postbound.SqlQuery(*, select_clause: Select, from_clause: From | None = None, where_clause: Where | None = None, groupby_clause: GroupBy | None = None, having_clause: Having | None = None, orderby_clause: OrderBy | None = None, limit_clause: Limit | None = None, cte_clause: CommonTableExpression | None = None, hints: Hint | None = None, explain: Explain | None = None)#
Represents a plain SELECT query, providing direct access to the different clauses in the query.
At a basic level, PostBOUND differentiates between two types of queries:
implicit SQL queries specify all referenced tables in the FROM clause and the join predicates in the WHERE clause, e.g.
SELECT * FROM R, S WHERE R.a = S.b AND R.c = 42. This is the traditional way of writing SQL queries.explicit SQL queries use the JOIN ON syntax to reference tables, e.g.
SELECT * FROM R JOIN S ON R.a = S.b WHERE R.c = 42. This is the more “modern” way of writing SQL queries.
There is also a third possibility of mixing the implicit and explicit syntax. For each of these cases, designated subclasses exist. They all provide the same functionality and only differ in the (sub-)types of their FROM clauses. Therefore, these classes can be considered as “marker” types to indicate that at a certain point of a computation, a specific kind of query is required. The SqlQuery class acts as a superclass that specifies the general behaviour of all query instances and can act as the most general type of query.
To represent other types of SQL statements (e.g. DML statements), different classes have to be used. Notably, this also applies to set queries, i.e. queries containing UNION, INTERSECT, or EXCEPT clauses. These are represented by the SetQuery class. The reason for this distinction is a pragmatic one: most research in query optimization is currently concerned with single SELECT queries and the interface for a set query has to be quite different from that of an ordinary query. For example, there is no obvious way how to represent the predicates of a query with an EXCEPT clause. Therefore, optimizers that provide support for set queries have to explicitly state this in their interface. At the same time, pretty much all of PostBOUND’s code that uses queries operates on features that are common to both SqlQuery as well as SetQuery objects. Therefore, set queries can be passed even though the interface only specifies SqlQuery objects. This is just because set queries are a late addition to PostBOUND and simply do not have the time to re-visit all other method definitions to update the their signatures.
If you want to explicitly communicate that some method accepts both plain SQL queries as well as set queries, you can use the SelectStatement super type.
The clauses of each query can be accessed via properties. If a clause is optional, the absence of the clause is indicated through a None value. All additional behaviour of the queries is provided by the different methods. These are mostly focused on an easy introspection of the query’s structure.
Notice that PostBOUND does not enforce any semantics on the queries (e.g. regarding data types, access to values, the cardinality of subquery results, or the connection between different clauses). This has to be done by the user, or by the actual database system.
Limitations#
While the query abstraction is quite powerful, it is cannot represent the full range of SQL statements. Noteworthy limitations include:
no DDL or DML statements. The query abstraction is really only focused on queries, i.e. SELECT statements.
no recursive CTEs. While plain CTEs are supported, recursive CTEs are not. While this would be an easy addition, there simply was no need for it so far. If you need recursive CTEs, PRs are always welcome!
no support for GROUPING SETS, including CUBE() and ROLLUP(). Conceptually speaking, these would not be hard to add, but there simply was no need for them so far. If you need them, PRs are always welcome!
- param select_clause:
The SELECT part of the query. This is the only required part of a query. Notice however, that some database systems do not allow queries without a FROM clause.
- type select_clause:
Select
- param from_clause:
The FROM part of the query, by default None
- type from_clause:
Optional[From], optional
- param where_clause:
The WHERE part of the query, by default None
- type where_clause:
Optional[Where], optional
- param groupby_clause:
The GROUP BY part of the query, by default None
- type groupby_clause:
Optional[GroupBy], optional
- param having_clause:
The HAVING part of the query, by default None.
- type having_clause:
Optional[Having], optional
- param orderby_clause:
The ORDER BY part of the query, by default None
- type orderby_clause:
Optional[OrderBy], optional
- param limit_clause:
The LIMIT and OFFSET part of the query. In standard SQL, this is designated using the FETCH FIRST syntax. Defaults to None.
- type limit_clause:
Optional[Limit], optional
- param cte_clause:
The WITH part of the query, by default None
- type cte_clause:
Optional[CommonTableExpression], optional
- param hints:
The hint block of the query. Hints are not part of standard SQL and follow a completely system-specific syntax. Even their placement in within the query varies from system to system and from extension to extension. Defaults to None.
- type hints:
Optional[Hint], optional
- param explain:
The EXPLAIN part of the query. Like hints, this is not part of standard SQL. However, most systems provide EXPLAIN functionality. The specific features and syntax are quite similar, but still system specific. Defaults to None.
- type explain:
Optional[Explain], optional
Warning
See the Limitations section for unsupported SQL features.
- property cte_clause: CommonTableExpression | None#
Get the WITH clause of the query.
- Returns:
The WITH clause if it was specified, or None otherwise.
- Return type:
Optional[CommonTableExpression]
- property select_clause: Select#
Get the SELECT clause of the query. Will always be set.
- Returns:
The SELECT clause
- Return type:
- property from_clause: From | None#
Get the FROM clause of the query.
- Returns:
The FROM clause if it was specified, or None otherwise.
- Return type:
Optional[From]
- property where_clause: Where | None#
Get the WHERE clause of the query.
- Returns:
The WHERE clause if it was specified, or None otherwise.
- Return type:
Optional[Where]
- property groupby_clause: GroupBy | None#
Get the GROUP BY clause of the query.
- Returns:
The GROUP BY clause if it was specified, or None otherwise.
- Return type:
Optional[GroupBy]
- property having_clause: Having | None#
Get the HAVING clause of the query.
- Returns:
The HAVING clause if it was specified, or None otherwise.
- Return type:
Optional[Having]
- property orderby_clause: OrderBy | None#
Get the ORDER BY clause of the query.
- Returns:
The ORDER BY clause if it was specified, or None otherwise.
- Return type:
Optional[OrderBy]
- property limit_clause: Limit | None#
Get the combined LIMIT and OFFSET clauses of the query.
According to the SQL standard, these clauses should use the FETCH FIRST syntax. However, many systems use OFFSET and LIMIT instead.
- Returns:
The FETCH FIRST clause if it was specified, or None otherwise.
- Return type:
Optional[Limit]
- property hints: Hint | None#
Get the hint block of the query.
The hints can specify preparatory statements that have to be executed before the actual query is run in addition to the hints themselves.
- Returns:
The hint block if it was specified, or None otherwise.
- Return type:
Optional[Hint]
- property explain: Explain | None#
Get the EXPLAIN block of the query.
- Returns:
The EXPLAIN settings if specified, or None otherwise.
- Return type:
Optional[Explain]
- abstractmethod is_implicit() bool#
Checks, whether this query has an implicit FROM clause.
The implicit FROM clause only consists of the source tables that should be scanned for the query. No subqueries or joins are contained in the clause. All join predicates must be part of the WHERE clause.
- Returns:
Whether the query is implicit
- Return type:
bool
See also
ImplicitSqlQuery
- abstractmethod is_explicit() bool#
Checks, whether this query has an explicit FROM clause.
The explicit FROM clause exclusively makes use of the JOIN ON syntax to denote both the tables that should be scanned, and the predicates that should be used to join the tables together. Therefore, the WHERE clause should only consist of filter predicates on the base tables. However, this is not enforced and the contents of the ON conditions as well as the WHERE clause can be arbitrary predicates.
- Returns:
Whether the query is explicit
- Return type:
bool
See also
ExplicitSqlQuery
- is_explain() bool#
Checks, whether this query is an EXPLAIN query rather than a normal SQL query.
An EXPLAIN query is not executed like a normal
SELECT ...query. Instead of actually calculating a result set, the database system only provides a query plan. This plan is the execution plan that would be used, had the query been entered as a normal SQL query.- Returns:
Whether this query should be explained, rather than executed.
- Return type:
bool
- tables() set[TableReference]#
Provides all tables that are referenced at any point in the query.
This includes tables from all clauses. Virtual tables will be included and tables that are only scanned within subqueries are included as well. Notice however, that some database systems might not support subqueries to be put at arbitrary positions in the query (e.g. GROUP BY clause).
- Returns:
All tables that are referenced in the query.
- Return type:
set[TableReference]
- output_columns() Sequence[ColumnReference]#
Provides the columns that form the result relation of this query.
Columns are ordered according to their appearance in the SELECT clause and will not have a bound table associated with them. This is because the query result is “anonymous” and does not have a relation name associated with it. The columns are named according to the following rules:
If the expression has an alias, this name is used
If the expression is a simple column reference, the column name is used
Otherwise, a generic name is used, e.g. “column_1”, “column_2”, etc.
- Returns:
The columns. Their order matches the order in which they appear in the SELECT clause of the query.
- Return type:
Sequence[ColumnReference]
- columns() set[ColumnReference]#
Provides all columns that are referenced at any point in the query.
This includes columns from all clauses and does not account for renamed columns from subqueries. For example, consider the query
SELECT R.a, my_sq.b FROM R JOIN (SELECT b FROM S) my_sq ON R.a < my_sq.b. columns would return the following set:{R.a, S.b, my_sq.b}, even thoughmy_sq.bcan be considered as just an alias forS.b.- Returns:
All columns that are referenced in the query.
- Return type:
set[ColumnReference]
- columns_of(table: TableReference) set[ColumnReference]#
Provides all columns of a specific table that are referenced at any point in the query.
- Parameters:
table (TableReference) – The table to which the columns should belong
- Returns:
All columns of the specified table that are referenced in the query.
- Return type:
set[ColumnReference]
See also
columnsGet all columns that are referenced in the query
- predicates() QueryPredicates#
Provides all predicates in this query.
All predicates really means all predicates: this includes predicates that appear in the FROM clause, the WHERE clause, as well as any predicates from CTEs.
- Returns:
A predicates wrapper around the conjunction of all individual predicates.
- Return type:
- filters() Collection[AbstractPredicate]#
Alias for predicates().filters().
See also
QueryPredicates.filters- Return type:
Collection[AbstractPredicate]
- joins() Collection[AbstractPredicate]#
Alias for predicates().joins().
See also
QueryPredicates.joins- Return type:
Collection[AbstractPredicate]
- join_graph(*, merge_aliases: bool = False) Graph#
Alias for predicates().join_graph().
See also
QueryPredicates.join_graph- Parameters:
merge_aliases (bool)
- Return type:
Graph
- filters_for(table: TableReference) AbstractPredicate | None#
Alias for predicates().filters_for(table).
See also
QueryPredicates.filters_for- Parameters:
table (TableReference)
- Return type:
AbstractPredicate | None
- joins_for(table: TableReference) Collection[AbstractPredicate]#
Alias for predicates().joins_for(table).
See also
QueryPredicates.joins_for- Parameters:
table (TableReference)
- Return type:
Collection[AbstractPredicate]
- joins_between(table1: TableReference | Iterable[TableReference], table2: TableReference | Iterable[TableReference]) AbstractPredicate | None#
Alias for predicates().joins_between(table1, table2).
See also
QueryPredicates.joins_between- Parameters:
table1 (TableReference | Iterable[TableReference])
table2 (TableReference | Iterable[TableReference])
- Return type:
AbstractPredicate | None
- joins_tables(tables: TableReference | Iterable[TableReference], *more_tables: TableReference) bool#
Alias for predicates().joins_tables(tables, *more_tables).
See also
QueryPredicates.joins_tables- Parameters:
tables (TableReference | Iterable[TableReference])
more_tables (TableReference)
- Return type:
bool
- subqueries() Collection[SqlQuery]#
Provides all subqueries that are referenced in this query.
Notice that CTEs are ignored by this method, since they can be accessed directly via the cte_clause property.
- Returns:
All subqueries that appear in any of the “inner” clauses of the query
- Return type:
Collection[SqlQuery]
- clauses(*, skip: ~typing.Type | ~collections.abc.Iterable[~typing.Type] | None = <class 'NoneType'>) Sequence[BaseClause]#
Provides all the clauses that are defined (i.e. not None) in this query.
- Parameters:
skip (Optional[Type | Iterable[Type]], optional) – The clause types that should be skipped in the output. This can be a single type or an iterable of types.
- Returns:
The clauses. The current order of the clauses is as follows: hints, explain, cte, select, from, where, group by, having, order by, limit. Notice however, that this order is not strictly standardized and may change in the future. All clauses that are not specified on the query will be skipped.
- Return type:
Sequence[BaseClause]
- bound_tables() set[TableReference]#
Provides all tables that can be assigned to a physical or virtual table reference in this query.
Bound tables are those tables, that are selected in the FROM clause of the query, or a subquery. Conversely, unbound tables are those that have to be “injected” by an outer query, as is the case for dependent subqueries.
For example, the query
SELECT * FROM R, S WHERE R.a = S.bhas two bound tables: R and S. On the other hand, the querySELECT * FROM R WHERE R.a = S.bhas only bound R, whereas S has to be bound in a surrounding query.- Returns:
All tables that are bound (i.e. listed in the FROM clause or a CTE) of the query.
- Return type:
set[TableReference]
- unbound_tables() set[TableReference]#
Provides all tables that are referenced in this query but not bound.
While tables() provides all tables that are referenced in this query in any way, bound_tables restricts these tables. This method provides the complementary set to bound_tables i.e.
tables = bound_tables ⊕ unbound_tables.- Returns:
The unbound tables that have to be supplied as part of an outer query
- Return type:
set[TableReference]
- is_ordered() bool#
Checks, whether this query produces its result tuples in order.
- Returns:
Whether a valid ORDER BY clause was specified on the query.
- Return type:
bool
- is_dependent() bool#
Checks, whether all columns that are referenced in this query are provided by the tables from this query.
In order for this check to work, all columns have to be bound to actual tables, i.e. the tables attribute of all column references have to be set to a valid object.
- Returns:
Whether all columns belong to tables that are bound by this query
- Return type:
bool
- is_scalar() bool#
Checks, whether the query is guaranteed to provide a single scalar value as a result.
Scalar results can only be calculated by queries with a single projection in the SELECT clause and if that projection is an aggregate function, e.g. SELECT min(R.a) FROM R. However, there are other queries which could also be scalar “by chance”, e.g. SELECT R.b FROM R WHERE R.a = 1 if R.a is the primary key of R. Notice that such cases are not recognized by this method.
- Returns:
Whether the query will always return a single scalar value
- Return type:
bool
- is_set_query() bool#
Checks, whether this query is a set query.
A set query is a query that combines the results of two or more queries into a single result set. This can be done by combining the tuples from both sets using a UNION clause (which removes duplicates), or a UNION ALL clause (which retains duplicates). Alternatively, only tuples that are present in both sets can be retained using an INTERSECT clause. Finally, all tuples from the first result set that are not part of the second result set can be computed using an EXCEPT clause.
Notice that only one of the set operators can be used at a time, but the input query of one set operation can itself use another set operation.
- Returns:
Whether this query is a set query
- Return type:
bool
- contains_cross_product() bool#
Checks, whether this query has at least one cross product.
- Returns:
Whether this query has cross products.
- Return type:
bool
- iterexpressions() Iterable[SqlExpression]#
Provides access to all expressions that are directly contained in this query.
Nested expressions can be accessed from these expressions in a recursive manner (see the SqlExpression interface for details).
- Returns:
The expressions
- Return type:
Iterable[SqlExpression]
- itercolumns() Iterable[ColumnReference]#
Provides access to all column in this query.
In contrast to the columns method, duplicates are returned multiple times, i.e. if a column is referenced n times in this query, it will also be returned n times by this method. Furthermore, the order in which columns are provided by the iterable matches the order in which they appear in this query.
- Returns:
The columns
- Return type:
Iterable[ColumnReference]
- stringify(*, trailing_delimiter: bool = True) str#
Provides a string representation of this query.
The only difference to calling str directly, is that the stringify method provides control over whether a trailing delimiter should be appended to the query.
- Parameters:
trailing_delimiter (bool, optional) – Whether a delimiter should be appended to the query. Defaults to True.
- Returns:
A string representation of this query
- Return type:
str
- ast() str#
Provides a human-readable representation of the abstract syntax tree for this query.
The AST is a textual representation of the query that shows the structure of the query in a tree-like manner.
- Returns:
The abstract syntax tree of this query
- Return type:
str
- accept_visitor(clause_visitor: ClauseVisitor[VisitorResult], *args, **kwargs) dict[BaseClause, VisitorResult]#
Applies a visitor over all clauses in the current query.
Notice that since the visitor is applied to all clauses, it returns the results for each of them.
- Parameters:
clause_visitor (ClauseVisitor) – The visitor algorithm to use.
- Return type:
dict[BaseClause, VisitorResult]
- Parameters:
select_clause (Select)
from_clause (Optional[From])
where_clause (Optional[Where])
groupby_clause (Optional[GroupBy])
having_clause (Optional[Having])
orderby_clause (Optional[OrderBy])
limit_clause (Optional[Limit])
cte_clause (Optional[CommonTableExpression])
hints (Optional[Hint])
explain (Optional[Explain])
- postbound.parse_query(query: str, *, include_hints: bool = True, bind_columns: bool | None = None, db_schema: DatabaseSchema | None = None) SqlQuery#
- postbound.parse_query(query: str, *, accept_set_query: Literal[True], include_hints: bool = True, bind_columns: bool | None = None, db_schema: DatabaseSchema | None = None) SqlQuery | SetQuery
- postbound.parse_query(query: str, *, accept_set_query: Literal[False], include_hints: bool = True, bind_columns: bool | None = None, db_schema: DatabaseSchema | None = None) SqlQuery
Parses a query string into a proper SqlQuery object.
During parsing, the appropriate type of SQL query (i.e. with implicit, explicit or mixed FROM clause) will be inferred automatically. Therefore, this method can potentially return a subclass of SqlQuery.
Once the query has been transformed, a text-based binding process is executed. During this process, the referenced tables are normalized such that column references using the table alias are linked to the correct tables that are specified in the FROM clause (see the module-level documentation for an example). The parsing process can optionally also involve a binding process based on the schema of a live database. This is important for all remaining columns where the text-based parsing was not possible, e.g. because the column was specified without a table alias.
- Parameters:
query (str) – The query to parse
accept_set_query (bool, optional) – Whether set queries are a valid result of the parsing process. If this is False (the default), an error will be raised if the input query is a set query. This implies that the result of the parsing process is always a SqlQuery instance. Otherwise, the result can also be a SetQuery instance.
include_hints (bool, optional) – Whether to include hints in the parsed query. If this is True (the default), any preceding comments in the query text will be parsed as a hint block. Otherwise, these comments are simply ignored.
bind_columns (bool | None, optional) – Whether to use live binding. This does not control the text-based binding, which is always performed. If this parameter is None (the default), the global auto_bind_columns variable will be queried. Depending on its value, live binding will be performed or not.
db_schema (Optional[DatabaseSchema], optional) – For live binding, this indicates the database to use. If this is None (the default), the database will be tried to extract from the DatabasePool
- Returns:
The parsed SQL query.
- Return type:
- class postbound.Database(system_name: str, *, cache_enabled: bool = True)#
A Database is PostBOUND’s logical abstraction of physical database management systems.
It provides high-level access to internal functionality provided by such systems. More specifically, each Database instance supports the following functionality:
executing arbitrary SQL queries
retrieving schema information, most importantly about primary keys and foreign keys
accessing statistical information, such as most common values or the number of rows in a table
query formatting and generation of query hints to enforce optimizer decisions (join orders, operators, etc.)
introspection of the query optimizer to retrieve query execution plans, cost estimates, etc.
Notice, that all this information is by design read-only and functionality to write queries is intentionally not implemented (although one could issue INSERT/UPDATE/DELETE queries via the query execution functionality).
This restriction to read-only information enables the caching of query results to provide them without running a query over and over again. This is achieved by storing the results of past queries in a special JSON file, which is read upon creation of the Database instance. If this behavior is not desired, it can simply be turned off globally via the cache_enabled property, or on a per-method-call basis by setting the corresponding parameter. If no such parameter is available, the specific method does not make use of the caching mechanic.
Each database management system will need to implement this basic interface to enable PostBOUND to access the necessary information.
- Parameters:
system_name (str) – The name of the database system for which the connection is established. This is only really important to distinguish different instances of the interface in a convenient manner.
cache_enabled (bool, optional) – Whether complex queries that are executed against the database system should be cached. This is especially useful to emulate certain statistics that are not maintained by the specific database system (see DatabaseStatistics for details). If this is False, the query cache will not be loaded as well. Defaults to True.
Notes
When the __init__ method is called, the connection to the specific database system has to be established already, i.e. calling any of the public methods should provide a valid result. This is particularly important, because this method takes care of the cache initialization. This initialization in turn relies on identifying the correct cache file, which in turn depends on the system name, system version and database name of the connection.
- abstractmethod schema() DatabaseSchema#
Provides access to the underlying schema information of the database.
- Returns:
An object implementing the schema interface for the actual database system. This should normally be completely stateless.
- Return type:
- abstractmethod statistics() DatabaseStatistics#
Provides access to the current statistics of the database.
Implementing generalized statistics for a framework that supports multiple different physical database systems is much more complicated than it might seem at first. Therefore, different modes for the statistics provisioning exist. These modes can be changed by setting the properties of the interface. See the documentation of DatabaseStatistics for more details.
Repeated calls to this method are guaranteed to provide the same object. Therefore, changes to the statistics interface configuration are guaranteed to be persisted accross multiple accesses to the statistics system.
- Returns:
The statistics interface. Repeated calls to this method are guaranteed to provide the same object.
- Return type:
- abstractmethod hinting() HintService#
Provides access to the hint generation facilities for the current database system.
- Returns:
The hinting service. This should normally be completely stateless.
- Return type:
- abstractmethod optimizer() OptimizerInterface#
Provides access to optimizer-related functionality of the database system.
- Returns:
The optimizer interface. This should normally be completely stateless.
- Return type:
- Raises:
UnsupportedDatabaseFeatureError – If the database system does not provide any sort of external access to the optimizer.
- abstractmethod execute_query(query: SqlQuery | str, *, cache_enabled: bool | None = None, raw: bool = False) Any#
Executes the given query and returns the associated result set.
- Parameters:
query (SqlQuery | str) – The query to execute. If it contains a Hint with preparatory_statements, these will be executed beforehand. Notice that such statements are never subject to caching.
cache_enabled (Optional[bool], optional) – Controls the caching behavior for just this one query. The default value of None indicates that the “global” configuration of the database system should be used. Setting this parameter to a boolean value forces or deactivates caching for the specific query for the specific execution no matter what the “global” configuration is.
raw (bool, optional) – Whether the result set should be returned as-is. By default, the result set is simplified. Raw mode skips this step.
- Returns:
Result set of the input query. This is a list of equal-length tuples in the most general case. Each component of the tuple corresponds to a specific column of the result set and each tuple corresponds to a row in the result set. However, many queries do not provide a 2-dimensional result set (e.g. COUNT(*) queries). In such cases, the nested structure of the result set makes it quite cumbersome to use. Therefore, this method tries to simplify the return value of the query for more convenient use (if raw mode is disabled). More specifically, if the query returns just a single row, this row is returned directly as a tuple. Furthermore, if the query returns just a single column, the values of that column are returned directly in a list. Both simplifications will also be combined, such that a result set of a single row of a single value will be returned as that single value directly. In all other cases, the result will be a list consisting of the different result tuples.
- Return type:
Any
Notes
This method is mainly intended to execute read-only SQL queries. In fact, the only types of SQL queries that can be modelled by the query abstraction layer are precisely such read-only queries. However, if one really needs to execute mutating queries, they can be issued as plain text. Just remember that this behavior is heavily discouraged!
The precise behavior of this method depends on whether caching is enabled or not. If it is, the query will only be executed against the live database system, if it is not in the cache. Otherwise, the result will simply be retrieved. Caching can be enabled/disabled for just this one query via the cache_enabled switch. If this is not specified, caching depends on the cache_enabled property.
If caching should be used for this method, but is disabled at a database-level, the current cache will still be read and persisted. This ensures that all cached queries are properly saved and none of the previous cache content is lost.
- explain(query: SqlQuery, *, plan: QueryPlan | None = None, join_order: JoinTree | None = None, physical_ops: PhysicalOperatorAssignment | None = None, plan_params: PlanParameterization | None = None) QueryPlan#
Shortcut method to obtain the query plan for a given query.
Calling this method is equivalent to calling
.optimizer().query_plan()on the query, but it also allows to directly pass hints for the plan generation. These are passed as-is to the hinting backend.- Parameters:
query (SqlQuery)
plan (QueryPlan | None)
join_order (JoinTree | None)
physical_ops (PhysicalOperatorAssignment | None)
plan_params (PlanParameterization | None)
- Return type:
- abstractmethod database_name() str#
Provides the name of the (physical) database that the database interface is connected to.
- Returns:
The database name, e.g. imdb or tpc-h
- Return type:
str
- database_system_name() str#
Provides the name of the database management system that this interface is connected to.
- Returns:
The database system name, e.g. PostgreSQL
- Return type:
str
- abstractmethod database_system_version() Version#
Returns the release version of the database management system that this interface is connected to.
- Returns:
The version
- Return type:
- abstractmethod describe() jsondict#
Provides a representation of the current database connection as well as its system settings.
This description is intended to transparently document which customizations have been applied, thereby giving an idea of how the default query execution might have been affected. It can be JSON-serialized and will be included by most of the output of the utilities in the runner module of the experiments package.
- Return type:
jsondict
- abstractmethod reset_connection() Any#
Obtains a new network connection for the database. Useful for debugging purposes or in case of crashes.
Notice that resetting the connection can have unintended side-effects if other methods rely on the cursor object. After resetting, the former cursor object will probably no longer be valid. Therefore, this method should be used with caution.
- Returns:
Backend-specific information about the new connection. E.g., this might be a connection ID or nothing at all.
- Return type:
Any
See also
- reset_cache() None#
Removes all results from the query cache. Useful for debugging purposes.
- Return type:
None
- abstractmethod cursor() Cursor#
Provides a cursor to execute queries and iterate over result sets manually.
- Returns:
A cursor compatible with the Python DB API specification 2.0 (PEP 249). The specific cursor type depends on the concrete database implementation however.
- Return type:
References
- abstractmethod close() None#
Shuts down all currently open connections to the database.
- Return type:
None
- provides(support: Type) bool#
Checks, whether the database interface supports a specific protocol.
- Parameters:
support (Type)
- Return type:
bool
- property cache_enabled: bool#
Controls, whether the results of executed queries should be cached to prevent future re-execution.
If caching should be enabled later on and no cached data exists, the cache will be inflated from disk.
- class postbound.Workload(queries: dict[LabelType, SqlQuery], name: str = '', root: Path | None = None)#
A workload collects a number of queries (read: benchmark) and provides utilities to operate on them conveniently.
In addition to the actual queries, each query is annotated by a label that can be used to retrieve the query more nicely. E.g. for queries in the Join Order Benchmark, access by their index is supported - such as
job["1a"]. Labels can be arbitrary types as long as they are hashable. Since the workload inherits from dict, the label can be used directly to fetch the associated query (and will raiseKeyErrorinstances for unknown labels).Each workload can be given a name, which is mainly intended for readability in
__str__methods and does not serve a functional purpose. However, it may be good practice to use a normalized name that can be used in different contexts such as in file names, etc.When using methods that allow iteration over the queries, they will typically be returned in order according to the natural order of the query labels. However, since workloads can be shuffled randomly, this order can also be destroyed.
A workload is implemented as an immutable data object. Therefore, it is not possible/not intended to change the contents of a workload object later on. All methods that mutate the contents instead provide new workload instances.
- Parameters:
queries (dict[LabelType, SqlQuery]) – The queries that form the actual workload
name (str, optional) – A name that can be used to identify or represent the workload, by default
"".root (Optional[Path], optional) – The root directory that contains the workload queries. This is mainly used to somehow identify the workload when no name is given or the workload contents do not match the expected queries. Defaults to
None.
Notes
Workloads support many of the Python builtin-methods thanks to inheriting from
UserDict. Namely, the len, iter and in methods work as expected on the labels. Furthermore, multiple workload objects can be added, subtracted and intersected using set semantics. Subtraction and intersection also work based on individual labels.- static read(root_dir: str | Path, *, query_file_pattern: str = '*.sql', name: str = '', label_prefix: str = '', file_encoding: str = 'utf-8', bind_columns: bool = True, include_hints: bool = True, on_error: Literal['raise', 'warn', 'ignore'] = 'raise', verbose: bool = False) Workload[str]#
Reads all SQL queries from a specific directory into a workload object.
Deprecated since version 0.20.2: Use the read_workload function instead, which provides more flexibility and features. We will unify the workload API in 0.21.0 to make sure all workload loading functions are methods of the workloads module.
This method assumes that the queries are stored in individual files, one query per file. The query labels will be constructed based on the file name of the source files. For example, a query contained in file
q-1-1.sqlwill receive labelq-1-1(note that the trailing file extension is dropped). If the label_prefix is given, it will be inserted before the file name-based label.- Parameters:
root_dir (str) – Directory containing the individual query files
query_file_pattern (str, optional) – File name pattern that is shared by all query files. Only files matching the pattern will be read and each matching file is assumed to be a valid workload query. This is resolved as a glob expression. Defaults to
"*.sql"name (str, optional) – An optional name that can be used to identify the workload. Empty by default.
label_prefix (str, optional) – A prefix to add before each query label. Empty by default. Notice that the prefix will be prepended as-is, i.e. no separator character is inserted. If a separator is desired, it has to be part of the prefix.
file_encoding (str, optional) – The encoding of the query files. All files must share the same encoding. Defaults to UTF-8 encoding.
bind_columns (bool, optional) – Whether the parser should try to infer tables for all column references which do not use the explicit alias.column syntax. This requires an active database connection. If no such connection exists, this setting is ignored.
include_hints (bool, optional) – Whether the parser should try to infer and extract hint blocks from the queries
on_error (Literal["raise", "warn", "ignore"], optional) – How to react to parser errors. By default, the exception is propagated to the parent process.
verbose (bool, optional) – Whether progress information should be printed.
- Returns:
A workload consisting of all query files contained in the root directory.
- Return type:
Workload[str]
See also
Path.glob,parser.parse_query
- property name: str#
Provides the name of the workload.
- Returns:
The name or an empty string if no name has been specified.
- Return type:
str
- queries() Sequence[SqlQuery]#
Provides all queries in the workload in natural order (according to their labels).
If the natural order was manually destroyed, e.g. by shuffling, the shuffled order is used.
- Returns:
The queries
- Return type:
Sequence[SqlQuery]
- labels() Sequence[LabelType]#
Provides all query labels of the workload in natural order.
If the natural order was manually destroyed, e.g. by shuffling, the shuffled order is used.
- Returns:
The labels
- Return type:
Sequence[LabelType]
- entries() Sequence[tuple[LabelType, SqlQuery]]#
Provides all (label, query) pairs in the workload, in natural order of the query labels.
If the natural order was manually destroyed, e.g. by shuffling, the shuffled order is used.
- head() tuple[LabelType, SqlQuery] | None#
Provides the first query in the workload.
The first query is determined according to the natural order of the query labels by default. If that order was manually destroyed, e.g. by shuffling, the shuffled order is used.
There is no policy to break ties in the order. An arbitrary query can be returned in this case.
- with_labels(labels: Iterable[LabelType]) Workload[LabelType]#
Provides a new workload that contains only the queries with the specified labels.
- first(n: int) Workload[LabelType]#
Provides the first n queries of the workload, according to the natural order of the query labels.
If there are less than n queries in the workload, all queries will be returned. Similar to other methods that rely on some sort of ordering of the queries, if the natural order has been manually broken due to shuffling, the shuffled order is used instead.
- last(n: int) Workload[LabelType]#
Provides the last n queries of the workload, according to the natural order of the query labels.
If there are less than n queries in the workload, all queries will be returned. Similar to other methods that rely on some sort of ordering of the queries, if the natural order has been manually broken due to shuffling, the shuffled order is used instead.
- pick_random(n: int) Workload[LabelType]#
Constructs a new workload, consisting of randomly selected queries from this workload.
The new workload will once again be ordered according to the natural ordering of the labels.
- with_prefix(label_prefix: LabelType) Workload[LabelType]#
Filters the workload for all queries that have a lablel starting with a specific prefix.
This method requires that all label instances provide a startswith method (as is the case for simple string labels). Most significantly, this means that integer-based indexing does not work with for the prefix-based filter. The See Also section provides some means to mitigate this problem.
- Parameters:
label_prefix (LabelType) – The prefix to filter for
- Returns:
All queries of this workload that have a label with a matching prefix. Queries will be sorted according to the natural order of their labels again.
- Return type:
- Raises:
ValueError – If the prefix type does not provide a startswith method.
- filter_by(predicate: Callable[[LabelType, SqlQuery], bool]) Workload[LabelType]#
Provides all queries from the workload that match a specific predicate.
- Parameters:
predicate (Callable[[LabelType, SqlQuery], bool]) – The filter condition. All queries that pass the check are included in the new workload. The filter predicate receives the label and the query for each query in the input
- Returns:
All queries that passed the filter condition check. Queries will be sorted according to the natural order of their labels again.
- Return type:
- relabel(label_provider: Callable[[LabelType, SqlQuery], NewLabelType]) Workload[NewLabelType]#
Constructs a new workload, leaving the queries intact but replacing the labels.
The new workload will ordered according to the natural order of the new labels.
- Parameters:
label_provider (Callable[[LabelType, SqlQuery], NewLabelType]) – Replacement method that maps all old labels to the new label values. This method has to provide unique labels. If that is not the case, conflicts will be resolved but in an arbitrary way. The replacement receives the old label as well as the query as input and produces the new label value.
- Returns:
All queries of the current workload, but with new labels
- Return type:
- map(transformation: Callable, *args, **kwargs) Workload[LabelType]#
Constructs a new workload, leaving the labels intact but replacing the queries.
The new workload will ordered according to the natural order of the labels.
- Parameters:
transformation (Callable[[SqlQuery, ...], SqlQuery]) – Replacement method that maps all old queries to new query values. The replacement receives the old query as input and produces the new query value. Additional arguments can be passed to the transformation and will be forwarded.
*args – Additional positional arguments that should be forwarded to the transformation method. These are passed after the query argument.
**kwargs – Additional keyword arguments that should be forwarded to the transformation method.
- Returns:
All queries of the current workload, but with new query values
- Return type:
See also
transformif the transformation also needs access to the query labels, or if labels should be updated as well
- transform(transformation: Callable, *args, **kwargs) Workload[LabelType]#
Constructs a new workload, replacing both the labels and the queries.
The new workload will ordered according to the natural order of the new labels.
- Parameters:
transformation (Callable[[LabelType, SqlQuery, ...], tuple[LabelType, SqlQuery]]) – Replacement method that maps all old (label, query) pairs to new (label, query) pairs. If only queries should be updated according to their labels, the transformation is free to return the old label value. Additional arguments can be passed to the transformation and will be forwarded.
*args – Additional positional arguments that should be forwarded to the transformation method. These are passed after the label and query.
**kwargs – Additional keyword arguments that should be forwarded to the transformation method.
- Returns:
All queries of the current workload, but with new labels and query values
- Return type:
See also
mapif the transformation is purely query-based and does not need access to the labels