DuckDB Backend#
Module name: postbound.duckdb
- class postbound.duckdb.DuckDBInterface(db: Path, *, system_name: str = 'DuckDB', cache_enabled: bool = False)#
- Parameters:
db (Path)
system_name (str)
cache_enabled (bool)
- schema() DuckDBSchema#
Provides access to the underlying schema information of the database.
- Returns:
An object implementing the schema interface for the actual database system. This should normally be completely stateless.
- Return type:
- statistics() DatabaseStatistics#
Provides access to the current statistics of the database.
Implementing generalized statistics for a framework that supports multiple different physical database systems is much more complicated than it might seem at first. Therefore, different modes for the statistics provisioning exist. These modes can be changed by setting the properties of the interface. See the documentation of DatabaseStatistics for more details.
Repeated calls to this method are guaranteed to provide the same object. Therefore, changes to the statistics interface configuration are guaranteed to be persisted accross multiple accesses to the statistics system.
- Returns:
The statistics interface. Repeated calls to this method are guaranteed to provide the same object.
- Return type:
- hinting() HintService#
Provides access to the hint generation facilities for the current database system.
- Returns:
The hinting service. This should normally be completely stateless.
- Return type:
- optimizer() OptimizerInterface#
Provides access to optimizer-related functionality of the database system.
- Returns:
The optimizer interface. This should normally be completely stateless.
- Return type:
- Raises:
UnsupportedDatabaseFeatureError – If the database system does not provide any sort of external access to the optimizer.
- execute_query(query: SqlQuery | str, *, cache_enabled: bool | None = None, raw: bool = False, timeout: float | None = None) Any#
Executes the given query and returns the associated result set.
- Parameters:
query (SqlQuery | str) – The query to execute. If it contains a Hint with preparatory_statements, these will be executed beforehand. Notice that such statements are never subject to caching.
cache_enabled (Optional[bool], optional) – Controls the caching behavior for just this one query. The default value of None indicates that the “global” configuration of the database system should be used. Setting this parameter to a boolean value forces or deactivates caching for the specific query for the specific execution no matter what the “global” configuration is.
raw (bool, optional) – Whether the result set should be returned as-is. By default, the result set is simplified. Raw mode skips this step.
timeout (float | None)
- Returns:
Result set of the input query. This is a list of equal-length tuples in the most general case. Each component of the tuple corresponds to a specific column of the result set and each tuple corresponds to a row in the result set. However, many queries do not provide a 2-dimensional result set (e.g. COUNT(*) queries). In such cases, the nested structure of the result set makes it quite cumbersome to use. Therefore, this method tries to simplify the return value of the query for more convenient use (if raw mode is disabled). More specifically, if the query returns just a single row, this row is returned directly as a tuple. Furthermore, if the query returns just a single column, the values of that column are returned directly in a list. Both simplifications will also be combined, such that a result set of a single row of a single value will be returned as that single value directly. In all other cases, the result will be a list consisting of the different result tuples.
- Return type:
Any
Notes
This method is mainly intended to execute read-only SQL queries. In fact, the only types of SQL queries that can be modelled by the query abstraction layer are precisely such read-only queries. However, if one really needs to execute mutating queries, they can be issued as plain text. Just remember that this behavior is heavily discouraged!
The precise behavior of this method depends on whether caching is enabled or not. If it is, the query will only be executed against the live database system, if it is not in the cache. Otherwise, the result will simply be retrieved. Caching can be enabled/disabled for just this one query via the cache_enabled switch. If this is not specified, caching depends on the cache_enabled property.
If caching should be used for this method, but is disabled at a database-level, the current cache will still be read and persisted. This ensures that all cached queries are properly saved and none of the previous cache content is lost.
- database_name() str#
Provides the name of the (physical) database that the database interface is connected to.
- Returns:
The database name, e.g. imdb or tpc-h
- Return type:
str
- database_system_version() Version#
Returns the release version of the database management system that this interface is connected to.
- Returns:
The version
- Return type:
- cursor() DuckDBPyConnection#
Provides a cursor to execute queries and iterate over result sets manually.
- Returns:
A cursor compatible with the Python DB API specification 2.0 (PEP 249). The specific cursor type depends on the concrete database implementation however.
- Return type:
References
- close() None#
Shuts down all currently open connections to the database.
- Return type:
None
- reset_connection() None#
Obtains a new network connection for the database. Useful for debugging purposes or in case of crashes.
Notice that resetting the connection can have unintended side-effects if other methods rely on the cursor object. After resetting, the former cursor object will probably no longer be valid. Therefore, this method should be used with caution.
- Returns:
Backend-specific information about the new connection. E.g., this might be a connection ID or nothing at all.
- Return type:
Any
See also
Database.cursor
- describe() jsondict#
Provides a representation of the current database connection as well as its system settings.
This description is intended to transparently document which customizations have been applied, thereby giving an idea of how the default query execution might have been affected. It can be JSON-serialized and will be included by most of the output of the utilities in the runner module of the experiments package.
- Return type:
jsondict
- class postbound.duckdb.DuckDBSchema(db: DuckDBInterface)#
- Parameters:
db (DuckDBInterface)
- tables(*, include_system_tables: bool = False) set[TableReference]#
Fetches all user-defined tables that are contained in the current database.
- Parameters:
include_system_tables (bool, optional) – Whether system tables should also be included. By default, only user-defined tables are returned.
- Returns:
All tables in the current schema, including materialized views, etc.
- Return type:
set[TableReference]
Notes
Hint for implementors: the default implementation of this method relies on the information_schema.tables view.
- has_secondary_index(column: ColumnReference) bool#
Checks, whether a secondary index is available for a specific column.
- Parameters:
column (ColumnReference) – The column to check
- Returns:
Whether a secondary index of any kind was created for the column. Compound indexes and primary key indexes fail this test.
- Return type:
bool
- Raises:
postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)
Notes
Hints for implementors: The default implementation of this method assumes that each foreign key column and each column with a UNIQUE constraint has an associated index. If this should not be the case, a custom implementation needs to be supplied. Furthermore, the implementation relies on the information_schema.table_constraints, information_schema.constraint_column_usage and information_schema.key_column_usage views.
- indexes_on(column: ColumnReference) set[str]#
Retrieves the names of all indexes of a specific column.
- Parameters:
column (ColumnReference) – The column to check.
- Returns:
The indexes. If no indexes are available, the set will be empty.
- Return type:
set[str]
- Raises:
postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)
Notes
Hints for implementors: the default implementation of this method assumes that primary keys, foreign keys and unique constraints are all associated with an index structure. It provides the names of the corresponding constraints. The implementation relies on the information_schema.table_constraints, information_schema.constraint_column_usage and information_schema.key_column_usage views.
- foreign_keys_on(column: ColumnReference) set[BoundColumnReference]#
Fetches all foreign key constraints that are specified on a specific column.
The provided columns are the target columns that are referenced by the foreign key constraint. E.g., suppose there are tables A and B with columns x and y. We specify a foreign key constraint on column y to ensure that all values in y reference a value in x. Then, calling this method on column y will return column x. If there are multiple foreign key constraints on the same column, all of them will be returned.
- Parameters:
column (ColumnReference) – The column to check. All foreign keys that are “pointing from” this column to another column are returned.
- Returns:
The columns that are “pointed to” by foreign key constraints on the given column. If no such foreign keys exist, an empty set is returned.
- Return type:
set[BoundColumnReference]
- Raises:
postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)
- class postbound.duckdb.DuckDBStatistics(db: DuckDBInterface, *, emulated: bool = False, enable_emulation_fallback: bool = True, cache_enabled: bool | None = True)#
- Parameters:
db (DuckDBInterface)
emulated (bool)
enable_emulation_fallback (bool)
cache_enabled (Optional[bool])
- postbound.duckdb.parse_duckdb_plan(raw_plan: dict | str, *, query: SqlQuery | None = None) QueryPlan#
Parses a DuckDB query plan from its JSON representation.
The query can be supplied to help with binding scan nodes to their corresponding tables (see Notes).
Notes
Once central limitation of DuckDB’s EXPLAIN plans is that they do not contain table aliases. Thus, if a table is referenced multiple times in a query, we cannot unambiguously bind scan nodes to their corresponding table. We employ some heuristics (e.g. trying to match filter predicates) to resolve this ambiguity, but there is no guarantee that this always works correctly.
- class postbound.duckdb.DuckDBOptimizer(db: DuckDBInterface)#
- Parameters:
db (DuckDBInterface)
- query_plan(query: SqlQuery | str) QueryPlan#
Obtains the query execution plan for a specific query.
This respects all hints that potentially influence the optimization process.
- analyze_plan(query: SqlQuery) QueryPlan#
- analyze_plan(query: SqlQuery, *, timeout: float) QueryPlan | None
Executes a specific query and provides the query execution plan supplemented with runtime information.
This respects all hints that potentially influence the optimization process.
- parse_plan(plan: Any, *, query: SqlQuery | None = None) QueryPlan#
Transforms the system-specific EXPLAIN output into a standardized QueryPlan.
The optional query can be used to provide additional context for the plan. This can be used by some database systems for correct parsing (e.g. DuckDB), while others might not use it at all. Omitting the query parameter is acceptable.
This method is mainly inteded to parse a previously generated plan, or when the plan could not simply be obtained via query_plan or analyze_plan.
- cardinality_estimate(query: SqlQuery | str) Cardinality#
Queries the DBMS query optimizer for its cardinality estimate, instead of executing the query.
The cardinality estimate will correspond to the estimate for the final node. Therefore, running this method with aggregate queries is not particularly meaningful.
- Parameters:
query (SqlQuery | str) – The input query
- Returns:
The cardinality estimate of the native optimizer for the database system.
- Return type:
- cost_estimate(query: SqlQuery | str) float#
Queries the DBMS query optimizer for the estimated cost of executing the query.
The cost estimate will correspond to the estimate for the final node. Typically, this cost includes the cost of all sub-operators as well.
- Parameters:
query (SqlQuery | str) – The input query
- Returns:
The cost estimate of the native optimizer for the database system.
- Return type:
Cost
- class postbound.duckdb.HintParts(settings: list[str], hints: list[str])#
Models the different kinds of optimizer hints that are supported by Postgres.
HintParts are designed to conveniently collect all kinds of hints in order to prepare the generation of a proper Hint clause.
See also
Hint- Parameters:
settings (list[str])
hints (list[str])
- settings: list[str]#
Settings are global to the current database connection and influence the selection of operators for all queries.
Typical examples include
SET enable_nestloop = 'off', which disables the usage of nested loop joins for all queries.
- hints: list[str]#
Hints are supplied by the quack_lab extension and influence optimizer decisions on a per-query basis.
Typical examples include the selection of a specific join order as well as the assignment of join operators to individual joins.
- static empty() HintParts#
Creates a new hint parts object without any contents.
- Returns:
A fresh plain hint parts object
- Return type:
- add(hint: str) None#
Adds a new hint.
This modifies the current object.
- Parameters:
hint (str) – The hint to add
- Return type:
None
- class postbound.duckdb.DuckDBHintService(db: DuckDBInterface)#
- Parameters:
db (DuckDBInterface)
- generate_hints(query: SqlQuery, plan: QueryPlan | None = None, *, join_order: JoinTree | None = None, physical_operators: PhysicalOperatorAssignment | None = None, plan_parameters: PlanParameterization | None = None) SqlQuery#
Transforms the input query such that the given optimization decisions are respected during query execution.
In the most common case this involves building a Hint clause that encodes the optimization decisions in a system-specific way. However, depending on the concrete database system, this might also involve a restructuring of certain parts of the query, e.g. the usage of specific join statements, the introduction of non-standard SQL statements, or a reordering of the FROM clause.
Notice that all optimization information is optional. If individual parameters are set to None, nothing has been enforced by PostBOUND’s optimization process and the native optimizer of the database system should “fill the gaps”.
Implementations of this method are required to adhere to operators for joins and scans as much as possible. However, there is no requirement to represent auxiliary nodes (e.g. sorts) if this is not possible or meaningful for the plan. As a rule of thumb, implementations should rate the integrity of the plan in the database higher than a perfect representation of the input data.
- Parameters:
query (SqlQuery) – The query that should be transformed
plan (Optional[QueryPlan], optional) – The query execution plan. If this is given, all other parameters should be None. This essentially enforces the given query plan.
join_order (Optional[JoinTree], optional) – The sequence in which individual joins should be executed.
physical_operators (Optional[PhysicalOperatorAssignment], optional) – The physical operators that should be used for the query execution. In addition to selecting specific operators for specific joins or scans, this can also include disabling certain operators for the entire query.
plan_parameters (Optional[PlanParameterization], optional) – Additional parameters and metadata for the native optimizer of the database system. Probably the most important use-case of these parameters is the supply of cardinality estimates for different joins and scans. For example, these can be combined with a join order to influence the physical operators that the native optimizer chooses. Another scenario is to only supply such cardinality estimates and leave the join_order and physical_operators completely empty, which essentially simulates a different cardinality estimation algorithm for the query. Notice however, that in this scenario cardinality estimates for all possible intermediate results of the query have to be supplied. Otherwise, the native optimizer once again “fills the gaps” and uses its own estimates for the remaining intermediate results that it explores during plan enumeration. This would probably effectively break the estimation algorithm.
- Returns:
The transformed query. It contains all necessary information to enforce the optimization decisions as best as possible. Notice that whether the native optimizer of the database system is obliged to respect the optimization decisions depends on the specific system. For example, for MySQL hints are really just hints and the optimizer is only encouraged to use specific operators but not forced to do so.
- Return type:
- format_query(query: SqlQuery) str#
Transforms the query into a database-specific string, mostly to incorporate deviations from standard SQL.
This method is necessary because the query abstraction layer is focused on modelling and unifying different parts of an SQL query. However, some database systems (cough .. MySQL .. cough) deviate from standard SQL syntax and express different parts of a query different. The most prominent example are older versions of MySQL that used double quotes for string values rather than the SQL standard single quotes. Therefore, the format_query method takes an abstract representation of an SQL query as input and turns it into a string representation that accounts for all such deviations.
- Parameters:
query (SqlQuery) – The query that should be adapted for the database system
- Returns:
An equivalent notation of the query that incorporates system-specific deviations from standard SQL. Notice that this query possibly can no longer be parsed by the query abstraction layer. It is a one-way process.
- Return type:
str
See also
postbound.qalthe query abstraction layer provided by PostBOUND
- supports_hint(hint: ScanOperator | JoinOperator | IntermediateOperator | HintType) bool#
Checks, whether the database system is capable of using the specified hint or operator
- Parameters:
hint (PhysicalOperator | HintType) – The hint/feature to check
- Returns:
Indicates whether the feature is supported by the specific database system.
- Return type:
bool
- postbound.duckdb.connect(db: str | Path, *, name: str = 'duckdb', refresh: bool = False, private: bool = False) DuckDBInterface#
Connects to a DuckDB database file.
- Parameters:
name (str, optional) – A name to identify the current connection if multiple connections to different DuckDB instances should be maintained. This is used to register the instance on the DatabasePool. Defaults to duckdb.
refresh (bool, optional) – If true, a new connection to the database will always be established, even if a connection to the same database is already pooled. The registration key will be suffixed to prevent collisions. By default, the current connection is re-used. If that is the case, no further information (e.g. config strings) is read and only the name is accessed.
private (bool, optional) – If true, skips registration of the new instance on the DatabasePool. Registration is performed by default.
db (str | Path)
- Return type: