DuckDB Backend#

Module name: postbound.duckdb

class postbound.duckdb.DuckDBInterface(db: Path, *, system_name: str = 'DuckDB', cache_enabled: bool = False)#
Parameters:
  • db (Path)

  • system_name (str)

  • cache_enabled (bool)

schema() DuckDBSchema#

Provides access to the underlying schema information of the database.

Returns:

An object implementing the schema interface for the actual database system. This should normally be completely stateless.

Return type:

DatabaseSchema

statistics() DatabaseStatistics#

Provides access to the current statistics of the database.

Implementing generalized statistics for a framework that supports multiple different physical database systems is much more complicated than it might seem at first. Therefore, different modes for the statistics provisioning exist. These modes can be changed by setting the properties of the interface. See the documentation of DatabaseStatistics for more details.

Repeated calls to this method are guaranteed to provide the same object. Therefore, changes to the statistics interface configuration are guaranteed to be persisted accross multiple accesses to the statistics system.

Returns:

The statistics interface. Repeated calls to this method are guaranteed to provide the same object.

Return type:

DatabaseStatistics

hinting() HintService#

Provides access to the hint generation facilities for the current database system.

Returns:

The hinting service. This should normally be completely stateless.

Return type:

HintService

optimizer() OptimizerInterface#

Provides access to optimizer-related functionality of the database system.

Returns:

The optimizer interface. This should normally be completely stateless.

Return type:

OptimizerInterface

Raises:

UnsupportedDatabaseFeatureError – If the database system does not provide any sort of external access to the optimizer.

execute_query(query: SqlQuery | str, *, cache_enabled: bool | None = None, raw: bool = False, timeout: float | None = None) Any#

Executes the given query and returns the associated result set.

Parameters:
  • query (SqlQuery | str) – The query to execute. If it contains a Hint with preparatory_statements, these will be executed beforehand. Notice that such statements are never subject to caching.

  • cache_enabled (Optional[bool], optional) – Controls the caching behavior for just this one query. The default value of None indicates that the “global” configuration of the database system should be used. Setting this parameter to a boolean value forces or deactivates caching for the specific query for the specific execution no matter what the “global” configuration is.

  • raw (bool, optional) – Whether the result set should be returned as-is. By default, the result set is simplified. Raw mode skips this step.

  • timeout (float | None)

Returns:

Result set of the input query. This is a list of equal-length tuples in the most general case. Each component of the tuple corresponds to a specific column of the result set and each tuple corresponds to a row in the result set. However, many queries do not provide a 2-dimensional result set (e.g. COUNT(*) queries). In such cases, the nested structure of the result set makes it quite cumbersome to use. Therefore, this method tries to simplify the return value of the query for more convenient use (if raw mode is disabled). More specifically, if the query returns just a single row, this row is returned directly as a tuple. Furthermore, if the query returns just a single column, the values of that column are returned directly in a list. Both simplifications will also be combined, such that a result set of a single row of a single value will be returned as that single value directly. In all other cases, the result will be a list consisting of the different result tuples.

Return type:

Any

Notes

This method is mainly intended to execute read-only SQL queries. In fact, the only types of SQL queries that can be modelled by the query abstraction layer are precisely such read-only queries. However, if one really needs to execute mutating queries, they can be issued as plain text. Just remember that this behavior is heavily discouraged!

The precise behavior of this method depends on whether caching is enabled or not. If it is, the query will only be executed against the live database system, if it is not in the cache. Otherwise, the result will simply be retrieved. Caching can be enabled/disabled for just this one query via the cache_enabled switch. If this is not specified, caching depends on the cache_enabled property.

If caching should be used for this method, but is disabled at a database-level, the current cache will still be read and persisted. This ensures that all cached queries are properly saved and none of the previous cache content is lost.

database_name() str#

Provides the name of the (physical) database that the database interface is connected to.

Returns:

The database name, e.g. imdb or tpc-h

Return type:

str

database_system_version() Version#

Returns the release version of the database management system that this interface is connected to.

Returns:

The version

Return type:

util.Version

cursor() DuckDBPyConnection#

Provides a cursor to execute queries and iterate over result sets manually.

Returns:

A cursor compatible with the Python DB API specification 2.0 (PEP 249). The specific cursor type depends on the concrete database implementation however.

Return type:

Cursor

References

close() None#

Shuts down all currently open connections to the database.

Return type:

None

reset_connection() None#

Obtains a new network connection for the database. Useful for debugging purposes or in case of crashes.

Notice that resetting the connection can have unintended side-effects if other methods rely on the cursor object. After resetting, the former cursor object will probably no longer be valid. Therefore, this method should be used with caution.

Returns:

Backend-specific information about the new connection. E.g., this might be a connection ID or nothing at all.

Return type:

Any

See also

Database.cursor

describe() jsondict#

Provides a representation of the current database connection as well as its system settings.

This description is intended to transparently document which customizations have been applied, thereby giving an idea of how the default query execution might have been affected. It can be JSON-serialized and will be included by most of the output of the utilities in the runner module of the experiments package.

Return type:

jsondict

class postbound.duckdb.DuckDBSchema(db: DuckDBInterface)#
Parameters:

db (DuckDBInterface)

tables(*, include_system_tables: bool = False) set[TableReference]#

Fetches all user-defined tables that are contained in the current database.

Parameters:

include_system_tables (bool, optional) – Whether system tables should also be included. By default, only user-defined tables are returned.

Returns:

All tables in the current schema, including materialized views, etc.

Return type:

set[TableReference]

Notes

Hint for implementors: the default implementation of this method relies on the information_schema.tables view.

has_secondary_index(column: ColumnReference) bool#

Checks, whether a secondary index is available for a specific column.

Parameters:

column (ColumnReference) – The column to check

Returns:

Whether a secondary index of any kind was created for the column. Compound indexes and primary key indexes fail this test.

Return type:

bool

Raises:
  • postbound.qal.UnboundColumnError – If the column is not associated with any table

  • postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

Notes

Hints for implementors: The default implementation of this method assumes that each foreign key column and each column with a UNIQUE constraint has an associated index. If this should not be the case, a custom implementation needs to be supplied. Furthermore, the implementation relies on the information_schema.table_constraints, information_schema.constraint_column_usage and information_schema.key_column_usage views.

indexes_on(column: ColumnReference) set[str]#

Retrieves the names of all indexes of a specific column.

Parameters:

column (ColumnReference) – The column to check.

Returns:

The indexes. If no indexes are available, the set will be empty.

Return type:

set[str]

Raises:
  • postbound.qal.UnboundColumnError – If the column is not associated with any table

  • postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

Notes

Hints for implementors: the default implementation of this method assumes that primary keys, foreign keys and unique constraints are all associated with an index structure. It provides the names of the corresponding constraints. The implementation relies on the information_schema.table_constraints, information_schema.constraint_column_usage and information_schema.key_column_usage views.

foreign_keys_on(column: ColumnReference) set[BoundColumnReference]#

Fetches all foreign key constraints that are specified on a specific column.

The provided columns are the target columns that are referenced by the foreign key constraint. E.g., suppose there are tables A and B with columns x and y. We specify a foreign key constraint on column y to ensure that all values in y reference a value in x. Then, calling this method on column y will return column x. If there are multiple foreign key constraints on the same column, all of them will be returned.

Parameters:

column (ColumnReference) – The column to check. All foreign keys that are “pointing from” this column to another column are returned.

Returns:

The columns that are “pointed to” by foreign key constraints on the given column. If no such foreign keys exist, an empty set is returned.

Return type:

set[BoundColumnReference]

Raises:
  • postbound.qal.UnboundColumnError – If the column is not associated with any table

  • postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

class postbound.duckdb.DuckDBStatistics(db: DuckDBInterface, *, emulated: bool = False, enable_emulation_fallback: bool = True, cache_enabled: bool | None = True)#
Parameters:
  • db (DuckDBInterface)

  • emulated (bool)

  • enable_emulation_fallback (bool)

  • cache_enabled (Optional[bool])

postbound.duckdb.parse_duckdb_plan(raw_plan: dict | str, *, query: SqlQuery | None = None) QueryPlan#

Parses a DuckDB query plan from its JSON representation.

The query can be supplied to help with binding scan nodes to their corresponding tables (see Notes).

Notes

Once central limitation of DuckDB’s EXPLAIN plans is that they do not contain table aliases. Thus, if a table is referenced multiple times in a query, we cannot unambiguously bind scan nodes to their corresponding table. We employ some heuristics (e.g. trying to match filter predicates) to resolve this ambiguity, but there is no guarantee that this always works correctly.

Parameters:
  • raw_plan (dict | str)

  • query (SqlQuery | None)

Return type:

QueryPlan

class postbound.duckdb.DuckDBOptimizer(db: DuckDBInterface)#
Parameters:

db (DuckDBInterface)

query_plan(query: SqlQuery | str) QueryPlan#

Obtains the query execution plan for a specific query.

This respects all hints that potentially influence the optimization process.

Parameters:

query (SqlQuery | str) – The input query

Returns:

The corresponding execution plan. This will never be an ANALYZE plan, but contain as much meaningful information as can be derived for the specific database system (e.g. regarding cardinality and cost estimates)

Return type:

QueryPlan

analyze_plan(query: SqlQuery) QueryPlan#
analyze_plan(query: SqlQuery, *, timeout: float) QueryPlan | None

Executes a specific query and provides the query execution plan supplemented with runtime information.

This respects all hints that potentially influence the optimization process.

Parameters:

query (SqlQuery) – The input query

Returns:

The corresponding execution plan. This plan will be an ANALYZE plan and contain all information that can be derived for the specific database system (e.g. cardinality estimates as well as true cardinality counts)

Return type:

QueryPlan

parse_plan(plan: Any, *, query: SqlQuery | None = None) QueryPlan#

Transforms the system-specific EXPLAIN output into a standardized QueryPlan.

The optional query can be used to provide additional context for the plan. This can be used by some database systems for correct parsing (e.g. DuckDB), while others might not use it at all. Omitting the query parameter is acceptable.

This method is mainly inteded to parse a previously generated plan, or when the plan could not simply be obtained via query_plan or analyze_plan.

Parameters:
Return type:

QueryPlan

cardinality_estimate(query: SqlQuery | str) Cardinality#

Queries the DBMS query optimizer for its cardinality estimate, instead of executing the query.

The cardinality estimate will correspond to the estimate for the final node. Therefore, running this method with aggregate queries is not particularly meaningful.

Parameters:

query (SqlQuery | str) – The input query

Returns:

The cardinality estimate of the native optimizer for the database system.

Return type:

Cardinality

cost_estimate(query: SqlQuery | str) float#

Queries the DBMS query optimizer for the estimated cost of executing the query.

The cost estimate will correspond to the estimate for the final node. Typically, this cost includes the cost of all sub-operators as well.

Parameters:

query (SqlQuery | str) – The input query

Returns:

The cost estimate of the native optimizer for the database system.

Return type:

Cost

class postbound.duckdb.HintParts(settings: list[str], hints: list[str])#

Models the different kinds of optimizer hints that are supported by Postgres.

HintParts are designed to conveniently collect all kinds of hints in order to prepare the generation of a proper Hint clause.

See also

Hint

Parameters:
  • settings (list[str])

  • hints (list[str])

settings: list[str]#

Settings are global to the current database connection and influence the selection of operators for all queries.

Typical examples include SET enable_nestloop = 'off', which disables the usage of nested loop joins for all queries.

hints: list[str]#

Hints are supplied by the quack_lab extension and influence optimizer decisions on a per-query basis.

Typical examples include the selection of a specific join order as well as the assignment of join operators to individual joins.

static empty() HintParts#

Creates a new hint parts object without any contents.

Returns:

A fresh plain hint parts object

Return type:

HintParts

add(hint: str) None#

Adds a new hint.

This modifies the current object.

Parameters:

hint (str) – The hint to add

Return type:

None

merge_with(other: HintParts) HintParts#

Combines the hints that are contained in this hint parts object with all hints in the other object.

This constructs new hint parts and leaves the current objects unmodified.

Parameters:

other (HintParts) – The additional hints to incorporate

Returns:

A new hint parts object that contains the hints from both source objects

Return type:

HintParts

class postbound.duckdb.DuckDBHintService(db: DuckDBInterface)#
Parameters:

db (DuckDBInterface)

generate_hints(query: SqlQuery, plan: QueryPlan | None = None, *, join_order: JoinTree | None = None, physical_operators: PhysicalOperatorAssignment | None = None, plan_parameters: PlanParameterization | None = None) SqlQuery#

Transforms the input query such that the given optimization decisions are respected during query execution.

In the most common case this involves building a Hint clause that encodes the optimization decisions in a system-specific way. However, depending on the concrete database system, this might also involve a restructuring of certain parts of the query, e.g. the usage of specific join statements, the introduction of non-standard SQL statements, or a reordering of the FROM clause.

Notice that all optimization information is optional. If individual parameters are set to None, nothing has been enforced by PostBOUND’s optimization process and the native optimizer of the database system should “fill the gaps”.

Implementations of this method are required to adhere to operators for joins and scans as much as possible. However, there is no requirement to represent auxiliary nodes (e.g. sorts) if this is not possible or meaningful for the plan. As a rule of thumb, implementations should rate the integrity of the plan in the database higher than a perfect representation of the input data.

Parameters:
  • query (SqlQuery) – The query that should be transformed

  • plan (Optional[QueryPlan], optional) – The query execution plan. If this is given, all other parameters should be None. This essentially enforces the given query plan.

  • join_order (Optional[JoinTree], optional) – The sequence in which individual joins should be executed.

  • physical_operators (Optional[PhysicalOperatorAssignment], optional) – The physical operators that should be used for the query execution. In addition to selecting specific operators for specific joins or scans, this can also include disabling certain operators for the entire query.

  • plan_parameters (Optional[PlanParameterization], optional) – Additional parameters and metadata for the native optimizer of the database system. Probably the most important use-case of these parameters is the supply of cardinality estimates for different joins and scans. For example, these can be combined with a join order to influence the physical operators that the native optimizer chooses. Another scenario is to only supply such cardinality estimates and leave the join_order and physical_operators completely empty, which essentially simulates a different cardinality estimation algorithm for the query. Notice however, that in this scenario cardinality estimates for all possible intermediate results of the query have to be supplied. Otherwise, the native optimizer once again “fills the gaps” and uses its own estimates for the remaining intermediate results that it explores during plan enumeration. This would probably effectively break the estimation algorithm.

Returns:

The transformed query. It contains all necessary information to enforce the optimization decisions as best as possible. Notice that whether the native optimizer of the database system is obliged to respect the optimization decisions depends on the specific system. For example, for MySQL hints are really just hints and the optimizer is only encouraged to use specific operators but not forced to do so.

Return type:

SqlQuery

format_query(query: SqlQuery) str#

Transforms the query into a database-specific string, mostly to incorporate deviations from standard SQL.

This method is necessary because the query abstraction layer is focused on modelling and unifying different parts of an SQL query. However, some database systems (cough .. MySQL .. cough) deviate from standard SQL syntax and express different parts of a query different. The most prominent example are older versions of MySQL that used double quotes for string values rather than the SQL standard single quotes. Therefore, the format_query method takes an abstract representation of an SQL query as input and turns it into a string representation that accounts for all such deviations.

Parameters:

query (SqlQuery) – The query that should be adapted for the database system

Returns:

An equivalent notation of the query that incorporates system-specific deviations from standard SQL. Notice that this query possibly can no longer be parsed by the query abstraction layer. It is a one-way process.

Return type:

str

See also

postbound.qal

the query abstraction layer provided by PostBOUND

supports_hint(hint: ScanOperator | JoinOperator | IntermediateOperator | HintType) bool#

Checks, whether the database system is capable of using the specified hint or operator

Parameters:

hint (PhysicalOperator | HintType) – The hint/feature to check

Returns:

Indicates whether the feature is supported by the specific database system.

Return type:

bool

postbound.duckdb.connect(db: str | Path, *, name: str = 'duckdb', refresh: bool = False, private: bool = False) DuckDBInterface#

Connects to a DuckDB database file.

Parameters:
  • name (str, optional) – A name to identify the current connection if multiple connections to different DuckDB instances should be maintained. This is used to register the instance on the DatabasePool. Defaults to duckdb.

  • refresh (bool, optional) – If true, a new connection to the database will always be established, even if a connection to the same database is already pooled. The registration key will be suffixed to prevent collisions. By default, the current connection is re-used. If that is the case, no further information (e.g. config strings) is read and only the name is accessed.

  • private (bool, optional) – If true, skips registration of the new instance on the DatabasePool. Registration is performed by default.

  • db (str | Path)

Return type:

DuckDBInterface