DuckDB Backend

DuckDB Backend#

Module name: postbound.duckdb

class postbound.duckdb.DuckDBInterface(db: Path, *, system_name: str = 'DuckDB', cache_enabled: bool = False)#

Parameters:

db (Path)
system_name (str)
cache_enabled (bool)

schema() → DuckDBSchema#

Provides access to the underlying schema information of the database.

Returns:: An object implementing the schema interface for the actual database system. This should normally be completely stateless.
Return type:: DatabaseSchema

statistics() → DatabaseStatistics#

Provides access to the current statistics of the database.

Implementing generalized statistics for a framework that supports multiple different physical database systems is much more complicated than it might seem at first. Therefore, different modes for the statistics provisioning exist. These modes can be changed by setting the properties of the interface. See the documentation of DatabaseStatistics for more details.

Repeated calls to this method are guaranteed to provide the same object. Therefore, changes to the statistics interface configuration are guaranteed to be persisted accross multiple accesses to the statistics system.

Returns:: The statistics interface. Repeated calls to this method are guaranteed to provide the same object.
Return type:: DatabaseStatistics

hinting() → HintService#

Provides access to the hint generation facilities for the current database system.

Returns:: The hinting service. This should normally be completely stateless.
Return type:: HintService

optimizer() → OptimizerInterface#

Provides access to optimizer-related functionality of the database system.

Returns:: The optimizer interface. This should normally be completely stateless.
Return type:: OptimizerInterface
Raises:: UnsupportedDatabaseFeatureError – If the database system does not provide any sort of external access to the optimizer.

execute_query(query: SqlQuery | str, *, cache_enabled: bool | None = None, raw: bool = False, timeout: float | None = None) → Any#

Executes the given query and returns the associated result set.

Parameters:

query (SqlQuery | str) – The query to execute. If it contains a Hint with preparatory_statements, these will be executed beforehand. Notice that such statements are never subject to caching.
cache_enabled (Optional[bool], optional) – Controls the caching behavior for just this one query. The default value of None indicates that the “global” configuration of the database system should be used. Setting this parameter to a boolean value forces or deactivates caching for the specific query for the specific execution no matter what the “global” configuration is.
raw (bool, optional) – Whether the result set should be returned as-is. By default, the result set is simplified. Raw mode skips this step.
timeout (float | None)

Returns:

Result set of the input query. This is a list of equal-length tuples in the most general case. Each component of the tuple corresponds to a specific column of the result set and each tuple corresponds to a row in the result set. However, many queries do not provide a 2-dimensional result set (e.g. COUNT(*) queries). In such cases, the nested structure of the result set makes it quite cumbersome to use. Therefore, this method tries to simplify the return value of the query for more convenient use (if raw mode is disabled). More specifically, if the query returns just a single row, this row is returned directly as a tuple. Furthermore, if the query returns just a single column, the values of that column are returned directly in a list. Both simplifications will also be combined, such that a result set of a single row of a single value will be returned as that single value directly. In all other cases, the result will be a list consisting of the different result tuples.

Return type:

Any

Notes

This method is mainly intended to execute read-only SQL queries. In fact, the only types of SQL queries that can be modelled by the query abstraction layer are precisely such read-only queries. However, if one really needs to execute mutating queries, they can be issued as plain text. Just remember that this behavior is heavily discouraged!

The precise behavior of this method depends on whether caching is enabled or not. If it is, the query will only be executed against the live database system, if it is not in the cache. Otherwise, the result will simply be retrieved. Caching can be enabled/disabled for just this one query via the cache_enabled switch. If this is not specified, caching depends on the cache_enabled property.

If caching should be used for this method, but is disabled at a database-level, the current cache will still be read and persisted. This ensures that all cached queries are properly saved and none of the previous cache content is lost.

database_name() → str#

Provides the name of the (physical) database that the database interface is connected to.

Returns:: The database name, e.g. imdb or tpc-h
Return type:: str

database_system_version() → Version#

Returns the release version of the database management system that this interface is connected to.

Returns:: The version
Return type:: util.Version

cursor() → DuckDBPyConnection#

Provides a cursor to execute queries and iterate over result sets manually.

Returns:: A cursor compatible with the Python DB API specification 2.0 (PEP 249). The specific cursor type depends on the concrete database implementation however.
Return type:: Cursor

References

close() → None#

Shuts down all currently open connections to the database.

Return type:: None

reset_connection() → None#

Obtains a new network connection for the database. Useful for debugging purposes or in case of crashes.

Notice that resetting the connection can have unintended side-effects if other methods rely on the cursor object. After resetting, the former cursor object will probably no longer be valid. Therefore, this method should be used with caution.

Returns:: Backend-specific information about the new connection. E.g., this might be a connection ID or nothing at all.
Return type:: Any

See also

Hint

Parameters:

settings (list[str])
hints (list[str])

settings: list[str]#

Settings are global to the current database connection and influence the selection of operators for all queries.

Typical examples include SET enable_nestloop = 'off', which disables the usage of nested loop joins for all queries.

hints: list[str]#

Hints are supplied by the quack_lab extension and influence optimizer decisions on a per-query basis.

Typical examples include the selection of a specific join order as well as the assignment of join operators to individual joins.

static empty() → HintParts#

Creates a new hint parts object without any contents.

Returns:: A fresh plain hint parts object
Return type:: HintParts

add(hint: str) → None#

Adds a new hint.

This modifies the current object.

Parameters:: hint (str) – The hint to add
Return type:: None

merge_with(other: HintParts) → HintParts#

Combines the hints that are contained in this hint parts object with all hints in the other object.

This constructs new hint parts and leaves the current objects unmodified.

Parameters:: other (HintParts) – The additional hints to incorporate
Returns:: A new hint parts object that contains the hints from both source objects
Return type:: HintParts

class postbound.duckdb.DuckDBHintService(db: DuckDBInterface)#

Parameters:: db (DuckDBInterface)

generate_hints(query: SqlQuery, plan: QueryPlan | None = None, *, join_order: JoinTree | None = None, physical_operators: PhysicalOperatorAssignment | None = None, plan_parameters: PlanParameterization | None = None) → SqlQuery#

Transforms the input query such that the given optimization decisions are respected during query execution.

In the most common case this involves building a Hint clause that encodes the optimization decisions in a system-specific way. However, depending on the concrete database system, this might also involve a restructuring of certain parts of the query, e.g. the usage of specific join statements, the introduction of non-standard SQL statements, or a reordering of the FROM clause.

Notice that all optimization information is optional. If individual parameters are set to None, nothing has been enforced by PostBOUND’s optimization process and the native optimizer of the database system should “fill the gaps”.

Implementations of this method are required to adhere to operators for joins and scans as much as possible. However, there is no requirement to represent auxiliary nodes (e.g. sorts) if this is not possible or meaningful for the plan. As a rule of thumb, implementations should rate the integrity of the plan in the database higher than a perfect representation of the input data.

Parameters:

query (SqlQuery) – The query that should be transformed
plan (Optional[QueryPlan], optional) – The query execution plan. If this is given, all other parameters should be None. This essentially enforces the given query plan.
join_order (Optional[JoinTree], optional) – The sequence in which individual joins should be executed.
physical_operators (Optional[PhysicalOperatorAssignment], optional) – The physical operators that should be used for the query execution. In addition to selecting specific operators for specific joins or scans, this can also include disabling certain operators for the entire query.
plan_parameters (Optional[PlanParameterization], optional) – Additional parameters and metadata for the native optimizer of the database system. Probably the most important use-case of these parameters is the supply of cardinality estimates for different joins and scans. For example, these can be combined with a join order to influence the physical operators that the native optimizer chooses. Another scenario is to only supply such cardinality estimates and leave the join_order and physical_operators completely empty, which essentially simulates a different cardinality estimation algorithm for the query. Notice however, that in this scenario cardinality estimates for all possible intermediate results of the query have to be supplied. Otherwise, the native optimizer once again “fills the gaps” and uses its own estimates for the remaining intermediate results that it explores during plan enumeration. This would probably effectively break the estimation algorithm.

Returns:

The transformed query. It contains all necessary information to enforce the optimization decisions as best as possible. Notice that whether the native optimizer of the database system is obliged to respect the optimization decisions depends on the specific system. For example, for MySQL hints are really just hints and the optimizer is only encouraged to use specific operators but not forced to do so.

Return type:

SqlQuery

format_query(query: SqlQuery) → str#

Transforms the query into a database-specific string, mostly to incorporate deviations from standard SQL.

This method is necessary because the query abstraction layer is focused on modelling and unifying different parts of an SQL query. However, some database systems (cough .. MySQL .. cough) deviate from standard SQL syntax and express different parts of a query different. The most prominent example are older versions of MySQL that used double quotes for string values rather than the SQL standard single quotes. Therefore, the format_query method takes an abstract representation of an SQL query as input and turns it into a string representation that accounts for all such deviations.

Parameters:: query (SqlQuery) – The query that should be adapted for the database system
Returns:: An equivalent notation of the query that incorporates system-specific deviations from standard SQL. Notice that this query possibly can no longer be parsed by the query abstraction layer. It is a one-way process.
Return type:: str

DuckDB Backend

Contents

DuckDB Backend#