DuckDB Interface#
The DuckDB backend provides support for query execution and plan hinting. It is implemented in the
duckdb module. Since DuckDB provides no support for optimizer hints that we rely on, we use a fork of
DuckDB called quacklab that adds support for the necessary hints.
Even with the necessary hints in place, some limitations remain due to internal design of DuckDB. These are outlined in
the limitation below.
Setup#
To setup DuckDB, use the duckdb-setup.sh script located in the db-support/duckdb directory of the PostBOUND
repository. This script will:
pull the most recent (or a selected) version of quacklab
compile DuckDB from source and create the Python wheel file
optionally install the wheel in the current Python environment
DuckDB uses uv <https://docs.astral.sh/uv/> for the build process. If this is not installed on your system, the setup script will install it automatically into a Python virtual environment. This environment can be controlled by a CLI parameter and will default to .venv in the DuckDB directory. It might be a good idea to point this to the venv that you use for your PostBOUND installation. In addition, you need a recent (C++ 17 compliant) C++ compiler on your system. Other dependencies such as CMake will be installed by the script as needed.
The entire setup process might take some time, so feel free to grab a cup of coffee after starting the script.
Tip
Once the setup is complete, you can import the database interface in your Python code directly by doing an
import quacklab. We use quacklab instead of duckdb to distinguish the modified version from the original
DuckDB. As a nice side effect, you can install vanilla DuckDB as well as quacklab in the same Python environment
without conflicts. Other than the name differences, the packages have the exact same API.
Once the setup is complete, you can create commonly-used database instances such as IMDB/JOB or Stats using the
workload-setup.py script. This script must be executed while the virtual environment containing the quacklab
installation is active.
The workload setup shell scripts serve the same purpose but require the DuckDB executable to be available on your PATH.
This is currently not the case when building quacklab due to limitations of the build system.
Usage#
The DuckDB backend is available from the duckdb module. You can create a connection to a DuckDB database
file using the connect() function. The returned database instance functions like any regular
Database interface:
import postbound as pb
duck_instance = pb.duckdb.connect("stats.duckdb")
stats = pb.workloads.stats()
duck_instance.execute_query(stats["q-1"])
You can also import the “raw” database interface in your Python code directly by doing an import quacklab. We use
quacklab instead of duckdb to distinguish the modified version from the original DuckDB. As a nice side effect, vanilla
DuckDB and quacklab can be installed side-by-side in the same Python environment without conflicts. Other than the name
differences, the packages have the exact same API.
Supported Backend Features#
Feature |
Status |
|---|---|
Query Execution |
fully supported |
Query Execution with timeouts |
fully supported |
Schema interface |
fully implemented |
Statistics interface |
fully implemented [1] |
EXPLAIN parsing and query plan extraction |
fully supported [2] |
EXPLAIN ANALYZE plans |
fully supported [2] |
Extraction of cardinality estimates |
fully supported |
Extraction of cost estimates |
not supported (see limitations) |
Plan hinting |
partially supported (see limitations) |
To obtain DuckDB query plans, you can either use the query_plan() method or parse
the EXPLAIN output manually using parse_duckdb_plan(). Both options yield the same results:
# obtain a query plan directly:
plan = duck_instance.optimizer().query_plan(stats["q-1"])
# this is equivalent to:
explain_query = pb.transform.as_explain(stats["q-1"])
raw_plan = duck_instance.execute_query(explain_query)
equivalent_plan = pb.duckdb.parse_duckdb_plan(raw_plan)
Optimizer Essentials#
DuckDB employs a sequential optimizer that performs complex algebraic transformations (e.g., subquery unnesting) on the logical query plan. The join order is determined using a dynamic programming algorithm based on [Moerkotte08]. It uses a simple “cost model” that sums the cardinality estimates of the input relations and the output cardinality of the intermediate. Based on this join order, physical operators are selected following a rule-based approach. Basically, hash joins are used whenever possible and nested loop joins are only used as a last resort.
Limitations#
The current implementation of the DuckDB execution engine imposes some strict limitations on the kind of hints we use reliably. In particular, the implementation of the physical operators is tightly coupled with their selection rules. For example, if the optimizer selects one join operator based on some property of the input query, the implementation of that operator will rely on this property being satisfied. Using a different operator typically leads to execution errors. As a consequence, operator hints can be specified, but should only be used with great care if at all. At the same time, this makes optimization approaches using a costs model much less appealing.
Based on these limitations, the following features can be used reliably with the DuckDB backend:
Join order hints and corresponding optimization strategies
Cardinality hints and corresponding estimation strategies
References#
Guido Moerkotte and Thomas Neumann: “Dynamic Programming Strikes Back” (SIGMOD 2008) DOI: 10.1145/1376616.1376672
footnotes