DuckDB Interface#

The DuckDB backend provides support for query execution and plan hinting. It is implemented in the duckdb module. Since DuckDB provides no support for optimizer hints that we rely on, we use a fork of DuckDB called quacklab that adds support for the necessary hints. Even with the necessary hints in place, some limitations remain due to internal design of DuckDB. These are outlined in the limitation below.

Setup#

To setup DuckDB, use the duckdb-setup.sh script located in the db-support/duckdb directory of the PostBOUND repository. This script will:

  1. pull the most recent (or a selected) version of quacklab

  2. compile DuckDB from source and create the Python wheel file

  3. optionally install the wheel in the current Python environment

DuckDB uses uv <https://docs.astral.sh/uv/> for the build process. If this is not installed on your system, the setup script will install it automatically into a Python virtual environment. This environment can be controlled by a CLI parameter and will default to .venv in the DuckDB directory. It might be a good idea to point this to the venv that you use for your PostBOUND installation. In addition, you need a recent (C++ 17 compliant) C++ compiler on your system. Other dependencies such as CMake will be installed by the script as needed.

The entire setup process might take some time, so feel free to grab a cup of coffee after starting the script.

Tip

Once the setup is complete, you can import the database interface in your Python code directly by doing an import quacklab. We use quacklab instead of duckdb to distinguish the modified version from the original DuckDB. As a nice side effect, you can install vanilla DuckDB as well as quacklab in the same Python environment without conflicts. Other than the name differences, the packages have the exact same API.

Once the setup is complete, you can create commonly-used database instances such as IMDB/JOB or Stats using the workload-setup.py script. This script must be executed while the virtual environment containing the quacklab installation is active. The workload setup shell scripts serve the same purpose but require the DuckDB executable to be available on your PATH. This is currently not the case when building quacklab due to limitations of the build system.

Usage#

The DuckDB backend is available from the duckdb module. You can create a connection to a DuckDB database file using the connect() function. The returned database instance functions like any regular Database interface:

import postbound as pb

duck_instance = pb.duckdb.connect("stats.duckdb")
stats = pb.workloads.stats()
duck_instance.execute_query(stats["q-1"])

You can also import the “raw” database interface in your Python code directly by doing an import quacklab. We use quacklab instead of duckdb to distinguish the modified version from the original DuckDB. As a nice side effect, vanilla DuckDB and quacklab can be installed side-by-side in the same Python environment without conflicts. Other than the name differences, the packages have the exact same API.

Supported Backend Features#

Feature

Status

Query Execution

fully supported

Query Execution with timeouts

fully supported

Schema interface

fully implemented

Statistics interface

fully implemented [1]

EXPLAIN parsing and query plan extraction

fully supported [2]

EXPLAIN ANALYZE plans

fully supported [2]

Extraction of cardinality estimates

fully supported

Extraction of cost estimates

not supported (see limitations)

Plan hinting

partially supported (see limitations)

To obtain DuckDB query plans, you can either use the query_plan() method or parse the EXPLAIN output manually using parse_duckdb_plan(). Both options yield the same results:

# obtain a query plan directly:
plan = duck_instance.optimizer().query_plan(stats["q-1"])

# this is equivalent to:
explain_query = pb.transform.as_explain(stats["q-1"])
raw_plan = duck_instance.execute_query(explain_query)
equivalent_plan = pb.duckdb.parse_duckdb_plan(raw_plan)

Optimizer Essentials#

DuckDB employs a sequential optimizer that performs complex algebraic transformations (e.g., subquery unnesting) on the logical query plan. The join order is determined using a dynamic programming algorithm based on [Moerkotte08]. It uses a simple “cost model” that sums the cardinality estimates of the input relations and the output cardinality of the intermediate. Based on this join order, physical operators are selected following a rule-based approach. Basically, hash joins are used whenever possible and nested loop joins are only used as a last resort.

Limitations#

The current implementation of the DuckDB execution engine imposes some strict limitations on the kind of hints we use reliably. In particular, the implementation of the physical operators is tightly coupled with their selection rules. For example, if the optimizer selects one join operator based on some property of the input query, the implementation of that operator will rely on this property being satisfied. Using a different operator typically leads to execution errors. As a consequence, operator hints can be specified, but should only be used with great care if at all. At the same time, this makes optimization approaches using a costs model much less appealing.

Based on these limitations, the following features can be used reliably with the DuckDB backend:

  • Join order hints and corresponding optimization strategies

  • Cardinality hints and corresponding estimation strategies

References#

[Moerkotte08]

Guido Moerkotte and Thomas Neumann: “Dynamic Programming Strikes Back” (SIGMOD 2008) DOI: 10.1145/1376616.1376672

footnotes