Cardinality Estimation Benchmark#
The Cardinality Estimation Benchmark (CEB) [1] provides a powerful framework to generate large workloads of SQL queries based on simple templates. The query generator splits the templates into sections for individual (groups of) predicates and allows these predicates to sample correlated values together, introduce dependencies to other predicate values, etc. However, the templates and their semantics are sadly not well-specified and its often unclear how specific properties interact and influence each other. Therefore, we ship our own generator implementation inspired by the CEB as part of the PostBOUND framework. In addition to the actual implementation, we also provide a detailed specification of the templates and their properties in the following section.
Tip
To use the CEB generator, you can either use the tools/ceb-generator.py script for a high-level interface, or
you can manually call generate_workload() from the
ceb module.
Generator Templates#
Each template is specified in a single TOML-file. A template has to contain the following sections:
A leading
titlekey of type String. It specifies the name of the current template and is mainly used for logging/debugging purposes.A
base_sqltable. Within the table, thesqlkey contains the final SQL query to generate (including the placeholder values). Thetable_aliasesis a required inline-table that maps each alias to a fully-qualified name. If no aliases are given, the values can simply be empty strings. For example, forSELECT * FROM title t,table_aliasesshould be{ t = "title" }, whereas forSELECT * FROM title, it should be{title = ""}. This slight inconsistency is necessary to accomodate for tables that are referenced with multiple aliases as well as tables that are referenced without an alias within the same query (which still is valid SQL..).
The base SQL query can contain placeholders in the form of <<placeholder>>. These placeholders are replaced by the actual
predicate values (see below).
Within the base_sql table, an array of tables called predicates contains the actual instructions to generate appropriate
values for the placeholders. A predicate is structured as follows:
Key |
Key type |
Allowed values |
Dependencies |
Description |
|---|---|---|---|---|
|
String, required |
arbitrary |
none |
An alias for the current predicate. The name can be referenced as a dependency for other predicates. |
|
Array of strings, optional |
|
none |
Specifies that values computed in the referenced predicates have to injected into the query that computes the current predicate values. Dependencies are assumed to form the same predicates as in the base SQL. |
|
Array of strings, required |
Placeholders from the base_sql query. |
none |
The placeholders of the base SQL whose values are computed by this predicate. Leading |
|
Array of strings, required |
(Qualified) column names |
none |
Output columns in the SQL query of this predicate that compute the given |
|
Array of strings, required |
|
none |
The predicate operands to be inserted into the base SQL. This key ensures that the selected values are escaped properly. The i-th operand corresponds to the i-th column in |
|
String, required |
|
none |
|
|
String, required |
|
none |
|
|
Array, optional |
arbitrary |
Required for |
The values to pick from. As a special case, Nested arrays can be used if multiple placeholders should be filled at the same time. For example, |
|
String, optional |
SQL query. |
Required for |
The SQL query that computes the |
|
Integer, optional |
arbitrary |
Ignored unless |
The index of the column in the SQL query’s SELECT clause or the element index in the list options that contains the weights for weighted sampling. Counting starts at 1. If this parameter is omitted, weights are inferred based on the number of occurrences of each value. If an index is supplied, the values are assumed to be unique. |
|
Integer, optional |
arbitrary |
Ignored unless |
The minimum number of values to insert into the IN predicate, defaults to 1. |
|
Integer, optional |
arbitrary, but |
Ignored unless |
The maximum number of values to insert into the IN predicate, defaults to the total number of values in the value list/SQL query. |
A limitation of CEB’s design is that IN predicates can only be sampled by themselves and might only be used in dependencies if the dependent predicates also use them as IN predicates. The core reason for this limitation is that it is unclear or not intuitive how an the values of an IN predicate should correpond to the values of other (scalar) placeholders computed by the query and what their correlation should be.
Examples#
All these examples are designed based on the IMDB schema/Join Order Benchmark [2].
title = "test-q1"
[base_sql]
sql = '''
SELECT min(t.title), min(t.production_year)
FROM title t
WHERE t.production_year >= <<T_PROD_YEAR>>
AND t.kind_id = <<T_KIND_ID>>
'''
table_aliases = { t = "title" }
[[predicates]]
name = "T_PROD_YEAR"
keys = ["T_PROD_YEAR"]
columns = ["t.production_year"]
pred_type = ">="
sampling_method = "uniform"
type = "list"
options = [1990, 1995, 2000, 2005, 2010]
[[predicates]]
name = "T_KIND_ID"
dependencies = ["T_PROD_YEAR"]
keys = ["T_KIND_ID"]
columns = ["t.kind_id"]
pred_type = "="
sampling_method = "weighted"
type = "sql"
sql = '''
SELECT kt.id
FROM kind_type kt
JOIN title t ON kt.id = t.kind_id
WHERE t.production_year >= <<T_PROD_YEAR>>
'''
title = "test-q2"
[base_sql]
sql = '''
SELECT min(t.title), min(t.production_year)
FROM title t
WHERE t.production_year >= <<T_PROD_YEAR>>
AND t.kind_id = <<T_KIND_ID>>
'''
table_aliases = { t = "title" }
[[predicates]]
name = "T_PROD_YEAR__KIND_ID"
keys = ["T_PROD_YEAR", "T_KIND_ID"]
columns = ["t.production_year", "t.kind_id"]
pred_type = [">=", "="]
sampling_method = "uniform"
type = "sql"
sql = '''SELECT DISTINCT t.production_year, t.kind_id FROM title t'''
title = "test-q3"
[base_sql]
sql = '''
SELECT min(t.title), min(t.production_year)
FROM title t
WHERE t.phonetic_code IN <<T_PHON_CODE>>
AND t.kind_id = <<T_KIND_ID>>
'''
table_aliases = { t = "title" }
[[predicates]]
name = "T_PHON_CODE"
keys = ["T_PHON_CODE"]
columns = ["t.phonetic_code"]
pred_type = ["IN"]
sampling_method = "weighted"
type = "sql"
sql = '''
SELECT phonetic_code
FROM title
WHERE phonetic_code IN (
SELECT phonetic_code
FROM title
GROUP BY phonetic_code
ORDER BY count(*) DESC
LIMIT 10
)'''
[[predicates]]
name = "T_KIND_ID"
keys = ["T_KIND_ID"]
columns = ["t.kind_id"]
pred_type = ["="]
sampling_method = "weighted"
weights_column = 2
type = "sql"
sql = '''SELECT kind_id, count(*) as "count" FROM title GROUP BY kind_id'''