Workloads Package#

Provides abstractions to represent entire query workloads and utilities to read some pre-defined instances.

The main abstraction provided by this class is the Workload. A number of utility functions to read collections of queries from different sources and input formats into workload objects exist as well. The pre-defined workloads include the Join Order Benchmark [1] (including JOB-light [2] and JOB-Complex [3]), Star Schema Benchmark [4], Stack Benchmark [5] and Stats Benchmark [6].

PostBOUND stores the workload queries in a dedicated directory, located relative to the user’s home directory at $HOME/.postbound/. If a workload is requested for the first time, it will be downloaded automatically. Therefore, the initial usage of a novel workload may take slightly longer than usual.

References

class postbound.workloads.LabelType#

The labels that are used to identify individual queries in a workload.

alias of TypeVar(‘LabelType’, bound=Hashable)

class postbound.workloads.NewLabelType#

In case of mutations of the workload labels, this denotes the new type of the labels after the mutation.

alias of TypeVar(‘NewLabelType’, bound=Hashable)

postbound.workloads.wrap_workload(queries: Iterable[SqlQuery], name: str = '', root: Path | str | None = None) Workload[int]#

Wraps a number of queries in a workload with numerical labels.

Parameters:
  • queries (Iterable[SqlQuery])

  • name (str)

  • root (Path | str | None)

Return type:

Workload[int]

postbound.workloads.read_workload(path: str | Path, name: str = '', *, query_file_pattern: str = '*.sql', recurse_subdirectories: bool = False, query_label_prefix: str = '', file_encoding: str = 'utf-8', bind_columns: bool = True, include_hints: bool = True, on_error: Literal['raise', 'warn', 'ignore'] = 'raise', verbose: bool = False) Workload[str]#

Loads a workload consisting of multiple different files, potentially scattered in multiple directories

The main advantage of this method over using Workload.read directly is the support for recursive directory layouts: it can traverse subdirectories relative to the specified root and collect all workload files in a recursive manner. If subdirectories are used, their names will be used as prefixes to the query label, which is still inferred from the query file name.

Parameters:
  • path (str) – The root directory containing the workload files. Each query is expected to be stored in its own file.

  • name (str, optional) – The name of the workload, by default “”

  • query_file_pattern (str, optional) – A glob pattern that all query files have to match. All files that match the pattern are assumed to be valid query files. Defaults to "*.sql"

  • recurse_subdirectories (bool, optional) – Whether query files in subdirectories should be read as well. Defaults to False, which emulates the behaviour of Workload.read

  • query_label_prefix (str, optional) – A global prefix that should be added to all labels, no matter their placement in subdirectories. Defaults to an empty string.

  • file_encoding (str, optional) – The encoding of the query files. All files must share a common encoding. Defaults to UTF-8

  • include_hints (bool, optional) – Whether the parser should try to infer and extract hint blocks from the queries

  • on_error (Literal["raise", "warn", "ignore"], optional) – How to react to parser errors. By default, the exception is propagated to the parent process.

  • verbose (bool, optional) – Whether progress information should be printed.

  • bind_columns (bool)

Returns:

The workload

Return type:

Workload[str]

postbound.workloads.read_batch_workload(filename: str, name: str = '', *, file_encoding: str = 'utf-8') Workload[int]#

Loads a workload consisting of multiple queries from a single file.

The input file has to contain one valid SQL query per line. While empty lines are skipped, any non-SQL line will raise an Error.

The workload will have numeric labels: the query in the first line will have label 1, the second one label 2 and so on.

Parameters:
  • filename (str) – The file to load. The extension does not matter, as long as it contains plain text and each query is placed on a single and separate line.

  • name (str, optional) – The name of the workload. If omitted, this defaults to the file name.

  • file_encoding (str, optional) – The encoding of the workload file. Defaults to UTF-8

Returns:

The workload

Return type:

Workload[int]

postbound.workloads.read_csv_workload(filename: str | Path, name: str = '', *, query_column: str = 'query', label_column: str | None = None, file_encoding: str = 'utf-8', pd_args: dict | None = None) Workload[str] | Workload[int]#

Loads a workload consisting of queries from a CSV column.

All queries are expected to be contained in the same column and each query is expected to be put onto its own row.

The column containing the actual queries can be configured via the query_column parameter. Likewise, the CSV file can already provide query labels in the label_column column. If this parameter is omitted, labels will be inferred based on the row number.

Parameters:
  • filename (str | Path) – The name of the CSV file to read. The extension does not matter, as long as the file can be read by the pandas CSV parser. The parser can receive additional arguments via the pd_args parameter.

  • name (str, optional) – The name of the workload. If omitted, this defaults to the file name.

  • query_column (str, optional) – The CSV column that contains the workload queries. All rows of that column will be read, by default “query”

  • label_column (Optional[str], optional) – The column containing the query labels. Each will receive a label from the label_column of the same row. If omitted, labels will be inferred based on the row number.

  • file_encoding (str, optional) – The encoding of the CSV file. Defaults to UTF-8.

  • pd_args (Optional[dict]) – Additional arguments to customize the behaviour of the pandas.read_csv method. They will be forwarded as-is. Consult the documentation of this method for more details on the allowed parameters and their functionality.

Returns:

The workload. It has string labels if label_column was provided, or numerical labels otherwise.

Return type:

Workload[str] | Workload[int]

See also

pandas.read_csv

postbound.workloads.generate_workload(queries: Iterable[SqlQuery], *, name: str = '', labels: dict[SqlQuery, LabelType] | None = None, workload_root: str | Path | None = None) Workload[LabelType]#

Wraps a number of queries in a workload object.

The queries can receive optional labels, and will receive numerical labels according to their position in the queries iterable if no explicit labels are provided (counting from 1).

The workload will be named according to the optional name parameter. If this fails, the name will be inferred from the optional workload_root. If this fails as well, an empty name will be used.

Parameters:
  • queries (Iterable[SqlQuery]) – The queries that should form the workload. This is only enumerated a single time, hence the iterable can “spent” its items.

  • name (str, optional) – The name of the workload, by default “”

  • labels (Optional[dict[SqlQuery, LabelType]], optional) – The labels of the workload queries. Defaults to None, in which case numerical labels will be used. In the first case the label type is inferred from the dictionary values. In the second case, it will be int.

  • workload_root (Optional[Path], optional) – The directory or file that originally contained the workload queries. Defaults to None if this is not known or not appropriate (e.g. for workloads that are read from a remote source)

Returns:

The workload

Return type:

Workload[LabelType]

postbound.workloads.job(*, flavor: Literal['default', 'light', 'complex'] = 'default', file_encoding: str = 'utf-8') Workload[str]#

Reads the Join Order Benchmark, with labels according to the original paper (e.g. 1a, 21c, etc.).

Parameters:
  • flavor (Literal["default", "light", "complex"], optional) – The flavor of the JOB benchmark to load. The default flavor is the original JOB benchmark. Use “light” or “complex” to load the respective variants.

  • file_encoding (str, optional) – The encoding of the query files, by default UTF-8.

Returns:

The workload

Return type:

Workload[str]

References

postbound.workloads.job_light(*, file_encoding: str = 'utf-8') Workload[str]#

Reads the JOB-light benchmark, with numeric query labels (q-1, q-2, q-3, …).

Parameters:

file_encoding (str, optional) – The encoding of the query files, by default UTF-8.

Returns:

The workload

Return type:

Workload[str]

References

postbound.workloads.job_complex(*, file_encoding: str = 'utf-8') Workload[str]#

Reads the JOB-complex benchmark, with numeric query labels (q-1, q-2, q-3, …).

Parameters:

file_encoding (str, optional) – The encoding of the query files, by default “utf-8”.

Returns:

The workload

Return type:

Workload[str]

References

postbound.workloads.ssb(*, file_encoding: str = 'utf-8', bind_columns: bool | None = None) Workload[str]#

Reads the Star Schema Benchmark, with labels according to the original data (e.g. q1-1, q3-2, etc.).

Parameters:
  • file_encoding (str, optional) – The encoding of the query files, by default UTF-8.

  • bind_columns (Optional[bool], optional) – Whether all columns in the queries should be bound to their respective tables. Since SSB does not use qualified columns names, these must be inferred from the database schema. Thus, an active database connection is required to bind the columns. By default, binding is attempted if there is an active database connection.

Returns:

The workload

Return type:

Workload[str]

References

postbound.workloads.stack(*, file_encoding: str = 'utf-8', bind_columns: bool | None = None) Workload[str]#

Reads the Stack Benchmark, as shipped with the PostBOUND repository.

Most queries use semi-numeric labels consisting of the context and the query number, e.g., q1/q1-001. However, some queries have completely random names, such as q16/fc8f97968b9fce81df4011c8175eada15541abe0. Still, all queries are grouped into one of 16 contexts.

Parameters:
  • file_encoding (str, optional) – The encoding of the query files, by default UTF-8.

  • bind_columns (Optional[bool], optional) – Whether all columns in the queries should be bound to their respective tables. Since Stack does not use qualified columns names in some queries, these must be inferred from the database schema. Thus, an active database connection is required to bind the columns. By default, binding is attempted if there is an active database connection.

Returns:

The workload.

Return type:

Workload[str]

References

postbound.workloads.stats(*, file_encoding: str = 'utf-8') Workload[str]#

Reads the Stats Benchmark, with semi-numeric query labels (e.g. q-1, q-2, etc.).

Parameters:

file_encoding (str, optional) – The encoding of the query files, by default UTF-8.

Returns:

The workload

Return type:

Workload[str]

References

postbound.workloads.fetch_workload(name: str, *, file_encoding: str = 'utf-8') Workload[str]#

Utility method to fetch a pre-defined workload by name.

Parameters:
  • name (str)

  • file_encoding (str)

Return type:

Workload[str]