Postgres Backend#
Module name: postbound.postgres
Contains the Postgres implementation of the Database interface.
In many ways the Postgres implementation can be thought of as the reference or blueprint implementation of the database interface. This is due to two main reasons: first up, Postgres’ capabilities follow a traditional architecture and its features cover most of the general aspects of query optimization (i.e. supported operators, join orders and statistics). Secondly, and on a more pragmatic note Potsgres was the first database system that was supported by PostBOUND and therefore a lot of the original Postgres interfaces eventually evolved into the more abstract database-independent interfaces.
- class postbound.postgres.PostgresSetting(parameter: str, value: object)#
Model for a single Postgres configuration such as SET enable_nestloop = ‘off’;.
This setting can be used directly as a replacement where a string is expected, or its different components can be accessed via the parameter and value attribute.
- Parameters:
parameter (str) – The name of the setting
value (object) – The setting’s current or desired value
- property parameter: str#
Gets the name of the setting.
- Returns:
The name
- Return type:
str
- property value: object#
Gets the current or desired value of the setting.
- Returns:
The raw, i.e. un-escaped value of the setting.
- Return type:
object
- update(value: object) PostgresSetting#
Creates a new setting with the same name but a different value.
- Parameters:
value (object) – The new value
- Returns:
The new setting
- Return type:
- class postbound.postgres.PostgresConfiguration(settings: Iterable[PostgresSetting])#
Model for a collection of different postgres settings that form a complete server configuration.
Each configuration is build of indivdual PostgresSetting objects. The configuration can be used directly as a replacement when a string is expected, or its different settings can be accessed individually - either through the accessor methods, or by using a dict-like syntax: calling
config[setting]with a string setting value will provide the matching PostgresSetting. Since the configuration also subclasses string, the precise behavior of __getitem__ depends on the argument type: string arguments provide settings whereas integer arguments result in specific characters. All other string methods are implemented such that the normal string behavior is retained. All additional behavior is part of new methods.- Parameters:
settings (Iterable[PostgresSetting]) – The settings that form the configuration.
Warning
Notice that while the configuration is a UserString, pyscopg currently does not support executing the configuration, i.e. executing
cursor.execute(config)will not work. Instead, the configuration has to be manually converted into a string first by calling str as incursor.execute(str(config)). This also applies to the execute_query() method of the PostgresInterface class, since it uses psycopg under the hood.- static load(*args, **kwargs) PostgresConfiguration#
Generates a new configuration based on (setting name, value) pairs.
- Parameters:
args – Ready-to-use PostgresSetting objects
kwargs – Additional settings
- Returns:
The configuration
- Return type:
- property settings: Sequence[PostgresSetting]#
Gets the settings that are part of the configuration.
- Returns:
The settings in the order in which they were originally specified.
- Return type:
Sequence[PostgresSetting]
- parameters() Sequence[str]#
Provides all setting names that are specified in this configuration.
- Returns:
The setting names in the order in which they were orignally specified.
- Return type:
Sequence[str]
- add(setting: PostgresSetting | str | None = None, value: object = None, **kwargs) PostgresConfiguration#
Creates a new configuration with additional settings.
The setting can be supplied either as a PostgresSetting object or as a key-value pair. The latter case allows both positional arguments, as well as as keyword arguments.
- Parameters:
setting (PostgresSetting | str) – The setting to add. This can either be a readily created PostgresSetting object or a string that will be used as the setting name. In the latter case, the value has to be supplied as well.
value (object) – The value of the setting. This is only used if setting is a string.
kwargs – If the setting is not specified as a string, nor as a PostgresSetting object, it has to be specified as keyword arguments. The keyword argument names are used as the setting names, the values are used as the setting values.
- Returns:
The updated configuration. The original config is not modified.
- Return type:
- remove(setting: PostgresSetting | str) PostgresConfiguration#
Creates a new configuration without a specific setting.
- Parameters:
setting (PostgresSetting) – The setting to remove
- Returns:
The updated configuration. The original config is not modified.
- Return type:
- update(setting: PostgresSetting | str, value: object) PostgresConfiguration#
Creates a new configuration with an updated setting.
- Parameters:
setting (PostgresSetting | str) – The setting to update. This can either be the raw setting name, or a PostgresSetting object. In either case, the updated value has to be supplied via the value parameter. (When supplying a PostgresSetting, only its name is used.)
value (object) – The updated value of the setting.
- Returns:
The updated configuration. The original config is not modified.
- Return type:
- as_dict() dict[str, object]#
Provides all settings as setting name -> setting value mappings.
- Returns:
The settings. Changes to this dictionary will not be reflected in the configuration object.
- Return type:
dict[str, object]
- class postbound.postgres.PostgresConfigInterface(pg_instance: PostgresInterface)#
A thin wrapper that provides read-only access to Postgres configuration settings using __getitem__ syntax.
- Parameters:
pg_instance (PostgresInterface)
- class postbound.postgres.PostgresInterface(connect_string: str, system_name: str = 'Postgres', *, application_name: str = 'PostBOUND', client_encoding: str = 'UTF8', cache_enabled: bool = False, debug: bool = False)#
Database implementation for PostgreSQL backends.
The config attribute provides read-only access to the current GUC values of the server.
- Parameters:
connect_string (str) – Connection string for psycopg to establish a connection to the Postgres server
system_name (str, optional) – Description of the specific Postgres server, by default Postgres
application_name (str, optional) – Identifier for the Postgres server. This will be the name that is shown in the server logs and process lists.
client_encoding (str, optional) – The client encoding to use for the connection, by default UTF8
cache_enabled (bool, optional) – Whether to enable caching of database queries, by default False
debug (bool, optional) – Whether additional debug information should be printed during database interaction. Defaults to False.
- schema() PostgresSchemaInterface#
Provides access to the underlying schema information of the database.
- Returns:
An object implementing the schema interface for the actual database system. This should normally be completely stateless.
- Return type:
- statistics() PostgresStatisticsInterface#
Provides access to the current statistics of the database.
Implementing generalized statistics for a framework that supports multiple different physical database systems is much more complicated than it might seem at first. Therefore, different modes for the statistics provisioning exist. These modes can be changed by setting the properties of the interface. See the documentation of DatabaseStatistics for more details.
Repeated calls to this method are guaranteed to provide the same object. Therefore, changes to the statistics interface configuration are guaranteed to be persisted accross multiple accesses to the statistics system.
- Returns:
The statistics interface. Repeated calls to this method are guaranteed to provide the same object.
- Return type:
- hinting() PostgresHintService#
Provides access to the hint generation facilities for the current database system.
- Returns:
The hinting service. This should normally be completely stateless.
- Return type:
- execute_query(query: SqlQuery | str, *, plan: QueryPlan | None = None, join_order: JoinTree | None = None, physical_operators: PhysicalOperatorAssignment | None = None, plan_parameters: PlanParameterization | None = None, cache_enabled: bool | None = None, raw: bool = False, timeout: float | None = None) Any#
Executes the given query and returns the associated result set.
- Parameters:
query (SqlQuery | str) – The query to execute. If it contains a Hint with preparatory_statements, these will be executed beforehand. Notice that such statements are never subject to caching.
cache_enabled (Optional[bool], optional) – Controls the caching behavior for just this one query. The default value of None indicates that the “global” configuration of the database system should be used. Setting this parameter to a boolean value forces or deactivates caching for the specific query for the specific execution no matter what the “global” configuration is.
raw (bool, optional) – Whether the result set should be returned as-is. By default, the result set is simplified. Raw mode skips this step.
plan (QueryPlan | None)
join_order (JoinTree | None)
physical_operators (PhysicalOperatorAssignment | None)
plan_parameters (PlanParameterization | None)
timeout (float | None)
- Returns:
Result set of the input query. This is a list of equal-length tuples in the most general case. Each component of the tuple corresponds to a specific column of the result set and each tuple corresponds to a row in the result set. However, many queries do not provide a 2-dimensional result set (e.g. COUNT(*) queries). In such cases, the nested structure of the result set makes it quite cumbersome to use. Therefore, this method tries to simplify the return value of the query for more convenient use (if raw mode is disabled). More specifically, if the query returns just a single row, this row is returned directly as a tuple. Furthermore, if the query returns just a single column, the values of that column are returned directly in a list. Both simplifications will also be combined, such that a result set of a single row of a single value will be returned as that single value directly. In all other cases, the result will be a list consisting of the different result tuples.
- Return type:
Any
Notes
This method is mainly intended to execute read-only SQL queries. In fact, the only types of SQL queries that can be modelled by the query abstraction layer are precisely such read-only queries. However, if one really needs to execute mutating queries, they can be issued as plain text. Just remember that this behavior is heavily discouraged!
The precise behavior of this method depends on whether caching is enabled or not. If it is, the query will only be executed against the live database system, if it is not in the cache. Otherwise, the result will simply be retrieved. Caching can be enabled/disabled for just this one query via the cache_enabled switch. If this is not specified, caching depends on the cache_enabled property.
If caching should be used for this method, but is disabled at a database-level, the current cache will still be read and persisted. This ensures that all cached queries are properly saved and none of the previous cache content is lost.
- optimizer() PostgresOptimizer#
Provides access to optimizer-related functionality of the database system.
- Returns:
The optimizer interface. This should normally be completely stateless.
- Return type:
- Raises:
UnsupportedDatabaseFeatureError – If the database system does not provide any sort of external access to the optimizer.
- database_name() str#
Provides the name of the (physical) database that the database interface is connected to.
- Returns:
The database name, e.g. imdb or tpc-h
- Return type:
str
- database_system_version() Version#
Returns the release version of the database management system that this interface is connected to.
- Returns:
The version
- Return type:
- backend_pid() int#
Provides the backend process ID of the current connection.
- Returns:
The process ID
- Return type:
int
- data_dir() Path#
Get the data directory of the Postgres server.
- Returns:
The data directory path
- Return type:
Path
- logfile() Path | None#
Get the log file of the (local) Postgres server.
- Return type:
Path | None
- describe() jsondict#
Provides a representation of the current database connection as well as its system settings.
This description is intended to transparently document which customizations have been applied, thereby giving an idea of how the default query execution might have been affected. It can be JSON-serialized and will be included by most of the output of the utilities in the runner module of the experiments package.
- Return type:
jsondict
- reset_connection() int#
Obtains a new network connection for the database. Useful for debugging purposes or in case of crashes.
Notice that resetting the connection can have unintended side-effects if other methods rely on the cursor object. After resetting, the former cursor object will probably no longer be valid. Therefore, this method should be used with caution.
- Returns:
Backend-specific information about the new connection. E.g., this might be a connection ID or nothing at all.
- Return type:
Any
See also
Database.cursor
- cursor() Cursor#
Provides a cursor to execute queries and iterate over result sets manually.
- Returns:
A cursor compatible with the Python DB API specification 2.0 (PEP 249). The specific cursor type depends on the concrete database implementation however.
- Return type:
References
- connection() Connection#
Provides the current database connection.
- Returns:
The connection
- Return type:
psycopg.Connection
- rollback_tx() None#
Perform a ROLLBACK on the current transaction (connection).
This should be executed if any errors occurred while running a query.
- Return type:
None
- obtain_new_local_connection() Connection#
Provides a new database connection to be used exclusively be the client.
The current connection maintained by the PostgresInterface is not affected by obtaining a new connection in any way.
- Returns:
The connection
- Return type:
psycopg.Connection
- close() None#
Shuts down all currently open connections to the database.
- Return type:
None
- prewarm_tables(tables: TableReference | Iterable[TableReference] | None = None, *more_tables: TableReference, exclude_table_pages: bool = False, include_primary_index: bool = True, include_secondary_indexes: bool = True) None#
Prepares the Postgres buffer pool with tuples from specific tables.
- Parameters:
tables (Optional[TableReference | Iterable[TableReference]], optional) – The tables that should be placed into the buffer pool
*more_tables (TableReference) – More tables that should be placed into the buffer pool, enabling a more convenient usage of this method. See examples for details on the usage.
exclude_table_pages (bool, optional) – Whether the table data (i.e. pages containing the actual tuples) should not be prewarmed. This is off by default, meaning that prewarming is applied to the data pages. This can be toggled on to only prewarm index pages (see include_primary_index and include_secondary_index).
include_primary_index (bool, optional) – Whether the pages of the primary key index should also be prewarmed. Enabled by default.
include_secondary_indexes (bool, optional) – Whether the pages for secondary indexes should also be prewarmed. Enabled by default.
- Return type:
None
Notes
If the database should prewarm more table pages than can be contained in the shared buffer, the actual contents of the pool are not specified. Since all prewarming tasks happen sequentially, the first prewarmed relations will typically be evicted and only the last relations (tables or indexes) are retained in the shared buffer. The precise order in which the prewarming tasks are executed is not specified and depends on the actual relations.
Examples
>>> pg.prewarm_tables([table1, table2]) >>> pg.prewarm_tables(table1, table2) >>> pg.prewarm_tables(query.tables())
- cooldown_tables(tables: TableReference | Iterable[TableReference] | None = None, *more_tables: TableReference, exclude_table_pages: bool = False, include_primary_index: bool = True, include_secondary_indexes: bool = True) None#
Removes tuples from specific tables from the Postgres buffer pool.
This method can be used to simulate a cold start for the next incoming query. It requires the pg_temperature extension that is part of the pg_lab project.
- Parameters:
tables (Optional[TableReference | Iterable[TableReference]], optional) – The tables that should be removed from the buffer pool
*more_tables (TableReference) – More tables that should be removed into the buffer pool, enabling a more convenient usage of this method. See examples for details on the usage.
exclude_table_pages (bool, optional) – Whether the table data (i.e. pages containing the actual tuples) should not be removed. This is off by default, meaning that the cooldown is applied to the data pages. This can be toggled on to only cooldown index pages (see include_primary_index and include_secondary_index).
include_primary_index (bool, optional) – Whether the pages of the primary key index should also be cooled down. Enabled by default.
include_secondary_indexes (bool, optional) – Whether the pages for secondary indexes should also be cooled down. Enabled by default.
- Return type:
None
Examples
>>> pg.cooldown_tables([table1, table2]) >>> pg.cooldown_tables(table1, table2) >>> pg.cooldown_tables(query.tables())
References
pg_lab : rbergm/pg_lab
- current_configuration(*, runtime_changeable_only: bool = False) PostgresConfiguration#
Provides all current configuration settings in the current Postgres connection.
- Parameters:
runtime_changeable_only (bool, optional) – Whether only such settings that can be changed at runtime should be provided. Defaults to False.
- Returns:
The current configuration.
- Return type:
- apply_configuration(configuration: PostgresConfiguration | PostgresSetting | str) None#
Changes specific configuration parameters of the Postgres server or current connection.
- Parameters:
configuration (PostgresConfiguration | PostgresSetting | str) – The desired setting values. If a string is supplied directly, it already has to be a valid setting update such as SET geqo = FALSE;.
- Return type:
None
- has_extension(extension_name: str, *, is_shared_object: bool = True) bool#
Checks, whether the current Postgres database has a specific extension loaded and active.
Extensions can be either created using the CREATE EXTENSION command, or by loading the shared object via LOAD. For the shared object-based check to work correctly, the Postgres server has to run in the same namespace as the PostBOUND client.
- Parameters:
extension_name (str) – The name of the extension to be checked. In case of shared objects, this should be equivalent to the name of said object. In this case, the suffix is optional.
is_shared_object (bool, optional) – Whether the extension is a shared object that is loaded into the Postgres server. By default this is set to True, which assumes that the extension is loaded as a shared object, rather than as a default extension.
- Returns:
Whether the extension is loaded and active in the current Postgres database.
- Return type:
bool
- class postbound.postgres.PostgresSchemaInterface(postgres_db: PostgresInterface)#
Database schema implementation for Postgres systems.
- Parameters:
postgres_db (PostgresInterface) – The database for which schema information should be retrieved
- tables(*, include_system_tables: bool = False, schema: str = 'public') set[TableReference]#
Fetches all user-defined tables that are contained in the current database.
- Parameters:
include_system_tables (bool, optional) – Whether system tables should also be included. By default, only user-defined tables are returned.
schema (str)
- Returns:
All tables in the current schema, including materialized views, etc.
- Return type:
set[TableReference]
Notes
Hint for implementors: the default implementation of this method relies on the information_schema.tables view.
- lookup_column(column: ColumnReference | str, candidate_tables: Iterable[TableReference], *, expect_match: bool = False) TableReference | None#
Searches for a table that owns the given column.
- Parameters:
column (ColumnReference | str) – The column that is being looked up
candidate_tables (Iterable[TableReference]) – Tables that could possibly own the given column
expect_match (bool, optional) – If enabled, an error is raised whenever no table is found. Otherwise None is returned. By default, this is disabled.
- Returns:
The first of the candidate_tables that has a column of similar name.
- Return type:
- Raises:
ValueError – If expect_match is enabled and none of the candidate tables has a column of the given name.
Notes
Hint for implementors: the default implementation of this method (transitively) relies on the information_schema.columns view.
- is_primary_key(column: ColumnReference) bool#
Checks, whether a column is the primary key for its associated table.
- Parameters:
column (ColumnReference) – The column to check
- Returns:
Whether the column is the primary key of its table. If it is part of a compound primary key, this is False.
- Return type:
bool
- Raises:
postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)
Notes
Hint for implementors: the default implementation of this method relies on the information_schema.table_constraints and information_schema.constraint_column_usage views.
- has_secondary_index(column: ColumnReference) bool#
Checks, whether a secondary index is available for a specific column.
- Parameters:
column (ColumnReference) – The column to check
- Returns:
Whether a secondary index of any kind was created for the column. Compound indexes and primary key indexes fail this test.
- Return type:
bool
- Raises:
postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)
Notes
Hints for implementors: The default implementation of this method assumes that each foreign key column and each column with a UNIQUE constraint has an associated index. If this should not be the case, a custom implementation needs to be supplied. Furthermore, the implementation relies on the information_schema.table_constraints, information_schema.constraint_column_usage and information_schema.key_column_usage views.
- indexes_on(column: ColumnReference) set[str]#
Retrieves the names of all indexes of a specific column.
- Parameters:
column (ColumnReference) – The column to check.
- Returns:
The indexes. If no indexes are available, the set will be empty.
- Return type:
set[str]
- Raises:
postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)
Notes
Hints for implementors: the default implementation of this method assumes that primary keys, foreign keys and unique constraints are all associated with an index structure. It provides the names of the corresponding constraints. The implementation relies on the information_schema.table_constraints, information_schema.constraint_column_usage and information_schema.key_column_usage views.
- indexed_column(index: str, *, schema: str = 'public') BoundColumnReference | None#
Retrieves the column that is indexed by a specific index.
- Returns:
The column or None, if the index does not exist (in the given schema). For multi-indexes, i.e. indexes over multiple columns, this returns the first column only.
- Return type:
Optional[ColumnReference]
- Parameters:
index (str)
schema (str)
- foreign_keys_on(column: ColumnReference) set[BoundColumnReference]#
Fetches all foreign key constraints that are specified on a specific column.
The provided columns are the target columns that are referenced by the foreign key constraint. E.g., suppose there are tables A and B with columns x and y. We specify a foreign key constraint on column y to ensure that all values in y reference a value in x. Then, calling this method on column y will return column x. If there are multiple foreign key constraints on the same column, all of them will be returned.
- Parameters:
column (ColumnReference) – The column to check. All foreign keys that are “pointing from” this column to another column are returned.
- Returns:
The columns that are “pointed to” by foreign key constraints on the given column. If no such foreign keys exist, an empty set is returned.
- Return type:
set[BoundColumnReference]
- Raises:
postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)
- datatype(column: ColumnReference) str#
Retrieves the (physical) data type of a column.
The provided type can be a standardized SQL-type, but it can be a type specific to the concrete database system just as well. It is up to the user to figure this out and to react accordingly.
- Parameters:
column (ColumnReference) – The colum to check
- Returns:
The datatype. Will never be empty.
- Return type:
str
- Raises:
postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)
Notes
Hint for implementors: the default implementation of this method relies on the information_schema.columns view.
- is_nullable(column: ColumnReference) bool#
Checks, whether a specific column may contain NULL values.
- Parameters:
column (ColumnReference) – The column to check
- Returns:
Whether the column may contain NULL values
- Return type:
bool
- Raises:
postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)
Notes
Hint for implementors: the default implementation of this method relies on the information_schema.columns view.
- class postbound.postgres.PostgresStatisticsInterface(postgres_db: PostgresInterface, *, emulated: bool = True, enable_emulation_fallback: bool = True, cache_enabled: bool | None = True)#
Statistics implementation for Postgres systems.
- Parameters:
postgres_db (PostgresInterface) – The database instance for which the statistics should be retrieved
emulated (bool, optional) – Whether the statistics interface should operate in emulation mode. To enable reproducibility, this is True by default
enable_emulation_fallback (bool, optional) – Whether emulation should be used for unsupported statistics when running in native mode, by default True
cache_enabled (Optional[bool], optional) – Whether emulated statistics queries should be subject to caching, by default True. Set to None to use the caching behavior of the db
- n_buffered(table: TableReference | str) int#
Retrieves the number of buffered pages for the specified table.
The table can either be a base table or the name of an index.
Notes
The current implementation of this method relies on the pg_buffercache extension and works by scanning the entire buffer cache. Therefore, it incurs a slight overhead and should not be called inside of hot loops. Instead, you can use buffer_state to retrieve the number of buffered pages for all relations in a single pass.
See also
- Parameters:
table (TableReference | str)
- Return type:
int
- buffer_state(*, schema: str = 'public') dict[str, int]#
Retrieves the current buffer state for all relations in the given schema (public by default).
If a relation is not contained in the result, none of its pages are currently buffered.
The result contains all PG relations, i.e. including indexes and other non-table relations. Typically, primary key indexes are named <relation_name>_pkey. The name of secondary indexes depends on the schema.
See also
- Parameters:
schema (str)
- Return type:
dict[str, int]
- update_statistics(columns: ColumnReference | Iterable[ColumnReference] | None = None, *, tables: TableReference | Iterable[TableReference] | None = None, perfect_mcv: bool = False, perfect_n_distinct: bool = False, verbose: bool = False) None#
Instructs the Postgres server to update statistics for specific columns.
Notice that is one of the methods of the database interface that explicitly mutates the state of the database system.
- Parameters:
columns (Optional[ColumnReference | Iterable[ColumnReference]], optional) – The columns for which statistics should be updated. If no columns are given, columns are inferred based on the tables and all detected columns are used.
tables (Optional[TableReference | Iterable[TableReference]], optional) – The table for which statistics should be updated. If columns are given, this parameter is completely ignored. If no columns and no tables are given, all tables in the current database are used.
perfect_mcv (bool, optional) – Whether the database system should attempt to create perfect statistics. Perfect statistics means that for each of the columns MCV lists are created such that each distinct value is contained within the list. For large and diverse columns, this might lots of compute time as well as storage space. Notice, that the database system still has the ultimate decision on whether to generate MCV lists in the first place. Postgres also imposes a hard limit on the maximum allowed length of MCV lists and histogram widths.
perfect_n_distinct (bool, optional) – Whether to set the number of distinct values to its true value.
verbose (bool, optional) – Whether to print some progress information to standard error.
- Return type:
None
- postbound.postgres.PostgresOptimizerSettings = {IntermediateOperator.Materialize: 'enable_material', IntermediateOperator.Memoize: 'enable_memoize', IntermediateOperator.Sort: 'enable_sort', JoinOperator.HashJoin: 'enable_hashjoin', JoinOperator.NestedLoopJoin: 'enable_nestloop', JoinOperator.SortMergeJoin: 'enable_mergejoin', ScanOperator.BitmapScan: 'enable_bitmapscan', ScanOperator.IndexOnlyScan: 'enable_indexonlyscan', ScanOperator.IndexScan: 'enable_indexscan', ScanOperator.SequentialScan: 'enable_seqscan'}#
All (session-global) optimizer settings that modify the allowed physical operators.
- postbound.postgres.PGHintPlanOptimizerHints: dict[ScanOperator | JoinOperator | IntermediateOperator, str] = {IntermediateOperator.Memoize: 'Memoize', JoinOperator.HashJoin: 'HashJoin', JoinOperator.NestedLoopJoin: 'NestLoop', JoinOperator.SortMergeJoin: 'MergeJoin', ScanOperator.BitmapScan: 'BitmapScan', ScanOperator.IndexOnlyScan: 'IndexOnlyScan', ScanOperator.IndexScan: 'IndexOnlyScan', ScanOperator.SequentialScan: 'SeqScan'}#
All physical operators that can be enforced by pg_hint_plan.
These settings operate on a per-relation basis and overwrite the session-global optimizer settings.
References
- postbound.postgres.PGLabOptimizerHints: dict[ScanOperator | JoinOperator | IntermediateOperator, str] = {IntermediateOperator.Materialize: 'Material', IntermediateOperator.Memoize: 'Memo', JoinOperator.HashJoin: 'HashJoin', JoinOperator.NestedLoopJoin: 'NestLoop', JoinOperator.SortMergeJoin: 'MergeJoin', ScanOperator.BitmapScan: 'BitmapScan', ScanOperator.IndexOnlyScan: 'IdxScan', ScanOperator.IndexScan: 'IdxScan', ScanOperator.SequentialScan: 'SeqScan'}#
All physical operators that can be enforced by pg_lab.
These settings operate on a per-relation basis and overwrite the session-global optimizer settings.
References
- postbound.postgres.PostgresJoinHints = {JoinOperator.HashJoin, JoinOperator.NestedLoopJoin, JoinOperator.SortMergeJoin}#
All join operators that are supported by Postgres.
- postbound.postgres.PostgresScanHints = {ScanOperator.BitmapScan, ScanOperator.IndexOnlyScan, ScanOperator.IndexScan, ScanOperator.SequentialScan}#
All scan operators that are supported by Postgres.
- postbound.postgres.PostgresPlanHints = {HintType.BushyJoinOrder, HintType.Cardinality, HintType.JoinDirection, HintType.LinearJoinOrder, HintType.Operator, HintType.Parallelization}#
All non-operator hints supported by Postgres, that can be used to enforce additional optimizer behaviour.
- class postbound.postgres.PostgresExplainClause(original_clause: Explain)#
A specialized EXPLAIN clause implementation to handle Postgres custom syntax for query plans.
If ANALYZE is enabled, this also retrieves information about shared buffer usage (page hits and disk reads).
- Parameters:
original_clause (Explain) – The actual EXPLAIN clause. The new explain clause acts as a decorator around the original clause.
- class postbound.postgres.PostgresLimitClause(original_clause: Limit)#
A specialized LIMIT clause implementation to handle Postgres custom syntax for limits / offsets
- Parameters:
original_clause (Limit) – The actual LIMIT clause. The new limit clause acts as a decorator around the original clause.
- postbound.postgres.PostgresHintingBackend#
The hinting backend being used.
If pg_lab is available, this is the preferred extension. Otherwise, pg_hint_plan is used as a fallback. If the hint service is inactive, the backend is set to _none_.
alias of
Literal[‘pg_hint_plan’, ‘pg_lab’, ‘none’]
- class postbound.postgres.PostgresHintService(postgres_db: PostgresInterface)#
Postgres-specific implementation of the hinting capabilities.
Most importantly, this service implements a mapping from the abstract optimization descisions (join order + operators) to their counterparts in the hinting backend and integrates Postgres’ few deviations from standard SQL syntax (CAST expressions and LIMIT clauses).
The hinting service supports two different kinds of backends: pg_lab or pg_hint_plan. The former is the preferred option since it provides cardinality hints for base joins and does not require management of the GeQO optimizer.
Notice that by delegating the adaptation of Postgres’ native optimizer to the pg_hint_plan extension, a couple of undesired side-effects have to be accepted:
forcing a join order also involves forcing a specific join direction. Our implementation applies a couple of heuristics to mitigate a bad impact on performance
the extension only instruments the dynamic programming-based optimizer. If the geqo_threshold is reached and the genetic optimizer takes over, no modifications are applied. Therefore, it is best to disable GeQO while working with Postgres. At the same time, this means that certain scenarios like custom cardinality estimation for the genetic optimizer cannot currently be tested
- Parameters:
postgres_db (PostgresInterface) – A postgres database with an active hinting backend (pg_hint_plan or pg_lab)
- Raises:
ValueError – If the supplied postgres_db does not have a supported hinting backend enabled.
See also
_generate_pg_join_order_hintReferences
- property backend: Literal['pg_hint_plan', 'pg_lab', 'none']#
The hinting backend in use.
- generate_hints(query: SqlQuery, plan: QueryPlan | None = None, *, join_order: JoinTree | None = None, physical_operators: PhysicalOperatorAssignment | None = None, plan_parameters: PlanParameterization | None = None) SqlQuery#
Transforms the input query such that the given optimization decisions are respected during query execution.
In the most common case this involves building a Hint clause that encodes the optimization decisions in a system-specific way. However, depending on the concrete database system, this might also involve a restructuring of certain parts of the query, e.g. the usage of specific join statements, the introduction of non-standard SQL statements, or a reordering of the FROM clause.
Notice that all optimization information is optional. If individual parameters are set to None, nothing has been enforced by PostBOUND’s optimization process and the native optimizer of the database system should “fill the gaps”.
Implementations of this method are required to adhere to operators for joins and scans as much as possible. However, there is no requirement to represent auxiliary nodes (e.g. sorts) if this is not possible or meaningful for the plan. As a rule of thumb, implementations should rate the integrity of the plan in the database higher than a perfect representation of the input data.
- Parameters:
query (SqlQuery) – The query that should be transformed
plan (Optional[QueryPlan], optional) – The query execution plan. If this is given, all other parameters should be None. This essentially enforces the given query plan.
join_order (Optional[JoinTree], optional) – The sequence in which individual joins should be executed.
physical_operators (Optional[PhysicalOperatorAssignment], optional) – The physical operators that should be used for the query execution. In addition to selecting specific operators for specific joins or scans, this can also include disabling certain operators for the entire query.
plan_parameters (Optional[PlanParameterization], optional) – Additional parameters and metadata for the native optimizer of the database system. Probably the most important use-case of these parameters is the supply of cardinality estimates for different joins and scans. For example, these can be combined with a join order to influence the physical operators that the native optimizer chooses. Another scenario is to only supply such cardinality estimates and leave the join_order and physical_operators completely empty, which essentially simulates a different cardinality estimation algorithm for the query. Notice however, that in this scenario cardinality estimates for all possible intermediate results of the query have to be supplied. Otherwise, the native optimizer once again “fills the gaps” and uses its own estimates for the remaining intermediate results that it explores during plan enumeration. This would probably effectively break the estimation algorithm.
- Returns:
The transformed query. It contains all necessary information to enforce the optimization decisions as best as possible. Notice that whether the native optimizer of the database system is obliged to respect the optimization decisions depends on the specific system. For example, for MySQL hints are really just hints and the optimizer is only encouraged to use specific operators but not forced to do so.
- Return type:
- format_query(query: SqlQuery) str#
Transforms the query into a database-specific string, mostly to incorporate deviations from standard SQL.
This method is necessary because the query abstraction layer is focused on modelling and unifying different parts of an SQL query. However, some database systems (cough .. MySQL .. cough) deviate from standard SQL syntax and express different parts of a query different. The most prominent example are older versions of MySQL that used double quotes for string values rather than the SQL standard single quotes. Therefore, the format_query method takes an abstract representation of an SQL query as input and turns it into a string representation that accounts for all such deviations.
- Parameters:
query (SqlQuery) – The query that should be adapted for the database system
- Returns:
An equivalent notation of the query that incorporates system-specific deviations from standard SQL. Notice that this query possibly can no longer be parsed by the query abstraction layer. It is a one-way process.
- Return type:
str
See also
postbound.qalthe query abstraction layer provided by PostBOUND
- supports_hint(hint: ScanOperator | JoinOperator | IntermediateOperator | HintType) bool#
Checks, whether the database system is capable of using the specified hint or operator
- Parameters:
hint (PhysicalOperator | HintType) – The hint/feature to check
- Returns:
Indicates whether the feature is supported by the specific database system.
- Return type:
bool
- describe() dict[str, str]#
Provides a JSON-serializable description of the hint service.
- Returns:
Information about the hinting backend
- Return type:
dict[str, str]
- class postbound.postgres.PostgresOptimizer(postgres_instance: PostgresInterface)#
Optimizer introspection for Postgres.
- Parameters:
postgres_instance (PostgresInterface) – The database whose optimizer should be introspected
- query_plan(query: SqlQuery | str) QueryPlan#
Obtains the query execution plan for a specific query.
This respects all hints that potentially influence the optimization process.
- analyze_plan(query: SqlQuery) QueryPlan#
- analyze_plan(query: SqlQuery, *, timeout: float) QueryPlan | None
- analyze_plan(query: SqlQuery, *, timeout: Literal[None]) QueryPlan
Executes a specific query and provides the query execution plan supplemented with runtime information.
This respects all hints that potentially influence the optimization process.
- parse_plan(plan: Any, *, query: SqlQuery | None = None) QueryPlan#
Transforms the system-specific EXPLAIN output into a standardized QueryPlan.
The optional query can be used to provide additional context for the plan. This can be used by some database systems for correct parsing (e.g. DuckDB), while others might not use it at all. Omitting the query parameter is acceptable.
This method is mainly inteded to parse a previously generated plan, or when the plan could not simply be obtained via query_plan or analyze_plan.
- cardinality_estimate(query: SqlQuery | str) Cardinality#
Queries the DBMS query optimizer for its cardinality estimate, instead of executing the query.
The cardinality estimate will correspond to the estimate for the final node. Therefore, running this method with aggregate queries is not particularly meaningful.
- Parameters:
query (SqlQuery | str) – The input query
- Returns:
The cardinality estimate of the native optimizer for the database system.
- Return type:
- cost_estimate(query: SqlQuery | str) float#
Queries the DBMS query optimizer for the estimated cost of executing the query.
The cost estimate will correspond to the estimate for the final node. Typically, this cost includes the cost of all sub-operators as well.
- Parameters:
query (SqlQuery | str) – The input query
- Returns:
The cost estimate of the native optimizer for the database system.
- Return type:
Cost
- configure_operator(operator: ScanOperator | JoinOperator | IntermediateOperator, *, enabled: bool) None#
Enables or disables a specific physical operator for the current Postgres connection.
- Parameters:
operator (PhysicalOperator) – The operator to configure.
enabled (bool) – Whether the operator should be allowed or not.
- Return type:
None
References
https://www.postgresql.org/docs/current/runtime-config-query.html
- postbound.postgres.connect(*, name: str = 'postgres', application_name: str = 'PostBOUND', connect_string: str = '', config_file: str | Path = '', encoding: str = 'UTF8', cache_enabled: bool = False, refresh: bool = False, private: bool = False, debug: bool = False) PostgresInterface#
Convenience function to seamlessly connect to a Postgres instance.
This function obtains a connection to a Postgres database by trying the following methods in order:
if the connect-string is supplied directly via the connect_string parameter, this is used
the connect string is read from the config_file if this parameter is supplied. This file has to be located in the current working directory, but absolute and relative paths are supported. If the file does not exist, an error is raised.
the connect string is read from the default connection file .psycopg_connection in the current working directory
the connection parameters are read from the standard Postgres environment variables (e.g. PGDATABASE, PGHOST, …). This method is triggered via the presence of the PGDATABASE environment variable. Note that this method is generally discouraged due to its implicit and non-obvious nature. A warning is emitted if this method is used.
If none of these methods worked, an error is raised.
After a connection to the Postgres instance has been obtained, it is registered automatically on the current DatabasePool instance. This can be changed via the private parameter.
Config file formats#
The configuration file can be supplied in different formats. Currently supported are:
Plain text files (no extension), such as .psycopg_connection: The entire file contents are read as a single psycopg-compatible connect string
INI files (.ini): The file must contain exactly one section. All key-value pairs in this section are treated as connection parameters.
TOML files (.toml): The file is parsed as a TOML document. All top-level key-value pairs are treated as connection parameters.
JSON files (.json): The file is parsed as a JSON document. All top level key-value pairs are treated as connection parameters.
YAML files (.yml or .yaml): The file is parsed as a YAML document. All top-level key-value pairs are treated as connection parameters. This requires the PyYAML package to be installed.
- param name:
A name to identify the current connection if multiple connections to different Postgres instances should be maintained. This is used to register the instance on the DatabasePool. Defaults to postgres.
- type name:
str, optional
- param application_name:
Identifier for the Postgres server. This will be the name that is shown in the server logs and process lists.
- type application_name:
str, optional
- param connect_string:
A Psycopg-compatible connect string for the database. Supplying this parameter overwrites any other connection information
- type connect_string:
str, optional
- param config_file:
A file containing a Psycopg-compatible connect string for the database. This is the default and preferred method of connecting to a Postgres database. Defaults to .psycopg_connection See the section on config_file formats for supported file types. The appropriate parser is selected based on the file extension.
- type config_file:
str | Path, optional
- param encoding:
The client enconding of the connection. Defaults to UTF8.
- type encoding:
str, optional
- param cache_enabled:
Controls the default caching behaviour of the Postgres instance. Caching of general queries is disabled by default, whereas queries from the statistics interface are cached by default.
- type cache_enabled:
bool, optional
- param refresh:
If true, a new connection to the database will always be established, even if a connection to the same database is already pooled. The registration key will be suffixed to prevent collisions. By default, the current connection is re-used. If that is the case, no further information (e.g. config strings) is read and only the name is accessed.
- type refresh:
bool, optional
- param private:
If true, skips registration of the new instance on the DatabasePool. Registration is performed by default.
- type private:
bool, optional
- returns:
The Postgres database object
- rtype:
PostgresInterface
- raises ValueError:
If neither a config file nor a connect string was given, or if the connect file should be used but does not exist
References
- Parameters:
name (str)
application_name (str)
connect_string (str)
config_file (str | Path)
encoding (str)
cache_enabled (bool)
refresh (bool)
private (bool)
debug (bool)
- Return type:
- postbound.postgres.start(pgdata: str | Path = '', *, logfile: str | Path = '') None#
Starts a local Postgres server.
This function assumes that pg_ctl is available on the system PATH and either the server’s data directory is specified explicitly, or set via the PGDATA environment variable.
- Parameters:
pgdata (str | Path)
logfile (str | Path)
- Return type:
None
- postbound.postgres.stop(pgdata: str | Path = '', *, raise_on_error: bool = False) None#
Stops a running (local) Postgres server.
This function assumes that pg_ctl is available on the system PATH and either the server’s data directory is specified explicitly, or set via the PGDATA environment variable.
If the server cannot be stopped due to whatever reason, an error can be raised by setting the corresponding parameter. Otherwise, it is silently ignored.
- Parameters:
pgdata (str | Path)
raise_on_error (bool)
- Return type:
None
- postbound.postgres.is_running(pgdata: str | Path = '') bool#
Checks, whether a local Postgres server is currently running.
This function assumes that pg_ctl is available on the system PATH. A data directory can be supplied to check whether a server is running for the specific database. If pgdata is not supplied, the PGDATA environment variable is used as a fallback.
- Parameters:
pgdata (str | Path)
- Return type:
bool
- class postbound.postgres.ParallelQueryExecutor(connect_string: str, n_threads: int | None = None, *, timeout: int | None = None, verbose: bool = False)#
The ParallelQueryExecutor provides mechanisms to conveniently execute queries in parallel.
The parallel execution happens by maintaining a number of worker threads that execute the incoming queries. The number of input queries can exceed the worker pool size, potentially by a large margin. If that is the case, input queries will be buffered until a worker is available.
This parallel executor has nothing to do with the Database interface and acts entirely independently and Postgres-specific.
- Parameters:
connect_string (str) – Connection info to establish a network connection to the Postgres instance. Delegates to Psycopg
n_threads (Optional[int], optional) – The maximum number of parallel workers to use. If this is not specified, uses
os.cpu_count()many workers.timeout (Optional[int], optional) – The number of seconds to wait until an individual query is aborted. Timeouts do not affect other queries (both those running in parallel or those running afterwards on the same worker). In case of a timeout, the query’s entry in the result set will be None.
verbose (bool, optional) – Whether to print logging information during the query execution. This is off by default.
See also
Database,PostgresInterfaceReferences
- queue_query(query: SqlQuery | str) None#
Adds a new query to the queue, to be executed as soon as possible.
If a timeout was specified when creating the executor, this timeout will be applied to the query.
- Parameters:
query (SqlQuery | str) – The query to execute
- Return type:
None
- drain_queue(timeout: float | None = None, *, callback: Callable[[SqlQuery | str, Sequence[tuple] | None], None] | None = None) None#
Blocks, until all queries currently queued have terminated.
- Parameters:
timeout (Optional[float], optional) – The number of seconds to wait until the calculation is aborted. Defaults to None, which indicates no timeout, i.e. wait forever. Note that in contrast to the timeout specified when creating the executor, this timeout applies to the entire queue and not to individual queries. For example, one can set the per-query timeout to 1s which means that each query can be executed for at most 1 second. If an additional timeout of 10s is specified on the queue, the entire queue will be aborted if it takes longer than 10 seconds to complete.
callback (Optional[Callable[[SqlQuery | str, ResultSet | None], None]], optional) – A callback to be executed with each query that completes. The callback receives the query that was executed and the corresponding (raw) result set as arguments. If the query ran into a timeout, the result set is None.
- Raises:
TimeoutError or concurrent.futures.TimeoutError – If some queries have not completed after the given timeout.
- Return type:
None
- result_set() dict[str | SqlQuery, Sequence[tuple] | None]#
Provides the results of all queries that have terminated already, mapping query -> result set
- Returns:
The query results. The raw result sets are provided without any simplification. If the query timed out, the result set is None (in contrast to empty result sets like []).
- Return type:
dict[str | SqlQuery, ResultSet | None]
- close() None#
Terminates all worker threads. The executor is essentially useless afterwards.
- Return type:
None
- class postbound.postgres.TimeoutQueryExecutor(postgres_instance: PostgresInterface | None = None)#
The TimeoutQueryExecutor provides a mechanism to execute queries with a timeout attached.
If the query takes longer than the designated timeout, its execution is cancelled. The query execution itself is delegated to the PostgresInterface, so all its rules still apply. At the same time, using the timeout executor service can invalidate some of the state that is exposed by the database interface (see Warnings below). Therefore, the relevant variables should be refreshed once the timeout executor was used.
In addition to calling the execute_query method directly, the executor also implements __call__ for more convenient access. Both methods accept the same parameters.
- Parameters:
postgres_instance (Optional[PostgresInterface], optional) – Database to execute the queries. If omitted, this is inferred from the DatabasePool.
Warning
When a query gets cancelled due to the timeout being reached, the current cursor as well as database connection might be refreshed. Any direct references to these instances should no longer be used.
- execute_query(query: SqlQuery | str, timeout: float, **kwargs) Any#
Runs a query on the database connection, cancelling if it takes longer than a specific timeout.
- Parameters:
query (SqlQuery | str) – Query to execute
timeout (float) – Maximum query execution time in seconds.
**kwargs – Additional parameters to pass to the PostgresInterface.execute_query method.
- Returns:
The query result if it terminated timely. Rules from PostgresInterface.execute_query apply.
- Return type:
Any
- Raises:
TimeoutError – If the query execution was not finished after timeout seconds.
- close() None#
Closes any internal resources held by the timeout executor.
After calling this method, the timeout executor should no longer be used.
- Return type:
None
- postbound.postgres.PostgresExplainJoinNodes = {'Hash Join': JoinOperator.HashJoin, 'Merge Join': JoinOperator.SortMergeJoin, 'Nested Loop': JoinOperator.NestedLoopJoin}#
A mapping from Postgres EXPLAIN node names to the corresponding join operators.
- postbound.postgres.PostgresExplainScanNodes = {'Bitmap Heap Scan': ScanOperator.BitmapScan, 'Index Only Scan': ScanOperator.IndexOnlyScan, 'Index Scan': ScanOperator.IndexScan, 'Seq Scan': ScanOperator.SequentialScan}#
A mapping from Postgres EXPLAIN node names to the corresponding scan operators.
- postbound.postgres.PostgresExplainIntermediateNodes = {'Materialize': IntermediateOperator.Materialize, 'Memoize': IntermediateOperator.Memoize, 'Sort': IntermediateOperator.Sort}#
A mapping from Postgres EXPLAIN node names to the corresponding intermediate operators.
- postbound.postgres.NodeType#
All different nodes that can be created by Postgres.
This has been extracted directly from ExplainNode() in explain.c from the Postgres source code.
alias of
Literal[‘Result’, ‘ProjectSet’, ‘ModifyTable’, ‘Append’, ‘Merge Append’, ‘Recursive Union’, ‘BitmapAnd’, ‘BitmapOr’, ‘Nested Loop’, ‘Merge Join’, ‘Hash Join’, ‘Seq Scan’, ‘Sample Scan’, ‘Gather’, ‘Gather Merge’, ‘Index Scan’, ‘Index Only Scan’, ‘Bitmap Index Scan’, ‘Bitmap Heap Scan’, ‘Tid Scan’, ‘Tid Range Scan’, ‘Subquery Scan’, ‘Function Scan’, ‘Table Function Scan’, ‘Values Scan’, ‘CTE Scan’, ‘Named Tuplestore Scan’, ‘WorkTable Scan’, ‘Foreign Scan’, ‘Custom Scan’, ‘Materialize’, ‘Memoize’, ‘Sort’, ‘Incremental Sort’, ‘Group’, ‘Aggregate’, ‘WindowAgg’, ‘Unique’, ‘SetOp’, ‘LockRows’, ‘Limit’, ‘Hash’]
- class postbound.postgres.PostgresExplainNode(explain_data: dict)#
Simplified model of a plan node as provided by Postgres’ EXPLAIN output in JSON format.
Generally speaking, a node stores all the information about the plan node that we currently care about. This is mostly focused on optimizer statistics, along with some additional data. Explain nodes form a hierarchichal structure with each node containing an arbitrary number of child nodes. Notice that this model is very loose in the sense that no constraints are enforced and no sanity checking is performed. For example, this means that nodes can contain more than two children even though this can never happen in a real EXPLAIN plan. Similarly, the correspondence between filter predicates and the node typse (e.g. join filter for a join node) is not checked.
All relevant data from the explain node is exposed as attributes on the objects. Even though these are mutable, they should be thought of as read-only data objects.
- Parameters:
explain_data (dict) – The JSON data of the current explain node. This is parsed and prepared as part of the __init__ method.
- node_type#
The node type. This should never be empty or None, even though it is technically allowed.
- Type:
NodeType | None, default None
- cost#
The optimizer’s cost estimation for this node. This includes the cost of all child nodes as well. This should normally not be NaN, even though it is technically allowed.
- Type:
float, default NaN
- cardinality_estimate#
The optimizer’s estimation of the number of tuples that will be produced by this operator. This should normally not be NaN, even though it is technically allowed.
- Type:
float, default NaN
- execution_time#
For EXPLAIN ANALYZE plans, this is the actual total execution time of the node in seconds. For pure EXPLAIN plans, this is NaN
- Type:
float, default NaN
- true_cardinality#
For EXPLAIN ANALYZE plans, this is the average of the number of tuples that were actually produced for each loop of the node. For pure EXPLAIN plans, this is NaN
- Type:
float, default NaN
- loops#
For EXPLAIN ANALYZE plans, this is the number of times the operator was invoked. The number of invocations can mean a number of different things: for parallel operators, this normally matches the number of parallel workers. For scans, this matches the number of times a new tuple was requested (e.g. for an index nested-loop join the number of loops of the index scan part indicates how many times the index was probed).
- Type:
int, default 1
- relation_name#
The name of the relation/table that is processed by this node. This should be defined on scan nodes, but could also be present on other nodes.
- Type:
str | None, default None
- relation_alias#
The alias of the relation/table under which the relation was accessed in th equery plan. See relation_name.
- Type:
str | None, default None
- index_name#
The name of the index that was probed. This should be defined on index scans and index-only scans, but could also be present on other nodes.
- Type:
str | None, default None
- filter_condition#
A post-processing filter that is applied to all rows emitted by this operator. This is most important for scan operations with an attached filter predicate, but can also be present on some joins.
- Type:
str | None, default None
- index_condition#
The condition that is used to locate the matching tuples in an index scan or index-only scan
- Type:
str | None, default None
- join_filter#
The condition that is used to determine matching tuples in a join
- Type:
str | None, default None
- hash_condition#
The condition that is used to determine matching tuples in a hash join
- Type:
str | None, default None
- recheck_condition#
For lossy bitmap scans or bitmap scans based on lossy indexes, this is post-processing check for whether the produced tuples actually match the filter condition
- Type:
str | None, default None
- parent_relationship#
Describes the role that this node plays in relation to its parent. Common values are inner which denotes that this is the inner child of a join and outer which denotes the opposite.
- Type:
str | None, default None
- parallel_workers#
For parallel operators in EXPLAIN ANALYZE plans, this is the actual number of worker processes that were started. Notice that in total there is one additional worker. This process takes care of spawning the other workers and managing them, but can also take part in the input processing.
- Type:
int | float, default NaN
- sort_keys#
The columns that are used to sort the tuples that are produced by this node. This is most important for sort nodes, but can also be present on other nodes.
- Type:
list[str]
For EXPLAIN ANALYZE plans with BUFFERS enabled, this is the number of blocks/pages that where retrieved from disk while executing this node, including the reads of all its child nodes.
- Type:
float, default NaN
For EXPLAIN ANALYZE plans with BUFFERS enabled, this is the number of blocks/pages that where retrieved from the shared buffer while executing this node, including the hits of all its child nodes.
- Type:
float, default NaN
- temp_blocks_read#
For EXPLAIN ANALYZE blocks with BUFFERS enabled, this is the number of short-term data structures (e.g. hash tables, sorts) that where read by this node, including reads of all its child nodes.
- Type:
float, default NaN
- temp_blocks_written#
For EXPLAIN ANALYZE blocks with BUFFERS enabled, this is the number of short-term data structures (e.g. hash tables, sorts) that where written by this node, including writes of all its child nodes.
- Type:
float, default NaN
- plan_width#
The average width of the tuples that are produced by this node.
- Type:
float, default NaN
- children#
All child / input nodes for the current node
- Type:
list[PostgresExplainNode]
- static all_node_types() frozenset[Literal['Result', 'ProjectSet', 'ModifyTable', 'Append', 'Merge Append', 'Recursive Union', 'BitmapAnd', 'BitmapOr', 'Nested Loop', 'Merge Join', 'Hash Join', 'Seq Scan', 'Sample Scan', 'Gather', 'Gather Merge', 'Index Scan', 'Index Only Scan', 'Bitmap Index Scan', 'Bitmap Heap Scan', 'Tid Scan', 'Tid Range Scan', 'Subquery Scan', 'Function Scan', 'Table Function Scan', 'Values Scan', 'CTE Scan', 'Named Tuplestore Scan', 'WorkTable Scan', 'Foreign Scan', 'Custom Scan', 'Materialize', 'Memoize', 'Sort', 'Incremental Sort', 'Group', 'Aggregate', 'WindowAgg', 'Unique', 'SetOp', 'LockRows', 'Limit', 'Hash']]#
All node types that are currently recognized by PostBOUND.
- Return type:
frozenset[Literal[‘Result’, ‘ProjectSet’, ‘ModifyTable’, ‘Append’, ‘Merge Append’, ‘Recursive Union’, ‘BitmapAnd’, ‘BitmapOr’, ‘Nested Loop’, ‘Merge Join’, ‘Hash Join’, ‘Seq Scan’, ‘Sample Scan’, ‘Gather’, ‘Gather Merge’, ‘Index Scan’, ‘Index Only Scan’, ‘Bitmap Index Scan’, ‘Bitmap Heap Scan’, ‘Tid Scan’, ‘Tid Range Scan’, ‘Subquery Scan’, ‘Function Scan’, ‘Table Function Scan’, ‘Values Scan’, ‘CTE Scan’, ‘Named Tuplestore Scan’, ‘WorkTable Scan’, ‘Foreign Scan’, ‘Custom Scan’, ‘Materialize’, ‘Memoize’, ‘Sort’, ‘Incremental Sort’, ‘Group’, ‘Aggregate’, ‘WindowAgg’, ‘Unique’, ‘SetOp’, ‘LockRows’, ‘Limit’, ‘Hash’]]
- is_scan() bool#
Checks, whether the current node corresponds to a scan node.
For Bitmap index scans, which are multi-level scan operators, this is true for the heap scan part that takes care of actually reading the tuples according to the bitmap provided by the bitmap index scan operators.
- Returns:
Whether the node is a scan node
- Return type:
bool
- is_join() bool#
Checks, whether the current node corresponds to a join node.
- Returns:
Whether the node is a join node
- Return type:
bool
- is_analyze() bool#
Checks, whether this EXPLAIN plan is an EXPLAIN ANALYZE plan or a pure EXPLAIN plan.
The analyze variant does not only obtain the plan, but actually executes it. This enables the comparison of the optimizer’s estimates to the actual values. If a plan is an EXPLAIN ANALYZE plan, some attributes of this node receive actual values. These include execution_time, true_cardinality, loops and parallel_workers.
- Returns:
Whether the node represents part of an EXPLAIN ANALYZE plan
- Return type:
bool
- filter_conditions() dict[str, str]#
Collects all filter conditions that are defined on this node
- Returns:
A dictionary mapping the type of filter condition (e.g. index condition or join filter) to the actual filter value.
- Return type:
dict[str, str]
- inner_outer_children() Sequence[PostgresExplainNode]#
Provides the children of this node in a sequence of inner, outer if applicable.
For all nodes where this structure is not meaningful (e.g. intermediate nodes that operate on a single relation or scan nodes), the child nodes are returned as-is (e.g. as a list of a single child or an empty list).
- Returns:
The children of the current node in a unified format
- Return type:
Sequence[PostgresExplainNode]
- parse_table() TableReference | None#
Provides the table that is processed by this node.
- Returns:
The table being scanned. For non-scan nodes, or nodes where no table can be inferred, None will be returned.
- Return type:
Optional[TableReference]
- as_qep() QueryPlan#
Transforms the postgres-specific plan to a standardized QueryPlan instance.
Notice that this transformation is lossy since not all information from the Postgres plan can be represented in query execution plan instances. Furthermore, this transformation can be problematic for complicated queries that use special Postgres features. Most importantly, for queries involving subqueries, special node types and parent relationships can be contained in the plan, that cannot be represented by other parts of PostBOUND. If this method and the resulting query execution plans should be used on complex workloads, it is advisable to check the plans twice before continuing.
- Returns:
The equivalent query execution plan for this node
- Return type:
- Raises:
ValueError – If the node contains more than two children.
- inspect(*, _indentation: int = 0) str#
Provides a pretty string representation of the EXPLAIN sub-plan that can be printed.
- Parameters:
_indentation (int, optional) – This parameter is internal to the method and ensures that the correct indentation is used for the child nodes of the plan. When inspecting the root node, this value is set to its default value of 0.
- Returns:
A string representation of the EXPLAIN sub-plan.
- Return type:
str
- class postbound.postgres.PostgresExplainPlan(explain_data: dict | list[dict])#
Models an entire EXPLAIN plan produced by Postgres
In contrast to PostgresExplainNode, this includes additional parameters (planning time and execution time) for the entire plan, rather than just portions of it.
This class supports all methods that are specified on the general QueryPlan and returns the correct data for its actual plan.
- Parameters:
explain_data (dict | list[dict]) – The JSON data of the entire explain plan. This is parsed and prepared as part of the __init__ method.
- planning_time#
The time in seconds that the optimizer spent to build the plan
- Type:
float
- execution_time#
The time in seconds the query execution engine needed to calculate the result set of the query. This does not account for network time to transmit the result set.
- Type:
float
- query_plan#
The actual plan
- Type:
- property root: PostgresExplainNode#
Gets the root node of the actual query plan.
- is_analyze() bool#
Checks, whether this EXPLAIN plan is an EXPLAIN ANALYZE plan or a pure EXPLAIN plan.
The analyze variant does not only obtain the plan, but actually executes it. This enables the comparison of the optimizer’s estimates to the actual values. If a plan is an EXPLAIN ANALYZE plan, some attributes of this node receive actual values. These include execution_time, true_cardinality, loops and parallel_workers.
- Returns:
Whether the plan represents an EXPLAIN ANALYZE plan
- Return type:
bool
- as_qep() QueryPlan#
Provides the actual explain plan as a normalized query execution plan instance
For notes on pecularities of this method, take a look at the See Also section
- Returns:
The query execution plan
- Return type:
See also
- inspect() str#
Provides a pretty string representation of the actual plan.
- Returns:
A string representation of the plan
- Return type:
str
See also
- class postbound.postgres.WorkloadShifter(pg_instance: PostgresInterface)#
The shifter provides simple means to manipulate the current contents of a database.
Currently, such means only include the deletion of specific rows, but other tools could be added in the future.
- Parameters:
pg_instance (PostgresInterface) – The database to manipulate
- remove_random(table: TableReference | str, *, n_rows: int | None = None, row_pct: float | None = None, vacuum: bool = False) None#
Deletes tuples from a specific tables at random.
- Parameters:
table (TableReference | str) – The table from which to delete
n_rows (Optional[int], optional) – The absolute number of rows to delete. Defaults to None in which case the row_pct is used.
row_pct (Optional[float], optional) – The share of rows to delete. Value should be in range (0, 1). Defaults to None in which case the n_rows is used.
vacuum (bool, optional) – Whether the database should be vacuumed after deletion. This optimizes the page layout by compacting the pages and forces a refresh of all statistics.
- Raises:
ValueError – If no correct n_rows or row_pct values have been given.
- Return type:
None
Warning
Notice that deletions in the given table can trigger further deletions in other tables through cascades in the schema.
- remove_ordered(column: ColumnReference | str, *, n_rows: int | None = None, row_pct: float | None = None, ascending: bool = True, null_placement: Literal['first', 'last'] | None = None, vacuum: bool = False) None#
Deletes the smallest/largest tuples from a specific table.
- Parameters:
column (ColumnReference | str) – The column to infer the deletion order. Can be either a proper column reference including the containing table, or a fully-qualified column string such as _table.column_ .
n_rows (Optional[int], optional) – The absolute number of rows to delete. Defaults to None in which case the row_pct is used.
row_pct (Optional[float], optional) – The share of rows to delete. Value should be in range (0, 1). Defaults to None in which case the n_rows is used.
ascending (bool, optional) – Whether the first or the last rows should be deleted. NULL values are according to null_placement.
null_placement (Optional[Literal["first", "last"]], optional) – Where to put NULL values in the order. Using the default value of None treats NULL values as being the largest values possible.
vacuum (bool, optional) – Whether the database should be vacuumed after deletion. This optimizes the page layout by compacting the pages and forces a refresh of all statistics.
- Raises:
ValueError – If no correct n_rows or row_pct values have been given.
- Return type:
None
Warning
Notice that deletions in the given table can trigger further deletions in other tables through cascades in the schema.
- generate_marker_table(target_table: str, marker_pct: float = 0.5, *, target_column: str = 'id', marker_table: str | None = None, marker_column: str | None = None) None#
Generates a new table that can be used to store rows that should be deleted at a later point in time.
The marker table will be created if it does not exist already. It contains exactly two columns: one column for the marker index (an ascending integer value) and another column that stores the primary keys of rows that should be deleted from the target table. If the marker table exists already, all current markings (but not the marked rows themselves) are removed. Afterwards, the new rows to delete are selected at random.
By default, only the target table is a required parameter. All other parameters have default values or can be inferred from the target table. The marker index column is marker_idx.
- Parameters:
target_table (str) – The table from which rows should be removed
marker_pct (float) – The percentage of rows that should be included in the marker table. Allowed range is [0, 1].
target_column (str, optional) – The column that contains the values used to identify the rows to be deleted in the target table. Defaults to id.
marker_table (Optional[str], optional) – The name of the marker table that should store the row identifiers. Defaults to <target table name>_delete_markers.
marker_column (Optional[str], optional) – The name of the column in the marker table that should contain the target column values. Defaults to <target table name>_<target column name>.
- Return type:
None
See also
- export_marker_table(*, target_table: str | None = None, marker_table: str | None = None, out_file: str | Path | None = None) None#
Stores a marker table in a CSV file on disk.
This allows the marker table to be re-imported later on.
- Parameters:
target_table (Optional[str], optional) – The name of the target table for which the marker has been created. This can be used to infer the name of the marker table if the defaults have been used.
marker_table (Optional[str], optional) – The name of the marker table. Can be omitted if the default name has been used and target_table is specified.
out_file (Optional[str | Path], optional) – The name and path of the output CSV file to create. If omitted, the name will be <marker table name>.csv and the file will be placed in the current working directory. If specified, an absolute path must be used.
- Raises:
ValueError – If neither target_table nor marker_table are given.
- Return type:
None
See also
- import_marker_table(*, target_table: str | None = None, marker_table: str | None = None, target_column: str = 'id', marker_column: str | None = None, target_column_type: str | None = None, in_file: str | Path | None = None) None#
Loads the contents of a marker table from a CSV file from disk.
The table will be created if it does not exist already. If the marker table exists already, all current markings (but not the marked rows themselves) are removed. Afterwards, the new markings are imported.
- Parameters:
target_table (Optional[str], optional) – The name of the target table for which the marker has been created. This can be used to infer the name of the marker table if the defaults have been used.
marker_table (Optional[str], optional) – The name of the marker table. Can be omitted if the default name has been used and target_table is specified.
target_column (str, optional) – The column that contains the values used to identify the rows to be deleted in the target table. Defaults to id.
marker_table – The name of the marker table that should store the row identifiers. Defaults to <target table name>_delete_markers.
target_column_type (Optional[str], optional) – The datatype of the target column. If this parameter is not given, target_table has to be specified to infer the proper datatype from the schema metadata.
in_file (Optional[str | Path], optional) – The name and path of the CSV file to read. If omitted, the name will be <marker table name>.csv and the file will be loaded in the current working directory. If specified, an absolute path must be used.
marker_column (str | None)
- Raises:
ValueError – If neither target_table nor marker_table are given.
- Return type:
None
See also
- remove_marked(target_table: str, *, target_column: str = 'id', marker_table: str | None = None, marker_column: str | None = None, vacuum: bool = False) None#
Deletes rows according to their primary keys stored in a marker table.
- Parameters:
target_table (str) – The table from which the rows should be removed.
target_column (str, optional) – A column of the target table that is used to identify rows matching the marked rows to remove. Defaults to id.
marker_table (Optional[str], optional) – A table containing marks of the rows to delete. Defaults to <target table>_delete_markers.
marker_column (Optional[str], optional) – A column of the marker table that contains the values of the columns to remove. Defaults to <target table>_<target column>.
vacuum (bool, optional) – Whether the database should be vacuumed after deletion. This optimizes the page layout by compacting the pages and forces a refresh of all statistics.
- Return type:
None
See also