MySQL Backend

MySQL Backend#

Module name: postbound.mysql

Contains the MySQL implementation of the Database interface.

The current implementation has a number of limitations. Some are caused by fundamental restrictions of how MySQL optimizes and executes queries, while others are caused by the sheer implementation effort that would have to be invested to implement the corresponding feature in MySQL.

The most important restrictions are as follows:

No support for parsing EXPLAIN ANALYZE plans. Calling the corresponding MysqlOptimizer.analyze_plan method raises a NotImplementedError. This is because MySQL currently (i.e. as of version 8.0) only provides EXPLAIN ANALYZE plans in TREE output format, which is not exhaustively documented and appears fairly irregular. This makes parsing the output fairly hard.

Restrictions of the query hint generation: query execution in MySQL differs fundamentally from the way queries are executed in more traditional systems such as PostgreSQL or Oracle. MySQL makes heavy usage of clustered indexes, meaning that all tuples in a table are automatically stored in a B-Tree according to the primary key index. As a consequence, MySQL strongly favors the usage of (Index-) Nested Loop Joins during query execution and rarely resorts to other operators. In fact, the only fundamentally different join operator available is the Hash Join. This operator is only used if a equality join should be executed between columns that do not have an index available. Therefore, it is not possible to disable Nested Loop Joins entirely, nor can the usage of Hash Joins be enforced. Instead, query hints can only disable the usage of Hash Joins, or recommend their usage. But whether or not they are actually applied is up to the MySQL query optimizer. A similar thing happens for the join order: although MySQL provides a number of hints related to the join order optimization, these hints are not always enforced. More specifically, to the best of our knowledge, it is not possible to enforce the branches in the join order and MySQL heavily favors left-deep query plans. Therefore, the generation of join order hints only works for linear join orders for now.

class postbound.mysql.MysqlConnectionArguments(user: str, database: str, password: str = '', host: str = '127.0.0.1', port: int = 3306, use_unicode: bool = True, charset: str = 'utf8mb4', autocommit: bool = True, sql_mode: str = 'ANSI')#

Captures all relevant parameters that customize the way the connection to a MySQL instance is establised.

The only required parameters are the user that should connect to the database and the name of the database to connect to. See [1] for the different parameters’ meaning.

References

Parameters:

user (str)
database (str)
password (str)
host (str)
port (int)
use_unicode (bool)
charset (str)
autocommit (bool)
sql_mode (str)

parameters() → dict[str, str | int | bool]#

Provides all arguments in one neat dict.

Returns:: A mapping from parameter name to parameter value.
Return type:: dict[str, str | int | bool]

class postbound.mysql.MysqlInterface(connection_args: MysqlConnectionArguments, system_name: str = 'MySQL', *, cache_enabled: bool = True)#

MySQL-specific implementation of the general Database interface.

Parameters:

connection_args (MysqlConnectionArguments)
system_name (str)
cache_enabled (bool)

schema() → MysqlSchemaInterface#

Provides access to the underlying schema information of the database.

Returns:: An object implementing the schema interface for the actual database system. This should normally be completely stateless.
Return type:: DatabaseSchema

statistics() → MysqlStatisticsInterface#

Provides access to the current statistics of the database.

Implementing generalized statistics for a framework that supports multiple different physical database systems is much more complicated than it might seem at first. Therefore, different modes for the statistics provisioning exist. These modes can be changed by setting the properties of the interface. See the documentation of DatabaseStatistics for more details.

Repeated calls to this method are guaranteed to provide the same object. Therefore, changes to the statistics interface configuration are guaranteed to be persisted accross multiple accesses to the statistics system.

Returns:: The statistics interface. Repeated calls to this method are guaranteed to provide the same object.
Return type:: DatabaseStatistics

hinting() → HintService#

Provides access to the hint generation facilities for the current database system.

Returns:: The hinting service. This should normally be completely stateless.
Return type:: HintService

execute_query(query: SqlQuery | str, *, cache_enabled: bool | None = None, raw: bool = False) → Any#

Executes the given query and returns the associated result set.

Parameters:

query (SqlQuery | str) – The query to execute. If it contains a Hint with preparatory_statements, these will be executed beforehand. Notice that such statements are never subject to caching.
cache_enabled (Optional[bool], optional) – Controls the caching behavior for just this one query. The default value of None indicates that the “global” configuration of the database system should be used. Setting this parameter to a boolean value forces or deactivates caching for the specific query for the specific execution no matter what the “global” configuration is.
raw (bool, optional) – Whether the result set should be returned as-is. By default, the result set is simplified. Raw mode skips this step.

Returns:

Result set of the input query. This is a list of equal-length tuples in the most general case. Each component of the tuple corresponds to a specific column of the result set and each tuple corresponds to a row in the result set. However, many queries do not provide a 2-dimensional result set (e.g. COUNT(*) queries). In such cases, the nested structure of the result set makes it quite cumbersome to use. Therefore, this method tries to simplify the return value of the query for more convenient use (if raw mode is disabled). More specifically, if the query returns just a single row, this row is returned directly as a tuple. Furthermore, if the query returns just a single column, the values of that column are returned directly in a list. Both simplifications will also be combined, such that a result set of a single row of a single value will be returned as that single value directly. In all other cases, the result will be a list consisting of the different result tuples.

Return type:

Any

Notes

This method is mainly intended to execute read-only SQL queries. In fact, the only types of SQL queries that can be modelled by the query abstraction layer are precisely such read-only queries. However, if one really needs to execute mutating queries, they can be issued as plain text. Just remember that this behavior is heavily discouraged!

The precise behavior of this method depends on whether caching is enabled or not. If it is, the query will only be executed against the live database system, if it is not in the cache. Otherwise, the result will simply be retrieved. Caching can be enabled/disabled for just this one query via the cache_enabled switch. If this is not specified, caching depends on the cache_enabled property.

If caching should be used for this method, but is disabled at a database-level, the current cache will still be read and persisted. This ensures that all cached queries are properly saved and none of the previous cache content is lost.

optimizer() → OptimizerInterface#

Provides access to optimizer-related functionality of the database system.

Returns:: The optimizer interface. This should normally be completely stateless.
Return type:: OptimizerInterface
Raises:: UnsupportedDatabaseFeatureError – If the database system does not provide any sort of external access to the optimizer.

database_name() → str#

Provides the name of the (physical) database that the database interface is connected to.

Returns:: The database name, e.g. imdb or tpc-h
Return type:: str

database_system_version() → Version#

Returns the release version of the database management system that this interface is connected to.

Returns:: The version
Return type:: util.Version

server_mode() → str#

Provides the current settings in the sql_mode MySQL variable.

Returns:: The sql_mode value, exactly as it is returned by the server. Typically, this is a list of comma-separated features.
Return type:: str

describe() → dict#

Provides a representation of the current database connection as well as its system settings.

This description is intended to transparently document which customizations have been applied, thereby giving an idea of how the default query execution might have been affected. It can be JSON-serialized and will be included by most of the output of the utilities in the runner module of the experiments package.

Return type:: dict

reset_connection() → None#

Obtains a new network connection for the database. Useful for debugging purposes or in case of crashes.

Notice that resetting the connection can have unintended side-effects if other methods rely on the cursor object. After resetting, the former cursor object will probably no longer be valid. Therefore, this method should be used with caution.

Returns:: Backend-specific information about the new connection. E.g., this might be a connection ID or nothing at all.
Return type:: Any

See also

Database.cursor

cursor() → Cursor#

Provides a cursor to execute queries and iterate over result sets manually.

Returns:: A cursor compatible with the Python DB API specification 2.0 (PEP 249). The specific cursor type depends on the concrete database implementation however.
Return type:: Cursor

References

close() → None#

Shuts down all currently open connections to the database.

Return type:: None

class postbound.mysql.MysqlSchemaInterface(mysql_db: MysqlInterface)#

Parameters:: mysql_db (MysqlInterface)

lookup_column(column: ColumnReference | str, candidate_tables: Iterable[TableReference], *, expect_match: bool = False) → TableReference | None#

Searches for a table that owns the given column.

Parameters:

column (ColumnReference | str) – The column that is being looked up
candidate_tables (Iterable[TableReference]) – Tables that could possibly own the given column
expect_match (bool, optional) – If enabled, an error is raised whenever no table is found. Otherwise None is returned. By default, this is disabled.

Returns:

The first of the candidate_tables that has a column of similar name.

Return type:

TableReference

Raises:

ValueError – If expect_match is enabled and none of the candidate tables has a column of the given name.

Notes

Hint for implementors: the default implementation of this method (transitively) relies on the information_schema.columns view.

is_primary_key(column: ColumnReference) → bool#

Checks, whether a column is the primary key for its associated table.

Parameters:

column (ColumnReference) – The column to check

Returns:

Whether the column is the primary key of its table. If it is part of a compound primary key, this is False.

Return type:

bool

Raises:

postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

Notes

Hint for implementors: the default implementation of this method relies on the information_schema.table_constraints and information_schema.constraint_column_usage views.

has_secondary_index(column: ColumnReference) → bool#

Checks, whether a secondary index is available for a specific column.

Parameters:

column (ColumnReference) – The column to check

Returns:

Whether a secondary index of any kind was created for the column. Compound indexes and primary key indexes fail this test.

Return type:

bool

Raises:

postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

Notes

Hints for implementors: The default implementation of this method assumes that each foreign key column and each column with a UNIQUE constraint has an associated index. If this should not be the case, a custom implementation needs to be supplied. Furthermore, the implementation relies on the information_schema.table_constraints, information_schema.constraint_column_usage and information_schema.key_column_usage views.

indexes_on(column: ColumnReference) → set[str]#

Retrieves the names of all indexes of a specific column.

Parameters:

column (ColumnReference) – The column to check.

Returns:

The indexes. If no indexes are available, the set will be empty.

Return type:

set[str]

Raises:

postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

Notes

Hints for implementors: the default implementation of this method assumes that primary keys, foreign keys and unique constraints are all associated with an index structure. It provides the names of the corresponding constraints. The implementation relies on the information_schema.table_constraints, information_schema.constraint_column_usage and information_schema.key_column_usage views.

foreign_keys_on(column: ColumnReference) → set[BoundColumnReference]#

Fetches all foreign key constraints that are specified on a specific column.

The provided columns are the target columns that are referenced by the foreign key constraint. E.g., suppose there are tables A and B with columns x and y. We specify a foreign key constraint on column y to ensure that all values in y reference a value in x. Then, calling this method on column y will return column x. If there are multiple foreign key constraints on the same column, all of them will be returned.

Parameters:

column (ColumnReference) – The column to check. All foreign keys that are “pointing from” this column to another column are returned.

Returns:

The columns that are “pointed to” by foreign key constraints on the given column. If no such foreign keys exist, an empty set is returned.

Return type:

set[BoundColumnReference]

Raises:

postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

datatype(column: ColumnReference) → str#

Retrieves the (physical) data type of a column.

The provided type can be a standardized SQL-type, but it can be a type specific to the concrete database system just as well. It is up to the user to figure this out and to react accordingly.

Parameters:

column (ColumnReference) – The colum to check
raw (bool, optional) – If enabled, the type will be returned as-is. Otherwise, it will be lower-cased. By default, lower-casing is enabled.

Returns:

The datatype. Will never be empty. The type will automatically be lower-cased unless raw is set.

Return type:

str

Raises:

postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

Notes

Hint for implementors: the default implementation of this method relies on the information_schema.columns view.

is_nullable(column) → bool#

Checks, whether a specific column may contain NULL values.

Parameters:

column (ColumnReference) – The column to check

Returns:

Whether the column may contain NULL values

Return type:

bool

Raises:

postbound.qal.UnboundColumnError – If the column is not associated with any table
postbound.qal.VirtualTableError – If the table associated with the column is a virtual table (e.g. subquery or CTE)

Notes

Hint for implementors: the default implementation of this method relies on the information_schema.columns view.

class postbound.mysql.MysqlStatisticsInterface(mysql_db: MysqlInterface)#

Parameters:: mysql_db (MysqlInterface)

postbound.mysql.MysqlOptimizerHints = {JoinOperator.HashJoin: 'BNL', JoinOperator.NestedLoopJoin: 'NO_BNL', ScanOperator.BitmapScan: 'INDEX_MERGE', ScanOperator.IndexOnlyScan: 'INDEX', ScanOperator.IndexScan: 'INDEX', ScanOperator.SequentialScan: 'NO_INDEX'}#

//dev.mysql.com/doc/refman/8.0/en/optimizer-hints.html

Type:: See https

postbound.mysql.MysqlSwitchableOptimizations = {JoinOperator.HashJoin: 'block_nested_loop'}#

//dev.mysql.com/doc/refman/8.0/en/switchable-optimizations.html

Type:: See https

class postbound.mysql.MysqlHintService(mysql_instance: MysqlInterface)#

Parameters:: mysql_instance (MysqlInterface)

generate_hints(query: SqlQuery, plan: QueryPlan | None = None, *, join_order: JoinTree | None = None, physical_operators: PhysicalOperatorAssignment | None = None, plan_parameters: PlanParameterization | None = None) → SqlQuery#

Transforms the input query such that the given optimization decisions are respected during query execution.

In the most common case this involves building a Hint clause that encodes the optimization decisions in a system-specific way. However, depending on the concrete database system, this might also involve a restructuring of certain parts of the query, e.g. the usage of specific join statements, the introduction of non-standard SQL statements, or a reordering of the FROM clause.

Notice that all optimization information is optional. If individual parameters are set to None, nothing has been enforced by PostBOUND’s optimization process and the native optimizer of the database system should “fill the gaps”.

Implementations of this method are required to adhere to operators for joins and scans as much as possible. However, there is no requirement to represent auxiliary nodes (e.g. sorts) if this is not possible or meaningful for the plan. As a rule of thumb, implementations should rate the integrity of the plan in the database higher than a perfect representation of the input data.

Parameters:

query (SqlQuery) – The query that should be transformed
plan (Optional[QueryPlan], optional) – The query execution plan. If this is given, all other parameters should be None. This essentially enforces the given query plan.
join_order (Optional[JoinTree], optional) – The sequence in which individual joins should be executed.
physical_operators (Optional[PhysicalOperatorAssignment], optional) – The physical operators that should be used for the query execution. In addition to selecting specific operators for specific joins or scans, this can also include disabling certain operators for the entire query.
plan_parameters (Optional[PlanParameterization], optional) – Additional parameters and metadata for the native optimizer of the database system. Probably the most important use-case of these parameters is the supply of cardinality estimates for different joins and scans. For example, these can be combined with a join order to influence the physical operators that the native optimizer chooses. Another scenario is to only supply such cardinality estimates and leave the join_order and physical_operators completely empty, which essentially simulates a different cardinality estimation algorithm for the query. Notice however, that in this scenario cardinality estimates for all possible intermediate results of the query have to be supplied. Otherwise, the native optimizer once again “fills the gaps” and uses its own estimates for the remaining intermediate results that it explores during plan enumeration. This would probably effectively break the estimation algorithm.

Returns:

The transformed query. It contains all necessary information to enforce the optimization decisions as best as possible. Notice that whether the native optimizer of the database system is obliged to respect the optimization decisions depends on the specific system. For example, for MySQL hints are really just hints and the optimizer is only encouraged to use specific operators but not forced to do so.

Return type:

SqlQuery

format_query(query: SqlQuery) → str#

Transforms the query into a database-specific string, mostly to incorporate deviations from standard SQL.

This method is necessary because the query abstraction layer is focused on modelling and unifying different parts of an SQL query. However, some database systems (cough .. MySQL .. cough) deviate from standard SQL syntax and express different parts of a query different. The most prominent example are older versions of MySQL that used double quotes for string values rather than the SQL standard single quotes. Therefore, the format_query method takes an abstract representation of an SQL query as input and turns it into a string representation that accounts for all such deviations.

Parameters:: query (SqlQuery) – The query that should be adapted for the database system
Returns:: An equivalent notation of the query that incorporates system-specific deviations from standard SQL. Notice that this query possibly can no longer be parsed by the query abstraction layer. It is a one-way process.
Return type:: str

MySQL Backend

Contents

MySQL Backend#