SQL Parser

SQL Parser#

Module name: postbound.parser

The parser constructs SqlQuery objects from query strings.

Other than the parsing itself, the process will also execute a basic column binding process. For example, consider a query like SELECT * FROM R WHERE R.a = 42. In this case, the binding only affects the column reference R.a and sets the table of that column to R. This binding based on column and table names is always performed.

If the table cannot be inferred based on the column name (e.g. for a query like SELECT * FROM R, S WHERE a = 42), a second binding phase can be executed. This binding needs a working database connection and queries the database schema to detect the correct tables for each column. Whether the second phase should also be executed by default can be configured system-wide by setting the auto_bind_columns variable.

Notes

Please beware that SQL parsing is a very challenging undertaking and there might be bugs in some lesser-used features. If you encounter any issues, please report them on the GitHub issue tracker. We test the parser based on some popular benchmarks, namely JOB and Stats to ensure that result sets from the raw SQL queries match result sets from the parsed queries. However, we cannot guarantee that the parser will work for all SQL queries.

The parsing itself is based on the pglast project that implements a SQL -> JSON/dict conversion, based on the actual Postgres query parser. Our parser implementation takes such a JSON representation as input and generates the more verbose structures of the qal. There exists a Jupyter notebook called PglastParsingTests in the tests directory that shows the output emitted by pglast for different SQL query features.

References

postbound.parser.auto_bind_columns: bool = True#: Indicates whether the parser should use the database catalog to obtain column bindings.

exception postbound.parser.ParserWarning#: A warning that is raised during parsing, but does not prevent successful parsing.

exception postbound.parser.ParserError(msg: str)#

An error that is raised when parsing fails.

Parameters:: msg (str)
Return type:: None

class postbound.parser.SchemaCache(schema: DatabaseSchema | None = None)#

A simple cache that stores the columns that belong to tables in our database schema.

The cache only queries the actual catalog of the database system, if the requested table has not been cached, yet.

Parameters:: schema (Optional[DatabaseSchema]) – The schema to cache. If not provided, the cache cannot resolve column bindings.

initialize_with(schema: DatabaseSchema | None) → None#

Sets the catalog if necessary

Parameters:: schema (DatabaseSchema | None)
Return type:: None

lookup_column(colname: str, candidate_tables: Iterable[TableReference]) → TableReference | None#

Resolves the table that defines a specific column.

If no catalog is available, this method will always return None.

Returns:

The table that defines the column. If there are multiple tables that could define the column, an arbitrary one is returned. If none of the candidates is the correct table, None is returned.

Return type:

Optional[TableReference]

Parameters:

colname (str)
candidate_tables (Iterable[TableReference])

columns_of(table: str | TableReference) → list[str]#

Provides the columns that belong to a specific table.

If no catalog is available, this method will always return an empty list.

Parameters:: table (str | TableReference)
Return type:: list[str]

class postbound.parser.QueryNamespace(*, parent: QueryNamespace | None = None, top_level: bool = False)#

The query namespace acts as the central service to resolve column bindings in a query.

It maintains a visibility map of all tables at a given point in the query and keeps track of the columns that form the result relation at the same points in time. This information is used to bind column references to the correct tables, including temporary virtual tables that alias existing physical columns.

The namespace protocol works as follows:

While parsing a query, the table sources (CTEs and FROM entries) should be handled first. Each source should be registered in the namespace using the register_table method.
When a subquery or CTE is encoutered, the open_nested method has to be called to open a new local namespace and track the virtual table correctly.
Once all tables are registered, the parser can handle the SELECT clause. Afterwards, determine_output_shape has to called to compute all columns that are part of the result relation of the current namespace. This method takes care of resolving SELECT * operations as necessary and requires that all input sources have already been registered and completely parsed, such that their output shapes are known.
While parsing the different clauses of the query, lookup_column and resolve_table can be used to determine the correct table references based on the sources that are currently available in the namespace.

Each namespace can be connected to a parent namespace, which in turn can provide additional CTEs, physical tables or subqueries (if the current namespace is for a LATERAL subquery). This allows the current namespace to check whether some column is actually provided by an outer scope if the namespace does not provide the column itself.

Parameters:

parent (Optional[QueryNamespace])
top_level (bool)

determine_output_shape(select_clause: Select | Iterable[ColumnReference | str] | None) → None#

Determines the columns that form the result relation of this namespace.

The result is only stored internally to allow parent namespaces to resolve column references correctly.

This method should only be called after all table sources from the current namespace are already registered in order to ensure that star expressions can be resolved correctly.

Parameters:: select_clause (Select | Iterable[ColumnReference | str] | None)
Return type:: None

register_table(table: TableReference) → None#

Adds a “physical” table to the current namespace.

In truth, the table does not need to be physical, it can also be a CTE that was defined in an outer namespace and is scanned here. “Physical” in this context means that the current namespace does not define the table itself.

Parameters:: table (TableReference)
Return type:: None

provides_column(name: str) → bool#

Checks, whether the current namespace has a specific column in its output relation.

Parameters:: name (str)
Return type:: bool

lookup_column(key: str) → TableReference | None#

Searches for the table that provies a specific column.

This table can be either virtual, i.e. a subquery or CTE (possibly from an outer namespace), or an actual physical table from the current database.

If no table is found , None is returned.

Parameters:: key (str)
Return type:: TableReference | None

resolve_table(key: str) → TableReference | None#

Searches for the table that is referenced by a specific key.

The table can be either provided by this namespace (as a physical table in the FROM clause, or defined through a subquery/CTE), or by an outer namspace.

Parameters:: key (str)
Return type:: TableReference | None

open_nested(*, alias: str = '', source: Literal['cte', 'subquery', 'setop', 'values', 'temporary']) → QueryNamespace#

Creates a new local namespace for a nested query.

Depending on the type of nested query, the namespace will be registered in different ways and used for different purposes (see parameters below).

Parameters:

alias (str, optional) – The name of the namespace. This is only relevant for CTEs and subqueries in the FROM clause.
source (Literal["cte", "subquery", "setop", "values", "temporary"]) –
The type of nested query. This value is used to determine the use of the subquery namespace as follows:
- ”cte”: The namespace is a CTE that is part of the query.
- ”subquery”: The namespace is a subquery in the FROM clause.
- ”setop”: The namespace is part of a set operation. No alias is required, but the namespace might be used to determine the output shape of the current namespace
- ”values”: The namespace is a temporary table that is part of a VALUES clause.
- ”temporary”: The namespace is a temporary table that is part of a subquery which is not used in the FROM clause, e.g. as a filter condition.

Return type:

QueryNamespace

postbound.parser.parse_query(query: str, *, include_hints: bool = True, bind_columns: bool | None = None, db_schema: DatabaseSchema | None = None) → SqlQuery#

postbound.parser.parse_query(query: str, *, accept_set_query: Literal[True], include_hints: bool = True, bind_columns: bool | None = None, db_schema: DatabaseSchema | None = None) → SqlQuery | SetQuery

postbound.parser.parse_query(query: str, *, accept_set_query: Literal[False], include_hints: bool = True, bind_columns: bool | None = None, db_schema: DatabaseSchema | None = None) → SqlQuery

Parses a query string into a proper SqlQuery object.

During parsing, the appropriate type of SQL query (i.e. with implicit, explicit or mixed FROM clause) will be inferred automatically. Therefore, this method can potentially return a subclass of SqlQuery.

Once the query has been transformed, a text-based binding process is executed. During this process, the referenced tables are normalized such that column references using the table alias are linked to the correct tables that are specified in the FROM clause (see the module-level documentation for an example). The parsing process can optionally also involve a binding process based on the schema of a live database. This is important for all remaining columns where the text-based parsing was not possible, e.g. because the column was specified without a table alias.

Parameters:

query (str) – The query to parse
accept_set_query (bool, optional) – Whether set queries are a valid result of the parsing process. If this is False (the default), an error will be raised if the input query is a set query. This implies that the result of the parsing process is always a SqlQuery instance. Otherwise, the result can also be a SetQuery instance.
include_hints (bool, optional) – Whether to include hints in the parsed query. If this is True (the default), any preceding comments in the query text will be parsed as a hint block. Otherwise, these comments are simply ignored.
bind_columns (bool | None, optional) – Whether to use live binding. This does not control the text-based binding, which is always performed. If this parameter is None (the default), the global auto_bind_columns variable will be queried. Depending on its value, live binding will be performed or not.
db_schema (Optional[DatabaseSchema], optional) – For live binding, this indicates the database to use. If this is None (the default), the database will be tried to extract from the DatabasePool

Returns:

The parsed SQL query.

Return type:

SqlQuery

postbound.parser.load_query(path: str | Path, *, accept_set_query: bool = False, include_hints: bool = True, bind_columns: bool | None = None, db_schema: DatabaseSchema | None = None) → SqlQuery | SetQuery#

Loads and parses a single query from a file.

All parameters other than the query path are forwarded to the parse_query method.

SQL Parser

Contents

SQL Parser#