Query Abstraction#

The query abstraction is used to represent SQL queries in a unified way, and to make accessing different parts of them easier. This section walks you through the most important parts of the query abstraction. Everything query-related is contained in the query abstraction layer, or qal for short.

In the next sections, we are going to use the following example query:

In [1]: import postbound as pb

In [2]: raw_query = """
   ...:     SELECT u.Id, u.DisplayName, avg(p.Score)
   ...:     FROM Users u
   ...:         JOIN Posts p ON u.Id = p.OwnerUserId
   ...:     WHERE p.PostTypeId = 2
   ...:         AND p.AcceptedAnswerId > 0
   ...:     GROUP BY u.Id, u.DisplayName
   ...:     ORDER BY avg(p.Score) DESC
   ...: """
   ...: 

You can parse this query into a proper SqlQuery object using the parse_query() function:

In [3]: query = pb.parse_query(raw_query)

In [4]: print(pb.qal.format_quick(query))
SELECT u.id,
  u.displayname,
  AVG(p.score)
FROM users AS u
  JOIN posts AS p ON u.id = p.owneruserid
WHERE p.posttypeid = 2
  AND p.acceptedanswerid > 0
GROUP BY u.id, u.displayname
ORDER BY AVG(p.score) DESC;

Basic query structure#

PostBOUND uses a query abstraction that consists of three main components:

  1. The top-level SqlQuery object, which represents an entire, ready-to-run SQL query

  2. Each query consists of one or multiple clauses, such as SELECT, FROM, WHERE, etc. These are represented by subclasses of the abstract BaseClause.

  3. Clauses are composed of expressions. These can be simple predicates, function calls, or even subqueries (which in turn contain SqlQuery instances). Expressions are represented by subclasses of the abstract SqlExpression.

Use the ast() method to inspect the structure of a query. For example, our example query looks like this:

In [5]: print(query.ast())
+-ExplicitSqlQuery
  +-Select
    + ColumnExpression [u.id]
    + ColumnExpression [u.displayname]
    +-FunctionExpression [AVG]
      + ColumnExpression [p.score]
  +-ExplicitFromClause
    +-JoinTableSource
      +-DirectTableSource [users AS u]
      +-DirectTableSource [posts AS p]
  +-Where
    +-CompoundPredicate [AND]
      +-BinaryPredicate [=]
        + ColumnExpression [p.posttypeid]
        +-StaticValueExpression [2]
      +-BinaryPredicate [>]
        + ColumnExpression [p.acceptedanswerid]
        +-StaticValueExpression [0]
  +-GroupBy
    + ColumnExpression [u.id]
    + ColumnExpression [u.displayname]
  +-OrderBy
    +-FunctionExpression [AVG]
      + ColumnExpression [p.score]

The query structure is quite flexible and we try to model a large portion of the scope of SQL features with it. For example, we support (recursive) CTEs, window functions, CASE expressions, or even qualified star expressions (e.g., SELECT t.* FROM t).

Important

The query parser is based on the actual Postgres parser (thanks to pglast!). While this means that we have pretty good coverage of the SQL standard, it also means that we cannot parse queries that Postgres does not understand.

Working with queries#

An important design decision of our query abstraction is that all queries are immutable. This means that once created, a query cannot be changed anymore. Instead, you need to create a new query object that contains your desired changes. The transform module has a large suite of functions that make these updates much easier:

In [6]: pb.transform.as_count_star_query(query)
Out[6]: 
SELECT COUNT(*)
FROM users AS u
  JOIN posts AS p ON u.id = p.owneruserid
WHERE p.posttypeid = 2
  AND p.acceptedanswerid > 0
GROUP BY u.id, u.displayname
ORDER BY AVG(p.score) DESC;

In [7]: pb.transform.drop_clause(query, pb.qal.Where)
Out[7]: 
SELECT u.id,
  u.displayname,
  AVG(p.score)
FROM users AS u
  JOIN posts AS p ON u.id = p.owneruserid
GROUP BY u.id, u.displayname
ORDER BY AVG(p.score) DESC;

In [8]: pb.transform.add_clause(query, pb.qal.Limit(limit=10))
Out[8]: 
SELECT u.id,
  u.displayname,
  AVG(p.score)
FROM users AS u
  JOIN posts AS p ON u.id = p.owneruserid
WHERE p.posttypeid = 2
  AND p.acceptedanswerid > 0
GROUP BY u.id, u.displayname
ORDER BY AVG(p.score) DESC
FETCH FIRST 10 ROWS ONLY;

All of the qal building blocks provide a visitor-based interface that allows you to traverse the query structure in a consistent way. These are defined in the ClauseVisitor and SqlExpressionVisitor. There is also a dedicated PredicateVisitor, even though predicates are just a special case of expressions. You can make use of Python multiple inheritance to create single visitor class that traverses an entire query.

Tip

Many parts of PostBOUND that represent queries or query plans have a tables() method and a columns() method. These methods return sets of all tables and columns that are referenced by the object and any “children” of it.

Working with joins and filters#

A core part of query optimization tasks is to analyze which join conditions and filter predicates are present in the query. You can either analyze queries manually and traverse the Where clause. At the same time, the query abstraction also provides QueryPredicates for a more high-level access:

In [9]: query.from_clause
Out[9]: FROM users AS u JOIN posts AS p ON u.id = p.owneruserid

In [10]: query.where_clause
Out[10]: WHERE p.posttypeid = 2 AND p.acceptedanswerid > 0

In [11]: query.predicates()
Out[11]: p.posttypeid = 2 AND p.acceptedanswerid > 0 AND u.id = p.owneruserid

The query predicates can be used to directly retrieve predicates that are relevant for specific tables, e.g.,

In [12]: query.predicates().joins()
Out[12]: {u.id = p.owneruserid}

In [13]: query.predicates().joins_for(pb.TableReference("users", "u"))
Out[13]: [u.id = p.owneruserid]

In [14]: query.predicates().filters_for(pb.TableReference("posts", "p"))
Out[14]: p.acceptedanswerid > 0 AND p.posttypeid = 2

In [15]: query.predicates().joins_between(
   ....:     pb.TableReference("users", "u"),
   ....:     pb.TableReference("posts", "p")
   ....: )
   ....: 
Out[15]: u.id = p.owneruserid

Attention

When traversing the query manually, don’t forget to also check the From clause! It might also contain join conditions that where specified with the JOIN ... ON ... syntax, e.g., SELECT * FROM t1 JOIN t2 ON t1.id = t2.id.

The query abstraction uses a full-blown recursive structure to represent predicates. While this approach allows for a large expressivity, it makes extracting specific bits of information a bit cumbersome. For example, to get any TableReference from a join predicate, one would need to do something like the following:

In [16]: full_pred = pb.util.collections.get_any(query.predicates().joins())

In [17]: full_pred.join_partners()
Out[17]: 
{(ColumnReference(name='owneruserid', table=TableReference(full_name='posts', alias='p', virtual=False, schema='')),
  ColumnReference(name='id', table=TableReference(full_name='users', alias='u', virtual=False, schema='')))}

In [18]: single_pred = pb.util.simplify(full_pred.join_partners())

In [19]: single_pred
Out[19]: 
(ColumnReference(name='owneruserid', table=TableReference(full_name='posts', alias='p', virtual=False, schema='')),
 ColumnReference(name='id', table=TableReference(full_name='users', alias='u', virtual=False, schema='')))

In [20]: any_table = single_pred[0]

In [21]: any_table
Out[21]: ColumnReference(name='owneruserid', table=TableReference(full_name='posts', alias='p', virtual=False, schema=''))

This is because the query abstraction needs to handle cases of complex conjunctiontive or disjunctive predicates accross multiple tables such as R.a = S.b OR R.a = T.c. However, such complicated structures do not occur in the commonly used benchmarks.

To ease the development experience, PostBOUND also has a simplified version of query predicates for cases where the predicates follow simple structures:

  • the SimpleFilter can be used for filter predicates that roughly match the structure <column> <operator> <value>.

  • the SimpleJoin can be used for join predicates that are plain inner equi-joins, i.e. <column 1> = <column 2>.

Since these simplifications only apply to a subset of all possible predicates, you need to check whether a predicate is actually of a supported form before creating the simplified version. See the class documentations for more details. Once you have obtained a simplified predicate, its components can be accessed in a more straightforward way:

In [22]: simple_filters = pb.qal.SimpleFilter.wrap_all(query.predicates().filters())

In [23]: filter_pred = pb.util.collections.get_any(simple_filters)

In [24]: filter_pred.column
Out[24]: ColumnReference(name='acceptedanswerid', table=TableReference(full_name='posts', alias='p', virtual=False, schema=''))

In [25]: filter_pred.operation
Out[25]: <LogicalOperator.Greater: '>'>

In [26]: filter_pred.value
Out[26]: 0

In [27]: simple_joins = pb.qal.SimpleJoin.wrap_all(query.predicates().joins())

In [28]: join_pred = pb.util.simplify(simple_joins)

In [29]: join_pred.lhs, join_pred.rhs
Out[29]: 
(ColumnReference(name='id', table=TableReference(full_name='users', alias='u', virtual=False, schema='')),
 ColumnReference(name='owneruserid', table=TableReference(full_name='posts', alias='p', virtual=False, schema='')))

Compare this output to the listing of the full AST above.

Attention

QueryPredicates also has a convenience method simplify() that returns simplified version of all predicates that can actually be simplified. However, if some predicates are more complicated than the simplification can handle, these are silently dropped form the result. Never forget to check all_simple() first to be sure you don’t lose any important predicates!

Many query optimizers derive equivalence classes from the query predicates to detect more worthwhile joins that are not explicitly listed in the query. You can do the same (currently somewhat clunkily) by adding all predicates that can be derived from equivalence classes to the query. Use determine_join_equivalence_classes() and generate_predicates_for_equivalence_classes() or the shorthand transformation add_ec_predicates().

DML and DDL queries#

Sadly, the SqlQuery abstraction is currently limited to plain SELECT queries. Since large portions of the code base rely on this assumption, it is unlikely to change in the future. This also applies to queries with set operations such as UNION or INTERSECT. These are handled by a dedicated SetQuery. See the class documentations for more details on how to use them.

If, at some point in the future, PostBOUND has proper support for DML or DDL queries, these will properly be represented by separate query classes similar to the SetQuery. To make clear that API functions can work with queries beyond plain SELECT, we use the SqlStatement. If you only want SELECT queries but are fine with set operations, use SelectStatement.

Relational algebra#

PostBOUND also provides a simple version of relational algebra. Check out the postbound.relalg module for more details. In short, use parse_relalg() to convert an SqlQuery to an equivalent tree of RelNode instances.

Note

The relational algebra is currently not integrated with the optimization pipelines. Instead, you can use it internalliy within the different optimization stages when calculating the query plan.