Coder Social home page Coder Social logo

trinodb / trino Goto Github PK

View Code? Open in Web Editor NEW
9.6K 169.0 2.8K 239.23 MB

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Home Page: https://trino.io

License: Apache License 2.0

Java 99.48% Shell 0.04% Python 0.01% HTML 0.03% CSS 0.02% JavaScript 0.32% ANTLR 0.08% Dockerfile 0.01% Groovy 0.01% PLSQL 0.01% Scala 0.01% Smarty 0.01%
java presto hive hadoop big-data sql prestodb database databases distributed-systems

trino's Introduction

Trino Logo

Trino is a fast distributed SQL query engine for big data analytics.

See the User Manual for deployment instructions and end user documentation.

Trino download Trino Slack Trino: The Definitive Guide book download

Development

See DEVELOPMENT for information about code style, development process, and guidelines.

See CONTRIBUTING for contribution requirements.

Security

See the project security policy for information about reporting vulnerabilities.

Build requirements

  • Mac OS X or Linux
  • Java 21.0.1+, 64-bit
  • Docker
    • Turn SELinux or other systems disabling write access to the local checkout off, to allow containers to mount parts of the Trino source tree

Building Trino

Trino is a standard Maven project. Simply run the following command from the project root directory:

./mvnw clean install -DskipTests

On the first build, Maven downloads all the dependencies from the internet and caches them in the local repository (~/.m2/repository), which can take a while, depending on your connection speed. Subsequent builds are faster.

Trino has a comprehensive set of tests that take a considerable amount of time to run, and are thus disabled by the above command. These tests are run by the CI system when you submit a pull request. We recommend only running tests locally for the areas of code that you change.

Running Trino in your IDE

Overview

After building Trino for the first time, you can load the project into your IDE and run the server. We recommend using IntelliJ IDEA. Because Trino is a standard Maven project, you easily can import it into your IDE. In IntelliJ, choose Open Project from the Quick Start box or choose Open from the File menu and select the root pom.xml file.

After opening the project in IntelliJ, double check that the Java SDK is properly configured for the project:

  • Open the File menu and select Project Structure
  • In the SDKs section, ensure that JDK 21 is selected (create one if none exist)
  • In the Project section, ensure the Project language level is set to 21

Running a testing server

The simplest way to run Trino for development is to run the TpchQueryRunner class. It will start a development version of the server that is configured with the TPCH connector. You can then use the CLI to execute queries against this server. Many other connectors have their own *QueryRunner class that you can use when working on a specific connector.

Running the full server

Trino comes with sample configuration that should work out-of-the-box for development. Use the following options to create a run configuration:

  • Main Class: io.trino.server.DevelopmentServer
  • VM Options: -ea -Dconfig=etc/config.properties -Dlog.levels-file=etc/log.properties -Djdk.attach.allowAttachSelf=true
  • Working directory: $MODULE_DIR$
  • Use classpath of module: trino-server-dev

The working directory should be the trino-server-dev subdirectory. In IntelliJ, using $MODULE_DIR$ accomplishes this automatically.

If VM options doesn't exist in the dialog, you need to select Modify options and enable Add VM options.

Running the CLI

Start the CLI to connect to the server and run SQL queries:

client/trino-cli/target/trino-cli-*-executable.jar

Run a query to see the nodes in the cluster:

SELECT * FROM system.runtime.nodes;

Run a query against the TPCH connector:

SELECT * FROM tpch.tiny.region;

trino's People

Contributors

alexjo2144 avatar arhimondr avatar cberner avatar dain avatar ebyhr avatar electrum avatar erichwang avatar findepi avatar findinpath avatar haozhun avatar hashhar avatar homar avatar kasiafi avatar kokosing avatar losipiuk avatar maciejgrzybek avatar martint avatar mosabua avatar nezihyigitbasi avatar nileema avatar nineinchnick avatar pettyjamesm avatar phd3 avatar pnowojski avatar praveen2112 avatar raghavsethi avatar raunaqmorarka avatar sopel39 avatar wendigo avatar wenleix avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

trino's Issues

Add support for case sensitive identifiers

<delimited identifier> ::=
  <double quote> <delimited identifier body> <double quote>

<delimited identifier body> ::=  <delimited identifier part>...
<delimited identifier part> ::=
    <nondoublequote character>
  | <doublequote symbol>

<Unicode delimited identifier> ::=
  U <ampersand> <double quote> <Unicode delimiter body> <double quote>
      <Unicode escape specifier>
<Unicode escape specifier> ::=
  [ UESCAPE <quote> <Unicode escape character> <quote> ]
<Unicode delimiter body> ::=
  <Unicode identifier part>...
<Unicode identifier part> ::=
    <delimited identifier part>
  | <Unicode escape value>
24) For every <identifier body> IB there is exactly one corresponding case-normal form CNF. CNF is an <identifier body> derived from IB as follows:
Let n be the number of characters in IB. For i ranging from 1 (one) to n, the i-th character Mi of IB is transliterated into the corresponding character 
or characters of CNF as follows:
Case:
   a) If Mi is a lower case character or a title case character for which an equivalent upper case sequence U is de ned by Unicode, then let j be th
       e number of characters in U; the next j characters of CNF are U.
   b) Otherwise, the next character of CNF is Mi.
25) The case-normal form of the <identifier body> of a <regular identifier> is used for purposes such as and including determination of identifier 
      equivalence, representation in the Definition and Information Schemas, and representation in diagnostics areas.

...

27) Two <regular identifier>s are equivalent if the case-normal forms of their <identifier body>s, considered as the repetition of a <character string literal> 
that specifies a <character set specification> of SQL_IDENTIFIER and an implementation-defined collation IDC that is sensitive to case, compare equally 
according to the comparison rules in Subclause 8.2, “<comparison predicate>”.

28) A <regular identifier> and a <delimited identifier> are equivalent if the case-normal form of the <identifier body> of the <regular identifier> and the 
<delimited identifier body> of the <delimited identifier> (with all occurrences of <quote> replaced by <quote symbol> and all occurrences of 
<doublequote symbol> replaced by <double quote>), considered as the repetition of a <character string literal> that specifies a <character set specification>
 of SQL_IDENTIFIER and IDC, compare equally according to the comparison rules in Subclause 8.2, “<comparison predicate>”.


29) Two<delimited identifier>s are equivalent if their <delimited identifierbody>s,considered as the repetition of a <character string literal> that specifies
 a <character set specification> of SQL_IDENTIFIER and an implementation-defined collation that is sensitive to case, compare equally according to the
 comparison rules in Subclause 8.2, “<comparison predicate>”.

30) Two <Unicode delimited identifier>s are equivalent if their <Unicode delimiter body>s, considered as the repetition of a <character string literal> that
 specifies a <character set specification> of SQL_IDENTIFIER and an implementation-defined collation that is sensitive to case, compare equally according
 to the comparison rules in Subclause 8.2, “<comparison predicate>”.

31) A <Unicode delimited identifier> and a <delimited identifier> are equivalent if their <Unicode delimiter body> and <delimited identifier body>, 
respectively, each considered as the repetition of a <character string literal> that specifies a <character set specification> of SQL_IDENTIFIER and 
an implementation-defined collation that is sensitive to case, compare equally according to the comparison rules in Subclause 8.2, “<comparison predicate>”.

32) A <regular identifier> and a <Unicode delimited identifier> are equivalent if the case-normal form of the <identifier body> of the <regular identifier> 
and the <Unicode delimiter body> of the <Unicode delimited identifier> considered as the repetition of a <character string literal>, each specifying a
 <character set specification> of SQL_IDENTIFIER and an implementation-defined collation that is sensitive to case, compare equally according to the 
comparison rules in Subclause 8.2, “<comparison predicate>”.

The approach and design is being captured here: https://github.com/prestosql/presto/wiki/Delimited-Identifiers

Remove DISTINCT inside IN subquery expression

Add an optimization rule to remove unnecessary DISTINCT in this scenario:

SELECT ...
FROM t
WHERE c IN (SELECT DISTINCT ... FROM u)

Since the semi join already performs deduplication of the values in the subquery, that operation is unnecessary and results in an extra exchange and aggregation.

presto> explain (type distributed) select count(*) from tpch.tiny.orders where custkey in (select custkey from tpch.tiny.customer);
                                                            Query Plan
-----------------------------------------------------------------------------------------------------------------------------------
 Fragment 0 [SINGLE]
     Output layout: [count]
     Output partitioning: SINGLE []
     - Output[_col0] => [count:bigint]
             _col0 := count
         - Aggregate(FINAL) => [count:bigint]
                 count := "count"("count_13")
             - LocalExchange[SINGLE] () => count_13:bigint
                 - RemoteSource[1] => [count_13:bigint]

 Fragment 1 [HASH]
     Output layout: [count_13]
     Output partitioning: SINGLE []
     - Aggregate(PARTIAL) => [count_13:bigint]
             count_13 := "count"(*)
         - FilterProject[filterPredicate = "expr_8"] => []
             - SemiJoin[custkey = custkey_1] => [custkey:bigint, expr_8:boolean]
                 - RemoteSource[2] => [custkey:bigint]
                 - LocalExchange[SINGLE] () => custkey_1:bigint
                     - RemoteSource[3] => [custkey_1:bigint]

 Fragment 2 [tpch:orders:15000]
     Output layout: [custkey]
     Output partitioning: HASH [custkey]
     - TableScan[tpch:tpch:orders:sf0.01, originalConstraint = true] => [custkey:bigint]
             custkey := tpch:custkey

 Fragment 3 [SOURCE]
     Output layout: [custkey_1]
     Output partitioning: HASH (replicate nulls) [custkey_1]
     - TableScan[tpch:tpch:customer:sf0.01, originalConstraint = true] => [custkey_1:bigint]
             custkey_1 := tpch:custkey
presto> explain (type distributed) select count(*) from tpch.tiny.orders where custkey in (select distinct custkey from tpch.tiny.customer);
                                           Query Plan
-------------------------------------------------------------------------------------------------
 Fragment 0 [SINGLE]
     Output layout: [count]
     Output partitioning: SINGLE []
     - Output[_col0] => [count:bigint]
             _col0 := count
         - Aggregate(FINAL) => [count:bigint]
                 count := "count"("count_13")
             - LocalExchange[SINGLE] () => count_13:bigint
                 - RemoteSource[1] => [count_13:bigint]

 Fragment 1 [HASH]
     Output layout: [count_13]
     Output partitioning: SINGLE []
     - Aggregate(PARTIAL) => [count_13:bigint]
             count_13 := "count"(*)
         - FilterProject[filterPredicate = "expr_8"] => []
             - SemiJoin[custkey = custkey_1] => [custkey:bigint, expr_8:boolean]
                 - RemoteSource[2] => [custkey:bigint]
                 - LocalExchange[SINGLE] () => custkey_1:bigint
                     - RemoteSource[3] => [custkey_1:bigint]

 Fragment 2 [tpch:orders:15000]
     Output layout: [custkey]
     Output partitioning: HASH [custkey]
     - TableScan[tpch:tpch:orders:sf0.01, originalConstraint = true] => [custkey:bigint]
             custkey := tpch:custkey

 Fragment 3 [HASH]
     Output layout: [custkey_1]
     Output partitioning: HASH (replicate nulls) [custkey_1]
     - Aggregate(FINAL)[custkey_1] => [custkey_1:bigint]
         - LocalExchange[HASH] ("custkey_1") => custkey_1:bigint
             - RemoteSource[4] => [custkey_1:bigint]

 Fragment 4 [SOURCE]
     Output layout: [custkey_1]
     Output partitioning: HASH [custkey_1]
     - Aggregate(PARTIAL)[custkey_1] => [custkey_1:bigint]
         - TableScan[tpch:tpch:customer:sf0.01, originalConstraint = true] => [custkey_1:bigint]
                 custkey_1 := tpch:custkey

Improve encoding of grouping sets in plan

Currently, the list of concrete grouping sets is expanded during analysis. This results in inefficiencies during analysis and planning due to redundant re-analysis and storage of the atoms of each grouping element (it allocates 2^#columns sets of expressions when analyzing the query. It then analyzes each of those expressions and creates a GroupID node with that many entries).

Instead, the engine should only analyze the atomic units (columns of each grouping element) and should encode the expected grouping sets in the plan using an abstract description of the components instead of enumerating every combination of grouping columns. The enumeration should happen when the operator is instantiated.

I.e., the GroupId node should contain a list of grouping elements:

Elements = (CUBE set<symbol> | ROLLUP list<symbol> | SETS list<set<symbol>>)+

Exponential processing time and memory usage in PredicatePushDown

Queries of this shape consume exponential amounts of memory and CPU:

WITH
t1 (v) AS (VALUES 1),
t2 AS( select if(v = 0, v, v) v from t1 ),
t3 AS( select if(v = 0, v, v) v from t2 ),
t4 AS( select if(v = 0, v, v) v from t3 ),
t5 AS( select if(v = 0, v, v) v from t4 ),
t6 AS( select if(v = 0, v, v) v from t5 ),
t7 AS( select if(v = 0, v, v) v from t6 ),
t8 AS( select if(v = 0, v, v) v from t7 ),
t9 AS( select if(v = 0, v, v) v from t8 ),
t10 AS( select if(v = 0, v, v) v from t9 ),
t11 AS( select if(v = 0, v, v) v from t10 ),
t12 AS( select if(v = 0, v, v) v from t11 ),
t13 AS( select if(v = 0, v, v) v from t12 ),
t14 AS( select if(v = 0, v, v) v from t13 ),
t15 AS( select if(v = 0, v, v) v from t14 ),
t16 AS( select if(v = 0, v, v) v from t15 )
select *
from t16
where v = 0

One possible short-term fix is to adjust the inlining heuristics to only do it if the expressions are trivial or appear only once (similar to how the InlineProjections rule works)

The longer-term fix is to move PredicatePushdown to iterative optimizer rules that simply avoid unproductive pushdown actions like the one caused by this query.

Query fails when there is column type mismatch between partition schema and table schema

Presto is failing to read the parquet partitions if the decimal datatype don't match with what is in the hive metastore. Here is the error:

Query 20190130_224317_00018_w9d29 failed: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'sbnum' in table 'default.presto_test' is declared as type 'decimal(8,0)', but partition 'month=201812' declared column 'sbnum' as type 'decimal(6,0)'.
com.facebook.presto.spi.PrestoException: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'sbnum' in table 'default.presto_test' is declared as type 'decimal(8,0)', but partition 'month=201812' declared column 'sbnum' as type 'decimal(6,0)'.
                at com.facebook.presto.hive.HiveSplitManager.lambda$getPartitionMetadata$2(HiveSplitManager.java:315)
                at com.google.common.collect.Iterators$6.transform(Iterators.java:788)
                at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
                at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
                at com.google.common.collect.Iterators$ConcatenatedIterator.hasNext(Iterators.java:1340)
                at com.facebook.presto.hive.ConcurrentLazyQueue.poll(ConcurrentLazyQueue.java:37)
                at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:252)
                at com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:91)
                at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:185)
                at com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
                at com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
                at com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
                at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                at java.lang.Thread.run(Thread.java:748)

Here are the steps to reproduce this issue -

  1. Create partitioned table -
CREATE external TABLE `presto_test`(
  `sbnum` decimal(6,0),
  `abnum` decimal(8,0)
)
partitioned by (
month int
)
STORED AS PARQUET
LOCATION
  '/hive/presto_test'
;

2.Insert data into this table, create few partitions -

insert overwrite table presto_test partition (month=201801)
select sbnum, abnum from <tabname> limit 10
;

insert overwrite table presto_test partition (month=201802)
select sbnum, abnum from <tabname> limit 10
;

3.Access the table from Presto to ensure it works -

select count(1) from presto_test ;

4.alter the table, change the data type -

alter table presto_test change sbnum sbnum decimal (8,0) ;

5.Access the table again from presto -

select count(1) from presto_test ; (should throw an error)

6.Insert 1 more partition -

insert overwrite table presto_test partition (month=201803)
select sbnum, abnum from <tabname> limit 10
;

7.Access the table from presto again -

select count(1) from presto_test where month = 201802 ; (should work)

Function and type namespaces

Goals

  • Allow for coexistence of functions and types with similar names but different semantics (e.g., compatibility layer for other databases like Teradata, Oracle, MySQL, Vertica, PostgreSQL, etc)
  • Allow for tighter control of what functions are available in a given deployment
  • Make it possible to implement user-defined functions and types in the future

Features

  • Support for CURRENT_PATH, SET PATH
  • Configurable/mountable function and type plugins

Considerations and open questions

  • Views need to record CURRENT_PATH in definition
  • Need to define migration strategy for re-packaging existing function library
  • Function should be resolved to a handle during analysis just like tables.
  • Function resolution should be transaction aware.
  • Function metadata can be fetch using a handle on the coordinator.
  • Function implementation can be retrieved using the handle on coordinator and workers.
  • Optimizers can lookup functions using fully a specified signature.
  • Function handles should be transported to worker without requiring a full rewrite of the plan IR.
  • SPI interface should be easy to implement.

TODOs

  • Introduce proper function abstractions to engine (FunctionManager, FunctionHandle, etc). #196
  • Simplify function resolution
    • Extract resolution logic from FunctionRegistry.
    • Resolve functions using simpler metadata interface.
    • Move resolution to be inside analyzer (like table resolution)
  • Simplify function instance usage
    • Move call convention resolution to be inside of FunctionManager
    • Callers should really only need to deal with conventions and method handles
  • Finish SQL PATH implementation
    • Move existing function registration to system catalog
    • Add namespace abstraction
    • Update function resolution to understand path semantics
  • Update connector SPI to support function resolution
  • Improve function handle transport to workers

References

4.34 SQL-paths

An SQL-path is a list of one or more <schema name>s that determines the search order for one of the following:
— The subject routine of a <routine invocation> whose <routine name> does not contain a .
— The user-defined type when the <path-resolved user-defined type name> does not contain a <schema name>.

4.43.3 SQL-session properties

An SQL-session has an SQL-path that is used to effectively qualify unqualified <routine name>s [...]
The SQL-path is initially set to an implementation-de ned value, but can subsequently be changed by the successful execution of a <set path statement>.
The text defining the SQL-path can be referenced by using the <general value specification> CURRENT_PATH.

6.4 <value specification> and <target specification>

  1. The declared type of CURRENT_USER, CURRENT_ROLE, SESSION_USER, SYSTEM_USER, CURRENT_CATALOG, CURRENT_SCHEMA, and CURRENT_PATH is character string. Whether the character string is fixed-length or variable-length, and its length if it is fixed-length or maximum length if it is variable-length, are implementation-de ned. The character set of the character string is SQL_IDENTIFIER. The declared type collation is the character set collation of SQL_IDENTIFIER, and the collation derivation is implicit.
  2. The value specified by CURRENT_PATH is a <schema name list> where <catalog name>s are <delimited identifier>s and the <unqualified schema name>s are <delimited identifier>s. Each <schema name> is separated from the preceding <schema name> by a <comma> with no intervening <space>s. The schemas referenced in this <schema name list> are those referenced in the SQL-path of the current SQL-session context, in the order in which they appear in that SQL-path.

10.3 <path specification>

Function

Specify an order for searching for an SQL-invoked routine.

Format

<path specification> ::= PATH <schema name list>
<schema name list> ::= <schema name> [ { <comma> <schema name> }... ]
<schema name> ::= [ <catalog name> <period> ] <unqualified schema name>
<unqualified schema name> ::= <identifier>
<catalog name> ::= <identifier>

Syntax Rules

  1. No two <schema name>s contained in <schema name list> shall be equivalent.

19.8 <set path statement>

Function

Set the SQL-path used to determine the subject routine of <routine invocation>s with unqualified <routine name>s in <preparable statement>s that are prepared in the current SQL-session by an <execute immediate statement> or a <prepare statement> and in <direct SQL statement>s that are invoked directly. The SQL-path remains the current SQL-path of the SQL-session until another SQL-path is successfully set.

Format

<set path statement> ::= SET <SQL-path characteristic>
<SQL-path characteristic> ::= PATH <value specification>

Syntax Rules

  1. The declared type of the shall be a character string type.

Access Rules

None.

General Rules

  1. Let S be <value specification> and let V be the character string that is the value of TRIM ( BOTH ' ' FROM S )
    a) If V does not conform to the Format and Syntax Rules of a <schema name list>, then an exception condition is raised: invalid schema name list specification.
    b) The SQL-path of the current SQL-session is set to V.
    NOTE 724 — A <set path statement> that is executed between a <prepare statement> and an <execute statement> has no effect on the prepared statement.

Conformance Rules

  1. Without Feature S071, “SQL paths in function and type name resolution”, Conforming SQL language shall not contain a <set path statement>.

Should information_schema.columns.COMMENT be renamed to COLUMN_COMMENT?

Mysql information_schema.columns table exposes column_comment column, while Presto exposes comment column, see https://dev.mysql.com/doc/refman/5.7/en/columns-table.html

I couldn't find this column in ANSI SQL, nor in Oracle, Teradata, MSSQL.

It looks like Hue depends on this column: https://github.com/cloudera/hue/blob/master/desktop/libs/notebook/src/notebook/connectors/jdbc.py#L213

This column was added to Presto about 5 years ago (with prestodb/presto#1214). In origin PR I do not see any discussion about compliance with existing database system or ANSI SQL.

Adding a short-term solution for pushing down partial aggregation into the connector

Greetings,

Following #18, we have created a simple partial aggregation push-down prototype, by a creating a new PlanOptimizer. It was added at the end of the optimizers' list (here), matching the partial AggregationNode - and passing down the aggregations (specified by their FunctionCall quailified name and the relevant column handles) into ConnectorTableLayoutHandle (as it's currently done for predicate pushdown by the PickLayouts optimizer) of the underlying ScanNode.

Currently, we keep the original partial aggregation node in the query plan, since we don't support all aggregations in our connector and would like to fall back to Presto implementation for unsupported aggregations.
In order to keep the implementation generic, we have added a new API for Metadata interface for optionally adding aggregation into existing TableLayoutHandle by the new optimizer.

We would like to ask whether our approach makes sense?
If so, we would like to help the effort adding this feature (partial aggregation pushdown API) to mainline Presto.

Also, we were not sure whether we are adding the new optimizer at the correct stage of PlanOptimizers. Maybe we should put it as a part of PickLayouts?

Incorrect FILTER clause evaluation order

According to the SQL specification, the FILTER clause for an aggregation query should be applied to the rows in the group before the arguments to the aggregation are evaluated.

Currently, expressions in the aggregation arguments are evaluated before the filter, leading to queries failing when they shouldn't.

For example:

select 
    sum(1 / a) filter (where a <> 0), 
    sum(a) filter (where true)
from (values (1), (0)) t(a)

fails with "/ by zero" error.

This is the plan it produces:

 Fragment 0 [SINGLE]
     Output layout: [sum, sum_3]
     Output partitioning: SINGLE []
     Stage Execution Strategy: UNGROUPED_EXECUTION
     Output[_col0, _col1]
     │   Layout: [sum:bigint, sum_3:bigint]
     │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
     │   _col0 := sum
     │   _col1 := sum_3
     └─ Aggregate(FINAL)
        │   Layout: [sum_3:bigint, sum:bigint]
        │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
        │   sum_3 := sum("sum_7")
        │   sum := sum("sum_6")
        └─ LocalExchange[SINGLE] ()
           │   Layout: [sum_7:row(bigint, boolean, bigint, boolean), sum_6:row(bigint, boolean, bigint, boolean)]
           │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
           └─ Aggregate(PARTIAL)
              │   Layout: [sum_7:row(bigint, boolean, bigint, boolean), sum_6:row(bigint, boolean, bigint, boolean)]
              │   sum_7 := sum("a") (mask = expr_1)
              │   sum_6 := sum("expr_2") (mask = expr_0)
              └─ Project[]
                 │   Layout: [expr_2:bigint, a:bigint, expr_1:boolean, expr_0:boolean]
                 │   Estimates: {rows: 2 (44B), cpu: 54, memory: 0B, network: 0B}
                 │   expr_2 := CAST((1 / "field") AS bigint)
                 │   a := CAST("field" AS bigint)
                 │   expr_1 := true
                 │   expr_0 := ("field" <> 0)
                 └─ LocalExchange[ROUND_ROBIN] ()
                    │   Layout: [field:integer]
                    │   Estimates: {rows: 2 (10B), cpu: 10, memory: 0B, network: 0B}
                    └─ Values
                           Layout: [field:integer]
                           Estimates: {rows: 2 (10B), cpu: 0, memory: 0B, network: 0B}
                           (1)
                           (0)

References:

10.9 <aggregate function>

d) If AF is immediately contained in a <set function specification> SFS, then a group of a grouped table of the aggregation query of SFS.

  1. Let T be the argument source of AF.
  1. Case:
    a) If <filter clause> is specified, then the <search condition> is effectively evaluated for each row of T. Let T1 be the collection of rows of T for which the result of the <search condition> is True.
    b) Otherwise, let T1 be T.
  1. If <general set function> is specified, then:
    a) Let TX be the single-column table that is the result of applying the <value expression> to each row of T1 and eliminating null values

Cannot create empty Hive partition for non-bucketed table

The system.create_empty_partition procedure fails because the source directory does not exist (no files were written) and thus the rename fails. SemiTransactionalHiveMetastore.Committer#prepareAddPartition should probably unconditionally create currentPath before checking if it is equal to targetPath.

Optimize union all of similar aggregations

Queries of the following shape:

SELECT 
   a1, ..., an,
   agg1(...),
   agg2(...)
FROM t
WHERE f1(...)

UNION ALL

SELECT 
   b1, ..., bn,
   agg3(...),
   agg4(...)
FROM t
WHERE f2(...)

...

(a1, ..., an, b1, ..., bn are constants)

can be optimized to process all the data in a single pass to reduce the number of stages in the query execution.

One possible rewrite could be:

SELECT * 
FROM UNNEST( (
    SELECT ARRAY[
        ROW(a1, ..., an, agg1(...) FILTER (WHERE f1(...)), agg2(...) FILTER (WHERE f1(...)),
        ROW(b1, ..., bn, agg3(...) FILTER (WHERE f2(...)), agg3(...) FILTER (WHERE f2(...)),
        ...]
    FROM t) ) u

Warn when CAST of varchar(x) -> varchar(y) results in truncation

Per the spec (6.13 General Rules 11.c):

11) If TD is variable-length character string or large object character string, then let MLTD
be the maximum length in characters of TD.

c) If SD is fixed-length character string, variable-length character string,
or large object character string, then

ii) If the length in characters of SV is larger than LTD, then TV is the first LTD characters
of SV.  If any of the remaining characters of SV are non-<space> characters,
then a completion condition is raised: warning — string data, right truncation.

PrestoS3FileSystem#rename does not preserve SSE properties

Extracted from: prestodb/presto#9916

PrestoS3FileSystem#rename should preserve SSE properties while copying files. Problematic line is: PrestoS3FileSystem.java:407: s3.copyObject. According to S3 Javadoc:

     * By default, all object metadata for the source object except
     * <b>server-side-encryption</b>, <b>storage-class</b> and
     * <b>website-redirect-location</b> are copied to the new destination
     * object, unless new object metadata in the specified
     * {@link CopyObjectRequest} is provided.

On the other hand, maybe PrestoS3FileSystem#rename should throw UnsupportedOperationException? It seems that it shouldn't be used at all because for S3 there should be no temporary objects when inserting into table (see: HiveWriteUtils#isS3FileSystem).

EXPLAIN CREATE TABLE AS SELECT creates a table, should be side-effect free

presto> use memory.default;
USE
presto:default> explain create table x as select 1 c;
                                                                                                                               Query Plan
-------------------------------------------------------------------------------------------------------------------------------------------------------
 - Output[rows] => [rows:bigint]
     - TableCommit[memory:INSTANCE:MemoryOutputTableHandle{table=MemoryTableHandle{connectorId=memory, schemaName=default, tableName=x, tableId=0, columnHandles=...
         - LocalExchange[SINGLE] () => partialrows:bigint, fragment:varbinary
             - RemoteExchange[GATHER] => partialrows:bigint, fragment:varbinary
                 - TableWriter => [partialrows:bigint, fragment:varbinary]
                         c := expr
                     ..................

[...]

presto:default> explain create table x as select 1 c;
Query 20180205_101139_00002_ijv9f failed: line 1:9: Destination table 'memory.default.x' already exists

elect * from x;
 c
---
(0 rows)

EXPLAIN renders query plan without actually running the query, so EXPLAIN CREATE TABLE ... AS SELECT ... should do just that.
Currently, it creates the table (but does't populate it).

Ref prestodb/presto#9896

Release notes for 301

Dain Sundstrom

  • all checked
  • 01-22 16:59:11 14db5b5 Add trace token to query events
  • 01-22 18:54:06 17112e0 Update node list on workers
  • 01-22 18:54:06 33c0fc0 Stop DiscoveryNodeManager in tests
  • 01-22 18:54:06 86d780a Only fire node change event on change
  • 01-22 18:54:06 b950d0f Add test for DiscoveryNodeManager listener

David Phillips

  • all checked
  • 01-23 00:10:15 33e41f3 Remove old Maven extensions
  • 01-23 00:10:15 7ef63e6 Use ValidationAssertions in TestPowerOfTwoValidator
  • 01-23 00:10:15 a5a7722 Update Sphinx plugin to 2.1
  • 01-23 00:10:15 accd094 Update to Airbase 90
  • 01-23 00:10:15 e261486 Fix undeclared dependency used for HTTP constants
  • 01-23 00:10:15 f0e4aa6 Fix deprecated usage of catchingAsync
  • 01-23 00:10:48 eb4f141 Use Xenial environment on Travis
  • 01-25 00:02:23 8456488 Move CLA bot config to clabot-config repo
  • 01-25 15:24:53 c408a78 Add current user security mode for views
  • 01-25 22:25:54 238f4b6 Update to Hive 2.3.4
  • 01-25 22:25:54 486a8f1 Fix value comparisons in TestHiveFileFormats
  • 01-25 22:25:54 507307e Remove unused Parquet classes
  • 01-25 22:39:27 95122b5 Improve error message for unsupported reduce_agg state type
  • 01-26 17:21:03 435fe92 Remove usages of TypeSignature from client
  • 01-26 17:21:03 904d511 Replace PrestoWarning with new client Warning
  • 01-26 17:21:03 950a56c Remove SPI as dependency for JDBC driver
  • 01-26 17:21:03 cbbf251 Convert JDBC driver to ClientTypeSignature
  • 01-26 17:21:03 d5dfb31 Convert JDBC driver from TimeZoneKey to ZoneId

Grzegorz Kokosiński

  • all checked
  • 01-25 00:04:33 b9a1a7e Limit access to ParametricAggregation fields
  • 01-28 03:22:46 1f284a1 Prefer SchemaTablePrefix#getSchema to create optional schema name
  • 01-28 03:22:46 5b63c49 Use Optional for nullable fields in SchemaTablePrefix
  • 01-28 03:22:46 73cc391 Prefer SchemaTablePrefix#toSchemaTableName
  • 01-28 05:03:23 efe8ca0 Add suppressed exceptions to exception that is thrown
  • 01-28 05:04:01 311abbb Extend JdbcSplit with additional predicate field
  • 01-28 05:10:15 9825bc8 Configure defaults for coordinator in DistributedQueryRunner in one place
  • 01-29 01:52:32 5368334 Fix formatting
  • 01-29 01:52:33 d13fbf9 Check if Presto is running before erasing RPM
  • 01-29 01:55:58 53ed213 Log test query durarion only if it run longer than 1s
  • 01-29 01:55:58 96cee95 Log Presto node role when creating distributed testing cluster
  • 01-29 01:55:58 bee9579 Log testing Presto server URL when it starts
  • 01-29 01:55:58 e77d698 Do not log test query uri
  • 01-29 07:20:30 120c28f Allow JDBC connector to define procedure
  • 01-30 01:55:54 fa65182 Add support for getTableStatistics to base-jdbc connector
  • 01-30 01:58:54 68393b0 Make use of hive temp staging directory configurable
  • 01-30 01:58:54 9227286 Make hive temporary staging directory location configurable

Karol Sobczak

  • all checked
  • 01-25 01:47:23 9305d99 Add write_hive_metastore_recording procedure
  • 01-25 05:21:20 47ba96e Add missing withGeneratedName to RecordingHiveMetastore exporter
  • 01-28 01:43:00 eb7a512 Add metadata recorder support to Glue
  • 01-29 02:52:01 eb7f677 Make dummy HiveEventClient use DEBUG log instead of INFO
  • 01-29 04:25:50 0481f18 Product tests for SHOW ROLES
  • 01-29 04:25:50 0546433 Add docs for SHOW ROLES
  • 01-29 04:25:50 07c5496 Introduce .information_schema.roles table
  • 01-29 04:25:50 0ba3120 Add grantor to HivePrivilegeInfo
  • 01-29 04:25:50 0beeb77 Add grantor_type and grantee_type columns to table_privileges
  • 01-29 04:25:50 0c2fa44 Make ThriftMetastoreUtil.list*Roles methods to return Stream
  • 01-29 04:25:50 0c3406c Add SHOW ROLE GRANTS syntax
  • 01-29 04:25:50 0f2e2e3 Move parsePrivilege to MetastoreUtil
  • 01-29 04:25:50 18603de Reorder methods in HivePrivilegeInfo
  • 01-29 04:25:50 1d11c06 Make list*TablePrivileges to return Stream
  • 01-29 04:25:50 20cf120 Fix listing privileges when admin role is set
  • 01-29 04:25:50 21ceed6 Catalog access control for roles
  • 01-29 04:25:50 23e3f31 Add Grant/Revoke/List roles authorization to the SPI
  • 01-29 04:25:50 2438df8 Assign admin role to subset of users in FileHiveMetastore
  • 01-29 04:25:50 266fa20 Implement listRoleGrants() in Hive
  • 01-29 04:25:50 26b421a Remove redundant checkDatabasePermission methods
  • 01-29 04:25:50 31e9d26 Reorder methods in SqlStandardAccessControl
  • 01-29 04:25:50 34ffbaa Accept ROLE in GRANT/REVOKE Privileges statements
  • 01-29 04:25:50 3dca463 Implement Grant/Revoke/ListApplicableRoles in Hive
  • 01-29 04:25:50 42ba503 Introduce CREATE ROLE and DROP ROLE statements
  • 01-29 04:25:50 4b90148 Add listRoleGrants to the SPI
  • 01-29 04:25:50 53b13c9 Refactor canCreateView security checks
  • 01-29 04:25:50 5d9615c Add SHOW CURRENT ROLES
  • 01-29 04:25:50 5db2f71 Remove unused metastore statistics beans
  • 01-29 04:25:50 5f9c74d Introduce GRANT/REVOKE roles statements
  • 01-29 04:25:50 62ed7d2 Introduce access control for GRANT/REVOKE ROLE
  • 01-29 04:25:50 671d653 Introduce ENABLED_ROLES view
  • 01-29 04:25:50 6910a27 Enumerate roles until enabled role is found
  • 01-29 04:25:50 709de9d Simplify checkTablePermission
  • 01-29 04:25:50 729f646 Fix rewrite SHOW GRANTS as a SELECT query
  • 01-29 04:25:50 7427e3a Extract role name constants
  • 01-29 04:25:50 745e4b8 Expose Create/Drop/List roles methods in SPI
  • 01-29 04:25:50 747c276 Introduce APPLICABLE_ROLES view
  • 01-29 04:25:50 776b453 Move PrincipalType to presto-spi
  • 01-29 04:25:50 8234650 Store catalog selected roles in Identity
  • 01-29 04:25:50 8464cbc Access control for SHOW ROLE GRANTS and SHOW CURRENT ROLES
  • 01-29 04:25:50 87063ee More product tests for SET ROLE
  • 01-29 04:25:50 91b38e4 Implement Create/Drop/List roles in Hive connector
  • 01-29 04:25:50 942cde9 Rename getGrantOptionForPrivilege to hasGrantOptionForPrivilege
  • 01-29 04:25:50 958edea Rewrite SHOW ROLES as a select query
  • 01-29 04:25:50 a20e9af Introduce isTableOwner method for readability
  • 01-29 04:25:50 aecd3b2 Set admin role before setting system properties in storage format test
  • 01-29 04:25:50 bbc0233 Make HiveQueryRunner default user have admin role
  • 01-29 04:25:50 be8189c Implement SET ROLE in Hive Connector
  • 01-29 04:25:50 c603474 Consider enabled roles for permissions
  • 01-29 04:25:50 c8d8733 Implement SHOW ROLE GRANTS rewrite
  • 01-29 04:25:50 cae5a86 Add access control checks for SHOW ROLES
  • 01-29 04:25:50 cb2da7b Refactor GRANT/REVOKE in Hive
  • 01-29 04:25:50 ce26168 Add docs for SHOW ROLE GRANTS
  • 01-29 04:25:50 d455185 Remove hive privilege null check
  • 01-29 04:25:50 dd9b64c Introduce SET ROLE statement
  • 01-29 04:25:50 de9890c Refactor HivePrivilegeInfo
  • 01-29 04:25:50 e5f2784 Allow all for admin role
  • 01-29 04:25:50 e8f81d9 Prepare metastore interface to accept ROLE for GRANT/REVOKE
  • 01-29 04:25:50 eadfb43 Remove 'IN catalog' syntax from role management commands
  • 01-29 04:25:50 f147d32 Add SHOW ROLES to the parser
  • 01-29 04:25:50 f5fe03d Document role management
  • 01-29 04:25:50 f5fe24d Introduce ConnectorIdentity
  • 01-29 04:25:50 f958f81 Refactor RecordingHiveMetastore to make it consistent between releases
  • 01-29 04:25:50 fb230fb Implement SET ROLE
  • 01-29 04:25:50 fb93b6e Fix listing privileges of tables owned by an user
  • 01-29 11:41:03 00aa33b Fix creating non-bucketed empty partition
  • 01-29 11:41:03 82574c8 ANALYZE statement: Parser
  • 01-29 11:41:03 82bc87f Add error message on column count mismatch when making partition name
  • 01-29 11:41:03 b747ecb ANALYZE statement: Implement ANALYZE in Hive connector
  • 01-29 11:41:03 be61588 Validate no partition value is null when making partition name
  • 01-29 11:41:03 df3f4df ANALYZE statement: Analyzer, planner and execution
  • 01-29 11:41:03 ecd0c0f Move getPartitionLocation to MetastoreUtil
  • 01-29 11:41:03 f800115 Fix partition column range statistics for empty partitions
  • 01-29 11:50:23 0c7d4ed Reduce parquet logging in Hive tests
  • 01-29 11:51:10 23f8793 Use processedInput* metrics in PipelineContext#getInput*

Martin Traverso

  • all checked
  • 01-22 15:20:16 bb0fd12 Update travis slack notification token
  • 01-25 14:17:15 d04152e Remove deprecated.legacy-log-function feature toggle
  • 01-25 14:17:15 f6044e6 Add standard log function
  • 01-25 14:27:45 ac02f3d Simplify coalesce expressions
  • 01-25 14:27:45 cf17d08 Flatten coalesce expressions
  • 01-25 19:31:27 a63829d Remove legacy array_agg semantics
  • 01-27 12:58:31 7c1250a Enable fusion of operators for queries with SYSTEM sampling

Piotr Findeisen

  • all checked
  • 01-24 12:52:31 0495b79 Add TYPE_NAME to JdbcTypeHandle
  • 01-24 12:52:44 a13d5d7 Remove unsupported types on write path in JDBC connectors
  • 01-24 12:52:44 b6faf92 Remove unsupported types on read path in JDBC connectors
  • 01-24 12:54:40 1455bcf Add section explaining legacy and new timestamp
  • 01-26 09:27:49 23e709f Fix port allocation race condition in EmbeddedZookeeper
  • 01-26 09:27:49 6d2cbce Use KafkaServer directly
  • 01-26 09:27:49 a881632 Code cleanup
  • 01-26 09:27:49 b0f1795 Update Kafka client to 0.10.2.2
  • 01-26 09:27:49 b410577 Fix port allocation race condition in EmbeddedKafka
  • 01-29 02:08:04 3fef5b6 Report top memory consumers when local memory limit is exceeded
  • 01-29 02:49:52 6dada93 Update to docker images version 11
  • 01-29 02:49:52 a1c4802 Support LDAP with anonymous bind disabled

Dynamic filtering

fact_table - large fact table, partitioned on date_key
d_date - small dimension table, with significant correlation on it's columns and date_key.

Currently for the following query:

SELECT 
    count(*)
FROM 
    fact_table a 
JOIN d_date b 
ON a.date_key = b.date_key 
WHERE 
    b.year = 2017;

larger probe table is fully scanned. When presto after scaning build table (right) could extract information that for b.year = 2017 there are only small number of matching b.date_key values. This information could be used to narrow down the table scan. In other words, above query could be dynamically "rewritten" to equivalent of:

SELECT 
    count(*)
FROM 
    fact_table a 
JOIN d_date b 
ON a.date_key = b.date_key 
WHERE a.date_key IN (20170101,20170102, etc...)

Design doc: https://docs.google.com/document/d/1TOlxS8ZAXSIHR5ftHbPsgUkuUA-ky-odwmPdZARJrUQ/edit

TIMESTAMP behaviour does not match sql standard

Problem description

See below.

Roadmap

(Note: since GH doesn't send updates for edits, when adding a bullet please be sure sure to add a comment too.)

Original problem description

It seems like meaning of TIMESTAMP and TIMESTAMP WITH TIMEZONE datatypes in Presto is totally not what is specified by SQL standard (and what other databases do).

This is my understanding of SQL 2003 standard (4.6.2 Datetimes):

TIMESTAMP WITH TIMEZONE represents absolute point in time. Typically databases store it internally as seconds since epoch in some fixed timezone (usually UTC). When querying TIMESTAMP WITH TIMEZONE data the values are presented to user in session timezone (yet session timezone is used just for presentation purposes).

TIMESTAMP does not represent specific point in time, but rather a reading of a wall clock+calendar. Selecting values from TIMESTAMP column should return same result set no matter what is the client's session timezone.

While Presto semantics is different:

TIMESTAMP seems to do what TIMESTAMP WITH TIMEZONE should.

TIMESTAMP WITH TIMEZONE encodes explicit timezone information to each value stored in table. The sql standard does not define a type like that. But it does not seem very practical. Assuming that values selected from TIMESTAMP WITH TIMEZONE are presented to user in session timezone anyway, the per-row timezone information can be stripped away and all values can be stored in some arbitrary fixed timezone (e.g. UTC).

Please comment on the semantics. It seems wrong. Why the choice - as it is hard to believe that it was not done intentionally.

@haozhun clarifying comments

1

@losipiuk I agree with you that Timestamp w/o TZ in Presto is broken. I do NOT agree that Timestamp w TZ should behave like Instant. I believe it should also have an associated time zone. (In other word, I believe Timestamp w TZ is implemented correctly today.) Below is an excerpt of something I wrote early last year that summarizes the current behavior and my understanding.


To summarize how things work today:

  • Timestamp w TZ = DateTime in joda = ZonedDateTime in java8
  • Timestamp w/o TZ = Instant in joda = Instant in java8.
    • In other words, Timestamp w/o TZ represents an instant in time, they just don't have a timezone associated with them.
    • When you print them out for human consumption, you need to resort to user's session time zone.
    • When you turn it into a Timestamp w TZ, you just stick that time zone to it without changing the instant.

The way I understand it

  • Timestamp w TZ = DateTime in joda = ZonedDateTime in java8
  • Timestamp w/o TZ = LocalDateTime in joda = LocalDateTime in java8.
    • In other words, Timestamp w/o TZ represents a specific Year/Month/Day/Hour/Minute/Second. But it doesn't represent a specific instant. (Side note: this concept does not need expensive representation. It can still be represented as millis/nanos since epoch, observing only chronology rules and no tz rules)
    • When you print it out, it should be printed out as is.
    • If you want to turn it into a Timestamp w TZ, you need to resort to the user's session time zone for the missing piece of information (and this is not always possible because Timestamp w TZ has gaps).

Here is the reason I believe the first understanding is inconsistent. I can only think of one possible interpretation for the other 3 concepts:

  • Time w TZ = no single class correspondence in joda = OffsetTime in java8
    • Offset and Zone is different in that only constant offset exists for OffsetX. Political zone is not allowed.
    • ZonedTime doesn't make sense and doesn't exist. ZonedDateTime and OffsetDateTime both exists.
  • Time w/o TZ = LocalTime in joda = LocalTime in java8
  • Date = LocalDate in joda = LocalDate in java8

Note here the inconsistency between interpretation of Timestamp w/o TZ and Time w/o TZ if we adopt the first interpretation of Timestamp w/o TZ (Instant vs LocalTime). Whereas under the second interpretation, it will be consistent (LocalDateTime vs LocalTime).

I went to SQL spec for the definitive answer:

  • Abbreviations
    • SV is the source value, TV is the target value
    • UTC is the UTC component of SV or TV (if and only if the source or target has time zone)
    • TZ is the timezone displacement of SV or TV (if and only if the source or target has time zone)
    • STZD is the SQL-session default time zone displacement
  • To convert Timestamp w/o TZ to Timestamp w/ TZ: TV.UTC = SV - STZD; TV.TZ = STZD
  • To convert Timestamp w/ TZ to Timestamp w/o TZ: TV = SV.UTC + SV.TZ

I believe these two rules proves that SQL spec agrees with my interpretation. Let's consider cast from Timestamp w/o TZ to Timestamp w/ TZ

  • Cast from Timestamp w/o TZ
    • Given, SV = 3600000 millis (1 hour)
  • Cast to Timestamp w TZ in MPK
    • STZD = America/Los_Angeles
    • TV.UTC = 3600000 - (-28800000) = 32400000, TV.TZ=America/Los_Angeles
    • When written out in human format, TV is: 1970-01-01 01:00:00 America/Los_Angeles, TV.UTC is 1970-01-01 09:00:00
  • Cast to Timestamp w TZ in Shanghai
    • STZD = Asia/Shanghai
    • TV.UTC = 3600000 - 28800000 = -25200000, TV.TZ=Asia/Shanghai
    • When written out in human format, TV is: 1970-01-01 01:00:00 Asia/Shanghai, TV.UTC is 1969-12-31 17:00:00

Under first interpretation, these two cast should have yield results that are equal. Under second interpretation, they would produce different result. The rule in SQL spec produces two different results.

Lastly, a side note from me. Both interpretation can produce results that is dependent on user session time zone:

  • Under current Presto interpretation of Timestamp w/o TZ:
    • Same storage representation. Let's say 60000 milliseconds.
    • Prints out as 1969-12-31 16:01:00 in MPK, and 1970-01-01 08:01:00 in China. They are different.
    • When cast to Timestamp w TZ, become 1969-12-31 16:01:00 America/Los_Angeles in MPK, and 1970-01-01 08:01:00 Asia/Shanghai in China. They are equal.
  • Under SQL spec interpretation of Timestamp w/o TZ:
    • Same storage representation. Let's say 60000 milliseconds.
    • Prints out as 1970-01-01 00:01:00 in MPK, and 1970-01-01 00:01:00 in China. They are the same.
    • When cast to Timestamp w TZ, become 1970-01-01 00:01:00 America/Los_Angeles in MPK, and 1970-01-01 00:01:00 Asia/Shanghai in China. They are different.

Under the SQL spec, cast from timestamp w/o TZ to timestamp w/ TZ can produce different results based on user time zone. As a result, I guess this cast probably should NOT have been implicit.

2

@losipiuk, @dain, and I reached agreement:

  • Timestamp with Timezone in Presto is implemented properly today (like DateTime in joda, ZonedDateTime in jdk8).
  • Timestamp in Presto is like Instant in joda/jdk8 today. It should be like LocalDateTime in joda/jdk8.
  • Extracting hour from 2016-01-01 12:00:00 <TZ> should return 12 no matter what <TZ> is put in template.
  • As part of fixing Timestamp in Presto, we should remove implicit coercion from Timestamp to Timestamp with Timezone because the result value is environment dependent.

Notes

Ticket migrated from prestodb/presto#7122, prestodb/presto#10326

Sampled partition stats have greatly underestimated NDVs for some columns

Incorporated from: prestodb/presto#11959 (discussion there)

tpch_sf100. orders (partitioned):

presto> show stats for hive.tpch_sf100_orc_part.orders;
  column_name  |      data_size      | distinct_values_count | nulls_fraction |   row_count    | low_value  | high_value
---------------+---------------------+-----------------------+----------------+----------------+------------+------------
 orderkey      | NULL                |               62881.0 |            0.0 | NULL           | 162        | 599999942
 custkey       | NULL                |               62881.0 |            0.0 | NULL           | 2          | 14999998
 orderstatus   |      1.4999663244E8 |                   3.0 |            0.0 | NULL           | NULL       | NULL
 totalprice    | NULL                |               62881.0 |            0.0 | NULL           | 851.17     | 571417.48
 orderpriority |     1.25999070546E9 |                   5.0 |            0.0 | NULL           | NULL       | NULL
 clerk         |      2.2499494866E9 |               49276.0 |            0.0 | NULL           | NULL       | NULL
 shippriority  | NULL                |                   1.0 |            0.0 | NULL           | 0          | 0
 comment       | 7.275173681759999E9 |               62881.0 |            0.0 | NULL           | NULL       | NULL
 orderdate     | NULL                |                2406.0 |            0.0 | NULL           | 1992-01-01 | 1998-08-02
 NULL          | NULL                | NULL                  | NULL           | 1.4999663244E8 | NULL       | NULL
(10 rows)
presto> show stats for hive.tpch_sf100_orc.orders;
  column_name  |      data_size       | distinct_values_count | nulls_fraction | row_count | low_value  | high_value
---------------+----------------------+-----------------------+----------------+-----------+------------+------------
 orderkey      | NULL                 |                 1.5E8 |            0.0 | NULL      | 1          | 600000000
 custkey       | NULL                 |           1.0165457E7 |            0.0 | NULL      | 1          | 14999999
 orderstatus   |                1.5E8 |                   3.0 |            0.0 | NULL      | NULL       | NULL
 totalprice    | NULL                 |           3.4038282E7 |            0.0 | NULL      | 811.73     | 591036.15
 orderdate     | NULL                 |                2449.0 |            0.0 | NULL      | 1992-01-01 | 1998-08-02
 orderpriority | 1.2600248040000002E9 |                   5.0 |            0.0 | NULL      | NULL       | NULL
 clerk         |               2.25E9 |              100816.0 |            0.0 | NULL      | NULL       | NULL
 shippriority  | NULL                 |                   1.0 |            0.0 | NULL      | 0          | 0
 comment       |        7.275038709E9 |          1.12463191E8 |            0.0 | NULL      | NULL       | NULL
 NULL          | NULL                 | NULL                  | NULL           |     1.5E8 | NULL       | NULL

Should we add extrapolate NDVs instead? It seems that partitions might often be different chunks of data so that NDVs don't overlap.
Alternatively we could store HLL state per column as an auxiliary partition property and calculate extrapolation based on merged HLLs

TABLESAMPLE SYSTEM prevents scan/filter/project operator fusion

The TABLESAMPLE node in the plan for the following query prevents operator fusion:

presto> explain (type distributed) select length(name) from tpch.tiny.nation TABLESAMPLE system (50);
                                      Query Plan                                       
---------------------------------------------------------------------------------------
 Fragment 0 [SINGLE]                                                                   
     Output layout: [length]                                                           
     Output partitioning: SINGLE []                                                    
     Grouped Execution: false                                                          
     - Output[_col0] => [length:bigint]                                                
             _col0 := length                                                           
         - RemoteSource[1] => [length:bigint]                                          
                                                                                       
 Fragment 1 [SOURCE]                                                                   
     Output layout: [length]                                                           
     Output partitioning: SINGLE []                                                    
     Grouped Execution: false                                                          
     - Project[] => [length:bigint]                                                    
             length := "length"("name")                                                
         - Sample[SYSTEM: 0.5] => [name:varchar(25)]                                   
             - TableScan[tpch:nation:sf0.01, grouped = false] => [name:varchar(25)]    
                     Cost: {rows: 25 (302B), cpu: 302.00, memory: 0.00, network: 0.00} 
                     name := tpch:name                                                 

Automated stress tests

@martint, @electrum, @dain Can you share your experience with a stress testing. What do you think needs to be tested:

  • what data and queries?
  • what workload (concurrency)?
  • how long it should be executed to catch production-like issues?
  • how much hardware do we need?
  • how can this be automated?
  • what needs to be verified (memory leaks, garbage collection, correctness)?

Are there any companies that would participate in this to get the "real" data and queries, or even hardware?

I was thinking about spinning a cluster(s) on AWS and the run the above frequently (like once a week). It would be good do it from the beginning, because now we have no many such tests and hence it should be easily automated. Then we could iteratively improve it.

Add hidden $partition column to Hive connector

It can help for computing or filtering the raw partition names that are provided by other Hive-related tools. E.g., ds=2017-05-23/type=abc/ts=2017-05-23+21%3A45%3A99.

It should support pushdown for partition pruning.

  • Add initial $partition column (#3582)
  • Support pushdown for partition pruning

Optimize joins with "similar" subqueries

Optimize queries of the following shape:

SELECT * 
FROM (
   SELECT k, agg1, agg2
   FROM t
   GROUP BY k
) a
JOIN (
   SELECT k, agg3, agg4
   FROM t
   GROUP BY k
) b
ON (a.k = b.k)

as

SELECT k, agg1, agg2, agg3, agg4
FROM T
GROUP BY k

More generally, if we know k contains distinct values, optimize

SELECT * 
FROM (
   SELECT k, v1, v2
   FROM t
) a
JOIN (
   SELECT k, v3, v4
   FROM t
) b
ON (a.k = b.k)

as

SELECT k, v1, v2, v3, v4 
FROM t

Release notes for 302

Dain Sundstrom

  • all checked
  • 01-31 15:18:24 10546ea Reformat ParquetTester
  • 01-31 15:18:24 bf01fcd Add dynamic batch sizing in Parquet reader
  • 01-31 15:27:30 99bd6ba Rename Context method to match field names
  • 01-31 15:27:30 bacf59f Add physical* and network* metrics
  • 01-31 15:27:30 e81eefe Remove unused methods
  • 01-31 16:30:15 1fad470 Disable ConnectorIdentity#equals and #hashCode
  • 01-31 16:30:15 4a4bf3f Improve ConnectorIdentity#toString
  • 02-04 14:45:12 6903d9b Add CLI option --trace-token to set the trace token #56.
  • 02-04 16:55:23 f7fe35c Fix starvation while waiting for minimum number of workers
  • 02-05 11:41:04 d6fb5e0 Add peak task user memory usage to query stats
  • 02-05 12:02:33 d7d6166 Add spilled data to query stats, CLI, and Web UI
  • 02-05 12:32:15 7ff3b40 Allow function implementation choices to have different dependencies
  • 02-06 09:06:30 275bb36 Remove system pool entirely

David Phillips

  • all checked
  • 01-31 17:15:52 0b55a32 Remove unused class ConnectorAnalyzeMetadata
  • 02-02 01:31:23 3fc4aff Use class reference for BlockMissingException
  • 02-02 16:17:58 02711a8 Move ScheduledSplit to execution package
  • 02-02 16:17:58 6082254 Move OutputBuffers to execution.buffer package
  • 02-02 16:17:58 8ca92f2 Move TaskSource to execution package
  • 02-02 16:37:56 d983340 Cleanup code in ORC module
  • 02-04 14:26:47 81d77ca Fix grammar in error messages for CDF functions
  • 02-04 15:30:50 20fb7d8 Document hive.recursive-directories
  • 02-05 14:04:02 c2f23eb Rename StageExecutionStrategy to StageExecutionDescriptor
  • 02-05 15:32:28 8ba4f26 Rename methods in StageExecutionDescriptor
  • 02-05 17:04:29 30f7812 Improve performance for positioned reads on S3
  • 02-05 17:04:29 5c57d92 Throw IOException for unrecoverable S3 errors
  • 02-05 22:18:08 0a543f6 Add option not to delete target directory on failure
  • 02-05 22:18:08 14cc065 Remove redundant null check
  • 02-06 11:07:04 4575f98 Introduce DynamicConfigurationProvider
  • 02-06 11:07:04 4daf538 Add support for extra credential in JDBC driver
  • 02-06 11:07:04 50224b3 Add HiveGcsModule
  • 02-06 11:07:04 70f733f Rename HdfsConfigurationUpdater to HdfsConfigurationInitializer
  • 02-06 11:07:04 9a09af1 Add credentials in Identity and ConnectorIdentity
  • 02-06 11:07:04 de7b516 Add support for extra credential in Presto CLI
  • 02-06 11:07:04 fa02de3 Add header for extra credential

Grzegorz Kokosiński

  • all checked
  • 02-01 02:13:30 1d62598 Remove PlanNodeMatcher
  • 02-01 02:13:30 431e230 Static import DEFAULT_MATCHER
  • 02-01 02:13:30 4efe280 Add context parameter to custom pattern matching functions
  • 02-01 02:13:30 5d57936 Make Matcher to return the Stream of Match
  • 02-01 02:13:30 6375f4c Remove Matcher, move match to Pattern
  • 02-01 02:13:30 642330b Make Pattern#previous to be Optional
  • 02-01 02:13:30 68e22ec Add missing requireNonNull to matching related constructors
  • 02-01 02:13:30 70800a9 Remove unused variable
  • 02-01 02:13:30 87ed0f3 Remove value from Match
  • 02-01 02:13:30 e0d02ad Inline Patterns implementations from DefaultMatcher
  • 02-01 12:34:13 5d384b2 Add Elasticsearch connector
  • 02-04 13:38:30 b59e583 Make Git quiet when running tests in Travis
  • 02-05 01:32:25 34858db Handle S3 empty directories as directories

Karol Sobczak

  • all checked
  • 02-01 10:11:38 ca11249 Add physical* and internalNetwork* stats to QueryStatistics
  • 02-04 06:45:39 c3b113e Do not reset configuration properties to default values
  • 02-04 06:45:39 fce297a Use passed featuresConfig in LocalQueryRunner
  • 02-04 09:00:10 1a867a6 Expose plan node stats and cost in event listener
  • 02-04 09:00:10 552cf2a Document that some QueryStatistics field are not backward compatible
  • 02-05 02:22:38 c3675ee Fix Javadoc for QueryStatistics#planNodeStatsAndCosts
  • 02-06 04:41:11 0a9b804 Revoke memory after initial output page has been produced in tests
  • 02-06 04:41:11 1492b3f Return spill future directly
  • 02-06 04:41:11 5fef5aa Allow memory revoke only during operator finish phase
  • 02-06 04:41:11 79744f2 Convert revocable memory to user memory on hash aggregation finish
  • 02-06 04:41:11 e69b668 Produce more than single page in testHashAggregation

Martin Traverso

  • all checked
  • 02-04 11:37:33 d5dfc99 Preserve quotedness in QualifiedName
  • 02-04 16:46:08 7569071 Add approx_distinct for boolean
  • 02-04 21:10:28 890a6a0 Clarify handling of constant properties
  • 02-04 21:10:28 c9f8a23 Propagate partition properties for full outer join
  • 02-05 06:56:16 12c9b50 Support boolean type in EXPLAIN IO
  • 02-05 06:56:16 efc95d6 Rename classes to match coding style
  • 02-05 23:24:22 7567baf Add SphericalGeography type

Piotr Findeisen

  • all checked
  • 02-01 00:24:27 0c72574 Remove TODO related to stats caching
  • 02-01 00:24:27 363a3c3 Capture TaskManagerConfig on a field in LocalQueryRunner
  • 02-01 00:24:27 dc1d66f Add Exchange before GroupId to improve Partial Aggregation
  • 02-01 03:24:50 eddafb5 Fix merge conflict
  • 02-01 07:16:18 1116011 Rename VarcharType#getLengthSafe to #getBoundedLength
  • 02-01 07:16:18 8af7888 Remove deprecated VarcharType#getLength method
  • 02-01 07:16:58 27a1b0e Minor code fixes in AbstractTestParquetReader
  • 02-01 07:16:58 d7210b7 Fix information_schema.columns queries for Kafka, Redis
  • 02-02 13:10:02 ddf2421 Document note about LDAPS and JDK Upgrade
  • 02-04 05:57:10 a873894 Remove redundant usages of StringBuilder
  • 02-04 05:57:10 c290b17 Import String.join statically
  • 02-04 08:57:57 656be2c Make service principal hostname configurable in Kerberos auth
  • 02-04 12:12:52 157ad3a Fix inconsistencies in hostname configuration property for Kerberos auth
  • 02-05 06:14:47 1fa2cba Document RecordCursor#getReadTimeNanos
  • 02-05 06:14:47 c48c8b6 Remove incorrect getReadTimeNanos() implementations
  • 02-05 07:45:50 0ebbea6 Check overlapping on Domain directly in TupleDomainParquetPredicate
  • 02-05 07:45:50 1f4caa8 Make #getDomain return nonnull value
  • 02-05 07:45:50 7296700 Assume Domain from file cannot be isNone()
  • 02-05 07:45:50 7d2f5a9 Use known Presto Type instead of reconstructing
  • 02-05 07:45:50 8e3bf91 Remove redundant else
  • 02-05 07:45:50 972fdec Remove unused ParquetTypeUtils#getPrestoType method

Simplify JOIN planning

The JOIN planning code in RelationPlanner currently has a bunch of complicated logic that attempts to derive and extract the terms of equality comparisons, etc. Now that JoinNode can contain an arbitrary expression as the join criteria, we should simplify this code to not do that, and leave it to the optimizers to push down predicates as necessary.

Implement MULTISET type, syntax, and functions

4.10.3 Multisets

A multiset is an unordered collection. Since a multiset is unordered, there is no ordinal position to reference individual elements of a multiset.

A multiset type is a <collection type>. If MT is some multiset type with element type EDT, then every value of MT is a multiset of EDT.

Let M1 and M2 be multisets of EDT. M1 and M2 are identical if and only if M1 and M2 have the same cardinality n, and for each element x in M1, the number of elements of M1 that are identical to x, including x itself, equals the number of elements of M2 that are identical to x.

Let n1 be the cardinality of M1 and let n2 be the cardinality of M2. M1 is a submultiset of M2 if, for each element x of M1, the number of elements of M1 that are not distinct from x, including x itself, is less than or equal to the number of elements of M2 that are not distinct from x.

IS DISTINCT FROM bug with ObjectIdFunctions

This is a copy of the issue from the original prestodb repo: prestodb/presto#11869

I'm getting the error:

com.facebook.presto.metadata.OperatorNotFoundException: 'IS DISTINCT FROM' cannot be applied to ObjectId, ObjectId

when trying to select from a view that pulls data from a MongoDB connector despite 'IS DISTINCT FROM' not being used in the view. I'm assuming there's some background logic that happens with joins/aggregations that uses the 'IS DISTINCT FROM' command.

The select works in presto version .188 but does not work in version .203 when testing a presto upgrade.

Ambiguous declarations for concat() function

The concat function allows the following forms:

concat(array(T), array(T))
concat(T, array(T))
concat(array(T), T)

This causes problems when one of the arguments is array[].

If T is array(U) in the second form, the invocation looks like concat(array(U), array(array(U)). A call like concat(array[1,2], array[]) is ambiguous, since array[] is a valid value for array(T) and array(array(U)). A similar issue happens for the third form.

For example:

ARRAY[1,2,3] || ARRAY[]

satisfies both:

  • array(e) || array(e), where e is integer
  • e || array(e), where e is array(integer)

We originally added it for convenience, but, in hindsight, it turns out to have been a mistake. We should leave only the variants that accept two or more arrays and introduce dedicated functions for single-element append/prepend.

Incorporate WorkProcessor in operators

Issue for effort to support:

  • cross operator lazy pages (starting from source operators)
  • cleanup/simplify contract between operators via WorkProcessor pipelines
  • provide base for further improvements (e.g: on stack rows without Page materialization, Graal)

The advantage of cross operator lazy pages is that we can avoid IO when queries are highly selective. This requires that significant processing happens in source stage, but this becomes more and more the case with improvements like CBO ("broadcast joins") or grouped execution.

Stages are:

  • Stage 1
  • base PageProcessor on WorkProcessor
  • Stage 2
  • internally base ScanFilterAndProject on WorkProcessor. The pipeline would look like follows:
split singleton -> [flatMap] -> pages source
                -> [transform] -> page processor 
                -> [transform] -> merge pages

or if split is cursor based

split singleton -> [flatMap] -> cursor source -> [transform] -> merge pages
  • internally base FilterAndProject on WorkProcessor. The pipeline would look like follows:
page buffer -> [transform] -> page processor -> [transform] -> [merge pages]
  • Stage 3
  • create interface for operators that are based on WorkProcessor pipelines
  • create standarized abstract operator class for operators that internally are based on WorkProcessor pipelines
  • combine operators that are based on WorkProcessors via dedicated "gluing" operator
  • base TopNOperator on WorkProcessor pipelines (fast data exploration!)

See some previous discussion in the old repo: prestodb/presto#12096

Support "array by table subquery" constructor

From the spec:

<array value constructor> ::=
    <array value constructor by enumeration>
  | <array value constructor by query>

<array value constructor by query> ::=
  ARRAY <table subquery>
  1. If <array value constructor by query> is specified, then
    a) The <query expression> QE simply contained in the <table subquery> shall be of degree 1 (one). Let
    ET be the declared type of the column in the result of <query expression>.
    b) The declared type of the <array value constructor by query> is array with element type ET and maximum
    cardinality equal to the implementation-defined maximum cardinality IMDC for such array types.
  1. The result of <array value constructor by query> is determined as follows:
    a) QE is evaluated, producing a table T. Let N be the number of rows in T.
    b) If N is greater than IMDC, then an exception condition is raised: data exception — array data, right truncation.
    c) The result of <array value constructor by query> is an array of N elements such that for all i, 1 (one) ≤ iN, the value of the i-th element is the value of the only column in the i-th row of T.
    NOTE 167 — The ordering of the array elements is effectively determined by the General Rules of Subclause 7.13, “<query expression>”.

In simple terms,

SELECT ARRAY (SELECT x FROM t)

is equivalent to

SELECT (SELECT array_agg(x) FROM t)

Efficient Adaptive Partial Aggregations

Extracted from: prestodb/presto#10305 (see discussion there)

Problems

  1. High cardinality group by can result in one row per partial which can make query significantly more expensive than no partial at all.
  2. No partial for distinct aggregations can cause slow queries since all distinct calculations happen on a few machines.
  3. Flush of partial is all or nothing which removes the benefit of skewed groups (high row reduction for big groups).

Proposal

Change partial aggregation to detect ineffective partial aggregations and switch to a mode where the raw input is output instead of intermediate results. In addition, the partial should be able to make this decision on a per-group basis so groups with a high row count can still get the benefits of partial aggregations.

For distinct aggregations in groups that performing partial aggregation, the operator would determine if a row is distinct for any of the aggregates, and if so, the raw input is output, along with flag columns that mark which aggregates the row is distinct for.

The net result is the aggregate may output rows for three distinct cases:

  1. intermediate results for an enabled group
  2. raw input for a disabled group
  3. distinct raw input for distinct aggregates (in an enabled group)

The output descriptor for a partial aggregation would look something like this:

  1. row_type either intermediate, raw, distinct partial
  2. intermediates outputs for non-distinct aggregations. null if row type is not intermediate
  3. raw input columns. null if row_type is intermediate; for distinct row_type only inputs to distinct aggregations may be present.
  4. distinct flags; if set the input row should be processed in the related aggregation. null if row_type is not distinct

PrestoS3FileSystem breaks read-after-write strict S3 consistency

Method: com.facebook.presto.hive.s3.PrestoS3FileSystem#create
will check for file existence when overwrite==false. File existence check will break read-after-write strict S3 consistency per S3 doc:

Amazon S3 provides read-after-write consistency for PUTS of new objects in
your S3 bucket in all regions with one caveat. The caveat is that if you make a
HEAD or GET request to the key name (to find if the object exists) before
creating the object, Amazon S3 provides eventual consistency for read-after-write. 

The existence check itself is weak as the file might be created while data is gathered in staging directory for upload.

I propose to remove the check of use LIST S3 request instead that does not break read-after-write consistency.

FYI: @findepi

JMX MBeans are not accessible

When JMX MBean is created via:

newExporter(binder).export(GlueHiveMetastore.class);

it doesn't list, but when created via:

newExporter(binder).export(GlueHiveMetastore.class)
  .as(generator -> generator.generatedNameOf(GlueHiveMetastore.class));

it does list in presto.plugin.hive.metastore.glue.

FYI: @electrum

Add support for FETCH/OFFSET syntax

In terms of syntax, this is what the spec says:

A <query expression> can also optionally contain a <result offset clause>, 
which may limit the cardinality of the derived table by removing a specified 
number of rows from the beginning of the derived table. If a <query expression> 
contains both an <order by clause> and a <result offset clause>, then the rows 
in the derived table are first sorted according to the <order by clause> and then
 limited by dropping the number of rows specified in the <result offset clause> 
from the beginning of the result produced by the <query expression>. If the 
cardinality of the result of an evaluation of a <query expression> is less than
 the offset value specified by a <result offset clause>, then the derived table is empty.

And

<query expression> ::=
  [ <with clause> ] <query expression body>
      [ <order by clause> ] [ <result offset clause> ] [ <fetch first clause> ]

<result offset clause> ::=
  OFFSET <offset row count> { ROW | ROWS }

<fetch first clause> ::=
  FETCH { FIRST | NEXT } [ <fetch first quantity> ] { ROW | ROWS } { ONLY | WITH TIES }

<fetch first quantity> ::=
    <fetch first row count>
  | <fetch first percentage>

<offset row count> ::=
  <simple value specification>

<fetch first row count> ::=
  <simple value specification>

<fetch first percentage> ::=
  <simple value specification> PERCENT

Quantify stats/cost calculator quality. Probabilistically adjust models as a longer term goal

Incorporated from: prestodb/presto#11615

Currently we don't have any quantifiable means of measuring stats cost model quality. It would be great to have one so that we can compare stats/cost calculator changes via A/B testing or by aggregating historical data.

Cumulative cost quality

Let's call cumulative estimated query q cost as:
EstimatedUserCpu(q), EstimatedMemory(q), EstimatedNetwork(q).
Let's call actual query cost as:
CostCpu(q), CostMemory(q), CostNetwork(q).
All of those are random variables with respect to q.

Observations:

  • f(EstimatedXXX(q)) = CostXXX(q) + E(q), where E(q) is the estimate error.
  • We expect that f should be a linear transformation (otherwise sum of subplan estimates wont be the estimate of a whole plan)
  • E(q) should be proportional to the CostXXX(q) or EstimatedXXX(q). This is because we expect larger errors for bigger cost/estimates

Let's say we gather queries qi and their estimates EstimatedXXX(qi) and costs CostXXX(qi). Thus we have data points composed of pairs:

<x1=EstimatedXXX(q1), y1=CostXXX(q1)>, ..., <xN=EstimatedXXX(qN), yN=CostXXX(qN)>

for i = 1...N.

Those points will look similar to:
points1

Let's transform those points by dividing yN by xN. We now get:

<x1=EstimatedXXX(q1), y1'=CostXXX(q1)/EstimatedXXX(q1)>, ..., <xN=EstimatedXXX(qN), yN'=CostXXX(qN)/Estimated(qN)>

Now those points will look like:
points2

We can now perform regression analysis as we should have satisfied conditions for performing OLS regression (errors come from same distribution and have constant variance).

We can derive quantifiable model properties like:

  • OLS error magnitude
  • regression statistical significance test (if errors have normal distribution)
  • confidence and prediction intervals (if errors have normal distribution)

Such metrics allow us to compare quality of cost estimates

Note: I think that there will be different error distributions depending how complex the plans are. For instance sophisticated plans with many plan nodes will have the error accumulated. To account for that, we could derive plan complexity (e.g: number of nodes, depth) as an additional input parameter to the model and we could perform analysis for each complexity level.

Individual operator cost quality

Let's call operator o estimated cost as:
OperatorEstimatedUserCpu(o), OperatorEstimatedMemory(o), OperatorEstimatedNetwork(o).
Let's call actual operator cost as:
OperatorCostCpu(q), OperatorCostMemory(q), OperatorCostNetwork(q).
Let's call operator estimated input data size as:
EstimatedInputSize(o).
Let's call operator measured input data size as:
MeasuredInputSize(o).
All of those are random variables with respect to o.

Observations:

  • operator most likely will have input estimated data size mismatch actual input data size. This affects values of OperatorEstimatedXXX(o) and OperatorCostXXX(q). Because of that, let's normalize them by:
NormalizedOperatorEstimatedXXX(o) = OperatorEstimatedXXX(o)/EstimatedInputSize(o)
NormalizedOperatorCostXXX(o) = OperatorCostXXX(o)/MeasuredInputSize(o)

Intuitively they tell how much memory is needed by operator per input byte. We can now define operator estimation error as:

OperatorEstimationErrorXXX(o) = NormalizedOperatorEstimatedXXX(o) - NormalizedOperatorCostXXX(o)

It will be similar to:
Error

We can now:

  • derive variance of the OperatorEstimationErrorXXX
  • derive the mean of the OperatorEstimationErrorXXX, e.g: how much (per input byte) operator is under/overestimating cost.
  • if OperatorEstimationErrorXXX has normal distribution we can make more sophisticated conclusions (e.g: if we added Constant to OperatorEstimatedXXX than for 90% of queries we would overestimate memory usage by this much).

Note that we should perform such analysis for each operator type as they have different profiles.

Stats quality

Similar approach can be applied to operator nodes that filter the data (e.g: actual filtering factor vs estimated filtering factor). This way we can evaluate quality of FilterStatsCalculator and rules that perform predicate estimation.

Adjusting model

The initial goal is to provide quality metrics for stats/cost model so that we could test and quantify model changes.

However, having gathered historical query and model data, we could introduce additional variables to our stats/cost computations that we would adjust to achieve desired model properties.
For instance, we could/should:

  • introduce variables for multiplying estimated operator cost so that they match measured cost.
  • specifically choose operator memory cost factor to be large so that for majority of the queries we overestimate memory (to be on a safe side)
  • multiply cumulative plan costs so that we slightly overestimate. Such factor for cumulative cost should account for plan complexity as simpler plans should have more accurate costs

Having such variables we could adjust model for different hardware, data and workloads. Specifically we could adjust model for each client so that CBO produces better plans.

Plan Of Attack

The steps are as follows:

  • For each operator expose cumulative and individual stats/costs in query info (v1/query) JSON so that they can be gathered and analyzed. This would be used to estimate model on predefined set of queries (e.g: TPCH/TPCDS) during benchmarks.

Note that operators don't match 1-1 to plan nodes. For instance to obtain HashAggregationOperator stats we would need to get stats for join build side.

  • Expose estimated cost/stats in EventListener so that we could gather and store historical data points.
  • Add variables to stats/cost models so that we can adjust model for particular hardware, data and workloads.

Named queries (WITH) should be evaluated only once

Currently, Trino evaluates named queries (WITH clause) by inlining them wherever they are referenced in the body of the main query. This can cause problems if the query is non-deterministic, since every inlined instance might produce different results.

According to the spec:

1) If a non-recursive <with clause> is specified, then:
    ...
    b) For every <with list element> WLE, let WQN be the <query name> immediately contained in WLE. Let WQE be 
       the <query expression> simply contained in WLE. Let WLT be the table resulting from evaluation of WQE, with 
       each column name replaced by the corresponding element of the <with column list>, if any, immediately contained in WLE.
    c) Every <table reference> contained in <query expression> that specifies WQN identifies WLT.

This says that each named query should be evaluated once and the result "stored" in a table WLT. Thereafter, any references to the named query directly identify that WLT table.

WLT in the definition above is what the spec calls "transient table":

4.15.4 Transient tables
A transient table is a named table that may come into existence implicitly during the evaluation of a <query expression> or the execution of a trigger. A transient table is identified by a <query name> if it arises during the evaluation of a <query expression>, or by a <transition table name> if it arises during the execution of a trigger. Such tables exist only for the duration of the executing SQL-statement containing the <query expression> or for the duration of the executing trigger.

Allow connectors to participate in query optimization

This is an umbrella issue to track various projects related to allowing connectors to participate in query optimization. The long-term vision is for plugins to provide custom rules that can transform subplans into plugin-specific operations. This requires a set of steps:

  • decouple the AST from the current IR (PlanNode tree): #13184
  • get rid of visitor-based optimizers
  • revamp the IR and optimizer to support a fully exploratory optimizer
  • allow connectors to provide optimizer rules

In the short term, we can introduce special-purpose mechanisms to the SPI and engine enable the following behaviors:

  • push down complex filters
  • push down projections (e.g, row/array/map dereference, pre-calculated virtual columns)
  • push down aggregations
  • push down joins
  • expose additional filters (e.g. for row-level authorization)
  • expose more complex data organizations (e.g., custom partitioning schemes)

Document describing the high-level approach: https://github.com/prestosql/presto/wiki/Pushdown-of-complex-operations

Plan:

  • #13184
  • Simplify and hide notion of TableLayout: #363
  • Deprecate table layouts: #420
    • Support Constraint-based pushdown to transition from Table Layout: #541
  • Define simplified expression language (some work in progress: #402)
  • Provide TableHandle to PageSource/RecordCursor providers: #442
  • Add pushXXXIntoConnector + Rule pairs
    • Limit: #421
    • Filter: #402, #7994
      • Use canonical function name in expression pushdown (e.g. pass substring instead of substr)
    • Projection: #676
    • Aggregation (#6613)
    • Join (#6620)
    • Sample: #753
    • Top N
      • engine & SPI (#4249)
      • JDBC connectors #4769
      • Push PARTIAL TopN into TableScan #7028
      • Push TopN through Project into TableScan rule #7029
      • Elasticsearch #4803
  • Integrate into connectors
    • Hive
      • Parquet + ORC - Migrate #187 to new framework
      • Optimize Top N queries over partition keys #3050
    • JDBC connectors
  • Remove TableLayout-related APIs: #781
  • SchemaTableName should not be required for synthetic ConnectorTableHandle returned from ConnectorMetadata.apply methods #6694
  • Support connectors advanced pushdown in EXPLAIN IO #6695

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.