trinodb / trino Goto Github PK

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Home Page: https://trino.io

License: Apache License 2.0

Java 99.48% Shell 0.04% Python 0.01% HTML 0.03% CSS 0.02% JavaScript 0.32% ANTLR 0.08% Dockerfile 0.01% Groovy 0.01% PLSQL 0.01% Scala 0.01% Smarty 0.01%

java presto hive hadoop big-data sql prestodb database databases distributed-systems

trino's Introduction

Trino is a fast distributed SQL query engine for big data analytics.

See the User Manual for deployment instructions and end user documentation.

Development

See DEVELOPMENT for information about code style, development process, and guidelines.

See CONTRIBUTING for contribution requirements.

Security

See the project security policy for information about reporting vulnerabilities.

Build requirements

Mac OS X or Linux
Java 21.0.1+, 64-bit
Docker
- Turn SELinux or other systems disabling write access to the local checkout off, to allow containers to mount parts of the Trino source tree

Building Trino

Trino is a standard Maven project. Simply run the following command from the project root directory:

./mvnw clean install -DskipTests

On the first build, Maven downloads all the dependencies from the internet and caches them in the local repository (~/.m2/repository), which can take a while, depending on your connection speed. Subsequent builds are faster.

Trino has a comprehensive set of tests that take a considerable amount of time to run, and are thus disabled by the above command. These tests are run by the CI system when you submit a pull request. We recommend only running tests locally for the areas of code that you change.

Running Trino in your IDE

Overview

After building Trino for the first time, you can load the project into your IDE and run the server. We recommend using IntelliJ IDEA. Because Trino is a standard Maven project, you easily can import it into your IDE. In IntelliJ, choose Open Project from the Quick Start box or choose Open from the File menu and select the root pom.xml file.

After opening the project in IntelliJ, double check that the Java SDK is properly configured for the project:

Open the File menu and select Project Structure
In the SDKs section, ensure that JDK 21 is selected (create one if none exist)
In the Project section, ensure the Project language level is set to 21

Running a testing server

The simplest way to run Trino for development is to run the TpchQueryRunner class. It will start a development version of the server that is configured with the TPCH connector. You can then use the CLI to execute queries against this server. Many other connectors have their own *QueryRunner class that you can use when working on a specific connector.

Running the full server

Trino comes with sample configuration that should work out-of-the-box for development. Use the following options to create a run configuration:

Main Class: io.trino.server.DevelopmentServer
VM Options: -ea -Dconfig=etc/config.properties -Dlog.levels-file=etc/log.properties -Djdk.attach.allowAttachSelf=true
Working directory: $MODULE_DIR$
Use classpath of module: trino-server-dev

The working directory should be the trino-server-dev subdirectory. In IntelliJ, using $MODULE_DIR$ accomplishes this automatically.

If VM options doesn't exist in the dialog, you need to select Modify options and enable Add VM options.

Running the CLI

Start the CLI to connect to the server and run SQL queries:

client/trino-cli/target/trino-cli-*-executable.jar

Run a query to see the nodes in the cluster:

SELECT * FROM system.runtime.nodes;

Run a query against the TPCH connector:

SELECT * FROM tpch.tiny.region;

trino's People

Contributors

Stargazers

Watchers

Forkers

martint praveen2112 electrum dain javainthinking miguelweezardo stewartpark ebyhr findepi ashishtadose kokosing guyco33 dyb10101 aeonve sopel39 prunu wind0830 ankitdixit vkorukanti marvincai starburstdata kube-reporting raghavsethi raunaqmorarka xumingming ilfrin gauravuniverse zm12397 zorrock pgagnon stendu sugaryou lewuathe kasiafi hgschmie qqibrow miniway juanolv zhp007 smartnews treasure-data chengat1314 cryptoe hocanint-amzn shlomialfasi rzeyde-varada alloydwhitlock vicowu ryanrupp gavinljj wenshao darklore chancez yaliang zx-hub hustnn nezihyigitbasi zhenxiao aperturetechnology pishu1502 mujin555 raviann liuguiyangnwpu dreamtheater0 guido1402 ninjabhishek herrsimpi hjw199089 bingbai0912 mlinhard forkeri blunttt nicholebreeze hhy5277 leon-gh aleph-zero lefty-ai mutugibrian parth-brahmbhatt christophe16522 colyzo wei-he infearontheway takezoe dnastack jamesrtaylor beinan walkertr useragent007 rhadfi22 thomaschow puneetjaiswal benoithanotte swgu qiangcai criteo-forks vgankidi yui-knk willshen 11xor6

trino's Issues

Add support for case sensitive identifiers

<delimited identifier> ::=
  <double quote> <delimited identifier body> <double quote>

<delimited identifier body> ::=  <delimited identifier part>...
<delimited identifier part> ::=
    <nondoublequote character>
  | <doublequote symbol>

<Unicode delimited identifier> ::=
  U <ampersand> <double quote> <Unicode delimiter body> <double quote>
      <Unicode escape specifier>
<Unicode escape specifier> ::=
  [ UESCAPE <quote> <Unicode escape character> <quote> ]
<Unicode delimiter body> ::=
  <Unicode identifier part>...
<Unicode identifier part> ::=
    <delimited identifier part>
  | <Unicode escape value>

24) For every <identifier body> IB there is exactly one corresponding case-normal form CNF. CNF is an <identifier body> derived from IB as follows:
Let n be the number of characters in IB. For i ranging from 1 (one) to n, the i-th character Mi of IB is transliterated into the corresponding character 
or characters of CNF as follows:
Case:
   a) If Mi is a lower case character or a title case character for which an equivalent upper case sequence U is de ned by Unicode, then let j be th
       e number of characters in U; the next j characters of CNF are U.
   b) Otherwise, the next character of CNF is Mi.
25) The case-normal form of the <identifier body> of a <regular identifier> is used for purposes such as and including determination of identifier 
      equivalence, representation in the Definition and Information Schemas, and representation in diagnostics areas.

...

27) Two <regular identifier>s are equivalent if the case-normal forms of their <identifier body>s, considered as the repetition of a <character string literal> 
that specifies a <character set specification> of SQL_IDENTIFIER and an implementation-defined collation IDC that is sensitive to case, compare equally 
according to the comparison rules in Subclause 8.2, “<comparison predicate>”.

28) A <regular identifier> and a <delimited identifier> are equivalent if the case-normal form of the <identifier body> of the <regular identifier> and the 
<delimited identifier body> of the <delimited identifier> (with all occurrences of <quote> replaced by <quote symbol> and all occurrences of 
<doublequote symbol> replaced by <double quote>), considered as the repetition of a <character string literal> that specifies a <character set specification>
 of SQL_IDENTIFIER and IDC, compare equally according to the comparison rules in Subclause 8.2, “<comparison predicate>”.


29) Two<delimited identifier>s are equivalent if their <delimited identifierbody>s,considered as the repetition of a <character string literal> that specifies
 a <character set specification> of SQL_IDENTIFIER and an implementation-defined collation that is sensitive to case, compare equally according to the
 comparison rules in Subclause 8.2, “<comparison predicate>”.

30) Two <Unicode delimited identifier>s are equivalent if their <Unicode delimiter body>s, considered as the repetition of a <character string literal> that
 specifies a <character set specification> of SQL_IDENTIFIER and an implementation-defined collation that is sensitive to case, compare equally according
 to the comparison rules in Subclause 8.2, “<comparison predicate>”.

31) A <Unicode delimited identifier> and a <delimited identifier> are equivalent if their <Unicode delimiter body> and <delimited identifier body>, 
respectively, each considered as the repetition of a <character string literal> that specifies a <character set specification> of SQL_IDENTIFIER and 
an implementation-defined collation that is sensitive to case, compare equally according to the comparison rules in Subclause 8.2, “<comparison predicate>”.

32) A <regular identifier> and a <Unicode delimited identifier> are equivalent if the case-normal form of the <identifier body> of the <regular identifier> 
and the <Unicode delimiter body> of the <Unicode delimited identifier> considered as the repetition of a <character string literal>, each specifying a
 <character set specification> of SQL_IDENTIFIER and an implementation-defined collation that is sensitive to case, compare equally according to the 
comparison rules in Subclause 8.2, “<comparison predicate>”.

The approach and design is being captured here: https://github.com/prestosql/presto/wiki/Delimited-Identifiers

Add support for Cassandra UDT

Ref prestodb/presto#6047

Remove DISTINCT inside IN subquery expression

Add an optimization rule to remove unnecessary DISTINCT in this scenario:

SELECT ...
FROM t
WHERE c IN (SELECT DISTINCT ... FROM u)

Since the semi join already performs deduplication of the values in the subquery, that operation is unnecessary and results in an extra exchange and aggregation.

presto> explain (type distributed) select count(*) from tpch.tiny.orders where custkey in (select custkey from tpch.tiny.customer);
                                                            Query Plan
-----------------------------------------------------------------------------------------------------------------------------------
 Fragment 0 [SINGLE]
     Output layout: [count]
     Output partitioning: SINGLE []
     - Output[_col0] => [count:bigint]
             _col0 := count
         - Aggregate(FINAL) => [count:bigint]
                 count := "count"("count_13")
             - LocalExchange[SINGLE] () => count_13:bigint
                 - RemoteSource[1] => [count_13:bigint]

 Fragment 1 [HASH]
     Output layout: [count_13]
     Output partitioning: SINGLE []
     - Aggregate(PARTIAL) => [count_13:bigint]
             count_13 := "count"(*)
         - FilterProject[filterPredicate = "expr_8"] => []
             - SemiJoin[custkey = custkey_1] => [custkey:bigint, expr_8:boolean]
                 - RemoteSource[2] => [custkey:bigint]
                 - LocalExchange[SINGLE] () => custkey_1:bigint
                     - RemoteSource[3] => [custkey_1:bigint]

 Fragment 2 [tpch:orders:15000]
     Output layout: [custkey]
     Output partitioning: HASH [custkey]
     - TableScan[tpch:tpch:orders:sf0.01, originalConstraint = true] => [custkey:bigint]
             custkey := tpch:custkey

 Fragment 3 [SOURCE]
     Output layout: [custkey_1]
     Output partitioning: HASH (replicate nulls) [custkey_1]
     - TableScan[tpch:tpch:customer:sf0.01, originalConstraint = true] => [custkey_1:bigint]
             custkey_1 := tpch:custkey

presto> explain (type distributed) select count(*) from tpch.tiny.orders where custkey in (select distinct custkey from tpch.tiny.customer);
                                           Query Plan
-------------------------------------------------------------------------------------------------
 Fragment 0 [SINGLE]
     Output layout: [count]
     Output partitioning: SINGLE []
     - Output[_col0] => [count:bigint]
             _col0 := count
         - Aggregate(FINAL) => [count:bigint]
                 count := "count"("count_13")
             - LocalExchange[SINGLE] () => count_13:bigint
                 - RemoteSource[1] => [count_13:bigint]

 Fragment 1 [HASH]
     Output layout: [count_13]
     Output partitioning: SINGLE []
     - Aggregate(PARTIAL) => [count_13:bigint]
             count_13 := "count"(*)
         - FilterProject[filterPredicate = "expr_8"] => []
             - SemiJoin[custkey = custkey_1] => [custkey:bigint, expr_8:boolean]
                 - RemoteSource[2] => [custkey:bigint]
                 - LocalExchange[SINGLE] () => custkey_1:bigint
                     - RemoteSource[3] => [custkey_1:bigint]

 Fragment 2 [tpch:orders:15000]
     Output layout: [custkey]
     Output partitioning: HASH [custkey]
     - TableScan[tpch:tpch:orders:sf0.01, originalConstraint = true] => [custkey:bigint]
             custkey := tpch:custkey

 Fragment 3 [HASH]
     Output layout: [custkey_1]
     Output partitioning: HASH (replicate nulls) [custkey_1]
     - Aggregate(FINAL)[custkey_1] => [custkey_1:bigint]
         - LocalExchange[HASH] ("custkey_1") => custkey_1:bigint
             - RemoteSource[4] => [custkey_1:bigint]

 Fragment 4 [SOURCE]
     Output layout: [custkey_1]
     Output partitioning: HASH [custkey_1]
     - Aggregate(PARTIAL)[custkey_1] => [custkey_1:bigint]
         - TableScan[tpch:tpch:customer:sf0.01, originalConstraint = true] => [custkey_1:bigint]
                 custkey_1 := tpch:custkey

does presto have any plan about support implicit conversion (from number to varchar )

May be all the Hive users want this features . It take huge cost changes for rewrite Hive scripts. Hive‘s market is so huge(ETL). Why not consider fully supporting Hive? This is more conducive to the development of Presto. Presto can increase the indication that opening this feature will reduce performance.

Some ln: prestodb/presto#10888

Add support for LEFT/RIGHT/INNER lateral joins

It should be possible to write queries like:

SELECT *
FROM t LEFT JOIN LATERAL (...) u ON ...

SELECT *
FROM t LEFT JOIN UNNEST(t.a) u ON ...

Improve encoding of grouping sets in plan

Currently, the list of concrete grouping sets is expanded during analysis. This results in inefficiencies during analysis and planning due to redundant re-analysis and storage of the atoms of each grouping element (it allocates 2^#columns sets of expressions when analyzing the query. It then analyzes each of those expressions and creates a GroupID node with that many entries).

Instead, the engine should only analyze the atomic units (columns of each grouping element) and should encode the expected grouping sets in the plan using an abstract description of the components instead of enumerating every combination of grouping columns. The enumeration should happen when the operator is instantiated.

I.e., the GroupId node should contain a list of grouping elements:

Elements = (CUBE set<symbol> | ROLLUP list<symbol> | SETS list<set<symbol>>)+

Exponential processing time and memory usage in PredicatePushDown

Queries of this shape consume exponential amounts of memory and CPU:

WITH
t1 (v) AS (VALUES 1),
t2 AS( select if(v = 0, v, v) v from t1 ),
t3 AS( select if(v = 0, v, v) v from t2 ),
t4 AS( select if(v = 0, v, v) v from t3 ),
t5 AS( select if(v = 0, v, v) v from t4 ),
t6 AS( select if(v = 0, v, v) v from t5 ),
t7 AS( select if(v = 0, v, v) v from t6 ),
t8 AS( select if(v = 0, v, v) v from t7 ),
t9 AS( select if(v = 0, v, v) v from t8 ),
t10 AS( select if(v = 0, v, v) v from t9 ),
t11 AS( select if(v = 0, v, v) v from t10 ),
t12 AS( select if(v = 0, v, v) v from t11 ),
t13 AS( select if(v = 0, v, v) v from t12 ),
t14 AS( select if(v = 0, v, v) v from t13 ),
t15 AS( select if(v = 0, v, v) v from t14 ),
t16 AS( select if(v = 0, v, v) v from t15 )
select *
from t16
where v = 0

One possible short-term fix is to adjust the inlining heuristics to only do it if the expressions are trivial or appear only once (similar to how the InlineProjections rule works)

The longer-term fix is to move PredicatePushdown to iterative optimizer rules that simply avoid unproductive pushdown actions like the one caused by this query.

Query fails when there is column type mismatch between partition schema and table schema

Presto is failing to read the parquet partitions if the decimal datatype don't match with what is in the hive metastore. Here is the error:

Query 20190130_224317_00018_w9d29 failed: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'sbnum' in table 'default.presto_test' is declared as type 'decimal(8,0)', but partition 'month=201812' declared column 'sbnum' as type 'decimal(6,0)'.
com.facebook.presto.spi.PrestoException: There is a mismatch between the table and partition schemas. The types are incompatible and cannot be coerced. The column 'sbnum' in table 'default.presto_test' is declared as type 'decimal(8,0)', but partition 'month=201812' declared column 'sbnum' as type 'decimal(6,0)'.
                at com.facebook.presto.hive.HiveSplitManager.lambda$getPartitionMetadata$2(HiveSplitManager.java:315)
                at com.google.common.collect.Iterators$6.transform(Iterators.java:788)
                at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
                at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
                at com.google.common.collect.Iterators$ConcatenatedIterator.hasNext(Iterators.java:1340)
                at com.facebook.presto.hive.ConcurrentLazyQueue.poll(ConcurrentLazyQueue.java:37)
                at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:252)
                at com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:91)
                at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:185)
                at com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
                at com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
                at com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
                at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                at java.lang.Thread.run(Thread.java:748)

Here are the steps to reproduce this issue -

Create partitioned table -

CREATE external TABLE `presto_test`(
  `sbnum` decimal(6,0),
  `abnum` decimal(8,0)
)
partitioned by (
month int
)
STORED AS PARQUET
LOCATION
  '/hive/presto_test'
;

2.Insert data into this table, create few partitions -

insert overwrite table presto_test partition (month=201801)
select sbnum, abnum from <tabname> limit 10
;

insert overwrite table presto_test partition (month=201802)
select sbnum, abnum from <tabname> limit 10
;

3.Access the table from Presto to ensure it works -

select count(1) from presto_test ;

4.alter the table, change the data type -

alter table presto_test change sbnum sbnum decimal (8,0) ;

5.Access the table again from presto -

select count(1) from presto_test ; (should throw an error)

6.Insert 1 more partition -

insert overwrite table presto_test partition (month=201803)
select sbnum, abnum from <tabname> limit 10
;

7.Access the table from presto again -

select count(1) from presto_test where month = 201802 ; (should work)

Support writing ZSTD compressed ORC data

Presto supports reading ZSTD ORC data.

support writing ZSTD ORC
allow configuring ZSTD as default compression codec (hive.compression-codec)

Function and type namespaces

Goals

Allow for coexistence of functions and types with similar names but different semantics (e.g., compatibility layer for other databases like Teradata, Oracle, MySQL, Vertica, PostgreSQL, etc)
Allow for tighter control of what functions are available in a given deployment
Make it possible to implement user-defined functions and types in the future

Features

Support for CURRENT_PATH, SET PATH
Configurable/mountable function and type plugins

Considerations and open questions

Views need to record CURRENT_PATH in definition
Need to define migration strategy for re-packaging existing function library
Function should be resolved to a handle during analysis just like tables.
Function resolution should be transaction aware.
Function metadata can be fetch using a handle on the coordinator.
Function implementation can be retrieved using the handle on coordinator and workers.
Optimizers can lookup functions using fully a specified signature.
Function handles should be transported to worker without requiring a full rewrite of the plan IR.
SPI interface should be easy to implement.

TODOs

Introduce proper function abstractions to engine (FunctionManager, FunctionHandle, etc). #196
Simplify function resolution
- Extract resolution logic from FunctionRegistry.
- Resolve functions using simpler metadata interface.
- Move resolution to be inside analyzer (like table resolution)
Simplify function instance usage
- Move call convention resolution to be inside of FunctionManager
- Callers should really only need to deal with conventions and method handles
Finish SQL PATH implementation
- Move existing function registration to system catalog
- Add namespace abstraction
- Update function resolution to understand path semantics
Update connector SPI to support function resolution
Improve function handle transport to workers

References

4.34 SQL-paths

An SQL-path is a list of one or more <schema name>s that determines the search order for one of the following:
— The subject routine of a <routine invocation> whose <routine name> does not contain a .
— The user-defined type when the <path-resolved user-defined type name> does not contain a <schema name>.

4.43.3 SQL-session properties

An SQL-session has an SQL-path that is used to effectively qualify unqualified <routine name>s [...]
The SQL-path is initially set to an implementation-de ned value, but can subsequently be changed by the successful execution of a <set path statement>.
The text defining the SQL-path can be referenced by using the <general value specification> CURRENT_PATH.

6.4 <value specification> and <target specification>

The declared type of CURRENT_USER, CURRENT_ROLE, SESSION_USER, SYSTEM_USER, CURRENT_CATALOG, CURRENT_SCHEMA, and CURRENT_PATH is character string. Whether the character string is fixed-length or variable-length, and its length if it is fixed-length or maximum length if it is variable-length, are implementation-de ned. The character set of the character string is SQL_IDENTIFIER. The declared type collation is the character set collation of SQL_IDENTIFIER, and the collation derivation is implicit.

The value specified by CURRENT_PATH is a <schema name list> where <catalog name>s are <delimited identifier>s and the <unqualified schema name>s are <delimited identifier>s. Each <schema name> is separated from the preceding <schema name> by a <comma> with no intervening <space>s. The schemas referenced in this <schema name list> are those referenced in the SQL-path of the current SQL-session context, in the order in which they appear in that SQL-path.

10.3 <path specification>

Function

Specify an order for searching for an SQL-invoked routine.

Format
<path specification> ::= PATH <schema name list>
<schema name list> ::= <schema name> [ { <comma> <schema name> }... ]
<schema name> ::= [ <catalog name> <period> ] <unqualified schema name>
<unqualified schema name> ::= <identifier>
<catalog name> ::= <identifier>
Syntax Rules

No two <schema name>s contained in <schema name list> shall be equivalent.

19.8 <set path statement>

Function

Set the SQL-path used to determine the subject routine of <routine invocation>s with unqualified <routine name>s in <preparable statement>s that are prepared in the current SQL-session by an <execute immediate statement> or a <prepare statement> and in <direct SQL statement>s that are invoked directly. The SQL-path remains the current SQL-path of the SQL-session until another SQL-path is successfully set.

Format
<set path statement> ::= SET <SQL-path characteristic>
<SQL-path characteristic> ::= PATH <value specification>
Syntax Rules

The declared type of the shall be a character string type.

Access Rules

None.

General Rules

Let S be <value specification> and let V be the character string that is the value of TRIM ( BOTH ' ' FROM S )
a) If V does not conform to the Format and Syntax Rules of a <schema name list>, then an exception condition is raised: invalid schema name list specification.
b) The SQL-path of the current SQL-session is set to V.
NOTE 724 — A <set path statement> that is executed between a <prepare statement> and an <execute statement> has no effect on the prepared statement.

Conformance Rules

Without Feature S071, “SQL paths in function and type name resolution”, Conforming SQL language shall not contain a <set path statement>.

Should information_schema.columns.COMMENT be renamed to COLUMN_COMMENT?

Mysql information_schema.columns table exposes column_comment column, while Presto exposes comment column, see https://dev.mysql.com/doc/refman/5.7/en/columns-table.html

I couldn't find this column in ANSI SQL, nor in Oracle, Teradata, MSSQL.

It looks like Hue depends on this column: https://github.com/cloudera/hue/blob/master/desktop/libs/notebook/src/notebook/connectors/jdbc.py#L213

This column was added to Presto about 5 years ago (with prestodb/presto#1214). In origin PR I do not see any discussion about compliance with existing database system or ANSI SQL.

Adding a short-term solution for pushing down partial aggregation into the connector

Greetings,

Following #18, we have created a simple partial aggregation push-down prototype, by a creating a new PlanOptimizer. It was added at the end of the optimizers' list (here), matching the partial AggregationNode - and passing down the aggregations (specified by their FunctionCall quailified name and the relevant column handles) into ConnectorTableLayoutHandle (as it's currently done for predicate pushdown by the PickLayouts optimizer) of the underlying ScanNode.

Currently, we keep the original partial aggregation node in the query plan, since we don't support all aggregations in our connector and would like to fall back to Presto implementation for unsupported aggregations.
In order to keep the implementation generic, we have added a new API for Metadata interface for optionally adding aggregation into existing TableLayoutHandle by the new optimizer.

We would like to ask whether our approach makes sense?
If so, we would like to help the effort adding this feature (partial aggregation pushdown API) to mainline Presto.

Also, we were not sure whether we are adding the new optimizer at the correct stage of PlanOptimizers. Maybe we should put it as a part of PickLayouts?

ANALYZE statement

cc @sopel39 @martint

Incorrect FILTER clause evaluation order

According to the SQL specification, the FILTER clause for an aggregation query should be applied to the rows in the group before the arguments to the aggregation are evaluated.

Currently, expressions in the aggregation arguments are evaluated before the filter, leading to queries failing when they shouldn't.

For example:

select 
    sum(1 / a) filter (where a <> 0), 
    sum(a) filter (where true)
from (values (1), (0)) t(a)

fails with "/ by zero" error.

This is the plan it produces:

 Fragment 0 [SINGLE]
     Output layout: [sum, sum_3]
     Output partitioning: SINGLE []
     Stage Execution Strategy: UNGROUPED_EXECUTION
     Output[_col0, _col1]
     │   Layout: [sum:bigint, sum_3:bigint]
     │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
     │   _col0 := sum
     │   _col1 := sum_3
     └─ Aggregate(FINAL)
        │   Layout: [sum_3:bigint, sum:bigint]
        │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
        │   sum_3 := sum("sum_7")
        │   sum := sum("sum_6")
        └─ LocalExchange[SINGLE] ()
           │   Layout: [sum_7:row(bigint, boolean, bigint, boolean), sum_6:row(bigint, boolean, bigint, boolean)]
           │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
           └─ Aggregate(PARTIAL)
              │   Layout: [sum_7:row(bigint, boolean, bigint, boolean), sum_6:row(bigint, boolean, bigint, boolean)]
              │   sum_7 := sum("a") (mask = expr_1)
              │   sum_6 := sum("expr_2") (mask = expr_0)
              └─ Project[]
                 │   Layout: [expr_2:bigint, a:bigint, expr_1:boolean, expr_0:boolean]
                 │   Estimates: {rows: 2 (44B), cpu: 54, memory: 0B, network: 0B}
                 │   expr_2 := CAST((1 / "field") AS bigint)
                 │   a := CAST("field" AS bigint)
                 │   expr_1 := true
                 │   expr_0 := ("field" <> 0)
                 └─ LocalExchange[ROUND_ROBIN] ()
                    │   Layout: [field:integer]
                    │   Estimates: {rows: 2 (10B), cpu: 10, memory: 0B, network: 0B}
                    └─ Values
                           Layout: [field:integer]
                           Estimates: {rows: 2 (10B), cpu: 0, memory: 0B, network: 0B}
                           (1)
                           (0)

References:

10.9 <aggregate function>

d) If AF is immediately contained in a <set function specification> SFS, then a group of a grouped table of the aggregation query of SFS.

Let T be the argument source of AF.

Case:
a) If <filter clause> is specified, then the <search condition> is effectively evaluated for each row of T. Let T1 be the collection of rows of T for which the result of the <search condition> is True.
b) Otherwise, let T1 be T.

If <general set function> is specified, then:
a) Let TX be the single-column table that is the result of applying the <value expression> to each row of T1 and eliminating null values

Cannot create empty Hive partition for non-bucketed table

The system.create_empty_partition procedure fails because the source directory does not exist (no files were written) and thus the rename fails. SemiTransactionalHiveMetastore.Committer#prepareAddPartition should probably unconditionally create currentPath before checking if it is equal to targetPath.

Incorrect return value for ROUND_N with negative N for types other than DECIMAL

ROUND(SMALLINT '99', -1) should return 100, now it returns 99.

Ref prestodb/presto#10025

Add CLI option --trace-token to set the trace token

This should be easy as trace token is already in ClientSession.

Optimize union all of similar aggregations

Queries of the following shape:

SELECT 
   a1, ..., an,
   agg1(...),
   agg2(...)
FROM t
WHERE f1(...)

UNION ALL

SELECT 
   b1, ..., bn,
   agg3(...),
   agg4(...)
FROM t
WHERE f2(...)

...

(a1, ..., an, b1, ..., bn are constants)

can be optimized to process all the data in a single pass to reduce the number of stages in the query execution.

One possible rewrite could be:

SELECT * 
FROM UNNEST( (
    SELECT ARRAY[
        ROW(a1, ..., an, agg1(...) FILTER (WHERE f1(...)), agg2(...) FILTER (WHERE f1(...)),
        ROW(b1, ..., bn, agg3(...) FILTER (WHERE f2(...)), agg3(...) FILTER (WHERE f2(...)),
        ...]
    FROM t) ) u

Provide AT ZONE-equivalent functions where zone name is not constant

~~make at_timezone it public (see @dain 's prestodb/presto#5162 (comment)) and add documentation~~

This is required to perform AT TIME ZONE conversion when zone is not a constant

See #135 (comment) comment below for what should be implemented in this issue.

Ref prestodb/presto#5162

Presto throws error when there is datatype mis-match between parquet schema and hive metastore

test

Warn when CAST of varchar(x) -> varchar(y) results in truncation

Per the spec (6.13 General Rules 11.c):

11) If TD is variable-length character string or large object character string, then let MLTD
be the maximum length in characters of TD.

c) If SD is fixed-length character string, variable-length character string,
or large object character string, then

ii) If the length in characters of SV is larger than LTD, then TV is the first LTD characters
of SV.  If any of the remaining characters of SV are non-<space> characters,
then a completion condition is raised: warning — string data, right truncation.

Support role management

Support Roles, Grants, etc.

PrestoS3FileSystem#rename does not preserve SSE properties

Extracted from: prestodb/presto#9916

PrestoS3FileSystem#rename should preserve SSE properties while copying files. Problematic line is: PrestoS3FileSystem.java:407: s3.copyObject. According to S3 Javadoc:

     * By default, all object metadata for the source object except
     * <b>server-side-encryption</b>, <b>storage-class</b> and
     * <b>website-redirect-location</b> are copied to the new destination
     * object, unless new object metadata in the specified
     * {@link CopyObjectRequest} is provided.

On the other hand, maybe PrestoS3FileSystem#rename should throw UnsupportedOperationException? It seems that it shouldn't be used at all because for S3 there should be no temporary objects when inserting into table (see: HiveWriteUtils#isS3FileSystem).

EXPLAIN CREATE TABLE AS SELECT creates a table, should be side-effect free

presto> use memory.default;
USE
presto:default> explain create table x as select 1 c;
                                                                                                                               Query Plan
-------------------------------------------------------------------------------------------------------------------------------------------------------
 - Output[rows] => [rows:bigint]
     - TableCommit[memory:INSTANCE:MemoryOutputTableHandle{table=MemoryTableHandle{connectorId=memory, schemaName=default, tableName=x, tableId=0, columnHandles=...
         - LocalExchange[SINGLE] () => partialrows:bigint, fragment:varbinary
             - RemoteExchange[GATHER] => partialrows:bigint, fragment:varbinary
                 - TableWriter => [partialrows:bigint, fragment:varbinary]
                         c := expr
                     ..................

[...]

presto:default> explain create table x as select 1 c;
Query 20180205_101139_00002_ijv9f failed: line 1:9: Destination table 'memory.default.x' already exists

elect * from x;
 c
---
(0 rows)

EXPLAIN renders query plan without actually running the query, so EXPLAIN CREATE TABLE ... AS SELECT ... should do just that.
Currently, it creates the table (but does't populate it).

Ref prestodb/presto#9896

Release notes for 301

Dain Sundstrom

all checked
01-22 16:59:11 14db5b5 Add trace token to query events
01-22 18:54:06 17112e0 Update node list on workers
01-22 18:54:06 33c0fc0 Stop DiscoveryNodeManager in tests
01-22 18:54:06 86d780a Only fire node change event on change
01-22 18:54:06 b950d0f Add test for DiscoveryNodeManager listener

David Phillips

all checked
01-23 00:10:15 33e41f3 Remove old Maven extensions
01-23 00:10:15 7ef63e6 Use ValidationAssertions in TestPowerOfTwoValidator
01-23 00:10:15 a5a7722 Update Sphinx plugin to 2.1
01-23 00:10:15 accd094 Update to Airbase 90
01-23 00:10:15 e261486 Fix undeclared dependency used for HTTP constants
01-23 00:10:15 f0e4aa6 Fix deprecated usage of catchingAsync
01-23 00:10:48 eb4f141 Use Xenial environment on Travis
01-25 00:02:23 8456488 Move CLA bot config to clabot-config repo
01-25 15:24:53 c408a78 Add current user security mode for views
01-25 22:25:54 238f4b6 Update to Hive 2.3.4
01-25 22:25:54 486a8f1 Fix value comparisons in TestHiveFileFormats
01-25 22:25:54 507307e Remove unused Parquet classes
01-25 22:39:27 95122b5 Improve error message for unsupported reduce_agg state type
01-26 17:21:03 435fe92 Remove usages of TypeSignature from client
01-26 17:21:03 904d511 Replace PrestoWarning with new client Warning
01-26 17:21:03 950a56c Remove SPI as dependency for JDBC driver
01-26 17:21:03 cbbf251 Convert JDBC driver to ClientTypeSignature
01-26 17:21:03 d5dfb31 Convert JDBC driver from TimeZoneKey to ZoneId

Grzegorz Kokosiński

all checked
01-25 00:04:33 b9a1a7e Limit access to ParametricAggregation fields
01-28 03:22:46 1f284a1 Prefer SchemaTablePrefix#getSchema to create optional schema name
01-28 03:22:46 5b63c49 Use Optional for nullable fields in SchemaTablePrefix
01-28 03:22:46 73cc391 Prefer SchemaTablePrefix#toSchemaTableName
01-28 05:03:23 efe8ca0 Add suppressed exceptions to exception that is thrown
01-28 05:04:01 311abbb Extend JdbcSplit with additional predicate field
01-28 05:10:15 9825bc8 Configure defaults for coordinator in DistributedQueryRunner in one place
01-29 01:52:32 5368334 Fix formatting
01-29 01:52:33 d13fbf9 Check if Presto is running before erasing RPM
01-29 01:55:58 53ed213 Log test query durarion only if it run longer than 1s
01-29 01:55:58 96cee95 Log Presto node role when creating distributed testing cluster
01-29 01:55:58 bee9579 Log testing Presto server URL when it starts
01-29 01:55:58 e77d698 Do not log test query uri
01-29 07:20:30 120c28f Allow JDBC connector to define procedure
01-30 01:55:54 fa65182 Add support for getTableStatistics to base-jdbc connector
01-30 01:58:54 68393b0 Make use of hive temp staging directory configurable
01-30 01:58:54 9227286 Make hive temporary staging directory location configurable

Karol Sobczak

all checked
01-25 01:47:23 9305d99 Add write_hive_metastore_recording procedure
01-25 05:21:20 47ba96e Add missing withGeneratedName to RecordingHiveMetastore exporter
01-28 01:43:00 eb7a512 Add metadata recorder support to Glue
01-29 02:52:01 eb7f677 Make dummy HiveEventClient use DEBUG log instead of INFO
01-29 04:25:50 0481f18 Product tests for SHOW ROLES
01-29 04:25:50 0546433 Add docs for SHOW ROLES
01-29 04:25:50 07c5496 Introduce .information_schema.roles table
01-29 04:25:50 0ba3120 Add grantor to HivePrivilegeInfo
01-29 04:25:50 0beeb77 Add grantor_type and grantee_type columns to table_privileges
01-29 04:25:50 0c2fa44 Make ThriftMetastoreUtil.list*Roles methods to return Stream
01-29 04:25:50 0c3406c Add SHOW ROLE GRANTS syntax
01-29 04:25:50 0f2e2e3 Move parsePrivilege to MetastoreUtil
01-29 04:25:50 18603de Reorder methods in HivePrivilegeInfo
01-29 04:25:50 1d11c06 Make list*TablePrivileges to return Stream
01-29 04:25:50 20cf120 Fix listing privileges when admin role is set
01-29 04:25:50 21ceed6 Catalog access control for roles
01-29 04:25:50 23e3f31 Add Grant/Revoke/List roles authorization to the SPI
01-29 04:25:50 2438df8 Assign admin role to subset of users in FileHiveMetastore
01-29 04:25:50 266fa20 Implement listRoleGrants() in Hive
01-29 04:25:50 26b421a Remove redundant checkDatabasePermission methods
01-29 04:25:50 31e9d26 Reorder methods in SqlStandardAccessControl
01-29 04:25:50 34ffbaa Accept ROLE in GRANT/REVOKE Privileges statements
01-29 04:25:50 3dca463 Implement Grant/Revoke/ListApplicableRoles in Hive
01-29 04:25:50 42ba503 Introduce CREATE ROLE and DROP ROLE statements
01-29 04:25:50 4b90148 Add listRoleGrants to the SPI
01-29 04:25:50 53b13c9 Refactor canCreateView security checks
01-29 04:25:50 5d9615c Add SHOW CURRENT ROLES
01-29 04:25:50 5db2f71 Remove unused metastore statistics beans
01-29 04:25:50 5f9c74d Introduce GRANT/REVOKE roles statements
01-29 04:25:50 62ed7d2 Introduce access control for GRANT/REVOKE ROLE
01-29 04:25:50 671d653 Introduce ENABLED_ROLES view
01-29 04:25:50 6910a27 Enumerate roles until enabled role is found
01-29 04:25:50 709de9d Simplify checkTablePermission
01-29 04:25:50 729f646 Fix rewrite SHOW GRANTS as a SELECT query
01-29 04:25:50 7427e3a Extract role name constants
01-29 04:25:50 745e4b8 Expose Create/Drop/List roles methods in SPI
01-29 04:25:50 747c276 Introduce APPLICABLE_ROLES view
01-29 04:25:50 776b453 Move PrincipalType to presto-spi
01-29 04:25:50 8234650 Store catalog selected roles in Identity
01-29 04:25:50 8464cbc Access control for SHOW ROLE GRANTS and SHOW CURRENT ROLES
01-29 04:25:50 87063ee More product tests for SET ROLE
01-29 04:25:50 91b38e4 Implement Create/Drop/List roles in Hive connector
01-29 04:25:50 942cde9 Rename getGrantOptionForPrivilege to hasGrantOptionForPrivilege
01-29 04:25:50 958edea Rewrite SHOW ROLES as a select query
01-29 04:25:50 a20e9af Introduce isTableOwner method for readability
01-29 04:25:50 aecd3b2 Set admin role before setting system properties in storage format test
01-29 04:25:50 bbc0233 Make HiveQueryRunner default user have admin role
01-29 04:25:50 be8189c Implement SET ROLE in Hive Connector
01-29 04:25:50 c603474 Consider enabled roles for permissions
01-29 04:25:50 c8d8733 Implement SHOW ROLE GRANTS rewrite
01-29 04:25:50 cae5a86 Add access control checks for SHOW ROLES
01-29 04:25:50 cb2da7b Refactor GRANT/REVOKE in Hive
01-29 04:25:50 ce26168 Add docs for SHOW ROLE GRANTS
01-29 04:25:50 d455185 Remove hive privilege null check
01-29 04:25:50 dd9b64c Introduce SET ROLE statement
01-29 04:25:50 de9890c Refactor HivePrivilegeInfo
01-29 04:25:50 e5f2784 Allow all for admin role
01-29 04:25:50 e8f81d9 Prepare metastore interface to accept ROLE for GRANT/REVOKE
01-29 04:25:50 eadfb43 Remove 'IN catalog' syntax from role management commands
01-29 04:25:50 f147d32 Add SHOW ROLES to the parser
01-29 04:25:50 f5fe03d Document role management
01-29 04:25:50 f5fe24d Introduce ConnectorIdentity
01-29 04:25:50 f958f81 Refactor RecordingHiveMetastore to make it consistent between releases
01-29 04:25:50 fb230fb Implement SET ROLE
01-29 04:25:50 fb93b6e Fix listing privileges of tables owned by an user
01-29 11:41:03 00aa33b Fix creating non-bucketed empty partition
01-29 11:41:03 82574c8 ANALYZE statement: Parser
01-29 11:41:03 82bc87f Add error message on column count mismatch when making partition name
01-29 11:41:03 b747ecb ANALYZE statement: Implement ANALYZE in Hive connector
01-29 11:41:03 be61588 Validate no partition value is null when making partition name
01-29 11:41:03 df3f4df ANALYZE statement: Analyzer, planner and execution
01-29 11:41:03 ecd0c0f Move getPartitionLocation to MetastoreUtil
01-29 11:41:03 f800115 Fix partition column range statistics for empty partitions
01-29 11:50:23 0c7d4ed Reduce parquet logging in Hive tests
01-29 11:51:10 23f8793 Use processedInput* metrics in PipelineContext#getInput*

Martin Traverso

all checked
01-22 15:20:16 bb0fd12 Update travis slack notification token
01-25 14:17:15 d04152e Remove deprecated.legacy-log-function feature toggle
01-25 14:17:15 f6044e6 Add standard log function
01-25 14:27:45 ac02f3d Simplify coalesce expressions
01-25 14:27:45 cf17d08 Flatten coalesce expressions
01-25 19:31:27 a63829d Remove legacy array_agg semantics
01-27 12:58:31 7c1250a Enable fusion of operators for queries with SYSTEM sampling

Piotr Findeisen

all checked
01-24 12:52:31 0495b79 Add TYPE_NAME to JdbcTypeHandle
01-24 12:52:44 a13d5d7 Remove unsupported types on write path in JDBC connectors
01-24 12:52:44 b6faf92 Remove unsupported types on read path in JDBC connectors
01-24 12:54:40 1455bcf Add section explaining legacy and new timestamp
01-26 09:27:49 23e709f Fix port allocation race condition in EmbeddedZookeeper
01-26 09:27:49 6d2cbce Use KafkaServer directly
01-26 09:27:49 a881632 Code cleanup
01-26 09:27:49 b0f1795 Update Kafka client to 0.10.2.2
01-26 09:27:49 b410577 Fix port allocation race condition in EmbeddedKafka
01-29 02:08:04 3fef5b6 Report top memory consumers when local memory limit is exceeded
01-29 02:49:52 6dada93 Update to docker images version 11
01-29 02:49:52 a1c4802 Support LDAP with anonymous bind disabled

Dynamic filtering

fact_table - large fact table, partitioned on date_key
d_date - small dimension table, with significant correlation on it's columns and date_key.

Currently for the following query:

SELECT 
    count(*)
FROM 
    fact_table a 
JOIN d_date b 
ON a.date_key = b.date_key 
WHERE 
    b.year = 2017;

larger probe table is fully scanned. When presto after scaning build table (right) could extract information that for b.year = 2017 there are only small number of matching b.date_key values. This information could be used to narrow down the table scan. In other words, above query could be dynamically "rewritten" to equivalent of:

SELECT 
    count(*)
FROM 
    fact_table a 
JOIN d_date b 
ON a.date_key = b.date_key 
WHERE a.date_key IN (20170101,20170102, etc...)

Design doc: https://docs.google.com/document/d/1TOlxS8ZAXSIHR5ftHbPsgUkuUA-ky-odwmPdZARJrUQ/edit

Test EXPLAIN for all queries

Ensure EXPLAIN <query> works for all the different setups possible, e.g.

regular joins, indexed joins
spatial joins
bucketed/grouped execution
...

Ref prestodb/presto#11658

TIMESTAMP behaviour does not match sql standard

Problem description

See below.

Roadmap

(Note: since GH doesn't send updates for edits, when adding a bullet please be sure sure to add a comment too.)

Original problem description

It seems like meaning of TIMESTAMP and TIMESTAMP WITH TIMEZONE datatypes in Presto is totally not what is specified by SQL standard (and what other databases do).

This is my understanding of SQL 2003 standard (4.6.2 Datetimes):

TIMESTAMP WITH TIMEZONE represents absolute point in time. Typically databases store it internally as seconds since epoch in some fixed timezone (usually UTC). When querying TIMESTAMP WITH TIMEZONE data the values are presented to user in session timezone (yet session timezone is used just for presentation purposes).

TIMESTAMP does not represent specific point in time, but rather a reading of a wall clock+calendar. Selecting values from TIMESTAMP column should return same result set no matter what is the client's session timezone.

While Presto semantics is different:

TIMESTAMP seems to do what TIMESTAMP WITH TIMEZONE should.

TIMESTAMP WITH TIMEZONE encodes explicit timezone information to each value stored in table. The sql standard does not define a type like that. But it does not seem very practical. Assuming that values selected from TIMESTAMP WITH TIMEZONE are presented to user in session timezone anyway, the per-row timezone information can be stripped away and all values can be stored in some arbitrary fixed timezone (e.g. UTC).

Please comment on the semantics. It seems wrong. Why the choice - as it is hard to believe that it was not done intentionally.

@haozhun clarifying comments

1

@losipiuk I agree with you that Timestamp w/o TZ in Presto is broken. I do NOT agree that Timestamp w TZ should behave like Instant. I believe it should also have an associated time zone. (In other word, I believe Timestamp w TZ is implemented correctly today.) Below is an excerpt of something I wrote early last year that summarizes the current behavior and my understanding.

To summarize how things work today:

Timestamp w TZ = DateTime in joda = ZonedDateTime in java8
Timestamp w/o TZ = Instant in joda = Instant in java8.
- In other words, Timestamp w/o TZ represents an instant in time, they just don't have a timezone associated with them.
- When you print them out for human consumption, you need to resort to user's session time zone.
- When you turn it into a Timestamp w TZ, you just stick that time zone to it without changing the instant.

The way I understand it

Timestamp w TZ = DateTime in joda = ZonedDateTime in java8
Timestamp w/o TZ = LocalDateTime in joda = LocalDateTime in java8.
- In other words, Timestamp w/o TZ represents a specific Year/Month/Day/Hour/Minute/Second. But it doesn't represent a specific instant. (Side note: this concept does not need expensive representation. It can still be represented as millis/nanos since epoch, observing only chronology rules and no tz rules)
- When you print it out, it should be printed out as is.
- If you want to turn it into a Timestamp w TZ, you need to resort to the user's session time zone for the missing piece of information (and this is not always possible because Timestamp w TZ has gaps).

Here is the reason I believe the first understanding is inconsistent. I can only think of one possible interpretation for the other 3 concepts:

Time w TZ = no single class correspondence in joda = OffsetTime in java8
- Offset and Zone is different in that only constant offset exists for OffsetX. Political zone is not allowed.
- ZonedTime doesn't make sense and doesn't exist. ZonedDateTime and OffsetDateTime both exists.
Time w/o TZ = LocalTime in joda = LocalTime in java8
Date = LocalDate in joda = LocalDate in java8

Note here the inconsistency between interpretation of Timestamp w/o TZ and Time w/o TZ if we adopt the first interpretation of Timestamp w/o TZ (Instant vs LocalTime). Whereas under the second interpretation, it will be consistent (LocalDateTime vs LocalTime).

I went to SQL spec for the definitive answer:

Abbreviations
- SV is the source value, TV is the target value
- UTC is the UTC component of SV or TV (if and only if the source or target has time zone)
- TZ is the timezone displacement of SV or TV (if and only if the source or target has time zone)
- STZD is the SQL-session default time zone displacement
To convert Timestamp w/o TZ to Timestamp w/ TZ: TV.UTC = SV - STZD; TV.TZ = STZD
To convert Timestamp w/ TZ to Timestamp w/o TZ: TV = SV.UTC + SV.TZ

I believe these two rules proves that SQL spec agrees with my interpretation. Let's consider cast from Timestamp w/o TZ to Timestamp w/ TZ

Cast from Timestamp w/o TZ
- Given, SV = 3600000 millis (1 hour)
Cast to Timestamp w TZ in MPK
- STZD = America/Los_Angeles
- TV.UTC = 3600000 - (-28800000) = 32400000, TV.TZ=America/Los_Angeles
- When written out in human format, TV is: 1970-01-01 01:00:00 America/Los_Angeles, TV.UTC is 1970-01-01 09:00:00
Cast to Timestamp w TZ in Shanghai
- STZD = Asia/Shanghai
- TV.UTC = 3600000 - 28800000 = -25200000, TV.TZ=Asia/Shanghai
- When written out in human format, TV is: 1970-01-01 01:00:00 Asia/Shanghai, TV.UTC is 1969-12-31 17:00:00

Under first interpretation, these two cast should have yield results that are equal. Under second interpretation, they would produce different result. The rule in SQL spec produces two different results.

Lastly, a side note from me. Both interpretation can produce results that is dependent on user session time zone:

Under current Presto interpretation of Timestamp w/o TZ:
- Same storage representation. Let's say 60000 milliseconds.
- Prints out as 1969-12-31 16:01:00 in MPK, and 1970-01-01 08:01:00 in China. They are different.
- When cast to Timestamp w TZ, become 1969-12-31 16:01:00 America/Los_Angeles in MPK, and 1970-01-01 08:01:00 Asia/Shanghai in China. They are equal.
Under SQL spec interpretation of Timestamp w/o TZ:
- Same storage representation. Let's say 60000 milliseconds.
- Prints out as 1970-01-01 00:01:00 in MPK, and 1970-01-01 00:01:00 in China. They are the same.
- When cast to Timestamp w TZ, become 1970-01-01 00:01:00 America/Los_Angeles in MPK, and 1970-01-01 00:01:00 Asia/Shanghai in China. They are different.

Under the SQL spec, cast from timestamp w/o TZ to timestamp w/ TZ can produce different results based on user time zone. As a result, I guess this cast probably should NOT have been implicit.

2

@losipiuk, @dain, and I reached agreement:

Timestamp with Timezone in Presto is implemented properly today (like DateTime in joda, ZonedDateTime in jdk8).
Timestamp in Presto is like Instant in joda/jdk8 today. It should be like LocalDateTime in joda/jdk8.
Extracting hour from 2016-01-01 12:00:00 <TZ> should return 12 no matter what <TZ> is put in template.
As part of fixing Timestamp in Presto, we should remove implicit coercion from Timestamp to Timestamp with Timezone because the result value is environment dependent.

Notes

Ticket migrated from prestodb/presto#7122, prestodb/presto#10326

Sampled partition stats have greatly underestimated NDVs for some columns

Incorporated from: prestodb/presto#11959 (discussion there)

tpch_sf100. orders (partitioned):

presto> show stats for hive.tpch_sf100_orc_part.orders;
  column_name  |      data_size      | distinct_values_count | nulls_fraction |   row_count    | low_value  | high_value
---------------+---------------------+-----------------------+----------------+----------------+------------+------------
 orderkey      | NULL                |               62881.0 |            0.0 | NULL           | 162        | 599999942
 custkey       | NULL                |               62881.0 |            0.0 | NULL           | 2          | 14999998
 orderstatus   |      1.4999663244E8 |                   3.0 |            0.0 | NULL           | NULL       | NULL
 totalprice    | NULL                |               62881.0 |            0.0 | NULL           | 851.17     | 571417.48
 orderpriority |     1.25999070546E9 |                   5.0 |            0.0 | NULL           | NULL       | NULL
 clerk         |      2.2499494866E9 |               49276.0 |            0.0 | NULL           | NULL       | NULL
 shippriority  | NULL                |                   1.0 |            0.0 | NULL           | 0          | 0
 comment       | 7.275173681759999E9 |               62881.0 |            0.0 | NULL           | NULL       | NULL
 orderdate     | NULL                |                2406.0 |            0.0 | NULL           | 1992-01-01 | 1998-08-02
 NULL          | NULL                | NULL                  | NULL           | 1.4999663244E8 | NULL       | NULL
(10 rows)

presto> show stats for hive.tpch_sf100_orc.orders;
  column_name  |      data_size       | distinct_values_count | nulls_fraction | row_count | low_value  | high_value
---------------+----------------------+-----------------------+----------------+-----------+------------+------------
 orderkey      | NULL                 |                 1.5E8 |            0.0 | NULL      | 1          | 600000000
 custkey       | NULL                 |           1.0165457E7 |            0.0 | NULL      | 1          | 14999999
 orderstatus   |                1.5E8 |                   3.0 |            0.0 | NULL      | NULL       | NULL
 totalprice    | NULL                 |           3.4038282E7 |            0.0 | NULL      | 811.73     | 591036.15
 orderdate     | NULL                 |                2449.0 |            0.0 | NULL      | 1992-01-01 | 1998-08-02
 orderpriority | 1.2600248040000002E9 |                   5.0 |            0.0 | NULL      | NULL       | NULL
 clerk         |               2.25E9 |              100816.0 |            0.0 | NULL      | NULL       | NULL
 shippriority  | NULL                 |                   1.0 |            0.0 | NULL      | 0          | 0
 comment       |        7.275038709E9 |          1.12463191E8 |            0.0 | NULL      | NULL       | NULL
 NULL          | NULL                 | NULL                  | NULL           |     1.5E8 | NULL       | NULL

Should we add extrapolate NDVs instead? It seems that partitions might often be different chunks of data so that NDVs don't overlap.
Alternatively we could store HLL state per column as an auxiliary partition property and calculate extrapolation based on merged HLLs

TABLESAMPLE SYSTEM prevents scan/filter/project operator fusion

The TABLESAMPLE node in the plan for the following query prevents operator fusion:

presto> explain (type distributed) select length(name) from tpch.tiny.nation TABLESAMPLE system (50);
                                      Query Plan                                       
---------------------------------------------------------------------------------------
 Fragment 0 [SINGLE]                                                                   
     Output layout: [length]                                                           
     Output partitioning: SINGLE []                                                    
     Grouped Execution: false                                                          
     - Output[_col0] => [length:bigint]                                                
             _col0 := length                                                           
         - RemoteSource[1] => [length:bigint]                                          
                                                                                       
 Fragment 1 [SOURCE]                                                                   
     Output layout: [length]                                                           
     Output partitioning: SINGLE []                                                    
     Grouped Execution: false                                                          
     - Project[] => [length:bigint]                                                    
             length := "length"("name")                                                
         - Sample[SYSTEM: 0.5] => [name:varchar(25)]                                   
             - TableScan[tpch:nation:sf0.01, grouped = false] => [name:varchar(25)]    
                     Cost: {rows: 25 (302B), cpu: 302.00, memory: 0.00, network: 0.00} 
                     name := tpch:name

Automated stress tests

@martint, @electrum, @dain Can you share your experience with a stress testing. What do you think needs to be tested:

what data and queries?
what workload (concurrency)?
how long it should be executed to catch production-like issues?
how much hardware do we need?
how can this be automated?
what needs to be verified (memory leaks, garbage collection, correctness)?

Are there any companies that would participate in this to get the "real" data and queries, or even hardware?

I was thinking about spinning a cluster(s) on AWS and the run the above frequently (like once a week). It would be good do it from the beginning, because now we have no many such tests and hence it should be easily automated. Then we could iteratively improve it.

Add hidden $partition column to Hive connector

It can help for computing or filtering the raw partition names that are provided by other Hive-related tools. E.g., ds=2017-05-23/type=abc/ts=2017-05-23+21%3A45%3A99.

It should support pushdown for partition pruning.

Add initial $partition column (#3582)
Support pushdown for partition pruning

Optimize joins with "similar" subqueries

Optimize queries of the following shape:

SELECT * 
FROM (
   SELECT k, agg1, agg2
   FROM t
   GROUP BY k
) a
JOIN (
   SELECT k, agg3, agg4
   FROM t
   GROUP BY k
) b
ON (a.k = b.k)

SELECT k, agg1, agg2, agg3, agg4
FROM T
GROUP BY k

More generally, if we know k contains distinct values, optimize

SELECT * 
FROM (
   SELECT k, v1, v2
   FROM t
) a
JOIN (
   SELECT k, v3, v4
   FROM t
) b
ON (a.k = b.k)

SELECT k, v1, v2, v3, v4 
FROM t

Release notes for 302

Dain Sundstrom

all checked
01-31 15:18:24 10546ea Reformat ParquetTester
01-31 15:18:24 bf01fcd Add dynamic batch sizing in Parquet reader
01-31 15:27:30 99bd6ba Rename Context method to match field names
01-31 15:27:30 bacf59f Add physical* and network* metrics
01-31 15:27:30 e81eefe Remove unused methods
01-31 16:30:15 1fad470 Disable ConnectorIdentity#equals and #hashCode
01-31 16:30:15 4a4bf3f Improve ConnectorIdentity#toString
02-04 14:45:12 6903d9b Add CLI option --trace-token to set the trace token #56.
02-04 16:55:23 f7fe35c Fix starvation while waiting for minimum number of workers
02-05 11:41:04 d6fb5e0 Add peak task user memory usage to query stats
02-05 12:02:33 d7d6166 Add spilled data to query stats, CLI, and Web UI
02-05 12:32:15 7ff3b40 Allow function implementation choices to have different dependencies
02-06 09:06:30 275bb36 Remove system pool entirely

David Phillips

all checked
01-31 17:15:52 0b55a32 Remove unused class ConnectorAnalyzeMetadata
02-02 01:31:23 3fc4aff Use class reference for BlockMissingException
02-02 16:17:58 02711a8 Move ScheduledSplit to execution package
02-02 16:17:58 6082254 Move OutputBuffers to execution.buffer package
02-02 16:17:58 8ca92f2 Move TaskSource to execution package
02-02 16:37:56 d983340 Cleanup code in ORC module
02-04 14:26:47 81d77ca Fix grammar in error messages for CDF functions
02-04 15:30:50 20fb7d8 Document hive.recursive-directories
02-05 14:04:02 c2f23eb Rename StageExecutionStrategy to StageExecutionDescriptor
02-05 15:32:28 8ba4f26 Rename methods in StageExecutionDescriptor
02-05 17:04:29 30f7812 Improve performance for positioned reads on S3
02-05 17:04:29 5c57d92 Throw IOException for unrecoverable S3 errors
02-05 22:18:08 0a543f6 Add option not to delete target directory on failure
02-05 22:18:08 14cc065 Remove redundant null check
02-06 11:07:04 4575f98 Introduce DynamicConfigurationProvider
02-06 11:07:04 4daf538 Add support for extra credential in JDBC driver
02-06 11:07:04 50224b3 Add HiveGcsModule
02-06 11:07:04 70f733f Rename HdfsConfigurationUpdater to HdfsConfigurationInitializer
02-06 11:07:04 9a09af1 Add credentials in Identity and ConnectorIdentity
02-06 11:07:04 de7b516 Add support for extra credential in Presto CLI
02-06 11:07:04 fa02de3 Add header for extra credential

Grzegorz Kokosiński

all checked
02-01 02:13:30 1d62598 Remove PlanNodeMatcher
02-01 02:13:30 431e230 Static import DEFAULT_MATCHER
02-01 02:13:30 4efe280 Add context parameter to custom pattern matching functions
02-01 02:13:30 5d57936 Make Matcher to return the Stream of Match
02-01 02:13:30 6375f4c Remove Matcher, move match to Pattern
02-01 02:13:30 642330b Make Pattern#previous to be Optional
02-01 02:13:30 68e22ec Add missing requireNonNull to matching related constructors
02-01 02:13:30 70800a9 Remove unused variable
02-01 02:13:30 87ed0f3 Remove value from Match
02-01 02:13:30 e0d02ad Inline Patterns implementations from DefaultMatcher
02-01 12:34:13 5d384b2 Add Elasticsearch connector
02-04 13:38:30 b59e583 Make Git quiet when running tests in Travis
02-05 01:32:25 34858db Handle S3 empty directories as directories

Karol Sobczak

all checked
02-01 10:11:38 ca11249 Add physical* and internalNetwork* stats to QueryStatistics
02-04 06:45:39 c3b113e Do not reset configuration properties to default values
02-04 06:45:39 fce297a Use passed featuresConfig in LocalQueryRunner
02-04 09:00:10 1a867a6 Expose plan node stats and cost in event listener
02-04 09:00:10 552cf2a Document that some QueryStatistics field are not backward compatible
02-05 02:22:38 c3675ee Fix Javadoc for QueryStatistics#planNodeStatsAndCosts
02-06 04:41:11 0a9b804 Revoke memory after initial output page has been produced in tests
02-06 04:41:11 1492b3f Return spill future directly
02-06 04:41:11 5fef5aa Allow memory revoke only during operator finish phase
02-06 04:41:11 79744f2 Convert revocable memory to user memory on hash aggregation finish
02-06 04:41:11 e69b668 Produce more than single page in testHashAggregation

Martin Traverso

all checked
02-04 11:37:33 d5dfc99 Preserve quotedness in QualifiedName
02-04 16:46:08 7569071 Add approx_distinct for boolean
02-04 21:10:28 890a6a0 Clarify handling of constant properties
02-04 21:10:28 c9f8a23 Propagate partition properties for full outer join
02-05 06:56:16 12c9b50 Support boolean type in EXPLAIN IO
02-05 06:56:16 efc95d6 Rename classes to match coding style
02-05 23:24:22 7567baf Add SphericalGeography type

Piotr Findeisen

all checked
02-01 00:24:27 0c72574 Remove TODO related to stats caching
02-01 00:24:27 363a3c3 Capture TaskManagerConfig on a field in LocalQueryRunner
02-01 00:24:27 dc1d66f Add Exchange before GroupId to improve Partial Aggregation
02-01 03:24:50 eddafb5 Fix merge conflict
02-01 07:16:18 1116011 Rename VarcharType#getLengthSafe to #getBoundedLength
02-01 07:16:18 8af7888 Remove deprecated VarcharType#getLength method
02-01 07:16:58 27a1b0e Minor code fixes in AbstractTestParquetReader
02-01 07:16:58 d7210b7 Fix information_schema.columns queries for Kafka, Redis
02-02 13:10:02 ddf2421 Document note about LDAPS and JDK Upgrade
02-04 05:57:10 a873894 Remove redundant usages of StringBuilder
02-04 05:57:10 c290b17 Import String.join statically
02-04 08:57:57 656be2c Make service principal hostname configurable in Kerberos auth
02-04 12:12:52 157ad3a Fix inconsistencies in hostname configuration property for Kerberos auth
02-05 06:14:47 1fa2cba Document RecordCursor#getReadTimeNanos
02-05 06:14:47 c48c8b6 Remove incorrect getReadTimeNanos() implementations
02-05 07:45:50 0ebbea6 Check overlapping on Domain directly in TupleDomainParquetPredicate
02-05 07:45:50 1f4caa8 Make #getDomain return nonnull value
02-05 07:45:50 7296700 Assume Domain from file cannot be isNone()
02-05 07:45:50 7d2f5a9 Use known Presto Type instead of reconstructing
02-05 07:45:50 8e3bf91 Remove redundant else
02-05 07:45:50 972fdec Remove unused ParquetTypeUtils#getPrestoType method

Simplify JOIN planning

The JOIN planning code in RelationPlanner currently has a bunch of complicated logic that attempts to derive and extract the terms of equality comparisons, etc. Now that JoinNode can contain an arbitrary expression as the join criteria, we should simplify this code to not do that, and leave it to the optimizers to push down predicates as necessary.

Implement MULTISET type, syntax, and functions

4.10.3 Multisets

A multiset is an unordered collection. Since a multiset is unordered, there is no ordinal position to reference individual elements of a multiset.

A multiset type is a <collection type>. If MT is some multiset type with element type EDT, then every value of MT is a multiset of EDT.

Let M1 and M2 be multisets of EDT. M1 and M2 are identical if and only if M1 and M2 have the same cardinality n, and for each element x in M1, the number of elements of M1 that are identical to x, including x itself, equals the number of elements of M2 that are identical to x.

Let n1 be the cardinality of M1 and let n2 be the cardinality of M2. M1 is a submultiset of M2 if, for each element x of M1, the number of elements of M1 that are not distinct from x, including x itself, is less than or equal to the number of elements of M2 that are not distinct from x.

Remove usage of SchemaTablePrefix deprecated methods

IS DISTINCT FROM bug with ObjectIdFunctions

This is a copy of the issue from the original prestodb repo: prestodb/presto#11869

I'm getting the error:

com.facebook.presto.metadata.OperatorNotFoundException: 'IS DISTINCT FROM' cannot be applied to ObjectId, ObjectId

when trying to select from a view that pulls data from a MongoDB connector despite 'IS DISTINCT FROM' not being used in the view. I'm assuming there's some background logic that happens with joins/aggregations that uses the 'IS DISTINCT FROM' command.

The select works in presto version .188 but does not work in version .203 when testing a presto upgrade.

Ambiguous declarations for concat() function

The concat function allows the following forms:

concat(array(T), array(T))
concat(T, array(T))
concat(array(T), T)

This causes problems when one of the arguments is array[].

If T is array(U) in the second form, the invocation looks like concat(array(U), array(array(U)). A call like concat(array[1,2], array[]) is ambiguous, since array[] is a valid value for array(T) and array(array(U)). A similar issue happens for the third form.

For example:

ARRAY[1,2,3] || ARRAY[]

satisfies both:

array(e) || array(e), where e is integer
e || array(e), where e is array(integer)

We originally added it for convenience, but, in hindsight, it turns out to have been a mistake. We should leave only the variants that accept two or more arrays and introduce dedicated functions for single-element append/prepend.

Incorporate WorkProcessor in operators

Issue for effort to support:

cross operator lazy pages (starting from source operators)
cleanup/simplify contract between operators via WorkProcessor pipelines
provide base for further improvements (e.g: on stack rows without Page materialization, Graal)

The advantage of cross operator lazy pages is that we can avoid IO when queries are highly selective. This requires that significant processing happens in source stage, but this becomes more and more the case with improvements like CBO ("broadcast joins") or grouped execution.

Stages are:

Stage 1

base PageProcessor on WorkProcessor

Stage 2

internally base ScanFilterAndProject on WorkProcessor. The pipeline would look like follows:

split singleton -> [flatMap] -> pages source
                -> [transform] -> page processor 
                -> [transform] -> merge pages

or if split is cursor based

split singleton -> [flatMap] -> cursor source -> [transform] -> merge pages

internally base FilterAndProject on WorkProcessor. The pipeline would look like follows:

page buffer -> [transform] -> page processor -> [transform] -> [merge pages]

Stage 3

create interface for operators that are based on WorkProcessor pipelines
create standarized abstract operator class for operators that internally are based on WorkProcessor pipelines
combine operators that are based on WorkProcessors via dedicated "gluing" operator
base TopNOperator on WorkProcessor pipelines (fast data exploration!)

See some previous discussion in the old repo: prestodb/presto#12096

Support "array by table subquery" constructor

From the spec:

<array value constructor> ::=
    <array value constructor by enumeration>
  | <array value constructor by query>

<array value constructor by query> ::=
  ARRAY <table subquery>

If <array value constructor by query> is specified, then
a) The <query expression> QE simply contained in the <table subquery> shall be of degree 1 (one). Let
ET be the declared type of the column in the result of <query expression>.
b) The declared type of the <array value constructor by query> is array with element type ET and maximum
cardinality equal to the implementation-defined maximum cardinality IMDC for such array types.

The result of <array value constructor by query> is determined as follows:
a) QE is evaluated, producing a table T. Let N be the number of rows in T.
b) If N is greater than IMDC, then an exception condition is raised: data exception — array data, right truncation.
c) The result of <array value constructor by query> is an array of N elements such that for all i, 1 (one) ≤ i ≤ N, the value of the i-th element is the value of the only column in the i-th row of T.
NOTE 167 — The ordering of the array elements is effectively determined by the General Rules of Subclause 7.13, “<query expression>”.

In simple terms,

SELECT ARRAY (SELECT x FROM t)

is equivalent to

SELECT (SELECT array_agg(x) FROM t)

Efficient Adaptive Partial Aggregations

Extracted from: prestodb/presto#10305 (see discussion there)

Problems

High cardinality group by can result in one row per partial which can make query significantly more expensive than no partial at all.
No partial for distinct aggregations can cause slow queries since all distinct calculations happen on a few machines.
Flush of partial is all or nothing which removes the benefit of skewed groups (high row reduction for big groups).

Proposal

Change partial aggregation to detect ineffective partial aggregations and switch to a mode where the raw input is output instead of intermediate results. In addition, the partial should be able to make this decision on a per-group basis so groups with a high row count can still get the benefits of partial aggregations.

For distinct aggregations in groups that performing partial aggregation, the operator would determine if a row is distinct for any of the aggregates, and if so, the raw input is output, along with flag columns that mark which aggregates the row is distinct for.

The net result is the aggregate may output rows for three distinct cases:

intermediate results for an enabled group
raw input for a disabled group
distinct raw input for distinct aggregates (in an enabled group)

The output descriptor for a partial aggregation would look something like this:

row_type either intermediate, raw, distinct partial
intermediates outputs for non-distinct aggregations. null if row type is not intermediate
raw input columns. null if row_type is intermediate; for distinct row_type only inputs to distinct aggregations may be present.
distinct flags; if set the input row should be processed in the related aggregation. null if row_type is not distinct

"Mismatched Domain types: smallint vs integer" error in Parquet predicate pushdown

create table t_si(si smallint, value varchar) with (format = 'PARQUET');

insert into t_si VALUES (SMALLINT '1', 'a');

select * from t_si where si = 1;

IllegalArgumentException: Mismatched Domain types: smallint vs integer

PrestoS3FileSystem breaks read-after-write strict S3 consistency

Method: com.facebook.presto.hive.s3.PrestoS3FileSystem#create
will check for file existence when overwrite==false. File existence check will break read-after-write strict S3 consistency per S3 doc:

Amazon S3 provides read-after-write consistency for PUTS of new objects in
your S3 bucket in all regions with one caveat. The caveat is that if you make a
HEAD or GET request to the key name (to find if the object exists) before
creating the object, Amazon S3 provides eventual consistency for read-after-write.

The existence check itself is weak as the file might be created while data is gathered in staging directory for upload.

I propose to remove the check of use LIST S3 request instead that does not break read-after-write consistency.

FYI: @findepi

JMX MBeans are not accessible

When JMX MBean is created via:

newExporter(binder).export(GlueHiveMetastore.class);

it doesn't list, but when created via:

newExporter(binder).export(GlueHiveMetastore.class)
  .as(generator -> generator.generatedNameOf(GlueHiveMetastore.class));

it does list in presto.plugin.hive.metastore.glue.

FYI: @electrum

Add support for FETCH/OFFSET syntax

In terms of syntax, this is what the spec says:

A <query expression> can also optionally contain a <result offset clause>, 
which may limit the cardinality of the derived table by removing a specified 
number of rows from the beginning of the derived table. If a <query expression> 
contains both an <order by clause> and a <result offset clause>, then the rows 
in the derived table are first sorted according to the <order by clause> and then
 limited by dropping the number of rows specified in the <result offset clause> 
from the beginning of the result produced by the <query expression>. If the 
cardinality of the result of an evaluation of a <query expression> is less than
 the offset value specified by a <result offset clause>, then the derived table is empty.

And

<query expression> ::=
  [ <with clause> ] <query expression body>
      [ <order by clause> ] [ <result offset clause> ] [ <fetch first clause> ]

<result offset clause> ::=
  OFFSET <offset row count> { ROW | ROWS }

<fetch first clause> ::=
  FETCH { FIRST | NEXT } [ <fetch first quantity> ] { ROW | ROWS } { ONLY | WITH TIES }

<fetch first quantity> ::=
    <fetch first row count>
  | <fetch first percentage>

<offset row count> ::=
  <simple value specification>

<fetch first row count> ::=
  <simple value specification>

<fetch first percentage> ::=
  <simple value specification> PERCENT

Quantify stats/cost calculator quality. Probabilistically adjust models as a longer term goal

Incorporated from: prestodb/presto#11615

Currently we don't have any quantifiable means of measuring stats cost model quality. It would be great to have one so that we can compare stats/cost calculator changes via A/B testing or by aggregating historical data.

Cumulative cost quality

Let's call cumulative estimated query q cost as:
EstimatedUserCpu(q), EstimatedMemory(q), EstimatedNetwork(q).
Let's call actual query cost as:
CostCpu(q), CostMemory(q), CostNetwork(q).
All of those are random variables with respect to q.

Observations:

f(EstimatedXXX(q)) = CostXXX(q) + E(q), where E(q) is the estimate error.
We expect that f should be a linear transformation (otherwise sum of subplan estimates wont be the estimate of a whole plan)
E(q) should be proportional to the CostXXX(q) or EstimatedXXX(q). This is because we expect larger errors for bigger cost/estimates

Let's say we gather queries qi and their estimates EstimatedXXX(qi) and costs CostXXX(qi). Thus we have data points composed of pairs:

<x1=EstimatedXXX(q1), y1=CostXXX(q1)>, ..., <xN=EstimatedXXX(qN), yN=CostXXX(qN)>

for i = 1...N.

Those points will look similar to:

Let's transform those points by dividing yN by xN. We now get:

<x1=EstimatedXXX(q1), y1'=CostXXX(q1)/EstimatedXXX(q1)>, ..., <xN=EstimatedXXX(qN), yN'=CostXXX(qN)/Estimated(qN)>

Now those points will look like:

We can now perform regression analysis as we should have satisfied conditions for performing OLS regression (errors come from same distribution and have constant variance).

We can derive quantifiable model properties like:

OLS error magnitude
regression statistical significance test (if errors have normal distribution)
confidence and prediction intervals (if errors have normal distribution)

Such metrics allow us to compare quality of cost estimates

Note: I think that there will be different error distributions depending how complex the plans are. For instance sophisticated plans with many plan nodes will have the error accumulated. To account for that, we could derive plan complexity (e.g: number of nodes, depth) as an additional input parameter to the model and we could perform analysis for each complexity level.

Individual operator cost quality

Let's call operator o estimated cost as:
OperatorEstimatedUserCpu(o), OperatorEstimatedMemory(o), OperatorEstimatedNetwork(o).
Let's call actual operator cost as:
OperatorCostCpu(q), OperatorCostMemory(q), OperatorCostNetwork(q).
Let's call operator estimated input data size as:
EstimatedInputSize(o).
Let's call operator measured input data size as:
MeasuredInputSize(o).
All of those are random variables with respect to o.

Observations:

operator most likely will have input estimated data size mismatch actual input data size. This affects values of OperatorEstimatedXXX(o) and OperatorCostXXX(q). Because of that, let's normalize them by:

NormalizedOperatorEstimatedXXX(o) = OperatorEstimatedXXX(o)/EstimatedInputSize(o)
NormalizedOperatorCostXXX(o) = OperatorCostXXX(o)/MeasuredInputSize(o)

Intuitively they tell how much memory is needed by operator per input byte. We can now define operator estimation error as:

OperatorEstimationErrorXXX(o) = NormalizedOperatorEstimatedXXX(o) - NormalizedOperatorCostXXX(o)

It will be similar to:

We can now:

derive variance of the OperatorEstimationErrorXXX
derive the mean of the OperatorEstimationErrorXXX, e.g: how much (per input byte) operator is under/overestimating cost.
if OperatorEstimationErrorXXX has normal distribution we can make more sophisticated conclusions (e.g: if we added Constant to OperatorEstimatedXXX than for 90% of queries we would overestimate memory usage by this much).

Note that we should perform such analysis for each operator type as they have different profiles.

Stats quality

Similar approach can be applied to operator nodes that filter the data (e.g: actual filtering factor vs estimated filtering factor). This way we can evaluate quality of FilterStatsCalculator and rules that perform predicate estimation.

Adjusting model

The initial goal is to provide quality metrics for stats/cost model so that we could test and quantify model changes.

However, having gathered historical query and model data, we could introduce additional variables to our stats/cost computations that we would adjust to achieve desired model properties.
For instance, we could/should:

introduce variables for multiplying estimated operator cost so that they match measured cost.
specifically choose operator memory cost factor to be large so that for majority of the queries we overestimate memory (to be on a safe side)
multiply cumulative plan costs so that we slightly overestimate. Such factor for cumulative cost should account for plan complexity as simpler plans should have more accurate costs

Having such variables we could adjust model for different hardware, data and workloads. Specifically we could adjust model for each client so that CBO produces better plans.

Plan Of Attack

The steps are as follows:

For each operator expose cumulative and individual stats/costs in query info (v1/query) JSON so that they can be gathered and analyzed. This would be used to estimate model on predefined set of queries (e.g: TPCH/TPCDS) during benchmarks.

Note that operators don't match 1-1 to plan nodes. For instance to obtain HashAggregationOperator stats we would need to get stats for join build side.

Expose estimated cost/stats in EventListener so that we could gather and store historical data points.
Add variables to stats/cost models so that we can adjust model for particular hardware, data and workloads.

Named queries (WITH) should be evaluated only once

Currently, Trino evaluates named queries (WITH clause) by inlining them wherever they are referenced in the body of the main query. This can cause problems if the query is non-deterministic, since every inlined instance might produce different results.

According to the spec:

1) If a non-recursive <with clause> is specified, then:
    ...
    b) For every <with list element> WLE, let WQN be the <query name> immediately contained in WLE. Let WQE be 
       the <query expression> simply contained in WLE. Let WLT be the table resulting from evaluation of WQE, with 
       each column name replaced by the corresponding element of the <with column list>, if any, immediately contained in WLE.
    c) Every <table reference> contained in <query expression> that specifies WQN identifies WLT.

This says that each named query should be evaluated once and the result "stored" in a table WLT. Thereafter, any references to the named query directly identify that WLT table.

WLT in the definition above is what the spec calls "transient table":

4.15.4 Transient tables
A transient table is a named table that may come into existence implicitly during the evaluation of a <query expression> or the execution of a trigger. A transient table is identified by a <query name> if it arises during the evaluation of a <query expression>, or by a <transition table name> if it arises during the execution of a trigger. Such tables exist only for the duration of the executing SQL-statement containing the <query expression> or for the duration of the executing trigger.

Allow connectors to participate in query optimization

This is an umbrella issue to track various projects related to allowing connectors to participate in query optimization. The long-term vision is for plugins to provide custom rules that can transform subplans into plugin-specific operations. This requires a set of steps:

decouple the AST from the current IR (PlanNode tree): #13184
get rid of visitor-based optimizers
revamp the IR and optimizer to support a fully exploratory optimizer
allow connectors to provide optimizer rules

In the short term, we can introduce special-purpose mechanisms to the SPI and engine enable the following behaviors:

push down complex filters
push down projections (e.g, row/array/map dereference, pre-calculated virtual columns)
push down aggregations
push down joins
expose additional filters (e.g. for row-level authorization)
expose more complex data organizations (e.g., custom partitioning schemes)

Document describing the high-level approach: https://github.com/prestosql/presto/wiki/Pushdown-of-complex-operations

Plan:

trinodb / trino Goto Github PK

trino's Introduction

Development

Security

Build requirements

Building Trino

Running Trino in your IDE

Overview

Running a testing server

Running the full server

Running the CLI

trino's People

Contributors

Stargazers

Watchers

Forkers

trino's Issues

Goals

Features

Considerations and open questions

TODOs

References

4.34 SQL-paths

4.43.3 SQL-session properties

6.4 <value specification> and <target specification>

10.3 <path specification>

Function

Format

Syntax Rules

19.8 <set path statement>

Function

Format

Syntax Rules

Access Rules

General Rules

Conformance Rules

10.9 <aggregate function>

Dain Sundstrom

David Phillips

Grzegorz Kokosiński

Karol Sobczak

Martin Traverso

Piotr Findeisen

Problem description

Roadmap

Original problem description

@haozhun clarifying comments

1

2

Notes

Dain Sundstrom

David Phillips

Grzegorz Kokosiński

Karol Sobczak

Martin Traverso

Piotr Findeisen

Problems

Proposal

Recommend Projects

Recommend Topics

Recommend Org

6.4 `<value specification>` and `<target specification>`

10.3 `<path specification>`

19.8 `<set path statement>`

10.9 `<aggregate function>`