Comments (6)
Hi @andygrove Ive run in local ballista
SELECT count(distinct c1) as cnt_distinct FROM aggregate_test_100
And the result is expected
+--------------+
| cnt_distinct |
+--------------+
| 5 |
+--------------+
which is expected
from arrow-ballista.
I checked the backtrace
2: datafusion_physical_expr::aggregate::build_in::create_aggregate_expr
at ./datafusion/physical-expr/src/aggregate/build_in.rs:75:13
3: datafusion::physical_plan::planner::create_aggregate_expr_with_name
at ./datafusion/core/src/physical_plan/planner.rs:1347:13
4: datafusion::physical_plan::planner::create_aggregate_expr
at ./datafusion/core/src/physical_plan/planner.rs:1390:5
5: datafusion::physical_plan::planner::DefaultPhysicalPlanner::create_initial_plan::{{closure}}::{{closure}}
at ./datafusion/core/src/physical_plan/planner.rs:525:29
sounds weird, but I didn't notice ballista modules here.
from arrow-ballista.
Related: apache/arrow-datafusion#3250
from arrow-ballista.
Hi @andygrove Ive run in local ballista
The issue is specific to distributed mode because it is the serde that has the hard-coded value
from arrow-ballista.
Hi @andygrove Ive run in local ballista
The issue is specific to distributed mode because it is the serde that has the hard-coded value
Is there any doc how to run ballista tests in distributed mode? perhaps its part of CI now?
from arrow-ballista.
@andygrove @comphead
I tried to analyze the problem and found that SELECT count(distinct c1) as cnt_distinct FROM aggregate_test_100
is also expected in distributed mode.
Because single distinct
is optimized by the optimizer as group by
in datafusion:
example sql: select count(distinct c_name) from customer_1;
// Logic plan before optimization:
Projection: COUNT(DISTINCT customer_1.c_name)
Aggregate: groupBy=[[]], aggr=[[COUNT(DISTINCT customer_1.c_name)]]
TableScan: customer_1 projection=[c_name]
// Logic plan after optimization:
Projection: COUNT(DISTINCT customer_1.c_name)
Projection: COUNT(alias1) AS COUNT(DISTINCT customer_1.c_name)
Aggregate: groupBy=[[]], aggr=[[COUNT(alias1)]]
Aggregate: groupBy=[[customer_1.c_name AS alias1]], aggr=[[]]
TableScan: customer_1 projection=[c_name]
The current problem with ballista is that it does not support DistinctCount
in non-single distinct scenarios.
Example sql select count(distinct c_name), max(c_name) from customer_1
:
[2022-10-18T06:32:14Z ERROR ballista_core::execution_plans::distributed_query] Job 3N8dtpp failed: Error planning job 3N8dtpp: NotImplemented("Aggregate function not supported: DistinctCount { name: \"COUNT(DISTINCT customer_1.c_name)\", data_type: Int64, state_data_types: [Utf8], exprs: [Column { name: \"c_name\", index: 0 }] }")
DataFusionError(ArrowError(ExternalError(Execution("Job 3N8dtpp failed: Error planning job 3N8dtpp: NotImplemented(\"Aggregate function not supported: DistinctCount { name: \\\"COUNT(DISTINCT customer_1.c_name)\\\", data_type: Int64, state_data_types: [Utf8], exprs: [Column { name: \\\"c_name\\\", index: 0 }] }\")"))))
I will sumit a PR for this :)
from arrow-ballista.
Related Issues (20)
- Deployment on AWS
- Expose the lower level scheduler API
- Update DataFusion to latest version HOT 1
- [Feature] Add support iceberg table
- [Bug] docker compose up -d error on building failed to calculate checksum HOT 6
- df.write_xxx no longer working in ballista HOT 1
- Build failure in flight_sql.rs
- Upgrade to DataFusion 33
- Failed to build docker image HOT 2
- Release Ballista 0.12 HOT 1
- Invalid argument error: lz4 IPC decompression requires the lz4 feature HOT 8
- Make max message size configurable in gRPC clients
- Consider using gRPC streams + chunking to avoid message size limits
- Fresh install of Ballista 11 crate fails with dependency errors HOT 2
- DecodeErrors using pyarrow flight connector
- Use StreamWriter instead of FileWriter
- Avoid writing schema to shuffle files
- Document how to run TPC-H benchmarks in Kubernetes
- Use correct attribution in footer of documentation pages HOT 2
- Update dependencies in UI HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-ballista.