Comments (7)
SAS-2710
from soda-core.
I'm having the same issues, it looks like updating to the more recent version if dask-sql should solve it
from soda-core.
Additional issue related to the current pinning of dask-sql
For context, dask-sql
has an indirect dependency on pydantic
through fastapi
.
dask-sql>=2023.0.0,<2023.6.0
is incompatible with pydantic 2. This conflicts with the soda-core pydantic >= 2.0.0,<3.0.0
requirement introduced here, and released in soda-core 3.1.4
It turns out that dask-sql >= 2022.10.0,<2023.0.0
was still compatible with pydantic 2 (by chance more than deliberate decision, as we are talking about versions that were released in late 2022, pre-pydantic 2), so effectively using soda-core >= 3.1.4
forces to use outdated dask-sql
2022.10.1
or 2022.12.2
through the pydantic 2 dependency.
dask-sql >= 2023.8.0
introduces compatibility (deliberately this time, through this fastapi upgrade) with pydantic 2, so loosening the < 2023.6.0 restriction would allow us to actually use dask-sql >= 2023.0.0
from soda-core.
I looked into this thoroughly and while indeed it is possible to finally get these requirements straight, dask-sql introduced issues into it's sql execution engine.
One example: our test suite runs a sodacl check avg_length
for a text column - latest versions of dask-sql (starting from 2023.10.0
) returns a wrong result - the length of the series (i.e. # of rows) instead of the length of the string values.
To add to that our scientific package still depends on pandas<2, while we upgraded core to pydantic 2 in the meantime.
I only see two options right now:
- wait for dask-sql to fix issues
- accept the dask-sql bugs, skip offending tests in the test suite temporarily, announce the issues to users with a potential workaround of using old version of dask-sql.
I don't particularly like either of the options but it seems 2. is the way to go forward as we are locking users out.
from soda-core.
One more update - it seems like the character length issue only appears when running checks via soda and not with latest version of dask-sql when running the same query using vanilla dask-sql. I have been trying to investigate why, but I haven't found the issue yet. Soda is very transparent with dask, it just creates a dask Context and saves it as one of the datasources, everything else is completely transparent - yet the same query on the same dataframe gets parsed differently by dask-sql.
from soda-core.
Steps to reproduce:
- Install soda-core-pandas-dask
- force upgrade dask and dask-sql
pip install dask==2024.2.0 dask_sql==2024.1.0
- run the scripts below
Soda script - returns 2
as a result (which is the number of rows), I debugged dask-sql and this is how it translates the sql into df operations:
'Aggregate: groupBy=[[]], aggr=[[AVG(length(employee.email))]]\n TableScan: employee projection=[email]'
import pandas as pd
from soda.scan import Scan
scan = Scan()
df_employee = pd.DataFrame({"email": ["[email protected]", "[email protected]"]})
scan.add_dask_dataframe(dataset_name="employee", dask_df=df_employee, data_source_name="orders")
scan.set_scan_definition_name("test")
scan.set_data_source_name("orders")
checks = """
checks for employee:
- avg_length(email) > 10
"""
scan.add_sodacl_yaml_str(checks)
scan.set_verbose(True)
scan.execute()
print(scan.get_logs_text())
vanilla dask - same DF, same SQL as generated by soda, returns 17.0
, but it gets translated into
'Aggregate: groupBy=[[]], aggr=[[AVG(character_length(employee.email))]]\n TableScan: employee projection=[email]'
(the difference is length
vs character_length
)
from dask_sql import Context
import pandas as pd
c = Context()
df_employee = pd.DataFrame({"email": ["[email protected]", "[email protected]"]})
c.create_table("employee", df_employee)
print(df_employee.dtypes)
res = c.sql(
"""SELECT
AVG(LENGTH(email))
FROM employee"""
).compute()
print(res)
from soda-core.
I was able to make progress here (thanks @baturayo for help!) and now the linked PR allows for dask-sql 2023.8.0
which seems to be a sweetspot for pydantic2 support and least breaking changes/issues/regressions being introduced. Released in Core 3.2.2.
from soda-core.
Related Issues (20)
- Feature request: Support for Looker Looks as data source HOT 4
- unable to use camel case postgres columns with soda contracts HOT 1
- Contract check level filters HOT 1
- On Oracle datasource discover table columns metadata and profile columns get table and column metadata fail HOT 1
- Contract quoting of schemas, datasets and columns HOT 2
- Contract API docs update HOT 3
- Add contract support for failed rows query HOT 1
- Add contract spark session API HOT 2
- Duplicate count check: on Oracle datasource wrong query to select failed rows HOT 2
- Contract identity issue HOT 5
- Issue connecting to db2 from soda-core HOT 3
- Yaml emitter error while executing scans concurrently HOT 4
- not able to install in databricks enviornment HOT 2
- Duckdb: schema metric not computed for db in file HOT 1
- Invalid configuration header: expected "data_source {data source name}" HOT 2
- Soda Core Trino 3.3.3 and 3.1.1 : Metrics 'schema' were not computed for check 'schema' HOT 1
- Spark partitioned tables HOT 2
- Issue to install soda-core-pandas-dask via Poetry in Windows HOT 2
- Migrate soda-core-athena to use newer PyAthena >= 3.0.10 HOT 1
- Enable more authentication options for Databricks data source HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from soda-core.