<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="17

Steps to reproduce: Install soda-core-pandas-dask for

I was able to make progress here (thanks <a class="user-mention notranslate" data-hove

Unpin dask-sql about soda-core HOT 7 CLOSED

jochenchrist commented on May 23, 2024 3

Unpin dask-sql

from soda-core.

Comments (7)

tools-soda commented on May 23, 2024

SAS-2710

from soda-core.

philiprj commented on May 23, 2024

I'm having the same issues, it looks like updating to the more recent version if dask-sql should solve it

from soda-core.

julienvantyghem commented on May 23, 2024

Additional issue related to the current pinning of dask-sql

For context, dask-sql has an indirect dependency on pydantic through fastapi.

dask-sql>=2023.0.0,<2023.6.0 is incompatible with pydantic 2. This conflicts with the soda-core pydantic >= 2.0.0,<3.0.0 requirement introduced here, and released in soda-core 3.1.4
It turns out that dask-sql >= 2022.10.0,<2023.0.0 was still compatible with pydantic 2 (by chance more than deliberate decision, as we are talking about versions that were released in late 2022, pre-pydantic 2), so effectively using soda-core >= 3.1.4 forces to use outdated dask-sql 2022.10.1 or 2022.12.2 through the pydantic 2 dependency.

dask-sql >= 2023.8.0 introduces compatibility (deliberately this time, through this fastapi upgrade) with pydantic 2, so loosening the < 2023.6.0 restriction would allow us to actually use dask-sql >= 2023.0.0

from soda-core.

m1n0 commented on May 23, 2024

I looked into this thoroughly and while indeed it is possible to finally get these requirements straight, dask-sql introduced issues into it's sql execution engine.
One example: our test suite runs a sodacl check avg_length for a text column - latest versions of dask-sql (starting from 2023.10.0) returns a wrong result - the length of the series (i.e. # of rows) instead of the length of the string values.

To add to that our scientific package still depends on pandas<2, while we upgraded core to pydantic 2 in the meantime.

I only see two options right now:

wait for dask-sql to fix issues
accept the dask-sql bugs, skip offending tests in the test suite temporarily, announce the issues to users with a potential workaround of using old version of dask-sql.

I don't particularly like either of the options but it seems 2. is the way to go forward as we are locking users out.

from soda-core.

m1n0 commented on May 23, 2024

One more update - it seems like the character length issue only appears when running checks via soda and not with latest version of dask-sql when running the same query using vanilla dask-sql. I have been trying to investigate why, but I haven't found the issue yet. Soda is very transparent with dask, it just creates a dask Context and saves it as one of the datasources, everything else is completely transparent - yet the same query on the same dataframe gets parsed differently by dask-sql.

from soda-core.

m1n0 commented on May 23, 2024

Steps to reproduce:

Install soda-core-pandas-dask
force upgrade dask and dask-sql pip install dask==2024.2.0 dask_sql==2024.1.0
run the scripts below

Soda script - returns 2 as a result (which is the number of rows), I debugged dask-sql and this is how it translates the sql into df operations:
'Aggregate: groupBy=[[]], aggr=[[AVG(length(employee.email))]]\n TableScan: employee projection=[email]'

import pandas as pd
from soda.scan import Scan

scan = Scan()

df_employee = pd.DataFrame({"email": ["[email protected]", "[email protected]"]})

scan.add_dask_dataframe(dataset_name="employee", dask_df=df_employee, data_source_name="orders")

scan.set_scan_definition_name("test")
scan.set_data_source_name("orders")

checks = """
checks for employee:
    - avg_length(email) > 10
"""

scan.add_sodacl_yaml_str(checks)

scan.set_verbose(True)
scan.execute()

print(scan.get_logs_text())

vanilla dask - same DF, same SQL as generated by soda, returns 17.0, but it gets translated into
'Aggregate: groupBy=[[]], aggr=[[AVG(character_length(employee.email))]]\n TableScan: employee projection=[email]'
(the difference is length vs character_length)

from dask_sql import Context
import pandas as pd

c = Context()

df_employee = pd.DataFrame({"email": ["[email protected]", "[email protected]"]})

c.create_table("employee", df_employee)

print(df_employee.dtypes)


res = c.sql(
    """SELECT
            AVG(LENGTH(email))
        FROM employee"""
).compute()
print(res)

from soda-core.

m1n0 commented on May 23, 2024

I was able to make progress here (thanks @baturayo for help!) and now the linked PR allows for dask-sql 2023.8.0 which seems to be a sweetspot for pydantic2 support and least breaking changes/issues/regressions being introduced. Released in Core 3.2.2.

from soda-core.

Unpin dask-sql about soda-core HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent