Coder Social home page Coder Social logo

Unpin dask-sql about soda-core HOT 7 CLOSED

jochenchrist avatar jochenchrist commented on May 23, 2024 3
Unpin dask-sql

from soda-core.

Comments (7)

tools-soda avatar tools-soda commented on May 23, 2024

SAS-2710

from soda-core.

philiprj avatar philiprj commented on May 23, 2024

I'm having the same issues, it looks like updating to the more recent version if dask-sql should solve it

from soda-core.

julienvantyghem avatar julienvantyghem commented on May 23, 2024

Additional issue related to the current pinning of dask-sql

For context, dask-sql has an indirect dependency on pydantic through fastapi.

dask-sql>=2023.0.0,<2023.6.0 is incompatible with pydantic 2. This conflicts with the soda-core pydantic >= 2.0.0,<3.0.0 requirement introduced here, and released in soda-core 3.1.4
It turns out that dask-sql >= 2022.10.0,<2023.0.0 was still compatible with pydantic 2 (by chance more than deliberate decision, as we are talking about versions that were released in late 2022, pre-pydantic 2), so effectively using soda-core >= 3.1.4 forces to use outdated dask-sql 2022.10.1 or 2022.12.2 through the pydantic 2 dependency.

dask-sql >= 2023.8.0 introduces compatibility (deliberately this time, through this fastapi upgrade) with pydantic 2, so loosening the < 2023.6.0 restriction would allow us to actually use dask-sql >= 2023.0.0

from soda-core.

m1n0 avatar m1n0 commented on May 23, 2024

I looked into this thoroughly and while indeed it is possible to finally get these requirements straight, dask-sql introduced issues into it's sql execution engine.
One example: our test suite runs a sodacl check avg_length for a text column - latest versions of dask-sql (starting from 2023.10.0) returns a wrong result - the length of the series (i.e. # of rows) instead of the length of the string values.

To add to that our scientific package still depends on pandas<2, while we upgraded core to pydantic 2 in the meantime.

I only see two options right now:

  1. wait for dask-sql to fix issues
  2. accept the dask-sql bugs, skip offending tests in the test suite temporarily, announce the issues to users with a potential workaround of using old version of dask-sql.

I don't particularly like either of the options but it seems 2. is the way to go forward as we are locking users out.

from soda-core.

m1n0 avatar m1n0 commented on May 23, 2024

One more update - it seems like the character length issue only appears when running checks via soda and not with latest version of dask-sql when running the same query using vanilla dask-sql. I have been trying to investigate why, but I haven't found the issue yet. Soda is very transparent with dask, it just creates a dask Context and saves it as one of the datasources, everything else is completely transparent - yet the same query on the same dataframe gets parsed differently by dask-sql.

from soda-core.

m1n0 avatar m1n0 commented on May 23, 2024

Steps to reproduce:

  1. Install soda-core-pandas-dask
  2. force upgrade dask and dask-sql pip install dask==2024.2.0 dask_sql==2024.1.0
  3. run the scripts below

Soda script - returns 2 as a result (which is the number of rows), I debugged dask-sql and this is how it translates the sql into df operations:
'Aggregate: groupBy=[[]], aggr=[[AVG(length(employee.email))]]\n TableScan: employee projection=[email]'

import pandas as pd
from soda.scan import Scan

scan = Scan()

df_employee = pd.DataFrame({"email": ["[email protected]", "[email protected]"]})

scan.add_dask_dataframe(dataset_name="employee", dask_df=df_employee, data_source_name="orders")

scan.set_scan_definition_name("test")
scan.set_data_source_name("orders")

checks = """
checks for employee:
    - avg_length(email) > 10
"""

scan.add_sodacl_yaml_str(checks)

scan.set_verbose(True)
scan.execute()

print(scan.get_logs_text())

vanilla dask - same DF, same SQL as generated by soda, returns 17.0, but it gets translated into
'Aggregate: groupBy=[[]], aggr=[[AVG(character_length(employee.email))]]\n TableScan: employee projection=[email]'
(the difference is length vs character_length)

from dask_sql import Context
import pandas as pd

c = Context()

df_employee = pd.DataFrame({"email": ["[email protected]", "[email protected]"]})

c.create_table("employee", df_employee)

print(df_employee.dtypes)


res = c.sql(
    """SELECT
            AVG(LENGTH(email))
        FROM employee"""
).compute()
print(res)

from soda-core.

m1n0 avatar m1n0 commented on May 23, 2024

I was able to make progress here (thanks @baturayo for help!) and now the linked PR allows for dask-sql 2023.8.0 which seems to be a sweetspot for pydantic2 support and least breaking changes/issues/regressions being introduced. Released in Core 3.2.2.

from soda-core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.