sodadata / soda-spark Goto Github PK

View Code? Open in Web Editor NEW

63.0 11.0 8.0 121 KB

Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

Home Page: https://docs.soda.io

License: Apache License 2.0

Python 100.00%

spark pyspark data-engineering data-quality data-observability data-testing soda-sql python

soda-spark's People

Contributors

Stargazers

Watchers

Forkers

anilkulkarni87 zixi0825 abhishek-khare shannywu joaoluga gldnblgty miserani anogues

soda-spark's Issues

Do not use global temporary view

No need for it anymore, as we do not have to specify the database.

Add scan functionality to usage section in the README

Explain the API for the scan functionality in the README.

Using the Soda Client and SQL Metrics does not work

I get an issue when running a scan that has sql metrics configured and passing the Soda Cloud client to the execute call. It works fine when running the scan without the client. The execution just hangs and I need to cancel it and then I see this error message:

^CERROR:root:Exception while sending command.                                   
Traceback (most recent call last):
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/clientserver.py", line 475, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
RuntimeError: reentrant call inside <_io.BufferedReader name=3>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/clientserver.py", line 503, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/clientserver.py", line 475, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/opt/homebrew/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/context.py", line 292, in signal_handler
    self.cancelAllJobs()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/context.py", line 1195, in cancelAllJobs
    self._jsc.sc().cancelAllJobs()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/java_gateway.py", line 1309, in __call__
    return_value = get_return_value(
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/protocol.py", line 334, in get_return_value
    raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling o14.sc

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/clientserver.py", line 503, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
--- Logging error ---
Traceback (most recent call last):
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/sodasql/scan/scan.py", line 682, in _run_sql_metric_failed_rows
    self.sampler.save_sample_to_local_file_with_limit(resolved_sql, temp_file, failed_limit)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/sodasql/scan/sampler.py", line 207, in save_sample_to_local_file_with_limit
    row = cursor.fetchone()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/sodaspark/scan.py", line 150, in fetchone
    row = self._df.first()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 1617, in first
    return self.head()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 1603, in head
    rs = self.head(1)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 1605, in head
    return self.take(n)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 744, in take
    return self.limit(num).collect()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 693, in collect
    sock_info = self._jdf.collectToPython()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/java_gateway.py", line 1309, in __call__
    return_value = get_return_value(
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/protocol.py", line 334, in get_return_value
    raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling o3900.collectToPython

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/logging/__init__.py", line 1083, in emit
    msg = self.format(record)
  File "/opt/homebrew/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/logging/__init__.py", line 927, in format
    return fmt.format(record)
  File "/opt/homebrew/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/logging/__init__.py", line 663, in format
    record.message = record.getMessage()
  File "/opt/homebrew/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/logging/__init__.py", line 367, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/Users/albin/dev/soda/pipeline-demo/soda-spark-test/soda-spark-tests.py", line 21, in <module>
    scan_result = scan.execute(scan_definition, df_ecommerce, soda_server_client=soda_server_client)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/sodaspark/scan.py", line 294, in execute
    scan.execute()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/sodasql/scan/scan.py", line 94, in execute
    self._query_sql_metrics_and_run_tests()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/sodasql/scan/scan.py", line 497, in _query_sql_metrics_and_run_tests
    self._query_sql_metrics_and_run_tests_base(self.scan_yml.sql_metric_ymls)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/sodasql/scan/scan.py", line 515, in _query_sql_metrics_and_run_tests_base
    self._run_sql_metric_failed_rows(sql_metric, resolved_sql, scan_column)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/sodasql/scan/scan.py", line 736, in _run_sql_metric_failed_rows
    logger.exception(f'Could not perform sql metric failed rows \n{resolved_sql}', e)
Message: 'Could not perform sql metric failed rows \nselect order_id as failed_orders\nfrom orders\nwhere ship_date < order_date;\n'
Arguments: (Py4JError('An error occurred while calling o3900.collectToPython'),)

Automate change log generation

Maybe with a tool like town crier

Remove ini files

Use setup.cfg instead

Split github workflow in a PR workflow and a release workflow

The github workflow we have now shows the release jobs (as cancelled) in the PR. By creating a separate workflow for the release, we eliminate this clutter. Still, the release workflow should only run when the tests have passed.

Test if `excluded_columns` work

In the yaml you can defined which columns to exclude. Test if this feature works.

Use version range for `soda-sql` dependencies

With have pinned versions for soda-sql, we should make this a version range.

Test if `metric-groups` work

In the column metrics the metric-groups configuration can be set. Test if this works.

Implement failed rows processor

It would be great if we can implement the failed rows processor for Soda Spark as well:

https://github.com/sodadata/soda-sql/blob/main/core/sodasql/scan/failed_rows_processor.py

Test if colum metrics work

Test if column metrics work. Testing for a random selection should suffice.

Retrieve version in workflow.yml

The version is retrieved here. I do not think this works, since setup.py does not work.

Let `execute` return the ScanResults

The execute function should return the ScanResults. We now do a conversion to a data frame, for the measurements only. Since this is not complete - e.g. test results are missing - we would like to the return the scan results.

Still keeping #23 open for discussion from the community about the preferred behavior of the execute.

Register data frame as temporary view and compute metric

Update .pre-commit-config.yaml

Use the one from GDD kick-start-python

Error message when using a data with 100s of cols

When a DS is large and has a big number of columns, the scan function scan.execute(scan_definition, df) fails with a spark OOM issue in the master due to the collection part of the metrics. A more meaningful message here would help to avoid miss leading the developer and let them know that the final result is too large and should be either filtered or split.

Generate docs with a workflow

Generate the docs using a Github workflow. Preferably similar to Soda data.

Add test to validate the scan results errors are empty

Add a test that validates that the scan results is an empty list

Add contributors to read me

Use setup.cfg instead of setup.py

Use setup.cfg instead of setup.py to define installation.

Fails to install on Azure Databricks Cluster

Library installation attempted on the driver node of cluster 0531-095737-pc8ifbl4 and failed. Please refer to the following error message to fix the library or contact Databricks support. Error Code: DRIVER_LIBRARY_INSTALLATION_FAILURE. Error Message: org.apache.spark.SparkException: Process List(/databricks/python/bin/pip, install, soda-spark, --disable-pip-version-check) exited with code 1. ERROR: Command errored out with exit status 1:
command: /databricks/python3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hk_a28h0/sasl_22bdc11526b24a309f12b898eb2ce262/setup.py'"'"'; file='"'"'/tmp/pip-install-hk_a28h0/sasl_22bdc11526b24a309f12b898eb2ce262/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-0uq392_j
cwd: /tmp/pip-install-hk_a28h0/sasl_22bdc11526b24a309f12b898eb2ce262/
Complete output (29 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.8
creating build/lib.linux-x86_64-3.8/sasl
copying sasl/init.py -> build/lib.linux-x86_64-3.8/sasl
running egg_info
writing sasl.egg-info/PKG-INFO
writing dependency_links to sasl.egg-info/dependency_links.txt
writing requirements to sasl.egg-info/requires.txt
writing top-level names to sasl.egg-info/top_level.txt
reading manifest file 'sasl.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'sasl.egg-info/SOURCES.txt'
copying sasl/saslwrapper.cpp -> build/lib.linux-x86_64-3.8/sasl
copying sasl/saslwrapper.h -> build/lib.linux-x86_64-3.8/sasl
copying sasl/saslwrapper.pyx -> build/lib.linux-x86_64-3.8/sasl
running build_ext
building 'sasl.saslwrapper' extension
creating build/temp.linux-x86_64-3.8
creating build/temp.linux-x86_64-3.8/sasl
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -Isasl -I/databricks/python3/include -I/usr/include/python3.8 -c sasl/saslwrapper.cpp -o build/temp.linux-x86_64-3.8/sasl/saslwrapper.o
In file included from sasl/saslwrapper.cpp:629:
sasl/saslwrapper.h:22:10: fatal error: sasl/sasl.h: No such file or directory
22 | #include <sasl/sasl.h>
| ^~~~~~~~~~~~~
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

ERROR: Failed building wheel for sasl
ERROR: Command errored out with exit status 1:
command: /databricks/python3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hk_a28h0/sasl_22bdc11526b24a309f12b898eb2ce262/setup.py'"'"'; file='"'"'/tmp/pip-install-hk_a28h0/sasl_22bdc11526b24a309f12b898eb2ce262/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-_6sr1coa/install-record.txt --single-version-externally-managed --compile --install-headers /databricks/python3/include/site/python3.8/sasl
cwd: /tmp/pip-install-hk_a28h0/sasl_22bdc11526b24a309f12b898eb2ce262/
Complete output (29 lines):
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.8
creating build/lib.linux-x86_64-3.8/sasl
copying sasl/init.py -> build/lib.linux-x86_64-3.8/sasl
running egg_info
writing sasl.egg-info/PKG-INFO
writing dependency_links to sasl.egg-info/dependency_links.txt
writing requirements to sasl.egg-info/requires.txt
writing top-level names to sasl.egg-info/top_level.txt
reading manifest file 'sasl.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'sasl.egg-info/SOURCES.txt'
copying sasl/saslwrapper.cpp -> build/lib.linux-x86_64-3.8/sasl
copying sasl/saslwrapper.h -> build/lib.linux-x86_64-3.8/sasl
copying sasl/saslwrapper.pyx -> build/lib.linux-x86_64-3.8/sasl
running build_ext
building 'sasl.saslwrapper' extension
creating build/temp.linux-x86_64-3.8
creating build/temp.linux-x86_64-3.8/sasl
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -Isasl -I/databricks/python3/include -I/usr/include/python3.8 -c sasl/saslwrapper.cpp -o build/temp.linux-x86_64-3.8/sasl/saslwrapper.o
In file included from sasl/saslwrapper.cpp:629:
sasl/saslwrapper.h:22:10: fatal error: sasl/sasl.h: No such file or directory
22 | #include <sasl/sasl.h>
| ^~~~~~~~~~~~~
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /databricks/python3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hk_a28h0/sasl_22bdc11526b24a309f12b898eb2ce262/setup.py'"'"'; file='"'"'/tmp/pip-install-hk_a28h0/sasl_22bdc11526b24a309f12b898eb2ce262/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-_6sr1coa/install-record.txt --single-version-externally-managed --compile --install-headers /databricks/python3/include/site/python3.8/sasl Check the logs for full command output.

Explain high level API in README

Add installation guide to README

Replace `Connection` and `Cursor` with `Warehouse` methods

As discussed in this soda-sql issue we prefer to use the Warehouse methods to execute SQL statements since we have control over this API. However, to unblock sodadata/soda-sql#240 we implemented the Connection and Cursor in sodadata/soda-sql#239. After the issue 479 is resolved we can replace the Connection and Cursor with a Warehouse implementation for Spark.

The warehouse name seems to be hardcoded.

When using Soda Spark with Soda Cloud the name looks like its static. This will be an issue for any use case where the data source has the same name:

Allow to send metrics to the Soda cloud

See this page for connection details

Have scan results as a dataframe

First investigate what the schema of the data frame would look like.

Remove setup.py

It is not needed due to setup.cfg

Implement scan functionality on a data frame given a scan yaml

Given a scan yaml file, apply a scan to a data frame.

def scan(df: DataFrame, scan_yaml: Union[str, Path]))
    ...

TBD:

the table name in the scan yml, for now use a dummy name, later we should think how to use this.

Use doctest

Use doctest to test examples in the docstring.

Add support for passing variables to scan execute

Implement the other warehouse methods

We have implemented the sql_fetchone method, but the others will not work for Spark.

Convert measurement in Spark data frame

After a measurement is calculated - given a certain metric - the response to the user should be given as a Spark data frame.

Cannot import soft_unicode from markupsafe

This bug is due to the new release of makupfsafe (2.1.0) where soft_unicode has been removed. Deps need to be adapted

Preferred `scan.execute` behaviour (API)

What is the preferred behavior for the scan.execute from a user perspective?

At this moment the execute returns a Spark data frame. The data frame contains the measurements of thescan_results.

Maybe it does not make sense to returns this a Spark data frame:

we are doing an extra python -> spark conversion.
if changes are made to the scan results, then we need to update the logic here too.

We could return the scan result object. However, that is maybe not an object the user expects, as it is Soda internal.

Use `soda-sql` with Spark dialect to generate SQL for metric(s)

We will use soda-sql to generate the SQL that computes a metric.

Scan fails ~when schema is not added explicitely~ when implicit schema has a column with a space

A user reporting a failure after reading csv's in spark and running the scan on it:

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

scan.execute(scan_definition, df)

The solution was to explicitly add the schema.

This issue asks to investigate if we also could use the inference of the schema.

Unnecessary temporary view in scan.execute?

When running a scan on a table that is in another database than default, the temporary view in scan.execute cannot be created because of the prefix in the table name. The scan runs fine without creating this temporary view.

I'm wondering if the creation of a the temporary view in scan.execute function is necessary. To the relatively naked eye it seems unnecessary. Why is it here? Are there situations in which the table is only readable when a temporary view is created in this way?