Coder Social home page Coder Social logo

sodadata / soda-spark Goto Github PK

View Code? Open in Web Editor NEW
63.0 11.0 8.0 121 KB

Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

Home Page: https://docs.soda.io

License: Apache License 2.0

Python 100.00%
spark pyspark data-engineering data-quality data-observability data-testing soda-sql python

soda-spark's People

Contributors

abhishek-khare avatar anilkulkarni87 avatar jczuurmond avatar jmarien avatar shannywu avatar vijaykiran avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

soda-spark's Issues

Using the Soda Client and SQL Metrics does not work

I get an issue when running a scan that has sql metrics configured and passing the Soda Cloud client to the execute call. It works fine when running the scan without the client. The execution just hangs and I need to cancel it and then I see this error message:

^CERROR:root:Exception while sending command.                                   
Traceback (most recent call last):
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/clientserver.py", line 475, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
RuntimeError: reentrant call inside <_io.BufferedReader name=3>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/clientserver.py", line 503, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/clientserver.py", line 475, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/opt/homebrew/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/context.py", line 292, in signal_handler
    self.cancelAllJobs()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/context.py", line 1195, in cancelAllJobs
    self._jsc.sc().cancelAllJobs()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/java_gateway.py", line 1309, in __call__
    return_value = get_return_value(
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/protocol.py", line 334, in get_return_value
    raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling o14.sc

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/clientserver.py", line 503, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
--- Logging error ---
Traceback (most recent call last):
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/sodasql/scan/scan.py", line 682, in _run_sql_metric_failed_rows
    self.sampler.save_sample_to_local_file_with_limit(resolved_sql, temp_file, failed_limit)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/sodasql/scan/sampler.py", line 207, in save_sample_to_local_file_with_limit
    row = cursor.fetchone()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/sodaspark/scan.py", line 150, in fetchone
    row = self._df.first()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 1617, in first
    return self.head()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 1603, in head
    rs = self.head(1)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 1605, in head
    return self.take(n)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 744, in take
    return self.limit(num).collect()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/sql/dataframe.py", line 693, in collect
    sock_info = self._jdf.collectToPython()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/java_gateway.py", line 1309, in __call__
    return_value = get_return_value(
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/py4j/protocol.py", line 334, in get_return_value
    raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling o3900.collectToPython

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/homebrew/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/logging/__init__.py", line 1083, in emit
    msg = self.format(record)
  File "/opt/homebrew/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/logging/__init__.py", line 927, in format
    return fmt.format(record)
  File "/opt/homebrew/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/logging/__init__.py", line 663, in format
    record.message = record.getMessage()
  File "/opt/homebrew/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/logging/__init__.py", line 367, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/Users/albin/dev/soda/pipeline-demo/soda-spark-test/soda-spark-tests.py", line 21, in <module>
    scan_result = scan.execute(scan_definition, df_ecommerce, soda_server_client=soda_server_client)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/sodaspark/scan.py", line 294, in execute
    scan.execute()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/sodasql/scan/scan.py", line 94, in execute
    self._query_sql_metrics_and_run_tests()
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/sodasql/scan/scan.py", line 497, in _query_sql_metrics_and_run_tests
    self._query_sql_metrics_and_run_tests_base(self.scan_yml.sql_metric_ymls)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/sodasql/scan/scan.py", line 515, in _query_sql_metrics_and_run_tests_base
    self._run_sql_metric_failed_rows(sql_metric, resolved_sql, scan_column)
  File "/Users/albin/dev/py/env/2.1.0b20/lib/python3.9/site-packages/sodasql/scan/scan.py", line 736, in _run_sql_metric_failed_rows
    logger.exception(f'Could not perform sql metric failed rows \n{resolved_sql}', e)
Message: 'Could not perform sql metric failed rows \nselect order_id as failed_orders\nfrom orders\nwhere ship_date < order_date;\n'
Arguments: (Py4JError('An error occurred while calling o3900.collectToPython'),)

Let `execute` return the ScanResults

The execute function should return the ScanResults. We now do a conversion to a data frame, for the measurements only. Since this is not complete - e.g. test results are missing - we would like to the return the scan results.

Still keeping #23 open for discussion from the community about the preferred behavior of the execute.

Error message when using a data with 100s of cols

When a DS is large and has a big number of columns, the scan function scan.execute(scan_definition, df) fails with a spark OOM issue in the master due to the collection part of the metrics. A more meaningful message here would help to avoid miss leading the developer and let them know that the final result is too large and should be either filtered or split.

Fails to install on Azure Databricks Cluster

Library installation attempted on the driver node of cluster 0531-095737-pc8ifbl4 and failed. Please refer to the following error message to fix the library or contact Databricks support. Error Code: DRIVER_LIBRARY_INSTALLATION_FAILURE. Error Message: org.apache.spark.SparkException: Process List(/databricks/python/bin/pip, install, soda-spark, --disable-pip-version-check) exited with code 1. ERROR: Command errored out with exit status 1:
command: /databricks/python3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hk_a28h0/sasl_22bdc11526b24a309f12b898eb2ce262/setup.py'"'"'; file='"'"'/tmp/pip-install-hk_a28h0/sasl_22bdc11526b24a309f12b898eb2ce262/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-0uq392_j
cwd: /tmp/pip-install-hk_a28h0/sasl_22bdc11526b24a309f12b898eb2ce262/
Complete output (29 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.8
creating build/lib.linux-x86_64-3.8/sasl
copying sasl/init.py -> build/lib.linux-x86_64-3.8/sasl
running egg_info
writing sasl.egg-info/PKG-INFO
writing dependency_links to sasl.egg-info/dependency_links.txt
writing requirements to sasl.egg-info/requires.txt
writing top-level names to sasl.egg-info/top_level.txt
reading manifest file 'sasl.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'sasl.egg-info/SOURCES.txt'
copying sasl/saslwrapper.cpp -> build/lib.linux-x86_64-3.8/sasl
copying sasl/saslwrapper.h -> build/lib.linux-x86_64-3.8/sasl
copying sasl/saslwrapper.pyx -> build/lib.linux-x86_64-3.8/sasl
running build_ext
building 'sasl.saslwrapper' extension
creating build/temp.linux-x86_64-3.8
creating build/temp.linux-x86_64-3.8/sasl
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -Isasl -I/databricks/python3/include -I/usr/include/python3.8 -c sasl/saslwrapper.cpp -o build/temp.linux-x86_64-3.8/sasl/saslwrapper.o
In file included from sasl/saslwrapper.cpp:629:
sasl/saslwrapper.h:22:10: fatal error: sasl/sasl.h: No such file or directory
22 | #include <sasl/sasl.h>
| ^~~~~~~~~~~~~
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

ERROR: Failed building wheel for sasl
ERROR: Command errored out with exit status 1:
command: /databricks/python3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hk_a28h0/sasl_22bdc11526b24a309f12b898eb2ce262/setup.py'"'"'; file='"'"'/tmp/pip-install-hk_a28h0/sasl_22bdc11526b24a309f12b898eb2ce262/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-_6sr1coa/install-record.txt --single-version-externally-managed --compile --install-headers /databricks/python3/include/site/python3.8/sasl
cwd: /tmp/pip-install-hk_a28h0/sasl_22bdc11526b24a309f12b898eb2ce262/
Complete output (29 lines):
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.8
creating build/lib.linux-x86_64-3.8/sasl
copying sasl/init.py -> build/lib.linux-x86_64-3.8/sasl
running egg_info
writing sasl.egg-info/PKG-INFO
writing dependency_links to sasl.egg-info/dependency_links.txt
writing requirements to sasl.egg-info/requires.txt
writing top-level names to sasl.egg-info/top_level.txt
reading manifest file 'sasl.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'sasl.egg-info/SOURCES.txt'
copying sasl/saslwrapper.cpp -> build/lib.linux-x86_64-3.8/sasl
copying sasl/saslwrapper.h -> build/lib.linux-x86_64-3.8/sasl
copying sasl/saslwrapper.pyx -> build/lib.linux-x86_64-3.8/sasl
running build_ext
building 'sasl.saslwrapper' extension
creating build/temp.linux-x86_64-3.8
creating build/temp.linux-x86_64-3.8/sasl
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -Isasl -I/databricks/python3/include -I/usr/include/python3.8 -c sasl/saslwrapper.cpp -o build/temp.linux-x86_64-3.8/sasl/saslwrapper.o
In file included from sasl/saslwrapper.cpp:629:
sasl/saslwrapper.h:22:10: fatal error: sasl/sasl.h: No such file or directory
22 | #include <sasl/sasl.h>
| ^~~~~~~~~~~~~
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /databricks/python3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hk_a28h0/sasl_22bdc11526b24a309f12b898eb2ce262/setup.py'"'"'; file='"'"'/tmp/pip-install-hk_a28h0/sasl_22bdc11526b24a309f12b898eb2ce262/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-_6sr1coa/install-record.txt --single-version-externally-managed --compile --install-headers /databricks/python3/include/site/python3.8/sasl Check the logs for full command output.

Preferred `scan.execute` behaviour (API)

What is the preferred behavior for the scan.execute from a user perspective?

At this moment the execute returns a Spark data frame. The data frame contains the measurements of thescan_results.

Maybe it does not make sense to returns this a Spark data frame:

  • we are doing an extra python -> spark conversion.
  • if changes are made to the scan results, then we need to update the logic here too.

We could return the scan result object. However, that is maybe not an object the user expects, as it is Soda internal.

Scan fails ~when schema is not added explicitely~ when implicit schema has a column with a space

A user reporting a failure after reading csv's in spark and running the scan on it:

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

scan.execute(scan_definition, df)

The solution was to explicitly add the schema.

This issue asks to investigate if we also could use the inference of the schema.

Unnecessary temporary view in scan.execute?

When running a scan on a table that is in another database than default, the temporary view in scan.execute cannot be created because of the prefix in the table name. The scan runs fine without creating this temporary view.

I'm wondering if the creation of a the temporary view in scan.execute function is necessary. To the relatively naked eye it seems unnecessary. Why is it here? Are there situations in which the table is only readable when a temporary view is created in this way?

Use tox for testing

With tox we have the flexibility to easily test for multiple python versions

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.