sodadata / soda-sql Goto Github PK
View Code? Open in Web Editor NEWSoda SQL and Soda Spark have been deprecated and replaced by Soda Core. docs.soda.io/soda-core/overview.html
Home Page: https://docs.soda.io/
License: Apache License 2.0
Soda SQL and Soda Spark have been deprecated and replaced by Soda Core. docs.soda.io/soda-core/overview.html
Home Page: https://docs.soda.io/
License: Apache License 2.0
Describe the bug
After adding a list of valid_values
to my scan.yml
running soda scan
started throwing exceptions.
FROM "ORDERS"
| Scan failed
Traceback (most recent call last):
File "/home/dirk/work/soda/code/meuk/sql_snowflake_demo/.venv/lib/python3.8/site-packages/sodasql/scan/scan.py", line 78, in execute
self._query_aggregations()
File "/home/dirk/work/soda/code/meuk/sql_snowflake_demo/.venv/lib/python3.8/site-packages/sodasql/scan/scan.py", line 211, in _query_aggregations
query_result_tuple = self.warehouse.sql_fetchone(sql)
File "/home/dirk/work/soda/code/meuk/sql_snowflake_demo/.venv/lib/python3.8/site-packages/sodasql/scan/warehouse.py", line 26, in sql_fetchone
return sql_fetchone(self.connection, sql)
File "/home/dirk/work/soda/code/meuk/sql_snowflake_demo/.venv/lib/python3.8/site-packages/sodasql/scan/db.py", line 20, in sql_fetchone
return sql_fetchone_description(connection, sql)[0]
File "/home/dirk/work/soda/code/meuk/sql_snowflake_demo/.venv/lib/python3.8/site-packages/sodasql/scan/db.py", line 33, in sql_fetchone_description
cursor.execute(sql)
File "/home/dirk/work/soda/code/meuk/sql_snowflake_demo/.venv/lib/python3.8/site-packages/snowflake/connector/cursor.py", line 603, in execute
Error.errorhandler_wrapper(self.connection, self,
File "/home/dirk/work/soda/code/meuk/sql_snowflake_demo/.venv/lib/python3.8/site-packages/snowflake/connector/errors.py", line 125, in errorhandler_wrapper
cursor.errorhandler(connection, cursor, error_class, error_value)
File "/home/dirk/work/soda/code/meuk/sql_snowflake_demo/.venv/lib/python3.8/site-packages/snowflake/connector/errors.py", line 85, in default_errorhandler
raise error_class(
snowflake.connector.errors.ProgrammingError: 001003 (42000): SQL compilation error:
syntax error line 20 at position 53 unexpected ')'.
| Measurement(metric='schema', column_name=None, value=[{'name': 'O_COMMENT', 'type': 'TEXT'}, {'name': 'O_ORDERPRIORITY', 'type': 'TEXT'}, {'name': 'O_TOTALPRICE', 'type': 'NUMBER'}, {'name': 'O_CUSTKEY', 'type': 'NUMBER'}, {'name': 'O_ORDERSTATUS', 'type': 'TEXT'}, {'name': 'O_SHIPPRIORITY', 'type': 'NUMBER'}, {'name': 'O_CLERK', 'type': 'TEXT'}, {'name': 'O_ORDERKEY', 'type': 'NUMBER'}, {'name': 'O_ORDERDATE', 'type': 'DATE'}], group_values=None)
| 1 measurements computed
| 0 tests executed
| All is good. No tests failed.
To Reproduce
Steps to reproduce the behavior:
scan.yml
to contain a list of valid_values
soda scan ...
Expected behavior
I was expecting my valid_values to be taken into account when computing the invalid_values
metric, but an exception was thrown during the scan
.
Context
scan.yml
table_name: ORDERS
metrics:
- row_count
- missing_count
- missing_percentage
- values_count
- values_percentage
- invalid_count
- invalid_percentage
- min
- max
- avg
- sum
columns:
O_ORDERSTATUS:
valid_values:
- O
- F
- P
tests:
row_count_min: row_count > 0
no_invalid_values: invalid_count == 0
OS: Linux (5.10)
Python Version: 3.8
Soda SQL Version: 2.0.0b6
Warehouse Type: Snowflake
Update documentation to include the list of possible validity formats and definition which can be used in soda-sql tests.
Current list copied from source code (https://github.com/sodadata/soda-sql/blob/main/sodasql/scan/validity.py):
'number_whole'
'number_decimal_point'
'number_decimal_comma'
'number_percentage'
'date_eu'
'date_us'
'date_inverse'
'time'
'uuid'
'email'
'phone'
'credit_card_number'
'ip_address'
Thanks
Describe the bug
It is expected that Soda SQL detects null
and '<empty string>'
as missing values out of the box + additionally declared missingValues
. For the data set up like this:
tableBuilder
.createColumn(COL_MISSING2, "VARCHAR(255)")
.scanConfiguration("missingValues", Lists.of("missing"))
.values(
"''",
"'missing'"
);
the missing count should be 2 (1 for "", and 1 for "missing"), but we get 1 since only missing
is detected.
Generated SQL:
COUNT(CASE WHEN NOT (missing2 IS NULL OR missing2 IN ('missing')) THEN 1 END),
Warehouse Type: PostgreSQL
Describe the bug
Our documentation currently says:
name: my_bigquery_project
connection:
type: bigquery
account_info: <PATH TO YOUR BIGQUERY ACCOUNT INFO JSON FILE>
dataset: sodasql
...
but our code expects
account_info_json
, andOS:
Python Version: *
Soda SQL Version: main
Warehouse Type: BigQuery
Hello there,
Describe the bug
soda scan SUM aggregate overflows
snowflake.connector.errors.ProgrammingError: 100058 (22000): Value overflow in a SUM aggregate
To Reproduce
Steps to reproduce the behavior:
soda scan on a big table
OS: Unix
Python Version: 3.8.7
Soda SQL Version: 2.0.0b10
Warehouse Type: Snowflake
Describe the bug
We do not get any histogram calculated for the percentage column types
tableBuilder.createColumn("varchar_percent_", "VARCHAR(10)")
.values(
"'8%'",
"null",
"'3--%'",
"'nopct'",
"'6%'",
"'6%'",
"'6%'",
"'77 %'")
.check(column -> {
ColumnProfile profile = column.getProfile();
assertThat(profile.getSourceType(), is("character varying"));
assertThat(profile.getHistogram(), notNullValue());
});
Warehouse Type: PostgreSQL
The test configuration files have hostname/database name etc. hard coded. https://github.com/sodadata/soda-sql/blob/main/tests/warehouses/redshift_cfg.yml#L2
I suggest moving all the required connection details to env_vars
Probably only documentation & templates.
As a data engineer, I want to know how to configure a Soda SQL scan as a step in my Airflow job.
For this PostgreSQL table:
CREATE TABLE test_table (
varchar_with_nulls_ VARCHAR(255),
text_custom_missing_ TEXT,
char_eu_date_ CHAR(10),
varchar_percent_ VARCHAR(10),
double_precision_ DOUBLE PRECISION,
integer_ INTEGER,
bytea_all_nulls_ BYTEA,
timestamp_ TIMESTAMP,
time_ TIME,
date_ DATE,
json_ JSON
);
profiling fails since data type JSON
is not supported:
14:13:46.322 INFO PostgresCo | STDERR: STATEMENT: WITH group_by_value AS (
14:13:46.322 INFO PostgresCo | STDERR: SELECT
14:13:46.322 INFO PostgresCo | STDERR: json_ AS value,
14:13:46.322 INFO PostgresCo | STDERR: COUNT(*) AS frequency
14:13:46.322 INFO PostgresCo | STDERR: FROM "public"."test_table"
14:13:46.322 INFO PostgresCo | STDERR: WHERE NOT (json_ IS NULL)
14:13:46.322 INFO PostgresCo | STDERR: GROUP BY json_
14:13:46.322 INFO PostgresCo | STDERR: )
14:13:46.322 INFO PostgresCo | STDERR: SELECT COUNT(*),
14:13:46.322 INFO PostgresCo | STDERR: COUNT(CASE WHEN frequency = 1 THEN 1 END),
14:13:46.322 INFO PostgresCo | STDERR: SUM(frequency)
14:13:46.322 INFO PostgresCo | STDERR: FROM group_by_value
14:13:46.326 INFO SodaScanne | STDOUT: Scan failed
14:13:46.326 INFO SodaScanne | STDOUT: Traceback (most recent call last):
14:13:46.326 INFO SodaScanne | STDOUT: File "/usr/local/lib/python3.8/site-packages/sodasql/scan/scan.py", line 77, in execute
14:13:46.326 INFO SodaScanne | STDOUT: self._query_group_by_value()
14:13:46.326 INFO SodaScanne | STDOUT: File "/usr/local/lib/python3.8/site-packages/sodasql/scan/scan.py", line 275, in _query_group_by_value
14:13:46.327 INFO SodaScanne | STDOUT: query_result_tuple = self.warehouse.sql_fetchone(sql)
14:13:46.327 INFO SodaScanne | STDOUT: File "/usr/local/lib/python3.8/site-packages/sodasql/scan/warehouse.py", line 27, in sql_fetchone
14:13:46.327 INFO SodaScanne | STDOUT: return sql_fetchone(self.connection, sql)
14:13:46.327 INFO SodaScanne | STDOUT: File "/usr/local/lib/python3.8/site-packages/sodasql/scan/db.py", line 20, in sql_fetchone
14:13:46.327 INFO SodaScanne | STDOUT: return sql_fetchone_description(connection, sql)[0]
14:13:46.327 INFO SodaScanne | STDOUT: File "/usr/local/lib/python3.8/site-packages/sodasql/scan/db.py", line 33, in sql_fetchone_description
14:13:46.327 INFO SodaScanne | STDOUT: cursor.execute(sql)
14:13:46.327 INFO SodaScanne | STDOUT: psycopg2.errors.UndefinedFunction: could not identify an equality operator for type json
14:13:46.328 INFO SodaScanne | STDOUT: LINE 7: GROUP BY json_
In Check your CLI installation on page https://docs.soda.io/soda-sql/#/5_min_tutorial
The commands listed include verify
, but output of soda
command on 2.0.0b2 doesn't:
soda/samples/hello-soda via ๐ v3.8.5 (venv385)
โฏ soda
Usage: soda [OPTIONS] COMMAND [ARGS]...
Soda CLI version 2.0.0b2
Options:
--help Show this message and exit.
Commands:
create Creates a new warehouse directory and prepares credentials in
your...
init Finds tables in the warehouse and based on the contents, creates...
scan Computes all measurements and runs all tests on one table.
soda/samples/hello-soda via ๐ v3.8.5 (venv385)
โฏ soda verify
Usage: soda [OPTIONS] COMMAND [ARGS]...
Try 'soda --help' for help.
Error: No such command 'verify'.
There is no need for adding __init__.py
files (since Python 3.3), they can be removed
Describe the bug
if a field has invalid value, but otherwise other values allow analytical type deduction, we should detect obviously invalid values
tableBuilder.createColumn("char_eu_date_", "CHAR(10)")
.values(
"'16/04/2020'",
"null",
"'16/04/2020'",
"'16/04/2020'",
"'16/04/2020'",
"'16/04/2020'",
"'invalid value'",
"'16/04/2020'")
.check(column -> {
ColumnProfile profile = column.getProfile();
assertThat(profile.getSourceType(), is("character"));
assertThat(profile.getInvalidValuesCount(), is(1l)); // currently we get 0
assertThat(profile.getInvalidValuesPercentage(), is(12.5f)); // currently we get 0
assertThat(profile.getDistinctValuesCount(), is(1l)); // currently we get 2
});
Warehouse Type: PostgreSQL
Describe the bug
Some metrics do not arrive for varchar column type which are expected & supported.
tableBuilder.createColumn("varchar_with_nulls_", "VARCHAR(255)")
.values(
"null",
"'a'")
.check(column -> {
ColumnProfile profile = column.getProfile();
assertThat(profile.getMin(), is("a"));
assertThat(profile.getMins(), is(Lists.of("a")));
assertThat(profile.getMax(), is("a"));
assertThat(profile.getMaxs(), is(Lists.of("a")));
});
Describe the bug
After running soda init
on a Snowflake Warehouse the process and with an exception:
Traceback (most recent call last):
[...]
File "/home/dirk/work/soda/code/soda-sql/sodasql/cli/cli.py", line 201, in init
if warehouse and warehouse.connection and not warehouse.connection.closed:
AttributeError: 'SnowflakeConnection' object has no attribute 'closed'
To Reproduce
Steps to reproduce the behavior:
soda init
Expected behavior
To close without throwing an exception
Context
OS: Arch Linux (5.10)
Python Version: 3.8
Soda SQL Version: 2.0.0b6
Describe the bug
if a field has invalid value, but otherwise other values allow analytical type deduction, we should detect obviously invalid values
tableBuilder.createColumn("varchar_percent_", "VARCHAR(10)")
.values(
"'8%'",
"null",
"'3--%'",
"'nopct'",
"'6%'",
"'6%'",
"'6%'",
"'77 %'")
.check(column -> {
ColumnProfile profile = column.getProfile();
assertThat(profile.getSourceType(), is("character varying"));
assertThat(profile.getValidValuesCount(), is(5l)); // we get 7
assertThat(profile.getInvalidValuesCount(), is(2l)); // we get 0
assertThat(profile.getInvalidValuesPercentage(), is(25f)); // we get 0
assertThat(profile.getDistinctValuesCount(), is(3l)); // we get 5
assertThat(profile.getUniqueValuesCount(), is(2l)); // we get 4
});
Warehouse Type: PostgreSQL
Describe the bug
Installed the latest version of soda-sql
, but I'm no longer able to run any of the commands because of an import error:
(.venv) [dirk@meuk]$ soda --help
Traceback (most recent call last):
File "/home/dirk/work/soda/code/meuk/.venv/bin/soda", line 5, in <module>
from sodasql.cli.cli import main
ModuleNotFoundError: No module named 'sodasql.cli'
To Reproduce
Steps to reproduce the behavior:
soda --help
soda create ./sql snowflake`Context
n/a
OS: Linux
Python Version: 3.8
Soda SQL Version: 2.0.0b7
Warehouse Type: na
Running scripts/recreate_venv.sh
shows errors because pip-compile is not in the dev-requirements.in
soda-sql on ๎ main [!] via ๐ v3.8.5 took 1m4s
โฏ scripts/recreate_venv.sh
Collecting pip
Using cached pip-20.3.3-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.1.1
Uninstalling pip-20.1.1:
Successfully uninstalled pip-20.1.1
Successfully installed pip-20.3.3
ERROR: Invalid requirement: ''
scripts/recreate_venv.sh: line 14: pip-compile: command not found
scripts/recreate_venv.sh: line 15: pip-compile: command not found
This is caused by line scripts/recreate_vnev#13
Describe the bug
I am trying to get soda-sql connect to some AWS Athena tables and stumble into an issue:
/Users/lgo/dev/soda-sql/venv/bin/python /Users/lgo/dev/soda-sql/venv/bin/soda init
| 2.0.0b10
| Initializing warehouse.yml ...
| Querying warehouse for tables
| Directory tables already exists
| Exception: TODO override and implement this abstract method
Traceback (most recent call last):
File "/Users/lgo/dev/soda-sql/venv/lib/python3.8/site-packages/sodasql/cli/cli.py", line 171, in init
scan_initializer.initialize_scan_ymls()
File "/Users/lgo/dev/soda-sql/venv/lib/python3.8/site-packages/sodasql/cli/scan_initializer.py", line 45, in initialize_scan_ymls
self.warehouse.dialect.sql_tables_metadata_query())
File "/Users/lgo/dev/soda-sql/venv/lib/python3.8/site-packages/sodasql/scan/dialect.py", line 109, in sql_tables_metadata_query
raise RuntimeError('TODO override and implement this abstract method')
RuntimeError: TODO override and implement this abstract method
Process finished with exit code 0
Not sure why I end up here, AWS Athena support is supposed work in this beta right? Besides this stack track, it would be good if the cli return a non-zero exit code in case of a fatal error like this.
Context
My warehouse.yml:
name: athena
connection:
type: athena
database: xxxxx_reporting_db
access_key_id: env_var(AWS_ACCESS_KEY_ID)
secret_access_key: env_var(AWS_SECRET_ACCESS_KEY)
role_arn:
region: us-east-1
staging_dir: s3://xxxxx-reporting-db/db/
My env_vars.yml:
athena:
AWS_ACCESS_KEY_ID: 'xxxxxxxxxxxxxx'
AWS_SECRET_ACCESS_KEY: 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
OS: Mac OS X
Python Version: Python 3.8.6
Soda SQL Version: 2.0.0b10
Warehouse Type: athena
Hi guys,
I've tried using SodaSQL with Snowflake, but I've had an issue with soda init
.
First, I've run soda create .
, which created the default warehouse.yml
.
I modified it to look like the following:
name: .
connection:
type: snowflake
username: MYUSERNAME
password: env_var(SNOWFLAKE_PWD)
account: MYACCOUNTNAME
database: MYDB
warehouse: MYWH
schema: MYSCHEMA
In ~/.soda/env_vars.yml
I have:
.:
SNOWFLAKE_PWD: '<my secret password>'
When I run soda init .
I get the following error:
/Users/alessandro.lollo/.pyenv/versions/3.8.3/envs/playground/lib/python3.8/site-packages/pandas/compat/__init__.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
warnings.warn(msg)
/Users/alessandro.lollo/.pyenv/versions/3.8.3/envs/playground/lib/python3.8/site-packages/snowflake/connector/options.py:78: UserWarning: You have an incompatible version of 'pyarrow' installed (2.0.0), please install a version that adheres to: 'pyarrow<0.18.0,>=0.17.0; extra == "pandas"'
warn_incompatible_dep('pyarrow', _installed_pyarrow_version.version, _expected_pyarrow_version)
| Querying warehouse for tables
| Exception: TODO override and implement this abstract method
Traceback (most recent call last):
File "/Users/alessandro.lollo/.pyenv/versions/3.8.3/envs/playground/lib/python3.8/site-packages/sodasql/cli/cli.py", line 196, in init
warehouse.dialect.sql_tables_metadata_query())
File "/Users/alessandro.lollo/.pyenv/versions/3.8.3/envs/playground/lib/python3.8/site-packages/sodasql/scan/dialect.py", line 98, in sql_tables_metadata_query
raise RuntimeError('TODO override and implement this abstract method')
RuntimeError: TODO override and implement this abstract method
A part from pyarrow
and lzma
warnings, it seems that the sql_tables_metadata_query
method is not implemented for Snowflake. I have not looked into SodaSQL code, though.
Currently tests/demo/print_demodata_sql_script.py
produces SQL statements to (re)create a table called demodata.
The goal is eventually to be able to load this demo data into new accounts.
In order to do this, we need to improve on the demo data.
We should end up with 2 or 3 datasets. One incremental, one snapshot and maybe another one.
Each of the tables should have realistic names like eg customer_transactions or so...
The demo data scans should be generated over a 7 day period.
The following data issues should be included:
Step 1 is to collect all potential data issues and turn them into a 2 or 3 realistic datasets, in which it is really easy to show how Soda captures each issue. Let's list all the potential data issues we want to include in the demo data below.
When soda scan --help
is run
... Non zero exist code means tests ...
should be
... Non zero exit code means tests ...
Hi team,
There seems to be an error in the soda-sql:tests doc: https://docs.soda.io/soda-sql/documentation/tests.html. More specifically, the sample snippet in the column test seems to be wrong.
In the scan.yml the tests and validity_format don't need a "-":
Expected working snippet:
table_name: yourtable
metrics:
- row_count
- missing_percentage
- invalid_percentage
columns:
start_date:
validity_format: date_eu
tests:
- invalid_percentage < 2.0
Thanks :)
We have various kinds and levels of testing using in some cases even real warehouse types we support.
On the "acceptance level" of the soda
CLI command we are testing only Postgres database (unit test source file: https://github.com/sodadata/soda-sql/blob/main/tests/local/warehouse/cli/test_cli.py). This test connects to PostgreSQL database, inserts some data (the same SQL data that tutorial uses: https://github.com/sodadata/soda-sql/blob/main/tests/demo/demodata.sql).
We should be able to generate test data using Dialect
support for each warehouse we currently use. We should commit this test data for usage in tutorials and in tests, but we should be able to re-run that script later on if we want to add more data types, different data etc.
We should add tests like above-mentioned PostgreSQL test which would run against a particular warehouse, inserting the SQL data generated by the script and verifying metric calculation works as expected.
We should be able to add tutorial pages in our documentation that refer to the demo data we have created, making sure the steps are identical as in our test.
We currently think that GitHub discussions serves best as communication medium in our community.
But Slack also is on our radar. If you think we should add a Slack community given that we already have GitHub discussions, vote up this issue by adding a +1 reaction to this post.
Describe the bug
Failure to install the CLI
ERROR: Could not find a version that satisfies the requirement pyarrow<0.18.0,>=0.17.0
ERROR: No matching distribution found for pyarrow<0.18.0,>=0.17.0
ERROR: Could not find a version that satisfies the requirement snowflake-connector-python==2.3.6
ERROR: No matching distribution found for snowflake-connector-python==2.3.6
To Reproduce
Steps to reproduce the behavior:
OS: Docker
Python Version: 3.9.1
Soda SQL Version: latests
Here are metrics I didn't find in the list of default data metrics being pushed from soda-sql which were pushed previously:
Please note that the metric NonUniqueValuesCount is apparently not used by the frontend, so if we don't need it any more I can just remove it from the list. The others are still supported & expected
The 5 min tutorial linked mentioned in the README file returns 403 Access Denied error.
https://docs.soda.io/soda-sql/getting-started/5_min_tutorial
Only relevant for soda-sql users with a Soda Cloud connection
Add include
/ exclude
columns feature which should make it possible to prevent certain columns to be pushed to Soda Cloud.
Make note in docs that we recommend not to upload Personal (Sensitive) Data to cloud and include an explanation of which data is possibly uploaded to Soda Cloud, like for example the values of min
and max
.
Describe the bug
I've setup project which includes a custom SQL metric but I noticed the columns/variables are always converted or expected to be uppercase.
Take for example the following SQL Metric:
sql: |
SELECT
COUNT(CASE WHEN O_ORDERSTATUS = 'P' THEN 1 END) as order_p,
COUNT(CASE WHEN O_ORDERSTATUS = 'O' THEN 1 END) as order_o
FROM ORDERS
tests:
open_orders: order_o == 0
processing_orders: order_p > 0
When running soda scan
it produces the following output:
(...)
| Executing SQL query:
SELECT
COUNT(CASE WHEN O_ORDERSTATUS = 'P' THEN 1 END) as order_p,
COUNT(CASE WHEN O_ORDERSTATUS = 'O' THEN 1 END) as order_o
FROM ORDERS
| SQL took 0:00:00.083167
| SQL metric 0 ORDER_P -> 38543
| Query measurement: ORDER_P = 38543
| SQL metric 0 ORDER_O -> 732044
| Query measurement: ORDER_O = 732044
| Test table(ORDERS) expression(row_count == 0) failed
(...)
| stddev(O_ORDERKEY) = 1732051.3849205659
| ORDER_P = 38543
| ORDER_O = 732044
| 28 measurements computed
| 3 tests executed
| 3 of 3 tests failed:
| table(ORDERS) sql_metric(0) expression(order_o == 0) cause order_o=None
| table(ORDERS) sql_metric(0) expression(order_p > 0) cause order_p=None
| table(ORDERS) expression(row_count == 0) cause row_count=1500000
As you can see order_p
and order_o
have been converted to uppercase metrics, which causes my tests to fail.
To Reproduce
Steps to reproduce the behavior:
soda scan
None
.Context
table_name: ORDERS
metrics:
- row_count
- sum
columns:
O_ORDERSTATUS:
valid_values:
- O
- F
sql_metrics:
- sql: |
SELECT
COUNT(CASE WHEN O_ORDERSTATUS = 'P' THEN 1 END) as order_p,
COUNT(CASE WHEN O_ORDERSTATUS = 'O' THEN 1 END) as order_o
FROM ORDERS
tests:
open_orders: order_o == 0
processing_orders: order_p > 0
tests:
dataset_size: row_count == 0
OS: Linux
Python Version: 3.8
Soda SQL Version: 2.0.0b8
Warehouse Type: Snowflake
Why are Athena parameters used as defaults for a Redshift connection?
Wouldn't it be better to use parameters defined in https://docs.soda.io/soda-sql/#/warehouse_types?id=redshift ?
Dependabot can check the repository for outdated dependencies and check for security vulnerabilities
See the links below for more info:
https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates
https://docs.github.com/en/github/administering-a-repository/keeping-your-dependencies-updated-automatically
https://github.com/marketplace/dependabot-preview
Describe the bug
I was trying to create some custom SQL Metrics for demo purposes, but it seems like soda scan
isn't able to resolve their path correctly.
The output I'm seeying:
(...)
| Scanning ORDERS in . ...
| sql_metric_path scan.yml does not exist
| sql_metric_path orderstatus_count.yml does not exist
| Executing SQL query:
It seems like the directory path is missing from the files, cause they are reported as 'missing'?
To Reproduce
Steps to reproduce the behavior:
scan.yml
and orderstatus_count.yml
for an example of the setupsoda scan . ORDERS
from within the project directoryContext
scan.yml
table_name: ORDERS
metrics:
- row_count
- missing_count
- missing_percentage
- values_count
- values_percentage
- invalid_count
- invalid_percentage
- min
- max
- avg
- sum
tests:
row_count_min: row_count > 0
orderstatus_count.yml
sql: |
SELECT
COUNT(CASE WHEN O_ORDERSTATUS IS "P" THEN 1 END) as order_p,
COUNT(CASE WHEN O_ORDERSTATUS IS "O" THEN 1 END) as order_o,
FROM ORDERS
tests:
open_orders: order_o == 0
processing_orders: order_p > 0
OS: Linux 5.10
Python Version: 3.8
Soda SQL Version: 2.0.0b6
Warehouse Type: Snowflake
See if we can apply this: https://hakibenita.com/sql-anomaly-detection
Since we are going to send measurements from the soda-sql
, we need to be careful if metricName
is identical to some of our own default metrics (for example, but not limited to, rowCount
/row_count
):
sql: |
SELECT coun(*) as row_count
FROM CUSTOMER_TRANSACTIONS
WHERE country = 'US'
tests:
total_volume_greater_than: row_count > 5000
I think soda SQL should forbid these cases by just throwing an exception in the runtime. Backend can't actually know is the row_count
metric that comes in by mistake unintended override of the default metric.
We can later on introduce namespaces which will keep metrics in separate namespaces and allow customers to use even conflicting SQL metric names
Currently building the local version uses a hardcoded __dev__
as a version, it is better to use the git SHA/ref as a version.
Folder structure to be discussed
Libraries use dashes in the names, no underscores. Folder names as indicated below all use underscores.
+ src
+ sodasql (why was this again?)
+ setup.py
+ sodasql_cloud
+ setup.py
+ sodasql_cloud
+ tests
+ sodasql_scan
+ setup.py
+ sodasql_scan
+ tests
+ sodasql_cli
+ setup.py
+ sodasql_cli
+ tests
+ sodasql_airflow
+ connectors
+ sodasql_postgres
+ setup.py
+ sodasql_postgres
+ tests
+ sodasql_snowflake
+ sodasql_athena
+ sodasql_redshift
+ sodasql_bigquery
+ sodasql_spark
Also, before starting I would like to understand how we'll deal with versioning the individual libs and how this impacts the release process.
TODO investigate if dialects can be extracted. Passing data from core lib to dialects is no prob. But The dialects should not invoke methods on core stuff. That would cause circular dependencies. To be investigated before starting this.
Currently only the properties are listed in docs/scan.md
The goal is that this page becomes the reference doc for scan yaml files.
Add some more context describing the properties.
Describe the bug
Tests execution result reports None
cause None=None
when using named tests. This only seems to happen when using 'named tests' like:
tests:
dataset_size: row_count == 0
| 62 measurements computed
| 1 tests executed
| 1 of 1 tests failed:
| None cause None=None
When providing a list of 'anonymous' tests it correctly reports which tests failed.
| 62 measurements computed
| 1 tests executed
| 1 of 1 tests failed:
| table(ORDERS) expression(row_count == 0) cause row_count=1500000
To Reproduce
Steps to reproduce the behavior:
scan.yml
soda scan ...
Context
scan.yml
table_name: ORDERS
metrics:
- row_count
tests:
dataset_size: row_count == 0
OS: Linux 5.10
Python Version: 3.8
Soda SQL Version: 2.0.0b6
Warehouse Type: Snowflake
Hi team,
Soda-SQL doens't show the test names of the test executed during a scan. It does so for measurements though. Please see the output below:
(.venv) mathissedestrooper@Mathisses-MacBook-Pro soda-sql-project % soda scan ./snowflake-soda-warehouse ORDERS
| 2.0.0b5
| Scanning ORDERS in ./snowflake-soda-warehouse ...
| sql_metric_path scan.yml does not exist
| Executing SQL query:
(...)
| missing_count(O_ORDERDATE) = 0
| values_percentage(O_ORDERDATE) = 100.0
| 62 measurements computed
| 1 tests executed
| All is good. No tests failed.
It would be handy to show the test names and results instead of just showing x tests executed and x failed. This would allow the user to immediately see what's wrong and act upon it accordingly.
Thanks!
As discussed earlier last weekend's code restructure should allow for more freedom when defining a soda-sql project directory structure. To accomplish this we no longer make 'directory' assumptions and instead allow developers to inline their sql_metrics in the scan-yml file.
In order to prevent scan-yml files from getting too large we should also support 'macro functions' which allow developers to include
multiple files or read
the contents of a file. As proposed we should consider adding support for either one, or both, of the following functions:
include(./path/*.yml)
This include()
function can be used in combination with sql_metrics:
to include the contents of a list of files. The first argument can be a path containing wildcards or a directory which will then read and parse all yml
files.
read(./path/query.sql)
The read()
function allows to read the contents of a given file and use it as the properties value. This can for example be used in combination with the sql_metrics[].sql
property when referring to a metric.sql
query file.
An example of a scan-YML file which uses read()
could then look like:
// ./tables/orders.yml
table_name: CUSTOMERS
metrics:
- row_count
sql_metrics:
# Use inline custom metrics
- sql: |
SELECT
COUNT(CASE WHEN O_COUNTRY IS "NL" THEN 1 END) as dutch_customers,
FROM CUSTOMERS
tests:
no_dutch_customers: dutch_customers == 0
# Use inline custom metrics where the SQL is defined in a separate file
- sql: read(./sql/very_complex_sql.sql)
tests:
order_size: order_size > 1000
tests:
dataset_size: row_count == 0
An example using include()
would look like:
// ./tables/orders.yml
table_name: ORDERS
metrics:
- row_count
- sum
columns:
O_ORDERSTATUS:
valid_values:
- O
- F
sql_metrics: include(./metrics/orders/*.yml)
tests:
dataset_size: row_count == 0
// ./metrics/orders/my_metric.yml
sql: |
SELECT
COUNT(CASE WHEN O_ORDERSTATUS IS "P" THEN 1 END) as order_p,
COUNT(CASE WHEN O_ORDERSTATUS IS "O" THEN 1 END) as order_o,
FROM ORDERS
tests:
open_orders: order_o == 0
processing_orders: order_p > 0
Up for discussion:
read()
and include()
or do we want to merge them into one?My two cents: merging into one might create more automagical behavior
and can therefore be less preferred.
Describe the bug
When scanning an Athena table, I get the following error:
...
| Property schema does not exist in warehouse.yml.connection
| Scan failed: warehouse.yml configuration errors:
[error] Property schema does not exist in warehouse.yml.connection
Traceback (most recent call last):
File "/Users/lgo/dev/soda-sql/sodasql/cli/cli.py", line 325, in scan
scan = scan_builder.build()
File "/Users/lgo/dev/soda-sql/sodasql/scan/scan_builder.py", line 76, in build
parser.assert_no_warnings_or_errors()
File "/Users/lgo/dev/soda-sql/sodasql/scan/parser.py", line 120, in assert_no_warnings_or_errors
raise AssertionError(f'{self.description} configuration errors: \n '
AssertionError: warehouse.yml configuration errors:
[error] Property schema does not exist in warehouse.yml.connection
To Reproduce
Steps to reproduce the behavior:
soda scan ...
on one of the tablesOS: Mac OS X
Python Version: 3.8.6
Soda SQL Version: main branch at current HEAD
Warehouse Type: athena
Describe the bug
Although it is not expected, we are evaluating the metrics for unknown colum type
tableBuilder.createColumn("bytea_all_nulls_", "BYTEA")
.values(
"null",
"null",
"null",
"null",
"null",
"null",
"null",
"null")
.check(column -> {
ColumnProfile profile = column.getProfile();
assertThat(profile.getSourceType(), is("bytea"));
// When analyticalType == unknown, then all profiling except the types above are null
assertThat(profile.getMissingValuesCount(), nullValue());
assertThat(profile.getMissingValuesPercentage(), nullValue());
assertThat(profile.getValidValuesCount(), nullValue());
assertThat(profile.getInvalidValuesCount(), nullValue());
assertThat(profile.getInvalidValuesPercentage(), nullValue());
assertThat(profile.getDistinctValuesCount(), nullValue());
assertThat(profile.getUniqueValuesCount(), nullValue());
assertThat(profile.getMin(), nullValue());
assertThat(profile.getMins(), nullValue());
assertThat(profile.getMax(), nullValue());
assertThat(profile.getMaxs(), nullValue());
assertThat(profile.getHistogram(), nullValue());
});
Warehouse Type: PostgreSQL
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.