sodadata / soda-core Goto Github PK
View Code? Open in Web Editor NEW:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
Home Page: https://go.soda.io/core-docs
License: Apache License 2.0
:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
Home Page: https://go.soda.io/core-docs
License: Apache License 2.0
Send schema check results to Soda Cloud
For details, see also https://github.com/sodadata/soda-sql-v3/issues/180
Currently we fail with "no module named ..."
➜ soda scan -ef environments.yml -cf actor-checks.yml
Couldn't execute scan
| connection to server at "localhost" (::1), port 5432 failed: Connection refused
| Is the server running on that host and accepting TCP/IP connections?
| connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection refused
| Is the server running on that host and accepting TCP/IP connections?
INFO | Reading SodaCL file "environments.yml"
INFO | Reading SodaCL file "actor-checks.yml"
ERROR | Couldn't execute scan: connection to server at "localhost" (::1), port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
e.g.
filter EVENT_TABLE_% [hourly]:
where: TIMESTAMP '${ts_start}' <= event_time AND event_time > dateadd(hour, -1, created_at)
for each table T:
tables:
- EVENT_TABLE_% [hourly]
checks:
- row_count > 0
-
via https://soda-community.slack.com/archives/C038FFU79J5/p1648393195437059
Port histogram metrics as to be used in profiling.
Trying to use a warn in a check that uses a statistical metric and it SKIPPED the check.
checks for sodatest_customers_b7580920:
warn:
when min(size) >= 0
Soda Core 0.0.1
Skipping section "checks for sodatest_customers_b7580920" because content is not a list
+-> line=1,col=1 in checks.yaml
Scan summary:
1 errors.
Oops! 1 error. 0 failures. 0 warnings. 0 pass.
ERRORS:
Skipping section "checks for sodatest_customers_b7580920" because content is not a list
+-> line=1,col=1 in checks.yaml
I tried this one as well:
checks for sodatest_customers_b7580920:
warn:
- when stddev(size) between 3 and 4
Soda Core 0.0.1
Skipping section "checks for sodatest_customers_b7580920" because content is not a list
+-> line=1,col=1 in checks.yaml
Scan summary:
1 errors.
Oops! 1 error. 0 failures. 0 warnings. 0 pass.
ERRORS:
Skipping section "checks for sodatest_customers_b7580920" because content is not a list
+-> line=1,col=1 in checks.yaml
Same error appears when I attempt to use warn
with a freshness check:
checks for sodatest_customers_b7580920:
warn:
when freshness using last_name < 1h
These valid formats backwards in soda/sodacl/format_cfg.py
. date eu should be DD/MM/YY and date us should be MM/DD/YY.
"date eu": rf"^{s}{month}[-\./]{day}[-\./]{year}{s}$",
"date us": rf"^{s}{day}[-\./]{month}[-\./]{year}{s}$",
Similar support for Athena as in Soda SQL https://docs.soda.io/soda/warehouse_types.html#amazon-athena
I get that the output displays a sample of the rows that failed the test, right? If so, could the label read "Failed samples"?
Additionally, why does the output list all columns? Could it be limited to just the columns I specified in the check?? (probably not, but I'm asking anyway!)
If there are tables with, like, 50 columns or something, what does the output look like?
checks for sodatest_orders_f7532be6:
- reference from (customer_country, customer_zip) to sodatest_customers_b7580920 (country, zip)
soda scan -d local_postgres_sodasql -c ~/.soda/configuration.yaml -ch checks.yaml
Soda Core 0.0.1
Sample 2022-03-03 11:09:05.365801/local_postgres_sodasql/local_postgres_sodasql.reference[customer_country,customer_zip]:
id customer_id_nok customer_id_ok customer_country customer_zip text
---- --------------- -------------- ---------------- ------------ -------
'O3' 'ID1' 'ID2' 'BE' '2000' 'three'
'O4' 'ID1' 'BE' 'four'
'O5' 'ID98' 'ID4' '2360' 'five'
'O6' 'ID99' 'ID1' 'UK' '2360' 'six'
'ID3' 'seven'
Scan summary:
1/1 check FAILED:
sodatest_orders_f7532be6 in local_postgres_sodasql
reference from (customer_country, customer_zip) to sodatest_customers_b7580920 (country, zip) [FAILED]
Oops! 1 failures. 0 warnings. 0 errors. 0 pass.
Using the only wildcard, %
, with the for each
check does not include the datasource schema resulting in a query which scans all tables within the information_schema.tables
table.
QUERY PRODUCE...
SELECT table_name
FROM information_schema.tables
WHERE (lower(table_name) like '%');
SHOULD BE...
SELECT table_name
FROM information_schema.tables
WHERE (lower(table_name) like '%')
AND lower(table_schema) = '[SCHEMA_NAME]';
Soda Scan
soda scan -d raw_stripe stitch_stripe.yml
Configuration
data_source raw_stripe:
type: snowflake
connection:
username: ${SNOWFLAKE_USER}
password: ${SNOWFLAKE_PASSWORD}
account: ${SNOWFLAKE_ACCOUNT}
database: RAW
schema: STRIPE
warehouse: my_warehouse
role: my_role
Check
for each table T:
tables:
- raw_stripe.charges # works
- raw_stripe.% # works but queries all tables
- '%' # works but queries all tables
checks:
- row_count > 0
Result
Queries [RANDOM]
tables not in the schema which results in execution errors
Soda Core 3.0.0b4
Reading configuration file "configuration.yml"
Adding SodaCL dir checks/raw/
Reading SodaCL file "checks/raw/stitch_stripe.yml"
Scan execution starts
Query raw_stripe.for_each_table_T[0]:
SELECT table_name
FROM information_schema.tables
WHERE (lower(table_name) like '%')
Query raw_stripe.[RANDOM_TABLE].aggregation[0]:
SELECT
COUNT(*)
FROM [RANDOM_TABLE].
Query execution error in raw_stripe.[RANDOM_TABLE]..aggregation[0]: 002003 (42S02): SQL compilation error:
Object '[RANDOM_TABLE].' does not exist or not authorized.
[...] -- ALL other tables in `information_schema.tables`
Testing checks with quotes on table names and column names.
row_count
check with quotes on table name.schema
check without quotes on table name.schema
check with quotes on table name.Set up:
Test 001 - row_count
check with table name in quotes; works as expected
checks.yml:
checks for "dim_account":
- row_count > 0
configuration.yml
data_source adventureworks:
type: postgres
connection:
host: db
username: postgres
password: secret
database: postgres
schema: public
Scan input:
soda scan -d adventureworks -V -c configuration.yml checks.yml
Scan output:
Soda Core 0.0.1
Reading configuration file "configuration.yml"
Reading SodaCL file "checks.yml"
Scan execution starts
Query adventureworks."dim_account".aggregation[0]:
SELECT
COUNT(*)
FROM "dim_account"
Scan summary:
1/1 query OK
adventureworks."dim_account".aggregation[0] [OK] 0:00:00.034581
1/1 check PASSED:
"dim_account" in adventureworks
row_count > 0 [PASSED]
check_value: 99
All is good. No failures. No warnings. No errors.
Test 002 - schema
check; works as expected
checks.yml:
checks for dim_account:
- schema:
warn:
when required column missing: [sombrero]
configuration.yml
as above
Scan input:
soda scan -d adventureworks -V -c configuration.yml checks.yml
Scan output:
Soda Core 0.0.1
Reading configuration file "configuration.yml"
Reading SodaCL file "checks.yml"
Scan execution starts
Query adventureworks.dim_account.schema[dim_account]:
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE lower(table_name) = 'dim_account'
AND lower(table_catalog) = 'postgres'
AND lower(table_schema) = 'public'
ORDER BY ORDINAL_POSITION
Scan summary:
1/1 query OK
adventureworks.dim_account.schema[dim_account] [OK] 0:00:00.118499
1/1 check WARNED:
dim_account in adventureworks
schema [WARNED]
missing_column_names = [sombrero]
schema_measured = [account_key integer, parent_account_key integer, account_code_alternate_key integer, parent_account_code_alternate_key integer, account_description character varying, account_type character varying, operator character varying, custom_members character varying, value_type character varying, custom_member_options character varying]
Only 1 warning. 0 failure. 0 errors. 0 pass.
Test 003 - schema
check with table name in quotes; error
checks.yml:
checks for "dim_account":
- schema:
warn:
when required column missing: [sombrero]
configuration.yml
as above
Scan input:
soda scan -d adventureworks -V -c configuration.yml checks.yml
Scan output:
Soda Core 0.0.1
Reading configuration file "configuration.yml"
Reading SodaCL file "checks.yml"
Scan execution starts
Query adventureworks."dim_account".schema["dim_account"]:
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE lower(table_name) = '"dim_account"'
AND lower(table_catalog) = 'postgres'
AND lower(table_schema) = 'public'
ORDER BY ORDINAL_POSITION
Metrics schema were not computed for check schema
Scan summary:
1/1 query OK
adventureworks."dim_account".schema["dim_account"] [OK] 0:00:00.097976
1/1 check NOT EVALUATED:
"dim_account" in adventureworks
schema [NOT EVALUATED]
1 checks not evaluated.
1 errors.
Oops! 1 error. 0 failures. 0 warnings. 0 pass.
ERRORS:
Metrics schema were not computed for check schema
Log when configuration files and SodaCL files are being loaded when -v is present. (using logs.debug)
This might already the case for SodaCL files: to be double checked.
But I suspect it may not be done for configuration files.
Invalid check "columns types": no viable alternative at input 'columnstypes'
+-> checks.yml line=4,col=5 "checks for actor"."[1]"(columns types)
Couldn't detect table check type of 'columns types'
+-> checks.yml line=4,col=5 "checks for actor"."[1]"(columns types)
INFO | Reading SodaCL file "environments.yml"
INFO | Reading SodaCL file "checks.yml"
ERROR | Invalid check "columns types": no viable alternative at input 'columnstypes' checks.yml line=4,col=5 "checks for actor"."[1]"(columns types)
ERROR | Couldn't detect table check type of 'columns types' checks.yml line=4,col=5 "checks for actor"."[1]"(columns types)
See docs/soda_checks_yaml.md. L69 find the TODO
Consider if we should push it to the user to define the right variables and avoid clashes between the variable names when comparing?
So will we map the variables of 1 table to the other as the user makes the reference in the check? Or do we force the user to submit the right variables in the scan as input.
@vijaykiran can you describe the problem?
Is it the same as #1150 ?
➜ soda scan -ef environments.yml -cf actor-checks.yml
Couldn't add SodaCL file actor-checks.yml
| 'warehouse'
INFO | Reading SodaCL file "environments.yml"
INFO | Reading SodaCL file "actor-checks.yml"
ERROR | Couldn't add SodaCL file actor-checks.yml: 'warehouse'
INFO | Scan results:
INFO | All is good. No checks failed or warned.
The for each table T:
check works for <table_name>
but not <schema.table_name>
unlike, and as expected, the checks for
check which supports <schema.table_name>
Example Checks.yml
# Not working...
for each table T:
tables:
- android.application_installed
checks:
- row_count > 0
# Working...
checks for android.application_installed:
- row_count > 0
Result - for each table T:
$ soda scan -d raw -c configuration.yml -V checks/checks.yml
Soda Core 3.0.0b4
Reading configuration file "configuration.yml"
Reading SodaCL file "checks/checks.yml"
Cross data source table sets not yet supported. android.application_installed refers to non default data source android. Default data source is raw
Scan execution starts
Query raw.for_each_table_T[0]:
SELECT table_name
FROM information_schema.tables
Query raw.APPLICATION_INSTALLED.aggregation[0]:
Result - checks for:
$ soda scan -d raw -c configuration.yml -V checks/checks.yml
Soda Core 3.0.0b4
Reading configuration file "configuration.yml"
Reading SodaCL file "checks/checks.yml"
Scan execution starts
Query raw.android.application_installed.aggregation[0]:
SELECT
COUNT(*)
FROM android.application_installed
Switch from the built-in re regex library to another external library for regex resolving.
compatible API, enhanced capability and standard unicode support
Given the checks.yml
configuration:
checks for bitcoin:
- min(usd):
warn: when < 20000
fail: when < 10000
expected check name:
min(usd) fails when < 10000
received check name:
min(usd)
A thought: would it be valuable to show the check_value
of a statistical calculation when a test passes? I like seeing the value when tests fail. Is that possible?
checks for sodatest_customers_b7580920:
- min(size) >= 0
- max(size) <= 5000
- avg(size) between -5 and 1000
- sum(distance) = 100
soda scan -d local_postgres_sodasql -c ~/.soda/configuration.yaml -ch checks.yaml
Soda Core 0.0.1
Scan summary:
2/4 checks PASSED:
sodatest_customers_b7580920 in local_postgres_sodasql
max(size) <= 5000 [PASSED]
avg(size) between -5 and 1000 [PASSED]
2/4 checks FAILED:
sodatest_customers_b7580920 in local_postgres_sodasql
min(size) >= 0 [FAILED]
check_value: -3.0
sum(distance) = 100 [FAILED]
check_value: 4011
Oops! 2 failures. 0 warnings. 0 errors. 2 pass.
When running the following checks with a valid Soda Cloud connection I get an error:
checks for CUSTOMERS:
# Detailed schema expectations control
- schema:
fail:
when required column missing: ['id', 'size', 'distance']
when forbidden column present: ['email_address', 'pii_%']
when wrong column type:
id: varchar
distance: integer
when wrong column index:
id: 0
# And/or simple notifications on any schema change:
warn:
when schema changes: any
The error returned is:
Unknown error while executing scan.
| 'NoneType' object has no attribute 'get'
| Stacktrace:
| Traceback (most recent call last):
| File "/home/dirk/work/soda/code/meuk/data-council-workshop/.venv/lib/python3.8/site-packages/soda/scan.py", line 339, in execute
| historic_value = historic_data.get(historic_descriptor)
| AttributeError: 'NoneType' object has no attribute 'get'
@tombaeyens you asked me to file an issue with the situation I had a full on traceback
my check was written like so with an intentional typo to see what the parser says:
checks for new_fact_currency_rate:
- countsame as fact_currency_rate
This gave:
root@0c07d91bb019:/sodacl# soda scan -d adventureworks -c configuration.yaml -ch checks.yaml
Soda Core 0.0.1
Invalid check "countsame as fact_currency_rate": mismatched input 'as' expecting {'between', 'not', 'warn', 'fail', 'pass', '!=', '<>', '<=', '>=', '=', '<', '>'}
+-> line=9,col=5 in checks.yaml
+-> See https://go.soda.io/SECTHEADER
No threshold specified for check "countsame as fact_currency_rate"
+-> line=9,col=5 in checks.yaml
Unknown error while executing scan.
| 'NoneType' object has no attribute 'column_name'
| Stacktrace:
| Traceback (most recent call last):
| File "/usr/local/lib/python3.8/dist-packages/soda_core/scan.py", line 327, in execute
| self.__create_check(check_cfg, data_source_scan, partition)
| File "/usr/local/lib/python3.8/dist-packages/soda_core/scan.py", line 480, in __create_check
| check.ensure_metrics_and_queries(data_source_scan=data_source_scan, partition=partition)
| File "/usr/local/lib/python3.8/dist-packages/soda_core/execution/metric_check.py", line 84, in ensure_metrics_and_queries
| metric = data_source_scan.resolve_metric(metric)
| File "/usr/local/lib/python3.8/dist-packages/soda_core/execution/data_source_scan.py", line 46, in resolve_metric
| metric.ensure_query()
| File "/usr/local/lib/python3.8/dist-packages/soda_core/execution/numeric_query_metric.py", line 183, in ensure_query
| self.partition.ensure_query_for_metric(self)
| File "/usr/local/lib/python3.8/dist-packages/soda_core/execution/partition.py", line 58, in ensure_query_for_metric
| sql_aggregation_expression = metric.get_sql_aggregation_expression()
| File "/usr/local/lib/python3.8/dist-packages/soda_core/execution/numeric_query_metric.py", line 149, in get_sql_aggregation_expression
| values_expression = self.column.column_name
| AttributeError: 'NoneType' object has no attribute 'column_name'
Ensure we have test coverage for most of the common data types.
environments.yml
warehouse dvdrental:
type: postgres
connection:
host: localhost
port: '5432'
username: sodasql
password:
database: dvdrental
schema: public
checks.yml
warehouse: dvdrental
checks for actor:
- count > 0
Scan command output
➜ soda scan -ef environments.yml -cf actor-checks.yml
Query execution error in dvdrental.actor.aggregation[0]: relation "actor" does not exist
LINE 3: FROM actor
^
SELECT
COUNT(*)
FROM actor
| relation "actor" does not exist
| LINE 3: FROM actor
| ^
Metrics count were not computed for check count > 0
INFO | Reading SodaCL file "environments.yml"
INFO | Reading SodaCL file "actor-checks.yml"
INFO | Query dvdrental.actor.aggregation[0]:
SELECT
COUNT(*)
FROM actor
ERROR | Query execution error in dvdrental.actor.aggregation[0]: relation "actor" does not exist
LINE 3: FROM actor
^
SELECT
COUNT(*)
FROM actor: relation "actor" does not exist
LINE 3: FROM actor
^
ERROR | Metrics count were not computed for check count > 0
INFO | Scan results:
DEBUG | 1 queries ERROR
DEBUG | dvdrental.actor.aggregation[0] | ERROR | 0:00:00.003656
SELECT
COUNT(*)
FROM actor
DEBUG | relation "actor" does not exist
DEBUG | LINE 3: FROM actor
DEBUG | ^
INFO | 1 checks not evaluated. See logs above for details.
INFO | Apart from the checks that have not been evaluated, no other checks failed or warned.
@vijaykiran can you elaborate?
This check passed, as expected, but I don’t understand why the Sample
information appears in the output.
checks for sodatest_customers_b7580920:
- duplicates(id) = 0
soda scan -d local_postgres_sodasql -c ~/.soda/configuration.yaml -ch checks.yaml
Soda Core 0.0.1
Sample 2022-03-02 15:22:36.421468/local_postgres_sodasql/sodatest_customers_b7580920/local_postgres_sodasql.sodatest_customers_b7580920.id.duplicates:
id frequency
-- ---------
Scan summary:
1/1 check PASSED:
sodatest_customers_b7580920 in local_postgres_sodasql
duplicates(id) = 0 [PASSED]
All is good. No failures. No warnings. No errors.
For this check, I left out the column name to see what would happen and it errored as expected but maybe the error message could be improved?
'NoneType' object has no attribute 'column_name'
+ a stacktrace is... not super useful. Could it be shortened to just "Missing column name"?
(Also, need to correct the docs for the metric: should be 'stddev' instead of 'stdev'.)
checks for sodatest_customers_b7580920:
- stdev > 4
soda scan -d local_postgres_sodasql -c ~/.soda/configuration.yaml -ch checks.yaml
Soda Core 0.0.1
Unknown error while executing scan.
| 'NoneType' object has no attribute 'column_name'
| Stacktrace:
| Traceback (most recent call last):
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/soda_core/soda_core/scan.py", line 323, in execute
| self.__create_check(check_cfg, data_source_scan, partition)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/soda_core/soda_core/scan.py", line 470, in __create_check
| check = Check.create(
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/soda_core/soda_core/execution/check.py", line 37, in create
| return MetricCheck(check_cfg, data_source_scan, partition, column)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/soda_core/soda_core/execution/metric_check.py", line 79, in __init__
| metric = data_source_scan.resolve_metric(metric)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/soda_core/soda_core/execution/data_source_scan.py", line 46, in resolve_metric
| metric.ensure_query()
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/soda_core/soda_core/execution/numeric_query_metric.py", line 126, in ensure_query
| self.partition.ensure_query_for_metric(self)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/soda_core/soda_core/execution/partition.py", line 56, in ensure_query_for_metric
| sql_aggregation_expression = metric.get_sql_aggregation_expression()
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/soda_core/soda_core/execution/numeric_query_metric.py", line 92, in get_sql_aggregation_expression
| values_expression = self.column.column_name
| AttributeError: 'NoneType' object has no attribute 'column_name'
The UI Datasets page, https://cloud.soda.io/datasets/overview, doesn't format well for long dataset
names (full screen)
Safari
For each, when used as designed, works perfectly:
for each table T:
tables:
- sodatest_customers_b7580920
- sodatest_orders_f7532be6
checks:
- count > 0
soda scan -d local_postgres_sodasql -c ~/.soda/configuration.yaml -ch checks.yaml
Soda Core 0.0.1
Scan summary:
2/2 checks PASSED:
sodatest_customers_b7580920 in local_postgres_sodasql
count > 0 [PASSED]
sodatest_orders_f7532be6 in local_postgres_sodasql
count > 0 [PASSED]
All is good. No failures. No warnings. No errors.
But, because I’m a dummy, I was initially using the For Each check to check columns in a single table instead of tables/dataset in data source. Even though it CLEARLY says tables in the checks file. Duh. But the CLI output showed me all tests passed everything was fine, even though there was no way it could’ve found non-existent table names like id
and size
. Could we add something so that for each errors out if the input are non-table names? For me and the other stoopids of the world?
for each table T:
tables:
- id
- size
checks:
- count > 0
scan -d local_postgres_sodasql -c ~/.soda/configuration.yaml -ch checks.yaml
Soda Core 0.0.1
Scan summary:
All is good. No failures. No warnings. No errors.
Using soda-core-postgres, 3.0.0b4.
Python 3.9.2
I have created a configuration.yml
file in my ~/.soda
directory. If I try to use the following scan command that ought to implicitly find the file, it produces an error.
soda scan -d aws_postgres_retail checks.yml
Soda Core 3.0.0b4
No configuration file specified nor found on ~/.soda/configuration.yml
Unknown error while executing scan.
| dictionary changed size during iteration
| Stacktrace:
| Traceback (most recent call last):
| File "/Users/Janet/.pyenv/versions/3.9.2/lib/python3.9/site-packages/soda/scan.py", line 267, in execute
| for data_source_scan_cfg in self._sodacl_cfg.data_source_scan_cfgs.values():
| RuntimeError: dictionary changed size during iteration
But if I explicitly identify the configuration.yml and path, it works just fine.
soda scan -d aws_postgres_retail -c ~/.soda/configuration.yml checks.yml
Soda Core 3.0.0b4
Scan summary:
1/1 check PASSED:
orders in aws_postgres_retail
row_count > 0 [PASSED]
All is good. No failures. No warnings. No errors.
This check works as expected:
checks for sodatest_customers_b7580920:
- schema:
warn:
when required column missing: [sombrero]
soda scan -d local_postgres_sodasql -c ~/.soda/configuration.yaml -ch checks.yaml
Soda Core 0.0.1
Scan summary:
1/1 check WARNED:
sodatest_customers_b7580920 in local_postgres_sodasql
schema [WARNED]
missing_column_names = [sombrero]
schema_measured = [id character varying, size double precision, sizetxt character varying, distance integer, pct character varying, cat character varying, country character varying, zip character varying, email character varying, date date, ts timestamp without time zone, ts_with_tz timestamp with time zone]
Only 1 warning. 0 failure. 0 errors. 0 pass.
But when I purposely messed up my syntax by adding an errant when
, the error message is misleading in pointing out what exactly is wrong. The colon in line 5 isn’t the problem, the aimless extra when is the issue. More precise error message?
checks for sodatest_customers_b7580920:
- schema:
warn:
when
when required column missing: [sombrero]
soda scan -d local_postgres_sodasql -c ~/.soda/configuration.yaml -ch checks.yaml
Soda Core 0.0.1
YAML syntax error
| mapping values are not allowed here
| in "<unicode string>", line 5, column 37:
| when required column missing: [sombrero]
| ^ (line: 5)
| Stacktrace:
| Traceback (most recent call last):
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/soda_core/soda_core/common/parser.py", line 65, in _parse_yaml_str
| file_dict = yaml.load(yaml_str)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/main.py", line 434, in load
| return constructor.get_single_data()
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 119, in get_single_data
| node = self.composer.get_single_node()
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 76, in get_single_node
| document = self.compose_document()
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 99, in compose_document
| node = self.compose_node(None, None)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 143, in compose_node
| node = self.compose_mapping_node(anchor)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 223, in compose_mapping_node
| item_value = self.compose_node(node, item_key)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 141, in compose_node
| node = self.compose_sequence_node(anchor)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 185, in compose_sequence_node
| node.value.append(self.compose_node(node, index))
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 143, in compose_node
| node = self.compose_mapping_node(anchor)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 223, in compose_mapping_node
| item_value = self.compose_node(node, item_key)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 143, in compose_node
| node = self.compose_mapping_node(anchor)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 223, in compose_mapping_node
| item_value = self.compose_node(node, item_key)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 113, in compose_node
| if self.parser.check_event(AliasEvent):
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/parser.py", line 146, in check_event
| self.current_event = self.state()
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/parser.py", line 629, in parse_block_mapping_value
| if self.scanner.check_token(ValueToken):
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/scanner.py", line 1796, in check_token
| self._gather_comments()
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/scanner.py", line 1838, in _gather_comments
| self.fetch_more_tokens()
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/scanner.py", line 288, in fetch_more_tokens
| return self.fetch_value()
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/scanner.py", line 657, in fetch_value
| raise ScannerError(
| ruamel.yaml.scanner.ScannerError: mapping values are not allowed here
| in "<unicode string>", line 5, column 37:
| when required column missing: [sombrero]
| ^ (line: 5)
+-> line=5,col=37 in checks.yaml
Scan summary:
1 errors.
Oops! 1 error. 0 failures. 0 warnings. 0 pass.
ERRORS:
YAML syntax error
| mapping values are not allowed here
| in "<unicode string>", line 5, column 37:
| when required column missing: [sombrero]
| ^ (line: 5)
| Stacktrace:
| Traceback (most recent call last):
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/soda_core/soda_core/common/parser.py", line 65, in _parse_yaml_str
| file_dict = yaml.load(yaml_str)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/main.py", line 434, in load
| return constructor.get_single_data()
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/constructor.py", line 119, in get_single_data
| node = self.composer.get_single_node()
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 76, in get_single_node
| document = self.compose_document()
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 99, in compose_document
| node = self.compose_node(None, None)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 143, in compose_node
| node = self.compose_mapping_node(anchor)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 223, in compose_mapping_node
| item_value = self.compose_node(node, item_key)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 141, in compose_node
| node = self.compose_sequence_node(anchor)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 185, in compose_sequence_node
| node.value.append(self.compose_node(node, index))
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 143, in compose_node
| node = self.compose_mapping_node(anchor)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 223, in compose_mapping_node
| item_value = self.compose_node(node, item_key)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 143, in compose_node
| node = self.compose_mapping_node(anchor)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 223, in compose_mapping_node
| item_value = self.compose_node(node, item_key)
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/composer.py", line 113, in compose_node
| if self.parser.check_event(AliasEvent):
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/parser.py", line 146, in check_event
| self.current_event = self.state()
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/parser.py", line 629, in parse_block_mapping_value
| if self.scanner.check_token(ValueToken):
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/scanner.py", line 1796, in check_token
| self._gather_comments()
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/scanner.py", line 1838, in _gather_comments
| self.fetch_more_tokens()
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/scanner.py", line 288, in fetch_more_tokens
| return self.fetch_value()
| File "/Users/Janet/Documents/GitHub/soda-sql-v3/.venv/lib/python3.8/site-packages/ruamel/yaml/scanner.py", line 657, in fetch_value
| raise ScannerError(
| ruamel.yaml.scanner.ScannerError: mapping values are not allowed here
| in "<unicode string>", line 5, column 37:
| when required column missing: [sombrero]
| ^ (line: 5)
+-> line=5,col=37 in checks.yaml
Given the checks.yml
configuration
checks for bitcoin:
- row_count between 0 and 1000000
- min(usd):
warn: when < 20000
fail: when < 10000
- max(usd):
warn: when > 70000
fail: when > 100000
execution of soda scan
will create 3 new checks.
After changing the threshold of the first check and leaving the other 2 checks unchanged in checks.yml
:
checks for bitcoin:
- row_count:
warn: when between 20 and 100
- min(usd):
warn: when < 20000
fail: when < 10000
- max(usd):
warn: when > 70000
fail: when > 100000
all of a sudden min(usd)
and max(usd)
get new identity resulting in creation of new checks in soda-cloud
and you end up with 6 checks instead of 4.
A couple of unexpected things with this check. I copy+pasted this out of the docs, changing only the column name and adding the extra check for count > 0
.
checks for sodatest_customers_b7580920:
- count > 0
- invalid_percent(country) < 1%:
missing values: [N/A, No value, null]
And the output confused me.
Only strings or only numbers allowed in missing values: [‘N/A’, ‘No value’, None]
Of course, it has no correct missing values to check against, but I would expect that if the missing values errors out, the check that uses the missing values would also error out, or at least not pass.
soda scan -d local_postgres_sodasql -c ~/.soda/configuration.yaml -ch checks.yaml
Soda Core 0.0.1
Only strings or only numbers allowed in missing values: ['N/A', 'No value', None]
+-> line=4,col=7 in checks.yaml
Counting invalid without valid specification does not make sense
Scan summary:
2/2 checks PASSED:
sodatest_customers_b7580920 in local_postgres_sodasql
count > 0 [PASSED]
invalid_percent(country) < 1% [PASSED]
1 errors.
Oops! 1 error. 0 failures. 0 warnings. 2 pass.
ERRORS:
Only strings or only numbers allowed in missing values: ['N/A', 'No value', None]
+-> line=4,col=7 in checks.yaml
This one worked, though, without the square brackets. Though now it occurs to me that the square brackets are perhaps not needed in the check, but were in the docs to communicate "these are your options for the value to put here".
checks for sodatest_customers_b7580920:
- count > 0
- invalid_percent(country) < 1%:
missing values: N/A
soda scan -d local_postgres_sodasql -c ~/.soda/configuration.yaml -ch checks.yaml
Soda Core 0.0.1
Counting invalid without valid specification does not make sense
Scan summary:
2/2 checks PASSED:
sodatest_customers_b7580920 in local_postgres_sodasql
count > 0 [PASSED]
invalid_percent(country) < 1% [PASSED]
All is good. No failures. No warnings. No errors.
And this did not work as expected. Though I was hoping the error message might tell me what I think I have already deduced, that you can only use a string for missing values, not a list.
checks for sodatest_customers_b7580920:
- count > 0
- invalid_percent(country) < 1%:
missing values: N/A, None
soda scan -d local_postgres_sodasql -c ~/.soda/configuration.yaml -ch checks.yaml
Soda Core 0.0.1
Counting invalid without valid specification does not make sense
Scan summary:
2/2 checks PASSED:
sodatest_customers_b7580920 in local_postgres_sodasql
count > 0 [PASSED]
invalid_percent(country) < 1% [PASSED]
All is good. No failures. No warnings. No errors.
running this without scan_execution_date being available as a variable
checks for fact_internet_sales:
- freshness using ship_date with scan_execution_date < 12h45
it uses current scan run time instead of complaining about missing variable
This check fails:
checks for retail_orders:
- invalid_count(discount) < 1:
valid format: positive percentage
With this exception:
(.venv) $soda scan -V -d aws_postgres_retail orders.yml
Soda Core 0.0.1
Reading configuration file "/Users/albin/.soda/configuration.yml"
Reading SodaCL file "orders.yml"
Scan execution starts
Query aws_postgres_retail.retail_orders.aggregation[0]:
SELECT
COUNT(CASE WHEN NOT (discount IS NULL) AND NOT (discount ~ '^ *\+? *(\d+([\.,]\d+)?|([\.,]\d+)) *% *$') THEN 1 END)
FROM retail_orders
Query execution error in aws_postgres_retail.retail_orders.aggregation[0]: operator does not exist: double precision ~ unknown
LINE 2: ...ASE WHEN NOT (discount IS NULL) AND NOT (discount ~ '^ *\+? ...
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
SELECT
COUNT(CASE WHEN NOT (discount IS NULL) AND NOT (discount ~ '^ *\+? *(\d+([\.,]\d+)?|([\.,]\d+)) *% *$') THEN 1 END)
FROM retail_orders
+-> operator does not exist: double precision ~ unknown
LINE 2: ...ASE WHEN NOT (discount IS NULL) AND NOT (discount ~ '^ *\+? ...
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
Metrics invalid_count were not computed for check invalid_count(discount) < 1
Scan summary:
1/1 query ERROR
aws_postgres_retail.retail_orders.aggregation[0] [ERROR] 0:00:00.617506
SELECT
COUNT(CASE WHEN NOT (discount IS NULL) AND NOT (discount ~ '^ *\+? *(\d+([\.,]\d+)?|([\.,]\d+)) *% *$') THEN 1 END)
FROM retail_orders
operator does not exist: double precision ~ unknown
LINE 2: ...ASE WHEN NOT (discount IS NULL) AND NOT (discount ~ '^ *\+? ...
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
1/1 check NOT EVALUATED:
retail_orders in aws_postgres_retail
invalid_count(discount) < 1 [NOT EVALUATED]
check_value: None
1 checks not evaluated.
2 errors.
Oops! 2 errors. 0 failures. 0 warnings. 0 pass.
ERRORS:
Query execution error in aws_postgres_retail.retail_orders.aggregation[0]: operator does not exist: double precision ~ unknown
LINE 2: ...ASE WHEN NOT (discount IS NULL) AND NOT (discount ~ '^ *\+? ...
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
SELECT
COUNT(CASE WHEN NOT (discount IS NULL) AND NOT (discount ~ '^ *\+? *(\d+([\.,]\d+)?|([\.,]\d+)) *% *$') THEN 1 END)
FROM retail_orders
+-> operator does not exist: double precision ~ unknown
LINE 2: ...ASE WHEN NOT (discount IS NULL) AND NOT (discount ~ '^ *\+? ...
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
Metrics invalid_count were not computed for check invalid_count(discount) < 1
(.venv) $
Freshness check sends the value as str(time delta)
this should be milliseconds
{
"identity": "check-test-workshop_ds-ORDERS-daily-created_at-Freshness-bfdcb2cd",
"name": "freshness using created_at < 2h",
"type": "metricThreshold",
"definition": "checks for ORDERS [daily]:\n freshness using created_at < 2h",
"location": {
"filePath": "checks/02_advanced_and_cool.yml",
"line": 34,
"col": 5
},
"dataSource": "workshop_ds",
"table": "ORDERS",
"column": "created_at",
"metrics": [
"metric-test-workshop_ds-ORDERS-daily-created_at-max"
],
"outcome": "fail",
"diagnostics": {
"value": "27 days, 2:29:20.044015",
"maxColumnTimestamp": "2022-02-24 16:25:52",
"maxColumnTimestampUtc": "2022-02-24 16:25:52+00:00",
"nowVariableName": "NOW",
"nowTimestamp": "2022-03-23T18:55:12.044015",
"nowTimestampUtc": "2022-03-23 18:55:12.044015+00:00",
"freshness": "27 days, 2:29:20.044015"
}
}
On purpose, I attempted to evaluate for avg
on a varchar
column. Naturally, there was an error, but the error message doesn’t explicitly tell me, “You can’t calculate average on varchar content, stoopid.” Could we replace the query execution error with something that will more directly lead me to my error? I’m afraid the HINT
didn’t help…
checks for sodatest_customers_b7580920:
- avg(pct) between 1 and 10
soda scan -d local_postgres_sodasql -c ~/.soda/configuration.yaml -ch checks.yaml
Soda Core 0.0.1
Query execution error in local_postgres_sodasql.sodatest_customers_b7580920.aggregation[0]: function avg(character varying) does not exist
LINE 2: AVG(pct)
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
SELECT
AVG(pct)
FROM sodatest_customers_b7580920
+-> function avg(character varying) does not exist
LINE 2: AVG(pct)
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
Metrics avg were not computed for check avg(pct) between 1 and 10
Scan summary:
1/1 check NOT EVALUATED:
sodatest_customers_b7580920 in local_postgres_sodasql
avg(pct) between 1 and 10 [NOT EVALUATED]
check_value: None
1 checks not evaluated. See logs for details.
2 errors.
Oops! 2 errors. 0 failures. 0 warnings. 0 pass.
ERRORS:
Query execution error in local_postgres_sodasql.sodatest_customers_b7580920.aggregation[0]: function avg(character varying) does not exist
LINE 2: AVG(pct)
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
SELECT
AVG(pct)
FROM sodatest_customers_b7580920
+-> function avg(character varying) does not exist
LINE 2: AVG(pct)
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
Metrics avg were not computed for check avg(pct) between 1 and 10
For an integer column, this worked as expected.
checks for sodatest_customers_b7580920:
- avg(distance) between 1 and 10
soda scan -d local_postgres_sodasql -c ~/.soda/configuration.yaml -ch checks.yaml
Soda Core 0.0.1
Scan summary:
1/1 check FAILED:
sodatest_customers_b7580920 in local_postgres_sodasql
avg(distance) between 1 and 10 [FAILED]
check_value: 445.6666666666667
Oops! 1 failures. 0 warnings. 0 errors. 0 pass.
soda scan -d does_not_exist -c configuration.yml checks.yml
results in
Soda Core 3.0.0b4
Unknown error while executing scan.
| dictionary changed size during iteration
| Stacktrace:
| Traceback (most recent call last):
| File "/Users/m1n0/dev/soda/soda-sql/soda-core/soda/core/soda/scan.py", line 267, in execute
| for data_source_scan_cfg in self._sodacl_cfg.data_source_scan_cfgs.values():
| RuntimeError: dictionary changed size during iteration
For the CLI output for checks status:
Only
? Only 1 warning. 0 failure. 0 error. 0 pass.
If there are more than one warnings, the Only
is still there: Only 2 warnings. 0 failure. 0 errors. 0 pass.
Are 50
warnings still Only
? 100
? Maybe we could use a different word, like Note
. (However, the failures trump the warnings in order of precedence so 1 failure and 2 warnings seems fine. Oops! 1 failure. 2 warnings. 0 errors. 0 pass.
)0 warnings
and sometimes it says No warnings
— is that intentional? All is good. No failures. No warnings. No errors.
Oops! 1 failure. 0 warnings. 0 errors. 0 pass.
I suppose I am advocating for consistency when the status is 0/none.Oops! 1 failures.
0 pass.
2 pass.
0 failure
Why does the word
Oops!appear when there is a failure? It implies that something is wrong when, really, a failed test means that something is working correctly/as expected! Maybe instead:
Note` ?Similar support for Hive as in Soda SQL https://docs.soda.io/soda/warehouse_types.html#apache-hive-experimental
All data sources have reserved keywords, if these are used for table/column names we have to quote them. Can we just quote everything to get around the issue?
Given this SodaCL
checks for fact_product_inventory:
- reference from (product_key) to dim_product (product_key)
the output of running soda scan results in 111455 lines because there are too many rows in fact_product_inventory with a product_key that's not there in dim_product. This makes it very inconvenient to see which keys are missing.
...
483 20140616 2014-06-16 '$36.80' 0 0 4
483 20140617 2014-06-17 '$36.62' 0 0 4
483 20140618 2014-06-18 '$36.78' 0 0 4
483 20140619 2014-06-19 '$36.88' 0 0 4
483 20140620 2014-06-20 '$36.86' 0 0 4
483 20140621 2014-06-21 '$36.84' 0 0 4
483 20140622 2014-06-22 '$37.01' 0 0 4
483 20140623 2014-06-23 '$37.13' 0 0 4
483 20140624 2014-06-24 '$37.29' 0 0 4
483 20140625 2014-06-25 '$37.19' 0 0 4
483 20140626 2014-06-26 '$37.13' 0 0 4
483 20140627 2014-06-27 '$37.13' 0 0 4
483 20140628 2014-06-28 '$37.26' 0 0 4
483 20140629 2014-06-29 '$37.16' 0 0 4
483 20140630 2014-06-30 '$37.22' 0 0 4
...
Suggestion: only show the distinct columns (in this case product_key) that are violating the referential integrity check.
Context: https://soda-community.slack.com/archives/C032BDZN3HV/p1645463084048419
Soda Cloud does not support check results without table
property. This property is filled with checks for X
and is not present in "standalone" checks with no table property e.g.
checks:
....
Options:
Disallow checks without table association
This seems very suboptimal as such checks have been identified as useful
Detect a table from the checks automatically
Seems unreliable and too much work
Change cloud to accept checks with no table.
This is technically easy, but more challenging from product point of view - the UI would have to deal with such checks appropriately.
for each table T:
tables:
- fact_%
- exclude fact_sales_quota
checks:
- count > 0
Expected - all fact_*
except fact_sales_quota
should be included.
Actual - all fact_*
are included, exclude is not taken into account
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.