waterdipai / datachecks Goto Github PK
View Code? Open in Web Editor NEWOpen Source Data Quality Monitoring.
Home Page: https://datachecks.io
License: Apache License 2.0
Open Source Data Quality Monitoring.
Home Page: https://datachecks.io
License: Apache License 2.0
Tell us about the documentation you'd like us to add or update
As a user, I want to have a quick start guide, so that I can run the project as quickly and easily as possible.
As a user, I want to generate combined metrics.
Combined metric is a special type of metric, which will be generated with a combination of two defined metrics. Below are examples of two combined metric
metrics:
- name: count_us_parts
metric_type: document_count
resource: search_datastore.product_data_us
- name: count_us_parts_valid
metric_type: row_count
resource: product_db.products
- name: combined_metric_example
metric_type: combined
expression: sum(count_us_parts, count_us_parts_valid)
- name: combined_metric_example
metric_type: combined
expression: div(sum(count_us_parts, count_us_parts_valid), count_us_parts_not_valid)
- name: combined_metric_example_percentage
metric_type: combined
expression: mul(div(count_us_parts, count_us_parts_valid), 100)
div()
sum()
mul()
sub()
Tell us about the problem you're trying to solve
As a user, I want to generate null count metric in a data column.
Describe the solution you’d like
Null count is a data quality metric that measures the number of null records in a dataset.
Example
Input
E_ID | First_Name | Last_Name |
---|---|---|
101 | Harry | Gomez |
102 | James | Watson |
103 | NULL | Parker |
104 | Christi | William |
105 | Ellen | Evans |
Output:
Null count for First Name Field
null_count = 1
Explanation: There is one null record in the table, i.e. column EMPLOYEE_ID(103).
Tell us about the problem you're trying to solve
As a user, I want to generate duplicate count metric in a data column.
Describe the solution you’d like
Duplicate count is a data quality metric that measures the number of identical or highly similar records in a dataset, highlighting potential data redundancy or errors.
Example
Input
E_ID | First_Name | Last_Name |
---|---|---|
101 | Harry | Gomez |
102 | James | Watson |
101 | Helen | Parker |
104 | Christi | William |
105 | Ellen | Evans |
Output:
duplicate_count = 1
Explanation: There is one duplicate record in the table, i.e. column EMPLOYEE_ID(101).
Description of what the bug is.
npm install
Steps to reproduce the behavior:
cd ui
npm install -f
Logs and additional context
....
ERROR in ../../../node_modules/@mui/material/internal/svg-icons/CheckBoxOutlineBlank.js 3:0-31
Module not found: Error: Can't resolve 'react' in '/Users/ayushmaster/node_modules/@mui/material/internal/svg-icons'
@ ../../../node_modules/@mui/material/Checkbox/Checkbox.js 13:0-82 66:38-62
@ ../../../node_modules/@mui/material/Checkbox/index.js 3:0-37 3:0-37
@ ../../../node_modules/material-react-table/dist/esm/material-react-table.esm.js 43:0-46 1011:143-151 1696:84-92 1724:269-277
@ ./src/pages/Metrics/Metrics.tsx 14:0-58 17:32-50
@ ./src/App.tsx 5:0-46 12:41-48
@ ./src/index.tsx 2:0-24 8:28-31
ERROR in ../../../node_modules/@mui/material/internal/svg-icons/CheckBoxOutlineBlank.js 9:0-48
Module not found: Error: Can't resolve 'react/jsx-runtime' in '/Users/ayushmaster/node_modules/@mui/material/internal/svg-icons'
@ ../../../node_modules/@mui/material/Checkbox/Checkbox.js 13:0-82 66:38-62
@ ../../../node_modules/@mui/material/Checkbox/index.js 3:0-37 3:0-37
@ ../../../node_modules/material-react-table/dist/esm/material-react-table.esm.js 43:0-46 1011:143-151 1696:84-92 1724:269-277
@ ./src/pages/Metrics/Metrics.tsx 14:0-58 17:32-50
@ ./src/App.tsx 5:0-46 12:41-48
@ ./src/index.tsx 2:0-24 8:28-31
ERROR in ../../../node_modules/@mui/material/internal/svg-icons/Close.js 3:0-31
Module not found: Error: Can't resolve 'react' in '/Users/ayushmaster/node_modules/@mui/material/internal/svg-icons'
@ ../../../node_modules/@mui/material/Alert/Alert.js 21:0-52 161:140-149
@ ../../../node_modules/@mui/material/Alert/index.js 3:0-34 3:0-34
@ ../../../node_modules/material-react-table/dist/esm/material-react-table.esm.js 54:0-40 1130:142-147
@ ./src/pages/Metrics/Metrics.tsx 14:0-58 17:32-50
@ ./src/App.tsx 5:0-46 12:41-48
@ ./src/index.tsx 2:0-24 8:28-31
ERROR in ../../../node_modules/@mui/material/internal/svg-icons/Close.js 11:0-48
Module not found: Error: Can't resolve 'react/jsx-runtime' in '/Users/ayushmaster/node_modules/@mui/material/internal/svg-icons'
@ ../../../node_modules/@mui/material/Alert/Alert.js 21:0-52 161:140-149
@ ../../../node_modules/@mui/material/Alert/index.js 3:0-34 3:0-34
@ ../../../node_modules/material-react-table/dist/esm/material-react-table.esm.js 54:0-40 1130:142-147
@ ./src/pages/Metrics/Metrics.tsx 14:0-58 17:32-50
@ ./src/App.tsx 5:0-46 12:41-48
@ ./src/index.tsx 2:0-24 8:28-31
ERROR in ../../../node_modules/@mui/material/internal/svg-icons/ErrorOutline.js 3:0-31
Module not found: Error: Can't resolve 'react' in '/Users/ayushmaster/node_modules/@mui/material/internal/svg-icons'
@ ../../../node_modules/@mui/material/Alert/Alert.js 19:0-66 123:27-43
@ ../../../node_modules/@mui/material/Alert/index.js 3:0-34 3:0-34
@ ../../../node_modules/material-react-table/dist/esm/material-react-table.esm.js 54:0-40 1130:142-147
@ ./src/pages/Metrics/Metrics.tsx 14:0-58 17:32-50
@ ./src/App.tsx 5:0-46 12:41-48
@ ./src/index.tsx 2:0-24 8:28-31
...
continues
Tell us about the problem you're trying to solve
Currently while running inspect datachecks refer to a single file. Oftentimes this file can be really large. We want to split the configuration file into several small files so that maintaining these configuration files becomes easier.
Describe the solution you’d like
While running the inspect command user can pass a directory, which includes all configuration files. Datachecks will read all files and aggregate all metric and data source configurations.
Describe the bug
When schema is not described in data_sources of config.yaml, then by default schema takes the value None, which is when passed to create_engine function from sqlalchemy, which is not properly handled and that creates a poor connection and tables are not loaded.
Expected behavior
There should be a proper condition to handle if schema is set to none instead of directly passing it to create_engine
function.
To Reproduce
Step 1: Setup sample SQL given in https://docs.datachecks.io/getting_started/ in PostgresSQL
Step 2: Copy paste the config.yaml as given by the documentation, and customize it according to your database name, mine is following.
data_sources:
- name: postgres
type: postgres
connection:
host: 127.0.0.1
port: 5432
username: postgres
password: password
database: postgres
metrics:
- name: count_of_products
metric_type: row_count
resource: postgres.products
validation:
threshold: "> 0 & < 1000"
- name: max_product_price_in_india
metric_type: max
resource: postgres.products.price
filters:
where: "country_code = 'IN'"
validation:
threshold: "< 190"
Step 3: Run the project using the command poetry run datachecks inspect -C config.yaml
Screenshots
When schema is set to none
Logs and additional context
Describe the bug
HTML report slack icon is routing to https://www.linkedin.com/company/datachecks/
Expected behavior
HTML report slack icon should route to
https://join.slack.com/t/datachecks/shared_invite/zt-1zqsigy4i-s5aadIh2mjhdpVWU0PstPg
Tell us about the problem you're trying to solve
As a user, I want to generate the stddev of the numeric column.
Describe the solution you’d like
Tasks:
Tell us about the problem you're trying to solve
New integration for databricks.
Describe the solution you’d like
We will use databricks-sql-connector
to make the connection with databricks.
Tasks:
Tell us about the problem you're trying to solve
As a user, I want to generate profiling metrics for all the tables and columns for the search index data sources i.e. Opensearch
Describe the solution you’d like
When provided a configuration for a data source, profiling metrics will be generated for all the tables and columns.
Below are the steps to do it.
Tell us about the problem you're trying to solve
As a user, I want to profile all the indices and fields for OpenSearch. To do it we need to get all the indices and field information for Opensearch
Tell us about the problem you're trying to solve
As a user, I want to raise the validation error for data quality violations.
Describe the solution you’d like
Tell us about the problem you're trying to solve
As a user, I want to generate metrics for different table segments.
For example, a table with one categorical column named C1 has values V1, V2, and V3. We want to get all the auto-metrics for the 3 sets of values.
Describe the solution you’d like
We need to provide configuration to the inspect class to generate segments.
tables:
table:
segments:
- name: n1
where: c1 == v1
Tell us about the problem you're trying to solve
As a user, I want to write custom SQL query to generate metric.
Describe the solution you’d like
To be detailed out
Tell us about the problem you're trying to solve
As a user, I want to generate null percentage metric in a data column.
Describe the solution you’d like
Null percentage metrics reveal missing data, a vital facet of completeness metrics, ensuring data sets are whole and reliable.
Example
Input:
E_ID | First_Name | Last_Name |
---|---|---|
101 | Harry | Gomez |
102 | James | Watson |
103 | NULL | Parker |
104 | Christi | William |
105 | Ellen | Evans |
Output:
Null percentage for First Name Field
null_percentage = 20%
Explanation: There is one null record in the table, i.e. column EMPLOYEE_ID(103).
Describe the bug
metric config
aren't affecting in resultsDescribe the bug
Logs and additional context
qualified_table_name = self.qualified_table_name(table)
query = f"SELECT COUNT(*) FROM {qualified_table_name} AS row_count"
As a user, I want to generate numeric metric min.
Tell us about the problem you're trying to solve
As a user, I want to identify data distribution imbalances of the numeric column.
Describe the solution you’d like
Skew metric in data quality measures the extent of asymmetry or distortion in the distribution of data values. It helps assess the balance and uniformity of data distribution.
Tell us about the problem you're trying to solve
As a user, I want to generate the Kurtosis mean of the data column
Describe the solution you’d like
Kurtosis is a data quality metric that measures the level of peakedness or flatness of a dataset's probability distribution in a geometric space.
Describe the bug
Filter clause is not applying for row_count metric
Expected behavior
- name: in_search_staging_row_count
metric_type: row_count
resource: search_consumer_pgsql.country_product
filters:
where: "country_code = 'IND'"
this should filter results based on the where clause. But it does not applying
New integration for Mysql.
We will use Mysql-python to make the connection with databricks.
Tasks:
Currently, the configuration key for the type of the metric is metric_type
.
To make it uniform with the data_source name, we will rename it to type
.
Current state:
metrics:
- name: count_us_parts
metric_type: document_count
resource: search_datastore.product_data_us
Change it to:
metrics:
- name: count_us_parts
type: document_count
resource: search_datastore.product_data_us
New integration for Elasticsearch.
We will use elastic python to make the connection with elastic.
Tasks:
Describe the bug
opensearch
and elasticsearch
in freshness metric is not accepting field_name
.Tell us about the problem you're trying to solve
As a user, I want to generate profiling metrics for all the tables and columns for the data sources.
Describe the solution you’d like
When provided a configuration for a data source, profiling metrics will be generated for all the tables and columns.
Below are the steps to do it.
Tell us about the problem you're trying to solve
New integration for Redshift.
Describe the solution you’d like
We will use Redshift sqlalchemy to make the connection with Redshift.
Tell us about the problem you're trying to solve
As a user, I want to store the metrics in the database, so that we have historical data for metrics.
Describe the solution you’d like
We will use Elasticseacrh for our metric store. The database repository will
Setup docs in the custom gh-pages.
Tell us about the problem you're trying to solve
As a user, I want to generate the sum of the numeric column.
Describe the solution you’d like
Tell us about the problem you're trying to solve
After implementing the combined metric, we need to make changes in CLI output table
Describe the solution you’d like
Updated structure for the CLI metric table
Metric Name | Data Source | Metric Type | Value |
---|---|---|---|
n1 | d1 | max | 1 |
n2 | d1 | min | 3 |
n3 | d2 | max | 2 |
cm1 | combined | 4 |
Tell us about the problem you're trying to solve
As a user, I want to calculate the nth root of the product of n data values for the data column.
Describe the solution you’d like
The geometric mean metric in data quality is a statistical measure that calculates the nth root of the product of n data values, often used to assess the central tendency of a dataset.
Tell us about the problem you're trying to solve
As a user, I want to see better-formatted output from CLI
Describe the solution you’d like
Metrics information will be shown as a table in CLI.
We will use the rich
library to build and show the table in the command line.
Tell us about the problem you're trying to solve
As a user, I want to generate distinct count for the data column.
Describe the solution you’d like
A distinct count metric in data quality measures the number of unique values within a dataset, ensuring accuracy and completeness.
Example
Input:
E_ID | First_Name | Last_Name | City |
---|---|---|---|
001 | Harry | Gomez | NY |
002 | James | Watson | LA |
003 | Helen | Parker | NY |
004 | Christi | William | OH |
005 | Ellen | Evans | OH |
Output:
distinct_count = 3
Explanation: we are calculating the distinct count of cities in the "City" column. The expected output is 3 because there are three unique cities in the dataset: New York(NY), Los Angeles(LA), and Ohio(OH).
Describe the bug
Tell us about the problem you're trying to solve
As a user, I want to generate an HTML report for all the metrics, so that It is easily sharable
Describe the solution you’d like
inspect
command we will add the optional parameter --report
and it will generate an HTML single page report for all the metricsInspect
class will have another method generate_report
which will take the file name and generate dashboard for all the metrics.Tell us about the problem you're trying to solve
What are you trying to do?
Describe the solution you’d like
A clear and concise description of what you want to see happen.
Describe the alternative you’ve considered or used
As a user, I want to generate an average for the numeric field of a table.
Tell us about the problem you're trying to solve
As a user, I want to generate the reciprocal of the average of the reciprocals of data values for the data column.
Describe the solution you’d like
The Harmonic mean metric in data quality is a statistical measure used to assess the quality of data by calculating the reciprocal of the average of the reciprocals of data values.
Describe the bug
When the data source reference name is wrong in resource selector, it should capture the error and should provide a better error code.
Expected behavior
The run exists with proper error code and log message
To Reproduce
Add data sources in the configuration like below
data_sources:
- name: search_datastore # Data source name
type: opensearch # Data source type is OpenSearch
connection:
host: 127.0.0.1
port: 9205
username: !ENV ${OS_USER} # Username to use for authentication ENV variables
password: !ENV ${OS_PASS}
Add metrics like below
metrics:
- name: count_us_parts
metric_type: document_count
resource: search_datastore1.product_data_us
Screenshots
Logs and additional context
If application, any other context, logs etc.here
Mkdocs will be used for documentation for Datachecks. Mkdocs basic framework needs to be set up.
Tell us about the problem you're trying to solve
New integration for Bigquery.
Describe the solution you’d like
We will use sqlalchemy-bigquery
to make the connection with bigquery.
Describe the bug
average
and variance
, is rendered unusable in Postgres SQL.Tell us about the problem you're trying to solve
As a user, I want to generate variance metric.
Describe the solution you’d like
Variance in data quality measures the degree of variability or dispersion in a dataset, indicating how spread out the data points are from the mean.
Describe the bug
The configured schema is not propagating to the Postgres connection. The current status is
url = URL.create(
drivername="postgresql",
username=self.data_connection.get("username"),
password=self.data_connection.get("password"),
host=self.data_connection.get("host"),
port=self.data_connection.get("port"),
database=self.data_connection.get("database"),
)
engine = create_engine(url)
schema name should be part of engine creation.
Expected behavior
Postgres Datasource should table the schema name in the configuration. And the data source should connect to only that schema.
- name: search_staging_db # Data source name
type: postgres # Data source type is Postgres
connection:
host: 127.0.0.1
port: 5422
username: !ENV ${DB2_USER} # Username to use for authentication ENV variables
password: !ENV ${DB2_PASS} # Password to use for authentication ENV variables
database: dc_db_2
schema: custom_name
The code should accommodate the schema name
schema = self.data_connection.get("schema")
#....
engine = create_engine(
url,
connect_args={'options': f'-csearch_path={schema}'},
isolation_level="AUTOCOMMIT"
)
Tell us about the problem you're trying to solve
While connecting to the Postgres database, the process should gather all tables and column schemas.
This will help to generate the auto-metrics for all the tables and columns.
Describe the solution you’d like
Sqlalchemy has API to get all schemas. We will use the APIS.
Getting table name
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.