waterdipai / datachecks Goto Github PK

View Code? Open in Web Editor NEW

128.0 2.0 18.0 4.56 MB

Open Source Data Quality Monitoring.

Home Page: https://datachecks.io

License: Apache License 2.0

Python 91.90% Dockerfile 0.07% Makefile 0.08% TypeScript 6.64% CSS 0.94% JavaScript 0.38%

data-engineering data-validation dataops dataquality metrics mlops postgresql python data-governance data-observability

datachecks's People

Contributors

Stargazers

Watchers

Forkers

datageek34 datageek00 niyasrad weryzebra-yue pulak0717 fabriciodadosbr murali-chevuri sqlking22 dusklight00 brunoscaglione driptanil anu-ra-g ryuk-me pratik-kumar-621 lynxye myzyq danish023

datachecks's Issues

docs: add getting started page

Tell us about the documentation you'd like us to add or update
As a user, I want to have a quick start guide, so that I can run the project as quickly and easily as possible.

feat: implement combined metric

Tell us about the problem you're trying to solve

As a user, I want to generate combined metrics.

Describe the solution you’d like

Combined metric is a special type of metric, which will be generated with a combination of two defined metrics. Below are examples of two combined metric

metrics:
  - name: count_us_parts
    metric_type: document_count
    resource: search_datastore.product_data_us
  - name: count_us_parts_valid
    metric_type: row_count
    resource: product_db.products

- name: combined_metric_example
  metric_type: combined
  expression: sum(count_us_parts, count_us_parts_valid)

- name: combined_metric_example
  metric_type: combined
  expression: div(sum(count_us_parts, count_us_parts_valid), count_us_parts_not_valid)

- name: combined_metric_example_percentage
  metric_type: combined
  expression: mul(div(count_us_parts, count_us_parts_valid), 100)

Available Functions:

div()
sum()
mul()
sub()

Tasks:

Extract functions from the expression string.
Define a Combined metric class to calculate the metric
Configuration extraction for combined metrics

fix: duplicate metric name should throw error

Describe the bug
In the configuration file, there is no duplication check logic for the metric name

Expected behavior
All the metric names must be unique.

Screenshots
It should throw an error if the same name for another metric

feat: Implement Null Count metrics

Tell us about the problem you're trying to solve

As a user, I want to generate null count metric in a data column.

Describe the solution you’d like

Null count is a data quality metric that measures the number of null records in a dataset.

Example

Input

E_ID	First_Name	Last_Name
101	Harry	Gomez
102	James	Watson
103	NULL	Parker
104	Christi	William
105	Ellen	Evans

Output:
Null count for First Name Field
null_count = 1

Explanation: There is one null record in the table, i.e. column EMPLOYEE_ID(103).

feat: Implement Duplicate Count metrics

Tell us about the problem you're trying to solve

As a user, I want to generate duplicate count metric in a data column.

Describe the solution you’d like

Duplicate count is a data quality metric that measures the number of identical or highly similar records in a dataset, highlighting potential data redundancy or errors.

Example

Input

E_ID	First_Name	Last_Name
101	Harry	Gomez
102	James	Watson
101	Helen	Parker
104	Christi	William
105	Ellen	Evans

Output:

duplicate_count = 1

Explanation: There is one duplicate record in the table, i.e. column EMPLOYEE_ID(101).

fix(ui): frontend dependency issues

Description of what the bug is.

Need to force install
Receiving a lot of errors while installing npm install

Steps to reproduce the behavior:

cd ui
npm install -f

Logs and additional context

....


ERROR in ../../../node_modules/@mui/material/internal/svg-icons/CheckBoxOutlineBlank.js 3:0-31
Module not found: Error: Can't resolve 'react' in '/Users/ayushmaster/node_modules/@mui/material/internal/svg-icons'
@ ../../../node_modules/@mui/material/Checkbox/Checkbox.js 13:0-82 66:38-62
@ ../../../node_modules/@mui/material/Checkbox/index.js 3:0-37 3:0-37
@ ../../../node_modules/material-react-table/dist/esm/material-react-table.esm.js 43:0-46 1011:143-151 1696:84-92 1724:269-277
@ ./src/pages/Metrics/Metrics.tsx 14:0-58 17:32-50
@ ./src/App.tsx 5:0-46 12:41-48
@ ./src/index.tsx 2:0-24 8:28-31

ERROR in ../../../node_modules/@mui/material/internal/svg-icons/CheckBoxOutlineBlank.js 9:0-48
Module not found: Error: Can't resolve 'react/jsx-runtime' in '/Users/ayushmaster/node_modules/@mui/material/internal/svg-icons'
@ ../../../node_modules/@mui/material/Checkbox/Checkbox.js 13:0-82 66:38-62
@ ../../../node_modules/@mui/material/Checkbox/index.js 3:0-37 3:0-37
@ ../../../node_modules/material-react-table/dist/esm/material-react-table.esm.js 43:0-46 1011:143-151 1696:84-92 1724:269-277
@ ./src/pages/Metrics/Metrics.tsx 14:0-58 17:32-50
@ ./src/App.tsx 5:0-46 12:41-48
@ ./src/index.tsx 2:0-24 8:28-31

ERROR in ../../../node_modules/@mui/material/internal/svg-icons/Close.js 3:0-31
Module not found: Error: Can't resolve 'react' in '/Users/ayushmaster/node_modules/@mui/material/internal/svg-icons'
@ ../../../node_modules/@mui/material/Alert/Alert.js 21:0-52 161:140-149
@ ../../../node_modules/@mui/material/Alert/index.js 3:0-34 3:0-34
@ ../../../node_modules/material-react-table/dist/esm/material-react-table.esm.js 54:0-40 1130:142-147
@ ./src/pages/Metrics/Metrics.tsx 14:0-58 17:32-50
@ ./src/App.tsx 5:0-46 12:41-48
@ ./src/index.tsx 2:0-24 8:28-31

ERROR in ../../../node_modules/@mui/material/internal/svg-icons/Close.js 11:0-48
Module not found: Error: Can't resolve 'react/jsx-runtime' in '/Users/ayushmaster/node_modules/@mui/material/internal/svg-icons'
@ ../../../node_modules/@mui/material/Alert/Alert.js 21:0-52 161:140-149
@ ../../../node_modules/@mui/material/Alert/index.js 3:0-34 3:0-34
@ ../../../node_modules/material-react-table/dist/esm/material-react-table.esm.js 54:0-40 1130:142-147
@ ./src/pages/Metrics/Metrics.tsx 14:0-58 17:32-50
@ ./src/App.tsx 5:0-46 12:41-48
@ ./src/index.tsx 2:0-24 8:28-31

ERROR in ../../../node_modules/@mui/material/internal/svg-icons/ErrorOutline.js 3:0-31
Module not found: Error: Can't resolve 'react' in '/Users/ayushmaster/node_modules/@mui/material/internal/svg-icons'
@ ../../../node_modules/@mui/material/Alert/Alert.js 19:0-66 123:27-43
@ ../../../node_modules/@mui/material/Alert/index.js 3:0-34 3:0-34
@ ../../../node_modules/material-react-table/dist/esm/material-react-table.esm.js 54:0-40 1130:142-147
@ ./src/pages/Metrics/Metrics.tsx 14:0-58 17:32-50
@ ./src/App.tsx 5:0-46 12:41-48
@ ./src/index.tsx 2:0-24 8:28-31

... 
continues

fix: type decimal is not json serializable in html report generation

Describe the bug
Generating HTML report at datachecks_report.html
Failed to run datachecks inspection: Object of type Decimal is not JSON serializable

Screenshots

feat: multiple configuration files included while inspect run

Tell us about the problem you're trying to solve
Currently while running inspect datachecks refer to a single file. Oftentimes this file can be really large. We want to split the configuration file into several small files so that maintaining these configuration files becomes easier.

Describe the solution you’d like

While running the inspect command user can pass a directory, which includes all configuration files. Datachecks will read all files and aggregate all metric and data source configurations.

fix: schema defaulting to None instead of "public" in postgres database integration

Describe the bug
When schema is not described in data_sources of config.yaml, then by default schema takes the value None, which is when passed to create_engine function from sqlalchemy, which is not properly handled and that creates a poor connection and tables are not loaded.

Expected behavior
There should be a proper condition to handle if schema is set to none instead of directly passing it to create_engine function.

To Reproduce
Step 1: Setup sample SQL given in https://docs.datachecks.io/getting_started/ in PostgresSQL
Step 2: Copy paste the config.yaml as given by the documentation, and customize it according to your database name, mine is following.

data_sources:
  - name: postgres
    type: postgres
    connection:
      host: 127.0.0.1
      port: 5432
      username: postgres
      password: password
      database: postgres
metrics:
  - name: count_of_products
    metric_type: row_count
    resource: postgres.products
    validation:
      threshold: "> 0 & < 1000"
  - name: max_product_price_in_india
    metric_type: max
    resource: postgres.products.price
    filters:
      where: "country_code = 'IN'"
    validation:
      threshold: "< 190"

Step 3: Run the project using the command poetry run datachecks inspect -C config.yaml

Screenshots
When schema is set to none

When schema is made public

Logs and additional context

fix: html report slack icon should route to slack

Describe the bug
HTML report slack icon is routing to https://www.linkedin.com/company/datachecks/

Expected behavior
HTML report slack icon should route to
https://join.slack.com/t/datachecks/shared_invite/zt-1zqsigy4i-s5aadIh2mjhdpVWU0PstPg

feat: Standard Deviation metric for Numeric Field

Tell us about the problem you're trying to solve

As a user, I want to generate the stddev of the numeric column.

Describe the solution you’d like

Tasks:

Add Standard Deviation metric
Add docs for the metric

feat: support for databricks

Tell us about the problem you're trying to solve

New integration for databricks.

Describe the solution you’d like
We will use databricks-sql-connector to make the connection with databricks.

Tasks:

Add integration for databricks
add docs for new integration
Manually test for all the metrics fo data bricks, as integration test will not be available.

feat: Automatic Datasource profiling for Search Index Datasources

Tell us about the problem you're trying to solve

As a user, I want to generate profiling metrics for all the tables and columns for the search index data sources i.e. Opensearch

Describe the solution you’d like

When provided a configuration for a data source, profiling metrics will be generated for all the tables and columns.

Below are the steps to do it.

Get all the tables for the data source.
Get all the columns for the table
Generate profiles for the numeric columns
Generate profiles for the text columns

(feat) Opensearch Auto detecting index and field

Tell us about the problem you're trying to solve

As a user, I want to profile all the indices and fields for OpenSearch. To do it we need to get all the indices and field information for Opensearch

feat: Metric Validation Framework

Tell us about the problem you're trying to solve

As a user, I want to raise the validation error for data quality violations.

Describe the solution you’d like

Details Notion Link

feat: Table Segmentation

Tell us about the problem you're trying to solve

As a user, I want to generate metrics for different table segments.
For example, a table with one categorical column named C1 has values V1, V2, and V3. We want to get all the auto-metrics for the 3 sets of values.

Describe the solution you’d like

We need to provide configuration to the inspect class to generate segments.

tables:
  table:
    segments:
      - name: n1
        where: c1 == v1

feat: custom sql metric

Tell us about the problem you're trying to solve
As a user, I want to write custom SQL query to generate metric.

Describe the solution you’d like

To be detailed out

feat: Implement Null percentage metrics

Tell us about the problem you're trying to solve

As a user, I want to generate null percentage metric in a data column.

Describe the solution you’d like

Null percentage metrics reveal missing data, a vital facet of completeness metrics, ensuring data sets are whole and reliable.

Example

Input:

E_ID	First_Name	Last_Name
101	Harry	Gomez
102	James	Watson
103	NULL	Parker
104	Christi	William
105	Ellen	Evans

Output:

Null percentage for First Name Field
null_percentage = 20%
Explanation: There is one null record in the table, i.e. column EMPLOYEE_ID(103).

fix(metric): the metric configuration filters does not apply

Describe the bug

Filters field in metric config aren't affecting in results

fix(datasource): include `qualified_table_name` to all table calls

Describe the bug

The change for including a function call to represent the qualified table name was initiated in a previous commit.
But, some of the datasource query functions have not been updated with the same, and it needs to be implemented.

Logs and additional context

Sample implementation of this would be,

qualified_table_name = self.qualified_table_name(table)
query = f"SELECT COUNT(*) FROM {qualified_table_name} AS row_count"

Numeric Metric Min

As a user, I want to generate numeric metric min.

Description

It is a type of field metric.
This metric will be calculated for both transitional database and search engines

feat: Implement Skew metrics

Tell us about the problem you're trying to solve

As a user, I want to identify data distribution imbalances of the numeric column.

Describe the solution you’d like

Skew metric in data quality measures the extent of asymmetry or distortion in the distribution of data values. It helps assess the balance and uniformity of data distribution.

feat: Implement Kurtosis mean metrics

Tell us about the problem you're trying to solve

As a user, I want to generate the Kurtosis mean of the data column

Describe the solution you’d like

Kurtosis is a data quality metric that measures the level of peakedness or flatness of a dataset's probability distribution in a geometric space.

fix: in row_count filter option is not applied

Describe the bug
Filter clause is not applying for row_count metric

Expected behavior

  - name: in_search_staging_row_count
    metric_type: row_count
    resource: search_consumer_pgsql.country_product
    filters:
      where: "country_code = 'IND'"

this should filter results based on the where clause. But it does not applying

feat: Mysql Datasource Integration

Tell us about the problem you're trying to solve

New integration for Mysql.

Describe the solution you’d like

We will use Mysql-python to make the connection with databricks.

Tasks:

Add integration for databricks
add docs for new integration
Integration test for all the metrics for Mysql

refactor: change input configuration from metric_type to type

Currently, the configuration key for the type of the metric is metric_type.

To make it uniform with the data_source name, we will rename it to type.

Current state:

metrics:
  - name: count_us_parts
    metric_type: document_count
    resource: search_datastore.product_data_us

Change it to:

metrics:
  - name: count_us_parts
    type: document_count
    resource: search_datastore.product_data_us

fix: duplicate datasource name should throw error

Describe the bug
In the configuration file, there is no duplication check logic.

Expected behavior
All the data source name must be unique.

To Reproduce
Steps to reproduce the behavior:

Screenshots
If applicable, add screenshots to help explain your problem.

feat: Elasticsearch datasource integration

Tell us about the problem you're trying to solve

New integration for Elasticsearch.

Describe the solution you’d like

We will use elastic python to make the connection with elastic.

Tasks:

Add integration for Elasticsearch
add docs for new integration
Integration test for all the metrics for Elasticsearch

fix: `field_name` not being accepted in `open-search` and `elastic-search` in freshness metric

Describe the bug

opensearch and elasticsearch in freshness metric is not accepting field_name.

[FEATURE] Automatic Datasource profiling for SQL Datasources

Tell us about the problem you're trying to solve
As a user, I want to generate profiling metrics for all the tables and columns for the data sources.

Describe the solution you’d like
When provided a configuration for a data source, profiling metrics will be generated for all the tables and columns.

Below are the steps to do it.

Get all the tables for the data source.
Get all the columns for the table
Generate profiles for the numeric columns
Generate profiles for the text columns

feat: Redshift Datasource Integration

Tell us about the problem you're trying to solve

New integration for Redshift.

Describe the solution you’d like

We will use Redshift sqlalchemy to make the connection with Redshift.

feat: Store metrics in Database

Tell us about the problem you're trying to solve

As a user, I want to store the metrics in the database, so that we have historical data for metrics.

Describe the solution you’d like

We will use Elasticseacrh for our metric store. The database repository will

Store metric in index
Read metric from the index
Create an index mapping for the metric store

docs: setup custom url for docs pages

Setup docs in the custom gh-pages.

Setup gh pages for mkdocs
Custom domain docs.datachecks.io

feat: Sum metric for Numeric Field

Tell us about the problem you're trying to solve

As a user, I want to generate the sum of the numeric column.

Describe the solution you’d like

feat: update CLI table output

Tell us about the problem you're trying to solve

After implementing the combined metric, we need to make changes in CLI output table

Describe the solution you’d like

Updated structure for the CLI metric table

Metric Name	Data Source	Metric Type	Value
n1	d1	max	1
n2	d1	min	3
n3	d2	max	2
cm1		combined	4

feat: Implement Geometric mean metrics

Tell us about the problem you're trying to solve

As a user, I want to calculate the nth root of the product of n data values for the data column.

Describe the solution you’d like

The geometric mean metric in data quality is a statistical measure that calculates the nth root of the product of n data values, often used to assess the central tendency of a dataset.

feat: Improve CLI Output

Tell us about the problem you're trying to solve

As a user, I want to see better-formatted output from CLI

Describe the solution you’d like

Metrics information will be shown as a table in CLI.

We will use the rich library to build and show the table in the command line.

feat: Implement Distinct Count metrics.

Tell us about the problem you're trying to solve

As a user, I want to generate distinct count for the data column.

Describe the solution you’d like

A distinct count metric in data quality measures the number of unique values within a dataset, ensuring accuracy and completeness.

Example

Input:

E_ID	First_Name	Last_Name	City
001	Harry	Gomez	NY
002	James	Watson	LA
003	Helen	Parker	NY
004	Christi	William	OH
005	Ellen	Evans	OH

Output:

distinct_count = 3

Explanation: we are calculating the distinct count of cities in the "City" column. The expected output is 3 because there are three unique cities in the dataset: New York(NY), Los Angeles(LA), and Ohio(OH).

fix(import): modules to import isn't being identifiable

Describe the bug

While using various databases, Importing module isn't identifying the which modules to import.

feat: HTML report framework

Tell us about the problem you're trying to solve

As a user, I want to generate an HTML report for all the metrics, so that It is easily sharable

Describe the solution you’d like

CLI: While running the inspect command we will add the optional parameter --report and it will generate an HTML single page report for all the metrics
Programeric: Inspect class will have another method generate_report which will take the file name and generate dashboard for all the metrics.

feat: API metrics

Tell us about the problem you're trying to solve

What are you trying to do?

Describe the solution you’d like

A clear and concise description of what you want to see happen.

Describe the alternative you’ve considered or used

Numeric Metric Average

As a user, I want to generate an average for the numeric field of a table.

Description

It is a type of field metric.
This metric will be calculated for both transitional database and search engines

feat: Implement Harmonic mean metrics

Tell us about the problem you're trying to solve

As a user, I want to generate the reciprocal of the average of the reciprocals of data values for the data column.

Describe the solution you’d like

The Harmonic mean metric in data quality is a statistical measure used to assess the quality of data by calculating the reciprocal of the average of the reciprocals of data values.

fix: Error handling while wrong datasource name reference in resource field

Describe the bug

When the data source reference name is wrong in resource selector, it should capture the error and should provide a better error code.

Expected behavior
The run exists with proper error code and log message

To Reproduce
Add data sources in the configuration like below

data_sources:
  - name: search_datastore        # Data source name
    type: opensearch              # Data source type is OpenSearch
    connection:
      host: 127.0.0.1
      port: 9205
      username: !ENV ${OS_USER}   # Username to use for authentication ENV variables
      password: !ENV ${OS_PASS}

Add metrics like below

metrics:
  - name: count_us_parts
    metric_type: document_count
    resource: search_datastore1.product_data_us

Screenshots

Logs and additional context
If application, any other context, logs etc.here

[DOCUMENTATION] add docs framework setup

Mkdocs will be used for documentation for Datachecks. Mkdocs basic framework needs to be set up.

feat: support for BigQuery

Tell us about the problem you're trying to solve

New integration for Bigquery.

Describe the solution you’d like

We will use sqlalchemy-bigquery to make the connection with bigquery.

Add integration for bigquery
add docs for new integration
Manually test for all the metrics for bigquery, as an integration test will not be available.

fix(metrics): round function in `average` and `variance` metric

Describe the bug

Usage of round function average and variance, is rendered unusable in Postgres SQL.

feat: Implement Variance metrics

Tell us about the problem you're trying to solve

As a user, I want to generate variance metric.

Describe the solution you’d like

Variance in data quality measures the degree of variability or dispersion in a dataset, indicating how spread out the data points are from the mean.

fix: postgres datasource not able to connect to schema other than public

Describe the bug
The configured schema is not propagating to the Postgres connection. The current status is

 url = URL.create(
        drivername="postgresql",
        username=self.data_connection.get("username"),
        password=self.data_connection.get("password"),
        host=self.data_connection.get("host"),
        port=self.data_connection.get("port"),
        database=self.data_connection.get("database"),
    )
    engine = create_engine(url)

schema name should be part of engine creation.

Expected behavior
Postgres Datasource should table the schema name in the configuration. And the data source should connect to only that schema.

  - name: search_staging_db       # Data source name
    type: postgres                # Data source type is Postgres
    connection:
      host: 127.0.0.1
      port: 5422
      username: !ENV ${DB2_USER}  # Username to use for authentication ENV variables
      password: !ENV ${DB2_PASS}  # Password to use for authentication ENV variables
      database: dc_db_2
      schema: custom_name

The code should accommodate the schema name

schema = self.data_connection.get("schema")
#....
engine = create_engine(
    url,
    connect_args={'options': f'-csearch_path={schema}'},
    isolation_level="AUTOCOMMIT"
)

[FEATURE] Postgres auto detecting table and column schema

Tell us about the problem you're trying to solve

While connecting to the Postgres database, the process should gather all tables and column schemas.
This will help to generate the auto-metrics for all the tables and columns.

Describe the solution you’d like

Sqlalchemy has API to get all schemas. We will use the APIS.
Getting table name

waterdipai / datachecks Goto Github PK

datachecks's People

Contributors

Stargazers

Watchers

Forkers

datachecks's Issues

Tell us about the problem you're trying to solve

Describe the solution you’d like

Available Functions:

Tasks:

Description

Tell us about the problem you're trying to solve

Describe the solution you’d like

Tell us about the problem you're trying to solve

Describe the solution you’d like

Description

Recommend Projects

Recommend Topics

Recommend Org