Coder Social home page Coder Social logo

waterdipai / datachecks Goto Github PK

View Code? Open in Web Editor NEW
128.0 2.0 18.0 4.56 MB

Open Source Data Quality Monitoring.

Home Page: https://datachecks.io

License: Apache License 2.0

Python 91.90% Dockerfile 0.07% Makefile 0.08% TypeScript 6.64% CSS 0.94% JavaScript 0.38%
data-engineering data-validation dataops dataquality metrics mlops postgresql python data-governance data-observability

datachecks's People

Contributors

anu-ra-g avatar datageek00 avatar dependabot[bot] avatar driptanil avatar fabriciodadosbr avatar niyasrad avatar pulak0717 avatar ryuk-me avatar subhankarb avatar weryzebra-yue avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

datachecks's Issues

docs: add getting started page

Tell us about the documentation you'd like us to add or update
As a user, I want to have a quick start guide, so that I can run the project as quickly and easily as possible.

feat: implement combined metric

Tell us about the problem you're trying to solve

As a user, I want to generate combined metrics.

Describe the solution you’d like

Combined metric is a special type of metric, which will be generated with a combination of two defined metrics. Below are examples of two combined metric

metrics:
  - name: count_us_parts
    metric_type: document_count
    resource: search_datastore.product_data_us
  - name: count_us_parts_valid
    metric_type: row_count
    resource: product_db.products
- name: combined_metric_example
  metric_type: combined
  expression: sum(count_us_parts, count_us_parts_valid)
- name: combined_metric_example
  metric_type: combined
  expression: div(sum(count_us_parts, count_us_parts_valid), count_us_parts_not_valid)
- name: combined_metric_example_percentage
  metric_type: combined
  expression: mul(div(count_us_parts, count_us_parts_valid), 100)

Available Functions:

  • div()
  • sum()
  • mul()
  • sub()

Tasks:

  • Extract functions from the expression string.
  • Define a Combined metric class to calculate the metric
  • Configuration extraction for combined metrics

fix: duplicate metric name should throw error

Describe the bug
In the configuration file, there is no duplication check logic for the metric name

Expected behavior
All the metric names must be unique.

Screenshots
It should throw an error if the same name for another metric
Screenshot 2023-09-24 at 10 31 54 AM

feat: Implement Null Count metrics

Tell us about the problem you're trying to solve

As a user, I want to generate null count metric in a data column.

Describe the solution you’d like

Null count is a data quality metric that measures the number of null records in a dataset.

Example

Input

E_ID First_Name Last_Name
101 Harry Gomez
102 James Watson
103 NULL Parker
104 Christi William
105 Ellen Evans

Output:
Null count for First Name Field
null_count = 1

Explanation: There is one null record in the table, i.e. column EMPLOYEE_ID(103).

feat: Implement Duplicate Count metrics

Tell us about the problem you're trying to solve

As a user, I want to generate duplicate count metric in a data column.

Describe the solution you’d like

Duplicate count is a data quality metric that measures the number of identical or highly similar records in a dataset, highlighting potential data redundancy or errors.

Example

Input

E_ID First_Name Last_Name
101 Harry Gomez
102 James Watson
101 Helen Parker
104 Christi William
105 Ellen Evans

Output:

duplicate_count = 1

Explanation: There is one duplicate record in the table, i.e. column EMPLOYEE_ID(101).

fix(ui): frontend dependency issues

Description of what the bug is.

  1. Need to force install
  2. Receiving a lot of errors while installing npm install

Steps to reproduce the behavior:

cd ui
npm install -f

Logs and additional context

....


ERROR in ../../../node_modules/@mui/material/internal/svg-icons/CheckBoxOutlineBlank.js 3:0-31
Module not found: Error: Can't resolve 'react' in '/Users/ayushmaster/node_modules/@mui/material/internal/svg-icons'
@ ../../../node_modules/@mui/material/Checkbox/Checkbox.js 13:0-82 66:38-62
@ ../../../node_modules/@mui/material/Checkbox/index.js 3:0-37 3:0-37
@ ../../../node_modules/material-react-table/dist/esm/material-react-table.esm.js 43:0-46 1011:143-151 1696:84-92 1724:269-277
@ ./src/pages/Metrics/Metrics.tsx 14:0-58 17:32-50
@ ./src/App.tsx 5:0-46 12:41-48
@ ./src/index.tsx 2:0-24 8:28-31

ERROR in ../../../node_modules/@mui/material/internal/svg-icons/CheckBoxOutlineBlank.js 9:0-48
Module not found: Error: Can't resolve 'react/jsx-runtime' in '/Users/ayushmaster/node_modules/@mui/material/internal/svg-icons'
@ ../../../node_modules/@mui/material/Checkbox/Checkbox.js 13:0-82 66:38-62
@ ../../../node_modules/@mui/material/Checkbox/index.js 3:0-37 3:0-37
@ ../../../node_modules/material-react-table/dist/esm/material-react-table.esm.js 43:0-46 1011:143-151 1696:84-92 1724:269-277
@ ./src/pages/Metrics/Metrics.tsx 14:0-58 17:32-50
@ ./src/App.tsx 5:0-46 12:41-48
@ ./src/index.tsx 2:0-24 8:28-31

ERROR in ../../../node_modules/@mui/material/internal/svg-icons/Close.js 3:0-31
Module not found: Error: Can't resolve 'react' in '/Users/ayushmaster/node_modules/@mui/material/internal/svg-icons'
@ ../../../node_modules/@mui/material/Alert/Alert.js 21:0-52 161:140-149
@ ../../../node_modules/@mui/material/Alert/index.js 3:0-34 3:0-34
@ ../../../node_modules/material-react-table/dist/esm/material-react-table.esm.js 54:0-40 1130:142-147
@ ./src/pages/Metrics/Metrics.tsx 14:0-58 17:32-50
@ ./src/App.tsx 5:0-46 12:41-48
@ ./src/index.tsx 2:0-24 8:28-31

ERROR in ../../../node_modules/@mui/material/internal/svg-icons/Close.js 11:0-48
Module not found: Error: Can't resolve 'react/jsx-runtime' in '/Users/ayushmaster/node_modules/@mui/material/internal/svg-icons'
@ ../../../node_modules/@mui/material/Alert/Alert.js 21:0-52 161:140-149
@ ../../../node_modules/@mui/material/Alert/index.js 3:0-34 3:0-34
@ ../../../node_modules/material-react-table/dist/esm/material-react-table.esm.js 54:0-40 1130:142-147
@ ./src/pages/Metrics/Metrics.tsx 14:0-58 17:32-50
@ ./src/App.tsx 5:0-46 12:41-48
@ ./src/index.tsx 2:0-24 8:28-31

ERROR in ../../../node_modules/@mui/material/internal/svg-icons/ErrorOutline.js 3:0-31
Module not found: Error: Can't resolve 'react' in '/Users/ayushmaster/node_modules/@mui/material/internal/svg-icons'
@ ../../../node_modules/@mui/material/Alert/Alert.js 19:0-66 123:27-43
@ ../../../node_modules/@mui/material/Alert/index.js 3:0-34 3:0-34
@ ../../../node_modules/material-react-table/dist/esm/material-react-table.esm.js 54:0-40 1130:142-147
@ ./src/pages/Metrics/Metrics.tsx 14:0-58 17:32-50
@ ./src/App.tsx 5:0-46 12:41-48
@ ./src/index.tsx 2:0-24 8:28-31

... 
continues 

feat: multiple configuration files included while inspect run

Tell us about the problem you're trying to solve
Currently while running inspect datachecks refer to a single file. Oftentimes this file can be really large. We want to split the configuration file into several small files so that maintaining these configuration files becomes easier.

Describe the solution you’d like

While running the inspect command user can pass a directory, which includes all configuration files. Datachecks will read all files and aggregate all metric and data source configurations.

fix: schema defaulting to None instead of "public" in postgres database integration

Describe the bug
When schema is not described in data_sources of config.yaml, then by default schema takes the value None, which is when passed to create_engine function from sqlalchemy, which is not properly handled and that creates a poor connection and tables are not loaded.

Expected behavior
There should be a proper condition to handle if schema is set to none instead of directly passing it to create_engine function.

To Reproduce
Step 1: Setup sample SQL given in https://docs.datachecks.io/getting_started/ in PostgresSQL
Step 2: Copy paste the config.yaml as given by the documentation, and customize it according to your database name, mine is following.

data_sources:
  - name: postgres
    type: postgres
    connection:
      host: 127.0.0.1
      port: 5432
      username: postgres
      password: password
      database: postgres
metrics:
  - name: count_of_products
    metric_type: row_count
    resource: postgres.products
    validation:
      threshold: "> 0 & < 1000"
  - name: max_product_price_in_india
    metric_type: max
    resource: postgres.products.price
    filters:
      where: "country_code = 'IN'"
    validation:
      threshold: "< 190"

Step 3: Run the project using the command poetry run datachecks inspect -C config.yaml

Screenshots
When schema is set to none
image
image

When schema is made public
image
image

Logs and additional context

feat: Standard Deviation metric for Numeric Field

Tell us about the problem you're trying to solve

As a user, I want to generate the stddev of the numeric column.

Describe the solution you’d like

Tasks:

  • Add Standard Deviation metric
  • Add docs for the metric

feat: support for databricks

Tell us about the problem you're trying to solve

New integration for databricks.

Describe the solution you’d like
We will use databricks-sql-connector to make the connection with databricks.

Tasks:

  • Add integration for databricks
  • add docs for new integration
  • Manually test for all the metrics fo data bricks, as integration test will not be available.

feat: Automatic Datasource profiling for Search Index Datasources

Tell us about the problem you're trying to solve

As a user, I want to generate profiling metrics for all the tables and columns for the search index data sources i.e. Opensearch

Describe the solution you’d like

When provided a configuration for a data source, profiling metrics will be generated for all the tables and columns.

Below are the steps to do it.

  • Get all the tables for the data source.
  • Get all the columns for the table
  • Generate profiles for the numeric columns
  • Generate profiles for the text columns

(feat) Opensearch Auto detecting index and field

Tell us about the problem you're trying to solve

As a user, I want to profile all the indices and fields for OpenSearch. To do it we need to get all the indices and field information for Opensearch

feat: Table Segmentation

Tell us about the problem you're trying to solve

As a user, I want to generate metrics for different table segments.
For example, a table with one categorical column named C1 has values V1, V2, and V3. We want to get all the auto-metrics for the 3 sets of values.

Describe the solution you’d like

We need to provide configuration to the inspect class to generate segments.

tables:
  table:
    segments:
      - name: n1
        where: c1 == v1

feat: custom sql metric

Tell us about the problem you're trying to solve
As a user, I want to write custom SQL query to generate metric.

Describe the solution you’d like

To be detailed out

feat: Implement Null percentage metrics

Tell us about the problem you're trying to solve

As a user, I want to generate null percentage metric in a data column.

Describe the solution you’d like

Null percentage metrics reveal missing data, a vital facet of completeness metrics, ensuring data sets are whole and reliable.

Example

Input:

E_ID First_Name Last_Name
101 Harry Gomez
102 James Watson
103 NULL Parker
104 Christi William
105 Ellen Evans

Output:

Null percentage for First Name Field
null_percentage = 20%
Explanation: There is one null record in the table, i.e. column EMPLOYEE_ID(103).

fix(datasource): include `qualified_table_name` to all table calls

Describe the bug

  • The change for including a function call to represent the qualified table name was initiated in a previous commit.
  • But, some of the datasource query functions have not been updated with the same, and it needs to be implemented.

Logs and additional context

  • Sample implementation of this would be,
qualified_table_name = self.qualified_table_name(table)
query = f"SELECT COUNT(*) FROM {qualified_table_name} AS row_count"

Numeric Metric Min

As a user, I want to generate numeric metric min.

Description

  • It is a type of field metric.
  • This metric will be calculated for both transitional database and search engines

feat: Implement Skew metrics

Tell us about the problem you're trying to solve

As a user, I want to identify data distribution imbalances of the numeric column.

Describe the solution you’d like

Skew metric in data quality measures the extent of asymmetry or distortion in the distribution of data values. It helps assess the balance and uniformity of data distribution.

feat: Implement Kurtosis mean metrics

Tell us about the problem you're trying to solve

As a user, I want to generate the Kurtosis mean of the data column

Describe the solution you’d like

Kurtosis is a data quality metric that measures the level of peakedness or flatness of a dataset's probability distribution in a geometric space.

fix: in row_count filter option is not applied

Describe the bug
Filter clause is not applying for row_count metric

Expected behavior

  - name: in_search_staging_row_count
    metric_type: row_count
    resource: search_consumer_pgsql.country_product
    filters:
      where: "country_code = 'IND'"

this should filter results based on the where clause. But it does not applying

feat: Mysql Datasource Integration

Tell us about the problem you're trying to solve

New integration for Mysql.

Describe the solution you’d like

We will use Mysql-python to make the connection with databricks.

Tasks:

  • Add integration for databricks
  • add docs for new integration
  • Integration test for all the metrics for Mysql

refactor: change input configuration from metric_type to type

Currently, the configuration key for the type of the metric is metric_type.

To make it uniform with the data_source name, we will rename it to type.

Current state:

metrics:
  - name: count_us_parts
    metric_type: document_count
    resource: search_datastore.product_data_us

Change it to:

metrics:
  - name: count_us_parts
    type: document_count
    resource: search_datastore.product_data_us

fix: duplicate datasource name should throw error

Describe the bug
In the configuration file, there is no duplication check logic.

Expected behavior
All the data source name must be unique.

To Reproduce
Steps to reproduce the behavior:

Screenshots
If applicable, add screenshots to help explain your problem.

Screenshot 2023-09-24 at 10 28 32 AM

feat: Elasticsearch datasource integration

Tell us about the problem you're trying to solve

New integration for Elasticsearch.

Describe the solution you’d like

We will use elastic python to make the connection with elastic.

Tasks:

  • Add integration for Elasticsearch
  • add docs for new integration
  • Integration test for all the metrics for Elasticsearch

[FEATURE] Automatic Datasource profiling for SQL Datasources

Tell us about the problem you're trying to solve
As a user, I want to generate profiling metrics for all the tables and columns for the data sources.

Describe the solution you’d like
When provided a configuration for a data source, profiling metrics will be generated for all the tables and columns.

Below are the steps to do it.

  1. Get all the tables for the data source.
  2. Get all the columns for the table
  3. Generate profiles for the numeric columns
  4. Generate profiles for the text columns

feat: Store metrics in Database

Tell us about the problem you're trying to solve

As a user, I want to store the metrics in the database, so that we have historical data for metrics.

Describe the solution you’d like

We will use Elasticseacrh for our metric store. The database repository will

  • Store metric in index
  • Read metric from the index
  • Create an index mapping for the metric store

feat: Sum metric for Numeric Field

Tell us about the problem you're trying to solve

As a user, I want to generate the sum of the numeric column.

Describe the solution you’d like

feat: update CLI table output

Tell us about the problem you're trying to solve

After implementing the combined metric, we need to make changes in CLI output table

Describe the solution you’d like

Updated structure for the CLI metric table

Metric Name Data Source Metric Type Value
n1 d1 max 1
n2 d1 min 3
n3 d2 max 2
cm1   combined 4

feat: Implement Geometric mean metrics

Tell us about the problem you're trying to solve

As a user, I want to calculate the nth root of the product of n data values for the data column.

Describe the solution you’d like

The geometric mean metric in data quality is a statistical measure that calculates the nth root of the product of n data values, often used to assess the central tendency of a dataset.

feat: Improve CLI Output

Tell us about the problem you're trying to solve

As a user, I want to see better-formatted output from CLI

Describe the solution you’d like

Metrics information will be shown as a table in CLI.

We will use the rich library to build and show the table in the command line.

feat: Implement Distinct Count metrics.

Tell us about the problem you're trying to solve

As a user, I want to generate distinct count for the data column.

Describe the solution you’d like

A distinct count metric in data quality measures the number of unique values within a dataset, ensuring accuracy and completeness.

Example

Input:

E_ID First_Name Last_Name City
001 Harry Gomez NY
002 James Watson LA
003 Helen Parker NY
004 Christi William OH
005 Ellen Evans OH

Output:

distinct_count = 3

Explanation: we are calculating the distinct count of cities in the "City" column. The expected output is 3 because there are three unique cities in the dataset: New York(NY), Los Angeles(LA), and Ohio(OH).

feat: HTML report framework

Tell us about the problem you're trying to solve

As a user, I want to generate an HTML report for all the metrics, so that It is easily sharable

Describe the solution you’d like

  • CLI: While running the inspect command we will add the optional parameter --report and it will generate an HTML single page report for all the metrics
  • Programeric: Inspect class will have another method generate_report which will take the file name and generate dashboard for all the metrics.

feat: API metrics

Tell us about the problem you're trying to solve

What are you trying to do?

Describe the solution you’d like

A clear and concise description of what you want to see happen.

Describe the alternative you’ve considered or used

Numeric Metric Average

As a user, I want to generate an average for the numeric field of a table.

Description

  • It is a type of field metric.
  • This metric will be calculated for both transitional database and search engines

feat: Implement Harmonic mean metrics

Tell us about the problem you're trying to solve

As a user, I want to generate the reciprocal of the average of the reciprocals of data values for the data column.

Describe the solution you’d like

The Harmonic mean metric in data quality is a statistical measure used to assess the quality of data by calculating the reciprocal of the average of the reciprocals of data values.

fix: Error handling while wrong datasource name reference in resource field

Describe the bug

When the data source reference name is wrong in resource selector, it should capture the error and should provide a better error code.

Expected behavior
The run exists with proper error code and log message

To Reproduce
Add data sources in the configuration like below

data_sources:
  - name: search_datastore        # Data source name
    type: opensearch              # Data source type is OpenSearch
    connection:
      host: 127.0.0.1
      port: 9205
      username: !ENV ${OS_USER}   # Username to use for authentication ENV variables
      password: !ENV ${OS_PASS}

Add metrics like below

metrics:
  - name: count_us_parts
    metric_type: document_count
    resource: search_datastore1.product_data_us

Screenshots

Image

Logs and additional context
If application, any other context, logs etc.here

feat: support for BigQuery

Tell us about the problem you're trying to solve

New integration for Bigquery.

Describe the solution you’d like

We will use sqlalchemy-bigquery to make the connection with bigquery.

  • Add integration for bigquery
  • add docs for new integration
  • Manually test for all the metrics for bigquery, as an integration test will not be available.

feat: Implement Variance metrics

Tell us about the problem you're trying to solve

As a user, I want to generate variance metric.

Describe the solution you’d like

Variance in data quality measures the degree of variability or dispersion in a dataset, indicating how spread out the data points are from the mean.

fix: postgres datasource not able to connect to schema other than public

Describe the bug
The configured schema is not propagating to the Postgres connection. The current status is

 url = URL.create(
        drivername="postgresql",
        username=self.data_connection.get("username"),
        password=self.data_connection.get("password"),
        host=self.data_connection.get("host"),
        port=self.data_connection.get("port"),
        database=self.data_connection.get("database"),
    )
    engine = create_engine(url)

schema name should be part of engine creation.

Expected behavior
Postgres Datasource should table the schema name in the configuration. And the data source should connect to only that schema.

  - name: search_staging_db       # Data source name
    type: postgres                # Data source type is Postgres
    connection:
      host: 127.0.0.1
      port: 5422
      username: !ENV ${DB2_USER}  # Username to use for authentication ENV variables
      password: !ENV ${DB2_PASS}  # Password to use for authentication ENV variables
      database: dc_db_2
      schema: custom_name

The code should accommodate the schema name

schema = self.data_connection.get("schema")
#....
engine = create_engine(
    url,
    connect_args={'options': f'-csearch_path={schema}'},
    isolation_level="AUTOCOMMIT"
)

[FEATURE] Postgres auto detecting table and column schema

Tell us about the problem you're trying to solve

While connecting to the Postgres database, the process should gather all tables and column schemas.
This will help to generate the auto-metrics for all the tables and columns.

Describe the solution you’d like

Sqlalchemy has API to get all schemas. We will use the APIS.
Getting table name

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.