Coder Social home page Coder Social logo

re-data / dbt-re-data Goto Github PK

View Code? Open in Web Editor NEW
97.0 4.0 39.0 4.22 MB

re_data - fix data issues before your users & CEO would discover them ๐Ÿ˜Š

Home Page: https://getre.io

License: Other

Python 87.58% Shell 1.56% Makefile 10.85%
data-quality dbt data-monitoring data-observability data-testing dbt-packages sql

dbt-re-data's Introduction

What is re_data?

re_data is an open-source data reliability framework for the modern data stack. ๐Ÿ˜Š

Currently, re_data focuses on observing the dbt project (together with underlying data warehouse - Postgres, BigQuery, Snowflake, Redshift).

Data transformations in re_data are implemented and exposed as models & macros in this dbt package.

Live demo

Check out our live demo of what re_data can do for you ๐Ÿ˜Š

Getting started

Check our docs! ๐Ÿ™‚

Join re_data community on Slack (we are very responsive there)

Check out more info, issues, etc. in master repo

Community

Say, hi to us on! ๐Ÿ™‚

Contributing

Any contributions are greatly appreciated! Most of our documentation and GitHub issues are managed in the primary re-data repo. See the Contributing section in re-data for details.

dbt-re-data's People

Contributors

akshaykarle avatar bachng2017 avatar davidzajac1 avatar dejii avatar enriquecastellano avatar famazak avatar maciejklimek avatar mateuszklimek avatar mrprigun avatar rafaelgaleazzo-bicyclehealth avatar redpin-pankaj avatar samgans avatar sergey-vdovin avatar suelai avatar yu-iskw avatar z3z1ma avatar zendesk-sova avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

dbt-re-data's Issues

[FEATURE] Make re_data work in case of no creation time mark in table columns

Tell us about the problem you're trying to solve

Currently for re_data to work tables need to have some column representing time when given record was added to a table. It's not always possible to have that (although it's most likely very good practice to have it). Currently table is just skipped and metrics are not computed for it, it would be nice to compute what's possible in this case.

Describe the solution youโ€™d like

If there is incremental index on table, it's possible to use this column instead of creation_time for filtering, it requires for each re_data run saving last created index with current timestamp.

If there is no incremental index or column, it's impossible to create many stats for last_day etc, but it's still possible to detect schema changes and compute custom metrics. re_data most likely should enable that.

Not clear idea / decision if that's worth implementing

[FEATURE] Add null percentage metric

Tell us about the problem you're trying to solve

Compute metric containing % of nulls (optionally nulls and empty values) in daily data of some column in the table

Describe the solution youโ€™d like

Add new metric variable count_nulls_pr and add expression for it in column_metric_expressions.
Add new this metric in list of supported metrics
Add new model for this metric

[BUG] BigQuery syntax error when running "dbt run --models package:re_data"

Describe the bug
I run this:
dbt run --models package:re_data --vars \ '{ "re_data:time_window_start": "2021-01-01 00:00:00", "re_data:time_window_end": "2021-01-02 00:00:00" }'
using a BigQuery DB.
I can the following error message:
16:01:26 | 3 of 18 ERROR creating incremental model dbt_chansen_re.re_data_base_metrics [ERROR in 31.59s] Database Error in model re_data_base_metrics (models/intermediate/re_data_base_metrics.sql) Syntax error: Expected ")" but got ":" at [330:21] compiled SQL at target/run/re_data/models/intermediate/re_data_base_metrics.sql

Line 315-345 of is pasted below (the code highlight does not work well on it due to the use of the "`" in the SQL):

as computed_on
union all
select
'dbt_chansen.active_periods' as table_name,
'user_id' as column_name,
'min' as metric,
null::integer as value,
cast('2021-01-02 00:00:00' as timestamp)
as time_window_start,
cast('2021-01-03 00:00:00' as timestamp)
as time_window_end,
current_timestamp
as computed_on union all # <---- this is line 330
select
'dbt_chansen.active_periods' as table_name,
'user_id' as column_name,
'max' as metric,
null::integer as value,
cast('2021-01-02 00:00:00' as timestamp)
as time_window_start,
cast('2021-01-03 00:00:00' as timestamp)
as time_window_end,
current_timestamp
as computed_on union all

To Reproduce
I am unable to provide this due to proprietary data.

Loosen pin on dbt_utils dependency

dbt_utils is on version 0.9.2 however the pin to the version in re_data is:

packages:

  • package: dbt-labs/dbt_utils
    version: [">=0.7.0", "<0.9.0"]
    This makes it tricky when combining with other packages.

Can these requirements but loosened, the above wants to install 0.8.6

[BUG] save_test_history macro doesn't store compiled_sql after dbt-core>=v1.3.0

Describe the bug
The save_test_history macro stores compiled_sql as null when used with dbt-core>=v1.3.0. This is because the compiled_sql field in node has been renamed to compiled_code since dbt-core>=v1.3.0(i.e. after release of python models)

Expected behavior
The save_test_history macro should store compiled_sql irrespective of dbt version.

To Reproduce

  • Install the re_data dbt package (version [">=0.10.0", "<0.11.0"]) on a dbt project with dbt-core>=v1.3.0
  • Run tests on models, sources or seeds (with re_data_monitored: true config)
  • Check re_data_test_history table in your target database, or look at Compiled SQL tab on your re_data overview site. You will see no compiled sql available.

Screenshots
image

Logs and additional context

[FEATURE] Add ability to implement support for different DB (Spark/Presto/etc.)

Tell us about the problem you're trying to solve

Adding support for Spark or Presto would be currently hard, because lack of single place when db specific logic is implemented. Adding clear catalogue splits between DBS would be really good to have.

Describe the solution youโ€™d like

Move db specific logic in macros to separate directories, prepare standard macros which DB needs to implement to work properly.

[FEATURE] Customized empty values

Tell us about the problem you're trying to solve

Currently empty values are only recognized when text field is empty string: "". But sometimes it would be good to consider "N/A", "no value" and other custom options empty.

Describe the solution youโ€™d like

This could be added as env variable for dbt project, which by default is empty string

[FEATURE] Detecting schema changes

Tell us about the problem you're trying to solve

If any schema changes happened in monitored tables I would like to know about that

Describe the solution youโ€™d like

Add table containing all schema changes that happened, possibly add also alert for that happening

[BUG] Unable to install re_data for Greenplum database

Describe the bug
There is an error when running the command dbt run --models package:re_data

Expected behavior
re_data models installed for Greenplum database.

To Reproduce
Steps to reproduce the behavior:

  • Configure dbt project for Greenplum database
  • Run command dbt run --models package:re_data

Screenshots
image

Logs and additional context

...
14:27:46 Completed with 1 error and 0 warnings:
14:27:46  
14:27:46  Database Error in model re_data_last_stats (models/metrics/for_anomalies/re_data_last_stats.sql)
14:27:46    column "re_data_base_metrics.value" must appear in the GROUP BY clause or be used in an aggregate function
14:27:46    LINE 12:         avg(value)  over(partition by table_name, column_nam...
14:27:46                         ^
14:27:46    compiled SQL at target/run/re_data/models/metrics/for_anomalies/re_data_last_stats.sql
...

There're four places hardcoded for PostgreSQL database in models/metrics/for_anomalies/re_data_last_stats.sql file.

image

Greenplum is based on PostgreSQL 9.4 and has the same syntax. So it should be changed to something like this:

image

[BUG] Not detecting removed table

Describe the bug

When a table is removed, re_data doesn't notice it and breaks on the next run.
A table should be removed from re_data_tables (+information about removed table should be added)

[FEATURE] Distinct values count metric

Tell us about the problem you're trying to solve

Add distinct values count metric to re_data

Describe the solution youโ€™d like

Number of distinct values computed for all text columns in DB

Describe the alternative youโ€™ve considered or used

If it's too expensive to compute that for every column, we may consider doing this
only when specific check added to DB. This maybe needed for big DBs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.