re-data / dbt-re-data Goto Github PK

View Code? Open in Web Editor NEW

97.0 4.0 39.0 4.22 MB

re_data - fix data issues before your users & CEO would discover them 😊

Home Page: https://getre.io

License: Other

Python 87.58% Shell 1.56% Makefile 10.85%

data-quality dbt data-monitoring data-observability data-testing dbt-packages sql

dbt-re-data's Introduction

What is re_data?

re_data is an open-source data reliability framework for the modern data stack. 😊

Currently, re_data focuses on observing the dbt project (together with underlying data warehouse - Postgres, BigQuery, Snowflake, Redshift).

Data transformations in re_data are implemented and exposed as models & macros in this dbt package.

Live demo

Check out our live demo of what re_data can do for you 😊

Getting started

Check our docs! 🙂

Join re_data community on Slack (we are very responsive there)

Check out more info, issues, etc. in master repo

Community

Say, hi to us on! 🙂

Contributing

Any contributions are greatly appreciated! Most of our documentation and GitHub issues are managed in the primary re-data repo. See the Contributing section in re-data for details.

dbt-re-data's People

Contributors

Stargazers

Watchers

Forkers

mateuszklimek maxenceroux dejii samgans sergey-vdovin stjordanis katsugeneration codep-ai donnyzhao z3z1ma gilbertabakahadjei athangachamy engr-krooozy rwatts3 sovalinux liepieshov ugemir hhhphp123 skwaugh mahdiqb shakahl maciejklimek mrprigun akshaykarle kevin811103 ahmedrad rafaelgaleazzo-bicyclehealth smook1980 enriquecastellano gthesheep bachng2017 redpin-pankaj eamontaaffe suelai famazak datateer stevewithington

dbt-re-data's Issues

[FEATURE] Make re_data work in case of no creation time mark in table columns

Tell us about the problem you're trying to solve

Currently for re_data to work tables need to have some column representing time when given record was added to a table. It's not always possible to have that (although it's most likely very good practice to have it). Currently table is just skipped and metrics are not computed for it, it would be nice to compute what's possible in this case.

Describe the solution you’d like

If there is incremental index on table, it's possible to use this column instead of creation_time for filtering, it requires for each re_data run saving last created index with current timestamp.

If there is no incremental index or column, it's impossible to create many stats for last_day etc, but it's still possible to detect schema changes and compute custom metrics. re_data most likely should enable that.

Not clear idea / decision if that's worth implementing

[FEATURE] Add null percentage metric

Tell us about the problem you're trying to solve

Compute metric containing % of nulls (optionally nulls and empty values) in daily data of some column in the table

Describe the solution you’d like

Add new metric variable count_nulls_pr and add expression for it in column_metric_expressions.
Add new this metric in list of supported metrics
Add new model for this metric

We are managing issues under master repo of this project: https://github.com/re-data/re-data/issues

[BUG] lack of namespace in fivetran_utils macros

Describe the bug
percentile and json_extract macros lack of namespace, so they could not be extended (for example by dbt_re_data_trino)

Expected behavior
namespace is evaluated correctly in dbt_project.yml

To Reproduce
current codes from github
https://github.com/re-data/dbt-re-data/blob/main/macros/utils/fivetran_utils/percentile.sql#L8
https://github.com/re-data/dbt-re-data/blob/main/macros/utils/fivetran_utils/json_extract.sql#L8

[BUG] BigQuery syntax error when running "dbt run --models package:re_data"

Describe the bug
I run this:
dbt run --models package:re_data --vars \ '{ "re_data:time_window_start": "2021-01-01 00:00:00", "re_data:time_window_end": "2021-01-02 00:00:00" }'
using a BigQuery DB.
I can the following error message:
16:01:26 | 3 of 18 ERROR creating incremental model dbt_chansen_re.re_data_base_metrics [ERROR in 31.59s] Database Error in model re_data_base_metrics (models/intermediate/re_data_base_metrics.sql) Syntax error: Expected ")" but got ":" at [330:21] compiled SQL at target/run/re_data/models/intermediate/re_data_base_metrics.sql

Line 315-345 of is pasted below (the code highlight does not work well on it due to the use of the "`" in the SQL):

as computed_on
union all
select
'dbt_chansen.active_periods' as table_name,
'user_id' as column_name,
'min' as metric,
null::integer as value,
cast('2021-01-02 00:00:00' as timestamp)
as time_window_start,
cast('2021-01-03 00:00:00' as timestamp)
as time_window_end,
current_timestamp
as computed_on union all # <---- this is line 330
select
'dbt_chansen.active_periods' as table_name,
'user_id' as column_name,
'max' as metric,
null::integer as value,
cast('2021-01-02 00:00:00' as timestamp)
as time_window_start,
cast('2021-01-03 00:00:00' as timestamp)
as time_window_end,
current_timestamp
as computed_on union all

To Reproduce
I am unable to provide this due to proprietary data.

Loosen pin on dbt_utils dependency

dbt_utils is on version 0.9.2 however the pin to the version in re_data is:

packages:

package: dbt-labs/dbt_utils
version: [">=0.7.0", "<0.9.0"]
This makes it tricky when combining with other packages.

Can these requirements but loosened, the above wants to install 0.8.6

[BUG] save_test_history macro doesn't store compiled_sql after dbt-core>=v1.3.0

Describe the bug
The save_test_history macro stores compiled_sql as null when used with dbt-core>=v1.3.0. This is because the compiled_sql field in node has been renamed to compiled_code since dbt-core>=v1.3.0(i.e. after release of python models)

Expected behavior
The save_test_history macro should store compiled_sql irrespective of dbt version.

To Reproduce

Install the re_data dbt package (version [">=0.10.0", "<0.11.0"]) on a dbt project with dbt-core>=v1.3.0
Run tests on models, sources or seeds (with re_data_monitored: true config)
Check re_data_test_history table in your target database, or look at Compiled SQL tab on your re_data overview site. You will see no compiled sql available.

Screenshots

Logs and additional context

[FEATURE] Add ability to implement support for different DB (Spark/Presto/etc.)

Tell us about the problem you're trying to solve

Adding support for Spark or Presto would be currently hard, because lack of single place when db specific logic is implemented. Adding clear catalogue splits between DBS would be really good to have.

Describe the solution you’d like

Move db specific logic in macros to separate directories, prepare standard macros which DB needs to implement to work properly.

[FEATURE] Customized empty values

Tell us about the problem you're trying to solve

Currently empty values are only recognized when text field is empty string: "". But sometimes it would be good to consider "N/A", "no value" and other custom options empty.

Describe the solution you’d like

This could be added as env variable for dbt project, which by default is empty string

[FEATURE] Detecting schema changes

Tell us about the problem you're trying to solve

If any schema changes happened in monitored tables I would like to know about that

Describe the solution you’d like

Add table containing all schema changes that happened, possibly add also alert for that happening

[BUG] Unable to install re_data for Greenplum database

Describe the bug
There is an error when running the command dbt run --models package:re_data

Expected behavior
re_data models installed for Greenplum database.

To Reproduce
Steps to reproduce the behavior:

Configure dbt project for Greenplum database
Run command dbt run --models package:re_data

Screenshots

Logs and additional context

...
14:27:46 Completed with 1 error and 0 warnings:
14:27:46  
14:27:46  Database Error in model re_data_last_stats (models/metrics/for_anomalies/re_data_last_stats.sql)
14:27:46    column "re_data_base_metrics.value" must appear in the GROUP BY clause or be used in an aggregate function
14:27:46    LINE 12:         avg(value)  over(partition by table_name, column_nam...
14:27:46                         ^
14:27:46    compiled SQL at target/run/re_data/models/metrics/for_anomalies/re_data_last_stats.sql
...

There're four places hardcoded for PostgreSQL database in models/metrics/for_anomalies/re_data_last_stats.sql file.

Greenplum is based on PostgreSQL 9.4 and has the same syntax. So it should be changed to something like this:

[BUG] Not detecting removed table

Describe the bug

When a table is removed, re_data doesn't notice it and breaks on the next run.
A table should be removed from re_data_tables (+information about removed table should be added)

[FEATURE] Distinct values count metric

Tell us about the problem you're trying to solve

Add distinct values count metric to re_data

Describe the solution you’d like

Number of distinct values computed for all text columns in DB

Describe the alternative you’ve considered or used

If it's too expensive to compute that for every column, we may consider doing this
only when specific check added to DB. This maybe needed for big DBs.

re-data / dbt-re-data Goto Github PK

dbt-re-data's Introduction

What is re_data?

Live demo

Getting started

Community

Contributing

dbt-re-data's People

Contributors

Stargazers

Watchers

Forkers

dbt-re-data's Issues

Recommend Projects

Recommend Topics

Recommend Org