Coder Social home page Coder Social logo

great-expectations / great_expectations Goto Github PK

View Code? Open in Web Editor NEW
9.5K 82.0 1.5K 193.71 MB

Always know what to expect from your data.

Home Page: https://docs.greatexpectations.io/

License: Apache License 2.0

Python 99.29% Jupyter Notebook 0.09% CSS 0.04% Lua 0.03% Dockerfile 0.01% Shell 0.03% Jinja 0.42% JavaScript 0.10%
pipeline-tests dataquality datacleaning datacleaner data-science data-profiling pipeline pipeline-testing cleandata dataunittest

great_expectations's People

Contributors

abegong avatar alexsherstinsky avatar anhollis avatar anthonyburdi avatar austiezr avatar ayirplm avatar aylr avatar billdirks avatar cdkini avatar cselig avatar dependabot[bot] avatar derekma73 avatar donaldheppner avatar eugmandel avatar jcampbell avatar joshua-stauffer avatar kenwade4 avatar kilo59 avatar kwcanuck avatar kyleaton avatar nathanfarmer avatar petermoyer avatar rachel-reverie avatar roblim avatar shinnnyshinshin avatar spbail avatar szecsip avatar talagluck avatar trangpham avatar tyler-hoffman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

great_expectations's Issues

Proposal: Use a WeightedPartitions for distributional expectations

{
  "partitions" : [0.0, 0.1, 0.3, 0.6, 1.0],
  "weights" : [0.4, 0.05, 0.25, 0.3, 0.0]
}

Partitions specifies the lower bound for each partition.
Weights specifies the total mass within each partition. (lower_bound <= value < upper_bound)

  • The number of entries in partition and weight lists must be equal.
  • For convenience, partitions are always sorted in ascending order.
  • Weights must sum to exactly 1.0

Note: Are there a JSON-serializable versions of inf and -inf?

Putting weights and partitions together into a single object has several advantages:

  • Can be passed/returned through true_value and other parameters
  • Simpler to test
  • Less prone to accidental separation in exploratory workflows

Using PDF instead of CDF has some advantages, too

  • Unified representation for categorical and continuous data
  • More user-friendly graphs
  • Still information-complete for calculating CDFs

Create issues from this issue. :)

Notes from July 10th call:

Custom expectations:

  • Standardize on column_map_expectation and column_aggregate_expectation
  • Drop column_elementwise_expectation for now. (If we discover we need it, we can add it back it.)
  • Add better worked examples in the docs.
  • We also need to add a prototyping syntax for expectations that doesn't require subclassing and decorators. Something along the lines of:
dataset.expect_function_to_be_elementwise_true('column', function)
   => assert(df.column.apply(function)==[True] * len(column))
dataset.expect_function_to_be_true('column', function)
   => assert(function(df.column == True))

Output formats

  • Make it clear in the docs: output_format is categorical, not strictly ordered. This makes the output API more flexible and extensible.
  • Bring the docs up to date (e.g. true_value for aggregate_column_expectations,
  • Change include_lineage to include_kwargs. Also make it clear that expectations have only kwargs, no args.
  • Think about including row_index_list as a return value. (This gets complicated in some-non-pandas systems.)
  • What about error messages and handling in expectations?

Feature/unit test refactor

Unit tests have been refactored and converted to work in python 3. See commit comments for specific details.

Add more thorough unit tests for...

Add more thorough unit tests for
expect_table_row_count_to_be_between
expect_table_row_count_to_equal
expect_column_values_to_be_dateutil_parseable
expect_column_values_to_be_valid_json
expect_column_stdev_to_be_between

What additional logic should we pack into Expectation decorators?

What are all the generic parameters that Expectations should accept?

All Expectations

  • output_format
  • include_kwargs
  • catch_exceptions
  • exclude_null_values?

For column_map_expectations

  • mostly

For column_aggregate_expectations

  • confidence_threshold

What other logic can we include?

  • Input validation

  • Output validation

    • Is JSON serializable
    • Has expected fields, etc.
  • Docstring propagation...

  • Create and append the Expectation to the dataset

  • Logic for de-duplication/updating Expectations

Should it be easy to simultaneously create many expectations?

Consider the case of something like the following:

for column in df.columns:
    df.expect_column_mean_to_be_between(column, min, max)

Currently, this will work, but we would need to wrap the expectation statement in print() to see the output at all, and even then we cannot see which column the expectation was about, unless we also coerce printing of the dictionary returned by the expectation. Is this a useful pattern?

How should we implement distributional expectations within the new expectation decorators?

Distributional expectations are different from all the other @column_aggregate_expectations:

They need to accept a confidence_threshold argument, similar to mostly for column_map_expectations. Unlike mostly, confidence_threshold isn't optional.

In addition to a true_value, they should also return a confidence_value:

{
  success : boolean,
  true_value : partitioned_weights,
  confidence_value : float on [0,1]
}

The difference isn't fundamentally because these are expectations about distributions. The difference is because these are statistical assumptions.

How should we capture this in our expectations?

Option 1: Create a @column_statistical_expectation
Option 2: Add parameters to the distributional expectations to give them the

I lean towards (1). Jerry-rigging extra fields and parameters in (2) seems like it could get sticky pretty fast. And statistical expectations are a patterns that I expect to use more in the future.

@jcampbell, @dgmiller Thoughts?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.