great-expectations / great_expectations Goto Github PK

Always know what to expect from your data.

Home Page: https://docs.greatexpectations.io/

License: Apache License 2.0

Python 99.29% Jupyter Notebook 0.09% CSS 0.04% Lua 0.03% Dockerfile 0.01% Shell 0.03% Jinja 0.42% JavaScript 0.10%

pipeline-tests dataquality datacleaning datacleaner data-science data-profiling pipeline pipeline-testing cleandata dataunittest

great_expectations's People

Contributors

Stargazers

Watchers

Forkers

las-ncsu greatexpectationslabs mgasner gouker luminantdata polya20 theianrobertson louispotok schrockn crawlik bouke-nederstigt njsmith8 zack3241 fhachez colemanja91 damolaakinleye ilmari-abaenglish edjoesu sotte ncsu-las shasthojoy cggarvey smontanaro luke-zhu irontablee allen8838 datariot anhollis eugmandel elsander benzei avanderm rossem kenoskynci jambo1 rahulj51 jseeman prem2017 cselig 2legit lewtun znd4 royalts mdscruggs acompa mastratton3 khorgath himanshukandwal nature123 chris-dis-missed guoqq1994 binbenban radinkar danish-moengage walvekarvarun cuulee orenovadia appliedinfo heerensharma talagluck msempere hubayirp alexsherstinsky fpli-mbr scarrucciu grehce pariyat danieloliver williamjr bballamudi peasonkews ian-whitestone dalisaydavid nchrist2 goeddie ffinfo isaacaguirre changhsinlee anuragnaik adamhepner noncomposmentis rajesuwerps xxsacxx wegamekinglc cwerner henrywu2019 techwrekfix alyizzet confman bobhaffner bgscoones clee8912 alexras williamwsyhk kkwan-tc kkwanyang jcampbell abhishekms1047 joostboonzajerflaes-heineken arseniid

great_expectations's Issues

Implement expect_column_values_to_match_json_schema with tests

Implement expect_column_values_to_be_of_type with tests

In unit tests, we should always use `assert_equal`, not `assert`

Proposal: Use a WeightedPartitions for distributional expectations

{
  "partitions" : [0.0, 0.1, 0.3, 0.6, 1.0],
  "weights" : [0.4, 0.05, 0.25, 0.3, 0.0]
}

Partitions specifies the lower bound for each partition.
Weights specifies the total mass within each partition. (lower_bound <= value < upper_bound)

The number of entries in partition and weight lists must be equal.
For convenience, partitions are always sorted in ascending order.
Weights must sum to exactly 1.0

Note: Are there a JSON-serializable versions of inf and -inf?

Putting weights and partitions together into a single object has several advantages:

Can be passed/returned through true_value and other parameters
Simpler to test
Less prone to accidental separation in exploratory workflows

Using PDF instead of CDF has some advantages, too

Unified representation for categorical and continuous data
More user-friendly graphs
Still information-complete for calculating CDFs

What should expect_column_value_lengths_to_be_less_than_or_equal_to do when passed floats or integers?

I would have expected it to throw a TypeError, instead it uses the value of the numeric data.

I think of the "length" of an int or float as meaningless. As such, I would expect this expectation to only work for strings.

Propose an API for custom expectations

Expectations should ensure their configuration is saveable at runtime

Currently, a user can create an expectation using parameters that are not json serializable but not be aware of the error until attempting to save the config.

Clean up overall documentation

Feature/distributional expectations

Close #39 with improvements for documentation, unit tests, and bug fixes.

named_regex_patterns should actually do something.

Right now, there's not way to programmatically reference them from Expectations.

Create issues from this issue. :)

Notes from July 10th call:

Custom expectations:

Standardize on column_map_expectation and column_aggregate_expectation
Drop column_elementwise_expectation for now. (If we discover we need it, we can add it back it.)
Add better worked examples in the docs.
We also need to add a prototyping syntax for expectations that doesn't require subclassing and decorators. Something along the lines of:

dataset.expect_function_to_be_elementwise_true('column', function)
   => assert(df.column.apply(function)==[True] * len(column))
dataset.expect_function_to_be_true('column', function)
   => assert(function(df.column == True))

Output formats

Make it clear in the docs: output_format is categorical, not strictly ordered. This makes the output API more flexible and extensible.
Bring the docs up to date (e.g. true_value for aggregate_column_expectations,
Change include_lineage to include_kwargs. Also make it clear that expectations have only kwargs, no args.
Think about including row_index_list as a return value. (This gets complicated in some-non-pandas systems.)
What about error messages and handling in expectations?

Settle on a stopgap output spec for expectations result objects

Feature/distributional expectations

Document behavior of expect_column_value_lengths_to_be_between

Should Expectations include an `exclude_null_values` parameter?

Feature/unit test refactor

Unit tests have been refactored and converted to work in python 3. See commit comments for specific details.

Should expectation decorators add expectations that cannot be executed to the expectation configuration?

For example, if an expectation is written on a column that does not exist, currently that expectation will immediately be added, even if it is never even evaluated.

expect_column_proportion_of_unique_values_to_be_between doesn't work without optional field max_value

Returns False:
'data_set.expect_column_proportion_of_unique_values_to_be_between(column="ID_COLUMN", min_value=1, include_config=True)['success']

Returns True:
'data_set.expect_column_proportion_of_unique_values_to_be_between(column="ID_COLUMN", min_value=1, max_value=1, include_config=True)['success']

Convert expectation decorators in pandas_dataset.py and update expectation decorator in base.py

expect_table_row_count_to_equal should be changed to DataSet.table_expectation
expect_table_row_count_to_be_between should be changed to DataSet.table_expectation
expect_column_values_to_be_subset should be to DataSet.expectation
update decorators in base.py to use the new python 3 logic

Add tabbed autocomplete for dataset.column_name.column_expectation

A bit of sugar:

dataset.column_name.expect_something(arg1, arg2) should evaluate to dataset.expect_something(column_name, arg1, arg2)

...and ipython should be able to autocomplete the expectation on tab.

In append_expecation, add an option for not overwriting duplicate expectations.

Support python3 and python3-compatible unittest framework (unittest)

nose recommends using a new framework for new projects to support python3, and we want to be as broadly compatible as possible.

Add documentation on running tests to the docs (and finish converting unittests)

Running unit tests currently requires of a mix of:

python -m unittest tests (for things converted to unittest) nosetests (for those not). We need to finish conversion and add to developer/contributor docs.

Append_expectation drops expectations of the same type even for different columns

...which is very broken. 👎

Docstrings in DataSet Expectations should propagate to PandasDataSet

Improve distributional expectations

Distributional expectations need:

Documentation
Bug Fixes
Unit tests
Better and simpler helpers
KL Divergence for discrete data

Decide on expected behavior for (and implement) distributional expectations

Decide on expected behavior for (and implement)
expect_column_numerical_distribution_to_be
expect_column_frequency_distribution_to_be

Add more thorough unit tests for...

Add more thorough unit tests for
expect_table_row_count_to_be_between
expect_table_row_count_to_equal
expect_column_values_to_be_dateutil_parseable
expect_column_values_to_be_valid_json
expect_column_stdev_to_be_between

What additional logic should we pack into Expectation decorators?

What are all the generic parameters that Expectations should accept?

All Expectations

output_format
include_kwargs
catch_exceptions
exclude_null_values?

For column_map_expectations

mostly

For column_aggregate_expectations

confidence_threshold

What other logic can we include?

Input validation
Output validation
- Is JSON serializable
- Has expected fields, etc.
Docstring propagation...
Create and append the Expectation to the dataset
Logic for de-duplication/updating Expectations

ensure unittest functionality is python 2 and 3 compatible

Define API for distributional expectations

Should it be easy to simultaneously create many expectations?

Consider the case of something like the following:

for column in df.columns:
    df.expect_column_mean_to_be_between(column, min, max)

Currently, this will work, but we would need to wrap the expectation statement in print() to see the output at all, and even then we cannot see which column the expectation was about, unless we also coerce printing of the dictionary returned by the expectation. Is this a useful pattern?

With the new decorators, passing args (instead of kwargs) to Expectations sometimes crashes them.

[Replication example needed]

We need to either fix this, or document and own it.

Also, we should write tests against this.

At this stage in the decorator refactor, this is the single biggest source of uncertainty for me.

Activate suppress_exceptions in all expectations

Lots of the expectations don't implement suppress_exceptions.

Remove the need for `DataSet` when using the column_expectation decorator

Not this:

    @DataSet.column_expectation
    def expect_column_values_to_be_unique(self, column, mostly=None, suppress_exceptions=False):

But this:

    @column_expectation
    def expect_column_values_to_be_unique(self, column, mostly=None, suppress_exceptions=False):

All expectations should have docstrings

Function signatures should be present in base DataSet class and overridden in implementing subclasses (e.g. PandasDataSet)

Currently, the API doesn't suggest parallelism will necessarily exist in the expectations that will be implemented in classes that inherit from DataSet

Do we really need dataset.@expectation and dataset.@column_expectation

The code is 90% redundant. Isn't there some way to refactor these?

Address isinstance python2 and python3 compatibility

Currently, ensure_json_serializable uses instance in a way that does not work for both python 2 and 3.

Essentially this pushes the project to only python2 in this version (since @abegong added the unicode type back to the check).

Remove (broken) multicolumn relations and check serialization for all expectations

How should we implement distributional expectations within the new expectation decorators?

Distributional expectations are different from all the other @column_aggregate_expectations:

They need to accept a confidence_threshold argument, similar to mostly for column_map_expectations. Unlike mostly, confidence_threshold isn't optional.

In addition to a true_value, they should also return a confidence_value:

{
  success : boolean,
  true_value : partitioned_weights,
  confidence_value : float on [0,1]
}

The difference isn't fundamentally because these are expectations about distributions. The difference is because these are statistical assumptions.

How should we capture this in our expectations?

Option 1: Create a @column_statistical_expectation
Option 2: Add parameters to the distributional expectations to give them the

I lean towards (1). Jerry-rigging extra fields and parameters in (2) seems like it could get sticky pretty fast. And statistical expectations are a patterns that I expect to use more in the future.

@jcampbell, @dgmiller Thoughts?

Re-implement all those poor suppressed expectations.

expect_column_value_lengths_to_be_between doesn't allow for specifying a single length

expect_column_value_lengths_to_be_between use exclusive boundaries, so you can't specify that all values are of the same length. For example:

drg.expect_column_value_lengths_to_be_between(column=" Average Covered Charges ", min_value=9, max_value=9)

will return:

{'exception_list': [
  '$105929.47',
  '$101282.03',
  '$146892.00',
...