great-expectations / great_expectations Goto Github PK
View Code? Open in Web Editor NEWAlways know what to expect from your data.
Home Page: https://docs.greatexpectations.io/
License: Apache License 2.0
Always know what to expect from your data.
Home Page: https://docs.greatexpectations.io/
License: Apache License 2.0
{
"partitions" : [0.0, 0.1, 0.3, 0.6, 1.0],
"weights" : [0.4, 0.05, 0.25, 0.3, 0.0]
}
Partitions
specifies the lower bound for each partition.
Weights
specifies the total mass within each partition. (lower_bound <= value < upper_bound)
Note: Are there a JSON-serializable versions of inf
and -inf
?
Putting weights and partitions together into a single object has several advantages:
true_value
and other parametersUsing PDF instead of CDF has some advantages, too
I would have expected it to throw a TypeError, instead it uses the value of the numeric data.
I think of the "length" of an int or float as meaningless. As such, I would expect this expectation to only work for strings.
Currently, a user can create an expectation using parameters that are not json serializable but not be aware of the error until attempting to save the config.
Close #39 with improvements for documentation, unit tests, and bug fixes.
Right now, there's not way to programmatically reference them from Expectations.
Notes from July 10th call:
column_map_expectation
and column_aggregate_expectation
column_elementwise_expectation
for now. (If we discover we need it, we can add it back it.)dataset.expect_function_to_be_elementwise_true('column', function)
=> assert(df.column.apply(function)==[True] * len(column))
dataset.expect_function_to_be_true('column', function)
=> assert(function(df.column == True))
output_format
is categorical, not strictly ordered. This makes the output API more flexible and extensible.true_value
for aggregate_column_expectations,include_lineage
to include_kwargs
. Also make it clear that expectations have only kwargs, no args.row_index_list
as a return value. (This gets complicated in some-non-pandas systems.)Unit tests have been refactored and converted to work in python 3. See commit comments for specific details.
For example, if an expectation is written on a column that does not exist, currently that expectation will immediately be added, even if it is never even evaluated.
Docstrings updated for version 0.1
Returns False:
'data_set.expect_column_proportion_of_unique_values_to_be_between(column="ID_COLUMN", min_value=1, include_config=True)['success']
Returns True:
'data_set.expect_column_proportion_of_unique_values_to_be_between(column="ID_COLUMN", min_value=1, max_value=1, include_config=True)['success']
expect_table_row_count_to_equal should be changed to DataSet.table_expectation
expect_table_row_count_to_be_between should be changed to DataSet.table_expectation
expect_column_values_to_be_subset should be to DataSet.expectation
update decorators in base.py to use the new python 3 logic
A bit of sugar:
dataset.column_name.expect_something(arg1, arg2)
should evaluate to dataset.expect_something(column_name, arg1, arg2)
...and ipython should be able to autocomplete the expectation on tab.
nose recommends using a new framework for new projects to support python3, and we want to be as broadly compatible as possible.
Running unit tests currently requires of a mix of:
python -m unittest tests
(for things converted to unittest) nosetests
(for those not). We need to finish conversion and add to developer/contributor docs.
...which is very broken. ๐
Distributional expectations need:
Decide on expected behavior for (and implement)
expect_column_numerical_distribution_to_be
expect_column_frequency_distribution_to_be
Add more thorough unit tests for
expect_table_row_count_to_be_between
expect_table_row_count_to_equal
expect_column_values_to_be_dateutil_parseable
expect_column_values_to_be_valid_json
expect_column_stdev_to_be_between
What are all the generic parameters that Expectations should accept?
All Expectations
For column_map_expectations
For column_aggregate_expectations
What other logic can we include?
Input validation
Output validation
Docstring propagation...
Create and append the Expectation to the dataset
Logic for de-duplication/updating Expectations
Consider the case of something like the following:
for column in df.columns:
df.expect_column_mean_to_be_between(column, min, max)
Currently, this will work, but we would need to wrap the expectation statement in print() to see the output at all, and even then we cannot see which column the expectation was about, unless we also coerce printing of the dictionary returned by the expectation. Is this a useful pattern?
[Replication example needed]
We need to either fix this, or document and own it.
Also, we should write tests against this.
At this stage in the decorator refactor, this is the single biggest source of uncertainty for me.
Lots of the expectations don't implement suppress_exceptions.
Not this:
@DataSet.column_expectation
def expect_column_values_to_be_unique(self, column, mostly=None, suppress_exceptions=False):
But this:
@column_expectation
def expect_column_values_to_be_unique(self, column, mostly=None, suppress_exceptions=False):
Currently, the API doesn't suggest parallelism will necessarily exist in the expectations that will be implemented in classes that inherit from DataSet
The code is 90% redundant. Isn't there some way to refactor these?
Currently, ensure_json_serializable uses instance in a way that does not work for both python 2 and 3.
Essentially this pushes the project to only python2 in this version (since @abegong added the unicode type back to the check).
Distributional expectations are different from all the other @column_aggregate_expectations:
They need to accept a confidence_threshold
argument, similar to mostly
for column_map_expectations. Unlike mostly
, confidence_threshold
isn't optional.
In addition to a true_value
, they should also return a confidence_value
:
{
success : boolean,
true_value : partitioned_weights,
confidence_value : float on [0,1]
}
The difference isn't fundamentally because these are expectations about distributions. The difference is because these are statistical assumptions.
How should we capture this in our expectations?
Option 1: Create a @column_statistical_expectation
Option 2: Add parameters to the distributional expectations to give them the
I lean towards (1). Jerry-rigging extra fields and parameters in (2) seems like it could get sticky pretty fast. And statistical expectations are a patterns that I expect to use more in the future.
@jcampbell, @dgmiller Thoughts?
expect_column_value_lengths_to_be_between use exclusive boundaries, so you can't specify that all values are of the same length. For example:
drg.expect_column_value_lengths_to_be_between(column=" Average Covered Charges ", min_value=9, max_value=9)
will return:
{'exception_list': [
'$105929.47',
'$101282.03',
'$146892.00',
...
great_expectations my_dataset.csv my_expectations.json --output_format=BOOLEAN_ONLY --catch_exceptions=False --include_config=True
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.