galileo-galilei / kedro-pandera Goto Github PK
View Code? Open in Web Editor NEWA kedro plugin to use pandera in your kedro projects
Home Page: https://kedro-pandera.readthedocs.io/en/latest/
License: Apache License 2.0
A kedro plugin to use pandera in your kedro projects
Home Page: https://kedro-pandera.readthedocs.io/en/latest/
License: Apache License 2.0
Opening the floor for feature request discussion, what do you want to see in this plugin? What should it do and what it shouldn't do? Why is it important to you?
The more I think about the importance of data contracts ensuring coverage checks as part of a team's workflow feels like a natural evolution of this pattern.
The way I see this, there are two standards a user should aim for:
Code to copy paste is worth a thousand word !
example_iris_data
of a dataframe with the kedro pandera infer -d example_iris_data
command.Runtime validation is performed before_node_run
. This means we validate only nodes which are loaded (e.g. inputs or intermediate outputs). We should also validate terminal nodes before saving them.
Users expect all datasets being validated once.
Create a after_dataset_save
or a after_node_run
hook which checks if the dataset is a terminal output of a pipeline before validation
I believe there is a leftover print statement in this method at this line:
When loading a Pandera DataFrameModel this isn't very verbose. But you can use the same resolver to also load a Pandera DataFrameSchema defined in Python - in which case this print statement outputs the entire schema and causes clutter. Can the print statement be removed or changed to a debug log message?
TBD
TBD
TBD
TBD
Pandera can not only be used to validate the Dataframe but also to convert the dtypes in the Dataframe accoding to the schema.
The schema.validate function returns the validated Dataframe with the converted dtypes. When can update the input dataframe with the validated dataframe so in the nodes we will get a validated and converted dataframe accorting to the schema.
Add additional configuration parameter which allows per dataset to define if only want to validate or also to convert the dataset.
If it is also configuted to convert the dataset we can forward the converted the dataset in the hook.
A global parameter can be defined which allows to specify the default behaviout for all datasets which use a pandera schema.
Kedro 0.19.0 had a breaking change.
The current hook still references catalog._data_sets which does not work with kedro >= 0.19.0
Pipeline runs and validates dataset
Pipeline does not run successfully and shows error
Enable data checking in Jupyter Notebook.
Quotes from #12
I wonder if we could introduce a catalog.validate() method which runs tests declared in the catalog? This is why I've designed the snippet above in a somewhat generic validators structure
We could even allow for a prototyping workflow with catalog.validate("dataset_name", SchemaClass)...
interactive worklflow: I'd love to have something like this too. The design should be thought in details. Notice you can already do something like (ugly but still easy):
data=catalog.load("dataset_name") catalog._data_sets["dataset_name"].metadata.pandera.schema.validate(data)
With the same logic, maybe a CLI command kedro pandera validate would be helpful too, I guess you sometimes just want to check a new dataset quickly.
This enable offline data checking, and make data validation easier within a notebook. It shouldn't even requires a full Kedro Projecta, config file + DataCatalog
+ a pyproject.toml maybe enough to make it work.
In general, pandera support two modes: Class-based API and Object-based API. Do we have a preference for supporting which API first? maybe it's trivial to support both.
It is already possible to validate data against a given schema defined in catalog
with the pandera
metadata key.
In addition to schema.validate
, pandera also support decorators for pipeline
It requires inspecting function signature then parse which datasets are registered with data check. (Out of scope, it only matters when you have a pipeline, we should start with notebook first)
There are few options:
catalog.validate
methodDataCatalog
class - requires change in settings.py
to enable it.kedro_pandera.validate(catalog, schema)
TBD
yaml
or python are very explicit, but hard to show to managers / stakeholders / business teams. Being able to convert schema to prettier and more organized HTML documents would definitely help documentation efforts and consistency. it would be great of kedro-pandera
could generate these docs automatically.
quoting @datajoely
Again dbt has had this for years and it's just a no brainer, we could easily generate static docs describing what data is in the catalog, associated metadata and tests.
There is also an obvious integration point with enterprise catalogs like Alation/Colibra/Amundsen
Dataset documentation is a much required feature to interact with non technical teams.
Add a CLI kedro pandera doc
which would perform the conversion of all datasets with schemas.
The real question lies in the responsibility of generating the HTML from schema. This likely
belongs to pandera
itself.
A factory dataset with schema defined isn't validated by kedro-pandera
Unable to validate datasets defined as a factory
For this catalog entry:
"{foo}_feature":
type: pandas.ParquetDataset
filepath: data/04_feature/{foo}_feature.parquet
metadata:
pandera:
schema: ${pa.python:my_kedro_project.pipelines.feature_preprocessing.schemas.GenericFeatureSchema}
The factory dataset is validated
The dataset isn't validated (in my case its the output
dataset). Removing factory specification fixes the problem
Looking into the source code the line dataset = catalog._datasets.get(name) returns None
for a factory dataset, which makes metadata
become None
too. That stops the validation.
It is a bigger issue with the catalog and dataset factories.
I managed to fix the issue by wrapping the code inside the for loop:
for name, data in datasets.items():
if catalog.exists(name):
dataset = catalog._datasets.get(name)
metadata = getattr(dataset, "metadata", None)
...
That makes the dataset pop up in catalog._datasets
and it's getting validated properly
Another workaround that I can think of is move from before/after_node_run
hook to before/after_dataset_loaded
, but not 100% sure that it will work
Yes
Instead of failing immediately when one check is wrong, pandera supports perfomring all check before failing
Make debugging easier by getting all errors in a single run
Pass kwargs to schema.validate()
through a config file or a dataset.metadata
extra key, e.g.:
iris:
type: pandas. CSVDataSet
filepath: /path/to/iris.csv
metadata:
pandera:
schema: ${pa.yaml: _iris_schema}
validation_kwargs:
lazy: True
This key can ultimately support all the arguments available in the validate
method: https://pandera.readthedocs.io/en/stable/reference/generated/methods/pandera.api.pandas.container.DataFrameSchema.validate.html
I'd like to validate data from the CLI.
When I have changed a filepath in my catalog, I'd like to be able to validate this new dataset before running a whole pipeline.
Create a kedro pandera validate --dataset <dataset_name>
command which will load and validate data.
Pandera accepts validating against one schema or another
Be compatible with pandera
Accept a list of schemas in metadata?
iris:
type: ...
filepath: ...
metadata:
pandera:
schemas:
- schema1 : <schema1>
- schema2 : <schema1>
I'd like to add a "schema preview" (maybe with a toggle button, as for code?) in kedro-viz as it already exists for code or dataset, see:
This would help documenting dataset directly from code.
Documenting data pipeline and comprehensive checks is hard, and kedro-viz is a great tool to show what exists in the code. I think this would be really useful to have "self documented" pipeline and enhance collaboration and maintenance
Absolutely no idea on how to extend kedro-viz
, happy to hear suggestions here :)
Necessary before a PyPI release, waiting for kedro==0.18.13
Some kedro-datasets
do not have a metadata
parameter. This causes kedro-pandera
to throw an error, even if there is no schema validation for the affected dataset.
The bug prevents me from using datasets without a metadata
parameter, disrupting my data pipeline. Even some official kedro datasets are missing this parameter (e.g. ManagedTableDataset, EagerPolarsDataset)
spaceflights-pandas
starter.kedro-datasets
.metadata
parameter from the custom dataset.The custom dataset should work without throwing an error, even if it lacks a metadata
parameter.
An AttributeError
is thrown:
AttributeError: 'CSVDataset' object has no attribute 'metadata'
Code causing the error:
catalog._datasets[name].metadata is not None
TBD
TBD
TBD
kedro run --pandera.off
CLI flag would be nice, but it is not currently possible to add flags to the CLI via plugins
This plugin's requirement has clipped kedro version <0.19
. I'm using newer kedro version and would like to use this plugin with it.
Latest change validating output data (after_node_run()
) fails on the MemoryDataset
output.
catalog._datasets
doesn't contain MemoryDataset
.
spaceflights-pandas
starterkedro-pandera
from mainNo error
INFO Running node: split_data_node: node.py:361
split_data([model_input_table;params:model_options]) ->
[X_train;X_test;y_train;y_test]
WARNING There are 3 nodes that have not run. runner.py:214
You can resume the pipeline run from the nearest nodes with persisted inputs by
adding the following argument to your previous command:
--from-nodes "split_data_node"
Traceback (most recent call last):
File "/Desktop/Projects/new-spaceflights/.venv/bin/kedro", line 8, in <module>
sys.exit(main())
File "/Desktop/Projects/new-spaceflights/.venv/lib/python3.10/site-packages/kedro/framework/cli/cli.py", line 233, in main
File "/Desktop/Projects/new-spaceflights/.venv/lib/python3.10/site-packages/pluggy/_manager.py", line 477, in <lambda>
lambda: oldcall(hook_name, hook_impls, caller_kwargs, firstresult)
File "/Desktop/Projects/new-spaceflights/.venv/lib/python3.10/site-packages/pluggy/_callers.py", line 139, in _multicall
raise exception.with_traceback(exception.__traceback__)
File "/Desktop/Projects/new-spaceflights/.venv/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
res = hook_impl.function(*args)
File "/Desktop/Projects/new-spaceflights/.venv/lib/python3.10/site-packages/kedro_pandera/framework/hooks/pandera_hook.py", line 95, in after_node_run
self._validate_datasets(node, catalog, outputs)
File "/Desktop/Projects/new-spaceflights/.venv/lib/python3.10/site-packages/kedro_pandera/framework/hooks/pandera_hook.py", line 58, in _validate_datasets
metadata = getattr(catalog._datasets[name], "metadata", None)
KeyError: 'X_train'
Python 3.10.13
kedro 0.19.6
I want to be able to run pipeline with fake data generated from a dataset schema, mainly for pipeline unit testing or debugging with small dataset.
Unit testing for data pipeline is hard, and this may be a helpful solution. [
kedro pandera dryrun --pipeline <pipeline_name>
(name to be defined) command which will generate data for all inputs datasets and run the pipeline thanks to pandera [data synthesis]PanderaRunner
to run the pipeline with kedro run --runner=PanderaRunner --pipeline <pipeline_name>
. The advantage is to stick to the kedro CLI and eventually enable "composition" with other logic; the drawback is that this solution is not compatible with a custom config file we may introduceBy default, pandera
does not raise errors for pyspark
DataFrame. Instead, it records validation errors within the df.pandera.errors attribute.
e.g.
df = metadata["pandera"]["schema"].validate(df)
df.pandera.errors
defaultdict(<function ErrorHandler.__init__.<locals>.<lambda> at 0x30ae9c550>, {'SCHEMA': defaultdict(<class 'list'>, {'WRONG_DATATYPE': [{'schema': 'IrisPySparkSchema', 'column': 'sepal_length', 'check': "dtype('StringType()')", 'error': "expected column 'sepal_length' to have type StringType(), got DoubleType()"}, {'schema': 'IrisPySparkSchema', 'column': 'sepal_width', 'check': "dtype('StringType()')", 'error': "expected column 'sepal_width' to have type StringType(), got DoubleType()"}, {'schema': 'IrisPySparkSchema', 'column': 'petal_length', 'check': "dtype('StringType()')", 'error': "expected column 'petal_length' to have type StringType(), got DoubleType()"}, {'schema': 'IrisPySparkSchema', 'column': 'petal_width', 'check': "dtype('StringType()')", 'error': "expected column 'petal_width' to have type StringType(), got DoubleType()"}]})})
As per pandera
documentation:
This design decision is based on the expectation that most use cases for pyspark SQL dataframes means entails a production ETL setting. In these settings, pandera prioritizes completing the production load and saving the data quality issues for downstream rectification.
Currently, validating pyspark
DataFrames directly is not possible, except by manually inspecting the pandera.error
attribute.
To enforce immediate error raising during validation, one can set lazy=False
when calling the validation method: metadata["pandera"]["schema"].validate(data, lazy=False)
This setting might be more suitable for machine learning tasks. Alternatively, validation can be toggled off using the environment variable export PANDERA_VALIDATION_ENABLED=false
, as mentioned in the docs and #27
In addition to the YAML API, we should support the class-base API DataFrameModel
(pydantic)
TBD
TBD
TBD
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.