galileo-galilei / kedro-pandera Goto Github PK

View Code? Open in Web Editor NEW

33.0 33.0 4.0 203 KB

A kedro plugin to use pandera in your kedro projects

Home Page: https://kedro-pandera.readthedocs.io/en/latest/

License: Apache License 2.0

Python 99.67% Makefile 0.33%

data-contracts data-pipelines data-schemas kedro kedro-plugin pandera pipelines-testing

kedro-pandera's Issues

What do you want to see in `kedro-pandera`?

Description

Opening the floor for feature request discussion, what do you want to see in this plugin? What should it do and what it shouldn't do? Why is it important to you?

Description

The more I think about the importance of data contracts ensuring coverage checks as part of a team's workflow feels like a natural evolution of this pattern.

Context

The way I see this, there are two standards a user should aim for:

A "gold standard 🥇" pattern where every dataset in your project has pandera schemas attached (all parameter inputs also have pandera/pydantic definitions too)
A "silver standard 🥈" pattern where just the free-inputs/outputs of a pipeline are properly validated and the rest is treated a closed box.

Possible Implementation

Build an AST introspection utility which uses an instantiated KedroSession object to validate state

Possible Alternatives

Look at building a Pylint plugin to do the same thing

Description

Code to copy paste is worth a thousand word !

Possible Implementation

clone the pandas iris tutorial
create the schema example_iris_data of a dataframe with the kedro pandera infer -d example_iris_data command.
Add a manual validation in the schema (ex: check that target is in ["setosa", "versicolor", "virginica"]
Run on a fake dataset
Enjoy the failure message

Other features

use the CLI to see the schema
generate a fake dataset and run the pipeline with it
add extra configuration (lazy failure...)

Add data validation to terminal outputs

Description

Runtime validation is performed before_node_run. This means we validate only nodes which are loaded (e.g. inputs or intermediate outputs). We should also validate terminal nodes before saving them.

Context

Users expect all datasets being validated once.

Possible Implementation

Create a after_dataset_save or a after_node_run hook which checks if the dataset is a terminal output of a pipeline before validation

Remove leftover print statement in `resolve_dataframe_model`

I believe there is a leftover print statement in this method at this line:

https://github.com/Galileo-Galilei/kedro-pandera/blob/a631c1ab5710152b6afc9f1fd0e230a7cfab7a95/kedro_pandera/framework/config/resolvers.py#L35C51-L35C51

When loading a Pandera DataFrameModel this isn't very verbose. But you can use the same resolver to also load a Pandera DataFrameSchema defined in Python - in which case this print statement outputs the entire schema and causes clutter. Can the print statement be removed or changed to a debug log message?

Generate metadata catalog entry from annotated functions

Description

TBD

Context

TBD

Possible Implementation

TBD

Possible Alternatives

TBD

Release v0.1.0

Description

Context

Possible Implementation

Possible Alternatives

Allow converting the Dataframe according to the defined schema

Description

Pandera can not only be used to validate the Dataframe but also to convert the dtypes in the Dataframe accoding to the schema.

The schema.validate function returns the validated Dataframe with the converted dtypes. When can update the input dataframe with the validated dataframe so in the nodes we will get a validated and converted dataframe accorting to the schema.

Context

Possible Implementation

Add additional configuration parameter which allows per dataset to define if only want to validate or also to convert the dataset.
If it is also configuted to convert the dataset we can forward the converted the dataset in the hook.

A global parameter can be defined which allows to specify the default behaviout for all datasets which use a pandera schema.

fix for kedro >= 0.19.0

Description

Kedro 0.19.0 had a breaking change.

Renamed data_set and DataSet to dataset and Dataset everywhere.

The current hook still references catalog._data_sets which does not work with kedro >= 0.19.0

Steps to Reproduce

install kedro >= 0.19.0
install kedor-pandera <=0.2.0
create pandera schema and add to catalog.yaml
run kedro run

Expected Result

Pipeline runs and validates dataset

Actual Result

Pipeline does not run successfully and shows error

Enable Offline Data Check with Jupyter

Description

Enable data checking in Jupyter Notebook.

Quotes from #12
I wonder if we could introduce a catalog.validate() method which runs tests declared in the catalog? This is why I've designed the snippet above in a somewhat generic validators structure
We could even allow for a prototyping workflow with catalog.validate("dataset_name", SchemaClass)...

interactive worklflow: I'd love to have something like this too. The design should be thought in details. Notice you can already do something like (ugly but still easy):

data=catalog.load("dataset_name")
catalog._data_sets["dataset_name"].metadata.pandera.schema.validate(data)

With the same logic, maybe a CLI command kedro pandera validate would be helpful too, I guess you sometimes just want to check a new dataset quickly.

Context

This enable offline data checking, and make data validation easier within a notebook. It shouldn't even requires a full Kedro Projecta, config file + DataCatalog + a pyproject.toml maybe enough to make it work.

In general, pandera support two modes: Class-based API and Object-based API. Do we have a preference for supporting which API first? maybe it's trivial to support both.

Possible Implementation

It is already possible to validate data against a given schema defined in catalog with the pandera metadata key.

In addition to schema.validate, pandera also support decorators for pipeline
It requires inspecting function signature then parse which datasets are registered with data check. (Out of scope, it only matters when you have a pipeline, we should start with notebook first)

There are few options:

monkeypatch a catalog.validate method
Inherit the current DataCatalog class - requires change in settings.py to enable it.
kedro_pandera.validate(catalog, schema)
??

Possible Alternatives

TBD

Generate HTML documentation from schema

Description

yaml or python are very explicit, but hard to show to managers / stakeholders / business teams. Being able to convert schema to prettier and more organized HTML documents would definitely help documentation efforts and consistency. it would be great of kedro-pandera could generate these docs automatically.

quoting @datajoely

Again dbt has had this for years and it's just a no brainer, we could easily generate static docs describing what data is in the catalog, associated metadata and tests.
There is also an obvious integration point with enterprise catalogs like Alation/Colibra/Amundsen

Context

Dataset documentation is a much required feature to interact with non technical teams.

Possible Implementation

Add a CLI kedro pandera doc which would perform the conversion of all datasets with schemas.

The real question lies in the responsibility of generating the HTML from schema. This likely
belongs to pandera itself.

Factory datasets are not getting validated

Description

A factory dataset with schema defined isn't validated by kedro-pandera

Context

Unable to validate datasets defined as a factory

Steps to Reproduce

For this catalog entry:

"{foo}_feature":
  type: pandas.ParquetDataset
  filepath: data/04_feature/{foo}_feature.parquet
  metadata:
    pandera:
      schema: ${pa.python:my_kedro_project.pipelines.feature_preprocessing.schemas.GenericFeatureSchema}

Expected Result

The factory dataset is validated

Actual Result

The dataset isn't validated (in my case its the output dataset). Removing factory specification fixes the problem

Investigation / workaround

Looking into the source code the line dataset = catalog._datasets.get(name) returns None for a factory dataset, which makes metadata become None too. That stops the validation.

It is a bigger issue with the catalog and dataset factories.

I managed to fix the issue by wrapping the code inside the for loop:

for name, data in datasets.items():
  if catalog.exists(name):
    dataset = catalog._datasets.get(name)
    metadata = getattr(dataset, "metadata", None)
    ...

That makes the dataset pop up in catalog._datasets and it's getting validated properly

Another workaround that I can think of is move from before/after_node_run hook to before/after_dataset_loaded, but not 100% sure that it will work

Your Environment

kedro-pandera=0.2.2
Python 3.10.11
Apple M2 Pro, Sonoma 14.5

Does the bug also happen with the last version on main?

Yes

Enable lazy validation at a dataset level

Description

Instead of failing immediately when one check is wrong, pandera supports perfomring all check before failing

Context

Make debugging easier by getting all errors in a single run

Possible Implementation

Pass kwargs to schema.validate() through a config file or a dataset.metadata extra key, e.g.:

iris: 
    type: pandas. CSVDataSet
    filepath: /path/to/iris.csv
    metadata:
        pandera:
            schema: ${pa.yaml: _iris_schema}
            validation_kwargs: 
                lazy: True

This key can ultimately support all the arguments available in the validate method: https://pandera.readthedocs.io/en/stable/reference/generated/methods/pandera.api.pandas.container.DataFrameSchema.validate.html

Add kedro catalog validate command

Description

I'd like to validate data from the CLI.

Context

When I have changed a filepath in my catalog, I'd like to be able to validate this new dataset before running a whole pipeline.

Possible Implementation

Create a kedro pandera validate --dataset <dataset_name> command which will load and validate data.

Enable multiple schema validation

Description

Pandera accepts validating against one schema or another

Context

Be compatible with pandera

Possible Implementation

Accept a list of schemas in metadata?

iris:
    type: ...
    filepath: ...
    metadata:
        pandera: 
            schemas: 
                - schema1 : <schema1>
                - schema2 : <schema1>

Add a preview of the schema in kedro-viz

Description

I'd like to add a "schema preview" (maybe with a toggle button, as for code?) in kedro-viz as it already exists for code or dataset, see:

This would help documenting dataset directly from code.

Context

Documenting data pipeline and comprehensive checks is hard, and kedro-viz is a great tool to show what exists in the code. I think this would be really useful to have "self documented" pipeline and enhance collaboration and maintenance

Possible Implementation

Absolutely no idea on how to extend kedro-viz, happy to hear suggestions here :)

Upgrade requirements with a valid kedro version

Description

Necessary before a PyPI release, waiting for kedro==0.18.13

AttributeError due to missing `metadata` parameter in datasets

Description

Some kedro-datasets do not have a metadata parameter. This causes kedro-pandera to throw an error, even if there is no schema validation for the affected dataset.

Context

The bug prevents me from using datasets without a metadata parameter, disrupting my data pipeline. Even some official kedro datasets are missing this parameter (e.g. ManagedTableDataset, EagerPolarsDataset)

Steps to Reproduce

Use the spaceflights-pandas starter.
Use a custom CSV dataset from kedro-datasets.
Delete or comment out the metadata parameter from the custom dataset.

Expected Result

The custom dataset should work without throwing an error, even if it lacks a metadata parameter.

Actual Result

An AttributeError is thrown:
AttributeError: 'CSVDataset' object has no attribute 'metadata'

Additional Information

Code causing the error:
catalog._datasets[name].metadata is not None

Temporarily deactivate runtime validation

Description

TBD

Context

TBD

Possible Implementation

TBD

kedro run --pandera.off CLI flag would be nice, but it is not currently possible to add flags to the CLI via plugins

Add kedro~=0.19.0 compatability

Description

This plugin's requirement has clipped kedro version <0.19. I'm using newer kedro version and would like to use this plugin with it.

Validating output data fails on a MemoryDataset

Description

Latest change validating output data (after_node_run()) fails on the MemoryDataset output.
catalog._datasets doesn't contain MemoryDataset.

Steps to Reproduce

Use spaceflights-pandas starter
Install kedro-pandera from main
Run the code.

Expected Result

No error

Actual Result

INFO     Running node: split_data_node:                                                    node.py:361
                             split_data([model_input_table;params:model_options]) ->
                             [X_train;X_test;y_train;y_test]
                    WARNING  There are 3 nodes that have not run.                                            runner.py:214
                             You can resume the pipeline run from the nearest nodes with persisted inputs by
                             adding the following argument to your previous command:
                               --from-nodes "split_data_node"
Traceback (most recent call last):
  File "/Desktop/Projects/new-spaceflights/.venv/bin/kedro", line 8, in <module>
    sys.exit(main())
  File "/Desktop/Projects/new-spaceflights/.venv/lib/python3.10/site-packages/kedro/framework/cli/cli.py", line 233, in main
  File "/Desktop/Projects/new-spaceflights/.venv/lib/python3.10/site-packages/pluggy/_manager.py", line 477, in <lambda>
    lambda: oldcall(hook_name, hook_impls, caller_kwargs, firstresult)
  File "/Desktop/Projects/new-spaceflights/.venv/lib/python3.10/site-packages/pluggy/_callers.py", line 139, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/Desktop/Projects/new-spaceflights/.venv/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/Desktop/Projects/new-spaceflights/.venv/lib/python3.10/site-packages/kedro_pandera/framework/hooks/pandera_hook.py", line 95, in after_node_run
    self._validate_datasets(node, catalog, outputs)
  File "/Desktop/Projects/new-spaceflights/.venv/lib/python3.10/site-packages/kedro_pandera/framework/hooks/pandera_hook.py", line 58, in _validate_datasets

metadata = getattr(catalog._datasets[name], "metadata", None)
KeyError: 'X_train'

Your Environment

Python 3.10.13
kedro 0.19.6

Run the pipeline with fake pandera-generated data

Description

I want to be able to run pipeline with fake data generated from a dataset schema, mainly for pipeline unit testing or debugging with small dataset.

Context

Unit testing for data pipeline is hard, and this may be a helpful solution. [

Possible Implementation(s)

create a kedro pandera dryrun --pipeline <pipeline_name> (name to be defined) command which will generate data for all inputs datasets and run the pipeline thanks to pandera [data synthesis]
create a PanderaRunner to run the pipeline with kedro run --runner=PanderaRunner --pipeline <pipeline_name>. The advantage is to stick to the kedro CLI and eventually enable "composition" with other logic; the drawback is that this solution is not compatible with a custom config file we may introduce

Raising errors for pyspark dataframe validation

Description

By default, pandera does not raise errors for pyspark DataFrame. Instead, it records validation errors within the df.pandera.errors attribute.

e.g.

df = metadata["pandera"]["schema"].validate(df)
df.pandera.errors
defaultdict(<function ErrorHandler.__init__.<locals>.<lambda> at 0x30ae9c550>, {'SCHEMA': defaultdict(<class 'list'>, {'WRONG_DATATYPE': [{'schema': 'IrisPySparkSchema', 'column': 'sepal_length', 'check': "dtype('StringType()')", 'error': "expected column 'sepal_length' to have type StringType(), got DoubleType()"}, {'schema': 'IrisPySparkSchema', 'column': 'sepal_width', 'check': "dtype('StringType()')", 'error': "expected column 'sepal_width' to have type StringType(), got DoubleType()"}, {'schema': 'IrisPySparkSchema', 'column': 'petal_length', 'check': "dtype('StringType()')", 'error': "expected column 'petal_length' to have type StringType(), got DoubleType()"}, {'schema': 'IrisPySparkSchema', 'column': 'petal_width', 'check': "dtype('StringType()')", 'error': "expected column 'petal_width' to have type StringType(), got DoubleType()"}]})})

As per pandera documentation:

This design decision is based on the expectation that most use cases for pyspark SQL dataframes means entails a production ETL setting. In these settings, pandera prioritizes completing the production load and saving the data quality issues for downstream rectification.

Context

Currently, validating pyspark DataFrames directly is not possible, except by manually inspecting the pandera.error attribute.

Possible Implementation

To enforce immediate error raising during validation, one can set lazy=False when calling the validation method: metadata["pandera"]["schema"].validate(data, lazy=False)
This setting might be more suitable for machine learning tasks. Alternatively, validation can be toggled off using the environment variable export PANDERA_VALIDATION_ENABLED=false, as mentioned in the docs and #27

Support DataframeModel and the python API for declaring schema

Description

In addition to the YAML API, we should support the class-base API DataFrameModel (pydantic)

Context

TBD

Possible Implementation

TBD

Possible Alternatives

TBD