jakobgm / patito Goto Github PK

View Code? Open in Web Editor NEW

272.0 272.0 23.0 942 KB

A data modelling layer built on top of polars and pydantic

License: MIT License

Python 100.00%

patito's People

Contributors

Stargazers

Watchers

patito's Issues

Release 0.6.0 wasn't published to pipy

thanks for this great library and excited to see progress continue. I noticed that the release GH actions pipeline failed for 0.6.0 https://github.com/JakobGM/patito/actions/runs/8066691480.

It would be great to get this release published since it now supports pydantic >2

Bug: Field constraints not evaluated on structs

This project looks really interesting, and I look forward to seeing how it develops. The struct support seems quite good, but it doesn't seem to support field constraints at the moment. Here's an example:

import patito as pt
import polars as pl


class Struct(pt.Model):
    x: int
    y: int
    z: int = pt.Field(lt=2)


class MyModel(pt.Model):
    struct_col: Struct
    list_struct_col: list[Struct]


df = pl.DataFrame(
    {
        "struct_col": [{"x": 1, "y": 2, "z": 3}],
        "list_struct_col": [[{"x": 1, "y": 2, "z": 3}, {"x": 1, "y": 2, "z": 3}]],
    }
)

MyModel.validate(df)

This passes validation, despite all of the z fields being greater than 2. I expect to receive errors for both of the columns. Can pt.Field be improved to do more validation inside of structs?

Fail to import Patito

Hi there.

Recently, I have caught a problem when import patito as pt, which returns TypeError:

In [1]: import patito as pt
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 1
----> 1 import patito as pt

File ~/Documents/GitHub/ServerlessPolars/.venv/lib/python3.10/site-packages/patito/__init__.py:4
      1 """Patito, a data-modelling library built on top of polars and pydantic."""
      2 from polars import Expr, Series, col
----> 4 from patito import exceptions, sql
      5 from patito.exceptions import ValidationError
      6 from patito.polars import DataFrame, LazyFrame

File ~/Documents/GitHub/ServerlessPolars/.venv/lib/python3.10/site-packages/patito/exceptions.py:6
      1 """Module containing all custom exceptions raised by patito."""
      3 import pydantic
----> 6 class ValidationError(pydantic.ValidationError):
      7     """Exception raised when dataframe does not match schema."""
     10 class ErrorWrapper(pydantic.error_wrappers.ErrorWrapper):

TypeError: type 'pydantic_core._pydantic_core.ValidationError' is not an acceptable base type

Before this issue, I already used Patito successfully in my project. After I take a look, it seems that recent Patito is not compatible with Pydantic v2 which is recently introduced. In the pyproject.toml, pydantic is constraint to >= 1.7.0 that makes the poetry automatically uses Pydantic=2.0.3 and I think this causes the issue.

Could you take a look? Many thanks.

[BUG]: Remove `.collect(eager=True)`

This syntax is no longer supported with Polars. This should be removed.

Reprex:
df: is a simple financial time series from OpenBB.

test = StocksBaseModel.DataFrame(df).cast()

TypeError: LazyFrame.collect() got an unexpected keyword argument 'eager'

bug: `Model.examples` returns columns in reverse order when you pass a list

When you pass a list of values, the columns are return in reversed order. However, if you pass a single value, it's in correct order. See examples below:

import patito as pt
class Test(pt.Model):
    a:str
    b:str
    
Test.examples({'a':'1', 'b':'2'})

shape: (1, 2)

a	b
str	str
"1"	"2"

This doesn't keep the order:

import patito as pt
class Test(pt.Model):
    a:str
    b:str
    
Test.examples({'a':['1'], 'b':['2']})

shape: (1, 2)

b	a
str	str
"2"	"1"

bug: TypeError: LazyFrame.collect() got an unexpected keyword argument '_eager' when using function patito.Model.examples()

It seems like patito.Model.examples() is not working. For testing I just installed Patito in a new venv. Running the following code

import patito as pt

class Bids(pt.Model):
    offer_id: str = pt.Field(alias='Offer-ID', unique=True)
    country_eic: str = pt.Field(alias='Country-EIC')
    power: int = pt.Field(alias="Power")


if __name__=='__main__':

    print(Bids.examples({"Offer-ID": ["a1","a2","a3"]}))

gives me the following error

Traceback (most recent call last):
File "C:\Users\ANON\python\test_proj\models.py", line 27, in
print(Bids.examples({"Offer-ID": ["a1","a2","a3"]}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\ANON\python\test_proj\venv\Lib\site-packages\patito\pydantic.py", line 1033, in examples
return DataFrame().with_columns(series).with_columns(unique_series)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\ANON\python\test_proj\venv\Lib\site-packages\patito\polars.py", line 703, in with_columns
return cast(DF, super().with_columns(*exprs, **named_exprs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\ANON\python\test_proj\venv\Lib\site-packages\polars\dataframe\frame.py", line 8270, in with_columns
return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: LazyFrame.collect() got an unexpected keyword argument '_eager'

With pip install patito following packages were installed:

patito==0.5.1
polars==0.20.4
pydantic==1.10.13
typing_extensions==4.9.0

Syntax for specifying missing columns

Currently, a type specification of Optional[int] means that a column must be of integer type but may contain nulls.

We currently don't support a syntax to specify that it is allowed that a column is missing.

One current workaround is to specify Foo.validate(df, allow_missing_columns=True), where allow_missing_columns is passed on to _find_errors as a kwarg (we should add this as an explicit parameter).

The following example contains a suggestion for how we could allow missing columns (see c). It is one that @JakobGM came up with last year.

import patito as pt
from typing import Optional

class Foo(pt.Model):
    a: int # only ints
    b: Optional[int] # mix ints and nulls
    c: int = None # column may be missing, but if it's there it must be an int - but this fails a type check
    d: Optional[int] = None # column may be missing, but if it's there it must be an int

An alternative would be to use pt.Field / ColumnInfo, and do something like the following, which I might like better, just because it will pass type checks.

class Foo(pt.Model):
    c: int = pt.Field(allow_missing=True)

I am very open to ideas here. Does anyone have a suggestion? Tagging a few possibly-interested parties, @brendancooley, @dsgibbons, @ion-elgreco

Add support for stuct columns?

annotations not inherited when inheriting a model

This is going to cause an issue when you want to validate the list column:

Reproducible example:

class Test(pt.Model):
    col: list[str]

class InhTest(Test):
    pass

df = InhTest.examples({
    "col":[['Hello']]
})
InhTest.validate(df)

print(Test.__annotations__)
print(InhTest.__annotations__)
{'col': list[str]}
{}

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[21], line 4
      1 df = InhTest.examples({
      2     "col":[['Hello']]
      3 })
----> 4 InhTest.validate(df)

File [~/<redacted>/.venv/lib/python3.10/site-packages/patito/pydantic.py:707](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/home/ion/<redacted>/~/<redacted>/.venv/lib/python3.10/site-packages/patito/pydantic.py:707), in Model.validate(cls, dataframe)
    662 @classmethod
    663 def validate(
    664     cls,
    665     dataframe: Union["pd.DataFrame", pl.DataFrame],
    666 ) -> None:
    667     """
    668     Validate the schema and content of the given dataframe.
    669 
   (...)
    705           Rows with invalid values: {'oven'}. (type=value_error.rowvalue)
    706     """
--> 707     validate(dataframe=dataframe, schema=cls)

File [~/<redacted>/.venv/lib/python3.10/site-packages/patito/validators.py:316](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/home/ion/<redacted>/~/<redacted>/.venv/lib/python3.10/site-packages/patito/validators.py:316), in validate(dataframe, schema)
    313 else:
    314     polars_dataframe = cast(pl.DataFrame, dataframe)
--> 316 errors = _find_errors(dataframe=polars_dataframe, schema=schema)
    317 if errors:
    318     raise ValidationError(errors=errors, model=schema)

File [~/<redacted>/.venv/lib/python3.10/site-packages/patito/validators.py:153](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/home/ion/<redacted>/~/<redacted>/.venv/lib/python3.10/site-packages/patito/validators.py:153), in _find_errors(dataframe, schema)
    150 if not isinstance(dtype, pl.List):
    151     continue
--> 153 annotation = schema.__annotations__[column]  # type: ignore[unreachable]
    155 # Retrieve the annotation of the list itself,
    156 # dewrapping any potential Optional[...]
    157 list_type = _dewrap_optional(annotation)

KeyError: 'col'

Feature request: refer to other columns when using comparators

The comparators lt, gt, le and ge are all defined with respect to a single floating point. It would be even better if we could define these with respect to other columns. Something like:

import patito as pt

class model(pt.Model):
    x_min: int
    x_max: int = pt.Field(gt=pt.col("x_min"))

I imagine this may be difficult if the underlying functionality is not supported in Pydantic, but I'd love to see this feature!

dtypes don't serialize properly when using nested models with alias generators

Problem

When using Pydantic alias generators, nested types are not serializing properly for dtypes.

My use case is that I have data coming from an API that is in camelCase. I validate against that format using the to_camel() alias generator. Serialized data should always be in snake case.

The bug can be reproduced with the following code:

from pydantic import AliasGenerator, ConfigDict
from pydantic.alias_generators import to_camel, to_snake

class BaseModel(pt.Model):
    model_config = ConfigDict(
        alias_generator=AliasGenerator(
            validation_alias=to_camel,
            serialization_alias=to_snake,
        ),
        populate_by_name=True,
    )

class NestedModel(BaseModel):
    nested_field: int

class ParentModel1(BaseModel):
    parent_field: int
    nested_model: NestedModel

When calling dtypes on NestedModel, things are serialized properly:

In [3]: NestedModel.dtypes
Out[3]: {'nested_field': Int64}

However, when calling dtypes on ParentModel, the columns for NestedModel are back to camelCase:

In [14]: ParentModel.dtypes
Out[14]: {'parent_field': Int64, 'nested_model': Struct({'nestedField': Int64})}

Serialization works as expected (can be initialized with camelCase):

In [16]: foo = ParentModel(parent_field=1, nested_model={'nestedField': 2})

In [17]: foo.model_dump()
Out[17]: {'parent_field': 1, 'nested_model': {'nested_field': 2}}

Solution

I've recently updated all my dependencies and am not sure if this is a new issue or one that already existed. I have a branch where I've added the above code as an initial test and have played with the mode="serialization" flag for model_dump_json(), but so far I haven't figured out the issue.

That branch is linked below.

It's worth noting that, without populate_by_name=True set on the model config, camelCase fields will fail validation. I think this is a newer flag, as well as the mode option for model dumping.

References

My WIP branch: https://github.com/timoguin/patito/blob/fix/dtype-casing-bug-when-using-aliases/
New test (that is currently passing but shouldn't be): https://github.com/timoguin/patito/blob/fix/dtype-casing-bug-when-using-aliases/tests/test_dummy_data.py#L187
Attempt to use mode="serialization": https://github.com/timoguin/patito/blob/fix/dtype-casing-bug-when-using-aliases/src/patito/_pydantic/schema.py#L30

[FEAT] Allow creation of Model.DataFrame from polars.DataFrame/LazyFrame

Right now, you can only create a model-aware DataFrame from either a dict representation or from pandas. It would be great for workflow to be able to do this with polars objects.

from openbb_terminal.stocks import stocks_helper as stocks
import polars as pl

df = stocks.load(
    symbol="AAPL",
    start_date="1950-01-01",
    end_date="2023-10-16",
) 

df = pl.from_pandas(df, include_index=True).lazy()

# THIS DOESNT WORK
df = StocksBaseModel.LazyFrame(df)

TypeError: DataFrame constructor called with unsupported type 'LazyFrame' for the `data` parameter

Support for polars Date and Datetime types

Great package.
Unfortunately I work with time series data a lot (example) below
and I am getting a runtime error like:

RuntimeError: no validator found for <class '__main__.PolarsDate'>, see `arbitrary_types_allowed` in Config

Example

from polars import DataFrame, Datetime, Date


@validate_arguments(config=dict(arbitrary_types_allowed=True))
class ScoresTable(pt.Model):
    id: int = pt.Field(ge=0, unique=True)
    local_date: Date
    utc_dt: Datetime(time_unit='us', time_zone='UTC')
    values: float

Refactor of Field / FieldCI / ColumnInfo

@brendancooley (and others), I want to refactor the Field function in order to:

Have a statically (pyright) parseable docstring
Explicitly pass arguments like gt, dtype, constraints etc rather than use args/kwargs
Properly be able to serialize and deserialize types, satisfying any pydantic type requirements.

The suggestion

I'm considering encoding all the relevant Field parameters inside ColumnInfo, and ensuring that ColumnInfo can serialize (using field_serializer) and deserialize (through validators) all parameters.

This means that we can type-safely pass all these arguments to pydantic's Field using pydantic.fields.Field(json_schema_extra=column_info.model_dump()).

Then, at validation time, we would reconstruct the ColumnInfo object for each column using ColumnInfo.model_validate(some_patito_model.model_fields["some_field"]), and be able to relatively easily use these objects for validation.

However

The only things that I'm a bit unsure about:

Should we use ColumnInfo for fields like gt, which already exist in pydantic's Field? Or do as we currently do and pop them off? I'm leaning towards keeping them within the ColumnInfo just to have all the logic in one place.
If we do use ColumnInfo like I suggest in the previous point, are there cases where we shouldn't do that?

Let me know if this seems unclear! Writing this while alternating entertaining a 2.5 year old and a 3 month old 😅

Somehow add typing to the exposed polars datframe methods on a patitio dataframe

Hello hello! Long time no see @JakobGM, @thomasaarholt 😄 Cool project!

I'm interested in strong typing on the exposed polars.DataFrame methods such as get_column(). Take for example the following snippet:

def time_filter_positions(cow_pos: pl.DataFrame):
    min_gsn = int(cow_pos.get_column("gsn").min())
    max_gsn = int(cow_pos.get_column("gsn").max())
    min_date = cow_pos.get_column("timestamp").min()
    max_date = cow_pos.get_column("timestamp").max()
    # ----SNIP----

This upsets the typechecker! Because min returns a (method) def min() -> (PythonLiteral | None) and a .. | None cannot be passed into int() without additional asserting.. Now what I want to do is something like this:

import patito as pt

class MyDf(pt.Model):
    gsn: int = pt.Field(unique=True)
    timestamp = pt.Field(pl.Datetime)

def time_filter_positions(cow_pos: MyDf.DataFrame):
    min_gsn = int(cow_pos.get_column("gsn").min())
    max_gsn = int(cow_pos.get_column("gsn").max())
    min_date = cow_pos.get_column("timestamp").min()
    max_date = cow_pos.get_column("timestamp").max()

and have it just work ™️.. So the patitio datframe should somehow add typing to the get_column() method. Check that the input string is a Literal that matches one of the columns and then defines the output of the get_column call to be of the correct type. I have no idea if this is possible at all. But it would lead to a fantastic developer experience. Where your type checker and autocomplete become super contextually aware when using patito dataframes! What do you guys think? In scope for this project? Or completely far out? Or is this a level of typing that ought to be implemented in Polars itself?

pt.Field does not respect alias= argument

Perhaps I misunderstand the way that Field(alias = 'name') should work in patitio, but I was surprised by these errors:

>>> from typing import Literal
>>> 
>>> import patito as pt
>>> import polars as pl
>>> 
>>> class Product(pt.Model):
...     product_id: int = pt.Field(unique=True, alias='prod')
...     name: str
...     temperature_zone: Literal["dry", "cold", "frozen"]
...     demand_percentage: float
...
>>> valid_product_df = pl.DataFrame(
...     {
...         "product_id": [64, 11],
...         "name": ["Pizza", "Cereal"],
...         "temperature_zone": ["frozen", "dry"],
...         "demand_percentage": [0.07, 0.16],
...     }
... )
>>>
>>> Product.validate(valid_product_df) # No errors
>>>
>>> also_valid_product_df = pl.DataFrame(
...     {
...         "prod": [64, 11],
...         "name": ["Pizza", "Cereal"],
...         "temperature_zone": ["frozen", "dry"],
...         "demand_percentage": [0.07, 0.16],
...     }
... )
>>>
>>> Product.validate(also_valid_product_df) #Surprise!
Traceback (most recent call last):
[...]
patito.exceptions.DataFrameValidationError: 2 validation errors for Product
product_id
  Missing column (type=type_error.missingcolumns)
prod
  Superfluous column (type=type_error.superfluouscolumns)

From the way that Field aliases work in pydantic I thought that 'prod' would be interpreted as 'product_id'. Is this expected behaviour? Because if so perhaps a clarifying line in the docs to rename columns would help?

Thanks for your work with patito - it's shaping up great! I really like it and it's massively helping my projects!

Env details

Python 3.9.19
patitio 0.6.1
polars 0.20.31
pydantic 2.7.4

bug: Returned model loses all field definitions after using `rename`, `with_fields`, `select` & `drop`

In the example below, you can see I create id with Uint16, however all the models after using rename, with_fields , select & drop lose their field definitions that were provided in pt.Field.

class Product(pt.Model):
    id: int = pt.Field(unique=True, dtype=pl.UInt16)
    row_id: int
print(Product.dtypes)

{'id': UInt16, 'row_id': Int64}

With_fields

print(Product.with_fields().dtypes)
{'id': Int64, 'row_id': Int64}

Rename

print(Product.rename({'row_id':'row'}).dtypes)
{'id': Int64, 'row': Int64}

Drop

print(Product.drop('row_id').dtypes)
{'id': Int64}

@JakobGM @thomasaarholt

Allow constraints columnar-wise with multi-column support

Currently the constraints are row-wise, but it would be nice to have the option to add columnar type of constraints or multi-columnar.

This way you can do unique check on multiple columns for example.

Problem while validating `list` type object

It seems that the _find_errors method used to validate the content of a polars.DataFrame based on a patito.Model is not taking list of objects properly when they are wrapped as Optional.

Here is a small reproducible code:

    from typing import Optional

    import patito
    import polars


    class Inner(patito.Model):

        name: str
        reliability: bool
        level: int


    class Outer(patito.Model):

        id: str
        code: str
        label: str
        inner_types: Optional[list[Inner]]


    df = polars.DataFrame(
        {
            "id": [1, 2, 3],
            "code": ["A", "B", "C"],
            "label": ["a", "b", "c"],
            "inner_types": [
                [{"name": "a", "reliability": True, "level": 1}],
                [{"name": "b", "reliability": False, "level": 2}],
                None,
            ],
        }
    )
    df = Outer.DataFrame(df).cast().derive()
    df.validate()

Here is the traceback I get:

Traceback (most recent call last):
  File "/home/user/__init__.py", line 1093, in <module>
    df.validate()
  File "/home/user/.venv/lib/python3.9/site-packages/patito/polars.py", line 584, in validate
    self.model.validate(dataframe=self, columns=columns, **kwargs)
  File "/home/user/.venv/lib/python3.9/site-packages/patito/pydantic.py", line 476, in validate
    validate(
  File "/home/user/.venv/lib/python3.9/site-packages/patito/validators.py", line 445, in validate
    errors = _find_errors(
  File "/home/user/.venv/lib/python3.9/site-packages/patito/validators.py", line 303, in _find_errors
    list_struct_errors = _find_errors(
  File "/home/user/.venv/lib/python3.9/site-packages/patito/validators.py", line 143, in _find_errors
    schema_subset = columns or schema.columns
AttributeError: type object 'list' has no attribute 'columns'

It might comes from the fact list of object which aren't primitive types aren't well understood by the method? Or list wrapped into Optional aren't well understood?

Implement generating examples on list types

Currently creating examples with list[str] is for example not supported. Could be useful to have

ORM style functionality (or recommendation of a package that'd work well with it?)

I'd love to declare a patito model, and then with a bit more code be able to generate that table in a database.

I've considered both SQLModel and Ormar recently.

Any recommendations for anything that might play nice?

How to validate an ordered categorical column?

This topic probably belongs in a discussion forum but I couldn't find one for patito. Please let me know if there is a better place to ask this.

I would like to use patito to validate a dataframe with a categorical column with known categories where the order of the categories is important. What I have done so far is as follows:

from typing import Literal, get_args

import patito as pt
import polars as pl


class MyModel(pt.Model):
    my_col: Literal["a", "b"]


my_dtype = pl.Enum([*get_args(MyModel.model_fields["my_col"].annotation)])

good_df = pl.DataFrame({"my_col": pl.Series(["b", "a"], dtype=my_dtype)})
bad_df = pl.DataFrame(
    {"my_col": pl.Series(["b", "a"], dtype=pl.Enum(["b", "a"]))}
)

MyModel.validate(good_df)
MyModel.validate(bad_df)

This passes for good_df and fails for bad_df as expected. However I'm not 100% sure that this is the intended use of Literal in a patito model, and it was a little awkward to get the correctly ordered categories to put in my custom dtype so I thought I'd ask to see if there's a better (or just different) way to do this.

`read_csv` does not use model dtypes when an alias generator is used

I'm trying out Patito on some real data and as far as I can tell it is inferring the column types (rather than using the model-specified field types) when using the read_csv helper, even though the docs suggest that they're used here:

Read CSV and apply correct column name and types from model.

I think this is happening due to the alias generator not being used to map the columns to the respective fields at this step (and this explains why converting to the data models first and then to DataFrames does work).

This would be nicer if fixed and done automatically.

In this case I find that on a dataset of a few thousand rows, with a column with a mix of numeric and alphanumeric values, if by chance the first few aren't alphanumeric then it gets inferred to be numeric (in this case int).

File attached for reproducibility:

stops.txt (can also be found here)

My model definition is:

from __future__ import annotations
from tubeulator.utils.string_conv import to_camel_case
from pydantic import AliasGenerator, ConfigDict
from enum import Enum

from patito import Model

class LocationTypeEnum(Enum):
    Stop = "0"
    Station = "1"
    # EntranceOrExit = "2"
    GenericNode = "3"
    # BoardingArea = "4"

class Stop(Model):
    model_config = ConfigDict(
        alias_generator=AliasGenerator(validation_alias=to_camel_case),
    )

    StopId: str
    StopCode: str | None = None
    StopName: str
    # StopDesc: str = None
    StopLat: float | None = None
    StopLon: float | None = None
    LocationType: LocationTypeEnum
    ParentStation: str | None = None
    LevelId: str | None = None
    PlatformCode: str | None = None

The alias generator ensures that all the fields are aliased appropriately using the callable provided: to_camel_case.

Click to show `to_camel_case` funcdef

import re

__all__ = ["to_camel_case", "to_pascal_case"]


def replace_multi_with_single(string: str, char="_") -> str:
    """
    Replace multiple consecutive occurrences of `char` with a single one.
    """
    rep = char + char
    while rep in string:
        string = string.replace(rep, char)

    return string


def to_camel_case(string: str) -> str:
    """
    Convert a string to Camel Case.

    Examples::

        >>> to_camel_case("ModeName")
        'modeName'
        >>> to_camel_case("a_b_c")
        'aBC'

    """
    string = replace_multi_with_single(string.replace("-", "_").replace(" ", "_"))

    return string[0].lower() + re.sub(
        r"(?:_)(.)",
        lambda m: m.group(1).upper(),
        string[1:],
    )

>>> Stop.DataFrame.read_csv("../data/stationdata/gtfs/stops.txt")
                                                             
Traceback (most recent call last):                                                                                                                  
File "<stdin>", line 1, in <module>                                                                                                               
File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/patito/polars.py", line 913, in read_csv                                  
df = cls.model.DataFrame._from_pydf(pl.read_csv(*args, **kwargs)._df)                                                                                                                 

File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 134, in wrapper                       
return function(*args, **kwargs)                                                                                                                         

File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 134, in wrapper                       
return function(*args, **kwargs)                                                                                                                         

File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 134, in wrapper                       
return function(*args, **kwargs)                                                                                                                         

File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/io/csv/functions.py", line 397, in read_csv                        
df = pl.DataFrame._read_csv(                                                                                                                           
File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/dataframe/frame.py", line 655, in _read_csv                        
self._df = PyDataFrame.read_csv(                                                                                                                             
polars.exceptions.ComputeError: could not parse `A` as dtype `i64` at column 'platform_code' (column number 10)                                                                                                                                                                                     
The current offset in the file is 162305 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `A` to the `null_values` list.

Original error: ```remaining bytes non-empty```

Works when the schema length is increased:

>>> Stop.DataFrame.read_csv("../data/stationdata/gtfs/stops.txt", infer_schema_length=10000, dtypes=Stop.dtypes)
shape: (6_384, 10)
┌───────────┬───────────────┬──────────────────────┬──────────┬───┬───────────────┬──────────┬───────────────────────────────────┬──────────┐
│ stop_code ┆ platform_code ┆ stop_name            ┆ stop_lon ┆ … ┆ location_type ┆ level_id ┆ stop_id                           ┆ stop_lat │
│ ---       ┆ ---           ┆ ---                  ┆ ---      ┆   ┆ ---           ┆ ---      ┆ ---                               ┆ ---      │
│ str       ┆ str           ┆ str                  ┆ f64      ┆   ┆ i64           ┆ str      ┆ str                               ┆ f64      │
╞═══════════╪═══════════════╪══════════════════════╪══════════╪═══╪═══════════════╪══════════╪═══════════════════════════════════╪══════════╡
│ HUBABW    ┆ null          ┆ Abbey Wood           ┆ null     ┆ … ┆ 1             ┆ null     ┆ HUBABW                            ┆ null     │
│ null      ┆ null          ┆ Outside Abbey Wood   ┆ null     ┆ … ┆ 3             ┆ null     ┆ HUBABW-Outside                    ┆ null     │
│ null      ┆ null          ┆ Bus                  ┆ 0.12128  ┆ … ┆ 3             ┆ L#1      ┆ HUBABW-1001001-Bus-5              ┆ 51.49238 │
...
│ null      ┆ 2             ┆ Westbound Platform 2 ┆ null     ┆ … ┆ 0             ┆ null     ┆ 910GBKRVS-Plat02-WB-london-overg… ┆ null     │
└───────────┴───────────────┴──────────────────────┴──────────┴───┴───────────────┴──────────┴───────────────────────────────────┴──────────┘

I tried passing the dtypes argument like the error message suggested but nothing happened, at which point I realised of course the column names get transformed by the alias generator when ingesting as Pydantic models.

The columns should be set as the correct types by applying the alias generator [or otherwise using per-field aliases] on the dtypes it passes through in the pt.DataFrame.read_csv method with an associated pt.Model class.

Solutions

This will only ever be a problem when has_header is True and if there's a model config specifying an alias_generator.

I put together a PR to contribute this feature:

It will also need to handle per-field aliases (not implemented initially).

Nested literals not evaluated

Nested literals in a list are not properly evaluated for invalid values, see example below:

import patito as pt
import polars as pl
from typing import Literal
class TestModel(pt.Model):
    foo: list[Literal['abc']]  = pt.Field(dtype=pl.List(pl.Utf8))
    
    
df = pl.DataFrame({
    "foo": [['wrong']]
})

TestModel.validate(df)

This should actually throw an error

Validation bug with `pl.Categorical` ordering argument

Hi, I'm trying to set patito.Model with list of polars.Categorical but when validating there is an error with the ordering parameters which is by default ordering='physical'.

Here is the model example:

class Provenance(str, Enum):

    A = "A"
    B = "B"
    C = "C"

class Schema(patito.Model):

    index: int
    databases: List[Provenance] = patito.Field(dtype=polars.List(polars.Categorical))

The problem arises also when I use a typing.Literal instead of an Enum for the Provenance item.

Here is the error rised when validating the model:

databases
  Polars dtype List(Categorical(ordering='physical')) does not match model field type. (type=type_error.columndtype)

I'm forced to use polars.Categorical for now as polars.Enum isn't supported by patito for now.

Do you have suggestions to fix this error?

RuntimeError: BindingsError: "Value(\"serialize not supported for this 'opaque' function\")"

I get this error when I import a class with a custom constraint using a struct that got this inherited from another class.

Reproducible with:

class Test(pt.Model):
    id: int = pt.Field(constraints=pl.struct("id", "line_id").is_unique())
    line_id: int = pt.Field(dtype=pl.UInt32)

class TestExtraField(Test):
    extra_field: str

Reproducible with:

import copy
import patito as pt

copy.deepcopy(pl.struct("id", "line_id").is_unique())

bug: pt.Field applying validation checks also on None values

edit: I see this inherited from PyDantic, so probably root issue is there.

I have a df with the following field

class Table(pt.Model):
    score: float | None = pt.Field(ge=0, le=1)

The score can be None, but if it's a float it should be between 0 and 1. However, patito is applying this validation checks on all rows, essentially completely ignoring the fact that I allow it too also be a None. Which results in this error:

ValidationError: 1 validation errors for Table
score
  13183196 rows with out of bound values. (type=value_error.rowvalue)

Restriction of `Field` are not applied to the `examples` function

The following code raises an ValidationError because a is assigned -0.5 as float by Product.examples() even though a is restricted to be >=0.0:

import patito as pt
class Product(pt.Model):
    a: float = pt.Field(ge=0.0)


if __name__ == "__main__":
    p = Product.examples()
    print(p["a"][0])  # "-0.5"
    Product.validate(Product.examples())  # ValidationError

Conversion between Polars -> Patito DataFrames and back

The functionality of this packages is awesome, but for the use case my team and I have, it's rendered essentially useless due to the fact that patito.polars.DataFrames can't be reverted back to polars.polars.DataFrames. This feature would be a huge help!

Bug: Multiple constraints are incorrectly evaluated with OR, not AND

When reading the pt.Field documentation for constraints, I assumed that the constraints would AND with each other. However, the behavior actually seems to be OR. This should be made clearer in the documentation. I think AND is more intuitive, but if OR was the intention, then it should be made clearer for users.

Here is an example:

import patito as pt
import polars as pl


class Line(pt.Model):
    """ A point 'x' between 0 and 1, with some 'width'. Applying 'width' at the point 'x' should not 
    extend beyond the [0, 1] interval.
    """
    x: float = pt.Field(ge=0, le=1)
    width: float = pt.Field(
        constraints=[
            (pt.col("x") - 0.5 * pt.col("width")) >= 0,
            (pt.col("x") + 0.5 * pt.col("width")) <= 1,
        ]
    )


Line.validate(pl.DataFrame({"x": [0.5], "width": [1.0]}))  # passes as expected, since 0.5 - 0.5 >= 0 and 0.5 + 0.5 <= 1
Line.validate(pl.DataFrame({"x": [0.4], "width": [1.0]}))  # passes, even though 0.4 - 0.5 < 0
Line.validate(pl.DataFrame({"x": [0.5], "width": [1.1]}))  # fails as expected (since 0.5 - 0.55 < 0 **and** 0.5 + 0.55 > 1)

Unable to install with polars-lts-cpu

Hi there, big fan of this project been thinking about the implications, congrats on the recent Pydantic 2.0 re-launch! 🎉

I'm just wondering whether there is an oversight here in regards to the polars-lts-cpu (for hardware without certain CPU instructions) and how that might be represented in package dependencies.

I looked around the Polars repo and came across this issue

pola-rs/polars#12880

In the thread, the following advice is given by Tim Stephenson:

Listing polars as a dependency in a package seems to be an oversight based off of how different x86 binaries for newer/older cpu's are given completely different package names. Potentially the answer from the polars team is that you shouldn't make python packages which depend on polars, only using polars as an end-user. Or you should only compile from source when deploying your package onto an old Xeon server.

I interpret this to mean you should not be putting polars as a package dependency, which is a surprise to say the least!

Perhaps instead it would be possible to make polars an extra, and polars-lts-cpu another extra? I don't suppose it's desirable to ship a patito-lts-cpu package

I'd be interested to see any other suggestions.

If you don't think it's worth the effort to change (and should be left to the user to sort out) then I'd understand too.

Allow `derived_from` to be used in `Model.examples()`

Currently derived_from is not used when you create an example from the model. It would be helpful it can also give an example automatically for these columns. For example, derived_from can be a concatenation on foo and bar, giving us foobar.

[Feature] Add pre/post validators

Hi, I discovered your project one week ago, thanks a lot for the work. I was doing something similar for data validation at my company and saw the speed of polars so decided to switch the backend to polars + patito instead of pure python + pydantic 🤗

Btw, I need the pre/post validation from pydantic to be able to manipulate data before they even get validated (e.g. transform a|b|c into a list of str [a, b, c] before evaluation).

Is it something you had in mind for this package, and/or could I contribute to it by adding this feature? @JakobGM @thomasaarholt @brendancooley

(Another issue talking about it: #42)

[FEAT] Ignore validation errors

Is there a way to configure patito Models to ignore certain validation errors (e.g. type_error.superfluoscolumns) in the model class rather than catching and checking errors?

are pydantic BeforeValidator and AfterValidator annotations supported?

Thanks!

[FEAT] Upgrade to `Pydantic v2`

This will allow multiple benefits.

Use cases:

multiple values allowed for aliases with AliasChoices and AliasGenerator

I need to validate a df that can have multiple names for a specific column. Let's say it could be col1 or `col[1-10]'.

I have a df that is created, and the column names can be a few different things but the same column, and I need a way to validate all the possible names. I've tried multiple things, custom validators, creating a model dynamically, and they have yet to work.