jakobgm / patito Goto Github PK
View Code? Open in Web Editor NEWA data modelling layer built on top of polars and pydantic
License: MIT License
A data modelling layer built on top of polars and pydantic
License: MIT License
thanks for this great library and excited to see progress continue. I noticed that the release GH actions pipeline failed for 0.6.0 https://github.com/JakobGM/patito/actions/runs/8066691480.
It would be great to get this release published since it now supports pydantic >2
This project looks really interesting, and I look forward to seeing how it develops. The struct support seems quite good, but it doesn't seem to support field constraints at the moment. Here's an example:
import patito as pt
import polars as pl
class Struct(pt.Model):
x: int
y: int
z: int = pt.Field(lt=2)
class MyModel(pt.Model):
struct_col: Struct
list_struct_col: list[Struct]
df = pl.DataFrame(
{
"struct_col": [{"x": 1, "y": 2, "z": 3}],
"list_struct_col": [[{"x": 1, "y": 2, "z": 3}, {"x": 1, "y": 2, "z": 3}]],
}
)
MyModel.validate(df)
This passes validation, despite all of the z
fields being greater than 2. I expect to receive errors for both of the columns. Can pt.Field
be improved to do more validation inside of structs?
Hi there.
Recently, I have caught a problem when import patito as pt
, which returns TypeError
:
In [1]: import patito as pt
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import patito as pt
File ~/Documents/GitHub/ServerlessPolars/.venv/lib/python3.10/site-packages/patito/__init__.py:4
1 """Patito, a data-modelling library built on top of polars and pydantic."""
2 from polars import Expr, Series, col
----> 4 from patito import exceptions, sql
5 from patito.exceptions import ValidationError
6 from patito.polars import DataFrame, LazyFrame
File ~/Documents/GitHub/ServerlessPolars/.venv/lib/python3.10/site-packages/patito/exceptions.py:6
1 """Module containing all custom exceptions raised by patito."""
3 import pydantic
----> 6 class ValidationError(pydantic.ValidationError):
7 """Exception raised when dataframe does not match schema."""
10 class ErrorWrapper(pydantic.error_wrappers.ErrorWrapper):
TypeError: type 'pydantic_core._pydantic_core.ValidationError' is not an acceptable base type
Before this issue, I already used Patito successfully in my project. After I take a look, it seems that recent Patito is not compatible with Pydantic v2 which is recently introduced. In the pyproject.toml
, pydantic
is constraint to >= 1.7.0
that makes the poetry automatically uses Pydantic=2.0.3
and I think this causes the issue.
Could you take a look? Many thanks.
This syntax is no longer supported with Polars. This should be removed.
Reprex:
df
: is a simple financial time series from OpenBB.
test = StocksBaseModel.DataFrame(df).cast()
TypeError: LazyFrame.collect() got an unexpected keyword argument 'eager'
When you pass a list of values, the columns are return in reversed order. However, if you pass a single value, it's in correct order. See examples below:
import patito as pt
class Test(pt.Model):
a:str
b:str
Test.examples({'a':'1', 'b':'2'})
a | b |
---|---|
str | str |
"1" | "2" |
This doesn't keep the order:
import patito as pt
class Test(pt.Model):
a:str
b:str
Test.examples({'a':['1'], 'b':['2']})
b | a |
---|---|
str | str |
"2" | "1" |
It seems like patito.Model.examples() is not working. For testing I just installed Patito in a new venv. Running the following code
import patito as pt
class Bids(pt.Model):
offer_id: str = pt.Field(alias='Offer-ID', unique=True)
country_eic: str = pt.Field(alias='Country-EIC')
power: int = pt.Field(alias="Power")
if __name__=='__main__':
print(Bids.examples({"Offer-ID": ["a1","a2","a3"]}))
gives me the following error
Traceback (most recent call last):
File "C:\Users\ANON\python\test_proj\models.py", line 27, in
print(Bids.examples({"Offer-ID": ["a1","a2","a3"]}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\ANON\python\test_proj\venv\Lib\site-packages\patito\pydantic.py", line 1033, in examples
return DataFrame().with_columns(series).with_columns(unique_series)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\ANON\python\test_proj\venv\Lib\site-packages\patito\polars.py", line 703, in with_columns
return cast(DF, super().with_columns(*exprs, **named_exprs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\ANON\python\test_proj\venv\Lib\site-packages\polars\dataframe\frame.py", line 8270, in with_columns
return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: LazyFrame.collect() got an unexpected keyword argument '_eager'
With pip install patito
following packages were installed:
Currently, a type specification of Optional[int]
means that a column must be of integer type but may contain nulls.
We currently don't support a syntax to specify that it is allowed that a column is missing.
One current workaround is to specify Foo.validate(df, allow_missing_columns=True)
, where allow_missing_columns
is passed on to _find_errors
as a kwarg (we should add this as an explicit parameter).
The following example contains a suggestion for how we could allow missing columns (see c
). It is one that @JakobGM came up with last year.
import patito as pt
from typing import Optional
class Foo(pt.Model):
a: int # only ints
b: Optional[int] # mix ints and nulls
c: int = None # column may be missing, but if it's there it must be an int - but this fails a type check
d: Optional[int] = None # column may be missing, but if it's there it must be an int
An alternative would be to use pt.Field
/ ColumnInfo
, and do something like the following, which I might like better, just because it will pass type checks.
class Foo(pt.Model):
c: int = pt.Field(allow_missing=True)
I am very open to ideas here. Does anyone have a suggestion? Tagging a few possibly-interested parties, @brendancooley, @dsgibbons, @ion-elgreco
This is going to cause an issue when you want to validate the list column:
Reproducible example:
class Test(pt.Model):
col: list[str]
class InhTest(Test):
pass
df = InhTest.examples({
"col":[['Hello']]
})
InhTest.validate(df)
print(Test.__annotations__)
print(InhTest.__annotations__)
{'col': list[str]}
{}
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[21], line 4
1 df = InhTest.examples({
2 "col":[['Hello']]
3 })
----> 4 InhTest.validate(df)
File [~/<redacted>/.venv/lib/python3.10/site-packages/patito/pydantic.py:707](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/home/ion/<redacted>/~/<redacted>/.venv/lib/python3.10/site-packages/patito/pydantic.py:707), in Model.validate(cls, dataframe)
662 @classmethod
663 def validate(
664 cls,
665 dataframe: Union["pd.DataFrame", pl.DataFrame],
666 ) -> None:
667 """
668 Validate the schema and content of the given dataframe.
669
(...)
705 Rows with invalid values: {'oven'}. (type=value_error.rowvalue)
706 """
--> 707 validate(dataframe=dataframe, schema=cls)
File [~/<redacted>/.venv/lib/python3.10/site-packages/patito/validators.py:316](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/home/ion/<redacted>/~/<redacted>/.venv/lib/python3.10/site-packages/patito/validators.py:316), in validate(dataframe, schema)
313 else:
314 polars_dataframe = cast(pl.DataFrame, dataframe)
--> 316 errors = _find_errors(dataframe=polars_dataframe, schema=schema)
317 if errors:
318 raise ValidationError(errors=errors, model=schema)
File [~/<redacted>/.venv/lib/python3.10/site-packages/patito/validators.py:153](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/home/ion/<redacted>/~/<redacted>/.venv/lib/python3.10/site-packages/patito/validators.py:153), in _find_errors(dataframe, schema)
150 if not isinstance(dtype, pl.List):
151 continue
--> 153 annotation = schema.__annotations__[column] # type: ignore[unreachable]
155 # Retrieve the annotation of the list itself,
156 # dewrapping any potential Optional[...]
157 list_type = _dewrap_optional(annotation)
KeyError: 'col'
The comparators lt
, gt
, le
and ge
are all defined with respect to a single floating point. It would be even better if we could define these with respect to other columns. Something like:
import patito as pt
class model(pt.Model):
x_min: int
x_max: int = pt.Field(gt=pt.col("x_min"))
I imagine this may be difficult if the underlying functionality is not supported in Pydantic, but I'd love to see this feature!
When using Pydantic alias generators, nested types are not serializing properly for dtypes.
My use case is that I have data coming from an API that is in camelCase. I validate against that format using the to_camel()
alias generator. Serialized data should always be in snake case.
The bug can be reproduced with the following code:
from pydantic import AliasGenerator, ConfigDict
from pydantic.alias_generators import to_camel, to_snake
class BaseModel(pt.Model):
model_config = ConfigDict(
alias_generator=AliasGenerator(
validation_alias=to_camel,
serialization_alias=to_snake,
),
populate_by_name=True,
)
class NestedModel(BaseModel):
nested_field: int
class ParentModel1(BaseModel):
parent_field: int
nested_model: NestedModel
When calling dtypes on NestedModel
, things are serialized properly:
In [3]: NestedModel.dtypes
Out[3]: {'nested_field': Int64}
However, when calling dtypes on ParentModel
, the columns for NestedModel are back to camelCase:
In [14]: ParentModel.dtypes
Out[14]: {'parent_field': Int64, 'nested_model': Struct({'nestedField': Int64})}
Serialization works as expected (can be initialized with camelCase):
In [16]: foo = ParentModel(parent_field=1, nested_model={'nestedField': 2})
In [17]: foo.model_dump()
Out[17]: {'parent_field': 1, 'nested_model': {'nested_field': 2}}
I've recently updated all my dependencies and am not sure if this is a new issue or one that already existed. I have a branch where I've added the above code as an initial test and have played with the mode="serialization"
flag for model_dump_json()
, but so far I haven't figured out the issue.
That branch is linked below.
It's worth noting that, without populate_by_name=True
set on the model config, camelCase fields will fail validation. I think this is a newer flag, as well as the mode
option for model dumping.
mode="serialization"
: https://github.com/timoguin/patito/blob/fix/dtype-casing-bug-when-using-aliases/src/patito/_pydantic/schema.py#L30Right now, you can only create a model-aware DataFrame from either a dict representation or from pandas. It would be great for workflow to be able to do this with polars objects.
from openbb_terminal.stocks import stocks_helper as stocks
import polars as pl
df = stocks.load(
symbol="AAPL",
start_date="1950-01-01",
end_date="2023-10-16",
)
df = pl.from_pandas(df, include_index=True).lazy()
# THIS DOESNT WORK
df = StocksBaseModel.LazyFrame(df)
TypeError: DataFrame constructor called with unsupported type 'LazyFrame' for the `data` parameter
Great package.
Unfortunately I work with time series data a lot (example) below
and I am getting a runtime error like:
RuntimeError: no validator found for <class '__main__.PolarsDate'>, see `arbitrary_types_allowed` in Config
Example
from polars import DataFrame, Datetime, Date
@validate_arguments(config=dict(arbitrary_types_allowed=True))
class ScoresTable(pt.Model):
id: int = pt.Field(ge=0, unique=True)
local_date: Date
utc_dt: Datetime(time_unit='us', time_zone='UTC')
values: float
@brendancooley (and others), I want to refactor the Field function in order to:
gt
, dtype
, constraints
etc rather than use args/kwargsI'm considering encoding all the relevant Field
parameters inside ColumnInfo
, and ensuring that ColumnInfo
can serialize (using field_serializer
) and deserialize (through validators) all parameters.
This means that we can type-safely pass all these arguments to pydantic's Field
using pydantic.fields.Field(json_schema_extra=column_info.model_dump())
.
Then, at validation time, we would reconstruct the ColumnInfo
object for each column using ColumnInfo.model_validate(some_patito_model.model_fields["some_field"])
, and be able to relatively easily use these objects for validation.
The only things that I'm a bit unsure about:
ColumnInfo
for fields like gt
, which already exist in pydantic's Field
? Or do as we currently do and pop them off? I'm leaning towards keeping them within the ColumnInfo
just to have all the logic in one place.ColumnInfo
like I suggest in the previous point, are there cases where we shouldn't do that?Let me know if this seems unclear! Writing this while alternating entertaining a 2.5 year old and a 3 month old 😅
Hello hello! Long time no see @JakobGM, @thomasaarholt 😄 Cool project!
I'm interested in strong typing on the exposed polars.DataFrame methods such as get_column()
. Take for example the following snippet:
def time_filter_positions(cow_pos: pl.DataFrame):
min_gsn = int(cow_pos.get_column("gsn").min())
max_gsn = int(cow_pos.get_column("gsn").max())
min_date = cow_pos.get_column("timestamp").min()
max_date = cow_pos.get_column("timestamp").max()
# ----SNIP----
This upsets the typechecker! Because min
returns a (method) def min() -> (PythonLiteral | None)
and a .. | None
cannot be passed into int()
without additional asserting.. Now what I want to do is something like this:
import patito as pt
class MyDf(pt.Model):
gsn: int = pt.Field(unique=True)
timestamp = pt.Field(pl.Datetime)
def time_filter_positions(cow_pos: MyDf.DataFrame):
min_gsn = int(cow_pos.get_column("gsn").min())
max_gsn = int(cow_pos.get_column("gsn").max())
min_date = cow_pos.get_column("timestamp").min()
max_date = cow_pos.get_column("timestamp").max()
and have it just work ™️.. So the patitio datframe should somehow add typing to the get_column()
method. Check that the input string is a Literal
that matches one of the columns and then defines the output of the get_column
call to be of the correct type. I have no idea if this is possible at all. But it would lead to a fantastic developer experience. Where your type checker and autocomplete become super contextually aware when using patito dataframes! What do you guys think? In scope for this project? Or completely far out? Or is this a level of typing that ought to be implemented in Polars itself?
Perhaps I misunderstand the way that Field(alias = 'name')
should work in patitio, but I was surprised by these errors:
>>> from typing import Literal
>>>
>>> import patito as pt
>>> import polars as pl
>>>
>>> class Product(pt.Model):
... product_id: int = pt.Field(unique=True, alias='prod')
... name: str
... temperature_zone: Literal["dry", "cold", "frozen"]
... demand_percentage: float
...
>>> valid_product_df = pl.DataFrame(
... {
... "product_id": [64, 11],
... "name": ["Pizza", "Cereal"],
... "temperature_zone": ["frozen", "dry"],
... "demand_percentage": [0.07, 0.16],
... }
... )
>>>
>>> Product.validate(valid_product_df) # No errors
>>>
>>> also_valid_product_df = pl.DataFrame(
... {
... "prod": [64, 11],
... "name": ["Pizza", "Cereal"],
... "temperature_zone": ["frozen", "dry"],
... "demand_percentage": [0.07, 0.16],
... }
... )
>>>
>>> Product.validate(also_valid_product_df) #Surprise!
Traceback (most recent call last):
[...]
patito.exceptions.DataFrameValidationError: 2 validation errors for Product
product_id
Missing column (type=type_error.missingcolumns)
prod
Superfluous column (type=type_error.superfluouscolumns)
From the way that Field aliases work in pydantic I thought that 'prod' would be interpreted as 'product_id'. Is this expected behaviour? Because if so perhaps a clarifying line in the docs to rename columns would help?
Thanks for your work with patito - it's shaping up great! I really like it and it's massively helping my projects!
Env details
Python 3.9.19
patitio 0.6.1
polars 0.20.31
pydantic 2.7.4
In the example below, you can see I create id with Uint16, however all the models after using rename
, with_fields
, select
& drop
lose their field definitions that were provided in pt.Field.
class Product(pt.Model):
id: int = pt.Field(unique=True, dtype=pl.UInt16)
row_id: int
print(Product.dtypes)
{'id': UInt16, 'row_id': Int64}
With_fields
print(Product.with_fields().dtypes)
{'id': Int64, 'row_id': Int64}
Rename
print(Product.rename({'row_id':'row'}).dtypes)
{'id': Int64, 'row': Int64}
Drop
print(Product.drop('row_id').dtypes)
{'id': Int64}
Currently the constraints are row-wise, but it would be nice to have the option to add columnar type of constraints or multi-columnar.
This way you can do unique check on multiple columns for example.
It seems that the _find_errors
method used to validate the content of a polars.DataFrame
based on a patito.Model
is not taking list of objects properly when they are wrapped as Optional
.
Here is a small reproducible code:
from typing import Optional
import patito
import polars
class Inner(patito.Model):
name: str
reliability: bool
level: int
class Outer(patito.Model):
id: str
code: str
label: str
inner_types: Optional[list[Inner]]
df = polars.DataFrame(
{
"id": [1, 2, 3],
"code": ["A", "B", "C"],
"label": ["a", "b", "c"],
"inner_types": [
[{"name": "a", "reliability": True, "level": 1}],
[{"name": "b", "reliability": False, "level": 2}],
None,
],
}
)
df = Outer.DataFrame(df).cast().derive()
df.validate()
Here is the traceback I get:
Traceback (most recent call last):
File "/home/user/__init__.py", line 1093, in <module>
df.validate()
File "/home/user/.venv/lib/python3.9/site-packages/patito/polars.py", line 584, in validate
self.model.validate(dataframe=self, columns=columns, **kwargs)
File "/home/user/.venv/lib/python3.9/site-packages/patito/pydantic.py", line 476, in validate
validate(
File "/home/user/.venv/lib/python3.9/site-packages/patito/validators.py", line 445, in validate
errors = _find_errors(
File "/home/user/.venv/lib/python3.9/site-packages/patito/validators.py", line 303, in _find_errors
list_struct_errors = _find_errors(
File "/home/user/.venv/lib/python3.9/site-packages/patito/validators.py", line 143, in _find_errors
schema_subset = columns or schema.columns
AttributeError: type object 'list' has no attribute 'columns'
It might comes from the fact list of object which aren't primitive types aren't well understood by the method? Or list wrapped into Optional
aren't well understood?
Currently creating examples with list[str] is for example not supported. Could be useful to have
I'd love to declare a patito model, and then with a bit more code be able to generate that table in a database.
I've considered both SQLModel and Ormar recently.
Any recommendations for anything that might play nice?
This topic probably belongs in a discussion forum but I couldn't find one for patito. Please let me know if there is a better place to ask this.
I would like to use patito to validate a dataframe with a categorical column with known categories where the order of the categories is important. What I have done so far is as follows:
from typing import Literal, get_args
import patito as pt
import polars as pl
class MyModel(pt.Model):
my_col: Literal["a", "b"]
my_dtype = pl.Enum([*get_args(MyModel.model_fields["my_col"].annotation)])
good_df = pl.DataFrame({"my_col": pl.Series(["b", "a"], dtype=my_dtype)})
bad_df = pl.DataFrame(
{"my_col": pl.Series(["b", "a"], dtype=pl.Enum(["b", "a"]))}
)
MyModel.validate(good_df)
MyModel.validate(bad_df)
This passes for good_df
and fails for bad_df
as expected. However I'm not 100% sure that this is the intended use of Literal
in a patito model, and it was a little awkward to get the correctly ordered categories to put in my custom dtype so I thought I'd ask to see if there's a better (or just different) way to do this.
I'm trying out Patito on some real data and as far as I can tell it is inferring the column types (rather than using the model-specified field types) when using the read_csv
helper, even though the docs suggest that they're used here:
Read CSV and apply correct column name and types from model.
I think this is happening due to the alias generator not being used to map the columns to the respective fields at this step (and this explains why converting to the data models first and then to DataFrames does work).
This would be nicer if fixed and done automatically.
In this case I find that on a dataset of a few thousand rows, with a column with a mix of numeric and alphanumeric values, if by chance the first few aren't alphanumeric then it gets inferred to be numeric (in this case int
).
File attached for reproducibility:
My model definition is:
from __future__ import annotations
from tubeulator.utils.string_conv import to_camel_case
from pydantic import AliasGenerator, ConfigDict
from enum import Enum
from patito import Model
class LocationTypeEnum(Enum):
Stop = "0"
Station = "1"
# EntranceOrExit = "2"
GenericNode = "3"
# BoardingArea = "4"
class Stop(Model):
model_config = ConfigDict(
alias_generator=AliasGenerator(validation_alias=to_camel_case),
)
StopId: str
StopCode: str | None = None
StopName: str
# StopDesc: str = None
StopLat: float | None = None
StopLon: float | None = None
LocationType: LocationTypeEnum
ParentStation: str | None = None
LevelId: str | None = None
PlatformCode: str | None = None
The alias generator ensures that all the fields are aliased appropriately using the callable provided: to_camel_case
.
import re
__all__ = ["to_camel_case", "to_pascal_case"]
def replace_multi_with_single(string: str, char="_") -> str:
"""
Replace multiple consecutive occurrences of `char` with a single one.
"""
rep = char + char
while rep in string:
string = string.replace(rep, char)
return string
def to_camel_case(string: str) -> str:
"""
Convert a string to Camel Case.
Examples::
>>> to_camel_case("ModeName")
'modeName'
>>> to_camel_case("a_b_c")
'aBC'
"""
string = replace_multi_with_single(string.replace("-", "_").replace(" ", "_"))
return string[0].lower() + re.sub(
r"(?:_)(.)",
lambda m: m.group(1).upper(),
string[1:],
)
>>> Stop.DataFrame.read_csv("../data/stationdata/gtfs/stops.txt")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/patito/polars.py", line 913, in read_csv
df = cls.model.DataFrame._from_pydf(pl.read_csv(*args, **kwargs)._df)
File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
return function(*args, **kwargs)
File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
return function(*args, **kwargs)
File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 134, in wrapper
return function(*args, **kwargs)
File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/io/csv/functions.py", line 397, in read_csv
df = pl.DataFrame._read_csv(
File "/home/louis/miniconda3/envs/tubeulator/lib/python3.11/site-packages/polars/dataframe/frame.py", line 655, in _read_csv
self._df = PyDataFrame.read_csv(
polars.exceptions.ComputeError: could not parse `A` as dtype `i64` at column 'platform_code' (column number 10)
The current offset in the file is 162305 bytes.
You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `A` to the `null_values` list.
Original error: ```remaining bytes non-empty```
Works when the schema length is increased:
>>> Stop.DataFrame.read_csv("../data/stationdata/gtfs/stops.txt", infer_schema_length=10000, dtypes=Stop.dtypes)
shape: (6_384, 10)
┌───────────┬───────────────┬──────────────────────┬──────────┬───┬───────────────┬──────────┬───────────────────────────────────┬──────────┐
│ stop_code ┆ platform_code ┆ stop_name ┆ stop_lon ┆ … ┆ location_type ┆ level_id ┆ stop_id ┆ stop_lat │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ ┆ i64 ┆ str ┆ str ┆ f64 │
╞═══════════╪═══════════════╪══════════════════════╪══════════╪═══╪═══════════════╪══════════╪═══════════════════════════════════╪══════════╡
│ HUBABW ┆ null ┆ Abbey Wood ┆ null ┆ … ┆ 1 ┆ null ┆ HUBABW ┆ null │
│ null ┆ null ┆ Outside Abbey Wood ┆ null ┆ … ┆ 3 ┆ null ┆ HUBABW-Outside ┆ null │
│ null ┆ null ┆ Bus ┆ 0.12128 ┆ … ┆ 3 ┆ L#1 ┆ HUBABW-1001001-Bus-5 ┆ 51.49238 │
...
│ null ┆ 2 ┆ Westbound Platform 2 ┆ null ┆ … ┆ 0 ┆ null ┆ 910GBKRVS-Plat02-WB-london-overg… ┆ null │
└───────────┴───────────────┴──────────────────────┴──────────┴───┴───────────────┴──────────┴───────────────────────────────────┴──────────┘
I tried passing the dtypes
argument like the error message suggested but nothing happened, at which point I realised of course the column names get transformed by the alias generator when ingesting as Pydantic models.
The columns should be set as the correct types by applying the alias generator [or otherwise using per-field aliases] on the dtypes it passes through in the pt.DataFrame.read_csv
method with an associated pt.Model
class.
This will only ever be a problem when has_header
is True
and if there's a model config specifying an alias_generator
.
I put together a PR to contribute this feature:
It will also need to handle per-field aliases (not implemented initially).
Nested literals in a list are not properly evaluated for invalid values, see example below:
import patito as pt
import polars as pl
from typing import Literal
class TestModel(pt.Model):
foo: list[Literal['abc']] = pt.Field(dtype=pl.List(pl.Utf8))
df = pl.DataFrame({
"foo": [['wrong']]
})
TestModel.validate(df)
This should actually throw an error
Hi, I'm trying to set patito.Model
with list of polars.Categorical
but when validating there is an error with the ordering
parameters which is by default ordering='physical'
.
Here is the model example:
class Provenance(str, Enum):
A = "A"
B = "B"
C = "C"
class Schema(patito.Model):
index: int
databases: List[Provenance] = patito.Field(dtype=polars.List(polars.Categorical))
The problem arises also when I use a typing.Literal
instead of an Enum
for the Provenance
item.
Here is the error rised when validating the model:
databases
Polars dtype List(Categorical(ordering='physical')) does not match model field type. (type=type_error.columndtype)
I'm forced to use polars.Categorical
for now as polars.Enum
isn't supported by patito for now.
Do you have suggestions to fix this error?
I get this error when I import a class with a custom constraint using a struct
that got this inherited from another class.
Reproducible with:
class Test(pt.Model):
id: int = pt.Field(constraints=pl.struct("id", "line_id").is_unique())
line_id: int = pt.Field(dtype=pl.UInt32)
class TestExtraField(Test):
extra_field: str
Reproducible with:
import copy
import patito as pt
copy.deepcopy(pl.struct("id", "line_id").is_unique())
edit: I see this inherited from PyDantic, so probably root issue is there.
I have a df with the following field
class Table(pt.Model):
score: float | None = pt.Field(ge=0, le=1)
The score can be None, but if it's a float it should be between 0 and 1. However, patito is applying this validation checks on all rows, essentially completely ignoring the fact that I allow it too also be a None. Which results in this error:
ValidationError: 1 validation errors for Table
score
13183196 rows with out of bound values. (type=value_error.rowvalue)
The following code raises an ValidationError
because a
is assigned -0.5
as float by Product.examples()
even though a
is restricted to be >=0.0
:
import patito as pt
class Product(pt.Model):
a: float = pt.Field(ge=0.0)
if __name__ == "__main__":
p = Product.examples()
print(p["a"][0]) # "-0.5"
Product.validate(Product.examples()) # ValidationError
The functionality of this packages is awesome, but for the use case my team and I have, it's rendered essentially useless due to the fact that patito.polars.DataFrames can't be reverted back to polars.polars.DataFrames. This feature would be a huge help!
When reading the pt.Field
documentation for constraints, I assumed that the constraints would AND with each other. However, the behavior actually seems to be OR. This should be made clearer in the documentation. I think AND is more intuitive, but if OR was the intention, then it should be made clearer for users.
Here is an example:
import patito as pt
import polars as pl
class Line(pt.Model):
""" A point 'x' between 0 and 1, with some 'width'. Applying 'width' at the point 'x' should not
extend beyond the [0, 1] interval.
"""
x: float = pt.Field(ge=0, le=1)
width: float = pt.Field(
constraints=[
(pt.col("x") - 0.5 * pt.col("width")) >= 0,
(pt.col("x") + 0.5 * pt.col("width")) <= 1,
]
)
Line.validate(pl.DataFrame({"x": [0.5], "width": [1.0]})) # passes as expected, since 0.5 - 0.5 >= 0 and 0.5 + 0.5 <= 1
Line.validate(pl.DataFrame({"x": [0.4], "width": [1.0]})) # passes, even though 0.4 - 0.5 < 0
Line.validate(pl.DataFrame({"x": [0.5], "width": [1.1]})) # fails as expected (since 0.5 - 0.55 < 0 **and** 0.5 + 0.55 > 1)
Hi there, big fan of this project been thinking about the implications, congrats on the recent Pydantic 2.0 re-launch! 🎉
I'm just wondering whether there is an oversight here in regards to the polars-lts-cpu
(for hardware without certain CPU instructions) and how that might be represented in package dependencies.
I looked around the Polars repo and came across this issue
In the thread, the following advice is given by Tim Stephenson:
Listing polars as a dependency in a package seems to be an oversight based off of how different x86 binaries for newer/older cpu's are given completely different package names. Potentially the answer from the polars team is that you shouldn't make python packages which depend on polars, only using polars as an end-user. Or you should only compile from source when deploying your package onto an old Xeon server.
I interpret this to mean you should not be putting polars
as a package dependency, which is a surprise to say the least!
Perhaps instead it would be possible to make polars an extra, and polars-lts-cpu another extra? I don't suppose it's desirable to ship a patito-lts-cpu package
I'd be interested to see any other suggestions.
If you don't think it's worth the effort to change (and should be left to the user to sort out) then I'd understand too.
Currently derived_from
is not used when you create an example from the model. It would be helpful it can also give an example automatically for these columns. For example, derived_from can be a concatenation on foo
and bar
, giving us foobar
.
Hi, I discovered your project one week ago, thanks a lot for the work. I was doing something similar for data validation at my company and saw the speed of polars so decided to switch the backend to polars + patito instead of pure python + pydantic 🤗
Btw, I need the pre/post validation from pydantic to be able to manipulate data before they even get validated (e.g. transform a|b|c
into a list of str [a, b, c]
before evaluation).
Is it something you had in mind for this package, and/or could I contribute to it by adding this feature? @JakobGM @thomasaarholt @brendancooley
(Another issue talking about it: #42)
Is there a way to configure patito Models to ignore certain validation errors (e.g. type_error.superfluoscolumns
) in the model class rather than catching and checking errors?
Thanks!
This will allow multiple benefits.
Use cases:
AliasChoices
and AliasGenerator
I need to validate a df that can have multiple names for a specific column. Let's say it could be col1
or `col[1-10]'.
I have a df that is created, and the column names can be a few different things but the same column, and I need a way to validate all the possible names. I've tried multiple things, custom validators, creating a model dynamically, and they have yet to work.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.