Comments (9)
Thanks for opening this, it highlights an important distinction between the libraries.
msgspec.Struct
constructors do no runtime type validation (unlike those of pydantic.BaseModel
). The argument for this is that type annotations + static analysis tools (or tests) should be sufficient for ensuring the correctness of your own code, you shouldn't have to pay the cost of runtime type checking for code you already know to be correct. As such msgspec
(currently) only supports type level constraints that can be statically validated by tools like mypy
or pyright
. A constraint like >= 0
can't be statically validated, so it's not clear what msgspec
should do here.
Right now I have a few ideas:
- Don't support these features. I personally don't need them. If we think of
msgspec
as comparable to rust's serde or golang's json, then we're already feature complete without them, those tools also don't handle value validation. - We do add support, but the checks are only applied during decoding, not in the constructors. I'm pretty adamant that constructors should always be fast. Instead, we'd add a new function for running validators on existing data. This might look like:
class PositivePoint(msgspec.Struct):
x: Annotated[int, Ge(0)]
y: Annotated[int, Ge(0)]
# no value validation happens here
x = Point(1, 2)
# This means that this won't error
bad = Point(-1, 0)
# Value validation checks must be explicitly run, this raises a ValidationError
msgspec.validate(bad)
points = [Point(1, 2), Point(3, 4), Point(5, 6)]
# For non-struct types, a type can be passed in explicitly
msgspec.validate(points, type=list[Point])
# Validators are also automatically run during decoding. This would raise a `ValidationError`
msgspec.json.decode(b'{"x": -1, "y": 0}', type=Point)
This lets the user control when (potentially expensive) validation checks are run, and by default only runs them when processing external inputs. If you know your own internal code is correct, there's no need to pay the runtime cost.
As for how to spell the validation checks, I'm -0.1 on relying on an external library here, unless that library was commonly used in the ecosystem (I recognize this is a chicken <-> egg problem). I also personally find the annotated type syntax to be a bit verbose, and if we want to support other metadata on fields (e.g. title, description, ...), we might need an additional Field
type anyway to encode that. Idk, there doesn't seem to be a concise way of spelling field annotations in python currently.
Thoughts?
from msgspec.
I think supporting these sorts of validations natively is important both for user experience and performance. I think providing validation and json schema is what made Pydantic successful, so there are obviously a lot of users that want these features in their data parsing/validation library of choice.
I 100% agree not validating constructors by default is the right move. I like your example where validation only happens explicitly or during decoding.
I'm -0.1 on relying on an external library here, unless that library was commonly used in the ecosystem (I recognize this is a chicken <-> egg problem)
Yup totally a chicken <-> egg problem. That said, Hypothesis and Pydantic v2 already plan on supporting annotated-types
. That means FastAPI, etc. will support it, hence there will soon be a relatively large ecosystem. And you could always make it an optional feature: only support it if installed, if an extra is selected, etc.
I also personally find the annotated type syntax to be a bit verbose
One way to mitigate this is type aliases, which work great for many common constraints:
from annotated_types import IsDigit
PhoneNumber = IsDigit[str]
The same could be done for Positive
(YearsOfExperience = Positive[int]
).
But yes, as soon as users need to input data (Ge(0)
) it can get a bit verbose. PEP593 explicitly calls this out and I hope that at some point they'll add support for easier parametrization (Ge[int, 0]
so that you don't have to import and use Annotated
everywhere).
if we want to support other metadata on fields (e.g. title, description, ...), we might need an additional Field type anyway to encode that
One thing you could do is make a Field
object with an __iter__
method that can be unpacked into multiple constraints:
from msgspec import Field
PhoneNumber = Annotated[str, *Field(pattern=r"", description="cel", min_len=123)]
(see https://github.com/annotated-types/annotated-types/blob/main/annotated_types/__init__.py#L132-L154)
This is only valid in Python >3.10 though.
In summary, yes, this isn't perfect, but I think it's the best option we have right now that doesn't do wonky stuff like mess with default values or create a library specific way to attach metadata. And I believe the situation will get better in the future (if there is a future PEP to make it less verbose, if an ecosystem of common metadata arises, once everyone is using 3.10 and can unpack metadata, etc.)
from msgspec.
One thing I'll note about the annotated-types design that might not be immediately obvious: the validation implementation is not encoded into the metadata, which would allow msgspec to interpret Ge(0)
, Predicate(str.islower)
or Pattern(r"Foo")
into some fast C code. And users persist a Decoder
object around I think it should be possible to "compile" a decoder ahead of time so that there is zero overhead if no constraints are included.
from msgspec.
Hmmm, ok. A few questions:
- Is supporting only the builtin constraints that are part of jsonschema (or similar) sufficient? Can we avoid supporting arbitrary python callables and still have most users be happy?
- What constraints are most common in your experience? How often does a field need a constraint as opposed to just a type?
- Can we get by without the
validate
function for now, only supporting validating constraints on decode?
from msgspec.
Is supporting only the builtin constraints that are part of jsonschema (or similar) sufficient? Can we avoid supporting arbitrary python callables and still have most users be happy?
I think it would be good for a lot of use cases, but certainly not all. Why do you want to avoid supporting arbitrary python callables?
What constraints are most common in your experience? How often does a field need a constraint as opposed to just a type?
I don't have a survey or anything, but I'd say Len
, Ge
/Le
/Gt
/Lt
and Pattern
are probably the most common.
Can we get by without the
validate
function for now, only supporting validating constraints on decode?
I'd say so, yes. I think that's how it will be used most of the time.
from msgspec.
Why do you want to avoid supporting arbitrary python callables?
Nothing against it in particular. However, the current way the internal type tree is represented makes supporting these tricky. All the builtin constraints can mostly only be applied to one type (and cases where they can apply to multiple are already explicitly forbidden in msgspec due to ambiguity). This makes our internal representation of these constraints easy.
If we support arbitrary user defined constraints, then we'd need to modify our internals much more to be able represent e.g.
class Example:
a: Annotated[int, Predicate(custom_1)] | Annotated[str, Predicate(custom_2)]
from msgspec.
Not sure I completely follow, but it sounds like it's internal complexity, which is a fair reason. Maybe custom predicates can be saved for later? I think they are a super valuable feature, but I do think the basic Ge
ones are an easier place to start.
from msgspec.
Yeah, it'd mess with the internals in a way that's doable but would require more thought. I don't think implementing any of the builtin constraints will require much of a refactor - the main issue there is how the users spell them (tieing in to #125).
from msgspec.
Posting here because this is specific to annotated-types
although it is also related to the discussion in #125.
I think I at least found a solution for the discussion about Annotated[..., Field(...)]
vs. Annotated[..., *Field(...)]
: annotated-types/annotated-types@f59cf6d
With that change anything that is compatible with annotated-types
would be able to interpret Annotated[..., Field(...)]
. So, for example, you could define a type Annotated[int, msgspec.Field(ge=0)]
and Hypothesis or even Pydantic would be able to interpret it.
You could also choose to use the same object for the foo: int = Field(ge=0)
syntax (that's what Pydantic does), but then Hypothesis and the rest of the annotated-types
ecosystem would not recognize it, but you'd give users the option to choose their preferred syntax.
from msgspec.
Related Issues (20)
- Inherit __init_subclass__ args to subclass HOT 1
- ormsgpack benchmark comparison HOT 1
- encode sort_keys argument HOT 16
- Allow more than one str-like type in unions HOT 5
- decode's strict=False does not cast floats to ints HOT 3
- Typed encoding HOT 1
- Validation on serialization HOT 5
- Make msgspec more aware of large data and other serialization protocols
- Recommendation for Efficiently Decoding BSON from PyMongo with msgspec
- leading underscores and `"camel"` rename strategy HOT 2
- Constraint check for typing.Optional field HOT 1
- Automatically support `functools.cached_property` without requiring setting `dict=True` HOT 2
- UnboundLocalError for `new_scope` HOT 3
- Sign release tags when making future releases? HOT 4
- How can I do data normalization on a frozen instance in `__post_init__`? HOT 1
- Cannot convert with `from_attributes` when using a rename convention HOT 3
- Cannot set `gc=False` on Generic structs? HOT 5
- Support types.MappingProxyType HOT 3
- Add either `init_omit_defaults` or `omit_none` HOT 5
- Consider making `DecodeError` and `ValidationError` inherit from `ValueError` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from msgspec.