My understanding is that msgspec currently only supports basic type/schema validation,

Hmmm, ok. A few questions: Is supporting only the builtin cons

Support annotated-types for metadata and constraint specification about msgspec HOT 9 CLOSED

jcrist commented on May 22, 2024

Support annotated-types for metadata and constraint specification

from msgspec.

Comments (9)

jcrist commented on May 22, 2024

Thanks for opening this, it highlights an important distinction between the libraries.

msgspec.Struct constructors do no runtime type validation (unlike those of pydantic.BaseModel). The argument for this is that type annotations + static analysis tools (or tests) should be sufficient for ensuring the correctness of your own code, you shouldn't have to pay the cost of runtime type checking for code you already know to be correct. As such msgspec (currently) only supports type level constraints that can be statically validated by tools like mypy or pyright. A constraint like >= 0 can't be statically validated, so it's not clear what msgspec should do here.

Right now I have a few ideas:

Don't support these features. I personally don't need them. If we think of msgspec as comparable to rust's serde or golang's json, then we're already feature complete without them, those tools also don't handle value validation.
We do add support, but the checks are only applied during decoding, not in the constructors. I'm pretty adamant that constructors should always be fast. Instead, we'd add a new function for running validators on existing data. This might look like:

class PositivePoint(msgspec.Struct):
    x: Annotated[int, Ge(0)]
    y: Annotated[int, Ge(0)]

# no value validation happens here
x = Point(1, 2)

# This means that this won't error
bad = Point(-1, 0)

# Value validation checks must be explicitly run, this raises a ValidationError
msgspec.validate(bad)

points = [Point(1, 2), Point(3, 4), Point(5, 6)]

# For non-struct types, a type can be passed in explicitly
msgspec.validate(points, type=list[Point])

# Validators are also automatically run during decoding. This would raise a `ValidationError`
msgspec.json.decode(b'{"x": -1, "y": 0}', type=Point)

This lets the user control when (potentially expensive) validation checks are run, and by default only runs them when processing external inputs. If you know your own internal code is correct, there's no need to pay the runtime cost.

As for how to spell the validation checks, I'm -0.1 on relying on an external library here, unless that library was commonly used in the ecosystem (I recognize this is a chicken <-> egg problem). I also personally find the annotated type syntax to be a bit verbose, and if we want to support other metadata on fields (e.g. title, description, ...), we might need an additional Field type anyway to encode that. Idk, there doesn't seem to be a concise way of spelling field annotations in python currently.

Thoughts?

from msgspec.

adriangb commented on May 22, 2024

I think supporting these sorts of validations natively is important both for user experience and performance. I think providing validation and json schema is what made Pydantic successful, so there are obviously a lot of users that want these features in their data parsing/validation library of choice.

I 100% agree not validating constructors by default is the right move. I like your example where validation only happens explicitly or during decoding.

I'm -0.1 on relying on an external library here, unless that library was commonly used in the ecosystem (I recognize this is a chicken <-> egg problem)

Yup totally a chicken <-> egg problem. That said, Hypothesis and Pydantic v2 already plan on supporting annotated-types. That means FastAPI, etc. will support it, hence there will soon be a relatively large ecosystem. And you could always make it an optional feature: only support it if installed, if an extra is selected, etc.

I also personally find the annotated type syntax to be a bit verbose

One way to mitigate this is type aliases, which work great for many common constraints:

from annotated_types import IsDigit

PhoneNumber = IsDigit[str]

The same could be done for Positive (YearsOfExperience = Positive[int]).

But yes, as soon as users need to input data (Ge(0)) it can get a bit verbose. PEP593 explicitly calls this out and I hope that at some point they'll add support for easier parametrization (Ge[int, 0] so that you don't have to import and use Annotated everywhere).

if we want to support other metadata on fields (e.g. title, description, ...), we might need an additional Field type anyway to encode that

One thing you could do is make a Field object with an __iter__ method that can be unpacked into multiple constraints:

from msgspec import Field

PhoneNumber = Annotated[str, *Field(pattern=r"", description="cel", min_len=123)]

(see https://github.com/annotated-types/annotated-types/blob/main/annotated_types/__init__.py#L132-L154)

This is only valid in Python >3.10 though.

In summary, yes, this isn't perfect, but I think it's the best option we have right now that doesn't do wonky stuff like mess with default values or create a library specific way to attach metadata. And I believe the situation will get better in the future (if there is a future PEP to make it less verbose, if an ecosystem of common metadata arises, once everyone is using 3.10 and can unpack metadata, etc.)

from msgspec.

adriangb commented on May 22, 2024

One thing I'll note about the annotated-types design that might not be immediately obvious: the validation implementation is not encoded into the metadata, which would allow msgspec to interpret Ge(0), Predicate(str.islower) or Pattern(r"Foo") into some fast C code. And users persist a Decoder object around I think it should be possible to "compile" a decoder ahead of time so that there is zero overhead if no constraints are included.

from msgspec.

jcrist commented on May 22, 2024

Hmmm, ok. A few questions:

Is supporting only the builtin constraints that are part of jsonschema (or similar) sufficient? Can we avoid supporting arbitrary python callables and still have most users be happy?
What constraints are most common in your experience? How often does a field need a constraint as opposed to just a type?
Can we get by without the validate function for now, only supporting validating constraints on decode?

from msgspec.

adriangb commented on May 22, 2024

Is supporting only the builtin constraints that are part of jsonschema (or similar) sufficient? Can we avoid supporting arbitrary python callables and still have most users be happy?

I think it would be good for a lot of use cases, but certainly not all. Why do you want to avoid supporting arbitrary python callables?

What constraints are most common in your experience? How often does a field need a constraint as opposed to just a type?

I don't have a survey or anything, but I'd say Len, Ge/Le/Gt/Lt and Pattern are probably the most common.

Can we get by without the validate function for now, only supporting validating constraints on decode?

I'd say so, yes. I think that's how it will be used most of the time.

from msgspec.

jcrist commented on May 22, 2024

Why do you want to avoid supporting arbitrary python callables?

Nothing against it in particular. However, the current way the internal type tree is represented makes supporting these tricky. All the builtin constraints can mostly only be applied to one type (and cases where they can apply to multiple are already explicitly forbidden in msgspec due to ambiguity). This makes our internal representation of these constraints easy.

If we support arbitrary user defined constraints, then we'd need to modify our internals much more to be able represent e.g.

class Example:
    a: Annotated[int, Predicate(custom_1)] | Annotated[str, Predicate(custom_2)]

from msgspec.

adriangb commented on May 22, 2024

Not sure I completely follow, but it sounds like it's internal complexity, which is a fair reason. Maybe custom predicates can be saved for later? I think they are a super valuable feature, but I do think the basic Ge ones are an easier place to start.

from msgspec.

jcrist commented on May 22, 2024

Yeah, it'd mess with the internals in a way that's doable but would require more thought. I don't think implementing any of the builtin constraints will require much of a refactor - the main issue there is how the users spell them (tieing in to #125).

from msgspec.

adriangb commented on May 22, 2024

Posting here because this is specific to annotated-types although it is also related to the discussion in #125.

I think I at least found a solution for the discussion about Annotated[..., Field(...)] vs. Annotated[..., *Field(...)]: annotated-types/annotated-types@f59cf6d

With that change anything that is compatible with annotated-types would be able to interpret Annotated[..., Field(...)]. So, for example, you could define a type Annotated[int, msgspec.Field(ge=0)] and Hypothesis or even Pydantic would be able to interpret it.

You could also choose to use the same object for the foo: int = Field(ge=0) syntax (that's what Pydantic does), but then Hypothesis and the rest of the annotated-types ecosystem would not recognize it, but you'd give users the option to choose their preferred syntax.

from msgspec.

Support annotated-types for metadata and constraint specification about msgspec HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent