Comments (8)
Hey @mishamsk
First of all, congratulations on your project publication! I'll consider using it when I need to do some parsing stuff.
When I was developing discriminator feature, I thought about adding an attribute to Discriminator
that would keep a tag generator function. In the current implementation it's hardcoded to take the tag from a dataclass field which you specifed in Discriminator
object. For example, if we have Discriminator(field="__type", include_subtypes=True)
, the following variant registration code will be generated:
for variant in iter_all_subclasses(__main__.Node):
try:
variants_map[variant.__dict__['__type']] = variant
except KeyError:
continue
By introducing a tag generator function we can change it. For example, if we have Discriminator(field="__type", variant_tag_getter=lambda cls: cls.__name__, include_subtypes=True)
the following variant registration code will be generated:
for variant in iter_all_subclasses(__main__.Node):
try:
variants_map[variant_tag_getter(variant)] = variant
except KeyError:
continue
I'm bad at naming but I guess that variant_tag_getter
will more clearly indicate the purpose instead of variant_tagger_fn
which may imply the "post serialize hook" functionality.
The question is, do we need or will we need functionality of automatic adding tags to all variants on serialization, so we could avoid using __post_serialize__
hooks. We can add a bool parameter add_tag_on_serialization
(just a working name) to Discriminator
for this but let's keep it for a separate issue.
What do you think? Maybe I'm missing something and you have your own vision of solving this issue.
from mashumaro.
@Fatal1ty first thanks for the wishes! it took me way to much time;-)
rgd your design suggestion, I have a couple of thoughts.
First, rgd my specific use case. I think it's a distinct usage pattern from a "model a messaging protocol and parse". Instead it's more of a "pickle" use case. I am controlling both serialization and deserialization, so do not care about validations or trying to decode data originating from some external system. I have a hunch that only this kind of usage pattern where one can expect a literal class name in serialized data. None of the public specs, except the ones used for interoperability, would ever dictate to store a literal class name in serialized data, because class name is obviously implementation specific.
So if you think that mashumaro is commonly used for such things, it may be worth implementing explicitly, instead of a "generic" approach. But you'd know better if this is the case. My usage pattern may be rare.
Rgd your proposal for the universal solution. I'd definitely say that whatever it will be, it's going to be the harder part to grasp for users. I had to read mashumaro's code to understand where this "getter" stuff will pop-up. I am not sure if you need a different argument for the Discriminator though. Why not just inspect the field
value, and allow callable?
I do think that using tag
instead of field
would be generally better, now that the discriminator is not necessarily a field. It would also match some other libs (e.g. msgspec), but that would break the API...
from mashumaro.
I am not sure if you need a different argument for the Discriminator though. Why not just inspect the field value, and allow callable?
Assuming we are using JSON serialization format, we need two things:
- A JSON key name that holds a tag value in the serialized data. We will take this tag value on deserialization and find it in the "tag → class" mapping to determine which variant class should be deserialized.
- A function that returns a tag value for a class so that we could go through the class hierarchy and create "tag → class" mapping.
Currently we can configure only first item through Discriminator "field" attribute. The second item is hardcoded:
mashumaro/mashumaro/core/meta/types/unpack.py
Lines 347 to 350 in c0a4cf8
Semantically this function (I called it variant_tagger_fn
in my proposal) returns a value for the class level attribute with the name discriminator.field
:
def variant_tagger_fn(cls):
return cls.__dict__[discriminator.field]
In your case you don't have class level attributes containing unique literals, so you could use the following function instead to get dynamic tag values:
def variant_tagger_fn(cls):
return cls.__name__ # or something more practical
I do think that using tag instead of field would be generally better, now that the discriminator is not necessarily a field. It would also match some other libs (e.g. msgspec), but that would break the API...
msgspec has its tag_field
parameter which semantically is the same as discriminator.field
in mashumaro. The difference is that by default msgspec consider the class name as a tag value, and mashumaro consider the value of a specific class level attribute as a tag value. My proposal is to allow the modification of this "tagger".
from mashumaro.
Oh, sorry, I wasn't thinking straight. Yes this makes total sense.
Did you consider keeping only the current field attribute, but allowing it to be a tuple? I.e.:
- No attribute - assume subclass based discrimination
- Plain string - current logic, implicitly assume that the provide string is a key in
__dict__
(which btw apparently won't work for slotted dataclasses! see code sample below. I wasn't expecting that) - A tuple of a string & callable - the new logic, which could also solve for slotted classes
I may be biased, but the word variant
doesn't "speak" to me, sounds more like an implementation detail. But, either way, having a variant_tagger_fn
is better than not having anything at all;-)
sample code - slotted dataclasses do not have __dict__
on class itself too!
from dataclasses import dataclass
from typing import ClassVar
@dataclass(slots=True)
class Test:
x: ClassVar[int] = 0
y: int = 0
print("__dict__" in dir(Test)) # False
print("__dict__" in dir(Test())) # False
from mashumaro.
@mishamsk Sorry for delay, I was busy at work.
Did you consider keeping only the current field attribute, but allowing it to be a tuple? I.e.:
- No attribute - assume subclass based discrimination
- Plain string - current logic, implicitly assume that the provide string is a key in
__dict__
(which btw apparently won't work for slotted dataclasses! see code sample below. I wasn't expecting that)- A tuple of a string & callable - the new logic, which could also solve for slotted classes
It will work but I don't see any point to make one argument to accept different value types instead of using two arguments. It will also be complicated to explain the use of this parameter in the documentation. Let's make it simple.
I may be biased, but the word variant doesn't "speak" to me, sounds more like an implementation detail. But, either way, having a variant_tagger_fn is better than not having anything at all;-)
I agree that it doesn't sound very well but I didn't manage to find a better word. I do think this complexity has roots in using discriminator for both subclass deserialization and for discriminated union deserialization. If it was only for subclass deserilizaiton then I would called it subclass_tagger_fn
. In my defense, I found some articles where the term "variant" is used for the same things:
- https://www.cs.umd.edu/projects/cyclone/online-manual/main-screen004.html
- https://docs.pydantic.dev/latest/usage/types/unions/#discriminated-unions-aka-tagged-unions
- https://tech.preferred.jp/en/blog/python-exhaustive-union-match/
- https://v2.ocaml.org/manual/polyvariant.html
- current logic, implicitly assume that the provide string is a key in
__dict__
(which btw apparently won't work for slotted dataclasses! see code sample below. I wasn't expecting that
I grabbed my head and almost believed it, but it made me go to check. Fortunately for us, your assumption is only partially correct. The truth is slotted dataclass instances don't have __dict__
attribute but class objects do have it. Compare this:
print("__dict__" in dir(Test)) # False
print(Test.__dict__) # {'__module__': '__main__', '__annotations__': {'x': typing.ClassVar[int], ...
You might be curious why I even rely on __dict__
attribute when retrieving a tag. The reason is that I wanted to support intermediate nodes in class tree that don't have explicit tag defined and are only used as a base class for their nodes. If we were using getattr(variant_cls, tag)
instead of variant_cls.__dict__[tag]
we would get a value inherited from an ancestor for those intermediate nodes and overwrite a class registered for the tag.
from mashumaro.
nothing to apologize for! appreciate additional insights.
Sorry for the false alarm, didn’t know that dir had completely different behavior based on what’s being inspected.
Regarding the naming, as I’ve said, I do not have a strong opinion. You’ve build & maintained this library all along, so you’ll know better what works for the community. The only thing I can add, is that I’ve been trying to find other libs that supported auto-subclass (de)serialization, and couldn’t find any. Seems odd to me, that there is nothing in-between pickle and strictly typed validation/serialization (pydantic, msgspec, cattrs etc). So having such a feature in mashumaro would further set it apart!
Also, this change seems relatively simple. Probably even simpler than #136. I am on vacation, so am away from coding for the sake of the family, but I may get some time to draft a PR when I am back if you haven’t done it yourself by then.
from mashumaro.
It was easy to implement :)
from mashumaro.
@Fatal1ty nice! you are definitely moving faster than I do;-)
from mashumaro.
Related Issues (20)
- Using Union with int/float casts to whichever appears first HOT 3
- Not parsing Generics correctly HOT 2
- Unserializable field in 3.12 if defined as a Generic TypeVar with mixin bounds HOT 6
- Allow propagation of class based discriminator settings to subclasses HOT 3
- Reject extra keys on deserialization HOT 7
- Investigate support for recursive Union types HOT 2
- Supports `numpy.ndarray` type for `orjson` HOT 2
- Add support for PEP 695
- Union type of [int | float] not serialized correctly HOT 2
- Add an alternative way to assign a field alias with annotations
- Add support for PEP 696
- Annotated SerializationStrategy used as a field serialization strategy leads to RecursionError
- Suppress warning about Union type HOT 4
- Take description from docstring HOT 3
- Incorrect type annotation for the result of `to_json` in `DataClassORJSONMixin`
- Different results between standard library json and orjson HOT 3
- Python 3.12.4 Compatibility
- Some tests error out with typing_extensions>=4.12
- More issues on Python 3.12.4 HOT 5
- [BUG] to_msgpack doesn't behave properly with Discriminator(include_subtypes=True) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mashumaro.