Comments (13)
Hi @hhcs9527
If we just replace the to_json / from_json function with DataClassJSONMixin ones, the speed remains the same
So that I can help you, could you provide more details about how you did it? I just replaced DataClassJsonMixin
from dataclasses-json with DataClassJSONMixin
from mashumaro and got significant improvement in from_json / to_json:
Source code
from dataclasses import dataclass
from mashumaro.mixins.json import DataClassJSONMixin
from timeit import timeit
from dataclasses_json import DataClassJsonMixin
@dataclass
class StockPosition1(DataClassJSONMixin):
ticker: str
name: str
balance: int
@dataclass
class StockPosition2(DataClassJsonMixin):
ticker: str
name: str
balance: int
t1 = timeit(
"loader(data)",
globals={
"loader": StockPosition1.from_json,
"data": '{"ticker": "AAPL", "name": "Apple", "balance": 42}',
},
)
print('mashumaro from_json', t1)
t2 = timeit(
"loader(data)",
globals={
"loader": StockPosition2.from_json,
"data": '{"ticker": "AAPL", "name": "Apple", "balance": 42}',
},
)
print('dataclasses_json from_json', t2)
obj1 = StockPosition1("AAPL", "Apple", 42)
obj2 = StockPosition2("AAPL", "Apple", 42)
t1 = timeit("dumper()", globals={"dumper": obj1.to_json})
print('mashumaro to_json', t1)
t2 = timeit("dumper()", globals={"dumper": obj2.to_json})
print('dataclasses_json to_json', t2)
mashumaro from_json 1.439956832997268
dataclasses_json from_json 21.087273874989478
mashumaro to_json 1.4008548330166377
dataclasses_json to_json 13.31320291600423
the reason I believe is the difference in how dataclasses_json implements the to_dict / from_dict
AFAIK, dataclasses-json doesn't compile loader / dumper for specific schema as mashumaro does, which is what makes it so slow.
from mashumaro.
@Fatal1ty Thanks for the help.
the following example shows the concept of my problem.
I have a object from @DataClass and DataClassJsonMixin. However, due to the speed issue, we would like to transform such object to DataClassJSONMixin.
But I have no idea how to retrieve the data in DataClassJsonMixin to DataClassJSONMixin => which I believe is the main reason that DataClassJSONMixin is fast.
I've tried several ways
- init the object as a property of DataClassJSONMixin => fail
- replace the function in DataClassJsonMixin to_json / to_dict / from_json / from_dict with DataClassJSONMixin => works, which I have no idea why. However, in a real-life example, performance doesn't work like that.
- replace with customize serialization strategy => no idea how to serialize, all I know is to call DataClassJsonMixin.to_dict => but it's just not fast
Here is my example for way 2.
speed : my_portfolio3 (mix one) > my_portfolio (DataClassJSONMixin) > my_portfolio2 (DataClassJsonMixin)
import json
from enum import Enum
from typing import List, Any, Union, Generic, TypeVar, Tuple, cast
from dataclasses import dataclass, field
from dataclasses_json import dataclass_json, DataClassJsonMixin
from mashumaro.mixins.json import DataClassJSONMixin
from mashumaro.types import SerializationStrategy
T = TypeVar("T")
def timeDiff(obj):
import time
start = time.time()
json_str = obj.to_json()
end = time.time()
print("to_json cost:", (end - start)*10000)
start = time.time()
type(obj).from_json(json_str)
end = time.time()
print("from_json cost:", (end - start)*10000)
print()
class Currency(Enum):
USD = "USD"
EUR = "EUR"
@dataclass
class CurrencyPosition(DataClassJSONMixin):
currency: Currency
balance: float
@dataclass
class StockPosition(DataClassJSONMixin):
ticker: str
name: str
balance: int
@dataclass
class Portfolio(DataClassJSONMixin):
currencies: List[CurrencyPosition]
stocks: List[StockPosition]
my_portfolio = Portfolio(
currencies=[
CurrencyPosition(Currency.USD, 238.67),
CurrencyPosition(Currency.EUR, 361.84),
],
stocks=[
StockPosition("AAPL", "Apple", 10),
StockPosition("AMZN", "Amazon", 10),
]
)
json_string = my_portfolio.to_json()
# print("my_portfolio og: ", json_string)
# print(json_string == Portfolio.from_json(my_portfolio.to_json()).to_json())
@dataclass_json
@dataclass
class CurrencyPosition2:
currency: Currency
balance: float
@dataclass_json
@dataclass
class Portfolio2():
currencies: List[CurrencyPosition]
stocks: List[StockPosition]
my_portfolio2 = Portfolio2(
currencies=[
CurrencyPosition(Currency.USD, 238.67),
CurrencyPosition(Currency.EUR, 361.84),
],
stocks=[
StockPosition("AAPL", "Apple", 10),
StockPosition("AMZN", "Amazon", 10),
]
)
timeDiff(my_portfolio)
timeDiff(my_portfolio2)
my_portfolio3 = Portfolio2(
currencies=[
CurrencyPosition(Currency.USD, 238.67),
CurrencyPosition(Currency.EUR, 361.84),
],
stocks=[
StockPosition("AAPL", "Apple", 10),
StockPosition("AMZN", "Amazon", 10),
]
)
Portfolio2.to_dict = DataClassJSONMixin.to_dict
Portfolio2.from_dict = DataClassJSONMixin.from_dict
Portfolio2.to_json = DataClassJSONMixin.to_json
Portfolio2.from_json = DataClassJSONMixin.from_json
timeDiff(cast(DataClassJsonMixin, my_portfolio3))
Hope this helps, thanks!
from mashumaro.
Hope this helps, thanks!
Yes, thank you, this example shed some light here, but you made a few mistakes. Let me explain.
Portfolio2.to_dict = DataClassJSONMixin.to_dict Portfolio2.from_dict = DataClassJSONMixin.from_dict Portfolio2.to_json = DataClassJSONMixin.to_json Portfolio2.from_json = DataClassJSONMixin.from_json
You can't simply replace class attributes to the ones from DataClassJSONMixin
class because the base class methods from_dict
/ to_dict
do nothing, they need to be built for Portfolio2
specifically. If you add print
statement to you timeDiff
you will see that print(obj.to_json())
is {}
and print(type(obj).from_json(json_str))
is <mashumaro.mixins.json.DataClassJSONMixin object at 0x...>
which is not you were expected and which is the reason why my_portfolio3 (mix one) is the fastest in your inaccurate but still a benchmark.
@dataclass_json @dataclass class Portfolio2(): currencies: List[CurrencyPosition] stocks: List[StockPosition]
The second point. You're mixing Portfolio2 with CurrencyPosition and StockPosition not with CurrencyPosition2 and StockPosition2. This works for dataclasses_json because it fortunately assumes your dataclasses have methods with the same names that mashumaro introduces. But if you're going to mix the libraries it will definitely be error prone.
my_portfolio (DataClassJSONMixin) > my_portfolio2 (DataClassJsonMixin)
I strongly believe this is a typo and you actually didn't get this result. I run your code on my laptop with Python 3.11 and got the following numbers:
to_json cost: 0.11205673217773438
from_json cost: 0.17881393432617188
to_json cost: 2.0599365234375
from_json cost: 2.582073211669922
to_json cost: 0.030994415283203125
from_json cost: 0.030994415283203125
So, mashumaro is roughly 18x faster on to_json
and 14x faster on from_json
.
I have a object from @DataClass and DataClassJsonMixin. However, due to the speed issue, we would like to transform such object to DataClassJSONMixin.
Based on all the above I'd like to ask you, why don't you just remove @dataclass_json
decorator or DataClassJsonMixin
base usage and replace it with DataClassJSONMixin
(or even DataClassORJSONMixin
for even better performance)?
from mashumaro.
You can't simply replace class attributes to the ones from
DataClassJSONMixin
class because the base class methodsfrom_dict
/to_dict
do nothing, they need to be built forPortfolio2
specifically
Just for the reference, it's not documented and subjected to changes but specific from_dict
/ to_dict
/ from_json
/ to_json
methods can be compiled and added to Portfolio2
by using internal CodeBuilder
that is used by mashumaro mixins. So, instead of this:
Portfolio2.to_dict = DataClassJSONMixin.to_dict
Portfolio2.from_dict = DataClassJSONMixin.from_dict
Portfolio2.to_json = DataClassJSONMixin.to_json
Portfolio2.from_json = DataClassJSONMixin.from_json
You're better to do this:
from mashumaro.core.meta.code.builder import CodeBuilder
cb = CodeBuilder(Portfolio2) # if you need to call from_dict / to_dict in your code
cb.add_pack_method()
cb.add_unpack_method()
cb = CodeBuilder(
Portfolio2, format_name="json", encoder=json.dumps, decoder=json.loads
) # if you need to call from_json / to_json in your code
cb.add_pack_method()
cb.add_unpack_method()
It will give Portfolio2
the same (de)serialization speed as for Portfolio
but in real life you will be forced to do this for all dataclasses you depend on until the related issue is solved:
from mashumaro.
Hi @Fatal1ty ,
Thanks for the advice, I just tried the methods, and it works! (which is the way I preferred, but the performance in the real-world case still has some issues)
Not sure what you mean by the last paragraph.
from mashumaro.
Thanks for the advice, I just tried the methods, and it works! (which is the way I preferred, but the performance in the real-world case still has some issues)
Can you share more details about your performance issues?
P.S. I found your recent changes in flytekit and I haven't delved deeply into the internals of flytekit, but as far as I understand, to_python_value
/ to_literal
are called every time on values of the same type. If this is the case, then I strongly recommend not to do so, since compilation is expensive and must be performed once per data type. You should cache it somehow. With the upcoming release you will be able to check if the data type doesn't have internal method __mashumaro_from_json__
/ __mashumaro_to_json__
and compile it if needed, but it's a dirty hack that has its own cost.
Anyway, I'm still struggling to understand why you're trying to mix dataclasses_json with mashumaro in this weird way? To help me better understand your motivation, I will quote my question:
Based on all the above I'd like to ask you, why don't you just remove @dataclass_json decorator or DataClassJsonMixin base usage and replace it with DataClassJSONMixin (or even DataClassORJSONMixin for even better performance)?
I found chapter "Using Custom Python Objects" in flytekit documentation that says that you support dataclasses with dataclass_json mixed functionality. So, why not extend dataclasses support either to dataclasses with @dataclass_json
decorator (or DataClassJsonMixin
since it's interchangeable) or to dataclasses with mashumaro mixins?
from mashumaro.
You're right, we should start at the bottom to replace the DataClassJsonMixin with DataClassJSONMixin.
That's what I'm going to do, thanks for the help along the way.
Thanks for the kind support and suggestion, also the amazing implementation for DataClassJSONMixin!
I think this issue is resolved, thanks.
from mashumaro.
By the way, can you point out the path where you implement the mashumaro_from_json for any type?
Does it mean the compile, also does the compile happen in the create object process?
Thanks for the clarification.
from mashumaro.
By the way, can you point out the path where you implement the mashumaro_from_json for any type?
To be more accurate, method __mashumaro_from_json__
is assigned only to dataclass types, not to any type. CodeBuilder constructs the code of this method line by line specifically for a dataclass that it sees for the first time here. But I should emphasize that __mashumaro_*__
methods were introduced recently and are only available in master branch at this moment. I'm going to make a release in the next few days.
Does it mean the compile, also does the compile happen in the create object process?
The moment when compilation occurs depends on the use of the library. Let me elaborate this.
- The most common case case β a user has a dataclass with a mixin provided by mashumaro. This dataclass doesn't have forward references and lazy compilation is disabled (default behaviour). In this case methods
from_*
/to_*
are being compiled and assigned to the dataclass at import time. So, when you have a reference to this dataclass type or dataclass instance you can use itsfrom_*
/to_*
methods. - The same case as above but this time a dataclass has forward references (e.g. it has a field of a type that is declared below and is not yet evaluated) or lazy compilation is explicitly enabled. In this case the real method compilation will be delayed until it's called for the first time. So, when you have a reference to this dataclass type or dataclass instance you can still use its
from_*
/to_*
methods. To pull this off there will be one-shot methodfrom_*
/to_*
that will build the real method and replace itself with it. It can be dangerous if you keep a reference to this one-shot method because you will end up with repeated compilation on every call. To avoid this it's recommended either not to save a reference to a dataclass method, or use a lambda function just in case. - Manual compilation using internal CodeBuilder. It's pretty straightforward here. When you call
add_pack_method
/add_unpack_method
your dataclass will get new methods compiled and assigned. So, it's not recommended to do it every time in the name of performance.
Also, you can be interested in this issue:
While it's not done yet you can add support for dataclasses without any mixins as follows:
import json
from dataclasses import dataclass
from typing import Type, TypeVar
from mashumaro.core.meta.code.builder import CodeBuilder
T = TypeVar("T")
class JSONDecoder:
def __init__(self):
self._cache = {}
def from_json(self, type_: Type[T], data: str) -> T:
type_id = id(type_)
method = self._cache.get(type_id)
if method:
return method(data)
cb = CodeBuilder(type_, format_name="json", decoder=json.loads)
cb.add_unpack_method()
method = getattr(type_, "__mashumaro_from_json__")
self._cache[type_id] = method
return method(data)
def from_json2(self, type_: Type[T], data: str) -> T:
method = getattr(type_, "__mashumaro_from_json__")
if method:
return method(data)
cb = CodeBuilder(type_, format_name="json", decoder=json.loads)
cb.add_unpack_method()
method = getattr(type_, "__mashumaro_from_json__")
return method(data)
It has a small additional performance cost but is still faster than dataclasses-json:
Variant without mixins and with getattr
turned out to be even faster for json comparing to using DataClassJSONMixin
. It happens because DataClassJSONMixin
doesn't compile a specific from_json
method (as DataClassORJSONMixin does), instead it calls a specific compiled from_dict
method underhood.
Source code
import json
from dataclasses import dataclass
from typing import Type, TypeVar
from timeit import timeit
from mashumaro.core.meta.code.builder import CodeBuilder
from dataclasses_json import DataClassJsonMixin
from mashumaro.mixins.json import DataClassJSONMixin
T = TypeVar("T")
class JSONDecoder:
def __init__(self):
self._cache = {}
def from_json(self, type_: Type[T], data: str) -> T:
type_id = id(type_)
method = self._cache.get(type_id)
if method:
return method(data)
cb = CodeBuilder(type_, format_name="json", decoder=json.loads)
cb.add_unpack_method()
method = getattr(type_, "__mashumaro_from_json__")
self._cache[type_id] = method
return method(data)
def from_json2(self, type_: Type[T], data: str) -> T:
method = getattr(type_, "__mashumaro_from_json__")
if method:
return method(data)
cb = CodeBuilder(type_, format_name="json", decoder=json.loads)
cb.add_unpack_method()
method = getattr(type_, "__mashumaro_from_json__")
return method(data)
@dataclass
class MyClass1:
x: int
@dataclass
class MyClass2(DataClassJSONMixin):
x: int
@dataclass
class MyClass3(DataClassJsonMixin):
x: int
json_decoder = JSONDecoder()
data = '{"x": "42"}'
json_decoder.from_json(MyClass1, data) # warmup
print(
"without mixins (1)",
timeit(
"decoder(cls, data)",
globals={
"decoder": json_decoder.from_json,
"data": data,
"cls": MyClass1,
},
),
)
print(
"without mixins (2)",
timeit(
"decoder(cls, data)",
globals={
"decoder": json_decoder.from_json2,
"data": data,
"cls": MyClass1,
},
),
)
print(
"DataClassJSONMixin",
timeit(
"decoder(data)",
globals={
"decoder": MyClass2.from_json,
"data": data,
},
),
)
print(
"DataClassJsonMixin",
timeit(
"decoder(data)",
globals={
"decoder": MyClass3.from_json,
"data": data,
},
),
)
without mixins (1) 1.086791625013575
without mixins (2) 1.0074725000013132
DataClassJSONMixin 1.0947887079964858
DataClassJsonMixin 13.494744666997576
I hope this will help you make the right decision.
P.S. You are free to close this issue if everything was answered.
from mashumaro.
Thanks for the explanation.
Does mashumaro support generated data class from its own JSON schema?
from mashumaro.
Does mashumaro support generated data class from its own JSON schema?
If you mean dataclass code generation from JSON schema, then answer is no. Itβs a job for another library as I see it.
from mashumaro.
Do you know which library? Thanks.
from mashumaro.
Do you know which library? Thanks.
Take a look at datamodel-code-generator
from mashumaro.
Related Issues (20)
- Allow customizing additionalProperties in jsonschemas HOT 3
- Enable generating jsonschema field restrictions HOT 7
- When generating a jsonschema, use "description" in field metadata HOT 1
- When using "all_refs=True" to generate a jsonschema, validation fails because the top-level object has invalid reference url HOT 2
- field metadata "description" does not show up in built json output HOT 1
- collections.OrderedDict is not hashable HOT 10
- Allow discrimination based on a key that is not part of the model HOT 8
- Raise the inner exception when parsing nested dataclasses HOT 2
- Performance of json_schema generation HOT 5
- collections.OrderedDict is not hashable when using inside typing_extensions.Annotated on Python < 3.9
- Overridden serialization methods doesn't work in JSON Schema for optional fields and inner dataclasses
- Third party default is not reflected in the generated JSON Schema
- Add support for ForwardRef HOT 6
- Drop Python 3.7 support
- Allow basic types used as serialization methods to be reflected in the JSON schema
- Switch to ruff from flake8
- Comprehension inlining on Python 3.12.0rc1 breaks some tests HOT 1
- Add support for Python 3.12
- Add support for PEP 696 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mashumaro.