Comments (14)
I think the "shape" system I briefly described in TkTech/pysimdjson#74 (comment) would work for the case you're describing. The shape is compiled into a simple bytecode in Python, and run in Cython/C. For many repeated structure transformations it's usually 1 or 2 orders of magnitude faster.
If you could provide a few examples of exactly how you're trying to transform the JSON, I can use it as an example use-case and ensure it performs well.
from aiodynamo.
Or https://github.com/TeskaLabs/cysimdjson for that matter.
Then, if that's promising, we could re-code dy2py
in cython
from aiodynamo.
Just letting us supply our own loads
/dumps
would be a great start, just plugging in ujson or orjson is a large speedup in JSON parsing and it's essentially free.
from aiodynamo.
@Tinche nice stab, but alas, Python's json
is good enough.
What's slow is the conversion from amazon-style {"foo": {"S": "bar"}}
to native Python {"foo": "bar"}
.
It may seem trivial, but it turns out that processing millions of conversions like this takes time.
Table scan (or query) delivers thousands of items, and each items often has dozens of fields, among which there may be compound fields (arrays, mapping), and so, in the end, there are simply too many operations to perform at Python level.
from aiodynamo.
@dimaqq I don't know how to optimize the JSON format Dynamo expects, but that's another issue. I do know ujson is significantly faster than Python's JSON, and there's a good chance it can be used in place with zero changes. So that's a free win.
from aiodynamo.
There's a set of benchmarks in this project, some don't require a dynamo server: https://github.com/HENNGE/aiodynamo/tree/master/benchmarks/deserialize
@Tinche would you like to run these against the main branch and against a temporary branch that uses a different library and post the results?
from aiodynamo.
@dimaqq Sure, sounds like a plan.
from aiodynamo.
I went with orjson in this case. I actually adapted the signing benchmark, since that's the one that touches JSON.
> pyperf compare_to before.json after.json
Mean +- std dev: [before] 71.6 us +- 1.7 us -> [after] 58.1 us +- 1.1 us: 1.23x faster
Seems worthwhile to me.
from aiodynamo.
the signing benchmark is valid (e.g. if someone loads a lot of data into dynamo), but not the most important, in my opinion.
the important bit is deserialisation, because that is a bottleneck in query/scan operations that cannot be trivially parallelised.
the deserialise benchmark doesn't even use json - rather it's about python code mangling aws-style dicts to python-style dicts. thus the important benchmark will be query. and that needs to be ran against well-provisioned cloud dynamo :pain:
(alternatively, we could set up a dummy http server that returns precooked json responses... we don't have that now, but it could be done).
I'll see if I can run some benchmarks
from aiodynamo.
Early results:
> pyperf compare_to aiodynamo_mock-without-orjson.json aiodynamo_mock-with-orjson.json
Mean +- std dev: [aiodynamo_mock-without-orjson] 108 ms +- 5 ms -> [aiodynamo_mock-with-orjson] 110 ms +- 5 ms: 1.02x slower
basically there's no difference for query or scan which are dominated by serialisation.
(tested with mock http client and ~3MB response)
from aiodynamo.
I mean yeah, I guess if you use a 3 megabyte payload the majority of the time is spent elsewhere. The vast majority of my payloads are much, much smaller though, so JSON decoding does play a role.
We can talk about optimizing stuff in https://github.com/HENNGE/aiodynamo/blob/963a6baecb7782fb5820179e2ec0c041a527d02e/src/aiodynamo/utils.py in another ticket. I might be able to help shave off some microseconds on some of these, which adds up.
from aiodynamo.
I imagine the network latency would dominate for small payloads.
Consider that AWS built-in monitoring: server-side get item latency is typically 4~15ms; and lower than 1ms is not even reported.
from aiodynamo.
Sure, but I can't do anything about network latency. Whereas if my JSON processing is a little more efficient, it means there's a little less latency for the endpoint, and my asyncio service is free to dedicate more time to another request sooner. I find after running asyncio at scale CPU time is a very precious resource.
Look, I'm not going to pretend this is going to be a major speedup. I am going to claim it's a very minor win that's essentially free. If you folks feel that's not the case (either it's not a win at all or it's not essentially free) I can respect that decision, it's your call after all :)
I can tell you personally, the first thing I do for our services is look at replacing the use of the stdlib json with something else, since it's very easy to do and does provide a small speedup. (Unless the service is running Pypy, that's another story.)
from aiodynamo.
I think this is pretty good now.
from aiodynamo.
Related Issues (20)
- F should implement __eq__ and __repr__
- 21.12 breaks some apis HOT 2
- Maybe summarise how DynamoDB works
- FileCredentials never set aws_session_token HOT 1
- Incompatible argument type issues after version 22.2 HOT 2
- KeyConditionExpression does not allow IN
- Support for applications running on Kubernetes
- after moving to github actions, integration tests are no longer run
- Add Amazon DynamoDB Accelerator (DAX) support to the library HOT 3
- Surprising behaviour in `Client.table_exists` HOT 1
- Auth Error With No Default Credentials HOT 1
- Unable to use AWS_REGION env var HOT 3
- Support AWS Web Identity Token File authentication HOT 2
- Shall we try goodfirstissue.dev ?
- Empty-set safety HOT 3
- Session instantiation HOT 6
- Get most recently added entry HOT 5
- ECS Fargate Credentials Compatibility HOT 2
- Support for typing-extensions version 4 module HOT 2
- Update maximum transaction operations to 100 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aiodynamo.