Comments (20)
This is an interesting topic... Let me try to give you some insights.
First of all the first thing we need is to define "fast". There are two possible objectives here: one is bandwidth (how many predictions can we run on a given hardware over a fixed period of time) the second is latency (how much time does it take to get the answer). There is another metric that is relevant to us: efficiency, which is how much resource is necessary to run one prediction.
Tract approach so far has been to focus on efficiency first It fits well with the embedded approach where resource is by definition limited. It's a good place to start: an efficiency gain is a net gain. It translates directly in both latency gains and bandwidth gain. It also translate to energy saving. I've been heard saying that efficiency gains help saving the planet, and I'm only half joking.
Now multi-threading. Or concurrence, or parallelism. This is a different approach: throwing more resource at the problem. At best this comes at constant efficiency, and we are the linear scalability domain: for instance, we allocate two cores to the prediction engine and get latency divided by 2. But it's not easy.
Let's review the approaches:
- external concurrency: one model, one thread per prediction. Near linear scalability. Linear bandwidth gains, latency non impacted if done right (with just the right number of tasks). Quite natural for server scenario, one query = one prediction = one thread = one core. Queries will be enqueued if the core/thread pool is entirely busy.
As a matter of fact, merging queries in batches (on N) can help efficiency a bit.
How about internal now ? Well, it's harder to discuss as the network characteristics will come into play. I will assume the focus is to get a better latency here.
- Slicing the input will not work in the general case, but may work for some networks or some partial networks. Convolutions are splittable over the spatial and/or temporal dimensions for instance. It will help with latency if the split happen on a spatial dimension. LSTM typically will block the split: their matrix products couple all the depth together, and each time-axis loop iteration depend on the result of the previous one. So this approach need some work so an execution planner can figure out when it is possible to do something, and decide if it worth it.
- Network parallelism. That one is relatively easy to do, as this parallelism is expressed naturally in the network graph. In the current implementation of tract, it could be done entirely by modifying the SimplePlan. Whether or not it would be helpful depend on the network though, some networks are mostly linear. On the case of LSTM, it may help the input multiplication part (as they are quite expensive and actually parallel). The matrix*vector product happening on C could also be parallelized, but as it tends to be shorter operation, it is less likely to compensate the switching overhead.
- Intra-op parallelization. Not too difficult to implement as it is limited to the scope of one given operator, there is no network-wise logic. Will only be relevant with big and expensive operations (typically matrix matrix products and convolutions). There is some stuff that can be done here actually. As tract application field has been real time voice on embedded, the matrix multipliers are geared towards smallish operands, and there is a lot that can be done to improve them on bigger matrices efficiency wise, even before discussing parallelization...
For the sake of completeness, there is another way to split networks that is relevant in real-time streaming applications: splitting a network in two roughly in the middle (assuming the interface is not too complicated). While this will not reduce latency in the best case, it may help if the network gets late at some point by allowing it to work on two (or more) time slice in a pipeling fashion.
from tract.
The FusedSpec vector can be shared, as it's read-only. Then we crafted a FusedKerSpec for each kernel call that contains a poiter to OutputStoreKer. So each thread must own its own OutputStoreKer. It should be OK as the OutputStoreKer is created inside the ScratchSpace buffer which, as already discussed, as to be per-thread.
Is it making sense ?
from tract.
@ro99 let's move the im2col issue to #570 and try to keep on general multithreading here.
from tract.
Thanks for these insights! I hadn't really considered bandwidth vs. latency. That's an interesting distinction and yes, I would say I was referring to low latency as "fast".
One thing I didn't catch from your comment: Do you actually plan to implement internal multithreading in tract or will it stay single-core, save to use in multiple threads and focused purely on efficiency for the foreseeable future?
Either way, I noticed that I can try implementing multithreading like 0.) in tractjs without any changes to tract by just slicing the input across the batch dimension externally to tract. I will try implementing that (see bminixhofer/tractjs#13) which should also give an impression of how ready wasm-bindgen
is for multithreading.
from tract.
My plans are not clear in the document because they're not clear in my head.
First, I will always put efficiency first. I will only do any internal parallelism if I am convinced the overhead is reasonable, or if there is an opt-out (or if it is opt-in).
Then there is the "work" thing :) Some of these approachs are a lot of work. If they start to make sense for my company, I will have no problem allocating work hours to them, but it's not necessarily the case right now.
So knowing that:
0/ is perfectly valid, and trivial or easy to do outside of tract. I am committed to keep it possible to have a single RunnableModel used by several workers threads to run multiple prediction. The benefit is now a bit limited for network with unfixed dimensions, but I am determined to fix that soon-ish. So yeah, feel free to play with MT in tract.js, or on top of it, that's the right place to do it.
1/ Slicing. I don't think this one is really interesting
2/ Parallel ops: this one is so easy that I may try it just for fun. Then who knows. But I will definitely keep the mono-thread runner too. The gains will obviously be limited to the network parallelism, so some networks will not see a lot of improvments here.
3/ Intra-op: I can, and I think I will allocate working time over the coming month to the efficiency gains on computation, with a focus on bigger matrices multiplication. If what I have in mind works, it should improve performance on basically all networks while staying reachable in terms of what I can achieve by myself. Once this is done, and only when it's done, it will make sense to move on and start spreading computations to multiple cores. Once again, if it happens it will have to be opt-in.
from tract.
Thanks for the clarification! That makes a lot of sense.
Regarding opt-out / opt-in and keeping the focus on efficiency: Some popular libraries like ndarray (ndarray feature flags) have parallelism via rayon behind a feature flag, so that might be one solution.
from tract.
I cannot fully agree that this is most efficient way to scale up using multiple threads spawning multiple inference tasks because almost all computers we use now are made by super scalar design it means it may (usually it is) contain a caches (L1, L2, L3 etc.) and if we spawn multiple threads with multiple distinct tasks - there will be a racing on cache leading to non-efficient cache usage. There are multiple GEMM solutions which highly optimized for multiple threads. Yes, it is obvious that if we would use batching - there will be higher latency. I would suggest to consider add support of batched mutithreaded opertions (GEMM, Conv etc.) as options in case someone does not bother latency.
from tract.
Do we have something new regarding approach (3)? I am working on a network that I think would greatly benefit from intra-op parallelization.
Currently this network (that is basically a u-net with resnet34 backbone) is suffering a lot during decoding process:
[...]
6258.218 ms/i 14.2% ┃┣ 171 Im2col 450.im2col
┃┃ ━━━ 1,934282998,F32
4288.437 ms/i 9.7% FMA(F32) 92493840384 21.568 GF/s ┃┣ 172 LirMatMulUnary 450.matmatmul
Params(F32) 99792 ┃┃ ━━━ 1,99,1024,1024,F32
5647.222 ms/i 12.8% ┃┣ 173 Im2col 452.im2col
┃┃ ━━━ 1,934282998,F32
4684.515 ms/i 10.6% FMA(F32) 92493840384 19.745 GF/s ┃┣ 174 LirMatMulUnary 452.matmatmul
Params(F32) 99792 ┃┃ ━━━ 1,99,1024,1024,F32
┗┓
384.967 ms/i 0.9% ┣┻ 175 Add 454
0.005 ms/i 0.0% ┣ 176 Reshape output.reshape_input
┃ ━━━ 1,99,1048576,F32
464.897 ms/i 1.1% ┣ 177 MatMatMulPack output.matmul.pack
┃ ━━━ 1,103809222,F32
94.854 ms/i 0.2% FMA(F32) 103809024 1.094 GF/s ┣ 178 LirMatMulUnary output.matmul.matmatmul
Params(F32) 1584 ━━━ 1,1,1024,1024,F32
Most time consuming operations
* LirMatMulUnary 54 nodes: 21051.949 ms/i 47.6%
* Im2col 48 nodes: 19414.745 ms/i 43.9%
* MoveAxis 10 nodes: 2066.150 ms/i 4.7%
* MatMatMulPack 6 nodes: 645.305 ms/i 1.5%
* Add 22 nodes: 444.749 ms/i 1.0%
* Concat 5 nodes: 403.741 ms/i 0.9%
* MaxPool 1 nodes: 95.982 ms/i 0.2%
* Max 21 nodes: 49.004 ms/i 0.1%
* Mul 5 nodes: 42.148 ms/i 0.1%
* Reshape 6 nodes: 0.025 ms/i 0.0%
* Source 1 nodes: 0.002 ms/i 0.0%
The bottleneck happens from 171-174, I think due to the bigger matrices computations. But just for comparison, onnxruntime takes about 20s to execute this model, while tract will take about 120s for the same. So, maybe there is room for improvement here?
from tract.
Nope, did not find the time to try to improve on intra-op (3) multi-threading yet. Some approach are relatively easy, if anybody wants to give it a shot, I'm happy to provide guidance.
What kind of system did you pull these numbers from ? I guess Intel, but how many cores, type and frequency ?
from tract.
The system these numbers comes from:
OS: macOS 11.4 20F71 x86_64
Host: MacBookPro13,3
Kernel: 20.5.0
CPU: Intel i7-6700HQ (8) @ 2.60GHz
GPU: Intel HD Graphics 530, AMD Radeon Pro 450
Memory: 5833MiB / 16384MiB
I am willing to give a shot, but I would need some guidance indeed.
from tract.
OK, so 20 GF/s is about what you can expect from one core on this CPU, at least there is nothing suspect here. You have 8 virtual cores, so we can expect a speedup of somewhere between 4 and 8 (depending if we're saturating memory access and/or actual multiplier ports), that should put us in the same ballpark than ONNX.
So a good place to start is https://github.com/sonos/tract/blob/main/linalg/src/frame/mmm/mmm.rs#L268 . This double nested loop goes all over the output matrix tile per tile. I think we can ignore the other single level loops that are around for starters (they are dealing with incomplete tiles on the right and bottom borders of the output matrix). So the first idea would be to create a task for each of these pair of for_valid
+ kernel
calls and put it in a pool executor of some sort, then wait for all of them to be done.
The main issue will be the scratch space: we need each thread to use a different one. I think the best way would be to introduce an intermediary object that holds a ThreadLocal. We will need to do something about the prepare() call to: it must be one at the beginning of the operation in each thread. So one way would be to stop reusing the ScratchSpace between multiplication (tract-core calls explicitely allocate_scratch_space() at model startup then run_with_scratch_space). I do need this optimisations for models which have a lot of smallish products, but we are trying to address the opposite case here, so reallocating the scratch space when we are going for a multi-threaded big op may just be ok performance-wise.
I suggest you do that first without worrying too much about breaking tract-core & co and my optimised use case: try to find something that works :) tract_linalg has good coverage of tests and benches so you can experiment there directly. When we are satisfied, we can look for ways to make it opt-in: I will want it behind a --feature gate at least as I don't want to add dozen of dependencies, so let's not worry about that too much from the start.
from tract.
Thanks for the hints :) What about non_linear
? Can we just put it behind arc, should it be protected by a mutex..?
I am asking because of the way that matrix C is changed. I see that when we call for_valid
, we store a *mut ptr of it on OutputStoreKer
. I am assuming that later it is changed when we call kernel
(at asm level).
I do not have experience with fused kernel op, so I am not sure how safe is to share non_linear
in this context.
from tract.
I am doing some tests using crossbeam and rayon, but I am not getting a lot of gain (1.5x and 1.3x respectively).
These are the links (e.g. crossbeam and e.g. rayon) in case you are interested.
rayon:
4907.905 ms/i 19.4% ┃┣ 171 Im2col 450.im2col
┃┃ ━━━ 1,934282998,F32
1279.879 ms/i 5.1% ┃┣ 172 LirMatMulUnary 450.matmatmul
┃┃ ━━━ 1,99,1024,1024,F32
4356.551 ms/i 17.2% ┃┣ 173 Im2col 452.im2col
┃┃ ━━━ 1,934282998,F32
1246.369 ms/i 4.9% ┃┣ 174 LirMatMulUnary 452.matmatmul
┃┃ ━━━ 1,99,1024,1024,F32
┗┓
108.617 ms/i 0.4% ┣┻ 175 Add 454
0.004 ms/i 0.0% ┣ 176 Reshape output.reshape_input
┃ ━━━ 1,99,1048576,F32
408.258 ms/i 1.6% ┣ 177 MatMatMulPack output.matmul.pack
┃ ━━━ 1,103809222,F32
42.977 ms/i 0.2% ┣ 178 LirMatMulUnary output.matmul.matmatmul
━━━ 1,1,1024,1024,F32
Most time consuming operations
* Im2col 48 nodes: 15725.109 ms/i 62.3%
* LirMatMulUnary 54 nodes: 6306.407 ms/i 25.0%
* MoveAxis 10 nodes: 1965.124 ms/i 7.8%
* MatMatMulPack 6 nodes: 568.207 ms/i 2.2%
* Concat 5 nodes: 374.580 ms/i 1.5%
* Add 22 nodes: 160.570 ms/i 0.6%
* MaxPool 1 nodes: 81.083 ms/i 0.3%
* Max 21 nodes: 45.211 ms/i 0.2%
* Mul 5 nodes: 31.548 ms/i 0.1%
* Reshape 6 nodes: 0.024 ms/i 0.0%
* Source 1 nodes: 0.001 ms/i 0.0%
crossbeam
4623.234 ms/i 16.7% ┃┣ 171 Im2col 450.im2col
┃┃ ━━━ 1,934282998,F32
1601.748 ms/i 5.8% ┃┣ 172 LirMatMulUnary 450.matmatmul
┃┃ ━━━ 1,99,1024,1024,F32
4680.580 ms/i 16.9% ┃┣ 173 Im2col 452.im2col
┃┃ ━━━ 1,934282998,F32
1571.774 ms/i 5.7% ┃┣ 174 LirMatMulUnary 452.matmatmul
┃┃ ━━━ 1,99,1024,1024,F32
┗┓
88.731 ms/i 0.3% ┣┻ 175 Add 454
0.005 ms/i 0.0% ┣ 176 Reshape output.reshape_input
┃ ━━━ 1,99,1048576,F32
398.826 ms/i 1.4% ┣ 177 MatMatMulPack output.matmul.pack
┃ ━━━ 1,103809222,F32
95.072 ms/i 0.3% ┣ 178 LirMatMulUnary output.matmul.matmatmul
━━━ 1,1,1024,1024,F32
Most time consuming operations
* Im2col 48 nodes: 16815.331 ms/i 60.9%
* LirMatMulUnary 54 nodes: 7557.583 ms/i 27.4%
* MoveAxis 10 nodes: 1953.166 ms/i 7.1%
* MatMatMulPack 6 nodes: 580.876 ms/i 2.1%
* Concat 5 nodes: 383.284 ms/i 1.4%
* Add 22 nodes: 151.883 ms/i 0.5%
* MaxPool 1 nodes: 80.546 ms/i 0.3%
* Max 21 nodes: 59.685 ms/i 0.2%
* Mul 5 nodes: 35.659 ms/i 0.1%
* Reshape 6 nodes: 0.032 ms/i 0.0%
* Source 1 nodes: 0.001 ms/i 0.0%
from tract.
I am wondering if there is anything we can do about Im2col (LirMatMulUnary time consumption decreased, but Im2col increased).
Also, do you think there are any other points that we can change to take advantage of multithreading?
from tract.
Ho ! I had missed your previous message. Congrats, it's a nice speedup on LirMatMulUnary already.
I don't think there is anything that structuraly prevents Im2Col to be ran in parallel. That said the code is quite complicated, with some specific implementation for diverse ranks and padding modes.
The first question is, are you falling in one of the optimised case (Valid1D, Valid2D, Patched2D) or are you getting the generic Patcher in Im2Col ? I suspect you are going to the generic case, Im2Col should not be that proeminent (even before the parallel matmul).
from tract.
By the way, have you check the chunk numbers in your split ? You very rightly put the multithreading hook between the two for-loops, but I would have thought that a chunk of 1 to be reasonable. (so one row / task)
from tract.
Yeah, I tried with a chunk of 1 as well, but the results are almost the same. Regarding your question, I am falling in the Patcher::Padded2d case. Not sure what that means (performance wise) though.
from tract.
OK, this is a bit surprising. Can you can share your model so I can have a look ? We can probably try to make Padded2D concurrent at the ci loop level, but we need to duplicate the writer and adjust their offsets accordingly.
tract's Im2Col is actually doing both classic Im2Col and packing at the same time. Basically the Writers comes from linalg and are responsible for writing the data elements and keeping track of the right place in the packed tensor (the Packing bit). The Patcher is responsible to scan the input data (actually implementing the Im2Col part).
from tract.
Sure, here is the link to the model.
from tract.
In some cases, internal multithreading could be helpful, e.g. with batch_size = 1, max_length = 512
, tract takes about 1.9 seconds to finish the forward pass of Albert, while the Python wrapper of ONNX takes 30ms (with sess_options.intra_op_num_threads = 6
to finish. Things get worse when incoming traffic is large as the model is shared.
Do you have any idea on how can I speed up the whole process? @kali
from tract.
Related Issues (20)
- Error while reading with into_optimized quant models HOT 4
- Performance regression in 0.21 HOT 2
- YOLOv8 / ONNX object detection, multiple shapes on top of eachother
- simple tree model fails to analyze reduce sum HOT 2
- [Python] Unable to build: `FileNotFoundError: [Errno 2] No such file or directory: '../Cargo.toml'` HOT 6
- Fix keras example HOT 2
- `gather_nd` fails when using varaiable batch dims
- Broken link in the blog post "Optimising a neural network for inference" (Sonos Tech Blog) HOT 1
- Onnx - Load external data from memory HOT 2
- Wasm SIMD support HOT 5
- Reshaping error HOT 1
- Do you support onnx extensions? HOT 2
- Error when splitting onnx model HOT 1
- Loading ONNX model became significantly slower since 0.21.4 HOT 7
- Xception model Unsupported: OperatorCode: MAX_POOL_2D HOT 8
- Stray `dbg!()` in several crate versions HOT 2
- Error While Converting TFLite Model to Runnable State HOT 6
- ARM SME support HOT 1
- Is there support for Apple MPS?
- Name collision
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tract.