Hi! From <a class="issue-link js-issue-link" data-error-text="Failed

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The system these numbers comes from: <div class="snippet-clipboard-content notrans

Internal multithreading,about sonos/tract

Comments (20)

kali commented on September 27, 2024 2

This is an interesting topic... Let me try to give you some insights.

First of all the first thing we need is to define "fast". There are two possible objectives here: one is bandwidth (how many predictions can we run on a given hardware over a fixed period of time) the second is latency (how much time does it take to get the answer). There is another metric that is relevant to us: efficiency, which is how much resource is necessary to run one prediction.

Tract approach so far has been to focus on efficiency first It fits well with the embedded approach where resource is by definition limited. It's a good place to start: an efficiency gain is a net gain. It translates directly in both latency gains and bandwidth gain. It also translate to energy saving. I've been heard saying that efficiency gains help saving the planet, and I'm only half joking.

Now multi-threading. Or concurrence, or parallelism. This is a different approach: throwing more resource at the problem. At best this comes at constant efficiency, and we are the linear scalability domain: for instance, we allocate two cores to the prediction engine and get latency divided by 2. But it's not easy.

Let's review the approaches:

external concurrency: one model, one thread per prediction. Near linear scalability. Linear bandwidth gains, latency non impacted if done right (with just the right number of tasks). Quite natural for server scenario, one query = one prediction = one thread = one core. Queries will be enqueued if the core/thread pool is entirely busy.

As a matter of fact, merging queries in batches (on N) can help efficiency a bit.

How about internal now ? Well, it's harder to discuss as the network characteristics will come into play. I will assume the focus is to get a better latency here.

Slicing the input will not work in the general case, but may work for some networks or some partial networks. Convolutions are splittable over the spatial and/or temporal dimensions for instance. It will help with latency if the split happen on a spatial dimension. LSTM typically will block the split: their matrix products couple all the depth together, and each time-axis loop iteration depend on the result of the previous one. So this approach need some work so an execution planner can figure out when it is possible to do something, and decide if it worth it.
Network parallelism. That one is relatively easy to do, as this parallelism is expressed naturally in the network graph. In the current implementation of tract, it could be done entirely by modifying the SimplePlan. Whether or not it would be helpful depend on the network though, some networks are mostly linear. On the case of LSTM, it may help the input multiplication part (as they are quite expensive and actually parallel). The matrix*vector product happening on C could also be parallelized, but as it tends to be shorter operation, it is less likely to compensate the switching overhead.
Intra-op parallelization. Not too difficult to implement as it is limited to the scope of one given operator, there is no network-wise logic. Will only be relevant with big and expensive operations (typically matrix matrix products and convolutions). There is some stuff that can be done here actually. As tract application field has been real time voice on embedded, the matrix multipliers are geared towards smallish operands, and there is a lot that can be done to improve them on bigger matrices efficiency wise, even before discussing parallelization...

For the sake of completeness, there is another way to split networks that is relevant in real-time streaming applications: splitting a network in two roughly in the middle (assuming the interface is not too complicated). While this will not reduce latency in the best case, it may help if the network gets late at some point by allowing it to work on two (or more) time slice in a pipeling fashion.

from tract.

kali commented on September 27, 2024 1

The FusedSpec vector can be shared, as it's read-only. Then we crafted a FusedKerSpec for each kernel call that contains a poiter to OutputStoreKer. So each thread must own its own OutputStoreKer. It should be OK as the OutputStoreKer is created inside the ScratchSpace buffer which, as already discussed, as to be per-thread.

Is it making sense ?

from tract.

kali commented on September 27, 2024 1

@ro99 let's move the im2col issue to #570 and try to keep on general multithreading here.

from tract.

bminixhofer commented on September 27, 2024

Thanks for these insights! I hadn't really considered bandwidth vs. latency. That's an interesting distinction and yes, I would say I was referring to low latency as "fast".

One thing I didn't catch from your comment: Do you actually plan to implement internal multithreading in tract or will it stay single-core, save to use in multiple threads and focused purely on efficiency for the foreseeable future?

Either way, I noticed that I can try implementing multithreading like 0.) in tractjs without any changes to tract by just slicing the input across the batch dimension externally to tract. I will try implementing that (see bminixhofer/tractjs#13) which should also give an impression of how ready wasm-bindgen is for multithreading.

from tract.

kali commented on September 27, 2024

My plans are not clear in the document because they're not clear in my head.

First, I will always put efficiency first. I will only do any internal parallelism if I am convinced the overhead is reasonable, or if there is an opt-out (or if it is opt-in).

Then there is the "work" thing :) Some of these approachs are a lot of work. If they start to make sense for my company, I will have no problem allocating work hours to them, but it's not necessarily the case right now.

So knowing that:
0/ is perfectly valid, and trivial or easy to do outside of tract. I am committed to keep it possible to have a single RunnableModel used by several workers threads to run multiple prediction. The benefit is now a bit limited for network with unfixed dimensions, but I am determined to fix that soon-ish. So yeah, feel free to play with MT in tract.js, or on top of it, that's the right place to do it.
1/ Slicing. I don't think this one is really interesting
2/ Parallel ops: this one is so easy that I may try it just for fun. Then who knows. But I will definitely keep the mono-thread runner too. The gains will obviously be limited to the network parallelism, so some networks will not see a lot of improvments here.
3/ Intra-op: I can, and I think I will allocate working time over the coming month to the efficiency gains on computation, with a focus on bigger matrices multiplication. If what I have in mind works, it should improve performance on basically all networks while staying reachable in terms of what I can achieve by myself. Once this is done, and only when it's done, it will make sense to move on and start spreading computations to multiple cores. Once again, if it happens it will have to be opt-in.

from tract.

bminixhofer commented on September 27, 2024

Thanks for the clarification! That makes a lot of sense.

Regarding opt-out / opt-in and keeping the focus on efficiency: Some popular libraries like ndarray (ndarray feature flags) have parallelism via rayon behind a feature flag, so that might be one solution.

from tract.

andreytkachenko commented on September 27, 2024

I cannot fully agree that this is most efficient way to scale up using multiple threads spawning multiple inference tasks because almost all computers we use now are made by super scalar design it means it may (usually it is) contain a caches (L1, L2, L3 etc.) and if we spawn multiple threads with multiple distinct tasks - there will be a racing on cache leading to non-efficient cache usage. There are multiple GEMM solutions which highly optimized for multiple threads. Yes, it is obvious that if we would use batching - there will be higher latency. I would suggest to consider add support of batched mutithreaded opertions (GEMM, Conv etc.) as options in case someone does not bother latency.

from tract.

ro99 commented on September 27, 2024

Do we have something new regarding approach (3)? I am working on a network that I think would greatly benefit from intra-op parallelization.

Currently this network (that is basically a u-net with resnet34 backbone) is suffering a lot during decoding process:

[...]

6258.218 ms/i 14.2%                                               ┃┣ 171 Im2col 450.im2col
                                                                 ┃┃   ━━━ 1,934282998,F32
4288.437 ms/i  9.7%  FMA(F32)     92493840384         21.568 GF/s ┃┣ 172 LirMatMulUnary 450.matmatmul
                    Params(F32)        99792                     ┃┃   ━━━ 1,99,1024,1024,F32
5647.222 ms/i 12.8%                                               ┃┣ 173 Im2col 452.im2col
                                                                 ┃┃   ━━━ 1,934282998,F32
4684.515 ms/i 10.6%  FMA(F32)     92493840384         19.745 GF/s ┃┣ 174 LirMatMulUnary 452.matmatmul
                    Params(F32)        99792                     ┃┃   ━━━ 1,99,1024,1024,F32
                                                                 ┗┓ 
384.967 ms/i  0.9%                                               ┣┻ 175 Add 454
  0.005 ms/i  0.0%                                               ┣ 176 Reshape output.reshape_input
                                                                 ┃   ━━━ 1,99,1048576,F32
464.897 ms/i  1.1%                                               ┣ 177 MatMatMulPack output.matmul.pack
                                                                 ┃   ━━━ 1,103809222,F32
 94.854 ms/i  0.2%  FMA(F32)       103809024          1.094 GF/s ┣ 178 LirMatMulUnary output.matmul.matmatmul
                    Params(F32)         1584                         ━━━ 1,1,1024,1024,F32

Most time consuming operations
 * LirMatMulUnary        54 nodes: 21051.949 ms/i 47.6%
 * Im2col                48 nodes: 19414.745 ms/i 43.9%
 * MoveAxis              10 nodes: 2066.150 ms/i  4.7%
 * MatMatMulPack          6 nodes: 645.305 ms/i  1.5%
 * Add                   22 nodes: 444.749 ms/i  1.0%
 * Concat                 5 nodes: 403.741 ms/i  0.9%
 * MaxPool                1 nodes:  95.982 ms/i  0.2%
 * Max                   21 nodes:  49.004 ms/i  0.1%
 * Mul                    5 nodes:  42.148 ms/i  0.1%
 * Reshape                6 nodes:   0.025 ms/i  0.0%
 * Source                 1 nodes:   0.002 ms/i  0.0%

The bottleneck happens from 171-174, I think due to the bigger matrices computations. But just for comparison, onnxruntime takes about 20s to execute this model, while tract will take about 120s for the same. So, maybe there is room for improvement here?

from tract.

kali commented on September 27, 2024

Nope, did not find the time to try to improve on intra-op (3) multi-threading yet. Some approach are relatively easy, if anybody wants to give it a shot, I'm happy to provide guidance.

What kind of system did you pull these numbers from ? I guess Intel, but how many cores, type and frequency ?

from tract.

ro99 commented on September 27, 2024

The system these numbers comes from:

OS: macOS 11.4 20F71 x86_64 
Host: MacBookPro13,3 
Kernel: 20.5.0 
CPU: Intel i7-6700HQ (8) @ 2.60GHz 
GPU: Intel HD Graphics 530, AMD Radeon Pro 450 
Memory: 5833MiB / 16384MiB

I am willing to give a shot, but I would need some guidance indeed.

from tract.

kali commented on September 27, 2024

OK, so 20 GF/s is about what you can expect from one core on this CPU, at least there is nothing suspect here. You have 8 virtual cores, so we can expect a speedup of somewhere between 4 and 8 (depending if we're saturating memory access and/or actual multiplier ports), that should put us in the same ballpark than ONNX.

So a good place to start is https://github.com/sonos/tract/blob/main/linalg/src/frame/mmm/mmm.rs#L268 . This double nested loop goes all over the output matrix tile per tile. I think we can ignore the other single level loops that are around for starters (they are dealing with incomplete tiles on the right and bottom borders of the output matrix). So the first idea would be to create a task for each of these pair of for_valid + kernel calls and put it in a pool executor of some sort, then wait for all of them to be done.

The main issue will be the scratch space: we need each thread to use a different one. I think the best way would be to introduce an intermediary object that holds a ThreadLocal. We will need to do something about the prepare() call to: it must be one at the beginning of the operation in each thread. So one way would be to stop reusing the ScratchSpace between multiplication (tract-core calls explicitely allocate_scratch_space() at model startup then run_with_scratch_space). I do need this optimisations for models which have a lot of smallish products, but we are trying to address the opposite case here, so reallocating the scratch space when we are going for a multi-threaded big op may just be ok performance-wise.

I suggest you do that first without worrying too much about breaking tract-core & co and my optimised use case: try to find something that works :) tract_linalg has good coverage of tests and benches so you can experiment there directly. When we are satisfied, we can look for ways to make it opt-in: I will want it behind a --feature gate at least as I don't want to add dozen of dependencies, so let's not worry about that too much from the start.

from tract.

ro99 commented on September 27, 2024

Thanks for the hints :) What about non_linear? Can we just put it behind arc, should it be protected by a mutex..?

I am asking because of the way that matrix C is changed. I see that when we call for_valid, we store a *mut ptr of it on OutputStoreKer. I am assuming that later it is changed when we call kernel (at asm level).

I do not have experience with fused kernel op, so I am not sure how safe is to share non_linear in this context.

from tract.

ro99 commented on September 27, 2024

I am doing some tests using crossbeam and rayon, but I am not getting a lot of gain (1.5x and 1.3x respectively).

These are the links (e.g. crossbeam and e.g. rayon) in case you are interested.

rayon:

4907.905 ms/i 19.4%  ┃┣ 171 Im2col 450.im2col
                    ┃┃   ━━━ 1,934282998,F32
1279.879 ms/i  5.1%  ┃┣ 172 LirMatMulUnary 450.matmatmul
                    ┃┃   ━━━ 1,99,1024,1024,F32
4356.551 ms/i 17.2%  ┃┣ 173 Im2col 452.im2col
                    ┃┃   ━━━ 1,934282998,F32
1246.369 ms/i  4.9%  ┃┣ 174 LirMatMulUnary 452.matmatmul
                    ┃┃   ━━━ 1,99,1024,1024,F32
                    ┗┓ 
108.617 ms/i  0.4%  ┣┻ 175 Add 454
  0.004 ms/i  0.0%  ┣ 176 Reshape output.reshape_input
                    ┃   ━━━ 1,99,1048576,F32
408.258 ms/i  1.6%  ┣ 177 MatMatMulPack output.matmul.pack
                    ┃   ━━━ 1,103809222,F32
 42.977 ms/i  0.2%  ┣ 178 LirMatMulUnary output.matmul.matmatmul
                        ━━━ 1,1,1024,1024,F32

Most time consuming operations
 * Im2col                48 nodes: 15725.109 ms/i 62.3%
 * LirMatMulUnary        54 nodes: 6306.407 ms/i 25.0%
 * MoveAxis              10 nodes: 1965.124 ms/i  7.8%
 * MatMatMulPack          6 nodes: 568.207 ms/i  2.2%
 * Concat                 5 nodes: 374.580 ms/i  1.5%
 * Add                   22 nodes: 160.570 ms/i  0.6%
 * MaxPool                1 nodes:  81.083 ms/i  0.3%
 * Max                   21 nodes:  45.211 ms/i  0.2%
 * Mul                    5 nodes:  31.548 ms/i  0.1%
 * Reshape                6 nodes:   0.024 ms/i  0.0%
 * Source                 1 nodes:   0.001 ms/i  0.0%

crossbeam

4623.234 ms/i 16.7%  ┃┣ 171 Im2col 450.im2col
                    ┃┃   ━━━ 1,934282998,F32
1601.748 ms/i  5.8%  ┃┣ 172 LirMatMulUnary 450.matmatmul
                    ┃┃   ━━━ 1,99,1024,1024,F32
4680.580 ms/i 16.9%  ┃┣ 173 Im2col 452.im2col
                    ┃┃   ━━━ 1,934282998,F32
1571.774 ms/i  5.7%  ┃┣ 174 LirMatMulUnary 452.matmatmul
                    ┃┃   ━━━ 1,99,1024,1024,F32
                    ┗┓ 
 88.731 ms/i  0.3%  ┣┻ 175 Add 454
  0.005 ms/i  0.0%  ┣ 176 Reshape output.reshape_input
                    ┃   ━━━ 1,99,1048576,F32
398.826 ms/i  1.4%  ┣ 177 MatMatMulPack output.matmul.pack
                    ┃   ━━━ 1,103809222,F32
 95.072 ms/i  0.3%  ┣ 178 LirMatMulUnary output.matmul.matmatmul
                        ━━━ 1,1,1024,1024,F32
Most time consuming operations
 * Im2col                48 nodes: 16815.331 ms/i 60.9%
 * LirMatMulUnary        54 nodes: 7557.583 ms/i 27.4%
 * MoveAxis              10 nodes: 1953.166 ms/i  7.1%
 * MatMatMulPack          6 nodes: 580.876 ms/i  2.1%
 * Concat                 5 nodes: 383.284 ms/i  1.4%
 * Add                   22 nodes: 151.883 ms/i  0.5%
 * MaxPool                1 nodes:  80.546 ms/i  0.3%
 * Max                   21 nodes:  59.685 ms/i  0.2%
 * Mul                    5 nodes:  35.659 ms/i  0.1%
 * Reshape                6 nodes:   0.032 ms/i  0.0%
 * Source                 1 nodes:   0.001 ms/i  0.0%

from tract.

ro99 commented on September 27, 2024

I am wondering if there is anything we can do about Im2col (LirMatMulUnary time consumption decreased, but Im2col increased).

Also, do you think there are any other points that we can change to take advantage of multithreading?

from tract.

kali commented on September 27, 2024

Ho ! I had missed your previous message. Congrats, it's a nice speedup on LirMatMulUnary already.

I don't think there is anything that structuraly prevents Im2Col to be ran in parallel. That said the code is quite complicated, with some specific implementation for diverse ranks and padding modes.

The first question is, are you falling in one of the optimised case (Valid1D, Valid2D, Patched2D) or are you getting the generic Patcher in Im2Col ? I suspect you are going to the generic case, Im2Col should not be that proeminent (even before the parallel matmul).

from tract.

kali commented on September 27, 2024

By the way, have you check the chunk numbers in your split ? You very rightly put the multithreading hook between the two for-loops, but I would have thought that a chunk of 1 to be reasonable. (so one row / task)

from tract.

ro99 commented on September 27, 2024

Yeah, I tried with a chunk of 1 as well, but the results are almost the same. Regarding your question, I am falling in the Patcher::Padded2d case. Not sure what that means (performance wise) though.

from tract.

kali commented on September 27, 2024

OK, this is a bit surprising. Can you can share your model so I can have a look ? We can probably try to make Padded2D concurrent at the ci loop level, but we need to duplicate the writer and adjust their offsets accordingly.

tract's Im2Col is actually doing both classic Im2Col and packing at the same time. Basically the Writers comes from linalg and are responsible for writing the data elements and keeping track of the right place in the packed tensor (the Packing bit). The Patcher is responsible to scan the input data (actually implementing the Im2Col part).

from tract.

ro99 commented on September 27, 2024

Sure, here is the link to the model.

from tract.

Yevgnen commented on September 27, 2024

In some cases, internal multithreading could be helpful, e.g. with batch_size = 1, max_length = 512, tract takes about 1.9 seconds to finish the forward pass of Albert, while the Python wrapper of ONNX takes 30ms (with sess_options.intra_op_num_threads = 6 to finish. Things get worse when incoming traffic is large as the model is shared.

Do you have any idea on how can I speed up the whole process? @kali

from tract.

Internal multithreading about tract HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent