Consider the following "mixed iterator" example: <div class="highlight highlight

Rayon slower than single threaded performance when mixing iter and par_iter about rayon HOT 9 OPEN

bergkvist commented on August 18, 2024

Rayon slower than single threaded performance when mixing iter and par_iter

from rayon.

Comments (9)

adamreichold commented on August 18, 2024

Generally speaking, parallelization does have a non-negligible overhead and hence whether it actually results in speed-ups depends on the amount of work done per item. In nested loops, it is there often the case that not parallelizing all levels is optimal. From past experience, I would recommend starting with parallelizing the outermost loop and then moving inwards so start with the largest work items and repeated reduce their size but increase their count as long as that pays off.

In your case, I see a speed-up from parallelizing the two outer levels, but it is minimal compared to just the outermost level which I would probably stick to here:

> hyperfine ./serial ./parallel-outermost ./parallel-two-levels 
Benchmark 1: ./serial
  Time (mean ± σ):       2.6 ms ±   0.2 ms    [User: 1.7 ms, System: 1.3 ms]
  Range (min … max):     2.3 ms …   3.6 ms    773 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 2: ./parallel-outermost
  Time (mean ± σ):       1.4 ms ±   0.1 ms    [User: 3.1 ms, System: 4.6 ms]
  Range (min … max):     1.2 ms …   2.1 ms    919 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 3: ./parallel-two-levels
  Time (mean ± σ):       1.3 ms ±   0.1 ms    [User: 3.2 ms, System: 4.3 ms]
  Range (min … max):     1.1 ms …   1.9 ms    1093 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Summary
  ./parallel-two-levels ran
    1.01 ± 0.10 times faster than ./parallel-outermost
    1.91 ± 0.19 times faster than ./serial

i.e.

use std::hint::black_box;

use rayon::prelude::*;

fn main() {
    let result = (0..512).into_par_iter().map(|x| {
        (0..512).map(|y| {
            (0..16).map(|z| black_box(x*y*z)).sum::<usize>()
        }).sum::<usize>()
    }).sum::<usize>();

    black_box(result);
}

Finally, maybe you could expand on what you would consider a helpful change here? Did you find the documentation lacking somewhere?

P.S.: As an side, as Hyperfine warns, it would probably be nicer to time multiple iterations of the above as the runtime of a single one is below 5ms.

from rayon.

bergkvist commented on August 18, 2024

My point is that it is surprising that iter(iter(par_iter)) has 400x more overhead than par_iter(par_iter(par_iter)) in the example, and that setting RAYON_NUM_THREADS=1 is not enough to make it behave similarly to the sequential version (being 100x slower still).

Imagine that the inner scope is within another function used a lot of places in a code-base, initially sequential, but later made parallel.

As you slowly start introducing rayon into the code-base, you might see big performance improvements in some places, while unexpectedly also getting 100x slowdowns in other parts of your code.

from rayon.

cuviper commented on August 18, 2024

The problem is that the work you're starting in parallel is tiny, not that there are iterators on the outside.

(0..16).into_par_iter().map(|z| black_box(x*y*z)).sum::<usize>()

Even when forced to a single thread, it's still sending that to the thread pool and waiting for the result, and I'm not surprised at all if that thread synchronization is more costly than the actual work.

from rayon.

bergkvist commented on August 18, 2024

Yes, I intentionally crafted the example to have very little compute to make the overhead as visible as possible. Why is it that parallel iterators in the outer scopes remove the overhead of the par_iter in the inner scope?

from rayon.

adamreichold commented on August 18, 2024

Why is it that parallel iterators in the outer scopes remove the overhead of the par_iter in the inner scope?

I guess the outer parallel iterators produce tasks which keep the thread pool occupied and prevent the inner tiny iterators from being split at all hence getting closer to the sequential version.

In your example of some utility function encapsulating some parallel processing, you would probably need to avoid excessive splitting using something with_min_len if you know that your work items might be too small to benefit from parallelization on their own.

from rayon.

cuviper commented on August 18, 2024

Why is it that parallel iterators in the outer scopes remove the overhead of the par_iter in the inner scope?

I guess the outer parallel iterators produce tasks which keep the thread pool occupied and prevent the inner tiny iterators from being split at all hence getting closer to the sequential version.

Also, the inner iterators will hit the fast path of already being within the pool, so they can start executing directly.

from rayon.

bergkvist commented on August 18, 2024

Oh, I didn't know about with_min_length - this seems like a great way to build parallelized functions with less surprising performance behaviour when used in various contexts:

use rayon::prelude::*;
use std::hint::black_box;

fn main() {
    let t0 = std::time::Instant::now();
    let result = (0..512).map(|x| {
        (0..512).map(|y| {
            my_sum_function(x, y, 16)
        }).sum::<usize>()
    }).sum::<usize>();
    println!("elapsed: {} us", t0.elapsed().as_micros());
    println!("result: {result}");
}

fn my_sum_function(x: usize, y: usize, n: usize) -> usize {
    (0..n).into_par_iter().with_min_len(512).map(|z| black_box(x*y*z)).sum::<usize>()
}

Timings (notice I'm using microseconds not milliseconds here):

$ cargo run --release
elapsed: 6562 us
result: 2053539102720

That said, seems like the following still has less overhead:

use rayon::prelude::*;
use std::hint::black_box;

fn main() {
    let t0 = std::time::Instant::now();
    let result = (0..512).map(|x| {
        (0..512).map(|y| {
            my_sum_function(x, y, 16)
        }).sum::<usize>()
    }).sum::<usize>();
    println!("elapsed: {} us", t0.elapsed().as_micros());
    println!("result: {result}");
}

fn my_sum_function(x: usize, y: usize, n: usize) -> usize {
    if n < 512 {
        (0..n).map(|z| black_box(x*y*z)).sum::<usize>()
    } else {
        (0..n).into_par_iter().map(|z| black_box(x*y*z)).sum::<usize>()
    }
}

Timings:

$ cargo run --releas
elapsed: 1890 us
result: 2053539102720

from rayon.

bergkvist commented on August 18, 2024

Also, the inner iterators will hit the fast path of already being within the pool, so they can start executing directly.
So if I understand correctly

This seems to be right, when observing the following code (where we simply wrap the entire iter(iter(par_iter)) in a rayon::join):

use rayon::prelude::*;
use std::hint::black_box;

fn main() {
    let t0 = std::time::Instant::now();
    let (result, _) = rayon::join(|| {
        (0..512).map(|x| {
            (0..512).map(|y| {
                (0..16).into_par_iter().map(|z| black_box(x*y*z)).sum::<usize>()
            }).sum::<usize>()
        }).sum::<usize>()
    }, || {

    });
    
    println!("elapsed: {} ms", t0.elapsed().as_millis());
    println!("result: {result}");
}

And we can see that here RAYON_NUM_THREADS=1 is 170x faster than the parallel one, presumably because the tasks don't need to be sent around and are moved into the single-threaded thread pool all at once by the rayon::join:

$ cargo run --release
elapsed: 1764 ms
result: 2053539102720

$ RAYON_NUM_THREADS=1 cargo run --release
elapsed: 13 ms
result: 2053539102720

I'm guessing the reason the multi-threaded example is still somewhat slow is because the work is moved around in the thread pool a lot.

from rayon.

ar37-rs commented on August 18, 2024

Probably there's a corelation between spawning a bulldozer and a scoop. of course, scoop is much faster for tiny tasks.

from rayon.

Rayon slower than single threaded performance when mixing iter and par_iter about rayon HOT 9 OPEN

Comments (9)

That said, seems like the following still has less overhead:

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent