Coder Social home page Coder Social logo

Comments (9)

adamreichold avatar adamreichold commented on August 18, 2024

Generally speaking, parallelization does have a non-negligible overhead and hence whether it actually results in speed-ups depends on the amount of work done per item. In nested loops, it is there often the case that not parallelizing all levels is optimal. From past experience, I would recommend starting with parallelizing the outermost loop and then moving inwards so start with the largest work items and repeated reduce their size but increase their count as long as that pays off.

In your case, I see a speed-up from parallelizing the two outer levels, but it is minimal compared to just the outermost level which I would probably stick to here:

> hyperfine ./serial ./parallel-outermost ./parallel-two-levels 
Benchmark 1: ./serial
  Time (mean ± σ):       2.6 ms ±   0.2 ms    [User: 1.7 ms, System: 1.3 ms]
  Range (min … max):     2.3 ms …   3.6 ms    773 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 2: ./parallel-outermost
  Time (mean ± σ):       1.4 ms ±   0.1 ms    [User: 3.1 ms, System: 4.6 ms]
  Range (min … max):     1.2 ms …   2.1 ms    919 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 3: ./parallel-two-levels
  Time (mean ± σ):       1.3 ms ±   0.1 ms    [User: 3.2 ms, System: 4.3 ms]
  Range (min … max):     1.1 ms …   1.9 ms    1093 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Summary
  ./parallel-two-levels ran
    1.01 ± 0.10 times faster than ./parallel-outermost
    1.91 ± 0.19 times faster than ./serial

i.e.

use std::hint::black_box;

use rayon::prelude::*;

fn main() {
    let result = (0..512).into_par_iter().map(|x| {
        (0..512).map(|y| {
            (0..16).map(|z| black_box(x*y*z)).sum::<usize>()
        }).sum::<usize>()
    }).sum::<usize>();

    black_box(result);
}

Finally, maybe you could expand on what you would consider a helpful change here? Did you find the documentation lacking somewhere?

P.S.: As an side, as Hyperfine warns, it would probably be nicer to time multiple iterations of the above as the runtime of a single one is below 5ms.

from rayon.

bergkvist avatar bergkvist commented on August 18, 2024

My point is that it is surprising that iter(iter(par_iter)) has 400x more overhead than par_iter(par_iter(par_iter)) in the example, and that setting RAYON_NUM_THREADS=1 is not enough to make it behave similarly to the sequential version (being 100x slower still).

Imagine that the inner scope is within another function used a lot of places in a code-base, initially sequential, but later made parallel.

As you slowly start introducing rayon into the code-base, you might see big performance improvements in some places, while unexpectedly also getting 100x slowdowns in other parts of your code.

from rayon.

cuviper avatar cuviper commented on August 18, 2024

The problem is that the work you're starting in parallel is tiny, not that there are iterators on the outside.

(0..16).into_par_iter().map(|z| black_box(x*y*z)).sum::<usize>()

Even when forced to a single thread, it's still sending that to the thread pool and waiting for the result, and I'm not surprised at all if that thread synchronization is more costly than the actual work.

from rayon.

bergkvist avatar bergkvist commented on August 18, 2024

Yes, I intentionally crafted the example to have very little compute to make the overhead as visible as possible. Why is it that parallel iterators in the outer scopes remove the overhead of the par_iter in the inner scope?

from rayon.

adamreichold avatar adamreichold commented on August 18, 2024

Why is it that parallel iterators in the outer scopes remove the overhead of the par_iter in the inner scope?

I guess the outer parallel iterators produce tasks which keep the thread pool occupied and prevent the inner tiny iterators from being split at all hence getting closer to the sequential version.

In your example of some utility function encapsulating some parallel processing, you would probably need to avoid excessive splitting using something with_min_len if you know that your work items might be too small to benefit from parallelization on their own.

from rayon.

cuviper avatar cuviper commented on August 18, 2024

Why is it that parallel iterators in the outer scopes remove the overhead of the par_iter in the inner scope?

I guess the outer parallel iterators produce tasks which keep the thread pool occupied and prevent the inner tiny iterators from being split at all hence getting closer to the sequential version.

Also, the inner iterators will hit the fast path of already being within the pool, so they can start executing directly.

from rayon.

bergkvist avatar bergkvist commented on August 18, 2024

Oh, I didn't know about with_min_length - this seems like a great way to build parallelized functions with less surprising performance behaviour when used in various contexts:

use rayon::prelude::*;
use std::hint::black_box;

fn main() {
    let t0 = std::time::Instant::now();
    let result = (0..512).map(|x| {
        (0..512).map(|y| {
            my_sum_function(x, y, 16)
        }).sum::<usize>()
    }).sum::<usize>();
    println!("elapsed: {} us", t0.elapsed().as_micros());
    println!("result: {result}");
}

fn my_sum_function(x: usize, y: usize, n: usize) -> usize {
    (0..n).into_par_iter().with_min_len(512).map(|z| black_box(x*y*z)).sum::<usize>()
}

Timings (notice I'm using microseconds not milliseconds here):

$ cargo run --release
elapsed: 6562 us
result: 2053539102720

That said, seems like the following still has less overhead:

use rayon::prelude::*;
use std::hint::black_box;

fn main() {
    let t0 = std::time::Instant::now();
    let result = (0..512).map(|x| {
        (0..512).map(|y| {
            my_sum_function(x, y, 16)
        }).sum::<usize>()
    }).sum::<usize>();
    println!("elapsed: {} us", t0.elapsed().as_micros());
    println!("result: {result}");
}

fn my_sum_function(x: usize, y: usize, n: usize) -> usize {
    if n < 512 {
        (0..n).map(|z| black_box(x*y*z)).sum::<usize>()
    } else {
        (0..n).into_par_iter().map(|z| black_box(x*y*z)).sum::<usize>()
    }
}

Timings:

$ cargo run --releas
elapsed: 1890 us
result: 2053539102720

from rayon.

bergkvist avatar bergkvist commented on August 18, 2024

Also, the inner iterators will hit the fast path of already being within the pool, so they can start executing directly.
So if I understand correctly

This seems to be right, when observing the following code (where we simply wrap the entire iter(iter(par_iter)) in a rayon::join):

use rayon::prelude::*;
use std::hint::black_box;

fn main() {
    let t0 = std::time::Instant::now();
    let (result, _) = rayon::join(|| {
        (0..512).map(|x| {
            (0..512).map(|y| {
                (0..16).into_par_iter().map(|z| black_box(x*y*z)).sum::<usize>()
            }).sum::<usize>()
        }).sum::<usize>()
    }, || {

    });
    
    println!("elapsed: {} ms", t0.elapsed().as_millis());
    println!("result: {result}");
}

And we can see that here RAYON_NUM_THREADS=1 is 170x faster than the parallel one, presumably because the tasks don't need to be sent around and are moved into the single-threaded thread pool all at once by the rayon::join:

$ cargo run --release
elapsed: 1764 ms
result: 2053539102720

$ RAYON_NUM_THREADS=1 cargo run --release
elapsed: 13 ms
result: 2053539102720

I'm guessing the reason the multi-threaded example is still somewhat slow is because the work is moved around in the thread pool a lot.

from rayon.

ar37-rs avatar ar37-rs commented on August 18, 2024

Probably there's a corelation between spawning a bulldozer and a scoop. of course, scoop is much faster for tiny tasks.

from rayon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.