Coder Social home page Coder Social logo

Threading about dftk.jl HOT 26 CLOSED

juliamolsim avatar juliamolsim commented on May 29, 2024 2
Threading

from dftk.jl.

Comments (26)

mfherbst avatar mfherbst commented on May 29, 2024

I agree the new partr framework seems to be the way people are heading also for threading support in the lower libraries and it only makes sense to follow along with it. Especially, since users of our code could do all sorts of things on top. Regarding @strided: I think that will really only be helpful at a few places (e.g. in the application of the non-local projectors) where a lot of classical array operations happen on all the bands at once. We'll have to benchmark of course.

from dftk.jl.

antoine-levitt avatar antoine-levitt commented on May 29, 2024

So, I did some very basic experiments. For a system with 400,000 plane waves, FFTW's own threading doesn't seem to do much: setting both FFTW and BLAS threads to the number of cores on my computer gave me a 20% speedup. So we should either do #9, or do our own threading

from dftk.jl.

mfherbst avatar mfherbst commented on May 29, 2024

Hmm 20% is surprisingly little, but maybe I misunderstand what you did.

Could you perhaps commit a small benchmark script. I think it would be good to have a few "benchmark cases" or integrate with https://github.com/JuliaCI/PkgBenchmark.jl such that one can track the performance better. What do you think?

from dftk.jl.

antoine-levitt avatar antoine-levitt commented on May 29, 2024

That's set_num_threads for both FFTW and Blas, set to the max number of cores vs 1. Benchmarking is easy : take any example and have more of it (eg set supercell). I don't think we need to setup performance tracking because essentially the only thing that matters right now is how we do the FFTs and how many of them we do, which is simpler to track by hand. The top things that are important right now are convergence criteria for the eigen solver (we do way too many iterations per scf step; by comparison abinit by default does 8 in the first two iterations, and then 4), and batching / threading FFTs.

from dftk.jl.

antoine-levitt avatar antoine-levitt commented on May 29, 2024

@mfherbst can you try the following benchmarking script on the machine you have? https://gist.github.com/antoine-levitt/88086895dd98f746d6c795c99a10fd9f

Here I get

4 threads
N=128, M=40
Single FFT: no threads
  26.611 ms (0 allocations: 0 bytes)
Single FFT: threads
  15.158 ms (78 allocations: 6.66 KiB)
Multiple FFTs: manual, no threads
  1.080 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  631.769 ms (112 allocations: 8.06 KiB)
Multiple FFTs: auto, no threads
  1.083 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  696.880 ms (3281 allocations: 272.52 KiB)
Multiple FFTs: manual_threaded, threads
  679.694 ms (3323 allocations: 275.73 KiB)
Multiple FFTs: auto, threads
  633.797 ms (39 allocations: 3.33 KiB)

So the good news is that all methods of parallelization are esssentially the same. The bad news is that they all suck :-) It looks like FFTs are almost memory-bound, and so do not benefit much from parallelization (at least on my machine). That's on julia 1.3. I'd test on the lab's cluster, but I'm getting proxy errors...

from dftk.jl.

mfherbst avatar mfherbst commented on May 29, 2024

My machine (julia 1.3, fftw)

4 threads
N=128, M=40
Single FFT: no threads
  19.447 ms (0 allocations: 0 bytes)
Single FFT: threads
  9.175 ms (78 allocations: 6.66 KiB)
Multiple FFTs: manual, no threads
  792.030 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  373.974 ms (110 allocations: 8.03 KiB)
Multiple FFTs: auto, no threads
  792.610 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  391.993 ms (3243 allocations: 271.92 KiB)
Multiple FFTs: manual_threaded, threads
  377.248 ms (3318 allocations: 275.66 KiB)
Multiple FFTs: auto, threads
  375.433 ms (40 allocations: 3.34 KiB)

from dftk.jl.

mfherbst avatar mfherbst commented on May 29, 2024

Cluster08 (julia 1.2, MKL)

16 threads
N=128, M=40
Single FFT: no threads
  43.748 ms (0 allocations: 0 bytes)
Single FFT: threads
  6.878 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, no threads
  1.774 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  418.949 ms (182 allocations: 15.58 KiB)
Multiple FFTs: auto, no threads
  1.781 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  370.278 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, threads
  325.279 ms (183 allocations: 15.50 KiB)
Multiple FFTs: auto, threads
  287.693 ms (0 allocations: 0 bytes)

and (again 1.2, MKL)

4 threads
N=128, M=40
Single FFT: no threads
  39.283 ms (0 allocations: 0 bytes)
Single FFT: threads
  11.205 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, no threads
  1.751 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  584.677 ms (111 allocations: 7.97 KiB)
Multiple FFTs: auto, no threads
  1.765 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  549.380 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, threads
  298.015 ms (107 allocations: 7.80 KiB)
Multiple FFTs: auto, threads
  496.712 ms (0 allocations: 0 bytes)

from dftk.jl.

antoine-levitt avatar antoine-levitt commented on May 29, 2024

clustern20 (with julia 1.1, I can't make 1.3 work with the proxy for some reason):

16 threads
N=128, M=40
Single FFT: no threads
  32.266 ms (0 allocations: 0 bytes)
Single FFT: threads
  4.336 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, no threads
  1.386 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  151.490 ms (53 allocations: 3.23 KiB)
Multiple FFTs: auto, no threads
  1.396 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  248.748 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, threads
  208.934 ms (56 allocations: 3.17 KiB)
Multiple FFTs: auto, threads
  143.142 ms (0 allocations: 0 bytes)
32 threads
N=128, M=40
Single FFT: no threads
  32.257 ms (0 allocations: 0 bytes)
Single FFT: threads
  3.193 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, no threads
  1.361 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  156.536 ms (23 allocations: 1.42 KiB)
Multiple FFTs: auto, no threads
  1.550 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  151.108 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, threads
  156.481 ms (24 allocations: 1.48 KiB)
Multiple FFTs: auto, threads
  150.511 ms (0 allocations: 0 bytes)

That's much better. I think that's consistent with FFTs being memory limited, but memory scaling differently on different machines.

Takeaways: oversubscription is fine, FFTW doesn't do better than outer threading. So my suggestion is to plan for a single (like we do now) threaded FFT (by setting FFTW.set_num_threads to JULIA_NUM_THREADS), and add our own threading on top of that. That was fine on 1.1, and should be even better on 1.3. Pity I can't test it on the cluster...

from dftk.jl.

mfherbst avatar mfherbst commented on May 29, 2024

Be careful with the 32 threads on cluster 20 ... it has hyper threading enabled, so effectively it's only 16 cores

from dftk.jl.

antoine-levitt avatar antoine-levitt commented on May 29, 2024

Yeah I know, that was basically to test oversubscription

from dftk.jl.

mfherbst avatar mfherbst commented on May 29, 2024

Julia 1.3 has changed the way they update the registries in a way that it seems to ignore the proxy settings ... I've had the same issues.

from dftk.jl.

mfherbst avatar mfherbst commented on May 29, 2024

For FFTW I think you are right, but for MKL's FFT the picture seems to be different.

from dftk.jl.

antoine-levitt avatar antoine-levitt commented on May 29, 2024

A bit, but maybe the results are too noisy. Can you run the 16 threads test again? I want to see if

Multiple FFTs: manual_threaded, threads
  325.279 ms (183 allocations: 15.50 KiB)
Multiple FFTs: auto, threads
  287.693 ms (0 allocations: 0 bytes)

should be trusted or not.

from dftk.jl.

mfherbst avatar mfherbst commented on May 29, 2024

Another run:

Multiple FFTs: manual_threaded, threads
  312.535 ms (182 allocations: 15.58 KiB)
Multiple FFTs: auto, threads
  261.955 ms (0 allocations: 0 bytes)

and yet one more:

Multiple FFTs: manual_threaded, threads
  368.037 ms (180 allocations: 15.16 KiB)
Multiple FFTs: auto, threads
  305.578 ms (0 allocations: 0 bytes)

and on another machine (cc09):

Multiple FFTs: manual_threaded, threads
  211.597 ms (173 allocations: 14.36 KiB)
Multiple FFTs: auto, threads
  147.225 ms (0 allocations: 0 bytes)

from dftk.jl.

mfherbst avatar mfherbst commented on May 29, 2024

The difference is similar in each case 50 to 60 ms.

from dftk.jl.

antoine-levitt avatar antoine-levitt commented on May 29, 2024

Hm. So results are inconsistent, but always in the same direction. I'm tempted to ignore... We really should see what it does with 1.3 (or even better, master). There are a few open issues on the julia github about proxies, I posted in one, but proxies are a uniform pain.

from dftk.jl.

antoine-levitt avatar antoine-levitt commented on May 29, 2024

But really, what this all shows is that a single FFT is already pretty well parallelized. Meaning that we can just ignore this and not do any threading at all (ie what we have now), and it'll be within a factor of 2 of the optimal (at least for these sizes). If we just add @threads in the for loop of the FFTs, we'll probably be optimal (or very close, esp. with post-1.2 improvements to threading). Then we should run a large-ish computation on the cluster, see if new bottlenecks appear, and maybe add threading accordingly.

from dftk.jl.

antoine-levitt avatar antoine-levitt commented on May 29, 2024

For proxy issues, see julia issue 33111, that fixed it for me

from dftk.jl.

antoine-levitt avatar antoine-levitt commented on May 29, 2024

So 1.3 improves the manual_threaded for me:

16 threads
N=128, M=40
Single FFT: no threads
  32.412 ms (0 allocations: 0 bytes)
Single FFT: threads
  3.564 ms (298 allocations: 26.22 KiB)
Multiple FFTs: manual, no threads
  1.423 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  155.690 ms (194 allocations: 16.94 KiB)
Multiple FFTs: auto, no threads
  1.415 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  177.217 ms (12010 allocations: 1.03 MiB)
Multiple FFTs: manual_threaded, threads
  143.499 ms (12359 allocations: 1.04 MiB)
Multiple FFTs: auto, threads
  173.176 ms (453 allocations: 37.64 KiB)
32 threads
N=128, M=40
Single FFT: no threads
  34.014 ms (0 allocations: 0 bytes)
Single FFT: threads
  2.989 ms (588 allocations: 52.25 KiB)
Multiple FFTs: manual, no threads
  1.442 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
  156.606 ms (306 allocations: 28.81 KiB)
Multiple FFTs: auto, no threads
  1.451 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
  170.377 ms (23837 allocations: 2.05 MiB)
Multiple FFTs: manual_threaded, threads
  154.102 ms (24331 allocations: 2.08 MiB)
Multiple FFTs: auto, threads
  144.152 ms (622 allocations: 51.50 KiB)

Still a slight edge for auto FFTW on 32 cores, but that changes from benchmark to benchmark, and when I repeated it manual_threaded was faster. So let's go with #77 and not bother too much.

from dftk.jl.

mfherbst avatar mfherbst commented on May 29, 2024

I agree. Especially since this keeps more control on our end and opens way to integrate with the developments happening in Julia in the future.

from dftk.jl.

antoine-levitt avatar antoine-levitt commented on May 29, 2024

OK, let's close this one for now then. We can revisit according to profiling.

from dftk.jl.

antoine-levitt avatar antoine-levitt commented on May 29, 2024

One thing is that FFTW defaults to no threading. Let's keep that manual for now, but note for later that we have to FFTW.set_num_threads, and BLAS.set_num_threads. Also, FFTW threading occurs at plan creation.

from dftk.jl.

mfherbst avatar mfherbst commented on May 29, 2024

That is not true. For me it does.

from dftk.jl.

mfherbst avatar mfherbst commented on May 29, 2024

See https://github.com/JuliaMath/FFTW.jl/blob/master/src/FFTW.jl#L59. This is activated if nthreads() > 1 and I have by default export JULIA_NUM_THREADS=4, which I think is the way to go with this issue.

from dftk.jl.

antoine-levitt avatar antoine-levitt commented on May 29, 2024

Oh, you're absolutely right, I stopped at https://github.com/JuliaMath/FFTW.jl/blob/master/src/FFTW.jl#L41. They're really confident oversubscription is not a problem then!

from dftk.jl.

mfherbst avatar mfherbst commented on May 29, 2024

Indeed. I just saw that, too.

from dftk.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.