Comments (26)
I agree the new partr framework seems to be the way people are heading also for threading support in the lower libraries and it only makes sense to follow along with it. Especially, since users of our code could do all sorts of things on top. Regarding @strided
: I think that will really only be helpful at a few places (e.g. in the application of the non-local projectors) where a lot of classical array operations happen on all the bands at once. We'll have to benchmark of course.
from dftk.jl.
So, I did some very basic experiments. For a system with 400,000 plane waves, FFTW's own threading doesn't seem to do much: setting both FFTW and BLAS threads to the number of cores on my computer gave me a 20% speedup. So we should either do #9, or do our own threading
from dftk.jl.
Hmm 20% is surprisingly little, but maybe I misunderstand what you did.
Could you perhaps commit a small benchmark script. I think it would be good to have a few "benchmark cases" or integrate with https://github.com/JuliaCI/PkgBenchmark.jl such that one can track the performance better. What do you think?
from dftk.jl.
That's set_num_threads
for both FFTW and Blas, set to the max number of cores vs 1. Benchmarking is easy : take any example and have more of it (eg set supercell
). I don't think we need to setup performance tracking because essentially the only thing that matters right now is how we do the FFTs and how many of them we do, which is simpler to track by hand. The top things that are important right now are convergence criteria for the eigen solver (we do way too many iterations per scf step; by comparison abinit by default does 8 in the first two iterations, and then 4), and batching / threading FFTs.
from dftk.jl.
@mfherbst can you try the following benchmarking script on the machine you have? https://gist.github.com/antoine-levitt/88086895dd98f746d6c795c99a10fd9f
Here I get
4 threads
N=128, M=40
Single FFT: no threads
26.611 ms (0 allocations: 0 bytes)
Single FFT: threads
15.158 ms (78 allocations: 6.66 KiB)
Multiple FFTs: manual, no threads
1.080 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
631.769 ms (112 allocations: 8.06 KiB)
Multiple FFTs: auto, no threads
1.083 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
696.880 ms (3281 allocations: 272.52 KiB)
Multiple FFTs: manual_threaded, threads
679.694 ms (3323 allocations: 275.73 KiB)
Multiple FFTs: auto, threads
633.797 ms (39 allocations: 3.33 KiB)
So the good news is that all methods of parallelization are esssentially the same. The bad news is that they all suck :-) It looks like FFTs are almost memory-bound, and so do not benefit much from parallelization (at least on my machine). That's on julia 1.3. I'd test on the lab's cluster, but I'm getting proxy errors...
from dftk.jl.
My machine (julia 1.3, fftw)
4 threads
N=128, M=40
Single FFT: no threads
19.447 ms (0 allocations: 0 bytes)
Single FFT: threads
9.175 ms (78 allocations: 6.66 KiB)
Multiple FFTs: manual, no threads
792.030 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
373.974 ms (110 allocations: 8.03 KiB)
Multiple FFTs: auto, no threads
792.610 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
391.993 ms (3243 allocations: 271.92 KiB)
Multiple FFTs: manual_threaded, threads
377.248 ms (3318 allocations: 275.66 KiB)
Multiple FFTs: auto, threads
375.433 ms (40 allocations: 3.34 KiB)
from dftk.jl.
Cluster08 (julia 1.2, MKL)
16 threads
N=128, M=40
Single FFT: no threads
43.748 ms (0 allocations: 0 bytes)
Single FFT: threads
6.878 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, no threads
1.774 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
418.949 ms (182 allocations: 15.58 KiB)
Multiple FFTs: auto, no threads
1.781 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
370.278 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, threads
325.279 ms (183 allocations: 15.50 KiB)
Multiple FFTs: auto, threads
287.693 ms (0 allocations: 0 bytes)
and (again 1.2, MKL)
4 threads
N=128, M=40
Single FFT: no threads
39.283 ms (0 allocations: 0 bytes)
Single FFT: threads
11.205 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, no threads
1.751 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
584.677 ms (111 allocations: 7.97 KiB)
Multiple FFTs: auto, no threads
1.765 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
549.380 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, threads
298.015 ms (107 allocations: 7.80 KiB)
Multiple FFTs: auto, threads
496.712 ms (0 allocations: 0 bytes)
from dftk.jl.
clustern20 (with julia 1.1, I can't make 1.3 work with the proxy for some reason):
16 threads
N=128, M=40
Single FFT: no threads
32.266 ms (0 allocations: 0 bytes)
Single FFT: threads
4.336 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, no threads
1.386 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
151.490 ms (53 allocations: 3.23 KiB)
Multiple FFTs: auto, no threads
1.396 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
248.748 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, threads
208.934 ms (56 allocations: 3.17 KiB)
Multiple FFTs: auto, threads
143.142 ms (0 allocations: 0 bytes)
32 threads
N=128, M=40
Single FFT: no threads
32.257 ms (0 allocations: 0 bytes)
Single FFT: threads
3.193 ms (0 allocations: 0 bytes)
Multiple FFTs: manual, no threads
1.361 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
156.536 ms (23 allocations: 1.42 KiB)
Multiple FFTs: auto, no threads
1.550 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
151.108 ms (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, threads
156.481 ms (24 allocations: 1.48 KiB)
Multiple FFTs: auto, threads
150.511 ms (0 allocations: 0 bytes)
That's much better. I think that's consistent with FFTs being memory limited, but memory scaling differently on different machines.
Takeaways: oversubscription is fine, FFTW doesn't do better than outer threading. So my suggestion is to plan for a single (like we do now) threaded FFT (by setting FFTW.set_num_threads
to JULIA_NUM_THREADS
), and add our own threading on top of that. That was fine on 1.1, and should be even better on 1.3. Pity I can't test it on the cluster...
from dftk.jl.
Be careful with the 32 threads on cluster 20 ... it has hyper threading enabled, so effectively it's only 16 cores
from dftk.jl.
Yeah I know, that was basically to test oversubscription
from dftk.jl.
Julia 1.3 has changed the way they update the registries in a way that it seems to ignore the proxy settings ... I've had the same issues.
from dftk.jl.
For FFTW I think you are right, but for MKL's FFT the picture seems to be different.
from dftk.jl.
A bit, but maybe the results are too noisy. Can you run the 16 threads test again? I want to see if
Multiple FFTs: manual_threaded, threads
325.279 ms (183 allocations: 15.50 KiB)
Multiple FFTs: auto, threads
287.693 ms (0 allocations: 0 bytes)
should be trusted or not.
from dftk.jl.
Another run:
Multiple FFTs: manual_threaded, threads
312.535 ms (182 allocations: 15.58 KiB)
Multiple FFTs: auto, threads
261.955 ms (0 allocations: 0 bytes)
and yet one more:
Multiple FFTs: manual_threaded, threads
368.037 ms (180 allocations: 15.16 KiB)
Multiple FFTs: auto, threads
305.578 ms (0 allocations: 0 bytes)
and on another machine (cc09):
Multiple FFTs: manual_threaded, threads
211.597 ms (173 allocations: 14.36 KiB)
Multiple FFTs: auto, threads
147.225 ms (0 allocations: 0 bytes)
from dftk.jl.
The difference is similar in each case 50 to 60 ms.
from dftk.jl.
Hm. So results are inconsistent, but always in the same direction. I'm tempted to ignore... We really should see what it does with 1.3 (or even better, master). There are a few open issues on the julia github about proxies, I posted in one, but proxies are a uniform pain.
from dftk.jl.
But really, what this all shows is that a single FFT is already pretty well parallelized. Meaning that we can just ignore this and not do any threading at all (ie what we have now), and it'll be within a factor of 2 of the optimal (at least for these sizes). If we just add @threads
in the for loop of the FFTs, we'll probably be optimal (or very close, esp. with post-1.2 improvements to threading). Then we should run a large-ish computation on the cluster, see if new bottlenecks appear, and maybe add threading accordingly.
from dftk.jl.
For proxy issues, see julia issue 33111, that fixed it for me
from dftk.jl.
So 1.3 improves the manual_threaded for me:
16 threads
N=128, M=40
Single FFT: no threads
32.412 ms (0 allocations: 0 bytes)
Single FFT: threads
3.564 ms (298 allocations: 26.22 KiB)
Multiple FFTs: manual, no threads
1.423 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
155.690 ms (194 allocations: 16.94 KiB)
Multiple FFTs: auto, no threads
1.415 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
177.217 ms (12010 allocations: 1.03 MiB)
Multiple FFTs: manual_threaded, threads
143.499 ms (12359 allocations: 1.04 MiB)
Multiple FFTs: auto, threads
173.176 ms (453 allocations: 37.64 KiB)
32 threads
N=128, M=40
Single FFT: no threads
34.014 ms (0 allocations: 0 bytes)
Single FFT: threads
2.989 ms (588 allocations: 52.25 KiB)
Multiple FFTs: manual, no threads
1.442 s (80 allocations: 5.00 KiB)
Multiple FFTs: manual_threaded, no threads
156.606 ms (306 allocations: 28.81 KiB)
Multiple FFTs: auto, no threads
1.451 s (0 allocations: 0 bytes)
Multiple FFTs: manual, threads
170.377 ms (23837 allocations: 2.05 MiB)
Multiple FFTs: manual_threaded, threads
154.102 ms (24331 allocations: 2.08 MiB)
Multiple FFTs: auto, threads
144.152 ms (622 allocations: 51.50 KiB)
Still a slight edge for auto FFTW on 32 cores, but that changes from benchmark to benchmark, and when I repeated it manual_threaded was faster. So let's go with #77 and not bother too much.
from dftk.jl.
I agree. Especially since this keeps more control on our end and opens way to integrate with the developments happening in Julia in the future.
from dftk.jl.
OK, let's close this one for now then. We can revisit according to profiling.
from dftk.jl.
One thing is that FFTW defaults to no threading. Let's keep that manual for now, but note for later that we have to FFTW.set_num_threads
, and BLAS.set_num_threads
. Also, FFTW threading occurs at plan creation.
from dftk.jl.
That is not true. For me it does.
from dftk.jl.
See https://github.com/JuliaMath/FFTW.jl/blob/master/src/FFTW.jl#L59. This is activated if nthreads() > 1
and I have by default export JULIA_NUM_THREADS=4
, which I think is the way to go with this issue.
from dftk.jl.
Oh, you're absolutely right, I stopped at https://github.com/JuliaMath/FFTW.jl/blob/master/src/FFTW.jl#L41. They're really confident oversubscription is not a problem then!
from dftk.jl.
Indeed. I just saw that, too.
from dftk.jl.
Related Issues (20)
- crash with PyPlot HOT 12
- Crash with PyPlot HOT 1
- Blochpocalypse HOT 7
- AD wrong with symmetries HOT 2
- construct_value HOT 1
- Support ARM / aarch64 / "Apple silicon" M-series processors HOT 6
- Integrate with PseudoPotentialIO
- GPU-friendly G-vector optimizations and de-duplication of form factor calculation HOT 10
- Julia Nightly tests failing HOT 1
- Usage Support: Adding DFTK.jl to Matsci Discourse Community HOT 3
- Mixed Reduced and Cartesian coords in apply_symop ? HOT 2
- Initial guess density currently broken on GPU
- Precompilation causes warning HOT 1
- Rounding issues cause bad wannier interface HOT 4
- unit and equation of calculated force HOT 2
- how to customize/define an xc functional HOT 2
- Multicomponent HOT 4
- Minus sign in Fourier transform of projectors HOT 6
- Unexpected error when using Crystallographic Information Framework (CIF) File Input HOT 5
- Use TestItemRunner.jl for managing tests
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dftk.jl.