Coder Social home page Coder Social logo

Comments (26)

gsitaram avatar gsitaram commented on June 11, 2024 1

In the mpirun command, you can call a wrapper script before pxgemm_miniapp that sets a different GPU device to each of the 2 ranks before pxgemm_miniapp is invoked. I believe NCCL does not allow multiple ranks to use the same GPU device.
For instance, if you were using OpenMPI, the wrapper script may look like this:

                   #!/bin/bash​
                   export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}
                   exec $*​

and you would call the miniapp as follows:

mpirun -np 2 ./wrapper.sh ./pxgemm_miniapp <args>

from cosma.

gsitaram avatar gsitaram commented on June 11, 2024 1

This is interesting. I do not see the same thing, so this tells me that it could have something to do with your MPI configuration. I am not an expert here, so will let others chime in.

from cosma.

kabicm avatar kabicm commented on June 11, 2024 1

Since there are 10 repetitions, the output contains a sorted timings array from the fastest to the slowest one in ms.

This means, the fastest time was 1068ms and the slowest one was 10368ms. This is indeed quite unusual that there is so much overhead. Usually it should be much less substantial difference.

from cosma.

tcarneirop avatar tcarneirop commented on June 11, 2024 1

That's true, I'm trying now to build on another computer.

For the moment, that's what I have on this second computer:

mpirun -np 1 ./wrapper.sh ./cosma_miniapp -m 32768 -n 32768 -k 32768 --type=double -r 4
COSMA TIMES [ms] = 5811 5814 5814 6523

mpirun -np 2 ./wrapper.sh ./cosma_miniapp -m 32768 -n 32768 -k 32768 --type=double -r 4

COSMA TIMES [ms] = 4278 4305 4306 10292

Shows some speedup, right?

But I'll rebuild cosma with the instructions you sent. I'm waiting for the reservation to be ready and I'll send you a feedback.

Many thanks!

from cosma.

tcarneirop avatar tcarneirop commented on June 11, 2024

Thank you for your reply. I'm going to test it and I'll send you a feedback.

from cosma.

tcarneirop avatar tcarneirop commented on June 11, 2024

Hello,

using the wrapper I can see 1 process for each GPU, thanks!

However, I have no idea how to interpret the outputs.

For instance, my system has 2 gpus and I launched the miniapp as follows:

mpirun -np 2 -npernode 2 -machinefile $PBS_NODEFILE ./wrapper.sh ./pxgemm_miniapp -m 18000 -n 18000 -k 18000 --p_grid=1,1 --block_a=1024,1024 --block_b=1024,1024 --block_c=1024,1024 --type=double --algorithm=cosma -r 1

What I see -- two times the same output

Running PDGEMM on the following problem

(2x)
GLOBAL MAT. SIZES

A = 18000 x 18000
B = 18000 x 18000
C = 18000 x 18000

...
...
COSMA TIMES [ms] = 8758
COSMA TIMES [ms] = 10471

....

So, it looks like two single-GPU versions of the miniapp are executed and the load is not divided among the GPU devices.

Is this correct? Are there additional mpi flags/pxgemm parameters do set in order to use multiple GPUs?

Best regards

from cosma.

gsitaram avatar gsitaram commented on June 11, 2024

Can you try setting --p_grid=2,1?

from cosma.

tcarneirop avatar tcarneirop commented on June 11, 2024

Sure!

mpirun -np 2 ./wrapper.sh ./pxgemm_miniapp -m 18000 -n 18000 -k 18000 --p_grid=2,1 --block_a=1024,1024 --block_b=1024,1024 --block_c=1024,1024 --type=double --algorithm=cosma -r 1

..........................

COSMA(pxgemm_miniapp.cpp): warning: number of processors in the grid must be equal to P, setting grid to 1xP instead.
COSMA(pxgemm_miniapp.cpp): warning: number of processors in the grid must be equal to P, setting grid to 1xP instead.
Running PDGEMM on the following problem:
...
...
...
COSMA TIMES [ms] = 3562
COSMA TIMES [ms] = 3522
..............................

Is the multi-gpu building instructions enough? Or should I also build for GPU-Aware MPI?

Thanks!

from cosma.

tcarneirop avatar tcarneirop commented on June 11, 2024

OK, let me try to rebuild everything and see what I get.

In your side you see just one output?

Is your OpenMPI the Official NVIDIA version of OpenMPI?

from cosma.

gsitaram avatar gsitaram commented on June 11, 2024

I run with OpenMPI that I build from source. I also run on AMD GPUs, but on your system, you should try the OpenMPI that is distributed with Nvidia's hpc toolkit. You don't need GPU aware MPI support for using NCCL/RCCL, but having it shouldn't hurt.

You can also check if Intel MPI has some parameters that result in such behavior (duplication of tasks instead of one parallel task with multiple ranks).

from cosma.

tcarneirop avatar tcarneirop commented on June 11, 2024

Ok, so there is an mpi in the source? Maybe it is interesting to use it instead of the modules from the system.

In this case, I'm using the OpenMPI from a module, the intel one does not have that global rank variable to help with the GPU affinity.

from cosma.

tcarneirop avatar tcarneirop commented on June 11, 2024

Btw, I'm using -DCOSMA_SCALAPACK=MKL instead of COSTOM. Is it ok?

from cosma.

tcarneirop avatar tcarneirop commented on June 11, 2024

A bit more info...
This behavior can be seen also for the cosma_miniapp.

It looks like things are executed on the first GPU and then on the next one.

$ mpirun -np 1 ./wrapper.sh ./cosma_miniapp -m 16384 -n 16384 -k 16384 --type=double -r 1
Strategy = Matrix dimensions (m, n, k) = (16384, 16384, 16384)
Number of processors: 1
Overlap of communication and computation: OFF.
Divisions strategy:
Required memory per rank (in #elements): 805306368
Available memory per rank (in #elements): not specified (assumed: infinite)

COSMA TIMES [ms] = 1056

mpirun -np 2 ./wrapper.sh ./cosma_miniapp -m 16384 -n 16384 -k 16384 --type=double -r 1
Strategy = Matrix dimensions (m, n, k) = (16384, 16384, 16384)
Number of processors: 2
Overlap of communication and computation: ON.
Communication-thread policy (for overlap): busy-waiting (using blocking one-sided MPI).
Divisions strategy:
parallel (k / 2)
Required memory per rank (in #elements): 671088640
Available memory per rank (in #elements): not specified (assumed: infinite)

COSMA TIMES [ms] = 3532

from cosma.

kabicm avatar kabicm commented on June 11, 2024

@tcarneirop I see that in the first example, the overlap of comm. and comp. is off and in the other one it's on. Can you run both examples, but turning off the overlap of communication and computation by setting: export COSMA_OVERLAP_COMM_AND_COMP=OFF, just so that we can have consistent configurations?

from cosma.

tcarneirop avatar tcarneirop commented on June 11, 2024

Hello, Thanks for your reply.

Tried and things are the same. Launches 4 mpi process, each one on a different GPU, but it looks like each one is solving a 16k x 16k matrix multiplication instead of sharing the load:

mpirun -np 4 ./wrapper.sh ./cosma_miniapp -m 16384 -n 16384 -k 16384 --type=double -r 1
Strategy = Matrix dimensions (m, n, k) = (16384, 16384, 16384)
Number of processors: 4
Overlap of communication and computation: OFF.
Divisions strategy:
parallel (n / 2)
parallel (k / 2)
Required memory per rank (in #elements): 469762048
Available memory per rank (in #elements): not specified (assumed: infinite)

COSMA TIMES [ms] = 3729

vs

mpirun -np 1 ./wrapper.sh ./cosma_miniapp -m 16384 -n 16384 -k 16384 --type=double -r 1
Strategy = Matrix dimensions (m, n, k) = (16384, 16384, 16384)
Number of processors: 1
Overlap of communication and computation: OFF.
Divisions strategy:
Required memory per rank (in #elements): 805306368
Available memory per rank (in #elements): not specified (assumed: infinite)

COSMA TIMES [ms] = 1100

My cfgs:

ml CMake/3.24.3-GCCcore-11.3.0
ml GCC/11.3.0
ml imkl
ml OpenMPI/4.1.4-GCC-11.3.0

ml CUDA/11.7.0

export NCCL_ROOT=home/user/nccl/
export NCCL_LIB_DIR=home/user/nccl/lib/
export NCCL_INCLUDE_DIR=home/user/nccl/include/
export COSMA_OVERLAP_COMM_AND_COMP=OFF

cmake -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=MKL -DCOSMA_WITH_NCCL=ON -DCMAKE_INSTALL_PREFIX=~/cosmaAGAIN ..

Btw, the wrapper.sh is the one provided a few answers ago.

Thanks!!

from cosma.

tcarneirop avatar tcarneirop commented on June 11, 2024

Nothing? =(

from cosma.

kabicm avatar kabicm commented on June 11, 2024

Hi @tcarneirop! I think I know what the problem might be.

However, to be sure, can you please also try running the same on the GPU, but without using NCCL and without using CUDA-aware MPI.

So, in the cmake command you would have the following: -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=MKL -DCOSMA_WITH_NCCL=OFF.

from cosma.

tcarneirop avatar tcarneirop commented on June 11, 2024

Thanks for your reply!

Sure, just one question -- what's the suggested CUDA-aware MPI should I load?

Thanks!

from cosma.

kabicm avatar kabicm commented on June 11, 2024

You should just use the standard MPI, without the cuda-aware part, so you can also add -DCOSMA_WITH_GPU_AWARE_MPI=OFF to cmake, but this is the default value anyway.

We simply want to see the results of COSMA + GPU-backend, but without using nccl and without using cuda-aware MPI.

from cosma.

tcarneirop avatar tcarneirop commented on June 11, 2024

Hello @kabicm

Thanks for your reply!

Ok, with -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=MKL -DCOSMA_WITH_NCCL=OFF -DCOSMA_WITH_GPU_AWARE_MPI=OFF

mpirun -np 1 ./cosma_miniapp -m 16384 -n 16384 -k 16384 --type=double -r 1
Strategy = Matrix dimensions (m, n, k) = (16384, 16384, 16384)
Number of processors: 1
Overlap of communication and computation: OFF.
Divisions strategy:
Required memory per rank (in #elements): 805306368
Available memory per rank (in #elements): not specified (assumed: infinite)


COSMA TIMES [ms] = 1081
mpirun -np 4 ./wrapper.sh ./cosma_miniapp -m 16384 -n 16384 -k 16384 --type=double -r 1
Strategy = Matrix dimensions (m, n, k) = (16384, 16384, 16384)
Number of processors: 4
Overlap of communication and computation: OFF.
Divisions strategy:
parallel (n / 2)
parallel (k / 2)
Required memory per rank (in #elements): 469762048
Available memory per rank (in #elements): not specified (assumed: infinite)

COSMA TIMES [ms] = 3020


My modules:
ml CMake/3.24.3-GCCcore-11.3.0
ml GCC/11.3.0
ml imkl
ml OpenMPI/4.1.4-GCC-11.3.0
ml CUDA/11.7.0

Thanks!!

from cosma.

kabicm avatar kabicm commented on June 11, 2024

Thanks a lot for detailed benchmarking!

I see two possibilities here:

  • I just realized that the number of repetitions is set to 1. The initial run is always allocating the memory pools (both CPU and GPU), so the timing here includes all the one-time overheads. For this reason, It would be better to set -r 3 or higher, because then the subsequent runs will reuse the pre-allocated memory, so that we can see the actual compute time without the overheads.
  • this might still be a small matrix size to fully utilize the 4xGPUs, so you should try with larger sizes as well. On P100, I was running m=n=k=30k (double precision) on a single GPU. If there is not enough work, then a single GPU might be faster.

I am looking forward to seeing how it went!

from cosma.

tcarneirop avatar tcarneirop commented on June 11, 2024

Thanks!!

So should I continue with
-DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=MKL -DCOSMA_WITH_NCCL=OFF -DCOSMA_WITH_GPU_AWARE_MPI=OFF

Trying right away!

from cosma.

kabicm avatar kabicm commented on June 11, 2024

Exactly, I would turn off the nccl and the gpu-aware-mpi for the time being. Great, thanks!

from cosma.

tcarneirop avatar tcarneirop commented on June 11, 2024

Now trying
mpirun -np 4 ./wrapper.sh ./cosma_miniapp -m 16384 -n 16384 -k 16384 --type=double -r 10
Strategy = Matrix dimensions (m, n, k) = (16384, 16384, 16384)

(sorry for using 16k, now running with something bigger)

But I just don't get the output -- How should I read these numbers?
...
...
Strategy = Matrix dimensions (m, n, k) = (16384, 16384, 16384)
Number of processors: 4
Overlap of communication and computation: OFF.
Divisions strategy:
parallel (n / 2)
parallel (k / 2)
Required memory per rank (in #elements): 469762048
Available memory per rank (in #elements): not specified (assumed: infinite)

COSMA TIMES [ms] = 1068 1083 1170 1236 1295 3184 9896 10132 10139 10368

?

from cosma.

kabicm avatar kabicm commented on June 11, 2024

This already looks much better!

For fine-tuning, you can also try playing with the tile sizes for GPUs (see:
https://github.com/eth-cscs/COSMA#tunable-parameters). Once the matrix dimensions are split among the ranks, each rank is further splitting the local matrices into tiles and pipelines these tiles to GPUs. You can choose the sizes of these tiles in number of elements, by setting:

export COSMA_GPU_MAX_TILE_M=4000
export COSMA_GPU_MAX_TILE_N=4000
export COSMA_GPU_MAX_TILE_K=4000

This means all the tiles would be 4k x 4k elements.

You can also try larger sizes, e.g. by setting all the tiles to 8k. The optimal tile size depends on the GPUs that you are using. For P100, the optimal sizes were 4-5k.

You can just set these environment variables before running, you don't have to recompile the code, they are used at runtime.

Just in case you want to experiment with it!

from cosma.

tcarneirop avatar tcarneirop commented on June 11, 2024

Hello @kabicm

many thanks!!

It looks like we've got some speedup:


mpirun -np 1 ./wrapper.sh ./cosma_miniapp -m 36000 -n 36000 -k 36000 --type=double -r 4
COSMA TIMES [ms] = 16278 16278 16279 16766


mpirun -np 4 ./wrapper.sh ./cosma_miniapp -m 36000 -n 36000 -k 36000 --type=double -r 4
COSMA TIMES [ms] = 9173 9184 9282 12963


Question --

32k on one gpu uses -- 1719MiB / 40960MiB

on 26769MiB / 40960MiB

Is it ok or is there something strange?


As I'm in a hurry, I could not test it yet with the settings you suggested.

As soon as I get another time-slot on the cluster I'll re-run the experiments .

Thanks again,
We have to buy you a coffee =)

from cosma.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.