eth-cscs / cosma Goto Github PK
View Code? Open in Web Editor NEWDistributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
License: BSD 3-Clause "New" or "Revised" License
Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
License: BSD 3-Clause "New" or "Revised" License
As observed by @pseewald, when COSMA is compiled with the GPU-backend, the performance for small matrices has a large overhead.
The overhead comes from the price of CPU<->GPU
transfers which doesn't pay off for small matrices. In this case, it is better to perform the local multiplications directly on CPU.
The library grid2grid is outdated and should be substituted with COSTA. COSTA is a higly efficient communication-optimal algorithm for matrix redistribution with the option to transpose or scale (multiply by a scalar) the matrix on-the-fly. Concretely, it implements the routine:
sub(B) = beta * sub(B) + alpha * sub(op(A)) ; op=N, T or C; sub = submatrix,
where the operation op
corresponds to no operation (N
), transpose (T
) or conjugate (C
).
Deacrease the overhead of the first run by avoiding the initialization of elements in the memory pool.
Dear developers, I am trying to link against COSMA via the USE COSMA in 30 seconds. However, it tells me grid2grid is not found. I saw you updated grid2grid to COSTA. Should the linking libs also update? If yes, what's the correct lib to be linked? I saw libcosta.a, libcosta_prefixed_scalapack.a, and libcosta_scalapack.a.
Thank you so much.
The memory pool reserves the elements but doesn't default initialize them, which leads to the undefined behaviour
according to the standard.
For a related discussion see electronic-structure/SIRIUS#621
Over the past few weeks we've been adding conda packages for spla, spfft and sirius and I'm now in the process of integrating them into the cp2k one.
COSMA may be somewhat of a fringe case for conda since conda binaries are unlikely to be run on really large-scale machines (for maximum performance you would compile it yourself) but it could still be useful for reproducibility and quick testing.
Once could start with a MPI-only build which I think would be straightforward; it is also possible to create GPU-enabled conda packages but there is little documentation, so this would probably better be left to a later stage.
Let me know what you think!
The current code assumes that at least 2 MPI processes are used. If only 1 process is used, then there are no any divisions, so strategy is empty, which causes a segfault when the matrices are created. Specifically, when matrices are created, the buffers are created as well, and the segfault appears in compute_buff_sizes
in n_buckets_[step] == 1
line. n_buckets_
has size equal to strategy_->size
, which is empty in this case, thus causing a segfault.
I tried to compile the source code using the given CMakeLists.txt, on a Intel+Linux environment and relying on Intel oneAPI for C/C++, Fortran and MPI compilers, and for the MKL library.
Using the provided CMakeLists.txt I got this error
-- Could NOT find MPI_CXX (missing: MPI_CXX_WORKS)
-- Could NOT find MPI (missing: MPI_CXX_FOUND)
Reason given by package: MPI component 'C' was requested, but language C is not enabled. MPI component 'Fortran' was requested, but language Fortran is not enabled.
Hence, I modified the file to include the required languages. If I do so, it successfully loads MPI_C and MPI_Fortran, but it is totally unable to load MPI_CXX, as shown by this error
-- Found MPI_C: /home/sw/openMPI/4.0.3/lib/libmpi.so (found version "3.1")
-- Could NOT find MPI_CXX (missing: MPI_CXX_WORKS)
-- Found MPI_Fortran: /home/sw/openMPI/4.0.3/lib/libmpi_usempif08.so (found version "3.1")
CMake Error at /home/sw/cmake/3.19.1/share/cmake-3.19/Modules/FindPackageHandleStandardArgs.cmake:218 (message):
Could NOT find MPI (missing: MPI_CXX_FOUND) (found version "3.1")
Even trying to avoid the configuration of MPI_CXX, the CMake script completely fails to load the MKL library.
Is there any update to the CMake configuration which allows to an easy compilation of this library?
You do not specify the path to the CUDA libraries. Therefore, one can't install COSMA with CUDA support.
When running
cmake -DCOSMA_BLAS=CUDA ..
Gives
- Selected BLAS backend for COSMA: CUDA
-- Selected SCALAPACK backend for COSMA: OFF
-- cxxopts version 2.2.0
CUDA not found. Please specify CUDA_PATH variable.
CMake Error at /usr/local/lib/python2.7/dist-packages/cmake/data/share/cmake-3.12/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
Could NOT find CUBLAS (missing: CUDA_TOOLKIT_INCLUDE CUDA_CUDART_LIB)
Call Stack (most recent call first):
/usr/local/lib/python2.7/dist-packages/cmake/data/share/cmake-3.12/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
libs/Tiled-MM/cmake/FindCUBLAS.cmake:51 (find_package_handle_standard_args)
libs/Tiled-MM/CMakeLists.txt:30 (find_package)
I get a segfault in COSMA for CP2K RPA on a system of 256 water molecules. The backtrace is:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#1 0x5529c30 in _ZN5cosma6Mapper5ownerERNS_10Interval2DE
at /scratch/snx3000/pseewald/rpa_benchmarks_cosma/COSMA/src/cosma/mapper.cpp:362
#2 0x5529c30 in _ZN5cosma6Mapper15get_layout_gridEv
at /scratch/snx3000/pseewald/rpa_benchmarks_cosma/COSMA/src/cosma/mapper.cpp:401
#3 0x550b504 in _ZN5cosma6pxgemmIdEEvcciiiT_PKS1_iiPKiS3_iiS5_S1_PS1_iiS5_
at /scratch/snx3000/pseewald/rpa_benchmarks_cosma/COSMA/src/cosma/cosma_pxgemm.cpp:190
#4 0x246aad2 in __cp_fm_basic_linalg_MOD_cp_fm_gemm
at /scratch/snx3000/pseewald/rpa_benchmarks_cosma/cp2k/src/fm/cp_fm_basic_linalg.F:449
#5 0xf27025 in __cp_gemm_interface_MOD_cp_gemm
at /scratch/snx3000/pseewald/rpa_benchmarks_cosma/cp2k/src/cp_gemm_interface.F:136
#6 0x1d59ede in contract_s_to_q
at /scratch/snx3000/pseewald/rpa_benchmarks_cosma/cp2k/src/rpa_util.F:810
The calculation was run on Piz Daint gpu partition / 1024 nodes with 12 OMP threads per rank.
CP2K input file:
H2O-RI-RPA-TZ-COSMA.inp.txt
I used COSMA commit 374ddb4
I'm seeing the following weak scaling performance when using COSMA configured with OpenBLAS:
Num Nodes | Avg Exec Time (ms) |
---|---|
1 | 2867 |
2 | 3026 |
4 | 3208 |
8 | 3090 |
16 | 5659 |
32 | 5730 |
64 | 5741 |
128 | 5893 |
256 | 5889 |
On each node, I'm using 20 threads for OpenMP. The initial problem size / command line is env OMP_NUM_THREADS=20 COSMA_OVERLAP_COMM_AND_COMP=ON jsrun -b none -c ALL_CPUS -g ALL_GPUS -r 1 -n 1 /g/g15/yadav2/cosma/build/miniapp/cosma_miniapp -r 10 -m 8192 -n 8192 -k 8192
. I'm on commit c7bdab9.
I'm not sure what happened at 16 nodes that caused the performance dip -- is something like this expected?
Hi! I'm trying to test out COSMA with OpenBLAS and running into problems. Importantly, I'm not using the CMake install of OpenBLAS, as using it gives me issues in a separate project; I'm using the standard make
based install of OpenBLAS. I used the following steps to compile COSMA:
mkdir build
cmake -DCOSMA_BLAS=OPENBLAS -DCMAKE_INSTALL_PREFIX=../install -DOPENBLAS_LIBRARIES=<path to my openblas.so> ..
make -j cosma_miniapp
I get out the errors:
../src/cosma/libcosma.a(blas.cpp.o): In function `cosma::gemm(int, int, int, double, double const*, int, double const*, int, double, double*, int)':
blas.cpp:(.text+0x4c): undefined reference to `cblas_dgemm(CBLAS_ORDER, CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int, int, int, double, double const*, int, double const*, int, double, double*, int)'
../src/cosma/libcosma.a(blas.cpp.o): In function `cosma::gemm(int, int, int, std::complex<double>, std::complex<double> const*, int, std::complex<double> const*, int, std::complex<double>, std::complex<double>*, int)':
blas.cpp:(.text+0xdc): undefined reference to `cblas_zgemm(CBLAS_ORDER, CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int, int, int, void const*, void const*, int, void const*, int, void const*, void*, int)'
../src/cosma/libcosma.a(blas.cpp.o): In function `cosma::gemm(int, int, int, float, float const*, int, float const*, int, float, float*, int)':
blas.cpp:(.text+0x14c): undefined reference to `cblas_sgemm(CBLAS_ORDER, CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int, int, int, float, float const*, int, float const*, int, float, float*, int)'
../src/cosma/libcosma.a(blas.cpp.o): In function `cosma::gemm(int, int, int, std::complex<float>, std::complex<float> const*, int, std::complex<float> const*, int, std::complex<float>, std::complex<float>*, int)':
blas.cpp:(.text+0x1d8): undefined reference to `cblas_cgemm(CBLAS_ORDER, CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int, int, int, void const*, void const*, int, void const*, int, void const*, void*, int)'
collect2: error: ld returned 1 exit status
I've been playing around with the final link command here but to no avail. Any suggestions?
The mirroring on gitlab fails after force pushing on github. I believe the problem is connected to the following:
https://gitlab.com/gitlab-org/gitlab/issues/1063
I cannot push to the protected branch on gitlab.
@haampie do you have some idea how this could be fixed?
A system call failed during shared memory initialization that should
not have. It is likely that your MPI job will now either abort or
experience performance degradation.
Local host: comp
System call: unlink(2) /dev/shm/osc_rdma.comp.59100001.9
Error: No such file or directory (errno 2)
--------------------------------------------------------------------------
[comp:35304] Read -1, expected 2048, errno = 14
[comp:35304] Read -1, expected 2048, errno = 14
[comp:35304] *** An error occurred in MPI_Accumulate
[comp:35304] *** reported by process [1494220801,3]
[comp:35304] *** on win rdma window 9
[comp:35304] *** MPI_ERR_OTHER: known error not in list
[comp:35304] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[comp:35304] *** and potentially your MPI job)
<end of output>
Test time = 4.24 sec
----------------------------------------------------------
Test Failed.
"test.multiply" end time: Jun 24 19:41 MSK
"test.multiply" time elapsed: 00:00:04
CMake options: -DCMAKE_INSTALL_PREFIX=/usr -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=CUSTOM -DBUILD_SHARED_LIBS=ON -DMPIEXEC_PREFLAGS='--oversubscribe'
Hey @haampie!
It seems CI/CD complains about a missing OpenMP flag, although COSMA doesn't use OpenMP, but one of it's dependencies (COSTA) has OpenMP as a dependency.
This part hasn't been changed since the last time it worked, so I don't understand what's the issue. Locally and on daint it works for me without problems. Can you please have a look?
Hi,
test.scalar_matmul is probably broken: it is computed endlessly with any BLAS flavor (OPENBLAS, MKL, or CUDA). CMake options are:
cmake ../cosma \
-DCMAKE_INSTALL_PREFIX=/usr \
-DBUILD_SHARED_LIBS=ON \
-DCOSMA_BLAS=CUSTOM \
-DCOSMA_SCALAPACK=CUSTOM \
-DMPIEXEC_PREFLAGS='--oversubscribe'
Compilers and libraries: GCC 10.2.0, OpenMPI, CUDA 11.1.105, OpenBLAS 0.3.12, MKL 2020.4.304, ScaLAPACK 2.1.0
Built with clang, MKL + MPICH:
cmake ${src_dir} -G Ninja \
-D CMAKE_EXPORT_COMPILE_COMMANDS=ON \
-D CMAKE_INSTALL_PREFIX="~/software/cosma" \
-D CMAKE_BUILD_TYPE=RelWithDebInfo \
-D COSMA_WITH_TESTS=ON \
-D COSMA_WITH_APPS=ON \
-D COSMA_WITH_BENCHMARKS=OFF \
-D COSMA_WITH_PROFILING=OFF \
-D COSMA_BLAS=MKL \
-D COSMA_SCALAPACK=MKL \
Dependencies were built with spack (MKL/MPICH):
w6chlh7 [email protected]~ilp64+shared threads=none
e3ni7ed [email protected]~argobots~fortran+hwloc+hydra+libxml2+pci+romio~slurm~verbs+wrapperrpath device=ch3 netmod=tcp patches=eb982de3366d48cbc55eb5e0df43373a45d9f51df208abf0835a72dc6c0b4774 pmi=pmi
h5qfs34 [email protected]~cairo~cuda~gl~libudev+libxml2~netloc~nvml+pci+shared
gt474jm [email protected]
ptvhtfi [email protected]~python
The test case:
mpiexec -n 16 tests/test.pdgemm --gtest_filter=Default/PdgemmTestWithParams.pdgemm/29
Error message:
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 186245 RUNNING AT T480s
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
... and also have a version number in the directory name inside the tarball
Dear COSMA developers,
I have been working on a GPU hackathon and testing COSMA (commit e034ddd) using the Scalapack API with FHIaims code on the Ascent cluster (The training cluster for Summit). With COSMA GPU version, I was able to get a 7x speedup for pzgemm (matrix size 3312 * 3312) (36 Power CPU cores + 6 V100 GPU v. 36 Power CPU cores) (36 MPI with OMP_THREADS=1). That is great. However, I have also seen some GPU errors after the job finished. The errors only exist if I link my code against COSMA. @kabicm suggests it could be something that happened during the cleanup stage. Unfortunately, I won't have the access to the cluster anymore now, so that I am not able to provide a minimal example to reproduce the error. We will try to get access to Summit cluster later.
This is the error I saw.
error: GPU API call : invalid argument
terminate called after throwing an instance of 'std::runtime_error'
what(): GPU ERROR
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
error: GPU API call : invalid argument
terminate called after throwing an instance of 'std::runtime_error'
what(): GPU ERROR
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
error: GPU API call : invalid argument
terminate called after throwing an instance of 'std::runtime_error'
what(): GPU ERROR
Here are my build and link and submitting scripts.
set -e
module purge
module load gcc/7.4.0
module load essl
module load cuda/10.1.243
module load spectrum-mpi
module load netlib-lapack
module load netlib-scalapack
module load cmake
export CUDA_PATH=$CUDA_DIR
export CC=mpicc
export CXX=mpicxx
cmake -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=CUSTOM -DCMAKE_INSTALL_PREFIX=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/cosma/install_yy ..
make VERBOSE=0 -j 16
make install
set(CMAKE_C_COMPILER "mpicc" CACHE STRING "")
set(CMAKE_C_FLAGS "-O2 -g -fbacktrace -mcpu=power9 -funwind-tables -fopenmp" CACHE STRING "")
set(CMAKE_Fortran_COMPILER "mpif90" CACHE STRING "")
set(CMAKE_Fortran_FLAGS "-O2 -g -fbacktrace -mcpu=power9 -ffree-line-length-none -funwind-tables -fopenmp" CACHE STRING "")
set(Fortran_MIN_FLAGS "-O0 -g -fbacktrace -ffree-line-length-none -funwind-tables -fopenmp" CACHE STRING "")
set(USE_CUDA ON CACHE BOOL "")
set(CMAKE_CUDA_FLAGS "-O2 -g -DAdd_ -arch=sm_70" CACHE STRING "")
set(USE_MPI ON CACHE BOOL "")
set(USE_SCALAPACK ON CACHE BOOL "")
set(USE_LIBXC OFF CACHE BOOL "")
set(USE_iPI OFF CACHE BOOL "")
set(USE_SPGLIB OFF CACHE BOOL "")
SET(INC_PATHS "/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/cosma/install_yy/include" CACHE STRING "")
set(LIB_PATHS "/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/cosma/install_yy/lib64 $ENV{OLCF_ESSL_ROOT}/lib64 $ENV{OLCF_CUDA_ROOT}/lib64" CACHE STRING "")
set(LIBS "cosma_pxgemm cosma costa_scalapack costa Tiled-MM scalapack essl lapack cublas cudart" CACHE STRING "")
#!/bin/bash
#BSUB -P GEN157
#BSUB -W 2:00
#BSUB -nnodes 1
#BSUB -alloc_flags gpumps
#BSUB -J aims-gw
#BSUB -o aims.%J
#BSUB -N [email protected]
module purge
module load gcc/7.4.0 spectrum-mpi/10.3.1.2-20200121 cuda/10.1.243 essl/6.1.0-2 netlib-lapack/3.8.0 netlib-scalapack/2.0.2
module load nsight-systems/2021.2.1.58
export COSMA_GPU_MEMORY_PINNING=OFF
export COSMA_GPU_STREAMS=1
export COSMA_GPU_MAX_TILE_M=500
export COSMA_GPU_MAX_TILE_N=500
export COSMA_GPU_MAX_TILE_K=500
#export LD_LIBRARY_PATH=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/cosma/install_yy/lib64
#bin=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/FHIaims_gw_gpu/build_gcc_cuda10_2/aims.210427.scalapack.mpi.x
#bin=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/FHIaims_gw_gpu/build_gcc_cuda10_3/aims.210427.scalapack.mpi.x
bin=/ccsopen/home/yaoyi92/proj_dir_aims-gw/yaoyi92/FHIaims_gw_gpu/build_gcc_cuda10_cosma/aims.210427.scalapack.mpi.x
export OMP_NUM_THREADS=1
ulimit -s unlimited
jsrun -n 6 -a 6 -c 6 -g 1 -r 6 $bin > aims.out
Best wishes,
Yi
At the moment, it is assumed that each rank uses a single GPU. However, we should also be able to deal with mutliple GPUs being available to the same MPI rank.
Currently, beta parameter in strategy is of type double (thus not templated). However, there is no need for strategy to contain this parameter. Strategy has to be decoupled from the command line parser (options) and beta should be removed from strategy.
Please, include a CMAKE option to build COSMA as a shared library.
The current implementation allocates one large piece of memory from which all the buffers are taken. This has its advantages, but adds the requirement that all the required memory must be available as one, consecutive piece, which might not be possible in case of fragmentation (see here).
We want to switch to using a proper memory pool.
I'm testing out COSMA with GPU support on a single node with GPU's, and I'm not seeing performance that I might expect.
1 GPU:
COSMA TIMES [ms] = 1562 1657 2133 2390 6865
2 GPU:
COSMA TIMES [ms] = 1544 2710 3374 3626 6060
4 GPU:
COSMA TIMES [ms] = 805 832 1456 3142 6419
I expect to:
I'm on the current master, and running the miniapp with (-n and -r are how many ranks to run on a node)
OMP_NUM_THREADS=6 COSMA_OVERLAP_COMM_AND_COMP=ON jsrun -n 4 -c 6 -g 1 -b none -r 4 ./miniapp/cosma_miniapp -m 16384 -n 16384 -k 16384 -r 5
I build cosma with:
cmake -DCOSMA_BLAS=CUDA -DCMAKE_INSTALL_PREFIX=../ ..
The array range_to_rank_
in mapper.hpp
should have values containing std::size_t
to avoid type mismatch.
The CP2K scaling test found an erroneous arithmetic operation at interval.cpp:41. Presumably, the other divisions in that file are also at risk.
It would be nice to have support for hipBLAS as this allows to test the hip code path on a Nvidia device.
In return the support for rocBLAS could be dropped.
Even if the -s
option is marked as optional in the doc, cosma_miniapp
terminates with a domain_error
exception if this option is not given a value.
Please add a default value or check if the value of the option is available before accessing it.
on Alps/Eiger:
$ cmake --install . --prefix ../install.cosma
-- Install configuration: "Release"
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/blas.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/layout.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/memory_pool.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/random_generator.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/pinned_buffers.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/local_multiply.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/timer.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/multiply.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/cinterface.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/communicator.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/context.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/environment_variables.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/statistics.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/mpi_mapper.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/profiler.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/matrix.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/blacs.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/cosma_pxgemm.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/mapper.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/scalapack.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/one_sided_communicator.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/two_sided_communicator.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/strategy.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/buffer.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/math_utils.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/pxgemm_params.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/cosma/interval.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/cosma/cosmaConfig.cmake
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/cosma/cosmaConfigVersion.cmake
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/cosma/FindMKL.cmake
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/pkgconfig/cosma.pc
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/grid2grid/grid2gridTargets.cmake
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/grid2grid/grid2gridTargets-release.cmake
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/transform.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/communication_data.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/memory_utils.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/mpi_type_wrapper.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/profiler.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/ranks_reordering.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/cantor_mapping.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/grid_cover.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/transformer.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/grid2D.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/block.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/grid_layout.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/comm_volume.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/scalapack_layout.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/tiling_manager.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/include/grid2grid/interval.hpp
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/grid2grid/grid2gridConfig.cmake
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/grid2grid/grid2gridConfigVersion.cmake
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/libgrid2grid.a
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/libcosma.a
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/libcosma_pxgemm.a
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/libcosma_pxgemm_cpp.a
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/cosma/cosmaTargets.cmake
-- Installing: /users/timuel/work/cp2k/build.cosma/../install.cosma/lib64/cmake/cosma/cosmaTargets-release.cmake
-- Installing: /usr/local/bin/test.mapper
CMake Error at tests/cmake_install.cmake:60 (file):
file INSTALL cannot copy file
"/users/timuel/work/cp2k/build.cosma/tests/test.mapper" to
"/usr/local/bin/test.mapper": Permission denied.
Call Stack (most recent call first):
cmake_install.cmake:67 (include)
The pdgemm wrapper gives wrong results, when pdgemm is invoked with the following parameters:
// ****************************************
// * INPUT PARAMETERS BEGIN *
// ****************************************
// * global dimensions *
// ***********************
// matrix A
int ma = 1280; // rows
int na = 1280; // cols
// matrix B
int mb = 1280; // rows
int nb = 1280; // cols
// matrix C
int mc = 1280; // rows
int nc = 1280; // cols
// ***********************
// * block sizes *
// ***********************
// matrix A
int bma = 32; // rows
int bna = 32; // cols
// matrix B
int bmb = 32; // rows
int bnb = 32; // cols
// matrix C
int bmc = 32; // rows
int bnc = 32; // cols
// ***********************
// * submatrices ij *
// ***********************
// matrix A
int ia = 1; // rows
int ja = 545; // cols
// matrix B
int ib = 513; // rows
int jb = 545; // cols
// matrix C
int ic = 1; // rows
int jc = 513; // cols
// ***********************
// * problem size *
// ***********************
int m = 512;
int n = 32;
int k = 736;
// ***********************
// * transpose flags *
// ***********************
char trans_a = 'N';
char trans_b = 'T';
// ***********************
// * scaling flags *
// ***********************
double alpha = 1.0;
double beta = 1.0;
// ***********************
// * leading dims *
// ***********************
int lld_a = 640;
int lld_b = 640;
int lld_c = 640;
// ***********************
// * proc grid *
// ***********************
int p = 2; // rows
int q = 4; // cols
// ***********************
// * proc srcs *
// ***********************
// matrix A
int src_ma = 0; // rows
int src_na = 0; // cols
// matrix B
int src_mb = 0; // rows
int src_nb = 0; // cols
// matrix C
int src_mc = 0; // rows
int src_nc = 0; // cols
// ****************************************
// * INPUT PARAMETERS END *
// ****************************************
I realized I cannot reopen the issue #87
copy the question here. Sorry for the confusion.
=====
Hi @kabicm ,
I am able to redo the test again and the problem still exists with v2.5.0 and the master branch. Does COSMA use a wrapper over MPI_Finalize()? I notice a similar issue on Summit here with another code that is wrapping over MPI_Finalize LLNL/Caliper#392.
If that's the case, my question becomes: Is it possible to manually finalize COSMA?
A related question here: if I called COSMA, does it take the GPU memory after the gemm calls? Is it possible to control those GPU memories? I guess I am looking for something like initiating and finalizing a COSMA environment over a certain code region and free up the GPU memories when outside the code region.
Best wishes,
Yi
Stumbled across this:
COSMA/src/cosma/cosma_pxgemm.cpp
Line 276 in d0ab1dd
should probably use push_back()
, as the vector is default initialised in 250.
Hi,
Could you please add an option to build COSMA using COSTA as an external library?
Anton K.
I try to build and install COSMA on piz daint:
source scripts/piz_daint_gpu.sh
then
cmake -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=MKL -DCMAKE_INSTALL_PREFIX=<installation dir>/cosma ..
make -j 8
make install
All library files are copied to the location <installation dir>/cosma/lib64
but libgrid2grid.a
is copied to <installation dir>/cosma/lib
.
When the memory usage is tight (no alignment, pool growing factor=1), the segfault occurs in the following case:
cosma::pxgemm_params<double>{
// matrix dimensions
1280, 1280, // matrix A
1280, 1280, // matrix B
1280, 1280, // matrix C
// block sizes
32, 32, // matrix A
32, 32, // matrix B
32, 32, // matrix C
// submatrices ij
1, 545, // matrix A
513, 545, // matrix B
1, 513, // matrix C
// problem size
512, 32, 736,
// transpose flags
'N', 'T',
// scaling flags
1.0, 1.0,
// leading dims
640, 640, 640,
// proc grid
2, 4, 'R',
// proc srcs
0, 0, // matrix A
0, 0, // matrix B
0, 0 // matrix C
}
Only when beta!=0. This is caused by swapping reduce_buffer
with buffer_i
, when reduce_buffer.size() < buffer_i.size()
.
When compiling COSMA using CUDA 9, and the following instructions:
cmake -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=CUSTOM ..
make -j
The compilation fails with:
/COSMA/benchmarks/gpu_gemm_cublas.cpp:5:10: fatal error: cublasLt.h: No such file or directory
#include <cublasLt.h>
^~~~~~~~~~~~
compilation terminated.
The header cublasLt.h
is indeed not included in CUDA 9.
Compiling with CUDA 10, however, works fine.
COSMA is able to deal with limited memory, but the available memory per rank should be specified by the user. If not specified, infinite memory is assumed. We want to add an environment variable that will specify the amount of available memory.
Hello! I am trying out COSMA on Summit. Built COSMA using -DCOSMA_BLAS=CUDA -DCOSMA_WITH_PROFILING=ON
using GCC 8.1, CUDA 10.1 and IBM Spectrum MPI 10.3
Testing the miniapp as follows using 3 nodes, 6 mpi ranks, 6 GPUs per node:
cosma/build/miniapp/cosma_miniapp -m 1000 -n 1000 -k 1000 -P 18
The result I get is
Strategy = Matrix dimensions (m, n, k) = (1000, 1000, 1000)
Number of processors: 1
Overlap of communication and computation: OFF.
Divisions strategy:
Required memory per rank (in #elements): 166668
Available memory per rank (in #elements): 9223372036854775807
_p_ REGION CALLS THREAD WALL %
_p_ total - 0.005 0.005 100.0
_p_ multiply - 0.004 0.004 86.8
_p_ computation 1 0.004 0.004 86.8
_p_ other 2 0.000 0.000 0.0
_p_ preprocessing - 0.001 0.001 13.2
_p_ communicators 1 0.001 0.001 12.8
_p_ matrices - 0.000 0.000 0.4
_p_ mapper - 0.000 0.000 0.3
_p_ coordinates 3 0.000 0.000 0.2
_p_ sizes 3 0.000 0.000 0.1
_p_ layout 3 0.000 0.000 0.0
_p_ buffer 3 0.000 0.000 0.0
_p_ allocation 2 0.000 0.000 0.0
COSMA TIMES [ms] = 5
Am not sure why the Number of processors is reported as 1 instead of 18.
Running the pdgemm_miniapp
returns the following error at the end:
Attempting to use an MPI routine after finalizing MPICH
.
Instead of pinning/unpinning of buffers within each local_multiply call, amortize it by reusing pinned buffers from the previous multiplication if possible. This can be done through the context.
Example:
[2/58] Building CXX object src/cosma/CMakeFiles/cosma_pxgemm_cpp.dir/scalapack.cpp.o
In file included from /home/teonnik/code/cosma/src/cosma/scalapack.cpp:1:
In file included from /home/teonnik/code/cosma/src/cosma/scalapack.hpp:5:
/home/teonnik/code/cosma/libs/grid2grid/src/grid2grid/scalapack_layout.hpp:158:5: warning: explicitly defaulted default constructor is implicitly deleted [-Wdefaulted-function-deleted]
local_grid_coord() = default;
^
/home/teonnik/code/cosma/libs/grid2grid/src/grid2grid/scalapack_layout.hpp:155:21: note: default constructor of 'local_grid_coord' is implicitly deleted because field 'el_coord' has no default constructor
elem_grid_coord el_coord;
^
/home/teonnik/code/cosma/libs/grid2grid/src/grid2grid/scalapack_layout.hpp:192:5: warning: explicitly defaulted default constructor is implicitly deleted [-Wdefaulted-function-deleted]
matrix_grid() = default;
^
/home/teonnik/code/cosma/libs/grid2grid/src/grid2grid/scalapack_layout.hpp:189:16: note: default constructor of 'matrix_grid' is implicitly deleted because field 'matrix_dimension' has no default constructor
matrix_dim matrix_dimension;
^
/home/teonnik/code/cosma/libs/grid2grid/src/grid2grid/scalapack_layout.hpp:213:5: warning: explicitly defaulted default constructor is implicitly deleted [-Wdefaulted-function-deleted]
local_blocks() = default;
^
/home/teonnik/code/cosma/libs/grid2grid/src/grid2grid/scalapack_layout.hpp:207:15: note: default constructor of 'local_blocks' is implicitly deleted because field 'block_dimension' has no default constructor
block_dim block_dimension;
^
/home/teonnik/code/cosma/libs/grid2grid/src/grid2grid/scalapack_layout.hpp:244:5: warning: explicitly defaulted default constructor is implicitly deleted [-Wdefaulted-function-deleted]
data_layout() = default;
^
/home/teonnik/code/cosma/libs/grid2grid/src/grid2grid/scalapack_layout.hpp:239:16: note: default constructor of 'data_layout' is implicitly deleted because field 'matrix_dimension' has no default constructor
matrix_dim matrix_dimension;
^
4 warnings generated.
Full report: https://pastebin.com/hPfZkcTk
The cmake version should be in sync with the git version to ensure the correct file names and sonames of shared libraries.
The test cases are: 10, 11, 13, 21, 22 and 24
Setup: [email protected], [email protected] and [email protected]
Error message for test case 10:
2: [T480s:485190] *** Process received signal ***
2: [T480s:485188] *** Process received signal ***
2: [T480s:485188] Signal: Segmentation fault (11)
2: [T480s:485188] Signal code: Address not mapped (1)
2: [T480s:485188] Failing at address: 0xffffffff82a03518
2: [T480s:485188] [T480s:485190] Signal: Segmentation fault (11)
2: [T480s:485190] Signal code: Address not mapped (1)
2: [T480s:485190] Failing at address: 0xffffffffaa9cc258
2: [T480s:485190] [ 0] /usr/lib/libpthread.so.0(+0x140f0)[0x7ffba86950f0]
2: [T480s:485190] [ 1] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/openmpi-4.0.5-hb4m5wzvl2chgf47y6vsuxwgbqtnzflj/lib/libmpi.so.40(PMPI_Comm_size+0x37)[0x7ffbaf83cd17]
2: [T480s:485190] [ 2] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_blacs_openmpi_lp64.so(MKLMPI_Comm_size+0x2a)[0x7ffba846643a]
2: [T480s:485190] [ 3] [ 0] /usr/lib/libpthread.so.0(+0x140f0)[0x7f09e01740f0]
2: [T480s:485188] [ 1] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/openmpi-4.0.5-hb4m5wzvl2chgf47y6vsuxwgbqtnzflj/lib/libmpi.so.40(PMPI_Comm_size+0x37)[0x7f09e731bd17]
2: /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_scalapack_lp64.so(PB_CpgemmMPI+0x15c)[0x7ffbaff869bc]
2: [T480s:485190] [ 4] [T480s:485188] [ 2] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_blacs_openmpi_lp64.so(MKLMPI_Comm_size+0x2a)[0x7f09dff4543a]
2: [T480s:485188] [ 3] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_scalapack_lp64.so(pdgemm_+0xda7)[0x7ffbaffebd27]
2: /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_scalapack_lp64.so(PB_CpgemmMPI+0x15c)[0x7f09e7a659bc]
2: [T480s:485188] [ 4] [T480s:485190] [ 5] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_scalapack_lp64.so(pdgemm_+0xda7)[0x7f09e7acad27]
2: [T480s:485188] [ 5] ./test.pdgemm(+0x2153d)[0x555d81b1c53d]
2: [T480s:485188] [ 6] ./test.pdgemm(+0x21f5c)[0x555d81b1cf5c]
2: [T480s:485188] [ 7] ./test.pdgemm(+0x15153)[0x555d81b10153]
2: [T480s:485188] [ 8] ./test.pdgemm(+0x532d7)[0x555d81b4e2d7]
2: [T480s:485188] [ 9] ./test.pdgemm(+0x494b9)[0x555d81b444b9]
2: [T480s:485188] [10] ./test.pdgemm(+0x49785)[0x555d81b44785]
2: [T480s:485188] [11] ./test.pdgemm(+0x49915)[0x555d81b44915]
2: [T480s:485188] [12] ./test.pdgemm(+0x4a033)[0x555d81b45033]
2: [T480s:485188] [13] ./test.pdgemm(+0x53857)[0x555d81b4e857]
2: [T480s:485188] [14] ./test.pdgemm(+0x4a174)[0x555d81b45174]
2: [T480s:485188] [15] ./test.pdgemm(+0xf1a3)[0x555d81b0a1a3]
2: [T480s:485188] [16] /usr/lib/libc.so.6(__libc_start_main+0xf2)[0x7f09dfa3d152]
2: [T480s:485188] [17] ./test.pdgemm(+0x1010e)[0x555d81b0b10e]
2: [T480s:485188] *** End of error message ***
2: ./test.pdgemm(+0x2153d)[0x55aaaa49053d]
2: [T480s:485190] [ 6] ./test.pdgemm(+0x21f5c)[0x55aaaa490f5c]
2: [T480s:485190] [ 7] ./test.pdgemm(+0x15153)[0x55aaaa484153]
2: [T480s:485190] [ 8] ./test.pdgemm(+0x532d7)[0x55aaaa4c22d7]
2: [T480s:485190] [ 9] ./test.pdgemm(+0x494b9)[0x55aaaa4b84b9]
2: [T480s:485190] [T480s:485187] *** Process received signal ***
2: [T480s:485189] *** Process received signal ***
2: [T480s:485189] Signal: Segmentation fault (11)
2: [T480s:485189] Signal code: Address not mapped (1)
2: [T480s:485189] Failing at address: 0xffffffff96ebf318
2: [T480s:485187] Signal: Segmentation fault (11)
2: [T480s:485187] Signal code: Address not mapped (1)
2: [T480s:485187] Failing at address: 0x69f95bc8
2: [T480s:485187] [ 0] /usr/lib/libpthread.so.0(+0x140f0)[0x7fe8df4050f0]
2: [T480s:485187] [ 1] [T480s:485189] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/openmpi-4.0.5-hb4m5wzvl2chgf47y6vsuxwgbqtnzflj/lib/libmpi.so.40(PMPI_Comm_size+0x37)[0x7fe8e65acd17]
2: [T480s:485187] [10] ./test.pdgemm(+0x49785)[0x55aaaa4b8785]
2: [T480s:485190] [11] ./test.pdgemm(+0x49915)[0x55aaaa4b8915]
2: [T480s:485190] [12] ./test.pdgemm(+0x4a033)[0x55aaaa4b9033]
2: [T480s:485190] [13] [ 2] ./test.pdgemm(+0x53857)[0x55aaaa4c2857]
2: [T480s:485190] [14] ./test.pdgemm(+0x4a174)[0x55aaaa4b9174]
2: [T480s:485190] [15] ./test.pdgemm(+0xf1a3)[0x55aaaa47e1a3]
2: [T480s:485190] [16] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_blacs_openmpi_lp64.so(MKLMPI_Comm_size+0x2a)[0x7fe8df1d643a]
2: [T480s:485187] [ 3] [ 0] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_scalapack_lp64.so(PB_CpgemmMPI+0x15c)[0x7fe8e6cf69bc]
2: [T480s:485187] [ 4] /usr/lib/libpthread.so.0(+0x140f0)[0x7f36e61e10f0]
2: /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_scalapack_lp64.so(pdgemm_+0xda7)[0x7fe8e6d5bd27]
2: [T480s:485187] [ 5] ./test.pdgemm(+0x2153d)[0x558b68caa53d]
2: [T480s:485187] [ 6] ./test.pdgemm(+0x21f5c)[0x558b68caaf5c]
2: [T480s:485187] [ 7] ./test.pdgemm(+0x15153)[0x558b68c9e153]
2: [T480s:485187] [ 8] ./test.pdgemm(+0x532d7)[0x558b68cdc2d7]
2: [T480s:485187] [ 9] ./test.pdgemm(+0x494b9)[0x558b68cd24b9]
2: [T480s:485187] [10] ./test.pdgemm(+0x49785)[0x558b68cd2785]
2: [T480s:485187] [11] ./test.pdgemm(+0x49915)[0x558b68cd2915]
2: [T480s:485187] [12] ./test.pdgemm(+0x4a033)[0x558b68cd3033]
2: [T480s:485187] [13] ./test.pdgemm(+0x53857)[0x558b68cdc857]
2: [T480s:485187] [14] ./test.pdgemm(+0x4a174)[0x558b68cd3174]
2: [T480s:485187] [15] ./test.pdgemm(+0xf1a3)[0x558b68c981a3]
2: [T480s:485187] [16] /usr/lib/libc.so.6(__libc_start_main+0xf2)[0x7fe8decce152]
2: [T480s:485187] [17] ./test.pdgemm(+0x1010e)[0x558b68c9910e]
2: [T480s:485187] *** End of error message ***
2: [T480s:485189] [ 1] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/openmpi-4.0.5-hb4m5wzvl2chgf47y6vsuxwgbqtnzflj/lib/libmpi.so.40(PMPI_Comm_size+0x37)[0x7f36ed388d17]
2: [T480s:485189] [ 2] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_blacs_openmpi_lp64.so(MKLMPI_Comm_size+0x2a)[0x7f36e5fb243a]
2: [T480s:485189] [ 3] /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_scalapack_lp64.so(PB_CpgemmMPI+0x15c)[0x7f36edad29bc]
2: [T480s:485189] [ 4] /usr/lib/libc.so.6(__libc_start_main+0xf2)[0x7ffba7f5e152]
2: /home/teonnik/code/spack/opt/spack/linux-arch-skylake/gcc-10.2.0/intel-mkl-2020.4.304-vengwtbz3klxn4tmjegweaggvcgo6qut/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_scalapack_lp64.so(pdgemm_+0xda7)[0x7f36edb37d27]
2: [T480s:485190] [17] ./test.pdgemm(+0x1010e)[0x55aaaa47f10e]
2: [T480s:485190] *** End of error message ***
2: [T480s:485189] [ 5] ./test.pdgemm(+0x2153d)[0x559c951be53d]
2: [T480s:485189] [ 6] ./test.pdgemm(+0x21f5c)[0x559c951bef5c]
2: [T480s:485189] [ 7] ./test.pdgemm(+0x15153)[0x559c951b2153]
2: [T480s:485189] [ 8] ./test.pdgemm(+0x532d7)[0x559c951f02d7]
2: [T480s:485189] [ 9] ./test.pdgemm(+0x494b9)[0x559c951e64b9]
2: [T480s:485189] [10] ./test.pdgemm(+0x49785)[0x559c951e6785]
2: [T480s:485189] [11] ./test.pdgemm(+0x49915)[0x559c951e6915]
2: [T480s:485189] [12] ./test.pdgemm(+0x4a033)[0x559c951e7033]
2: [T480s:485189] [13] ./test.pdgemm(+0x53857)[0x559c951f0857]
2: [T480s:485189] [14] ./test.pdgemm(+0x4a174)[0x559c951e7174]
2: [T480s:485189] [15] ./test.pdgemm(+0xf1a3)[0x559c951ac1a3]
2: [T480s:485189] [16] /usr/lib/libc.so.6(__libc_start_main+0xf2)[0x7f36e5aaa152]
2: [T480s:485189] [17] ./test.pdgemm(+0x1010e)[0x559c951ad10e]
2: [T480s:485189] *** End of error message ***
2: --------------------------------------------------------------------------
2: Primary job terminated normally, but 1 process returned
2: a non-zero exit code. Per user-direction, the job has been aborted.
2: --------------------------------------------------------------------------
2: --------------------------------------------------------------------------
2: mpiexec noticed that process rank 3 with PID 0 on node T480s exited on signal 11 (Segmentation fault).
2: --------------------------------------------------------------------------
1/1 Test #2: test.pdgemm ......................***Failed 4.26 sec
Hi,
I have OOM events when working with quite large matrix dimensions.
I am working with commit a7c6bb3
on Intel XEON Platinum 8360Y with 72 cores per node and 256 GB RAM)
using the following libraries: intel-mpi, intel-mkl, and intel-mpi (not too old versions).
When working on 2 of my 72 core machines and executing the following command:
srun /u/airmler/src/COSMA/miniapp/cosma_miniapp -m 36044 -n 36044 -k 36044 -r 5
(which is identical to mpirun -np 144 $EXE, when working with another wrapper),
I obtain a OOM event (slurm kills the job because some process is OOM).
However, he says in stdout:
Required memory per rank (in #elements): 171415254
which is around 1.3GB. This is way less than the 3.7GB per rank I should have available.
What helps is when I reduce the used memory:
export COSMA_CPU_MAX_MEMORY=1100
is running without problems.
On this cluster, it is not so easy to profile the memory consumption. Are you aware of any load inbalances? Is the provided "required memory per rank" reliable?
===
Note: it is not important for me that this issue is resolved. I simply want to share this behavior with you.
I am sure you can transfer the matrix dimensions in a way you can reproduce the problem at your cluster.
Otherwise I could try to run some suggested examples on "my" machine.
Best regards.
Hi,
I run into a few minor errors while trying to compile with intel/17.0.4 and impi/17.3.
The first one is in grid2grid: kabicm/grid2grid/issues/10
The others are:
src/cosma/strategy.cpp(244): error: no instance of constructor "std::vector<_Tp, _Alloc>::vector [with _Tp=std::tuple<long long, int, char>, _Alloc=std::allocator<std::tuple<long long, int, char>>]" matches the argument list
argument types are: ({...}, {...}, {...})
std::vector<dim_pair> dims = {{m, divm, 'B'}, {n, divn, 'A'}, {k, divk, 'C'}};
src/cosma/strategy.cpp(278): error: copy-list-initialization cannot use a constructor marked "explicit"
return {memory_A, memory_B, memory_C};
A+,
Alex
In the GPU backend: Tiled-MM
, device id is always set to 0. Instead, we want it to be set only if not already set (by the user).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.