Coder Social home page Coder Social logo

MPI problems with srun about eckit HOT 9 CLOSED

ecmwf avatar ecmwf commented on August 20, 2024
MPI problems with srun

from eckit.

Comments (9)

wdeconinck avatar wdeconinck commented on August 20, 2024

Can you check first that eckit has MPI enabled in the CMake configuration log? I guess so but just to be sure.

Then, how eckit detects that it is running in an MPI environment it checks at run-time for following environment variables to choose either true MPI or a Serial MPI stubs implementation.
https://github.com/ecmwf/eckit/blob/develop/src/eckit/mpi/Comm.cc#L43-L45

I should mention that I have been able to use srun in the past without issues, provided that the environment variable export SLURM_EXPORT_ENV=ALL is set, which is required to propagate environment variables to the srun'd program.

If that doesn't work, you can force eckit to choose the "parallel" (MPI) implementation by setting the environment variable:
export ECKIT_MPI_FORCE=parallel

from eckit.

mmiesch avatar mmiesch commented on August 20, 2024

Thanks @wdeconinck for the helpful information. It turns out that MPI was enabled and the SLURM_EXPORT_ALL did not help. But, the export ECKIT_MPI_FORCE=parallel command does seem to work!

from eckit.

mmiesch avatar mmiesch commented on August 20, 2024

Though this does work for parallel applications, it's not an ideal solution for unit testing with ctest, of course. Setting this variable helps parallel tests to pass but it breaks all the serial tests. So, if you have any other ideas on how one might get around this please let me know. Thanks again.

from eckit.

wdeconinck avatar wdeconinck commented on August 20, 2024

Ideally the fix is to figure out which environment variable we can add to the list in eckit to check for.
e.g. OMPI_COMM_WORLD_SIZE, ALPS_APP_PE, PMI_SIZE
To figure it out, one could run:

srun -n 1 env

and see which environment variable srun adds that we can detect MPI with. You can then test it by adding it to the code here https://github.com/ecmwf/eckit/blob/develop/src/eckit/mpi/Comm.cc#L43-L45. If it works, we can then patch eckit.

Another ugly workaround or hack which I have used, that would satisfy you for now is to create a srun wrapper called e.g. parallel_run containing the lines:

export ECKIT_MPI_FORCE=parallel
srun "$@"

Then CMake-configure with -DMPIEXEC_EXECUTABLE=<path/to/parallel_run> -DMPIEXEC_NUMPROC_FLAG='-n'

from eckit.

mmiesch avatar mmiesch commented on August 20, 2024

Thanks @wdeconinck - the slurm output env variables are listed here - confirmed with the srun -n1 env command you suggested. I think the best bet would likely be SLURM_NTASKS.

I will edit the code as suggested and let you know if it works.

Thanks for the workaround suggestion but I think this would have the same problem as before - all serial jobs would fail because they would invoke the Parallel subclass of eckit::mpi::Comm, right?

from eckit.

mmiesch avatar mmiesch commented on August 20, 2024

Thanks again @wdeconinck - your suggested solution works. I could do a formal pull request but it might be easier for you to just apply the patch on your end.

Here is the code that works:

        else if (::getenv("OMPI_COMM_WORLD_SIZE") ||  // OpenMPI
                 ::getenv("ALPS_APP_PE") ||           // Cray PE
                 ::getenv("PMI_SIZE") ||              // Intel
                 ::getenv("SLURM_NTASKS")) {          // slurm srun

from eckit.

wdeconinck avatar wdeconinck commented on August 20, 2024

Thank you @mmiesch , the modification has been submitted as PR upstream and should make its way soon.

from eckit.

mmiesch avatar mmiesch commented on August 20, 2024

Great - Thanks again @wdeconinck for your help.

from eckit.

wdeconinck avatar wdeconinck commented on August 20, 2024

@tlmquintino This issue can be closed now ( I don't have permissions ).

from eckit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.