Coder Social home page Coder Social logo

pegasus-bridle's Introduction

Pegasus Bridle png

This project contains some scripts to ease the training of models at the DFKI GPU cluster.

Why should I use this setup?

  • One time setup of the execution environment.
  • Persistent cache. This is useful, for instance, when working with Huggingface to cache models and dataset preprocessing steps.
  • If an environment variables file .env is found in the current working directory, all contained variables are exported automatically and are available inside the Slurm job.

💥 IMPORTANT 💥

This approach requires some manual housekeeping. Since the cache is persisted (by default to /netscratch/$USER/.cache_slurm), that needs to be cleaned up from time to time. It is also recommended to remove Conda environments when they are not needed anymore.

Overview

To train models at the cluster, we first need to set up a respective python environment. Then, we can call a wrapper script that will start a Slurm job with a selected Enroot image and execute the command we passed to it within the job. In the following, this is described in detail.

Setup the working environment

  1. Install Miniconda
    1. Download the miniconda setup script using the following command:
      wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
    2. Make the setup script executable and install miniconda using: bash ./miniconda.sh
      IMPORTANT: It is recommended to use /netscratch/$USER/miniconda3 as install location.
      Note: If you want to choose another install directory, just adapt the respective environment variable HOST_CONDA_ENVS_DIR in the .env file (see below).
    3. Switch to the conda-forge channel. Because of license restrictions, we have to use conda-forge and disable the default conda channel.
      1. Add conda-forge as the highest priority channel (taken from here): conda config --add channels conda-forge
      2. disable the default conda channel: conda config --remove channels defaults
  2. Setup a conda environment
    1. Create a conda environment, e.g. using: conda create -n {env} python=3.9 (replace {env} with a name of your choice)
    2. Either start a screen session or make sure you are in bash (type bash in terminal) to make the conda commands available.
    3. Activate the environment: conda activate {env}
    4. Install any required python packages. We recommend that you use the PyPI cache installed at the cluster as described here, e.g. using:
      pip install --no-cache --index-url http://pypi-cache/index --trusted-host pypi-cache <package>
      
  3. Get this code and cd into it:
    git clone https://github.com/DFKI-NLP/pegasus-bridle.git && cd pegasus-bridle
  4. Prepare the Slurm setup environment variable file
    1. Copy the example file: cp .env.example .env
    2. Adapt .env and ensure that the respective paths exist at the host and create them if necessary (especially for HOST_CACHEDIR)
    3. Make sure the images you are using contains a conda installation.

Note: It is also possible to have multiple Slurm setup environment variable files, e.g. one for each of your deep learning projects. In this case, put the content of the .env file into a file .pegasus-bridle.env that should be located in the directory from where you start the wrapper script (e.g. your project directory). The wrapper script will automatically detect this file and use it instead of the .env file in the pegasus-bridle directory.

Executing the code

  1. Activate the conda environment at the host: conda activate {env}.
    Note: This is just required to pass the conda environment name to the wrapper script. You can also set a fixed name by directly overwriting the environment variable CONDA_ENV in the .env file (see above).
  2. Run wrapper.sh from anywhere, e.g. your project directory, and give the python command in the parameters to execute it:
    bash path/to/pegasus-bridle/wrapper.sh command with arguments
    
    Example Usage:
    bash /home/$USER/projects/pegasus-bridle/wrapper.sh python src/train.py +trainer.fast_dev_run=true
    

Notes:

  • If an environment variables file .env is found in the current working directory (this is not the .env file you have created for the Slurm setup), all contained variables are exported automatically and are available inside the Slurm job.
  • For more details about slurm cluster, please follow this link.

For interactive mode

  1. Run bash path/to/interactive.sh from your project directory
  2. [OPTIONAL] Activate the conda environment inside the slurm job:
    1. Execute source /opt/conda/bin/activate
    2. Activate conda environment: conda activate {env}

Note: This uses the same environment variables as the wrapper.sh. You may modify them before starting an interactive session, especially variables related to resource allocation.

See also

pegasus-bridle's People

Contributors

arnebinder avatar jk-tripathy avatar leonhardhennig avatar malteos avatar

Stargazers

altar avatar Imge Yüzüncüoglu avatar Ekaterina Borisova avatar  avatar Tim P avatar

Watchers

 avatar

pegasus-bridle's Issues

Workaround/Update for setting up Conda env on the container

The current .env file requires conda to be installed in the container before a custom env can be used. This can be bypassed by mounting the entire conda dir found at the netscratch head node to the container rather than just the env folder.

Suggested fix to .env.example file:

# change this if you have installed miniconda to another location
HOST_CONDA_ENVS_DIR=/netscratch/$USER/miniconda3/
# change this if you use another Enroot image with a different conda location
CONTAINER_CONDA_ENVS_DIR=/opt/conda/

By mounting the entire conda folder the initial source /opt/conda/bin/activate in activate_and_execute.sh can never fail allowing the custom env activated when wanted.

adjust step "Prepare the Slurm setup environment variable file" in the readme

We should not recommend to create the .env file in the pegasus-bridle directory because this is error prone when executing an example command from that directory. Instead, we just should mention that .pegasus-bridle.env (also as copy from .env.example) needs to be put into the working directory from where the final commands are executed.

environement vars file per project

Currently it is necessary to adjust the .env file in the pegasus-bridle directory for proper resource management. This is annoying if you have e.g. two setups / projects that you use pegasus-bridle for, but with different resource requirements. In this case you need to change the .env file back and forth when executing the other setup.

Idea: Iff a .pegasus-bridle.env is available in the current execution directory, load that one instead of the .env from the pegasus-bridle directory.

Update paths

https://pegasus.dfki.de/posts/maintenance_23_24/ states that some directory locations have changed, e.g. /netscratch/enroot has moved to /enroot. Paths in the scripts should be updated according to the status update.

Main changes probably:

  • update /netscratch/enroot to /enroot (mostly used as srun arg, "--container-image=/enroot/")
  • add /fscratch as image mount, i.e. sth like "--container-mounts=/netscratch:/netscratch,/fscratch/$USER:/fscratch/$USER,/ds:/ds,...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.