Coder Social home page Coder Social logo

mig-parted's Introduction

MIG Partiton Editor for NVIDIA GPUs

MIG (short for Multi-Instance GPU) is a mode of operation in the newest generation of NVIDIA Ampere GPUs. It allows one to partition a GPU into a set of "MIG Devices", each of which appears to the software consuming them as a mini-GPU with a fixed partition of memory and a fixed partition of compute resources. Please refer to the MIG User Guide for a detailed explanation of MIG and the features it provides.

The MIG Partiton Editor (nvidia-mig-parted) is a tool designed for system administrators to make working with MIG partitions easier.

It allows administrators to declaratively define a set of possible MIG configurations they would like applied to all GPUs on a node. At runtime, they then point nvidia-mig-parted at one of these configurations, and nvidia-mig-parted takes care of applying it. In this way, the same configuration file can be spread across all nodes in a cluster, and a runtime flag (or environment variable) can be used to decide which of these configurations to actually apply to a node at any given time.

As an example, consider the following configuration for an NVIDIA DGX-A100 node (found in the examples/config.yaml file of this repo):

version: v1
mig-configs:
  all-disabled:
    - devices: all
      mig-enabled: false

  all-enabled:
    - devices: all
      mig-enabled: true
      mig-devices: {}

  all-1g.5gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "1g.5gb": 7

  all-2g.10gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "2g.10gb": 3

  all-3g.20gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "3g.20gb": 2

  all-balanced:
    - devices: all
      mig-enabled: true
      mig-devices:
        "1g.5gb": 2
        "2g.10gb": 1
        "3g.20gb": 1

  custom-config:
    - devices: [0,1,2,3]
      mig-enabled: false
    - devices: [4]
      mig-enabled: true
      mig-devices:
        "1g.5gb": 7
    - devices: [5]
      mig-enabled: true
      mig-devices:
        "2g.10gb": 3
    - devices: [6]
      mig-enabled: true
      mig-devices:
        "3g.20gb": 2
    - devices: [7]
      mig-enabled: true
      mig-devices:
        "1g.5gb": 2
        "2g.10gb": 1
        "3g.20gb": 1

Each of the sections under mig-configs is user-defined, with custom labels used to refer to them. For example, the all-disabled label refers to the MIG configuration that disables MIG for all GPUs on the node. Likewise, the all-1g.5gb label refers to the MIG configuration that slices all GPUs on the node into 1g.5gb devices. Finally, the custom-config label defines a completely custom configuration which disables MIG on the first 4 GPUs on the node, and applies a mix of MIG devices across the rest.

Using this tool the following commands can be run to apply each of these configs, in turn:

$ nvidia-mig-parted apply -f examples/config.yaml -c all-disabled
$ nvidia-mig-parted apply -f examples/config.yaml -c all-1g.5gb
$ nvidia-mig-parted apply -f examples/config.yaml -c all-2g.10gb
$ nvidia-mig-parted apply -f examples/config.yaml -c all-3g.20gb
$ nvidia-mig-parted apply -f examples/config.yaml -c all-balanced
$ nvidia-mig-parted apply -f examples/config.yaml -c custom-config

The currently applied configuration can then be looked up with:

$ nvidia-mig-parted export
version: v1
mig-configs:
  current:
  - devices: all
    mig-enabled: true
    mig-devices:
      1g.5gb: 2
      2g.10gb: 1
      3g.20gb: 1

And asserted with:

$ nvidia-mig-parted assert -f examples/config.yaml -c all-balanced
Selected MIG configuration currently applied

$ echo $?
0

$ nvidia-mig-parted assert -f examples/config.yaml -c all-1g.5gb
ERRO[0000] Assertion failure: selected configuration not currently applied

$ echo $?
1

Note: The nvidia-mig-parted tool alone does not take care of making sure that your node is in a state where MIG mode changes and MIG device configurations will apply cleanly. Moreover, it does not ensure that MIG device configurations will persist across node reboots.

To help with this, a systemd service and a set of support scripts have been developed to wrap nvidia-mig-parted and provide these much desired features. Please see the README.md under deployments/systemd for more details.

Installing nvidia-mig-parted

At the moment, there is no common distribution platform for nvidia-mig-parted. However, we do build deb, rpm and tarball packages and distribute them as assets with every release. Please see our release page here to download them and install them.

To build from source, please follow one of the methods below.

Use docker with go install:

docker run \
    --rm \
    -v $(pwd):/dest \
    golang:1.20.1 \
    sh -c "
    go install github.com/NVIDIA/mig-parted/cmd/nvidia-mig-parted@latest
    mv /go/bin/nvidia-mig-parted /dest/nvidia-mig-parted
    "

Run go get and go install directly:

GO111MODULE=off go get -u github.com/NVIDIA/mig-parted/cmd/nvidia-mig-parted
GOBIN=$(pwd)    go install github.com/NVIDIA/mig-parted/cmd/nvidia-mig-parted

Clone the repo and build it:

git clone http://github.com/NVIDIA/mig-parted
cd mig-parted
go build ./cmd/nvidia-mig-parted

When followed exactly, any of these methods should generate a binary called nvidia-mig-parted in your current directory. Once this is done, it is advised that you move this binary to somewhere in your path, so you can follow the commands below verbatim.

Quick Start

Before going into the details of every possible option for nvidia-mig-parted it's useful to walk through a few examples of its most common usage. All commands below use the example configuration file found under examples/config.yaml of this repo.

Apply a specific MIG config from a configuration file

nvidia-mig-parted apply -f examples/config.yaml -c all-1g.5gb

Apply a config to only change the MIG mode settings of a config

nvidia-mig-parted apply --mode-only -f examples/config.yaml -c all-1g.5gb

Apply a MIG config with debug output

nvidia-mig-parted -d apply -f examples/config.yaml -c all-1g.5gb

Apply a one-off MIG config without a configuration file

cat <<EOF | nvidia-mig-parted apply -f -
version: v1
mig-configs:
  all-1g.5gb:
  - devices: all
    mig-enabled: true
    mig-devices:
      1g.5gb: 7
EOF

Apply a one-off MIG config to only change the MIG mode

cat <<EOF | nvidia-mig-parted apply --mode-only -f -
version: v1
mig-configs:
  whatever:
  - devices: all
    mig-enabled: true
    mig-devices: {}
EOF

Export the current MIG config

nvidia-mig-parted export

Assert a specific MIG configuration is currently applied

nvidia-mig-parted assert -f examples/config.yaml -c all-1g.5gb

Assert the MIG mode settings of a MIG configuration are currently applied

nvidia-mig-parted assert --mode-only -f examples/config.yaml -c all-1g.5gb

Assert a one-off MIG config without a configuration file

cat <<EOF | nvidia-mig-parted assert -f -
version: v1
mig-configs:
  all-1g.5gb:
  - devices: all
    mig-enabled: true
    mig-devices: 
      1g.5gb: 7
EOF

Assert the MIG mode setting of a one-off MIG config

cat <<EOF | nvidia-mig-parted assert --mode-only -f -
version: v1
mig-configs:
  whatever:
  - devices: all
    mig-enabled: true
    mig-devices: {}
EOF

mig-parted's People

Contributors

arangogutierrez avatar cdesiniotis avatar cfhammill avatar colobas avatar dependabot[bot] avatar elezar avatar jamienguyennvidia avatar kcoms555 avatar klueska avatar nvjmayo avatar omer-dayan avatar rorajani avatar shivakunv avatar shivamerla avatar supertetelman avatar tariq1890 avatar vineel-cruise avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mig-parted's Issues

Fail to install using docker

Environment: Ubuntu 20.04 with Go 1.20.6 from snap

Using docker with go install:

>     -v $(pwd):/dest \
>     golang:1.16.4 \
>     sh -c "
>     go install github.com/NVIDIA/mig-parted/cmd@latest
>     mv /go/bin/cmd /dest/nvidia-mig-parted
>     "

Unable to find image 'golang:1.16.4' locally
1.16.4: Pulling from library/golang
d960726af2be: Pull complete 
e8d62473a22d: Pull complete 
8962bc0fad55: Pull complete 
65d943ee54c1: Pull complete 
f2253e6fbefa: Pull complete 
6d7fa7c7d5d3: Pull complete 
e2e442f7f89f: Pull complete 
Digest: sha256:8a106c4b4005efb43c0ba4bb5763b84742c7e222bad5a8dff73cc9f7710c64ee
Status: Downloaded newer image for golang:1.16.4
go: downloading github.com/NVIDIA/mig-parted v0.5.3
go install github.com/NVIDIA/mig-parted/cmd@latest: module github.com/NVIDIA/mig-parted@latest found (v0.5.3), but does not contain package github.com/NVIDIA/mig-parted/cmd
mv: cannot stat '/go/bin/cmd': No such file or directory

gpu-operator creates ci using mig Insufficient Resources

mig-config.yaml

    mig-configs:
      custom-config: 
        - devices: [0]
          mig-enabled: false
        - devices: [1]     
          mig-enabled: true   
          mig-devices:
            "7g.80gb": 1    
        - devices: [2]
          mig-enabled: true
          mig-devices:
            "2g.20gb": 3
        - devices: [3]      
          mig-enabled: true 
          mig-devices:
            "3g.40gb": 1
            "4g.40gb": 1    
        - devices: [4]      
          mig-enabled: true
          mig-devices:
            "3g.40gb": 1
            "4g.40gb": 1    
        - devices: [5]      
          mig-enabled: true
          mig-devices:
            "3g.40gb": 1
            "4g.40gb": 1
        - devices: [6]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 1
            "4g.40gb": 1 
        - devices: [7]
          mig-enabled: true
          mig-devices:
            "1g.10gb": 1
            "2g.20gb": 1
            "4g.40gb": 1 

nvidia-smi mig -lcip -gi 0

+--------------------------------------------------------------------------------------+
| Compute instance profiles:                                                           |
| GPU     GPU       Name             Profile  Instances   Exclusive       Shared       |
|       Instance                       ID     Free/Total     SM       DEC   ENC   OFA  |
|         ID                                                          CE    JPEG       |
|======================================================================================|
|   0      0       MIG 1c.7g.80gb       0      0/7           14        5     0     1   |
|                                                                      7     1         |
+--------------------------------------------------------------------------------------+
|   0      0       MIG 2c.7g.80gb       1      0/3           28        5     0     1   |
|                                                                      7     1         |
+--------------------------------------------------------------------------------------+
|   0      0       MIG 3c.7g.80gb       2      0/2           42        5     0     1   |
|                                                                      7     1         |
+--------------------------------------------------------------------------------------+
|   0      0       MIG 4c.7g.80gb       3      0/1           56        5     0     1   |
|                                                                      7     1         |
+--------------------------------------------------------------------------------------+
|   0      0       MIG 7g.80gb          4*     0/1           98        5     0     1   |
|                                                                      7     1         |
+--------------------------------------------------------------------------------------+
|   1      0       MIG 1c.7g.80gb       0      0/7           14        5     0     1   |
|                                                                      7     1         |
+--------------------------------------------------------------------------------------+
|   1      0       MIG 2c.7g.80gb       1      0/3           28        5     0     1   |
|                                                                      7     1         |
+--------------------------------------------------------------------------------------+
|   1      0       MIG 3c.7g.80gb       2      0/2           42        5     0     1   |
|                                                                      7     1         |
+--------------------------------------------------------------------------------------+
|   1      0       MIG 4c.7g.80gb       3      0/1           56        5     0     1   |
|                                                                      7     1         |
+--------------------------------------------------------------------------------------+
|   1      0       MIG 7g.80gb          4*     0/1           98        5     0     1   |
|                                                                      7     1         |
+--------------------------------------------------------------------------------------+

nvidia-smi mig -lci -gi 0

+--------------------------------------------------------------------+
| Compute instances:                                                 |
| GPU     GPU       Name             Profile   Instance   Placement  |
|       Instance                       ID        ID       Start:Size |
|         ID                                                         |
|====================================================================|
|   0      0       MIG 7g.80gb          4         0          0:7     |
+--------------------------------------------------------------------+
|   1      0       MIG 7g.80gb          4         0          0:7     |
+--------------------------------------------------------------------+

nvidia-smi mig -cci 2 -gi 0

Unable to create a compute instance on GPU  0 GPU instance ID  0 using profile 2: Insufficient Resources
Failed to create compute instances: Insufficient Resources

I want to create a MIG 3c.7g/80gb specification prompt for Independent Resources.How to solve it.

Installing `nvidia-mig-parted` fails.

Dear all,

thank you for your work!
We used the following installation instructions to build nvidia-mig-parted:

docker run \
    -v $(pwd):/dest \
    golang:1.15 \
    sh -c "
    GO111MODULE=off go get -u github.com/NVIDIA/mig-parted/cmd/nvidia-mig-parted
    GOBIN=/dest     go install github.com/NVIDIA/mig-parted/cmd/nvidia-mig-parted
    "

I think that exactly this statement worked fine a few weeks ago, but currently it fails with:

src/github.com/NVIDIA/mig-parted/cmd/util/util.go:100:18: undefined: os.ReadFile

Is this a known problem or did I made some mistake? (I tested it on multiple locations, the error was always the os.ReadFile one from above.)

Best regards,

Maik

mig build failed, error cannot find package "github.com/NVIDIA/mig-parted/cmd/apply" in any of:.

Using ubuntu 18.04, and when try to install, mig parted tool I get this erro " cannot find package "github.com/NVIDIA/mig-parted/cmd/apply" in any of:"

user@host-1:~/slurm-instal/mig-parted$ go build ./cmd/nvidia-mig-parted
cmd/nvidia-mig-parted/main.go:22:2: cannot find package "github.com/NVIDIA/mig-parted/cmd/apply" in any of:
/usr/src/github.com/NVIDIA/mig-parted/cmd/apply (from $GOROOT)
/home/user/go/src/github.com/NVIDIA/mig-parted/cmd/apply (from $GOPATH)
cmd/nvidia-mig-parted/main.go:23:2: cannot find package "github.com/NVIDIA/mig-parted/cmd/assert" in any of:
/usr/src/github.com/NVIDIA/mig-parted/cmd/assert (from $GOROOT)
/home/user/go/src/github.com/NVIDIA/mig-parted/cmd/assert (from $GOPATH)
cmd/nvidia-mig-parted/main.go:24:2: cannot find package "github.com/NVIDIA/mig-parted/cmd/checkpoint" in any of:
/usr/src/github.com/NVIDIA/mig-parted/cmd/checkpoint (from $GOROOT)
/home/user/go/src/github.com/NVIDIA/mig-parted/cmd/checkpoint (from $GOPATH)
cmd/nvidia-mig-parted/main.go:25:2: cannot find package "github.com/NVIDIA/mig-parted/cmd/export" in any of:
/usr/src/github.com/NVIDIA/mig-parted/cmd/export (from $GOROOT)
/home/user/go/src/github.com/NVIDIA/mig-parted/cmd/export (from $GOPATH)
cmd/nvidia-mig-parted/main.go:26:2: cannot find package "github.com/NVIDIA/mig-parted/cmd/restore" in any of:
/usr/src/github.com/NVIDIA/mig-parted/cmd/restore (from $GOROOT)
/home/user/go/src/github.com/NVIDIA/mig-parted/cmd/restore (from $GOPATH)
cmd/nvidia-mig-parted/main.go:27:2: cannot find package "github.com/NVIDIA/mig-parted/cmd/util" in any of:
/usr/src/github.com/NVIDIA/mig-parted/cmd/util (from $GOROOT)
/home/user/go/src/github.com/NVIDIA/mig-parted/cmd/util (from $GOPATH)
cmd/nvidia-mig-parted/main.go:28:2: cannot find package "github.com/sirupsen/logrus" in any of:
/usr/src/github.com/sirupsen/logrus (from $GOROOT)
/home/user/go/src/github.com/sirupsen/logrus (from $GOPATH)
cmd/nvidia-mig-parted/main.go:29:2: cannot find package "github.com/urfave/cli/v2" in any of:
/usr/src/github.com/urfave/cli/v2 (from $GOROOT)
/home/user/go/src/github.com/urfave/cli/v2 (from $GOPATH)

oneshot services causes boot failures due to lack of timeout

nvidia-mig-manager.service is Type=oneshot.
DefaultTimeoutStartSec is not used for oneshot services, which causes the entire system to fail to boot stuck for nvidia-mig-manager.service to complete.

Boot failure is worse than a failed / degraded service.

A TimeoutStartSec should be added to this to at least allow the system to boot in a degraded state (for debug / recovery without OOB BMC / IPMI / KVM).

The root cause may be #11 , but a timeout addition will make this more resilient.

image

How to access the a MIG Device ID programmatically

Hi @klueska , I am looking into an issue of assigning a gpu which has been partitioned by MIG inside a python script where want to run a Pytorch model.

We typically do it this way in Torchserve and now if a A100 gpu is partitioned into 2 gpus such as "MIG-GPU-63feeb45-94c6-b9cb-78ea-98e9b7a5be6b/0/0" and "MIG-GPU-63feeb45-94c6-b9cb-78ea-98e9b7a5be6b/1/0", what would be good way to handle it, is there any tool available that provides this info?

This MIG GPU-id is not available through CUDA utilities in Pytorch.

I appreciate your thoughts.

Partitions aren't created, but getting "MIG configuration applied successfully" message

I have installed the mig-parted tool as root on a node. I am able to run the sample commands listed in the readme page, getting the "MIG configuration applied successfully" message after applying different configurations from the config.yaml file. However, the partitions do not seem to be created, as checked by both nvidia-smi and nvidia-mig-parted export (mig-devices returns "{}"). Do you have any guidance on what could be going on here?

MIG partitioning leading to nvidia_a100_3g.39gb instead of 3g.40gb partition for NVIDIA driver versions 535.x and 545.x

Hi,

I've a bunch of servers with 4 A100 GPUs each and I've MIG-partitioned each GPU in the 'all-balanced' profile and managed them through Slurm.

$ cat /etc/nvidia-mig-manager/config.yaml
...
  all-balanced:
...
    # H100-80GB, H800-80GB, A100-80GB, A800-80GB
    - device-filter: ["0x233110DE", "0x233010DE", "0x232210DE", "0x20B210DE", "0x20B510DE", "0x20F310DE", "0x20F510DE"]
      devices: all
      mig-enabled: true
      mig-devices:
        "1g.10gb": 2
        "2g.20gb": 1
        "3g.40gb": 1
...

WIth NVIDIA driver 495.x, I could partition them as follows without any issues.

  • 1g.10gb : 2
  • 2g.20gb : 1
  • 3g.40gb : 1

However, with the latest drivers, namely 535.x and 545.x, each GPUs get partitioned into

  • 1g.10gb : 2
  • 2g.20gb : 1
  • 3g.39gb : 1

I use AutoDetect=nvml for Slurm to detect the types of MIG partitions and their CPU affinities. Slurm reports this discrepancy in the logs:

$ slurmd -G 

slurmd: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error:     GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null)  Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error:     GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null)  Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error:     GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null)  Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error:     GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null)  Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: gres/gpu: _merge_system_gres_conf: WARNING: The following autodetected GPUs are being ignored:
slurmd:     GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):24-31  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia0,/dev/nvidia-caps/nvidia-cap21,/dev/nvidia-caps/nvidia-cap22 UniqueId:MIG-efa8d929-9af6-5083-af99-f1ceefb8b29a
slurmd:     GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):8-15  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia1,/dev/nvidia-caps/nvidia-cap156,/dev/nvidia-caps/nvidia-cap157 UniqueId:MIG-44b932cc-40b5-5e7b-b01b-7e342ecfcb64
slurmd:     GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):56-63  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2,/dev/nvidia-caps/nvidia-cap291,/dev/nvidia-caps/nvidia-cap292 UniqueId:MIG-e3ab25b5-7be9-5d4b-940d-63841fead660
slurmd:     GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):40-47  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3,/dev/nvidia-caps/nvidia-cap426,/dev/nvidia-caps/nvidia-cap427 UniqueId:MIG-60b3faac-01f1-5bc1-be5c-c53e2c4e0d82
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=31 ID=7696487 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap30,/dev/nvidia-caps/nvidia-cap31 Cores=24-31 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=166 ID=7696487 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap165,/dev/nvidia-caps/nvidia-cap166 Cores=8-15 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=301 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap300,/dev/nvidia-caps/nvidia-cap301 Cores=56-63 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=436 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap435,/dev/nvidia-caps/nvidia-cap436 Cores=40-47 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=85 ID=7696487 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap84,/dev/nvidia-caps/nvidia-cap85 Cores=24-31 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=220 ID=7696487 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap219,/dev/nvidia-caps/nvidia-cap220 Cores=8-15 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=355 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap354,/dev/nvidia-caps/nvidia-cap355 Cores=56-63 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=490 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap489,/dev/nvidia-caps/nvidia-cap490 Cores=40-47 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=94 ID=7696487 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap93,/dev/nvidia-caps/nvidia-cap94 Cores=24-31 CoreCnt=64 Links=0,0,0,-1
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=229 ID=7696487 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap228,/dev/nvidia-caps/nvidia-cap229 Cores=8-15 CoreCnt=64 Links=0,0,0,-1
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=364 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap363,/dev/nvidia-caps/nvidia-cap364 Cores=56-63 CoreCnt=64 Links=0,0,0,-1
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=499 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap498,/dev/nvidia-caps/nvidia-cap499 Cores=40-47 CoreCnt=64 Links=0,0,0,-1

I have tried using nvidia-mig-manager versions [0.5.3, 0.5.4.1 and 0.5.5] and I see the same behavior as long as the NVIDIA driver version is 535 or 545. I haven't tried 505, 515, 525.

=== w/ NVIDIA driver 495.x ===

$ nvidia-smi
Wed Dec 13 00:32:33 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:01:00.0 Off |                   On |
| N/A   25C    P0    50W / 500W |     24MiB / 81251MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:41:00.0 Off |                   On |
| N/A   25C    P0    51W / 500W |     24MiB / 81251MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  Off  | 00000000:81:00.0 Off |                   On |
| N/A   25C    P0    48W / 500W |     24MiB / 81251MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  Off  | 00000000:C1:00.0 Off |                   On |
| N/A   25C    P0    50W / 500W |     24MiB / 81251MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+


+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    3   0   1  |      6MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   10   0   3  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    3   0   1  |      6MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    9   0   2  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   10   0   3  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    3   0   1  |      6MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    9   0   2  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2   10   0   3  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    3   0   1  |      6MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    9   0   2  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3   10   0   3  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

=== w/ NVIDIA driver 545.x ===

$  nvidia-smi
Wed Dec 13 00:29:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:01:00.0 Off |                   On |
| N/A   26C    P0              51W / 500W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:41:00.0 Off |                   On |
| N/A   26C    P0              49W / 500W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:81:00.0 Off |                   On |
| N/A   27C    P0              51W / 500W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:C1:00.0 Off |                   On |
| N/A   25C    P0              48W / 500W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    2   0   0  |              37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    3   0   1  |              25MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    9   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   10   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    2   0   0  |              37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    3   0   1  |              25MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    9   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1   10   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    2   0   0  |              37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    3   0   1  |              25MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    9   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2   10   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    2   0   0  |              37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    3   0   1  |              25MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    9   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3   10   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Looking at the memory of the different partitions, the 10GB and 20GB partitions are the same regardless of the NVIDIA driver version, but the "40GB" partitions are a little lower (40192MiB) for driver version 545.x compared to 40448 MiB for driver version 495.x, I see that the memory is smaller for

=== w/ NVIDIA driver 495.x ===

$ nvidia-smi | grep " 2   0   0"
|  0    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|  1    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|  2    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|  3    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |

=== w/ NVIDIA driver 545.x ===

$ nvidia-smi | grep " 2   0   0"
|  0    2   0   0  |   37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|  1    2   0   0  |    37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|  2    2   0   0  |    37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|  3    2   0   0  |    37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |

This is perhaps the reason why the partition is reported as 3g.39gb instead of 3g.40gb. Since we already have lots of GPUs with 3g.40gb partitions and people are trained to use them, having to hack things by introducing a different label for the same Slurm GRES would create a lot of confusion and inconvenience. So, we should appreciate any guidance in resolving this issue.

Thanks a lot.

Enable golangci-lint in repo

We should:

  • add a golangci-lint configuration
  • update the makefile to trigger golangci-lint
  • Enable the golangci-lint checks in PRs

mmap error for most operations on debian 10

Hi,

I tried to use mig-parted on a debian 10 system with 6 A100 GPUs installed and backports kernel 5.10 (5.10.0-0.bpo.8-amd64 #1 SMP Debian 5.10.46-4~bpo10+1 (2021-08-07) x86_64 GNU/Linux) as well as kernel 4.19 and get the following error:

# nvidia-mig-parted -d assert --config-file /etc/nvidia-mig-manager/config.yaml --selected-config all-disabled
DEBU[0000] Parsing config file...                       
DEBU[0000] Selecting specific MIG config...             
DEBU[0000] Asserting MIG mode configuration...          
DEBU[0000] Walking MigConfig for (devices=all)          
DEBU[0000]   GPU 0: 0x20F110DE                          
DEBU[0000]     Asserting MIG mode: Disabled             
DEBU[0000] Error checking MIG capable: error opening bar0 MMIO resource: failed to open file for mmio: failed to mmap file: invalid argument
 
FATA[0000] Assertion failure: selected configuration not currently applied
# nvidia-mig-parted -d apply --config-file /etc/nvidia-mig-manager/config.yaml --selected-config all-disabled 
[...]
DEBU[0001] Applying MIG mode change...                  
DEBU[0001] Walking MigConfig for (devices=all)          
DEBU[0001]   GPU 0: 0x20F110DE                          
DEBU[0001]     MIG capable: true                        
DEBU[0001]     Current MIG mode: Disabled               
DEBU[0001]     Updating MIG mode: Disabled              
DEBU[0001]     Mode change pending: false               
DEBU[0001]   GPU 1: 0x20F110DE                          
DEBU[0001]     MIG capable: true                        
DEBU[0001]     Current MIG mode: Disabled               
DEBU[0001]     Updating MIG mode: Disabled              
DEBU[0001]     Mode change pending: false               
DEBU[0001]   GPU 2: 0x20F110DE                          
DEBU[0001]     MIG capable: true                        
DEBU[0001]     Current MIG mode: Disabled               
DEBU[0001]     Updating MIG mode: Disabled              
DEBU[0001]     Mode change pending: false               
DEBU[0001]   GPU 3: 0x20F110DE                          
DEBU[0001]     MIG capable: true                        
DEBU[0001]     Current MIG mode: Disabled               
DEBU[0001]     Updating MIG mode: Disabled              
DEBU[0001]     Mode change pending: false               
DEBU[0001]   GPU 4: 0x20F110DE                          
DEBU[0001]     MIG capable: true                        
DEBU[0001]     Current MIG mode: Enabled                
DEBU[0001]     Updating MIG mode: Disabled              
DEBU[0003]     Mode change pending: false               
DEBU[0003]   GPU 5: 0x20F110DE                          
DEBU[0003]     MIG capable: true                        
DEBU[0003]     Current MIG mode: Enabled                
DEBU[0003]     Updating MIG mode: Disabled              
DEBU[0014]     Mode change pending: false               
DEBU[0014] Checking current MIG device configuration... 
DEBU[0014] Walking MigConfig for (devices=all)          
DEBU[0014]   GPU 0: 0x20F110DE                          
DEBU[0014] Running pre-apply-config hook
[...]
DEBU[0014] Applying MIG device configuration...         
DEBU[0014] Walking MigConfig for (devices=all)          
DEBU[0014]   GPU 0: 0x20F110DE                          
DEBU[0014] Running apply-exit hook
[....]
FATA[0015] Error checking MIG capable: error opening bar0 MMIO resource: failed to open file for mmio: failed to mmap file: invalid argument

The change for mig mode (equivalent to nvidia-smi -mig 0) works fine but the assertion always fails with this error and setting up mig instances doesn't work.
I tried to debug the mmap failure but couldn't find anything obvious.
I also used strace to how the call:

strace -e trace=%memory nvidia-mig-parted -d assert --config-file /etc/nvidia-mig-manager/vrvis.yaml --selected-config all-disabled
brk(NULL)                               = 0x1255000
mmap(NULL, 25926, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f904a11f000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f904a11d000
mmap(NULL, 132288, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f904a0fc000
mmap(0x7f904a102000, 61440, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x7f904a102000
mmap(0x7f904a111000, 24576, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x15000) = 0x7f904a111000
mmap(0x7f904a117000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a000) = 0x7f904a117000
mmap(0x7f904a119000, 13504, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f904a119000
mmap(NULL, 16656, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f904a0f7000
mmap(0x7f904a0f8000, 4096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x7f904a0f8000
mmap(0x7f904a0f9000, 4096, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f904a0f9000
mmap(0x7f904a0fa000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f904a0fa000
mmap(NULL, 1837056, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f9049f36000
mprotect(0x7f9049f58000, 1658880, PROT_NONE) = 0
mmap(0x7f9049f58000, 1343488, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7f9049f58000
mmap(0x7f904a0a0000, 311296, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16a000) = 0x7f904a0a0000
mmap(0x7f904a0ed000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b6000) = 0x7f904a0ed000
mmap(0x7f904a0f3000, 14336, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f904a0f3000
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9049f33000
mprotect(0x7f904a0ed000, 16384, PROT_READ) = 0
mprotect(0x7f904a0fa000, 4096, PROT_READ) = 0
mprotect(0x7f904a117000, 4096, PROT_READ) = 0
mprotect(0x878000, 4096, PROT_READ)     = 0
mprotect(0x7f904a14d000, 4096, PROT_READ) = 0
munmap(0x7f904a11f000, 25926)           = 0
brk(NULL)                               = 0x1255000
brk(0x1276000)                          = 0x1276000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9049ef3000
mmap(NULL, 131072, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9049ed3000
mmap(NULL, 1048576, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9049dd3000
mmap(NULL, 8388608, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90495d3000
mmap(NULL, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90455d3000
mmap(NULL, 536870912, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90255d3000
mmap(0xc000000000, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc000000000
mmap(0xc000000000, 67108864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xc000000000
mmap(NULL, 33554432, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90235d3000
mmap(NULL, 2165768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90233c2000
mmap(0x7f9049ed3000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f9049ed3000
mmap(0x7f9049e53000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f9049e53000
mmap(0x7f90499d9000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f90499d9000
mmap(0x7f9047603000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f9047603000
mmap(0x7f9035753000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f9035753000
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90232c2000
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90232b2000
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90232a2000
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f9022aa1000
mprotect(0x7f9022aa2000, 8388608, PROT_READ|PROT_WRITE) = 0
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f90222a0000
mprotect(0x7f90222a1000, 8388608, PROT_READ|PROT_WRITE) = 0
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f9021a9f000
mprotect(0x7f9021aa0000, 8388608, PROT_READ|PROT_WRITE) = 0
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=6578, si_uid=0} ---
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f9020a5d000
mprotect(0x7f9020a5e000, 8388608, PROT_READ|PROT_WRITE) = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9020a1d000
DEBU[0000] Parsing config file...                       
DEBU[0000] Selecting specific MIG config...             
DEBU[0000] Asserting MIG mode configuration...          
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=6578, si_uid=0} ---
mmap(NULL, 1439992, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f90208bd000
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=6578, si_uid=0} ---
DEBU[0000] Walking MigConfig for (devices=all)          
DEBU[0000]   GPU 0: 0x20F110DE                          
DEBU[0000]     Asserting MIG mode: Disabled             
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=6578, si_uid=0} ---
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=6578, si_uid=0} ---
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f902086d000
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=6578, si_uid=0} ---
--- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=6578, si_uid=0} ---
mmap(NULL, 16777216, PROT_READ, MAP_SHARED, 3, 0) = -1 EINVAL (Invalid argument)
DEBU[0000] Error checking MIG capable: error opening bar0 MMIO resource: failed to open file for mmio: failed to mmap file: invalid argument
 
FATA[0000] Assertion failure: selected configuration not currently applied 
+++ exited with 1 +++

Thanks for your help,
Valentin

Missing feature: show the spec of a MIG profile

It would be nice if we could the tool to query content of a configuration file (most important is the 2nd command):

# nvidia-mig-parted show -f examples/config.yaml
version: v1
mig-configs:
  - all-disabled
  - all-enabled
  -  ...

# nvidia-mig-parted show -f examples/config.yaml -c all-balanced
version: v1
all-balanced:
  - devices: all
    mig-enabled: true
    mig-devices:
      "1g.5gb": 2
      "2g.10gb": 1
      "3g.20gb": 1

along with nvidia-mig-parted export, this would allow a script to detect if applying a profile would change the state of mig-enabled property

Issue with systemd-based deployment

Hello,

We're using the recommended systemd manifest that's present as part of this repository. Upon close inspection, it seems like the ExecStart invocation contains a "-" prefix. This makes it so that even if the command fails, the systemd service will still say that it's active. Is there a reason for this? I think in almost all cases, it makes sense to make the systemd service fail if the underlying command fails. What do you think?

Thanks in advance.

Issue when installing the systemd install.sh script

Hello Support,
I am having some issues with the Systemd installation script. When I run the install.sh script, I get this error.

(base) root@mpigpu06dal:/opt/mig-parted/deployments/systemd# ./install.sh
Unable to find image 'golang:1.16.4' locally
1.16.4: Pulling from library/golang
d960726af2be: Pull complete
e8d62473a22d: Pull complete
8962bc0fad55: Pull complete
65d943ee54c1: Pull complete
f2253e6fbefa: Pull complete
6d7fa7c7d5d3: Pull complete
e2e442f7f89f: Pull complete
Digest: sha256:8a106c4b4005efb43c0ba4bb5763b84742c7e222bad5a8dff73cc9f7710c64ee
Status: Downloaded newer image for golang:1.16.4
go: downloading github.com/NVIDIA/mig-parted v0.5.0
go install github.com/NVIDIA/mig-parted/cmd@latest: github.com/NVIDIA/[email protected]: verifying module: github.com/NVIDIA/[email protected]: Get "https://sum.golang.org/lookup/github.com/!n!v!i!d!i!a/[email protected]": read tcp 172.17.0.2:53664->34.64.4.81:443: read: connection reset by peer
mv: cannot stat '/go/bin/cmd': No such file or directory

Where is the /go/bin/cmd

Rename path to container Makefile

Instead of deployments/gpu-operator/Makefile, it would be better to use deployments/container/Makefile.

Note that this may require a generalization since the deployment is currently only consumed from the GPU Operator.

How to do mig on a gpu

My environment has 8 A100 GPU and now I'm trying to use mig on gpu number 7, but it doesn't work.

cat <<EOF | nvidia-mig-parted apply -f -
version: v1
mig-configs:
  all-1g.5gb:
  - devices: 7
    mig-enabled: true
    mig-devices:
      1g.5gb: 7
EOF

Error:

FATA[0000] Error parsing config file: unmarshal error: error unmarshaling JSON: while decoding JSON: invalid string input for 'devices': 7

What should I do? mig-parted only support operations on all gpus?

Support ARM with pre-build packages

Hello there,

I would like to use mig-parted on a GraceHopper ARM64 system. Unfortuanally, there is no prebuild package that can be installed out of the box. I build the ubuntu20.04 package myself and it looks that it works as expected. Having packages that simply can be downloaded would be much easier.

Regards
Sebastian

Does this work with vGPU?

We would like to use this project to configure MIG backed vGPUs on A100s in an OpenStack cloud. We are using vGPU host driver version 525.60.12. As I understand it, I have to enable SR-IOV on the A100s in order to expose the vGPUs to OpenStack / KVM using the /usr/lib/nvidia/sriov-manage included with the vGPU driver. However, once I do that, mig-parted starts acting up:

# nvidia-mig-parted export
FATA[0000] Error checking MIG capable: error getting device handle: Invalid Argument 

Is this scenario not supported or am I doing something wrong?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.