Coder Social home page Coder Social logo

nebuly-ai / nos Goto Github PK

View Code? Open in Web Editor NEW
605.0 605.0 31.0 10.01 MB

Module to Automatically maximize the utilization of GPU resources in a Kubernetes cluster through real-time dynamic partitioning and elastic quotas - Effortless optimization at its finest!

Home Page: https://www.nebuly.com/

License: Apache License 2.0

Makefile 1.81% Dockerfile 1.02% Go 95.53% Shell 0.18% Smarty 1.46%
gpu kubernetes optimization

nos's Introduction

Nebuly AI cover logo

Hi all, here is Nebuly πŸ‘

Join our NEW community on Discord ✨ for a chat about AI optimization

or if you want to know more feel free to connect with us on LinkedIn

nos's People

Contributors

5cat avatar dependabot[bot] avatar emilecourthoud avatar nickpetrovic avatar telemaco019 avatar windowsxp-beta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nos's Issues

Partitioner renders malformed device-plugin ConfigMap value which breaks GFD, causing Pods to be Pending forever

In internal/partitioning/mps/partitioner.go, in ToPluginConfig function, Config struct is used from github.com/NVIDIA/k8s-device-plugin/api/config/v1 package. This struct, while containing nested structs, does not use struct pointers, which causes YAML/JSON Marshal function to render "empty" structs as empty map/object, instead of omitting them. This results in following config value (take a look at timeSlicing):

flags:
  failOnInitError: null
  gdsEnabled: null
  migStrategy: none
  mofedEnabled: null
resources:
  gpus: null
sharing:
  mps:
    failRequestsGreaterThanOne: true
    resources:
    - devices:
      - "0"
      memoryGB: 10
      name: nvidia.com/gpu
      rename: gpu-10gb
      replicas: 2
  timeSlicing: {}
version: v1

This behavior is explained here.

There is a custom Unmarshal function that is executed when sharing.timeSlicing field exists in raw config, but throws an error when it is empty, exactly as we see in the above config example. See code here:

	resources, exists := ts["resources"]
	if !exists {
		return fmt.Errorf("no resources specified")
	}

GFD uses this package to read device-plugin config, created by partitioner. When new partitioning config is applied, empty timeSlicing field in it causes the above code to crash the GFD container with no resources specified error, until timeSlicing: {} is removed from ConfigMap, which resolves the error.

I think it makes sense to fix this issue in nebuly-ai/k8s-device-plugin by removing the checks and forking GFD to use that, as well as tweaking the structs to utilize pointers when nesting other structs in order to render proper YAML.

Metrics-exporter setup; How to go about it?

Came across the metrics exporter, however am not able to set it up,
The errors are:

{"level":"info","ts":1679291005.7844253,"msg":"reading metrics file","metricsFile":""}
{"level":"error","ts":1679291005.7844558,"msg":"failed to read metrics file","error":"open : no such file or directory","stacktrace":"main.main\n\t/workspace/cmd/metricsexporter/metricsexporter.go:62\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

Can someone please point me to set this up? We need to set up per pod GPU utilization metrics

resource request key format

To request a GPU with a 10g the following key/value is used

resources:
  limits:
    nvidia.com/gpu-10gb: 1

Would it make more sense to do the following instead

resources:
  limits:
    nvidia.com/gpu-memory: "10Gi"
    nvidia.com/gpu: 1

Handle GPU partitioning mode changes on the same Node (MIG<>MPS)

Problem description

When changing the partitioning mode of a node from MPS to MIG, the nvidia-device-plugin crashes and therefore any new MIG device created by nos is never exposed to k8s as resource.

How to reproduce

  1. Enable MPS partitioning on a node (kubectl label nodes <node> "nos.nebuly.com/gpu-partitioning=mps")
  2. Create a Pod requesting MPS resources (for instance nvidia.com/gpu-10gb)
  3. After the requested MPS resources are created and the Pod is scheduled on the node, delete the Pod and change the node's GPU partitioning mode to MIG (kubectl label nodes <node> "nos.nebuly.com/gpu-partitioning=mps")
  4. Create a Pod requesting MIG resources (for instance nvidia.com/mig-1g.10gb)

Expected behaviour

After step 4, the MIG resources are created automatically and the Pod is scheduled on the node

Actual behaviour

After step 4, the MIG devices are created on the GPU, however the nvidia-device-plugin Pod crashes with error Cannot find configuration named <config-name>, where <config-name> is the name of the configuration set by nos during step 2.

pod stuck pending at resource overuse

Hi,

I am allocating only 1gb of 24gb available memory to gpu operator that is shown in my node's labels. I also have another gpu device plugin (the default one) in my cluster but I have done the necessary affinity configurations to prevent both running. Basically, my pod stucks at pending (the sleep pod that is shared on documentation) with the reasoning of resource overuse, and does not get scheduled. MPS server occupies even less than 1gb on my gpu, and seems to be running in the output of nvidia-smi.

I have followed steps in doc about user mode 1000 and necessary gpu-operator config arrangements (mig mode mixed etc.)

Any help would be much appreciated.

NAMESPACE                NAME                                                          READY   STATUS      RESTARTS      AGE
calico-apiserver         calico-apiserver-6dd8b8765c-7nm86                             1/1     Running     0             23h
calico-apiserver         calico-apiserver-6dd8b8765c-fp6bx                             1/1     Running     0             23h
calico-system            calico-kube-controllers-5c8ddb5dcf-tv4fw                      1/1     Running     0             23h
calico-system            calico-node-hzxml                                             1/1     Running     0             23h
calico-system            calico-typha-d6688954-g547t                                   1/1     Running     0             23h
calico-system            csi-node-driver-4qfps                                         2/2     Running     0             23h
default                  gpu-feature-discovery-nqrtb                                   1/1     Running     0             85m
default                  gpu-operator-787cd6f58-xn68k                                  1/1     Running     0             85m
default                  gpu-pod                                                       0/1     Completed   0             3h35m
default                  mps-partitioning-example                                      0/1     Pending     0             3m16s
default                  nvidia-container-toolkit-daemonset-dj7xv                      1/1     Running     0             85m
default                  nvidia-cuda-validator-4pmjv                                   0/1     Completed   0             56m
default                  nvidia-dcgm-exporter-pwfwb                                    1/1     Running     0             85m
default                  nvidia-device-plugin-daemonset-7p4b7                          1/1     Running     0             85m
default                  nvidia-operator-validator-fr897                               1/1     Running     0             85m
default                  release-name-node-feature-discovery-gc-5cbdb95596-9p5bn       1/1     Running     0             88m
default                  release-name-node-feature-discovery-master-788d855b45-fsz56   1/1     Running     0             88m
default                  release-name-node-feature-discovery-worker-dgcn5              1/1     Running     0             39m
kube-system              coredns-5dd5756b68-tgdgf                                      1/1     Running     0             23h
kube-system              coredns-5dd5756b68-wlxq2                                      1/1     Running     0             23h
kube-system              etcd-selin-csl                                                1/1     Running     1553          23h
kube-system              kube-apiserver-selin-csl                                      1/1     Running     30            23h
kube-system              kube-controller-manager-selin-csl                             1/1     Running     0             23h
kube-system              kube-proxy-lslfg                                              1/1     Running     0             23h
kube-system              kube-scheduler-selin-csl                                      1/1     Running     35            23h
nebuly-nvidia            nvidia-device-plugin-1698187396-r7tpf                         3/3     Running     0             32m
node-feature-discovery   nfd-6q9tl                                                     2/2     Running     0             14m
node-feature-discovery   nfd-master-85f4bc48cf-dlw4q                                   1/1     Running     0             42m
node-feature-discovery   nfd-worker-wln6p                                              1/1     Running     2 (42m ago)   42m
tigera-operator          tigera-operator-94d7f7696-ff7kf                               1/1     Running     0             23h
selin@selin-csl:~$ kubectl describe node selin-csl
Name:               selin-csl
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
                    feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true
                    feature.node.kubernetes.io/cpu-cpuid.FLUSH_L1D=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.FXSR=true
                    feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true
                    feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP=true
                    feature.node.kubernetes.io/cpu-cpuid.IBPB=true
                    feature.node.kubernetes.io/cpu-cpuid.LAHF=true
                    feature.node.kubernetes.io/cpu-cpuid.MD_CLEAR=true
                    feature.node.kubernetes.io/cpu-cpuid.MOVBE=true
                    feature.node.kubernetes.io/cpu-cpuid.MPX=true
                    feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true
                    feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD=true
                    feature.node.kubernetes.io/cpu-cpuid.STIBP=true
                    feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true
                    feature.node.kubernetes.io/cpu-cpuid.SYSEE=true
                    feature.node.kubernetes.io/cpu-cpuid.VMX=true
                    feature.node.kubernetes.io/cpu-cpuid.X87=true
                    feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVE=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVES=true
                    feature.node.kubernetes.io/cpu-cstate.enabled=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/cpu-model.family=6
                    feature.node.kubernetes.io/cpu-model.id=85
                    feature.node.kubernetes.io/cpu-model.vendor_id=Intel
                    feature.node.kubernetes.io/cpu-pstate.scaling_governor=powersave
                    feature.node.kubernetes.io/cpu-pstate.status=active
                    feature.node.kubernetes.io/cpu-pstate.turbo=true
                    feature.node.kubernetes.io/cpu-rdt.RDTCMT=true
                    feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMBA=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMBM=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMON=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
                    feature.node.kubernetes.io/kernel-version.full=6.2.0-34-generic
                    feature.node.kubernetes.io/kernel-version.major=6
                    feature.node.kubernetes.io/kernel-version.minor=2
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/pci-0300_1002.present=true
                    feature.node.kubernetes.io/pci-0300_10de.present=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=ubuntu
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=22
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=selin-csl
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
                    nos.nebuly.com/gpu-partitioning=mps
                    nvidia.com/cuda.driver.major=535
                    nvidia.com/cuda.driver.minor=113
                    nvidia.com/cuda.driver.rev=01
                    nvidia.com/cuda.runtime.major=12
                    nvidia.com/cuda.runtime.minor=2
                    nvidia.com/gfd.timestamp=1698184228
                    nvidia.com/gpu-driver-upgrade-state=upgrade-done
                    nvidia.com/gpu.compute.major=7
                    nvidia.com/gpu.compute.minor=5
                    nvidia.com/gpu.count=1
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=pre-installed
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=turing
                    nvidia.com/gpu.machine=Precision-5820-Tower
                    nvidia.com/gpu.memory=24576
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-TITAN-RTX
                    nvidia.com/gpu.replicas=1
                    nvidia.com/mig.capable=false
                    nvidia.com/mig.strategy=mixed

doc: wrong make targets

While building the nos I found the doc was a bit outdated.
The make targets in this part should all begin with docker-.

nvdia-cuda-mps-server consistently hangs at the "creating worker thread" log

I am using nvidia-cuda-mps-server for GPU virtualization (GPU is V100), and the plugin comes from Nebuly-NVIDIA. The CUDA client is k8s.gcr.io/cuda-vector-add:v0.1. After the CUDA client starts as a container, the nvidia-cuda-mps-server process consistently hangs at the "creating worker thread" log, and the client does not print any logs. Where could the problem be? Is it possible that my GPU card does not support MPS, or is it an issue with the client?

Steps to reproduce the issue
1.install Nebuly-NVIDIA plugin:https://github.com/nebuly-ai/k8s-device-plugin
2.start a pod which image is β€œk8s.gcr.io/cuda-vector-add:v0.1”
3. client does not print any logs,and nvidia-cuda-mps-server hangs at the "creating worker thread" log
WechatIMG284
WechatIMG285
WechatIMG286
WechatIMG288

Nebuly k8s-device-plugin not starting on GKE

Hi, I'm trying to setup MPS partitioning on GKE, but I can't get the k8s-device-plugin to work. The plugin gets installed correctly, but it never starts any driver pods.

Cluster data:

  • K8s Rev: v1.24.11-gke.1000
  • GPU used: Nvidia L4
  • nvidia-device-plugin-0.13.0

The node only has the following taints:

Taints:             nvidia.com/gpu=present:NoSchedule

It's also properly labeled as

nos.nebuly.com/gpu-partitioning=mps

The regular nvidia device plugin has worked just fine before I pushed it out with nodeSelectors on the default daemonset injected on GKE.

The nebuly plugin however is stuck at 0 pods:

k get ds -n nebuly-nvidia

NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                         AGE
nvidia-device-plugin-1684693222   0         0         0       0            0           nos.nebuly.com/gpu-partitioning=mps   33m

Your documentation mentions that in order to avoid duplicate drivers on nodes, we can configure affinity on the prexisting nvidia driver to avoid scheduling both on the nodes. I've done that for the GKE driver daemonset, but that results in a container that's always stuck in creating. Not a big deal, but I just want to confirm that this is expected.

Here's what pods I currently have on the GPU node:

  Namespace                   Name                                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                       ------------  ----------  ---------------  -------------  ---
  kube-system                 fluentbit-gke-nzm72                                        100m (2%)     0 (0%)      200Mi (1%)       500Mi (3%)     23m
  kube-system                 gke-metrics-agent-ghmdm                                    8m (0%)       0 (0%)      110Mi (0%)       110Mi (0%)     23m
  kube-system                 kube-proxy-gke-xxx-gke-workspace-gpu-95e23864-6fwc    100m (2%)     0 (0%)      0 (0%)           0 (0%)         23m
  kube-system                 nvidia-gpu-device-plugin-x6l9c                             50m (1%)      0 (0%)      50Mi (0%)        50Mi (0%)      23m
  kube-system                 pdcsi-node-dbxns                                           10m (0%)      0 (0%)      20Mi (0%)        100Mi (0%)     23m

Is there anything I'm doing incorrectly here? Afaik it's not possible to remove the default nvidia driver from the cluster, as it's automatically injected by GKE. Please let me know if there's anything I could do to solve this, I'd love to start using your stuff. Thanks a lot of your time.

Support mixed MIG+MPS dynamic partititioning

Description

Currently, when enabling Dynamic GPU Partitioning on a node, it is possible to choose only between MIG or MPS by adding one of the following labels: nos.nebuly.com/gpu-partitioning: "mig" or nos.nebuly.com/gpu-partitioning: "mps".

It would be nice to have a third dynamic partitioning option that mixes MIG and MPS. This would be particularly useful for further partitioning MIG devices with MPS, as often the smallest available MIG device on a GPU is way larger than the resources required by the workloads.

For instance, the smallest MIG profile for NVIDIA-A100-SXM4-80GB is 1g.10gb, which provides 10GB of GPU memory. However, since many workloads require less than 10GB of GPU memory, this leads to inefficiencies.

Right now the alternative is to partition GPUs using MPS, which allows the creation of GPU slices of arbitrary size. However, MPS does not provide full workload isolation. Using MPS on top of MIG would enable finer-grained partitioning without compromising too much workload isolation, as only the workloads sharing the same MIG partition wouldn't be fully isolated.

Proposed solution

Add the possibility to label a node with nos.nebuly.com/gpu-partitioning: "mixed". For nodes with this label, nos should automatically use MPS for partitioning the smallest available MIG devices according to the requested resources.

Cannot use entire gpu memory

Hi,

I have an A100-PCIE-40GB gpu and I an trying to use nos mps dynamic partitioning.
The issue is that Is seems to have some issues with total capacity calculation.
for example I am trying to run 2 pods that require resource : nvidia.com/gpu-20gb: 1 and one of them always stays in pending
while I am able to schedule 1 pod of nvidia.com/gpu-20gb and another 2 pods requesting nvidia.com/gpu-10gb
I faced this issue of not fully using the GPU memory in some more combinations.

Does someone have any idea?
will be much appreciated

Usage with Karpenter?

Karpenter doesn't like the custom resource requests from nebuly, as it uses nvidia.com/gpu to map to instance types with gpus available. Interested in a solution, which would effectively enable simple serverless gpus with high utilization

Multi-tenant Elastic Resource Quota

It would be nice if we could configure the Quotas to be shared only among the same tenants, not cross-tenants. So, a namespace named "nos-deployment-1" would only share resources with namespaces starting with the same tenant name "nos".

Limiting GPU Resource Usage per Docker Container with MPS Daemon

I’ve been utilizing the MPS (Multi-Process Service) daemon to manage resource usage limits for processes using the CUDA_MPS_ACTIVE_THREAD_PERCENTAGE and CUDA_MPS_PINNED_DEVICE_MEM_LIMIT environment variables, and it’s been working well. However, I’ve encountered a scenario that I’m not sure how to address. I’m curious if there’s a way to apply these limits collectively to an entire Docker container.

For example, if we set CUDA_MPS_PINNED_DEVICE_MEM_LIMIT=0=1000MB in the container’s environment variables, launching two processes results in each having its own limit, effectively allowing them to use a total of 2000MB combined. Is there a mechanism or strategy to enforce the total limit across the entire container so that, in my case, two applications together cannot exceed the 1000MB limit?

Has anyone tackled this issue before, or is there a way to ensure that the collective limit applies to the whole Docker container, restricting the total resource usage to, for example, 1000MB as per my example?

How to configure sharing.mps for individual nodes

sharing:
  mps: 
    failRequestsGreaterThanOne: true
    resources:
      - name: nvidia.com/gpu
        rename: nvidia.com/gpu-2gb
        memoryGB: 2
        replicas: 2
        devices: ["0"]

Can this configuration be applied to individual nodes as specified above?

KubeFlow Integration

surface hooks to Kubeflow for granular model scheduling on appropriate K8s cluster vGPU node resources

MPS server not serving any request after connecting with wrong user ID

Problem description

MPS Server requires the clients to run with the same user ID, which is 1000 by default. If a container requesting MPS resources runs with a different user ID, the MPS server refuses the request and the container cannot access the GPU. This behaviour is expected.

However, after that happens, any new container running with user 1000 and requesting MPS resources incurs the same problem.

How to replicate

  1. Create a Pod requesting MPS resources and running with a user ID different from 1000 (in this case 0):
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  hostIPC: true
  restartPolicy: OnFailure
  containers:
  - name: cuda-test
    image: "pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime"
    command: ["python", "-c", "import torch; print(torch.cuda.is_available())"]
    resources:
      limits:
         nvidia.com/gpu-2gb: 1
  1. Create a new Pod requesting MPS resources, this time running with user ID 1000:
apiVersion: v1
kind: Pod
metadata:
  name: test-pod-2
spec:
  hostIPC: true
  restartPolicy: OnFailure
  securityContext:
    runAsUser: 1000
    runAsNonRoot: true
  containers:
  - name: cuda-test
    image: "pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime"
    command: ["python", "-c", "import torch; print(torch.cuda.is_available())"]
    resources:
      limits:
         nvidia.com/gpu-2gb: 1

Expected behaviour

The first Pod running as user 1000 should not be able to access GPU. The second Pod running as user 1000 should instead be able to access the requested GPU slice.

Actual behaviour

Both Pods are stuck when requesting GPU access, as the MPS server enqueued the requests and never serve them.
These are the logs from the MPS server running in the device-plugin when the Pods running as 1000 tries to connect to the GPU:

nvidia-mps-server [2023-02-28 09:31:00.573 Control    54] Accepting connection...
nvidia-mps-server [2023-02-28 09:31:00.573 Control    54] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list

Temporary solution

Restart the MPS server running on the node by restarting the device-plugin Pod on that node.

Question about mps sever occupied GPU memory

I found that my mps server will always occupied 27mb memory in my GPU telsa v100-32gb ,if I have allocate gpu-16gb and then I allocate another 16gb, it will failed , because this GPU don't have enough memory

GPU Partitioning annotations are not properly cleaned up

Description

Uninstalling nos or disabling dynamic GPU partitioning on a node does not remove the node annotations set by nos.

Current behaviour

The following annotations are set by nos on the Nodes for which automatic GPU partitioning is enabled:

  • nos.nebuly.com/status-gpu-<index>-<mig-profile>-free: <quantity>
  • nos.nebuly.com/status-gpu-<index>-<mig-profile>-used: <quantity>
  • nos.nebuly.com/spec-gpu-<index>-<mig-profile>: <quantity>

After uninstalling nos or disabling dynamic GPU partitioning on a certain node, these annotations are not removed. Consequently, the next time nos is installed on the cluster or dynamic GPU partitioning is enabled on the node, nos might apply the previous desired GPU partitioning state.

Desired behaviour

Uninstalling nos or disable GPU partitioning on a node should clean up all the annotations previously set by nos.

Support running on nodes with host-installed GPU drivers

Nos is currently broken on systems where GPU drivers are pre-installed on hosts, for example on AKS. The symptom is gpu-agent Pod not starting due to missing /run/nvidia path on host.

According to Nvidia DRA driver documentation, /run/nvidia folder is provided via driver container. When drivers are installed on host instead of via container, the path is missing and has to be symlinked to host root manually:

Ensure your NVIDIA driver installation is rooted at /run/nvidia/driver

For deployments running a driver container this is a noop.
The driver container should already mount the driver installation at /run/nvidia/driver.

For deployments running with a host-installed driver, the following is sufficient to meet this requirement:

mkdir -p /run/nvidia
sudo ln -s / /run/nvidia/driver

NOTE: This is only currently necessary due to a limitation of how our CDI
generation library works. This restriction will be removed very soon.

To implement support for host-installed drivers, we can simply mount host's / as /run/nvidia/driver inside gpu-agent container.

Elastic Resource Quota for non-AI workloads

I operate a regular DevOps platform for microservices which doesn't use or need GPU.

However, GPU is a prerequisite.

The Elastic Resource Quota feature really caught my attention. It would be a great help.

Is there a way to use it without having to enable GPU?

Demo gpu sharing for mps does not start inferencing after downloading pytorch_model.bin

Starting Prometheus server on port 8000...
Running benchmark...
Downloading (…)cessor_config.json";: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 292/292 [00:00<00:00, 27.0kB/s]
Downloading (…)"config.json";: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4.13k/4.13k [00:00<00:00, 244kB/s]
Downloading (…)"pytorch_model.bin";: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 123M/123M [11:58<00:00, `171kB/s]

The line Running inference... is not printed out so I assume there is some problem when the model is loaded to GPU. Here is the MPS server log:

==> /tmp/nvidia-mps/server.log <==
[2024-07-30 02:31:59.303 Other   138] Initializing server process
[2024-07-30 02:31:59.339 Server   138] Creating server context on device 0 (NVIDIA GeForce RTX 2080 Ti)
[2024-07-30 02:31:59.401 Server   138] Creating server context on device 1 (NVIDIA GeForce RTX 2080 Ti)
[2024-07-30 02:31:59.456 Server   138] Created named shared memory region /cuda.shm.3e8.8a.1

==> /tmp/nvidia-mps/control.log <==
[2024-07-30 02:31:59.456 Control    58] NEW SERVER 138: Ignoring connection from user

==> /tmp/nvidia-mps/server.log <==
[2024-07-30 02:31:59.456 Server   138] Active Threads Percentage set to 0.0
[2024-07-30 02:32:36.506 Server   138] Server Priority set to 0
[2024-07-30 02:32:36.506 Server   138] Server has started
[2024-07-30 02:32:36.506 Server   138] Destroy server context on device 0
[2024-07-30 02:32:36.545 Server   138] Destroy server context on device 1

==> /tmp/nvidia-mps/control.log <==
[2024-07-30 02:32:36.581 Control    58] Server 138 exited with status 0
[2024-07-30 02:32:36.581 Control    58] Starting new server 144 for user 1000

==> /tmp/nvidia-mps/server.log <==
[2024-07-30 02:32:36.601 Other   144] Startup
[2024-07-30 02:32:36.601 Other   144] Connecting to control daemon on socket: /tmp/nvidia-mps/control

==> /tmp/nvidia-mps/control.log <==
[2024-07-30 02:32:36.601 Control    58] Accepting connection...

==> /tmp/nvidia-mps/server.log <==
[2024-07-30 02:32:36.601 Other   144] Initializing server process
[2024-07-30 02:32:36.641 Server   144] Creating server context on device 0 (NVIDIA GeForce RTX 2080 Ti)
[2024-07-30 02:32:36.704 Server   144] Creating server context on device 1 (NVIDIA GeForce RTX 2080 Ti)
[2024-07-30 02:32:36.768 Server   144] Created named shared memory region /cuda.shm.3e8.90.1

==> /tmp/nvidia-mps/control.log <==
[2024-07-30 02:32:36.768 Control    58] NEW SERVER 144: Ready

==> /tmp/nvidia-mps/server.log <==
[2024-07-30 02:32:36.768 Server   144] Active Threads Percentage set to 100.0
[2024-07-30 02:32:36.768 Server   144] Server Priority set to 0
[2024-07-30 02:32:36.768 Server   144] Server has started
[2024-07-30 02:32:36.768 Server   144] Received new client request
[2024-07-30 02:32:36.799 Server   144] Worker created
[2024-07-30 02:32:36.799 Server   144] Creating worker thread
[2024-07-30 02:32:36.799 Server   144] Waiting for current clients to finish

==> /tmp/nvidia-mps/control.log <==
[2024-07-30 02:32:36.847 Control    58] Accepting connection...
[2024-07-30 02:32:36.848 Control    58] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2024-07-30 02:37:55.850 Control    58] Accepting connection...
[2024-07-30 02:37:55.850 Control    58] User did not send valid credentials
[2024-07-30 02:37:55.850 Control    58] Accepting connection...
[2024-07-30 02:37:55.851 Control    58] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2024-07-30 02:41:25.952 Control    58] Accepting connection...
[2024-07-30 02:41:25.952 Control    58] User did not send valid credentials
[2024-07-30 02:41:25.952 Control    58] Accepting connection...
[2024-07-30 02:41:25.952 Control    58] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2024-07-30 02:42:55.872 Control    58] Accepting connection...
[2024-07-30 02:42:55.872 Control    58] User did not send valid credentials
[2024-07-30 02:42:55.872 Control    58] Accepting connection...
[2024-07-30 02:42:55.872 Control    58] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list
[2024-07-30 02:49:23.964 Control    58] Accepting connection...
[2024-07-30 02:49:23.964 Control    58] User did not send valid credentials
[2024-07-30 02:49:23.964 Control    58] Accepting connection...
[2024-07-30 02:49:23.964 Control    58] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list
[2024-07-30 02:50:09.170 Control    58] Accepting connection...
[2024-07-30 02:50:09.247 Control    58] User did not send valid credentials
[2024-07-30 02:50:09.247 Control    58] Accepting connection...
[2024-07-30 02:50:09.247 Control    58] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2024-07-30 02:51:05.370 Control    58] Accepting connection...
[2024-07-30 02:51:05.370 Control    58] User did not send valid credentials
[2024-07-30 02:51:05.370 Control    58] Accepting connection...
[2024-07-30 02:51:05.370 Control    58] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2024-07-30 02:52:51.748 Control    58] Accepting connection...
[2024-07-30 02:52:51.749 Control    58] User did not send valid credentials
[2024-07-30 02:52:51.749 Control    58] Accepting connection...
[2024-07-30 02:52:51.749 Control    58] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2024-07-30 02:54:55.658 Control    58] Accepting connection...
[2024-07-30 02:54:55.658 Control    58] User did not send valid credentials
[2024-07-30 02:54:55.658 Control    58] Accepting connection...
[2024-07-30 02:54:55.658 Control    58] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2024-07-30 02:57:06.983 Control    58] Accepting connection...
[2024-07-30 02:57:06.984 Control    58] User did not send valid credentials
[2024-07-30 02:57:06.984 Control    58] Accepting connection...
[2024-07-30 02:57:06.984 Control    58] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list

mig-agent pod failure

Hi,

I am seeing the below error in nebuly-nos-nebuly-nos-mig-agent pod.

{"level":"info","ts":1678537855.2905262,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1678537855.2921324,"logger":"setup","msg":"Initializing NVML client"}
{"level":"info","ts":1678537855.2921576,"logger":"setup","msg":"Checking MIG-enabled GPUs"}
{"level":"info","ts":1678537855.450721,"logger":"setup","msg":"Cleaning up unused MIG resources"}
{"level":"error","ts":1678537855.5242505,"logger":"setup","msg":"unable to initialize agent","error":"[code: generic err: unable to get allocatable resources from Kubelet gRPC socket: rpc error: code = Unimplemented desc = unknown method GetAllocatableResources for service v1.PodResourcesLister]","stacktrace":"main.main\n\t/workspace/migagent.go:119\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

Please let me know what information is needed from my side and also any pointers on why we get this error will be greatly appreciated.

Thanks

MIG partitioning on multi-GPU nodes breaks when there are no MIG devices

On nodes with multiple GPUs with MIG mode enabled, if a GPU does not have any MIG resource then the device plugin fails to advertise GPU resources to k8s. When this happens, the GPU slices created by nos on the node GPUs never become available in k8s.

This is due to NVIDIA Device Plugin raising an error if a GPU has MIG-mode enabled but no MIG device.

We can solve this by making nos initialize GPUs with MIG-mode enabled with an arbitrary MIG geometry.

Cluster autoscaling with nos

Hi I wanted to use nos as an autoscaler too, scaling in and out gpu nodes within the cluster while using MPS. Since nos already watches resource requests and availability it should be possible to add nodes to the cluster depending upon the resources requested leading to additional cost saving on top of higher GPU utilization.

Is this feature part of roadmap? Or if someone familiar with nos can help direct the best way to implement this within nos.

NOS MPS leaves GPUs on node in exclusive mode

In my use-case I am often enabling and disabling NOS on individual nodes by adding/removing the label nos.nebuly.com/gpu-partitioning=mps. After labeling the node, NOS will change the GPU mode to exclusive. However, after removing the label, the GPU remains in exclusive mode.

Expected behavior: NOS should revert the GPU mode to whatever it was when it started or to default.

Workaround: Change back to default mode (or whatever mode you want) after removing the label. Do this for all GPUs. For example, to change the mode on GPU 0 back to default use the following.

nvidia-smi -i 0 -c 0

Unable to pull

Unable to pull
helm install oci://ghcr.io/nebuly-ai/helm-charts/nos
--version 0.1.2
--namespace nebuly-nos
--generate-name
--create-namespace

GPU Ram limit invalid

OS : Ubuntu 22.04 LTS
Driver Version: 520.61.05
CUDA Version: 11.8
GPU: A100 * 2
Nos things installed the same as the official doc (only install nos)
using the MPS sharing methond (20gb gpu ram per pod)

When the pod is assigned to the GPU 1 , its gpu memory can't seem to be limited.
It eats all the 80gb gpu ram (but everything works fine when the pod is assigned to GPU 0), i can't find out why.
(These two pictures were taken at different times)
image

image://github.com/nebuly-ai/nos/assets/25812692/f25d6376-9eaa-4443-bbce-2a5437c72506)

7g.79gb does not work as expected.

using gpu-operator (helm 23.9.1), and nos (helm 0.1.2)

I have an issue with nvidia.com/mig-7g.79gb. when specifying it it causes nos to create the mig configuration as expected, but it seems to be specified as nvidia.com/mig-7g.80gb as shown in log below from nvidia-device-plugin.

I0312 23:04:34.682199       1 server.go:165] Starting GRPC server for 'nvidia.com/mig-7g.80gb'
I0312 23:04:34.682673       1 server.go:117] Starting to serve 'nvidia.com/mig-7g.80gb' on /var/lib/kubelet/device-plugins/nvidia-mig-7g.80gb.sock
I0312 23:04:34.684745       1 server.go:125] Registered device plugin for 'nvidia.com/mig-7g.80gb' with Kubelet

Additionally, the labels created on the node look like this

Screenshot_20240312_162714

But the issue is because we specified nvidia.com/mig-7g.79gb the pod stays in pending. Note the config below (all other nvidia examples commented out below work except 7g.79gb.

---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-test-7g80g
spec:
  template:
    spec:
      runtimeClassName: nvidia
      restartPolicy: Never
      containers:
      - name: nvidia
        image: nvidia/cuda:12.3.2-devel-ubuntu22.04
        command: ["sleep", "12000"]
        resources:
          limits:
            nvidia.com/mig-7g.79gb: 1
            #nvidia.com/mig-1g.10gb: 1
            #nvidia.com/mig-2g.20gb: 1
            #nvidia.com/mig-4g.40gb: 1

I tried adding 7g.80gb to allowedGeometries, but it did not work as expected. Briefly looked at code and see https://github.com/nebuly-ai/nos/blob/main/pkg/gpu/mig/known_configs.go#L93, so not sure if I missed something, or if there is a way to get the desired behavior?

nebuly-nvidia-device plugin crash on new partitioning / config change

I deployed nos with nebuly-nvidia device plugin in MPS partitioning mode.
Whenever I deploy a deployment/pods that require a change of GPU partitioning by the GPU partitioner, the nebuly-nvidia-device plugin crashes.

I tried to follow whats happening and this is what I guess:

  1. new deployment gets applied, GPU partitionier checks pending pods if partitioning needs change
  2. GPU partitioner performs new partitioning and writes to config and references the new config in the node-label nvidia.com/device-plugin.config.
  3. At the same time nebuly-device plugin is triggerd by label change and tries to read the new config referenced by the label.
  4. The config referenced does not exist (yet?) maybe - is this a timing issue, that for instance the config takes a second to become active?
  5. The non-existing config causes the nebuly-device-plugin to crash. Because this happens every time a new partitioning is necessary, after some time we run into the k8s CrashLoopBackoff, meaning that the restart of the nebuly-device-plugin takes 5 minutes. After 5 minutes and the restart the new partitioning becomes active and the pending pods start quickly with access to their configured MPS GPU fractions.

Here is the output of the logs of the nebuly-nvidia-device plugin. You can see at 13:05 I deployed a Deployment with a pod requesting a nvidia.com/gpu-2gb which triggered a new partitioning and caused the crash:

kubectl logs pod/nvidia-device-plugin-1722514861-rrdhz -n nebuly-nvidia --follow
Defaulted container "nvidia-device-plugin-sidecar" out of: nvidia-device-plugin-sidecar, nvidia-mps-server, nvidia-device-plugin-ctr, set-compute-mode (init), set-nvidia-mps-volume-permissions (init), nvidia-device-plugin-init (init)
W0801 13:02:37.159120     270 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2024-08-01T13:02:37Z" level=info msg="Waiting for change to 'nvidia.com/device-plugin.config' label"
time="2024-08-01T13:02:37Z" level=info msg="Label change detected: nvidia.com/device-plugin.config=vm125-1722517133"
time="2024-08-01T13:02:37Z" level=info msg="Updating to config: vm125-1722517133"
time="2024-08-01T13:02:37Z" level=info msg="Successfully updated to config: vm125-1722517133"
time="2024-08-01T13:02:37Z" level=info msg="Sending signal 'hangup' to 'nvidia-device-plugin'"
time="2024-08-01T13:02:37Z" level=info msg="Successfully sent signal"
time="2024-08-01T13:02:37Z" level=info msg="Waiting for change to 'nvidia.com/device-plugin.config' label"
time="2024-08-01T13:05:02Z" level=info msg="Label change detected: nvidia.com/device-plugin.config=vm125-1722517497"
time="2024-08-01T13:05:02Z" level=info msg="Error: specified config vm125-1722517497 does not exist"

I mean it is still working but with this it takes always 5 minutes for my pods to start when partitioning changes :(

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.