Join our NEW community on Discord β¨ for a chat about AI optimization
or if you want to know more feel free to connect with us on LinkedIn
Module to Automatically maximize the utilization of GPU resources in a Kubernetes cluster through real-time dynamic partitioning and elastic quotas - Effortless optimization at its finest!
Home Page: https://www.nebuly.com/
License: Apache License 2.0
In internal/partitioning/mps/partitioner.go
, in ToPluginConfig
function, Config struct is used from github.com/NVIDIA/k8s-device-plugin/api/config/v1
package. This struct, while containing nested structs, does not use struct pointers, which causes YAML/JSON Marshal function to render "empty" structs as empty map/object, instead of omitting them. This results in following config value (take a look at timeSlicing):
flags:
failOnInitError: null
gdsEnabled: null
migStrategy: none
mofedEnabled: null
resources:
gpus: null
sharing:
mps:
failRequestsGreaterThanOne: true
resources:
- devices:
- "0"
memoryGB: 10
name: nvidia.com/gpu
rename: gpu-10gb
replicas: 2
timeSlicing: {}
version: v1
This behavior is explained here.
There is a custom Unmarshal function that is executed when sharing.timeSlicing
field exists in raw config, but throws an error when it is empty, exactly as we see in the above config example. See code here:
resources, exists := ts["resources"]
if !exists {
return fmt.Errorf("no resources specified")
}
GFD uses this package to read device-plugin config, created by partitioner. When new partitioning config is applied, empty timeSlicing field in it causes the above code to crash the GFD container with no resources specified
error, until timeSlicing: {}
is removed from ConfigMap, which resolves the error.
I think it makes sense to fix this issue in nebuly-ai/k8s-device-plugin by removing the checks and forking GFD to use that, as well as tweaking the structs to utilize pointers when nesting other structs in order to render proper YAML.
Came across the metrics exporter, however am not able to set it up,
The errors are:
{"level":"info","ts":1679291005.7844253,"msg":"reading metrics file","metricsFile":""}
{"level":"error","ts":1679291005.7844558,"msg":"failed to read metrics file","error":"open : no such file or directory","stacktrace":"main.main\n\t/workspace/cmd/metricsexporter/metricsexporter.go:62\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
Can someone please point me to set this up? We need to set up per pod GPU utilization metrics
I continued to build gpu-agent
and found that ts_agent.yaml
in
gpu_agent.yaml
?To request a GPU with a 10g the following key/value is used
resources:
limits:
nvidia.com/gpu-10gb: 1
Would it make more sense to do the following instead
resources:
limits:
nvidia.com/gpu-memory: "10Gi"
nvidia.com/gpu: 1
When changing the partitioning mode of a node from MPS to MIG, the nvidia-device-plugin crashes and therefore any new MIG device created by nos
is never exposed to k8s as resource.
kubectl label nodes <node> "nos.nebuly.com/gpu-partitioning=mps"
)nvidia.com/gpu-10gb
)kubectl label nodes <node> "nos.nebuly.com/gpu-partitioning=mps"
)nvidia.com/mig-1g.10gb
)After step 4, the MIG resources are created automatically and the Pod is scheduled on the node
After step 4, the MIG devices are created on the GPU, however the nvidia-device-plugin Pod crashes with error Cannot find configuration named <config-name>
, where <config-name>
is the name of the configuration set by nos
during step 2.
Hi,
I am allocating only 1gb of 24gb available memory to gpu operator that is shown in my node's labels. I also have another gpu device plugin (the default one) in my cluster but I have done the necessary affinity configurations to prevent both running. Basically, my pod stucks at pending (the sleep pod that is shared on documentation) with the reasoning of resource overuse, and does not get scheduled. MPS server occupies even less than 1gb on my gpu, and seems to be running in the output of nvidia-smi.
I have followed steps in doc about user mode 1000 and necessary gpu-operator config arrangements (mig mode mixed etc.)
Any help would be much appreciated.
NAMESPACE NAME READY STATUS RESTARTS AGE
calico-apiserver calico-apiserver-6dd8b8765c-7nm86 1/1 Running 0 23h
calico-apiserver calico-apiserver-6dd8b8765c-fp6bx 1/1 Running 0 23h
calico-system calico-kube-controllers-5c8ddb5dcf-tv4fw 1/1 Running 0 23h
calico-system calico-node-hzxml 1/1 Running 0 23h
calico-system calico-typha-d6688954-g547t 1/1 Running 0 23h
calico-system csi-node-driver-4qfps 2/2 Running 0 23h
default gpu-feature-discovery-nqrtb 1/1 Running 0 85m
default gpu-operator-787cd6f58-xn68k 1/1 Running 0 85m
default gpu-pod 0/1 Completed 0 3h35m
default mps-partitioning-example 0/1 Pending 0 3m16s
default nvidia-container-toolkit-daemonset-dj7xv 1/1 Running 0 85m
default nvidia-cuda-validator-4pmjv 0/1 Completed 0 56m
default nvidia-dcgm-exporter-pwfwb 1/1 Running 0 85m
default nvidia-device-plugin-daemonset-7p4b7 1/1 Running 0 85m
default nvidia-operator-validator-fr897 1/1 Running 0 85m
default release-name-node-feature-discovery-gc-5cbdb95596-9p5bn 1/1 Running 0 88m
default release-name-node-feature-discovery-master-788d855b45-fsz56 1/1 Running 0 88m
default release-name-node-feature-discovery-worker-dgcn5 1/1 Running 0 39m
kube-system coredns-5dd5756b68-tgdgf 1/1 Running 0 23h
kube-system coredns-5dd5756b68-wlxq2 1/1 Running 0 23h
kube-system etcd-selin-csl 1/1 Running 1553 23h
kube-system kube-apiserver-selin-csl 1/1 Running 30 23h
kube-system kube-controller-manager-selin-csl 1/1 Running 0 23h
kube-system kube-proxy-lslfg 1/1 Running 0 23h
kube-system kube-scheduler-selin-csl 1/1 Running 35 23h
nebuly-nvidia nvidia-device-plugin-1698187396-r7tpf 3/3 Running 0 32m
node-feature-discovery nfd-6q9tl 2/2 Running 0 14m
node-feature-discovery nfd-master-85f4bc48cf-dlw4q 1/1 Running 0 42m
node-feature-discovery nfd-worker-wln6p 1/1 Running 2 (42m ago) 42m
tigera-operator tigera-operator-94d7f7696-ff7kf 1/1 Running 0 23h
selin@selin-csl:~$ kubectl describe node selin-csl
Name: selin-csl
Roles: control-plane
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
feature.node.kubernetes.io/cpu-cpuid.ADX=true
feature.node.kubernetes.io/cpu-cpuid.AESNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX=true
feature.node.kubernetes.io/cpu-cpuid.AVX2=true
feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true
feature.node.kubernetes.io/cpu-cpuid.FLUSH_L1D=true
feature.node.kubernetes.io/cpu-cpuid.FMA3=true
feature.node.kubernetes.io/cpu-cpuid.FXSR=true
feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true
feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP=true
feature.node.kubernetes.io/cpu-cpuid.IBPB=true
feature.node.kubernetes.io/cpu-cpuid.LAHF=true
feature.node.kubernetes.io/cpu-cpuid.MD_CLEAR=true
feature.node.kubernetes.io/cpu-cpuid.MOVBE=true
feature.node.kubernetes.io/cpu-cpuid.MPX=true
feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true
feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD=true
feature.node.kubernetes.io/cpu-cpuid.STIBP=true
feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true
feature.node.kubernetes.io/cpu-cpuid.SYSEE=true
feature.node.kubernetes.io/cpu-cpuid.VMX=true
feature.node.kubernetes.io/cpu-cpuid.X87=true
feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true
feature.node.kubernetes.io/cpu-cpuid.XSAVE=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true
feature.node.kubernetes.io/cpu-cpuid.XSAVES=true
feature.node.kubernetes.io/cpu-cstate.enabled=true
feature.node.kubernetes.io/cpu-hardware_multithreading=true
feature.node.kubernetes.io/cpu-model.family=6
feature.node.kubernetes.io/cpu-model.id=85
feature.node.kubernetes.io/cpu-model.vendor_id=Intel
feature.node.kubernetes.io/cpu-pstate.scaling_governor=powersave
feature.node.kubernetes.io/cpu-pstate.status=active
feature.node.kubernetes.io/cpu-pstate.turbo=true
feature.node.kubernetes.io/cpu-rdt.RDTCMT=true
feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true
feature.node.kubernetes.io/cpu-rdt.RDTMBA=true
feature.node.kubernetes.io/cpu-rdt.RDTMBM=true
feature.node.kubernetes.io/cpu-rdt.RDTMON=true
feature.node.kubernetes.io/kernel-config.NO_HZ=true
feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
feature.node.kubernetes.io/kernel-version.full=6.2.0-34-generic
feature.node.kubernetes.io/kernel-version.major=6
feature.node.kubernetes.io/kernel-version.minor=2
feature.node.kubernetes.io/kernel-version.revision=0
feature.node.kubernetes.io/pci-0300_1002.present=true
feature.node.kubernetes.io/pci-0300_10de.present=true
feature.node.kubernetes.io/storage-nonrotationaldisk=true
feature.node.kubernetes.io/system-os_release.ID=ubuntu
feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=22
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
kubernetes.io/arch=amd64
kubernetes.io/hostname=selin-csl
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node.kubernetes.io/exclude-from-external-load-balancers=
nos.nebuly.com/gpu-partitioning=mps
nvidia.com/cuda.driver.major=535
nvidia.com/cuda.driver.minor=113
nvidia.com/cuda.driver.rev=01
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=2
nvidia.com/gfd.timestamp=1698184228
nvidia.com/gpu-driver-upgrade-state=upgrade-done
nvidia.com/gpu.compute.major=7
nvidia.com/gpu.compute.minor=5
nvidia.com/gpu.count=1
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=pre-installed
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.family=turing
nvidia.com/gpu.machine=Precision-5820-Tower
nvidia.com/gpu.memory=24576
nvidia.com/gpu.present=true
nvidia.com/gpu.product=NVIDIA-TITAN-RTX
nvidia.com/gpu.replicas=1
nvidia.com/mig.capable=false
nvidia.com/mig.strategy=mixed
While building the nos I found the doc was a bit outdated.
The make targets in this part should all begin with docker-
.
Steps to reproduce the issue
1.install Nebuly-NVIDIA pluginοΌhttps://github.com/nebuly-ai/k8s-device-plugin
2.start a pod which image is βk8s.gcr.io/cuda-vector-add:v0.1β
3. client does not print any logs,and nvidia-cuda-mps-server hangs at the "creating worker thread" log
Hi, I'm trying to setup MPS partitioning on GKE, but I can't get the k8s-device-plugin to work. The plugin gets installed correctly, but it never starts any driver pods.
Cluster data:
The node only has the following taints:
Taints: nvidia.com/gpu=present:NoSchedule
It's also properly labeled as
nos.nebuly.com/gpu-partitioning=mps
The regular nvidia device plugin has worked just fine before I pushed it out with nodeSelectors on the default daemonset injected on GKE.
The nebuly plugin however is stuck at 0 pods:
k get ds -n nebuly-nvidia
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
nvidia-device-plugin-1684693222 0 0 0 0 0 nos.nebuly.com/gpu-partitioning=mps 33m
Your documentation mentions that in order to avoid duplicate drivers on nodes, we can configure affinity on the prexisting nvidia driver to avoid scheduling both on the nodes. I've done that for the GKE driver daemonset, but that results in a container that's always stuck in creating
. Not a big deal, but I just want to confirm that this is expected.
Here's what pods I currently have on the GPU node:
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system fluentbit-gke-nzm72 100m (2%) 0 (0%) 200Mi (1%) 500Mi (3%) 23m
kube-system gke-metrics-agent-ghmdm 8m (0%) 0 (0%) 110Mi (0%) 110Mi (0%) 23m
kube-system kube-proxy-gke-xxx-gke-workspace-gpu-95e23864-6fwc 100m (2%) 0 (0%) 0 (0%) 0 (0%) 23m
kube-system nvidia-gpu-device-plugin-x6l9c 50m (1%) 0 (0%) 50Mi (0%) 50Mi (0%) 23m
kube-system pdcsi-node-dbxns 10m (0%) 0 (0%) 20Mi (0%) 100Mi (0%) 23m
Is there anything I'm doing incorrectly here? Afaik it's not possible to remove the default nvidia driver from the cluster, as it's automatically injected by GKE. Please let me know if there's anything I could do to solve this, I'd love to start using your stuff. Thanks a lot of your time.
Hi, I was building the gpu-partitioner
and ran into an issue saying lstat /the/path/to/nos/config/gpupartitioner/default/gpu_partitioner_metrics_service.yaml.yaml: no such file or directory
.
Then I located the cause:
Currently, when enabling Dynamic GPU Partitioning on a node, it is possible to choose only between MIG or MPS by adding one of the following labels: nos.nebuly.com/gpu-partitioning: "mig"
or nos.nebuly.com/gpu-partitioning: "mps"
.
It would be nice to have a third dynamic partitioning option that mixes MIG and MPS. This would be particularly useful for further partitioning MIG devices with MPS, as often the smallest available MIG device on a GPU is way larger than the resources required by the workloads.
For instance, the smallest MIG profile for NVIDIA-A100-SXM4-80GB is 1g.10gb
, which provides 10GB of GPU memory. However, since many workloads require less than 10GB of GPU memory, this leads to inefficiencies.
Right now the alternative is to partition GPUs using MPS, which allows the creation of GPU slices of arbitrary size. However, MPS does not provide full workload isolation. Using MPS on top of MIG would enable finer-grained partitioning without compromising too much workload isolation, as only the workloads sharing the same MIG partition wouldn't be fully isolated.
Add the possibility to label a node with nos.nebuly.com/gpu-partitioning: "mixed"
. For nodes with this label, nos
should automatically use MPS for partitioning the smallest available MIG devices according to the requested resources.
Hi,
I have an A100-PCIE-40GB gpu and I an trying to use nos mps dynamic partitioning.
The issue is that Is seems to have some issues with total capacity calculation.
for example I am trying to run 2 pods that require resource : nvidia.com/gpu-20gb: 1 and one of them always stays in pending
while I am able to schedule 1 pod of nvidia.com/gpu-20gb and another 2 pods requesting nvidia.com/gpu-10gb
I faced this issue of not fully using the GPU memory in some more combinations.
Does someone have any idea?
will be much appreciated
Karpenter doesn't like the custom resource requests from nebuly, as it uses nvidia.com/gpu to map to instance types with gpus available. Interested in a solution, which would effectively enable simple serverless gpus with high utilization
It would be nice if we could configure the Quotas to be shared only among the same tenants, not cross-tenants. So, a namespace named "nos-deployment-1" would only share resources with namespaces starting with the same tenant name "nos".
Iβve been utilizing the MPS (Multi-Process Service) daemon to manage resource usage limits for processes using the CUDA_MPS_ACTIVE_THREAD_PERCENTAGE and CUDA_MPS_PINNED_DEVICE_MEM_LIMIT environment variables, and itβs been working well. However, Iβve encountered a scenario that Iβm not sure how to address. Iβm curious if thereβs a way to apply these limits collectively to an entire Docker container.
For example, if we set CUDA_MPS_PINNED_DEVICE_MEM_LIMIT=0=1000MB in the containerβs environment variables, launching two processes results in each having its own limit, effectively allowing them to use a total of 2000MB combined. Is there a mechanism or strategy to enforce the total limit across the entire container so that, in my case, two applications together cannot exceed the 1000MB limit?
Has anyone tackled this issue before, or is there a way to ensure that the collective limit applies to the whole Docker container, restricting the total resource usage to, for example, 1000MB as per my example?
sharing:
mps:
failRequestsGreaterThanOne: true
resources:
- name: nvidia.com/gpu
rename: nvidia.com/gpu-2gb
memoryGB: 2
replicas: 2
devices: ["0"]
Can this configuration be applied to individual nodes as specified above?
surface hooks to Kubeflow for granular model scheduling on appropriate K8s cluster vGPU node resources
MPS Server requires the clients to run with the same user ID, which is 1000 by default. If a container requesting MPS resources runs with a different user ID, the MPS server refuses the request and the container cannot access the GPU. This behaviour is expected.
However, after that happens, any new container running with user 1000 and requesting MPS resources incurs the same problem.
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
hostIPC: true
restartPolicy: OnFailure
containers:
- name: cuda-test
image: "pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime"
command: ["python", "-c", "import torch; print(torch.cuda.is_available())"]
resources:
limits:
nvidia.com/gpu-2gb: 1
apiVersion: v1
kind: Pod
metadata:
name: test-pod-2
spec:
hostIPC: true
restartPolicy: OnFailure
securityContext:
runAsUser: 1000
runAsNonRoot: true
containers:
- name: cuda-test
image: "pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime"
command: ["python", "-c", "import torch; print(torch.cuda.is_available())"]
resources:
limits:
nvidia.com/gpu-2gb: 1
The first Pod running as user 1000 should not be able to access GPU. The second Pod running as user 1000 should instead be able to access the requested GPU slice.
Both Pods are stuck when requesting GPU access, as the MPS server enqueued the requests and never serve them.
These are the logs from the MPS server running in the device-plugin when the Pods running as 1000 tries to connect to the GPU:
nvidia-mps-server [2023-02-28 09:31:00.573 Control 54] Accepting connection...
nvidia-mps-server [2023-02-28 09:31:00.573 Control 54] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
Restart the MPS server running on the node by restarting the device-plugin Pod on that node.
I found that my mps server will always occupied 27mb memory in my GPU telsa v100-32gb οΌif I have allocate gpu-16gb and then I allocate another 16gb, it will failed , because this GPU don't have enough memory
Uninstalling nos
or disabling dynamic GPU partitioning on a node does not remove the node annotations set by nos
.
The following annotations are set by nos
on the Nodes for which automatic GPU partitioning is enabled:
nos.nebuly.com/status-gpu-<index>-<mig-profile>-free: <quantity>
nos.nebuly.com/status-gpu-<index>-<mig-profile>-used: <quantity>
nos.nebuly.com/spec-gpu-<index>-<mig-profile>: <quantity>
After uninstalling nos
or disabling dynamic GPU partitioning on a certain node, these annotations are not removed. Consequently, the next time nos
is installed on the cluster or dynamic GPU partitioning is enabled on the node, nos
might apply the previous desired GPU partitioning state.
Uninstalling nos
or disable GPU partitioning on a node should clean up all the annotations previously set by nos
.
Nos is currently broken on systems where GPU drivers are pre-installed on hosts, for example on AKS. The symptom is gpu-agent
Pod not starting due to missing /run/nvidia
path on host.
According to Nvidia DRA driver documentation, /run/nvidia
folder is provided via driver container. When drivers are installed on host instead of via container, the path is missing and has to be symlinked to host root manually:
Ensure your NVIDIA driver installation is rooted at
/run/nvidia/driver
For deployments running a driver container this is a
noop
.
The driver container should already mount the driver installation at/run/nvidia/driver
.For deployments running with a host-installed driver, the following is sufficient to meet this requirement:
mkdir -p /run/nvidia sudo ln -s / /run/nvidia/driverNOTE: This is only currently necessary due to a limitation of how our CDI
generation library works. This restriction will be removed very soon.
To implement support for host-installed drivers, we can simply mount host's /
as /run/nvidia/driver
inside gpu-agent container.
I operate a regular DevOps platform for microservices which doesn't use or need GPU.
However, GPU is a prerequisite.
The Elastic Resource Quota feature really caught my attention. It would be a great help.
Is there a way to use it without having to enable GPU?
Starting Prometheus server on port 8000...
Running benchmark...
Downloading (β¦)cessor_config.json";: 100%|ββββββββββ| 292/292 [00:00<00:00, 27.0kB/s]
Downloading (β¦)"config.json";: 100%|ββββββββββ| 4.13k/4.13k [00:00<00:00, 244kB/s]
Downloading (β¦)"pytorch_model.bin";: 100%|ββββββββββ| 123M/123M [11:58<00:00, `171kB/s]
The line Running inference... is not printed out so I assume there is some problem when the model is loaded to GPU. Here is the MPS server log:
==> /tmp/nvidia-mps/server.log <==
[2024-07-30 02:31:59.303 Other 138] Initializing server process
[2024-07-30 02:31:59.339 Server 138] Creating server context on device 0 (NVIDIA GeForce RTX 2080 Ti)
[2024-07-30 02:31:59.401 Server 138] Creating server context on device 1 (NVIDIA GeForce RTX 2080 Ti)
[2024-07-30 02:31:59.456 Server 138] Created named shared memory region /cuda.shm.3e8.8a.1
==> /tmp/nvidia-mps/control.log <==
[2024-07-30 02:31:59.456 Control 58] NEW SERVER 138: Ignoring connection from user
==> /tmp/nvidia-mps/server.log <==
[2024-07-30 02:31:59.456 Server 138] Active Threads Percentage set to 0.0
[2024-07-30 02:32:36.506 Server 138] Server Priority set to 0
[2024-07-30 02:32:36.506 Server 138] Server has started
[2024-07-30 02:32:36.506 Server 138] Destroy server context on device 0
[2024-07-30 02:32:36.545 Server 138] Destroy server context on device 1
==> /tmp/nvidia-mps/control.log <==
[2024-07-30 02:32:36.581 Control 58] Server 138 exited with status 0
[2024-07-30 02:32:36.581 Control 58] Starting new server 144 for user 1000
==> /tmp/nvidia-mps/server.log <==
[2024-07-30 02:32:36.601 Other 144] Startup
[2024-07-30 02:32:36.601 Other 144] Connecting to control daemon on socket: /tmp/nvidia-mps/control
==> /tmp/nvidia-mps/control.log <==
[2024-07-30 02:32:36.601 Control 58] Accepting connection...
==> /tmp/nvidia-mps/server.log <==
[2024-07-30 02:32:36.601 Other 144] Initializing server process
[2024-07-30 02:32:36.641 Server 144] Creating server context on device 0 (NVIDIA GeForce RTX 2080 Ti)
[2024-07-30 02:32:36.704 Server 144] Creating server context on device 1 (NVIDIA GeForce RTX 2080 Ti)
[2024-07-30 02:32:36.768 Server 144] Created named shared memory region /cuda.shm.3e8.90.1
==> /tmp/nvidia-mps/control.log <==
[2024-07-30 02:32:36.768 Control 58] NEW SERVER 144: Ready
==> /tmp/nvidia-mps/server.log <==
[2024-07-30 02:32:36.768 Server 144] Active Threads Percentage set to 100.0
[2024-07-30 02:32:36.768 Server 144] Server Priority set to 0
[2024-07-30 02:32:36.768 Server 144] Server has started
[2024-07-30 02:32:36.768 Server 144] Received new client request
[2024-07-30 02:32:36.799 Server 144] Worker created
[2024-07-30 02:32:36.799 Server 144] Creating worker thread
[2024-07-30 02:32:36.799 Server 144] Waiting for current clients to finish
==> /tmp/nvidia-mps/control.log <==
[2024-07-30 02:32:36.847 Control 58] Accepting connection...
[2024-07-30 02:32:36.848 Control 58] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2024-07-30 02:37:55.850 Control 58] Accepting connection...
[2024-07-30 02:37:55.850 Control 58] User did not send valid credentials
[2024-07-30 02:37:55.850 Control 58] Accepting connection...
[2024-07-30 02:37:55.851 Control 58] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2024-07-30 02:41:25.952 Control 58] Accepting connection...
[2024-07-30 02:41:25.952 Control 58] User did not send valid credentials
[2024-07-30 02:41:25.952 Control 58] Accepting connection...
[2024-07-30 02:41:25.952 Control 58] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2024-07-30 02:42:55.872 Control 58] Accepting connection...
[2024-07-30 02:42:55.872 Control 58] User did not send valid credentials
[2024-07-30 02:42:55.872 Control 58] Accepting connection...
[2024-07-30 02:42:55.872 Control 58] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list
[2024-07-30 02:49:23.964 Control 58] Accepting connection...
[2024-07-30 02:49:23.964 Control 58] User did not send valid credentials
[2024-07-30 02:49:23.964 Control 58] Accepting connection...
[2024-07-30 02:49:23.964 Control 58] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list
[2024-07-30 02:50:09.170 Control 58] Accepting connection...
[2024-07-30 02:50:09.247 Control 58] User did not send valid credentials
[2024-07-30 02:50:09.247 Control 58] Accepting connection...
[2024-07-30 02:50:09.247 Control 58] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2024-07-30 02:51:05.370 Control 58] Accepting connection...
[2024-07-30 02:51:05.370 Control 58] User did not send valid credentials
[2024-07-30 02:51:05.370 Control 58] Accepting connection...
[2024-07-30 02:51:05.370 Control 58] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2024-07-30 02:52:51.748 Control 58] Accepting connection...
[2024-07-30 02:52:51.749 Control 58] User did not send valid credentials
[2024-07-30 02:52:51.749 Control 58] Accepting connection...
[2024-07-30 02:52:51.749 Control 58] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2024-07-30 02:54:55.658 Control 58] Accepting connection...
[2024-07-30 02:54:55.658 Control 58] User did not send valid credentials
[2024-07-30 02:54:55.658 Control 58] Accepting connection...
[2024-07-30 02:54:55.658 Control 58] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2024-07-30 02:57:06.983 Control 58] Accepting connection...
[2024-07-30 02:57:06.984 Control 58] User did not send valid credentials
[2024-07-30 02:57:06.984 Control 58] Accepting connection...
[2024-07-30 02:57:06.984 Control 58] NEW CLIENT 0 from user 0: Server is not ready, push client to pending list
Hi,
I am seeing the below error in nebuly-nos-nebuly-nos-mig-agent pod.
{"level":"info","ts":1678537855.2905262,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1678537855.2921324,"logger":"setup","msg":"Initializing NVML client"}
{"level":"info","ts":1678537855.2921576,"logger":"setup","msg":"Checking MIG-enabled GPUs"}
{"level":"info","ts":1678537855.450721,"logger":"setup","msg":"Cleaning up unused MIG resources"}
{"level":"error","ts":1678537855.5242505,"logger":"setup","msg":"unable to initialize agent","error":"[code: generic err: unable to get allocatable resources from Kubelet gRPC socket: rpc error: code = Unimplemented desc = unknown method GetAllocatableResources for service v1.PodResourcesLister]","stacktrace":"main.main\n\t/workspace/migagent.go:119\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
Please let me know what information is needed from my side and also any pointers on why we get this error will be greatly appreciated.
Thanks
On nodes with multiple GPUs with MIG mode enabled, if a GPU does not have any MIG resource then the device plugin fails to advertise GPU resources to k8s. When this happens, the GPU slices created by nos
on the node GPUs never become available in k8s.
This is due to NVIDIA Device Plugin raising an error if a GPU has MIG-mode enabled but no MIG device.
We can solve this by making nos
initialize GPUs with MIG-mode enabled with an arbitrary MIG geometry.
Hi I wanted to use nos as an autoscaler too, scaling in and out gpu nodes within the cluster while using MPS. Since nos already watches resource requests and availability it should be possible to add nodes to the cluster depending upon the resources requested leading to additional cost saving on top of higher GPU utilization.
Is this feature part of roadmap? Or if someone familiar with nos can help direct the best way to implement this within nos.
In my use-case I am often enabling and disabling NOS on individual nodes by adding/removing the label nos.nebuly.com/gpu-partitioning=mps
. After labeling the node, NOS will change the GPU mode to exclusive. However, after removing the label, the GPU remains in exclusive mode.
Expected behavior: NOS should revert the GPU mode to whatever it was when it started or to default.
Workaround: Change back to default mode (or whatever mode you want) after removing the label. Do this for all GPUs. For example, to change the mode on GPU 0 back to default use the following.
nvidia-smi -i 0 -c 0
Unable to pull
helm install oci://ghcr.io/nebuly-ai/helm-charts/nos
--version 0.1.2
--namespace nebuly-nos
--generate-name
--create-namespace
OS : Ubuntu 22.04 LTS
Driver Version: 520.61.05
CUDA Version: 11.8
GPU: A100 * 2
Nos things installed the same as the official doc (only install nos)
using the MPS sharing methond (20gb gpu ram per pod)
When the pod is assigned to the GPU 1 , its gpu memory can't seem to be limited.
It eats all the 80gb gpu ram (but everything works fine when the pod is assigned to GPU 0)οΌ i can't find out why.
(These two pictures were taken at different times)
://github.com/nebuly-ai/nos/assets/25812692/f25d6376-9eaa-4443-bbce-2a5437c72506)
using gpu-operator (helm 23.9.1), and nos (helm 0.1.2)
I have an issue with nvidia.com/mig-7g.79gb. when specifying it it causes nos to create the mig configuration as expected, but it seems to be specified as nvidia.com/mig-7g.80gb as shown in log below from nvidia-device-plugin.
I0312 23:04:34.682199 1 server.go:165] Starting GRPC server for 'nvidia.com/mig-7g.80gb'
I0312 23:04:34.682673 1 server.go:117] Starting to serve 'nvidia.com/mig-7g.80gb' on /var/lib/kubelet/device-plugins/nvidia-mig-7g.80gb.sock
I0312 23:04:34.684745 1 server.go:125] Registered device plugin for 'nvidia.com/mig-7g.80gb' with Kubelet
Additionally, the labels created on the node look like this
But the issue is because we specified nvidia.com/mig-7g.79gb the pod stays in pending. Note the config below (all other nvidia examples commented out below work except 7g.79gb.
---
apiVersion: batch/v1
kind: Job
metadata:
name: job-test-7g80g
spec:
template:
spec:
runtimeClassName: nvidia
restartPolicy: Never
containers:
- name: nvidia
image: nvidia/cuda:12.3.2-devel-ubuntu22.04
command: ["sleep", "12000"]
resources:
limits:
nvidia.com/mig-7g.79gb: 1
#nvidia.com/mig-1g.10gb: 1
#nvidia.com/mig-2g.20gb: 1
#nvidia.com/mig-4g.40gb: 1
I tried adding 7g.80gb to allowedGeometries, but it did not work as expected. Briefly looked at code and see https://github.com/nebuly-ai/nos/blob/main/pkg/gpu/mig/known_configs.go#L93, so not sure if I missed something, or if there is a way to get the desired behavior?
MPS requires at least CUDA 11.5 https://developer.nvidia.com/blog/revealing-new-features-in-the-cuda-11-5-toolkit , with CUDA version less than that, the MPS server hangs and does not serve any requests.
I deployed nos with nebuly-nvidia device plugin in MPS partitioning mode.
Whenever I deploy a deployment/pods that require a change of GPU partitioning by the GPU partitioner, the nebuly-nvidia-device plugin crashes.
I tried to follow whats happening and this is what I guess:
Here is the output of the logs of the nebuly-nvidia-device plugin. You can see at 13:05 I deployed a Deployment with a pod requesting a nvidia.com/gpu-2gb which triggered a new partitioning and caused the crash:
kubectl logs pod/nvidia-device-plugin-1722514861-rrdhz -n nebuly-nvidia --follow
Defaulted container "nvidia-device-plugin-sidecar" out of: nvidia-device-plugin-sidecar, nvidia-mps-server, nvidia-device-plugin-ctr, set-compute-mode (init), set-nvidia-mps-volume-permissions (init), nvidia-device-plugin-init (init)
W0801 13:02:37.159120 270 client_config.go:608] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2024-08-01T13:02:37Z" level=info msg="Waiting for change to 'nvidia.com/device-plugin.config' label"
time="2024-08-01T13:02:37Z" level=info msg="Label change detected: nvidia.com/device-plugin.config=vm125-1722517133"
time="2024-08-01T13:02:37Z" level=info msg="Updating to config: vm125-1722517133"
time="2024-08-01T13:02:37Z" level=info msg="Successfully updated to config: vm125-1722517133"
time="2024-08-01T13:02:37Z" level=info msg="Sending signal 'hangup' to 'nvidia-device-plugin'"
time="2024-08-01T13:02:37Z" level=info msg="Successfully sent signal"
time="2024-08-01T13:02:37Z" level=info msg="Waiting for change to 'nvidia.com/device-plugin.config' label"
time="2024-08-01T13:05:02Z" level=info msg="Label change detected: nvidia.com/device-plugin.config=vm125-1722517497"
time="2024-08-01T13:05:02Z" level=info msg="Error: specified config vm125-1722517497 does not exist"
I mean it is still working but with this it takes always 5 minutes for my pods to start when partitioning changes :(
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.