Comments (10)
Reboot is only necessary on CSPs with passhthrough virtualization but no GPU reset capabilities (e.g. GCP). Setting the envvar WITH_REBOOT=true on the k8s-mig-manager container should give you the behavior you want.
from mig-parted.
It does. not work. I have tried setting envvar WITH_REBOOT=true. Code does not seem to be right. https://github.com/NVIDIA/mig-parted/blob/main/deployments/gpu-operator/reconfigure-mig.sh#L469 currently, iwith WITH_REBOOT true will reboot node only if, this apply command fails.. but this succeeds, so we need manual reboot.
These are the logs: you can see that it fails in the end, but code is checking only if mig-configuration apply fails, which passes, but in the end mig-apply fails and it needs reboot of node to get fixed.
Applying the MIG mode change from the selected config to the node
If the -r option was passed, the node will be automatically rebooted if this is not successful
time="2022-04-25T05:42:13-11:00" level=debug msg="Parsing config file..."
time="2022-04-25T05:42:13-11:00" level=debug msg="Selecting specific MIG config..."
time="2022-04-25T05:42:13-11:00" level=debug msg="Running apply-start hook"
time="2022-04-25T05:42:13-11:00" level=debug msg="Checking current MIG mode..."
time="2022-04-25T05:42:13-11:00" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-04-25T05:42:13-11:00" level=debug msg=" GPU 0: 0x20B510DE"
time="2022-04-25T05:42:13-11:00" level=debug msg=" Asserting MIG mode: Enabled"
time="2022-04-25T05:42:13-11:00" level=debug msg=" MIG capable: true\n"
time="2022-04-25T05:42:13-11:00" level=debug msg=" Current MIG mode: Disabled"
time="2022-04-25T05:42:13-11:00" level=debug msg="Running pre-apply-mode hook"
time="2022-04-25T05:42:13-11:00" level=debug msg="Applying MIG mode change..."
time="2022-04-25T05:42:13-11:00" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-04-25T05:42:13-11:00" level=debug msg=" GPU 0: 0x20B510DE"
time="2022-04-25T05:42:13-11:00" level=debug msg=" MIG capable: true\n"
time="2022-04-25T05:42:13-11:00" level=debug msg=" Current MIG mode: Disabled"
time="2022-04-25T05:42:13-11:00" level=debug msg=" Updating MIG mode: Enabled"
time="2022-04-25T05:42:16-11:00" level=debug msg=" Mode change pending: true"
time="2022-04-25T05:42:16-11:00" level=debug msg="At least one mode change pending"
time="2022-04-25T05:42:16-11:00" level=debug msg="Resetting GPUs..."
time="2022-04-25T05:42:16-11:00" level=debug msg=" NVIDIA kernel module loaded"
time="2022-04-25T05:42:16-11:00" level=debug msg=" Using nvidia-smi to perform GPU reset"
time="2022-04-25T05:42:20-11:00" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
Applying the selected MIG config to the node
time="2022-04-25T05:42:20-11:00" level=debug msg="Parsing config file..."
time="2022-04-25T05:42:20-11:00" level=debug msg="Selecting specific MIG config..."
time="2022-04-25T05:42:20-11:00" level=debug msg="Running apply-start hook"
time="2022-04-25T05:42:20-11:00" level=debug msg="Checking current MIG mode..."
time="2022-04-25T05:42:20-11:00" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-04-25T05:42:20-11:00" level=debug msg=" GPU 0: 0x20B510DE"
time="2022-04-25T05:42:20-11:00" level=debug msg=" Asserting MIG mode: Enabled"
time="2022-04-25T05:42:20-11:00" level=debug msg=" MIG capable: true\n"
time="2022-04-25T05:42:20-11:00" level=debug msg=" Current MIG mode: Disabled"
time="2022-04-25T05:42:20-11:00" level=debug msg="Running pre-apply-mode hook"
time="2022-04-25T05:42:20-11:00" level=debug msg="Applying MIG mode change..."
time="2022-04-25T05:42:20-11:00" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-04-25T05:42:20-11:00" level=debug msg=" GPU 0: 0x20B510DE"
time="2022-04-25T05:42:20-11:00" level=debug msg=" MIG capable: true\n"
time="2022-04-25T05:42:20-11:00" level=debug msg=" Current MIG mode: Disabled"
time="2022-04-25T05:42:20-11:00" level=debug msg=" Updating MIG mode: Enabled"
time="2022-04-25T05:42:23-11:00" level=debug msg=" Mode change pending: true"
time="2022-04-25T05:42:23-11:00" level=debug msg="At least one mode change pending"
time="2022-04-25T05:42:23-11:00" level=debug msg="Resetting GPUs..."
time="2022-04-25T05:42:23-11:00" level=debug msg=" NVIDIA kernel module loaded"
time="2022-04-25T05:42:23-11:00" level=debug msg=" Using nvidia-smi to perform GPU reset"
time="2022-04-25T05:42:26-11:00" level=debug msg="Checking current MIG device configuration..."
time="2022-04-25T05:42:26-11:00" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-04-25T05:42:26-11:00" level=debug msg=" GPU 0: 0x20B510DE"
time="2022-04-25T05:42:26-11:00" level=debug msg="Running pre-apply-config hook"
time="2022-04-25T05:42:26-11:00" level=debug msg="Applying MIG device configuration..."
time="2022-04-25T05:42:26-11:00" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-04-25T05:42:26-11:00" level=debug msg=" GPU 0: 0x20B510DE"
time="2022-04-25T05:42:26-11:00" level=debug msg=" MIG capable: true\n"
time="2022-04-25T05:42:26-11:00" level=debug msg="Running apply-exit hook"
time="2022-04-25T05:42:26-11:00" level=fatal msg="Unable to apply MIG config with MIG mode disabled"
Restarting any GPU clients previously shutdown on the host by restarting their systemd services
Starting kubelet.service
Restarting any GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels
node/worker-1 unlabeled
Changing the 'nvidia.com/mig.config.state' node label to 'failed'
node/worker-1 unlabeled
time="2022-04-25T16:42:26Z" level=error msg="Error: exit status 1"
time="2022-04-25T16:42:26Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"
from mig-parted.
The above issue is seen while both enabling and disabling MIG. It needs node reboot to proceed further.
from mig-parted.
If the apply command is succeeding then (for whatever reason) it thinks the GPU reset completed successfully, when it didn’t actually reset the GPU.
What environment are you running this in? As I mentioned before, a reboot is only necessary if the reset fails, which in your setup appears to be silently failing and returning a successful error code.
from mig-parted.
HX 3.0 VMware vCenter environment.
from mig-parted.
I have a vague memory of this coming up in vsphere before and there was some incorrect vbios settings or something.
@shivamerla, @supertetelman, @kpouget
Does this ring a bell for you?
from mig-parted.
I do recall seeing this as some sort of misconfiguration, but I thought that had been addressed in updated documentation. I will dig around and try to find an answer.
@dogra-gopal, are you using GPU passthrough in this environment or is this vGPU? If you are using vGPU, MIG configuration changes need to be done through VMware and cannot be controlled through the mig-parted tool.
from mig-parted.
@supertetelman I am using GPU passthrough.
from mig-parted.
Verified that with latest version v0.4.2, this issue is fixed. So closing this. Thanks @klueska
from mig-parted.
Yes, we added this to work around this issue:
e761afb
It's still unclear why the GPU reset silently fails, but at least we are able to get past it with this workaround.
from mig-parted.
Related Issues (20)
- Installing `nvidia-mig-parted` fails. HOT 2
- Startup order HOT 6
- mig build failed, error cannot find package "github.com/NVIDIA/mig-parted/cmd/apply" in any of:. HOT 6
- Does this work with vGPU? HOT 5
- Issue when installing the systemd install.sh script HOT 7
- Is the MIG service free?
- Partitions aren't created, but getting "MIG configuration applied successfully" message HOT 8
- Why nvidia-mig-manager stops the kubelet during configuration of MIG
- "7g.40gb" configuration missing in examples/config.yaml HOT 2
- Issue with systemd-based deployment
- Fail to install using docker HOT 1
- gpu-operator creates ci using mig Insufficient Resources HOT 3
- Support ARM with pre-build packages HOT 1
- MIG partitioning leading to nvidia_a100_3g.39gb instead of 3g.40gb partition for NVIDIA driver versions 535.x and 545.x HOT 7
- Rename path to container Makefile
- Enable golangci-lint in repo
- mmap error for most operations on debian 10 HOT 10
- How to access the a MIG Device ID programmatically HOT 9
- A start job for Configure MIG on NVIDIA GPUs (x min x sec / no limit) HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mig-parted.