Coder Social home page Coder Social logo

Comments (10)

klueska avatar klueska commented on August 20, 2024

Reboot is only necessary on CSPs with passhthrough virtualization but no GPU reset capabilities (e.g. GCP). Setting the envvar WITH_REBOOT=true on the k8s-mig-manager container should give you the behavior you want.

from mig-parted.

dogra-gopal avatar dogra-gopal commented on August 20, 2024

It does. not work. I have tried setting envvar WITH_REBOOT=true. Code does not seem to be right. https://github.com/NVIDIA/mig-parted/blob/main/deployments/gpu-operator/reconfigure-mig.sh#L469 currently, iwith WITH_REBOOT true will reboot node only if, this apply command fails.. but this succeeds, so we need manual reboot.

These are the logs: you can see that it fails in the end, but code is checking only if mig-configuration apply fails, which passes, but in the end mig-apply fails and it needs reboot of node to get fixed.

Applying the MIG mode change from the selected config to the node
If the -r option was passed, the node will be automatically rebooted if this is not successful
time="2022-04-25T05:42:13-11:00" level=debug msg="Parsing config file..."
time="2022-04-25T05:42:13-11:00" level=debug msg="Selecting specific MIG config..."
time="2022-04-25T05:42:13-11:00" level=debug msg="Running apply-start hook"
time="2022-04-25T05:42:13-11:00" level=debug msg="Checking current MIG mode..."
time="2022-04-25T05:42:13-11:00" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-04-25T05:42:13-11:00" level=debug msg=" GPU 0: 0x20B510DE"
time="2022-04-25T05:42:13-11:00" level=debug msg=" Asserting MIG mode: Enabled"
time="2022-04-25T05:42:13-11:00" level=debug msg=" MIG capable: true\n"
time="2022-04-25T05:42:13-11:00" level=debug msg=" Current MIG mode: Disabled"
time="2022-04-25T05:42:13-11:00" level=debug msg="Running pre-apply-mode hook"
time="2022-04-25T05:42:13-11:00" level=debug msg="Applying MIG mode change..."
time="2022-04-25T05:42:13-11:00" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-04-25T05:42:13-11:00" level=debug msg=" GPU 0: 0x20B510DE"
time="2022-04-25T05:42:13-11:00" level=debug msg=" MIG capable: true\n"
time="2022-04-25T05:42:13-11:00" level=debug msg=" Current MIG mode: Disabled"
time="2022-04-25T05:42:13-11:00" level=debug msg=" Updating MIG mode: Enabled"
time="2022-04-25T05:42:16-11:00" level=debug msg=" Mode change pending: true"
time="2022-04-25T05:42:16-11:00" level=debug msg="At least one mode change pending"
time="2022-04-25T05:42:16-11:00" level=debug msg="Resetting GPUs..."
time="2022-04-25T05:42:16-11:00" level=debug msg=" NVIDIA kernel module loaded"
time="2022-04-25T05:42:16-11:00" level=debug msg=" Using nvidia-smi to perform GPU reset"
time="2022-04-25T05:42:20-11:00" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
Applying the selected MIG config to the node
time="2022-04-25T05:42:20-11:00" level=debug msg="Parsing config file..."
time="2022-04-25T05:42:20-11:00" level=debug msg="Selecting specific MIG config..."
time="2022-04-25T05:42:20-11:00" level=debug msg="Running apply-start hook"
time="2022-04-25T05:42:20-11:00" level=debug msg="Checking current MIG mode..."
time="2022-04-25T05:42:20-11:00" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-04-25T05:42:20-11:00" level=debug msg=" GPU 0: 0x20B510DE"
time="2022-04-25T05:42:20-11:00" level=debug msg=" Asserting MIG mode: Enabled"
time="2022-04-25T05:42:20-11:00" level=debug msg=" MIG capable: true\n"
time="2022-04-25T05:42:20-11:00" level=debug msg=" Current MIG mode: Disabled"
time="2022-04-25T05:42:20-11:00" level=debug msg="Running pre-apply-mode hook"
time="2022-04-25T05:42:20-11:00" level=debug msg="Applying MIG mode change..."
time="2022-04-25T05:42:20-11:00" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-04-25T05:42:20-11:00" level=debug msg=" GPU 0: 0x20B510DE"
time="2022-04-25T05:42:20-11:00" level=debug msg=" MIG capable: true\n"
time="2022-04-25T05:42:20-11:00" level=debug msg=" Current MIG mode: Disabled"
time="2022-04-25T05:42:20-11:00" level=debug msg=" Updating MIG mode: Enabled"
time="2022-04-25T05:42:23-11:00" level=debug msg=" Mode change pending: true"
time="2022-04-25T05:42:23-11:00" level=debug msg="At least one mode change pending"
time="2022-04-25T05:42:23-11:00" level=debug msg="Resetting GPUs..."
time="2022-04-25T05:42:23-11:00" level=debug msg=" NVIDIA kernel module loaded"
time="2022-04-25T05:42:23-11:00" level=debug msg=" Using nvidia-smi to perform GPU reset"
time="2022-04-25T05:42:26-11:00" level=debug msg="Checking current MIG device configuration..."
time="2022-04-25T05:42:26-11:00" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-04-25T05:42:26-11:00" level=debug msg=" GPU 0: 0x20B510DE"
time="2022-04-25T05:42:26-11:00" level=debug msg="Running pre-apply-config hook"
time="2022-04-25T05:42:26-11:00" level=debug msg="Applying MIG device configuration..."
time="2022-04-25T05:42:26-11:00" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-04-25T05:42:26-11:00" level=debug msg=" GPU 0: 0x20B510DE"
time="2022-04-25T05:42:26-11:00" level=debug msg=" MIG capable: true\n"
time="2022-04-25T05:42:26-11:00" level=debug msg="Running apply-exit hook"
time="2022-04-25T05:42:26-11:00" level=fatal msg="Unable to apply MIG config with MIG mode disabled"
Restarting any GPU clients previously shutdown on the host by restarting their systemd services
Starting kubelet.service
Restarting any GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels
node/worker-1 unlabeled
Changing the 'nvidia.com/mig.config.state' node label to 'failed'
node/worker-1 unlabeled
time="2022-04-25T16:42:26Z" level=error msg="Error: exit status 1"
time="2022-04-25T16:42:26Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"

from mig-parted.

dogra-gopal avatar dogra-gopal commented on August 20, 2024

The above issue is seen while both enabling and disabling MIG. It needs node reboot to proceed further.

from mig-parted.

klueska avatar klueska commented on August 20, 2024

If the apply command is succeeding then (for whatever reason) it thinks the GPU reset completed successfully, when it didn’t actually reset the GPU.

What environment are you running this in? As I mentioned before, a reboot is only necessary if the reset fails, which in your setup appears to be silently failing and returning a successful error code.

from mig-parted.

dogra-gopal avatar dogra-gopal commented on August 20, 2024

HX 3.0 VMware vCenter environment.

from mig-parted.

klueska avatar klueska commented on August 20, 2024

I have a vague memory of this coming up in vsphere before and there was some incorrect vbios settings or something.

@shivamerla, @supertetelman, @kpouget
Does this ring a bell for you?

from mig-parted.

supertetelman avatar supertetelman commented on August 20, 2024

I do recall seeing this as some sort of misconfiguration, but I thought that had been addressed in updated documentation. I will dig around and try to find an answer.

@dogra-gopal, are you using GPU passthrough in this environment or is this vGPU? If you are using vGPU, MIG configuration changes need to be done through VMware and cannot be controlled through the mig-parted tool.

from mig-parted.

dogra-gopal avatar dogra-gopal commented on August 20, 2024

@supertetelman I am using GPU passthrough.

from mig-parted.

dogra-gopal avatar dogra-gopal commented on August 20, 2024

Verified that with latest version v0.4.2, this issue is fixed. So closing this. Thanks @klueska

from mig-parted.

klueska avatar klueska commented on August 20, 2024

Yes, we added this to work around this issue:
e761afb

It's still unclear why the GPU reset silently fails, but at least we are able to get past it with this workaround.

from mig-parted.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.