Coder Social home page Coder Social logo

Comments (11)

klueska avatar klueska commented on June 11, 2024

How did you set up MPS?

from k8s-device-plugin.

ysz-github avatar ysz-github commented on June 11, 2024

您是如何设置 MPS 的?

I haven't set MPS in the YAML, just applied for GPU resources like in Time-slicing mode. How should I set it up? Thank you!

from k8s-device-plugin.

ysz-github avatar ysz-github commented on June 11, 2024

How did you set up MPS?

The settings to enable CUDA MPS are as follows:

version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: "envvar"
    deviceIDStrategy: "uuid"
  gfd:
    oneshot: false
    noTimestamp: false
    outputFile: /etc/kubernetes/node-feature-discovery/features.d/gfd
    sleepInterval: 60s
sharing:
  mps:
    resources:
    - name: nvidia.com/gpu
      replicas: 10

from k8s-device-plugin.

elezar avatar elezar commented on June 11, 2024

@ysz-github do you have an example application / podspec that you're using to confirm this?

Could you also please confirm your driver version? We are investigating an issue where setting the device memory limits by UUID are not having the desired effect.

from k8s-device-plugin.

aphrodite1028 avatar aphrodite1028 commented on June 11, 2024

I have same issues using mps in docker cuda process, driver 535.129.03 and nvdp version is 0.15.0-rc1

from k8s-device-plugin.

elezar avatar elezar commented on June 11, 2024

There is a known issue with 0.15.0-rc.1 where memory limits were not correctly applied. This will be addressed in v0.15.0-rc.2 which we will release soon.

from k8s-device-plugin.

aphrodite1028 avatar aphrodite1028 commented on June 11, 2024

There is a known issue with 0.15.0-rc.1 where memory limits were not correctly applied. This will be addressed in v0.15.0-rc.2 which we will release soon.

ok, i know, thanks for your reply!

from k8s-device-plugin.

elezar avatar elezar commented on June 11, 2024

@aphrodite1028 @ysz-github we have just released https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.15.0-rc.2 which should address this issue. Please let us know if you're still experiencing problems.

from k8s-device-plugin.

aphrodite1028 avatar aphrodite1028 commented on June 11, 2024

@aphrodite1028 @ysz-github we have just released https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.15.0-rc.2 which should address this issue. Please let us know if you're still experiencing problems.

I found https://github.com/NVIDIA/k8s-device-plugin/blob/main/cmd/mps-control-daemon/mps/daemon.go#L77-L85 here.

if I do not set CUDA_VISIBLE_DEVICES env and start nvidia-cuda-mps-control -d and nvidia-cuda-mps-control, then limit device memory failed and not found nvidia-cuda-mps-server in container。 if I setting again, ignore mps-control-daemon ds config,will success in host machine, but Segmentation fault in container.

how to set device memory limit for client in container?

driver version is 535.129.03
GPU is RTX A6000

and i use helm deploy in k8s has an error like "linux mounts: path /run/nvidia/mps is mounted on /run but it is not a shared mount" when has mountPropagation

        volumeMounts:
        - mountPath: /mps
          mountPropagation: Bidirectional
          name: mps-root

from k8s-device-plugin.

klueska avatar klueska commented on June 11, 2024

@aphrodite1028 . You shouldn't need to do anything special in your user container. The system starts the MPS server for all GPUs on the machine and your client will be forced to make use of it.

These lines set the upper limit on the pinned device memory and thread percentage consumable by the client.
https://github.com/NVIDIA/k8s-device-plugin/blob/main/cmd/mps-control-daemon/mps/daemon.go#L111-L122

You can manually adjust the pinned memory limit and thread percentage to something smaller that this using the envvars when you start your container (but you can't set it to something larger).

from k8s-device-plugin.

aphrodite1028 avatar aphrodite1028 commented on June 11, 2024

@aphrodite1028 . You shouldn't need to do anything special in your user container. The system starts the MPS server for all GPUs on the machine and your client will be forced to make use of it.

These lines set the upper limit on the pinned device memory and thread percentage consumable by the client. https://github.com/NVIDIA/k8s-device-plugin/blob/main/cmd/mps-control-daemon/mps/daemon.go#L111-L122

You can manually adjust the pinned memory limit and thread percentage to something smaller that this using the envvars when you start your container (but you can't set it to something larger).

thanks for your reply.

if mps pinned device memory has driver version limit when i use? I found using man nvidia-cuda-mps-control in driver 470, not found set_default_device_pinned_mem_limit method.

from k8s-device-plugin.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.