Comments (5)
Thanks for digging that up. I don't quite recall why I had that nil
check present there. Looking at the existing code, it obviously doesn't make sense.
I have created NVIDIA/go-gpuallocator#23 to address this in go-gpuallocator.
from k8s-device-plugin.
Looking at the Filter()
call that is returning all devices when required
is empty, this should only happen if required
is nil
:
// Filter filters out the selected devices from the list.
// If the supplied list of uuids is nil, no filtering is performed.
// Note that the specified uuids must exist in the list of devices.
func (d DeviceList) Filter(uuids []string) (DeviceList, error) {
if uuids == nil {
return d, nil
}
filtered := []*Device{}
for _, uuid := range uuids {
for _, device := range d {
if device.UUID == uuid {
filtered = append(filtered, device)
break
}
}
if len(filtered) == 0 || filtered[len(filtered)-1].UUID != uuid {
return nil, fmt.Errorf("no device with uuid: %v", uuid)
}
}
return filtered, nil
}
Why is required == nil
in this case?
Looking at the type definition in k8s:
type ContainerPreferredAllocationRequest struct {
// List of available deviceIDs from which to choose a preferred allocation
AvailableDeviceIDs []string `protobuf:"bytes,1,rep,name=available_deviceIDs,json=availableDeviceIDs,proto3" json:"available_deviceIDs,omitempty"`
// List of deviceIDs that must be included in the preferred allocation
MustIncludeDeviceIDs []string `protobuf:"bytes,2,rep,name=must_include_deviceIDs,json=mustIncludeDeviceIDs,proto3" json:"must_include_deviceIDs,omitempty"`
// Number of devices to include in the preferred allocation
AllocationSize int32 `protobuf:"varint,3,opt,name=allocation_size,json=allocationSize,proto3" json:"allocation_size,omitempty"`
XXX_NoUnkeyedLiteral struct{} `json:"-"`
XXX_sizecache int32 `json:"-"`
}
We see that the json
encoding includes omitempty
. Does this affect the protobuf?
@klueska can you think of any reason why we should treat nil
and []string{}
differently from the perspective of the device plugin?
from k8s-device-plugin.
Assuming we could consider required == nil
equivalent to required == []string{}
we could apply the following diff:
diff --git a/internal/rm/nvml_manager.go b/internal/rm/nvml_manager.go
index 56f05429..a00d41cf 100644
--- a/internal/rm/nvml_manager.go
+++ b/internal/rm/nvml_manager.go
@@ -73,6 +73,9 @@ func NewNVMLResourceManagers(nvmllib nvml.Interface, config *spec.Config) ([]Res
// GetPreferredAllocation runs an allocation algorithm over the inputs.
// The algorithm chosen is based both on the incoming set of available devices and various config settings.
func (r *nvmlResourceManager) GetPreferredAllocation(available, required []string, size int) ([]string, error) {
+ if required == nil {
+ required = []string{}
+ }
return r.getPreferredAllocation(available, required, size)
}
to address this issue?
Would you be able to confirm this?
from k8s-device-plugin.
It looks like the Filter()
function was added as part of:
https://github.com/NVIDIA/go-gpuallocator/pull/13/files#diff-7a10395c66058f91191f5b9ac49321a6e95a8332ea4098d3438e0e19b2b02fdeR151
The old logic that had this loop was:
+// Create a list of Devices from the specific set of GPU uuids passed in.
+func NewDevicesFrom(uuids []string) ([]*Device, error) {
+ devices, err := NewDevices()
+ if err != nil {
+ return nil, err
+ }
+
+ filtered := []*Device{}
+ for _, uuid := range uuids {
+ for _, device := range devices {
+ if device.UUID == uuid {
+ filtered = append(filtered, device)
+ break
+ }
+ }
+ if len(filtered) == 0 || filtered[len(filtered)-1].UUID != uuid {
+ return nil, fmt.Errorf("no device with uuid: %v", uuid)
+ }
+ }
+
+ return filtered, nil
+}
In general, it's completely reasonable for required
to be nil (in fact this is the common case) because the kubelet doesn't want to predecide which GPUs should be included in the list of those returned.
from k8s-device-plugin.
That said, I think we could safely change:
if uuids == nil {
return d, nil
}
to
if len(uuids) == 0 {
return d, nil
}
without any issues.
from k8s-device-plugin.
Related Issues (20)
- Device plugin does not start on MIG-enabled host due to insufficient permissions HOT 6
- Daemonset yaml file is not picking up Timeslicing configMap
- Create CDI spec error "libcuda.so.535.129.03 not found" in version "v0.15.0-rc.2" HOT 2
- Dedicated GPU's for time slicing on multi GPU set ups.
- How to mount containerPath to a hostPath for discover NVIDIA libraries w/o CDI spec HOT 5
- Using CUDA MPS to enable GPU sharing in K8S, error:error checking MPS daemon health HOT 2
- K3s in Docker (K3D) - `nvml error: insufficient permissions`
- Fix e2e tests HOT 1
- WSL2 - No devices found. Waiting indefinitely. HOT 3
- MPS use error: Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)! HOT 29
- Back-off restarting failed container nvidia-device-plugin-ctr HOT 3
- Error in nvidia-device-plugin pod. HOT 2
- Go Package: github.com/opencontainers/runc 1.0.0-rc93 < 1.1.12 - Local Sandbox Bypass Vulnerability HOT 1
- When use MPS, add a initContainers to default set compute model
- update nodelabel for config-manger k8s-device-plugin continuing printing error msg, not stop HOT 1
- allPossibleMigStrategiesAreNone is false when using default values HOT 4
- Fix mode detection on Tegra-based platforms that support NVML HOT 1
- Workloads keep in hang state except cuda-sample:vectoradd under MPS mode HOT 9
- mps server error Failed to start : invalid argument
- nvidia-device-plugin.hasConfigMap returns a string HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from k8s-device-plugin.