Coder Social home page Coder Social logo

k8s-rdma-device-plugin's Introduction

RDMA device plugin for Kubernetes

Introduction

k8s-rdma-device-plugin is a device plugin for Kubernetes to manage RDMA device.

RDMA(remote direct memory access) is a high performance network protocol, which has the following major advantages:

  • Zero-copy

    Applications can perform data transfer without the network software stack involvement and data is being send received directly to the buffers without being copied between the network layers.

  • Kernel bypass

    Applications can perform data transfer directly from userspace without the need to perform context switches.

  • No CPU involvement

    Applications can access remote memory without consuming any CPU in the remote machine. The remote memory machine will be read without any intervention of remote process (or processor). The caches in the remote CPU(s) won't be filled with the accessed memory content.

You can read this post to get more information about RDMA.

This plugin allow you to use RDMA device in container of Kubernetes cluster. And more, We can use this plugin work with sriov-cni to provide high perfmance network connection for distributed application, especially GPU distributed application, such as Tensorflow,Spark, etc.

Quick Start

Build

Install libibverbs package, for CentOS:

# yum install libibverbs-devel -y

Then run build:

# ./build 
# ls bin
k8s-rdma-device-plugin

Work with Kubernetes

  • Preparing RDMA node

Install ibverbs libraries, then start kubelet with --feature-gates=DevicePlugins=true.

  • Run device plugin daemon process
# bin/k8s-rdma-device-plugin -master eth1 -v 4
INFO[0000] Fetching devices.                            
DEBU[0000] RDMA device list: [{{mlx4_1 uverbs1 /sys/class/infiniband_verbs/uverbs1 /sys/class/infiniband/mlx4_1} eth2} {{mlx4_3 uverbs3 /sys/class/infiniband_verbs/uverbs3 /sys/class/infiniband/mlx4_3} eth4} {{mlx4_2 uverbs2 /sys/class/infiniband_verbs/uverbs2 /sys/class/infiniband/mlx4_2} eth3} {{mlx4_4 uverbs4 /sys/class/infiniband_verbs/uverbs4 /sys/class/infiniband/mlx4_4} eth5}] 
INFO[0000] Starting FS watcher.                         
INFO[0000] Starting OS watcher.                         
INFO[0000] Starting to serve on /var/lib/kubelet/device-plugins/rdma.sock 
INFO[0000] Registered device plugin with Kubelet
...

or deploy it as a daemonset:

# kubectl -n kube-system apply -f rdma-device-plugin.yml
# kubectl -n kube-system get pods
rdma-device-plugin-daemonset-2wbdv         1/1       Running   0          14m
rdma-device-plugin-daemonset-7pwf7         1/1       Running   0          14m
  • Run RDMA container
apiVersion: v1
kind: Pod
metadata:
  name: rdma-pod
spec:
  containers:
    - name: rdma-container
      image: mellanox/mofed421_docker:noop
      securityContext:
        capabilities:
          add: ["ALL"]
      resources:
        limits:
          tencent.com/rdma: 1 # requesting 1 RDMA device

Dockerfile for mellanox/mofed421_docker:noop:

FROM mellanox/mofed421_docker:latest

CMD ["/bin/sleep", "360000"]

TODO

Share RDMA device for the containers in the same pod

Generally speaking, for RoCE with k8s, all containers in the same pod should share the same RDMA devices, this is unsupported by k8s now.

Work with sriov-cni plugin

Kubernetes call DP(device plugin) when Admit pod, and call CNI plugin when creating sandbox container. We need a way that pass RDMA device information from DP to CNI. Refer to the issue 32.

Work with NVIDIA GPU plugin

For high performance, we should coordinate the k8s-rdma-device-plugin and nvidia device plugin, and try to make RDMA devices and GPU devices allocated for the same container are located under the same PCIe switch.

k8s-rdma-device-plugin's People

Contributors

carmark avatar hustcat avatar iwita avatar panpan0000 avatar zhouzijiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

k8s-rdma-device-plugin's Issues

k8s rdma 和 ceph ib 网络整合的问题

@carmark
老师您好,刚刚在小象上看了您的直播,刚好在社区上找到了您,想请教几个关于 k8s rdma 整合 ceph ib 网络的问题。

我现在已知的几个知识(如果理解有误欢迎指出):

  1. 需要一个 rdma device plugin 将 ib 网卡映射到容器(跟 nvidia device plugin 原理一样)
  2. 如果使用容器网络,docker 只能映射一个网卡(要么 ib 要么以太),所以只能使用 hostNetwork
  3. 如果想要使用 RDMA 的话,K8S Node Network 和 Ceph Public Network 都必须有/是 IB 网络(不考虑其他的 RDMA/RoCE 的话)

如果上述都成立,会存在几个问题(如果理解有误欢迎指出):

  1. K8S Pod 如果使用容器网络不就不能访问 Ceph 了吗?(因为 Ceph 的 Public Network 是 IB 网络,而容器网络时以太网)
  2. Ceph 官方建议 Public Network 和 Cluster Network 分离,如果 Public Network 使用 IB 网络,Cluster Network 怎么选择呢?再来一个 IB 网络?

注:我最理想的情况是 k8s 既能通过 RDMA/IB 访问 Ceph,也能通过以太网访问 Ceph

Question about how does rdma-device-plugin mount infiniband driver to the container ?

Hi, we install rdma-device-plugin on our clusters and we find that the infiniband driver is mounted to our container in /dev/infiniband once we specify rdma resource in our pod resources.

But I am curious how rdma-device-plugin mounts infiniband driver to the container ?

I am looking at the Allocate Implementation

func (m *RdmaDevicePlugin) Allocate(ctx context.Context, r *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
devs := m.devs
response := pluginapi.AllocateResponse{ContainerResponses: []*pluginapi.ContainerAllocateResponse{}}
log.V(1).Infof("Request IDs: %v", r.ContainerRequests)
for _, container := range r.ContainerRequests {
var devicesList []*pluginapi.DeviceSpec
for _, id := range container.DevicesIDs {
if !deviceExists(devs, id) {
return nil, fmt.Errorf("invalid allocation request: unknown device: %s", id)
}
var devPath string
if dev, ok := m.devices[id]; ok {
// TODO: to function
devPath = fmt.Sprintf("/dev/infiniband/%s", dev.RdmaDevice.DevName)
} else {
continue
}
ds := &pluginapi.DeviceSpec{
ContainerPath: devPath,
HostPath: devPath,
Permissions: "rw",
}
devicesList = append(devicesList, ds)
}
// for /dev/infiniband/rdma_cm
rdma_cm_paths := []string{
"/dev/infiniband/rdma_cm",
}
for _, dev := range rdma_cm_paths {
devicesList = append(devicesList, &pluginapi.DeviceSpec{
ContainerPath: dev,
HostPath: dev,
Permissions: "rw",
})
}
response.ContainerResponses = append(response.ContainerResponses, &pluginapi.ContainerAllocateResponse{
Devices: devicesList,
})
}
return &response, nil
}

Only devicesList is set in the response

response.ContainerResponses = append(response.ContainerResponses, &pluginapi.ContainerAllocateResponse{
			Devices: devicesList,
		})

Container response is defined as the following. If Mounts is not in the above pluginapi.ContainerAllocateResponse how could driver is mounted to the container ?

type ContainerAllocateResponse struct {
	// List of environment variable to be set in the container to access one of more devices.
	Envs map[string]string `` /* 149-byte string literal not displayed */
	// Mounts for the container.
	Mounts []*Mount `protobuf:"bytes,2,rep,name=mounts,proto3" json:"mounts,omitempty"`
	// Devices for the container.
	Devices []*DeviceSpec `protobuf:"bytes,3,rep,name=devices,proto3" json:"devices,omitempty"`
	// Container annotations to pass to the container runtime
	Annotations          map[string]string `` /* 163-byte string literal not displayed */
	XXX_NoUnkeyedLiteral struct{}          `json:"-"`
	XXX_sizecache        int32             `json:"-"`
}

IB sharing within Pod

Hi @hustcat ,

I saw in the todo you mentioned that there are work need to be done to support containers within a pod sharing RDMA devices. Is it also the same case with infiniband ? If so could you help elaborate what are the steps k8s need to take to support this feature ?

ibv_devinfo output "Failed to open device"

I config 8 VFs and following is ibv_devinfo's output in demo container:

Failed to open device
Failed to open device
Failed to open device
Failed to open device
Failed to open device
hca_id: mlx4_2
        transport:                      InfiniBand (0)
        fw_ver:                         2.40.7000
        node_guid:                      0014:0500:8cc6:cd0a
        sys_image_guid:                 248a:0703:00e5:3d43
        vendor_id:                      0x02c9
        vendor_part_id:                 4100
        hw_ver:                         0x1
        board_id:                       MT_1100120019
        phys_port_cnt:                  1
        Device ports:
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 2
                        port_lid:               1
                        port_lmc:               0x00
                        link_layer:             InfiniBand

Failed to open device
Failed to open device
Failed to open device

Is that normal? It seems that ibv_devinfo is looking for VFs which is not assigned to it

I run a simple openmpi program and got error message shows that openmpi is looking for mlx4_1, but VF assigned to this pod is mlx4_2

Thanks!

No devices found

[root@SCSP00596 k8s-rdma-device-plugin]# ./bin/k8s-rdma-device-plugin -master ens2f0
I0520 10:03:24.135872   30847 main.go:31] Fetching devices.
I0520 10:03:24.137327   30847 main.go:39] No devices found.

Then, I add some logs in the code, print the info linke that:

GetDevices... ens2f0
ibvDevList:  [{rxe0 uverbs0 /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband/rxe0}] <nil>
netDevList:  [enp129s1] <nil>
rdma resourceFile: /sys/class/infiniband/rxe0/device/resource
netdev resourceFile: /sys/class/net/enp129s1/device/resource

So, why the bytes of 'resource' compare failed ?

plugin fail to register

Hi @hustcat ,

I am facing below error. How to overcome this error?

Error:

2018/04/29 21:22:02 grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: write unix /var/lib/kubelet/device-plugins/rdma.sock->@: write: broken pipe"
ERRO[0008] Could not register device plugin: rpc error: code = Unimplemented desc = unknown service deviceplugin.Registration
INFO[0008] Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
INFO[0008] Received signal "interrupt", shutting down.

Logs:

kubelet start arguments that contains feature get details.
kubelet version is 1.10.2

● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Sun 2018-04-29 21:20:42 IDT; 9min ago
Docs: http://kubernetes.io/docs/
Main PID: 3313 (kubelet)
CGroup: /system.slice/kubelet.service
└─3313 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin --cluster-dns=10.96.0.10 --cluster-domain=cluster.local --authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt --cadvisor-port=0 --cgroup-driver=cgroupfs --rotate-certificates=true --cert-dir=/var/lib/kubelet/pki --feature-gates=DevicePlugins=true

device-plugin unix socket
ls -l /var/lib/kubelet/device-plugins/
total 4
-rw------- 1 root root 48 Apr 29 21:20 kubelet_internal_checkpoint
srwxr-xr-x 1 root root 0 Apr 29 21:20 kubelet.sock
[root@reg-l-vrt-41009 repeated_tasks]#

Failure to initialize the plugin

Hi, @hustcat ,

I currently am working on a project involving Kubernetes and RDMA-enabled containers. I was very happy to find your RDMA device plugin project on Github as, if worked, it would solve a lot of my problems and I'm very grateful that you published it.

Unfortunately, when I tried to deploy the daemonset as described in your README, it producer the following error:

2018-09-06T14:26:37.916726-04:00 tporch2.lab2-skae 500c35b3b7bd[1580]: time="2018-09-06T18:26:37Z" level=info msg="Fetching devices." 2018-09-06T14:26:37.917634-04:00 tporch2.lab2-skae 500c35b3b7bd[1580]: time="2018-09-06T18:26:37Z" level=error msg="Error to get IB device: open /sys/class/net/flannel.1/device/resource: no such file or directory"

I realized that it is attributed to the following piece of code in rdma.go:

`for _, d := range ibvDevList {
for _, n := range netDevList {
dResource, err := getRdmaDeviceResoure(d.Name)
if err != nil {
return nil, err
}
nResource, err := getNetDeviceResoure(n)
if err != nil {
return nil, err
}

                    // the same device
                    if bytes.Compare(dResource, nResource) == 0 {
                            devs = append(devs, Device{
                                    RdmaDevice: d,
                                    NetDevice:  n,
                            })
                    }
            }
    }`

Several entries in /sys/class/net (docker virtual devices and flannel) don't have device/resource file and would cause this error.

# ls -al /sys/class/net/ total 0 drwxr-xr-x 2 root root 0 Sep 6 14:27 . drwxr-xr-x 74 root root 0 Sep 6 14:27 .. lrwxrwxrwx 1 root root 0 Sep 6 14:27 cni0 -> ../../devices/virtual/net/cni0 lrwxrwxrwx 1 root root 0 Sep 6 14:27 docker0 -> ../../devices/virtual/net/docker0 lrwxrwxrwx 1 root root 0 Sep 6 14:27 eth0 -> ../../devices/pci0000:00/0000:00:1c.0/0000:05:00.0/net/eth0 lrwxrwxrwx 1 root root 0 Sep 6 14:27 eth1 -> ../../devices/pci0000:00/0000:00:1c.0/0000:05:00.1/net/eth1 lrwxrwxrwx 1 root root 0 Sep 6 14:27 eth2 -> ../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/net/eth2 lrwxrwxrwx 1 root root 0 Sep 6 14:27 eth3 -> ../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/net/eth3 lrwxrwxrwx 1 root root 0 Sep 6 14:27 eth4 -> ../../devices/pci0000:00/0000:00:02.0/0000:02:00.0/net/eth4 lrwxrwxrwx 1 root root 0 Sep 6 14:27 eth5 -> ../../devices/pci0000:00/0000:00:02.0/0000:02:00.1/net/eth5 lrwxrwxrwx 1 root root 0 Sep 6 14:27 flannel.1 -> ../../devices/virtual/net/flannel.1 lrwxrwxrwx 1 root root 0 Sep 6 14:27 lo -> ../../devices/virtual/net/lo lrwxrwxrwx 1 root root 0 Sep 6 14:27 veth22bb6ca5 -> ../../devices/virtual/net/veth22bb6ca5 root@tporch2:~/projects/tporch# ls -al /sys/class/net/*/device lrwxrwxrwx 1 root root 0 Sep 5 14:30 /sys/class/net/eth0/device -> ../../../0000:05:00.0 lrwxrwxrwx 1 root root 0 Sep 5 14:30 /sys/class/net/eth1/device -> ../../../0000:05:00.1 lrwxrwxrwx 1 root root 0 Sep 5 14:31 /sys/class/net/eth2/device -> ../../../0000:03:00.0 lrwxrwxrwx 1 root root 0 Sep 5 14:31 /sys/class/net/eth3/device -> ../../../0000:03:00.0 lrwxrwxrwx 1 root root 0 Sep 5 14:31 /sys/class/net/eth4/device -> ../../../0000:02:00.0 lrwxrwxrwx 1 root root 0 Sep 5 14:31 /sys/class/net/eth5/device -> ../../../0000:02:00.1

I understand that a change in the code that checks for the presence of the file before doing the comparison would fix the problem. but I wonder how did you deal with it when you tested your code? Have you not used docker and flannel (or other CNI)? You surely must have some virtual devices in your configuration, no? I would much appreciate the answer before I start hacking the code. :)

Bug: cannot find device resource

For single NIC this plugin may work well, but for NIC Channel Bonding this cannot.
In rdma.go, it will check the resource file is equal or not. For binding interface, the resource file is not in /sys/class/net/%s/device/resource

请教关于rdma和该插件的问题

hust cat您好,
非常感谢您的工作,正好match我们在近期遇到的问题
我对网络不是很熟悉,所以有以下几个问题

  1. rdma协议只有ib才支持吗?听说某些高端以太网也支持了,那么您的插件可以配合支持rdma的以太网工作吗?
  2. 您在介绍的最后说配合gpu插件的工作todo,也就是说目前gpu to gpu的通讯还无法使用rdma?比如我在物理机上通过open mpi是能很轻易的透过rdma进行gpu通信,那么在K8s的容器内,装有您的插件,可以达到相同的效果吗?
  3. 如果2的问题能work,那他和物理机直接通讯的效率相比,会否差一些,因为虚拟网络如calico那层有一定的开销,是这样吗?
  4. 看起来您的插件已经解决了k8s无法挂载device的问题,但介绍里你还是提到了需要他的支持,这个有何影响?
    谢谢

build github.com/hustcat/k8s-rdma-device-plugin: cannot load

k8s-rdma-device-plugin git:(master) ✗ ./build
Checking gofmt...
Building plugins
build github.com/hustcat/k8s-rdma-device-plugin: cannot load k8s.io/kubernetes/pkg/kubelet/apis/deviceplugin/v1alpha: cannot find module providing package k8s.io/kubernetes/pkg/kubelet/apis/deviceplugin/v1alpha

CVEs found in this project

vendor/golang.org/x/text/internal/language/parse.go is matching CVE-2020-28852
vendor/golang.org/x/text/internal/language/language.go is matching CVE-2020-28851
vendor/golang.org/x/text/transform/transform.go is matching CVE-2020-14040
vendor/golang.org/x/text/internal/language/parse.go is matching CVE-2021-38561
vendor/golang.org/x/text/language/parse.go is matching CVE-2021-38561
vendor/golang.org/x/text/internal/language/language.go is matching CVE-2021-38561
vendor/golang.org/x/net/http/httpguts/httplex.go is matching CVE-2021-31525

Please address these.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.