Coder Social home page Coder Social logo

nvidia / deepops Goto Github PK

View Code? Open in Web Editor NEW
1.2K 52.0 315.0 18.94 MB

Tools for building GPU clusters

License: BSD 3-Clause "New" or "Revised" License

Shell 69.80% Python 17.80% Dockerfile 3.65% Jupyter Notebook 1.39% C 0.25% Mustache 0.39% Lua 0.42% Jinja 5.57% Smarty 0.73%

deepops's Introduction

DeepOps

Infrastructure automation tools for Kubernetes and Slurm clusters with NVIDIA GPUs.

Table of Contents

Overview

The DeepOps project encapsulates best practices in the deployment of GPU server clusters and sharing single powerful nodes (such as NVIDIA DGX Systems). DeepOps may also be adapted or used in a modular fashion to match site-specific cluster needs. For example:

  • An on-prem data center of NVIDIA DGX servers where DeepOps provides end-to-end capabilities to set up the entire cluster management stack
  • An existing cluster running Kubernetes where DeepOps scripts are used to deploy KubeFlow and connect NFS storage
  • An existing cluster that needs a resource manager / batch scheduler, where DeepOps is used to install Slurm or Kubernetes
  • A single machine where no scheduler is desired, only NVIDIA drivers, Docker, and the NVIDIA Container Runtime

Latest release: DeepOps 23.08 Release

It is recommended to use the latest release branch for stable code (linked above). All development takes place on the master branch, which is generally functional but may change significantly between releases.

Deployment Requirements

Provisioning System

The provisioning system is used to orchestrate the running of all playbooks and one will be needed when instantiating Kubernetes or Slurm clusters. Supported operating systems which are tested and supported include:

  • NVIDIA DGX OS 4, 5
  • Ubuntu 18.04 LTS, 20.04, 22.04 LTS
  • CentOS 7, 8

Cluster System

The cluster nodes will follow the requirements described by Slurm or Kubernetes. You may also use a cluster node as a provisioning system but it is not required.

  • NVIDIA DGX OS 4, 5
  • Ubuntu 18.04 LTS, 20.04, 22.04 LTS
  • CentOS 7, 8

You may also install a supported operating system on all servers via a 3rd-party solution (i.e. MAAS, Foreman) or utilize the provided OS install container.

Kubernetes

Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications. The instantiation of a Kubernetes cluster is done by Kubespray. Kubespray runs on bare metal and most clouds, using Ansible as its substrate for provisioning and orchestration. For people with familiarity with Ansible, existing Ansible deployments or the desire to run a Kubernetes cluster across multiple platforms, Kubespray is a good choice. Kubespray does generic configuration management tasks from the "OS operators" ansible world, plus some initial K8s clustering (with networking plugins included) and control plane bootstrapping. DeepOps provides additional playbooks for orchestration and optimization of GPU environments.

Consult the DeepOps Kubernetes Deployment Guide for instructions on building a GPU-enabled Kubernetes cluster using DeepOps.

For more information on Kubernetes in general, refer to the official Kubernetes docs.

Slurm

Slurm is an open-source cluster resource management and job scheduling system that strives to be simple, scalable, portable, fault-tolerant, and interconnect agnostic. Slurm currently has been tested only under Linux.

As a cluster resource manager, Slurm provides three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates conflicting requests for resources by managing a queue of pending work. Slurm cluster instantiation is achieved through SchedMD

Consult the DeepOps Slurm Deployment Guide for instructions on building a GPU-enabled Slurm cluster using DeepOps.

For more information on Slurm in general, refer to the official Slurm docs.

Hybrid clusters

DeepOps does not test or support a configuration where both Kubernetes and Slurm are deployed on the same physical cluster.

NVIDIA Bright Cluster Manager is recommended as an enterprise solution which enables managing multiple workload managers within a single cluster, including Kubernetes, Slurm, Univa Grid Engine, and PBS Pro.

DeepOps does not test or support a configuration where nodes have a heterogenous OS running. Additional modifications are needed if you plan to use unsupported operating systems such as RHEL.

Virtual

To try DeepOps before deploying it on an actual cluster, a virtualized version of DeepOps may be deployed on a single node using Vagrant. This can be used for testing, adding new features, or configuring DeepOps to meet deployment-specific needs.

Consult the Virtual DeepOps Deployment Guide to build a GPU-enabled virtual cluster with DeepOps.

Updating DeepOps

To update from a previous version of DeepOps to a newer release, please consult the DeepOps Update Guide.

Copyright and License

This project is released under the BSD 3-clause license.

Issues

NVIDIA DGX customers should file an NVES ticket via NVIDIA Enterprise Services.

Otherwise, bugs and feature requests can be made by filing a GitHub Issue.

Contributing

To contribute, please issue a signed pull request against the master branch from a local fork. See the contribution document for more information.

deepops's People

Contributors

0leaf avatar ajdecon avatar avolkov1 avatar chychen avatar crd477 avatar crtierney42 avatar dholt avatar elgalu avatar hightoxicity avatar iamadrigal avatar issamsaid avatar itzsimpl avatar khoa-ho avatar lukeyeager avatar mboglesby avatar michael-balint avatar mkunin-work avatar nuka137 avatar nvfattyrichness avatar nvhans avatar phogan-nvidia avatar ryanolson avatar satindern avatar scottesandiego avatar seyong-um avatar supertetelman avatar tuttlebr avatar tvincereed avatar vtlrazin avatar yangatgithub avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepops's Issues

Grafana does not Display GPU Metrics (Ubuntu 18.04, GTX 1060)

I tried KONG on my laptop (using Ubuntu 18.04, GTX 1060, 6 GiB) but I cannot get Grafana to display GPU information. CPU, memory and I/O information can be displayed without any issue. Also, the resnet-gpu-pod example works fine for me. For the setup, I just use one master node without worker node. For now, this will be the setting for our GPU servers as well.

The link I follow: https://docs.nvidia.com/datacenter/kubernetes/kubernetes-install-guide/index.html

The steps I took in is in this doc.txt file and node exporter metrics info is in node_exporter_metrics.txt file. Following is my Grafana screen shoot.

grafana

master nodes labeled incorrectly

Some recent update has caused master nodes to be mislabeled, i.e.

node-role.kubernetes.io/master= should be node-role.kubernetes.io/master=true

Work around:

kubectl label nodes --all --overwrite node-role.kubernetes.io/master=true

Simplify monitoring deployment

Monitoring section needs another once-over to simplify deployment, auto-detect endpoints and auto generate/deploy dashboards and alerting

PXE boot phase (pixiecore) finishing on OK on imgfetch --name ready http://mgmtip:81/_/booting?mac=00%3A00%3A00%3A00%3A00%3A00

pixiecore is currently serving:

#!ipxe
kernel --name kernel http://mgmtip:81/_/file?name=ka431WTrnWITmnBXhXVgvDdXkyFj-0U_QQdisp483ClJhLdqjHhnenoUL9458ylvN84McvgRB6ioIX1nTY4YG8jjkPqonGTVa1x3e8y8lC5N05qs9QjKHsfjb3a86_SYu5wBw5WyCkjqPx4xngqa&type=kernel&mac=00%3A00%3A00%3A00%3A00%3A00
initrd --name initrd0 http://mgmtip:81/_/file?name=zFaDKSwrlxyZH3XOY-BTRr8f5JieKkTikgPPGR4c51dULsy8ZmKTERR3AUbFZ720nj66VM51vYZSvESFf1w7GEl2VFOOSnTMI46VnyNPF0FEoBGF7U0vs7ApxkxKBvq-wWvhLwDfLq9P4LlnFHVwl2xS4g%3D%3D&type=initrd&mac=00%3A00%3A00%3A00%3A00%3A00
imgfetch --name ready http://mgmtip:81/_/booting?mac=00%3A00%3A00%3A00%3A00%3A00 ||
imgfree ready ||
boot kernel initrd=initrd0 quiet url=http://mgmtip:81/_/file?name=i6d4oz7hmc9ICZzU9Lfm-5Hg8ZQf7OifqKKf-X44heR8uVLG0jv7Jkgd_U9aGU-DcvdVqYuwEHZKTIVM11pCaLRmnPwjAoA%3D ramdisk_size=100000 locale=fr_FR.UTF-8 auto=true priority=critical kbd-chooser/method=fr netcfg/choose_interface=enp1s0f1 netcfg/get_hostname=my-hostname netcfg/dhcp_timeout=120

when my DGX fetches: http://mgmtip:81/_/ipxe?arch=1&mac=00%3A00%3A00%3A00%3A00%3A00

And following logic here: https://github.com/google/netboot/blob/cc33920b4f3296801a64d731d269978116f40d92/pixiecore/http.go#L202

But DGX is printing at client side:
...

http://mgmtip:81/_/booting?mac=00%3A00%3A00%3A00%3A00%3A00... OK

This call in http is giving a http response code of 200 with a # Booting body

Then nothing is happening on this server, it is currently stuck on:

http://mgmtip:81/_/booting?mac=00%3A00%3A00%3A00%3A00%3A00... OK

Do you know what could explain this weird behavior?

Thanks

kubectl create secrete docker registry fails to accept documented arguments

Hello,

I recently pulled 19.03 and setup a temporary k8s cluster to test the setup process. I tried following the usage documentation here https://github.com/NVIDIA/deepops/blob/master/docs/kubernetes-usage.md. I reached the point where I was ready to register the NGC docker repository so I could more easily deploy nvidia containers to the cluster, but I reached step 2 of the documentation for this task and the cli seems to be fighting me. The command I am running minus my email and docker password:

kubectl create secret docker-registry nvcr.dgxkey –-docker-server=nvcr.io.com --docker-username=\$oauthtoken --docker-password=DOCKER_PASSWORD --docker-email=DOCKER_EMAIL

The cli prints the following error:

error: exactly one NAME is required, got 2 See 'kubectl create secret docker-registry -h' for help and examples.

Yes I checked the help documentation, which looking at the usage section:
Usage: kubectl create secret docker-registry NAME --docker-username=user --docker-password=password --docker-email=email [--docker-server=string] [--from-literal=key1=value1] [--dry-run] [options]

I see that the --docker-server parameter is optional. This lead me to omit this parameter, and instead use the input to the --docker-server parameter as the NAME parameter, and it works. This was just a little annoying so I figured I would make an issue explaining it.

Scheduled Jenkins job to perform test with CentOS

Right now we do a Jenkins build for each PR using Ubuntu as the base OS, but we want to support CentOS as well.

The build already takes ~24 minutes for a successful run, and we have limited test resources, so I don't necessarily want to do both for each PR. Would it be possible to have Jenkins run the same build for CentOS on some regular schedule, say 2-4 times a day, to catch regressions? Or provide a way to trigger a CentOS build ad-hoc for a given PR?

Questions: ipxe boot from enp1s0f1 (second interface) / air gapped cluster ubuntu

Doc is made to retrieve mac adress of the first interface with ipmitool... It's not a blocker for me since I have already an os running on my DGX nodes and I can get mac address of my enp1s0f1 (second interface) but it would be great to have a command to get any other interface IP if possible with ipmi, the only one interface up/plugged in our environment (I did not find it in dgx user guide, maybe I missed it). I tested manualy the dhcp running on my master (launched using helm ipxie chart) with the current running os of my dgx and using dhclient and I can get my static lease. But ipxe install is currently failing. Are the ipmi commands suitable to install the OS from second network interface?

Other question: I used the netboot ubuntu installer (indicated in the doc for dgx nodes) but I have no route to go to internet (aim is to only use proxies to go outside), is it something which will block me for proper os installation with the ubuntu-installer?

I hopped I could test it using ipxe boot from vagrant but it seems like you are using a dgx box, is it something planned for you to be implemented (virtual env ipxe boot for DGX)?

Great thanks for your help.

Slurm playbook is missing a required cpu package

On an up-to-date DGX station Slurm was failing to run any batch jobs and continually transitioning into a draining state. This was on a single node system where the DGX Station was also a login node.

The error was from the cpu-setup script and can be seen below:

root@dgxstation:/shared# bash /etc/slurm/prolog-exclusive.d/50-cpu-setup
+ type cpupower
cpupower is /usr/bin/cpupower
+ cpupower frequency-info
+ grep -e 'governors: Not Available'
WARNING: cpupower not found for kernel 4.15.0-42

  You may need to install the following packages for this specific kernel:
    linux-tools-4.15.0-42-generic
    linux-cloud-tools-4.15.0-42-generic

  You may also want to install one of the following packages to keep up to date:
    linux-tools-generic
    linux-cloud-tools-generic
+ cpupower frequency-set -g performance
WARNING: cpupower not found for kernel 4.15.0-42

  You may need to install the following packages for this specific kernel:
    linux-tools-4.15.0-42-generic
    linux-cloud-tools-4.15.0-42-generic

  You may also want to install one of the following packages to keep up to date:
    linux-tools-generic
    linux-cloud-tools-generic

The missing package that resolves the issue:
apt-get install -y linux-tools-$(uname -r)

This potentially looks like a dupe of #158 which has been closed/resolved.

Running off of master @ 6f7353d.

Installed slurm via ansible-playbook -i config/inventory playbooks/slurm-cluster.yml

Virtual deepops failed with libvirt on Ubuntu 16.04

Hello.
When I tried to setup deepops cluster with virtual mode for testing, it failed with libvirt provider at launching vagrant box. What I did is like

step1: clone deepops repository into local
step2: copy virtual/Vagrantfile-ubuntu to virtual/Vagrantfile
step3: run virtual/vagrant_startup.sh

It looks like the dependency packages are installed (e.g. host-manager plugin) successfully but the vagrant up failed. I paste the execution log since vagrant up called, here.

+ vagrant up --provider=libvirt
Bringing machine 'virtual-mgmt' up with 'libvirt' provider...
Bringing machine 'virtual-login' up with 'libvirt' provider...
Bringing machine 'virtual-gpu01' up with 'libvirt' provider...
==> virtual-mgmt: An error occurred. The error will be shown after all tasks complete.
==> virtual-gpu01: An error occurred. The error will be shown after all tasks complete.
==> virtual-login: An error occurred. The error will be shown after all tasks complete.
An error occurred while executing multiple actions in parallel.
Any errors that occurred are shown below.

An error occurred while executing the action on the 'virtual-mgmt'
machine. Please handle this error then try again:

There was error while creating libvirt storage pool: Call to virStoragePoolDefineXML failed: operation failed: Storage source conflict with pool: 'images'

An error occurred while executing the action on the 'virtual-login'
machine. Please handle this error then try again:

There was error while creating libvirt storage pool: Call to virStoragePoolDefineXML failed: operation failed: Storage source conflict with pool: 'images'

An error occurred while executing the action on the 'virtual-gpu01'
machine. Please handle this error then try again:

There was error while creating libvirt storage pool: Call to virStoragePoolDefineXML failed: operation failed: Storage source conflict with pool: 'images'

Any idea to fix this problem?

Alert about spawning DHCP server into the cluster via helm (dgxie service)

Using helm to spawn DHCP server (dgxie service), for unknown reason, we lost the docker daemon on the master, following that the kubelet could no more keep running/start critical services like apiserver... A reboot allowed docker + kubelet recovery.
But some worker nodes lost their ips (not able to renew their leases during the incident), we got an unhealthy ceph cluster. Looking at dgxie pod state after reboot, the pod was stuck on ContainerCreating state due to ceph partial failure (Volume Claim stuck).
Finally all the things recovered replacing Volume claims by empty volumes at dgxie helm service creation, the dgxie service could from there start, nodes recovered their ips, the ceph cluster went healthy and any volume claim could be satisfied.

So two things here:

  • Spawning dgxie service into k8s cluster should propose a ha mechanism (dhcp is a critical service)
  • We should think what are the dgxie true storage requirement and ensure dgxie storage resiliency

Per OS Settings

@dholt @lukeyeager

Proposal:

Add OS specific group_var files in config/group_vars, e.g.

  • config/group_vars/coreos.yml
  • config/group_vars/ubuntu.yml
  • config/group_vars/ubuntu1804.yml

In config/inventory, add OS groups where hosts can be listed

[coreos]
core01
core02
core03

[ubuntu1804]
login # this machine would inherit from both ubuntu and ubuntu1804 group vars

[ubuntu:children]
ubuntu1804

This has worked well for my CoreOS + Ubuntu hybrid cluster.

Centos7 / 'git' command fails in ansible-playbook but not at shell

The check to make sure 'kubespray' is up to date fails in when executing in the playbook but not at the command line. both commands are run at the same directory level.

$  ansible-playbook -i k8s-config/hosts.ini -b playbooks/k8s-cluster.yml -K -k -u deepops
SSH password: 
SUDO password[defaults to SSH password]: 

PLAY [all] ****************************************************************************

TASK [Install Python for Ansible] *****************************************************
changed: [nuc]
changed: [gpu1]
changed: [nv1]

PLAY [127.0.0.1] **********************************************************************

TASK [make sure kubespray is at the correct version] **********************************
fatal: [127.0.0.1]: FAILED! => {"changed": true, "cmd": ["git", "submodule", "update", "--init"], "delta": "0:00:00.011728", "end": "2019-03-01 11:49:31.484712", "msg": "non-zero return code", "rc": 1, "start": "2019-03-01 11:49:31.472984", "stderr": "You need to run this command from the toplevel of the working tree.", "stderr_lines": ["You need to run this command from the toplevel of the working tree."], "stdout": "", "stdout_lines": []}
	to retry, use: --limit @/home/deepops/deepops/playbooks/k8s-cluster.retry

PLAY RECAP ****************************************************************************
127.0.0.1                  : ok=0    changed=0    unreachable=0    failed=1   
gpu1                       : ok=1    changed=1    unreachable=0    failed=0   
nuc                        : ok=1    changed=1    unreachable=0    failed=0   
nv1                        : ok=1    changed=1    unreachable=0    failed=0   

[deepops@nuc deepops]$ git submodule update --init
Submodule path 'kubespray': checked out 'ea41fc5e742daf525bf4f23f0709b2008eeb49fb'

commented out the check in the playbook as a workaround.

$ ansible --version
ansible 2.7.8
  config file = /home/deepops/deepops/ansible.cfg
  configured module search path = [u'/home/deepops/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Oct 30 2018, 23:45:53) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]

Add NGC containers to Kubeflow

We have a ./scripts/k8s_deploy_kubeflow.sh script to launch Kubeflow, but it'd be great to treat NGC containers as first-class citizens that you can launch from the spawn interface.

Rapids/Dask deployment doesn't work out of the box

Couple of things that I think we should modify:

  • Build fails here, I think it needs a path, i.e. . at the end of the command:

    docker build -t dask-rapids

  • The deployment should not depend on a load balancer, since not everyone has the ability to hand out free IP ranges in their environment.

  • The deployment should work out of the box, i.e. not require manually building and pushing a container. We have the "deepops" account on docker hub we could use, though the containers should ideally come from NGC or another public source if possible.

switch to prometheus operator

this might help in #19 .
Also this would be a more elegant way to deploy monitoring.
If you'd like this added I'd be happy to contribute.

Update docker version

Needs to be a version supported by the NVIDIA container runtime

Newer versions of Docker are unsupported by Kubernetes, but seem to work

Permit to use an internal NTP server for node time syncing

In air gapped cluster, k8s nodes are not syncing themselves on NTP servers.
Time sync is needed for example for ceph.

As a quick workaround, I did following tasks:

  • ansible all -k -b -a "apt-get install ntp ntpdate"
  • ansible all -k -b -a "ntpd 10.0.7.10"
  • ansible all -k -b -a "timedatectl set-ntp true"

Convert dgxie config-map to inline config

Use Helm config to supply extra dnsmasq via inline config map vs separate file

i.e.

dnsmasq: |-
dhcp-host=00:01:02:03:04:05,login01,192.168.1.10
dhcp-host=01:02:03:04:05:06,dgx01,192.168.1.20
cname=registry,mgmt01
cname=registry.local,mgmt01.local

Automatically generate machine file, dhcp file, and inventory file

There are several config files that all use the same information such as hostnames, mac addresses, interfaces, etc.

Could we create a config file where users can define each server, the type, the IP, the mac, interfaces, and the hostname and then use that config file to automatically generate the other config files used by DGXie?

A lot of the confusion people have had have been around generating these files.

k8s-gpu-plugin.yml - throws syntax error

Trying deepops-refacor branch on Centos7 before bring to the customer - any insight on the issue below being really a syntax issue or actually OS related.

Install Process
Install a supported operating system (Ubuntu/RHEL)

$ ansible-playbook -v -i k8s-config/hosts.ini -b playbooks/k8s-cluster.yml
Using /home/deepops/deepops-refactor/ansible.cfg as config file

ERROR! no action detected in task. This often indicates a misspelled module name, or incorrect module path.

The error appears to have been in '/home/deepops/deepops-refactor/playbooks/k8s-gpu-plugin.yml': line 17, column 7, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

        name: openshift
    - name: create GPU device plugin
      ^ here


k8s-gpu-plugin.yml

---

- hosts: kube-master
  become: true
  become_method: sudo
  tasks:
    - name: install pip
      apt:
        name: "{{ item }}"
        state: present
      with_items:
        - "python-pip"
      when: (ansible_facts['distribution'] == 'Ubuntu')
    - name: install openshift python client for k8s_raw module
      pip:
        name: openshift
    - name: create GPU device plugin
      k8s:
        state: present
        definition: "{{ lookup('url', 'https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.12/nvidia-device-plugin.yml', split_lines=False) }}"
      run_once: true
  tags:
    - k8s_gpu_device_plugin

Virtual DeepOps - dgx01 final reboot with no IP address

With virtual deployment, the deployment fails as the dgx01 VM final reboot leaves it without an IP address.

The problem is /etc/network/interfaces specifying the address for eth1, but there is no such interface (only enp6..).

I fixed it by adding net.ifnames=0 to the kernel in ansible/roles/slurm/tasks/compute.yml:

- name: add cgroups to grub options
  lineinfile:
    dest: /etc/default/grub
    regexp: "^GRUB_CMDLINE_LINUX="
    line: 'GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1 net.ifnames=0"'
  register: update_grub

Requirements.yml, "unxnn.ansible_users"

I think you need to change “unxnn.ansible_users” to “unxnn.users” in the Requirements.yml file.

Is it okay that the version change as below?

fca2c5a6962db7696c32a9d319d388cc8623c4
--> 56f9436e0d1e92e743e960b978ac8b72e070

Nvidia OS Server 4.0.5 Compatibility

Hi,

Are there any known compatibility issues using deepops with the latest Nvidia OS Server? I want to upgrade my DGX-2 to 4.0.5 but before I proceed I want to make sure deepops has been tested in this environment?

Thank you.

Best,
Noah

Create an optional section to setup and install Kubeflow

Kubeflow is fairly simple to setup and integrate with Kubernetes. It is a very popular tool for deploying and managing DL jobs in Kubernetes clusters.

It would be beneficial to create/integrate additional Kubeflow playbooks so that users can launch Jupyter notebooks easier at the end of a DeepOps install.

Slurm role issues

Some outstanding issues with the Slurm role:

  • Controller/login node needs reboot before slurm works
  • "cpufreq" prolog failed, need to check for binary/pkg
  • Option to allow other users to ssh without a job
  • Not run for multi-node: /etc/slurm/prolog.d/50-exclusive-parts
    • Job CPUs > System CPUs
  • Tags no longer work

Vagrant and Virsh issues if virt* installed by vagrant_setup

If virt packages are installed as part of vagrant_startup.sh, then the calling shell isn't aware of the membership in the libvirt group. This means that the steps in https://github.com/NVIDIA/deepops/blob/master/virtual/README.md after ./vagrant_startup.sh won't work, and indeed the virsh list at the end of that script doesn't work either. (See attached log)

vagrant_startup_sh_NoVirtManager.log

The script should probably check the group membership and inform the user they need to logout/login before proceeding, and potentially the README should be updated as well with a note that the user may need to logout/login to pick up the libvirt membership before attempting the other vagrant ssh calls, etc.

K8S NVIDIA plugin validation step throws an error.

If you are following the setup guide step by step the k8s nvidia plugin step will fail with the below error.

This is due to the latest nvidia/cuda Docker image requiring newer drivers than what is installed via DeepOps.

A simple workaround is to run with --image=nvidia/cuda:10.0-runtime

~$ kubectl run gpu-test --rm -t -i --restart=Never --image=nvidia/cuda --limits=nvidia.com/gpu=1 -- nvidia-smi
pod "gpu-test" deleted
pod default/gpu-test terminated (ContainerCannotRun)
OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=GPU-0f04e528-7346-af7f-c217-804407855015 --compute --utility --require=cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411 --pid=11889 /var/lib/docker/overlay2/715ad56593e9de9f0ca90efe9177760f53346fbbabc6caeaac41aa7e895fce3f/merged]\\\\nnvidia-container-cli: requirement error: invalid expression\\\\n\\\"\"": unknown

Replace all hard-coded URLs with variables

In order to support offline or air-gapped installations, DeepOps needs to account for the case where no host has access to the Internet. Scripts that would normally download files from Github, Canonical, Google, etc. will instead need to use local mirrors at a user-specified address.

Opening this issue to track replacement of hard-coded URLs in our scripts and Ansible playbooks with variables that the user can override.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.