rossumai / nvgpu Goto Github PK

NVIDIA GPU tools - monitoring on CLI & web app with multiple agents

License: MIT License

Makefile 4.83% Python 91.11% Dockerfile 4.06%

nvgpu's Introduction

`nvgpu` - NVIDIA GPU tools

It provides information about GPUs and their availability for computation.

Often we want to train a ML model on one of GPUs installed on a multi-GPU machine. Since TensorFlow allocates all memory, only one such process can use the GPU at a time. Unfortunately nvidia-smi provides only a text interface with information about GPUs. This packages wraps it with an easier to use CLI and Python interface.

It's a quick and dirty solution calling nvidia-smi and parsing its output. We can take one or more GPUs availabile for computation based on relative memory usage, ie. it is OK with Xorg taking a few MB.

In addition we have a fancy table of GPU with more information taken by python binding to NVML.

For easier monitoring of multiple machines it's possible to deploy agents (that provide the GPU information in JSON over a REST API) and show the aggregated status in a web application.

Installing

For a user:

pip install nvgpu

or to the system:

sudo -H pip install nvgpu

Usage examples

Command-line interface:

# grab all available GPUs
CUDA_VISIBLE_DEVICES=$(nvgpu available)

# grab at most available GPU
CUDA_VISIBLE_DEVICES=$(nvgpu available -l 1)

Print pretty colored table of devices, availability, users, processes:

$ nvgpu list
    status    type                 util.      temp.    MHz  users    since    pids    cmd
--  --------  -------------------  -------  -------  -----  -------  ---------------  ------  --------
 0  [ ]       GeForce GTX 1070      0 %          44    139                          
 1  [~]       GeForce GTX 1080 Ti   0 %          44    139  alice    2 days ago       19028   jupyter
 2  [~]       GeForce GTX 1080 Ti   0 %          44    139  bob      14 hours ago     8479    jupyter
 3  [~]       GeForce GTX 1070     46 %          54   1506  bob      7 days ago       20883   train.py
 4  [~]       GeForce GTX 1070     35 %          64   1480  bob      7 days ago       26228   evaluate.py
 5  [!]       GeForce GTX 1080 Ti   0 %          44    139  ?                         9305
 6  [ ]       GeForce GTX 1080 Ti   0 %          44    139

Or shortcut:

$ nvl

Python API:

import nvgpu

nvgpu.available_gpus()
# ['0', '2']

nvgpu.gpu_info()
[{'index': '0',
  'mem_total': 8119,
  'mem_used': 7881,
  'mem_used_percent': 97.06860450794433,
  'type': 'GeForce GTX 1070',
  'uuid': 'GPU-3aa99ee6-4a9f-470e-3798-70aaed942689'},
 {'index': '1',
  'mem_total': 11178,
  'mem_used': 10795,
  'mem_used_percent': 96.57362676686348,
  'type': 'GeForce GTX 1080 Ti',
  'uuid': 'GPU-60410ded-5218-7b06-9c7a-124b77a22447'},
 {'index': '2',
  'mem_total': 11178,
  'mem_used': 10789,
  'mem_used_percent': 96.51994990159241,
  'type': 'GeForce GTX 1080 Ti',
  'uuid': 'GPU-d0a77bd4-cc70-ca82-54d6-4e2018cfdca6'},
  ...
]

Web application with agents

There are multiple nodes. Agents take info from GPU and provide it in JSON via REST API. Master gathers info from other nodes and displays it in a HTML page. Agents can also display their status by default.

Agent

FLASK_APP=nvgpu.webapp flask run --host 0.0.0.0 --port 1080

Master

Set agents into a config file. Agent is specified either via a URL to a remote machine or 'self' for direct access to local machine. Remove 'self' if the machine itself does not have any GPU. Default is AGENTS = ['self'], so that agents also display their own status. Set AGENTS = [] to avoid this.

# nvgpu_master.cfg
AGENTS = [
         'self', # node01 - master - direct access without using HTTP
         'http://node02:1080',
         'http://node03:1080',
         'http://node04:1080',
]

NVGPU_CLUSTER_CFG=/path/to/nvgpu_master.cfg FLASK_APP=nvgpu.webapp flask run --host 0.0.0.0 --port 1080

Open the master in the web browser: http://node01:1080.

Installing as a service

On Ubuntu with systemd we can install the agents/master as as service to be ran automatically on system start.

# create an unprivileged system user
sudo useradd -r nvgpu

Copy nvgpu-agent.service to:

sudo vi /etc/systemd/system/nvgpu-agent.service

Set agents to the configuration file for the master:

sudo vi /etc/nvgpu.conf

AGENTS = [
         # direct access without using HTTP
         'self',
         'http://node01:1080',
         'http://node02:1080',
         'http://node03:1080',
         'http://node04:1080',
]

Set up and start the service:

# enable for automatic startup at boot
sudo systemctl enable nvgpu-agent.service
# start
sudo systemctl start nvgpu-agent.service 
# check the status
sudo systemctl status nvgpu-agent.service

# check the service
open http://localhost:1080

Author

Bohumír Zámečník, Rossum, Ltd.
License: MIT

TODO

order GPUs by priority (decreasing power, decreasing free memory)

nvgpu's People

Contributors

Stargazers

Watchers

nvgpu's Issues

Not working with CUDA 11 as nvidia-smi format is change

stop working with CUDA 11
format is change need to start from line 8 and junp 4 lines per each GPU

json.decoder.JSONDecodeError

I tried to set up multi agents (1 agent, 1 master).

The agent works perfectly well, but when I query the master, I receive following error :

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
X.X.X.X - - [03/Dec/2019 17:08:55] "GET / HTTP/1.1" 500 -

AttributeError: 'module' object has no attribute 'Cryptography_HAS_SSL_ST'

Apr 16 09:21:29 mlok systemd[1]: Started NVGPU agent.
Apr 16 09:21:29 mlok bash[1327]:  * Serving Flask app "nvgpu.webapp"
Apr 16 09:21:29 mlok bash[1327]:  * Environment: production
Apr 16 09:21:29 mlok bash[1327]:    WARNING: Do not use the development server in a production environment.
Apr 16 09:21:29 mlok bash[1327]:    Use a production WSGI server instead.
Apr 16 09:21:29 mlok bash[1327]:  * Debug mode: off
Apr 16 09:21:29 mlok bash[1327]: Traceback (most recent call last):
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/bin/flask", line 11, in <module>
Apr 16 09:21:29 mlok bash[1327]:     sys.exit(main())
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/flask/cli.py", line 894, in main
Apr 16 09:21:29 mlok bash[1327]:     cli.main(args=args, prog_name=name)
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/flask/cli.py", line 557, in main
Apr 16 09:21:29 mlok bash[1327]:     return super(FlaskGroup, self).main(*args, **kwargs)
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 717, in main
Apr 16 09:21:29 mlok bash[1327]:     rv = self.invoke(ctx)
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 1137, in invoke
Apr 16 09:21:29 mlok bash[1327]:     return _process_result(sub_ctx.command.invoke(sub_ctx))
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 956, in invoke
Apr 16 09:21:29 mlok bash[1327]:     return ctx.invoke(self.callback, **ctx.params)
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 555, in invoke
Apr 16 09:21:29 mlok bash[1327]:     return callback(*args, **kwargs)
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/click/decorators.py", line 64, in new_func
Apr 16 09:21:29 mlok bash[1327]:     return ctx.invoke(f, obj, *args, **kwargs)
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 555, in invoke
Apr 16 09:21:29 mlok bash[1327]:     return callback(*args, **kwargs)
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/flask/cli.py", line 767, in run_command
Apr 16 09:21:29 mlok bash[1327]:     app = DispatchingApp(info.load_app, use_eager_loading=eager_loading)
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/flask/cli.py", line 293, in __init__
Apr 16 09:21:29 mlok bash[1327]:     self._load_unlocked()
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/flask/cli.py", line 317, in _load_unlocked
Apr 16 09:21:29 mlok bash[1327]:     self._app = rv = self.loader()
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/flask/cli.py", line 372, in load_app
Apr 16 09:21:29 mlok bash[1327]:     app = locate_app(self, import_name, name)
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/flask/cli.py", line 235, in locate_app
Apr 16 09:21:29 mlok bash[1327]:     __import__(module_name)
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/nvgpu/webapp.py", line 23, in <module>
Apr 16 09:21:29 mlok bash[1327]:     from nvgpu.master import gather_reports, format_reports_to_ansi
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/nvgpu/master.py", line 1, in <module>
Apr 16 09:21:29 mlok bash[1327]:     import requests
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/requests/__init__.py", line 95, in <module>
Apr 16 09:21:29 mlok bash[1327]:     from urllib3.contrib import pyopenssl
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/urllib3/contrib/pyopenssl.py", line 46, in <module>
Apr 16 09:21:29 mlok bash[1327]:     import OpenSSL.SSL
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/pyOpenSSL-19.0.0-py2.7.egg/OpenSSL/__init__.py", line 8, in <module>
Apr 16 09:21:29 mlok bash[1327]:     from OpenSSL import crypto, SSL
Apr 16 09:21:29 mlok bash[1327]:   File "/usr/local/lib/python2.7/dist-packages/pyOpenSSL-19.0.0-py2.7.egg/OpenSSL/SSL.py", line 194, in <module>
Apr 16 09:21:29 mlok bash[1327]:     if _lib.Cryptography_HAS_SSL_ST:
Apr 16 09:21:29 mlok bash[1327]: AttributeError: 'module' object has no attribute 'Cryptography_HAS_SSL_ST'

pip install -U is a little presumptuous isn't it?

I really appreciate the nvgpu tool! Thank you.
Your installation instructions direct use of the -U or --upgrade pip flag. Considering, this may upgrade a user's environment in substantial ways they may not prefer, is it really necessary or appropriate? Trusting users that aren't aware of what the -U flag does could end up using an environment different from what they wanted, which was just to add the nvgpu functionality. I'm guessing [hoping] it resolves issues you've experienced but it seems like these would be solved a different way. Can you let us know why you've directed that?

gpu_info issue

There is a bug in the gpu_info() method when the gpu is titan.

not working

if gpu is 'GPU 0: TITAN X (Pascal) (UUID: GPU-cd2c447b-916f-e0e0-1054-63e07d7110e5)'
re.match('GPU ([0-9]+): ([^(]+) \(UUID: ([^)]+)\)', gpu).groups()
-> AttributeError: 'NoneType' object has no attribute 'groups'

working

if gpu is 'GPU 0: NVIDIA GeForce RTX 3060 (UUID: GPU-db5b6266-4f82-d11f-067f-71dff805c1e4)' 
re.match('GPU ([0-9]+): ([^(]+) \(UUID: ([^)]+)\)', gpu).groups()
-> ('0', 'NVIDIA GeForce RTX 3060', 'GPU-db5b6266-4f82-d11f-067f-71dff805c1e4')

solution

re.match('GPU ([0-9]+): ([^(]+) (UUID: ([^)]+))', gpu) -> re.match('GPU ([0-9]+): ([^.]+) (UUID: ([^)]+))', gpu)

if gpu is 'GPU 0: TITAN X (Pascal) (UUID: GPU-cd2c447b-916f-e0e0-1054-63e07d7110e5)'
re.match('GPU ([0-9]+): ([^.]+) \(UUID: ([^)]+)\)', gpu).groups()
-> ('0', 'TITAN X (Pascal)', 'GPU-cd2c447b-916f-e0e0-1054-63e07d7110e5')

if gpu is 'GPU 0: NVIDIA GeForce RTX 3060 (UUID: GPU-db5b6266-4f82-d11f-067f-71dff805c1e4)' 
re.match('GPU ([0-9]+): ([^.]+) \(UUID: ([^)]+)\)', gpu).groups()
-> ('0', 'NVIDIA GeForce RTX 3060', 'GPU-db5b6266-4f82-d11f-067f-71dff805c1e4')

Add memory info

Great repo !

I think one possibly missing information is the GPU memory used (as a percentage, like for GPU util).

For example sometime my process need only 30% of the memory, so I can start several processes on the same GPU. That would be useful to be able to see the remaining free memory for a given GPU.

nvgpu in Docker cannot see processes in host OS from container :facepalm:

psutils cannot see process details (user, creation time, command) on the host OS - by definition os container this is separated. This is a design fault of the dockerized solution and I'm not

 nvidia-docker run --rm bzamecnik/nvgpu nvl
    status    type                 util.      temp.    MHz  users    since    pids    cmd
--  --------  -------------------  -------  -------  -----  -------  -------  ------  -----
 0  [!]       GeForce GTX 1070       0 %         45   1582  ?                 27363   
 1  [ ]       GeForce GTX 1080 Ti    0 %         48   1480                            
 2  [ ]       GeForce GTX 1080 Ti    0 %         54   1480                            
 3  [!]       GeForce GTX 1070       0 %         43   1582  ?                 38128   
 4  [!]       GeForce GTX 1070       0 %         44   1506  ?                 11107   
 5  [!]       GeForce GTX 1080 Ti    0 %         52   1480  ?                 13318   
 6  [!]       GeForce GTX 1080 Ti    0 %         54   1480  ?                 13318

How can I get other stats from the API?

Currently the output looks like:

[{'index': '0',
  'mem_total': 8119,
  'mem_used': 7881,
  'mem_used_percent': 97.06860450794433,
  'type': 'GeForce GTX 1070',
  'uuid': 'GPU-3aa99ee6-4a9f-470e-3798-70aaed942689'}]

Is there a way to include more stuff gpu temperature or load like so?

[{'index': '0',
  'load': '20%',
  'temperature': '35°C'
  'mem_total': 8119,
  'mem_used': 7881,
  'mem_used_percent': 97.06860450794433,
  'type': 'GeForce GTX 1070',
  'uuid': 'GPU-3aa99ee6-4a9f-470e-3798-70aaed942689'}]

Potential licensing issue

Hi,

As part of a license check in some of our codebases, we found that nvgpu is licensed with MIT but one if its dependencies, ansi2html is an LGPLv3+ package. Seen that ansi2html is only used in the webapp.py and that webapp is not a crucial part of this library, would it be possible to package it as a PyPI extra, so that it is only installed through something like pip install nvgpu[webapp] ?

Thank you!

Also show the used gpu memory and total gpu memory for nvgpu --list?

Hi, could you add the feature of showing used gpu memory and total gpu memory for nvgpu --list?

Run queries to agents asynchronously

Try to parse jenkins user from its env var

It might be nice to see the actual use running a job over jenkins. It can be done eg. with a little help from the jobs (setting an env variable or putting the user name into the command).

psutil: AttributeError: 'module' object has no attribute '_exceptions'

Traceback (most recent call last):
  File "/usr/local/bin/nvl", line 11, in <module>
    sys.exit(pretty_list_gpus())
  File "/usr/local/lib/python2.7/dist-packages/nvgpu/list_gpus.py", line 67, in pretty_list_gpus
    df = device_table()
  File "/usr/local/lib/python2.7/dist-packages/nvgpu/list_gpus.py", line 60, in device_table
    rows = [device_status(device_index) for device_index in range(device_count)]
  File "/usr/local/lib/python2.7/dist-packages/nvgpu/list_gpus.py", line 33, in device_status
    except psutil._exceptions.NoSuchProcess:
AttributeError: 'module' object has no attribute '_exceptions'

Install nvgpu error !

I got a error when install nvgpu on windows. I have found a similar problem in gg when it says to install on ubuntu so I try in WSL and it's successful. Can anyone help me to install it in windows ?

Partially used gpu not displayed by `available_gpus`

Hi,
I think there might be a bug in the method available_gpus:

As you can see, my gpu 0 is only partially used, but can not be returned by available_gpus().
I wonder if it is designed this way or some kind of a bug.

Permission denied on Windows

Python 3.7.3
Windows 10
nvgpu version: 0.8.0
NVIDIA driver version 441.66
CUDA version: 10.2.89

Error message:

C:\Users\zz>nvl
Traceback (most recent call last):
  File "d:\soft\miniconda3\lib\site-packages\psutil\_pswindows.py", line 716, in wrapper
    return fun(self, *args, **kwargs)
  File "d:\soft\miniconda3\lib\site-packages\psutil\_pswindows.py", line 926, in username
    domain, user = cext.proc_username(self.pid)
PermissionError: [WinError 5] 拒绝访问。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "d:\soft\miniconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "d:\soft\miniconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "D:\soft\Miniconda3\Scripts\nvl.exe\__main__.py", line 9, in <module>
  File "d:\soft\miniconda3\lib\site-packages\nvgpu\list_gpus.py", line 107, in pretty_list_gpus
    rows = device_statuses()
  File "d:\soft\miniconda3\lib\site-packages\nvgpu\list_gpus.py", line 67, in device_statuses
    return [device_status(device_index) for device_index in range(device_count)]
  File "d:\soft\miniconda3\lib\site-packages\nvgpu\list_gpus.py", line 67, in <listcomp>
    return [device_status(device_index) for device_index in range(device_count)]
  File "d:\soft\miniconda3\lib\site-packages\nvgpu\list_gpus.py", line 36, in device_status
    users.append(proc.username())
  File "d:\soft\miniconda3\lib\site-packages\psutil\__init__.py", line 815, in username
    return self._proc.username()
  File "d:\soft\miniconda3\lib\site-packages\psutil\_pswindows.py", line 718, in wrapper
    raise convert_oserror(err, pid=self.pid, name=self._name)
psutil.AccessDenied: psutil.AccessDenied (pid=1252)

AttributeError: 'str' object has no attribute 'decode'

In the last month or so, new installations of nvgpu have been resulting nvl being non-functional. Instead of giving the expected results showing gpu utilization, I've been getting this error:

Traceback (most recent call last):
  File "/home/{userid}/.conda/envs/my_conda_env/bin/nvl", line 8, in <module>
    sys.exit(pretty_list_gpus())
  File "/home/{userid}/.conda/envs/my_conda_env/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 107, in pretty_list_gpus
    rows = device_statuses()
  File "/home/{userid}/.conda/envs/my_conda_env/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 67, in device_statuses
    return [device_status(device_index) for device_index in range(device_count)]
  File "/home/{userid}/.conda/envs/my_conda_env/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 67, in <listcomp>
    return [device_status(device_index) for device_index in range(device_count)]
  File "/home/{userid}/.conda/envs/my_conda_env/lib/python3.9/site-packages/nvgpu/list_gpus.py", line 18, in device_status
    device_name = device_name.decode('UTF-8')
AttributeError: 'str' object has no attribute 'decode'

I found that pip uninstalling pynvml, which would uninstalls v11.5.0 and pip install pynvml==11.4.1 restores functionality. So, one solution might be to adding <= 11.5.0 to the pynvml install_requires in setup.py line 19. I didn't actually test that. I am instead conda install -c conda-forge pynvml==11.4.1 into my conda environment before I pip install nvgpu.

I am running nvgpu using python v3.9.16 on a Linux server running cuda v11.4 drivers.

nvl over time?

I share some multi-GPU machines with a group of people. Sometimes running nvl (or nvgpu.available_gpus()) one time misses that someone else is interactively working on one of the GPU's but my check missed their sporatic usage. It'd be nice to have the option to get an nvl result that somehow averages the information over a user definable period of time.
Alternately, if the nvl output was compatible with the watch command I could do watch -n 3 nvl. Instead I get characters watch can't display nicely.

Anyway, great tool! Thank you!

ngvpu-agent cannot see processes from within Docker container :facepalm:

This is a fault by design. Better would be just to run the services eg. via systemd. :(

Python3 support

When will this project be supported in Python3?

Windows support

In the case of Windows, the file "nvidia-smi.exe" can be found at
"C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi".

This can be used to also run on Windows as it is for example done in PyTorch.

AttributeError: 'NoneType' object has no attribute 'groups'

>>> nvgpu.gpu_info()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jz748/anaconda3/lib/python3.7/site-packages/nvgpu/__init__.py", line 8, in gpu_info
    gpu_infos = [re.match('GPU ([0-9]+): ([^(]+) \(UUID: ([^)]+)\)', gpu).groups() for gpu in gpus]
  File "/home/jz748/anaconda3/lib/python3.7/site-packages/nvgpu/__init__.py", line 8, in <listcomp>
    gpu_infos = [re.match('GPU ([0-9]+): ([^(]+) \(UUID: ([^)]+)\)', gpu).groups() for gpu in gpus]
AttributeError: 'NoneType' object has no attribute 'groups'

nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.14       Driver Version: 430.14       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 00000000:01:00.0 Off |                  N/A |
| 27%   45C    P0    81W / 250W |      0MiB /  6083MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN X (Pascal)    Off  | 00000000:0B:00.0 Off |                  N/A |
| 27%   45C    P0    57W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 00000000:0D:00.0 Off |                  N/A |
| 27%   45C    P0    81W / 250W |      0MiB /  6083MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 980 Ti  Off  | 00000000:0E:00.0 Off |                  N/A |
| 17%   45C    P0    66W / 250W |      0MiB /  6083MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX TIT...  Off  | 00000000:0F:00.0 Off |                  N/A |
| 28%   45C    P0    82W / 250W |      0MiB /  6083MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 1080    Off  | 00000000:10:00.0 Off |                  N/A |
| 29%   39C    P0    37W / 180W |      0MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX TIT...  Off  | 00000000:11:00.0 Off |                  N/A |
| 27%   44C    P0    77W / 250W |      0MiB /  6083MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

[BUG] gpu_info incorrect if statement for specific driver version

With my specific combination of NVIDIA driver and CUDA version, the function gpu_info has an incompatible/incorrect if-else statement. It turns out that in this case, nvidia-smi prints in the following manner:

Mon May  2 18:25:57 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   31C    P0    23W / 300W |     11MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

And therefore should have the if rather than the else logic applied.

Handle errors from NVL more gracefully.

Logs:

  File "/usr/local/lib/python2.7/dist-packages/nvgpu/list_gpus.py", line 60, in device_statuses
    return [device_status(device_index) for device_index in range(device_count)]
  File "/usr/local/lib/python2.7/dist-packages/nvgpu/list_gpus.py", line 14, in device_status
    handle = nv.nvmlDeviceGetHandleByIndex(device_index)
  File "/usr/local/lib/python2.7/dist-packages/pynvml.py", line 946, in nvmlDeviceGetHandleByIndex
    _nvmlCheckReturn(ret)
  File "/usr/local/lib/python2.7/dist-packages/pynvml.py", line 405, in _nvmlCheckReturn
    raise NVMLError(ret)
NVMLError_GpuIsLost: GPU is lost

It fails with 500, but should better report the error NVMLError_GpuIsLost: GPU is lost.