gpuopenanalytics / pynvml Goto Github PK

View Code? Open in Web Editor NEW

200.0 200.0 30.0 225 KB

Provide Python access to the NVML library for GPU diagnostics

License: BSD 3-Clause "New" or "Revised" License

Python 99.71% Dockerfile 0.26% Shell 0.03%

pynvml's Introduction

gpuopenanalytics.github.io

The Website for the GPU Open Analytics Initiative (GoAi)

See it at www.gpuopenanalytics.com.

pynvml's People

Contributors

Stargazers

Watchers

pynvml's Issues

adding support for CUDA_VISIBLE_DEVICES which is currently ignored

I understand that this is a python binding to nvml, which ignores CUDA_VISIBLE_DEVICES, but perhaps this feature could be respected in pynvml? otherwise we end up with inconsistent behavior between pytorch (tf?) and pynvml.

For example on this setup I have card 0 (24GB), card 1 (8GB).

If I run:

CUDA_VISIBLE_DEVICES=1 python -c "import pynvml; pynvml.nvmlInit(); handle = pynvml.nvmlDeviceGetHandleByIndex(0); print(pynvml.nvmlDeviceGetMemoryInfo(handle).total)"
25447170048

which is the output for card 0, even though I was expecting output for card 1.

The expected output is the one if I explicitly pass the system ID to nvml:

python -c "import pynvml;pynvml.nvmlInit(); handle = pynvml.nvmlDeviceGetHandleByIndex(1); print(pynvml.nvmlDeviceGetMemoryInfo(handle).total)"
8513978368

So I get the wrong card in the first snippet - I get card 0, rather than card 1, indexed as 0th.

The conflict with pytorch happens when I call id = torch.cuda.current_device() - which returns 0 with CUDA_VISIBLE_DEVICES="1". I hope my explanation is clear of where I have a problem.

pynvml could respect CUDA_VISIBLE_DEVICES if the latter is set.

Of course, if this is attempted, then we can't just change the normal behavior as it'd break people's code. Perhaps, if pynvml.nvmlInit(respect_cuda_visible_devices=True) is passed, then it could magically remap the id arg to nvmlDeviceGetHandleByIndex to the corresponding id in CUDA_VISIBLE_DEVICES. So in the very first snippet above, nvmlDeviceGetHandleByIndex(0), will actually call it for id=1, as it's 0th relative to `CUDA_VISIBLE_DEVICES="1".

So nvmlDeviceGetHandleByIndex() arg will become an index with respect to CUDA_VISIBLE_DEVICES. e.g. `CUDA_VISIBLE_DEVICES="1,0" will reverse the ids.

Thank you!

Meanwhile I added the following workaround to my software:

    import os
    [...]
    if id is None:
        id = torch.cuda.current_device()
    # if CUDA_VISIBLE_DEVICES is used automagically remap the id since pynvml ignores this env var
    if "CUDA_VISIBLE_DEVICES" in os.environ:
        ids = list(map(int, os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")))
        id = ids[id] # remap
    try:
        handle = pynvml.nvmlDeviceGetHandleByIndex(id)
        [...]

If someone needs this as a helper wrapper, you can find it here:
https://github.com/stas00/ipyexperiments/blob/3db0bbac2e2e6f1873b105953d9a7b3b7ca491b1/ipyexperiments/utils/mem.py#L33

NVLINK Ask

I am interested in two nvlink capabilities I call from the command line:

nvidia-smi nvlink -g 1 -i 2
- I believe the above asks for the total data transferred with counter 1 on GPU 2 (note: there are two counters 1 and 0)
nvidia-smi nvlink -r 1
This call resets all the counters of type 1 on all GPUs

Could these call be exposed to pynvml ?

cc @thomcom @shwina

What information is valuable?

Here is what the nvidia-smi API in this library produces for a single GPU. What of this information is useful and would we maybe want to make dashboards out of?

In [1]: import nvidia_smi

In [2]: nvsmi = nvidia_smi.nvidia_smi.getInstance()

In [3]: nvsmi.DeviceQuery()['gpu'][0]
Out[3]:
{'id': '0000:06:00.0',
 'product_name': 'Tesla V100-SXM2-32GB',
 'product_brand': 'Tesla',
 'display_mode': 'Enabled',
 'display_active': 'Disabled',
 'persistence_mode': 'Enabled',
 'accounting_mode': 'Disabled',
 'accounting_mode_buffer_size': '4000',
 'driver_model': {'current_dm': 'N/A', 'pending_dm': 'N/A'},
 'serial': '0321918171737',
 'uuid': 'GPU-96ab329d-7a1f-73a8-a9b7-18b4b2855f92',
 'minor_number': '0',
 'vbios_version': '88.00.43.00.04',
 'multigpu_board': 'No',
 'board_id': '0x600',
 'inforom_version': {'img_version': 'G503.0204.00.02',
  'oem_object': '1.1',
  'ecc_object': '5.0',
  'pwr_object': 'N/A'},
 'gpu_operation_mode': {'current_gom': 'N/A', 'pending_gom': 'N/A'},
 'pci': {'pci_bus': '06',
  'pci_device': '00',
  'pci_domain': '0000',
  'pci_device_id': '1DB510DE',
  'pci_bus_id': '0000:06:00.0',
  'pci_sub_system_id': '124910DE',
  'pci_gpu_link_info': {'pcie_gen': {'max_link_gen': '3',
    'current_link_gen': '3'},
   'link_widths': {'max_link_width': '16x', 'current_link_width': '16x'}},
  'pci_bridge_chip': {'bridge_chip_type': 'N/A', 'bridge_chip_fw': 'N/A'},
  'replay_counter': '0',
  'tx_util': 0,
  'tx_util_unit': 'KB/s',
  'rx_util': 0,
  'rx_util_unit': 'KB/s'},
 'fan_speed': 'N/A',
 'fan_speed_unit': '%',
 'performance_state': 'P0',
 'clocks_throttle': {'clocks_throttle_reason_gpu_idle': 'Active',
  'clocks_throttle_reason_applications_clocks_setting': 'Not Active',
  'clocks_throttle_reason_sw_power_cap': 'Not Active',
  'clocks_throttle_reason_hw_slowdown': 'Not Active',
  'clocks_throttle_reason_unknown': 'N/A'},
 'fb_memory_usage': {'total': 32510.5,
  'used': 0.0,
  'free': 32510.5,
  'unit': 'MiB'},
 'bar1_memory_usage': {'total': 32768.0,
  'used': 2.50390625,
  'free': 32765.49609375,
  'unit': 'MiB'},
 'compute_mode': 'Default',
 'utilization': {'gpu_util': 0,
  'memory_util': 0,
  'encoder_util': 0,
  'decoder_util': 0,
  'unit': '%'},
 'ecc_mode': {'current_ecc': 'Enabled', 'pending_ecc': 'Enabled'},
 'ecc_errors': {'volatile': {'single_bit': {'device_memory': 0,
    'register_file': 0,
    'l1_cache': 0,
    'l2_cache': 0,
    'texture_memory': 'N/A',
    'total': '0'},
   'double_bit': {'device_memory': 0,
    'register_file': 0,
    'l1_cache': 0,
    'l2_cache': 0,
    'texture_memory': 'N/A',
    'total': '0'}},
  'aggregate': {'single_bit': {'device_memory': 0,
    'register_file': 0,
    'l1_cache': 0,
    'l2_cache': 0,
    'texture_memory': 'N/A',
    'total': '0'},
   'double_bit': {'device_memory': 0,
    'register_file': 0,
    'l1_cache': 0,
    'l2_cache': 0,
    'texture_memory': 'N/A',
    'total': '0'}}},
 'retired_pages': {'multiple_single_bit_retirement': None,
  'double_bit_retirement': None,
  'pending_retirement': 'No'},
 'temperature': {'gpu_temp': 31,
  'gpu_temp_max_threshold': 90,
  'gpu_temp_slow_threshold': 87,
  'unit': 'C'},
 'power_readings': {'power_management': 'Supported',
  'power_draw': 43.235,
  'power_limit': 300.0,
  'default_power_limit': 300.0,
  'enforced_power_limit': 300.0,
  'min_power_limit': 150.0,
  'max_power_limit': 300.0,
  'power_state': 'P0',
  'unit': 'W'},
 'clocks': {'graphics_clock': 135,
  'sm_clock': 135,
  'mem_clock': 877,
  'unit': 'MHz'},
 'applications_clocks': {'graphics_clock': 1290,
  'mem_clock': 877,
  'unit': 'MHz'},
 'default_applications_clocks': {'graphics_clock': 1290,
  'mem_clock': 877,
  'unit': 'MHz'},
 'max_clocks': {'graphics_clock': 1530,
  'sm_clock': 1530,
  'mem_clock': 877,
  'unit': 'MHz'},
 'clock_policy': {'auto_boost': 'N/A', 'auto_boost_default': 'N/A'},
 'supported_clocks': [{'current': 877,
   'unit': 'MHz',
   'supported_graphics_clock': [1530,
    1522,
    1515,
    1507,
    1500,
    1492,
    1485,
    1477,
    1470,
    1462,
    1455,
    1447,
    1440,
    1432,
    1425,
    1417,
    1410,
    1402,
    1395,
    1387,
    1380,
    1372,
    1365,
    1357,
    1350,
    1342,
    1335,
    1327,
    1320,
    1312,
    1305,
    1297,
    1290,
    1282,
    1275,
    1267,
    1260,
    1252,
    1245,
    1237,
    1230,
    1222,
    1215,
    1207,
    1200,
    1192,
    1185,
    1177,
    1170,
    1162,
    1155,
    1147,
    1140,
    1132,
    1125,
    1117,
    1110,
    1102,
    1095,
    1087,
    1080,
    1072,
    1065,
    1057,
    1050,
    1042,
    1035,
    1027,
    1020,
    1012,
    1005,
    997,
    990,
    982,
    975,
    967,
    960,
    952,
    945,
    937,
    930,
    922,
    915,
    907,
    900,
    892,
    885,
    877,
    870,
    862,
    855,
    847,
    840,
    832,
    825,
    817,
    810,
    802,
    795,
    787,
    780,
    772,
    765,
    757,
    750,
    742,
    735,
    727,
    720,
    712,
    705,
    697,
    690,
    682,
    675,
    667,
    660,
    652,
    645,
    637,
    630,
    622,
    615,
    607,
    600,
    592,
    585,
    577,
    570,
    562,
    555,
    547,
    540,
    532,
    525,
    517,
    510,
    502,
    495,
    487,
    480,
    472,
    465,
    457,
    450,
    442,
    435,
    427,
    420,
    412,
    405,
    397,
    390,
    382,
    375,
    367,
    360,
    352,
    345,
    337,
    330,
    322,
    315,
    307,
    300,
    292,
    285,
    277,
    270,
    262,
    255,
    247,
    240,
    232,
    225,
    217,
    210,
    202,
    195,
    187,
    180,
    172,
    165,
    157,
    150,
    142,
    135]}],
 'processes': None,
 'accounted_processes': None}

cc @seibert @kkraus14 @sklam @randerzander

Higher-level application

PyNVML bindings are great to do all GPU information management from Python, but they are almost entirely an identical a copy of the C API. This can be a barrier for Python users who need to find out from the NVML API documentation what the API provides, and then what are the appropriate types that need to be passed, etc. We currently utilize PyNVML in both Distributed and Dask-CUDA, but there's also some overlap that leads to code duplication.

I feel one way to reduce code duplication and make it easier for new users, and thus make things overall better, is to provide a "High-level PyNVML library" that takes care of the basic needs for users. For example, I would imagine something like the following (but not limited to) to be available (implementation omitted for simplicity):

class Handle:
    """A handle to a GPU device.

    Parameters
    ----------
    index: int, optional
        Integer representing the CUDA device index to get a handle to.
    uuid: bytes or str, optional
        UUID of a CUDA device to get a handle to.

    Raises
    ------
    ValueError
        If neither `index` nor `uuid` are specified or if both are specified.
    """
    def __init__(
        self, index: Optional[int] = None, uuid: Optional[Union[bytes, str]] = None
    )

    @property
    def free_memory(self) -> int:
        """
        Free memory of the CUDA device.
        """

    @property
    def total_memory(self) -> int:
        """
        Total memory of the CUDA device.
        """

    @property
    def used_memory(self) -> int:
        """
        Used memory of the CUDA device.
        """

There would be more than the above to be covered, such as getting the number of available GPUs in the system, whether a GPU has a context currently created, if a handle is MIG or physical GPU, etc. Additionally, we would have simple tools that are generally useful, for example a small tool I wrote long ago to measure NVLink bandwidth and peak memory, and whatever else fits in the scope of a "High-level PyNVML library" that can make our users' lives easier.

So to begin this discussion I would like to know how people like @rjzamora and @kenhester feel about this idea. Would this be something that would fit in the scope of this project? Are there any impediments to adding such a library within the scope of this project/repository?

Also cc @quasiben for vis.

BUG: Output of `nvml.nvmlDeviceGetComputeRunningProcesses` reports `pid` as `usedGpuMemory` and `usedGpuMemory` as `pid`

Description

When using nvmlDeviceGetComputeRunningProcesses to get the usedGpuMemory of all the processes using a particular GPU (in this case, GPU 0), I saw that erroneous results appeared to be reported. When compared with nvidia-smi in the terminal, the usedGpuMemory contained the value of the process ID while the pid field, rather than containing the process ID, contained the used GPU memory. So the values were swapped. Sometimes other fields in the process object contained the process ID or GPU memory values, making the field values of the process objects output be nvmlDeviceGetComputeRunningProcesses overall shuffled. Investigation is warranted to ensure nvmlDeviceGetComputeRunningProcesses consistently provides correct output.

Code for reproducing the bug

import pynvml.nvml as nvml
import multiprocess as mp
import torch

def main():
    event = mp.Event()
    profiling_process = mp.Process(target=_profile_resources, kwargs={'event': event})
    profiling_process.start()
    with mp.Pool(8) as pool:
        for res in [pool.apply_async(_multiprocess_task, (i,)) for i in range(12)]:
            res.get()
    event.set()
    profiling_process.join()
    profiling_process.close()

def _profile_resources(event):
    nvml.nvmlInit()
    while True:
        handle = nvml.nvmlDeviceGetHandleByIndex(0)
        gpu_processes = nvml.nvmlDeviceGetComputeRunningProcesses(handle)
        print(gpu_processes)
        time.sleep(.1)
        if event.is_set():
            break

def _multiprocess_task(num: int):
    t1 = torch.tensor([1.1] * int(5**num)).to(torch.device('cuda:0'))
    t2 = torch.tensor([2.2] * int(5**num)).to(torch.device('cuda:0'))
    time.sleep(1)
    return (t1 * t2).shape

Environment

torch==2.0.1
pynvml=11.5.0
CUDA version: 12.2
GPU Model: NVIDIA GeForce RTX 4080
Driver Version: 535.54.03

What is help_query_gpu.txt for?

I'm excited to see this library being revived!

The help_query_gpu.txt file seems to actually be a tar.gz file containing:

pynvml/nvidia_smi.py
pynvml/PKG-INFO
pynvml/pynvml.py
pynvml/README.txt
pynvml/setup.py

What is this file for, and why is it included in the package data list in setup.py?

Failure On Install

I'm tryng to install pynvml from source and getting a strange error:

(cudf-dev) bzaitlen@dgx15:~/GitRepos/pynvml$ pip install -e .
Obtaining file:///home/nfs/bzaitlen/GitRepos/pynvml
    ERROR: Complete output from command python setup.py egg_info:
    ERROR: Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/nfs/bzaitlen/GitRepos/pynvml/setup.py", line 20, in <module>
        version=versioneer.get_version(),
      File "/home/nfs/bzaitlen/GitRepos/pynvml/versioneer.py", line 1480, in get_version
        return get_versions()["version"]
      File "/home/nfs/bzaitlen/GitRepos/pynvml/versioneer.py", line 1420, in get_versions
        assert cfg.tag_prefix is not None, "please set versioneer.tag_prefix"
    AssertionError: please set versioneer.tag_prefix
    ----------------------------------------
ERROR: Command "python setup.py egg_info" failed with error code 1 in /home/nfs/bzaitlen/GitRepos/pynvml/

PR #13 adds versioneer but maybe something wonky happened in the setup ?

Drop requirements.txt

Currently this package has a requirements file that includes packages like pytest and pip. Probably these aren't actually requirements, but instead used for development.

I was confused for a moment and thought that these were actual requirements, but checked setup.py and setup.cfg and found that they weren't mentioned, so maybe this isn't an issue.

Maybe these should be renamed to requirements-dev.txt or something for clarity?

Not a big deal though, I was just briefly confused and thought I'd raise an issue. Please feel free to ignore.

[pytest] test_nvmlSystemGetDriverVersion() and test_nvmlSystemGetNVMLVersion() fails

Describe the bug
test_nvmlSystemGetDriverVersion() and test_nvmlSystemGetNVMLVersion() fails because the version number reported cannot be cast to float as expected by the testcase.

Steps/Code to reproduce bug
Executing the pytest reports these failures in pynvml/tests/test_nvml.py test file -

________________________ test_nvmlSystemGetNVMLVersion _________________________

nvml = None

    def test_nvmlSystemGetNVMLVersion(nvml):
        vsn = 0.0
>       vsn = float(pynvml.nvmlSystemGetDriverVersion().decode())
E       ValueError: could not convert string to float: '440.33.01'

test_nvml.py:57: ValueError
_______________________ test_nvmlSystemGetDriverVersion ________________________

nvml = None

    def test_nvmlSystemGetDriverVersion(nvml):
        vsn = 0.0
>       vsn = float(pynvml.nvmlSystemGetDriverVersion().decode())
E       ValueError: could not convert string to float: '440.33.01'

test_nvml.py:71: ValueError

NVML Permissions

I'm filing this issue not for pynvml to resolve , but rather to raise awareness for permissioning issues around nvml to the pynvml devs.

NVML can be administered in such away as to prevent non-admin/root users access to the library. This would prevent nvml from querying performance counters. For example:

(base) bzaitlen@dgx15:~$ nvidia-smi nvlink -g 0
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-96ab329d-7a1f-73a8-a9b7-18b4b2855f92)
NVML: Unable to get the NvLink link utilization counter control for link 0: Insufficient Permissions

I found this page on admin pemissioning which explains how to set/unset admin privileges.

Another way to easily check the status of permissions is to look for RmProfilingAdminOnly in driver params:

(base) bzaitlen@dgx15:~$ cat /proc/driver/nvidia/params | grep RmProfilingAdminOnly
RmProfilingAdminOnly: 1

How to get the process pid in docker？

I get the process pid in docker through nvmlDeviceGetComputeRunningProcesses(), which is the process pid on the host machine, but this is different from the process pid in docker. Is there a way to get the process pid in docker instead of the pid on the host machine?

Using versioneer

Would be good to use versioneer here to simplify the process of making a release. This would make sure the version matches that of the git tag without having to manually bump it in several places.

virtual GPU has brand number =10 which is not on the list

my results for
nvidia-smi --query-gpu=name --format=csv
are:

name
GRID V100DX-16Q

when running:
nvidia_smi.getInstance().DeviceQuery()

i get:

Traceback (most recent call last):
  File "/home/e161081/git/comps/qalgdlinfra/QAlgDLInfra/sandbox/daniel/allegro_examples/allegro_related_scripts/gpu_debug.py", line 5, in <module>
    a = nvidia_smi.getInstance().DeviceQuery()
  File "/home/e161081/git/comps/qalgdlinfra/venv/lib/python3.8/site-packages/pynvml/smi.py", line 1886, in DeviceQuery
    brandName = NVSMI_BRAND_NAMES[nvmlDeviceGetBrand(handle)]
KeyError: 10

Create conda recipe on conda forge

See https://conda-forge.org/#contribute

@rjzamora can you take a look at doing this? My hope is that the docs above make it somewhat straightforward. Either @quasiben or @jakirkham should be able to help if things get complex (which wouldn't be surprising).

Bump version in init to 8.0.2

Currently this is 8.0.1. Would be good to bump to 8.0.2 to match the tag.

nvmlDeviceGetName throws UnicodeDecodeError invalid start byte

Running the following code on WSL2 throws the error mentioned in the title:

from pynvml import *

handle = nvmlDeviceGetHandleByIndex(0)
print(nvmlDeviceGetName(handle))

Stacktrace:

File "<stdin>", line 1, in <module>
  File "/home/user/.local/lib/python3.9/site-packages/pynvml/nvml.py", line 1744, in wrapper
    return res.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

Whereas nvidia-smi command returns info without issues:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.03              Driver Version: 555.85         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0  On |                  N/A |
|  0%   35C    P8             16W /  370W |     947MiB /  24576MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

If I try to decode the output of nvmlDeviceGetName using utf-16 codec, this is the string:
'闸膠\uf88e肑要郸膐\uf889낑ꂀ釸膠\uf8a5ꂜ꾁駸膐\uf8a3ꂔꂀ雸膀\uf894낌ꂀ軸肐グ'

pynvml version 11.5.0

possibly integrating pynvx for Darwin as a substitute

One of ipyexperiments users said pynvml didn't work on Darwin, but there is pynvx with the same API, so we made a transparent wrapper to make things work.

This is totally up to you, but I thought perhaps this can be integrated too:
https://github.com/stas00/ipyexperiments/blob/a60a23141e2ecb11c0554f8be535fe6ccb831604/ipyexperiments/utils/pynvml_gate.py

I'd totally understand if you'd say that this is outside the scope of pynvml, in which case please feel free to close this issue as I have a workaround in place.

Moreover, I realize that if he said pynvml is not available then there is nothing really can be done inside pynvml, other than perhaps generating a dummy pynvml package on Darwin which declares pynvx as a dependency and loads it transparently as if it were pynvml.

What do you think?

I have no such need myself since I'm not on Darwin.

When failing on NVMLError exception, bug in handling

When failing on this line in smi.py with exception
nvmlDeviceGetSupportedMemoryClocks(handle)

following line fails with this error - "TypeError: list indices must be integers or slices, not str"
except NVMLError as err:
supportedClocks['Error'] = nvidia_smi.__handleError(err)

because supportedClocks defines as list

Does pynvml support `nvidia-smi --query-gpu=reset_status.reset_required`?

We are trying to find out what the pynvml equivalent is for the following command:

nvidia-smi --query-gpu=reset_status.reset_required --format=csv

We saw most of nvidia-smi --query-gpu= attributes are supported in the smi module, but the reset_status.reset_required attribute is not in the supported NVSMI_QUERY_GPU. Could you let us know is it possible to get the reset_status.reset_required from pynvml? If not, do you plan to support it in the future? Thank you!

Build multi-figure dashboard

In https://github.com/rjzamora/jupyterlab-bokeh-server/tree/pynvml @rjzamora has a Bokeh server with several individual pages, each with a single full-page figure. These pages were designed to be embedded into JupyterLab.

Some consumers of this information may not want to use JupyterLab (for example, all of the CUDA programmers out there) and so we may also want to have a more dashboardy page that includes several plots laid out nicely. I imagine that this might increase the use of these dashboards.

This requires someone to take a look at the plots that we currently have and then arrange them nicely onto the page using either Bokeh layout, or just standard HTML/CSS.

cc @jacobtomlinson, you might find this interesting to get your feet wet with GPU things.

Readme.md import example

Hello,

From the example to import the module (from pynvml import *) should be import pynvml or to import a specific function or subclass for best Python coding practices.

The phrase from module import * can cause namespace collisions within people's code and should be avoided whenever it can.

Uninitialized Error

This could be a dask issue but I am getting an error related to pynvml when launching a dask cluster:

distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/nanny.py", line 674, in run
    await worker
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 1016, in start
    await self._register_with_scheduler()
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 811, in _register_with_scheduler
    metrics=await self.get_metrics(),
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 740, in get_metrics
    result = await result
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 3406, in gpu_metric
    result = yield offload(nvml.real_time)
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/utils.py", line 1489, in offload
    return (yield _offload_executor.submit(fn, *args, **kwargs))
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/diagnostics/nvml.py", line 11, in real_time
    "utilization": [pynvml.nvmlDeviceGetUtilizationRates(h).gpu for h in handles],
  File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/diagnostics/nvml.py", line 11, in <listcomp>
    "utilization": [pynvml.nvmlDeviceGetUtilizationRates(h).gpu for h in handles],
  File "/home/nfs/bzaitlen/GitRepos/pynvml/pynvml/nvml.py", line 1347, in nvmlDeviceGetUtilizationRates
    check_return(ret)
  File "/home/nfs/bzaitlen/GitRepos/pynvml/pynvml/nvml.py", line 366, in check_return
    raise NVMLError(ret)
pynvml.nvml.NVMLError_Uninitialized: Uninitialized

cc @mrocklin

Get gpu usage by pid

Will it be possible to get gpu usage from a given pid from this binding?

How to get the maximally used GPU memory during a period?

Dear all,

Thanks very much for sharing such a great tool.

When running a deep learning model, I want to measure the maximally used GPU memory during a period (e.g., 10 mins).

How can I achieve this goal?

Any comments would be highly appreciated.

Kindest regards,
Jun

`nvmlSystemGetProcessName` fails for WSL

nvmlSystemGetProcessName https://docs.nvidia.com/deploy/nvml-api/group__nvmlSystemQueries.html#group__nvmlSystemQueries_1gf37b04cea12ef2fcf6a463fed1b983b2

Based on the documentation returns the a string encoded in ANSI.

However it is automatically decoded in "utf-8"

pynvml/pynvml/nvml.py

Line 1744 in 43a7803

return res.decode()

For this reason the function fails in WSL where the returned process name is in ANSI. ANSI encoding is not well-supported in Linux though.

A proposed solution would be to have a special case for windows (it already exists in other parts of the library) or set errors="ignore" for this specific function call (as it appears to be the only encoded in ANSI from the documentation)

Minimal Example

from pynvml.smi import nvidia_smi as smi

import torch
torch.randn(1000).to("cuda")
_instance = smi.getInstance()
# fails here
device = _instance.DeviceQuery()
device["gpu"][0]["processes"]

Wrong exn for wsl

For unsupported wsl calls, downstream projects are expecting exn Unsupported but receiving Unknown, and thus failing fast instead of proceeding w/ partial nvidia support: See rapidsai/cudf#9955 + dask/distributed#5628

Instead, wsl calls should realize they're unsupported and pass that along. I'm guessing this may be difficult based on the linked issues.

This falls under the general pynvml wsl support issue: #26

Windows 10 location of Nvidia-smi and nvml

On my current version of windows 10 with driver version 461.55 (Quadro RTX 4000), nvml.dll (and nvidia-smi) are not in the location that this library is expecting.

pynvml/pynvml/nvml.py

Line 731 in 7c78212

    
           nvml_lib = CDLL(os.path.join(os.getenv("ProgramFiles", "C:/Program Files"), "NVIDIA Corporation/NVSMI/nvml.dll"))

PS C:\> Get-Command nvidia-smi

CommandType     Name                                               Version    Source
-----------     ----                                               -------    ------
Application     nvidia-smi.exe                                     10.0.1001… C:\WINDOWS\system32\nvidia-smi.exe

PS C:\> Get-Command nvml.dll

CommandType     Name                                               Version    Source
-----------     ----                                               -------    ------
Application     nvml.dll                                           10.0.1001… C:\WINDOWS\system32\nvml.dll

Does anyone know if this is where the Nvidia drivers will now be putting nvidia-smi and nvml going forward, or is this something that may be specific to my setup?

In either case, would it be useful to have an optional parameter being passed into nvmlInit() with the location of the nvml.dll

something along the lines of:

def nvmlInit(nvmlPath:str = None):
    """Initialize NVML.
    Uses nvmlInit_v2() from the underlying NVML library.
    Args:
        nvmlPath: the absolute path to the nvml library that will be used instead of looking in standard known locations.
    Returns:
        None
    """
    def _load_nvml_library(nvmlPath = None):
        """
        Load the library if it isn't loaded already
        """
        global nvml_lib

        if (nvml_lib == None):
            # lock to ensure only one caller loads the library
            lib_load_lock.acquire()

            try:
                # ensure the library still isn't loaded
                if (nvml_lib == None):
                    try:
                        if (nvmlPath != None):
                            # load nvml from the supplied path
                            nvml_lib = CDLL(nvmlPath)
                        elif (sys.platform[:3] == "win"):
                            # cdecl calling convention
                            # load nvml.dll from %ProgramFiles%/NVIDIA Corporation/NVSMI/nvml.dll
                            nvml_lib = CDLL(os.path.join(os.getenv("ProgramFiles", "C:/Program Files"), "NVIDIA Corporation/NVSMI/nvml.dll"))
                        else:
                            # assume linux
                            nvml_lib = CDLL("libnvidia-ml.so.1")
                    except OSError as ose:
                        check_return(NVML_ERROR_LIBRARY_NOT_FOUND)
                    if (nvml_lib == None):
                        check_return(NVML_ERROR_LIBRARY_NOT_FOUND)
            finally:
                # lock is always freed
                lib_load_lock.release()
    _load_nvml_library(nvmlPath = nvmlPath)

    #
    # Initialize the library
    #
    fn = get_func_pointer("nvmlInit_v2")
    ret = fn()
    check_return(ret)

    # Atomically update refcount
    global nvml_lib_refcount
    lib_load_lock.acquire()
    nvml_lib_refcount += 1
    lib_load_lock.release()
    return None

Support for WSL (Windows Subsystem for Linux)

Opening this issue to track WSL (Windows Subsystem for Linux). Currently this is not supported as NVML is not supported on WSL (based on reading this note from the NVIDIA WSL docs).

NVIDIA Management Library (NVML) APIs are not supported. Consequently, nvidia-smi may not be functional in WSL 2.

However when NVML is supported on Windows, it would be nice to try using/testing pynvml on Windows.

For more general RAPIDS support of WSL, please see issue ( rapidsai/cudf#28 ).

undefined symbol: nvmlDeviceGetComputeRunningProcesses_v2

When I run the following code to get GPU process information:

import psutil
import pynvml #导包


UNIT = 1024 * 1024


pynvml.nvmlInit() 
gpuDeriveInfo = pynvml.nvmlSystemGetDriverVersion()


gpuDeviceCount = pynvml.nvmlDeviceGetCount()


for i in range(gpuDeviceCount):
    handle = pynvml.nvmlDeviceGetHandleByIndex(i)#获取GPU i的handle，后续通过handle来处理

            print("进程pid：", pidInfo.pid, "用户名：", pidUser, 
            "显存占有：", pidInfo.usedGpuMemory/UNIT, "Mb") # 统计某pid使用的显存


pynvml.nvmlShutdown() #最后关闭管理工具

but I get the errors like this:

Traceback (most recent call last):
  File "/mnt/data0/home/dengjinhong/miniconda3/envs/python3/lib/python3.6/site-packages/pynvml.py", line 782, in _nvmlGetFunctionPointer
    _nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
  File "/mnt/data0/home/dengjinhong/miniconda3/envs/python3/lib/python3.6/ctypes/__init__.py", line 361, in __getattr__
    func = self.__getitem__(name)
  File "/mnt/data0/home/dengjinhong/miniconda3/envs/python3/lib/python3.6/ctypes/__init__.py", line 366, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/nvidia-430/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "gpu_info.py", line 21, in <module>
    pidAllInfo = pynvml.nvmlDeviceGetComputeRunningProcesses(handle)#获取所有GPU上正在运行的进程信息
  File "/mnt/data0/home/dengjinhong/miniconda3/envs/python3/lib/python3.6/site-packages/pynvml.py", line 2223, in nvmlDeviceGetComputeRunningProcesses
    return nvmlDeviceGetComputeRunningProcesses_v2(handle);
  File "/mnt/data0/home/dengjinhong/miniconda3/envs/python3/lib/python3.6/site-packages/pynvml.py", line 2191, in nvmlDeviceGetComputeRunningProcesses_v2
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
  File "/mnt/data0/home/dengjinhong/miniconda3/envs/python3/lib/python3.6/site-packages/pynvml.py", line 785, in _nvmlGetFunctionPointer
    raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.NVMLError_FunctionNotFound: Function Not Found

Here is the nvidia-smi information:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 20%   26C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 20%   28C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A |
| 20%   24C    P8     9W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:83:00.0 Off |                  N/A |
| 20%   27C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

The version of nvidia-ml-py is 11.495.46. So why did this happen?

MIG Support

Dear

Pynvml is not in sync anymore with nvml. More specifically it is not possible to access the MIG APIs: https://docs.nvidia.com/deploy/nvml-api/group__nvmlMultiInstanceGPU.html#group__nvmlMultiInstanceGPU_1g15e07cc6230a2d90c5bc85de85261ef7

Would it be possible to add these?

BR, Pieterjan

AttributeError: module 'pynvml' has no attribute '_nvmlGetFunctionPointer'

Is this error occurred due to the compatibility between pynvml and driver version? My driver version is 450.80.02.

Comparison with other Python NVML bindings

Have noticed there are a few other implementations of Python NVML bindings. It seems worthwhile to compare these to see what they offer and how they differ (particularly from what is here).

Restructure directory

We might want to follow this directory structure in the future: https://stackoverflow.com/a/5998845/616616

Add `nvmlDeviceGetNvLink*` function wrappers

As pointed out to me by @mt-jones, the NVML C API currently provides NvLink functions, like nvmlDeviceGetNvLinkUtilizationCounter.

PyNVML wrappers for these functions would be useful to multi-GPU users.

[pytest] nvlink related tests fail on a machine without nvlink

Describe the bug
I see that the tests for nvlink related APIs fail on a machine without nvlink e.g. test_nvml_nvlink_properties(). Looking at the rc pynvml.nvml.NVMLError_NotSupported it is clear that the failure is because of the absence of nvlink.
Opening this issue to check if there is a better way to handle these in the tests. Or is it too much of a work to bother about?

Steps/Code to reproduce bug
pytest reports these kinds of failures for nvlink related testcases -

__________________________________ test_nvml_nvlink_properties ___________________________________

ngpus = 2
handles = [<pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7f2f299bfc80>, <pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7f2f299bfb70>]

    def test_nvml_nvlink_properties(ngpus, handles):
        for i in range(ngpus):
            for j in range(pynvml.NVML_NVLINK_MAX_LINKS):
>               version = pynvml.nvmlDeviceGetNvLinkVersion(handles[i], j)

test_nvml.py:238:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../../anaconda3/envs/pynvml_py36/lib/python3.6/site-packages/pynvml/nvml.py:2021: in nvmlDeviceGetNvLinkVersion
    check_return(ret)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

ret = 3

    def check_return(ret):
        if (ret != NVML_SUCCESS):
>           raise NVMLError(ret)
E           pynvml.nvml.NVMLError_NotSupported: Not Supported

../../../../anaconda3/envs/pynvml_py36/lib/python3.6/site-packages/pynvml/nvml.py:366: NVMLError_NotSupported
------------------------------------- Captured stdout setup --------------------------------------
[2 GPUs]