The Website for the GPU Open Analytics Initiative (GoAi)
See it at www.gpuopenanalytics.com.
Provide Python access to the NVML library for GPU diagnostics
License: BSD 3-Clause "New" or "Revised" License
The Website for the GPU Open Analytics Initiative (GoAi)
See it at www.gpuopenanalytics.com.
I understand that this is a python binding to nvml, which ignores CUDA_VISIBLE_DEVICES
, but perhaps this feature could be respected in pynvml
? otherwise we end up with inconsistent behavior between pytorch (tf?) and pynvml.
For example on this setup I have card 0 (24GB), card 1 (8GB).
If I run:
CUDA_VISIBLE_DEVICES=1 python -c "import pynvml; pynvml.nvmlInit(); handle = pynvml.nvmlDeviceGetHandleByIndex(0); print(pynvml.nvmlDeviceGetMemoryInfo(handle).total)"
25447170048
which is the output for card 0, even though I was expecting output for card 1.
The expected output is the one if I explicitly pass the system ID to nvml:
python -c "import pynvml;pynvml.nvmlInit(); handle = pynvml.nvmlDeviceGetHandleByIndex(1); print(pynvml.nvmlDeviceGetMemoryInfo(handle).total)"
8513978368
So I get the wrong card in the first snippet - I get card 0, rather than card 1, indexed as 0th.
The conflict with pytorch
happens when I call id = torch.cuda.current_device()
- which returns 0
with CUDA_VISIBLE_DEVICES="1"
. I hope my explanation is clear of where I have a problem.
pynvml could respect CUDA_VISIBLE_DEVICES
if the latter is set.
Of course, if this is attempted, then we can't just change the normal behavior as it'd break people's code. Perhaps, if pynvml.nvmlInit(respect_cuda_visible_devices=True)
is passed, then it could magically remap the id
arg to nvmlDeviceGetHandleByIndex
to the corresponding id in CUDA_VISIBLE_DEVICES
. So in the very first snippet above, nvmlDeviceGetHandleByIndex(0)
, will actually call it for id=1
, as it's 0th
relative to `CUDA_VISIBLE_DEVICES="1".
So nvmlDeviceGetHandleByIndex()
arg will become an index with respect to CUDA_VISIBLE_DEVICES
. e.g. `CUDA_VISIBLE_DEVICES="1,0" will reverse the ids.
Thank you!
Meanwhile I added the following workaround to my software:
import os
[...]
if id is None:
id = torch.cuda.current_device()
# if CUDA_VISIBLE_DEVICES is used automagically remap the id since pynvml ignores this env var
if "CUDA_VISIBLE_DEVICES" in os.environ:
ids = list(map(int, os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")))
id = ids[id] # remap
try:
handle = pynvml.nvmlDeviceGetHandleByIndex(id)
[...]
If someone needs this as a helper wrapper, you can find it here:
https://github.com/stas00/ipyexperiments/blob/3db0bbac2e2e6f1873b105953d9a7b3b7ca491b1/ipyexperiments/utils/mem.py#L33
I am interested in two nvlink capabilities I call from the command line:
nvidia-smi nvlink -g 1 -i 2
nvidia-smi nvlink -r 1
This call resets all the counters of type 1
on all GPUs
Could these call be exposed to pynvml ?
Here is what the nvidia-smi API in this library produces for a single GPU. What of this information is useful and would we maybe want to make dashboards out of?
In [1]: import nvidia_smi
In [2]: nvsmi = nvidia_smi.nvidia_smi.getInstance()
In [3]: nvsmi.DeviceQuery()['gpu'][0]
Out[3]:
{'id': '0000:06:00.0',
'product_name': 'Tesla V100-SXM2-32GB',
'product_brand': 'Tesla',
'display_mode': 'Enabled',
'display_active': 'Disabled',
'persistence_mode': 'Enabled',
'accounting_mode': 'Disabled',
'accounting_mode_buffer_size': '4000',
'driver_model': {'current_dm': 'N/A', 'pending_dm': 'N/A'},
'serial': '0321918171737',
'uuid': 'GPU-96ab329d-7a1f-73a8-a9b7-18b4b2855f92',
'minor_number': '0',
'vbios_version': '88.00.43.00.04',
'multigpu_board': 'No',
'board_id': '0x600',
'inforom_version': {'img_version': 'G503.0204.00.02',
'oem_object': '1.1',
'ecc_object': '5.0',
'pwr_object': 'N/A'},
'gpu_operation_mode': {'current_gom': 'N/A', 'pending_gom': 'N/A'},
'pci': {'pci_bus': '06',
'pci_device': '00',
'pci_domain': '0000',
'pci_device_id': '1DB510DE',
'pci_bus_id': '0000:06:00.0',
'pci_sub_system_id': '124910DE',
'pci_gpu_link_info': {'pcie_gen': {'max_link_gen': '3',
'current_link_gen': '3'},
'link_widths': {'max_link_width': '16x', 'current_link_width': '16x'}},
'pci_bridge_chip': {'bridge_chip_type': 'N/A', 'bridge_chip_fw': 'N/A'},
'replay_counter': '0',
'tx_util': 0,
'tx_util_unit': 'KB/s',
'rx_util': 0,
'rx_util_unit': 'KB/s'},
'fan_speed': 'N/A',
'fan_speed_unit': '%',
'performance_state': 'P0',
'clocks_throttle': {'clocks_throttle_reason_gpu_idle': 'Active',
'clocks_throttle_reason_applications_clocks_setting': 'Not Active',
'clocks_throttle_reason_sw_power_cap': 'Not Active',
'clocks_throttle_reason_hw_slowdown': 'Not Active',
'clocks_throttle_reason_unknown': 'N/A'},
'fb_memory_usage': {'total': 32510.5,
'used': 0.0,
'free': 32510.5,
'unit': 'MiB'},
'bar1_memory_usage': {'total': 32768.0,
'used': 2.50390625,
'free': 32765.49609375,
'unit': 'MiB'},
'compute_mode': 'Default',
'utilization': {'gpu_util': 0,
'memory_util': 0,
'encoder_util': 0,
'decoder_util': 0,
'unit': '%'},
'ecc_mode': {'current_ecc': 'Enabled', 'pending_ecc': 'Enabled'},
'ecc_errors': {'volatile': {'single_bit': {'device_memory': 0,
'register_file': 0,
'l1_cache': 0,
'l2_cache': 0,
'texture_memory': 'N/A',
'total': '0'},
'double_bit': {'device_memory': 0,
'register_file': 0,
'l1_cache': 0,
'l2_cache': 0,
'texture_memory': 'N/A',
'total': '0'}},
'aggregate': {'single_bit': {'device_memory': 0,
'register_file': 0,
'l1_cache': 0,
'l2_cache': 0,
'texture_memory': 'N/A',
'total': '0'},
'double_bit': {'device_memory': 0,
'register_file': 0,
'l1_cache': 0,
'l2_cache': 0,
'texture_memory': 'N/A',
'total': '0'}}},
'retired_pages': {'multiple_single_bit_retirement': None,
'double_bit_retirement': None,
'pending_retirement': 'No'},
'temperature': {'gpu_temp': 31,
'gpu_temp_max_threshold': 90,
'gpu_temp_slow_threshold': 87,
'unit': 'C'},
'power_readings': {'power_management': 'Supported',
'power_draw': 43.235,
'power_limit': 300.0,
'default_power_limit': 300.0,
'enforced_power_limit': 300.0,
'min_power_limit': 150.0,
'max_power_limit': 300.0,
'power_state': 'P0',
'unit': 'W'},
'clocks': {'graphics_clock': 135,
'sm_clock': 135,
'mem_clock': 877,
'unit': 'MHz'},
'applications_clocks': {'graphics_clock': 1290,
'mem_clock': 877,
'unit': 'MHz'},
'default_applications_clocks': {'graphics_clock': 1290,
'mem_clock': 877,
'unit': 'MHz'},
'max_clocks': {'graphics_clock': 1530,
'sm_clock': 1530,
'mem_clock': 877,
'unit': 'MHz'},
'clock_policy': {'auto_boost': 'N/A', 'auto_boost_default': 'N/A'},
'supported_clocks': [{'current': 877,
'unit': 'MHz',
'supported_graphics_clock': [1530,
1522,
1515,
1507,
1500,
1492,
1485,
1477,
1470,
1462,
1455,
1447,
1440,
1432,
1425,
1417,
1410,
1402,
1395,
1387,
1380,
1372,
1365,
1357,
1350,
1342,
1335,
1327,
1320,
1312,
1305,
1297,
1290,
1282,
1275,
1267,
1260,
1252,
1245,
1237,
1230,
1222,
1215,
1207,
1200,
1192,
1185,
1177,
1170,
1162,
1155,
1147,
1140,
1132,
1125,
1117,
1110,
1102,
1095,
1087,
1080,
1072,
1065,
1057,
1050,
1042,
1035,
1027,
1020,
1012,
1005,
997,
990,
982,
975,
967,
960,
952,
945,
937,
930,
922,
915,
907,
900,
892,
885,
877,
870,
862,
855,
847,
840,
832,
825,
817,
810,
802,
795,
787,
780,
772,
765,
757,
750,
742,
735,
727,
720,
712,
705,
697,
690,
682,
675,
667,
660,
652,
645,
637,
630,
622,
615,
607,
600,
592,
585,
577,
570,
562,
555,
547,
540,
532,
525,
517,
510,
502,
495,
487,
480,
472,
465,
457,
450,
442,
435,
427,
420,
412,
405,
397,
390,
382,
375,
367,
360,
352,
345,
337,
330,
322,
315,
307,
300,
292,
285,
277,
270,
262,
255,
247,
240,
232,
225,
217,
210,
202,
195,
187,
180,
172,
165,
157,
150,
142,
135]}],
'processes': None,
'accounted_processes': None}
PyNVML bindings are great to do all GPU information management from Python, but they are almost entirely an identical a copy of the C API. This can be a barrier for Python users who need to find out from the NVML API documentation what the API provides, and then what are the appropriate types that need to be passed, etc. We currently utilize PyNVML in both Distributed and Dask-CUDA, but there's also some overlap that leads to code duplication.
I feel one way to reduce code duplication and make it easier for new users, and thus make things overall better, is to provide a "High-level PyNVML library" that takes care of the basic needs for users. For example, I would imagine something like the following (but not limited to) to be available (implementation omitted for simplicity):
class Handle:
"""A handle to a GPU device.
Parameters
----------
index: int, optional
Integer representing the CUDA device index to get a handle to.
uuid: bytes or str, optional
UUID of a CUDA device to get a handle to.
Raises
------
ValueError
If neither `index` nor `uuid` are specified or if both are specified.
"""
def __init__(
self, index: Optional[int] = None, uuid: Optional[Union[bytes, str]] = None
)
@property
def free_memory(self) -> int:
"""
Free memory of the CUDA device.
"""
@property
def total_memory(self) -> int:
"""
Total memory of the CUDA device.
"""
@property
def used_memory(self) -> int:
"""
Used memory of the CUDA device.
"""
There would be more than the above to be covered, such as getting the number of available GPUs in the system, whether a GPU has a context currently created, if a handle is MIG or physical GPU, etc. Additionally, we would have simple tools that are generally useful, for example a small tool I wrote long ago to measure NVLink bandwidth and peak memory, and whatever else fits in the scope of a "High-level PyNVML library" that can make our users' lives easier.
So to begin this discussion I would like to know how people like @rjzamora and @kenhester feel about this idea. Would this be something that would fit in the scope of this project? Are there any impediments to adding such a library within the scope of this project/repository?
Also cc @quasiben for vis.
When using nvmlDeviceGetComputeRunningProcesses
to get the usedGpuMemory
of all the processes using a particular GPU (in this case, GPU 0), I saw that erroneous results appeared to be reported. When compared with nvidia-smi
in the terminal, the usedGpuMemory
contained the value of the process ID while the pid
field, rather than containing the process ID, contained the used GPU memory. So the values were swapped. Sometimes other fields in the process object contained the process ID or GPU memory values, making the field values of the process objects output be nvmlDeviceGetComputeRunningProcesses
overall shuffled. Investigation is warranted to ensure nvmlDeviceGetComputeRunningProcesses
consistently provides correct output.
import pynvml.nvml as nvml
import multiprocess as mp
import torch
def main():
event = mp.Event()
profiling_process = mp.Process(target=_profile_resources, kwargs={'event': event})
profiling_process.start()
with mp.Pool(8) as pool:
for res in [pool.apply_async(_multiprocess_task, (i,)) for i in range(12)]:
res.get()
event.set()
profiling_process.join()
profiling_process.close()
def _profile_resources(event):
nvml.nvmlInit()
while True:
handle = nvml.nvmlDeviceGetHandleByIndex(0)
gpu_processes = nvml.nvmlDeviceGetComputeRunningProcesses(handle)
print(gpu_processes)
time.sleep(.1)
if event.is_set():
break
def _multiprocess_task(num: int):
t1 = torch.tensor([1.1] * int(5**num)).to(torch.device('cuda:0'))
t2 = torch.tensor([2.2] * int(5**num)).to(torch.device('cuda:0'))
time.sleep(1)
return (t1 * t2).shape
torch==2.0.1
pynvml=11.5.0
CUDA version: 12.2
GPU Model: NVIDIA GeForce RTX 4080
Driver Version: 535.54.03
I'm excited to see this library being revived!
The help_query_gpu.txt
file seems to actually be a tar.gz file containing:
pynvml/nvidia_smi.py
pynvml/PKG-INFO
pynvml/pynvml.py
pynvml/README.txt
pynvml/setup.py
What is this file for, and why is it included in the package data list in setup.py?
I'm tryng to install pynvml from source and getting a strange error:
(cudf-dev) bzaitlen@dgx15:~/GitRepos/pynvml$ pip install -e .
Obtaining file:///home/nfs/bzaitlen/GitRepos/pynvml
ERROR: Complete output from command python setup.py egg_info:
ERROR: Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/nfs/bzaitlen/GitRepos/pynvml/setup.py", line 20, in <module>
version=versioneer.get_version(),
File "/home/nfs/bzaitlen/GitRepos/pynvml/versioneer.py", line 1480, in get_version
return get_versions()["version"]
File "/home/nfs/bzaitlen/GitRepos/pynvml/versioneer.py", line 1420, in get_versions
assert cfg.tag_prefix is not None, "please set versioneer.tag_prefix"
AssertionError: please set versioneer.tag_prefix
----------------------------------------
ERROR: Command "python setup.py egg_info" failed with error code 1 in /home/nfs/bzaitlen/GitRepos/pynvml/
PR #13 adds versioneer but maybe something wonky happened in the setup ?
Currently this package has a requirements file that includes packages like pytest and pip. Probably these aren't actually requirements, but instead used for development.
I was confused for a moment and thought that these were actual requirements, but checked setup.py and setup.cfg and found that they weren't mentioned, so maybe this isn't an issue.
Maybe these should be renamed to requirements-dev.txt or something for clarity?
Not a big deal though, I was just briefly confused and thought I'd raise an issue. Please feel free to ignore.
Describe the bug
test_nvmlSystemGetDriverVersion()
and test_nvmlSystemGetNVMLVersion()
fails because the version number reported cannot be cast to float
as expected by the testcase.
Steps/Code to reproduce bug
Executing the pytest reports these failures in pynvml/tests/test_nvml.py
test file -
________________________ test_nvmlSystemGetNVMLVersion _________________________
nvml = None
def test_nvmlSystemGetNVMLVersion(nvml):
vsn = 0.0
> vsn = float(pynvml.nvmlSystemGetDriverVersion().decode())
E ValueError: could not convert string to float: '440.33.01'
test_nvml.py:57: ValueError
_______________________ test_nvmlSystemGetDriverVersion ________________________
nvml = None
def test_nvmlSystemGetDriverVersion(nvml):
vsn = 0.0
> vsn = float(pynvml.nvmlSystemGetDriverVersion().decode())
E ValueError: could not convert string to float: '440.33.01'
test_nvml.py:71: ValueError
I'm filing this issue not for pynvml
to resolve , but rather to raise awareness for permissioning issues around nvml to the pynvml devs.
NVML can be administered in such away as to prevent non-admin/root users access to the library. This would prevent nvml from querying performance counters. For example:
(base) bzaitlen@dgx15:~$ nvidia-smi nvlink -g 0
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-96ab329d-7a1f-73a8-a9b7-18b4b2855f92)
NVML: Unable to get the NvLink link utilization counter control for link 0: Insufficient Permissions
I found this page on admin pemissioning which explains how to set/unset admin privileges.
Another way to easily check the status of permissions is to look for RmProfilingAdminOnly
in driver params:
(base) bzaitlen@dgx15:~$ cat /proc/driver/nvidia/params | grep RmProfilingAdminOnly
RmProfilingAdminOnly: 1
I get the process pid in docker through nvmlDeviceGetComputeRunningProcesses(), which is the process pid on the host machine, but this is different from the process pid in docker. Is there a way to get the process pid in docker instead of the pid on the host machine?
Would be good to use versioneer here to simplify the process of making a release. This would make sure the version matches that of the git tag
without having to manually bump it in several places.
my results for
nvidia-smi --query-gpu=name --format=csv
are:
name
GRID V100DX-16Q
when running:
nvidia_smi.getInstance().DeviceQuery()
i get:
Traceback (most recent call last):
File "/home/e161081/git/comps/qalgdlinfra/QAlgDLInfra/sandbox/daniel/allegro_examples/allegro_related_scripts/gpu_debug.py", line 5, in <module>
a = nvidia_smi.getInstance().DeviceQuery()
File "/home/e161081/git/comps/qalgdlinfra/venv/lib/python3.8/site-packages/pynvml/smi.py", line 1886, in DeviceQuery
brandName = NVSMI_BRAND_NAMES[nvmlDeviceGetBrand(handle)]
KeyError: 10
See https://conda-forge.org/#contribute
@rjzamora can you take a look at doing this? My hope is that the docs above make it somewhat straightforward. Either @quasiben or @jakirkham should be able to help if things get complex (which wouldn't be surprising).
Currently this is 8.0.1
. Would be good to bump to 8.0.2 to match the tag.
Running the following code on WSL2 throws the error mentioned in the title:
from pynvml import *
handle = nvmlDeviceGetHandleByIndex(0)
print(nvmlDeviceGetName(handle))
Stacktrace:
File "<stdin>", line 1, in <module>
File "/home/user/.local/lib/python3.9/site-packages/pynvml/nvml.py", line 1744, in wrapper
return res.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte
Whereas nvidia-smi
command returns info without issues:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.03 Driver Version: 555.85 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 On | N/A |
| 0% 35C P8 16W / 370W | 947MiB / 24576MiB | 2% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
If I try to decode the output of nvmlDeviceGetName
using utf-16 codec, this is the string:
'闸膠\uf88e肑要郸膐\uf889낑ꂀ釸膠\uf8a5ꂜ꾁駸膐\uf8a3ꂔꂀ雸膀\uf894낌ꂀ軸肐グ'
pynvml version 11.5.0
One of ipyexperiments
users said pynvml
didn't work on Darwin, but there is pynvx
with the same API, so we made a transparent wrapper to make things work.
This is totally up to you, but I thought perhaps this can be integrated too:
https://github.com/stas00/ipyexperiments/blob/a60a23141e2ecb11c0554f8be535fe6ccb831604/ipyexperiments/utils/pynvml_gate.py
I'd totally understand if you'd say that this is outside the scope of pynvml
, in which case please feel free to close this issue as I have a workaround in place.
Moreover, I realize that if he said pynvml
is not available then there is nothing really can be done inside pynvml
, other than perhaps generating a dummy pynvml
package on Darwin which declares pynvx
as a dependency and loads it transparently as if it were pynvml
.
What do you think?
I have no such need myself since I'm not on Darwin.
When failing on this line in smi.py with exception
nvmlDeviceGetSupportedMemoryClocks(handle)
following line fails with this error - "TypeError: list indices must be integers or slices, not str"
except NVMLError as err:
supportedClocks['Error'] = nvidia_smi.__handleError(err)
because supportedClocks defines as list
We are trying to find out what the pynvml equivalent is for the following command:
nvidia-smi --query-gpu=reset_status.reset_required --format=csv
We saw most of nvidia-smi --query-gpu=
attributes are supported in the smi
module, but the reset_status.reset_required
attribute is not in the supported NVSMI_QUERY_GPU
. Could you let us know is it possible to get the reset_status.reset_required
from pynvml? If not, do you plan to support it in the future? Thank you!
In https://github.com/rjzamora/jupyterlab-bokeh-server/tree/pynvml @rjzamora has a Bokeh server with several individual pages, each with a single full-page figure. These pages were designed to be embedded into JupyterLab.
Some consumers of this information may not want to use JupyterLab (for example, all of the CUDA programmers out there) and so we may also want to have a more dashboardy page that includes several plots laid out nicely. I imagine that this might increase the use of these dashboards.
This requires someone to take a look at the plots that we currently have and then arrange them nicely onto the page using either Bokeh layout, or just standard HTML/CSS.
cc @jacobtomlinson, you might find this interesting to get your feet wet with GPU things.
Hello,
From the example to import the module (from pynvml import *
) should be import pynvml
or to import a specific function or subclass for best Python coding practices.
The phrase from module import *
can cause namespace collisions within people's code and should be avoided whenever it can.
This could be a dask issue but I am getting an error related to pynvml when launching a dask cluster:
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/nanny.py", line 674, in run
await worker
File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 1016, in start
await self._register_with_scheduler()
File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 811, in _register_with_scheduler
metrics=await self.get_metrics(),
File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 740, in get_metrics
result = await result
File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/worker.py", line 3406, in gpu_metric
result = yield offload(nvml.real_time)
File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/utils.py", line 1489, in offload
return (yield _offload_executor.submit(fn, *args, **kwargs))
File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/home/nfs/bzaitlen/miniconda3/envs/cudf-dev/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/diagnostics/nvml.py", line 11, in real_time
"utilization": [pynvml.nvmlDeviceGetUtilizationRates(h).gpu for h in handles],
File "/home/nfs/bzaitlen/GitRepos/distributed/distributed/diagnostics/nvml.py", line 11, in <listcomp>
"utilization": [pynvml.nvmlDeviceGetUtilizationRates(h).gpu for h in handles],
File "/home/nfs/bzaitlen/GitRepos/pynvml/pynvml/nvml.py", line 1347, in nvmlDeviceGetUtilizationRates
check_return(ret)
File "/home/nfs/bzaitlen/GitRepos/pynvml/pynvml/nvml.py", line 366, in check_return
raise NVMLError(ret)
pynvml.nvml.NVMLError_Uninitialized: Uninitialized
cc @mrocklin
Will it be possible to get gpu usage from a given pid from this binding?
Dear all,
Thanks very much for sharing such a great tool.
When running a deep learning model, I want to measure the maximally used GPU memory during a period (e.g., 10 mins).
How can I achieve this goal?
Any comments would be highly appreciated.
Kindest regards,
Jun
nvmlSystemGetProcessName
https://docs.nvidia.com/deploy/nvml-api/group__nvmlSystemQueries.html#group__nvmlSystemQueries_1gf37b04cea12ef2fcf6a463fed1b983b2
Based on the documentation returns the a string encoded in ANSI.
However it is automatically decoded in "utf-8"
Line 1744 in 43a7803
For this reason the function fails in WSL where the returned process name is in ANSI. ANSI encoding is not well-supported in Linux though.
A proposed solution would be to have a special case for windows (it already exists in other parts of the library) or set errors="ignore"
for this specific function call (as it appears to be the only encoded in ANSI from the documentation)
Minimal Example
from pynvml.smi import nvidia_smi as smi
import torch
torch.randn(1000).to("cuda")
_instance = smi.getInstance()
# fails here
device = _instance.DeviceQuery()
device["gpu"][0]["processes"]
For unsupported wsl calls, downstream projects are expecting exn Unsupported
but receiving Unknown
, and thus failing fast instead of proceeding w/ partial nvidia support: See rapidsai/cudf#9955 + dask/distributed#5628
Instead, wsl calls should realize they're unsupported and pass that along. I'm guessing this may be difficult based on the linked issues.
This falls under the general pynvml wsl support issue: #26
On my current version of windows 10 with driver version 461.55 (Quadro RTX 4000), nvml.dll (and nvidia-smi) are not in the location that this library is expecting.
Line 731 in 7c78212
PS C:\> Get-Command nvidia-smi
CommandType Name Version Source
----------- ---- ------- ------
Application nvidia-smi.exe 10.0.1001… C:\WINDOWS\system32\nvidia-smi.exe
PS C:\> Get-Command nvml.dll
CommandType Name Version Source
----------- ---- ------- ------
Application nvml.dll 10.0.1001… C:\WINDOWS\system32\nvml.dll
Does anyone know if this is where the Nvidia drivers will now be putting nvidia-smi and nvml going forward, or is this something that may be specific to my setup?
In either case, would it be useful to have an optional parameter being passed into nvmlInit()
with the location of the nvml.dll
something along the lines of:
def nvmlInit(nvmlPath:str = None):
"""Initialize NVML.
Uses nvmlInit_v2() from the underlying NVML library.
Args:
nvmlPath: the absolute path to the nvml library that will be used instead of looking in standard known locations.
Returns:
None
"""
def _load_nvml_library(nvmlPath = None):
"""
Load the library if it isn't loaded already
"""
global nvml_lib
if (nvml_lib == None):
# lock to ensure only one caller loads the library
lib_load_lock.acquire()
try:
# ensure the library still isn't loaded
if (nvml_lib == None):
try:
if (nvmlPath != None):
# load nvml from the supplied path
nvml_lib = CDLL(nvmlPath)
elif (sys.platform[:3] == "win"):
# cdecl calling convention
# load nvml.dll from %ProgramFiles%/NVIDIA Corporation/NVSMI/nvml.dll
nvml_lib = CDLL(os.path.join(os.getenv("ProgramFiles", "C:/Program Files"), "NVIDIA Corporation/NVSMI/nvml.dll"))
else:
# assume linux
nvml_lib = CDLL("libnvidia-ml.so.1")
except OSError as ose:
check_return(NVML_ERROR_LIBRARY_NOT_FOUND)
if (nvml_lib == None):
check_return(NVML_ERROR_LIBRARY_NOT_FOUND)
finally:
# lock is always freed
lib_load_lock.release()
_load_nvml_library(nvmlPath = nvmlPath)
#
# Initialize the library
#
fn = get_func_pointer("nvmlInit_v2")
ret = fn()
check_return(ret)
# Atomically update refcount
global nvml_lib_refcount
lib_load_lock.acquire()
nvml_lib_refcount += 1
lib_load_lock.release()
return None
Opening this issue to track WSL (Windows Subsystem for Linux). Currently this is not supported as NVML is not supported on WSL (based on reading this note from the NVIDIA WSL docs).
- NVIDIA Management Library (NVML) APIs are not supported. Consequently, nvidia-smi may not be functional in WSL 2.
However when NVML is supported on Windows, it would be nice to try using/testing pynvml on Windows.
For more general RAPIDS support of WSL, please see issue ( rapidsai/cudf#28 ).
When I run the following code to get GPU process information:
import psutil
import pynvml #导包
UNIT = 1024 * 1024
pynvml.nvmlInit()
gpuDeriveInfo = pynvml.nvmlSystemGetDriverVersion()
gpuDeviceCount = pynvml.nvmlDeviceGetCount()
for i in range(gpuDeviceCount):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)#获取GPU i的handle,后续通过handle来处理
print("进程pid:", pidInfo.pid, "用户名:", pidUser,
"显存占有:", pidInfo.usedGpuMemory/UNIT, "Mb") # 统计某pid使用的显存
pynvml.nvmlShutdown() #最后关闭管理工具
but I get the errors like this:
Traceback (most recent call last):
File "/mnt/data0/home/dengjinhong/miniconda3/envs/python3/lib/python3.6/site-packages/pynvml.py", line 782, in _nvmlGetFunctionPointer
_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/mnt/data0/home/dengjinhong/miniconda3/envs/python3/lib/python3.6/ctypes/__init__.py", line 361, in __getattr__
func = self.__getitem__(name)
File "/mnt/data0/home/dengjinhong/miniconda3/envs/python3/lib/python3.6/ctypes/__init__.py", line 366, in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/nvidia-430/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v2
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "gpu_info.py", line 21, in <module>
pidAllInfo = pynvml.nvmlDeviceGetComputeRunningProcesses(handle)#获取所有GPU上正在运行的进程信息
File "/mnt/data0/home/dengjinhong/miniconda3/envs/python3/lib/python3.6/site-packages/pynvml.py", line 2223, in nvmlDeviceGetComputeRunningProcesses
return nvmlDeviceGetComputeRunningProcesses_v2(handle);
File "/mnt/data0/home/dengjinhong/miniconda3/envs/python3/lib/python3.6/site-packages/pynvml.py", line 2191, in nvmlDeviceGetComputeRunningProcesses_v2
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v2")
File "/mnt/data0/home/dengjinhong/miniconda3/envs/python3/lib/python3.6/site-packages/pynvml.py", line 785, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.NVMLError_FunctionNotFound: Function Not Found
Here is the nvidia-smi
information:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64 Driver Version: 430.64 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A |
| 20% 26C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A |
| 20% 28C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:82:00.0 Off | N/A |
| 20% 24C P8 9W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A |
| 20% 27C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
The version of nvidia-ml-py
is 11.495.46. So why did this happen?
Dear
Pynvml is not in sync anymore with nvml. More specifically it is not possible to access the MIG APIs: https://docs.nvidia.com/deploy/nvml-api/group__nvmlMultiInstanceGPU.html#group__nvmlMultiInstanceGPU_1g15e07cc6230a2d90c5bc85de85261ef7
Would it be possible to add these?
BR, Pieterjan
Is this error occurred due to the compatibility between pynvml and driver version? My driver version is 450.80.02.
Have noticed there are a few other implementations of Python NVML bindings. It seems worthwhile to compare these to see what they offer and how they differ (particularly from what is here).
We might want to follow this directory structure in the future: https://stackoverflow.com/a/5998845/616616
As pointed out to me by @mt-jones, the NVML C API currently provides NvLink functions, like nvmlDeviceGetNvLinkUtilizationCounter
.
PyNVML wrappers for these functions would be useful to multi-GPU users.
Describe the bug
I see that the tests for nvlink related APIs fail on a machine without nvlink e.g. test_nvml_nvlink_properties()
. Looking at the rc pynvml.nvml.NVMLError_NotSupported
it is clear that the failure is because of the absence of nvlink.
Opening this issue to check if there is a better way to handle these in the tests. Or is it too much of a work to bother about?
Steps/Code to reproduce bug
pytest reports these kinds of failures for nvlink related testcases -
__________________________________ test_nvml_nvlink_properties ___________________________________
ngpus = 2
handles = [<pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7f2f299bfc80>, <pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7f2f299bfb70>]
def test_nvml_nvlink_properties(ngpus, handles):
for i in range(ngpus):
for j in range(pynvml.NVML_NVLINK_MAX_LINKS):
> version = pynvml.nvmlDeviceGetNvLinkVersion(handles[i], j)
test_nvml.py:238:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../../anaconda3/envs/pynvml_py36/lib/python3.6/site-packages/pynvml/nvml.py:2021: in nvmlDeviceGetNvLinkVersion
check_return(ret)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
ret = 3
def check_return(ret):
if (ret != NVML_SUCCESS):
> raise NVMLError(ret)
E pynvml.nvml.NVMLError_NotSupported: Not Supported
../../../../anaconda3/envs/pynvml_py36/lib/python3.6/site-packages/pynvml/nvml.py:366: NVMLError_NotSupported
------------------------------------- Captured stdout setup --------------------------------------
[2 GPUs]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.