Comments (13)
@nghtm , Try to run the sample workload as suggested here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html#running-a-sample-workload-with-docker. This will tell us if the Nvidia runtime configured correctly or not.
from dcgm-exporter.
The first set of numbers in the DCGM-Exporter version correspond to the DCGM library version used in the container and in testing (3.3.5 in your case). The second set of numbers (3.4.0) correspond to the DCGM-Exporter version. However, DCGM follows semver compatibility guidelines so any 3.x version should be compatible.
from dcgm-exporter.
Thank you for the response, helpful info on versions. :-)
When I try running this container with DCGM_EXPORTER_VERSION=3.3.5-3.4.0-ubuntu22.04
and dcgmi -v = 3.3.5
it fails, causing Nvidia-SMI to throw errors on gpu 0. Prior to running the container, Nvidia-smi showed all GPUs to be healthy. I examined nvidia-bug-report and found the following message:
Apr 30 21:13:04 ip-10-1-5-148 dockerd[10261]: time="2024-04-30T21:13:04.829815111Z" level=error msg="restartmanger wait error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: detection error: nvml error: unknown error: unknown"
For GPU 0 which shows ERR!, NVSMI Log shows:
==============NVSMI LOG==============
Timestamp : Tue Apr 30 21:16:21 2024
Driver Version : 535.161.08
CUDA Version : 12.2
Attached GPUs : 8
GPU 00000000:00:16.0
Product Name : NVIDIA A10G
Product Brand : Unknown Error
Product Architecture : Ampere
Display Mode : N/A
Display Active : N/A
Persistence Mode : Enabled
Addressing Mode : None
MIG Mode
Current : Unknown Error
Pending : Unknown Error
Accounting Mode : N/A
Accounting Mode Buffer Size : N/A
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1652222014738
GPU UUID : Unknown Error
Minor Number : 0
VBIOS Version : Unknown Error
MultiGPU Board : N/A
Board ID : N/A
Board Part Number : 900-2G133-A840-100
GPU Part Number : 2237-892-A1
FRU Part Number : N/A
Module ID : Unknown Error
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : 535.161.08
GPU Virtualization Mode
Virtualization Mode : N/A
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : N/A
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x00
Device : 0x16
Domain : 0x0000
Device Id : 0x223710DE
Bus Id : 00000000:00:16.0
Sub System Id : 0x152F10DE
GPU Link Info
PCIe Generation
Max : N/A
Current : N/A
Device Current : N/A
Device Max : N/A
Host Max : N/A
Link Width
Max : N/A
Current : N/A
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : Unknown Error
Replay Number Rollovers : Unknown Error
Tx Throughput : Unknown Error
Rx Throughput : Unknown Error
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : Unknown Error
Performance State : Unknown Error
Clocks Event Reasons : N/A
Sparse Operation Mode : Unknown Error
FB Memory Usage
Total : 23028 MiB
Reserved : 512 MiB
Used : 0 MiB
Free : 22515 MiB
BAR1 Memory Usage
Total : N/A
Used : N/A
Free : N/A
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : N/A
Memory : N/A
Encoder : N/A
Decoder : N/A
JPEG : N/A
OFA : N/A
Encoder Stats
Active Sessions : N/A
Average FPS : N/A
Average Latency : N/A
FBC Stats
Active Sessions : N/A
Average FPS : N/A
Average Latency : N/A
ECC Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
SRAM Threshold Exceeded : N/A
Aggregate Uncorrectable SRAM Sources
SRAM L2 : N/A
SRAM SM : N/A
SRAM Microcontroller : N/A
SRAM PCIE : N/A
SRAM Other : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : Unknown Error
Temperature
GPU Current Temp : Unknown Error
GPU T.Limit Temp : Unknown Error
GPU Shutdown T.Limit Temp : Unknown Error
GPU Slowdown T.Limit Temp : Unknown Error
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating T.Limit Temp : Unknown Error
GPU Power Readings
Power Draw : N/A
Current Power Limit : 670166.31 W
Requested Power Limit : 0.00 W
Default Power Limit : Unknown Error
Min Power Limit : Unknown Error
Max Power Limit : Unknown Error
Module Power Readings
Power Draw : Unknown Error
Current Power Limit : Unknown Error
Requested Power Limit : 0.00 W
Default Power Limit : Unknown Error
Min Power Limit : Unknown Error
Max Power Limit : Unknown Error
Clocks
Graphics : N/A
SM : N/A
Memory : N/A
Video : N/A
Applications Clocks
Graphics : Unknown Error
Memory : Unknown Error
Default Applications Clocks
Graphics : Unknown Error
Memory : Unknown Error
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : N/A
SM : N/A
Memory : N/A
Video : N/A
Max Customer Boost Clocks
Graphics : Unknown Error
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : Unknown Error
Fabric
State : N/A
Status : N/A
Processes : None
from dcgm-exporter.
You need to install and configure the NVIDIA Container Toolkit, it seems, that it is not configured correctly and that is why you see the error:
Apr 30 21:13:04 ip-10-1-5-148 dockerd[10261]: time="2024-04-30T21:13:04.829815111Z" level=error msg="restartmanger wait error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: detection error: nvml error: unknown error: unknown"
from dcgm-exporter.
Thanks for the response.
nvidia-container-toolkit is installed.
ubuntu@ip-10-1-5-148:/var/log$ dpkg -l | grep nvidia-container-toolkit
ii nvidia-container-toolkit 1.15.0-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.15.0-1 amd64 NVIDIA Container Toolkit Base
ubuntu@ip-10-1-5-148:/var/log$ cat /etc/nvidia-container-runtime/config.toml
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]
[nvidia-container-runtime.modes]
sounds like I will need to debug this further. I will report back when if I determine a root cause
from dcgm-exporter.
We are installing nvidia-container-toolkit on the node via this script:
The docker configuration defaults to:
{
"data-root": "/opt/dlami/nvme/docker/data-root"
}
But I can typically run nvidia commands via docker with this. For example: sudo docker run --rm --gpus all ubuntu nvidia-smi
works.
However when I try launching the dcgmi container and tracking docker logs
, it fails after about 1 minute:
docker logs 92c05c0f81ba
time="2024-04-30T22:16:32Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T22:16:32Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T22:16:33Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T22:16:33Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-golden-metrics.csv'"
time="2024-04-30T22:16:33Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T22:17:15Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"
from dcgm-exporter.
Trying to go back to the base dcgm-exporter container, which uses etc/dcgm-exporter/dcp-metrics-included.csv
instead of the custom CSV file I have writen, to see if that fixes the container.
sudo docker run -d --rm \
--gpus all \
--net host \
--cap-add SYS_ADMIN \
nvcr.io/nvidia/k8s/dcgm-exporter:${DCGM_EXPORTER_VERSION}-ubuntu20.04 \
-f /etc/dcgm-exporter/dcp-metrics-included.csv
from dcgm-exporter.
For reference, this is the install script for dcgm exporter which has been causing the container failures on g5.48xlarge (a10 GPUs)
from dcgm-exporter.
It seems to be working without issues on h100s, so perhaps some of the custom metrics are not available on a10s (just a hypothesis)
from dcgm-exporter.
Repeated error trying to run container on a10 GPUs, but it works on h100 GPUs.
On a10s, the docker logs show:
ubuntu@ip-10-1-5-148:~$ docker logs ca88122482d5
time="2024-04-30T23:14:28Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T23:14:28Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T23:14:29Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T23:14:29Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-04-30T23:14:29Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T23:15:06Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"
on h100s, the docker logs show
ubuntu@ip-10-1-22-213:~$ docker logs 01a9236f1495
time="2024-04-30T23:05:43Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T23:05:43Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T23:05:43Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T23:05:43Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-golden-metrics.csv'"
time="2024-04-30T23:05:43Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T23:05:46Z" level=info msg="Pipeline starting"
time="2024-04-30T23:05:46Z" level=info msg="Starting webserver"
level=info ts=2024-04-30T23:05:46.033Z caller=tls_config.go:313 msg="Listening on" address=[::]:9400
level=info ts=2024-04-30T23:05:46.034Z caller=tls_config.go:316 msg="TLS is disabled." http2=false address=[::]:9400
from dcgm-exporter.
Reporting findings from today:
h100 nodes (8x GPU) no issue, all versions of DCGM exporter appear to be working
a10 nodes (8x GPU)
older version of dcgm works 2.1.4-2.3.1-ubuntu20.04
All versions above 3.1.6-3.1.3-ubuntu20.04
are failing, docker logs show the following:
level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"
from dcgm-exporter.
Root cause determined it is an issue with the OS version of Nvidia Driver 535.161.08 with the g5.48xlarge (8x a10) instances and Nvidia DCGM 3.3.5-3.4.0-ubuntu22.04
We were able to run DCGM-Exporter installing the proprietary driver 535.161.08 or by using 2.1.4-2.3.1-ubuntu20.04, but 3.3.5-3.4.0-ubuntu22.04 was failing consistently with the OS Driver on g5.48xlarge, represented by GSP errors in dmesg.
Similar to this issue reporter here: awslabs/amazon-eks-ami#1523
Anyways, thanks for the help and quick responses
from dcgm-exporter.
@nghtm Thank you for the update. I am closing the issue as solved.
from dcgm-exporter.
Related Issues (20)
- SIGSEGV: segmentation violation HOT 6
- Metrics around capturing gpu FLOPS HOT 2
- Cannot build from source HOT 9
- how to query rated power? HOT 1
- Cannot build from source via Ansible HOT 4
- Executing dcgmi diag -r 3 in dcgm-exporter, the prompt shows "nvvs binary was not found" HOT 1
- hello,I use docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 to start the container and an error message readlink: missing operand HOT 5
- Profiling module failed to load HOT 5
- Could not enable kubernetes metric collection: nvml: Unknown Error HOT 2
- Failed to watch metrics: Error watching fields: The third-party Profiling module returned an u HOT 2
- Makefile missing DIST_DIR := cmd/dcgm-exporter HOT 1
- Hello, why /var/log/nv-hostengine.log file had many ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches() HOT 1
- https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ is not signed HOT 2
- nvlink metrics are not available on the gh200 gpu node HOT 2
- I can't get the following metrics, but I've set the environment variable HOT 3
- config csv DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, but cannot get on metrics HOT 2
- can I get computeRunningProcesses and graphicsRunningProcesses this two metrics?? HOT 1
- exported_pod cause issue with query -> every sample a different metrics HOT 3
- Switch GPU Util metric to `DCGM_FI_PROF_GR_ENGINE_ACTIVE` in NVIDIA DCGM Metrics Dashboard
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dcgm-exporter.