Coder Social home page Coder Social logo

Comments (13)

nvvfedorov avatar nvvfedorov commented on June 26, 2024 1

@nghtm , Try to run the sample workload as suggested here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html#running-a-sample-workload-with-docker. This will tell us if the Nvidia runtime configured correctly or not.

from dcgm-exporter.

glowkey avatar glowkey commented on June 26, 2024

The first set of numbers in the DCGM-Exporter version correspond to the DCGM library version used in the container and in testing (3.3.5 in your case). The second set of numbers (3.4.0) correspond to the DCGM-Exporter version. However, DCGM follows semver compatibility guidelines so any 3.x version should be compatible.

from dcgm-exporter.

nghtm avatar nghtm commented on June 26, 2024

Thank you for the response, helpful info on versions. :-)

When I try running this container with DCGM_EXPORTER_VERSION=3.3.5-3.4.0-ubuntu22.04 and dcgmi -v = 3.3.5 it fails, causing Nvidia-SMI to throw errors on gpu 0. Prior to running the container, Nvidia-smi showed all GPUs to be healthy. I examined nvidia-bug-report and found the following message:

Apr 30 21:13:04 ip-10-1-5-148 dockerd[10261]: time="2024-04-30T21:13:04.829815111Z" level=error msg="restartmanger wait error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: detection error: nvml error: unknown error: unknown"

For GPU 0 which shows ERR!, NVSMI Log shows:

==============NVSMI LOG==============

Timestamp                                 : Tue Apr 30 21:16:21 2024
Driver Version                            : 535.161.08
CUDA Version                              : 12.2

Attached GPUs                             : 8
GPU 00000000:00:16.0
    Product Name                          : NVIDIA A10G
    Product Brand                         : Unknown Error
    Product Architecture                  : Ampere
    Display Mode                          : N/A
    Display Active                        : N/A
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : Unknown Error
        Pending                           : Unknown Error
    Accounting Mode                       : N/A
    Accounting Mode Buffer Size           : N/A
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1652222014738
    GPU UUID                              : Unknown Error
    Minor Number                          : 0
    VBIOS Version                         : Unknown Error
    MultiGPU Board                        : N/A
    Board ID                              : N/A
    Board Part Number                     : 900-2G133-A840-100
    GPU Part Number                       : 2237-892-A1
    FRU Part Number                       : N/A
    Module ID                             : Unknown Error
    Inforom Version
        Image Version                     : N/A
        OEM Object                        : N/A
        ECC Object                        : N/A
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 535.161.08
    GPU Virtualization Mode
        Virtualization Mode               : N/A
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : N/A
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x00
        Device                            : 0x16
        Domain                            : 0x0000
        Device Id                         : 0x223710DE
        Bus Id                            : 00000000:00:16.0
        Sub System Id                     : 0x152F10DE
        GPU Link Info
            PCIe Generation
                Max                       : N/A
                Current                   : N/A
                Device Current            : N/A
                Device Max                : N/A
                Host Max                  : N/A
            Link Width
                Max                       : N/A
                Current                   : N/A
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : Unknown Error
        Replay Number Rollovers           : Unknown Error
        Tx Throughput                     : Unknown Error
        Rx Throughput                     : Unknown Error
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : Unknown Error
    Performance State                     : Unknown Error
    Clocks Event Reasons                  : N/A
    Sparse Operation Mode                 : Unknown Error
    FB Memory Usage
        Total                             : 23028 MiB
        Reserved                          : 512 MiB
        Used                              : 0 MiB
        Free                              : 22515 MiB
    BAR1 Memory Usage
        Total                             : N/A
        Used                              : N/A
        Free                              : N/A
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : N/A
        Memory                            : N/A
        Encoder                           : N/A
        Decoder                           : N/A
        JPEG                              : N/A
        OFA                               : N/A
    Encoder Stats
        Active Sessions                   : N/A
        Average FPS                       : N/A
        Average Latency                   : N/A
    FBC Stats
        Active Sessions                   : N/A
        Average FPS                       : N/A
        Average Latency                   : N/A
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
            SRAM Threshold Exceeded       : N/A
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : N/A
            SRAM SM                       : N/A
            SRAM Microcontroller          : N/A
            SRAM PCIE                     : N/A
            SRAM Other                    : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : Unknown Error
    Temperature
        GPU Current Temp                  : Unknown Error
        GPU T.Limit Temp                  : Unknown Error
        GPU Shutdown T.Limit Temp         : Unknown Error
        GPU Slowdown T.Limit Temp         : Unknown Error
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : Unknown Error
    GPU Power Readings
        Power Draw                        : N/A
        Current Power Limit               : 670166.31 W
        Requested Power Limit             : 0.00 W
        Default Power Limit               : Unknown Error
        Min Power Limit                   : Unknown Error
        Max Power Limit                   : Unknown Error
    Module Power Readings
        Power Draw                        : Unknown Error
        Current Power Limit               : Unknown Error
        Requested Power Limit             : 0.00 W
        Default Power Limit               : Unknown Error
        Min Power Limit                   : Unknown Error
        Max Power Limit                   : Unknown Error
    Clocks
        Graphics                          : N/A
        SM                                : N/A
        Memory                            : N/A
        Video                             : N/A
    Applications Clocks
        Graphics                          : Unknown Error
        Memory                            : Unknown Error
    Default Applications Clocks
        Graphics                          : Unknown Error
        Memory                            : Unknown Error
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : N/A
        SM                                : N/A
        Memory                            : N/A
        Video                             : N/A
    Max Customer Boost Clocks
        Graphics                          : Unknown Error
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : Unknown Error
    Fabric
        State                             : N/A
        Status                            : N/A
    Processes                             : None

from dcgm-exporter.

nvvfedorov avatar nvvfedorov commented on June 26, 2024

You need to install and configure the NVIDIA Container Toolkit, it seems, that it is not configured correctly and that is why you see the error:

Apr 30 21:13:04 ip-10-1-5-148 dockerd[10261]: time="2024-04-30T21:13:04.829815111Z" level=error msg="restartmanger wait error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: detection error: nvml error: unknown error: unknown"

from dcgm-exporter.

nghtm avatar nghtm commented on June 26, 2024

Thanks for the response.

nvidia-container-toolkit is installed.

ubuntu@ip-10-1-5-148:/var/log$ dpkg -l | grep nvidia-container-toolkit
ii  nvidia-container-toolkit               1.15.0-1                              amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base          1.15.0-1                              amd64        NVIDIA Container Toolkit Base
ubuntu@ip-10-1-5-148:/var/log$ cat /etc/nvidia-container-runtime/config.toml
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]

[nvidia-container-runtime.modes]

sounds like I will need to debug this further. I will report back when if I determine a root cause

from dcgm-exporter.

nghtm avatar nghtm commented on June 26, 2024

We are installing nvidia-container-toolkit on the node via this script:

The docker configuration defaults to:

{
    "data-root": "/opt/dlami/nvme/docker/data-root"
}

But I can typically run nvidia commands via docker with this. For example: sudo docker run --rm --gpus all ubuntu nvidia-smi works.

However when I try launching the dcgmi container and tracking docker logs, it fails after about 1 minute:

docker logs 92c05c0f81ba
time="2024-04-30T22:16:32Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T22:16:32Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T22:16:33Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T22:16:33Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-golden-metrics.csv'"
time="2024-04-30T22:16:33Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T22:16:33Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T22:17:15Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"

from dcgm-exporter.

nghtm avatar nghtm commented on June 26, 2024

Trying to go back to the base dcgm-exporter container, which uses etc/dcgm-exporter/dcp-metrics-included.csv instead of the custom CSV file I have writen, to see if that fixes the container.

    sudo docker run -d --rm \
       --gpus all \
       --net host \
       --cap-add SYS_ADMIN \
       nvcr.io/nvidia/k8s/dcgm-exporter:${DCGM_EXPORTER_VERSION}-ubuntu20.04 \
       -f /etc/dcgm-exporter/dcp-metrics-included.csv 

from dcgm-exporter.

nghtm avatar nghtm commented on June 26, 2024

For reference, this is the install script for dcgm exporter which has been causing the container failures on g5.48xlarge (a10 GPUs)

https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/install_dcgm_exporter.sh

from dcgm-exporter.

nghtm avatar nghtm commented on June 26, 2024

It seems to be working without issues on h100s, so perhaps some of the custom metrics are not available on a10s (just a hypothesis)

from dcgm-exporter.

nghtm avatar nghtm commented on June 26, 2024

Repeated error trying to run container on a10 GPUs, but it works on h100 GPUs.

On a10s, the docker logs show:

ubuntu@ip-10-1-5-148:~$ docker logs ca88122482d5
time="2024-04-30T23:14:28Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T23:14:28Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T23:14:29Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T23:14:29Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time="2024-04-30T23:14:29Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T23:14:29Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T23:15:06Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"

on h100s, the docker logs show

ubuntu@ip-10-1-22-213:~$ docker logs 01a9236f1495
time="2024-04-30T23:05:43Z" level=info msg="Starting dcgm-exporter"
time="2024-04-30T23:05:43Z" level=info msg="DCGM successfully initialized!"
time="2024-04-30T23:05:43Z" level=info msg="Collecting DCP Metrics"
time="2024-04-30T23:05:43Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-golden-metrics.csv'"
time="2024-04-30T23:05:43Z" level=info msg="Initializing system entities of type: GPU"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-04-30T23:05:45Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-04-30T23:05:46Z" level=info msg="Pipeline starting"
time="2024-04-30T23:05:46Z" level=info msg="Starting webserver"
level=info ts=2024-04-30T23:05:46.033Z caller=tls_config.go:313 msg="Listening on" address=[::]:9400
level=info ts=2024-04-30T23:05:46.034Z caller=tls_config.go:316 msg="TLS is disabled." http2=false address=[::]:9400

from dcgm-exporter.

nghtm avatar nghtm commented on June 26, 2024

Reporting findings from today:

h100 nodes (8x GPU) no issue, all versions of DCGM exporter appear to be working
a10 nodes (8x GPU)
older version of dcgm works 2.1.4-2.3.1-ubuntu20.04

All versions above 3.1.6-3.1.3-ubuntu20.04 are failing, docker logs show the following:

level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"

from dcgm-exporter.

nghtm avatar nghtm commented on June 26, 2024

Root cause determined it is an issue with the OS version of Nvidia Driver 535.161.08 with the g5.48xlarge (8x a10) instances and Nvidia DCGM 3.3.5-3.4.0-ubuntu22.04

We were able to run DCGM-Exporter installing the proprietary driver 535.161.08 or by using 2.1.4-2.3.1-ubuntu20.04, but 3.3.5-3.4.0-ubuntu22.04 was failing consistently with the OS Driver on g5.48xlarge, represented by GSP errors in dmesg.

Similar to this issue reporter here: awslabs/amazon-eks-ami#1523

Anyways, thanks for the help and quick responses

from dcgm-exporter.

nvvfedorov avatar nvvfedorov commented on June 26, 2024

@nghtm Thank you for the update. I am closing the issue as solved.

from dcgm-exporter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.