Coder Social home page Coder Social logo

azhpc-images's Introduction

Build Status

OS Version Status Badge
Ubuntu 20.04 Build Status
Ubuntu 22.04 Build Status
AlmaLinux 8.7 Build Status

Azure HPC/AI VM Images

This repository houses a collection of scripts meticulously crafted for installing High-Performance Computing (HPC) and Artificial Intelligence (AI) libraries, along with tools essential for building Azure HPC/AI images. Whether you're provisioning compute-intensive workloads or crafting advanced AI models in the cloud, these scripts streamline the process, ensuring efficiency and reliability in your deployments.

Following are the current supported HPC/AI VM images that are available in Azure Marketplace:

  • Ubuntu-HPC 22.04 (microsoft-dsvm:ubuntu-hpc:2204:latest)
  • Ubuntu-HPC 20.04 (microsoft-dsvm:ubuntu-hpc:2004:latest)
  • AlmaLinux-HPC 8.7 (almalinux:almalinux-hpc:8_7-hpc-gen2:latest)

How to Use

The high level steps to create your own HPC images using our repository are:

  1. Deploy a VM (tutorial).
  2. Run install.sh (pick the corresponding install.sh in our repository for your OS, e.g., Ubuntu 22.04).
  3. Generate an image from the VM (tutorial).

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

azhpc-images's People

Contributors

abhamidipati-msft avatar anshuljainansja avatar basnijholt avatar christinetai0607 avatar darkwhite29 avatar edwardsp avatar hmeiland avatar jithinjosepkl avatar joe-atzinger avatar jonshelley avatar liquidpt avatar lorisercole avatar ltalirz avatar microsoft-github-operations[bot] avatar microsoftopensource avatar moes1 avatar mvrequa avatar petersatsuse avatar rafsalas19 avatar samtkaplan avatar sandeeponcall avatar shivanispatel avatar souvik-de avatar themorey avatar tlcyr4 avatar vermagit avatar vgamayunov avatar xpillons avatar yakovdyadkin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

azhpc-images's Issues

topology xml file for an azure cluster with 4 nodes each with 4 K80 GPU

I am following Azure MLOPs Pipeline Template git repo for creating a CV project. I am using it to finetune CIFAR10 using pretrained ResNet50 network.

So, the file below is what was produced for me by the template automatically. However, it doesn't seem to be correct for the cluster setting I have.

I have a GPU cluster that has 4 nodes, each node has 4 K80 GPUs.
Could you please share with me the ndv4-top.xml file for this topology?

<!-- This topology file was copied from https://github.com/Azure/azhpc-images/blob/master/common/network-tuning.sh -->
<system version="1">
  <cpu numaid="0" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
    <pci busid="ffff:ff:01.0" class="0x060400" link_speed="16 GT/s" link_width="16">
      <pci busid="0001:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0101:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0002:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0102:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
    </pci>
  </cpu>
  <cpu numaid="1" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
    <pci busid="ffff:ff:02.0" class="0x060400" link_speed="16 GT/s" link_width="16">
      <pci busid="0003:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0103:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0004:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0104:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
    </pci>
  </cpu>
  <cpu numaid="2" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
      <pci busid="ffff:ff:03.0" class="0x060400" link_speed="16 GT/s" link_width="16">
      <pci busid="000b:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0105:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
      <pci busid="000c:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0106:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
    </pci>
  </cpu>
  <cpu numaid="3" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
    <pci busid="ffff:ff:04.0" class="0x060400" link_speed="16 GT/s" link_width="16">
      <pci busid="000d:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0107:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
      <pci busid="000e:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0108:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
    </pci>
  </cpu>
</system>

Screenshot from 2023-05-19 09-25-47

VFs of mellonox MT27710 Family [ConnectX-4 Lx are going missing while listing them using ip link show but able to see them in lspci | grep bus address.

Hi Team,

Please help me to get rid of this situation,

Where there are many servers with VFs inconsistency, some servers having 61, some are having 62, other with 63.

We have seen that this VFS are specifically fluctuating for only one PF i.e enp134s5 and missing VFs are enp134s5f5, enp134s5f6.

But those VFs enp134s5f5,enp134s5f6 are listing using lspci command. Please find the logs below

[root@overcloud-dl380vprobesriovperformancecompute-chc4b-c00-8 ~]# ethtool -i enp134s5
driver: mlx5_core
version: 5.0-2.1.8
firmware-version: 14.29.1016 (HP_2420110034)
expansion-rom-version:
bus-info: 0000:86:05.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

Based on the bus-info when tried to check the partition information can see partition 5 and 6 pci address exists
[root@overcloud-dl380vprobesriovperformancecompute-chc4b-c00-8 ~]# lspci |grep 86:05
86:05.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
86:05.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
86:05.2 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
86:05.3 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
86:05.4 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
86:05.5 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
86:05.6 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
86:05.7 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]

And those VFs are not listed in ip link show command output,

logicalname: enp134s5 driverversion: 5.0-2.1.8
logicalname: enp134s5f1 driverversion: 5.0-2.1.8
logicalname: enp134s5f2 driverversion: 5.0-2.1.8
logicalname: enp134s5f3 driverversion: 5.0-2.1.8
logicalname: enp134s5f4 driverversion: 5.0-2.1.8
logicalname: enp134s5f7 driverversion: 5.0-2.1.8

Upon investigating we have found that those VFs are not existed, interesting thing is , if they are not created then why they are listed using lspci command.

[root@overcloud-dl380vprobesriovperformancecompute-chc4b-c00-8 ~]# ifconfig enp134s5f5
enp134s5f5: error fetching interface information: Device not found
[root@overcloud-dl380vprobesriovperformancecompute-chc4b-c00-8 ~]# ifconfig enp134s5f6
enp134s5f6: error fetching interface information: Device not found

Please help me to get an answer to this issue.

Thanks & Regards,Hi Team,

Please help me to get rid of this situation,

Where there are many servers with VFs inconsistency, some servers having 61, some are having 62, other with 63.

We have seen that this VFS are specifically fluctuating for only one PF i.e enp134s5 and missing VFs are enp134s5f5, enp134s5f6.

But those VFs enp134s5f5,enp134s5f6 are listing using lspci command. Please find the logs below

[root@overcloud-dl380vprobesriovperformancecompute-chc4b-c00-8 ~]# ethtool -i enp134s5
driver: mlx5_core
version: 5.0-2.1.8
firmware-version: 14.29.1016 (HP_2420110034)
expansion-rom-version:
bus-info: 0000:86:05.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

Based on the bus-info when tried to check the partition information can see partition 5 and 6 pci address exists
[root@overcloud-dl380vprobesriovperformancecompute-chc4b-c00-8 ~]# lspci |grep 86:05
86:05.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
86:05.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
86:05.2 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
86:05.3 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
86:05.4 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
86:05.5 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
86:05.6 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
86:05.7 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]

And those VFs are not listed in ip link show command output,

logicalname: enp134s5 driverversion: 5.0-2.1.8
logicalname: enp134s5f1 driverversion: 5.0-2.1.8
logicalname: enp134s5f2 driverversion: 5.0-2.1.8
logicalname: enp134s5f3 driverversion: 5.0-2.1.8
logicalname: enp134s5f4 driverversion: 5.0-2.1.8
logicalname: enp134s5f7 driverversion: 5.0-2.1.8

Upon investigating we have found that those VFs are not existed, interesting thing is , if they are not created then why they are listed using lspci command.

[root@overcloud-dl380vprobesriovperformancecompute-chc4b-c00-8 ~]# ifconfig enp134s5f5
enp134s5f5: error fetching interface information: Device not found
[root@overcloud-dl380vprobesriovperformancecompute-chc4b-c00-8 ~]# ifconfig enp134s5f6
enp134s5f6: error fetching interface information: Device not found

Please help me to get an answer to this issue.

Thanks & Regards,
Saleemmalik,
+91-7815981336.
Saleemmalik,
+91-7815981336.

Mellanox IB Device Missing on NC24r & NC24rs_v2

I've used the scripts to install onto a ubuntu18.04 image but when launching a NC24r or NC24rs_v2 instance with the image I cannot see any mellanox infiniband device in lspci (and RDMA tests like ib_write_bw fail to find the device). The mellanox infiniband device shows up properly in lspci and works just fine on NC24rs_v3 using this same image.

Is there something I need to do to get the infiniband devices working on NC24r and NC24rs_v2 instances with ubuntu18.04?

Thanks!

AMD FFTW and glibc 2.29 dependency

FFTW library in the following CentOS-HPC images has a dependency on glibc 2.29.

  • OpenLogic:CentOS-HPC:7.6:7.6.2020100800
  • OpenLogic:CentOS-HPC:7_6gen2:7.6.2020100801
  • OpenLogic:CentOS-HPC:7.7:7.7.2020100700
  • OpenLogic:CentOS-HPC:7_7-gen2:7.7.2020100701
  • OpenLogic:CentOS-HPC:8_1:8.2.2020100700
  • OpenLogic:CentOS-HPC:8_1-gen2:8.2.2020100701

ldd /opt/amd/fftw/lib/libfftw3f.so /opt/amd/fftw/lib/libfftw3f.so: /lib64/libm.so.6: version 'GLIBC_2.29' not found (required by /opt/amd/fftw/lib/libfftw3f.so)

Recommendation: Please use the previous FFTW version (https://github.com/amd/amd-fftw/releases/download/2.0/aocl-fftw-centos-2.0.tar.gz). This will be fixed in the next image update.

MST does not load

Hello,
I'm trying to configure the Mellanox HCAs in a HC44rs (CentOS 8.1) but things are not clicking and MST does not load correctly. There are many moving pieces so let me report what I see and my specific questions:
1 - ifconfig reports the ethernet and the loop, but the IB is a bit weird

  ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
        inet 172.16.1.26  netmask 255.255.0.0  broadcast 172.16.255.255
        inet6 fe80::215:5dff:fd33:ff23  prefixlen 64  scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband 20:00:09:28:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 19  bytes 1414 (1.3 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

I have never seen this warning message and don't know what it means or if it's causing any issue.
2 - The OS sees the adapter f014:00:02.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] and the IB drivers seem to be loaded.
3 - According to the Mellanox documentation for SR-IOV, the grub configuration file must contain intel_iommu=on and iommu=pt but I don't see either in the cfg file.
4 - OpenSM starts but I don't see any configuration file so it's unclear if the virtualization is enabled.
5 - As mentioned, MST just doesn't start:

# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices
Unloading MST PCI module (unused) - Success
Unloading MST PCI configuration module (unused) - Success
# mst status
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module is not loaded

PCI Devices:
------------

        No devices were found.

Thank you.

OpenMPI on CentOS-HPC 7.9: unable to use ucx because `UCP worker does not support MPI_THREAD_MULTIPLE`

I am unable to use the UCX messaging layer together with OpenMPI for an application that supports MPI + OpenMP parallelization strategies.

While the UCX pml component is found, initializing the component fails:

mpirun --mca pml ucx --mca pml_base_verbose 20 application
...
[hc44-low-2:20718] mca: base: components_register: registering framework pml components
[hc44-low-2:20718] mca: base: components_register: found loaded component ucx
[hc44-low-2:20718] mca: base: components_register: component ucx register function successful
[hc44-low-2:20718] mca: base: components_open: opening pml components
[hc44-low-2:20718] mca: base: components_open: found loaded component ucx
[hc44-low-2:20718] mca: base: components_open: component ucx open function successful
[hc44-low-2:20718] select: initializing pml component ucx
[hc44-low-2:20718] select: init returned failure for component ucx
...

When I export OMPI_MCA_pml_ucx_verbose=10, I am notified that

[hc44-low-2:21571] pml_ucx.c:325 UCP worker does not support MPI_THREAD_MULTIPLE. PML UCX could not be selected

This happens even when I export OMP_NUM_THREADS=1 (I guess this is independent of whether multiple threads are actually used).

I read in openucx/ucx#5284 (comment) that I may need UCX to be built with the --enable-mt option.

Would it be possible to have the UCX that ships with CentOS-HPC built with the --enable-mt option?

Or is this already the case and I am barking up the wrong tree here?

CentOS 7.7 HPC on Standard NC24rs_v3 - GPUs/IB devices missing

Hello,

I'm asking my initial question from openlogic/AzureBuildCentOS#92 :

I'm using CentOS 7.7 HPC image as a source for a packer build.
Within the first steps I'm running lspci. Unfortunately it doesn't include all 4 NVidia GPUs, nor the Mellanox IB device.
VM type is: Standard NC24rs_v3 (24 vcpus, 448 GiB memory), region westeurope - my understanding is that it should support SR-IOV.

azure-arm: + lspci
    azure-arm: 0000:00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (AGP disabled) (rev 03)
    azure-arm: 0000:00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 01)
    azure-arm: 0000:00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
    azure-arm: 0000:00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
    azure-arm: 0000:00:08.0 VGA compatible controller: Microsoft Corporation Hyper-V virtual VGA
    azure-arm: 3130:00:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
    azure-arm: + source /etc/os-release

Using CentOS 7.7 HPC on HB60rs lists the Mellanox IB device:

    azure-arm: + lspci
    azure-arm: 0000:00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (AGP disabled) (rev 03)
    azure-arm: 0000:00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 01)
    azure-arm: 0000:00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
    azure-arm: 0000:00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
    azure-arm: 0000:00:08.0 VGA compatible controller: Microsoft Corporation Hyper-V virtual VGA
    azure-arm: be1a:00:02.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]

Thanks.

[OMAZUREMDS-8.4.0:ERROR] - centos 7 HPC image

Hello,

I'm starting my packer builds based on image "OpenLogic:CentOS7-HPC:7_9-gen2:latest".

Packer is started in debug mode so I can login in parallel.
/var/log/messages has loads of these messages, I'm not sure why those are rased and even azsec-monitor is installed although it should not be there.

<snip> rsyslogd: [0;31m[OMAZUREMDS-8.4.0:ERROR] <SetupConnectionWithMdsd>: error at connect(). socket_file='/var/run/mdsd/default_json.socket' errno=No such file or directory  [0m [v8.24.0-57.el7_9]

if yum list installed azsec-monitor >/dev/null 2>&1; then yum remove -y azsec-monitor; fi

Inexistent hardcoded paths in HPC-X libtool archive files

When linking with libtool against HPC-X provided in the marketplace image the build fails due to a hardcoded path in the dependency_libs metadata of the libtool archive files included with HPC-X. Since this path (/hpc/local/oss/...) does not exist in the image, the linker cannot find the target .la files required to complete the dependency resolution which causes the build to fail.

Here is a list of all the .la files in HPC-X referencing the /hpc/local/oss/... path:

clusterkit/lib/libcuda_wrapper.la
hcoll/lib/hcoll/hmca_gpu_cuda.la
hcoll/lib/hcoll/hmca_bcol_nccl.la
hcoll/debug/lib/hcoll/hmca_gpu_cuda.la
hcoll/debug/lib/hcoll/hmca_bcol_nccl.la
nccl_rdma_sharp_plugin/lib/libnccl-net.la
ompi/lib/libmpi_usempi_ignore_tkr.la
ompi/lib/libmpi_mpifh.la
ompi/lib/libmpi_usempif08.la
ompi/tests/ipm-2.0.6/lib/libipmf.la
sharp/lib/libsharp_coll_cuda_wrapper.la
sharp/debug/lib/libsharp_coll_cuda_wrapper.la
ucx/mt/lib/ucx/libucx_perftest_cuda.la
ucx/mt/lib/ucx/libuct_xpmem.la
ucx/mt/lib/ucx/libuct_cuda.la
ucx/mt/lib/ucx/libuct_cuda_gdrcopy.la
ucx/lib/ucx/libucx_perftest_cuda.la
ucx/lib/ucx/libuct_xpmem.la
ucx/lib/ucx/libuct_cuda.la
ucx/lib/ucx/libuct_cuda_gdrcopy.la
ucx/prof/lib/ucx/libucx_perftest_cuda.la
ucx/prof/lib/ucx/libuct_xpmem.la
ucx/prof/lib/ucx/libuct_cuda.la
ucx/prof/lib/ucx/libuct_cuda_gdrcopy.la
ucx/debug/lib/ucx/libucx_perftest_cuda.la
ucx/debug/lib/ucx/libuct_xpmem.la
ucx/debug/lib/ucx/libuct_cuda.la
ucx/debug/lib/ucx/libuct_cuda_gdrcopy.la

Two potential solutions:

  1. Fix the paths to point to the correct directories in the provided stack
  2. Remove all .la files and let the linker resolve the dependencies at link time

The second solution would be preferred in order to allow customer to correctly link against any library, not only the ones provided in the HPC stack.

Two versions of ucx installed on CentOS-HPC image?

We are using a CycleCloud cluster that is running the latest OpenLogic:CentOS-HPC:7_9 image.

We are experiencing some UCX-related issues with OpenMPI, and while looking into those I noticed that there are two different versions of ucx_info installed on the image:

One in /usr/bin/ucx_info

$ ucx_info -v
# UCT version=1.11.1 revision c58db6b
# configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --with-cuda --with-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --with-cuda=/usr/local/cuda-11.2

and one in /opt/hpcx-v2.9.0-gcc9.2.0-MLNX_OFED_LINUX-5.4-1.0.3.0-redhat7.9-x86_64/ucx/bin/ucx_info, which is the UCX that OpenMPI is linked against:

$ /opt/hpcx-v2.9.0-gcc9.2.0-MLNX_OFED_LINUX-5.4-1.0.3.0-redhat7.9-x86_64/ucx/bin/ucx_info -v
# UCT version=1.11.0 revision 6031c98
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --with-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc9.2.0-MLNX_OFED_LINUX-5.4-1.0.3.0-redhat7.9-x86_64/ucx

I'm new to this - could someone please explain what is the purpose of these two independent UCX installations?

Thanks a lot!

Centos 8.2 Sku/Urn tag in "az vm image list" output points to 8_1

Hi,

I just run az vm image list and was wondering why CentOS8.2 is listed in Sku 8_1-gen2?

$ az vm image list --offer=CentOS-HPC --sku="gen2" --all --output table
Offer       Publisher    Sku       Urn                                           Version
----------  -----------  --------  --------------------------------------------  --------------
CentOS-HPC  OpenLogic    7_6gen2   OpenLogic:CentOS-HPC:7_6gen2:7.6.20200302     7.6.20200302
CentOS-HPC  OpenLogic    7_6gen2   OpenLogic:CentOS-HPC:7_6gen2:7.6.2020042001   7.6.2020042001
CentOS-HPC  OpenLogic    7_6gen2   OpenLogic:CentOS-HPC:7_6gen2:7.6.2020062901   7.6.2020062901
CentOS-HPC  OpenLogic    7_6gen2   OpenLogic:CentOS-HPC:7_6gen2:7.6.2020100801   7.6.2020100801
CentOS-HPC  OpenLogic    7_7-gen2  OpenLogic:CentOS-HPC:7_7-gen2:7.7.2020031801  7.7.2020031801
CentOS-HPC  OpenLogic    7_7-gen2  OpenLogic:CentOS-HPC:7_7-gen2:7.7.2020042001  7.7.2020042001
CentOS-HPC  OpenLogic    7_7-gen2  OpenLogic:CentOS-HPC:7_7-gen2:7.7.2020043001  7.7.2020043001
CentOS-HPC  OpenLogic    7_7-gen2  OpenLogic:CentOS-HPC:7_7-gen2:7.7.2020062601  7.7.2020062601
CentOS-HPC  OpenLogic    7_7-gen2  OpenLogic:CentOS-HPC:7_7-gen2:7.7.2020100701  7.7.2020100701
CentOS-HPC  OpenLogic    8_1-gen2  OpenLogic:CentOS-HPC:8_1-gen2:8.1.2020041301  8.1.2020041301
CentOS-HPC  OpenLogic    8_1-gen2  OpenLogic:CentOS-HPC:8_1-gen2:8.1.2020043001  8.1.2020043001
CentOS-HPC  OpenLogic    8_1-gen2  OpenLogic:CentOS-HPC:8_1-gen2:8.1.2020062401  8.1.2020062401
CentOS-HPC  OpenLogic    8_1-gen2  OpenLogic:CentOS-HPC:8_1-gen2:8.2.2020100701  8.2.2020100701

Ubuntu-hpc 18.04 latest image slowness/lag reported on NFS File share

Hello,

There seems to be some lag/slowness in accessing the NFS file system on the latest Ubuntu-hpc 18.04 image which was released to the Azure marketplace on 15/03/2023.

Image URN: microsoft-dsvm:ubuntu-hpc:1804:18.04.2023031501
Kernel version:

unaazureuser@ip-0A0A000D:~$ uname -r
5.4.0-1104-azure

NFS File system Details:

azureuser@ip-0A0A000B:~$ df -h | grep /shared    --> Azure File Share
premstortest23.file.core.windows.net:/premstortest23/test  1.0T  1.5G 1023G   1% /shared

azureuser@ip-0A0A000B:~$ df -h | grep /data   --> ANF 
10.10.2.4:/testvol                                         100G  256K  100G   1% /data

File Access:

azureuser@ip-0A0A000D:~$ time ls /shared/home/azureuser/
a  foo  new.txt  sample.txt

real    0m0.635s
user    0m0.004s
sys     0m0.001s

azureuser@ip-0A0A000D:~$ time ls /data/
a.txt  b.txt

real    0m0.483s
user    0m0.000s
sys     0m0.003s

azureuser@ip-0A0A000D:~$ time touch /data/c.txt

real    0m0.186s
user    0m0.002s
sys     0m0.000s

azureuser@ip-0A0A000D:~$ time mkdir /shared/home/azureuser/test

real    0m0.614s
user    0m0.001s
sys     0m0.003s

While on older Ubuntu-DSVM marketplace images, there is no lag observed. This is causing slowness in customer application running.

Old Ubuntu-hpc Image and Kernel version:

microsoft-dsvm:ubuntu-hpc:1804:18.04.2022121201
azureuser@ip-0A0A000B:~$ uname -r
5.4.0-1098-azure

azureuser@ip-0A0A000B:~$ time ls /shared/home/azureuser/
a  foo  new.txt  sample.txt  test

real    0m0.008s
user    0m0.001s
sys     0m0.000s
azureuser@ip-0A0A000B:~$ time ls /data/
a.txt  b.txt  c.txt

real    0m0.003s
user    0m0.001s
sys     0m0.000s

Please check on this to fix the kernel issue.

stack size limits missing from limits.conf

Any chance we could extend the limits.conf change to include setting the stack to unlimited ?:

# set limits for HPC apps
cat << EOF >> /etc/security/limits.conf
*               hard    memlock         unlimited
*               soft    memlock         unlimited
*               hard    nofile          65535
*               soft    nofile          65535
*              hard    stack           unlimited
*               soft    stack           unlimited
EOF

Simpler maintenance for versions and software BOM

The versions which got installed by the scripts are distributed in many files. This means a lot of work do provide newer versions.

A central file for the version would help to have less work to change scripts. It would in addition help for new minor versions for a distribution

A second benefit would be that such a "config" file could act as a software bill of materials, which is needed to have a secure software supply chain.

The idea is to have all variables, versions, urls and checksums in one file, which gets sourced in install.sh.
I have a working prototype for SLE HPC nearly ready.

Here a small snippet how it could look like

 --snip--
 # SLE Version
 # it can be generated, but better is to go through this file
 # and manual an set the right versions of the various sections
 # #source /etc/os-release
 # #export SLE_DOTV=${VERSION_ID}
 # #export SLE_MAJOR=${VERSION_ID%.*}
 export SLE_DOTV=15.4
 export SLE_MAJOR=15
 ...
 # azcopy
 export AZVERSION="10.16.2"
 RELEASE_TAG="release20221108"
 #
 export AZTARBALL="azcopy_linux_amd64_${AZVERSION}.tar.gz"
 export AZCOPY_DOWNLOAD_URL="https://azcopyvnext.azureedge.net/${RELEASE_TAG}/${AZTARBALL}"
 
 # HPC-X
 # pls accept the EULA
 export HPCX_VERSION="2.12"
 export HPCX_DOWNLOAD_URL=https://content.mellanox.com/hpc/hpc-x/v${HPCX_VERSION}/hpcx-v${HPCX_VERSION}-gcc-inbox-suse${SLE_DOTV}-cuda11-gdrcopy2-nccl${HPCX_VERSION}-x86_64.tbz
 export HPCX_CHKSUM="bc315d3b485d13c97cd174ef5c9cba5c2fa1fbc3e5175f96f1a406a6c0699bdb"
 ...
 --snip--

Centos-HPC 7.9 cuda broken for NC24 instances

Hi, we have a question about the best approach to get a working CUDA installation for the NC24 instances with the CentOS-HPC 7.9 image.

The NC24 (v1) instances come with 4 Tesla K80 GPUs (sm_37).

Contrary to NVIDIA's own documentation, which states that the 470 driver branch still supports sm_37, tye GPUs are not recognized on the CentOS-HPC 7.9 image with the 470.82.01 driver (image version 20220112):

$ nvidia-smi
No devices were found

A colleague of ours successfully installed CUDA 11.2 and the 460 series driver on the CentOS 7.6 image (which still works with the K80s). However, we noticed that the CentOS-HPC 7.6 image no longer receives updates, so we would like to avoid moving our entire cluster to it just for the GPU support on the NC24 instances.

We tried downgrading the CUDA version on the CentOS-HPC 7.9 image to 11.2.2 with driver 460.32.03 but it was a little tricky since there are a number of NVIDIA components and some yum excludes (relevant parts of the image setup: [1,2]).

We really need the NC24 v1 instances, and were wondering what you would suggest as the best way forward.

E.g. one thing that could make our life easier would be to release a CentOS-HPC 7.9 image without CUDA that we could use to install a version that works with the K80s.

Of course it would be even better if the CentOS-HPC 7.9 image could support them out of the box. I'm not a CUDA expert and I don't understand this apparent discrepancy with NVIDIA's documentation - I also read here that the K80 should be supported up to 470.103.01.
Perhaps it is something else about the current CentOS-HPC image 7.9 that prevents the GPUs from being seen by nvidia-smi?

[1] https://github.com/Azure/azhpc-images/blob/7a9c492621e081f9c3fa36b3f35c0a6ffffced52/common/install_nvidiagpudriver.sh
[2] https://github.com/Azure/azhpc-images/blob/7a9c492621e081f9c3fa36b3f35c0a6ffffced52/centos/centos-7.x/common/install_nvidiagpudriver.sh

cc @matt-chan

amlfs-lustre-client-2.15.1-29-gbae0abe not found

When I running at Ubuntu-20.04๏ผŒ
azhpc-images\ubuntu\commoninstall_lustre_client.sh:9
apt-get install -y amlfs-lustre-client-${LUSTRE_VERSION}=$(uname -r)
E: Version '5.16.0' for 'amlfs-lustre-client-2.15.1-29-gbae0abe' was not found

Include Lustre client in all HPC images

In an effort to provide a more cohesive story to our customers, we'd like to offer a Lustre client image.

Maintaining this client and keeping it current with the Lustre server versions has been a historical painpoint for customers. Offering a HPC image with the Lustre client installed would be a huge plus for our customers.

Below are some links on how to install the client. The DKMS link will be of particular interest for the Centos 7.x HPC Images. Whamcloud (current owner/maintainer of Lustre) do not offer RPMs for Centos 7.7 any longer.

http://wiki.lustre.org/Installing_the_Lustre_Software#Lustre_Client_Software_Installation
http://wiki.lustre.org/Installing_the_Lustre_Software#Using_DKMS

Our team is building its own client image for the time being for testing. We are using Ansible to do that. Below is a link to the playbook that we use for Centos images.

https://dev.azure.com/msazure/One/_git/Avere-laaso-dev?version=GBmaster&path=%2Fsrc%2Fansible%2Flclient.centos.playbook.yaml

The relevant snippet in that playbook is pasted below for convenience:
- name: remove /etc/yum.repos.d/lustre.repo become: true file: name: /etc/yum.repos.d/lustre.repo state: absent - name: generate /etc/yum.repos.d/lustre.repo become: true lineinfile: path: /etc/yum.repos.d/lustre.repo line: "{{ item }}" create: true owner: root group: root mode: 0o444 with_items: - "[lustre-client]" - "name=lustre-client" - "baseurl={{ lustre_repo_centos_client }}" - "enabled=0" - "gpgcheck=0" - name: install Lustre CentOS DKMS packages become: true yum: enablerepo: lustre-client state: present update_cache: true name: - lustre-client-dkms - lustre-client - name: verify lustre install become: true when: true command: "modprobe -v lustre"

simplify the run-tests.sh

run-tests.sh has in the meantime a very complex structure and carrys many variables for the different distros and versions.

As the scripts write a the component versions in a file, this file could be used as a source for many of the variables used in run-tests.sh. Maybe it needs a bit closer look, but i think its worth the effort to make things more easy at the end

AZHPC CentOS 8.3 Image - release date

Hi,

From the Github repo, I can see that code has been written to support CentOS 8.3, but I can't find the image in the Azure Marketplace.
Can you please share when are you planning to release CentOS 8.3 image ?

Thanks,
Eliott

On how to specify the azhpc-images when allocating new VM

Because I have the same issue with 28, how do I specify the Canonical version VM? Or If I want the centos 8.1, how do I specify it when allocating the VM.

When I set OpenLogic:CentOS-HPC:8_1-gen2:latest there, it comes with the error
image

Platform Image id /subscriptions/f4b71ca8-2442-461d-ba11-c2b5bcb828cc/providers/Microsoft.Compute/locations/southcentralus/publishers/OpenLogic/artifacttypes/vmimage/offers/CentoOS-HPC/skus/8_1-gen2/versions/latest is invalid or does not exist

HPCX and IMPI modules not compatible with Lmod

Lmod is failing to load HPCX and IMPI modules.
Repro :

  • yum install Lmod

  • module use /usr/share/Modules/modulefiles

  • module load mpi/hpcx
    Lmod has detected the following error: The following module(s) are unknown: "/opt/hpcx-v2.7.4-gcc-MLNX_OFED_LINUX-5.2-1.0.4.0-redhat7.8-x86_64/modulefiles/hpcx"

  • module load mpi/impi
    Lmod has detected the following error: The following module(s) are unknown: "/opt/intel/impi/2018.4.274/intel64/modulefiles/mpi"

yum commands result in Killed in `Almalinux`

I was testing updating the image using yum and no matter what yum command I use the result is always Killed

$ sudo yum update
Failed to set locale, defaulting to C.UTF-8
AlmaLinux 8 - BaseOS                                                                                                                                                               4.2 MB/s | 5.2 MB     00:01    
AlmaLinux 8 - AppStream                                                                                                                                                            9.7 MB/s |  11 MB     00:01    
AlmaLinux 8 - Extras                                                                                                                                                                23 kB/s |  19 kB     00:00    
AlmaLinux 8 - PowerTools                                                                                                                                                           3.4 MB/s | 3.1 MB     00:00    
Azure Lustre Packages                                                                                                                                                              4.1 MB/s | 3.0 MB     00:00    
cuda-rhel8-x86_64                                                                                                                                                                  4.5 MB/s | 2.1 MB     00:00    
Extra Packages for Enterprise Linux 8 - x86_64                                                                                                                                     8.9 MB/s |  14 MB     00:01    
Killed

I tried with yum clean and then running the update again but no joy. I was wondering if this is a known issue.

ibpanic after install in Ubuntu 20.04

I installed the azhpc-images on

  • Ubuntu 20.04
  • Standard ND96amsr A100 v4 machine

with the latest commit (c8db6de).

However, when I run

$ ibstat

, it raises an error as

ibpanic: [769778] main: stat of IB device 'mlx5_an0' failed: No such file or directory

It seems the machine has mlx5_ib0 to mlx5_ib7, why it tries to find mlx5_an0?

run_tests cannot be run as root

To run the test script you need to run with sudo, otherwise the lspci command is not found.
however when running with sudo mpirun failed because it runs as root :-)

mpirun has detected an attempt to run as root.

so either change the script call commands with sudo when they need to or add the right options to allow root to run mpirun (not the best one)

MOFED support on CX3-Pro cards

CX3-Pro cards are not supported in newer Mellanox OFED versions, and these cards are supported through Mellanox OFED LTS version (4.9-0.1.7.0). For more information, see Linux Drivers.

Thus, the latest Azure Marketplace HPC images that have Mellanox OFED 5.1 and later, do not support ConnectX3-Pro cards. Please check the Mellanox OFED version in the HPC image before using it on VMs with ConnectX3-Pro cards.

Following are the latest CentOS-HPC images that support ConnectX3-Pro cards:

  • OpenLogic:CentOS-HPC:7.6:7.6.2020062900
  • OpenLogic:CentOS-HPC:7_6gen2:7.6.2020062901
  • OpenLogic:CentOS-HPC:7.7:7.7.2020062600
  • OpenLogic:CentOS-HPC:7_7-gen2:7.7.2020062601
  • OpenLogic:CentOS-HPC:8_1:8.1.2020062400
  • OpenLogic:CentOS-HPC:8_1-gen2:8.1.2020062401

Standard NC24rs_v3 - ib0 not available

Hello,

I've provisioned Standard NC24rs_v3 using CentOS-HPC 7.7 and 7.8.
In either case ib0 isn't configured properly and isn't available.

Is there an issue with the OFED installation?

# lspci
0001:00:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
0002:00:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
0003:00:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
0004:00:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)
86fc:00:02.0 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]

ib0 isn't configured at all.
and ibstatus errors out:

ibstatus
Fatal error:  device '*': sys files not found (/sys/class/infiniband/*/ports)
uname -a
Linux pkrvm226m165fgz 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

So I decided to install MLNX OFED and perform a reboot.
After that ib0 is configured - it looks like the Mellanox adapter is used for IB and ethernet though.

4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP group default qlen 256
    link/infiniband a0:00:6c:20:fe:80:00:00:00:00:00:00:00:15:5d:ff:fd:33:ff:24 brd 00:ff:ff:ff:ff:12:40:1b:80:06:00:00:00:00:00:00:ff:ff:ff:ff
    inet 172.16.1.27/16 brd 172.16.255.255 scope global ib0
       valid_lft forever preferred_lft forever
    inet6 fe80::215:5dff:fd33:ff24/64 scope link
       valid_lft forever preferred_lft forever

Breaking change in IMPI module files

There is a breaking change in the latest IMPI modules which removed all the MPI_* variables; Please consider to add them back for compatibility

Consider cross-linking between this repo and the image descriptions on Azure Marketplace

If I understand correctly, this repository is used to build the CentOS-HPC image by OpenLogic/Rogue wave software and the Ubuntu-based HPC and AI image by Microsoft on the Azure marketplace.

This connection can be found e.g. in the Azure documentation but there is no direct link to this repository from the description of the images on the Azure marketplace, nor is there a link in this repository to those images.

I would suggest adding links in both directions in order to clarify this.

`mpiifort` not working in Almalinux 8.7

The mpiifort command does not work in the almalinux 8.7 image as ifort was not configured with the oneapi setup.

No intel compiler is provided by default with the present configuration. It would be good to include them as several software stacks are much more efficient with intel compilers than with the GNU ones.

rpm's installed outside yum

When adding a new rpm through yum, it gives the following warning:
Warning: RPMDB altered outside of yum.
This can be solved by running sudo yum history sync in e.g. centos/centos-7.x/centos-7.9-hpc/clear_history.sh

`azhpc-build` script fails but resources are created, leading to unintended charges

I recently encountered an issue while using the AZHPC script (azhpc-build) on Azure. When the script fails to execute due to some error (for example provisioning failed due to invalid parameter), it still creates resources like VM scale sets, vnets, and others. As a result, users are charged for these resources, even though the script execution was unsuccessful. When checking the Azure portal, it does not show any virtual machines running, which can lead users to believe that they are not being charged. As a result, users may incur unexpected costs due to this issue, which can be frustrating and negatively impact their experience.

Expected behavior:

The azhpc-build script should delete the resources created when it fails to execute completely to avoid unintentional charges to the user. The AZHPC script should not create and charge for resources when it fails to execute.

Any way to reliably detect infiniband eth1?

On RDMA enabled instances (both with SR-IOV and without) it appears that there is a nic that shows up as eth1. This is before installing any drivers (just off of the base ubuntu18 image). Azure IMDS also shows this second nic in IMDS in the interface array, although there is no IP associated with it.

I'm not sure if this is an IPoIB nic, but regardless it is interfering with our interface naming because we also have a secondary NIC that is attached to the VM. On non-RDMA instances this secondary nic we attach will show up as eth1, but on RDMA enabled instances it gets bumped down to eth2 because of this infiniband nic.

Is there any reliable way to create a generic udev rule/init.d script that will be baked into our image that can detect the rdma eth1 so that we can rename it? We need this rule to work before the network is up so it can't rely on IMDS or the SharedConfig.xml.

Broken yum update on CentOS7.9

The current azhop CentOS7.9 image has issues when one tries to perform yum update. The problem seems to be that there are incompatible packages in the image definition.

$ sudo yum update
--> Processing Conflict: clamav-filesystem-0.103.8-3.el7.noarch conflicts clamav < 0.103.8-3.el7
--> Finished Dependency Resolution
Error: Package: clamav-0.103.7-1.el7.x86_64 (@epel)
           Requires: clamav-filesystem = 0.103.7-1.el7
           Removing: clamav-filesystem-0.103.7-1.el7.noarch (@epel)
               clamav-filesystem = 0.103.7-1.el7
           Updated By: clamav-filesystem-0.103.8-3.el7.noarch (epel)
               clamav-filesystem = 0.103.8-3.el7
Error: clamav-filesystem conflicts with clamav-0.103.7-1.el7.x86_64
Error: Package: glibc-devel-2.17-325.el7_9.x86_64 (@updates-openlogic)
           Requires: glibc-headers = 2.17-325.el7_9
           Removing: glibc-headers-2.17-325.el7_9.x86_64 (@updates-openlogic)
               glibc-headers = 2.17-325.el7_9
           Updated By: glibc-headers-2.17-326.el7_9.x86_64 (updates-openlogic)
               glibc-headers = 2.17-326.el7_9
           Available: glibc-headers-2.17-317.el7.x86_64 (base)
               glibc-headers = 2.17-317.el7
           Available: glibc-headers-2.17-322.el7_9.x86_64 (updates)
               glibc-headers = 2.17-322.el7_9
           Available: glibc-headers-2.17-323.el7_9.x86_64 (updates)
               glibc-headers = 2.17-323.el7_9
           Available: glibc-headers-2.17-324.el7_9.x86_64 (updates)
               glibc-headers = 2.17-324.el7_9
Error: Package: clamav-update-0.103.7-1.el7.x86_64 (@epel)
           Requires: clamav-filesystem = 0.103.7-1.el7
           Removing: clamav-filesystem-0.103.7-1.el7.noarch (@epel)
               clamav-filesystem = 0.103.7-1.el7
           Updated By: clamav-filesystem-0.103.8-3.el7.noarch (epel)
               clamav-filesystem = 0.103.8-3.el7
 You could try using --skip-broken to work around the problem
** Found 56 pre-existing rpmdb problem(s), 'yum check' output follows:
WALinuxAgent-2.7.3.0-1_ol001.el7.noarch is a duplicate with WALinuxAgent-2.5.0.2-1_ol001.el7.noarch
at-3.1.13-25.el7_9.x86_64 is a duplicate with at-3.1.13-24.el7.x86_64
clamav-0.103.7-1.el7.x86_64 has missing requires of clamav-lib = ('0', '0.103.7', '1.el7')
clamav-0.103.8-3.el7.x86_64 is a duplicate with clamav-0.103.7-1.el7.x86_64
clamav-0.103.8-3.el7.x86_64 has missing requires of clamav-filesystem = ('0', '0.103.8', '3.el7')
clamav-filesystem-0.103.7-1.el7.noarch has installed conflicts clamav > ('0', '0.103.7', '1.el7'): clamav-0.103.8-3.el7.x86_64
clamav-update-0.103.7-1.el7.x86_64 has missing requires of clamav-lib = ('0', '0.103.7', '1.el7')
clamav-update-0.103.8-3.el7.x86_64 is a duplicate with clamav-update-0.103.7-1.el7.x86_64
clamav-update-0.103.8-3.el7.x86_64 has missing requires of clamav-filesystem = ('0', '0.103.8', '3.el7')
cloud-init-19.4-7.el7.centos.6.x86_64 is a duplicate with cloud-init-19.4-7.el7.centos.5.x86_64
diffutils-3.3-6.el7_9.x86_64 is a duplicate with diffutils-3.3-5.el7.x86_64
dkms-3.0.10-1.el7.noarch is a duplicate with dkms-3.0.3-1.el7.noarch
glibc-devel-2.17-325.el7_9.x86_64 has missing requires of glibc = ('0', '2.17', '325.el7_9')
glibc-devel-2.17-326.el7_9.x86_64 is a duplicate with glibc-devel-2.17-325.el7_9.x86_64
glibc-devel-2.17-326.el7_9.x86_64 has missing requires of glibc-headers = ('0', '2.17', '326.el7_9')
glibc-headers-2.17-325.el7_9.x86_64 has missing requires of glibc = ('0', '2.17', '325.el7_9')
1:grub2-efi-x64-2.02-0.87.0.2.el7.centos.11.x86_64 has missing requires of grub2-tools = ('1', '2.02', '0.87.0.2.el7.centos.11')
1:grub2-pc-2.02-0.87.0.2.el7.centos.11.x86_64 has missing requires of grub2-tools = ('1', '2.02', '0.87.0.2.el7.centos.11')
1:grub2-tools-2.02-0.87.el7.centos.7.x86_64 has missing requires of grub2-common = ('1', '2.02', '0.87.el7.centos.7')
1:grub2-tools-2.02-0.87.el7.centos.7.x86_64 has missing requires of grub2-tools-minimal = ('1', '2.02', '0.87.el7.centos.7')
1:grub2-tools-extra-2.02-0.87.0.2.el7.centos.11.x86_64 has missing requires of grub2-tools = ('1', '2.02', '0.87.0.2.el7.centos.11')
gzip-1.5-11.el7_9.x86_64 is a duplicate with gzip-1.5-10.el7.x86_64
libsmbclient-4.10.16-24.el7_9.x86_64 has missing requires of samba-common = ('0', '4.10.16', '24.el7_9')
libwbclient-4.10.16-20.el7_9.x86_64 has missing requires of samba-client-libs = ('0', '4.10.16', '20.el7_9')
libwbclient-4.10.16-24.el7_9.x86_64 is a duplicate with libwbclient-4.10.16-20.el7_9.x86_64
mdadm-4.1-9.el7_9.x86_64 is a duplicate with mdadm-4.1-8.el7_9.x86_64
2:microcode_ctl-2.1-73.15.el7_9.x86_64 is a duplicate with 2:microcode_ctl-2.1-73.11.el7_9.x86_64
rsync-3.1.2-12.el7_9.x86_64 is a duplicate with rsync-3.1.2-10.el7.x86_64
rsyslog-8.24.0-57.el7_9.3.x86_64 is a duplicate with rsyslog-8.24.0-57.el7_9.1.x86_64
samba-client-libs-4.10.16-24.el7_9.x86_64 has missing requires of samba-common = ('0', '4.10.16', '24.el7_9')
samba-common-libs-4.10.16-24.el7_9.x86_64 has missing requires of samba-common = ('0', '4.10.16', '24.el7_9')
sssd-1.16.5-10.el7_9.15.x86_64 has missing requires of sssd-common = ('0', '1.16.5', '10.el7_9.15')
sssd-1.16.5-10.el7_9.15.x86_64 has missing requires of sssd-ipa = ('0', '1.16.5', '10.el7_9.15')
sssd-ad-1.16.5-10.el7_9.15.x86_64 has missing requires of sssd-common = ('0', '1.16.5', '10.el7_9.15')
sssd-ad-1.16.5-10.el7_9.15.x86_64 has missing requires of sssd-krb5-common = ('0', '1.16.5', '10.el7_9.15')
sssd-client-1.16.5-10.el7_9.14.x86_64 has missing requires of libsss_idmap = ('0', '1.16.5', '10.el7_9.14')
sssd-client-1.16.5-10.el7_9.14.x86_64 has missing requires of libsss_nss_idmap = ('0', '1.16.5', '10.el7_9.14')
sssd-client-1.16.5-10.el7_9.15.x86_64 is a duplicate with sssd-client-1.16.5-10.el7_9.14.x86_64
sssd-common-1.16.5-10.el7_9.14.x86_64 has missing requires of libsss_autofs(x86-64) = ('0', '1.16.5', '10.el7_9.14')
sssd-common-1.16.5-10.el7_9.14.x86_64 has missing requires of libsss_idmap(x86-64) = ('0', '1.16.5', '10.el7_9.14')
sssd-common-1.16.5-10.el7_9.14.x86_64 has missing requires of libsss_sudo(x86-64) = ('0', '1.16.5', '10.el7_9.14')
sssd-common-pac-1.16.5-10.el7_9.15.x86_64 has missing requires of sssd-common = ('0', '1.16.5', '10.el7_9.15')
sssd-ipa-1.16.5-10.el7_9.14.x86_64 has missing requires of libipa_hbac(x86-64) = ('0', '1.16.5', '10.el7_9.14')
sssd-ipa-1.16.5-10.el7_9.14.x86_64 has missing requires of libsss_idmap = ('0', '1.16.5', '10.el7_9.14')
sssd-ipa-1.16.5-10.el7_9.14.x86_64 has missing requires of sssd-common-pac = ('0', '1.16.5', '10.el7_9.14')
sssd-krb5-1.16.5-10.el7_9.15.x86_64 has missing requires of sssd-common = ('0', '1.16.5', '10.el7_9.15')
sssd-krb5-1.16.5-10.el7_9.15.x86_64 has missing requires of sssd-krb5-common = ('0', '1.16.5', '10.el7_9.15')
sssd-ldap-1.16.5-10.el7_9.15.x86_64 has missing requires of sssd-common = ('0', '1.16.5', '10.el7_9.15')
sssd-ldap-1.16.5-10.el7_9.15.x86_64 has missing requires of sssd-krb5-common = ('0', '1.16.5', '10.el7_9.15')
sssd-proxy-1.16.5-10.el7_9.15.x86_64 has missing requires of sssd-common = ('0', '1.16.5', '10.el7_9.15')
sysstat-10.1.5-20.el7_9.x86_64 is a duplicate with sysstat-10.1.5-19.el7.x86_64
systemd-219-78.el7_9.5.x86_64 has missing requires of systemd-libs = ('0', '219', '78.el7_9.5')
systemd-devel-219-78.el7_9.7.x86_64 has missing requires of systemd = ('0', '219', '78.el7_9.7')
systemd-python-219-78.el7_9.7.x86_64 has missing requires of systemd = ('0', '219', '78.el7_9.7')
systemd-sysv-219-78.el7_9.7.x86_64 has missing requires of systemd = ('0', '219', '78.el7_9.7')
tuned-2.11.0-12.el7_9.noarch is a duplicate with tuned-2.11.0-11.el7_9.noarch

Using sudo yum update --skip-broken indeed goes around the problem but does not seem to be the best solution.

I suppose that changing to another image such as almalinux might be best. Is that image MPI and Infiniband compatible? Since for what I see in the documentation CentOS images are still the recommended ones for HPC jobs. Or is any of the present images working well with MPI and Infiniband?

Thanks for the help!

Intel debris left in /root dir

root@ip-0A0E0004:~# ls -lR /root/intel/
/root/intel/:
total 4
-rw-r--r-- 1 root root 1 Apr 21 00:05 isip

On the Ubuntu 18 image there is an "intel" directory left in the /root dir. Can we clean up the /root dir at some point?

copy_kvp_client.sh Connection refused

[13:11:27.578]+ ../../../common/copy_kvp_client.sh
[13:11:27.578]--2023-11-04 13:11:29-- https://raw.githubusercontent.com/microsoft/lis-test/master/WS2012R2/lisa/tools/KVP/kvp_client.c
[13:11:27.578]Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 0.0.0.0, ::
[13:11:27.578]Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|0.0.0.0|:443... failed: Connection refused.
[13:11:27.578]Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|::|:443... failed: Connection refused.
[13:11:58.577]root@rjz-MSI:~/azhpc-images/ubuntu/ubuntu-20.x/ubuntu-20.04-hpc# ping raw.githubusercontent.com
[13:11:58.577]PING raw.githubusercontent.com (127.0.0.1) 56(84) bytes of data.
[13:11:58.577]64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.031 ms
[13:11:59.605]64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.030 ms

nfs read-ahead setting causes Lustre slowness

Troubleshooting an issue with very slow read performance of a Lustre filesystem (~100x) we found that it is caused by the NFS read-ahead setting done by the HPC images.

This is how it is done in hpc-tuning.sh:

cat > /etc/udev/rules.d/90-nfs-readahead.rules <<EOM
SUBSYSTEM=="bdi",
ACTION=="add",
PROGRAM="/usr/bin/awk -v bdi=$kernel 'BEGIN{ret=1} {if ($4 == bdi) {ret=0}} END{exit ret}' /proc/fs/nfsfs/volumes",
ATTR{read_ahead_kb}="15380"
EOM

The value of ATTR{read_ahead_kb}=15380 seem to be applied to all devices, not only NFS, including Lustre mounts.

This is how teh issue can be seen - with 10MB block size the performance drops big time compared to 1MB:

root@ubuntu2004:~# echo 3 > /proc/sys/vm/drop_caches
root@ubuntu2004:~# dd if=/lfs2/test/ubuntu2004/test of=/dev/null bs=1M
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.0562267 s, 1.9 GB/s
root@ubuntu2004:~# echo 3 > /proc/sys/vm/drop_caches
root@ubuntu2004:~# dd if=/lfs2/test/ubuntu2004/test of=/dev/null bs=10M
10+0 records in
10+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 2.50094 s, 41.9 MB/s

Checking the read-ahead value for lustre - it is set to 15380:

root@ubuntu2004:~# cat /sys/devices/virtual/bdi/lustrefs-ffff9b44c8196800/read_ahead_kb
15380

Setting read-ahead to 0 fixes the slowdown:

root@ubuntu2004:~# echo 0 > /sys/devices/virtual/bdi/lustrefs-ffff9b44c8196800/read_ahead_kb
root@ubuntu2004:~# echo 3 > /proc/sys/vm/drop_caches
root@ubuntu2004:~# dd if=/lfs2/test/ubuntu2004/test of=/dev/null bs=10M
10+0 records in
10+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.0857173 s, 1.2 GB/s

Can the udev rule be written in a way that it only affects nfs mounts and not applied to all devices?

Missing drivers on Ubuntu 20.04 HPC Image

I created a Standard ND96asr_v4 VM with 8 x NVIDIA A100's. After SSHing I tried to get GPU info nvidia-smi and it failed with:

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I double checked that I actually has GPUs:

$ lspci | grep -i NVIDIA
0001:00:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
0002:00:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
0003:00:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
0004:00:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
0005:00:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
0006:00:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
0007:00:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
0008:00:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
0009:00:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
000a:00:00.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
000b:00:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
000c:00:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
000d:00:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
000e:00:00.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)

Then after coming to this repository and reading the README, I thought "Oh maybe I just need to load an MPI module for everything to work:

$module load mpi
$ mpirun -np 8 \
>     --allow-run-as-root \
>     --map-by ppr:8:node \
>     -x LD_LIBRARY_PATH=/usr/local/nccl-rdma-sharp-plugins/lib:$LD_LIBRARY_PATH \
>     -mca coll_hcoll_enable 0 \
>     -x NCCL_IB_PCI_RELAXED_ORDERING=1 \
>     -x UCX_TLS=tcp \
>     -x UCX_NET_DEVICES=eth0 \
>     -x CUDA_DEVICE_ORDER=PCI_BUS_ID \
>     -x NCCL_SOCKET_IFNAME=eth0 \
>     -x NCCL_DEBUG=WARN \
>     -x NCCL_TOPO_FILE=/opt/microsoft/ndv4-topo.xml \
>     /opt/nccl-tests/build/all_reduce_perf -b1K -f2 -g1 -e 4G
# nThread 1 nGpus 1 minBytes 1024 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
<image>: Test CUDA failure common.cu:732 'no CUDA-capable device is detected'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
<image>: Test CUDA failure common.cu:732 'no CUDA-capable device is detected'
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[48042,1],4]
  Exit code:    2
--------------------------------------------------------------------------

At this point it appears that something is wrong with the installed drivers. Should I try reinstalling them or is there something obvious that I am missing?

CentOS 8 EOL

Hi,
currently the centos 8 hpc images are based on a product that will be gone in 2-3 months (as in no longer receive any updates). Are you planning to move to 8 stream? Or an different rhel derivate (rocky/alma/whatever)?

Greetings
Klaas

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.