Coder Social home page Coder Social logo

MST does not load about azhpc-images HOT 4 CLOSED

azure avatar azure commented on July 26, 2024
MST does not load

from azhpc-images.

Comments (4)

vermagit avatar vermagit commented on July 26, 2024

ib0 is up and has IP assigned. This looks fine.
What does ibv_devinfo look like?
This should be good to run jobs over IB.

I suspect the artifacts you see are a result of SR-IO Virtualization and perhaps some restrictions for security in a multi-tenant environment. I'll let some other chime in on that.
What are you looking for with mst tools?

from azhpc-images.

afernandezody avatar afernandezody commented on July 26, 2024

Hi @vermagit,
You're probably right as the jobs seem to be running OK on the IB transport (maybe my 1st attempt from about 15 days ago failed for some other reason). The outputs are:

sudo ifconfig 
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.0.4  netmask 255.255.255.0  broadcast 10.0.0.255
        inet6 fe80::20d:3aff:fe74:185  prefixlen 64  scopeid 0x20<link>
        ether 00:0d:3a:74:01:85  txqueuelen 1000  (Ethernet)
        RX packets 829  bytes 442371 (432.0 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 940  bytes 198060 (193.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::f434:19e9:be4e:2020  prefixlen 64  scopeid 0x20<link>
        ether 00:15:5d:33:ff:12  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 58  bytes 9042 (8.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 2044
        inet 172.16.1.9  netmask 255.255.0.0  broadcast 172.16.255.255
        inet6 fe80::215:5dff:fd33:ff12  prefixlen 64  scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
        infiniband 20:00:09:28:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 20  bytes 1490 (1.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

I still don't understand the warning message but maybe it doesn't have any effect (still have to test on several vms when the quota is approved).

sudo lspci 
0000:00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (AGP disabled) (rev 03)
0000:00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 01)
0000:00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
0000:00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
0000:00:08.0 VGA compatible controller: Microsoft Corporation Hyper-V virtual VGA
26ee:00:02.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
sudo ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         16.26.0206
        node_guid:                      0015:5dff:fe33:ff12
        sys_image_guid:                 9803:9b03:0094:25c8
        vendor_id:                      0x02c9
        vendor_part_id:                 4120
        hw_ver:                         0x0
        board_id:                       MT_0000000010
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 2
                        port_lid:               699
                        port_lmc:               0x00
                        link_layer:             InfiniBand

As far as what I want mst for, the answer is nothing specific. I'm trying to understand how to handle MLXN adapters on SR-IOV but didn't foresee that mst would be disabled.

from azhpc-images.

vermagit avatar vermagit commented on July 26, 2024

lspci shows device is there, ifconfig shows interface with IP assigned, ibv_devinfo looks good too.

Others will chime in here on your questions on iommu and mst, and why some MLNX tools are restricted.

from azhpc-images.

jithinjosepkl avatar jithinjosepkl commented on July 26, 2024

1 - ifconfig reports the ethernet and the loop, but the IB is a bit weird
I have never seen this warning message and don't know what it means or if it's causing any issue.

You can safely ignore this warning. As long as the link status is shown as "LinkUp" in ibstat/ibstatus, you are good to go. Running IB-level benchmarks (ib_send_lat/ib_send_bw) is a good sanity check.

If you are planning on running MPI jobs, you typically don't need IPoIB. Your MPI library will use the verbs interface for IB communication (unless you explicitly use the TCP/IP channel of MPI library).

But if you have an app that uses TCP/IP for communication and you want to run over IB, you can use IPoIB. As long as the ib0 interface has a valid 172.16.X.X address, you are good to go with IPoIB.

2 - The OS sees the adapter f014:00:02.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] and the IB drivers seem to be loaded.

Yes.

3 - According to the Mellanox documentation for SR-IOV, the grub configuration file must contain >intel_iommu=on and iommu=pt but I don't see either in the cfg file.

This is related to host configuration. You don't need to do any special configuration on VMs.

4 - OpenSM starts but I don't see any configuration file so it's unclear if the virtualization is enabled.

You don't need to start or configure OpenSM on Azure HPC VMs. OpenSM is running on each Azure HPC cluster in a dedicated device (UFM).

5 - As mentioned, MST just doesn't start:

As Aman mentioned in the previous comment, this is due to a security limitation.

from azhpc-images.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.