Comments (4)
ib0 is up and has IP assigned. This looks fine.
What does ibv_devinfo look like?
This should be good to run jobs over IB.
I suspect the artifacts you see are a result of SR-IO Virtualization and perhaps some restrictions for security in a multi-tenant environment. I'll let some other chime in on that.
What are you looking for with mst tools?
from azhpc-images.
Hi @vermagit,
You're probably right as the jobs seem to be running OK on the IB transport (maybe my 1st attempt from about 15 days ago failed for some other reason). The outputs are:
sudo ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.0.0.4 netmask 255.255.255.0 broadcast 10.0.0.255
inet6 fe80::20d:3aff:fe74:185 prefixlen 64 scopeid 0x20<link>
ether 00:0d:3a:74:01:85 txqueuelen 1000 (Ethernet)
RX packets 829 bytes 442371 (432.0 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 940 bytes 198060 (193.4 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::f434:19e9:be4e:2020 prefixlen 64 scopeid 0x20<link>
ether 00:15:5d:33:ff:12 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 58 bytes 9042 (8.8 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044
inet 172.16.1.9 netmask 255.255.0.0 broadcast 172.16.255.255
inet6 fe80::215:5dff:fd33:ff12 prefixlen 64 scopeid 0x20<link>
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
infiniband 20:00:09:28:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 20 bytes 1490 (1.4 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
I still don't understand the warning message but maybe it doesn't have any effect (still have to test on several vms when the quota is approved).
sudo lspci
0000:00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (AGP disabled) (rev 03)
0000:00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 01)
0000:00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
0000:00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
0000:00:08.0 VGA compatible controller: Microsoft Corporation Hyper-V virtual VGA
26ee:00:02.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
sudo ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 16.26.0206
node_guid: 0015:5dff:fe33:ff12
sys_image_guid: 9803:9b03:0094:25c8
vendor_id: 0x02c9
vendor_part_id: 4120
hw_ver: 0x0
board_id: MT_0000000010
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 2
port_lid: 699
port_lmc: 0x00
link_layer: InfiniBand
As far as what I want mst for, the answer is nothing specific. I'm trying to understand how to handle MLXN adapters on SR-IOV but didn't foresee that mst would be disabled.
from azhpc-images.
lspci shows device is there, ifconfig shows interface with IP assigned, ibv_devinfo looks good too.
Others will chime in here on your questions on iommu and mst, and why some MLNX tools are restricted.
from azhpc-images.
1 - ifconfig reports the ethernet and the loop, but the IB is a bit weird
I have never seen this warning message and don't know what it means or if it's causing any issue.
You can safely ignore this warning. As long as the link status is shown as "LinkUp" in ibstat/ibstatus, you are good to go. Running IB-level benchmarks (ib_send_lat/ib_send_bw) is a good sanity check.
If you are planning on running MPI jobs, you typically don't need IPoIB. Your MPI library will use the verbs interface for IB communication (unless you explicitly use the TCP/IP channel of MPI library).
But if you have an app that uses TCP/IP for communication and you want to run over IB, you can use IPoIB. As long as the ib0 interface has a valid 172.16.X.X address, you are good to go with IPoIB.
2 - The OS sees the adapter f014:00:02.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] and the IB drivers seem to be loaded.
Yes.
3 - According to the Mellanox documentation for SR-IOV, the grub configuration file must contain >intel_iommu=on and iommu=pt but I don't see either in the cfg file.
This is related to host configuration. You don't need to do any special configuration on VMs.
4 - OpenSM starts but I don't see any configuration file so it's unclear if the virtualization is enabled.
You don't need to start or configure OpenSM on Azure HPC VMs. OpenSM is running on each Azure HPC cluster in a dedicated device (UFM).
5 - As mentioned, MST just doesn't start:
As Aman mentioned in the previous comment, this is due to a security limitation.
from azhpc-images.
Related Issues (20)
- Release matching container images?
- topology xml file for an azure cluster with 4 nodes each with 4 K80 GPU
- nfs read-ahead setting causes Lustre slowness HOT 2
- amlfs-lustre-client-2.15.1-29-gbae0abe not found HOT 1
- copy_kvp_client.sh Connection refused HOT 1
- MOFED on secureboot enabled vms HOT 3
- add pmix v4 HOT 6
- Docker containers left over from the test need to be removed
- GPG checks are failing on almalinux 8.7 image HOT 2
- AlmaLinux HPC image hostname change broken HOT 1
- install `nv_peer_mem` kernel for GPUDirect RDMA HOT 5
- NCCL graph and topology incompatible with A100 HOT 3
- Feature request to add support for NCv5 HOT 2
- Support for SMB shares
- Rocky Linux 9 or AlmaLinux 9 support HOT 1
- doc: how to build the image HOT 1
- Why the SUSE contributions got deleted? HOT 1
- CUDA driver version mismatched with CUDA runtime version HOT 1
- Enroot breaks in almalinux:almalinux-hpc:8_7-hpc-gen2:8.7.2024042601 HOT 5
- leftover hello-world container in almalinux:almalinux-hpc:8_7-hpc-gen2:8.7.2024042601
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from azhpc-images.