Coder Social home page Coder Social logo

validate-dcb's Introduction

Build status downloads

⭐ More by the Microsoft Core Networking team

Find more from the Core Networking team using the MSFTNet topic

What's New in v2.2

For more information, please see What's New

Getting Started
  1. Learn the Tool
  2. Customize your Config
  3. Initiate Testing

Description

Validate-DCB v2.1 is a PowerShell-based unit test tool that allows you to:

   :heavy_check_mark: Validate the expected configuration on one to N number of systems or clusters

   :heavy_check_mark: Validate the configuration meets best practices

Additional benefits include:

   :heavy_check_mark: The configuration doubles as DCB documentation for the expected configuration of your systems.

   :heavy_check_mark: Answer "What Changed?" when faced with an operational issue (see Test Results)

   :heavy_check_mark: [New with version 2] Deploy the configuration to nodes

ℹ️ Note: This tool does not modify your system unless you specify the -Deploy command. As such, you can re-validate the configuration as many times as desired.

Overview

RDMA over Converged Ethernet (RoCE) requires Data Center Bridging (DCB) technologies to make a network fabric lossless. The configuration requirements are complex and error prone, requiring exact configuration and adherence to best practices across:

   ➡️ Each Windows Node

   ➡️ Each network port RDMA traffic passes through on the fabric

This tool aims to validate the DCB configuration on the Windows nodes by taking an expected configuration as input and unit tests each Windows system.

Important: The validation of the network fabric is out-of-scope for this tool

Here's a quick introductory video from Microsoft Premier Field Engineer, Jan Mortensen. Alt text

Scenarios

Validate-DCB will provide configuration validation for one or more nodes or clusters across a variety of scenarios including:

   ➡️ Native RDMA Adapters (Mode 1)

   ➡️ Host vNIC RDMA (Mode 2) with vNICs in the parent partition

   ➡️ Combination scenarios with both Native RDMA and Host Virtual NICs

   ➡️ Multiple virtual switches with RDMA enabled adapters

⚠️ For step-by-step configuration instructions, please see the Converged NIC Guide. Alternatively, you can use the deployment options in version 2

Test Overview

Test Types

Currently all tests in Validate-DCB are unit tests. That is, they break down and check individual configuration items one by one, rather than a holistic or functional test. In the future, we may incorporate integration/acceptance testing.

Tests are broken down into two types:

   ➡️ Global - Tests the TestHost, Each SUT, and Configuration File for prerequisites. If anything fails here, Validate-DCB will not move onto the actual DCB tests.

   ➡️ Modal - Tests each SUT for RDMA and configuration best practices

For more information, please see Test Details

Test Results

Testing with Azure DevOps and a CI/CD pipeline

Besides the on-screen feedback provided by the tool, results of the tests are stored in NUnitXML format in the \Results folder. These Results can be stored for historical reasons and take part in a CI/CD pipline as shown in Building a Continuous Integration and Continuous Deployment pipeline with DSC

Simple report using PowerBi

You can also use PowerBi to make displaying results easy. For more information, please see Using the Results or see this video from Microsoft Premier Field Engineer, Jan Mortensen.

Alt text

Interpreting Test Results

Validate-DCB may not work with other languagues. In this case, use the test as guidance on how to verify your configuration.

How Test Output is Constructed

Tests are constructed hierarchically. Describing blocks contain one or more Context blocks. Context blocks contain one or more tests. This is Pester terminology outside the scope of this documentation. Pester is a PowerShell-based unit testing framework included inbox with Windows 10, Server 2016 and Server 2019.

While we have future plans to include more sections, currently the only two possible describe blocks are:

   ➡️ [Global Unit] tests requirements or prerequisites to run the modal tests

   ➡️ [Modal Unit] tests a node's configuration or best practices

A context block is a group of one or more tests. For example, Validate-DCB may test a physical NetAdapter's Advanced Properties including the VLANID or NetwordDirect (RDMA in driver terms) settings. These would be grouped in the same context.

Describe or Context Titles

Each Describe, Context, and Test includes a title enclosed in square brackets [ ]. Information inside these square brackets are intended to guide you to the necessary details to either resolve a failing test, or understand what just passed. Let's use this as an example:



➡️ Describing [Modal Unit] contains unit tests for the RDMA modes of operation (NDK mode 1 or 2)

➡️ Context can be broken down as follows:

   ↪️ [Modal Unit] – The describe block this Context is within

   ↪️ [VMSwitch.RDMAEnabledAdapters] – The section of the config file currently being testing.

   ↪️ [SUT: TK5-3WP07R0511] – The hostname of the current System Under Test

In this example, the current context is used for testing an adapter that is expected to be enabled for RDMA and connected to a VMSwitch.

This adapter exists below the VMSwitch section of the configuration file.

Note: During runtime, a variable named $ConfigData contains the information from the config file. With a debugger attached, you can walk the variable like this:

   [DBG]: PS C:\> $ConfigData.AllNodes.VMSwitch.RDMAEnabledAdapters

Passing Tests

If your system passes a test you will see green text similar to this:

+ [SUT: TK5-3WP07R0511]-[VMSwitch: VMSTest]-[RDMAEnabledAdapter: RoCE-01]-[Noun: NetAdapter] Interface status must be "Up"

Using the above image as an example, you can interpret this passing test as:

▶️ The SUT named TK5-3WP07R0511

   ↪️ is expecting the RDMAEnabledAdapter named RoCE-01

    ↪️ intended to back the VMSwitch named VMSTest

     ✔️ to have an interface operation status of "Up"

You can verify this using the PowerShell noun identified in the test (in the example, this is NetAdapter).

    

Failing Tests

If your system is incorrectly configured, the test will provide an error message on-screen.

Unlike most PowerShell scripts, red error messages do not indicate an exception or failing code. Rather this (typically) is indicating a failing test. Another words, this is highlighting something you need to fix.

Failing tests give information to identify the misconfiguration. In the failing test shown below (red output), the RDMAEnabledAdapter named RoCE-02 on SUT named TK5-3WP07R0511 was expected to be attached to the VMSwitch named VMSTest.

As you can see above, the Enabled property corresponding to the:

  

By running Get-NetAdapterBinding on the SUT you can see this for yourself.

Here's another video from Microsoft Premier Field Engineer, Jan Mortensen, who reviews and validates errors found with Validate-DCB

Alt text

Reviewing the Tests

You may also find it useful to review the code generating the failing test. To do this, navigate in the folder structure to the file and line specified in the test failure, for example:

This message identifies the file and line number of the failing test.

  

Now navigate to the file and review the code.

  

If you’re still stuck and want to review the variables during runtime, you can set a breakpoint on the line above that specified in the test failure (the test failed at line 490 so the breakpoint at 489 as shown here):

  

⚠️ If searching for a test in the code,please be aware that parenthesis typically indicates variables that are being expanded. All other test descriptions should be searchable.

For example, in this test description the exact driver version is specific to a particular NIC manufacturer (in this case 1.90.19240.0) and therefore, you cannot search for this in the test as it’s an expanded variable.

Resolving Test Failures

To complete our example above, we need to resolve the configuration issue. To do this, we'll attach the adapter(s) to the VMSwitch so the binding is now enabled.

Getting Started

Installation

Validate-DCB is now published in the PowerShell gallery. Please use Install-Module Validate-DCB from a system with internet connectivity.

For disconnected systems, use Save-Module -Name Validate-DCB -Path c:\temp\Validate-DCB then move the modules in c:\temp\Validate-DCB to your disconnected system. Here's a video from Microsoft Premier Field Engineer, Jan Mortensen.

Alt text

Requirements

  • TestHost: Windows 10, Windows Server 2016, or Windows Server 2019. The TestHost can also be a SUT if it is the appropriate OS.

  • System Under Test (SUT): Windows Server 2016 or Windows Server 2019

  • Configuration File: This is a file that defines the expected configuration on the SUTs.

Configuration File

Regardless of the scenario, you need a configuration file to define the expected configuration on your systems. Validate-DCB then checks that each system matches the expected configuration. With Validate-DCB v2.1 we recommend using the user interface to create the configuration for you. To do this, run Validate-DCB without parameters. For more information on customizing your own file, please see: Customize your Config

Running Validate-DCB

To begin testing, complete the wizard mentioned in the previous section or run Validate-DCB -ConfigFilePath <Path to your configuration file>.ps1 if you have an existing configuration file you wish to use.

Additionally, you can connect Validate-DCB with your Azure Automation account to first deploy the configuration (then validate).

ℹ️ Note: For full parameter help use: Get-Help Validate-DCB

Here are a few tips on the parameters of the parameters.

Parameter Description
TestScope Determines the describe block to be run. You can use this to only run certain describe blocks. For example:

Use Global if you just want to setup a test host or validate your systems are ready to be tested.

Use Modal if you have already know you have all the prerequisites met.
LaunchUI Use this parameter to launch a user interface that helps create a configuration file.
ExampleConfig Use this to select one of the pre-defined configuration files that will test a system in Mode 1 or Mode 2. For more information on the example configuration guides, please see Examples.

For details about the configuration for these modes, please review the Converged NIC Guide
ConfigFilePath Use this parameter to specify the path to a custom configuration file.
ContinueOnFailure If a test fails in one of the Describe blocks, Validate-DCB exits prior to moving to the next Describe block allowing you to correct the issue. Use this to attempt all tests even if a test failure is detected.
Deploy Use this parameter to deploy the configuration to all specified nodes prior to validating the configuration

validate-dcb's People

Contributors

bendavms avatar brian-anderson-dev avatar carlosmayol avatar corenetbuilder avatar dcuomo avatar hfu949 avatar microsoftopensource avatar msftgits avatar nharper285 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

validate-dcb's Issues

DCBX Willing setting on Intel X722

[-] [SUT: BFICLHOST3]-[RDMAEnabledAdapters: SET01]-[Noun: NetQosDcbxSetting] interfaces DCBX 'Willing' option should be false 74ms
  Expected: {false}
  But was:  {}
  558:                     ($actNetQoSState.NetQosDcbxSettingInterfaces | Where-Object InterfaceAlias -like $thisRDMAEnabledAdapter.Name).Willing | Should Be 'false'
  at <ScriptBlock>, C:\Program Files\WindowsPowerShell\Modules\Validate-DCB\20191128.2.2.82\tests\unit\modal.unit.tests.ps1: line 558

The Intel X722 does not appear to have this capability...?

Enabled                    : True
Capabilities               :                       Hardware     Current
                                                   --------     -------
                             MacSecBypass        :
                             DcbxSupport         :
                             NumTCs(Max/ETS/PFC) :

HardwareCapabilities       :
CurrentCapabilities        :
OperationalTrafficClasses  : Not Available
OperationalFlowControl     : Not Available
OperationalClassifications : Not Available
RemoteTrafficClasses       : Not Available
RemoteFlowControl          : Not Available
RemoteClassifications      : Not Available
OperationalSettings        :
RemoteSettings             :
AdminStatus                :
ifAlias                    : xxxx
InterfaceAlias             : xxxx
ifDesc                     : Intel(R) Ethernet Connection X722 for 10GbE SFP+
Caption                    : MSFT_NetAdapterQosSettingData 'Intel(R) Ethernet Connection X722 for 10GbE SFP+'
Description                : Intel(R) Ethernet Connection X722 for 10GbE SFP+
ElementName                : Intel(R) Ethernet Connection X722 for 10GbE SFP+
InstanceID                 : {E69F7F95-0474-4AD7-BF71-5530AC4247A9}
InterfaceDescription       : Intel(R) Ethernet Connection X722 for 10GbE SFP+

From PROSet PowerShell.

 Get-IntelNetAdapterStatus -Name "*x722*" -Status DCB
 Get-IntelNetAdapterStatus : The specified device does not support DCB, has DCB disabled, or Intel's implementation of DCB is not installed.
 At line:1 char:1
 \+ Get-IntelNetAdapterStatus -Name "*x722*" -Status DCB
 \+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     \+ CategoryInfo          : NotSpecified: (:) [Get-IntelNetAdapterStatus], Exception
     \+ FullyQualifiedErrorId : System.Exception,Intel.PowerShell.Network.Adapter.GetIntelNetAdapterStatus

If this is the case, which it seems to be, then Intel RNICs may need an exclusion to this test.

ContinueOnFailure not working properly

In the default testscope, ContinueonFailure param doesn't work because the condition
If ($GlobalResults.FailedCount -ne 0) {

Doesn't account for the parameter.

Config File must contain the AllNodes Hashtable

This test fails on multi-node clusters since AllNodes is defined as an array in the config file.

$AllNodes = @()

$configData.AllNodes.GetType()

IsPublic IsSerial Name BaseType


True True Object[] System.Array

The test needs to look at the first element in the array. Or look through the array. Or check whether it's an array. Or some combination of the above.

Example fix:

### Verify configData contains the AllNodes HashTable It "[Config File]-[AllNodes] Config File must contain the AllNodes Hashtable" { $configData.AllNodes[0] | Should BeOfType System.Collections.Hashtable }

SMB Client RDMA Capable being mis-reported

When executing Validate-DCB, I am receiving the following message:

      [-] [SUT: hci01]-[SMB Adapter: Storage1]-[Noun: SMBServerNetworkInterface] SMB Client must report RDMA Capable 11ms
        Expected $true, but got $null.
        875:                     (($SMBServerNetworkInterface | Where-Object InterfaceIndex -eq $NetAdapter.IfIndex) | Select-Object -first 1).RdmaCapable | Should be $true at <ScriptBlock>, C:\Program Files\WindowsPowerShell\Modules\Validate-DCB\20210802.2.2.117\tests\unit\modal.unit.tests.ps1: line 875

However, when expecting the properties of the adapter, it appears the above is being incorrectly reported.

Get-NetAdapter

Name                      InterfaceDescription                    ifIndex Status       MacAddress
----                      --------------------                    ------- ------       ----------
Compute-2                 HPE Ethernet 10/25Gb 2-port 640FLR...#2      17 Up           04-09-73-...
Storage1                  Hyper-V Virtual Ethernet Adapter             16 Up           00-15-5D-...
StorageReplica2           Hyper-V Virtual Ethernet Adapter #4          41 Up           00-15-5D-...
Embedded LOM 1 Port 3     HPE Ethernet 1Gb 4-port 331i Adapter         13 Disconnected B8-83-03-...
Compute-1                 HPE Ethernet 10/25Gb 2-port 640FLR-S...      12 Up           04-09-73-...
Embedded LOM 1 Port 4     HPE Ethernet 1Gb 4-port 331i Adapter #3      10 Disconnected B8-83-03-...
Embedded LOM 1 Port 2     HPE Ethernet 1Gb 4-port 331i Adapter #4       8 Disconnected B8-83-03-...
Management-1              HPE Ethernet 1Gb 4-port 331i Adapter #2       7 Up           B8-83-03-...
StorageReplica1           Hyper-V Virtual Ethernet Adapter #3          37 Up           00-15-5D-...
Storage2                  Hyper-V Virtual Ethernet Adapter #2          33 Up           00-15-5D-...

Get-SmbClientNetworkInterface

Interface Index RSS Capable RDMA Capable Speed   IpAddresses
--------------- ----------- ------------ -----   -----------
13              False       False        0  bps  {fe80::d03b:d964:2eea:f57f}
10              False       False        0  bps  {fe80::d4c0:2632:efc9:d40c}
8               False       False        0  bps  {fe80::45f4:fd34:f8b1:b679}
17              False       False        25 Gbps {}
12              False       False        25 Gbps {}
16              True        True         25 Gbps {fe80::ddd9:560:bbe1:dff3, 172.100.0.4}
33              True        True         25 Gbps {fe80::b8be:b1cb:f721:f5e6, 172.102.0.4}
37              True        True         25 Gbps {fe80::2d42:f3e:5e69:97f3, 172.101.0.4}
41              True        True         25 Gbps {fe80::a51a:264a:2743:56c7, 172.103.0.4}
9               False       False        10 Gbps {fe80::4525:a4fc:7150:2a46, 169.254.1.184}
7               True        False        1 Gbps  {fe80::d1b7:daef:a181:9e88, 10.40.219.53, 10.40...

NetAdapter Name for the virtual NIC is named the same as the VMNetworkAdapter name

When executing Validate-DCB, using the UI as the mechanism for parameter input, I have tried many different combinations including renaming adapters etc. however I cannot overcome this error. I fear that this is further causing additional errors.

Configuration information resulting in the error:

  • Server hardware include 2 physical ports which are teamed into a virtual switch "management_switch"
  • Physical ports are named Management-1 and Management-2
  • VMNetworkAdapters for the management OS have been created for Management, Storage1, and Storage2

Configuration attempt 1 to bypass the error:

  • Change the name of the NetAdapters from 'vEthernet (Storage1)' and 'vEthernet (Storage1)' to Storage1 and Storage2 respectivley
  • This results in the same error.

Why fixed Live Migration limit of 750 MBps?

In Tests/unit/modal.unit.tests.ps1, a solution with network adapters with connection speed greater than 10GbE, a fixed maximum limit of 750 MBps is looked for. A reference to an article https://techcommunity.microsoft.com/t5/failover-clustering/optimizing-hyper-v-live-migrations-on-an-hyperconverged/ba-p/396609 is used as justification. However, that article states "The testing conducted tested different bandwidth limits on a dual 10 Gpbs RDMA enabled NIC and measured failures under stress conditions and found that throttling live migration to 750 MB achieved the highest level of availability to the system. On a system with higher bandwidth, you may be able to throttle to a value higher than 750 MB." More specific guidance is provided by Microsoft at https://docs.microsoft.com/en-us/azure-stack/hci/concepts/host-network-requirements#traffic-bandwidth-allocation. This article provides more specific guidance illustrating that a fixed 750MBps is not always the right value.

Here is an example of the appropriate calculation (and settings):

`
$aggregateLinkSpeed = ($smbNIC1.TransmitLinkSpeed + $smbNIC2.TransmitLinkSpeed)/1000000000
$smbBandwidthAllocationPercent = .5
$smbBandwidthLimit = $aggregateLinkSpeed * $smbBandwidthAllocationPercent
$liveMigrationBandwidthLimit = $smbBandwidthLimit * .29
$liveMigrationMaxMigrationLimit = 2

if($liveMigrationBandwidthLimit -lt 5){
$migrationPerformanceOption = "Compression"
Set-VMHost -VirtualMachineMigrationPerformanceOption $migrationPerformanceOption
}

else{
$migrationPerformanceOption = "SMB"
Set-VMHost -VirtualMachineMigrationPerformanceOption $migrationPerformanceOption
Set-SmbBandwidthLimit -CimSession $server -Category LiveMigration -BytesPerSecond ($liveMigrationBandwidthLimit/8)*1000000000)
}
`

Incorrect Mellanox Driver Version Shown for Dell EMC Solutions for Microsoft Azure Stack HCI Solution

When using Microsoft Validate-DCB on Dell EMC Solutions for Microsoft Azure Stack HCI Solutions with Mellanox NICs, we see a recommendation to install a version of the Mellanox driver that has not been certified or even available for download from Dell.

https://github.com/microsoft/Validate-DCB/blob/master/helpers/drivers/drivers.psd1
@{ IHV = 'Mellanox' ; DriverFileName = 'mlx5.sys' ; MinimumDriverVersion = '2.60.21096.0' } # ConnectX-4

Is there a way to add a check if Dell EMC Solutions for Microsoft Azure Stack HCI Solution use the driver version shown in the Support Matrix for Microsoft HCI Solutions?

Add BIOS settings verification

Hello,

I have seen that the script does not include checking the settings on the BIOS settings, this can lead to situations that following the steps and the script is not reporting any errors still DCB is not enabled.

Can you please include BIOS settings verifications, on the image we have one example of this from DELL BIOS settings.

DCBX DELL Example

VLAN on host pNIC?

This test starting on line 256 of global.unit.tests.ps1.

### Verify each VMSwitch.RDMAEnabledAdapter includes the VLANID property from Get-NetAdapterAdvancedProperty It "[Config File]-[AllNodes.VMSwitch.RDMAEnabledAdapters]-[Node: $($thisNode.NodeName)]-[Entry: $($thisRDMAEnabledAdapter.Name)]-[Noun: NetAdapterAdvancedProperty] Must include the VLANID property for each entry" { $thisRDMAEnabledAdapter.VLANID | Should not BeNullOrEmpty }

Based on the example config files this is checking the pNIC for a VLANID. But shouldn't the pNIC be in TRUNK mode?

Based on this example:

RDMAEnabledAdapters = @( @{ Name = 'RoCE-01' ; VMNetworkAdapter = 'SMB01' ; VLANID = '101' ; JumboPacket = 9014 } @{ Name = 'RoCE-02' ; VMNetworkAdapter = 'SMB02' ; VLANID = '101' ; JumboPacket = 9014 } )

The test runs against RoCE-1, which is the pNIC. The VLAN test should be run against the affinitized VMNetworkAdapter, SMB*.

Error during new Live Migration Limit test

Seeing an error running Validate-DCB on servers, not sure how to troubleshoot

`
[+] [SUT: servername]-[SMB Adapter: STORE-D]-[Noun: SMBServerNetworkInterface] SMB Client must report RDMA Capable 2ms
[-] Should have an Live Migration limit of 750 MBps 1ms
ParameterBindingException: A positional parameter cannot be found that accepts argument '1'.
at , C:\Program Files\WindowsPowerShell\Modules\Validate-DCB\20191128.2.2.82\tests\unit\modal.unit.tests.ps1: line 902

`

Verify Cluster Bandwidth Percentage should account for Nic speed

1% cluster bandwidth percentage is for 25GB nics, if 10GB nic is present cluster bandwidth will be 2%

    ### Verify Cluster Bandwidth Percentage is equal to 1%
    It "[Config File]-[NonNodeData.NetQos]-[Noun: NetQosTrafficClass] Cluster BandwidthPercentage must be 1%" {
        ($ConfigData.NonNodeData.NetQos.GetEnumerator().Where{ $_.Template -eq 'Cluster' }).BandwidthPercentage | Should be 1
    }

Set default LB mode

Currently load balancing mode is only checked if specified in the config file. Should update to check for the recommended type if not specified in the config file

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.