Coder Social home page Coder Social logo

datacenter / aci-pre-upgrade-validation-script Goto Github PK

View Code? Open in Web Editor NEW
41.0 12.0 25.0 713 KB

A script to run validations to detect potential issues that may cause an ACI fabric upgrade to fail

Home Page: https://datacenter.github.io/ACI-Pre-Upgrade-Validation-Script/

License: Apache License 2.0

Python 100.00%

aci-pre-upgrade-validation-script's Introduction

Quick Start

  1. Copy aci-preupgrade-validation-script.py to your APIC (suggested path: /data/techsupport)
  2. On your APIC, run cd /data/techsupport then python aci-preupgrade-validation-script.py
  3. Provide a user name and password (admin level privileges are recommended)
  4. Select the target version (the version needs to be on APIC)
  5. Follow recommendations for all checks that have been flagged as FAIL or MANUAL CHECK REQUIRED

Introduction

The Goal of this script is to provide you with an automated list of proactive checks before performing an ACI fabric upgrade. Each check is documented in this page with a detailed explanation of the importance to resolve each issue before upgrading.

Check out ACI Pre-Upgrade Validation Script Document for details of the script.

Check out ACI Upgrade Guide for details of ACI upgrades in general.

Failure to address an affected issue before an upgrade is known to cause challenges during or post upgrade.

For every check that has been flagged as FAIL, a general recommended action has been provided to guide next steps. There is also a summary with the number of checks that matched a given status.

aci-pre-upgrade-validation-script's People

Contributors

ehaminian avatar jeestr4d avatar kshcheku avatar monrog2 avatar takishida avatar welkin-he avatar wilsonbc2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aci-pre-upgrade-validation-script's Issues

Validate target image integrity to avoid corrupted image

The current image validation makes sure all apics has same md5sum for the target image, which assumed that customer has validated the image against CCO release md5 before uploading to APIC.

It would be great to enhance this script to use CCO release's md5sum as single source of truth for validation, this will be very helpful to capture the corrupted image as well as upgrade failure.

CSCwh81430 for N9K-C93108TC-FX3P

(use upvote πŸ‘ for attentions)

Validation Type

[X] - Other

What needs to be validated

Because of CSCwh81430 , for customer with N9K-C93108TC-FX3P, they are advised not upgrade nor reload until further notice

Why it needs to be validated

reload/upgrade N9K-C93108TC-FX3P can trigger a potential outage because the link would not come up.

Relax `Features that need to be Disabled prior to Upgrade` check for pre-installed apps

Currently this check will flag on every app that is enabled, which includes the pre-installed apps.

Pre-installed apps typically cannot be disabled, only removed. We have received confirmation that upgrading with pre-installed apps enabled should not pose an issue to the upgrade.

As such, this check needs to be relaxed for such apps including:

  • Cisco_NIBASE (ACI version 4.2+)
  • Nexus Insights Cloud Connector (ACI Version 5.2+)
  • Cisco_ApicVision
  • Base Package apps:
    • Search
    • APIC Postman
    • Contract Viewer
    • VisuDash

fvUplinkOrderCont parameter 'active' should not be empty

Validation Type

[ ] - Fault

[ x ] - Config

[ ] - Bug

[ ] - Other

What needs to be validated

For MO 'fvUplinkOrderCont' the property 'active' should not be empty. If it is empty, changing vSwitch policy associated with the VMM domain can end up causing an outage on any EPG port-groups (PGs) with empty 'active' field - on VC these PGs have all their uplinks put into unused corresponding to the blank 'active' uplinks field.


node-3# moquery fvUplinkOrderCont | egrep '(\#|^dn|^active|^standby)'
# fvUplinkOrderCont
active      :
dn          : uni/tn-OPERATIONS/ap-AP-SHARED-SVCS/epg-EPG-SHARED-SVCS-TEMPLATES/rsdomAtt-[uni/vmmp-VMware/dom-VDS-USDEN01]/uplinkorder
standby     :
# fvUplinkOrderCont
active      :
dn          : uni/tn-OPERATIONS/ap-AP-SHARED-SVCS/epg-EPG-SHARED-SVCS-INTERNAL/rsdomAtt-[uni/vmmp-VMware/dom-VDS-USDEN01]/uplinkorder
standby     :
# fvUplinkOrderCont
active      :
dn          : uni/tn-OPERATIONS/ap-AP-SHARED-SVCS/epg-EPG-SHARED-SVCS-PUBLIC/rsdomAtt-[uni/vmmp-VMware/dom-VDS-USDEN01]/uplinkorder
standby     : 

Why it needs to be validated

Customer may hit an outage in these conditions:

  • fvUplinkOrderCont is created with blank 'active' field - this can happen in older version which does not have fix for CSCvr96408
  • vCenter side PG are affected but instead of fixing config on ACI, (setting active uplink field to required values in fvUplinkOrderCont), config is fixed on VC by manually placing required uplinks in 'active' on PG
  • ACI side still has empty 'active' in fvUplinkOrderCont.
  • This incorrect config will be carried over during upgrade

In case any fvUplinkOrderCont has empty 'active' field, need to bring this to customer's attention. The ACI configuration should be fixed to avoid issues in the future.

`L3Out Subnets (F0467 prefix-entry-already-in-use)` does not work on older switch versions

(use upvote πŸ‘ for attentions)

Validation Type

[F0467(prefix-entry-already-in-use)] - Fault

[Same subnet with import-Security scope configured on different L3outs in same vrf] - Config

[ ] - Bug

[ ] - Other

What needs to be validated

Same subnet with import-Security scope configured on different L3outs in same vrf

Why it needs to be validated

After upgrade from 3.2(5e) to 5.2(5c), F0467(prefix-entry-already-in-use) fault raised when same subnet with import-Security scope configured on different L3outs in same vrf. The configuration not supported in 5.2 release. Based on network design/configuration, the configuration issue may have traffic impact.

Additional context

[Check 20/43] L3Out Subnets (F0467 prefix-entry-already-in-use) fail to identify the issue in 3.2(5e) release in my case.

NewValidation: Compare apicca.crt and apicca.key before upgrade.

(use upvote πŸ‘ for attentions)

Validation Type

[ missing Cert can cause spine reload due to nxqmtt crash in 5.2.x] - Fault

[ check apic cert corruption] - Config

[ CSCvy35257] - Bug

[ ] - Other

What needs to be validated

verify if both CLI return same value, if same, check pass, else recover cert before the upgrade
openssl rsa -modulus -noout -in /securedata/apicca/apicca.key
openssl x509 -modulus -noout -in /securedata/apicca/apicca.crt

Why it needs to be validated

it can cause Spine crash on 5.2.6e due to CSCwc74242

Additional context

Add any other context about the feature request here.

New Validation Request: whether many interface override policies configured before upgrading to 5.2(4d) or above

(use upvote πŸ‘ for attentions)

Validation Type

[ ] - Fault

[X] - Config

[ ] - Bug

[ ] - Other

What needs to be validated

For a customer who implemented interface configuration via clis, they may have created many interface override policies.

Why it needs to be validated

If customer upgrades the fabric from any version prior to 5.2(4d) to 5.2(4d) and above, this type of configuration can cause APIC shard-20 to take hours before completing postUpgradeCb. If switch upgrade was triggered before APIC complete that, those upgraded switches will experience bootstrap delay and unable to fetch required access policies until APIC completed that.

This validation is designed to identify if the fabric has similar configuration, so that customer can be alerted.

Additional context

Add any other context about the feature request here.

Add check for F606755 as part of CSCwb67893

fault F606755 should be checked before upgrade to see if there is any l3out configuration which had failed.

POD6-APIC-4# moquery -c faultInst -f 'fault.Inst.code=="F606755"'
Total Objects shown: 1

fault.Inst

code : F606755
ack : no
cause : fsm-failed
changeSet :
childAction :
created : 2022-04-21T13:36:16.485+00:00
delegated : no
descr : [FSM:FAILED]: Configure EpP for EPg EXT_NET_NETBACKUP(TASK:ifc:policymgr:FvEPgEpP)
dn : uni/tn-PRODUCTION/out-NETBACKUP_L3OUT/instP-EXT_NET_NETBACKUP/fault-F606755
domain : infra
highestSeverity : major
lastTransition : 2022-04-21T13:36:16.485+00:00
lc : raised
modTs : never
occur : 1
origSeverity : major
prevSeverity : major
rn : fault-F606755
rule : fsm-ep-pfsm-fail
severity : major
status :
subject : task-ifc-policymgr-fvepg-epp
type : config
uid :

If this is raised, it could be because there is a duplication of the node profile/logical interface profile with an EIGRP neighborship. The duplication raises this fault, and after upgrade to a version with PD it can cause PM to crash. This has a cascading affect and should be added as a check.

FabricDomain Name check is failing when the target version is 6.0(2h)

(use upvote πŸ‘ for attentions)
Describe the bug
FabricDomain Name check (fabricdomain_name_check()) fails with an error when the target version is 6.0(2h).

Script output

--- omit ---

You have chosen version "6.0(2h)"

--- omit ---

[Check 46/47] FabricDomain Name...
              Error: list indices must be integers or slices, not str...                                                          ERROR !!

To Reproduce
Steps to reproduce the behavior such as:

  1. Upload 6.0(2h) to an APIC
  2. Run the script
  3. Select the target version as 6.0(2h)

Expected behavior
This check should result in FAIL - OUTAGE WARNING!! instead of ERROR when the target version is 6.0(2h).

Additional context

This is failing with an error because the logic is not handling the API query response correctly. There should not be imdata as it's already stripped off in icurl().

         fabricDomain = controller['imdata'][0]['topSystem']['attributes']['fabricDomain']

The pytest is not failing because the test data fed to this function was formed to work with this wrong handling, but is different from the actual input.

bgp_peer_loopback_check validation

Describe the bug
The "bgp_peer_loopback_check" validation fails to detect instances where a BGP loopback is manually created without utilizing the "Use Router ID as Loopback Address" checkbox.

Script output
[Check 33/49] BGP Peer Profile at node level without Loopback... FAIL - OUTAGE WARNING!!
Tenant L3Out Node Profile Pod Node Recommended Action


tenant l3 l3_nodeProfile 1 101 Configure a loopback or configure bgpPeerP under interfaces instead of nodes

To Reproduce
Steps to reproduce the behavior:

  1. Create a BGP l3out
  2. Create a loopback manually without "Use Router ID as Loopback Address" under node profile
  3. Create a bpg peering sourcing from that loopback

Expected behavior
In this scenario, the script should not issue an outage warning because the configuration is valid.

Additional context

    for l3extLNodeP_child in l3extLNodeP['l3extLNodeP']['children']:   #<<<<<<<<<<<<<<<<<<<<<<<< Outer "for" loop
        if not l3extLNodeP_child.get('l3extRsNodeL3OutAtt'):
            continue
        if l3extLNodeP_child['l3extRsNodeL3OutAtt']['attributes']['rtrIdLoopBack'] == 'yes':
            continue
        if l3extLNodeP_child['l3extRsNodeL3OutAtt'].get('children'):
            for rsnode_child in l3extLNodeP_child['l3extRsNodeL3OutAtt']['children']: #<<<<<<<<<<<<<<<<<<<<<<<< Inner "for" loop
                if rsnode_child.get('l3extLoopBackIfP'):
                    continue                                                          #<<<<<<<<<<<<<<<<<<<<<<<< The "continue" statement interrupts the inner "for" loop, allowing the code to proceed to the "data.append" section and trigger an outage warning, even if a loopback has been detected.

        # No loopbacks are configured for this node even though it has bgpPeerP
        name = re.search(name_regex, l3extLNodeP['l3extLNodeP']['attributes']['dn'])
        dn = re.search(node_regex, l3extLNodeP_child['l3extRsNodeL3OutAtt']['attributes']['tDn'])
        data.append([
            name.group('tenant'), name.group('l3out'), name.group('nodep'),
            dn.group('pod'), dn.group('node'), recommended_action])
if not data:
    result = PASS
print_result(title, result, msg, headers, data)
return result

Request to add check for CSCvv30303

CSCvv30303 - ACI: Same subnet under BD and EPG with different scope is a misconfig

Conditions:

Having a configuration where a given subnet is defined under both an EPG as well as its corresponding BD, however both subnet definitions have a different combination of "scope" definitions; [private, public, shared]

This config is technically and has always been considered a mis-configuration.

Workaround:

Ensure that the subnet scope on the bridge domain and EPG are the same.

fvnsEncapBlk.role=internal caused vlan deletion post upgrade and outage.

(use upvote πŸ‘ for attentions)

Validation Type

[ ] - Fault

[X] - Config

[ ] - Bug

[ ] - Other

What needs to be validated

for fvnsEncapBlk.role=internal, since ACI 4.2(6m) or 5.2(1g) , the corresponding stpAllocEncapBlkDef is deleted post upgrade, as a result, those vlans are deleted from the related switch. If they were used for any user tenant before, outage can be expected.

Why it needs to be validated

Outage can be triggered post upgrade if customer is not aware of this behavior change, introduced by [CSCvw33061]

Additional context

This behavior change is publicly release noted, documented however we see customer's still facing this trouble.

NewValidation: Add check for mgmtSubnet configuration when upgrading APICs from 4.2 to 5.2

(use upvote πŸ‘ for attentions)

Validation Type

[ ] - Fault

[ x] - Config

[ x] - Bug

[ ] - Other

What needs to be validated

  1. identify which mgmtSubnet(s) are configured
    apic1# moquery -c mgmtSubnet
    Total Objects shown: 1

mgmt.Subnet

ip : 10.1.1.0/24
annotation :
childAction :
descr :
dn : uni/tn-mgmt/extmgmt-default/instp-out-of-band-mgmt-external/subnet-[10.1.1.0/24]
extMngdBy :
lcOwn : local
modTs : 2018-08-29T07:47:00.348+00:00
name :
nameAlias :
rn : subnet-[10.1.1.0/24] <<<<<<<<<<<
status :
uid : 15374
userdom : all

  1. check if from/to version is 4.2(7) to 5.2(7)

Why it needs to be validated

In 4.2(7), due to CSCvz96117 configured mgmtSubnet is not honored when OOB contract is configured. This was resolved in 5.2(3e)+ and If a customer does not have 0.0.0.0/0 (or whichever specific subnet they'll be managing their fabric from) and proceeds with upgrade, they will be unable to reach APICs via SSH

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvz96117

Additional context

Switches are not affected by this change in behavior, only management access to controllers

VNID Mismatch check fails on newer APICs running py3

(use upvote πŸ‘ for attentions)
Describe the bug
VNID Mismatch check fails on newer APICs running py3.

Script output

[Check 31/45] VNID Mismatch... 
              Error: '<' not supported between instances of 'dict' and 'dict'...                                                  ERROR !!

To Reproduce
Run the script on an APIC which has the following conditions

  • APIC is running a newer version such as 6.0(2) that only has py3.
  • APIC meets the condition to fail VNID Mismatch (i.e. it would result in Fail - Outage).

Expected behavior
The check should print the VNIDs that are mismatched across nodes.

Additional context
N/A

Route-target Type validation for GOLF

(use upvote πŸ‘ for attentions)

Validation Type

[ ] - Fault

[x] - Config

[ ] - Bug

[ ] - Other

What needs to be validated

for every CTX listed by l3ext.GlobalCtxName, if bgp.RtTarget.rt of that vrf starts with "extended:", and current version is older than 4.2(1a) and target version is equal or newer than 4.2(1a), we should alert outage for customer before upgrade.

Why it needs to be validated

we see customer with GOLF deployed hit outage after upgrading from 3.2 to 4.2, caused by export route-target not deployed to BGP.

Additional context

The enforcement of route target type was set since 4.2(1i).

eventmgr_db_defect_check should check CSCvt07565

Describe the enhancement
eventmgr_db_defect_check in the current code checks if APIC version is affected by CSCvn20175 or not.
But we have another eventmgr DB defect CSCvt07565 which was fixed in later releases.
For example, although version 3.2.6 PASSes current eventmgr_db_defect_check, it can be affected by CSCvt07565 and upgrade may be failed.
With that, my suggestion is that eventmgr_db_defect_check checks if APIC version is affected by CSCvt07565 or not.

Current behavior/output
eventmgr_db_defect_check checks if APIC version is affected by CSCvn20175 or not.

Suggested behavior/output
eventmgr_db_defect_check checks if APIC version is affected by CSCvt07565 or not.
We would not need to change the current recommended_action.

recommended_action = 'Contact Cisco TAC to check the DB size via root'

Check `acidiag verifyapic` for cert validity if cver below 3.2.7f

goal would be to catch failures in acidiag verify apic and address them before upgrading to versions 3.2.7f+

In below example, we would flag on PID:UCSC-C220-M3S

acidiag verifyapic
openssl_check: certificate details
subject= CN=XXXXXXXXXX,serialNumber=PID:UCSC-C220-M3S  SN:XXXXXXXXXX
issuer= CN=Cisco Manufacturing CA,O=Cisco Systems
notBefore=Nov 11 15:32:41 2021 GMT
notAfter=May 14 20:25:41 2029 GMT
openssl_check: passed
openssl_check: certificate details
subject= /serialNumber=PID:UCSC-C220-M3S  SN:XXXXXXXXXX/CN=XXXXXXXXXX
Certificate doesn't match APIC format
apic_cert_format_check: failed

Add check for CSCvz84036 - Disabling EECDH base cipher causes nginx.conf to be invalid

A new check is required for CSCvz84036 - Disabling 'EECDH' cipher fails nginx validation blocking new cert if 'EECDH+XX+XX' ciphers enabled.

Customer will disable the base cipher and invalidate their nginx.conf file. Nginx and UI will continue to work on the current version however after upgrade the UI will be unavailable until the issue is resolved.

Need check for DN "uni/fabric/comm-default/https/cph-EECDH" with "state: enabled".

apic1# moquery -d uni/fabric/comm-default/https/cph-EECDH
Total Objects shown: 1

# comm.Cipher
id           : EECDH
annotation   :
childAction  :
dn           : uni/fabric/comm-default/https/cph-EECDH
extMngdBy    :
lcOwn        : local
modTs        : 2023-01-31T21:42:58.332-05:00
rn           : cph-EECDH
state        : enabled
status       :
uid          : 0
userdom      : all
weak         : no

Add check for CSCwb80058 - Reduced number of uplinks

(use upvote πŸ‘ for attentions)

Validation Type

[ ] - Fault

[ x] - Config

[ ] - Bug

[ ] - Other

What needs to be validated

  1. Identify the model of switches that had their uplink information changed under CSCwb80058

moquery -c fabricNode

Look for these 4 models and check if their role is leaf. Above moquery should give you both
southlake N9K-C93240YC-FX2/N9K-C93300YC-FX2
Duvel N9K-C9364D-GX2A
redhorse N9K-C9364C-GX
st_archer N9K-C93360YC-FX2

  1. For each model get the node ID - say node-105
    Check the number of uplinks for each model

moquery -c eqptPortP | grep node-105 | wc

If the count is > 56 then raise the alarm of a possible issue post upgrade.

Why it needs to be validated

When the switch is upgraded, there is a possibility that some of the uplinks will not function as uplinks. This is random
You will also get a fault.
Jul 25 11:01:30 Tier1_Leaf %LOG_LOCAL0-4-SYSTEM_MSG [F2981][raised][portp-policy-limit-exceeded][warning][sys/ops/slot-lcslot-1/portpol-14/fault-F2981] PortP policy limit exceeded

Additional context

The fix is to reduce the number of uplinks to 56 or less for tier 1 leaves.

APIC SSD fault not raising due to CSCvx28453

Hi Team,

I have enhanced the apic_ssd_check function so that we can catch a scenario that APIC SSD fault are not raised due to CSCvx28453.

Please review and comment.

cheers,
welkin

NewValidation:need the check to validate CSCwf80352

it's a request to implement the check in pre_upgrade script against CSCwf80352

in short, if Fabric Name contains "#" or ";" - conversion script will truncate it; meaning APIC will have new fabric name after upgrade in sam.config. When switch reboots during upgrade it cannot join the fabric as it receives truncated fabric name which is not matching DME config.

updating the defect with more info upon confirming with DE team. thanks!

NewValidation: CSCwf00416 ACI 16.0.2 leaf sends manually configured intf description as port description in lldp instead of dn

(use upvote πŸ‘ for attentions)

Validation Type

[ ] - Fault

[ ] - Config

[ x] - Bug

[ ] - Other

What needs to be validated

For a bug validation, describe the bug ID (ex. CSCxx12345) and the brief description of the bug in the context of pre-upgrade validations.
[CSCwf00416] ACI 16.0.2 leaf sends manually configured intf description as port description in lldp instead of dn

  1. Check if there are any interface profile that has a desc configured. If there is one then we need to find our if that interface profile is tied to a VMM doamian
    moquery -c infraAccPortP | egrep "^dn|desc"

  2. Get a list of all the vmm domain that has on demand attachment. This issue affects only vmm doamin with on demand resolution
    a-apic1# moquery -c fvRsDomAtt | egrep -B 16 "instrImedcy : lazy" | grep "dn*.*vmmp"

  3. Find the AEP associated with vmm domain. This will help get the policy group tied to the AEP.
    a-apic1# moquery -c infraRtDomP | grep ^dn | grep vmmp

  4. get the policy group associated with the specfic VMM AEPs
    a-apic1# moquery -c infraRtAttEntP | grep ^dn

  5. See which Interface Profile has the specfic policy group associated with the VMM AEPs and whether there is a description associated with that profile
    a-apic1# moquery -c infraAccPortP -x rsp-subtree=full | egrep -A 18 "infra.AccPortP|infra.RsAccBaseGrp" | egrep "^dn|^desc|^ dn|^ tDn"

  6. If there is description associated with that profile then we need to flag it.

Why it needs to be validated

Vlans will not be deployed on the leaf

Additional context

Add any other context about the feature request here.

Verify [Check 43/43] APIC CA Cert Validation works in 3.2 release

(use upvote πŸ‘ for attentions)

Validation Type

[ ] - Fault

[check apic cert corruption] - Config

[CSCvy35257] - Bug

[ ] - Other

What needs to be validated

verify if both CLI return same value, if same, check pass, else recover cert before the upgrade
openssl rsa -noout -modulus -in /securedata/apicca/apicca.key | openssl md5
openssl x509 -noout -modulus -in /securedata/apicca/apicca.crt | openssl md5

Why it needs to be validated

F1419 service failed fault raised after apic upgrade.

Additional context

Checking item 43 failed to identify the issue in 3.2(5e) release. Please help on verification.

validated vnsLDevCtx with context=""

Validation Type

[x] - Config
The empty context for vnsLDevCtx are not supported and can cause unexpected forwarding issues.

What needs to be validated

if vnsLDevCtx.context=="", raise the alert before upgrading.

Why it needs to be validated

To prevent svg outage post upgrade

Additional context

Add any other context about the feature request here.

bootflash fail over 50% if node already staged

Ran the script after a spine had downloaded new image
Spine now had running image and new image
Script showed a fail because now spine was over 50% bootflash
Think this is a false positive

Gathering APIC Versions from Firmware Repository...

What is the Target Version? : 2

[Check 7/37] Switches are all in Active state... PASS
[Check 8/37] NTP Status... PASS
[Check 9/37] Firmware/Maintenance Groups when crossing 4.0 Release... Versions not applicable N/A
[Check 10/37] Features that need to be Disabled prior to Upgrade... PASS
[Check 11/37] Switch Upgrade Group Guidelines... PASS
[Check 12/37] APIC Disk Space Usage (F1527, F1528, F1529 equipment-full)... PASS
[Check 13/37] Switch Node /bootflash usage... FAIL - UPGRADE FAILURE!!
Pod-ID Node-ID Utilization Alert


1 1001 58.4411409341 Over 50% usage! Contact Cisco TAC for Support

VWNSPN1001# ls -l
total 6563128
-rw-rw-rw- 1 root root 4977087 Mar 25 03:02 CpuUsage.Log
-rw-rw-rw- 1 root root 1776793780 Nov 5 2019 aci-n9000-dk9.14.1.2u.bin
-rw-rw-rw- 1 root root 1961654688 Mar 25 22:00 aci-n9000-dk9.14.2.7f.bin
-rw-r--r-- 1 root root 1500113199 Aug 29 2019 auto-k
-rw-r--r-- 1 root root 1471706292 Nov 5 2019 auto-s
-rw-rw-rw- 1 root root 2 Nov 5 2019 diag_bootup
-rw-r--r-- 1 root root 54 Mar 25 22:50 disk_log.txt
-rw-rw-rw- 1 root root 7 Mar 25 22:00 imgDnldStatus
-rw-rw-rw- 1 root root 693 Nov 5 2019 libmon.logs
drwxr-xr-x 4 root root 4096 Aug 29 2019 lxc
-rw-r--r-- 1 root root 4383581 Mar 25 22:50 mem_log.txt
-rw-r--r-- 1 root root 724404 Nov 5 2019 mem_log.txt.old.gz
-rw-r--r-- 1 root root 180202 Mar 25 22:50 mts_buffer_log.log
-rw-rw-rw- 1 root root 41461 Feb 18 04:29 urib_api_log.txt

VWNSPN1001# show version
Cisco Nexus Operating System (NX-OS) Software
TAC support: http://www.cisco.com/tac
Documents: http://www.cisco.com/en/US/products/ps9372/tsd_products_support_series_home.html
Copyright (c) 2002-2014, Cisco Systems, Inc. All rights reserved.
The copyrights to certain works contained in this software are
owned by other third parties and used and distributed under
license. Certain components of this software are licensed under
the GNU General Public License (GPL) version 2.0 or the GNU
Lesser General Public License (LGPL) Version 2.1. A copy of each
such license is available at
http://www.opensource.org/licenses/gpl-2.0.php and
http://www.opensource.org/licenses/lgpl-2.1.php

Software
BIOS: version 05.35
kickstart: version 14.1(2u) [build 14.1(2u)]
system: version 14.1(2u) [build 14.1(2u)]
PE: version 4.1(2u)
BIOS compile time: 05/10/2019
kickstart image file is: /bootflash/aci-n9000-dk9.14.1.2u.bin
kickstart compile time: 10/30/2019 11:58:04 [10/30/2019 11:58:04]
system image file is: /bootflash/auto-s
system compile time: 10/30/2019 11:58:04 [10/30/2019 11:58:04]

Hardware
cisco N9K-C9364C ("supervisor")
Intel(R) Xeon(R) CPU D-1526 @ 1.80GHz with 32695296 kB of memory.
Processor Board ID FDO22350P3G

Device name: VWNSPN1001
bootflash: 125029376 kB

Kernel uptime is 871 day(s), 02 hour(s), 49 minute(s), 44 second(s)

Last reset at 26000 usecs after Tue Nov 05 21:01:55 2019 UTC
Reason: reset-by-installer
System version: 14.1(2m)
Service: Upgrade

plugin
Core Plugin, Ethernet Plugin
VWNSPN1001#

Check 38 Eventmgr DB size defect susceptibility FAIL if version is 3.2(10)

Check 38 FAIL if current version is 3.2(10)

3.2(9h)

fab1-apic1# python /data/techsupport/aci-preupgrade-validation-script.py
...
Checking current APIC version (switch nodes are assumed to be on the same version)...3.2(9h)
...
[Check 38/42] Eventmgr DB size defect susceptibility...                                                                               PASS

3.2(10e) or later 3.2(10)

fab1-apic1# python /data/techsupport/aci-preupgrade-validation-script.py
...
Checking current APIC version (switch nodes are assumed to be on the same version)...3.2(10g)
...
[Check 38/42] Eventmgr DB size defect susceptibility...                                                           FAIL - UPGRADE FAILURE!!
  Potential Defect  Recommended Action
  ----------------  ------------------
  CSCvn20175        Contact Cisco TAC to check the DB size via root

Warn about F0467 - Encap already in use

Warn about encap in use so that any conflicting config can be deleted before the upgrade. This might prevent the wrong vlan from being pushed to an interface after an upgrade.

NewValidation: CSCwb08081 Set clause not applying if prefix list is empty or not explicitly matched

(use upvote πŸ‘ for attentions)

Validation Type

[ ] - Fault

[X ] - Config

[ ] - Bug

[ ] - Other

What needs to be validated

For a config validation, describe the exact configuration to be validated

Step 1

moquery -c bgpPeer | grep dn | grep -v overlay-1 | cut -d "/" -f3,8

THis should give you the border leaf node ID , vrf and the peer IP

Step 2 - For each of the BL run the below command
show ip bgp nei vrf | grep "Inbound route-map"

show ip bgp nei 10.103.3.234 vrf 695365052:USER | grep "Inbound route-map"

Leaf-104# show ip bgp nei 10.103.3.234 vrf 695365052:USER | grep "Inbound route-map"
Inbound route-map configured is imp-l3out-L3OUT_SDWAN_SOHO-peer-2359297, handle obtained

This will give you the route map name

Step 3: Check the route map

Leaf-104# show route-map imp-l3out-L3OUT_SDWAN_SOHO-peer-2359297
route-map imp-l3out-L3OUT_SDWAN_SOHO-peer-2359297, permit, sequence 4601
Match clauses:
community (community-list filter): peer32773-2359297-exc-ext-in-SDWAN-SOHO-TO-USER-IN1COMMUNITY_SDWAN_SOHO1COMMUNITY_SDWAN_SOHO-rgcom
Set clauses:
local-preference 200
community 65100:101 65100:500 65100:601 65381:4 additive

Step 4:
Check if the route map has a community match clause only
If yes then you are susceptible to this issue

Why it needs to be validated

Describe why it needs to be validated and what may happen if it's not validated.
Inbound route will not have the right attributes set which can affect other downstream actions matching on the set attributes

Additional context

Add any other context about the feature request here.

Check for F1394 - (This fault occurs when a port is down and is in use for fabric)

Hello,
Suggested enhancement.
Had a Customer that ran the Pre-Upgrade and it didn't flag any of the F1394 Faults that they apparently had.
Now, for some reason F1394 is a minor Severity. Not sure I agree with this in this case the Customer didn't give it the attention that it deserved and subsequently resulted in a postponement of the upgrade Maintenance Window when it was found.

F1394: fltEthpmIfPortDownFabric
Severity: minor
Explanation: This fault occurs when a port is down and is in use for fabric

ENH: Add script version details to output for pre-upgrade-validation script

Use case:
Recently a customer encountered an issue on upgrade where the eventmgr DB size check passed due to faulty logic within the pre-upgrade-validation script. The reason for this is likely due to the bug reported under issue #44, however we are not able to prove which version of the script the customer was using.

Feature request:
Add a script version/compile date to the script output so that we can identify which version of the script was run.

Check 26/36 - HW Programming Failure: Will return "API call failed! " on pre 4.1(1) versions

When running the script, you may see the following:

[Check 26/36] HW Programming Failure (F3544 L3Out Prefixes, F3545 Contracts, actrl-resource-unavailable)...
Error: API call failed! Check debug log... ERROR !!

There is a debug.log within the .tgz output bundle with the following traces:

[2021-08-13 14:26:59.488-0400 INFO icurl:506 ] cmd = icurl -gs http://127.0.0.1:7777/api/class/faultInst.json?query-target-filter=or(eq(faultInst.code,"F3544"),eq(faultInst.code,"F3545"))
[2021-08-13 14:26:59.540-0400 DEBUG icurl:508 ] response: {"totalCount":"1","imdata":[{"error":{"attributes":{"code":"301","text":"Incorrect filter format for faultInst.code, value 'F3544' is not valid"}}}]}
[2021-08-13 14:26:59.541-0400 ERROR <module>:2125] API call failed! Check debug log

The core of the issue here is that fault code F3544 did not exist until 4.1(1).

A "fix" for this check would be to return a more meaningful message if currentversion < 4.1(1) to let the user know that this check will not work on their version.

get_target_version pull in analytics image

Enter username for APIC login          : admin
Enter password for corresponding User  : 

Checking current APIC version (switch nodes are assumed to be on the same version)...4.2(3l)

Gathering APIC Versions from Firmware Repository...

[1]: aci-analyticsagent-dk9.3.5.1.23.bin
[2]: aci-apic-dk9.4.2.7l.bin

What is the Target Version?     : 

We only want apic versions to be selectable

"name 'file' is not defined" is shown in some checks.

Describe the bug
I executed aci-preupgrade-validation-script.py on 6.0(2h).
I got name 'file' is not defined as an exception in some checks.

Script output
Like below

Checking current APIC version (switch nodes are assumed to be on the same version)...6.0(2h)

Gathering APIC Versions from Firmware Repository...

[1]: aci-apic-dk9.6.0.3d.bin

What is the Target Version?     : 1

You have chosen version "aci-apic-dk9.6.0.3d.bin"

Collecting VPC Node IDs...

[Check  1/47] APIC Target version image and MD5 hash...
              Checking apic1......                                                                                                ERROR !!
                                                                                                                  FAIL - UPGRADE FAILURE!!
  APIC        Firmware  md5sum  Failure                     Recommended Action
  ----        --------  ------  -------                     ------------------
  apic1       -         -       name 'file' is not defined  -                                              <<<<<<<<<<<<<<<<<<<<<<<<<

To Reproduce
Steps to reproduce the behavior such as:

  1. Add APIC image from APIC GUI > Admin > Firmware > Images > Actions > Add firmware
  2. Confirm that the image has been added.
  3. Run aci-preupgrade-validation-script.py
  4. Select the added image.

Expected behavior
Users do not get this exception, name 'file' is not defined.

Additional context
I found that file function was used in def start_log of class Connection.
file function is one of Built-in Functions in Python2 and it is no longer used in Python3.
Python3 is used in Version 6.0(2h).

apic1# acidiag version
6.0.2h

apic1# python -V
Python 3.8.10

Please see the following documents.

https://docs.python.org/2.7/library/functions.html#file
https://docs.python.org/3.8/library/functions.html

In the above Python2 doc, it is recommended to use open function over file function.

When opening a file, it’s preferable to use open() instead of invoking this constructor directly. file is more suited to type testing (for example, writing isinstance(f, file)).

Could you please use open or other alternatives instead of file?

Check 27/49 mentions a non-existent class name.

Describe the bug
The script mentions class actrlPrefix. But such class doesn't seem to exist.

Script output
[Check 27/49] HW Programming Failure (F3544 L3Out Prefixes, F3545 Contracts, actrl-resource-unavailable)... MANUAL CHECK REQUIRED
Object Class Recommended Action
#------------------------------
actrlPrefix Check that "operSt" are set to "enabled". F3544 does not exist on this version.
actrlRule Check that "operSt" are set to "enabled". F3545 does not exist on this version.

To Reproduce
Nothing specific.

Expected behavior
We expect to find class actrlPrefix. We don't see it exists anywhere in ACI versions 3.2, 4.2 or 5.2.

Additional context
Probably this an incorrect class name that need to be updated. A closest class name that we think close and relevant is actrlPfxEntry. We are not sure if this is the right class to check.

We are testing with ACI 3.2 trying to upgrade to 4.2.

Upgrading to 5.2(4) from 4.x can cause /32 static routes to not be installed in FIB

This seems to be a intended check that the SW does
If it sees /32 static route that also is overlapping with a BD subnet range on the leaf it will not install it into the FIB
RIB yes
The BL with the static route is not affected but other leaf nodes that receive this update from BGP are.
There is no fault thrown as of now and I opened up the issue with devs already
Seems to be related to this https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/kb/Cisco-ACI-Floating-L3Out.html#Cisco_Concept.dita_2752bf03-c688-42c0-8721-4d463bd2c4e6
CSCwb91766

[Check 14/36] Standby APIC Disk Space Usage: Check for cntrlSbstState != erased to ignore removed standby APICs

there is an issue where some standby apics which have been removed were not cleaned up properly will get pulled by the script, causing this check to report a failure:

[Check 14/36] Standby APIC Disk Space Usage... FAIL - UPGRADE FAILURE!!
SN OOB Mount Point Current Usage % Recommended Action
  0.0.0.0           -            -                failed to login to host
  192.168.12.20/24  -            -                failed to login to host

The script needs to check if the stby is not erased before building the list. Alternatively, I think we can check if it's "approved":

# infra.SnNode
id             : 20
addr           : 10.0.0.20
adminSt        : out-of-service
annotation     :
apicMode       : standby
chassis        : 2e408b08-f944-11eb-b690-c72bece1f1f6
childAction    :
cntrlSbstState : erased        <---     These ones should be ignored 
dn             : topology/pod-1/node-2/av/serial-20
extMngdBy      :
lcOwn          : local
mbSn           : FCH1822V1PU
modTs          : 2021-08-19T16:26:25.716+00:00
name           :
nameAlias      :
oobGwIpAddr    : 10.122.141.97
oobGwIpv6Addr  : ::
oobIpAddr      : 10.122.141.110/27
oobIpv6Addr    : ::
operSt         : unregistered
podId          : 2
rn             : serial-20
routableIpAddr : 0.0.0.0
status         :
uid            : 0
version        : 4.2(3l)

Check if VLAN programming are set as "external (on the wire) or not If not raise warning

(use upvote πŸ‘ for attentions)

Validation Type

[X ] - Config

Customer got hit because of internal vlan config during upgrade and this caused complete network down for more than 12 hr.

If you are upgrading to Cisco APIC release 4.2(6o), 4.2(7l), 5.2(1g), or later, ensure that any VLAN encapsulation blocks that you are explicitly using for leaf switch front panel VLAN programming are set as "external (on the wire)." If these VLAN encapsulation blocks are instead set to "internal," the upgrade causes the front panel port VLAN to be removed, which can result in a datapath outage.

Release note : https://www.cisco.com/c/en/us/td/docs/dcn/aci/apic/5x/release-notes/cisco-apic-release-notes-527.html

telemetryStatsServerP object check does not capture all the failure cases

The current condition implies "telemetryStatsServerP exist and collectorLocation=="apic" for 4.2(4d) and earlier

Our customer had an upgrade from 4.2(7v) to 5.2(7f), however the svcconfig file on switch become corrupt because "collectorLocation="

    if cversion.older_than("4.2(4d)") and tversion.newer_than("5.2(2d)"):
        if not isinstance(telemetryStatsServerP_json, list):
            telemetryStatsServerP_json = icurl('class', 'telemetryStatsServerP.json')
        for serverp in telemetryStatsServerP_json:
            if serverp["telemetryStatsServerP"]["attributes"].get("collectorLocation") == "apic":
                result = FAIL_O
                data.append([str(cversion), str(tversion), 'telemetryStatsServerP.collectorLocation = "apic" Found'])

    print_result(title, result, msg, headers, data, recommended_action=recommended_action, doc_url=doc_url)
    return result

Multiple checks are failing due to script not able to login to host

(use upvote πŸ‘ for attentions)
Describe the bug
Unable to login into APIC controllers to perform target version image compatibility and to check APIC SSD health.
Tested on three different ACI fabrics with different numbers of controllers and APIC versions. Same error output every time.

Script output
You have chosen version "aci-apic-dk9.5.2.7f.bin"

Collecting VPC Node IDs...[Check 1/45] APIC Target version image and MD5 hash...
Checking APIC1...... ERROR !!
Checking APIC3...... ERROR !!
Checking APIC2...... ERROR !!
FAIL - UPGRADE FAILURE!!
APIC Firmware md5sum Failure Recommended Action
APIC1 - - ls command via ssh failed due to:failed to login to host -
APIC2 - - ls command via ssh failed due to:failed to login to host -
APIC3 - - ls command via ssh failed due to:failed to login to host -

[Check 15/45] APIC SSD Health...
Checking APIC1...... ERROR !!
Checking APIC2...... ERROR !!
Checking APIC3...... ERROR !!
FAIL - UPGRADE FAILURE!!
Pod Node Storage Unit % lifetime remaining Recommended Action
1 APIC1 - - - failed to login to host
2 APIC2 - - - failed to login to host
3 APIC3 - - - failed to login to host

To Reproduce

  1. Run the script on version 3.2.7k with target version 5.2.7f
  2. Run the script on version 4.2.7f with target version 5.2.7f
  3. Run the script on version 5.2.7f with target version 5.2.7g
    All the above tested with local admin account.

Need to add manual check for flow control behavior change after an upgrade

If one upgrades ACI leaf switchs from older 13.x to newer 15.2 VPCs connected to devices which have link level flow control in auto/desirable state can go down due to error "vpc port channel mis-config due to vpc links in the 2 switches connected to different partners". Fault F0518 will also be raised.

Root cause is due to ACI software in older release incorrectly signalling far end device which has flow control in auto/desirable state to enable send/transmit flow control. After an upgrade, the behavior is corrected leading to the problem.

Here is an issue which is documented for Catalysts connected to ACI leaf which has more information.
CSCvo27498.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.