In the e2e tests, we are creating pod and attaching a volume through automation. B

Checking the format status also hangs. <div class="snippet-clipboard-content notra

Around 886 host rescans were running in the background. <div class="snippet-clipbo

Mount timed out ,about kubernetes-sigs/ibm-powervs-block-csi-driver

Comments (32)

Madhan-SWE commented on July 25, 2024

While the expected NodeStageVolume is called, there is another NodeStageVolume call for other volume in the cluster is made. this could've caused the issue.

I1118 03:32:30.005319       1 node.go:92] NodeStageVolume: called with args {VolumeId:21a0ff66-bf01-45a3-add1-b4f4982854f9 PublishContext:map[wwn:6005076810830198a000000000000d09] StagingTargetPath:/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-41aacd4d-9d53-4168-bbc0-77cf56596e26/globalmount VolumeCapability:mount:<fs_type:"ext4" mount_flags:"rw" > access_mode:<mode:SINGLE_NODE_WRITER >  Secrets:map[] VolumeContext:map[storage.kubernetes.io/csiProvisionerIdentity:1637056010413-8081-powervs.csi.ibm.com] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I1118 03:32:40.031333       1 node.go:356] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I1118 03:32:40.034375       1 node.go:356] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I1118 03:32:40.035256       1 node.go:92] NodeStageVolume: called with args {VolumeId:1659d58d-5ae9-432c-8797-363e448a696d PublishContext:map[wwn:60050768108181d628000000000044ec] StagingTargetPath:/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d79ec46a-461b-4422-81c6-b99f4752c0e0/globalmount VolumeCapability:mount:<fs_type:"ext3" > access_mode:<mode:SINGLE_NODE_WRITER >  Secrets:map[] VolumeContext:map[storage.kubernetes.io/csiProvisionerIdentity:1636700261501-8081-powervs.csi.ibm.com] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I1118 03:32:41.953376       1 node.go:356] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I1118 03:32:54.032953       1 node.go:356] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I1118 03:32:54.036012       1 node.go:356] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

From the NodeStage volume method, RescanSCSIBus() from the mounter is called.
And then the there is no prints after the call in the node plugin logs.

Explored the RescanSCSIBus() method and the method is running the script /usr/bin/rescan-scsi-bus.sh to scan the scsi bus. Manually ran the script, while scanning the scsi bus for the 4th host, the script got stuck and the window is hanging still. This is the reason why we couldn't see logs after this RescanSCSIBus() call

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

Commented the call for RescanSCSIBus() method as the result of it is not used in the plugin.
When NodeStage volume is called it internally calls GetDevicePath to get the device path using WWN.

GetDevicePath internally calls Attach method from fibrechannel library.

The following order of call created the issue.
NodeStageVolume -> mounter.GetDevicePath-> fibrechannel.Attach -> fibrechannel.searchDisk -> fibrechannel.scsiHostRescan

When RescanSCSIBus() is commented, the method scsiHostRescan is get stuck and no logs after this method call.

The method scsiHostRescan get the list of dirs in the folder /sys/class/scsi_host/. Go to each folder named like host1, host2, host3, host4 and write on the file scan.

While writing on host4/scan, the method got stuck.
Couldn't manually write to the file due to the permission issues.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

Since the scsiHostRescan is failing, manually tried to format the newly created disk.
Format window also hung and not returning the results.

[root@madhan-1-kube-1-22-2 ~]# mkfs -t ext4 -F -m0 /dev/dm-26
mke2fs 1.45.6 (20-Mar-2020)

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

Checking the format status also hangs.

[root@madhan-1-kube-1-22-2 powervs-csi-driver]# blkid -p -s TYPE -s PTTYPE -o export /dev/dm-26

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

Around 886 host rescans were running in the background.

[root@madhan-1-kube-1-22-2 ~]# ps aux | grep rescan-scsi-bus.sh | wc -l
886

Couldn't force kill the rescan script.

[root@madhan-1-kube-1-22-2 ~]# ps -eaf  | grep rescan | grep 4191027
root     4191027 3631367  0 Nov18 ?        00:00:01 /bin/bash /usr/bin/rescan-scsi-bus.sh
[root@madhan-1-kube-1-22-2 ~]# kill -9 4191027
[root@madhan-1-kube-1-22-2 ~]# ps -eaf  | grep rescan | grep 4191027
root     4191027 3631367  0 Nov18 ?        00:00:01 /bin/bash /usr/bin/rescan-scsi-bus.sh

from ibm-powervs-block-csi-driver.

mkumatag commented on July 25, 2024

[root@madhan-1-kube-1-22-2 ~]# ps -eaf | grep rescan | grep 4191027
root 4191027 3631367 0 Nov18 ? 00:00:01 /bin/bash /usr/bin/rescan-scsi-bus.sh

what are the child processes running part of this script? pstree may help

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

[root@madhan-1-kube-1-22-2 ~]# ps -eaf | grep rescan | grep 4191027
root 4191027 3631367 0 Nov18 ? 00:00:01 /bin/bash /usr/bin/rescan-scsi-bus.sh

what are the child processes running part of this script? pstree may help

Restarted the host, will check the child processes if the process still running after restart.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

Restarted the host and all the background process were killed.
/usr/bin/rescan-scsi-bus.sh is running without any issues.

Redeployed the CSI Driver.
Controller Plugin always fails during blue mix authentication with timeout error

Controller plugin logs:


[root@madhan-1-kube-1-22-2 powervs-csi-driver]# kubectl logs powervs-csi-controller-77ff978f87-5twfw -c powervs-plugin --follow
I1119 08:30:01.255610       1 driver.go:68] Driver: powervs.csi.ibm.com Version: v0.0.2
I1119 08:30:01.255669       1 controller.go:60] retrieving node info from metadata service
I1119 08:30:01.255683       1 metadata.go:27] retrieving instance data from kubernetes api
I1119 08:30:01.257671       1 metadata.go:32] kubernetes api is available
I1119 08:30:01.284241       1 controller.go:65] Metadata: &{cloudInstanceId:7845d372-d4e1-46b8-91fc-41051c984601 pvmInstanceId:9552c51d-5916-4ce5-a061-1e8bd7315ca8}
I1119 08:30:01.285915       1 controller.go:66] Cloud instance id: 7845d372-d4e1-46b8-91fc-41051c984601
I1119 08:30:01.286090       1 controller.go:68] apikey : ===========================================
I1119 08:30:01.286107       1 powervs.go:128] API Key: ===========================================
I1119 08:30:01.286124       1 powervs.go:130] session ERROR: <nil>, bxSess &{Config:0xc00062a0e0}
I1119 08:30:31.287004       1 powervs.go:136] Authentication ERROR: Post "https://iam.cloud.ibm.com/identity/token": dial tcp: i/o timeout
panic: Post "https://iam.cloud.ibm.com/identity/token": dial tcp: i/o timeout

goroutine 1 [running]:
github.com/ppc64le-cloud/powervs-csi-driver/pkg/driver.newControllerService(0xc0000a0230)
	/root/e2etest/powervs-csi-driver/pkg/driver/controller.go:71 +0x55c
github.com/ppc64le-cloud/powervs-csi-driver/pkg/driver.NewDriver({0xc00056ff38, 0x4, 0x4})
	/root/e2etest/powervs-csi-driver/pkg/driver/driver.go:92 +0x2b8
main.main()
	/root/e2etest/powervs-csi-driver/cmd/main.go:33 +0x1a0

Error snippet:

        err = authenticateAPIKey(bxSess)
        klog.V(4).Infof("Authentication ERROR: %+v", err)
        if err != nil {
                return nil, err
        }

Node plugin is running in the same snippet on the same node with out any errors.
Same version of controller plugin is running without any issues in the new cluster.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

The PowerVS-plugin container in the controller pod couldn't connect to internet.
Tried to install ping and ping google.
But, apt-update itself failing due to resolution issues.

[root@madhan-1-kube-1-22-2 powervs-csi-driver]# kubectl exec -it  powervs-csi-controller-77ff978f87-6h6x7  -c powervs-plugin /bin/sh
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
# apt-get update
0% [Working]

0% [Connecting to deb.debian.org] [Connecting to security.debian.org]
0% [Connecting to deb.debian.org] [Connecting to security.debian.org]
0% [Connecting to deb.debian.org] [Connecting to security.debian.org]

Err:1 http://deb.debian.org/debian buster InRelease                  
  Temporary failure resolving 'deb.debian.org'
Err:2 http://security.debian.org/debian-security buster/updates InRelease
  Temporary failure resolving 'security.debian.org'
  
0% [Connecting to deb.debian.org]
0% [Connecting to deb.debian.org]

Err:3 http://deb.debian.org/debian buster-updates InRelease
  Temporary failure resolving 'deb.debian.org'
Reading package lists... Done    
W: Failed to fetch http://deb.debian.org/debian/dists/buster/InRelease  Temporary failure resolving 'deb.debian.org'
W: Failed to fetch http://security.debian.org/debian-security/dists/buster/updates/InRelease  Temporary failure resolving 'security.debian.org'
W: Failed to fetch http://deb.debian.org/debian/dists/buster-updates/InRelease  Temporary failure resolving 'deb.debian.org'
W: Some index files failed to download. They have been ignored, or old ones used instead.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

Controller plugin having the same image is running without any errors in other cluster.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

@mkumatag , any approaches on debugging this further?

from ibm-powervs-block-csi-driver.

mkumatag commented on July 25, 2024

@mkumatag , any approaches on debugging this further?

this can be a generic issue with the calico/dns pods deployed, can you check if you can reach the outside n/w via ping command,? worst case - you may just need to restart calico/coredns pods.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

Modified the plugin image by installing iputils-ping.
Tried pinging 8.8.8.8 from the container, ping is failing as expected.

[root@madhan-1-kube-1-22-2 powervs-csi-driver]# kubectl exec powervs-csi-controller-77ff978f87-ngzfs -c powervs-plugin -- ping 8.8.8.8

command terminated with exit code 137
[root@madhan-1-kube-1-22-2 powervs-csi-driver]#

Planning to restart calico and coredns pods.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

Calico has some network issues in the latest version.
Applied the below fix given by @bkhadars solved the issue.

systemctl start docker

iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT

# Flush All Iptables Chains/Firewall rules #
iptables -F

# Delete all Iptables Chains #
iptables -X

# Flush all counters too #
iptables -Z
# Flush and delete all nat and  mangle #
iptables -t nat -F
iptables -t nat -X
iptables -t mangle -F
iptables -t mangle -X
iptables -t raw -F
iptables -t raw -X

systemctl restart docker

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

Ran e2e test cases to reproduce and analyse the issues.
There were no issues for 30minutes. Then the cluster is started showing up the format issue: #14

from ibm-powervs-block-csi-driver.

k8s-triage-robot commented on July 25, 2024

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

from ibm-powervs-block-csi-driver.

k8s-triage-robot commented on July 25, 2024

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

/remove-lifecycle rotten

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

/reopen

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

/ptal

from ibm-powervs-block-csi-driver.

k8s-triage-robot commented on July 25, 2024

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

/remove-lifecycle stale

from ibm-powervs-block-csi-driver.

k8s-triage-robot commented on July 25, 2024

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

/remove-lifecycle stale

from ibm-powervs-block-csi-driver.

k8s-triage-robot commented on July 25, 2024

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

from ibm-powervs-block-csi-driver.

k8s-triage-robot commented on July 25, 2024

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

/remove-lifecycle rotten

from ibm-powervs-block-csi-driver.

k8s-triage-robot commented on July 25, 2024

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

from ibm-powervs-block-csi-driver.

k8s-triage-robot commented on July 25, 2024

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

from ibm-powervs-block-csi-driver.

k8s-triage-robot commented on July 25, 2024

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

from ibm-powervs-block-csi-driver.

k8s-ci-robot commented on July 25, 2024

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from ibm-powervs-block-csi-driver.

Mount timed out about ibm-powervs-block-csi-driver HOT 32 CLOSED

Comments (32)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent