Do research on scalability testing for csi and add for the PowerVS CSI driver.

Scalability Notes: <div class="snippet-clipboard-content notranslate position-rela

and will this time create impact to this timeout value here - <a class="issue-link js-

Add scale tests,about kubernetes-sigs/ibm-powervs-block-csi-driver

Comments (30)

Madhan-SWE commented on July 25, 2024

Scalability Notes:

Scalability:

1. Horizontal scaling - on controller plugin - deployment
   - Need to enable auto scaling for controller pod when resoucres used more than 80% 
   - Need to find how many volume requests the controller pod can handle by a single replica before the usage reaches 80%
   - Need to find upto how many replicas the plugin code can run without any errors

2. Vertical scaling - on node plugin - demonset
   - Since it is demonset, vertical scaling is default
   - Need to find how many volumes requests per node the plugin can hadle

Metrics for testing:

Time of provisioning: The amount of time it takes to provision and attach 1000 volumes
   - average of all volume times
   - median of times
   - need to visualize the metrics

Micro benchmarking: Need to benchmark each and every operation of the blugin ex: controllerPublish, controllerCreate, nodePublish, nodeStage
   - can be saved for later

Benchmarking for kubernetes:
   - If the plugin just listens to kublet logs, then it is expected to not create any issues
   - If kubernetes services are making requests during volume operations, we should also consider how much requests the affected service can hadle
   - Even if we don't mind about kubernetes, creating 100pods will definitely affects the etcd service

Network metrics latency testing:
 - latency testing: since the calls are grpc calls, need to see how much time the plugin taking to respond to each request
 - no need to care about io and throughput as it should be tested as part of powervs cloud platform

PowerVS cloud volume testing:
 - No need to bench mark for the powervs cloud like how much time the cloud takes to create disks as the motive is to test the csi driver



Tool for testing: kube-burner
  - https://github.com/cloud-bulldozer/kube-burner
  - Kube-burner is a tool aimed at stressing Kubernetes clusters by creating or deleting a high quantity of objects
  - can create the required number of pods, pvcs 
  
 We can also use Openshift4-tools for bench marking: https://github.com/RobertKrawitz/OpenShift4-tools

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

/assign

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

Need to use existing tool for benchmarking.
Need to focus on the below operations:

Time of provisioning: The amount of time it takes to provision and attach 100 volumes
   - average of all volume times
   - median of times
   - need to visualize the metrics

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

Create workloads in using Kube-burner with max disk size(100GB) and see how the system behaves.

from ibm-powervs-block-csi-driver.

mkumatag commented on July 25, 2024

Create workloads in using Kube-burner with max disk size(100GB) and see how the system behaves.

check whats the max volume size is supported in the powervs in the ibmcloud powervs doc and try creating of that size and time it.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

The maximum volume size supported is not mentioned in the Docs.
However, from UI I can see the maximum size as 2TB.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

2000GB is the maximum supported Disk size. Confirmed by Bobby.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

Used the command mkfs -t xfs /dev/dev-name to format the disk.
Below table shows the run time analysis:

Disk Size	Run time
500GB	1m9s
1000GB	2m26s
1500GB	3m40s
2000GB	4m17s

Increasing disk size can definitely increase the time for staging the volume and publish it.
Which increases the time for the workloads to be running.

The above table shows the run time for mkfs.
@mkumatag , do you think that we should track runtime in the plugin for the below operations?

Controller Plugin:

CreateVolume

Node Plugin:

blkid
mkfs
mount

from ibm-powervs-block-csi-driver.

mkumatag commented on July 25, 2024

Increasing disk size can definitely increase the time for staging the volume and publish it.
Which increases the time for the workloads to be running.

What's the time it takes to create the volume?

from ibm-powervs-block-csi-driver.

mkumatag commented on July 25, 2024

and will this time create impact to this timeout value here - #32 (comment)

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

Tried to attach 2000GB disk to a pod using Kube-burner (no of workloads=1).
Create is SuperFast(less than 5s) and attachVolume took 20s.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

and will this time create impact to this timeout value here - #32 (comment)

We've given 100s timeout to csi-provisioner. We don't really need this much time out.

Current temporary wait loop in the code itself is not going for 2nd iteration most of the times(unless there are more disks created) and the each iteration takes 5s.

Which means, In most cases the created disk is available in 5 to 10s.
It is not the conclusion still, we need to create 100+ disks at a time and see how the system behaves.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

20 volumes/iteration of size 10GB each took 7mins/iteration, successfully tested with 80volumes.
30 volumes/iteration of size 1GB each took 7mins/iteration, successfully tested with 150volumes.
60 volumes/iteration of size 1GB each took 60mins+/iteration, successfully tested with 120volumes -> formatting and mounting 60volumes at the same time takes lots of time on the nodes.

from ibm-powervs-block-csi-driver.

mkumatag commented on July 25, 2024

20 volumes/iteration of size 10GB each took 7mins/iteration, successfully tested with 80volumes.
30 volumes/iteration of size 1GB each took 7mins/iteration, successfully tested with 150volumes.
60 volumes/iteration of size 1GB each took 60mins+/iteration, successfully tested with 120volumes -> formatting and mounting 60volumes at the same time takes lots of time on the nodes.

Please mention the number of workers we are trying with.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

20 volumes/iteration of size 10GB each took 7mins/iteration, successfully tested with 80volumes.
30 volumes/iteration of size 1GB each took 7mins/iteration, successfully tested with 150volumes.
60 volumes/iteration of size 1GB each took 60mins+/iteration, successfully tested with 120volumes -> formatting and mounting 60volumes at the same time takes lots of time on the nodes.

Please mention the number of workers we are trying with.

No of worker nodes used in this test is 3.

Need to use taint feature and schedule volumes on a single node of the cluster using Kube-burner.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

Used taint to schedule all the workloads on a single node.
No of worker nodes: 3

No of workloads/iteration	Run time/iteration	No of iterations
10	~3m	5
20	~3m	5
30	~7m	4
50	~10m	2

Tried to attach 127+ volumes to a single node(127 is the max limit) by running 20 workloads/iteration.
On the 6th iteration, pods were in pending state.
Could schedule only 110 pods on the node.

[root@madhan-multinode-kubernetes-1 ~]# kubectl get pods --all-namespaces -o wide | grep madhan-multinode-kubernetes-2 | wc -l
110

Pods were in Pending state as there were too many pods on the node.

[root@madhan-multinode-kubernetes-1 ~]# kubectl describe pod app-6-8 -n kube-system-test
Name:         app-6-8
Namespace:    kube-system-test
...
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  25m                 default-scheduler  0/4 nodes are available: 4 persistentvolumeclaim "powervs-claim-6-8" not found.
  Warning  FailedScheduling  53s (x23 over 25m)  default-scheduler  0/4 nodes are available: 1 Too many pods, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 node(s) had taint {key1: value1}, that the pod didn't tolerate.

Couldn't try attaching the 127 volume on the node as Kubernetes didn't allow to run more than 110 pods/node.

Solution
Need to modify template in such a way that each pod has 2 volumes and try with Kube-burner.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

Used the command mkfs -t xfs /dev/dev-name to format the disk. Below table shows the run time analysis:

Disk Size Run time
500GB 1m9s
1000GB 2m26s
1500GB 3m40s
2000GB 4m17s
Increasing disk size can definitely increase the time for staging the volume and publish it. Which increases the time for the workloads to be running.

The above table shows the run time for mkfs. @mkumatag , do you think that we should track runtime in the plugin for the below operations?

Controller Plugin:

CreateVolume

Node Plugin:

blkid

mkfs

mount

Any format that takes more than 1minute is not acceptable. Check with PowerTeam.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

20 volumes/iteration of size 10GB each took 7mins/iteration, successfully tested with 80volumes.
30 volumes/iteration of size 1GB each took 7mins/iteration, successfully tested with 150volumes.
60 volumes/iteration of size 1GB each took 60mins+/iteration, successfully tested with 120volumes -> formatting and mounting 60volumes at the same time takes lots of time on the nodes.

Please mention the number of workers we are trying with.

No of worker nodes used in this test is 3.

Need to use taint feature and schedule volumes on a single node of the cluster using Kube-burner.

Check for audit log and control plane logs from the storage. (May need to ask for the logs from the power storage team for the expected timeframe)

from ibm-powervs-block-csi-driver.

k8s-triage-robot commented on July 25, 2024

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

from ibm-powervs-block-csi-driver.

k8s-triage-robot commented on July 25, 2024

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

from ibm-powervs-block-csi-driver.

k8s-triage-robot commented on July 25, 2024

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

from ibm-powervs-block-csi-driver.

k8s-ci-robot commented on July 25, 2024

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

/reopen

from ibm-powervs-block-csi-driver.

k8s-ci-robot commented on July 25, 2024

@Madhan-SWE: Reopened this issue.

In response to this:

/reopen

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

/remove-lifecycle rotten

from ibm-powervs-block-csi-driver.

k8s-triage-robot commented on July 25, 2024

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

/remove-lifecycle stale

from ibm-powervs-block-csi-driver.

k8s-triage-robot commented on July 25, 2024

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

Scale tests are added as part of the repo: https://github.com/kubernetes-sigs/ibm-powervs-block-csi-driver/tree/main/tests/scale/kube-burner
This issue can be closed.
New issue can be opened in future in order re-test and document the new results.

from ibm-powervs-block-csi-driver.

Madhan-SWE commented on July 25, 2024

/remove-lifecycle stale

from ibm-powervs-block-csi-driver.

Add scale tests about ibm-powervs-block-csi-driver HOT 30 CLOSED

Comments (30)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent