Customer-reported issue: Is not detecting updated/resized max size

There appears to be a bug in Prometheus Server which causes the kubelet_volume_stats_capacity_bytes to not be updated properly in Prometheus after a resize. Note: May need to go file a bug against the metrics-server or Prometheus.

After further investigation, it appears the prometheus metrics of kube_persistentvolume_capacity_bytes which is tied to the "PV" and not the "PVC" is fully updated, and we could (in theory) instead look there for the updated value but I believe this to be a bug which should be fixed in Prometheus.

Random feature ideas to consider (see here if you wish to contribute)

NOTE: If anyone wants us to prioritize any of these issues, please say something.

Trigger on inodes count

Hi,

Sometimes a volume is full because there is no free inode left, while the capacity usage is still low. This happens when you have lots of small files for example.
The feature request would be to add a new resize condition based on the inodes free/count ratio (based on kubelet_volume_stats_inodes* metrics for example)

Cheers,
Pierre

Can't use multiplier suffix (Gi, Ti) on `SCALE_UP_MAX_SIZE`

Is your feature request related to a problem? Please describe.

Tried do use scale_up_max_size: 300Gi in the values.yaml of the chart but it didn't work

2023-11-23T11:48:38.529068781Z Traceback (most recent call last):
2023-11-23T11:48:38.529130552Z   File "/app/./main.py", line 4, in <module>
2023-11-23T11:48:38.529156651Z     from helpers import INTERVAL_TIME, PROMETHEUS_URL, DRY_RUN, VERBOSE, get_settings_for_prometheus_metrics, is_integer_or_float, print_human_readable_volume_dict
2023-11-23T11:48:38.529169656Z   File "/app/helpers.py", line 34, in <module>
2023-11-23T11:48:38.529186090Z     SCALE_UP_MAX_SIZE = int(getenv('SCALE_UP_MAX_SIZE') or 16000000000000)           # How many bytes is the maximum disk size that we can resize up, default is 16TB for EBS volumes in AWS (in bytes, so 16000000000000)
2023-11-23T11:48:38.529293525Z                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-11-23T11:48:38.529335419Z ValueError: invalid literal for int() with base 10: '300Gi'

Describe the solution you'd like

Be able to use suffixes as defined in helpers.py

Kubernetes-Volume-Autoscaler/helpers.py

Lines 200 to 241 in f7b773f

    
           def convert_storage_to_bytes(storage): 
        
               # BinarySI == Ki | Mi | Gi | Ti | Pi | Ei 
        
               if storage.endswith('Ki'): 
        
                   return int(storage.replace("Ki","")) * 1024 
        
               if storage.endswith('Mi'): 
        
                   return int(storage.replace("Mi","")) * 1024 * 1024 
        
               if storage.endswith('Gi'): 
        
                   return int(storage.replace("Gi","")) * 1024 * 1024 * 1024 
        
               if storage.endswith('Ti'): 
        
                   return int(storage.replace("Ti","")) * 1024 * 1024 * 1024 * 1024 
        
               if storage.endswith('Pi'): 
        
                   return int(storage.replace("Pi","")) * 1024 * 1024 * 1024 * 1024 * 1024 
        
               if storage.endswith('Ei'): 
        
                   return int(storage.replace("Ei","")) * 1024 * 1024 * 1024 * 1024 * 1024 * 1024 
        
               # decimalSI == m | k | M | G | T | P | E | "" (this last one is the fallthrough at the end) 
        
               if storage.endswith('k'): 
        
                   return int(storage.replace("k","")) * 1000 
        
               if storage.endswith('K'): 
        
                   return int(storage.replace("K","")) * 1000 
        
               if storage.endswith('m'): 
        
                   return int(storage.replace("m","")) * 1000 * 1000 
        
               if storage.endswith('M'): 
        
                   return int(storage.replace("M","")) * 1000 * 1000 
        
               if storage.endswith('G'): 
        
                   return int(storage.replace("G","")) * 1000 * 1000 * 1000 
        
               if storage.endswith('T'): 
        
                   return int(storage.replace("T","")) * 1000 * 1000 * 1000 * 1000 
        
               if storage.endswith('P'): 
        
                   return int(storage.replace("P","")) * 1000 * 1000 * 1000 * 1000 * 1000 
        
               if storage.endswith('E'): 
        
                   return int(storage.replace("E","")) * 1000 * 1000 * 1000 * 1000 * 1000 * 1000 
        
               # decimalExponent == e | E (in the middle of two integers) 
        
               lowercaseDecimalExponent = storage.split('e') 
        
               uppercaseDecimalExponent = storage.split('E') 
        
               if len(lowercaseDecimalExponent) > 1 or len(uppercaseDecimalExponent) > 1: 
        
                   return int(float(str(format(float(storage))))) 
        
               # If none above match, then it should just be an integer value (in bytes) 
        
               return int(storage)

Grafana Dashboard

Is there pre-made grafana dashboard (or something that your team uses) that I can use to visualize the metrics of volume autoscaler? If there is one, please can you share the Grafana dashboard json file?

Autoscaling size below current size and PVC size not human readable.

Sometimes, the autoscaler tries to resize a PVC with a size below current size, raising an error.

Volume infra.data-nfs-server-provisioner-1637948923-0 is 85% in-use of the 80Gi available
  BECAUSE it is above 80% used
  ALERT has been for 1306 period(s) which needs to at least 5 period(s) to scale
  AND we need to scale it immediately, it has never been scaled previously
  RESIZING disk from 86G to 20G
  Exception raised while trying to scale up PVC infra.data-nfs-server-provisioner-1637948923-0 to 20000000000 ...
(422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'e69b53c3-d332-4925-b9ea-afa7570297a9', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'b64e47c9-2a4e-48ae-83bc-355685b6c007', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'e5841496-62d0-426a-a987-4b26ec143a20', 'Date': 'Sat, 22 Oct 2022 16:58:07 GMT', 'Content-Length': '520'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"PersistentVolumeClaim \"data-nfs-server-provisioner-1637948923-0\" is invalid: spec.resources.requests.storage: Forbidden: field can not be less than previous value","reason":"Invalid","details":{"name":"data-nfs-server-provisioner-1637948923-0","kind":"PersistentVolumeClaim","causes":[{"reason":"FieldValueForbidden","message":"Forbidden: field can not be less than previous value","field":"spec.resources.requests.storage"}]},"code":422}


FAILED requesting to scale up `infra.data-nfs-server-provisioner-1637948923-0` by `10%` from `86G` to `20G`, it was using more than `80%` disk space over the last `78360 seconds`

I'm using the helm chart version 1.0.3 (same image tag)

Another issue, the autoscaler was able to resize another PVC from 13Gi to 14173392076, this is not human readable as before. It's not a serious issue but it's still disturbing. The autoscaler also sent the alert to slack twice for this PVC with several hours interval.

kubelet_volume_stats_available_bytes metric is not available in prometheus

Describe the bug
The volume-autoscaler controller doesn't detect any PVC. I have a test pod running with a volume attached.
After checking the available prometheus metrics, the metric kubelet_volume_stats_available_bytes and kubelet_volume_stats_capacity_bytes are not exported with the default prometheus configuration.
I have checked the code and these 2 metrics are needed for the function fetch_pvcs_from_prometheus in the helper.py file.

To Reproduce
Steps to reproduce the behavior

Install prometheus: https://artifacthub.io/packages/helm/prometheus-community/prometheus
Install volume-autoscaler: helm upgrade --install volume-autoscaler devops-nirvana/volume-autoscaler --set "prometheus_url=http://prometheus-server:80"
Create test pod: kubectl apply -f https://raw.githubusercontent.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/master/examples/simple-pod-with-pvc.yaml
The volume-autoscaler controller doesn't detect any PVC.

Expected behavior
Volume-autoscaler retrieve metrics which are available in prometheus, if not possible, explain how to configure the required additional metrics in prometheus.

Extra Information Requested

Kubernetes Version: 1.26
Prometheus Version: v2.47.0

All volume expansion alerts arriving at same timestamp in slack. WHY?

Hello,

I am trying the autoscaler. And the solution is working as expected. The controller is successfully increasing the volume as per the defined settings below. However, I am receiving the related alerts at the same timestamp. It seems when the volume is finally increased all the previous warnings are coming at once (reference: 2 alerts arrived at 1:30 PM in the below snapshot). Please note in the evidence snapshot below it is shown for 2 alerts but the same behaviour was found in next occurrences.

Is this a bug or expected behaviour of the controller? If there is any settings that should be changed please advise.

Settings:

    - name: PROMETHEUS_URL
      value: <PROMETHEUS_URL>
    - name: SLACK_WEBHOOK_URL
      value: <SLACK_WEBHOOK_URL>
    - name: SLACK_CHANNEL
      value: <SLACK_CHANNEL>
    - name: INTERVAL_TIME
      value: "60"
    - name: SCALE_AFTER_INTERVALS
      value: "5"
    - name: SCALE_ABOVE_PERCENT
      value: "80"
    - name: SCALE_UP_PERCENT
      value: "20"
    - name: SCALE_UP_MIN_INCREMENT
      value: "1000000000"
    - name: SCALE_UP_MAX_INCREMENT
      value: "16000000000000"
    - name: SCALE_UP_MAX_SIZE
      value: "16000000000000"
    - name: SCALE_COOLDOWN_TIME
      value: "22200"
    - name: DRY_RUN
    - name: PROMETHEUS_LABEL_MATCH
    - name: HTTP_TIMEOUT
      value: "15"
    - name: VERBOSE
      value: "false"
    - name: VICTORIAMETRICS_MODE
      value: "false"
    ```

Add support for custom headers in calls to Prometheus API

Is your feature request related to a problem? Please describe.

When using prometheus in agentMode, Volume Autoscaler will fail with the following error:

Prometheus query failed with code: error
Prometheus Error: unavailable with Prometheus Agent

When Mimir or Cortex is used as a prometheus datastore, it requires the X-Scope-OrgID header to be added to API requests.
References:

Currently, Volume Autoscaler cannot be used to query Mimir running in multi-tenant mode.

Describe the solution you'd like

I'd like the option to provide the X-Scope-OrgID header. Ideally we could provide a map of any headers in the helm chart's values.yaml but any way to set X-Scope-OrgID would meet the current need.

Describe alternatives you've considered

The alternatives are to...

Run prometheus in non-agent mode. This has it's own drawbacks
Run mimir in single tenant mode. This isn't always possible
Add some sort of translation layer/reverse proxy to inject the X-Scope-OrgID header

Customize Slack message

Is your feature request related to a problem? Please describe.
It'd be nice to be able to customize the message sent to Slack, as it currently lacks e.g. the K8s cluster name, so with multiple clusters having similar apps setup you don't know where the scale up event happened. Or at least for now optionally add the cluster name via values.yaml.

Exception while trying to describe all PVCs

Describe the bug
Checking the app ( as it was working before ) it now throws an error

TypeError: argument of type 'NoneType' is not iterable

To Reproduce
Steps to reproduce the behavior

Step 1
Step 2
Step 3

Expected behavior
To be able to describe pvc

Extra Information Requested

Kubernetes Version: 1.24
Prometheus Version: kube-prometheus-stack v45.18.0
Enable "Verbose" mode in helm chart, and copy/paste the values printed therein

-------------------------------------------------------------------------------------------------------------
               Volume Autoscaler - Configuration
-------------------------------------------------------------------------------------------------------------
             Prometheus URL: http://kube-prometheus-stack-prometheus:9090
         Prometheus Version: 2.42.0
          Prometheus Labels: {}
    Interval to query usage: every 60 seconds
             Scale up after: 5 intervals (300 seconds total)
     Scale above percentage: disk is over 80% full
 Scale up minimum increment: 1000000000 bytes, or 1G
 Scale up maximum increment: 16000000000000 bytes, or 16T
      Scale up maximum size: 16000000000000 bytes, or 16T
        Scale up percentage: 20% of current disk size
          Scale up cooldown: only resize every 22200 seconds
               Verbose Mode: is ENABLED
                    Dry Run: is Disabled
 HTTP Timeouts for k8s/prom: is 15 seconds
-------------------------------------------------------------------------------------------------------------
Exception while trying to describe all PVCs
Traceback (most recent call last):
  File "/app/./main.py", line 64, in <module>
    pvcs_in_kubernetes = describe_all_pvcs(simple=True)
  File "/app/helpers.py", line 335, in describe_all_pvcs
    output_objects["{}.{}".format(item.metadata.namespace,item.metadata.name)] = convert_pvc_to_simpler_dict(item)
  File "/app/helpers.py", line 305, in convert_pvc_to_simpler_dict
    if 'volume.autoscaler.kubernetes.io/last-resized-at' in pvc.metadata.annotations:
TypeError: argument of type 'NoneType' is not iterable

Support victoriametrics instead of prometheus

Is your feature request related to a problem? Please describe.

We're using victoriametrics to monitor large parts of our infrastructure, including kubernetes.
Unfortunately i've run into the issue that the volume-autoscaler doesn't seem to support this.
The problem here is that volume-autoscaler tries to read the prometheus build info and version, endpoints which victoriametrics doesn't have.

Describe the solution you'd like

I propose to add an extra env variable + helm option to skip the version check and set a prometheus-compat version manually.

Multiarch image?

Hello,

Thanks for this project which is very useful to me.

Did you consider making a multiarch image? I'm interested in arm64 support.

I just checked and the build passes without modification

$ docker buildx build --platform linux/arm64/v8,linux/amd64 -t xyz:1 .

Edit: I've just checked it starts correctly (at runtime) on arm64

Cheers,
Pierre

	def convert_storage_to_bytes(storage):

	# BinarySI == Ki \| Mi \| Gi \| Ti \| Pi \| Ei
	if storage.endswith('Ki'):
	return int(storage.replace("Ki","")) * 1024
	if storage.endswith('Mi'):
	return int(storage.replace("Mi","")) * 1024 * 1024
	if storage.endswith('Gi'):
	return int(storage.replace("Gi","")) * 1024 * 1024 * 1024
	if storage.endswith('Ti'):
	return int(storage.replace("Ti","")) * 1024 * 1024 * 1024 * 1024
	if storage.endswith('Pi'):
	return int(storage.replace("Pi","")) * 1024 * 1024 * 1024 * 1024 * 1024
	if storage.endswith('Ei'):
	return int(storage.replace("Ei","")) * 1024 * 1024 * 1024 * 1024 * 1024 * 1024

	# decimalSI == m \| k \| M \| G \| T \| P \| E \| "" (this last one is the fallthrough at the end)
	if storage.endswith('k'):
	return int(storage.replace("k","")) * 1000
	if storage.endswith('K'):
	return int(storage.replace("K","")) * 1000
	if storage.endswith('m'):
	return int(storage.replace("m","")) * 1000 * 1000
	if storage.endswith('M'):
	return int(storage.replace("M","")) * 1000 * 1000
	if storage.endswith('G'):
	return int(storage.replace("G","")) * 1000 * 1000 * 1000
	if storage.endswith('T'):
	return int(storage.replace("T","")) * 1000 * 1000 * 1000 * 1000
	if storage.endswith('P'):
	return int(storage.replace("P","")) * 1000 * 1000 * 1000 * 1000 * 1000
	if storage.endswith('E'):
	return int(storage.replace("E","")) * 1000 * 1000 * 1000 * 1000 * 1000 * 1000

	# decimalExponent == e \| E (in the middle of two integers)
	lowercaseDecimalExponent = storage.split('e')
	uppercaseDecimalExponent = storage.split('E')
	if len(lowercaseDecimalExponent) > 1 or len(uppercaseDecimalExponent) > 1:
	return int(float(str(format(float(storage)))))

	# If none above match, then it should just be an integer value (in bytes)
	return int(storage)

devops-nirvana / kubernetes-volume-autoscaler Goto Github PK

kubernetes-volume-autoscaler's People

Contributors

Stargazers

Watchers

Forkers

kubernetes-volume-autoscaler's Issues

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Recommend Projects

Recommend Topics

Recommend Org