devops-nirvana / kubernetes-volume-autoscaler Goto Github PK
View Code? Open in Web Editor NEWAutoscaling volumes for Kubernetes (with the help of Prometheus)
License: Apache License 2.0
Autoscaling volumes for Kubernetes (with the help of Prometheus)
License: Apache License 2.0
There appears to be a bug in Prometheus Server which causes the kubelet_volume_stats_capacity_bytes
to not be updated properly in Prometheus after a resize. Note: May need to go file a bug against the metrics-server or Prometheus.
After further investigation, it appears the prometheus metrics of kube_persistentvolume_capacity_bytes which is tied to the "PV" and not the "PVC" is fully updated, and we could (in theory) instead look there for the updated value but I believe this to be a bug which should be fixed in Prometheus.
NOTE: If anyone wants us to prioritize any of these issues, please say something.
Hi,
Sometimes a volume is full because there is no free inode left, while the capacity usage is still low. This happens when you have lots of small files for example.
The feature request would be to add a new resize condition based on the inodes free/count ratio (based on kubelet_volume_stats_inodes*
metrics for example)
Cheers,
Pierre
Tried do use scale_up_max_size: 300Gi
in the values.yaml
of the chart but it didn't work
2023-11-23T11:48:38.529068781Z Traceback (most recent call last):
2023-11-23T11:48:38.529130552Z File "/app/./main.py", line 4, in <module>
2023-11-23T11:48:38.529156651Z from helpers import INTERVAL_TIME, PROMETHEUS_URL, DRY_RUN, VERBOSE, get_settings_for_prometheus_metrics, is_integer_or_float, print_human_readable_volume_dict
2023-11-23T11:48:38.529169656Z File "/app/helpers.py", line 34, in <module>
2023-11-23T11:48:38.529186090Z SCALE_UP_MAX_SIZE = int(getenv('SCALE_UP_MAX_SIZE') or 16000000000000) # How many bytes is the maximum disk size that we can resize up, default is 16TB for EBS volumes in AWS (in bytes, so 16000000000000)
2023-11-23T11:48:38.529293525Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-11-23T11:48:38.529335419Z ValueError: invalid literal for int() with base 10: '300Gi'
Be able to use suffixes as defined in helpers.py
Kubernetes-Volume-Autoscaler/helpers.py
Lines 200 to 241 in f7b773f
Is there pre-made grafana dashboard (or something that your team uses) that I can use to visualize the metrics of volume autoscaler? If there is one, please can you share the Grafana dashboard json file?
Sometimes, the autoscaler tries to resize a PVC with a size below current size, raising an error.
Volume infra.data-nfs-server-provisioner-1637948923-0 is 85% in-use of the 80Gi available
BECAUSE it is above 80% used
ALERT has been for 1306 period(s) which needs to at least 5 period(s) to scale
AND we need to scale it immediately, it has never been scaled previously
RESIZING disk from 86G to 20G
Exception raised while trying to scale up PVC infra.data-nfs-server-provisioner-1637948923-0 to 20000000000 ...
(422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'e69b53c3-d332-4925-b9ea-afa7570297a9', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'b64e47c9-2a4e-48ae-83bc-355685b6c007', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'e5841496-62d0-426a-a987-4b26ec143a20', 'Date': 'Sat, 22 Oct 2022 16:58:07 GMT', 'Content-Length': '520'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"PersistentVolumeClaim \"data-nfs-server-provisioner-1637948923-0\" is invalid: spec.resources.requests.storage: Forbidden: field can not be less than previous value","reason":"Invalid","details":{"name":"data-nfs-server-provisioner-1637948923-0","kind":"PersistentVolumeClaim","causes":[{"reason":"FieldValueForbidden","message":"Forbidden: field can not be less than previous value","field":"spec.resources.requests.storage"}]},"code":422}
FAILED requesting to scale up `infra.data-nfs-server-provisioner-1637948923-0` by `10%` from `86G` to `20G`, it was using more than `80%` disk space over the last `78360 seconds`
I'm using the helm chart version 1.0.3 (same image tag)
Another issue, the autoscaler was able to resize another PVC from 13Gi to 14173392076, this is not human readable as before. It's not a serious issue but it's still disturbing. The autoscaler also sent the alert to slack twice for this PVC with several hours interval.
Describe the bug
The volume-autoscaler controller doesn't detect any PVC. I have a test pod running with a volume attached.
After checking the available prometheus metrics, the metric kubelet_volume_stats_available_bytes
and kubelet_volume_stats_capacity_bytes
are not exported with the default prometheus configuration.
I have checked the code and these 2 metrics are needed for the function fetch_pvcs_from_prometheus
in the helper.py
file.
To Reproduce
Steps to reproduce the behavior
helm upgrade --install volume-autoscaler devops-nirvana/volume-autoscaler --set "prometheus_url=http://prometheus-server:80"
kubectl apply -f https://raw.githubusercontent.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/master/examples/simple-pod-with-pvc.yaml
Expected behavior
Volume-autoscaler retrieve metrics which are available in prometheus, if not possible, explain how to configure the required additional metrics in prometheus.
Extra Information Requested
Hello,
I am trying the autoscaler. And the solution is working as expected. The controller is successfully increasing the volume as per the defined settings below. However, I am receiving the related alerts at the same timestamp. It seems when the volume is finally increased all the previous warnings are coming at once (reference: 2 alerts arrived at 1:30 PM in the below snapshot). Please note in the evidence snapshot below it is shown for 2 alerts but the same behaviour was found in next occurrences.
Is this a bug or expected behaviour of the controller? If there is any settings that should be changed please advise.
Settings:
- name: PROMETHEUS_URL
value: <PROMETHEUS_URL>
- name: SLACK_WEBHOOK_URL
value: <SLACK_WEBHOOK_URL>
- name: SLACK_CHANNEL
value: <SLACK_CHANNEL>
- name: INTERVAL_TIME
value: "60"
- name: SCALE_AFTER_INTERVALS
value: "5"
- name: SCALE_ABOVE_PERCENT
value: "80"
- name: SCALE_UP_PERCENT
value: "20"
- name: SCALE_UP_MIN_INCREMENT
value: "1000000000"
- name: SCALE_UP_MAX_INCREMENT
value: "16000000000000"
- name: SCALE_UP_MAX_SIZE
value: "16000000000000"
- name: SCALE_COOLDOWN_TIME
value: "22200"
- name: DRY_RUN
- name: PROMETHEUS_LABEL_MATCH
- name: HTTP_TIMEOUT
value: "15"
- name: VERBOSE
value: "false"
- name: VICTORIAMETRICS_MODE
value: "false"
```
Is your feature request related to a problem? Please describe.
When using prometheus in agentMode, Volume Autoscaler will fail with the following error:
Prometheus query failed with code: error
Prometheus Error: unavailable with Prometheus Agent
When Mimir or Cortex is used as a prometheus datastore, it requires the X-Scope-OrgID
header to be added to API requests.
References:
Currently, Volume Autoscaler cannot be used to query Mimir running in multi-tenant mode.
Describe the solution you'd like
I'd like the option to provide the X-Scope-OrgID
header. Ideally we could provide a map of any headers in the helm chart's values.yaml
but any way to set X-Scope-OrgID
would meet the current need.
Describe alternatives you've considered
The alternatives are to...
X-Scope-OrgID
headerIs your feature request related to a problem? Please describe.
It'd be nice to be able to customize the message sent to Slack, as it currently lacks e.g. the K8s cluster name, so with multiple clusters having similar apps setup you don't know where the scale up event happened. Or at least for now optionally add the cluster name via values.yaml.
Describe the bug
Checking the app ( as it was working before ) it now throws an error
TypeError: argument of type 'NoneType' is not iterable
To Reproduce
Steps to reproduce the behavior
Expected behavior
To be able to describe pvc
Extra Information Requested
-------------------------------------------------------------------------------------------------------------
Volume Autoscaler - Configuration
-------------------------------------------------------------------------------------------------------------
Prometheus URL: http://kube-prometheus-stack-prometheus:9090
Prometheus Version: 2.42.0
Prometheus Labels: {}
Interval to query usage: every 60 seconds
Scale up after: 5 intervals (300 seconds total)
Scale above percentage: disk is over 80% full
Scale up minimum increment: 1000000000 bytes, or 1G
Scale up maximum increment: 16000000000000 bytes, or 16T
Scale up maximum size: 16000000000000 bytes, or 16T
Scale up percentage: 20% of current disk size
Scale up cooldown: only resize every 22200 seconds
Verbose Mode: is ENABLED
Dry Run: is Disabled
HTTP Timeouts for k8s/prom: is 15 seconds
-------------------------------------------------------------------------------------------------------------
Exception while trying to describe all PVCs
Traceback (most recent call last):
File "/app/./main.py", line 64, in <module>
pvcs_in_kubernetes = describe_all_pvcs(simple=True)
File "/app/helpers.py", line 335, in describe_all_pvcs
output_objects["{}.{}".format(item.metadata.namespace,item.metadata.name)] = convert_pvc_to_simpler_dict(item)
File "/app/helpers.py", line 305, in convert_pvc_to_simpler_dict
if 'volume.autoscaler.kubernetes.io/last-resized-at' in pvc.metadata.annotations:
TypeError: argument of type 'NoneType' is not iterable
Is your feature request related to a problem? Please describe.
We're using victoriametrics to monitor large parts of our infrastructure, including kubernetes.
Unfortunately i've run into the issue that the volume-autoscaler doesn't seem to support this.
The problem here is that volume-autoscaler tries to read the prometheus build info and version, endpoints which victoriametrics doesn't have.
Describe the solution you'd like
I propose to add an extra env variable + helm option to skip the version check and set a prometheus-compat version manually.
Hello,
Thanks for this project which is very useful to me.
Did you consider making a multiarch image? I'm interested in arm64 support.
I just checked and the build passes without modification
$ docker buildx build --platform linux/arm64/v8,linux/amd64 -t xyz:1 .
Edit: I've just checked it starts correctly (at runtime) on arm64
Cheers,
Pierre
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.