backube / snapscheduler Goto Github PK
View Code? Open in Web Editor NEWScheduled snapshots for Kubernetes persistent volumes
Home Page: https://backube.github.io/snapscheduler/
License: GNU Affero General Public License v3.0
Scheduled snapshots for Kubernetes persistent volumes
Home Page: https://backube.github.io/snapscheduler/
License: GNU Affero General Public License v3.0
Describe the feature you'd like to have.
Upgrade operator-sdk to v0.16.0
What is the value to the end user? (why is it a priority?)
How will we know we have a good solution? (acceptance criteria)
Additional context
Changes: https://github.com/operator-framework/operator-sdk/releases/tag/v0.16.0
Describe the feature you'd like to have.
E2E should test the functionality a bit better. For example, #80 was caused by not handling a nil template, but this doesn't get tested by the simple smoke test that currently exists.
What is the value to the end user? (why is it a priority?)
How will we know we have a good solution? (acceptance criteria)
Enhance the e2e to ensure:
Additional context
OLM, being largely OpenShift-specific leaves out a large amount of the Kubernetes universe. Providing a Helm chart gives a path for the other folks.
Describe the feature you'd like to have.
external-snapshotter just released v2.0.1 which backported kubernetes-csi/external-snapshotter#240. This fixes the import problems w/ go modules.
gometalinter is being archived. The recommended replacement is: https://github.com/golangci/golangci-lint
This would establish a maximum latency after which a scheduled snapshot would be skipped.
This could be helpful in the case where the scheduler was down for an extended period, preventing a sudden burst of snapshots. For this to be useful, there needs to be a default (e.g., 5min)
Prior to an official release, the CRD version should be updated to either beta or v1.
Describe the bug
I tried creating a schedule without adding volumesnapshotclass because I had a default snapshot class. But the schedule pod kept on crashing.
Steps to reproduce
Expected behavior
Schedule should start creating snapshots.
Actual results
Schedule pod was crashing.
Additional context
Schedule yaml: https://pastebin.com/y0LgPraz
Pod logs: https://pastebin.com/qtk98VEi
Describe the feature you'd like to have.
The way to have timezone based schedule.
What is the value to the end user? (why is it a priority?)
Good to have feature.
How will we know we have a good solution? (acceptance criteria)
Additional context
With GitHub Actions now GA, it would be good to consolidate CI from Travis to GH.
This may "just work" given different manifests.
Describe the feature you'd like to have.
Currently, the .spec.disabled
field isn't hinted in the CSV, so the current state of whether the SnapshotSchedule is enabled/disabled isn't easily seen.
What is the value to the end user? (why is it a priority?)
Since it's not shown unless the user looks at the yaml version of the object, it's easy to miss that the schedule is currently disabled, leading to confusion about why snapshots are not being taken.
How will we know we have a good solution? (acceptance criteria)
.spec.disabled
should be hinted as urn:alm:descriptor:com.tectonic.ui:checkbox
Additional context
Where should docs be hosted? Options: readthedocs, gh-pages, md in docs/ directory
The operator version is supposed to be printed at startup, but 0.0.0
is printed instead. The version is not being properly set at compile time.
Describe the feature you'd like to have.
After deploying, the Helm chart should print the version of the chart, application, and image that was deployed.
What is the value to the end user? (why is it a priority?)
While it's fairly easy to look at the Deployment to verify the app/image version, it would be nice to get that upfront.
How will we know we have a good solution? (acceptance criteria)
Additional context
Add:
While not a SLO, this is probably still reasonable practice to follow.
When generating snapshot names, the name of the PVC may need to be truncated to stay within naming limits
robfig/cron supports shortcuts like "@hourly" in addition to the standard 5-field cronspec. Currently, the validation only supports the latter, and will reject the shortcut specifications.
Change the licensing on snapscheduler/pkg/apis/snapscheduler/*
to dual AGPL & Apache-2.0 to permit the API types to be imported into other Apache licensed code.
This should permit other code that wants to manipulate the SnaphotSchedule CRs to gain access to the necessary types by directly importing from this repo.
Note: there were breaking changes in v0.11.0: https://github.com/operator-framework/operator-sdk/blob/master/doc/migration/version-upgrade-guide.md#v011x
As a part of this work, go back through the upgrade guide and make sure updates have been make since 0.9.0.
Since the operator supports leader election, its deployment should be replica 2 to ensure upgrades and re-scheduling go smoothly.
The deployment should also provide resource requests for proper scheduling.
Describe the feature you'd like to have.
Currently, snapscheduler doesn't provide any metrics related to the snapshots attempted/created. It would be good to provide some stats that could be monitored/alerted
What is the value to the end user? (why is it a priority?)
Users that depend on having snapshots to protect their data should have a way to monitor whether those snapshots are being successfully created
How will we know we have a good solution? (acceptance criteria)
Additional context
cc: @prasanjit-enginprogam
Describe the bug
Below are several doc bugs/suggestions from @ShyamsundarR
Describe the bug
Seen in the operator log:
E1211 18:34:58.078937 1 reflector.go:283] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:98: Failed to watch *v1.PersistentVolumeClaim: unknown (get persistentvolumeclaims)
E1211 18:34:59.081894 1 reflector.go:283] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:98: Failed to watch *v1.PersistentVolumeClaim: unknown (get persistentvolumeclaims)
E1211 18:35:00.084973 1 reflector.go:283] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:98: Failed to watch *v1.PersistentVolumeClaim: unknown (get persistentvolumeclaims)
Steps to reproduce
Expected behavior
Above error messages are seen in the operator pod logs
Actual results
Those messages shouldn't be present
Additional context
This is seen on OpenShift 4.2 and 4.3-ci builds with snapscheduler v1.0.0, installed via Chart v1.0.1
Describe the feature you'd like to have.
The code should be updated to use VolumeSnapshot version v1beta1 instead of the current v1alpha1.
What is the value to the end user? (why is it a priority?)
Kubernetes v1.17 switched from v1alpha1 to v1beta1, so in order to support kube versions going forward, the snap version needs to be updated
Additional context
It appears that the alpha API was removed entirely from the snapshotter, meaning there is no grace period for this change... Unless there's a webhook someplace to make the conversion.
Describe the feature you'd like to have.
Kube 1.16 brought apiextensions to v1, and it appears v1beta1 vill no longer be available in 1.19.
The SnapScheduler operator needs to be deployed w/ the proper version depending on the cluster version.
What is the value to the end user? (why is it a priority?)
It is desirable for users to continue to have access to SnapScheduler as Kubernetes moves to 1.19+.
How will we know we have a good solution? (acceptance criteria)
Deployment of the operator is currently supported both via Helm and OLM. This should continue to be the case.
Additional context
Currently, Helm deployment is supported back to 1.13, and OLM is back to 1.17. Ideally, we can preserve that even with this switch.
Describe the feature you'd like to have.
We are using a scheduler to create regular (hourly) snapshots of a source volume. We have a stateful set that needs to create a volume (using the "volumeClaimTemplate" option). It would be very helpful if we could have the most recent snapshot created be named '-latest' instead of the timestamp.
What is the value to the end user? (why is it a priority?)
When using snapshots as part of a StatefulSet VolumeClaimTemplate, Kubernetes fails if you try to update the DataSource part of the VolumeClaim template. Say, for example, to use a more recent snapshot. The workaround it a bit dangerous and error prone: Delete the StatefulSet with --cascade=false
and then re-create it with the newer snapshot ID.
If we knew we could always expect our optimal snapshot to be suffixed with '-latest' (or something else obvious), we wouldn't need to change the datasource in the StatefulSet's VolumeClaimTemplate.
How will we know we have a good solution? (acceptance criteria)
When we can increase the replica count of a StatefulSet and have it always use the most recent snapshot as it's datasource for the PVC.
Additional context
Not sure if this is a bit of an edge case, but it would be super-useful for things like blockchains. Each node has a huge ledger volume. The closer to"now" that we can get when provisioning the volumes for additional pods, the faster the pods can be up and ready.
Prevent pending jobs from piling up due to failures and long running pods.
Describe the feature you'd like to have.
Add a Validating Webhook to improve the validity checking of the SnapshotSchedule CR.
What is the value to the end user? (why is it a priority?)
The current openapi validation handles some validation, but not all field validation can be adequately expressed using openapi. Errors currently missed by openapi validation will only be discovered after the operator tries to reconcile, with error reporting limited to the .status
field of the affected CR.
By adding a validating webhook, more (all?) errors can be found when the object is first created, so the create/update can be immediately rejected, providing more timely feedback to the user. This will also improve error reporting capabilities via the web console.
How will we know we have a good solution? (acceptance criteria)
.spec.schedule
should be validated by parsing it to ensure it's a valid cronspec (i.e., all values are in range).spec.retention.expires
should properly parse as a time.Duration
Additional context
This should be viewed as an enhancement on top of the openapi validation, not a substitute. The openapi validation should still be the first line of defense since those rules are exposed in the CRD, whereas the webhook's rules are opaque to the user.
Starting point (from this tutorial at KubeConNA 2019):
https://github.com/jpbetz/KoT/tree/master/admission
Describe the feature you'd like to have.
Document how to use ResourceQuota to limit the number of snapshots that can be created.
https://kubernetes.io/docs/concepts/policy/resource-quotas/#object-count-quota
What is the value to the end user? (why is it a priority?)
It's easy to create a snapshotschedule that results in many snapshots being created, potentially exhausting the storage of the system or incurring high cloud provider costs.
How will we know we have a good solution? (acceptance criteria)
Additional context
Describe the feature you'd like to have.
The Helm chart should limit the nodes for the snapscheduler operator to amd64/linux hosts (as opposed to Windows or other architectures).
What is the value to the end user? (why is it a priority?)
The operator will fail to start on anything other than amd64/linux.
How will we know we have a good solution? (acceptance criteria)
Additional context
The arch and os labels: kubernetes.io/arch
and kubernetes.io/os
first appeared in 1.14. Prior to that, they were beta.kubernetes.io/*
. As long as we continue to support 1.13, we need to ensure it continues to work w/ the beta label also.
Describe the feature you'd like to have.
The lifetime of the snapshot should be independent of the namespace of the PVC.
What is the value to the end user? (why is it a priority?)
Snapshots that reside in the same namespace as the primary data don't protect against accidental namespace deletion.
How will we know we have a good solution? (acceptance criteria)
Additional context
Describe the feature you'd like to have.
Kube 1.18 has been released and Kind images are available. Update CI and gating tests to include 1.18.
What is the value to the end user? (why is it a priority?)
How will we know we have a good solution? (acceptance criteria)
Additional context
https://github.com/operator-framework/operator-sdk/releases/tag/v0.13.0
generate crds
and/or directly using openapi-gen
"*"
General housekeeping items that need to be done prior to a 1.0 release:
The internal state machine for the scheduler is exposed in the .status
portion of the CR. This should really be removed since it's internal implementation detail and not meant to be part of the API.
Describe the bug
A couple days ago, changes were made to the csi-driver-host-path repo that have broken the CI testing of SnapScheduler. The hostpath driver install is failing while setting up the CI environment.
Steps to reproduce
Run ./hack/setup-kind-cluster.sh
Expected behavior
Actual results
Additional context
It should be possible to override the snapshotclass on a per-PVC basis by setting an annotation on the PVC.
In decreasing priority:
I'm trying to run this in eks with version 1.17 but I get the following error:
kubectl describe po/snapscheduler-864d84f9-gnzwb -n snapscheduler
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned snapscheduler/snapscheduler-864d84f9-gnzwb to ip-10-27-7-136.eu-west-2.compute.internal
Normal Pulled 23m (x5 over 25m) kubelet, ip-10-27-7-136.eu-west-2.compute.internal Container image "quay.io/backube/snapscheduler:1.1.1" already present on machine
Normal Created 23m (x5 over 25m) kubelet, ip-10-27-7-136.eu-west-2.compute.internal Created container snapscheduler
Warning Failed 23m (x5 over 25m) kubelet, ip-10-27-7-136.eu-west-2.compute.internal Error: failed to start container "snapscheduler": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"/manager\": stat /manager: no such file or directory": unknown
Warning BackOff 7s (x112 over 24m) kubelet, ip-10-27-7-136.eu-west-2.compute.internal Back-off restarting failed container
Using the following HelmRelease
---
apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
name: snapscheduler
namespace: fluxcd
annotations:
flux.weave.works/automated: "false"
spec:
targetNamespace: snapscheduler
helmVersion: v3
releaseName: snapscheduler
chart:
repository: ***
name: snapscheduler
version: 1.2.1
values:
replicaCount: 2
image:
repository: quay.io/backube/snapscheduler
tagOverride: ""
pullPolicy: IfNotPresent
imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""
serviceAccount:
create: true
podSecurityContext: {}
securityContext: {}
resources:
requests:
cpu: 10m
memory: 100Mi
nodeSelector:
tolerations: []
affinity: {}
During periodic (Idle state) reconcile, delete snapshots whose metadata.creationTimestamp
is older than time.Now() - spec.retention.expires
Describe the bug
The e2e tests flake frequently in the cleanup stage... "timeout waiting for condition".
There had been some sort of race condition re finalizers in the snapshotter... it could be that; or it could just be that the timeout is too short.
Steps to reproduce
Expected behavior
Actual results
Additional context
Examples:
Kubebuilder docs: https://book.kubebuilder.io/beyond_basics/generating_crd.html
This is on hold pending sdk support
Describe the feature you'd like to have.
Upgrade golangci-lint to 1.23.1
There should be a maximum of spec.retention.maxCount
snapshots. During idle reconcile:
snapscheduler.backube/schedule: <schedule_name>
spec.source.name
(source PVC name)metadata.creationTimestamp
until there are at most maxCount
leftDescribe the bug
Wanted to scrape the snapscheduler metrics from prometheus , metrics seems to be not working.
Steps to reproduce
Had vanilla install of snapscheduler
$ kubectl describe service/snapscheduler-metrics -n abcns
Name: snapscheduler-metrics
Namespace: cloudops
Labels: name=snapscheduler
Annotations: <none>
Selector: name=snapscheduler
Type: ClusterIP
IP Families: <none>
IP: 172.20.219.85
IPs: <none>
Port: http-metrics 8383/TCP
TargetPort: 8383/TCP
Endpoints: <none>
Port: cr-metrics 8686/TCP
TargetPort: 8686/TCP
Endpoints: <none>
Session Affinity: None
Events: <none>
$
Using port-forwarding, it's giving me a timeout error.
kubectl --kubeconfig ABC.config port-forward svc/snapscheduler-metrics -n cloudops 9100:8686
kubectl --kubeconfig ABC.config port-forward svc/snapscheduler-metrics -n cloudops 9100:8383
Expected behavior
I should be able to see metrics
Actual results
getting below error
error: timed out waiting for the condition
Please help
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.