backube / snapscheduler Goto Github PK

View Code? Open in Web Editor NEW

240.0 240.0 23.0 1.93 MB

Scheduled snapshots for Kubernetes persistent volumes

Home Page: https://backube.github.io/snapscheduler/

License: GNU Affero General Public License v3.0

Dockerfile 2.61% Shell 13.04% Go 65.47% Makefile 15.91% Ruby 0.49% Mustache 2.48%

csi data-protection kubernetes kubernetes-operator persistent-volume scheduled-snapshots storage

snapscheduler's People

Contributors

Stargazers

Watchers

snapscheduler's Issues

Upgrade SDK to v1.0

Describe the feature you'd like to have.
Upgrade operator-sdk to v0.16.0

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

Additional context
Changes: https://github.com/operator-framework/operator-sdk/releases/tag/v0.16.0

Describe the feature you'd like to have.
E2E should test the functionality a bit better. For example, #80 was caused by not handling a nil template, but this doesn't get tested by the simple smoke test that currently exists.

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)
Enhance the e2e to ensure:

nil template is ok
labels get passed through
class name is properly respected
selectors work properly

Additional context

Create Helm chart to deploy the operator

OLM, being largely OpenShift-specific leaves out a large amount of the Kubernetes universe. Providing a Helm chart gives a path for the other folks.

Fix import of external-snapshotter v2

Describe the feature you'd like to have.
external-snapshotter just released v2.0.1 which backported kubernetes-csi/external-snapshotter#240. This fixes the import problems w/ go modules.

We should be able to remove the hack in go.mod used to fake the v2 import
Does this cause any other import problems?

Switch to golangci-lint

gometalinter is being archived. The recommended replacement is: https://github.com/golangci/golangci-lint

Add startingDeadlineSeconds

This would establish a maximum latency after which a scheduled snapshot would be skipped.
This could be helpful in the case where the scheduler was down for an extended period, preventing a sudden burst of snapshots. For this to be useful, there needs to be a default (e.g., 5min)

Update CRD to v1

Prior to an official release, the CRD version should be updated to either beta or v1.

Schedule not able to detect the default snapshotclass present.

Describe the bug
I tried creating a schedule without adding volumesnapshotclass because I had a default snapshot class. But the schedule pod kept on crashing.

Steps to reproduce

Create schedule without specifying snapshotclass
Check for schedule pods

Expected behavior
Schedule should start creating snapshots.

Actual results
Schedule pod was crashing.

Additional context
Schedule yaml: https://pastebin.com/y0LgPraz
Pod logs: https://pastebin.com/qtk98VEi

Adding timezone perferred schedule

Describe the feature you'd like to have.
The way to have timezone based schedule.

What is the value to the end user? (why is it a priority?)
Good to have feature.

How will we know we have a good solution? (acceptance criteria)

Additional context

Move CI to GH-actions

With GitHub Actions now GA, it would be good to consolidate CI from Travis to GH.

Support single operator watching all namespaces

This may "just work" given different manifests.

Expose "disabled" field in OLM UI

Describe the feature you'd like to have.
Currently, the .spec.disabled field isn't hinted in the CSV, so the current state of whether the SnapshotSchedule is enabled/disabled isn't easily seen.

What is the value to the end user? (why is it a priority?)
Since it's not shown unless the user looks at the yaml version of the object, it's easy to miss that the schedule is currently disabled, leading to confusion about why snapshots are not being taken.

How will we know we have a good solution? (acceptance criteria)
.spec.disabled should be hinted as urn:alm:descriptor:com.tectonic.ui:checkbox

Additional context

Add documentation

How to install
How to use
- Features & CR fields

Where should docs be hosted? Options: readthedocs, gh-pages, md in docs/ directory

Operator version is not properly displayed

The operator version is supposed to be printed at startup, but 0.0.0 is printed instead. The version is not being properly set at compile time.

Print version of chart & application when deploying

Describe the feature you'd like to have.
After deploying, the Helm chart should print the version of the chart, application, and image that was deployed.

What is the value to the end user? (why is it a priority?)
While it's fairly easy to look at the Deployment to verify the app/image version, it would be nice to get that upfront.

How will we know we have a good solution? (acceptance criteria)

Additional context

Consider adding version & condition status fields

Add:

status.version
status.conditions

Ref: https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#what-should-an-operator-report-with-clusteroperator-custom-resource

While not a SLO, this is probably still reasonable practice to follow.

Properly truncate PVC names

When generating snapshot names, the name of the PVC may need to be truncated to stay within naming limits

Integration w/ OLM & operatorhub

Generation of OLM artifacts
Installation via OLM w/ instructions added to docs
Artifacts (text/icons etc.) for OperatorHub
Submit to operatorhub

Documentation enhancements

Suggestions for labeling: app-centric vs. schedule-centric
404 page should encourage filing an issue
Developer docs
- Building & make targets
- Prereqs: sdk, go/gvm, golangci-lint
- Upgrading the SDK
- Editing the docs
- Development prioritization
  - Project board
  - Bugs before features
  - Roadmap: cluster-scope operator, cluster-scope schedule, schedules for SCs

Add e2e scaffolding and sanity test

Use sdk to create e2e test(s)
Add minikube to Travis and run e2e there.

Add unit test for listPVCsMatchingSelector

Add configuration for codecov

Fix CRD validation of `schedule` to allow shortcuts like "@hourly"

robfig/cron supports shortcuts like "@hourly" in addition to the standard 5-field cronspec. Currently, the validation only supports the latter, and will reject the shortcut specifications.

Change licensing on APIs

Change the licensing on snapscheduler/pkg/apis/snapscheduler/* to dual AGPL & Apache-2.0 to permit the API types to be imported into other Apache licensed code.

This should permit other code that wants to manipulate the SnaphotSchedule CRs to gain access to the necessary types by directly importing from this repo.

Upgrade operator-sdk to v0.12.0

Note: there were breaking changes in v0.11.0: https://github.com/operator-framework/operator-sdk/blob/master/doc/migration/version-upgrade-guide.md#v011x

As a part of this work, go back through the upgrade guide and make sure updates have been make since 0.9.0.

Improve Deployment spec

Since the operator supports leader election, its deployment should be replica 2 to ensure upgrades and re-scheduling go smoothly.

The deployment should also provide resource requests for proper scheduling.

Snapshot metrics

Describe the feature you'd like to have.
Currently, snapscheduler doesn't provide any metrics related to the snapshots attempted/created. It would be good to provide some stats that could be monitored/alerted

What is the value to the end user? (why is it a priority?)
Users that depend on having snapshots to protect their data should have a way to monitor whether those snapshots are being successfully created

How will we know we have a good solution? (acceptance criteria)

Additional context
cc: @prasanjit-enginprogam

Misc doc bugs

Describe the bug
Below are several doc bugs/suggestions from @ShyamsundarR

In https://backube.github.io/snapscheduler/install.html:
- Spell correct "teh": "This page provides instructions to deploy the snapscheduler operator. The operator is cluster-scoped, but its resources are namespaced. This means, a single instance of teh "
https://backube.github.io/snapscheduler/usage.html:
- When I click "go back to installation.: it askes me to save a file (Firefox)
Clarify the semantics of expiration. State specifically that when both time & count are specified, expiration is the soonest/most restrictive.

Failed to watch *v1.PersistentVolumeClaim

Describe the bug
Seen in the operator log:

E1211 18:34:58.078937       1 reflector.go:283] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:98: Failed to watch *v1.PersistentVolumeClaim: unknown (get persistentvolumeclaims)
E1211 18:34:59.081894       1 reflector.go:283] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:98: Failed to watch *v1.PersistentVolumeClaim: unknown (get persistentvolumeclaims)
E1211 18:35:00.084973       1 reflector.go:283] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:98: Failed to watch *v1.PersistentVolumeClaim: unknown (get persistentvolumeclaims)

Steps to reproduce

Create a test namespace
Create a PVC that is WaitForFirstConsumer
Create a snapshotschedule that selects the above PVC
Wait for the schedule to fire

Expected behavior
Above error messages are seen in the operator pod logs

Actual results
Those messages shouldn't be present

Additional context

This doesn't seem to cause any ill effects other than the error messages

This is seen on OpenShift 4.2 and 4.3-ci builds with snapscheduler v1.0.0, installed via Chart v1.0.1

Update VolumeSnapshot to v1beta1

Describe the feature you'd like to have.
The code should be updated to use VolumeSnapshot version v1beta1 instead of the current v1alpha1.

What is the value to the end user? (why is it a priority?)
Kubernetes v1.17 switched from v1alpha1 to v1beta1, so in order to support kube versions going forward, the snap version needs to be updated

Additional context
It appears that the alpha API was removed entirely from the snapshotter, meaning there is no grace period for this change... Unless there's a webhook someplace to make the conversion.

See: https://github.com/kubernetes-csi/external-snapshotter

Move CRD to apiextensions.k8s.io/v1

Describe the feature you'd like to have.
Kube 1.16 brought apiextensions to v1, and it appears v1beta1 vill no longer be available in 1.19.
The SnapScheduler operator needs to be deployed w/ the proper version depending on the cluster version.

What is the value to the end user? (why is it a priority?)
It is desirable for users to continue to have access to SnapScheduler as Kubernetes moves to 1.19+.

How will we know we have a good solution? (acceptance criteria)
Deployment of the operator is currently supported both via Helm and OLM. This should continue to be the case.

Additional context
Currently, Helm deployment is supported back to 1.13, and OLM is back to 1.17. Ideally, we can preserve that even with this switch.

Allow snapshot names to be rotated with most recent being named "_latest"

Describe the feature you'd like to have.
We are using a scheduler to create regular (hourly) snapshots of a source volume. We have a stateful set that needs to create a volume (using the "volumeClaimTemplate" option). It would be very helpful if we could have the most recent snapshot created be named '-latest' instead of the timestamp.

What is the value to the end user? (why is it a priority?)
When using snapshots as part of a StatefulSet VolumeClaimTemplate, Kubernetes fails if you try to update the DataSource part of the VolumeClaim template. Say, for example, to use a more recent snapshot. The workaround it a bit dangerous and error prone: Delete the StatefulSet with --cascade=false and then re-create it with the newer snapshot ID.

If we knew we could always expect our optimal snapshot to be suffixed with '-latest' (or something else obvious), we wouldn't need to change the datasource in the StatefulSet's VolumeClaimTemplate.

How will we know we have a good solution? (acceptance criteria)
When we can increase the replica count of a StatefulSet and have it always use the most recent snapshot as it's datasource for the PVC.

Additional context
Not sure if this is a bit of an edge case, but it would be super-useful for things like blockchains. Each node has a huge ledger volume. The closer to"now" that we can get when provisioning the volumes for additional pods, the faster the pods can be up and ready.

Prevent concurrent jobs

Prevent pending jobs from piling up due to failures and long running pods.

Add validation via Admission Webhook

Describe the feature you'd like to have.
Add a Validating Webhook to improve the validity checking of the SnapshotSchedule CR.

What is the value to the end user? (why is it a priority?)
The current openapi validation handles some validation, but not all field validation can be adequately expressed using openapi. Errors currently missed by openapi validation will only be discovered after the operator tries to reconcile, with error reporting limited to the .status field of the affected CR.
By adding a validating webhook, more (all?) errors can be found when the object is first created, so the create/update can be immediately rejected, providing more timely feedback to the user. This will also improve error reporting capabilities via the web console.

How will we know we have a good solution? (acceptance criteria)

.spec.schedule should be validated by parsing it to ensure it's a valid cronspec (i.e., all values are in range)
.spec.retention.expires should properly parse as a time.Duration

Additional context
This should be viewed as an enhancement on top of the openapi validation, not a substitute. The openapi validation should still be the first line of defense since those rules are exposed in the CRD, whereas the webhook's rules are opaque to the user.

Starting point (from this tutorial at KubeConNA 2019):
https://github.com/jpbetz/KoT/tree/master/admission

Document how to set snapshot quotas

Describe the feature you'd like to have.
Document how to use ResourceQuota to limit the number of snapshots that can be created.
https://kubernetes.io/docs/concepts/policy/resource-quotas/#object-count-quota

What is the value to the end user? (why is it a priority?)
It's easy to create a snapshotschedule that results in many snapshots being created, potentially exhausting the storage of the system or incurring high cloud provider costs.

How will we know we have a good solution? (acceptance criteria)

Admins will understand that it is best-practice to limit the number of snaps
There will be instructions how to do so
There will be at least 1 example

Additional context

Add node selector to run operator only on linux hosts

Describe the feature you'd like to have.
The Helm chart should limit the nodes for the snapscheduler operator to amd64/linux hosts (as opposed to Windows or other architectures).

What is the value to the end user? (why is it a priority?)
The operator will fail to start on anything other than amd64/linux.

How will we know we have a good solution? (acceptance criteria)

Additional context
The arch and os labels: kubernetes.io/arch and kubernetes.io/os first appeared in 1.14. Prior to that, they were beta.kubernetes.io/*. As long as we continue to support 1.13, we need to ensure it continues to work w/ the beta label also.

Protect against accidental namespace deletion

Describe the feature you'd like to have.
The lifetime of the snapshot should be independent of the namespace of the PVC.

What is the value to the end user? (why is it a priority?)
Snapshots that reside in the same namespace as the primary data don't protect against accidental namespace deletion.

How will we know we have a good solution? (acceptance criteria)

If a namespace containing the primary data is deleted, the scheduled snapshots should survive.
All the other scheduling features should continue to work as intended (e.g., expiring old snaps)

Additional context

There seem to be (at least) two approaches:
- Transfer the VolumeSnapshot object (by re-binding the VSC) to a different namespace
- Change the deletion policy of the snapshot content such that it is retained even if the VS is deleted
The work on volume namespace transfer may be useful here kubernetes/enhancements#1555

Test against kube 1.18 in CI

Describe the feature you'd like to have.
Kube 1.18 has been released and Kind images are available. Update CI and gating tests to include 1.18.

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

Additional context

Upgrade operator-sdk to v0.15.1

https://github.com/operator-framework/operator-sdk/releases/tag/v0.13.0

Investigate the generate crds and/or directly using openapi-gen
Does controller-runtime 0.3.0 -> 0.4.0 get us anything?
Investigate: Replace Role verb "*"

Housekeeping for 1.0 release

General housekeeping items that need to be done prior to a 1.0 release:

Add a CODE_OF_CONDUCT.md (linked from README.md)
Add CONTRIBUTING.md
Add SECURITY.md
Add issue templates (bug, enhancement)
Add PR template
Link to CoC, Contrib, Security from README.md
Add CHANGELOG.md

Remove state machine from `.status`

The internal state machine for the scheduler is exposed in the .status portion of the CR. This should really be removed since it's internal implementation detail and not meant to be part of the API.

Fix CI

Describe the bug
A couple days ago, changes were made to the csi-driver-host-path repo that have broken the CI testing of SnapScheduler. The hostpath driver install is failing while setting up the CI environment.

Steps to reproduce
Run ./hack/setup-kind-cluster.sh

Expected behavior

Actual results

Additional context

Allow overriding snapshotclass per PVC

It should be possible to override the snapshotclass on a per-PVC basis by setting an annotation on the PVC.

In decreasing priority:

PVC annotation
class from the schedule
cluster default

Error: failed to start container "snapscheduler": Error response from daemon: OCI runtime create failed

I'm trying to run this in eks with version 1.17 but I get the following error:

kubectl describe po/snapscheduler-864d84f9-gnzwb -n snapscheduler

Events:
  Type     Reason     Age                 From                                                Message
  ----     ------     ----                ----                                                -------
  Normal   Scheduled  <unknown>           default-scheduler                                   Successfully assigned snapscheduler/snapscheduler-864d84f9-gnzwb to ip-10-27-7-136.eu-west-2.compute.internal
  Normal   Pulled     23m (x5 over 25m)   kubelet, ip-10-27-7-136.eu-west-2.compute.internal  Container image "quay.io/backube/snapscheduler:1.1.1" already present on machine
  Normal   Created    23m (x5 over 25m)   kubelet, ip-10-27-7-136.eu-west-2.compute.internal  Created container snapscheduler
  Warning  Failed     23m (x5 over 25m)   kubelet, ip-10-27-7-136.eu-west-2.compute.internal  Error: failed to start container "snapscheduler": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"/manager\": stat /manager: no such file or directory": unknown
  Warning  BackOff    7s (x112 over 24m)  kubelet, ip-10-27-7-136.eu-west-2.compute.internal  Back-off restarting failed container

Using the following HelmRelease

---
apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: snapscheduler
  namespace: fluxcd
  annotations:
    flux.weave.works/automated: "false"
spec:
  targetNamespace: snapscheduler
  helmVersion: v3
  releaseName: snapscheduler
  chart:
    repository: ***
    name: snapscheduler
    version: 1.2.1
  values:
    replicaCount: 2

    image:
      repository: quay.io/backube/snapscheduler
      tagOverride: ""
      pullPolicy: IfNotPresent

    imagePullSecrets: []
    nameOverride: ""
    fullnameOverride: ""

    serviceAccount:
      create: true

    podSecurityContext: {}
    securityContext: {}
    resources:
      requests:
        cpu: 10m
        memory: 100Mi

    nodeSelector:
    tolerations: []
    affinity: {}

Expire snapshots based on time

During periodic (Idle state) reconcile, delete snapshots whose metadata.creationTimestamp is older than time.Now() - spec.retention.expires

Fix e2e flakes

Describe the bug
The e2e tests flake frequently in the cleanup stage... "timeout waiting for condition".
There had been some sort of race condition re finalizers in the snapshotter... it could be that; or it could just be that the timeout is too short.

Steps to reproduce

Run the e2e
They fail randomly (kube version doesn't seem to matter much).

Expected behavior

Actual results

Additional context
Examples:

Enumerate the snaps w/ label snapscheduler.backube/schedule: <schedule_name>
Group by spec.source.name (source PVC name)
Delete the oldest according to metadata.creationTimestamp until there are at most maxCount left

Not able to fetch snapscheduler metrics

Describe the bug
Wanted to scrape the snapscheduler metrics from prometheus , metrics seems to be not working.

Steps to reproduce
Had vanilla install of snapscheduler

$ kubectl describe service/snapscheduler-metrics -n abcns
Name:              snapscheduler-metrics
Namespace:         cloudops
Labels:            name=snapscheduler
Annotations:       <none>
Selector:          name=snapscheduler
Type:              ClusterIP
IP Families:       <none>
IP:                172.20.219.85
IPs:               <none>
Port:              http-metrics  8383/TCP
TargetPort:        8383/TCP
Endpoints:         <none>
Port:              cr-metrics  8686/TCP
TargetPort:        8686/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>
$

Using port-forwarding, it's giving me a timeout error.

kubectl --kubeconfig ABC.config port-forward svc/snapscheduler-metrics -n cloudops 9100:8686
kubectl --kubeconfig ABC.config port-forward svc/snapscheduler-metrics -n cloudops 9100:8383

Expected behavior
I should be able to see metrics

Actual results
getting below error

error: timed out waiting for the condition

Please help

backube / snapscheduler Goto Github PK

snapscheduler's People

Contributors

Stargazers

Watchers

Forkers

snapscheduler's Issues

Recommend Projects

Recommend Topics

Recommend Org