Coder Social home page Coder Social logo

backube / snapscheduler Goto Github PK

View Code? Open in Web Editor NEW
240.0 240.0 23.0 1.93 MB

Scheduled snapshots for Kubernetes persistent volumes

Home Page: https://backube.github.io/snapscheduler/

License: GNU Affero General Public License v3.0

Dockerfile 2.61% Shell 13.04% Go 65.47% Makefile 15.91% Ruby 0.49% Mustache 2.48%
csi data-protection kubernetes kubernetes-operator persistent-volume scheduled-snapshots storage

snapscheduler's People

Contributors

dependabot[bot] avatar dschunack avatar henriklundahl avatar johnstrunk avatar mergify[bot] avatar mjhuber avatar renovate-bot avatar scoof avatar tgip-work avatar wjentner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

snapscheduler's Issues

Improve e2e

Describe the feature you'd like to have.
E2E should test the functionality a bit better. For example, #80 was caused by not handling a nil template, but this doesn't get tested by the simple smoke test that currently exists.

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)
Enhance the e2e to ensure:

  • nil template is ok
  • labels get passed through
  • class name is properly respected
  • selectors work properly

Additional context

Add startingDeadlineSeconds

This would establish a maximum latency after which a scheduled snapshot would be skipped.
This could be helpful in the case where the scheduler was down for an extended period, preventing a sudden burst of snapshots. For this to be useful, there needs to be a default (e.g., 5min)

Update CRD to v1

Prior to an official release, the CRD version should be updated to either beta or v1.

Schedule not able to detect the default snapshotclass present.

Describe the bug
I tried creating a schedule without adding volumesnapshotclass because I had a default snapshot class. But the schedule pod kept on crashing.

Steps to reproduce

  1. Create schedule without specifying snapshotclass
  2. Check for schedule pods

Expected behavior
Schedule should start creating snapshots.

Actual results
Schedule pod was crashing.

Additional context
Schedule yaml: https://pastebin.com/y0LgPraz
Pod logs: https://pastebin.com/qtk98VEi

Adding timezone perferred schedule

Describe the feature you'd like to have.
The way to have timezone based schedule.

What is the value to the end user? (why is it a priority?)
Good to have feature.

How will we know we have a good solution? (acceptance criteria)

Additional context

Move CI to GH-actions

With GitHub Actions now GA, it would be good to consolidate CI from Travis to GH.

Expose "disabled" field in OLM UI

Describe the feature you'd like to have.
Currently, the .spec.disabled field isn't hinted in the CSV, so the current state of whether the SnapshotSchedule is enabled/disabled isn't easily seen.

What is the value to the end user? (why is it a priority?)
Since it's not shown unless the user looks at the yaml version of the object, it's easy to miss that the schedule is currently disabled, leading to confusion about why snapshots are not being taken.

How will we know we have a good solution? (acceptance criteria)
.spec.disabled should be hinted as urn:alm:descriptor:com.tectonic.ui:checkbox

Additional context

Add documentation

  • How to install
  • How to use
    • Features & CR fields

Where should docs be hosted? Options: readthedocs, gh-pages, md in docs/ directory

Print version of chart & application when deploying

Describe the feature you'd like to have.
After deploying, the Helm chart should print the version of the chart, application, and image that was deployed.

What is the value to the end user? (why is it a priority?)
While it's fairly easy to look at the Deployment to verify the app/image version, it would be nice to get that upfront.

How will we know we have a good solution? (acceptance criteria)

Additional context

Properly truncate PVC names

When generating snapshot names, the name of the PVC may need to be truncated to stay within naming limits

Integration w/ OLM & operatorhub

  • Generation of OLM artifacts
  • Installation via OLM w/ instructions added to docs
  • Artifacts (text/icons etc.) for OperatorHub
  • Submit to operatorhub

Documentation enhancements

  • Suggestions for labeling: app-centric vs. schedule-centric
  • 404 page should encourage filing an issue
  • Developer docs
    • Building & make targets
    • Prereqs: sdk, go/gvm, golangci-lint
    • Upgrading the SDK
    • Editing the docs
    • Development prioritization
      • Project board
      • Bugs before features
      • Roadmap: cluster-scope operator, cluster-scope schedule, schedules for SCs

Change licensing on APIs

Change the licensing on snapscheduler/pkg/apis/snapscheduler/* to dual AGPL & Apache-2.0 to permit the API types to be imported into other Apache licensed code.

This should permit other code that wants to manipulate the SnaphotSchedule CRs to gain access to the necessary types by directly importing from this repo.

Improve Deployment spec

Since the operator supports leader election, its deployment should be replica 2 to ensure upgrades and re-scheduling go smoothly.

The deployment should also provide resource requests for proper scheduling.

Snapshot metrics

Describe the feature you'd like to have.
Currently, snapscheduler doesn't provide any metrics related to the snapshots attempted/created. It would be good to provide some stats that could be monitored/alerted

What is the value to the end user? (why is it a priority?)
Users that depend on having snapshots to protect their data should have a way to monitor whether those snapshots are being successfully created

How will we know we have a good solution? (acceptance criteria)

Additional context
cc: @prasanjit-enginprogam

Misc doc bugs

Describe the bug
Below are several doc bugs/suggestions from @ShyamsundarR

  • In https://backube.github.io/snapscheduler/install.html:
    • Spell correct "teh": "This page provides instructions to deploy the snapscheduler operator. The operator is cluster-scoped, but its resources are namespaced. This means, a single instance of teh "
  • https://backube.github.io/snapscheduler/usage.html:
    • When I click "go back to installation.: it askes me to save a file (Firefox)
  • Clarify the semantics of expiration. State specifically that when both time & count are specified, expiration is the soonest/most restrictive.

Failed to watch *v1.PersistentVolumeClaim

Describe the bug
Seen in the operator log:

E1211 18:34:58.078937       1 reflector.go:283] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:98: Failed to watch *v1.PersistentVolumeClaim: unknown (get persistentvolumeclaims)
E1211 18:34:59.081894       1 reflector.go:283] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:98: Failed to watch *v1.PersistentVolumeClaim: unknown (get persistentvolumeclaims)
E1211 18:35:00.084973       1 reflector.go:283] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:98: Failed to watch *v1.PersistentVolumeClaim: unknown (get persistentvolumeclaims)

Steps to reproduce

  • Create a test namespace
  • Create a PVC that is WaitForFirstConsumer
  • Create a snapshotschedule that selects the above PVC
  • Wait for the schedule to fire

Expected behavior
Above error messages are seen in the operator pod logs

Actual results
Those messages shouldn't be present

Additional context

  • This doesn't seem to cause any ill effects other than the error messages

This is seen on OpenShift 4.2 and 4.3-ci builds with snapscheduler v1.0.0, installed via Chart v1.0.1

Update VolumeSnapshot to v1beta1

Describe the feature you'd like to have.
The code should be updated to use VolumeSnapshot version v1beta1 instead of the current v1alpha1.

What is the value to the end user? (why is it a priority?)
Kubernetes v1.17 switched from v1alpha1 to v1beta1, so in order to support kube versions going forward, the snap version needs to be updated

Additional context
It appears that the alpha API was removed entirely from the snapshotter, meaning there is no grace period for this change... Unless there's a webhook someplace to make the conversion.

See: https://github.com/kubernetes-csi/external-snapshotter

Move CRD to apiextensions.k8s.io/v1

Describe the feature you'd like to have.
Kube 1.16 brought apiextensions to v1, and it appears v1beta1 vill no longer be available in 1.19.
The SnapScheduler operator needs to be deployed w/ the proper version depending on the cluster version.

What is the value to the end user? (why is it a priority?)
It is desirable for users to continue to have access to SnapScheduler as Kubernetes moves to 1.19+.

How will we know we have a good solution? (acceptance criteria)
Deployment of the operator is currently supported both via Helm and OLM. This should continue to be the case.

Additional context
Currently, Helm deployment is supported back to 1.13, and OLM is back to 1.17. Ideally, we can preserve that even with this switch.

Allow snapshot names to be rotated with most recent being named "_latest"

Describe the feature you'd like to have.
We are using a scheduler to create regular (hourly) snapshots of a source volume. We have a stateful set that needs to create a volume (using the "volumeClaimTemplate" option). It would be very helpful if we could have the most recent snapshot created be named '-latest' instead of the timestamp.

What is the value to the end user? (why is it a priority?)
When using snapshots as part of a StatefulSet VolumeClaimTemplate, Kubernetes fails if you try to update the DataSource part of the VolumeClaim template. Say, for example, to use a more recent snapshot. The workaround it a bit dangerous and error prone: Delete the StatefulSet with --cascade=false and then re-create it with the newer snapshot ID.

If we knew we could always expect our optimal snapshot to be suffixed with '-latest' (or something else obvious), we wouldn't need to change the datasource in the StatefulSet's VolumeClaimTemplate.

How will we know we have a good solution? (acceptance criteria)
When we can increase the replica count of a StatefulSet and have it always use the most recent snapshot as it's datasource for the PVC.

Additional context
Not sure if this is a bit of an edge case, but it would be super-useful for things like blockchains. Each node has a huge ledger volume. The closer to"now" that we can get when provisioning the volumes for additional pods, the faster the pods can be up and ready.

Add validation via Admission Webhook

Describe the feature you'd like to have.
Add a Validating Webhook to improve the validity checking of the SnapshotSchedule CR.

What is the value to the end user? (why is it a priority?)
The current openapi validation handles some validation, but not all field validation can be adequately expressed using openapi. Errors currently missed by openapi validation will only be discovered after the operator tries to reconcile, with error reporting limited to the .status field of the affected CR.
By adding a validating webhook, more (all?) errors can be found when the object is first created, so the create/update can be immediately rejected, providing more timely feedback to the user. This will also improve error reporting capabilities via the web console.

How will we know we have a good solution? (acceptance criteria)

  • .spec.schedule should be validated by parsing it to ensure it's a valid cronspec (i.e., all values are in range)
  • .spec.retention.expires should properly parse as a time.Duration

Additional context
This should be viewed as an enhancement on top of the openapi validation, not a substitute. The openapi validation should still be the first line of defense since those rules are exposed in the CRD, whereas the webhook's rules are opaque to the user.

Starting point (from this tutorial at KubeConNA 2019):
https://github.com/jpbetz/KoT/tree/master/admission

Document how to set snapshot quotas

Describe the feature you'd like to have.
Document how to use ResourceQuota to limit the number of snapshots that can be created.
https://kubernetes.io/docs/concepts/policy/resource-quotas/#object-count-quota

What is the value to the end user? (why is it a priority?)
It's easy to create a snapshotschedule that results in many snapshots being created, potentially exhausting the storage of the system or incurring high cloud provider costs.

How will we know we have a good solution? (acceptance criteria)

  • Admins will understand that it is best-practice to limit the number of snaps
  • There will be instructions how to do so
  • There will be at least 1 example

Additional context

Add node selector to run operator only on linux hosts

Describe the feature you'd like to have.
The Helm chart should limit the nodes for the snapscheduler operator to amd64/linux hosts (as opposed to Windows or other architectures).

What is the value to the end user? (why is it a priority?)
The operator will fail to start on anything other than amd64/linux.

How will we know we have a good solution? (acceptance criteria)

Additional context
The arch and os labels: kubernetes.io/arch and kubernetes.io/os first appeared in 1.14. Prior to that, they were beta.kubernetes.io/*. As long as we continue to support 1.13, we need to ensure it continues to work w/ the beta label also.

Protect against accidental namespace deletion

Describe the feature you'd like to have.
The lifetime of the snapshot should be independent of the namespace of the PVC.

What is the value to the end user? (why is it a priority?)
Snapshots that reside in the same namespace as the primary data don't protect against accidental namespace deletion.

How will we know we have a good solution? (acceptance criteria)

  • If a namespace containing the primary data is deleted, the scheduled snapshots should survive.
  • All the other scheduling features should continue to work as intended (e.g., expiring old snaps)

Additional context

  • There seem to be (at least) two approaches:
    • Transfer the VolumeSnapshot object (by re-binding the VSC) to a different namespace
    • Change the deletion policy of the snapshot content such that it is retained even if the VS is deleted
  • The work on volume namespace transfer may be useful here kubernetes/enhancements#1555

Test against kube 1.18 in CI

Describe the feature you'd like to have.
Kube 1.18 has been released and Kind images are available. Update CI and gating tests to include 1.18.

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

Additional context

Housekeeping for 1.0 release

General housekeeping items that need to be done prior to a 1.0 release:

  • Add a CODE_OF_CONDUCT.md (linked from README.md)
  • Add CONTRIBUTING.md
  • Add SECURITY.md
  • Add issue templates (bug, enhancement)
  • Add PR template
  • Link to CoC, Contrib, Security from README.md
  • Add CHANGELOG.md

Remove state machine from `.status`

The internal state machine for the scheduler is exposed in the .status portion of the CR. This should really be removed since it's internal implementation detail and not meant to be part of the API.

Fix CI

Describe the bug
A couple days ago, changes were made to the csi-driver-host-path repo that have broken the CI testing of SnapScheduler. The hostpath driver install is failing while setting up the CI environment.

Steps to reproduce
Run ./hack/setup-kind-cluster.sh

Expected behavior

Actual results

Additional context

Allow overriding snapshotclass per PVC

It should be possible to override the snapshotclass on a per-PVC basis by setting an annotation on the PVC.

In decreasing priority:

  1. PVC annotation
  2. class from the schedule
  3. cluster default

Error: failed to start container "snapscheduler": Error response from daemon: OCI runtime create failed

I'm trying to run this in eks with version 1.17 but I get the following error:

kubectl describe po/snapscheduler-864d84f9-gnzwb -n snapscheduler

Events:
  Type     Reason     Age                 From                                                Message
  ----     ------     ----                ----                                                -------
  Normal   Scheduled  <unknown>           default-scheduler                                   Successfully assigned snapscheduler/snapscheduler-864d84f9-gnzwb to ip-10-27-7-136.eu-west-2.compute.internal
  Normal   Pulled     23m (x5 over 25m)   kubelet, ip-10-27-7-136.eu-west-2.compute.internal  Container image "quay.io/backube/snapscheduler:1.1.1" already present on machine
  Normal   Created    23m (x5 over 25m)   kubelet, ip-10-27-7-136.eu-west-2.compute.internal  Created container snapscheduler
  Warning  Failed     23m (x5 over 25m)   kubelet, ip-10-27-7-136.eu-west-2.compute.internal  Error: failed to start container "snapscheduler": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"/manager\": stat /manager: no such file or directory": unknown
  Warning  BackOff    7s (x112 over 24m)  kubelet, ip-10-27-7-136.eu-west-2.compute.internal  Back-off restarting failed container

Using the following HelmRelease

---
apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: snapscheduler
  namespace: fluxcd
  annotations:
    flux.weave.works/automated: "false"
spec:
  targetNamespace: snapscheduler
  helmVersion: v3
  releaseName: snapscheduler
  chart:
    repository: ***
    name: snapscheduler
    version: 1.2.1
  values:
    replicaCount: 2

    image:
      repository: quay.io/backube/snapscheduler
      tagOverride: ""
      pullPolicy: IfNotPresent

    imagePullSecrets: []
    nameOverride: ""
    fullnameOverride: ""

    serviceAccount:
      create: true

    podSecurityContext: {}
    securityContext: {}
    resources:
      requests:
        cpu: 10m
        memory: 100Mi

    nodeSelector:
    tolerations: []
    affinity: {}

Expire snapshots based on time

During periodic (Idle state) reconcile, delete snapshots whose metadata.creationTimestamp is older than time.Now() - spec.retention.expires

Fix e2e flakes

Describe the bug
The e2e tests flake frequently in the cleanup stage... "timeout waiting for condition".
There had been some sort of race condition re finalizers in the snapshotter... it could be that; or it could just be that the timeout is too short.

Steps to reproduce

  • Run the e2e
  • They fail randomly (kube version doesn't seem to matter much).

Expected behavior

Actual results

Additional context
Examples:

Expire snapshots based on count

There should be a maximum of spec.retention.maxCount snapshots. During idle reconcile:

  • Enumerate the snaps w/ label snapscheduler.backube/schedule: <schedule_name>
  • Group by spec.source.name (source PVC name)
  • Delete the oldest according to metadata.creationTimestamp until there are at most maxCount left

Not able to fetch snapscheduler metrics

Describe the bug
Wanted to scrape the snapscheduler metrics from prometheus , metrics seems to be not working.

Steps to reproduce
Had vanilla install of snapscheduler

$ kubectl describe service/snapscheduler-metrics -n abcns
Name:              snapscheduler-metrics
Namespace:         cloudops
Labels:            name=snapscheduler
Annotations:       <none>
Selector:          name=snapscheduler
Type:              ClusterIP
IP Families:       <none>
IP:                172.20.219.85
IPs:               <none>
Port:              http-metrics  8383/TCP
TargetPort:        8383/TCP
Endpoints:         <none>
Port:              cr-metrics  8686/TCP
TargetPort:        8686/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>
$

Using port-forwarding, it's giving me a timeout error.

kubectl --kubeconfig ABC.config port-forward svc/snapscheduler-metrics -n cloudops 9100:8686
kubectl --kubeconfig ABC.config port-forward svc/snapscheduler-metrics -n cloudops 9100:8383

Expected behavior
I should be able to see metrics

Actual results
getting below error

error: timed out waiting for the condition

Please help

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.