Summary As K8up user I want

The cron library used by k8up also supports those keywords. <a href="https://pkg.go.de

Support "auto" schedules,about k8up-io/k8up

Comments (18)

Schnitzel commented on May 28, 2024 1

So to be clear the backup schedules would still be generated by lagoon? Or do you also want the smart schedules for the backup? Asking because in that case we'd have to implement the limit for during the night, too.

the reason we implement the auto schedules are first and foremost for the prune and check schedules, as lagoon currently creates them for each namespace and multiple namespaces share an S3 repository/bucket and so we have some buckets that run 50 prunes/checks a week as there are 50 namespaces. As Lagoon has no possibility to do this only once per week we would like to use the @weekly in lagoon for the prune and check schedules and let the operator then handle this (aka the operator should only run it once per week per repository).

While we need this for the prune and checks, the suggestion came from VSHN to use this also for the backup schedules as well, this is not the requirement from Lagoon side, as Lagoon already can nicely schedule the backups in a specific timeframe.
So in the end:

Lagoon will use the auto feature only for the prune and checks
Lagoon will still use it's internal system to schedule the backups
k8up is not expected to have the feature of the night limit right now

If k8up ever wants to add the feature of the night limit in the future that would be great, then lagoon can also just say @nightly and k8up does the rest (aka schedule them just during a specific time of the day), but this is really not required right now at all.

so I think if we add an easy stable randomization for now that's perfectly fine. This can be added to all crontab definitions within a Schedule Object. Where and how exactly it will be used (prune, checks, backups) is in the end anyway the k8up users choice (Lagoon will only use it for prune and checks)

from k8up.

mhutter commented on May 28, 2024

Kubernetes CronJobs support @hourly etc as a schedule specification: https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#schedule

from k8up.

Kidswiss commented on May 28, 2024

The cron library used by k8up also supports those keywords. https://pkg.go.dev/github.com/robfig/cron/v3#hdr-Predefined_schedules

But they are more or less just aliases for defined points of time, like daily is always on midnight. The idea for the auto schedules would be to be triggered at any time in the given frame. For example daily should mean that the job should run at least once every day sometime between 00:00 and 23:59. When exactly should be determined by the operator.

This feature is intended mostly for jobs that need exclusive access to the backup repository like prune and check. They don't have any impact on the applications and can thus run whenever no backups are running.

I amend the description with this information.

from k8up.

ccremer commented on May 28, 2024

I have refined the feature request with some acceptance criteria, which serve as requirements for this feature.
I'm unsure about:

randomization of schedules: Is this a good approach? How exactly should the operator determine "optimal" schedules?
If randomization is acceptable: Should we keep the @hourly (etc.) in order to use the predefined hard-coded schedules? How to distinguish from randomized ones? If yes, is @random-hourly or similar the better approach?
Make the generated schedule persistent or should that be regenerated every time K8up starts up and keep the schedules in-memory? If so, should the persistence be stored in the status field, or should K8up convert the "auto" schedule into the actually-used schedule and update the spec? (I would advise to use the status field)

@tobru and @Schnitzel please review the criteria

from k8up.

Schnitzel commented on May 28, 2024

How exactly should the operator determine "optimal" schedules?

I'm not sure if randomizing is the correct way, as randomizing could still cause a lot of backups running at the same time. As the Operator knows how many schedules are planned for a specific timeframe it could do it a bit cleverer than just random:
check how many schedules are there overall and spread them across the whole timeframe, an example:
If we have 60 schedules all with @hourly then it starts one every minute. if we have 120 schedules with @hourly it
starts two every minute.

If it want's to be even fancier: check the time the schedules had the last time (assumption here is that the schedules roughly take the same time every time) and use that for calculating the execution time to make sure we don't have too many schedules running at the same time, for example if we have 60 schedules and 20 of them each take 5 minutes, while the remaining 40 schedules all take 1 minute, make sure we don't start the 5 minute ones right 1 minute after each other, as we would then have 5 schedules running at the same time.

I'm aware that the intelligent and fancy version are probably complex to implement so maybe for the first implementation we just do them randomized, I would be fine with that.

If randomization is acceptable: Should we keep the @hourly (etc.) in order to use the predefined hard-coded schedules? How to distinguish from randomized ones? If yes, is @random-hourly or similar the better approach?

I think yes we probably should come up with new naming for the spread out schedules, something like @hourly-smart would make sense (assuming it's not actually randomized, see first question)

Make the generated schedule persistent or should that be regenerated every time K8up starts up

I don't have a hard feeling here, we just need to make sure that the operator can handle a restart while backups are happening, and I'm not sure we can guarantee that a backup actually happens if we calculate the start point every time the backup operator starts as it could happen that it calculates a startup point for a minute of the hour that already has passed.

one question for this AC:

Given a schedulable K8up object that need exclusive access to a backup repository
When the same non-standardized predefined cron syntax is specified that targets the same backup repository
Then ignore the duplicated schedule of the same job type

I agree that this applies for check and prune job types, but for backups this could cause the backup to be skipped? Or do we assume here that a backup does not need exclusive access to the backup repository? Maybe we should then specify which job types need exclusive access?

from k8up.

ccremer commented on May 28, 2024

check how many schedules are there overall and spread them across the whole timeframe, an example:
If we have 60 schedules all with @hourly then it starts one every minute. if we have 120 schedules with @hourly it
starts two every minute.

At first it sounds reasonable, but what if there are new schedules being created? Or a bunch of schedules are created/updated in batch? Should the Operator reschedule every single schedule to balance them out again? I think this could cause a big reconciliation loop and a hit on K8s API performance.
If we have, say 120 schedules, and the 121st schedule gets created, it wouldn't fit anywhere without a rebalancing of all the others, correct? Assuming there are schedules being deleted and (re)created over time (CI/CD, Helm, production rollouts, etc) without a rebalance on every update on any schedule this might as well look randomized?

Unless we are talking about just the schedules that target the same backend (bucket) and randomize just the schedules that are using different targets/backends. Then the effect of rebalancing may be limited, provided there aren't hundreds of schedules using the same global bucket...

check the time the schedules had the last time

I'm not sure if we track schedule durations in the operator itself... but I don't know the code too much yet

I think yes we probably should come up with new naming for the spread out schedules, something like @hourly-smart would make sense (assuming it's not actually randomized, see first question)

@hourly-smart sounds good, that leaves us the freedom of implementation (randomized vs some intelligence)

Maybe we should then specify which job types need exclusive access?

In the code, currently (master branch) following job types are exclusive: Archive, Check, Prune, Restore. We should make that explicit.

and I'm not sure we can guarantee that a backup actually happens if we calculate the start point every time the backup operator starts as it could happen that it calculates a startup point for a minute of the hour that already has passed.

To me it sounds that we have to persist the resulting schedule from the "smart intelligence", otherwise this exact scenario is going to happen.

from k8up.

Kidswiss commented on May 28, 2024

Unless we are talking about just the schedules that target the same backend (bucket) and randomize just the schedules that are using different targets/backends. Then the effect of rebalancing may be limited, provided there aren't hundreds of schedules using the same global bucket...

I feel like the main pain point this feature should solve is the scheduling complications that are caused by exclusive jobs. I assume considering the smart schedules by repository should be a good way. So we can treat each schedule with the same endpoint in one "smart schedule set". For example if there are multiple @hourly-smart schedules for repository a and b, the exclusive jobs can run against a and b at the same time without any issues. In that case we'd only have to spread the jobs more or less smartly for each such set.

In the code, currently (master branch) following job types are exclusive: Archive, Check, Prune, Restore. We should make that explicit.

I just checked the upstream code, as Restore and Archive (which is just a series of restores) defined as exclusive seemed wrong. Because the exclusivity of the jobs should reflect what Restic's exclusivity definitions. Restore is currently not an exclusive operation: https://github.com/restic/restic/blob/master/cmd/restic/cmd_restore.go#L105 But prune is: https://github.com/restic/restic/blob/master/cmd/restic/cmd_prune.go#L145 as is Check: https://github.com/restic/restic/blob/master/cmd/restic/cmd_check.go#L189. I'll open a Github issue and if I find the time a PR to correct that. Because having the Restore exclusive in the operator will block a lot of backups unnecessary.

To me it sounds that we have to persist the resulting schedule from the "smart intelligence", otherwise this exact scenario is going to happen.

My suggestion: save the scheduled time in the status field of the schedule objects. I'd say for each Job type that's scheduled smartly it should be tracked. Of course the field would need to be updated for each schedule if they are affected by changes.

from k8up.

ccremer commented on May 28, 2024

I feel like the main pain point this feature should solve is the scheduling complications that are caused by exclusive jobs. I assume considering the smart schedules by repository should be a good way. So we can treat each schedule with the same endpoint in one "smart schedule set". For example if there are multiple @hourly-smart schedules for repository a and b, the exclusive jobs can run against a and b at the same time without any issues. In that case we'd only have to spread the jobs more or less smartly for each such set.

Works for me. Should we impose a limit for each set? e.g. if there are 60 schedules in the same set (think of the global repository) then we would stop trying to rebalance every schedule evenly. This way we can break the reconciliation loop to avoid performance hits

My suggestion: save the scheduled time in the status field of the schedule objects. I'd say for each Job type that's scheduled smartly it should be tracked. Of course the field would need to be updated for each schedule if they are affected by changes.

Yes, my idea also. But that is an implementation detail for me :)

from k8up.

Kidswiss commented on May 28, 2024

Works for me. Should we impose a limit for each set? e.g. if there are 60 schedules in the same set (think of the global repository) then we would stop trying to rebalance every schedule evenly. This way we can break the reconciliation loop to avoid performance hits

The question is: what should happen if we reach the limit? We can't just drop them afterwards, they still need to be scheduled somehow.

Yes, my idea also. But that is an implementation detail for me :)

Sure, it's meant as an implementation idea :)

from k8up.

ccremer commented on May 28, 2024

The question is: what should happen if we reach the limit? We can't just drop them afterwards, they still need to be scheduled somehow.

I thought about randomizing new schedules. The 61st schedule would be scheduled randomly within the given timeframe, leaving the 60 others untouched (they remain evenly balanced). Once the amount is below or equal to 60 again, we start to rebalance again.
Implementation wise I presume we need to get a list of schedules and filter them by the schedule spec anyway. But if the list is longer than x amount, we can skip the logic that tries to rebalance the schedules smart and put a random schedule instead. This should avoid many follow-up reconciliations.

from k8up.

ccremer commented on May 28, 2024

Hm, the general problem with rebalancing is that there's a risk of skipping a schedule. E.g. if a schedule was to be taken at 2pm, then rebalancing happens and the schedule is put back to 1pm at 1.30pm, then it won't get scheduled for the day (or any timeframe)...
I'm not sure if that's acceptable. Over time, schedules will look randomized without rebalancing, at which point we might as well just use random schedules from the beginning.
Or are there any other ideas?

from k8up.

Kidswiss commented on May 28, 2024

I think the rebalancing of already scheduled jobs makes this whole thing much more complex.

Here's another idea:

For any given smart schedule (hourly, daily, monthly, etc.) we define slots, for example it could each minute for hourly or every 10 minutes for daily (implementation detail, maybe make it configurable...). Then we start to fill up the slots, save them in the CRs themselves and then won't touch them again until the schedule type is changed (e.g. from daily-smart to weekly-smart, then they have to be re-scheduled). If all the slots are used up, we'd start to use slots a second time for non-esclusive jobs. If the job is exclusive and there's no more slot free we'd have to reject it and inform the user about that (admission webhook?). This way we don't have to worry about the shuffling and potentially missing existing schedules.

to achieve maximum spread we could fill the slots alternating. So the first schedule gets the first slot, the second schedule gets the last slot, the second gets the third slow, etc. This way we can leave margins for exclusive jobs, so they won't block too many other jobs.

To clarify: these slots would be per repository, so check and prunes should be deduplicated anyways.

from k8up.

Schnitzel commented on May 28, 2024

mhh I'm realizing that we're opening up a can of worms with the idea of the smart system, maybe let's initially just use a simple randomized schedule that is built in a way that it generates the same minute (in the case of @hourly-smart), here is what we do at lagoon where we have a similar issue with cronjobs (customers say they want a cronjob running every hour and we need to make sure that not all cronjobs in the cluster run at the same time):

the code is here: https://github.com/amazeeio/lagoon/blob/main/images/kubectl-build-deploy-dind/scripts/convert-crontab.sh
and here an example how it is used:
https://github.com/amazeeio/lagoon/blob/f2fe153eab7da3c932f8b66b5c7136115a8c1bdc/images/kubectl-build-deploy-dind/build-deploy-docker-compose.sh#L775
(yes we use this for generating the schedules for k8up currently as well, but lagoon has no understanding of the repositories, hence we cannot do any deduplication for the check/prune schedules, therefore I would like to move this logic to k8up operator)

in the end the code does this:

generate a numeric number from the namespace of the environment (example: echo "drupal-example-master" | cksum | cut -f 1 -d " ", which generates the number 1053081383) use this as the seed for the modulus
generate a number for the spread of the time (in the case of @hourly this would be 60 minutes): echo $((1053081383 % 60)) generates 23, so for this namespace the cron will always run on the minute 23

the code goes even a bit fancier that you can define it from which hour to which hour you want the number generated, so you can say: M H(3-6) * * 6 which would generate a cron schedule with any minute of the hour (1-60) and an hour number of 3-6, (so 3,4,5 or 6). we did this in order to have crons always running during the nights.
I don't think that's currently necessary for us here as the operator should realize that currently an exclusive schedule is already running and not start another one, example: if the prune would run during the same time as a backup it would first finish the prune and then start the backup later

from k8up.

Kidswiss commented on May 28, 2024

@Schnitzel So to be clear the backup schedules would still be generated by lagoon? Or do you also want the smart schedules for the backup? Asking because in that case we'd have to implement the limit for during the night, too.

I don't think that's currently necessary for us here as the operator should realize that currently an exclusive schedule is already running and not start another one, example: if the prune would run during the same time as a backup it would first finish the prune and then start the backup later

That's correct, the operator prioritizes the exclusive jobs like prune and check and starts them first if a backup is also in the queue.

I agree that a truly smart scheduler will be rather complex, so randomising the schedules as you suggested sounds reasonable.

So to sumarize:

we should do some stable randomisation similar like lagoon handles it, when a "smart" schedule is defined.
also if a prune and a check is "smartly" scheduled multiple times to the same repository, it should only get registered once, aka get deduplicated
the operator ensures with its own locking that backups get delayed if an exclusive job is running

@Schnitzel did I miss something?

from k8up.

ccremer commented on May 28, 2024

During implementation of this feature in #186 I found a special case: Monthly. The @monthly-random will produce a schedule with a day-of-month between 1 and 27, even though some months have 30 and 31 days. This is to prevent schedules being skipped if the current month has fewer days than specified in the day-of-month field in the schedule. Otherwise, alerts may be fired if the rule is like "fire if last backup more than 30 days ago".

@Schnitzel is that acceptable?

from k8up.

Schnitzel commented on May 28, 2024

@ccremer sure, no problem, I don't expect lagoon will use the @monthly anyway :) and even if, that's ok.

from k8up.

ccremer commented on May 28, 2024

One PR down. K8up supports now randomized schedules 🎉
Story ain't quite done yet. The deduplication feature is currently not yet implemented.

from k8up.

ccremer commented on May 28, 2024

As discussed internally, we have decided to put deduplication of the Check and Prune into its own Story in #214 and we close this one.

from k8up.

Support "auto" schedules about k8up HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent