fiaas / fiaas-deploy-daemon Goto Github PK

View Code? Open in Web Editor NEW

54.0 54.0 31.0 3.16 MB

fiaas-deploy-daemon is the core component of the FIAAS platform

Home Page: https://fiaas.github.io/

License: Apache License 2.0

Python 96.26% HTML 1.11% Shell 1.58% Dockerfile 0.20% Smarty 0.85%

fiaas kubernetes kubernetes-operator python

fiaas-deploy-daemon's People

Contributors

Stargazers

Watchers

fiaas-deploy-daemon's Issues

Reduce memory requested for datadog sidecar

At the moment, the datadog integration requests 2GB of RAM for the pod. From metrics in the cluster, we can see that the max that's being used is around 220MB, so we'd like to reduce this, as the saving over all the pods being run will be quite large.

Either we can reduce the hardcoded value, maybe initially to 1GB, and then later to 500MB. Or, we could introduce deploy-daemon configuration for this. Any preference?

Change datadog api key env var to "DD_API_KEY"

Consider changing the env var in datadog.py the Datadog API Key is stored in to enable compatiblity with Datadog v6 Agent. According to the v5-agent docs the v5 agent will read the API key from either API_KEY or DD_API_KEY, whereas the v6 agent only accepts DD_API_KEY

Support NodeSelectors

We are using fiaas to deploy our API and It does the job, but now we also would like to deploy different types of applications like streaming, ML or batch applications that require pods to be scheduled on specific nodes. Taints are not sufficient as it does not restrics API pods to be scheduled on those nodes.
Therefor we would like to be able to add nodeSelectors to our pods.

The idea would be to add to the spec file a field like :

  nodeSelector:
    disktype: ssd

We would really like to have this feature, wdyt ?

Access Control Mechanism

Currently applications that expose HTTP(S) endpoints are open to the internet by default. It would be good to be able to have a way to limit their exposure to a particular pre-determined scope that would be expressed as part of the application spec. The scopes could be pre-defined sets of hostnames and/or ip-ranges that could be arranged in a ip-whitelisting set on the ingress resource or used for provisioning multiple ingress resources with different access control.

Wrong deployment status for 1 replica and singleton=false

Given a FIAAS application with this configuration:

...
replicas:
  singleton: false
  minimum: 1
  maximum: 1
...

Running successfully in the cluster:

$ kubectl get deployment
NAME      READY   UP-TO-DATE   AVAILABLE   AGE
example   1/1     1            1           4m30s
$ kubectl get pods
NAME                       READY   STATUS    RESTARTS   AGE
example-68b794f65f-tnqfv   1/1     Running   0          4m30s
$ kubectl get application-statuses -l "app=example" -oyaml
  result: SUCCESS

Now, I release a new version of my application but the new replica fails to start:

$ kubectl get deployment
NAME      READY   UP-TO-DATE   AVAILABLE   AGE
example   1/1     1            1           11m
$ kubectl get pods
NAME                       READY   STATUS             RESTARTS   AGE
example-68b794f65f-tnqfv   1/1     Running            0          11m
example-fcb8db6b9-9t255    0/1     CrashLoopBackOff   4          2m45s
$ kubectl get application-statuses -l "app=example" -oyaml
  result: SUCCESS

I would expect the fiaas deployment daemon to flag the deployment as a failure instead of a success.

pip install -r requirements.txt fails with dependency conflict on six package

Can probably be fixed by requiring six == 1.14.0

ERROR: Cannot install fiaas-deploy-daemon[dev] 1.0-SNAPSHOT, pinject 0.14.1, docker 4.0.2, -r requirements.txt (line 3), k8s 0.15.0, flake8-print 3.1.0 and tox 3.14.5 because these package versions have conflicting dependencies.

The conflict is caused by:
    fiaas-deploy-daemon[dev] 1.0-SNAPSHOT depends on six==1.12.0
    pinject 0.14.1 depends on six>=1.7.3
    docker 4.0.2 depends on six>=1.4.0
    fiaas-deploy-daemon 1.0-SNAPSHOT depends on six==1.12.0
    k8s 0.15.0 depends on six==1.12.0
    flake8-print 3.1.0 depends on six
    tox 3.14.5 depends on six<2 and >=1.14.0

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

Expire ApplicationStatus after a given time-to-live

Currently FIAAS keeps the last 10 ApplicationStatus resources for each Application. When running in a cluster with many applications there quickly ends up being a lot (thousands) of resources. I think it is mostly the last 2 or 3 statuses that are relevant in practice. Statuses are less relevant the older they are.
Maybe it makes sense to implement a more advanced cleanup policy where FIAAS deletes ApplicationStatus resources that are older than some duration, and only keeps maybe the last 2 or 3 for each application?

Avoid triggering redeploy of every application on fiaas-deploy-daemon startup

When fiaas-deploy-daemon starts up, the CRD watcher will trigger updates in the managed (deployment, service, etc.) resources corresponding to each application resource. Usually this will be a no-op since the resource will be "updated" with the same state that it already has, and the control loops within Kubernetes won't actually do anything. However, if a new version of fiaas-deploy-daemon with a slight change in behavior (e.g. if a label is added on a resource) is deployed, it might trigger an actual update of all the managed resources.
This means that we can get into situations where a fiaas-deploy-daemon upgrade triggers a rolling upgrade of all applications are triggered at roughly the same time, which can cause several issues that coupled together may affect service availability.

In our setup it would be ideal to replicate the behavior that using the pipeline consumer has, where deployments are only triggered by the deployment orchestrating system, and not have fiaas-deploy-daemon potentially trigger redeploy of already deployed applications when it starts up.

Some solutions for this can be to:

Skip deploying if the there already exists an ApplicationStatus which has fiaas/deployment_id which matches the Application resource. fiaas/deployment_id should be unique for each externally triggered deployment. This should not be too difficult to implement, and has the benefit of still being able to trigger a full deploy of every application to propagate changes by new behavior, by simply removing the latest ApplicationStatus resource for each Application before updating fiaas-deploy-daemon.
Use the watch bookmark API (https://kubernetes.io/docs/reference/using-api/api-concepts/#watch-bookmarks) to try to only read Application resource updates that we haven't seen before. This isn't a huge change, but it requires that state (the most recent bookmark) is kept somewhere. That complicates things a bit. Additionally this feature is only GA in Kubernetes 1.17, which is more recent than the version we're on, so it won't solve the issue for us right now.

Clean up environment variables passed to apps

When fiaas deploys an application, a number of environment variables are set, from a number of sources. Some of these have been "marked" as deprecated (by virtue of a comment in the source code) for ages, some are somewhat weird FINN legacy stuff and some are useful but badly named.

I'd like to do a clean up, but it probably requires a detailed plan for how to go about it, so that users have time to update their applications.

In addition to the ones FDD sets directly, we have the environment variables that comes from global config. Because we at some point thought it would be best to set a prefix to all environment variables supplied by fiaas, we added copies of all of them with the FIAAS_ prefix with the intention to remove the non-prefixed ones later. We never followed through, and I think that's a good thing.

I'd like to clean this up, following these principles:

Environment variables from global config are kept as is. It's up to the cluster operator to ensure that they are unique enough.
Environment variables set by FDD for internal reasons, should have the FIAAS_ prefix.
Environment variables that are a legacy from FINN should be removed or replaced where needed (see table below)

This leads to me wanting to make these changes (this might also solve #12, since we remove the propagated variables completely):

Name	Value	What to do with it
ARTIFACT_NAME	Name of application	Rename to FIAAS_APPLICATION_NAME. (for FINN, this requires some work, since ARTIFACT_NAME is use "everywhere" in common libraries)
CONSTRETTO_TAGS	Legacy FINN support	Remove (for FINN: move to global config)
FINN_ENV	Same as FIAAS_ENVIRONMENT, legacy from FINN. Considered deprecated for ages.	Remove
IMAGE	Name of deployed image. Could be useful for applications.	Rename to FIAAS_IMAGE
LOG_FORMAT	Logging format (plain/json). Propagated from FDDs own config. Legacy from FINN.	Remove (for FINN: move to global config)
LOG_STDOUT	Logging destination (true->stdout/false->file). Hardcoded to true. Legacy from FINN.	Remove (for FINN: move to global config)
VERSION	Version of deployed image (ie. tag of container). Might be useful, but we already have cases of name collision.	Rename to FIAAS_IMAGE_VERSION

FIAAS_ENVIRONMENT	Propagated from FDDs config. Should describe environment.	Remove. If cluster operators needs it, they can put it in global config.
FIAAS_INFRASTRUCTURE	Propagated from FDDs config. Should describe infrastructure of current cluster. Legacy from FINN and marked as deprecated.	Remove

FIAAS_LIMITS_CPU	Comes from resource limits	Keep as is
FIAAS_LIMITS_MEMORY	Comes from resource limits	Keep as is
FIAAS_REQUESTS_CPU	Comes from resource requests	Keep as is
FIAAS_REQUESTS_MEMORY	Comes from resource requests	Keep as is
FIAAS_NAMESPACE	Contains namespace app is in	The namespace is available as a file, but not everyone knows this, so having it as environment variable is useful. Keep as is.
FIAAS_POD_NAME	Contains name of pod	This values is available as hostname, but it's easier to just take from environment. Keep as is.

Support for multiple Ingresses

Short version: We have a growing number of users who would like to have multiple Ingresses for their fiaas apps, because they want different annotations on them, that control visibility/access. Adding 'annotations' as a property to the items in the list of 'ingresses' allows us to support this, while keeping things simpler/compatible with the current behaviour by using this 'annotations' field as the distinguisher for when to group and when to separate hosts/paths into an Ingress object.

ingress:
  - host: myservice-internal.myns-pro.cluster.com
    annotations:
      kubernetes.io/ingress.class: internal
      nginx.ingress.kubernetes.io/whitelist-source-range: “10.x.y.z/32”
  - host: myservice-public.myns-pro.cluster.com
    annotations:
      kubernetes.io/ingress.class: public
      nginx.ingress.kubernetes.io/whitelist-source-range: “215.a.b.c/32”

I've written up a longer version that explains the behaviour in a little more detail, with examples showing the impact (or lack of impact in the cases for backwards-compatibility): https://docs.google.com/document/d/12-hrIgxMEYJDw8qMeUzDar8xa_iV4Tc2CqfpVRo9WoE/edit?usp=sharing
(It's public for commenting/suggesting, let me know what emails to add if you'd like to edit)

Include failure events from Deployment/Replicaset/Pod in case of timeout being 'ready'

We've seen users having problems debugging deploy errors in a couple of cases:

errors with liveness/readiness checks
errors deploying with resource requests/limits higher than cluster allows

My idea is to capture these events when fiaas decides the deployment is failed due to a timeout, and attach them to the ApplicationStatus object, in a similar way to what is done for logs, so they can be displayed in the mast UI.

Deploy daemon in frozen state

We've had a deploy daemon that somehow ended up in a state where it appeared frozen, not triggering new deployments. Checking kubectl logs for the fdd pod just showed the last succcessful deployment after which the pod seemingly stopped logging alltogether.

[2019-08-02 14:20:16,637|   INFO] Saving result SUCCESS for my_service deployment_id=679a8b83-6476-4eef-9019-9fd51c56b4dc [fiaas_deploy_daemon.crd.status|Scheduler]

After restarting the fdd pod deployments resumed normally.

Upgrade to python 3

Work is underway to remove integration with Kafka that uses python 2.7 libraries.
Once that is complete fiaas-deploy-daemon can be upgraded to python 3.

Create a Changelog when selecting a new stable version

When we mark a new version as stable, we should generate a Changelog (commits to master is probably good enough, similar to the one generated by publish) between this version and the previous stable.

This changelog should probably be made visible in Skipper, but the first iteration could be to just add it as a file next to the metadata json-file.

Support separate service accounts per application

FIAAS only supports using the default service account in Deployment objects to map the k8s API token inside the pod (if instructed to do so).

AWS EKS use service accounts to inject IAM Roles and provide fine-grained authorization to the AWS API, a similar functionality that kiam provides (using service accounts instead of annotations on pods).

To support this use case, FIAAS should allow binding different service accounts per application. This could work as follows:

FDD will have a new setting (separateServiceAccountPerApplication) that will create a service account with the application name (if it doesn't exist), following the FIAAS naming convention.
Deployment objects will bind this new service account in the Deployment object instead of the default account.

If the cluster operator wants to provide IAM roles for a given application, the operator may configure the app specific service account accordingly following https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html

Thoughts?

Propagation of config flags as environment variables can lead to config changes not taking effect on redeploy

When fiaas-deploy-daemon deploys itself, it sets some environment variables based on the internal state of its config object (see below). Since the config object is based on data from environment variables and the cluster_config.yaml file being merged (in that order), this can lead to a configuration change in the cluster_config.yaml file not being picked up since a value is already set in the corresponding environment variable.

fiaas-deploy-daemon/fiaas_deploy_daemon/deployer/kubernetes/deployment/deployer.py

Lines 223 to 236 in 4d4bf40

    
           def _build_fiaas_env(config): 
        
               env = { 
        
                   "FIAAS_INFRASTRUCTURE": config.infrastructure,  # DEPRECATED. Remove in the future. 
        
                   "LOG_STDOUT": "true", 
        
                   "LOG_FORMAT": config.log_format, 
        
                   "CONSTRETTO_TAGS": "kubernetes",  # DEPRECATED. Remove in the future. 
        
               } 
        
               if config.environment: 
        
                   env.update({ 
        
                       "FINN_ENV": config.environment,  # DEPRECATED. Remove in the future. 
        
                       "FIAAS_ENVIRONMENT": config.environment, 
        
                       "CONSTRETTO_TAGS": ",".join(("kubernetes-{}".format(config.environment), "kubernetes", config.environment)), 
        
                   }) 
        
               return env

The typical way this occurs is when bootstrapping fiaas-deploy-daemon via Skipper. When this is done, the fiaas-deploy-daemon configmap, which usually provides the cluster_config.yaml file is not mounted in the Pod. This means that the default values for flags like FIAAS_ENVIRONMENT and LOG_FORMAT are used. To change these later, just changing the configmap is not enough; to make the config change actually be applied you have to change the configmap, then remove the corresponding environment variables from the fiaas-deploy-daemon deployment resource, which will trigger a redeploy that then applies the changes.

This behavior might be improved by not propagating the state from the internal config object at all when fiaas-deploy-daemon is deploying itself

Refactor the Lifecycle signals

We use blinker to handle the various lifecycle events in fiaas-deploy-daemon. These are deploy initiated, started, failed and success. Initially, we used to send the AppSpec to these signal handlers, but changed to individual parameters when we introduced the initiated signal. This is because at the time of initiated, we don't have everything we need to create an AppSpec.

As time passed we have added more data, and now the list of parameters to send on each signal is getting long and complicated. We are also sending the exact same set of parameters on each signal, and all signal handlers effectively handle the signal in the same way with simply a string parameter indicating which signal was actually received.

I suggest to refactor the signals as follows:

Collapse all signals to a single signal, with an additional parameter state. The state parameter indicates which of the four states just happened.
Create a namedtuple in the lifecycle module called Subject which is meant to describe the subject of the signal. This object should contain all the current parameters sent to the signals
Create the Subject when sending the initiated signal
Attach the Subject to AppSpec when creating it, so that subsequent signals can use the already created Subject

This will simplify quite a bit of code. It will move the logic for selecting data to send to the signals to one place, and the signal handling will be simpler.

"Extensions" for fiaas applications

We're receiving a number of requests from users who want fiaas to work in different ways for their use-cases, most recently (and most vocally) for some kind of "sidecar support" for their apps. We haven't been able to come up with a version of this feature that meets what seems to be a very disparate set of requirements that doesn't expose the full Deployment object as an API, so we've been looking at other ways to let us, and our users, experiment somehow with this feature, and others, perhaps to arrive at something that could be incorporated into the core project, or perhaps it can stay as an 'extension', living alongside the core.

So we've come up with a proposal for allowing the behaviour of fiaas, specifically the creation of the k8s objects, to be pluggable: between fiaas creating the object based on its internal logic and the Application configuration, and it being saved to the k8s API, we'd like to make an HTTP request to an external service, passing it the object that's about to be created along with the Application configuration.

The request would be to ${extensionServiceUrl}/fiaas/deploy/${Kind} (e.g. /fiaas/deploy/Ingress), and the body would contain serialized versions of the particular k8s object (e.g. the Ingress) and the AppSpec.

The service will be able to make some modifications to that object before returning it, and the modified version is what fiaas will save to k8s. We imagine that the 'extensions' part of the current spec, perhaps necessitating the 'reservation' of some key there, would be where the logic for the modification comes from - it would then be up to the fiaas-admins for a cluster to explain/document this extra functionality for their users.

The changes to fiaas will be the introduction of a URL for this 'extension service', which will be optional, along with a new component that, given that base-URL, will be able to make HTTP requests. This component will then be called for each of the objects created (Deployment, Ingress, Service, HPA), and it will respond with the (potentially) modified object.

To allow for an implementation of the extension service to care only about a specific type, and allow some future extensibility, the service returning 404 will be ignored/skipped. Other errors will be logged and trigger failure, similar to validation problems inside fiaas, for example.

This first iteration should allow us to do some of the experimentation we'd like to do. If it seems useful enough, then being able to authenticate with the extension-service using a shared token (equivalent to how it's done for the usage-reporting) would be a likely short-term 2nd step. And adding extra hooks, perhaps before and after the full deploy-process, to allow dynamically modifying the Application config, or to cause some extra objects to be created, could be interesting too, though we don't have concrete use-cases for those right now.

We'll be opening PRs before too long, which can be used to discuss specifics of, for example, the API-contract between fiaas and the extension-service that we're still not 100% clear on, but I'm opening the issue now to see if there's any feedback on the idea itself.

Remove pipeline consumer

Removing the pipeline consumer part of fiaas-deploy-daemon has been planned for a while. Tracking it here for visibility.

The first step of this is implemented in #63

Check compatibility with + add Kubernetes 1.19 to e2e test suite

Check if anything needs to be updated to support Kubernetes 1.19, and add that version to the e2e test suite.

Kubernetes changelog: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.19.md

Delete ApplicationStatus resources when deleting Application

When deleting an Application resource, all managed resources are deleted, except for the ApplicationStatus resources created on deployment. I think it makes sense to also delete the ApplicationStatus resources, since deleting an application typically means that it is decommisioned

Support custom TLS certificate

FIAAS supports certificate provisioning with Lets Encrypt. It is possible to plug in custom certificates but it requires manual steps to set up a custom/manually managed ingress resource.

It should be possible to plug in custom TLS certificates without manual steps.

Labels and annotations on ApplicationStatus resources are overwritten on update

fiaas-deploy-daemon overwrites labels and annotations on the ApplicationStatus resources it manages when it updates these. We have a use case where it would be useful to store some information in a label or annotation on ApplicationStatus resources, to avoid duplicate reporting to the system that orchestrates deployments, so it would be useful if annotations and labels set on the ApplicationStatus were persistent.

Application spec should be verified before being deployed

FIAAS supports multiple versions of application specs. When users upgrade their application specs from one version to the next it is easy to supply values that are not valid causing failures when deploying. By introducing a verification step in the form of linting of the application spec ahead of deploying it, it would be possible to raise issues more explicitly and give users feedback earlier in the process.

e2e test suite Kubernetes version cleanup

It has been a while since updated the e2e test suite Kubernetes versions list was updated last, and some of the Kubernetes versions used there are getting somewhat dated, so I think it would be a good idea to see if it is possible to remove some of the older Kubernetes versions used in the test suite. Run time for e2e tests is getting longer, and more tests mean more complexity and more flakiness of the test suite as a whole. There is also some maintenance upkeep with keeping the older kind images working since there are some certificates for cluster-internal use which tend to expire one year after image build time (see #45, #116).
From our perspective, I think it would be okay to keep only 1.16 and 1.18 (added in #147) out of the currently tested versions. Are there any Kubernetes versions we can remove from the test suite now? / which Kubernetes versions would you prefer to keep testing against ?

e2e tests fail because kind images for 1.12 and 1.14 have expired certificates

It looks like the CI build for #105 fails because the kind clusters used for e2e tests with Kubernetes version 1.12 and 1.14 fail to start. I looked into it and it looks like it might be the same issue with expired certificates which was fixed for Kubernetes 1.9 in #45 . Looks like we might need a similar fix for these Kubernetes versions too.

I found this in the kubelet logs from the kind docker container:

E0922 09:27:13.978134     405 bootstrap.go:264] Part of the existing bootstrap client certificate is expired: 2020-09-10 23:17:42 +0000 UTC

Build multi-arch images

To support running fiaas on ARM architectures, we should make our images multi-arch.

TypeError while trying to deploy via pipeline consumer

The pipeline consumer raises TypeError when trying to process deployments:

  File "/usr/local/lib/python2.7/site-packages/fiaas_deploy_daemon/base_thread.py", line 29, in _logging_target
    self()
  File "/usr/local/lib/python2.7/site-packages/fiaas_deploy_daemon/pipeline/consumer.py", line 66, in __call__
    self._handle_message(deploy_counter, message, message_counter)
  File "/usr/local/lib/python2.7/site-packages/fiaas_deploy_daemon/pipeline/consumer.py", line 76, in _handle_message
    deployment_id=self._deployment_id(event))
  File "/usr/local/lib/python2.7/site-packages/fiaas_deploy_daemon/lifecycle.py", line 39, in initiate
    subject = Subject(*args, **kwargs)
TypeError: __new__() takes exactly 7 arguments (4 given)```

Teams and tags are not supported when deploying using CRDs

In the kafka deploy method used at FINN, we support setting teams and tags labels on the deployed objects. This is used to track metrics and ownership across deployments.

When deploying using CRDs, we have not implemented support for these labels, and instead hardcode them to empty lists.

For FINN to move to CRD deployments, this needs to be supported. For us to get to python3 (#6) we need to get rid of the kafka code, which can only happen when FINN has completed the move to CRDs.

Expose fiaas-deploy-daemon version via a metric

To make it easier to track which version of fiaas-deploy-daemon is running at a given time (and changes over time), expose the version (config.version) in a metric.

Default local/private endpoints

Some HTTP endpoints exposed by applications deployed by FIAAS should generally not be available from the internet/outside the Kubernetes cluster. Typically this will be prometheus metrics endpoints, health endpoints and so on. Currently the defaults for these endpoints provided by FIAAS are grouped under /_/. This means that one way to solve this could be to employ the default that /_/* should be considered private by the ingress controller.

Proposal: Multiple config-map support for fiaas

We are using config-maps in fiaas today but it can be tedious at times updating and keeping track of all our config-maps when making changes. For example for our database config we use a different port in dev and production, and it also requires som difference in the ssl part of the connection-string. I'm wondering if this could have been solved using only a single config-map named i.e database-config and applied this config-map to all our apps via fiaas.yml, something like this:

---
version: 3
admin_access: true
replicas:
  maximum: 1
  minimum: 1
resources:
  requests:
    memory: 128Mi
ports:
  - target_port: 5000
healthchecks:
  liveness:
    http:
      path: /healthz
metrics:
  prometheus:
    path: /internal-backstage/prometheus
config_maps:
    - team-db-config
    - team-kafka-config

The default can stay the same that a config-map with the same name as an application would be applied by default. And a default for the config-maps in fiaas.yml could be something like:
config_maps: []

As far as I'm aware multiple config-maps can be applied to a pod like this:

spec:
  containers:
    - name: example-pod
      envFrom:
      - configMapRef:
          name: config-1
      - configMapRef:
          name: config-2

Let me know what you think, I would love this feature 👍

Option to disable service links environment variables

k8s injects a bunch of service links variables to be able to reach other services within a namespace via env vars. This feature is deprecated since a lot time ago. However, those environment variables are being injected anyway which makes some services to fail (e.g specific example being nodejs apps).

Disabling this environment variables injection is already possible in k8s 1.13 via the enableServiceLinks flag in the PodSpec.

It would be helpful to get this flag configurable (possibly default to false? given it shouldn't be used as it's a deprecated feature).

Update integration tests to test against updated k8s versions

Shift the kubernetes versions in e2e tests to updated versions (min 1.9.x)

Support integration with datadog agent daemonset

The current datadog support in fiaas is based around running a dd agent sidecar in each application pod. According to the datadog documentation this may not be the optimal way (https://docs.datadoghq.com/agent/kubernetes/#installation)
"Note: Only one Datadog Agent shound run on each node; a sidecar per pod is not generally recommended and may not function as expected."

When a cluster operator has deployed datadog agent as a daemonset there is a dd agent pod running on each node. For enabling apm and tracing from application pods a couple of environment variables need to be set to enable reporting to the dd agent but currently they cannot be expressed/set via fiaas without making some changes.

Right now it is possible to work around this by manually patching the deployment object generated by fiaas.

spec:
  template:
    spec:
      containers:
      - env:
        - name: DD_AGENT_HOST
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        - name: DATADOG_TRACE_AGENT_HOSTNAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        image: myimage
        name: myapp

Would it make sense to extend the datadog integration to support the behavior described above where the datadog agent host could be exposed to the application and the sidecar avoided altogether?

e2e tests are hard to debug

With the switch to using the kind container for e2e tests, we have a new type of instability in our e2e tests, where we get connection reset when communicating with kind. Since we don't capture any logs, that's all we know.

We should look into capturing logs from the kind container, so that we have more insights into why we get connection reset. Then maybe we can find solutions to avoid connection issues in the future, making the tests more stable.

Use recommended kubernetes labels

There is a set of recommended labels that overlap to some degree with the labels we set. Should we transition to using the recommended labels instead of (or in addition to) the ones we set?

https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/

I think we at least could add these:

app.kubernetes.io/name: myapp
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/managed-by: fiaas

When we copy liveness to readiness, liveness should have increased failure threshold

When a user only specifies a liveness check, our current behavior is to copy it to the readiness check.
Some blogs advocate that in the case where your liveness and readiness checks are the same, the liveness check should have a higher failure threshold.

This will allow your app to be considered unready and taken out of rotation, while still giving it some more tries to get ready before it is killed. This would help in cases where high load from the start makes it harder to come up.

Allow users to disable ingress

It's impossible to disable ingress for particular deployments.

One use case would be when multiple services are deployed using FIAAS and only one has to be exposed (have ingress).

CI builds for PRs from forks doesn't run e2e tests

CI builds for PRs from forks doesn't run e2e tests because secrets aren't exposed in builds for forked PRs, and they are necessary to docker login, which happens before the e2e test run as it is now.
It would be good to change the CI config to make it possible to have full CI builds including e2e tests on forked PRs too. It is probably not a good idea to expose secrets for those builds, but rather change how the build is done for PRs. I don't think docker login shouldn't be strictly necessary to run the e2e tests as it shouldn't need to push images, only build it.

Transition from extensions/v1beta1 Deployment to apps/v1 Deployment

FIAAS currently uses extensions/v1beta1 Deployment. This resource is deprecated in Kubernetes 1.16, and will be removed in Kubernetes 1.18.
Code paths that handle deployments need to be updated to use apps/v1 Deployment resources to be compatible with Kubernetes 1.18 and later.

Transition from extensions/v1beta1 Ingress to networking.k8s.io/v1 Ingress

FIAAS currently uses extensions/v1beta1 Ingress. This API will be disabled in Kubernetes 1.20. networking.k8s.io/v1beta1 Ingress which is supposed to replace this API is available in Kubernetes 1.14 and later.

Generalize secret support in FIAAS

FIAAS currently supports importing secrets into pods in two different ways:

native k8s secrets (mounting the secret with the same name as the FIAAS app)
importing from strongbox via an init container

It would be helpful to generalize how the init container for strongbox works, define a very simple interface for it, so we could swap in any other secret implementation. By doing so, we could support other secret managements tools like AWS Secret Manager or Vault.

Support namespace-level 'issuer' for TLS

Currently, FIAAS supports configuring ingresses to support TLS using LetsEncrypt via the "cluster-Issuer" feature of cert-manager. To allow for custom domains using DNS challenges (necessary for provisioning valid certificates in a multi-cluster setup, where HTTP-challenge doesn't work well) we'd like to allow for users to also configure their apps to use an "issuer" instead (these exist inside the same namespace). For cert-manager, this requires the Ingress to have different annotations (certmanager.k8s.io/cluster-issuer vs. certmanager.k8s.io/issuer). Currently in FIAAS, this value is fixed, and only the value of the annotation can be changed (this allows switching between letsencrypt and letsencrypt-staging, for example).

The least intrusive way to enable this feature seems to be extending the configuration at the fiaas-deploy-daemon level, where we can introduce a mapping between domain suffixes and the key issuer, and the app config won't need to change.

The related config for TLS issuing would look something like:

tls-certificate-issuer: letsencrypt
tls-certificate-issuer-type-overrides:
  - foo.example.com=certmanager.k8s.io/issuer
  - bar.example.com=certmanager.k8s.io/issuer
tls-certificate-issuer-type-default: certmanager.k8s.io/cluster-issuer

Where tls-certificate-issuer currently exists, and remains the same.
tls-certificate-issuer-type-overrides is a new key and contains a mapping between domain suffixes and the annotation to use when configuring TLS for an ingress (a domain matches if the value in the configuration is a sub-domain, that is a suffix with an additional '.'-separator; in the case of multiple matches, the ~~first~~ most-specific (i.e. longest) in the list will be used).
tls-certificate-issuer-type-default is another new key, and decides which issuer-type should be used when the domain doesn't match any configured in the previous section; the default will be to continue using cluster-issuer.

This feature will also require a related change in how ingresses are created: currently, the ingress annotation to configure TLS is added after the creation of multiple ingresses, and so it would currently be the case that multiple hosts, requiring different issuers, would be created using the same Ingress, which won't work correctly. So as part of this change we will also change that, so that the annotations added to configure TLS are taken into account when creating one or multiple ingresses.

This change will be completely transparent and have no impact for fiaas-admins not configuring tls-certificate-issuer-type-overrides.

Change our labels and annotations to use a FQDN instead of just `fiaas`

Technically, labels and annotations that use a / should have a FQDN before the slash. We have always just used fiaas. While that works, we should probably at some point decide on a FQDN, register it, and change all our labels and annotations.

Transition to apiextensions.k8s.io/v1 CustomResourceDefinition

The CustomResourceDefinition resource is out of beta in Kubernetes 1.16. apiextensions.k8s.io/v1beta1 CustomResourceDefinition, which FIAAS currently uses, will be removed in Kubernetes v1.22.

There are some changes that are necessary to transition FIAAS from using to apiextensions.k8s.io/v1 CustomResourceDefinition; we need to determine exactly what these are, and how to implement the changes in FIAAS.

Use API v2 for HPA creation

FDDEP-0010: Switch API version for HPA object creation to v2beta2

Summary

Currently when HPA object is created - it uses API v1 that supports only very limited set of metrics that can be used to scale the deployment (pod CPU utilization and memory). We need to switch API version to v2 to allow users to use extended set of metrics (e.g. provided by prometheus) for deployment autoscaling.

Motivation

This change will allow to extend PAAS spec so users can scale their deployments based on relevant metrics.
It will solve the "sidecar problem" (i.e. for HPA v1 CPU utilization is an average across all containers within pod that may lead to improper scaling in case of sidecars that may lower the average pod CPU utilization value).

Goals

The goal is just switch HPA to v2 that will allow:

maintain backward compatibility, i.e. current PAAS spec for replicas will work exactly as before in sense of scaling logic
add possibility to scale on various resource metrics
allow extending FIAAS ecosystem with scaling based on business metrics

Non-Goals

add support for custom metrics
update PAAS spec

Proposal

User stories

Sidecar problem

As a user when I have a pod with sidecar(s) (for example datadog agent) and specify CPU threshold 75% for scaling - if datadog agent CPU usage is 0 while application pod CPU usage is 90 - average will be below 75 and scaling up won't happen.

Business metrics

As a user I want to scale my deployment based on relevant business metric (e.g. requests rate). With HPA v1 I just can't specify this metric.

⚠️ Caveats ⚠️

HPA API v2 was introduced in k8s version 1.9+, so for older cluster versions backward compatibility in case of just switch to API v2 will be broken.

Mitigations

There are several possible options of how to mitigate the caveat listed above:

We can just state in release notes that new version includes breaking changes and older k8s versions is no longer supported.
We can "branch" the development and make tagged releases
We can not just switch to API v2, but maintain backward compatibility by adding some cluster version detection feature that will use API v1 for older clusters and v2 for newer ones.

Design details

💡 Ideally we should probably switch to using official k8s client library instead of fiaas onne, but it will require too much refactoring that is out of the scope of this proposal.

We need to add v2 models in fiaas/k8s of autoscaling v2 objects
We need to update FDD autoscaler deployer to use new models
We need to support current paas.yaml spec so we need to add transformation of current metrics specification to v2 format, i.e. targetCPUUtilizationPercentage: XX should be transformed to v2 metrics spec:

metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: XX

Dependencies

https://github.com/fiaas/k8s

Implementation history

2020-10-13: Proposal open for discussion

Add option to set a configurable default ingressclass for ingresses

It could be useful to be able to set a default ingressclass for ingress resources managed by fiaas-deploy-daemon via the kubernetes.io/ingress.class annotation.

When moving to the networking.k8s.io/v1 ingress API (#40) it becomes possible to tag a IngressClass resource as the default at the cluster level. Even with this option, it could still be useful to have the ability to have a (different) default configured at the fiaas-deploy-daemon instance level. (When moving to the v1 ingress API, fiaas-deploy-daemon default ingress class should likely be implemented via the ingressClass field and not the annotation)

Populate ownerReferences in the children of fiaas TPR/CRD's

It would be great if fiaas populated the ownerReferences of resources created by fiaas. This would enable us to navigate to the fiaas definition from an ingress object for example.

https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/

Support for progressDeadlineSeconds in Deployments

Kubernetes has the notion of https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#progress-deadline-seconds to understand when a deployment is making progress (or not) via a condition. Unless specified, that timeout is set to 10m.

Deployment tools in general (kubectl, spinnaker, argo, etc.) use this condition to signal that a deployment failed. However, FIAAS only checks for the desired vs ready pods which is an incomplete view of what k8s is surfacing.

This issue is to discuss wether FIAAS needs to follow this same pattern and consider the progress condition in deployments.

Some possibilities could be:

include a timeout in the ApplicationSpec to create the deployment with a given timeout (default to 0, so it's barckwards compatible with current behaviour).
ignore the condition as FIAAS does today and rely exclusively on the desired vs ready pods as of today.
others?

	def _build_fiaas_env(config):
	env = {
	"FIAAS_INFRASTRUCTURE": config.infrastructure, # DEPRECATED. Remove in the future.
	"LOG_STDOUT": "true",
	"LOG_FORMAT": config.log_format,
	"CONSTRETTO_TAGS": "kubernetes", # DEPRECATED. Remove in the future.
	}
	if config.environment:
	env.update({
	"FINN_ENV": config.environment, # DEPRECATED. Remove in the future.
	"FIAAS_ENVIRONMENT": config.environment,
	"CONSTRETTO_TAGS": ",".join(("kubernetes-{}".format(config.environment), "kubernetes", config.environment)),
	})
	return env