Coder Social home page Coder Social logo

operate-first / continuous-deployment Goto Github PK

View Code? Open in Web Editor NEW
7.0 7.0 14.0 865 KB

Continuous Deployment

License: GNU General Public License v3.0

Dockerfile 100.00%
aicoe gitops hacktoberfest hacktoberfest2022 helm kubernetes operate-first redhat

continuous-deployment's People

Contributors

accorvin avatar anishasthana avatar durandom avatar harshad16 avatar hemajv avatar humairak avatar larsks avatar martinpovolny avatar mscherer avatar oindrillac avatar tumido avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

continuous-deployment's Issues

Move/Update Docs from aicoe-cd to this repo

We have a bunch of docs in the aicoe-cd repo that should be moved here. The docs that are team specific should remain in the aicoe-cd repo, the rest can be moved here.

An example of docs not to be moved Permissions.

The docs also reference links that point in the aicoe-cd repo, these should be converted to point here where applicable (do not use relative paths for those links, and use the full URL, this is for mkdocs to work).

The docs structure should remain the same and be found in the root directory.

Move sops yaml in root to dev folder

The sops.yaml in the root folder is a dev sops yaml for testing / demo purposes, the private key for it is exposed, so it should not be used for anything other then encrypting things that are not confidential. Keeping it in the root folder could result in files in the moc-cnv overlay to get encrypted using this exposed gpg key, which would be ... not good.

ArgoCD Apps of Apps architecture proposal

ArgoCD is live on MOC and we have created the appropriate projects/permissions to start deploying applications.

Before we do that I think we should have a brief discussion on how we should set up our applications on argocd declaratively.

For Context:
ArgoCD Applications manifests are a declarative way to manage Argocd applications in git. Traditiionaly we've stored these alongside argocd deployment manifests, like IDH has done here.

This has been fine in the past since we controlled the deployment of ArgoCD and had merge access to the repo where the applications were stored. So if we wanted to onboard a new app, we make a PR with the application manifest and someone on our team would merge it (see this pr as an example).

But now we have a situation where MOC manages Argocd here. It was added to this repo because it's a cluster-wide argocd that can be used to manage cluster resources as well.

The Problem:
If we applied our current practice, we'd store our app manifests here. The problem is that we don't have merge access to this repo, and it wouldn't really make much sense for people who manage the infrastructure to also handle PR's that don't pertain directly to cluster management.

Proposed Solution
To reconcile this dilemma I'd like to put forth the following suggestion in how we can organize our repository/argocd applications:

image

The Infra Repo is analogous to the moc-cnv-sandbox repo here, but could be replaced with another repo as well.

The idea here is that all our operate-first/thoth/data-science Argocd Applications would go in the opf-argocd-apps repo. Then we'd have an App of Apps i.e. the OPF Parent App that manages all these apps. This way we can add new applications declaratively to ArgoCD w/o having to make PR's to the Infra Repo (or moc-cnv-sandbox). Operate-first admins would manage the opf-argocd-app repo. Any other argocd applications that manage cluster resources like clusterrolebindings or operator subscriptions etc. can remain in the infra repo since that's a concern for cluster admins. We would direct any user of moc that wants to use ArgoCD to manage their apps to add their ArgoCD apps to the opf-argocd-apps repo.

Pros:

  • MOC Admins/Ops are not bombarded with PR's for ArgoCD App onboarding
  • OperateFirst maintainers can handle the PR's unhindered
  • The "OPF-ArgoCD-Apps" repo can be leveraged by CRC/Quicklab/Other OCP Clusters to quickly setup ArgoCD ODH/Thoth/etc. Applications.

Cons:
One concern here is that there is no way to automatically enforce that Applications in opf-argocd-apps repo belong to the Operate First argocd project (see diagaram). Why is this a problem? Because we use ArgoCD projects to restrict what types of resources applications in that project can deploy. For example argocd apps in the Infra Apps project in the diagram can deploy clusterrolebinding, operators, etc. So while OPF Parent App cannot deploy clusterrolebindings because it belongs to the Operate First argocd project, it could deploy another ArgoCD application that belongs to Infra apps and that argocd app could deploy clusterrolebindings`.

You can read more about this issue here. The individual there used admission hooks to get around this but I don't think we want to go there just yet. My suggestion is we begin by enforcing this at the PR level, and transition to maybe catching this in CI until there's a proper solution upstream.

Use project global project to template group permissions

As described here we can have a common global project to inherit all permissions from. Since most of the permissions are essentially copied and pasted, this would make it significantly easier to account for what permissions each team has when deploying via argocd onto their team's respective projects.

[discussion] Kustomize Plugin Directory

I would like to start a discussion about using kustomize plugins to extend the functionality of kustomize to meet our needs. Some existing problems right now:

  • Kafkatopics require creating a new resource file with a lot of repitition, so 100 topics require 100 new files
    • because of this we had to use helm for one of our other repos
  • Configmap Generator does not allow you to extend existing configmaps' data fields (you have to overwrite the whold data field if you fetch a configmap from another kustomize base)
    • the ability to do this would allow us to move more non cluster specific configurations from aicoe-cd to this repo
  • We can't sops encrypt inline, for example in configmaps
  • Some deployments require fetching existing information like SA tokens to populate the manifest post-deployment, this requires you to deploy, then fetch tokens, then update manifests, then re-deploy again (ideally we should deploy only once)

All of these issues (save for the last one maybe) can be easily solved by writing some quick Kustomize plugins.

I was thinking we would include a sub directory in this repo called kustomize_plugins/operate-first/v1/... where we write and add these plugins.

We would then include these plugins as part of the argocd image.
In order to use these plugins locally, you would just cp this plugin folder to your $XDG_HOME_CONFIG/kustomize/plugin/ folder.

The plugins themselves are very straightforward and easy to write (often only ~50-100 lines), and can be written in bash/go/python or essentially any language (we would likely just use python, something like this).

What do you guys think?

Determine how to gate changes between environments

We need to figure out a good way to gate/promote changes between environments. Right now we generally default to just updating the base manifest or all the overlays at the same time.

Assume that we have the following sets of manifests (and environments), all managed by ArgoCD
dev -> stage -> prod-1, prod-2

Our current options are:

OPTION 1: Test and verify changes in dev -> update the base manifests via PR -> deploy and test changes in stage -> deploy and test changes in prod-*

Pros:

  • We will be manually promoting changes and requiring manual verification, so if runbooks are well defined this should be relatively safe

Cons:

  • Manual verification can still result in errors
  • We aren't using ArgoCD to the fullest
  • There will be periods of time where what is in git does not match what is actually deployed

OPTION 2: Test and verify changes in dev -> update the base manifests via PR -> deploy and test changes in stage -> Create new tagged release after a number of changes are merged into master -> Update the tagged release that prod-* argocd apps are pointed to.

Pros:

  • Very easy to tell what is deployed in prod environment (we're basing the deployment on a tagged release)
  • Difficult to accidentally push changes to prod.

Cons:

  • While lots of small changes may not result in issues, one large change can result in applications crashing
  • Slower velocity (not necessarily a bad thing)

OPTION 3: Test and verify changes in dev -> PR to stage overlays -> deploy and verify changes in stage -> PR to prod-* overlays -> deploy and verify changes in prod-* (This means you basically have 3 copies of the same manifests)

Pros:

  • Very deliberate process for promoting changes.

Cons:

  • Having three copies of the same manifests goes against the whole idea of overlays and bases. This isn't necessarily bad, but it can be an anti-pattern.
  • Harder to determine what is different between environments

OPTION 4: Test and verify changes in dev -> PR to stage overlays -> deploy and verify changes in stage -> PR to revert changes in stage manifests and update the base manifests -> deploy and verify changes in prod-*

Pros:

  • Addresses one of the flaws of 3 so that we re-use manifests as much as possible
  • Very deliberate process for promoting changes

Cons:

  • Revert commits mean that history will be a little messy

OPTION 5: Test and verify changes in dev -> PR to update base manifests -> deploy in stage -> Use ArgoCD resource hooks to verify changes -> trigger sync in prod-* manifests

Pros:

  • Automated, so it will require minimal input from developers once changes are merged in

Cons:

  • Will require larger investment to write tests for applications
  • This might be a slight mis-use of resource hooks, so we should investigate that

OPTION 6: Similar to above, but use something like keptn for the change gating instead

Pros:

  • Automated
  • Based on SLIs and SLOs, so it will force us to think more about the health of our services

Cons:

  • Will require a lot of investment for getting started

Deploy as many components as possible on quicklab cluster

The purpose of this exercise is to identify logical groupings of namesapces : components.

From previous attempts we know we cannot deploy all ODH components onto a single quicklab instance with 3 workers.

Therefore, this task should result in a grouping of components and their deployments in such a way, we can deploy in these groups on to a quicklab. Documents should follow.

The end result should be a set of N quicklab clusters that contain alll components of ODH.

Update ArgoCD to 1.8.3

Issues #90 and #84 require similar steps to upgrading ArgoCD itself, and 1.8.3 introduces some fixes we could benefit from. I think we can tackle all 3 at the same time.

unable to oc login during quicklab setup

I am trying to set up a new Quicklab cluster following the instructions provided and I am facing an issue in step 10 here.

As I try to oc login into my cluster, I get the following error:

[ochatter@ochatter ~]$ oc login upi-0.ochattertest2.lab.upshift.rdu2.redhat.com:6443
The server is using a certificate that does not match its hostname: x509: certificate is valid for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, openshift, openshift.default, openshift.default.svc, openshift.default.svc.cluster.local, 172.30.0.1, not upi-0.ochattertest2.lab.upshift.rdu2.redhat.com
You can bypass the certificate check, but any data you send to the server could be intercepted by others.
Use insecure connections? (y/n): n

error: The server is using a certificate that does not match its hostname: x509: certificate is valid for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, openshift, openshift.default, openshift.default.svc, openshift.default.svc.cluster.local, 172.30.0.1, not upi-0.ochattertest2.lab.upshift.rdu2.redhat.com

@tumido can you please help me understand what could be going wrong here?

Move argocd manifests to apps repo?

I think this repo has become a bit of a snowflake because we store all our cluster resources in the apps repo but make an exception for the cluster rbac/crds for argocd because they go in this repo. I'm thinking we should just move all of the argocd stuff into an argocd app in the apps repo, we just keep this repo alive for the docker file + tag releases. We also move the docs too, so that we have one less repo to worry about when debating where a certain piece of documentation lives. WDYT?

Edit: Laying down plan for migration here:

  • first move argocd manifests folder, and store cluster resources in cluster-scope app
  • move scripts to the apps repo scripts folder We'll just remove these and delegate fresh install steps to markdown
  • move docs to the apps docs folder
  • update downstream aicoe-cd repo manifests to point to the apps/argocd/base path
  • remove manifests/scripts/docs from this repo
  • rename this repo to a more appropriate name that houses argocd image

relationship between `continous-deployment` and it's implementation specific incarnations

We want to maintain one upstream version of https://github.com/operate-first/continuous-deployment and make it easy for downstream users to build on top of the knowledge collected upstream. While we have control over our own downstream, e.g. CRC or QuickLab, we will not have control over a 3rd party downstream.

The amount of change introduces by downstream may also vary: from just changing a key in KSOPS to replacing KSOPS with Vault.

For a downstream user, it should be easy to follow the documentation, without any context switching to different repositories.
It should also be really easy to incorporate all changes from upstream, once upstream introduces new best practices.

The original idea was to have this continous-deployment repo to be the upstream with no implementation specifics and let other targets be the fork of it.

E.g. continous-deployment <--upstream_of-- continous-deployment-crc

Unfortunately you can't fork into the same account/org (https://github.community/t/alternatives-to-forking-into-the-same-account/10200)

I suggest to create a new or duplicate repo continous-deployment-crc and handle the rebasing without GH, just like explained in https://stackoverflow.com/questions/45748400/git-fork-repo-to-same-organization

The downside is, we don't get a nice UI how many commits each repo is ahead/behind the other one.

Thoughts?

Write a KafkaTopic generator kustomize plugin

Write a kustomize plugin kafkatopicgenerator:

When given an input like:

apiVersion: operatefirst/v1
kind: KafkaTopicGenerator
metadata:
  name: kafka-topics
  namespace: mynamespace
clusterName: dev
defaults:
   partitions: 2
   replicas: 3
topics:
  - topicName: exampletopicA
    partitions: 4
    config: 
       retention.bytes: 853333300
       retention.ms: 172800000
  - topicName: exampletopicB

Should generate kafka topic manifest:

---
apiVersion: kafka.strimzi.io/v1alpha1
kind: KafkaTopic
metadata:
  name: exampletopicA
  namespace: mynamespace
  labels:
    strimzi.io/cluster: dev
    template: kafka-topics-template
spec:
    partitions: 4
    replicas: 3
    topicName: exampletopicA
    config:
       retention.bytes: 853333300
       retention.ms: 172800000
---
apiVersion: kafka.strimzi.io/v1alpha1
kind: KafkaTopic
metadata:
  name: exampletopicB
  namespace: mynamespace
  labels:
    strimzi.io/cluster: dev
    template: kafka-topics-template
spec:
    partitions: 2
    replicas: 3
    topicName: exampletopicB

The plugin should be added within the operate-first/cd/kustomizePlugins/v1/kafkatopicgenerator
Include a readme.md in this folder on the example use case like the one above, and installation instructions.

kustomize stdout mess running under toolbox

when running kustomize from the toolbox using the 'run" command I get an information message in the stdout. Which is a problem if I want to pipe to oc or kubectl or anywhere

Compare:

$ kustomize build manifests/overlays/dev --enable_alpha_plugins 2>/dev/null  | head
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: application-controller
    app.kubernetes.io/name: argocd-application-controller
    app.kubernetes.io/part-of: argocd
  name: argocd-application-controller
  namespace: aicoe-argocd-dev
---

vs:

$ toolbox run --container of-toolbox-v0.1.0 kustomize build manifests/overlays/dev --enable_alpha_plugins 2>/dev/null | head
2020/10/05 09:46:56 Attempting plugin load from '/usr/share/.config/kustomize/plugin/viaduct.ai/v1/ksops/ksops.so'
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: application-controller
    app.kubernetes.io/name: argocd-application-controller
    app.kubernetes.io/part-of: argocd
  name: argocd-application-controller
  namespace: aicoe-argocd-dev

Using toolbox enter is ok too:

$ toolbox enter --container of-toolbox-v0.1.0
...
...
$ kustomize build manifests/overlays/dev --enable_alpha_plugins 2>/dev/null | head
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: application-controller
    app.kubernetes.io/name: argocd-application-controller
    app.kubernetes.io/part-of: argocd
  name: argocd-application-controller
  namespace: aicoe-argocd-dev
---

So this is probably something that needs to be reported back to toolbox. Or is it an expected behavior?

Add monitoring and Long Term Storage for Metrics

We can set up individual application monitoring using the ODH monitoring (Prometheus and Grafana Operators) and for long term storage of metrics we can use the Observatorium stack to set up the Thanos Infrastructure. (https://github.com/observatorium)

Proposed Solution:

  1. Deploy ODH Prometheus and ODH Grafana + applications to monitor
  2. Determine if we need one prometheus instance to monitor all namespaces or individual namespace prometheus instances
  3. Deploy Observatorium (disable the Loki setup)
  4. Update ODH Prometheus Manifest to remote write to Observatorium

@HumairAK @anishasthana @durandom

Helm-Secrets plugin no longer there

The helm-secrets plugin doesn't seem like it's available in the repo-server pods, this is resulting in helm builds that use secrets from failing to build manifests and instructing argocd to prune these resources.

Aggregate Logging solution for operate-first deployments

We need to determine how to aggregate logs for applications running on our clusters, whether they be Thoth or ODH.

Two options here are:

  1. Use OpenShift as logging layer to grab all STDOUT and send it via fluentD to something like Elasticsearch.
    1.1 The problem here is that Elasticsearch is not a sutiable solution for all our users, so we should figure out exactly why it isn't suitable and document it
  2. Use an opensource project such as Loki or Graylog to grab logs and visualize them.
    A key point here is that integration for the logging should be simple for any solution that we choose.

I think starting with a POC using Loki makes most sense as there are already other teams at Red Hat using Loki.

Reduce manual steps for deploying ArgoCD

There are a number of manual steps that are introduced when deploying argocd, most of these are due to adding openshift authentication.

Brief primer on adding Openshift Oauth:

Openshift allows you to use a service account as an oauth client in order to authenticate against the oauth server. ArgoCD uses Dex identity service for implementing auth. We combine the 2 to achieve openshift auth. This essentially means we need to do a bunch of steps to give argocd the SA token that's acting as the oauth client, and we need to give the SA a redirect link to call back to. All these steps are captured in this script.

We should minimize the need for this script as much as possible. Some areas that can be automated:

  • Add dynamic redirect here, see more info here.
  • Get rid of callback here as it's not needed, from argocd docs:
    • Argo CD will automatically use the correct redirectURI

    • link to relevant page

We should also monitor this PR, once this makes it to release we should update and update this line to a secret reference, then manually create the dex-server service account secret as described here, the token will then automatically be picked up and we won't need to hard code it.

This should reduce most of the steps in the actual argocd deployment bits in the script.

Helm-Secrets deprecated, switch to new repo

As per this repo here

Please note, this project is no longer being maintained. There is an active fork jkroepke/helm-secrets and we will also contribute our future changes to it.

We should switch to the one here.

Enable anonymous read access

Currently an unauthenticated user sees:
image

We need to fix this, since we are multi cluster now and we might not want to have user parity between clusters - like for the workshops for example.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.