operate-first / continuous-deployment Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 14.0 865 KB

Continuous Deployment

License: GNU General Public License v3.0

Dockerfile 100.00%

aicoe gitops hacktoberfest hacktoberfest2022 helm kubernetes operate-first redhat

continuous-deployment's People

Contributors

Stargazers

Watchers

Forkers

humairak durandom martinpovolny tumido 4n4nd sankbad aakankshaduggal hemajv harshad16 oindrillac anishasthana mscherer larsks accorvin

continuous-deployment's Issues

add JH/Hue/SuperSet component of ODH to the `cd` fork

don't fork the upstream ODH manifest (or internal DH manifest) but reference the upstream ODH manifest in remote base

Deploy all odh components on quicklab using operate-first/odh repo

We want to replicate quicklab/crc deployments as closely to how we want to deploy on moc. This means using the odh repo to deploy the manifests using argocd. There maybe some issues with resource usage, part of this issue should be investigating how to reconcile such problems.

This issue should build on the issue here and follow on operate-first/apps#9

Move/Update Docs from aicoe-cd to this repo

We have a bunch of docs in the aicoe-cd repo that should be moved here. The docs that are team specific should remain in the aicoe-cd repo, the rest can be moved here.

An example of docs not to be moved Permissions.

The docs also reference links that point in the aicoe-cd repo, these should be converted to point here where applicable (do not use relative paths for those links, and use the full URL, this is for mkdocs to work).

The docs structure should remain the same and be found in the root directory.

Move sops yaml in root to dev folder

The sops.yaml in the root folder is a dev sops yaml for testing / demo purposes, the private key for it is exposed, so it should not be used for anything other then encrypting things that are not confidential. Keeping it in the root folder could result in files in the moc-cnv overlay to get encrypted using this exposed gpg key, which would be ... not good.

ArgoCD Apps of Apps architecture proposal

ArgoCD is live on MOC and we have created the appropriate projects/permissions to start deploying applications.

Before we do that I think we should have a brief discussion on how we should set up our applications on argocd declaratively.

For Context:
ArgoCD Applications manifests are a declarative way to manage Argocd applications in git. Traditiionaly we've stored these alongside argocd deployment manifests, like IDH has done here.

This has been fine in the past since we controlled the deployment of ArgoCD and had merge access to the repo where the applications were stored. So if we wanted to onboard a new app, we make a PR with the application manifest and someone on our team would merge it (see this pr as an example).

But now we have a situation where MOC manages Argocd here. It was added to this repo because it's a cluster-wide argocd that can be used to manage cluster resources as well.

The Problem:
If we applied our current practice, we'd store our app manifests here. The problem is that we don't have merge access to this repo, and it wouldn't really make much sense for people who manage the infrastructure to also handle PR's that don't pertain directly to cluster management.

Proposed Solution
To reconcile this dilemma I'd like to put forth the following suggestion in how we can organize our repository/argocd applications:

The Infra Repo is analogous to the moc-cnv-sandbox repo here, but could be replaced with another repo as well.

The idea here is that all our operate-first/thoth/data-science Argocd Applications would go in the opf-argocd-apps repo. Then we'd have an App of Apps i.e. the OPF Parent App that manages all these apps. This way we can add new applications declaratively to ArgoCD w/o having to make PR's to the Infra Repo (or moc-cnv-sandbox). Operate-first admins would manage the opf-argocd-app repo. Any other argocd applications that manage cluster resources like clusterrolebindings or operator subscriptions etc. can remain in the infra repo since that's a concern for cluster admins. We would direct any user of moc that wants to use ArgoCD to manage their apps to add their ArgoCD apps to the opf-argocd-apps repo.

Pros:

MOC Admins/Ops are not bombarded with PR's for ArgoCD App onboarding
OperateFirst maintainers can handle the PR's unhindered
The "OPF-ArgoCD-Apps" repo can be leveraged by CRC/Quicklab/Other OCP Clusters to quickly setup ArgoCD ODH/Thoth/etc. Applications.

Cons:
One concern here is that there is no way to automatically enforce that Applications in opf-argocd-apps repo belong to the Operate First argocd project (see diagaram). Why is this a problem? Because we use ArgoCD projects to restrict what types of resources applications in that project can deploy. For example argocd apps in the Infra Apps project in the diagram can deploy clusterrolebinding, operators, etc. So while OPF Parent App cannot deploy clusterrolebindings because it belongs to the Operate First argocd project, it could deploy another ArgoCD application that belongs to Infra apps and that argocd app could deploy clusterrolebindings`.

You can read more about this issue here. The individual there used admission hooks to get around this but I don't think we want to go there just yet. My suggestion is we begin by enforcing this at the PR level, and transition to maybe catching this in CI until there's a proper solution upstream.

Use project global project to template group permissions

As described here we can have a common global project to inherit all permissions from. Since most of the permissions are essentially copied and pasted, this would make it significantly easier to account for what permissions each team has when deploying via argocd onto their team's respective projects.

[discussion] Rename repo to something more specific

cd -- is a bit vague

Suggestions?

[discussion] Kustomize Plugin Directory

I would like to start a discussion about using kustomize plugins to extend the functionality of kustomize to meet our needs. Some existing problems right now:

Kafkatopics require creating a new resource file with a lot of repitition, so 100 topics require 100 new files
- because of this we had to use helm for one of our other repos
Configmap Generator does not allow you to extend existing configmaps' data fields (you have to overwrite the whold data field if you fetch a configmap from another kustomize base)
- the ability to do this would allow us to move more non cluster specific configurations from aicoe-cd to this repo
We can't sops encrypt inline, for example in configmaps
Some deployments require fetching existing information like SA tokens to populate the manifest post-deployment, this requires you to deploy, then fetch tokens, then update manifests, then re-deploy again (ideally we should deploy only once)

All of these issues (save for the last one maybe) can be easily solved by writing some quick Kustomize plugins.

I was thinking we would include a sub directory in this repo called kustomize_plugins/operate-first/v1/... where we write and add these plugins.

We would then include these plugins as part of the argocd image.
In order to use these plugins locally, you would just cp this plugin folder to your $XDG_HOME_CONFIG/kustomize/plugin/ folder.

The plugins themselves are very straightforward and easy to write (often only ~50-100 lines), and can be written in bash/go/python or essentially any language (we would likely just use python, something like this).

What do you guys think?

Transfer deployment manifests from https://github.com/AICoE/aicoe-cd

Copy all content from https://github.com/AICoE/aicoe-cd to this repo, with any implementation details stripped, e.g. clusternames, dev, prod etc

The idea is to use this repo as an upstream repo, which is agnostic of the target environment.

The only app to be included should be the argocd guestbook example

Determine how to gate changes between environments

We need to figure out a good way to gate/promote changes between environments. Right now we generally default to just updating the base manifest or all the overlays at the same time.

Assume that we have the following sets of manifests (and environments), all managed by ArgoCD
dev -> stage -> prod-1, prod-2

Our current options are:

OPTION 1: Test and verify changes in dev -> update the base manifests via PR -> deploy and test changes in stage -> deploy and test changes in prod-*

Pros:

We will be manually promoting changes and requiring manual verification, so if runbooks are well defined this should be relatively safe

Cons:

Manual verification can still result in errors
We aren't using ArgoCD to the fullest
There will be periods of time where what is in git does not match what is actually deployed

OPTION 2: Test and verify changes in dev -> update the base manifests via PR -> deploy and test changes in stage -> Create new tagged release after a number of changes are merged into master -> Update the tagged release that prod-* argocd apps are pointed to.

Pros:

Very easy to tell what is deployed in prod environment (we're basing the deployment on a tagged release)
Difficult to accidentally push changes to prod.

Cons:

While lots of small changes may not result in issues, one large change can result in applications crashing
Slower velocity (not necessarily a bad thing)

OPTION 3: Test and verify changes in dev -> PR to stage overlays -> deploy and verify changes in stage -> PR to prod-* overlays -> deploy and verify changes in prod-* (This means you basically have 3 copies of the same manifests)

Pros:

Very deliberate process for promoting changes.

Cons:

Having three copies of the same manifests goes against the whole idea of overlays and bases. This isn't necessarily bad, but it can be an anti-pattern.
Harder to determine what is different between environments

OPTION 4: Test and verify changes in dev -> PR to stage overlays -> deploy and verify changes in stage -> PR to revert changes in stage manifests and update the base manifests -> deploy and verify changes in prod-*

Pros:

Addresses one of the flaws of 3 so that we re-use manifests as much as possible
Very deliberate process for promoting changes

Cons:

Revert commits mean that history will be a little messy

OPTION 5: Test and verify changes in dev -> PR to update base manifests -> deploy in stage -> Use ArgoCD resource hooks to verify changes -> trigger sync in prod-* manifests

Pros:

Automated, so it will require minimal input from developers once changes are merged in

Cons:

Will require larger investment to write tests for applications
This might be a slight mis-use of resource hooks, so we should investigate that

OPTION 6: Similar to above, but use something like keptn for the change gating instead

Pros:

Automated
Based on SLIs and SLOs, so it will force us to think more about the health of our services

Cons:

Will require a lot of investment for getting started

Deploy as many components as possible on quicklab cluster

The purpose of this exercise is to identify logical groupings of namesapces : components.

From previous attempts we know we cannot deploy all ODH components onto a single quicklab instance with 3 workers.

Therefore, this task should result in a grouping of components and their deployments in such a way, we can deploy in these groups on to a quicklab. Documents should follow.

The end result should be a set of N quicklab clusters that contain alll components of ODH.

get access to MOC CNV cluster

to deploy ODH

deploy argocd to quicklab

Create an implementation specific fork of https://github.com/operate-first/cd to target quicklab clusters

cd target to crc

Create an implementation specific fork of https://github.com/operate-first/cd to target https://code-ready.github.io/crc/

Move & Setup build pipeline for ArgoCD image

Currently the ArgoCD image is built from this Dockerfile here: https://github.com/AICoE/aicoe-cd/blob/master/Dockerfile

We want to move this dockerfile to this repo and have builds triggered on tagged releases from this repo.

This ticket should be followed up with maintainers of the aicoe-cd repo to have the Dockerfile removed from that repo.

Update ArgoCD to 1.8.3

Issues #90 and #84 require similar steps to upgrading ArgoCD itself, and 1.8.3 introduces some fixes we could benefit from. I think we can tackle all 3 at the same time.

unable to oc login during quicklab setup

I am trying to set up a new Quicklab cluster following the instructions provided and I am facing an issue in step 10 here.

As I try to oc login into my cluster, I get the following error:

[ochatter@ochatter ~]$ oc login upi-0.ochattertest2.lab.upshift.rdu2.redhat.com:6443
The server is using a certificate that does not match its hostname: x509: certificate is valid for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, openshift, openshift.default, openshift.default.svc, openshift.default.svc.cluster.local, 172.30.0.1, not upi-0.ochattertest2.lab.upshift.rdu2.redhat.com
You can bypass the certificate check, but any data you send to the server could be intercepted by others.
Use insecure connections? (y/n): n

error: The server is using a certificate that does not match its hostname: x509: certificate is valid for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, openshift, openshift.default, openshift.default.svc, openshift.default.svc.cluster.local, 172.30.0.1, not upi-0.ochattertest2.lab.upshift.rdu2.redhat.com

@tumido can you please help me understand what could be going wrong here?

Pull categorical encoding JH image into ODH

Add an image stream pulling image from aicoe-aiops/categorical-encoding#2 into ODH deployed via O1.

SPIKE: investigate ArgoCD operator

deploy argocd to MOC CNV cluster

Create an implementation specific fork of https://github.com/operate-first/cd to target MOC CNV cluster

Add catagorical data notebook image to JH instance

container image with all cli dependencies

To remove the dependency to install kustomize or sops, as mentioned in https://aicoe.github.io/aicoe-cd/setup_argocd_dev_enviornment/ - can we have those in containers and add shell aliases?

Let's have a toolbox container for this based on Fedora... see https://github.com/thoth-station/thoth-toolbox

Change upstream images to pull from imagestream registry

Update images to use local image streams and set:

referencePolicy:
      type: Local

So that we don't get rate limited by docker.

Identify lag/manual points to Continuous Deployment Process and identify areas of improvement

We should identify how long it takes to deploy argocd + odh + components on a dev ocp cluster (crc or quicklab) and whether this is an acceptable time frame.

If it is not - this should result in a concrete solution (e.g. a script) in the form of another issue.

Disable GPG key Verification for ArgoCD

ArgoCD attempts to verify the gpg keys that all commits are signed with, resulting in an error stating that a given gpg key is invalid. While one option is to push all contributer public keys to Argocd, it is not needed as github/gitlab already perform these checks for us. The change to disable it looks like a fairly straightforward patch too.

https://argoproj.github.io/argo-cd/user-guide/gpg-verification/

Move argocd manifests to apps repo?

I think this repo has become a bit of a snowflake because we store all our cluster resources in the apps repo but make an exception for the cluster rbac/crds for argocd because they go in this repo. I'm thinking we should just move all of the argocd stuff into an argocd app in the apps repo, we just keep this repo alive for the docker file + tag releases. We also move the docs too, so that we have one less repo to worry about when debating where a certain piece of documentation lives. WDYT?

Edit: Laying down plan for migration here:

first move argocd manifests folder, and store cluster resources in cluster-scope app
~~move scripts to the apps repo scripts folder~~ We'll just remove these and delegate fresh install steps to markdown
move docs to the apps docs folder
update downstream aicoe-cd repo manifests to point to the apps/argocd/base path
remove manifests/scripts/docs from this repo
rename this repo to a more appropriate name that houses argocd image

add ArgoCD application of deploying the ODH operator to crc fork

based on the quicklab steps

Setup AICOE-CI on repository

We want to essentially imitate the aicoe-cd ci setup for this repo, use the same configuration files in that repo.

See: https://github.com/AICoE/aicoe-ci#setting-aicoe-ci-on-github-organizationrepository on how to do this.

Update argocd images to use quay

Instead of docker, use quay for the argocd images, find them here

relationship between `continous-deployment` and it's implementation specific incarnations

We want to maintain one upstream version of https://github.com/operate-first/continuous-deployment and make it easy for downstream users to build on top of the knowledge collected upstream. While we have control over our own downstream, e.g. CRC or QuickLab, we will not have control over a 3rd party downstream.

The amount of change introduces by downstream may also vary: from just changing a key in KSOPS to replacing KSOPS with Vault.

For a downstream user, it should be easy to follow the documentation, without any context switching to different repositories.
It should also be really easy to incorporate all changes from upstream, once upstream introduces new best practices.

The original idea was to have this continous-deployment repo to be the upstream with no implementation specifics and let other targets be the fork of it.

E.g. continous-deployment <--upstream_of-- continous-deployment-crc

Unfortunately you can't fork into the same account/org (https://github.community/t/alternatives-to-forking-into-the-same-account/10200)

I suggest to create a new or duplicate repo continous-deployment-crc and handle the rebasing without GH, just like explained in https://stackoverflow.com/questions/45748400/git-fork-repo-to-same-organization

The downside is, we don't get a nice UI how many commits each repo is ahead/behind the other one.

Thoughts?

Write a KafkaTopic generator kustomize plugin

Write a kustomize plugin kafkatopicgenerator:

When given an input like:

apiVersion: operatefirst/v1
kind: KafkaTopicGenerator
metadata:
  name: kafka-topics
  namespace: mynamespace
clusterName: dev
defaults:
   partitions: 2
   replicas: 3
topics:
  - topicName: exampletopicA
    partitions: 4
    config: 
       retention.bytes: 853333300
       retention.ms: 172800000
  - topicName: exampletopicB

Should generate kafka topic manifest:

---
apiVersion: kafka.strimzi.io/v1alpha1
kind: KafkaTopic
metadata:
  name: exampletopicA
  namespace: mynamespace
  labels:
    strimzi.io/cluster: dev
    template: kafka-topics-template
spec:
    partitions: 4
    replicas: 3
    topicName: exampletopicA
    config:
       retention.bytes: 853333300
       retention.ms: 172800000
---
apiVersion: kafka.strimzi.io/v1alpha1
kind: KafkaTopic
metadata:
  name: exampletopicB
  namespace: mynamespace
  labels:
    strimzi.io/cluster: dev
    template: kafka-topics-template
spec:
    partitions: 2
    replicas: 3
    topicName: exampletopicB

The plugin should be added within the operate-first/cd/kustomizePlugins/v1/kafkatopicgenerator
Include a readme.md in this folder on the example use case like the one above, and installation instructions.

kustomize stdout mess running under toolbox

when running kustomize from the toolbox using the 'run" command I get an information message in the stdout. Which is a problem if I want to pipe to oc or kubectl or anywhere

Compare:

$ kustomize build manifests/overlays/dev --enable_alpha_plugins 2>/dev/null  | head
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: application-controller
    app.kubernetes.io/name: argocd-application-controller
    app.kubernetes.io/part-of: argocd
  name: argocd-application-controller
  namespace: aicoe-argocd-dev
---

vs:

$ toolbox run --container of-toolbox-v0.1.0 kustomize build manifests/overlays/dev --enable_alpha_plugins 2>/dev/null | head
2020/10/05 09:46:56 Attempting plugin load from '/usr/share/.config/kustomize/plugin/viaduct.ai/v1/ksops/ksops.so'
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: application-controller
    app.kubernetes.io/name: argocd-application-controller
    app.kubernetes.io/part-of: argocd
  name: argocd-application-controller
  namespace: aicoe-argocd-dev

Using toolbox enter is ok too:

$ toolbox enter --container of-toolbox-v0.1.0
...
...
$ kustomize build manifests/overlays/dev --enable_alpha_plugins 2>/dev/null | head
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/component: application-controller
    app.kubernetes.io/name: argocd-application-controller
    app.kubernetes.io/part-of: argocd
  name: argocd-application-controller
  namespace: aicoe-argocd-dev
---

So this is probably something that needs to be reported back to toolbox. Or is it an expected behavior?

Add monitoring and Long Term Storage for Metrics

We can set up individual application monitoring using the ODH monitoring (Prometheus and Grafana Operators) and for long term storage of metrics we can use the Observatorium stack to set up the Thanos Infrastructure. (https://github.com/observatorium)

Proposed Solution:

Deploy ODH Prometheus and ODH Grafana + applications to monitor
Determine if we need one prometheus instance to monitor all namespaces or individual namespace prometheus instances
Deploy Observatorium (disable the Loki setup)
Update ODH Prometheus Manifest to remote write to Observatorium

@HumairAK @anishasthana @durandom

Update to argocd 1.8.1

Helm-Secrets plugin no longer there

The helm-secrets plugin doesn't seem like it's available in the repo-server pods, this is resulting in helm builds that use secrets from failing to build manifests and instructing argocd to prune these resources.

Aggregate Logging solution for operate-first deployments

We need to determine how to aggregate logs for applications running on our clusters, whether they be Thoth or ODH.

Two options here are:

Use OpenShift as logging layer to grab all STDOUT and send it via fluentD to something like Elasticsearch.
1.1 The problem here is that Elasticsearch is not a sutiable solution for all our users, so we should figure out exactly why it isn't suitable and document it
Use an opensource project such as Loki or Graylog to grab logs and visualize them.
A key point here is that integration for the logging should be simple for any solution that we choose.

I think starting with a POC using Loki makes most sense as there are already other teams at Red Hat using Loki.

Reduce manual steps for deploying ArgoCD

There are a number of manual steps that are introduced when deploying argocd, most of these are due to adding openshift authentication.

Brief primer on adding Openshift Oauth:

Openshift allows you to use a service account as an oauth client in order to authenticate against the oauth server. ArgoCD uses Dex identity service for implementing auth. We combine the 2 to achieve openshift auth. This essentially means we need to do a bunch of steps to give argocd the SA token that's acting as the oauth client, and we need to give the SA a redirect link to call back to. All these steps are captured in this script.

We should minimize the need for this script as much as possible. Some areas that can be automated:

Add dynamic redirect here, see more info here.
Get rid of callback here as it's not needed, from argocd docs:
- Argo CD will automatically use the correct redirectURI
- link to relevant page

We should also monitor this PR, once this makes it to release we should update and update this line to a secret reference, then manually create the dex-server service account secret as described here, the token will then automatically be picked up and we won't need to hard code it.

This should reduce most of the steps in the actual argocd deployment bits in the script.

Helm-Secrets deprecated, switch to new repo

As per this repo here

Please note, this project is no longer being maintained. There is an active fork jkroepke/helm-secrets and we will also contribute our future changes to it.

We should switch to the one here.

Quicklab and persistent storage

Currently Openshift4 in Quicklab is lacking persistent storage - no PVs available on the cluster. Let's hope Quicklab team can help us out with that:

https://redhat.service-now.com/surl.do?n=INC1446454

add argocd application of deploying the ODH operator to quicklab fork

deploy via OLM upstream
are we able to lock the ODH operator version?

operate-first / continuous-deployment Goto Github PK

continuous-deployment's People

Contributors

Stargazers

Watchers

Forkers

continuous-deployment's Issues

Recommend Projects

Recommend Topics

Recommend Org