operate-first / continuous-deployment Goto Github PK
View Code? Open in Web Editor NEWContinuous Deployment
License: GNU General Public License v3.0
Continuous Deployment
License: GNU General Public License v3.0
don't fork the upstream ODH manifest (or internal DH manifest) but reference the upstream ODH manifest in remote base
We want to replicate quicklab/crc deployments as closely to how we want to deploy on moc. This means using the odh repo to deploy the manifests using argocd. There maybe some issues with resource usage, part of this issue should be investigating how to reconcile such problems.
This issue should build on the issue here and follow on operate-first/apps#9
We have a bunch of docs in the aicoe-cd repo that should be moved here. The docs that are team specific should remain in the aicoe-cd
repo, the rest can be moved here.
An example of docs not to be moved Permissions.
The docs also reference links that point in the aicoe-cd
repo, these should be converted to point here where applicable (do not use relative paths for those links, and use the full URL, this is for mkdocs to work).
The docs structure should remain the same and be found in the root directory.
The sops.yaml in the root folder is a dev sops yaml for testing / demo purposes, the private key for it is exposed, so it should not be used for anything other then encrypting things that are not confidential. Keeping it in the root folder could result in files in the moc-cnv overlay to get encrypted using this exposed gpg key, which would be ... not good.
ArgoCD is live on MOC and we have created the appropriate projects/permissions to start deploying applications.
Before we do that I think we should have a brief discussion on how we should set up our applications on argocd declaratively.
For Context:
ArgoCD Applications manifests are a declarative way to manage Argocd applications in git. Traditiionaly we've stored these alongside argocd deployment manifests, like IDH has done here.
This has been fine in the past since we controlled the deployment of ArgoCD and had merge access to the repo where the applications were stored. So if we wanted to onboard a new app, we make a PR with the application manifest and someone on our team would merge it (see this pr as an example).
But now we have a situation where MOC manages Argocd here. It was added to this repo because it's a cluster-wide argocd that can be used to manage cluster resources as well.
The Problem:
If we applied our current practice, we'd store our app manifests here. The problem is that we don't have merge access to this repo, and it wouldn't really make much sense for people who manage the infrastructure to also handle PR's that don't pertain directly to cluster management.
Proposed Solution
To reconcile this dilemma I'd like to put forth the following suggestion in how we can organize our repository/argocd applications:
The Infra Repo
is analogous to the moc-cnv-sandbox
repo here, but could be replaced with another repo as well.
The idea here is that all our operate-first/thoth/data-science Argocd Applications would go in the opf-argocd-apps
repo. Then we'd have an App of Apps i.e. the OPF Parent App
that manages all these apps. This way we can add new applications declaratively to ArgoCD w/o having to make PR's to the Infra Repo
(or moc-cnv-sandbox
). Operate-first admins would manage the opf-argocd-app
repo. Any other argocd applications that manage cluster resources like clusterrolebindings
or operator subscriptions
etc. can remain in the infra repo since that's a concern for cluster admins. We would direct any user of moc that wants to use ArgoCD to manage their apps to add their ArgoCD apps to the opf-argocd-apps
repo.
Pros:
Cons:
One concern here is that there is no way to automatically enforce that Applications in opf-argocd-apps
repo belong to the Operate First
argocd project (see diagaram). Why is this a problem? Because we use ArgoCD projects to restrict what types of resources applications in that project can deploy. For example argocd apps in the Infra Apps
project in the diagram can deploy clusterrolebinding
, operators
, etc. So while OPF Parent App
cannot deploy clusterrolebindings
because it belongs to the Operate First
argocd project, it could deploy another ArgoCD application that belongs to Infra apps
and that argocd app could deploy clusterrolebindings`.
You can read more about this issue here. The individual there used admission hooks to get around this but I don't think we want to go there just yet. My suggestion is we begin by enforcing this at the PR level, and transition to maybe catching this in CI until there's a proper solution upstream.
As described here we can have a common global project to inherit all permissions from. Since most of the permissions are essentially copied and pasted, this would make it significantly easier to account for what permissions each team has when deploying via argocd onto their team's respective projects.
cd -- is a bit vague
Suggestions?
I would like to start a discussion about using kustomize plugins to extend the functionality of kustomize to meet our needs. Some existing problems right now:
base
)
aicoe-cd
to this repoAll of these issues (save for the last one maybe) can be easily solved by writing some quick Kustomize plugins.
I was thinking we would include a sub directory in this repo called kustomize_plugins/operate-first/v1/...
where we write and add these plugins.
We would then include these plugins as part of the argocd image.
In order to use these plugins locally, you would just cp
this plugin folder to your $XDG_HOME_CONFIG/kustomize/plugin/
folder.
The plugins themselves are very straightforward and easy to write (often only ~50-100 lines), and can be written in bash/go/python or essentially any language (we would likely just use python, something like this).
What do you guys think?
Copy all content from https://github.com/AICoE/aicoe-cd to this repo, with any implementation details stripped, e.g. clusternames, dev, prod etc
The idea is to use this repo as an upstream repo, which is agnostic of the target environment.
The only app to be included should be the argocd guestbook example
We need to figure out a good way to gate/promote changes between environments. Right now we generally default to just updating the base manifest or all the overlays at the same time.
Assume that we have the following sets of manifests (and environments), all managed by ArgoCD
dev -> stage -> prod-1, prod-2
Our current options are:
OPTION 1: Test and verify changes in dev
-> update the base
manifests via PR -> deploy and test changes in stage
-> deploy and test changes in prod-*
Pros:
Cons:
OPTION 2: Test and verify changes in dev
-> update the base
manifests via PR -> deploy and test changes in stage
-> Create new tagged release after a number of changes are merged into master -> Update the tagged release that prod-*
argocd apps are pointed to.
Pros:
Cons:
OPTION 3: Test and verify changes in dev
-> PR to stage
overlays -> deploy and verify changes in stage
-> PR to prod-*
overlays -> deploy and verify changes in prod-*
(This means you basically have 3 copies of the same manifests)
Pros:
Cons:
OPTION 4: Test and verify changes in dev
-> PR to stage
overlays -> deploy and verify changes in stage
-> PR to revert changes in stage
manifests and update the base
manifests -> deploy and verify changes in prod-*
Pros:
Cons:
OPTION 5: Test and verify changes in dev
-> PR to update base
manifests -> deploy in stage
-> Use ArgoCD resource hooks to verify changes -> trigger sync in prod-*
manifests
Pros:
Cons:
OPTION 6: Similar to above, but use something like keptn for the change gating instead
Pros:
Cons:
The purpose of this exercise is to identify logical groupings of namesapces : components.
From previous attempts we know we cannot deploy all ODH components onto a single quicklab instance with 3 workers.
Therefore, this task should result in a grouping of components and their deployments in such a way, we can deploy in these groups on to a quicklab. Documents should follow.
The end result should be a set of N
quicklab clusters that contain alll components of ODH.
to deploy ODH
Create an implementation specific fork of https://github.com/operate-first/cd to target quicklab clusters
Create an implementation specific fork of https://github.com/operate-first/cd to target https://code-ready.github.io/crc/
Currently the ArgoCD image is built from this Dockerfile here: https://github.com/AICoE/aicoe-cd/blob/master/Dockerfile
We want to move this dockerfile to this repo and have builds triggered on tagged releases from this repo.
This ticket should be followed up with maintainers of the aicoe-cd
repo to have the Dockerfile removed from that repo.
I am trying to set up a new Quicklab cluster following the instructions provided and I am facing an issue in step 10 here.
As I try to oc login
into my cluster, I get the following error:
[ochatter@ochatter ~]$ oc login upi-0.ochattertest2.lab.upshift.rdu2.redhat.com:6443
The server is using a certificate that does not match its hostname: x509: certificate is valid for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, openshift, openshift.default, openshift.default.svc, openshift.default.svc.cluster.local, 172.30.0.1, not upi-0.ochattertest2.lab.upshift.rdu2.redhat.com
You can bypass the certificate check, but any data you send to the server could be intercepted by others.
Use insecure connections? (y/n): n
error: The server is using a certificate that does not match its hostname: x509: certificate is valid for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, openshift, openshift.default, openshift.default.svc, openshift.default.svc.cluster.local, 172.30.0.1, not upi-0.ochattertest2.lab.upshift.rdu2.redhat.com
@tumido can you please help me understand what could be going wrong here?
Add an image stream pulling image from aicoe-aiops/categorical-encoding#2 into ODH deployed via O1.
Create an implementation specific fork of https://github.com/operate-first/cd to target MOC CNV cluster
To remove the dependency to install kustomize or sops, as mentioned in https://aicoe.github.io/aicoe-cd/setup_argocd_dev_enviornment/ - can we have those in containers and add shell aliases?
Let's have a toolbox container for this based on Fedora... see https://github.com/thoth-station/thoth-toolbox
Update images to use local image streams and set:
referencePolicy:
type: Local
So that we don't get rate limited by docker.
We should identify how long it takes to deploy argocd + odh + components on a dev ocp cluster (crc or quicklab) and whether this is an acceptable time frame.
If it is not - this should result in a concrete solution (e.g. a script) in the form of another issue.
ArgoCD attempts to verify the gpg keys that all commits are signed with, resulting in an error stating that a given gpg key is invalid. While one option is to push all contributer public keys to Argocd, it is not needed as github/gitlab already perform these checks for us. The change to disable it looks like a fairly straightforward patch too.
https://argoproj.github.io/argo-cd/user-guide/gpg-verification/
I think this repo has become a bit of a snowflake because we store all our cluster resources in the apps repo but make an exception for the cluster rbac/crds for argocd because they go in this repo. I'm thinking we should just move all of the argocd stuff into an argocd app in the apps repo, we just keep this repo alive for the docker file + tag releases. We also move the docs too, so that we have one less repo to worry about when debating where a certain piece of documentation lives. WDYT?
Edit: Laying down plan for migration here:
aicoe-cd
repo manifests to point to the apps/argocd/base pathbased on the quicklab steps
We want to essentially imitate the aicoe-cd
ci setup for this repo, use the same configuration files in that repo.
See: https://github.com/AICoE/aicoe-ci#setting-aicoe-ci-on-github-organizationrepository on how to do this.
Instead of docker, use quay for the argocd images, find them here
We want to maintain one upstream version of https://github.com/operate-first/continuous-deployment and make it easy for downstream users to build on top of the knowledge collected upstream. While we have control over our own downstream, e.g. CRC or QuickLab, we will not have control over a 3rd party downstream.
The amount of change introduces by downstream may also vary: from just changing a key in KSOPS to replacing KSOPS with Vault.
For a downstream user, it should be easy to follow the documentation, without any context switching to different repositories.
It should also be really easy to incorporate all changes from upstream, once upstream introduces new best practices.
The original idea was to have this continous-deployment
repo to be the upstream with no implementation specifics and let other targets be the fork of it.
E.g. continous-deployment
<--upstream_of-- continous-deployment-crc
Unfortunately you can't fork into the same account/org (https://github.community/t/alternatives-to-forking-into-the-same-account/10200)
I suggest to create a new or duplicate repo continous-deployment-crc
and handle the rebasing without GH, just like explained in https://stackoverflow.com/questions/45748400/git-fork-repo-to-same-organization
The downside is, we don't get a nice UI how many commits each repo is ahead/behind the other one.
Thoughts?
Write a kustomize plugin kafkatopicgenerator
:
When given an input like:
apiVersion: operatefirst/v1
kind: KafkaTopicGenerator
metadata:
name: kafka-topics
namespace: mynamespace
clusterName: dev
defaults:
partitions: 2
replicas: 3
topics:
- topicName: exampletopicA
partitions: 4
config:
retention.bytes: 853333300
retention.ms: 172800000
- topicName: exampletopicB
Should generate kafka topic manifest:
---
apiVersion: kafka.strimzi.io/v1alpha1
kind: KafkaTopic
metadata:
name: exampletopicA
namespace: mynamespace
labels:
strimzi.io/cluster: dev
template: kafka-topics-template
spec:
partitions: 4
replicas: 3
topicName: exampletopicA
config:
retention.bytes: 853333300
retention.ms: 172800000
---
apiVersion: kafka.strimzi.io/v1alpha1
kind: KafkaTopic
metadata:
name: exampletopicB
namespace: mynamespace
labels:
strimzi.io/cluster: dev
template: kafka-topics-template
spec:
partitions: 2
replicas: 3
topicName: exampletopicB
The plugin should be added within the operate-first/cd/kustomizePlugins/v1/kafkatopicgenerator
Include a readme.md in this folder on the example use case like the one above, and installation instructions.
when running kustomize from the toolbox using the 'run" command I get an information message in the stdout. Which is a problem if I want to pipe to oc
or kubectl
or anywhere
Compare:
$ kustomize build manifests/overlays/dev --enable_alpha_plugins 2>/dev/null | head
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/component: application-controller
app.kubernetes.io/name: argocd-application-controller
app.kubernetes.io/part-of: argocd
name: argocd-application-controller
namespace: aicoe-argocd-dev
---
vs:
$ toolbox run --container of-toolbox-v0.1.0 kustomize build manifests/overlays/dev --enable_alpha_plugins 2>/dev/null | head
2020/10/05 09:46:56 Attempting plugin load from '/usr/share/.config/kustomize/plugin/viaduct.ai/v1/ksops/ksops.so'
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/component: application-controller
app.kubernetes.io/name: argocd-application-controller
app.kubernetes.io/part-of: argocd
name: argocd-application-controller
namespace: aicoe-argocd-dev
Using toolbox enter
is ok too:
$ toolbox enter --container of-toolbox-v0.1.0
...
...
$ kustomize build manifests/overlays/dev --enable_alpha_plugins 2>/dev/null | head
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/component: application-controller
app.kubernetes.io/name: argocd-application-controller
app.kubernetes.io/part-of: argocd
name: argocd-application-controller
namespace: aicoe-argocd-dev
---
So this is probably something that needs to be reported back to toolbox. Or is it an expected behavior?
We can set up individual application monitoring using the ODH monitoring (Prometheus and Grafana Operators) and for long term storage of metrics we can use the Observatorium stack to set up the Thanos Infrastructure. (https://github.com/observatorium)
Proposed Solution:
The helm-secrets plugin doesn't seem like it's available in the repo-server pods, this is resulting in helm builds that use secrets from failing to build manifests and instructing argocd to prune these resources.
We need to determine how to aggregate logs for applications running on our clusters, whether they be Thoth or ODH.
Two options here are:
I think starting with a POC using Loki makes most sense as there are already other teams at Red Hat using Loki.
There are a number of manual steps that are introduced when deploying argocd, most of these are due to adding openshift authentication.
Brief primer on adding Openshift Oauth:
Openshift allows you to use a service account as an oauth client in order to authenticate against the oauth server. ArgoCD uses Dex identity service for implementing auth. We combine the 2 to achieve openshift auth. This essentially means we need to do a bunch of steps to give argocd the SA token that's acting as the oauth client, and we need to give the SA a redirect link to call back to. All these steps are captured in this script.
We should minimize the need for this script as much as possible. Some areas that can be automated:
Argo CD will automatically use the correct redirectURI
We should also monitor this PR, once this makes it to release we should update and update this line to a secret reference, then manually create the dex-server service account secret as described here, the token will then automatically be picked up and we won't need to hard code it.
This should reduce most of the steps in the actual argocd deployment bits in the script.
Currently Openshift4 in Quicklab is lacking persistent storage - no PVs available on the cluster. Let's hope Quicklab team can help us out with that:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.