palladius / clouddeploy-platinum-path Goto Github PK

Cloud Build + Cloud Deploy from the ground up

License: Apache License 2.0

Shell 77.80% Procfile 0.15% Python 2.16% Dockerfile 1.12% Ruby 3.97% Makefile 7.34% HCL 5.47% Perl 0.47% JavaScript 1.52%

clouddeploy-platinum-path's Introduction

palladius

My personal Palladius gem

clouddeploy-platinum-path's People

Contributors

Stargazers

Watchers

Forkers

sbbogdanc nateaveryg aablsk willisc7 palladius-uat steenblik entorarifi mbychkowski afrancella jasgeetsingh

clouddeploy-platinum-path's Issues

Cant create a VM within the ILB network: Subnetwork must have purpose=PRIVATE.

gcloud compute instances create sol0-pvt-connect --zone=$REGION-b \
    --machine-type=e2-small --network-interface=subnet=dmarzi-proxy,no-address \
    --maintenance-policy=MIGRATE --provisioning-model=STANDARD --service-account=$PROJECT_NUMBER-compute@developer.gserviceaccount.com \
    --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append \
    --tags=http-server,https-server \
    --create-disk=auto-delete=yes,boot=yes,device-name=sol0-pvt-connect,image=projects/ubuntu-os-cloud/global/images/ubuntu-minimal-2204-jammy-v20220712,mode=rw,size=100,type=projects/cicd-platinum-test002/zones/europe-west1-b/diskTypes/pd-balanced \
    --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --reservation-affinity=any
# ERROR: (gcloud.compute.instances.create) Could not fetch resource:
# - Subnetwork must have purpose=PRIVATE.

CD Stuck to 8 versions ago for app01

My app has been upgraded this morning from 2.19 to 2.20.
However, the dev service still points to v2.14. I've shot in the face Service and pod, I've restored the service via cloud deploy but nothing,, still 2.14

Let me show you.

2.20

[DEBUG] DEBUG has been enabled! Please change to DEBUG=FALSE in your .env.sh to remove this. Some impotant fields:
[DEBUG] PROJECT_ID:        'cicd-platinum-test002'
[DEBUG] ACCOUNT:           '[email protected]'
[DEBUG] GITHUB_REPO_OWNER: 'palladius'
[DEBUG] GCLOUD_REGION:     'europe-west1'
[DEBUG] GKE_REGION:        'europe-west1'
Getting status of my 8 apps them all WOW:
== app01: python web app ==
HTTP_RESPONSE: [DEV] 200 for http://34.78.131.190:8080/ ''
HTTP_RESPONSE: [STAG] 200 for http://34.76.157.232:8080/        ''
HTTP_RESPONSE: [CANA] 200 for http://34.140.176.245:8080/       ''
HTTP_RESPONSE: [PROD] 200 for http://35.187.20.43:8080/ ''

The freshest is dev, look : http://34.78.131.190:8080/

=> App01 (🐍) v2.14 Hello world from Skaffold in python! [...] app01 (🐍) v2.14 TARGET=None

(Yes I've tried reloading and curling - same result)

01-set-up-GKE-clusters.sh fails in envs where default network isnt automatically created

Script currently assumes default network exists and fails if it doesn't. It might be worth adding gcloud compute networks create default to the script.

Also, the command takes a while to run so I would probably use --async and then have them run a "watch" command

Automate Cloud Build github repo import

This solution is currently not feasible.
There is an alpha solution which I started working onto, but I want to wait until the solution doesn't require project allowlisting (at this moment I can try but other users can't).

Cant get step 06 (Cloud Build locally) to work

[BUG] 'Unable to list repositories' when trying to connect GitHub repo to Cloud Build

A colleague just noticed that this is blocking the GItHub connect

Issue is tracked here: https://issuetracker.google.com/issues/251424997

And is manifested by a "Unable to list repositories" error on the (3) Select repository as in the above bug.

I know engineers are working on it. Please follow progress in the public bug.

[sol1] Migrate sol1 single-cluster to support dev/staging which are, actually, in a single cluster

Talking to Daniel I realized that the sol1 has always been Single Cluster, not MC!

No biggie, seems like the best thing to do is to refactor the code into supporting Dev/Staging which are already in a single cluster (DEV).

Also, they are 1 pod each, so demonstratic proper traffic splitting seems less hypocritical (canary/prod have instead a 80/20 as of now).

Possible complication: currently DEV/STAG are in two different namespaces. We might have to CHANGE that, if it doesnt work. In this case I need to rename staging to $PODNAME-staging in kustomize, which potentially will break other stuff. SIGH.

first invokation of ./00-init.sh returns error

ricc@cloudshell:~/clouddeploy-platinum-path (cloud-ops-sandbox-1808198420)$ ./00-init.sh
Created [cicd-platinum-02].
Activated [cicd-platinum-02].
ERROR: (gcloud.config.configurations.activate) Cannot activate configuration [cicd-platinum-02], it does not exist.
ERROR: (gcloud.config.configurations.create) Cannot create configuration [cicd-platinum-02], it already exists.

[demo] Create dynamic rollout

Rob Edward wrote: To round out the narrative we use when it comes to progressive rollout and SLI/SLO (well error budget) monitoring adding visibility in Ops Suite to support and monitor the rollout etc. Although it sounds like Cloud Deploy will expose something like this in a few months.

My idea:

can do that, just need to abstract the YAML in script 15 with 22/78% into a separate parametric yaml (maybe under templates/ )?
Once done, I could do a 17.sh script which does some timing like:

sleep 1
cat template | sed s/MYPERCENTAGE/10/g | kubectl apply
sleep 5
cat template | sed s/MYPERCENTAGE/50/g | kubectl apply
sleep 5
cat template | sed s/MYPERCENTAGE/90/g | kubectl apply
sleep 5
cat template | sed s/MYPERCENTAGE/100/g | kubectl apply

just an idea..

[sol1] default backend - 404 for everything :/

I'm sad.

Everything seems corectly setup:

HTTPRoute
Gateway API
Service is pointing to all right things.

But still when I curl the IP i get a 404.

[Solution2] all good but my Gateways dont get public IP :(

Seems lie solution1 is now complete. I create everything correctly, with the only issue that my GKE LoadBalancers
are unable to pull a public IP.
When i did this with Daniel for the first time it took it a while to get it up and running (a few hours) so since its 1800 I'll wait until tomorrow morning. But still pasting the output:

bin/kubectl-canary get gateways | egrep "sol1-app"
[CANA] sol1-app01-eu-w1-ext-gw           sol1-app01-eu-w1-gke-l7-gxlb                           27m
[CANA] sol1-app02-eu-w1-ext-gw           sol1-app02-eu-w1-gke-l7-gxlb                           15m

and again:

🐼 bin/kubectl-canary describe gateways.gateway.networking.k8s.io sol1-app01-eu-w1-ext-gw
[DEBUG] DEBUG has been enabled! Please change to DEBUG=FALSE in your .env.sh to remove this. Some impotant fields:
[DEBUG] PROJECT_ID:        'cicd-platinum-test002'
[DEBUG] ACCOUNT:           '[email protected]'
[DEBUG] GITHUB_REPO_OWNER: 'palladius'
[DEBUG] GCLOUD_REGION:     'europe-west1'
[DEBUG] GKE_REGION:        'europe-west1'
[CANA] W0718 18:17:39.258998 1315977 gcp.go:120] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.25+; use gcloud instead.
[CANA] To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
[CANA] Name:         sol1-app01-eu-w1-ext-gw
[CANA] Namespace:    default
[CANA] Labels:       <none>
[CANA] Annotations:  <none>
[CANA] API Version:  gateway.networking.k8s.io/v1alpha2
[CANA] Kind:         Gateway
[CANA] Metadata:
[CANA] Creation Timestamp:  2022-07-18T15:48:55Z
[CANA] Generation:          1
[CANA] Managed Fields:
[CANA] API Version:  gateway.networking.k8s.io/v1alpha2
[CANA] Fields Type:  FieldsV1
[CANA] fieldsV1:
[CANA] f:metadata:
[CANA] f:annotations:
[CANA] .:
[CANA] f:kubectl.kubernetes.io/last-applied-configuration:
[CANA] f:spec:
[CANA] .:
[CANA] f:gatewayClassName:
[CANA] f:listeners:
[CANA] .:
[CANA] k:{"name":"http"}:
[CANA] .:
[CANA] f:allowedRoutes:
[CANA] .:
[CANA] f:kinds:
[CANA] f:namespaces:
[CANA] .:
[CANA] f:from:
[CANA] f:name:
[CANA] f:port:
[CANA] f:protocol:
[CANA] Manager:         kubectl-client-side-apply
[CANA] Operation:       Update
[CANA] Time:            2022-07-18T15:48:55Z
[CANA] Resource Version:  8797104
[CANA] UID:               abca00d1-c099-48ee-af3c-8227eec271f0
[CANA] Spec:
[CANA] Gateway Class Name:  sol1-app01-eu-w1-gke-l7-gxlb
[CANA] Listeners:
[CANA] Allowed Routes:
[CANA] Kinds:
[CANA] Group:  gateway.networking.k8s.io
[CANA] Kind:   HTTPRoute
[CANA] Namespaces:
[CANA] From:  Same
[CANA] Name:      http
[CANA] Port:      80
[CANA] Protocol:  HTTP
[CANA] Status:
[CANA] Conditions:
[CANA] Last Transition Time:  1970-01-01T00:00:00Z
[CANA] Message:               Waiting for controller
[CANA] Reason:                NotReconciled
[CANA] Status:                Unknown
[CANA] Type:                  Scheduled
[CANA] Events:                    <none>

./08-cloud-deploy-setup.sh fails to upload skaffold.yaml

Problem

Following the docs linearly, I get the following errors when running ./08-cloud-deploy-setup.sh.

Error 1

2022-07-11T16:26:18.246728928Z Copying file:///workspace/rendered/skaffold.yaml [Content-Type=application/octet-stream]...
Info
2022-07-11T16:26:18.427942055Z AccessDeniedException: 403 [email protected] does not have storage.objects.create access to the Google Cloud Storage object.
Info
2022-07-11T16:26:18.453183206Z CommandException: 1 file/object could not be transferred.
Info
2022-07-11T16:26:18.981071868Z ERROR
Info
2022-07-11T16:26:18.981105981Z ERROR: could not upload /workspace/rendered/skaffold.yaml to gs://us-central1.deploy-artifacts.plat-path-00.appspot.com/app01-20220711-1625-v2-3-fa8aa86c7c3244e3b050e9c030241a20/canary/; err = gsutil exited with non-zero status: 1
Info
2022-07-11T16:26:19.142109184Z / [0/1 files][ 0.0 B/ 543.0 B] 0% Done

Error 1 Fix

Giving [email protected] Storage Object Creator role fixes the issue.

Error 2

2022-07-11T16:42:47.972602063Z starting build "d763b2ea-b6b5-486e-987c-738fb5864c41"
Info
2022-07-11T16:42:47.972660475Z FETCHSOURCE
Info
2022-07-11T16:42:47.972979712Z BUILD
Info
2022-07-11T16:42:48.607019999Z Pulling image: gcr.io/k8s-skaffold/skaffold:v1.37.2-lts
Info
2022-07-11T16:42:49.557503177Z v1.37.2-lts: Pulling from k8s-skaffold/skaffold
Info
2022-07-11T16:42:49.562615333Z Digest: sha256:0bde2b09928ce891f4e1bfb8d957648bbece9987ec6ef3678c6542196e64e71a
Info
2022-07-11T16:42:49.565943896Z Status: Downloaded newer image for gcr.io/k8s-skaffold/skaffold:v1.37.2-lts
Info
2022-07-11T16:42:49.573509460Z gcr.io/k8s-skaffold/skaffold:v1.37.2-lts
Info
2022-07-11T16:42:54.820596796Z Copying gs://plat-path-00_clouddeploy_us-central1/source/1657557764.190844-f107cc873b384a01835d89bb1df1c4ca.tgz...
Info
2022-07-11T16:42:54.883669804Z / [0 files][ 0.0 B/216.2 KiB] / [1 files][216.2 KiB/216.2 KiB]
Info
2022-07-11T16:42:54.883684643Z Operation completed over 1 objects/216.2 KiB.
Info
2022-07-11T16:42:56.022407681Z profile selection ["production"] did not match those defined in any configurations. Check that values specified in the "--profile" or "-p" flags are valid profile names.
Info
2022-07-11T16:42:56.486140146Z ERROR
Info
2022-07-11T16:42:56.486173988Z ERROR: build step 0 "gcr.io/k8s-skaffold/skaffold:v1.37.2-lts" failed: step exited with non-zero status: 1

Error 2 Fix

Change clouddeploy.template.yaml to reference "prod" instead of "production"

Error 3

2022-07-11T18:30:38.047032735Z Fetching cluster endpoint and auth data.
Info
2022-07-11T18:30:38.131125194Z ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=403, message=Required "container.clusters.get" permission(s) for "projects/plat-path-00/locations/us-central1/clusters/cicd-dev".
Info
2022-07-11T18:30:38.747207944Z ERROR
Info
2022-07-11T18:30:38.747248722Z ERROR: build step 0 "gcr.io/k8s-skaffold/skaffold:v1.37.2-lts" failed: step exited with non-zero status: 1

Error 3 Fix

Giving [email protected] Kubernetes Engine Developer role fixes the issue.

Error 4

2022-07-11T18:50:11.982941521Z Fetching cluster endpoint and auth data.
Info
2022-07-11T18:50:12.228428165Z kubeconfig entry generated for cicd-dev.
Info
2022-07-11T18:50:13.210498953Z Starting deploy...
Info
2022-07-11T18:50:18.973523079Z - service/app02-kuruby created
Info
2022-07-11T18:50:20.485026404Z - Warning: Autopilot set default resource requests for Deployment default/app02-kuruby, as resource requests were not specified. See http://g.co/gke/autopilot-defaults.
Info
2022-07-11T18:50:20.485711420Z - deployment.apps/app02-kuruby created
Info
2022-07-11T18:50:20.487990601Z Waiting for deployments to stabilize...
Info
2022-07-11T18:50:27.532443368Z - deployment/app02-kuruby: no nodes available to schedule pods
Info
2022-07-11T18:50:27.532455203Z - pod/app02-kuruby-68f787f7-mcpwj: no nodes available to schedule pods
Info
2022-07-11T18:53:04.685079362Z - deployment/app02-kuruby: container app02-kuruby is waiting to start: ricc-app01-ruby-example can't be pulled
Info
2022-07-11T18:53:04.685090778Z - pod/app02-kuruby-68f787f7-mcpwj: container app02-kuruby is waiting to start: ricc-app01-ruby-example can't be pulled
Info
2022-07-11T18:53:04.685092281Z - deployment/app02-kuruby failed. Error: container app02-kuruby is waiting to start: ricc-app01-ruby-example can't be pulled.
Info
2022-07-11T18:53:04.686693150Z 1/1 deployment(s) failed
Info
2022-07-11T18:53:04.956703356Z ERROR
Info
2022-07-11T18:53:04.956730985Z ERROR: build step 0 "gcr.io/k8s-skaffold/skaffold:v1.37.2-lts" failed: step exited with non-zero status: 1

Error 4 Fix

TBD

[Solution2] currently broken

Solution 2 is currently broken.

Looks like Canary works for both Python and Ruby, but Prod has zero selectors.

Move namespace logic from app01/02 to common components/ folder

Is there a reason why the namespace part was put in the apps: https://github.com/palladius/clouddeploy-platinum-path/blob/main/apps/app01/k8s/overlays/dev/namespace.yaml
and not consolidated in the base components/ part (https://github.com/palladius/clouddeploy-platinum-path/tree/main/components/common/dev)?

I don't see a reason to do this within apps instead of on component-level. And I would strongly recommend lifting this into component like you are proposing. 👍

[sol1] Currently supporting single cluster

Talked to Alex who pointed out this cannot work in MC mode.
Talked to Daniel who confirmed sol1 does NOT support MC.

I see three solutions:

NOOP. do nothing and just update docs.
Moved to DEV cluster where I actually DO have 2 targets and I can demonstrate that. Just need to add 4 pods to staging instead than to prod. Beware of the doppel-namespce though ;)
Fix and make it MVC instead.

Cleanup Kustomize

3 problems with kustomize:

code is scattered across 2 dirs, should be in one only: see below
code was just a test for me to learn, and 2 weeks after is destined to become the number one repo on github :) I need some cleanup and make sure it all works.
Does it make sense to use kustomize at all? I mean, the ONLY use which is REALLY important is: canary 9 pods, prod 1 pod. And a common selector for canary and prod. Coudl I just create 4 copies of k8s manifests to increase readability?

I'd love a kustomize expert to help me on this.

scattered code (1)

ricc@ricc:~/git/clouddeploy-platinum-path$ 🐼 find . -name components
./components
./apps/components

Script 15.sh works WELL but its not really multi tenant

Took me a week to fix it, but when last night I finally got it to work, I've noticed that script 15-solution2-xlb-GFE3-traffic-split.sh is not 100% multitennant.

While it's now able to Load Balance with traffic splitting a generic app01 // app02, some artifacts (like FWD_RULE) have a fixed name which doesnt allow me to have it parametric in APP_ID. this means that once you call it with app01, it creates GCP infrastructure to route to app01, and then you do it with app02 and it changes and route to app02.
There's currently NO WAY to route both differently and separately.

Things I need to change in 15-solution2-xlb-GFE3-traffic-split.sh:

URLMAP to add APPXX
VirtualHosts to add APPXX

To keep things easy, I will enforce this best PRactice: every new name shall be:

new_var := $APPXX-$OLDNAME

and to make this even obviouser ;) I will rename the variables from

OLD_VAR_NAME => OLD_VAR_NAME_MTSUFFIX

for easy of grepping.

Autotagging in CB script2 (currently coded but broken)

[from bielski:]

The image tagging in cloud-build/02-dev-to-staging-auto-promo script fails.

Why does it fail?
In order to recreate an existing tag (which the used gcloud command seems to do), the service account would need artifactregistry.tags.delete permission which is not included in the artifactregistry.writer role.
There are multiple tags that are potentially being overwritten:
latest
latest-cb2
$DOCKER_IMAGE_VERSION is the same as vSUPERDUPER_MAGIC_VERSION which also creates a conflict.

Solutions:
a) give an additional role to cloud build svc acc (roles/artifactregistry.repoAdmin)
b) prevent tagging duplicates (which would prevent us from using latest tag)

add to ComputeSvcAccount 2 roles [from Alex]

$ gcloud projects get-iam-policy ricc-cicd  \
--flatten="bindings[].members" \
--format='table(bindings.role)' \
--filter="[bindings.members:[email protected]](mailto:bindings.members%[email protected])"

ROLE
roles/artifactregistry.reader
roles/container.developer                     <= was missing
roles/container.nodeServiceAgent
roles/storage.objectAdmin                     <= was missing

am I right Alex?

HurricaneAlex: remove Deployments and leave only Services for sol123

My code is broken. I was adding images which pointed to non-hydrated images for my solutions. this was silly.
I just need to pinpoint Services to my existing 2 apps (8 apps actually).
why am I trying to substitute Cloude Deploy with bash?!? :P

This bug tracks this execution since its about:

remove from prod all broken manifests
patching troth solutions removing deployment
pushing to prod the new manifests

So quite complex :)