cert-manager / testing Goto Github PK

Repository containing cert-manager testing infrastructure configuration

Dockerfile 9.75% Shell 17.72% Go 68.65% Makefile 3.88%

bazel infra prow-jobs

testing's Introduction

cert-manager

cert-manager adds certificates and certificate issuers as resource types in Kubernetes clusters, and simplifies the process of obtaining, renewing and using those certificates.

It supports issuing certificates from a variety of sources, including Let's Encrypt (ACME), HashiCorp Vault, and Venafi TPP / TLS Protect Cloud, as well as local in-cluster issuance.

cert-manager also ensures certificates remain valid and up to date, attempting to renew certificates at an appropriate time before expiry to reduce the risk of outages and remove toil.

Documentation

Documentation for cert-manager can be found at cert-manager.io.

For the common use-case of automatically issuing TLS certificates for Ingress resources, see the cert-manager nginx-ingress quick start guide.

For a more comprehensive guide to issuing your first certificate, see our getting started guide.

Installation

Installation is documented on the website, with a variety of supported methods.

Developing cert-manager

We actively welcome contributions and we support both Linux and macOS environments for development.

Different platforms have different requirements; we document everything on our Building cert-manager website page.

Note in particular that macOS has several extra requirements, to ensure that modern tools are installed and available. Read the page before getting started!

Troubleshooting

If you encounter any issues whilst using cert-manager, we have a number of ways to get help:

A troubleshooting guide on our website.
Our official Kubernetes Slack channel - the quickest way to ask! (#cert-manager and #cert-manager-dev)
Searching for an existing issue.

If you believe you've found a bug and cannot find an existing issue, feel free to open a new issue! Be sure to include as much information as you can about your environment.

Community

The cert-manager-dev Google Group is used for project wide announcements and development coordination. Anybody can join the group by visiting here and clicking "Join Group". A Google account is required to join the group.

Meetings

We have several public meetings which any member of our Google Group is more than welcome to join!

Check out the details on our website. Feel free to drop in and ask questions, chat with us or just to say hi!

Contributing

We welcome pull requests with open arms! There's a lot of work to do here, and we're especially concerned with ensuring the longevity and reliability of the project. The contributing guide will help you get started.

Coding Conventions

Code style guidelines are documented on the coding conventions page of the cert-manager website. Please try to follow those guidelines if you're submitting a pull request for cert-manager.

Importing cert-manager as a Module

⚠️ Please note that cert-manager does not currently provide a Go module compatibility guarantee. That means that most code under pkg/ is subject to change in a breaking way, even between minor or patch releases and even if the code is currently publicly exported.

The lack of a Go module compatibility guarantee does not affect API version guarantees under the Kubernetes Deprecation Policy.

For more details see Importing cert-manager in Go on the cert-manager website.

The import path for cert-manager versions 1.8 and later is github.com/cert-manager/cert-manager.

For all versions of cert-manager before 1.8, including minor and patch releases, the import path is github.com/jetstack/cert-manager.

Security Reporting

Security is the number one priority for cert-manager. If you think you've found a security vulnerability, we'd love to hear from you.

Follow the instructions in SECURITY.md to make a report.

Changelog

Every release on GitHub has a changelog, and we also publish release notes on the website.

History

cert-manager is loosely based upon the work of kube-lego and has borrowed some wisdom from other similar projects such as kube-cert-manager.

_{^{Logo design by Zoe Paterson}}

testing's People

Contributors

Stargazers

Watchers

testing's Issues

Update triage-party to fix the "similar" bug

One annoying thing I noticed yesterday while triaging on https://triage.build-infra.jetstack.net is that the similar label appears sometimes more than 50 times:

The issue, tracked in google/triage-party#196, has been solved in google/triage-party#204 and contained in Triage Party v1.3.0. We are using Triage Party v1.2.1. I searched for an upgrade process but there does not seem to be one, since Triage Party seems to rely on a cache that can be reconstructed after the upgrade.

Document infra image bumps and versioning

See https://kubernetes.slack.com/archives/CDEQJ0Q8M/p1633442598251200 for context

Migrate Tarmak jobs off of 'bootstrap.py'

/assign @simonswine

Automate bumping Prow itself

This can be split into a two stage process:

Add a postsubmit that runs bazel run //prow/cluster:production.apply - this will be the first step, and what ensures the repo is in sync with the actual Prow cluster
Add a periodic job that looks up the latest Prow image tag, updates all relevant manifests, and creates a PR to bump image tags across the repo.

Once we have these two, we can configure the periodic to run on some schedule, and this way bumping Prow should be as simple as /lgtm and /approve 😄

Fix failing Pod row on cert-manager-website-update-index

Currently all cert-manager-website-update-index jobs are marked as failing due to the new TestGrid Pod row.

We fixed this issue for jobs running in the build-infra-workers cluster by enabling the GCS reporter on Prow's crier and manually creating some RBAC in the build-infra-workers cluster for crier so it the reporter can gather the necessary information- temp solution whilst we haven't yet automated the test infra setup.

cert-manager-website-update-index runs in our 'trusted' cluster, so perhaps it is a better solution to disable the Pod row on that particular job than hack together some temporary RBAC there.

The row can be disabled by setting disable_prowjob_analysis field on the particular test group in TestGrid config, like it's done here.
However Jetstack's TestGrid config is autogenerated using transfigure using our prow config, job configs and a base testgrid config and I was not able to find a way how to get this field generated using our current setup.

I have opened a PR against kubernetes/test-infra that would allow doing this via a ProwJob annotation.

Alternative solution would be to create RBAC for Prow's crier in the trusted cluster.

Add more verify scripts to hack/

We need verify scripts for:

running gazelle
running kazel
ensuring the repo builds (aka bazel build //...)
running gofmt
checking boilerplate headers
verifying config (should call checkconfig at a particular path, or run the docker image)

Image build jobs always appear to error

It seems like all Prowjobs that build new images on changes to ./images always appear to error even if the image build is successful, see i.e https://prow.build-infra.jetstack.net/view/gs/jetstack-logs/logs/post-testing-push-bazel-tools/1451100848355545088

Ensure optional periodics get run if relevant files change

For example if kind config changes, we should probably run tests with all versions of Kubernetes

Adding jetstack/testing presubmits

We should have presubmits enabled for this repository for things like verifying our job config, and boilerplate files.

Add a periodic that tests cert-manager upgrade

Once cert-manager/cert-manager#4182 merges we could make a Prow periodic that tests cert-manager upgrades by running make verify-upgrade.

Use Kubernetes 1.21

Kubernetes 1.21 was released on Thu 8 April 2021. It would be nice if we could start testing this.

cert-manager/website#452

Error: could not write config when writing testgrid config: close: googleapi: Error 403: Access denied., forbidden

Looks like we haven't got the permissions right for Prow to upload the testgrid config to the new GCP bucket:

https://prow.build-infra.jetstack.net/view/gs/jetstack-logs/logs/post-testing-upload-testgrid-config/1522143360465244160

2022/05/05 09:17:16 could not write config: close: googleapi: Error 403: Access denied., forbidden

Plank is still looking for GKE cluster

We removed gke cluster from Prow's config, but it appears like plank is still looking for it and throwing errors:

{"component":"plank","error":"errors syncing: [error starting pod : unknown cluster alias \"gke\" error starting pod : unknown cluster alias \"gke\" error starting pod : unknown cluster alias \"gke\" error starting pod : unknown cluster alias \"gke\"]","file":"prow/plank/controller.go:170","func":"k8s.io/test-infra/prow/plank.(*Controller).Start","level":"error","msg":"Error syncing.","severity":"error","time":"2021-05-28T08:27:58Z"}
{"component":"plank","duration":"57.521714ms","file":"prow/plank/controller.go:172","func":"k8s.io/test-infra/prow/plank.(*Controller).Start","level":"info","msg":"Synced","severity":"info","time":"2021-05-28T08:27:58Z"}
{"component":"plank","error":"errors syncing: [error starting pod : unknown cluster alias \"gke\" error starting pod : unknown cluster alias \"gke\" error starting pod : unknown cluster alias \"gke\" error starting pod : unknown cluster alias \"gke\"]","file":"prow/plank/controller.go:170","func":"k8s.io/test-infra/prow/plank.(*Controller).Start","level":"error","msg":"Error syncing.","severity":"error","time":"2021-05-28T08:28:28Z"}
{"component":"plank","duration":"34.927807ms","file":"prow/plank/controller.go:172","func":"k8s.io/test-infra/prow/plank.(*Controller).Start","level":"info","msg":"Synced","severity":"info","time":"2021-05-28T08:28:28Z"}

See related discussion on Slack https://kubernetes.slack.com/archives/CDEQJ0Q8M/p1622138455202100

Configure Bazel build cache

As more of our jobs move over to Bazel, we should consider enabling the build caching feature to speed up tests.

This shouldn't be too difficult to enable, but we'll need to take consideration of how it works in a multi-repo context.

Investigate whether it's worth running jobs with Kyverno only periodically

Investigate whether we would save a significant amount of test resources by not installing Kyverno on each test run
If that is the case, perhaps tests with Kyverno are only run periodically

See https://kubernetes.slack.com/archives/CDEQJ0Q8M/p1649064127017669?thread_ts=1648986468.269049&cid=CDEQJ0Q8M for context

No kind logs when a job times out after 2 hours

When the Prow runner times out (2 hours), two unwanted behaviors occur:

The container logs (e.g., pebble) aren't uploaded to the "artifacts" because make kind-logs does not seem to be called,
The Prow UI doesn't show the test cases that passed and failed because Ginkgo does not produce any XML output.

Initially, I thought I had forgotten to set a trap for SIGINT that gets sent when the timeout is reached, because Ginkgo seemed to keep going without stopping. But in reality, these two symptoms are due to the runner script not retransmitting SIGINT it to its children when it receives SIGINT. This is because bash ignores any signals while a child process is executing with command "$@".

Here is what a Prow job looks like:

command: /tools/entrypoint
args:
  - runner
  - bash
  - -c
  - |
    apt-get install jq -y >/dev/null
    make -j vendor-go e2e-ci K8S_VERSION=1.23

Let us mimick what runner does:

$ bash -c 'echo $$; command bash -c "trap \"echo CLEANING UP\" EXIT; sleep 1000"' &
# <----------------------------------------------------------------------------->
#                 Mimicks the command "runner".
#                  <------------------------------------------------------------>
#                  Mimicks the command "bash -c apt-get..."
#                  that is run inside the "runner" script with
#                  the line "command $@".

It should output the PID of the inner bash process:

Now, type:

kill -s INT 2896855

Nothing happens, the bash script should still be running. The outer bash script has not passed the SIGINT signal down to its children.

But with SIGTERM, it seems to work:

$ kill -s TERM 2896855
CLEANING UP
[1]  + 2896855 terminated  bash -c

After fixing this issue, I suggest reducing the overall timeout from 2 hours to 40 minutes.

Misconfiguration causing many errors in the hook logs

$ kubectl logs deploy/hook | jq | fgrep  -C 10 error
...
{
  "component": "hook",
  "error": "Post \"http://needs-rebase\": dial tcp 10.31.251.8:80: connect: connection refused",
  "event-GUID": "6b8398f0-88d9-11ec-9083-d6a9f0825356",
  "event-type": "pull_request",
  "external-plugin": "needs-rebase",
  "file": "prow/hook/server.go:225",
  "func": "k8s.io/test-infra/prow/hook.(*Server).demuxExternal.func1",
  "level": "error",
  "msg": "Error dispatching event to external plugin.",
  "severity": "error",
  "time": "2022-02-08T12:20:27Z"
}
{
  "component": "hook",
  "error": "Post \"http://cherrypick\": dial tcp 10.31.255.164:80: connect: connection refused",
  "event-GUID": "6b8398f0-88d9-11ec-9083-d6a9f0825356",
  "event-type": "pull_request",
  "external-plugin": "cherrypick",
  "file": "prow/hook/server.go:225",
  "func": "k8s.io/test-infra/prow/hook.(*Server).demuxExternal.func1",
  "level": "error",
  "msg": "Error dispatching event to external plugin.",
  "severity": "error",
  "time": "2022-02-08T12:20:33Z"
}

Porting all 'legacy' bootstrap based jobs to this repo

In order to make this repo the absolute authority, and deprecate our custom changes made in jetstack/test-infra, we need to update all of our base images to this repository and execute the bootstrap.py script in here.

You can see an example of this here: https://github.com/jetstack/testing/blob/master/images/minikube-in-go/runner#L21-L22

In the meantime, old jobs will continue to clone test-infra and use the presubmit configs defined in that repo.

Document how infra images are built

Document (with code comments) how images in https://github.com/jetstack/testing/tree/master/images are built.

See #518 (review)

Individual branch protection rules for cert-manager repo branches

Currently we apply the same rules for all jetstack/cert-manager branches here. This can sometimes cause issues, i.e when needing to cherry-pick into a release that does not support e2e tests against the same version of k8s as what we run the default presubmit e2e tests for PRs to master.
We could have the rules specified for each currently supported branch individually.

/kind cleanup

RBAC Rules for trusted prowjobs

We have different workloads that run in different clusters e.g. trusted. This YAML snippet applies the required roles to ensure the services that talk from other clusters can read, update and delete CI jobs that have run in the other clusters.

I've attached it here for now since we don't have a place to store or run apply these changes to another cluster. This YAML snippet has already been applied to -trusted and the -gke cluster

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: test-pods
  name: "sinker"
rules:
  - apiGroups:
    - ""
    resources:
    - pods
    verbs:
    - list
    - watch
    - patch
    - get
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: test-pods
  name: "sinker"
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: "sinker"
subjects:
- kind: User
  name: "client"
  apiGroup: rbac.authorization.k8s.io

Migration plan

Documenting the steps involved with the migration:

Make this repository authoritative for Prow deployment (e.g. deploying Prow components to the prow cluster)
Copying all existing images and job configuration to this repo (bootstrap images still referencing test-infra repo)
Switch update-config plugin to work on this repo instead of test-infra (i.e. make this repo authoritative for Prow config as well)
Update all base images to reference jetstack/testing repo
- cert-manager
- navigator
- test-infra/testing
- tarmak
Moving gubernator deployment into this repo #4

I'll also be incrementally adding more pre/post submits for verification and automation for this repo too. Initially, it will be to simply run bazel test //... within the repo.

~~In future I'd like to explore auto-deploying the Prow components on push to master (versioned via the Bazel WORKSPACE file).~~ This is now done

I'd also like to find ways we can auto-push build images from this repository too. We'll need to find some way to detect if we need to build a new image however in order to do this, as our build images are not being built by Bazel. (ref #1)

/cc @simonswine

Onboard with CertManager prow cluster for running CI tests for jniebuhr/aws-pca-issuer

Is your feature request related to a problem? Please describe.
Onboard with CertManager prow cluster for running CI tests for jniebuhr/aws-pca-issuer

Describe the solution you'd like
We would like to onboard with CertManager prow cluster for running CI tests which are checked in as part of the repo jniebuhr/aws-pca-issuer

Describe alternatives you've considered
N/A
Additional context
N/A

/kind feature
Related to cert-manager/cert-manager#3675, cert-manager/cert-manager#3670

Re-instate Prow periodics that parse GitHub issues

We have a bunch of Prow periodics that parse GitHub issues and mark them as stale/rotten/close them here.

Currently they don't succeed because of a missing retest-bot-token Secret in Prow workers cluster. This Secret has been created manually before- perhaps now it could be automated.

See the Slack chat here https://kubernetes.slack.com/archives/CDEQJ0Q8M/p1622204713220100

Make it obvious that the cert-manager Pod presets are applied for all tests

The Pod presets are currently in cert-manager-presubmits.yaml but we should move them to a separate file to make it obvious that they are used by other tests too. See: #409 (comment)

Move infra image building off Docker

Currently all images in ./images are built using Docker.
Some of the downsides:

slow builds
requires to run image build pods as privileged
somewhat complex Docker setup (but this is needed for e2e tests which setup kind cluster too)
building images on Kubernetes with Docker is generally not recommended (@BeckyPauley pointed me at this doc)

We could instead use a more lightweight build mechanism such as ko which is already used for a number of internal projects at Jetstack or a 'kube native' image build mechanism like kaniko

Note: we should aim to use similar tooling across cert-manager projects

The group cert-manager-dev-alerts does not receive testgrid alerts

We set up [email protected] as the alert email for testgrid in #425. Unfortunately, @wallrj noticed this morning that testgrid has been failing during the night and no email was sent to cert-manager-dev-alerts.

We might have forgotten to whitelist the sender email that is used to send emails to [email protected], but I don't have the permission to see the group settings.

@munnerz Could we whitelist this email? I'm not which email it is though

By the way, would it make sense to have the @jetstack/team-cert-manager set as admins of this group?

Clean up Presets

At the moment we have a bunch of presets in ./config/jobs/cert-manager/config.yaml and a bunch more in ./config.yaml

We should:

place all the presets as close as possible to jobs that need them, i.e ones that are for a specific repository in the directory where jobs for that repo live
document that approach in https://github.com/jetstack/testing#creating-new-prowjobs
delete the presets that are no longer used

Configuring Peribolos for Github org management

We can use Peribolos for github org management, allowing us to manage who is a member of various organisations through GitHub using bots.

You can see an example of the config for this here: https://github.com/kubernetes/org/tree/master/config

@simonswine I know you've expressed a desire to manage a lot of this with Terraform instead - keen to hear your thoughts here.

This would be good at least to get setup for the cert-manager org, which is currently completely manually managed.

Venafi Issuer tests are consistently failing with "common name test-common-name-bvxqgardto is not allowed in this policy"

TL;DR: On Friday 15 Sept 2022, the policy folder \VED\Policy was changed, which affected the policy folder used during the cert-manager e2e tests (\VED\Policy\Jetstack). Before the change, there was no restriction on the common name or DNS names. After the change, the common name and DNS names became restricted to example.com or *.example.com. On 18 Sept 2022, I removed the example.com domain restriction in \VED\Policy which fixed the build failures.

Testgrid has been notifying us of the Venafi tests failing over and over:

The tests seem to be failing with this error:

common name test-common-name-bvxqgardto is not allowed in this policy: ^([\p{L}\p{N}-*]+\.)*example\.com$

This error comes from the vcert library, in the SimpleValidateCertificateRequest ) function. Vcert fetches the policies attached to the policy folder \VED\Policy\Jetstack.

It might be due to a change in the policies of the folder \VED\Policy\Jetstack in TPP. I looked at the JSON file produced by vcert getpolicy to see:

TOKEN=$(vcert getcred -u $VENAFI_TPP_URL --username $VENAFI_TPP_USERNAME --password $VENAFI_TPP_PASSWORD --client-id=$VENAFI_TPP_CLIENT_ID --scope='certificate:manage;configuration:manage' --format json | jq -r .access_token)
vcert getpolicy -u $VENAFI_TPP_URL -t $TOKEN -z 'Jetstack'

which returns:

{
  "users": [
    "jetstack-platform",
    "jetstack_user"
  ],
  "policy": {
    "domains": [
      "example.com"
    ],
    "wildcardAllowed": true,
    "certificateAuthority": "\\VED\\Policy\\Administration\\CAs\\Microsoft CA Web Server 1 Year",
    "keyPair": {
      "reuseAllowed": true
    },
    "subjectAltNames": {
      "dnsAllowed": true,
      "ipAllowed": false,
      "emailAllowed": false,
      "uriAllowed": false,
      "upnAllowed": false
    }
  },
  "defaults": {
    "keyPair": {
      "keyType": "RSA",
      "rsaKeySize": 2048,
      "serviceGenerated": false
    },
    "autoInstalled": true
  }
}

The enforces common names and DNS names to end with example.com. But our tests submit CSRs with a common name that doesn't end with example.com; I looked at old tests, and the common names we have been using never ends with example.com. It always look something like this:

test-common-name-tkewxmuslq
test-common-name-ucgwdafybr

I also looked in the UI to see whether this policy is inherited from the root folder \VED\Policy or not. The answer is yes: the domain example.com is enforced from the root folder:

This (seemingly new) example.com policy seems to have been added to \VED\Policy on Friday 15 Sept 2022.

I learned that TPP relies on a "Log Server", and that log server (which is a feature of the SQL database as I understand it) allows us to audit everything that happens in TPP. So I RDP'd into the VM, and opened the application "Venafi Configuration Admin" as suggested here. I had to "enable" something first:

I think the "events" I am looking for are "updates to the Admin UI":

Unfortunately, the tab "All events" shows an error. Creating custom filters (e.g., filtering on "updates to Admin UI") shows the same error:

Fix up job template for jetstack/testing repo when using decorated jobs

Links to gubernator are broken, namely because the decorated job log uploader uploads to:

https://jetstack-build-infra.appspot.com/build/jetstack-logs/pr-logs/pull/testing/13/pull-testing-verify-config/5/

and the job template refers to:

https://jetstack-build-infra.appspot.com/build/jetstack-logs/pr-logs/pull/jetstack_testing/13/pull-testing-verify-config/5/

(i.e. with jetstack_ added).

We should work out how the decorator uploader determines what path to use, and potentially update the job template accordingly.

Set up periodics against 'previous previous' branch

We currently support n-2 releases of cert-manager, but don't only run the periodics against the previous branch, not the one before that.

We should also run them against the 'previous previous' branch

We had a discussion about this on Slack https://kubernetes.slack.com/archives/CDEQJ0Q8M/p1654700642951309?thread_ts=1654691889.270499&cid=CDEQJ0Q8M

We can no longer see ProwJob's JobHistory

See here- it seems like Job's History can not be viewed via TestGrid.

The error failed to get job history: could not instantiate storage bucket: bucket "jetstack-logs" not in allowed list ([kubernetes-jenkins]); you may allow it by including it in deck.additional_allowed_buckets. suggests adding jetstack-logs bucket to deck.additional_allowed_buckets here, but doing that (and ensuring deck loads the updated config) seemed to have no effect.

April 12th, 2021 End of grace period for storage bucket validation, additional buckets have to be allowed by adding them to the deck.additional_allowed_buckets list.

Above is from Prow's changelog

Automation for pushing build images (images/)

Right now we have a series of ad-hoc Makefile's for building and pushing our test images.

We should have a unified way to do this.

It will be painful to use Bazel here, as Bazel's rules_docker does not support common statements like RUN, nor does it support Dockerfile's.

We should come up with some simple automation that can be used to ensure all images are up to date. Once this is done, we can potentially make this a task executed by Prow itself as a postsubmit.

Update Plank config

A couple of error messages in Plank's logs that suggest we may need to change some flags/config settings:

{"component":"plank","file":"prow/config/config.go:1540","func":"k8s.io/test-infra/prow/config.(*Config).validateComponentConfig","level":"warning","msg":"configuring the 'gcs/' storage provider suffix in the job url prefix is now deprecated, please configure the job url prefix without the suffix as it's now appended automatically. Handling of the old configuration will be removed in September 2020","severity":"warning","time":"2021-05-28T11:45:15Z"}
{"component":"plank","file":"prow/config/config.go:875","func":"k8s.io/test-infra/prow/config.(*Deck).Validate","level":"warning","msg":"rerun_auth_config will be deprecated in July 2020, and it will be replaced with rerun_auth_configs['*'].","severity":"warning","time":"2021-05-28T11:45:15Z"}
{"component":"plank","duration":"29.839452ms","file":"prow/plank/controller.go:172","func":"k8s.io/test-infra/prow/plank.(*Controller).Start","level":"info","msg":"Synced","severity":"info","time":"2021-05-28T11:45:28Z"}

/kind cleanup

Migrate cert-manager testgrid integration from Transfigure to Config Merger

Transfigure is deprecated in favor of Config Merger. See Migration for details.

And Transfigure sometimes fails with false errors caused by problems in kubernetes/test-infra, preventing us merging unrelated changes to this repo,

#628 (comment)

Istio have already completed the migration. Here's their issue:

istio/test-infra#3533

TODO

Add Configurator jobs (like this example) to Prow instance.
- #673
- #676
In the same PR, in this repository, add your instance to the mergelists and delete the gen-config.yaml file that Transfigure has been maintaining.
- kubernetes/test-infra#26209
Delete any leftover Transfigure jobs and close any leftover Transfigure PRs.

Allow adding newly required presubmits that aren't run against older branches

See https://kubernetes.slack.com/archives/CDEQJ0Q8M/p1648471110452349 for context

Currently if we add a new presubmit which is required it will become required for all PRs against all branches in the repo, including backports against older branches where the new presubmit may be irreleveant or unable to succeed.

We should fix this situation.
Perhaps it is possible to specify against which branches a job should be required by using contexts?

Document how updates to job config files make it to the Prow cluster

As I understand the way this works is by using the config_updater plugin that watches the files in jobs/ and updates a job-config configmap in Prow's build cluster.
However, lately we've had weird issues when after i.e a new jobs are added or some file names have been changed, the correct job configs are not propagated to the cluster. See i.e cert-manager/csi-driver-spiffe#2 (comment) and the weird output of the update from this PR #591 which renamed a file.
We should:

investigate whether the config updater works as it should
document how it works

Some Prow periodic jobs are no longer running.

It seems that since we made some changes to our Prow config (on Friday) a bunch of periodic jobs are no longer being run:

- jetstack-cert-manager-master/ci-cert-manager-bazel-experimental
- jetstack-cert-manager-master/ci-cert-manager-e2e-v1-16
- jetstack-cert-manager-master/ci-cert-manager-e2e-v1-17
- jetstack-cert-manager-master/ci-cert-manager-e2e-v1-18
- jetstack-cert-manager-previous/ci-cert-manager-previous-e2e-v1-16
- jetstack-cert-manager-previous/ci-cert-manager-previous-e2e-v1-19
- jetstack-cert-manager-previous/ci-cert-manager-previous-previous-experimental
- jetstack-cert-manager-next/ci-cert-manager-next-bazel
- jetstack-cert-manager-next/ci-cert-manager-next-bazel-experimental

See i.e master/v1.18 last report from 04/09

/kind bug

Investigate why TestGrid config no longer gets generated.

Our TestGrid config is in kubernetes/test-infra repo here.

We have some automation (need documentation) that re-generates the config and PRs in changes.

It seems to be broken- I have added a new ProwJob in #487 , but did not see a new PR against test-infra.

/kind bug

Error when building images build aborted: no such package '@test_infra//images/bootstrap/barnacle'

I tried using bazel to build the images in #90 but it failed as follows:

$ bazel run //images/bazelbuild

Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: SHA256 (https://github.com/bazelbuild/rules_docker/archive/v0.5.1.tar.gz) = 29d109605e0d6f9c892584f07275b8c9260803bf0c6fcb7de2623b2bedc910bd
ERROR: /home/richard/go/src/github.com/jetstack/testing/images/BUILD.bazel:23:1: no such package '@test_infra//images/bootstrap/barnacle': Error cloning repository: [email protected]:jetstack/test-infra.git: Auth cancel caused by [email protected]:jetstack/test-infra.git: Auth cancel caused by Auth cancel and referenced by '//images:barnacle'
ERROR: Analysis of target '//images/bazelbuild:bazelbuild' failed; build aborted: no such package '@test_infra//images/bootstrap/barnacle': Error cloning repository: [email protected]:jetstack/test-infra.git: Auth cancel caused by [email protected]:jetstack/test-infra.git: Auth cancel caused by Auth cancel
INFO: Elapsed time: 302.258s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (29 packages loaded)
FAILED: Build did NOT complete successfully (29 packages loaded)

bazel version
Build label: 0.16.1- (@non-git)
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Mon Aug 13 16:44:43 2018 (1534178683)
Build timestamp: 1534178683
Build timestamp as int: 1534178683

Test against Kubernetes v1.22

We would like cert-manager v1.5 to be deployable on Kubernetes v1.22.

Because Kubernetes v1.22 will be released shortly before cert-manager v1.5, there probably won't yet be a kind image, so we should build our own and start testing cert-manager master once it is usable with v1.22 (cert-manager/cert-manager#3390).

Error: Forbidden: IAM Service Account Credentials API has not been used in project 771478705899 before or it is disabled

In https://prow.build-infra.jetstack.net/view/gs/jetstack-logs/logs/post-testing-upload-testgrid-config/1522560873900544000

2022/05/06 12:56:28 could not write config: close: Post "https://storage.googleapis.com/upload/storage/v1/b/jetstack-testgrid/o?alt=json&name=config&prettyPrint=false&projection=full&uploadType=multipart": compute: Received 403 `Unable to generate access token; IAM returned 403 Forbidden: IAM Service Account Credentials API has not been used in project 771478705899 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/iamcredentials.googleapis.com/overview?project=771478705899 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.
This error could be caused by a missing IAM policy binding on the target IAM service account.
For more information, refer to the Workload Identity documentation:
https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to

I suppose we should enable this API in terraform

Document how to test ProwJobs locally

We have some docs here https://github.com/jetstack/testing#testing-locally but at the moment these instructions are not working (the local cluster gets spun up, but the ProwJob pod always fails- might be because our Prow config uses some deprecated options?

We should look at https://github.com/kubernetes/test-infra/blob/master/prow/build_test_update.md#how-to-test-a-prowjob and add concrete examples how we can use that testing strategy against our ProwJobs.

Add images verify step

We should add a dind based verify step that ensures that the images in images/ can be built okay.

This will involve:

a dind presubmit
as an optimisation, a script that compares the changes in the PR to determine which images need to be re-tested
a script to run make build on the affected images.

This will make it easier to verify that images are buildable before merging.

This issue does not track how we automatically push these images (see #1)

Deploying gubernator

We also run Gubernator as well as Prow as part of our build infrastructure.

The only changes we've had to make to the test-infra repo in our fork in order to deploy Gubernator are to the config.yaml file.

We should move this config file over to this repo, and work out how we can deploy gubernator as stored in our own fork of test-infra, whilst keeping the configuration in this repository.

Test cert-manager v1.3 with Kubernetes 1.20

We currently don't test previous release with v1.20, but we should.

See https://testgrid.k8s.io/jetstack-cert-manager-previous

Investigate whether we can make ProwJobs configs more DRY

See #683 (comment) for context

Perhaps we could use a simple yaml templater like ytt and generate the job configs in the postsubmit that updates Prow config in cluster.

If we do this, we should make sure that the solution is easy to use and doesn't require people to learn a complicated tool or language and that a project that doesn't require templating can still add a plain yaml job config.

Perhaps we could use ytt or a similar templating tool.

Try out rootless containers for running jobs

Right now, all of our Prow jobs are running in a pod that is running as UID 0 as a privileged process (for accessing the host's devices, such as /sys/fs/cgroup) with the capability SYS_ADMIN (for using clone(2) and unshare(2) I assume).

securityContext:
  privileged: true
  capabilities:
    add: ["SYS_ADMIN"]

We could improve on this and remove the UID 0 requirement by running the pods as a non-privileged users. For that, we can rely on the "cri in userns" feature of containerd or docker.

containerd: https://github.com/containerd/containerd/blob/v1.6.1/contrib/Dockerfile.test.d/cri-in-userns/docker-entrypoint.sh
docker: I was not able to find an example of docker-in-docker in userns mode

🚧 Note that this issue does not relate to the fact that dockershim will be removed in Kubernetes 1.24. This change does not affect us since we are not accessing the docker socket present on the host (instead, we run our own docker daemon in each of the job pods).

Testgrid error: podinfo.json not found please install prow's gcs reporter

All the testgrid reports are marked failed, with an error message:

podinfo.json not found please install prow's gcs reporter

https://testgrid.k8s.io/jetstack-cert-manager-master#ci-cert-manager-bazel

Maybe this is the reporter:

But there is very little documentation in those files.