Coder Social home page Coder Social logo

openshift-psap / ci-artifacts Goto Github PK

View Code? Open in Web Editor NEW
5.0 3.0 26.0 9.87 MB

OpenShift PSAP-team CI Artifacts

License: Apache License 2.0

Shell 16.95% Dockerfile 0.20% Python 73.27% Jinja 4.28% Makefile 0.94% Cuda 0.23% C++ 1.84% RobotFramework 1.04% Jupyter Notebook 1.11% JavaScript 0.13%
ci ansible operator openshift gpu nfd

ci-artifacts's People

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ci-artifacts's Issues

Generic command for installing operators from OperatorHub

./run_toolbox.py nfd_operator deploy_from_operatorhub

./run_toolbox.py gpu_operator deploy_from_operatorhub
${OPERATOR_CHANNEL:-}
${OPERATOR_VERSION:-}
--namespace ${OPERATOR_NAMESPACE}

the behaviour of these two commands must be very similar,
I think it shouldn't be hard to rewrite ./run_toolbox.py gpu_operator deploy_from_operatorhub into a generic command, something like:

deploy_from_operatorhub
--catalog=certified-operators
--name=gpu-operator
--namespace=... # optional
--channel=v1.9.0 # optional, can use defaultChannel
--csv-name=... # optional, can use defaultCSV
--deploy-default-cr=True

this would allow us installing any operator from the command line.

#300 gpu_operator_run_gpu-burn: make gpu-burn execution easier to reproduce
#301 benchmarking: make the execution easier to reproduce

could rewritten with these ^^^ two PRs in mind, to make the install easy to reproduce with the execution artifacts.

Detect when the GPU Operator fails because of a cluster upgrade

Until v1.6.2 (included) of the GPU Operator, OpenShift cluster upgrade is not supported, because the driver DaemonSet receives the RHEL_VERSION as a DaemonSet/Pod environment variable.

This makes the driver Pod unable to build the nvidia driver, because it cannot fetch any package.
Example of lines from the driver logs:

+ echo -e 'Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 4.18.0-240.22.1.el8_3.x86_64\n'

el8_3 means that it's RHEL 8.3 kernel that is running, but

dnf -q -y --releasever=8.2

this --releasever=8.2 shows that the DaemonSet is configured with RHEL_VERSION=8.2


It can would be easy to check if OCP_VERSION in the driver daemonset matches ocp_release variable that we already capture in the ansible playbooks.

This test can be integrated in the diagnose.sh script.

Test the cluster upgrade support of the GPU Operator

The GPU Operator should support seamlessly the upgrade of the OpenShift cluster (in a forth coming release at least).

We want to be able to test this upgrade support in the nightly CI.

OpenShift Prow doesn't support our upgrade use case, which is:

  1. install and test the GPU Operator as usual
  2. upgrade the cluster
  3. test the GPU Operator

(operator are usually preinstalled in OpenShift, or straightforward to install via OperatorHub, but the GPU Operator requires the deployment of the entitlement, the scale-up of the cluster with the GPU operator and the deployment of NFD ...).

This ticket will track the development of this feature.

set_scale.sh: cannot specify the source machineset

Currently, ./toolbox/cluster/set_scale.sh cannot be customized to decide which MachineSet will be used to derive the new MachineSet:

- name: Get the names of an existing worker machinesets (of any instance type)
  command:
    oc get machinesets -n openshift-machine-api -o
    jsonpath='{range .items[?(@.spec.template.metadata.labels.machine\.openshift\.io/cluster-api-machine-role=="worker")]}{.metadata.name}{"\n"}{end}'
  register: oc_get_machinesets
  failed_when: not oc_get_machinesets.stdout

it would be nice to have the ability to easily override oc_get_machinesets.stdout to specify which machineset to use as a base.

Quick and dirty example of what I did to work around that:

- name: Get the names of an existing worker machinesets (of any instance type)
  command:
    echo kpouget-20210519-kf6rn-worker-eu-central-1b
  register: oc_get_machinesets
  failed_when: not oc_get_machinesets.stdout

The reason for that is that the instance-type I want isn't available in eu-central-1a, only in 1b

Include "p3.2xlarge" GPU instance during CI

Include the "p3.2xlarge" GPU instance as well during CI along with the cheaper "g4dn.xlarge"

When we onboard performance tests in the CI, we should run the training workloads on the p3 instance and the inference workloads in the g4dn instance

oc isn't part of the CI image

Currently we download oc, (kubectl), helm and operator-sdk as part of the precheck() call of build/root/usr/local/bin/run.

I think it would be better to fetch these binaries when building the image.

Implement some unit testing for toolbox scripts

The toolbox scripts are used to test the deployment of the GPU operator, so most of their code (ansible playbooks and roles) is tested before merging a new PR in the master branch (\test gpu-operator-e2e) and in the nightly testing:

image

But the GPU operator doesn't cover 100% of the toolbox features, and some flags and code branches might be left untested. This is for instance what happened with toolbox/entitlement/test.sh that got broken when no flag was passed (see the fix in cf8a276), which isn't executed in the GPU Operator testing.

This ticket will track the progress of the design and development of unit tests.

Detect when the nightly CI fails because of a cluster shutdown

Every once in a while, the nightly testing fails because the cluster becomes unreachable:

roles/capture_environment/tasks/main.yml:8
TASK: capture_environment : Store OpenShift YAML version
----- FAILED ----
msg: non-zero return code

<command> oc version -oyaml > /logs/artifacts/233800__cluster__capture_environment/ocp_version.yml

<stderr> The connection to the server api.ci-op-75fhpdb3-3c6fc.origin-ci-int-aws.dev.rhcloud.com:6443 was refused - did you specify the right host or port?
----- FAILED ----

This kind of failure is independent from the GPU Operator testing, and it should be made clear in the CI-Dashboard (Prow infrastructure restarts the testing when this happens). An orange dot could do the job, with a label cluster issue detected.

To detect this, the must-gather script could simply create a file cluster-down when oc version doesn't work.
This the presence of this file would tell the ci-dashboard to set the orange flag.

GPU Operator: run gpu-operator test_operatorhub should not specify the exact operator version

Currently, we are nightly testing the released versions of the GPU Operator with this entrypoint:

run gpu-operator test_operatorhub 1.8.0 v1.8

where 1.8.0 specifies the version of the operator to be installed, and v1.8 specifies the OLM channel.

It would be better to only specify the channel, so that the latest minor version is installed:

run gpu-operator test_operatorhub v1.8

When the gpu_operator_deploy_from_operatorhub role was originally written (before v1.7), the GPU Operator only had a stable channel, so the channel could be omitted, and the full version had to be specified.
Now that NVIDIA switched to a dedicated channel per 1.X release, we could update the role/entrypoint to test "the latest minor available for a given channel".

In addition, the ci-dashboard was recently updated to show the exact version of the GPU Operator being installed & tested (instead of a hard-coded value):
image

Operator install from OperatorHub fails because the package is not found

From time to time, in the CI, NFD or the GPU Operator fail to be installed from OperatorHub, because the PackageManifest is not found:

<command> oc get packagemanifests/gpu-operator-certified -n openshift-marketplace
<stderr> Error from server (NotFound): packagemanifests.packages.operators.coreos.com "gpu-operator-certified" not found
<command> oc get packagemanifests/nfd -n openshift-marketplace
<stderr> Error from server (NotFound): packagemanifests.packages.operators.coreos.com "nfd" not found

Measure the code coverage of the presubmit tests

In parallel of implementing unit testing (#92), it would be interesting to measure the code coverage of the GPU Operator testing + Unit testing, to ensure that all the playbook tasks are executed at least once as part of the presubmit testing.

This ticket will track the design and development of this task.

Delete 'release-4.x' branches

I create this issue to discuss the topic and understand if the idea makes sense.

Currently, the code of this repository isn't specific to any version of OpenShift, so I would to suggest to get rid of the release-4.x branches and use a simpler workflow with a master branch defining the way to test the GPU-Operator on all the OpenShift releases.

See this commit kpouget/release@3790f84 for the patch that should be applied to openshift-release repository.

Prow CI: Upgrade config not using predefined steps

Currently, the cluster upgrade testing is performed "manually" in the cluster_upgrade_to_image role.

This simple playbook only waits for the end of the upgrade, but doesn't perform any other kind of test.

The reason for this choice is that

  1. the Prow CI steps for upgrading the cluster do not support running custom repository tests, which is mandatory for the GPU Operator (entitlement, installation of dependency, initial deployment and validation of GPU Operator).
  2. we wanted to be able to validate rapidly NVIDIA implementation of the cluster upgrade support.

In the future, it would be important to move to a proper CI upgrade step.

Add the ability to entitle only GPU nodes

Currently, the entitlement is performed cluster-wide, so all the nodes of the cluster have to be rebooted when the entitlement is deployed.

In order to avoid rebooting nodes that do not require entitlement, we need to update

  1. the MachineConfig resources to target only a specific set of nodes
  2. the MachineSet to apply a label the node when it gets created (instead of relying on NFD to discover that is has a GPU)
  3. the entitlement test pod, to make sure it lands on an entitled node.

I think it would be good to keep the existing behavior as default for the toolbox commands, but add a --label ... to support this optimization.

Ensure that a GPU node is available before deploying the GPU Operator

Currently, before deploying the GPU Operator on the CI, we do this:

prepare_cluster_for_gpu_operator() {
    entitle
    toolbox/scaleup_cluster.sh
    toolbox/nfd/deploy_from_operatorhub.sh
}

but we never test that NFD correctly labels the nodes and that GPU nodes are indeed available.


So prepare_cluster_for_gpu_operator should be extended with:

toolbox/nfd/wait_for_gpu_nodes.sh

that would wait 5min for a node with this label to show up:

var gpuNodeLabels = map[string]string{
	"feature.node.kubernetes.io/pci-10de.present":      "true",
	"feature.node.kubernetes.io/pci-0302_10de.present": "true",
	"feature.node.kubernetes.io/pci-0300_10de.present": "true",
}

entitlement: using the same content for the entitlement.pem and entitlement-key.pem isn't safe

As per this issue openshift-psap/blog-artifacts#6, using the same content for entitlement.pem and entitlement-key.pem isn't safe,

as confirmed by this command:

$ NAME=key
$ podman run --rm -it -v $KEY:/etc/pki/entitlement/$NAME.pem registry.access.redhat.com/ubi8-minimal:8.3-298 bash -x -c "cp /etc/pki/entitlement/$NAME.pem /etc/pki/entitlement/$NAME-key.pem; microdnf install kernel-devel"

+ cp /etc/pki/entitlement/key.pem /etc/pki/entitlement/key-key.pem
+ microdnf install kernel-devel
Downloading metadata...
Downloading metadata...
Downloading metadata...
Downloading metadata...
error: cannot update repo 'rhel-8-for-x86_64-baseos-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried; Last error: Curl error (58): Problem with the local SSL certificate for https://cdn.redhat.com/content/dist/rhel8/8/x86_64/baseos/os/repodata/repomd.xml [unable to set private key file: '/etc/pki/entitlement/key-key-key.pem' type PEM]
  • NAME=entite --> doesn't work
  • NAME=entitlement --> works

Learn about ansible TAGS and refactor the roles to us it

I am watching an Ansible course as part of Red Hat Day of Learning, and they explain the concept of Ansible tags: execute only the tasks matching a command-line --tags name or --skip-tags name2

This feature could be useful for us, it needs to be investigated.

Create a must-gather image for the GPU Operator

OpenShift allows capturing key information about the cluster with the must-gather command. This command allows passing a custom image, eg:

oc adm must-gather --image=quay.io/kubevirt/must-gather:latest --dest-dir=/tmp/must  

See this document for an explanation about the design, the main script and the secondary scripts.

The requirement from the image are simple:

To provide your own must-gather image, it must....

  • Must have a zero-arg, executable file at /usr/bin/gather that does your default gathering
  • Must produce data to be copied back at /must-gather. The data must not contain any sensitive data. We don't string PII information, only secret information.
  • Must produce a text /must-gather/version that indicates the product (first line) and the version (second line, major.minor.micro.qualifier), so that programmatic analysis can be developed.

ci-artifacts as a PSAP toolbox

This issue tracks the progress of the PSAP toolbox:

GPU Operator

  • Deploy from OperatorHub
    • allow deploying an older version #76
toolbox/gpu-operator/deploy_from_operatorhub.sh
toolbox/gpu-operator/undeploy_from_operatorhub.sh
  • Deploy from helm
toolbox/gpu-operator/list_version_from_helm.sh
toolbox/gpu-operator/deploy_with_helm.sh <helm-version>
toolbox/gpu-operator/undeploy_with_helm.sh
  • Deploy from a custom commit.
toolbox/gpu-operator/deploy_from_commit.sh <git repository> <git reference> [gpu_operator_image_tag_uid]
Example: 
toolbox/gpu-operator/deploy_from_commit.sh https://github.com/NVIDIA/gpu-operator.git master
  • Run the GPU Operator deployment validation tests
toolbox/gpu-operator/run_ci_checks.sh
  • Run GPU Burst to validate that the GPUs can run workloads

  • Capture GPU operator possible issues (entitlement, NFD labelling, operator deployment, state of resources in gpu-operator-resources, ...)

    • already partly done inside the CI, but we should improve the toolbox aspect

NFD

  • Deploy the NFD operator from OperatorHub:
toolbox/nfd/deploy_from_operatorhub.sh
toolbox/nfd/undeploy_from_operatorhub.sh
  • Control the channel to use from the command-line

  • Test the NFD deployment #78

    • test with the NFD if GPU nodes are available
    • wait with the NFD for GPU nodes to become available #78
toolbox/nfd/has_gpu_nodes.sh
toolbox/nfd/wait_gpu_nodes.sh

Cluster

  • Add a GPU node on AWS
./toolbox/scaleup_cluster.sh
  • Specify a machine type in the command-line, and skip scale-up if a node with the given machine-type is already present
./toolbox/scaleup_cluster.sh <machine-type>
  • Entitle the cluster, by passing a PEM file, checking if they should be concatenated or not, etc. And do nothing is the cluster is already entitled
toolbox/entitlement/deploy.sh --pem /path/to/pem
toolbox/entitlement/deploy.sh --machine-configs /path/to/machineconfigs
toolbox/entitlement/undeploy.sh
toolbox/entitlement/test.sh
toolbox/entitlement/wait.sh
  • Capture all the clues required to understand entitlement issues
toolbox/entitlement/inspect.sh
  • Deployment of an entitled cluster
    • already coded, but we need to integrate this repo within the toolbox
    • deploy a cluster with 1 master node

CI

  • Build the image used for the Prow CI testing, and run a given command in the Pod
Usage:   toolbox/local-ci/deploy.sh <ci command> <git repository> <git reference> [gpu_operator_image_tag_uid]
Example: toolbox/local-ci/deploy.sh 'run gpu-ci' https://github.com/openshift-psap/ci-artifacts.git master

toolbox/local-ci/cleanup.sh

documentation: add roles/*/README.md descriptions

To improve the reusability and potentially use of our roles in Ansible Galaxy, we need to add README.md files in the different roles we create, and describe their input parameters, dependency, etc.

Cannot modify the GPU Operator ClusterPolicy before deploying it

Currently, the GPU Operator ClusterPolicy is fetched from the ClusterServiceVersion alm-example and instantiated right away.

However, in some cases, the default content is not the one we desire. See for instance this unmerged commit, where we need to set the repoConfig stanza when running with OCP 4.8 (using RHEL beta repositories).

    toolbox/gpu-operator/deploy_from_operatorhub.sh 
    [...]

    if oc version | grep -q "Server Version: 4.8"; then
        echo "Running on OCP 4.8, enabling RHEL beta repository"
        ./toolbox/gpu-operator/set_repo-config.sh --rhel-beta
    fi

    toolbox/gpu-operator/wait_deployment.sh

Another example would be when we want to customize the operator or operand image path to use custom ones.

The GPU Operator DaemonSets are never updated once created, so if they are created with the wrong values, the DaemonSets will never be fixed.

The hack above works (hopefully) because the driver container will fail to deploy without the right repoConfig configuration, so it's safe to manually delete it after the update, but in the general case, the Driver container should never be deleted once running, as the nvidia driver cannot be removed from the kernel while other process (workload or operand) use it.


We should find a way to allow patching the ClusterPolicy before deploying it. The solution should be generic, so that any kind of modification can be performed during the deployment.

Refactor the ansible roles

We are currently using some roles (eg nv_gpu) to perform many different tasks, depending on the flags we activate:

  - name: Install NFD-operator from OperatorHub
    include_tasks: roles/nv_gpu/tasks/install_nfd.yml
    when: install_nfd_operator_from_hub == "yes"

  - name: Wait for NFD-labeled GPU nodes to appear
    include_tasks: roles/nv_gpu/tasks/test_nfd_gpu.yml
    when: nfd_test_gpu_nodes == "yes"

  - name: Install GPU-operator from OperatorHub
    include_tasks: roles/nv_gpu/tasks/install_nv.yml
    when: install_gpu_operator_from_hub == "yes"

I don't think this is the way ansible roles are supposed to be used, as it causes many tasks to be shown as "skipped", but still visible in the logs.

This ticket will track the refactoring of these big roles into smaller chunks, doing only one task (=one role per toolbox script, more or less).

Use the bundle deployment to test a specific commit of the GPU Operator

Currently, when we want to test a specific commit of the GPU Operator, we internally use the GPU Operator helm-chart to configure and deploy the resources.

test_commit() {
    CI_IMAGE_GPU_COMMIT_CI_REPO="${1:-https://github.com/NVIDIA/gpu-operator.git}"
    CI_IMAGE_GPU_COMMIT_CI_REF="${2:-master}"

    CI_IMAGE_GPU_COMMIT_CI_IMAGE_UID="ci-image"

    echo "Using Git repository ${CI_IMAGE_GPU_COMMIT_CI_REPO} with ref ${CI_IMAGE_GPU_COMMIT_CI_REF}"

    prepare_cluster_for_gpu_operator
    toolbox/gpu-operator/deploy_from_commit.sh "${CI_IMAGE_GPU_COMMIT_CI_REPO}" \
                                               "${CI_IMAGE_GPU_COMMIT_CI_REF}" \
                                               "${CI_IMAGE_GPU_COMMIT_CI_IMAGE_UID}"
    validate_gpu_operator_deployment
}

This is working properly, however it would be better to test the bundle resources, as it's the method that will be used to deploy on OpenShift, including in the nightly testing of the master branch.

Documentation

Update README
and Also write proper documentation about the available playbooks

Typos

./roles/entitlement_test_wait_deployment/defaults/main/config.yml:2: successfull ==> successful
./roles/gpu_operator_run_gpu-burn/tasks/main.yml:53: Instanciate ==> Instantiate
./roles/nfd_test_wait_labels/tasks/main.yml:4: quering ==> querying

GPU Operator: test PROXY configuration

We currently do not have any test validating the GPU Operator connected to the Internet through a Cluster PROXY.

This config appeared to be buggy in the GPU Operator 1.8.0 and 1.8.1, due to indeterministic ordering of the driver-container env entries, leading to a constant update of the driver DaemonSet and recreation of the Pods.

We should have test case covering this use-case, maybe running once a week, maybe with an in-cluster proxy relay as a first step.

Ansible-lint tests only modified files

By default ansible-lint only tests the files modified by the PR, and hence never ran if over the full repository.

$ ansible-lint -v --force-color -c config/ansible-lint.yml playbooks roles
INFO     Discovering files to lint: git ls-files -z

vs

$ ansible-lint -v --force-color -c config/ansible-lint.yml $(find . -name *.yml)
# .ansible-lint
warn_list:  # or 'skip_list' to silence them completely
  - internal-error  # Unexpected internal error
  - syntax-check  # Ansible syntax check failed
Finished with 40 failure(s), 0 warning(s) on 195 files.

we need to have a look at these warnings/errors and fix them.

Fix yaml Lints

after running Yamllint some lint error were found

 playbooks/gpu-burst.yml
  5:17      warning  truthy value should be true or false  (truthy)
  9:19      error    trailing spaces  (trailing-spaces)
playbooks/nvidia-gpu-operator-ci.yml
  11:1      error    trailing spaces  (trailing-spaces)
playbooks/openshift-psap-ci.yml
  12:1      error    trailing spaces  (trailing-spaces)
  23:1      error    trailing spaces  (trailing-spaces)
roles/check_deps/tasks/main.yml
  6:18      warning  truthy value should be true or false  (truthy)
  12:18     warning  truthy value should be true or false  (truthy)
  18:18     warning  truthy value should be true or false  (truthy)
  21:8      error    trailing spaces  (trailing-spaces)
  28:44     error    trailing spaces  (trailing-spaces)
  29:1      error    trailing spaces  (trailing-spaces)
  32:14     error    trailing spaces  (trailing-spaces)
roles/nv_gpu/files/001_namespace.yaml
  5:1       error    too many blank lines (1 > 0)  (empty-lines)
roles/nv_gpu/files/003_operator_sub.yaml
  5:36      error    trailing spaces  (trailing-spaces)
  8:31      error    trailing spaces  (trailing-spaces)
  9:30      error    trailing spaces  (trailing-spaces)
  10:41     error    trailing spaces  (trailing-spaces)
roles/nv_gpu/tasks/ci_checks.yml
  5:32      error    trailing spaces  (trailing-spaces)
  6:15      error    trailing spaces  (trailing-spaces)
  13:46     error    trailing spaces  (trailing-spaces)
  21:11     error    trailing spaces  (trailing-spaces)
  27:1      error    too many blank lines (1 > 0)  (empty-lines)
roles/nv_gpu/tasks/install.yml
  4:37      error    trailing spaces  (trailing-spaces)
  11:1      error    trailing spaces  (trailing-spaces)
  14:41     error    trailing spaces  (trailing-spaces)
  21:1      error    trailing spaces  (trailing-spaces)
  24:52     error    trailing spaces  (trailing-spaces)
  31:1      error    trailing spaces  (trailing-spaces)
  34:41     error    trailing spaces  (trailing-spaces)
  41:1      error    trailing spaces  (trailing-spaces)
roles/nv_gpu/tasks/main.yml
  8:1       error    trailing spaces  (trailing-spaces)
  9:26      error    trailing spaces  (trailing-spaces)
roles/openshift_nfd/tasks/ci_checks.yml
  5:32      error    trailing spaces  (trailing-spaces)
  6:15      error    trailing spaces  (trailing-spaces)
  13:24     error    trailing spaces  (trailing-spaces)
  19:15     error    trailing spaces  (trailing-spaces)
roles/openshift_nfd/tasks/main.yml
  17:1      error    trailing spaces  (trailing-spaces)
  50:68     error    trailing spaces  (trailing-spaces)
  53:37     error    trailing spaces  (trailing-spaces)
  57:60     error    trailing spaces  (trailing-spaces)
  62:50     error    trailing spaces  (trailing-spaces)
  80:1      error    trailing spaces  (trailing-spaces)
  81:26     error    trailing spaces  (trailing-spaces)
roles/openshift_nfd/tasks/uninstall_nfd.yml
  14:32     error    trailing spaces  (trailing-spaces)
  18:1      error    too many blank lines (1 > 0)  (empty-lines)
roles/openshift_node/tasks/aws.yml
  80:17     error    trailing spaces  (trailing-spaces)
  106:1     error    trailing spaces  (trailing-spaces)
  112:1     error    trailing spaces  (trailing-spaces)
roles/openshift_node/tasks/main.yml
  2:63      error    trailing spaces  (trailing-spaces)
roles/openshift_node/tasks/scaleup_checks.yml
  19:9      error    trailing spaces  (trailing-spaces)
  26:133    error    trailing spaces  (trailing-spaces)
  33:38     error    trailing spaces  (trailing-spaces)
  34:8      error    trailing spaces  (trailing-spaces)
  35:47     error    trailing spaces  (trailing-spaces)
  36:11     error    too many spaces after colon  (colons)
  37:25     error    trailing spaces  (trailing-spaces)
  41:89     error    trailing spaces  (trailing-spaces)
  48:40     error    trailing spaces  (trailing-spaces)
  49:8      error    trailing spaces  (trailing-spaces)
  50:49     error    trailing spaces  (trailing-spaces)
  51:11     error    too many spaces after colon  (colons)
  52:25     error    trailing spaces  (trailing-spaces)
roles/openshift_odh/tasks/install_required_pkgs.yml
  8:11      warning  truthy value should be true or false  (truthy)
  16:11     warning  truthy value should be true or false  (truthy)
  36:17     warning  truthy value should be true or false  (truthy)
  41:11     warning  truthy value should be true or false  (truthy)
  74:19     warning  truthy value should be true or false  (truthy)
  79:13     warning  truthy value should be true or false  (truthy)
roles/openshift_odh/tasks/main.yml
  21:43     error    trailing spaces  (trailing-spaces)
  27:38     error    trailing spaces  (trailing-spaces)
  45:39     error    trailing spaces  (trailing-spaces)
roles/openshift_odh/tasks/uninstall_odh.yml
  21:80     error    trailing spaces  (trailing-spaces)
  25:32     error    trailing spaces  (trailing-spaces)
roles/openshift_sro/tasks/main.yml
  17:42     error    trailing spaces  (trailing-spaces)
  20:37     error    trailing spaces  (trailing-spaces)
roles/openshift_sro/tasks/uninstall_sro.yml
  14:32     error    trailing spaces  (trailing-spaces) 

Create a full weekly suite for the PSAP operators suite

We should have a BIG test that runs weekly, with chaos testing and other best test practices paths

the idea would be to:

  • run the basic installs, let it run for a couple of minutes
  • run a very small ML perf benchmark
  • run a chaos run, randomly deleting components from the operators and monitoring if the operators recover from it
  • run a scale up and scale down test (GPU and NFD test)
  • run a cluster upgrade

This will test and stress PSAP operators to common real world scenarios so we can be prepared

Test OpenShift upgrade with GPU workload running

Currently, we for the upgrade scenario, we ...

  1. install and test the GPU Operator
  2. trigger the cluster upgrade
  3. test the GPU Operator.

We need to add two steps:

  • 2.5 start a long running GPU workload (gpu burn, but without waiting for completion)
  • 3.5 test what happened to the workload (eg, wait for it to be restarted and running)

Use Ansible role "template" files instead of custom sed replacement

We need to have a look at the template built-in feature and see how we could use is instead of using sed:

- name: "Create the OperatorHub subscription for {{ gpu_operator_csv_name }}"
  shell:
    set -o pipefail;
    cat {{ gpu_operator_operatorhub_sub }}
    | sed 's|{{ '{{' }} startingCSV {{ '}}' }}|{{ gpu_operator_csv_name }}|'
    | oc apply -f-
  args:
    warn: false # don't warn about using sed here

This change was "anticipated", as we already use the jinga2 remplate style for our template files, eg:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: gpu-operator-certified
  namespace: openshift-operators
spec:
  channel: stable
  installPlanApproval: Manual
  name: gpu-operator-certified
  source: certified-operators
  sourceNamespace: openshift-marketplace
  startingCSV: "{{ startingCSV }}"

Example:

- name: Store the CSV version
  set_fact:
    startingCSV: "{{ gpu_operator_csv_name }}"

- name: "Create the OperatorHub subscription for {{ gpu_operator_csv_name }}"
  template:
    src: "{{ gpu_operator_operatorhub_sub }}"
    dest: "{{ artifact_extra_logs_dir }}/gpu_operator_sub.yml"

TODO: List of the roles with sed transformation removed:

  • nfd_deploy
  • local-ci_deploy
  • gpu_operator_run_gpu-burn
  • gpu_operator_deploy_custom_commit
  • gpu_operator_deploy_from_operatorhub

Allow scaling a cluster up and down with N nodes

Currently, toolbox/cluster/scaleup.sh [instance-type] allows only adding new MachineSets with a given instance-type.
For testing the GPU Operator support of scale-up and scale-down, we would need to be able to add new GPU nodes to a cluster, and potentially scale it to 0.

So,

  • 1/ toolbox/cluster/scaleup.sh should be extended (or with another command) to be able to set the number of desired nodes of a given instance type, eg:
toolbox/cluster/scaleup.sh <instance-type> # make sure that machines with <instance-type> are available
toolbox/cluster/scaleup.sh <instance-type> N # make sure that N machines with <instance-type> are available
  • 2/ the nightly CI entrypoint should be extended to ensure that this capability works properly in the GPU Operator, with tests like:
Deploy the GPU Operator when 0 GPU nodes are available
Scale-up the cluster to 1 GPU node, make sure that the GPU of the nodes gets available
Scale-up the cluster to 2 GPU nodes, make sure that the 2 GPUs get available
Scale-up the cluster to 1 GPU node, make sure that the node disappears

GPU Operator: Prepare a disconnected driver-container POC

Disconnected / air-gaped environments cannot access the Internet, only a limited set of image registries / package mirrors.

This POC will demonstrate how to build a custom GPU Operator driver image (with Internet access), then build and load nvidia driver module in GPU nodes, without Internet access.

ci-artifacts maintenance overview

Some things I have in mind for improving/fixing ci-artifacts:

  • Fix the GPU Operator deploy_from_operatorhub to work with v1.9.0-beta and v1.9.0 when released

    • the single-namespace + entitlement-free new features change the way the GPU Operator is deployed, I addressed it for master branch testing, but I couldn't do it for OperatorHub deployment until released by NVIDIA (beta was released last week)
    • Fixed with #289
  • Update gpu_operator_set_namespace to use ClusterPolicy.status.namespace (see PR)

    • will be simpler that the code I wrote before this PR was merged
    • EDIT: WONT FIX, oc get pod -l app.kubernetes.io/component=gpu-operator -A -ojsonpath={.items[].metadata.namespace} is simple enough
  • Enable testing the GPU Operator v1.9 (when released, ie > 2021-12-03)

  • Call hack/must-gather.sh script instead of custom scripts

  • Turn the image helper BuildConfig into a simple DockerFile + quay.io "build on master-merge"

    • this image is 100% static, never updated, there's no need to rebuilt it for every master GPU Operator test
    • this image is duplicated in NFD master test
    • takes 8 minutes to build
2021-11-28 23:32:23,960 p=90 u=psap-ci-runner n=ansible | Sunday 28 November 2021  23:32:23 +0000 (0:00:00.697)       0:00:08.016 ******* 
2021-11-28 23:32:24,402 p=90 u=psap-ci-runner n=ansible | TASK: gpu_operator_bundle_from_commit : Wait for the helper image to be built
  • Double check the alert-testing of the GPU Operator master branch

    • I'm doubtful about what happens with the driver-cannot-be-built alert wrt entitlement-free deployments
  • Refresh the versions used for the GPU Operator upgrade testing (currently only 4.6 --> 4.7)

  • Confirm the fate of testing the GPU Operator on OCP 4.6 clusters

    • EUS release
  • Enable testing of the GPU Operator on OCP 4.10

  • Improve the GPU Operator and rewrite gpu_operator_get_csv_version

    • I think it's becoming critical that the GPU Operator image exposes the GPU Operator version (v1.x.y) and the git commit used to build it
    • in the CI I already include the commit hash + commit date in the master-branch bundle version (eg, 21.11.25-git.57914a2), but that's not enough as this information isn't part of the operator image (recently the CI was using the same outdated image for a week and we failed to notice it until we had to test custom GPU Operator PRs)
    • once this is done, update this role to avoid fetching the information from the CSV, whenever possible (won't be backported)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.