openshift-psap / ci-artifacts Goto Github PK
View Code? Open in Web Editor NEWOpenShift PSAP-team CI Artifacts
License: Apache License 2.0
OpenShift PSAP-team CI Artifacts
License: Apache License 2.0
./run_toolbox.py nfd_operator deploy_from_operatorhub
./run_toolbox.py gpu_operator deploy_from_operatorhub
${OPERATOR_CHANNEL:-}
${OPERATOR_VERSION:-}
--namespace ${OPERATOR_NAMESPACE}
the behaviour of these two commands must be very similar,
I think it shouldn't be hard to rewrite ./run_toolbox.py gpu_operator deploy_from_operatorhub
into a generic command, something like:
deploy_from_operatorhub
--catalog=certified-operators
--name=gpu-operator
--namespace=... # optional
--channel=v1.9.0 # optional, can usedefaultChannel
--csv-name=... # optional, can usedefaultCSV
--deploy-default-cr=True
this would allow us installing any operator from the command line.
#300 gpu_operator_run_gpu-burn: make gpu-burn execution easier to reproduce
#301 benchmarking: make the execution easier to reproduce
could rewritten with these ^^^ two PRs in mind, to make the install easy to reproduce with the execution artifacts.
Until v1.6.2 (included) of the GPU Operator, OpenShift cluster upgrade is not supported, because the driver DaemonSet receives the RHEL_VERSION
as a DaemonSet/Pod environment variable.
This makes the driver Pod unable to build the nvidia driver, because it cannot fetch any package.
Example of lines from the driver logs:
+ echo -e 'Starting installation of NVIDIA driver version 460.32.03 for Linux kernel version 4.18.0-240.22.1.el8_3.x86_64\n'
el8_3
means that it's RHEL 8.3 kernel that is running, but
dnf -q -y --releasever=8.2
this --releasever=8.2
shows that the DaemonSet is configured with RHEL_VERSION=8.2
It can would be easy to check if OCP_VERSION
in the driver daemonset matches ocp_release
variable that we already capture in the ansible playbooks.
This test can be integrated in the diagnose.sh
script.
The GPU Operator should support seamlessly the upgrade of the OpenShift cluster (in a forth coming release at least).
We want to be able to test this upgrade support in the nightly CI.
OpenShift Prow doesn't support our upgrade use case, which is:
(operator are usually preinstalled in OpenShift, or straightforward to install via OperatorHub, but the GPU Operator requires the deployment of the entitlement, the scale-up of the cluster with the GPU operator and the deployment of NFD ...).
This ticket will track the development of this feature.
Currently, ./toolbox/cluster/set_scale.sh
cannot be customized to decide which MachineSet
will be used to derive the new MachineSet:
- name: Get the names of an existing worker machinesets (of any instance type)
command:
oc get machinesets -n openshift-machine-api -o
jsonpath='{range .items[?(@.spec.template.metadata.labels.machine\.openshift\.io/cluster-api-machine-role=="worker")]}{.metadata.name}{"\n"}{end}'
register: oc_get_machinesets
failed_when: not oc_get_machinesets.stdout
it would be nice to have the ability to easily override oc_get_machinesets.stdout
to specify which machineset to use as a base.
Quick and dirty example of what I did to work around that:
- name: Get the names of an existing worker machinesets (of any instance type)
command:
echo kpouget-20210519-kf6rn-worker-eu-central-1b
register: oc_get_machinesets
failed_when: not oc_get_machinesets.stdout
The reason for that is that the instance-type I want isn't available in eu-central-1a
, only in 1b
Include the "p3.2xlarge" GPU instance as well during CI along with the cheaper "g4dn.xlarge"
When we onboard performance tests in the CI, we should run the training workloads on the p3 instance and the inference workloads in the g4dn instance
The following branches are being fast-forwarded from the current development branch (release-4.8) as placeholders for future releases. No merging is allowed into these release branches until they are unfrozen for production release.
release-4.9
Contact the Test Platform or Automated Release teams for more information.
Currently we download oc
, (kubectl
), helm
and operator-sdk
as part of the precheck()
call of build/root/usr/local/bin/run
.
I think it would be better to fetch these binaries when building the image.
The toolbox
scripts are used to test the deployment of the GPU operator, so most of their code (ansible playbooks and roles) is tested before merging a new PR in the master
branch (\test gpu-operator-e2e
) and in the nightly testing:
But the GPU operator doesn't cover 100% of the toolbox
features, and some flags and code branches might be left untested. This is for instance what happened with toolbox/entitlement/test.sh
that got broken when no flag was passed (see the fix in cf8a276), which isn't executed in the GPU Operator testing.
This ticket will track the progress of the design and development of unit tests.
Every once in a while, the nightly testing fails because the cluster becomes unreachable:
roles/capture_environment/tasks/main.yml:8
TASK: capture_environment : Store OpenShift YAML version
----- FAILED ----
msg: non-zero return code
<command> oc version -oyaml > /logs/artifacts/233800__cluster__capture_environment/ocp_version.yml
<stderr> The connection to the server api.ci-op-75fhpdb3-3c6fc.origin-ci-int-aws.dev.rhcloud.com:6443 was refused - did you specify the right host or port?
----- FAILED ----
This kind of failure is independent from the GPU Operator testing, and it should be made clear in the CI-Dashboard (Prow infrastructure restarts the testing when this happens). An orange dot could do the job, with a label cluster issue detected
.
To detect this, the must-gather
script could simply create a file cluster-down
when oc version
doesn't work.
This the presence of this file would tell the ci-dashboard
to set the orange flag.
Currently, we are nightly testing the released versions of the GPU Operator with this entrypoint:
run gpu-operator test_operatorhub 1.8.0 v1.8
where 1.8.0
specifies the version of the operator to be installed, and v1.8
specifies the OLM channel.
It would be better to only specify the channel, so that the latest minor version is installed:
run gpu-operator test_operatorhub v1.8
When the gpu_operator_deploy_from_operatorhub
role was originally written (before v1.7), the GPU Operator only had a stable
channel, so the channel could be omitted, and the full version had to be specified.
Now that NVIDIA switched to a dedicated channel per 1.X release, we could update the role/entrypoint to test "the latest minor available for a given channel".
In addition, the ci-dashboard
was recently updated to show the exact version of the GPU Operator being installed & tested (instead of a hard-coded value):
From time to time, in the CI, NFD or the GPU Operator fail to be installed from OperatorHub, because the PackageManifest is not found:
<command> oc get packagemanifests/gpu-operator-certified -n openshift-marketplace
<stderr> Error from server (NotFound): packagemanifests.packages.operators.coreos.com "gpu-operator-certified" not found
<command> oc get packagemanifests/nfd -n openshift-marketplace
<stderr> Error from server (NotFound): packagemanifests.packages.operators.coreos.com "nfd" not found
In parallel of implementing unit testing (#92), it would be interesting to measure the code coverage of the GPU Operator testing + Unit testing, to ensure that all the playbook tasks are executed at least once as part of the presubmit testing.
This ticket will track the design and development of this task.
I create this issue to discuss the topic and understand if the idea makes sense.
Currently, the code of this repository isn't specific to any version of OpenShift, so I would to suggest to get rid of the release-4.x
branches and use a simpler workflow with a master
branch defining the way to test the GPU-Operator on all the OpenShift releases.
See this commit kpouget/release@3790f84 for the patch that should be applied to openshift-release
repository.
The following branches are being fast-forwarded from the current development branch (release-4.8) as placeholders for future releases. No merging is allowed into these release branches until they are unfrozen for production release.
release-4.9
Contact the Test Platform or Automated Release teams for more information.
Currently, the cluster upgrade testing is performed "manually" in the cluster_upgrade_to_image role.
This simple playbook only waits for the end of the upgrade, but doesn't perform any other kind of test.
The reason for this choice is that
In the future, it would be important to move to a proper CI upgrade step.
Currently, the entitlement is performed cluster-wide, so all the nodes of the cluster have to be rebooted when the entitlement is deployed.
In order to avoid rebooting nodes that do not require entitlement, we need to update
MachineConfig
resources to target only a specific set of nodesMachineSet
to apply a label the node when it gets created (instead of relying on NFD to discover that is has a GPU)I think it would be good to keep the existing behavior as default for the toolbox commands, but add a --label ...
to support this optimization.
This avoids going through an ssh
connection to run a local task.
This happens by default in my system (Fedora 32), but I've seen issues where people had to manually allow ssh localhost
to work without a password.
Currently, before deploying the GPU Operator on the CI, we do this:
prepare_cluster_for_gpu_operator() {
entitle
toolbox/scaleup_cluster.sh
toolbox/nfd/deploy_from_operatorhub.sh
}
but we never test that NFD correctly labels the nodes and that GPU nodes are indeed available.
So prepare_cluster_for_gpu_operator
should be extended with:
toolbox/nfd/wait_for_gpu_nodes.sh
that would wait 5min for a node with this label to show up:
var gpuNodeLabels = map[string]string{
"feature.node.kubernetes.io/pci-10de.present": "true",
"feature.node.kubernetes.io/pci-0302_10de.present": "true",
"feature.node.kubernetes.io/pci-0300_10de.present": "true",
}
As per this issue openshift-psap/blog-artifacts#6, using the same content for entitlement.pem
and entitlement-key.pem
isn't safe,
as confirmed by this command:
$ NAME=key
$ podman run --rm -it -v $KEY:/etc/pki/entitlement/$NAME.pem registry.access.redhat.com/ubi8-minimal:8.3-298 bash -x -c "cp /etc/pki/entitlement/$NAME.pem /etc/pki/entitlement/$NAME-key.pem; microdnf install kernel-devel"
+ cp /etc/pki/entitlement/key.pem /etc/pki/entitlement/key-key.pem
+ microdnf install kernel-devel
Downloading metadata...
Downloading metadata...
Downloading metadata...
Downloading metadata...
error: cannot update repo 'rhel-8-for-x86_64-baseos-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried; Last error: Curl error (58): Problem with the local SSL certificate for https://cdn.redhat.com/content/dist/rhel8/8/x86_64/baseos/os/repodata/repomd.xml [unable to set private key file: '/etc/pki/entitlement/key-key-key.pem' type PEM]
NAME=entite
--> doesn't workNAME=entitlement
--> worksI am watching an Ansible course as part of Red Hat Day of Learning, and they explain the concept of Ansible tags: execute only the tasks matching a command-line --tags name
or --skip-tags name2
This feature could be useful for us, it needs to be investigated.
OpenShift allows capturing key information about the cluster with the must-gather
command. This command allows passing a custom image, eg:
oc adm must-gather --image=quay.io/kubevirt/must-gather:latest --dest-dir=/tmp/must
See this document for an explanation about the design, the main script and the secondary scripts.
The requirement from the image are simple:
To provide your own must-gather image, it must....
- Must have a zero-arg, executable file at
/usr/bin/gather
that does your default gathering- Must produce data to be copied back at
/must-gather
. The data must not contain any sensitive data. We don't string PII information, only secret information.- Must produce a text
/must-gather/version
that indicates the product (first line) and the version (second line,major.minor.micro.qualifier
), so that programmatic analysis can be developed.
This issue tracks the progress of the PSAP toolbox
:
toolbox/gpu-operator/deploy_from_operatorhub.sh
toolbox/gpu-operator/undeploy_from_operatorhub.sh
toolbox/gpu-operator/list_version_from_helm.sh
toolbox/gpu-operator/deploy_with_helm.sh <helm-version>
toolbox/gpu-operator/undeploy_with_helm.sh
toolbox/gpu-operator/deploy_from_commit.sh <git repository> <git reference> [gpu_operator_image_tag_uid]
Example:
toolbox/gpu-operator/deploy_from_commit.sh https://github.com/NVIDIA/gpu-operator.git master
toolbox/gpu-operator/run_ci_checks.sh
Run GPU Burst to validate that the GPUs can run workloads
Capture GPU operator possible issues (entitlement, NFD labelling, operator deployment, state of resources in gpu-operator-resources, ...)
toolbox/nfd/deploy_from_operatorhub.sh
toolbox/nfd/undeploy_from_operatorhub.sh
Control the channel to use from the command-line
Test the NFD deployment #78
toolbox/nfd/has_gpu_nodes.sh
toolbox/nfd/wait_gpu_nodes.sh
./toolbox/scaleup_cluster.sh
./toolbox/scaleup_cluster.sh <machine-type>
toolbox/entitlement/deploy.sh --pem /path/to/pem
toolbox/entitlement/deploy.sh --machine-configs /path/to/machineconfigs
toolbox/entitlement/undeploy.sh
toolbox/entitlement/test.sh
toolbox/entitlement/wait.sh
toolbox/entitlement/inspect.sh
Usage: toolbox/local-ci/deploy.sh <ci command> <git repository> <git reference> [gpu_operator_image_tag_uid]
Example: toolbox/local-ci/deploy.sh 'run gpu-ci' https://github.com/openshift-psap/ci-artifacts.git master
toolbox/local-ci/cleanup.sh
To improve the reusability and potentially use of our roles in Ansible Galaxy, we need to add README.md
files in the different roles we create, and describe their input parameters, dependency, etc.
Currently, the GPU Operator ClusterPolicy is fetched from the ClusterServiceVersion alm-example
and instantiated right away.
However, in some cases, the default content is not the one we desire. See for instance this unmerged commit, where we need to set the repoConfig
stanza when running with OCP 4.8 (using RHEL beta repositories).
toolbox/gpu-operator/deploy_from_operatorhub.sh
[...]
if oc version | grep -q "Server Version: 4.8"; then
echo "Running on OCP 4.8, enabling RHEL beta repository"
./toolbox/gpu-operator/set_repo-config.sh --rhel-beta
fi
toolbox/gpu-operator/wait_deployment.sh
Another example would be when we want to customize the operator or operand image path to use custom ones.
The GPU Operator DaemonSets are never updated once created, so if they are created with the wrong values, the DaemonSets will never be fixed.
The hack above works (hopefully) because the driver container will fail to deploy without the right repoConfig
configuration, so it's safe to manually delete it after the update, but in the general case, the Driver container should never be deleted once running, as the nvidia
driver cannot be removed from the kernel while other process (workload or operand) use it.
We should find a way to allow patching the ClusterPolicy before deploying it. The solution should be generic, so that any kind of modification can be performed during the deployment.
The following branches are being fast-forwarded from the current development branch (release-4.8) as placeholders for future releases. No merging is allowed into these release branches until they are unfrozen for production release.
release-4.9
Contact the Test Platform or Automated Release teams for more information.
We are currently using some roles (eg nv_gpu
) to perform many different tasks, depending on the flags we activate:
- name: Install NFD-operator from OperatorHub
include_tasks: roles/nv_gpu/tasks/install_nfd.yml
when: install_nfd_operator_from_hub == "yes"
- name: Wait for NFD-labeled GPU nodes to appear
include_tasks: roles/nv_gpu/tasks/test_nfd_gpu.yml
when: nfd_test_gpu_nodes == "yes"
- name: Install GPU-operator from OperatorHub
include_tasks: roles/nv_gpu/tasks/install_nv.yml
when: install_gpu_operator_from_hub == "yes"
I don't think this is the way ansible roles are supposed to be used, as it causes many tasks to be shown as "skipped", but still visible in the logs.
This ticket will track the refactoring of these big roles into smaller chunks, doing only one task (=one role per toolbox
script, more or less).
Currently, when we want to test a specific commit of the GPU Operator, we internally use the GPU Operator helm-chart to configure and deploy the resources.
test_commit() {
CI_IMAGE_GPU_COMMIT_CI_REPO="${1:-https://github.com/NVIDIA/gpu-operator.git}"
CI_IMAGE_GPU_COMMIT_CI_REF="${2:-master}"
CI_IMAGE_GPU_COMMIT_CI_IMAGE_UID="ci-image"
echo "Using Git repository ${CI_IMAGE_GPU_COMMIT_CI_REPO} with ref ${CI_IMAGE_GPU_COMMIT_CI_REF}"
prepare_cluster_for_gpu_operator
toolbox/gpu-operator/deploy_from_commit.sh "${CI_IMAGE_GPU_COMMIT_CI_REPO}" \
"${CI_IMAGE_GPU_COMMIT_CI_REF}" \
"${CI_IMAGE_GPU_COMMIT_CI_IMAGE_UID}"
validate_gpu_operator_deployment
}
This is working properly, however it would be better to test the bundle
resources, as it's the method that will be used to deploy on OpenShift, including in the nightly testing of the master branch.
Explore if we can have a bot to publish weekly PR status and PR`s that need some attention
Update README
and Also write proper documentation about the available playbooks
./roles/entitlement_test_wait_deployment/defaults/main/config.yml:2: successfull ==> successful
./roles/gpu_operator_run_gpu-burn/tasks/main.yml:53: Instanciate ==> Instantiate
./roles/nfd_test_wait_labels/tasks/main.yml:4: quering ==> querying
Can you guys also think of a release management and testing plan for ci-artifacts? For eg. How often will we cut a release ? What will be the criteria for cutting a release .. etc
we have a template for that in NFD, maybe we could use the same for ci-artifacts
https://github.com/kubernetes-sigs/node-feature-discovery/blob/master/.github/ISSUE_TEMPLATE/new-release.md
We currently do not have any test validating the GPU Operator connected to the Internet through a Cluster PROXY.
This config appeared to be buggy in the GPU Operator 1.8.0 and 1.8.1, due to indeterministic ordering of the driver-container env
entries, leading to a constant update of the driver DaemonSet and recreation of the Pods.
We should have test case covering this use-case, maybe running once a week, maybe with an in-cluster proxy relay as a first step.
By default ansible-lint
only tests the files modified by the PR, and hence never ran if over the full repository.
$ ansible-lint -v --force-color -c config/ansible-lint.yml playbooks roles
INFO Discovering files to lint: git ls-files -z
vs
$ ansible-lint -v --force-color -c config/ansible-lint.yml $(find . -name *.yml)
# .ansible-lint
warn_list: # or 'skip_list' to silence them completely
- internal-error # Unexpected internal error
- syntax-check # Ansible syntax check failed
Finished with 40 failure(s), 0 warning(s) on 195 files.
we need to have a look at these warnings/errors and fix them.
after running Yamllint some lint error were found
playbooks/gpu-burst.yml
5:17 warning truthy value should be true or false (truthy)
9:19 error trailing spaces (trailing-spaces)
playbooks/nvidia-gpu-operator-ci.yml
11:1 error trailing spaces (trailing-spaces)
playbooks/openshift-psap-ci.yml
12:1 error trailing spaces (trailing-spaces)
23:1 error trailing spaces (trailing-spaces)
roles/check_deps/tasks/main.yml
6:18 warning truthy value should be true or false (truthy)
12:18 warning truthy value should be true or false (truthy)
18:18 warning truthy value should be true or false (truthy)
21:8 error trailing spaces (trailing-spaces)
28:44 error trailing spaces (trailing-spaces)
29:1 error trailing spaces (trailing-spaces)
32:14 error trailing spaces (trailing-spaces)
roles/nv_gpu/files/001_namespace.yaml
5:1 error too many blank lines (1 > 0) (empty-lines)
roles/nv_gpu/files/003_operator_sub.yaml
5:36 error trailing spaces (trailing-spaces)
8:31 error trailing spaces (trailing-spaces)
9:30 error trailing spaces (trailing-spaces)
10:41 error trailing spaces (trailing-spaces)
roles/nv_gpu/tasks/ci_checks.yml
5:32 error trailing spaces (trailing-spaces)
6:15 error trailing spaces (trailing-spaces)
13:46 error trailing spaces (trailing-spaces)
21:11 error trailing spaces (trailing-spaces)
27:1 error too many blank lines (1 > 0) (empty-lines)
roles/nv_gpu/tasks/install.yml
4:37 error trailing spaces (trailing-spaces)
11:1 error trailing spaces (trailing-spaces)
14:41 error trailing spaces (trailing-spaces)
21:1 error trailing spaces (trailing-spaces)
24:52 error trailing spaces (trailing-spaces)
31:1 error trailing spaces (trailing-spaces)
34:41 error trailing spaces (trailing-spaces)
41:1 error trailing spaces (trailing-spaces)
roles/nv_gpu/tasks/main.yml
8:1 error trailing spaces (trailing-spaces)
9:26 error trailing spaces (trailing-spaces)
roles/openshift_nfd/tasks/ci_checks.yml
5:32 error trailing spaces (trailing-spaces)
6:15 error trailing spaces (trailing-spaces)
13:24 error trailing spaces (trailing-spaces)
19:15 error trailing spaces (trailing-spaces)
roles/openshift_nfd/tasks/main.yml
17:1 error trailing spaces (trailing-spaces)
50:68 error trailing spaces (trailing-spaces)
53:37 error trailing spaces (trailing-spaces)
57:60 error trailing spaces (trailing-spaces)
62:50 error trailing spaces (trailing-spaces)
80:1 error trailing spaces (trailing-spaces)
81:26 error trailing spaces (trailing-spaces)
roles/openshift_nfd/tasks/uninstall_nfd.yml
14:32 error trailing spaces (trailing-spaces)
18:1 error too many blank lines (1 > 0) (empty-lines)
roles/openshift_node/tasks/aws.yml
80:17 error trailing spaces (trailing-spaces)
106:1 error trailing spaces (trailing-spaces)
112:1 error trailing spaces (trailing-spaces)
roles/openshift_node/tasks/main.yml
2:63 error trailing spaces (trailing-spaces)
roles/openshift_node/tasks/scaleup_checks.yml
19:9 error trailing spaces (trailing-spaces)
26:133 error trailing spaces (trailing-spaces)
33:38 error trailing spaces (trailing-spaces)
34:8 error trailing spaces (trailing-spaces)
35:47 error trailing spaces (trailing-spaces)
36:11 error too many spaces after colon (colons)
37:25 error trailing spaces (trailing-spaces)
41:89 error trailing spaces (trailing-spaces)
48:40 error trailing spaces (trailing-spaces)
49:8 error trailing spaces (trailing-spaces)
50:49 error trailing spaces (trailing-spaces)
51:11 error too many spaces after colon (colons)
52:25 error trailing spaces (trailing-spaces)
roles/openshift_odh/tasks/install_required_pkgs.yml
8:11 warning truthy value should be true or false (truthy)
16:11 warning truthy value should be true or false (truthy)
36:17 warning truthy value should be true or false (truthy)
41:11 warning truthy value should be true or false (truthy)
74:19 warning truthy value should be true or false (truthy)
79:13 warning truthy value should be true or false (truthy)
roles/openshift_odh/tasks/main.yml
21:43 error trailing spaces (trailing-spaces)
27:38 error trailing spaces (trailing-spaces)
45:39 error trailing spaces (trailing-spaces)
roles/openshift_odh/tasks/uninstall_odh.yml
21:80 error trailing spaces (trailing-spaces)
25:32 error trailing spaces (trailing-spaces)
roles/openshift_sro/tasks/main.yml
17:42 error trailing spaces (trailing-spaces)
20:37 error trailing spaces (trailing-spaces)
roles/openshift_sro/tasks/uninstall_sro.yml
14:32 error trailing spaces (trailing-spaces)
The following branches are being fast-forwarded from the current development branch (release-4.8) as placeholders for future releases. No merging is allowed into these release branches until they are unfrozen for production release.
release-4.9
Contact the Test Platform or Automated Release teams for more information.
Currently, we only test the successful deployment of the GPU Operator on the cluster, but we do not run any GPU payload on it.
We should extent the playbooks to run GPU Burn, and find the way to run it on every GPU available (ie, not only the 1st one of the node)
https://github.com/openshift-psap/gpu-burn/blob/master/gpu-burn.yaml
We should have a BIG test that runs weekly, with chaos testing and other best test practices paths
the idea would be to:
This will test and stress PSAP operators to common real world scenarios so we can be prepared
Currently, we for the upgrade scenario, we ...
We need to add two steps:
gpu burn
, but without waiting for completion)We need to have a look at the template built-in feature and see how we could use is instead of using sed
:
- name: "Create the OperatorHub subscription for {{ gpu_operator_csv_name }}"
shell:
set -o pipefail;
cat {{ gpu_operator_operatorhub_sub }}
| sed 's|{{ '{{' }} startingCSV {{ '}}' }}|{{ gpu_operator_csv_name }}|'
| oc apply -f-
args:
warn: false # don't warn about using sed here
This change was "anticipated", as we already use the jinga2
remplate style for our template files, eg:
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: gpu-operator-certified
namespace: openshift-operators
spec:
channel: stable
installPlanApproval: Manual
name: gpu-operator-certified
source: certified-operators
sourceNamespace: openshift-marketplace
startingCSV: "{{ startingCSV }}"
Example:
- name: Store the CSV version
set_fact:
startingCSV: "{{ gpu_operator_csv_name }}"
- name: "Create the OperatorHub subscription for {{ gpu_operator_csv_name }}"
template:
src: "{{ gpu_operator_operatorhub_sub }}"
dest: "{{ artifact_extra_logs_dir }}/gpu_operator_sub.yml"
TODO: List of the roles with sed
transformation removed:
nfd_deploy
local-ci_deploy
gpu_operator_run_gpu-burn
gpu_operator_deploy_custom_commit
gpu_operator_deploy_from_operatorhub
Currently, toolbox/cluster/scaleup.sh [instance-type]
allows only adding new MachineSets
with a given instance-type
.
For testing the GPU Operator support of scale-up and scale-down, we would need to be able to add new GPU nodes to a cluster, and potentially scale it to 0.
So,
toolbox/cluster/scaleup.sh
should be extended (or with another command) to be able to set the number of desired nodes of a given instance type, eg:toolbox/cluster/scaleup.sh <instance-type> # make sure that machines with <instance-type> are available
toolbox/cluster/scaleup.sh <instance-type> N # make sure that N machines with <instance-type> are available
Deploy the GPU Operator when 0 GPU nodes are available
Scale-up the cluster to 1 GPU node, make sure that the GPU of the nodes gets available
Scale-up the cluster to 2 GPU nodes, make sure that the 2 GPUs get available
Scale-up the cluster to 1 GPU node, make sure that the node disappears
Disconnected / air-gaped environments cannot access the Internet, only a limited set of image registries / package mirrors.
This POC will demonstrate how to build a custom GPU Operator driver image (with Internet access), then build and load nvidia
driver module in GPU nodes, without Internet access.
Some things I have in mind for improving/fixing ci-artifacts
:
Fix the GPU Operator deploy_from_operatorhub
to work with v1.9.0-beta
and v1.9.0
when released
master
branch testing, but I couldn't do it for OperatorHub deployment until released by NVIDIA (beta was released last week) Update gpu_operator_set_namespace to use ClusterPolicy.status.namespace
(see PR)
WONT FIX
, oc get pod -l app.kubernetes.io/component=gpu-operator -A -ojsonpath={.items[].metadata.namespace}
is simple enoughEnable testing the GPU Operator v1.9 (when released, ie > 2021-12-03)
Call hack/must-gather.sh
script instead of custom scripts
Turn the image helper BuildConfig into a simple DockerFile + quay.io "build on master-merge"
master
GPU Operator test2021-11-28 23:32:23,960 p=90 u=psap-ci-runner n=ansible | Sunday 28 November 2021 23:32:23 +0000 (0:00:00.697) 0:00:08.016 *******
2021-11-28 23:32:24,402 p=90 u=psap-ci-runner n=ansible | TASK: gpu_operator_bundle_from_commit : Wait for the helper image to be built
Double check the alert-testing of the GPU Operator master
branch
Refresh the versions used for the GPU Operator upgrade testing (currently only 4.6 --> 4.7)
master
4.6 --> 4.7
upgrade testing (both versions are not supported anymore)
Confirm the fate of testing the GPU Operator on OCP 4.6 clusters
Enable testing of the GPU Operator on OCP 4.10
Improve the GPU Operator and rewrite gpu_operator_get_csv_version
master-branch
bundle version (eg, 21.11.25-git.57914a2
), but that's not enough as this information isn't part of the operator image (recently the CI was using the same outdated image for a week and we failed to notice it until we had to test custom GPU Operator PRs)the ad-hoc / messy arg parsing each script does now is getting a bit out of hand, maybe we should consider transforming the toolbox from a bunch of shell scripts to something like https://click.palletsprojects.com/en/8.0.x/ or https://github.com/google/python-fire (or anything else we like from this list)
This will be more maintainable / give us
--help
for free.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.