Coder Social home page Coder Social logo

Comments (5)

ArangoGutierrez avatar ArangoGutierrez commented on August 28, 2024 1

The build folder is for things we want into the container image, for this idea I would prefer a simple Bash on the project root.
But I like the idea

from ci-artifacts.

kpouget avatar kpouget commented on August 28, 2024

@ArangoGutierrez I think this toolbox can be really useful for people dealing with NFD & GPU Operator, so we should strive to make it more complete and convenient to use. Some ideas of what should be added:

  • undeploy NFD from OperatorHub

  • better control of NFD channel

  • capture entitlement possible issues (list MC resources, make sure that cert/key/rhsm are at the right place, test dnf install kernel-devel ...)

  • capture GPU operator possible issues (entitlement ^^, NFD labelling, operator deployment, state of resources in gpu-operator-resources, ...)

  • deployment of an entitled cluster (https://gitlab.com/kpouget_psap/deploy-cluster)

And in addition, I think that build/root/usr/local/bin/run image entry-point should be updated to rely on the toolbox, so that we don't break the toolbox by mistake

from ci-artifacts.

kpouget avatar kpouget commented on August 28, 2024

for reference, here are the current functionalities of the toolbox:

toolbox/gpu-operator/deploy_from_operatorhub.sh
toolbox/gpu-operator/undeploy_from_operatorhub.sh
toolbox/gpu-operator/deploy_from_commit.sh
toolbox/gpu-operator/undeploy_from_commit.sh
toolbox/gpu-operator/run_ci_checks.yml
toolbox/deploy_nfd_from_operatorhub.sh
toolbox/scaleup_gpu_node.sh

from ci-artifacts.

kpouget avatar kpouget commented on August 28, 2024

This patch cfa402a goes into the direction to make the toolbox useful for helping people to debug their issues related to NFD and GPU operator: save into a dedicated directory all the debugging logs that can be useful to understand the state of the cluster, then retrieve and analyze the files locally.

(the patch generates artifact files with GPU operator-related logs, to better understand CI failures whenever they occur)

from ci-artifacts.

kpouget avatar kpouget commented on August 28, 2024

🔲 Entitle the cluster, by passing a PEM file, checking if they should be concatenated or not, etc. And do nothing is the cluster is already entitled

@ArangoGutierrez do you think you can take care of that?

I would like to get rid of this:

entitle() {
    # TODO: this should become an ansible playbook properly waiting
    # for the deployment of the entitlement

    if [[ "${SKIP_ENTITLEMENT:-n}" == y ]]; then
        echo "Skipping entitlement (SKIP_ENTITLEMENT=${SKIP_ENTITLEMENT})"
        return
    fi

    PSAP_ENTITLEMENT="/var/run/psap-entitlement-secret/01-cluster-wide-machineconfigs.yaml"
    if [ ! -f "$PSAP_ENTITLEMENT" ]; then
        echo "FATAL: PSAP entitlement resources not found ($PSAP_ENTITLEMENT)"
        exit 1
    fi

    if ! oc get mc/50-entitlement-pem -oname 2>/dev/null | grep -q 50-entitlement-pem; then
        echo "===> Apply entitlement manifests from ${PSAP_ENTITLEMENT} <=="
        oc create -f "${PSAP_ENTITLEMENT}" && sleep 300
    else
        echo "===> Entitlement already instanciated <=="
    fi
}

and replace it with something like this in the CI:

PSAP_ENTITLEMENT="/var/run/psap-entitlement-secret/01-cluster-wide-machineconfigs.yaml"
./toolbox/entitle_cluster.sh --machine-configs $PSAP_ENTITLEMENT

or outside of the CI:

./toolbox/entitle_cluster.sh --pem /path/to/key.pem

which would make sure that the entitlement is correctly created and applied.

from ci-artifacts.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.