hpe-container-platform-community / ezdemo Goto Github PK

View Code? Open in Web Editor NEW

6.0 3.0 6.0 459 KB

HPE Ezmeral Deployment tool for demos

ezmeral hpe

ezdemo's Introduction

This project has moved to https://github.com/HewlettPackard/ezdemo

ezdemo's People

Contributors

Stargazers

Watchers

Forkers

edrusb mrpaladini snowch snowch-forks bridgecrew-perf7

ezdemo's Issues

Update GPU driver

Could we update to the most recent stable Nvidia driver?

https://uk.download.nvidia.com/tesla/470.82.01/NVIDIA-Linux-x86_64-470.82.01.run

Replace MCS certificate with minica signed one

Users should not have an SSL error when accessing MCS

configure AD users on picasso

It would be good to configure ad_admin1 and ad_user1 on picasso.
As an ansible newbie I'm not sure the best way to structure the scripts.

Here's my attempt ...

root@1e2f223abb02:/app/server# cat ansible/routines/configure_picasso.yml
### Configure Picasso
- hosts: "{{ (groups['controllers'] | first) | default([])}}"
  tasks:
    - name: read cluster id
      shell: "hpecp k8scluster list -o text | cut -d' ' -f1"
      register: cluster_id

    - name: get cluster
      shell: "hpecp k8scluster get {{ cluster_id.stdout }} -o json"
      register: cluster_json
      ignore_errors: True

    - set_fact:
        cluster: "{{ cluster_json.stdout | from_json }}"
    - set_fact:
        firstmaster_id: "{{ (cluster | json_query(jmesquery)) | first }}"
      vars:
        jmesquery: "k8shosts_config[?role=='master'].node"

    - shell: "hpecp k8sworker get {{ firstmaster_id }} -o json"
      register: firstmaster_json
    - set_fact:
        firstmasterip: "{{ (firstmaster_json.stdout | from_json) | json_query('ipaddr') }}"

    - name: prepare tenants
      shell: |-
        function retry {
          local n=1
          local max=20
          local delay=30
          while true; do
            "$@" && break || {
              if [[ $n -lt $max ]]; then
                ((n++))
                echo "Command failed. Attempt $n/$max:"
                sleep $delay;
              else
                fail "The command has failed after $n attempts."
              fi
          }
          done
        }
        export SCRIPTPATH="/opt/bluedata/bundles/hpe-cp*"
        export MASTER_NODE_IP={{ firstmasterip }}
        export LOG_FILE_PATH=/tmp/register_k8s_prepare.log
        retry ${SCRIPTPATH}/startscript.sh --action prepare_dftenants
        export LOG_FILE_PATH=/tmp/register_k8s_configure.log
        [[ $(tail -1 ${LOG_FILE_PATH} 2> /dev/null ) == "The action configure_dftenants completed successfully." ]] || echo yes | ${SCRIPTPATH}/startscript.sh --action configure_dftenants
        export LOG_FILE_PATH=/tmp/register_k8s_register.log
        [[ $(tail -1 ${LOG_FILE_PATH} 2> /dev/null ) == "The action register_dftenants completed successfully." ]] || expect <<EOF
          set timeout 1800
          spawn $(realpath ${SCRIPTPATH})/startscript.sh --action register_dftenants
          expect ".*Enter Site Admin username: " { send "admin\r" }
          expect "admin\r\nEnter Site Admin password: " { send "{{ admin_password }}\r" }
          expect eof
        EOF
      register: result
      # retries: 15
      # delay: 60
      # until: result is not failed

- name: configure picasso DF users
  hosts: localhost
  tasks:

    - name: mapr password
      shell: "kubectl --kubeconfig {{ ansible_env.HOME }}/.kube/config -n dfdemo get secret system -o yaml | grep MAPR_PASSWORD | head -1 | awk '{print $2}' | base64 --decode"
      register: mapr_password

    - name: maprlogin
      shell: "kubectl --kubeconfig {{ ansible_env.HOME }}/.kube/config -n dfdemo exec admincli-0 -- bash -c 'echo {{ mapr_password.stdout }} | maprlogin password'"

    - name: add ad_admin1
      shell: "kubectl --kubeconfig {{ ansible_env.HOME }}/.kube/config -n dfdemo exec admincli-0 -- maprcli acl edit -type cluster -user ad_admin1:fc"

    - name: add ad_user1
      shell: "kubectl --kubeconfig {{ ansible_env.HOME }}/.kube/config -n dfdemo exec admincli-0 -- maprcli acl edit -type cluster -user ad_user1:login"

ensure aws root block devices are removed on destroy

  root_block_device {
    ...
    delete_on_termination = true
 }

Allow ports 9500 - 9699 on gateway

see EZESC-1160

Provide usage instructions if no/wrong parameter provided

It would be good to provide usage if user provides wrong parameters, e.g. 00-run_all.sh

#!/usr/bin/env bash

set -euo pipefail

if ! echo "aws azure kvm vmware" | grep -w -q ${1}; then
   echo Usage: "${0} aws|azure|kvm|vmware"
   exit 1
fi

...

ECP cert needs to be generated with Subject Alternative Name (SAN)

See here for more info:

UI progress reporting and continuity

Instead of showing just the output of the run(s), we should have a nicer/simpler/friendlier component to report the progress.

This could be a progress bar configured with,

expected completion time vs processing time, or
total steps vs current step (steps are not good indicators on how long the run will take)
And should be able to pick up where it left (ie, page restart).

Allow region selection - Azure

allow selection of region via config.json

Some tasks can run in parallel, ie, "add workers" and "configure DF node", "create k8scluster" and "create DF".
Currently running Samba in docker container within ad_server is started as async task, and is never checked for completion/errors. Would be nice to have a method to submit tasks in the background, and check/wait just before the job that depends it (create tenant should check create cluster, install mapr should check AD integration etc).
Ansible has async method, and I guess it is the best option to implement this (need to ensure we submit and check these jobs on the same hosts etc).

ovirt Support

Should I open a new branch for ovirt and vmware, or just check into the existing repo?
the UI we can still adapt later for it, what do you think?

provide password option for admin user

Could we provide an option for users to set the admin password? E.g.

{
  "aws_access_key": "",
  "aws_secret_key": "",
  "is_mlops": false,
  "is_df": fale,
  "user": "",
  "project_id": ""
  "admin_password": "ChangeMe!!"
}

Should start.sh automatically pull latest image?

E.g. using the --pull always

docker run --pull always -d -p 4000:4000 -p 8443:8443 ${joined} erdincka/ezdemo:latest

KVM support

Add support for single node KVM

Allow region selection - AWS

allow users to select regions via config.json

SSL is incorrectly setup

Feedback from eng ...

Chris Snow, You did not install the environment correctly if you want to have the root CA guarantee TLS transactions....

See the install options:
           
--ssl-cert : Absolute path to the SSL certificate.       
--ssl-priv-key : Absolute path to the SSL certificate's private key.        
--ssl-ca-data : Absolute path to the SSL CA certificate data file path (optional).

We should be using ...

--ssl-cert=/etc/pki/tls/certs/cert.pem 
--ssl-priv-key=/etc/pki/tls/private/key.pem 
--ssl-ca-data=/etc/pki/tls/certs/minica.pem

# If you install ECP in this fashion, you will not see the insecure TLS connection.

See EZESC-1160 (internal Jira)

add gworkers to shutdown.yml

https://github.com/hpe-container-platform-community/ezdemo/blob/main/server/ansible/shutdown.yml

Create MLOPS SCC entry with ansible

To update the MLOPS SCC configuration:

POST /api/v2/k8scluster/ {cluster_id} /kubectl

With payload data:

data = {
    "op": {kubectl_op}, // "create", "apply", "delete"
    "data": { 
      "apiVersion": "", 
      "kind": "", 
      "metadata": { 
        "namespace": "", 
        "name": "", 
        "labels":{ 
           "kubedirector.hpe.com/cmType": "source-control", 
           "createdByUser": "", 
           "createdByRole": "", 
           "parentConfiguration": "" 
         }
       }, 
      "data":{
         "sourceControlName": "",
         "type": "github | bitbucket", 
         "repoURL": "", 
         "authType": "token | password", 
         "branch": "", 
         "workingDirectory": "", 
         "proxyProtocol": "", 
         "proxyHostname": "", 
         "proxyPort": "", 
         "username": "", 
         "email": "", 
         "token": "", 
         "description": "" 
      }
   }
}

more information to follow on:

data.apiVersion
data.kind
data.metadata.labels.parentConfiguration (how do we retrieve this)?

E.g. Parent

{
  "method": "post",
  "apiurl": "https://127.0.0.1:8080",
  "timeout": 239,
  "data": {
    "kubectl_op": "create",
    "cluster_href": "/api/v2/k8scluster/1",
    "payload": {
      "apiVersion": "v1",
      "kind": "ConfigMap",
      "metadata": {
        "namespace": "k8s-tenant-1",
        "name": "abc",
        "labels": {
          "kubedirector.hpe.com/cmType": "source-control",
          "createdByUser": "6",
          "createdByRole": "Admin"
        }
      },
      "data": {
        "type": "github",
        "repoURL": "[email protected]:hpe-container-platform-community/example_active_directory_server.git",
        "authType": "token",
        "branch": "main",
        "workingDirectory": "",
        "proxyProtocol": "",
        "proxyHostname": "",
        "proxyPort": "",
        "description": ""
      }
    }
  },
  "op": "source_control_action"
}

Example child

{
  "method": "post",
  "apiurl": "https://127.0.0.1:8080",
  "timeout": 239,
  "data": {
    "kubectl_op": "create",
    "cluster_href": "/api/v2/k8scluster/1",
    "payload": {
      "apiVersion": "v1",
      "kind": "ConfigMap",
      "metadata": {
        "namespace": "k8s-tenant-1",
        "name": "mysccchild",
        "labels": {
          "kubedirector.hpe.com/cmType": "source-control",
          "createdByUser": "22",
          "createdByRole": "Member",
          "parentConfiguration": "myscc"
        }
      },
      "data": {
        "type": "github",
        "repoURL": "[email protected]:hpe-container-platform-community/example_active_directory_server.git",
        "authType": "token",
        "branch": "main",
        "workingDirectory": "",
        "proxyProtocol": "",
        "proxyHostname": "",
        "proxyPort": "",
        "username": "mygitusername",
        "email": "[email protected]",
        "token": "mygittoken",
        "description": ""
      }
    }
  },
  "op": "source_control_action"
}

lost internet connection during "drain gpu nodes" unable to restart job

ansible output ...

TASK [drain gpu nodes] *********************************************************
failed: [localhost] (item={'changed': True, 'stdout': 'ip-10-1-0-176.eu-west-1.compute.internal', 'stderr': '', 'rc': 0, 'cmd': 'kubectl get nodes -o json | jq -r \'.items[] | select( .status.addresses[].address == "10.1.0.176") | .metadata.name\'', 'start': '2022-02-09 10:03:24.206134', 'end': '2022-02-09 10:03:25.481530', 'delta': '0:00:01.275396', 'msg': '', 'invocation': {'module_args': {'_raw_params': 'kubectl get nodes -o json | jq -r \'.items[] | select( .status.addresses[].address == "10.1.0.176") | .metadata.name\'', '_uses_shell': True, 'warn': False, 'stdin_add_newline': True, 'strip_empty_ends': True, 'argv': None, 'chdir': None, 'executable': None, 'creates': None, 'removes': None, 'stdin': None}}, 'stdout_lines': ['ip-10-1-0-176.eu-west-1.compute.internal'], 'stderr_lines': [], 'failed': False, 'item': '10.1.0.176', 'ansible_loop_var': 'item'}) => {"ansible_loop_var": "item", "changed": true, "cmd": "kubectl drain --ignore-daemonsets \"ip-10-1-0-176.eu-west-1.compute.internal\"", "delta": "0:00:01.718773", "end": "2022-02-09 10:03:27.463665", "item": {"ansible_loop_var": "item", "changed": true, "cmd": "kubectl get nodes -o json | jq -r '.items[] | select( .status.addresses[].address == \"10.1.0.176\") | .metadata.name'", "delta": "0:00:01.275396", "end": "2022-02-09 10:03:25.481530", "failed": false, "invocation": {"module_args": {"_raw_params": "kubectl get nodes -o json | jq -r '.items[] | select( .status.addresses[].address == \"10.1.0.176\") | .metadata.name'", "_uses_shell": true, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "stdin_add_newline": true, "strip_empty_ends": true, "warn": false}}, "item": "10.1.0.176", "msg": "", "rc": 0, "start": "2022-02-09 10:03:24.206134", "stderr": "", "stderr_lines": [], "stdout": "ip-10-1-0-176.eu-west-1.compute.internal", "stdout_lines": ["ip-10-1-0-176.eu-west-1.compute.internal"]}, "msg": "non-zero return code", "rc": 1, "start": "2022-02-09 10:03:25.744892", "stderr": "error: unable to drain node \"ip-10-1-0-176.eu-west-1.compute.internal\", aborting command...\n\nThere are pending nodes to be drained:\n ip-10-1-0-176.eu-west-1.compute.internal\nerror: cannot delete Pods with local storage (use --delete-emptydir-data to override): istio-system/grafana-784c89f4cf-rk6g4", "stderr_lines": ["error: unable to drain node \"ip-10-1-0-176.eu-west-1.compute.internal\", aborting command...", "", "There are pending nodes to be drained:", " ip-10-1-0-176.eu-west-1.compute.internal", "error: cannot delete Pods with local storage (use --delete-emptydir-data to override): istio-system/grafana-784c89f4cf-rk6g4"], "stdout": "node/ip-10-1-0-176.eu-west-1.compute.internal already cordoned", "stdout_lines": ["node/ip-10-1-0-176.eu-west-1.compute.internal already cordoned"]}

I'm wondering if it is possible to handle this issue?

Ability to share state between local machine and docker image

I'm wondering whether a simple fix for this requirement could be to mount two volumes and use two rsync processes.

One process would copy config files allowing users to provide only the files they need to change, e.g.

If on the host (rsync source) you had ./app/server/aws/config.json rsync's default behavior would be to copy only the files on the source without deleting the other files in the destination.

rsync could also be used to create a backup of the entire docker /app folder to the local machine?

Thoughts?

Some changes needed to 03-install.sh?

I'm not sure if these issues are because I corrupted erdincka/ezdemo:latest with the github action...

I had to create the group_vars folder:

[[ -d ./ansible/group_vars/ ]] || mkdir ./ansible/group_vars
echo "ansible_ssh_common_args: ${SSH_OPTS}" > ./ansible/group_vars/all.yml

gateway_pub_dns was throwing an error, and relative path for key was failing unless in the right directory

### TODO: Move to ansible task
SSH_CONFIG="
Host *
  StrictHostKeyChecking no
Host hpecp_gateway
  # Hostname ${gateway_pub_dns}
  Hostname ${GATW_PUB_DNS[0]}
  # IdentityFile generated/controller.prv_key
  IdentityFile /app/server/generated/controller.prv_key
...

I had to create the ~/.ssh folder:

[[ -d ~/.ssh ]] || mkdir ~/.ssh && chmod 700 ~/.ssh
echo "${SSH_CONFIG}" > ~/.ssh/ssh_config ## TODO: move to ansible, delete on destroy

For some reason, ssh client wasn't using ~/.ssh/ssh_config by default, I had to use ~/.ssh/config instead:

[[ -d ~/.ssh ]] || mkdir ~/.ssh && chmod 700 ~/.ssh
echo "${SSH_CONFIG}" > ~/.ssh/config ## TODO: move to ansible, delete on destroy

Do these changes make sense? Shall I add them?

Enable HA within Ansible runs

Selecting High Availability option creates 2 gateways & 3 controllers, but installation only configures 2 gateways. Need to update the process within Ansible to configure & add these 2 additional controllers into the cluster (as EPIC workers) and then enable HA.
No Cluster IP or floating VIP required in this setup.

rename k8s-tenant-1 to mlops-ex-tenant

setup gitea ... /root/.kube/config: no such file or directory

TASK [setup gitea] *************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": "./setup_gitea.sh 'kubectl --kubeconfig /root/.kube/config -n k8s-tenant-1'", "delta": "0:00:00.183659", "end": "2022-01-23 18:24:44.196309", "failed_when_result": true, "msg": "non-zero return code", "rc": 1, "start": "2022-01-23 18:24:44.012650", "stderr": "error: stat /root/.kube/config: no such file or directory", "stderr_lines": ["error: stat /root/.kube/config: no such file or directory"], "stdout": "", "stdout_lines": []}

I think the /root/.kube directory needs to be created, e.g.

ansible/refresh.yml ...

...
  - name: update kubeadmin config
    shell: |-
      [[ -d ~/.kube ]] || mkdir ~/.kube
      while : ; do
        hpecp k8scluster admin_kube_config {{ item }} > ~/.kube/config
        [ $(wc -l ~/.kube/config | cut -d' ' -f1) -lt 5 ] || break
        sleep 10
      done
    with_items: "{{ cluster_ids }}"

Vmware support

Adding VMWare as provider

Lookup AWS instance types

With terraform it is possible to lookup instance types. This will allow deployment to continue (possibly at a higher cost) if the preferred instance type is not supported in the selected region and availability zone. E.g.

data "aws_ec2_instance_type_offering" "example" {
  filter {
    name   = "instance-type"
    values = ["t2.micro", "t3.micro"]
  }

  preferred_instance_types = ["t3.micro", "t2.micro"]
}

Source:

Also, aws_ec2_instance_types. E.g.

data "aws_ec2_instance_types" "test" {
  filter {
    name   = "auto-recovery-supported"
    values = ["true"]
  }

  filter {
    name   = "network-info.encryption-in-transit-supported"
    values = ["true"]
  }

  filter {
    name   = "instance-storage-supported"
    values = ["true"]
  }

  filter {
    name   = "instance-type"
    values = ["g5.2xlarge", "g5.4xlarge"]
  }
}

https://registry.terraform.io/providers/hashicorp/aws/latest/docs/data-sources/ec2_instance_types