Coder Social home page Coder Social logo

chaos-charts's Introduction

Chaos-Charts

Slack Channel GitHub Workflow Docker Pulls GitHub issues Twitter Follow YouTube Channel

This repository hosts the Litmus Chaos Charts. A set of related chaos faults are bundled into a Chaos Chart. Chaos Charts are classified into the following categories.

Kubernetes Chaos

Chaos faults that apply to Kubernetes resources are classified in this category. Following chaos faults are supported for Kubernetes:

Fault Name Description Link
Container Kill Kill one container in the application pod container-kill
Disk Fill Fill the Ephemeral Storage of the Pod disk-fill
Docker Service Kill Kill docker service of the target node docker-service-kill
Kubelet Service Kill Kill kubelet service of the target node kubelet-service-kill
Node CPU Hog Stress the cpu of the target node node-cpu-hog
Node Drain Drain the target node node-drain
Node IO Stress Stress the IO of the target node node-io-stress
Node Memory Hog Stress the memory of the target node node-memory-hog
Node Restart Restart the target node node-restart
Node Taint Taint the target node node-taint
Pod Autoscaler Scale the replicas of the target application pod-autoscaler
Pod CPU Hog Stress the CPU of the target pod pod-cpu-hog
Pod Delete Delete the target pods pod-delete
Pod DNS Spoof Spoof dns requests to desired target hostnames pod-dns-spoof
Pod DNS Error Error the dns requests of the target pod pod-dns-error
Pod IO Stress Stress the IO of the target pod pod-io-stress
Pod Memory Hog Stress the memory of the target pod pod-memory-hog
Pod Network Latency Induce the network latency in target pod pod-network-latency
Pod Network Corruption Induce the network packet corruption in target pod pod-network-corruption
Pod Network Duplication Induce the network packet duplication in target pod pod-network-duplication
Pod Network Loss Induce the network loss in target pod pod-network-loss
Pod Network Partition Disrupt network connectivity to kubernetes pods pod-network-partition

Application Chaos

While chaos faults under the Kubernetes category offer the ability to induce chaos into Kubernetes resources, it is difficult to analyze and conclude if the induced chaos found a weakness in a given application. The application specific chaos faults are built with some checks on pre-conditions and some expected outcomes after the chaos injection. The result of the chaos faults is determined by matching the outcome with the expected outcome.

Fault Category Description Link
Spring Boot Faults Injects faults in Spring Boot applications Spring Boot Faults

Platform Chaos

Chaos faults that inject chaos into the platform and infrastructure resources are classified into this category. Management of platform resources vary significantly from each other, Chaos Charts may be maintained separately for each platform (For example: AWS, GCP, Azure, VMWare etc.)

Following chaos faults are classified in this category:

Fault Category Description Link
AWS Faults AWS Platform specific chaos AWS Faults
Azure Faults Azure Platform specific chaos Azure Faults
GCP Faults GCP Platform specific chaos GCP Faults
VMWare Faults VMWare Platform specific chaos VMWare Faults

Installation Steps for Chart Releases

Note: Supported from release 3.0.0

  • To install the chaos faults from a specific chart for a given release, execute the following commands with the desired <release_version>, <chart_name> & <namespace>
## downloads and unzips the released source
tar -zxvf <(curl -sL https://github.com/litmuschaos/chaos-charts/archive/<release_version>.tar.gz)

## installs the chaosexperiment resources
find chaos-charts-<release_version> -name experiments.yaml | grep <chart-name> | xargs kubectl apply -n <namespace> -f
  • For example, to install the Kubernetes fault chart bundle for release 3.0.0, in the sock-shop namespace, run:
tar -zxvf <(curl -sL https://github.com/litmuschaos/chaos-charts/archive/3.0.0.tar.gz)
find chaos-charts-3.0.0 -name experiments.yaml | grep kubernetes | xargs kubectl apply -n sock-shop -f
  • If you would like to install a specific fault, replace the experiments.yaml in the above command with the relative path of the fault manifest within the parent chart. For example, to install only the pod-delete fault, run:
find chaos-charts-3.0.0 -name fault.yaml | grep 'kubernetes/pod-delete' | xargs kubectl apply -n sock-shop -f

License

FOSSA Status

chaos-charts's People

Contributors

aditya109 avatar ajeshbaby avatar amitbhatt818 avatar amityt avatar avaakash avatar dharmaanu avatar fossabot avatar gdsoumya avatar ghoshankur1983 avatar gprasath avatar iassurewipro avatar imrajdas avatar ishangupta-ds avatar ispeakc0de avatar jonsy13 avatar litmusbot avatar namkyu1999 avatar navinjoy avatar neelanjan00 avatar oumkale avatar rahulchheda avatar sanjay1611 avatar saranya-jena avatar shreyangi avatar sumitnagal avatar torumakabe avatar uditgaurav avatar umamukkara avatar w3aman avatar williamhyzhang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chaos-charts's Issues

Can't pull image "litmuschaos/ansible-runner:1.13.0" for litmus test-case coreDNS-pod-delete

Can't find image litmuschaos/ansible-runner:1.13.0 in dockerhub due to which we are getting
Back-off pulling image "litmuschaos/ansible-runner:1.13.0"
Error: ImagePullBackOff

docker pull litmuschaos/ansible-runner:1.13.0
Error response from daemon: manifest for litmuschaos/ansible-runner:1.13.0 not found: manifest unknown: manifest unknown

Image is used in :

image: "litmuschaos/ansible-runner:1.13.0"

(feat): Add permission field for experiments

Feature request

Why we need this

Add the permission field in the experiments spec, This field will contain the apiGroups, resources and verbs which will define the permission required to run the experiment.

experiment spec will look like

  permissions:
    apiGroups:
      - ""
      - "extensions"
      - "apps"
      - "batch"
      - "litmuschaos.io"
    resources:
      - "daemonsets"
      - "deployments"
      - "replicasets"
      - "jobs"
      - "pods"
      - "pods/exec"
      - "events"
      - "chaosengines"
      - "chaosexperiments"
      - "chaosresults"
    verbs:
      - "*"

Add non-root images for all k8s executors for chaos workflows.

Currently all the images require root access to run k8s commands during workflow execution steps, using alpine version of the image can help to resolve the issue.

Replace: lachlanevenson/k8s-kubectl with alpine/k8s:1.18.2

The following block is a part of description obtained by running- kubectl describe pod argowf-chaos-pod-delete-1613893961-2346069701 -n litmus

main:
    Container ID:  docker://363021a6285bca9e528a264dcb841bb6e412d425c8b726d9508bdc571b8a3c71
    Image:         lachlanevenson/k8s-kubectl
    Image ID:      docker-pullable://lachlanevenson/k8s-kubectl@sha256:5ef0345fc57be2dd836b1ac2822b4ca56ad4f0dbe92b4847574bd1bffcac08a5
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      kubectl apply -f /tmp/pod-delete.yaml -n litmus
    State:          Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:345: starting container process caused "chdir to cwd (\"/root\") set in config.json failed: permission denied": unknown
      Exit Code:    126

Reporting is done using manage.py file which is too powerful for reporting piece

Hi Team,
Reporting is done using manage.py which is a wrapper over whole experiments , i dont want to use the same just want to get the experiments report simply calling a process should be fine, User with manage.py can do anything as a owner of experiment or if i only want to do reporting for experiment i should be able to do without using whole manage.py.

Reporting is capturing the describe details of experiments and putting in tabular pic format.
It should have other details like error logs and experiment status and deviation.

Example for running custom workflows

I am looking for examples like running k6 in the workflow of pod-delete but finding it very hard to add them. It would be nice if we can get examples of such workflows in the chaoshub or documentation. Can someone please share one if you do use?

ChaosEngine not received in correct format when retriving it via graphql query getYamlData

From litmuschaos hub we are receiving chaosengine in the below format when requesting via graphql query "getYamlData" . I guess it retrives from https://hub.litmuschaos.io/api/chaos/2.14.0?file=charts/generic/pod-delete/engine.yaml

metadata:
  name: nginx-chaos
  namespace: default

But when using this format, we are not able to create workflow we get the below error
empty chaos experiment name

From the commit https://github.com/litmuschaos/litmus/commit/f416059afa7d845c8b8438cbfb6e8b4691dcec4f#diff-ddef3ecf514a4d1fe675145c[…]4ba8f1e07fc2e16b3e24be17340fc8

We could see that the name is changed to "GenerateName", Could you please fix this in chaoshub?

Pod DNS error and Pod DNS spoof litmus tests validations and TOTAL_CHAOS_DURATION issue

For Pod DNS error litmus experiment, we followed the steps(https://litmuschaos.github.io/litmus/experiments/categories/pods/pod-dns-error/#ramp-time) to generate a chaos for the target hostname(nginx), we also ran a shell script to validate if the chaos is injected. But we were not able to identify the chaos injection for the application pods, since the hostname DNS was still working during the entire duration of chaos.

We also wanted to debug this more by increasing the TOTAL_CHAOS_DURATION to a higher value(like 300 seconds), but even after increasing the chaos duration, the chaos experiment completes within 30-40 seconds. Can you please confirm if there is any other configuration we can use to increase chaos duration or if we can validate the chaos experiment? We also noticed the similar behavior for POD DNS Spoof experiment.

Ability to choose docker repository on example workflows

At the moment when trying the example workflows, there is no way of defining the docker registry that the applications and other images come from, with the docker pull rate limit this can cause the experiments to fail due to image pulls. Please provide a way to choose a custom repository for all images, not just the standard litmus images.

It would also be helpful to know where the docker images come from before running the experiments to check the cluster can reach all of the required images.

AWS Auth Failure

I'm trying to execute ec2-terminate-by-id experiment.

I have followed all the steps in the document and executed the experiment

I got following error after execution.
time="2021-06-23T02:15:12Z" level=error msg="failed to get the ec2 instance status, err: AuthFailure: AWS was not able to validate the provided access credentials\n\tstatus code: 401, request id: 92cc03a9-7fc3-4011-a511-844a10f66a7b"

I have verified that /tmp/cloud_config file is present and have valid access credentials.

I would like to know what else I can look for to resolve this issue

Thank you

Node-restart-helper pod failed!

I get the following error inside node-restar-helper pod:

Warning: Permanently added '10.0.80.173' (ED25519) to the list of known hosts.
Load key "/data/ssh-privatekey": invalid format
Permission denied, please try again.
Permission denied, please try again.
Received disconnect from 10.0.80.173 port 22:2: Too many authentication failures
Disconnected from 10.0.80.173 port 22

I created id-rsa secret form ssh key which I am using for ssh to server but getting error. Please help.

err: Unable to Get the chaosengine

My environment is pretty simple. I have an azure k8s cluster, and followed this guide https://istio.io/latest/docs/setup/getting-started/ to setup my environment. In other words, the namespace is enabled with istio injection.

Litmus is setup, and I was trying to run pod-delete experiment. Chaos runner pod is created, but I saw errors though. "pod-delete" pod is also created, but the target pod was not deleted. When I was looking at logs by k logs -f pod-delete-4y48y4-ql64c -c pod-delete-4y48y4, I saw error

Unable to initialise probes details from chaosengine, err: Unable to Get the chaosengine, err: Get \"https://10.0.0.1:443/apis/litmuschaos.io/v1alpha1/namespaces/default/chaosengines/bookinfo-chaos\": dial tcp 10.0.0.1:443: connect: connection refused

This is my engine yaml, https://drive.google.com/file/d/1HAaMLamHS3BZP6SDNtD_-YO1pdl46vnh/view?usp=sharing

Litmus version is 1.9.0

Tested AKS

I tested some generic experiments on Azure Kubernetes Service. It worked well. Are there any processes or guidelines to add AKS to tested platforms? Thanks in advance.

Installation process for experiments

Hello,
how do you feel about making installation process for experiments in this repo more helm native?
Like, helm install --version chart-name and etc.?

I would like to install charts using declarative approach (helmfile tool), but it's strange to me using some bash commands without actually using helm :)

Cannot start the experiment disk-fill

Cannot start the experiment disk-fill. Would you please help to have a look at the issue?

time="2022-12-23T03:44:48Z" level=info msg="Helper Name: disk-fill"
time="2022-12-23T03:44:48Z" level=info msg="[PreReq]: Getting the ENV variables"
time="2022-12-23T03:44:48Z" level=info msg="container ID of nginx container, containerID: e15be7fdcc7442be032e1caf56d1e5bb41852bb07e395f2aaf88e5dd2588ff70"
time="2022-12-23T03:44:48Z" level=error msg="du: /diskfill/e15be7fdcc7442be032e1caf56d1e5bb41852bb07e395f2aaf88e5dd2588ff70: No such file or directory\n"
time="2022-12-23T03:44:48Z" level=fatal msg="helper pod failed, err: exit status 1"```

DNS SPOOF Error helper pods are failing

We are running pod dns spoof litmus experiment with a chronschedule as follows

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
  name: schedule-pod-dns-spoof
spec:
  schedule:
    repeat:
      properties:
        minChaosInterval:
          # schedule the chaos at every 1 minutes
          minute:
            everyNthMinute: 1
  engineTemplateSpec:
    engineState: 'active'
    annotationCheck: 'false'
    components:
      runner:
        # resource requirements for the runner pod
        resources:
          requests:
            cpu: "50m"
            memory: "64Mi"
          limits:
            cpu: "100m"
            memory: "128Mi"
    appinfo:
      appns: "default"
      applabel: "app=productpage"
      appkind: "deployment"
    chaosServiceAccount: pod-dns-spoof-sa
    jobCleanUpPolicy: 'delete'
    experiments:
      - name: pod-dns-spoof
        spec:
          components:
            env:
              # map of host names
              - name: SPOOF_MAP
                value: '{"reviews":"spoofabc.com"}'
              - name: TOTAL_CHAOS_DURATION
                value: '60'
              - name: TARGET_CONTAINER
                value: 'productpage'
              - name: 'PODS_AFFECTED_PERC'
                value: '100'

The helper pods are going in the error state, we noticed following logs for the helper pods:

2022-08-11 19:15:59.481 IST
pod-dns-spoof
time="2022-08-11T13:45:59Z" level=info msg="Helper Name: dns-chaos"
2022-08-11 19:15:59.482 IST
pod-dns-spoof
time="2022-08-11T13:45:59Z" level=info msg="[PreReq]: Getting the ENV variables"
2022-08-11 19:15:59.584 IST
pod-dns-spoof
time="2022-08-11T13:45:59Z" level=info msg="Container ID: 1c53401df500"
2022-08-11 19:15:59.681 IST
pod-dns-spoof
time="2022-08-11T13:45:59Z" level=info msg="[Info]: Container ID=1c53401df500 has process PID=12290"
2022-08-11 19:15:59.681 IST
pod-dns-spoof
time="2022-08-11T13:45:59Z" level=info msg="/bin/bash -c sudo TARGET_PID=12290 CHAOS_TYPE=spoof SPOOF_MAP='{\"reviews\":\"spoofabc.com\"}' TARGET_HOSTNAMES='' CHAOS_DURATION=60 MATCH_SCHEME=exact nsutil -p -n -t 12290 -- dns_interceptor"
2022-08-11 19:15:59.747 IST
pod-dns-spoof
time="2022-08-11T13:45:59Z" level=info msg="DNS Interceptor Port" port=53
2022-08-11 19:15:59.747 IST
pod-dns-spoof
time="2022-08-11T13:45:59Z" level=info msg="Upstream DNS Server" server="10.80.0.10:53"
2022-08-11 19:15:59.747 IST
pod-dns-spoof
time="2022-08-11T13:45:59Z" level=info msg="Chaos Error Targets" targets="[]"
2022-08-11 19:15:59.747 IST
pod-dns-spoof
time="2022-08-11T13:45:59Z" level=info msg="Chaos Spoof Map" spoof_map="map[reviews:spoofabc.com]"
2022-08-11 19:15:59.747 IST
pod-dns-spoof
time="2022-08-11T13:45:59Z" level=info msg="Chaos type" chaos-type=spoof
2022-08-11 19:15:59.747 IST
pod-dns-spoof
time="2022-08-11T13:45:59Z" level=info msg="Target match scheme" match-scheme=exact
2022-08-11 19:15:59.812 IST
pod-dns-spoof
time="2022-08-11T13:45:59Z" level=info msg="Error String: The connection to the server 10.80.0.1:443 was refused - did you specify the right host or port?\n"
2022-08-11 19:15:59.813 IST
pod-dns-spoof
time="2022-08-11T13:45:59Z" level=fatal msg="helper pod failed, err: unable to annotate the schedule-pod-dns-spoof-1660225507-pod-dns-spoof chaosresult, err: exit status 1"
2022-08-11 19:16:01.317 IST
pod-dns-spoof-cpg9ce
time="2022-08-11T13:46:01Z" level=info msg="pod-dns-spoof-helper-vidwtf helper pod is in Running state"
2022-08-11 19:16:03.327 IST
pod-dns-spoof-cpg9ce
time="2022-08-11T13:46:03Z" level=info msg="[Status]: The running status of Pods are as follows" Pod=pod-dns-spoof-helper-vidwtf Status=Running
2022-08-11 19:16:05.397 IST
istio-proxy
2022-08-11T13:46:05.396337Z warning envoy config StreamSecrets gRPC config stream closed: 13,
2022-08-11 19:16:05.397 IST
istio-proxy
2022-08-11T13:46:05.396398Z warning envoy config StreamSecrets gRPC config stream closed: 13,
2022-08-11 19:16:05.397 IST
istio-proxy
2022-08-11T13:46:05.396441Z warning envoy config StreamAggregatedResources gRPC config stream closed: 13,

After the first chaos schedule is started , we see that the chaos is injected however it is not targetting the specific hostname mentioned in the SPOOF_MAP and instead targets all hostnames for the target application pod(app=productpage) and target applicaiton pod is not able to call any other hostname. We had following questions:

  1. How can we target a specific hostname ?
  2. What is causing the helper pods to fail?
  3. Even after stopping the chaos, the application pods are not recovering and are unable to reach any hostname. We need to manually restart the application for recovery , how can we ensure the application pods are automatically recovered after the chaos stops?

Disk Fill experiment failed

Hi experts,
I met this error when I tried to run the Disk Fill experiment, could you help to have a look, thanks.

time="2022-12-30T09:15:59Z" level=info msg="[Fill]: Filling ephemeral storage, size: 134218KB"
time="2022-12-30T09:15:59Z" level=info msg="dd: {sudo dd if=/dev/urandom of=/proc/8706/root/home/diskfill bs=256K count=524}"
time="2022-12-30T09:16:01Z" level=fatal msg="helper pod failed, err: could not fill ephemeral storage\n --- at /litmus-go/chaoslib/litmus/disk-fill/helper/disk-fill.go:140 (diskFill) ---\nCaused by: {\"source\":\"disk-fill-helper-z6x26\",\"errorCode\":\"CHAOS_INJECT_ERROR\",\"reason\":\"524+0 records in\\n524+0 records out\\n\",\"target\":\"{podName: nginx-75d8d4bf8-wsndb, namespace: litmus, container: nginx}\"}"

Best Regards,
Lin

Disk Fill experiment failed with error "du: /diskfill/{containerId} no such file or directory"

Hi expert,
Disk Fill experiment failed when I tried to run the experiment on the cluster which runs the containerd container runtime.

time="2022-12-23T03:44:48Z" level=info msg="Helper Name: disk-fill"
time="2022-12-23T03:44:48Z" level=info msg="[PreReq]: Getting the ENV variables"
time="2022-12-23T03:44:48Z" level=info msg="container ID of nginx container, containerID: e15be7fdcc7442be032e1caf56d1e5bb41852bb07e395f2aaf88e5dd2588ff70"
time="2022-12-23T03:44:48Z" level=error msg="du: /diskfill/e15be7fdcc7442be032e1caf56d1e5bb41852bb07e395f2aaf88e5dd2588ff70: No such file or directory\n"
time="2022-12-23T03:44:48Z" level=fatal msg="helper pod failed, err: exit status 1"```

I have found some information from the litmus\mkdocs\docs\experiments\troubleshooting\experiments.md, so what is the correct CONTAINER_PATH env should I provide? Could you help to have a look? Thanks.
image

Best Regards,
Lin

Spring-boot experiments failing with same error

Hi team,

I'm able to work through postman locally and get the Chaos Monkey urls working, but when I deploy the code and run any spring-boot experiment, the experiment fails. Using curl, I'm able to hit the endpoints in deployment as well. I've tried running in serial and parallel, and regardless of which is used, I get the following error:

http://{cluster_ip}:{app_port}/actuator/chaosmonkey/assaults/runtime/attack\': EOF','target':'{podName: {correct pod name}, namespace: {correct_ns}
CHAOS_INJECT_ERROR: failed to call the chaos monkey api to start assault Post

appinfo is not passed while running the pod-delete experiment

I am using Chaos Chart version 3.1.0 on AKS. The installation and setup verifications are fine as the the below document:
Getting Started with Litmus

But when I am running the following experiment:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: nginx
spec:
  appinfo:
    appns: 'nginx'
    applabel: 'app=nginx'
    appkind: 'deployment'
  # It can be true/false
  #annotationCheck: 'false'
  # It can be active/stop
  engineState: 'active'
  #ex. values: ns1:name=percona,ns2:run=nginx
  auxiliaryAppInfo: ''
  chaosServiceAccount: pod-delete-sa
  # It can be delete/retain
  jobCleanUpPolicy: 'retain'
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            # set chaos duration (in sec) as desired
            - name: TOTAL_CHAOS_DURATION
              value: '30'

            # set chaos interval (in sec) as desired
            - name: CHAOS_INTERVAL
              value: '10'
              
            # pod failures without '--force' & default terminationGracePeriodSeconds
            - name: FORCE
              value: 'false'

             ## percentage of total pods to target
            - name: PODS_AFFECTED_PERC
              value: ''

The respective RBAC:

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: pod-delete-sa
  namespace: nginx
  labels:
    name: pod-delete-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-delete-sa
  namespace: nginx
  labels:
    name: pod-delete-sa
rules:
- apiGroups: [""]
  resources: ["pods","events"]
  verbs: ["create","list","get","patch","update","delete","deletecollection"]
- apiGroups: [""]
  resources: ["pods/exec","pods/log","replicationcontrollers"]
  verbs: ["create","list","get"]
- apiGroups: ["batch"]
  resources: ["jobs"]
  verbs: ["create","list","get","delete","deletecollection"]
- apiGroups: ["apps"]
  resources: ["deployments","statefulsets","daemonsets","replicasets"]
  verbs: ["list","get"]
- apiGroups: ["apps.openshift.io"]
  resources: ["deploymentconfigs"]
  verbs: ["list","get"]
- apiGroups: ["argoproj.io"]
  resources: ["rollouts"]
  verbs: ["list","get"]
- apiGroups: ["litmuschaos.io"]
  resources: ["chaosengines","chaosexperiments","chaosresults"]
  verbs: ["create","list","get","patch","update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pod-delete-sa
  namespace: nginx
  labels:
    name: pod-delete-sa
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: pod-delete-sa
subjects:
- kind: ServiceAccount
  name: pod-delete-sa
  namespace: nginx

Always the test is failing with the error:
Fail Step: [chaos]: Failed inside the chaoslib, err: please provide one of the appLabel or TARGET_PODS

GCP vm disk loss requires server in terminated state

While trying out the GCP VM disk loss experiment(https://litmuschaos.github.io/litmus/experiments/categories/gcp/gcp-vm-disk-loss/). It requires the server to be in terminated state so the test can be completed, I get following error while executing this test

time="2022-08-01T09:24:38Z" level=info msg="[Chaos]: Detaching the disk volume from the instance"
time="2022-08-01T09:24:38Z" level=error msg="Chaos injection failed, err: disk detachment failed, err: googleapi: Error 400: Invalid resource usage: 'To detach the boot disk, the instance must be in TERMINATED state.'., invalidResourceUsage"

Is there a way to force detach the disk from the server while the server is still in running state? We want to test this experiment on a GKE node and we cant stop the node of the GKE nodepool to execute this experiment.

Security vulnerabilities found on alexeiled/stress-ng:latest-ubuntu

Details:

      ## It is used in pumba lib only    
      - name: STRESS_IMAGE
        value: 'alexeiled/stress-ng:latest-ubuntu'

Cannot start the pod-cpu-hog experiment.

Cannot start the pod-cpu-hog experiment. Would you please help to have a look at the issue? The litmus version is 2.14.0.

C41K60Q0XJ:nginx$ k get pod -n litmus
NAME                                        READY   STATUS      RESTARTS   AGE
chaos-exporter-d767fcf5-hsbgq               1/1     Running     0          18h
chaos-operator-ce-7cf6cc79b4-xwh99          1/1     Running     0          18h
chaos-stress-cpu-1671764370-276557960       0/2     Completed   0          9m55s
chaos-stress-cpu-1671764370-670059491       0/2     Completed   0          10m
event-tracker-5ddb594676-9dhtj              1/1     Running     0          18h
litmusportal-auth-server-57899b796d-2tvvq   1/1     Running     0          18h
litmusportal-frontend-857dc974b7-n4jv4      1/1     Running     0          18h
litmusportal-server-59f79479db-92qmb        1/1     Running     0          78m
mongo-0                                     1/1     Running     0          18h
pod-cpu-hog-0hytghxz-runner                 0/1     Completed   0          9m51s
pod-cpu-hog-helper-xhsnlr                   0/1     Error       0          9m26s
pod-cpu-hog-mfbudy-skzzc                    0/1     Completed   0          9m46s
subscriber-5cbcb4df94-22cg7                 1/1     Running     0          18h
workflow-controller-8c548f686-zmk9c         1/1     Running     0          18h
C41K60Q0XJ:nginx I068888$ k logs pod-cpu-hog-helper-xhsnlr -n litmus
time="2022-12-23T03:01:21Z" level=info msg="Helper Name: stress-chaos"
time="2022-12-23T03:01:21Z" level=info msg="[PreReq]: Getting the ENV variables"
time="2022-12-23T03:01:21Z" level=info msg="container ID of nginx container, containerID: e15be7fdcc7442be032e1caf56d1e5bb41852bb07e395f2aaf88e5dd2588ff70"
time="2022-12-23T03:01:21Z" level=error msg="[docker]: **Failed to run docker inspect: []\nError: No such object: e15be7fdcc7442be032e1caf56d1e5bb41852bb07e395f2aaf88e5dd2588ff70\n"**
time="2022-12-23T03:01:21Z" level=fatal msg="helper pod failed, err: exit status 1"
C41K60Q0XJ:nginx I068888$ k logs pod-cpu-hog-0hytghxz-runner -n litmus
W1223 03:00:45.714442       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2022-12-23T03:00:45Z" level=info msg="Experiments details are as follows" Experiments List="[pod-cpu-hog]" Engine Name=pod-cpu-hog-0hytghxz appLabels="app=nginx" appNs=nginx appKind=deployment Service Account Name=litmus-admin Engine Namespace=litmus
time="2022-12-23T03:00:45Z" level=info msg="Getting the ENV Variables"
time="2022-12-23T03:00:45Z" level=info msg="Preparing to run Chaos Experiment: pod-cpu-hog"
time="2022-12-23T03:00:46Z" level=info msg="Started Chaos Experiment Name: pod-cpu-hog, with Job Name: pod-cpu-hog-mfbudy"
time="2022-12-23T03:01:42Z" level=info msg="Chaos Pod Completed, Experiment Name: pod-cpu-hog, with Job Name: pod-cpu-hog-mfbudy"
time="2022-12-23T03:01:44Z" level=info msg="Chaos Engine has been updated with result, Experiment Name: pod-cpu-hog"
time="2022-12-23T03:01:44Z" level=info msg="[skip]: skipping the job deletion as jobCleanUpPolicy is set to {}"
C41K60Q0XJ:nginx I068888$ k logs pod-cpu-hog-0hytghxz-runner -n litmus
W1223 03:00:45.714442       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2022-12-23T03:00:45Z" level=info msg="Experiments details are as follows" Experiments List="[pod-cpu-hog]" Engine Name=pod-cpu-hog-0hytghxz appLabels="app=nginx" appNs=nginx appKind=deployment Service Account Name=litmus-admin Engine Namespace=litmus
time="2022-12-23T03:00:45Z" level=info msg="Getting the ENV Variables"
time="2022-12-23T03:00:45Z" level=info msg="Preparing to run Chaos Experiment: pod-cpu-hog"
time="2022-12-23T03:00:46Z" level=info msg="Started Chaos Experiment Name: pod-cpu-hog, with Job Name: pod-cpu-hog-mfbudy"
time="2022-12-23T03:01:42Z" level=info msg="Chaos Pod Completed, Experiment Name: pod-cpu-hog, with Job Name: pod-cpu-hog-mfbudy"
time="2022-12-23T03:01:44Z" level=info msg="Chaos Engine has been updated with result, Experiment Name: pod-cpu-hog"
time="2022-12-23T03:01:44Z" level=info msg="[skip]: skipping the job deletion as jobCleanUpPolicy is set to {}"
C41K60Q0XJ:nginx I068888$ k logs pod-cpu-hog-mfbudy-skzzc -n litmus
time="2022-12-23T03:00:58Z" level=info msg="Experiment Name: pod-cpu-hog"
time="2022-12-23T03:00:58Z" level=info msg="[PreReq]: Getting the ENV for the pod-cpu-hog experiment"
time="2022-12-23T03:01:00Z" level=info msg="[PreReq]: Updating the chaos result of pod-cpu-hog experiment (SOT)"
time="2022-12-23T03:01:02Z" level=info msg="The application information is as follows" Namespace=nginx Label="app=nginx" App Kind=deployment
time="2022-12-23T03:01:02Z" level=info msg="[Status]: Verify that the AUT (Application Under Test) is running (pre-chaos)"
time="2022-12-23T03:01:02Z" level=info msg="[Status]: Checking whether application containers are in ready state"
time="2022-12-23T03:01:02Z" level=info msg="[Status]: The Container status are as follows" Pod=nginx-deployment1-5bcb875959-gnxnl Readiness=true container=nginx
time="2022-12-23T03:01:04Z" level=info msg="[Status]: Checking whether application pods are in running state"
time="2022-12-23T03:01:04Z" level=info msg="[Status]: The status of Pods are as follows" Status=Running Pod=nginx-deployment1-5bcb875959-gnxnl
time="2022-12-23T03:01:06Z" level=info msg="[Info]: The chaos tunables are:" Sequence=parallel PodsAffectedPerc=0 CPU Core=1 CPU Load Percentage=100
time="2022-12-23T03:01:06Z" level=info msg="[Chaos]:Number of pods targeted: 1"
time="2022-12-23T03:01:06Z" level=info msg="[Info]: Target pods list for chaos, [nginx-deployment1-5bcb875959-gnxnl]"
time="2022-12-23T03:01:06Z" level=info msg="[Info]: Details of application under chaos injection" PodName=nginx-deployment1-5bcb875959-gnxnl NodeName=ip-10-180-6-87.eu-central-1.compute.internal ContainerName=nginx
time="2022-12-23T03:01:06Z" level=info msg="[Status]: Checking the status of the helper pods"
time="2022-12-23T03:01:24Z" level=info msg="pod-cpu-hog-helper-xhsnlr helper pod is in Running state"
time="2022-12-23T03:01:26Z" level=info msg="[Wait]: waiting till the completion of the helper pod"
time="2022-12-23T03:01:26Z" level=info msg="helper pod status: Running"
time="2022-12-23T03:01:26Z" level=info msg="[Status]: The running status of Pods are as follows" Pod=pod-cpu-hog-helper-xhsnlr Status=Failed
time="2022-12-23T03:01:27Z" level=error msg="[Error]: CPU hog failed, err: helper pod failed"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.