grafana / xk6-disruptor Goto Github PK

View Code? Open in Web Editor NEW

82.0 14.0 9.0 1.18 MB

Extension for injecting faults into k6 tests

Home Page: https://k6.io/docs/javascript-api/xk6-disruptor/

License: GNU Affero General Public License v3.0

Makefile 0.42% Go 97.56% Dockerfile 0.11% Shell 1.90%

chaos-engineering fault-injection k6 testing xk6

xk6-disruptor's Introduction

xk6-disruptor

"Like Unit Testing, but for Reliability"

xk6-disruptor is an extension adds fault injection capabilities to Grafana k6. It implements the ideas of the Chaos Engineering discipline and enables Grafana k6 users to test their system's reliability under turbulent conditions.

⚠️ Important xk6-disruptor is in the alpha stage, undergoing active development. We do not guarantee API compatibility between releases - your k6 scripts may need to be updated on each release until this extension reaches v1.0 release.

Why xk6-disruptor?

xk6-disruptor is purposely designed and built to provide the best experience for developers trying to make their systems more reliable:

Everything as code.
- No need to learn a new DSL.
- Developers can use their usual development IDE
- Facilitate test reuse and sharing
Fast to adopt with no day-two surprises.
- No need to deploy and maintain a fleet of agents or operators.
Easy to extend and integrate with other types of tests.
- No need to try to glue multiple tools together to get the job done.

Also, this project has been built to be a good citizen in the Grafana k6 ecosystem by:

Working well with other extensions.
Working well with k6's core concepts and features.

You can check this out in the following example:

export default function () {
    // Create a new pod disruptor with a selector 
    // that matches pods from the "default" namespace with the label "app=my-app"
    const disruptor = new PodDisruptor({
        namespace: "default",
        select: { labels: { app: "my-app" } },
    });

    // Disrupt the targets by injecting HTTP faults into them for 30 seconds
    const fault = {
        averageDelay: 500,
        errorRate: 0.1,
        errorCode: 500
    }
    disruptor.injectHTTPFaults(fault, "30s")
}

Features

The project, at this time, is intended to test systems running in Kubernetes. Other platforms are not supported at this time.

It offers an API for creating disruptors that target one specific type of the component (e.g., Pods) and is capable of injecting different kinds of faults, such as errors in HTTP requests served by that component. Currently, disruptors exist for Pods and Services, but others will be introduced in the future as well as additional types of faults for the existing disruptors.

Use cases

The main use case for xk6-disruptor is to test the resiliency of an application of diverse types of disruptions by reproducing their effects without reproducing their root causes. For example, inject delays in the HTTP requests an application makes to a service without having to stress or interfere with the infrastructure (network, nodes) on which the service runs or affect other workloads in unexpected ways.

In this way, xk6-disruptor make reliability tests repeatable and predictable while limiting their blast radius. These are essential characteristics to incorporate these tests in the test suits of applications deployed on shared infrastructures such as staging environments.

Learn more

Check the get started guide for instructions on how to install and use xk6-disruptor.

The examples section in the documentation presents examples of using xk6-disruptor for injecting faults in different scenarios.

If you encounter any bugs or unexpected behavior, please search the currently open GitHub issues first, and create a new one if it doesn't exist yet.

The Roadmap presents the project's goals for the coming months regarding new functionalities and enhancements.

If you are interested in contributing with the development of this project, check the contributing guide

xk6-disruptor's People

Contributors

Stargazers

Watchers

Forkers

pablochacin ppcano mcandeia nissessenap nickandreev roobre mstoykov jorturfer codebien

xk6-disruptor's Issues

Add support to inject faults in more protocols (e.g. Redis)

Presently the disruptor only supports fault injection for HTTP protocol. However, many microservice applications are adopting gRPC.

Additionally, the ability to inject faults in database connections (e.g., Redis, MySQL) is relevant for many use cases.

Failed to Inject HTTP fault: Command terminated with exit code 1

I'm not able to inject HTTP faults using the PodDisruptor. I get the following error:

INFO[0006] target: ["querier-54f7cf5487-tpm5x"]          source=console
ERRO[0007] GoError: command terminated with exit code 1
        at reflect.methodValueCall (native)
        at file:///Users/dgzlopes/go/src/github.com/grafana/xk6-disruptor/examples/pod_disruptor.js:26:34(32)
        at native  executor=per-vu-iterations scenario=default source=stacktrace

This is the script that I'm using:

import { PodDisruptor } from 'k6/x/disruptor';

const selector = {
        namespace: 'demo',
        select: {
                labels: {
                        name: 'querier',
                },
        },
}

const fault = {
        averageDelay: 100,
        errorRate: 0.1,
        errorCode: 500
}

export default function () {
        const disruptor = new PodDisruptor(selector)
        const targets = disruptor.targets()
        if (targets.length != 1) {
        	throw new Error("expected list to have one target")
        }

       console.log("target: " + JSON.stringify(targets))
       disruptor.injectHTTPFaults(fault, 30)
}

The pod I'm targeting is running properly. Also, the ephemeral container is there:

Ephemeral Containers:
  xk6-agent:
    Container ID:   docker://aca764daeee2fd701d756ad33f9ab96e71498cdf9ddc7d8e1dc9cbb210b573b2
    Image:          ghcr.io/grafana/xk6-disruptor-agent
    Image ID:       docker-pullable://ghcr.io/grafana/xk6-disruptor-agent@sha256:1b3d8a8f7d4e9d28fcaf55b831a5d43cdfb145a6bd8fc8970499a3acedabd438
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Tue, 15 Nov 2022 12:27:09 +0100
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:         <none>

I logged what we are passing to the exec helper and got this:

querier-54f7cf5487-tpm5x [xk6-disruptor-agent http -d 30s -a 100 -v 0 -e 500 -r 0.1 -b ]

There are no logs on the ephemeral container (or at least, I couldn't see them).

My local setup:

Kubernetes v1.25.2 (Docker Desktop - Mac M1)
Built from the latest code in main.

Backport disruptor to Kubernetes < 1.23

Currently xk6-disruptor requires Kubernetes 1.23 or higher. This dependency is mostly due to the changes introduced in this version to the ephemeral containers API. However, there are still many Kubernetes cluster previous K8s versions that cannot use the disruptor.
It would then be convenient to backport the implementation of the disruptor to these versions.

Redesign JavaScript Fault injection API

The xk6-disruptor API is built around the concept of disruptors that inject faults. Presently, each disruptor implements one method for each type of fault it injects and different disruptors can implement the same method if they are able to inject the same type of fault. For example, ServiceDisruptor and PodDisruptor both implement the InjectHTTPFault method for injecting HttpFaults.

Regarding the documentation, having the same method implemented by multiple disruptors introduces redundancy in the documentation, as the description of the method signature and parameters are the same. There may be some minor differences, as noticed above in the case of how the PodDisruptor and ServiceDisruptor handle the port in the HTTPFault parameter.

As more fault types and more disruptors are added, this duplicity is expected to grow. For example, both a PodDisruptor and a NodeDisruptor can implement an InjectNetworkFault method for injecting network-level disruptions.

The fault injection API could be simplified using a generic InjectFault function implemented by all disruptors. This function receives the description of the fault as an object.

Disruptor.injectFault(type, fault, duration, options)

The documentation for each disruptor class must lists which types of faults it supports and document any difference in the way they handle these faults. The fault object is documented separately and this description is shared by all disruptors supporting it.

Document differences between the HTTP fault injection for PodDisruptor and ServiceDisrupors

The documentation for the injectHTTPFaults methods for the PodDisruptor and ServiceDisruptor do not explain these two methods' differences in treating the port parameter in the httpFault.

The PodDisruptor interprets the port parameter as the container port in the Pod while the ServiceDisruptor Interprets the port as the port in the Service. In many cases, these two interpretations correspond, as the service's port matches the pod's container port. But this is not always the case.

To better understand the issue, consider the following example of a pod exposed as a service. Notice the pod exposes port 80 while the service exposes port 8080 and maps it to port 80 in the Pod.

Pod:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  containers:
  - name: nginx
    image: nginx
    ports:
    - containerPort: 80

Service:

apiVersion: v1
kind: Service
metadata:
 name: nginx
spec:
 ports: 
 - port: 8080
   targetPort: 80
 selector:
   app: nginx

Consider now the following HttpFault definition:

{
  port: 80
  errorRate: 0.1
  error: 500
}

If this fault is injected using a PodDisruptor, the results are as expected: the traffic to the Pod's port 80 is disrupted. If applied to a ServiceDisruptor, we get: error injecting fault: the service does not expose the given port: 80.

To make this fault work for a service disruptor, we must change the port to 8080.

Don't expose node ports by default in e2e tests

Since #89 e2e tests no longer rely on exposing a node port for each service in the kind cluster used by the tests.

However, when launching multiple e2e tests in parallel there are errors due to conflicts with exposed ports:

agent_e2e_test.go:76: failed to create cluster config: command "docker run --name e2e-xk6-agent-control-plane --hostname e2e-xk6-agent-control-plane --label io.x-k8s.kind.role=control-plane --privileged --security-opt seccomp=unconfined --security-opt apparmor=unconfined --tmpfs /tmp --tmpfs /run --volume /var --volume /lib/modules:/lib/modules:ro -e KIND_EXPERIMENTAL_CONTAINERD_SNAPSHOTTER --detach --tty --label io.x-k8s.kind.cluster=e2e-xk6-agent --net kind --restart=on-failure:1 --init=false --publish=0.0.0.0:32080:32080/TCP --publish=0.0.0.0:32081:32081/TCP --publish=0.0.0.0:32082:32082/TCP --publish=0.0.0.0:32083:32083/TCP --publish=0.0.0.0:32084:32084/TCP --publish=0.0.0.0:32085:32085/TCP --publish=0.0.0.0:32086:32086/TCP --publish=0.0.0.0:32087:32087/TCP --publish=0.0.0.0:32088:32088/TCP --publish=0.0.0.0:32089:32089/TCP --publish=127.0.0.1:33093:6443/TCP -e KUBECONFIG=/etc/kubernetes/admin.conf kindest/node:v1.24.0@sha256:406fd86d48eaf4c04c7280cd1d2ca1d61e7d0d61ddef0125cb097bc7b82ed6a1" failed with error: exit status 125

Implement mechanism for preventing concurrent fault injections

One current limitation of the xk6-disruptor is that concurrent injection of faults into the same targets (e.g. pods, nodes) should be prevented because it can have unexpected results.

One possible approach (the simplest) to mitigate this risk would be to limit concurrent executions by using a "only-once" kind of executor in k6 as proposed in this issue. However, besides the drawback of requiring changes in the core k6 implementation, this approach does not entirely solve the issue of users inadvertently executing multiple concurrent fault injections due to miss-configuration of their tests or the execution of multiple tests that have the same targets.

Therefore, it would be desirable to implement a mechanism that prevents the concurrent execution of fault injections on the same targets. Such mechanism would be conceptually a lock on the targets. This mean that concurrent fault injections may occur only if their targets do not overlap.

Some possible implementations could be:

Adding annotations to the targets that specify an expiration time, for example : k6.io/disruptor/target-lock: <timestamp>. The main advantage of this approach is its simplicity. However, it has a major drawback: it cannot guarantee atomicity. Two disruptors can start annotating two overlapping sets of targets concurrently and end up with a mix of annotations from both disruptors. In this case, both disruptors would have to roll-back their annotations, making the locking logic complex. One possible solution would be to add a higher level lock that prevents two disruptors to run concurrently. See comment at the end of the issue regarding the implementation of such lock.
Creating a disruption CRD. This CRD would contain the targets of a disruptor as well as an expiration time. This resource would serve as a lock until a given expiration for the targets. The main drawback is that it requires checking all existing CRDs to see if any matches the same targets before creating a new disruption. Moreover, this process must be executed by only one disruptor at a time, so it also requires a high level lock object to serialize this process. Additionally, this approaches requires the creation of a disruption CRD. This is undesirable as it increases the operational complexity of using the xk6-disruptor.
Using an operator. Similar to the previous approach, each disruption creates a CRD, which describes the targets. The operator process these CRDs and validates if there is any other existing CRD that overlaps the same targets, updating the status of the CRD to valid or rejected. The disruptor can check this status before continuing. The main advantage of this approach is that it is a well-known pattern. Additionally, it opens the possibility of running the fault injection from the operator instead of the disruptor extension. However, it requires the deployment of an operator and the CRD, increasing the operational complexity.

Some of the alternatives described above may require a high level lock that prevents two disruptors to run concurrently to avoid race conditions. Implementing such lock may require the creation of a CRD in the target environment. This is undesirable it increases the operational complexity of using the xk6-disruptor.

Redesign ServiceDisruptor API to avoid redundancies in fault injection methods

The ServiceDisruptor is a helper class that offers a convenient way of disrupting pods that back a service.
Its implementation simply wraps a PodDisruptor that is created by looking at the selector of a target service and using it a pod selector.

Both the ServiceDisruptor and the PodDisruptor implement the same fault injection methods, for example injectHTTFaults.

This design is convenient because the ServiceDisruptor can modify or enhance the behavior of such methods. For example, it takes the port definition from the service as a target for injecting HTTP faults, while for the PodDisruptor this port must be specified.

However, duplicating the method for fault injection in the Pod and Service distruptors creates significant duplication of efforts in the documentation. Presently this is only one method that is duplicated, but as we implement more faults (e.g. killing a random pod) these methods should be also duplicated.

When multiple pods are selected: Requests are being client-side throttled

When trying to instantiate a PodDisruptor, with a selector that touches 15 pods, I get the following messages from time to time:

I1114 11:50:30.814928   48040 request.go:601] Waited for 4.391415667s due to client-side throttling, not priority and fairness, request: PATCH:https://kubernetes.docker.internal:6443/api/v1/namespaces/k6-cloud-crocospans/pods/grafana-agent-metrics-0/ephemeralcontainerstions

I wonder if we could be more gentle with out request pattern. Also, I wonder if this could be a problem on huge namespaces.

Publish a k6 image with xk6-disruptor and xk6-kubernetes extensions

Create a k6 image including the xk6-disruptor and xk6-kubernetes extensions that could be used for running chaos tests using the k6-opertor.

https://github.com/grafana/k6-operator#using-extensions

Can't create PodDisruptor: the server could not find the requested resource

As the title says, I'm not able to use the PodDisruptor.

I get the following error:

ERRO[0005] GoError: error creating PodDisruptor: the server could not find the requested resource
        at disrupt (file:///Users/dgzlopes/go/src/github.com/grafana/xk6-disruptor/examples/httpbin/disrupt-pod.js:28:23(14))
        at native  executor=shared-iterations scenario=disrupt source=stacktrace

After some digging, it looks like the error is generated when we try to apply the patch:

xk6-disruptor/pkg/kubernetes/helpers/pods.go

Line 177 in 8e52c42

return err

The pod name that we are passing seems correct. The patch we pass is the following one:

{
   "spec":{
      "ephemeralContainers":[
         {
            "image":"ghcr.io/grafana/xk6-disruptor-agent",
            "imagePullPolicy":"IfNotPresent",
            "name":"xk6-agent",
            "resources":{
               
            },
            "securityContext":{
               "capabilities":{
                  "add":[
                     "NET_ADMIN"
                  ]
               }
            },
            "stdin":true,
            "tty":true
         }
      ]
   }
}

The pod runs correctly, and I can interact with it using Kubectl.

My local setup:

Kubernetes 1.22.5 (Docker Desktop - Mac M1)
Built from the latest code in main.

Dependency Disruptor

It is a common use case to test the effect of known patterns of behavior in external dependencies (services that are not under the control of the organization). Using the xk6-disruptor, this could be accomplished by implementing a Dependency Disruptor, which instead of disrupting a service (or a group of pods), disrupts the requests these pods make to other services.

This could be implemented using a similar approach used by the disruptor, injecting a transparent proxy but in this case for outgoing requests.

This approach will work well if the service is a dependency for a small set of pods (for example, the pods that back an internal service) but will not work well if many different pods (e.g. many different internal services) use this external dependency.

From the implementation perspective, the two main blockers for this functionality are:

TLS termination. For external services, the most common scenario is to use encrypted communications using TLS. In this case, the disruptor cannot modify the response (e.g. the status code). Moreover, the traffic cannot be intercepted using a simple proxy because the handshaking would fail. Using eBPF may open some alternatives.
How to identify the IP address(es) of the dependency. Currently, the disruptor uses iptables to redirect the traffic to the proxy that injects the faults. In the case of the dependency disruptor the traffic going to the external service is the one that must be intercepted. However, the IP address of this external dependency may not be known at the time the disruptor agent is installed, or it can change during the execution of the disruption (for example, if the external dependency uses DNS load balancing).

Implement grpc interface to disruptor agent

Presently the communication between the xk6-disruptor extension and the xk6-disruptor-agent running in the target Pods is implemented by executing a CLI command in the agent's container. This approach has several important advantages:

Facilitates testing the agent without any client-side component as it can be invoked from the command line.
Does not require the agent to be accessible outside the Kubernetes cluster (exec command uses Kubernetes API server as a gateway)

However, it also has some important drawbacks:

Mapping the interface defined in the xk6-disruptor extension into the corresponding commands in the agent. This process is error prone and may create subtle errors as different expectations regarding default values for parameters.
Error handling is quite limited when compared with the rich error handling capabilities in grpc
It makes difficult to test the communication, the Kubernetes client does not provide a way for mocking the execution of commands
Some disruptions may drop the network connection to the target (e.g simulate node disconnection) causing the exec command to fail.

Even when the existing command line model will likely remain the default interface, it would be convenient to also implement this communication using grpc and use it when running tests inside the Kubernetes cluster, for example using the k6-operator or K6 Cloud.

Create Brew Formula

Create brew formula for installing xk6-disruptor in MacOS systems.

Document how to run xk6-disruptor tests using the k6-operator

Is some setting, users may not be allow to run tests that use the xk6-disruptor, because the user lacks the permissions for running containers with the security privileges needed by the xk6-disruptor-agent (see get started guide for more details).

One alternative is to install the k6-operator using a service account that has such permissions.

Document the issue and the solution using k6-operator, including the permissions required by the service account.

Requires:

Support multiple platforms for agent image

Presently the process for generating the agent image has two steps on which the architecture and operating systems should be considered:

When building the agent binary, the architecture and operating system is taken from the default in the build platform.
The image uses and alpine base image and expects a linux binary for the agent.

These two factors above create multiple issues that prevent supporting multiple platforms.

In the CI the agent is compiled for the amd64 architecture and the linux operating system. This image will not be compatible with clusters running in arm64 architecture
If the agent in built in a MacOS platform, the resulting binary will be for the darwin operating system. This binary will be incompatible with the image that expects a linux binary

Therefore, the build process should:

Generate a linux binary regardless of the build platform, for both the arm64 and amd64 architectures
Generate an image that supports arm64 and amd64 architectures using these binaries

This may be the cause of the second issue reported in #62

Disruptor agent not working on MacOS on ARM chips

There are some issues regarding the compatibiity of the disruptor agent with test clusters deployed in MacOS running on ARM chips.

Iptables support on QEMU
As reported by @dgzlopes iptables currently doesn't work under QEMU emulation on M1 laptops (docker/for-mac#6297 (comment))
The agent fails to execute in the container

When injecting a fault, the script fails with this error message:

INFO[0000] target: ["hotrod-6cb64465cc-ts6qw"]           source=console
ERRO[0000] GoError: error invoking agent: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "052394429b3ee2563b9124e274b55117fb7b2af5556e6001ae7a4e82bc4c8388": OCI runtime exec failed: exec failed: unable to start container process: exec /usr/bin/xk6-disruptor-agent: exec format error: unknown 

        at reflect.methodValueCall (native)
        at file:///Users/dgzlopes/go/src/github.com/grafana/xk6-disruptor/examples/pod_disruptor.js:25:34(32)
        at native  executor=per-vu-iterations scenario=default source=stacktrace

Add validations to the API

Currently the objects passed from the JS to the extensions are not properly validated. In particular, it is not validated that only valid fields fields are passed. This may cause errors due to misspelled field names or invalid structures as reported in #41.

This kind of validations are somehow at odds with the philosophy of "duck typing", common in the JS community, that precludes strict type validations between objects, but the potential issues raised by not having them outweighs this limitation.

This validations should only related to structure, not to the value of fields, which should be validated in the corresponding disruptors.

Validations

Validate selectors #41
Validate options
Validate faults

Create interface with Goja for disruptor API

Presently the Disruptor interface in golang is directly exposed to JS. This creates several inconvenients:

Struct fields are passed with the wrong type
If a struct field is misspelled this is not detected as an error creating a confusing experience for user (see #41)
There is no support for Duration type in JS so parameters and struct fields that specify a duration must be defined either as integers with a predefined unit (seconds, mil;seconds) or as strings converted to duration in the Disruptor code

In order to address the issues mentioned above, it would be convenient to create an interface that is aware of the goja type system and validates the parameters received, making any necessary conversion.

Build binary release for k6 with disruptor extension

Create binary releases for common installes:

Pass context to agent's commands

Passing a context to the commands executed by the agent would allow more consistent handling of cancellations and timeouts.

Define durations as strings in the API

Presently the API defines several function parameters and struct fields that specify durations. In part due to the issues described in #45 these arguments have been defined as integers with an implicit time unit. This may result inconvenient to users who must remember which unit is use.

It would offer a better experience to define all durations as strings with an explicit time unit such as 100ms or 30s and convert to the required unit internally.

Refactor CI build/release logic into script

The logic for building/releasing the xk6-disruptor is presently embedded into the CI workflow.
As this logic becomes more complex (e.g. add support multiple target architectures) it becomes harder to maintain in the CI.
Therefore it would be convenient to move the logic to a script that implements all the required steps and invoke the steps from the CI.

Implement helper function to generate disruption scenarios

As described in this issue, presently there are is an important limitation in the execution of disruptor functions: they should not be executed concurrently over a set of overlapping targets.

In practice, this means that the scenarios triggering the disruptions should be executed under one VU and in most cases, only one iteration of the scenario, using a shared-iterations scenario as shown below:

export const options = {
   // This scenario executes the tests
    scenarios: {
        // This scenario injects the faults
        faults: {
             executor: 'shared-iterations',
             iterations: 1,
             vus: 1,
             exec: "disrupt",
             startTime: "30s",
        }

As discussed in this other issue, one way to minimize this risk is to provide a helper function that return an scenario configuration that follows this restrictions:, not allowing invalid configurations (such as using another executor or multiple VUs. The code above will be equivalent to the code below using this helper function:

import { disruptScenario } from 'k6/x/disruptor;

export const options = {
   // This scenario executes the tests
    scenarios: {
        // This scenario injects the faults
        faults: disruptScenario({
                   exec: "disrupt",
                   startTime: "30s",
        })

Validate target port when injecting protocol faults to Pods

Presently, the target port is not validated when injecting protocol faults in a pod. If the port is not been used by any container of the Pod, the fault injection will not have any effect, but neither will the user receive any error or warning.

It would be desirable to check that the target port is exposed by any of the containers in the Pod. This validation must be made for each target Pod, as they may have been started independently or been created by different deployments.

Update github actions

The github pipeline are generating warnings due to actions that depend on deprecated node.js version

Node.js 12 actions are deprecated. Please update the following actions to use Node.js 16: actions/upload-artifact@v2. For more information see: https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/.
 
Node.js 12 actions are deprecated. Please update the following actions to use Node.js 16: actions/download-artifact@v2. For more information see: https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/.

Add e2e tests running test scripts

Presently, integration tests test the golang code. Additional tests are required to ensure the js extension works properly.
These tests should

launch a test cluster
2 setup resources (e.g. deploy pods, expose services)
3 execute a test script

Steps 1. and 2. can reuse existing test setup utils.

Step 3. requires a mechanism for executing a script and validate the results. Two approaches could be used here:

Run the xk6-disruptor binary as a process passing the test script from a file. This approach makes difficult to check the result (output needs to be parsed)
Setup a test goja environment and load a execute the script (see for example this test). This approach has the inconvenient that it is not really testing the final extension (for example, the initialization code)

Context conflict with xk6-browser

When running an script that injects faults using the disruptor using a custom build that includes also the xk6-browser extension, the following error is reported.

ERRO[0000] GoError: error creating ServiceDisruptor: error creating ServiceDisruptor: client rate limiter Wait returned an error: context canceled

This happens even if the script does not use the browser extension (e.g. this demo script)

Running a script with this custom build (k6 + xk6-disruptor+ xk6-browser) that use the browser but not the disruptor works as expected.

Use tagged agent image instead of latest when injecting agent

Presently when the xk6-disruptor extension injects the agent in a target it uses the latest image.

As we move towards versioned releases, it would be necessary that each xk6-disruptor version uses the corresponding agent image to ensure compatibility between both components.

Moving in this direction, the CI/CD workflow is prepared for publishing the agent's image when a new version is released using the release version as tag.

Making the extension reference this version could be easily implemented by:

Adding a Version constant somewhere in the code base (e.g. pkg/internal/constants) and update it when releasing a new version of the extension (as is done in k6, for example)
Use this constant as tag when referencing the agent's image

The simplicity of this approach has a limitation: once a version is released (say, v0.1.0) the main branch may receive updates to the agent image (published with the tag latest by the CI) but the extension will still make reference to the latest published version.

The challenge is therefore how to ensure that when the extension is built from a given version it uses the corresponding agent image, but when built from the main branch, the latest agent image is used.

The selected solution should work both when building the extension in the CI and when building it locally, therefore it must work using exclusively information available at compile time.

Generate Windows binaries

Fix http disruptor's proxy tests port usage

The http proxy test starts a http server and a http proxy. Each one requires its own port for listening to requests. As tests cases can be executed in parallel, each test uses a different port pair. This can cause failures in the test if any of those ports are used (for example, port 8084 seems to be used for other purpose in github's workers and cannot be used)

Return output of agent command execution in case of error

When the execution of a command in the agent fails, we only receive the return code, which provides no information for debugging the error. See for example #46

The command output (strout, stderr) is available as a return parameter from the command execution and could be returned in the error.

Abstract out agent execution environment

The disruptor agent is responsible for injecting faults in the disruption target. The reliability of the agent is critical to ensure the chaos tests do not disrupt the target in unpredictable ways or that its effects last beyond the duration of the fault injection. Therefore, the testability of the agent is a key requirement.

However, in order to inject faults, the agent must interact with the target's execution environment. For example, executing commands to modify the configuration or execute other tasks.

These kinds of interactions are difficult to test because the functions the golang standard library offers for interacting with the execution environment (mostly under the os package) are not provided as interfaces that could be mocked.

One alternative that has already been applied for the execution of commands is to provide abstractions for these functions that can be easily mocked in tests.

This approach can be extended to:

This same approach has been used in k6 by introducing a global state

Add support for in-cluster kubernetes configuration

In order to allpw running xk6-disruptor as a pod in a Kubernetes cluster, it is necessary to add support for the in-cluster configuration.

Add template for injected errors

Some APIs expect that in case of errors, a body to be returned with details about the error. Usually it takes the form of a json object with the error code and a human readable description. Presently, the http_proxy does not return any body when a fault is injected causing unexpected errors in the application.

ServiceDisruptor does not resolve target port if service uses port name

If a service specifies the target port using its name, the ServiceDisruptor is not able to resolve it. As a result, the agent uses the default target port (80).

How to reproduce:

Deploy a pod that exposes port 8080 with the name http.
Expose the pod with a port that maps port 80 to the target port http
Inject faults to port 80 of the service using a ServiceDisruptor. The faults are not injected, because the disruptor will use the default target port 80 in the pod, instead of the exposed port 8080.

See the manifest below for deploying and exposing the pod and the script for injecting faults.

Manifests

kind: Service
apiVersion: v1
metadata:
  name: httpbin
  namespace: httpbin
spec:
  selector:
      app: httpbin
  type: LoadBalancer
  ports:
  - name: http
    port: 80
    targetPort: HTTP
----
kind: Namespace
apiVersion: v1
metadata:
  name: httpbin
---
kind: Pod
apiVersion: v1
metadata:
  name: httpbin
  namespace: httpbin
  labels:
     app: httpbin
spec:
  containers:
  - name: httpbin
    image: mccutchen/go-httpbin
    command: ["go-httpbin", "--port", "8080"]
    ports:
    - name: http
      containerPort: 8080

Tesscript

import { ServiceDisruptor } from 'k6/x/disruptor';
import http from 'k6/http';

export default function (data) {
    http.get(`http://${__ENV.SVC_IP}/status/200`);
}

export function disrupt(data) {
    if (__ENV.SKIP_FAULTS == "1") {
        return
    }

    const disruptor = new ServiceDisruptor("httpbin", "httpbin")

    // delay traffic from one random replica of the deployment
    const fault = {
        port: 80,
        averageDelay: 200,
        errorCode: 500,
        errorRate: 0.1
    }
    const opts = {
            proxyPort: 8000
    }
    disruptor.injectHTTPFaults(fault, 30, opts)
}

export const options = {
    scenarios: {
        load: {
            executor: 'constant-arrival-rate',
            rate: 100,
            preAllocatedVUs: 10,
            maxVUs: 100,
            exec: "default",
            startTime: '0s',
            duration: "30s",
        },
        disrupt: {
            executor: 'shared-iterations',
            iterations: 1,
            vus: 1,
            exec: "disrupt",
            startTime: "0s",
        },
    }
}

Remove circular dependency in Kubernetes e2e tests

Currently, the Kubernetes helper package's tests use to some extend helper functions provided by that same package for the test setup, creating a kind of circular dependency between the tests and the package.

A better alternative would be to use other tools for test setup and for checking test conditions.

One possibility would be to run the tests as pods in the cluster using scripts and CLI tools such as kubectl. In this way the tests would be an automation of actions an operator could do.

One open question would be how to check the results of the test from the Job execution.

Unit tests fail in MacOS

Unit tests for process utils fail when running in MacOS. This is due to the utilization of a full path to the true and false command which is different between Linux and MacOS environment.

--- FAIL: Test_Exec (0.00s)
    --- FAIL: Test_Exec/do_not_return_output (0.01s)
        process_test.go:61: error: fork/exec /bin/true: no such file or directory
        process_test.go:65: unexpected error fork/exec /bin/true: no such file or directory
FAIL
FAIL    github.com/grafana/xk6-disruptor/pkg/utils/process      0.392s
FAIL

Failed to run script: Invalid apiVersion

As seen on the screenshot, I'm getting that error while trying to run my script!

It is true that my cluster version is too old (1.21) and not inside the requirements of the disruptor extension, but we aren't displaying the error message that #49 should trigger.

I'm running on Linux with the v0.1.3 binary (downloaded from the releases page). The cluster is EKS v.1.21.

Handle interruption in the xk6-disruptor agent and clean-up before exiting

When injecting a fault, the agent can modify the environment of the pod It is installed in. For example when injecting a HTTP fault, the iptables of the pod are modified. If the agent exits for some reason, these changes should be restored before exiting.

Allow the selection of urls to target an http disruption

When defining an HTTPFault allow the specification of a list of URL paths the fault should affect. requests not matching any of the paths should not be affected.

As many APIs have URLs that include varibles (e.g. user or product ids) the specification should allow for specifying wildcards or placeholders in the paths:

const fault = {
    select: ["books/*"]
    delay: 100,
}

As the fault specification allows the definition of a list of paths to be excluded, there is the potential of conflicts: a path that match the selection patterns but also the exclusion pattern. In this case, the exclusion should override the selection and the request should not be affected by the fault.

xk6-disruptor-agent does not terminate if test is cancelled

When a failure is injected in a Pod, xk6-disruptor executes the xk6-disruptor-agent as a process int the xk6-agent container attached to that Pod. Is this process which injects the disruption in the Pod.

If the k6 test is cancelled, this process is not cancelled and the disruption continues in effect until the process ends.

Running again the test while this process is running will have unpredictable results.

To address this issue we need to

Handle the termination of the test. This is presently not possible due to grafana/k6#2432
Cancel the execution of the command. This may require cancelable exec connections which is planned in kubernetes client 0.26.

Implement fault injection for grpc services

Grpc is becoming increasingly popular as a protocol for micro-services applications, and in particular for infrastructure services.
Therefore, it would be convenient to add support for injecting faults to requests to grpc services, following a similar pattern to the one used for injecting faults in HTTP requests.

Add selector validation

Right now, we don't validate that the selector the user passes is correct.

I spent some time trying to use the following selector:

const selector = {
	namespace: 'k6-cloud-test',
	labels: {
		name: 'querier',
	},
}

And it didn't work as I expected! It matched only the namespace and omitted the labels as the schema was wrong.

We should point that out to the users.

Prevent multiple xk6-disruptor-agent commands to be executed on a pod

As described in #82 if the test fails the xk6-disruptor-agent command is not cancelled. If the test is re-executed it will execute another command and the results are unpredictable.

Therefore, until #82 is properly addressed, it would be convenient to prevent multiple executions of commands in the same target pod. This would also partially address #26

Relax lint errcheck settings

Current lint settings enforce some rather strict rules, like not allowing ignoring error check when closing streams in a defer function. Therefore, the following is not allowed

defer f.Close()

And must be replaced by

defer func() {
        _ = f.Close()
}()

Obtain a new context for each disruptor instance.

After these changes introduced in the way context is managed in k6 it is no longer safe to store a context obtained from vu.Context() because it is cancelled after each iteration. The init code is also considered an "interation".

Therefore, the extensions should get a new context from the VU each time they need one.

Currently, the disruptor gets a context in the init context and passes it to the Kubernetes.New method which stores in the Kubernetes struct:

// NewModuleInstance returns a new instance of the disruptor module for each VU.
func (*RootModule) NewModuleInstance(vu modules.VU) modules.Instance {
	k8s, err := kubernetes.New(vu.Context())
	if err != nil {
		common.Throw(vu.Runtime(), fmt.Errorf("error creating Kubernetes helper: %w", err))
	}

	return &ModuleInstance{
		vu:  vu,
		k8s: k8s,
	}
}

This behavior is incorrect. It already manifested as a bug when compiling a custom binary that includes xk6-disruptor and xk6-browser, because xk6-browser requires k6 v0.41.1 that includes this change.

The disruptor should get a reference to the vu on which an instance is created and use this reference to get a context when needed. For example, when creating a new instance of the PodDisruptor

func (m *ModuleInstance) newPodDisruptor(c goja.ConstructorCall) *goja.Object {
	rt := m.vu.Runtime()

	disruptor, err := api.NewPodDisruptor(rt, c, m.k8s)
	if err != nil {
		common.Throw(rt, fmt.Errorf("error creating PodDisruptor: %w", err))
	}
	return disruptor
}

This method should then be propagated to every call to Kubernetes, instead of reusing the context stored in the struct.

Redesign JavaScript API implementation

This API is exposed to the JS code by a series of adaprters that validates and convert the objects received from Javascript, delegates the execution to the corresponding disruptor and return the result or raises an error.

Replicating the methods for fault injection in multiple disruptors creates significant duplication in the implementation of the JS API: each adapter must re-implement the methods of the disruptor that it wraps.

Following the example above, the adapters for PodDisruptor and ServiceDisruptor must both implement the method InjectHTTPFault. This duplication will increase as more fault types and disruptors are added. For example, it is to be expected that both the PodDisruptor and a future NodeDisruptor implements an InjectNetworkFaults method for injecting network-level faults.

Document shell commands for Windows environments

Presently the disruptor documentation onlys shows commands for Linux/Windows environments. In some cases, these commands work differently in Windows environments. See for example grafana/k6-docs#975

Documentation should present the valid syntax for Linux/Mac and Windows separately.

HTTPBin example: Setup step is flaky

I tried to run the HTTPBin example and got hit by some problems.

First, the 10s timeout was hit while waiting for the pod to be ready:

ERRO[0013] aborting test. Pod httpbin not ready after 10 seconds
        at setup (file:///Users/dgzlopes/go/src/github.com/grafana/xk6-disruptor/examples/httpbin/disrupt-pod.js:24:8(36))
        at native  hint="script exception"

I changed the timeout and re-run the script after this happened and got another error:

ERRO[0003] GoError: namespaces "httpbin-ns" already exists
        at reflect.methodValueCall (native)
        at setup (file:///Users/dgzlopes/go/src/github.com/grafana/xk6-disruptor/examples/httpbin/disrupt-pod.js:18:14(7))
        at native  hint="script exception"

If the test finishes abruptly, the teardown phase isn't executed, and the namespace and its contents from the past run hang around. I removed this manually and run the test again. Another error:

ERRO[0064] setup() execution timed out after 60 seconds  hint="You can increase the time limit via the setupTimeout option"

Right. Self-explanatory 😄

Also: grafana/xk6-kubernetes#79