Coder Social home page Coder Social logo

xlab-uiuc / acto Goto Github PK

View Code? Open in Web Editor NEW
110.0 110.0 39.0 31.89 MB

Push-Button End-to-End Testing of Kubernetes Operators and Controllers

License: Apache License 2.0

Python 78.71% Makefile 0.03% Go 17.44% C 0.31% Shell 3.51%
kubernetes kubernetes-operator operator software-reliability system-reliability

acto's People

Contributors

312hzeng avatar chasing311 avatar essoz avatar github-actions[bot] avatar kashuncheng avatar kennn98 avatar kevchentw avatar kevin85421 avatar lhan0123 avatar manvik-uiuc avatar markintoshz avatar marshtompsxd avatar mittal1787 avatar pan-ziyue avatar qsdrqs avatar shreesha00 avatar shuaiwang516 avatar spedoske avatar srikarvanavasam avatar taham0 avatar thrivikramanv avatar thuzxj avatar tianyin avatar twinisland avatar tylergu avatar tz-zzz avatar unw9527 avatar whentojump avatar xinze-zheng avatar zyzuiuc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

acto's Issues

Meeting summary 04/07

In today's meeting, we mainly discussed the following topics

  • Test case generation
  • Input space pruning

Test case generation

Previously, we run tests by randomly selecting the field, and assigning random values to the selected fields. This results many redundent testing/alarms, and we don’t know how well we explored the input space.
To address this problem, I changed Acto to first generate a list of test cases, and execute them one by one.

Listing all the fields

The input is essentially a JSON instance, so we can consider the input as a tree.
The object nodes and array nodes in the JSON are essentially the inner nodes of the tree.
The basic types, e.g. Number, String, Boolean, are the outer nodes of the tree.

We consider both the outer nodes and inner nodes as fields.
For outer nodes, it's straight forward because they have concrete values and we want to test different values for them.
For inner nodes, their characteristics also affect the operators' behavior. For example, for an array field, array with 0, 1, 3 items may trigger different operator behavior.

Generate test cases for each field

Once we have the list of all fields, we can generate test cases for each of them.
Acto uses heuristics to generate test cases for each of the fields depending on their type.

  • For Integer fields, we generate 1) increase, 2) decrease test cases. For example, the increase test case would increment the existing value at the field
  • For String field, we generate 1) change test case. It will simply change the existing value to a different one.
  • For Boolean fields, we generate 1) Toggle-on 2) Toggle-off test cases. Toggle-on test case will change the value from True to False.
  • For Array fields, we generate 1) Pop-item 2) Push-item tests. Pop-item test case will pop an item from the existing array value.
  • For Object fields, we simply generate Delete test case to change the existing value to null.

You may notice that these test cases require some preconditions to run. For example, to run the Pop-item test case of the array fields, there needs to be at least one item in the current value in the field.
So each test case has three callbacks: precondition, mutator, setup. The precondition callback will check if the precondition of this test case is satisfied or not. The mutator will change the value to exercise the actual test case. If the precondition is not satisfied, we will call the setup callback to satisfy the precondition, so that we can exercise this test case next time.

Input space pruning

By using the test case generation mentioned above, we parsed 1332 fields for rabbitmq-operator. And we generated 2861 test cases in total. On average time each test case takes ~3 minutes to run, to exhaust all the test cases for rabbitmq, we need to run for 18 days on a single machine. We need to prune the input space.

  1. The first idea is that some fields in rabbitmq's input are from the generic functionalities from Kubernetes, e.g. Affinity, Toleration. These fields are simply copied over to a template when the operator constructs the statefulSet spec. We can safely prune the children fields of Affinity because their test cases wouldn't exercise any different logic in operators. If we use this prune technique, we can prune 128 fields.
  2. I have another observation, which is that the field spec.override.statefulSet.spec has 1109 fields, which makes up 85% of the test cases. However, this spec.override.statefulSet.spec field is only exercised in the following statement, where the variable podSpecOverride corresponds to the field spec.override.statefulSet.spec:
patch, err := json.Marshal(podSpecOverride)
patchedJSON, err := strategicpatch.StrategicMergePatch(originalPodSpec, patch, corev1.PodSpec{})

This means that, we are using 85% of the test cases to just test one functionality in the program.
Through discussion, I think we can find which field corresponds to one functionality in the program, and calculate the cost to test this functionality. If the cost is too high, we can prune this field.
For example, in the code shown above, we can learn that the field spec.override.statefulSet.spec corresponds to one functionality, since the operator only accesses to the spec.over.statefulSet.**spec** level and didn't access any children fields of spec.override.statefulSet.spec. Then we calculate the number of test cases needed to test this functionality and decide if we should spend the effort to test it.

Getting an intuitive terminology instead of 'schema'

Goal: An easy-to-understand terminology that can be used for all the following objects:
I want to have a general terminology for these different objects because I observe that they share the same structure (all of them are trying to describe how the data should be constructed). And in our implementation, their classes inherit from the same parent class; I need to properly name their classes and the parent class.

{
  "type": "object",
  "minProperties": 2,
  "properties": {
    "first_name": { "type": "string" },
    "last_name": { "type": "string" },
    "birthday": { "type": "string", "format": "date" },
  }
}
{
  "type": "array",
  "items": {
    "type": "number"
  },
  "minItems": 2
}
{
  "type": "number",
  "default": 20,
  "multipleOf" : 10
}

All of these objects are describing the organization of the data and guiding how the data should be constructed.
My initial proposal was using the term schema. But during the meeting, it seems that the term schema is confusing to folks. It only makes sense for complex structures to have schema, and it was confusing to call

{
  "type": "number",
  "default" : 20,
  "multipleOf" : 10
}

a schema, because this only describes a number.

@tianyin proposed to use constraint. I think it makes sense to call the items constraints, for example, in the first object, "type": "object", "minProperties": 2, "properties": ... can all be called constraints. Then the entire object could be called a ConstraintSet?

@marshtompsxd proposed to use property. I think property is a similar term to constraint. For example, "type": "array" is a property of this object. But I think it is a little weird to call the entire object a "property".

Build a basic prototype

@tylergu As we discussed after the meeting, I think the highest priority for now is to build a basic prototype that can work for a few operators (other than RabbitMQ). Without that, I feel some discussions are less effective due to the lack of understanding on the strawman and the data:

  • I can't understand whether certain problems are important or not. For example, there are a few discussions on the oracles (e.g., "what to do if we can't map a field from CR->service"). Are those really important problems, or are they imaginary? What are the examples that we failed to map?
  • I can't understand what are the tradeoffs and benefits ("what's the benefits of constructing internal objects over mutating yaml?"

Where is acto now?

The fact is that there is no a in acto now:

  • The current acto does NOT have an automatic input generation -- all the values were hardcoded by @tylergu based on reading documents. In other words, it can't even be applied to the second operator.

  • For the oracle, there are a lot of proposals including using a state machine proposed by @kevin85421 and leveraging idempotency of state transition proposed by @tylergu. Those are certainly nice to have, but even the simplest oracle (the state diff) does not work now (see #12).

What's the most important next step?

IMHO, it's always fun to chat about new ideas and potential completeness/soundness problems. But, the most important thing is to build a solid prototype and experiment it with multiple operators. Only by doing that, we can have the experience of what does not work and develop the understanding of what will work better, and will be able to identify opportunities to improve the test technologies. Otherwise, we will keep cycling on some very hard, but less important problems. Oftentimes, some of those problems do not matter in practice.

What are the blockers?

Based on the meeting, I feel the following two things are blockers that we have to fix:

  • #12 to achieve automatic equivalence checks for state diffs
  • The way to generate many values to keep CPUs busy (I don't have a strong opinions on whether yaml is better than objects and I'm much less interested in the debate compared with a quick hack that is able to function (which can always be refined into a systematic solution).

I would highly suggest we focus on addressing these two issues and run stuff in an automatic fashion.

Initial plan

It is hard to come up with the list of operators we want to study, so we decided to study one operator(spark-operator) at first. By studying the first operator, we hope to:

  • Gain experience in finding bug reports that are within our interest, e.g. what key words to use to query Github or JIRA
  • Have a better idea of what operators we are interested in
  • Have a better idea of how to profile each bug report, e.g. root cause, triggering

After studying one or two operators, we should be able to

  1. finalize our list of operators to study
  2. construct a pool of bug reports
  3. randomly sample from the pool and dig into each bug report

Action Plan before 04/08/2022

  • Prune the test case space
    • First manually prune
    • Then think about how to automate
  • Make oracle extensible
  • Inspect cass-operator's result

List of next steps

We discussed several interesting next steps to do at this stage:

Input space pruning 1 - Prune the fields that are simply copied over to a template

Cluster management systems like Kubernetes provides some generic functionalities for managing applications, for example, Affinity and PersistenceVolume. The operators enable users to manage their applications with just one application-specific input (CR). In this input, they still allow users to specific these generic functionalities provided by Kubernetes. Then in the operators' logic, they will just simply hand these generic fields to Kubernetes. If we can identify such fields in the operator's input, then we can prune the subfields of them.

For example, in the rabbitmq-operator's code, spec.Affinity is simply copied over to a field when creating the podTemplateSpec for statefulSet. In this case, we can prune all the children of the field spec.Affinity in rabbitmq's CR.

Input space pruning 2 - Prune the fields that are too expensive to test

We observe that in rabbitmq-operator's input, there are in total 1323 fields. Out of the 1323 fields, 1109 fields are under the field spec.override.statefulSet.spec, because this field contains the complete schema of the statefulSet.spec resource. But this spec.override.statefulSet.spec field is only used as a patch to conduct a JSONStrategicMergePatch on the existing statefulSet as shown in the code below, where podSpecOverride corresponds to the spec.override.statefulSet.spec field:

patch, err := json.Marshal(podSpecOverride)
patchedJSON, err := strategicpatch.StrategicMergePatch(originalPodSpec, patch, corev1.PodSpec{})

It is too expensive for us to spend 99% of the test cases to only test this single functionality. We can do program analysis to identify the fields that the operator directly accesses. And then we can get the cost of testing this field, if it is too expensive to test this field, then we should prune this field.

Cass-operator also has a field called spec.podSpecTemplate which has the entire schema of statefulSet's podTemplate. This spec.podSpecTemplate has ~1000 fields.

Reduce false alarms

Running several test cases

Currently our test cases only change one field in each test case. If we can run several test cases at the same time, we can largely reduce the testing time.

There are two potential challenges:
1. There are dependencies among the fields
2. Changing multiple fields at a time could complicate the oracle

Run the rabbitmq-operator/cass-operator systematically

We need to run rabbitmq-operator/cass-operator with the new input generation. We can first run them by manually pruning the input space. The results will show us how many false alarms we have.

Test plan generation

  1. Generating good/bad values: As discussed two weeks before, operators' input go through two levels of check.
    1. The first level is the server side validation, which uses the schema and the validation webhook. If the input cannot pass this level, it gets rejected directly without reaching the operator code. Acto can recognize if the input is rejected by the server side, and tries its best to generate inputs that pass the first level check.
    2. The second level is in the operator's logic. When the operator receives the input, it performs some sanity check. The challenge here is that Acto cannot tell if the input fails the second level check. This challenge causes some false alarms.
  2. Back and forth testing: The idea of back and forth testing is that, assuming the declarative nature of operators, if the same input is submitted to operator, then the application deployed should be exactly the same. During testing, we can revert back to an input we submitted before, and compare the two application produced by the operator. If they are not the same, then we find a bug.

Testing delete, and creating 2nd application

Currently we only test different inputs in one CR. It's also possible to test deleting the CR and recreating. We can also test inputs in two CRs, for example in rabbitmq's case, we will be creating two clusters of rabbitmq.

Parallelize Acto.

Acto is based on Kind cluster which uses docker containers to virtualize clusters. So it is possible to have multiple Kubernetes clusters running different test cases on the same machine. I noticed that not all the cores are efficiently used while running Acto, so it might be beneficial to explore running Acto in multi-cluster setting.

Automate the CRD generation into the pipeline.

Some operators do not provide a complete CRD that fully reflects the input structure defined in the API types. We can use kubebuilder to automatically generate CRD for operators in this case. We need to incorporate this option into Acto's pipeline.

Changing storageClassName in CR does not change the PVC resource

Describe the bug

I was trying to change the persistence/storageClassName field in my rabbitmq-cluster's CR, but changing the persistence/storageClassName has no effect on the PVC used by the statefulSet.
The persistence/storageClassName field was initially not specified so the operator used the default storage class "standard". Then I created a new storage class following instruction here https://github.com/rabbitmq/cluster-operator/blob/main/docs/examples/production-ready/ssd-gke.yaml, and changed persistence/storageClassName from null to ssd. This change failed silently.

To reproduce

Steps to reproduce this behavior:

  1. Deploy cluster-operator using the command:
    kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml
  2. Deploy rabbitmqCluster with the following yaml file:
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: sample
spec:
  image: "rabbitmq:3.8.21-management"
  persistence:
    storage: 20Gi
  replicas: 2
  1. Create new storage class:
    kubectl apply -f https://github.com/rabbitmq/cluster-operator/blob/main/docs/examples/production-ready/ssd-gke.yaml
  2. Change the rabbitmqCluster's persistence/storageClassName to ssd and apply
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: sample
spec:
  image: "rabbitmq:3.8.21-management"
  persistence:
    storage: 20Gi
    storageClassName: ssd
  replicas: 2
  1. Check the PVC used by rabbitmq cluster, the storageClassName change is not applied. It's still "standard" instead of "ssd":
"spec": {
    "accessModes": [
        "ReadWriteOnce"
    ],
    "resources": {
        "requests": {
            "storage": "20Gi"
        }
    },
    "storageClassName": "standard",
    "volumeMode": "Filesystem",
    "volumeName": "pvc-43d66309-2090-4de3-bc82-9edcd0a69361"
}

Version and environment information

  • rabbitmq: rabbitmq:3.8.21-management
  • rabbitmq cluster operator: rabbitmqoperator/cluster-operator:1.10.0
  • Kubernetes: 1.21.1
  • Running on Kind cluster

Addition context

This bug is caused because the operator only reconciles the PVC's storage capacity, but does not reconcile the storageClassName here: https://github.com/rabbitmq/cluster-operator/blob/d657ffb516f948aaffd252794e3ed5e75e352d3d/controllers/reconcile_persistence.go#L15.

Possible fix is to create the desired storage type and migrate the data over. Or report an error message like PVC scaling down here: https://github.com/rabbitmq/cluster-operator/blob/d657ffb516f948aaffd252794e3ed5e75e352d3d/internal/scaling/scaling.go#L51

Action Plan before 02/25/2022

Before 02/18/2022:

  • Finish implementation of collecting the system state to also include the custom resource
    • DeepDiff between dict
  • Go through the noisy fields in the system state delta to see if there are some application-specific fields
    • Depending on the observation, figure out the solution
  • Implement the solution for the system convergence problem. Try listening to events first
  • Filter out benign error msgs

Before 02/25/2022:

  • System state delta oracle
    • canonicalization of field path
    • Compare diff
    • Examples of field match, but not captured
  • Input generation
    • For easy types, they can be automated easily
    • For string, solve common fields first
    • For application specific string, expose an interface

Exploration action plan

Focusing on rabbitmq-operator

  1. Understand the configuration that can be passed into the operator.
    • What kinds of configuration parameters does it have
    • How can the parameters be set
  2. Understand the tests for the operator
    • How do the existing tests test the operator
    • What framework do they use
    • What kind of oracles do they use?
  3. How can we find the rabbitmq #741 bug?
    • Can the existing test be modified to find this bug?
    • What input is needed to trigger this bug, and what oracle is needed to detect?

Action Plan before 03/25/2022

  • #43
  • Experiment with bad values in addition to good values #42
  • Write scripts to reduce effort of inspecting rabbitmq results
  • Normalize dictionary before state oracle

Action Plan before 01/28/2022

Before 01/23

  • Prepare slides for Redhat meeting
    • Present the three ideas that we are interested in
    • Prepare discussion questions

Before 01/28

  • Investigate testing techniques used in rabbitmq operator
  • Inspect more bugs in rabbitmq operator

The e2e test case matrix

Currently our e2e testing explore the space by doing a random walk: After each test, we randomly select a field and randomly assign a value to it. I want to propose a more systematic approach for exploring our input space.

Assumption: Operator implements level-triggering, that is, operator only needs to observe the final input to drive the system to the desired state. Under this assumption, we can get this property: For any input x, there is only one correct system state corresponds to x.

The problem we want to solve: In operators' reconciliation loop, they read two inputs: 1) The current system state 2) The cr.spec which is the desired state. The responsibility of operators not only includes deploying the correct system from scratch(when current system state is empty), but also includes to drive the current system to the desired state when users reconfigure. We want to test operators ability to drive the system to the desired spec no matter what the current system state is.

Test case matrix:

image
We can use this test case matrix to represent all the test cases we want to test. We use the system_state(X) function to denote the correct system state we collect after we submit input X.

Running tests

Once we get this test case matrix, we can start running the tests to fulfill this matrix.
For example, if we first run a test trial: A->B->C, we fulfills the (system_state(A), Input B), and (system_state(B), Input C) in the matrix.

What is the oracle

When running the trial A->B->C, we collected states after each test and save them as trial_1{A,B,C}
Then we run another trial B->A->C, in this trial we tested (system_state(B), A) and (system_state(A), C) in the matrix, and collect system states after each test trial_2{A,B,C}. Then we can compare the system states we collected in the trial_2 with the ones we collect in the trial_1. Since we assume that the operator implements level-triggering, these states should agree.

Efficiency concern

If we want to test all the test cases in this matrix, we will have O(n^2) complexity where n is the number of different inputs.
It is impossible for us to run all the test cases, but we can do some test prioritization and assigning weights to each edge.

After all, this test case matrix is just a representation. We can still do random walk here and only run certain number of tests, but we will be able to avoid redundant tests and have an additional oracle to use.

Action Plan before 04/15/2022

  • Write algorithm to express how we want to do the analysis (separate the program analysis into two problems: mapping and analysis)
    • Mapping is the mapping between the field in the input and the variable in the program
    • analysis is the algorithm of pruning fields assuming we have the mapping

A note on Friday's discussion on value generation

We discussed the different information sources we can leverage for value generation.

I know @Yicheng-Lu-llll and @kevin85421 perhaps have some black magic in mind. But, in this issue, let me write down the generation based on value constraints (e.g., data type, data range, semantic types such as image and filepath, etc).

To be able to generate input CR diff, there are two problems to solve:

  1. The CR structure/definition
  2. The value

Structure

Structure is typically a simpler problem to solve, because what we need is a definition of the structure. The API definition, described in #15, has a complete program definition of the structure.

The CRD could also have the definition, but @tylergu finds that in some projects, the CRD is a partial definition, rather than a complete one.

There is a debate on whether to use CRD or API definition (which I address below). But, no matter whether we start from CRD or from API definition, the problem is straightforward.

Value

The essential problem for generating different good and bad values is to learn the constraints of the value. For an operator, there are many different information sources we can aggregate and leverage, including:

  • Document (currently, in acto, the values are "hand-coded" based on the docs (see #13)
  • CRD
  • API definition
  • Source code (source code perhaps provides the most precise and complete information)

If we look at prior research papers, prior work leverages all the above information.

  • Document -- you can find work that uses NLP to learn the constraints, e.g., ConfSeer and PracExtractor
  • CRD and API definition -- this is the most straightforward thing to do; Marcel van Lohuizen told me they even did it in GCL.
  • Source code -- there is A LOT OF paper on learning rules from source code, e.g., our own work Spex

The ideas are all there, the main question is how to apply them to build a practical system.

Tradeoff

Each information source has different tradeoffs. One needs much more expensive analysis than the other.

The practice is always to start from the simplest to the hardest, so that we can always understand the cost-vs-benefits -- a great question asked by @wangchen615 during the meeting -- "what do you gain by using API definitions over CRD?"

@tylergu later provides examples that why API definitions are likely more complete than CRD. On the other hand, he also agrees that CRD could have different information.

It is clear that CRD is much cheaper to use (it's an independent YAML file) than API definitions (Go source code which needs constructors).

So, the agreement is to start from CRD and then (or meanwhile) invest how to use API definition.

This could help answer @wangchen615 question about what is the additional benefit API definition brings over CRD.

Also, given the highest priority being #13 , a CRD-based input generation can lead to a quick prototype to make the CPUs busy.

Input generation proposal

The general idea of the input generation is to generate items recursively. The root of our input generation is the CR.spec object.

Then we will have a generic generate() function. Example:

def generate(type):
    if type == int: return random()
    if type == bool: return random(true, false)

    ret = {}
    for child in type.fields:
        ret.child = generate(type(child))
    return ret

Consider the following CRD:

type RabbitmqClusterSpec struct {
	Replicas *int32 `json:"replicas,omitempty"`
	Image string `json:"image,omitempty"`
	Service RabbitmqClusterServiceSpec `json:"service,omitempty"`
}

type RabbitmqClusterServiceSpec struct {
	Type corev1.ServiceType `json:"type,omitempty"`
	Annotations map[string]string `json:"annotations,omitempty"`
}

Acto will call generate(RabbitmqClusterSpec), which will generate an integer, a string, and call generate(RabbitmqClusterServiceSpec) which in turn generates a corev1.ServiceType and a map of string to string.

Some of the fields are contraints among its children, we can override generate function for these fields to express these constraints.

The input generation is going to be non-trivial amount of work. We want to learn the structure of the input CR, and then generate the inputs structurally. Even after we have the framework to generate input structurally, we still won't be able to generate the inputs fully automatically without some human guidance. I don't think we will ever be able to fully automate the input generation, our goal should be reducing the human effort for input generation as much as possible.

Deciding between CRD and API definition

There are two options to use as the guidance for learning the input structure: CRD or API definition. Here are my thoughts on the trade-offs:

  1. Using CRD relies on the complete CRD translation from the API definition. If the CRD is automatically generated using the some framework, then the information we can get from the CRD is almost the same with the API definition. Basically the CRD will contain the entire structure for the input. But there are cases where the CRD is handwritten by developers, and the developers do not fully describe the CRD according to the API definition (see operators written by percona https://github.com/percona/percona-xtradb-cluster-operator/blob/7f2d8575a3fc8de00019bdbb090bc543831b7ae9/deploy/crd.yaml#L24). In this xtradb-cluster-operator's case, we have zero information about how to generate the spec, because the spec is simply described as an object.
    • If we use CRD as the guidance, the information is possibly incomplete.
    • If we use API definition, we are guaranteed to have the complete information about the input structure.
  2. Using API definition can be hard, because generating go structs could be an engineering heavy task. But I think there is definitely a way to do this. In fact, generating objects in this way is a very basic technique used for spec-based input generation in software engineering: testEra, Korat.
    • If we use CRD as the guidance, we need to develop a way to parse the CRD and understand its structure.
    • If we use API definition as the guidance, some engineering work needs to be done to parse the go struct types. Same type of work has been done for Java in the testEra and Korat.
  3. When we are testing several operators, different operators have different input structures, but they usually share some common fields, for example these corev1.* structs provided by kubernetes. If we can recognize the common fields, then we can reuse generation rules for these fields.
    • If we use CRD as the guidance, we can infer the fields from their children and their names; if two fields have the exact children and name, they are very likely to be the same type.
    • If we use the API definition as the guidance, we know exactly what type is that field.

Disabling affinity rule in override.statefulset does not get applied to statefulSet

Describe the bug

I was trying to patch the /spec/override/statefulSet/spec/template/spec/affinity to null to delete the affinity rule that was previously specified for rabbitmqCluster under /spec/affinity. It seems this null value does not get propagated into the Go value when unmarshalling. This causes that the affinity rule deleting is not applied to the statefulSet.

To reproduce

Steps to reproduce this behavior:

  1. Deploy cluster-operator using the command:
    kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml
  2. Deploy rabbitmqCluster with the following yaml file:
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: sample
spec:
  image: "rabbitmq:3.8.21-management"
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - sample
          topologyKey: kubernetes.io/hostname
  persistence:
    storage: 20Gi
  replicas: 2
  1. Patch the rabbitmqCluster using the patch file with the command: kubectl patch rabbitmqCluster sample --type merge --patch-file patch.yaml

patch.yaml:

spec:
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers: []
            affinity: null
  1. Check rabbitmqCluster spec, the affinity in override is omitted:
"override": {
    "statefulSet": {
        "spec": {
            "template": {
                "spec": {
                    "containers": []
                }
            }
        }
    }
}

and in the statefulSet spec, affinity rule is not deleted.

The root cause here is because the override.statefulset.spec.template is unmarshalled into *corev1.podSpec at first and then the operator marshals it back to JSON format so that it can apply JSON strategic merge. However, the affinity field is omitted due to the omitempty rule in coreV1 API, and the patch no longer has affinity: null in it. The essential thing that's missed here is that, for a JSON patch P, marshal(unmarshal(P)) does not necessarily equal to P.

Version and environment information

  • rabbitmq: rabbitmq:3.8.21-management
  • rabbitmq cluster operator: rabbitmqoperator/cluster-operator:1.10.0
  • Kubernetes: 1.21.1
  • Running on Kind cluster

Addition context

This is a very similar issue(rabbitmq/cluster-operator#741) reported before, but the fix was specifically for securityContext. This problem needs a systematic fix(e.g. marshal the field override into Raw JSON format) because there are dozens of other fields in the coreV1 API that could cause the same issue.

Action Plan before 02/11/2022

  1. Check result from kubectl.
    • Fix the invalid CR problem. Revert to previous CR if there is an error. - Top Priority
  2. Record delta(CR)-> delta(state)
    • Find all the k8s objects created by the application
      • Check all the object under that namespace
      • Check ownerReference
    • Develop a way to compare the states
      • Use unique identifier: name to know which two objects are the same object
  3. Prepare slides for IBM meeting - before 02/04

Things learnt from Spark-on-k8s-operator

Haven't finished Spark-on-k8s-operator, but here is a running list of things I learnt from it so far:

  • This operator is very different from operators I saw in sieve. It does not manage a SparkCluster, instead, it manages individual spark applications. Users will submit SparkApplications to the operator, and the operator will start the driver pods and executors for the application.
  • Github issue tag is not a very good filter to find bug reports in this repository, as the developers did not carefully tag the issues in the first place. My experience is that, being referenced by a pull request yields a higher chance for an issue to be a bug. If the issue is a bug, the pull request message usually contains the word fix.
  • Some issues are actually bugs, but without linked pull request for fix. The maintainer simply replies "Fix in version XXX"
  • There are issues that are failures, but the fault is not in the code, instead is in users' behavior. For example, configuration is incorrect.

Weird behavior when getting the custom resource object from k8s server

We get the custom resource object from k8s server after each test to compute the delta. When I was inspecting the deltas, I realized some strange changes:

This is the initial CR yaml we submit:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: hello-world
spec:
  resources:
    requests:
      cpu: 1
      memory: 4Gi
    limits:
      cpu: 1
      memory: 4Gi
  tls:
    caSecretName: null
    disableNonTLSListeners: false
    secretName: null
  skipPostDeploySteps: false
  tolerations: null

After submit this CR yaml, the custom resource object we get from the k8s server is:

"resources": {
    "limits": {
        "cpu": "1",
        "memory": "4Gi"
    },
    "requests": {
        "cpu": "1",
        "memory": "4Gi"
    }
},
"secretBackend": {},
"service": {
    "type": "ClusterIP"
},
"terminationGracePeriodSeconds": 604800,
"tls": {}

Note that the cpu fields under resources/limits and resources/requests are strings, and tls is empty while we specified disableNonTLSListeners: false in our CR yaml. The response also misses skipPostDeploySteps, while our CR yaml input has skipPostDeploySteps: false.

Then, if we change the resources/limits/cpu to 2, and submit the changed CR yaml, the custom resource object we get from k8s server will be:

"resources": {
    "limits": {
        "cpu": 2,
        "memory": "4Gi"
    },
    "requests": {
        "cpu": 1,
        "memory": "4Gi"
    }
},
"secretBackend": {},
"service": {
    "type": "ClusterIP"
},
"skipPostDeploySteps": false,
"terminationGracePeriodSeconds": 604800,
"tls": {
    "disableNonTLSListeners": false
}

Note that the cpu fields under resources/limits and resources/requests are integers now, and it now reflects disableNonTLSListeners: false and "skipPostDeploySteps": false.

It's hard for me figure out why this happens, and it currently interferes our system state delta.

One guess I have is that false is probably the omitempty value for bool variables, and they are omitted when the server dumps the CR spec. And when we change the resources/limits/cpu field, it triggered the server to dump the CR spec in a different way.

[Acto-1] [Bug Fix] candidates.yaml

I run Acto with the following command:

python3 acto.py --candidates data/rabbitmq-operator/candidates.yaml --seed data/rabbitmq-operator/cr.yaml --operator data/rabbitmq-operator/operator.yaml --duration 1

The python script generates new YAML files periodically (mutate-0.yaml, mutate-1.yaml, ...). However, I did not see any CR in my "rabbitmq-system" namespace.

kubectl get RabbitmqCluster -n rabbitmq-system

Hence, I applied all 8 generated YAML manually (mutate-0.yaml ~ mutate-7.yaml). No one can be deployed successfully as shown in the following figures. Kubernetes tell us the problems are located in candidates.yaml due to type error.

截圖 2022-03-04 下午4 45 09
截圖 2022-03-04 下午4 45 20

Initial test results

I ran Acto for 3 Hours. It ran 65 tests, and produced 25 alarms

All of the alarms are from our system state oracle.

For 19 out of 25: Acto didn't find any matching field in system state deltas for the input delta.
For 6 out of 25 alarms: Acto found some matching fields, but the value change is different


1 true alarm

See here: #39

18 false alarms caused by no matching field in system state deltas

3 are caused by changing a complex object: When we change a complex object to null, changes are reflected on a lower level.
Concretely, consider the following example where we changed the secretBackend from null to a new object.

"root['spec']['secretBackend']": {
        "prev": null,
        "curr": {
            "vault": {
                    "annotations": {
                        "key": "random"
                    }
            }
        }
}

Then we have the following system state delta:

"root['test-cluster-server']['spec']['template']['metadata']['annotations']['key']": {
    "prev": null,
    "curr": "random"
}
...

Acto tries to find a matching field based on the input delta's path ['spec']['secretBackend'], but the system state delta is at lower level.
In the system state delta, the path is ...['annotations']['key']. To match these two fields, we need to flatten the dict in the input delta before field matching.

10 are caused because we performed a change which is rejected by the operator: scale down and shrink volumn, key-value delimiter not found.

1 is caused by changing from a default value to null (this is effectively no change, but we were not aware of the default value). Need to be aware of default values

1 is caused by that field does not affect application's state (configuration of the operator itself)

2 are caused by a bug in our input generation

1 need further inspection


6 false alarms caused by value mismatch

3 are caused by lack of canonicalization when comparing dictionaries: easy to fix, canonicalize fields when comparing dict

Need canonicalization when comparing dictionaries
requiredDuringSchedulingIgnoredDuringExecution != required_during_scheduling_ignored_during_execution

"root['spec']['affinity']['podAntiAffinity']": {
        "prev": {
            "requiredDuringSchedulingIgnoredDuringExecution": [
                    {
                        "labelSelector": {
                                "matchExpressions":...
                        }
                    }
            ]
        },
        "curr": null
}
"root['test-cluster-server-0']['spec']['affinity']['pod_anti_affinity']": {
    "prev": {
            "required_during_scheduling_ignored_during_execution": [
                {
                        "label_selector": {
                            "match_expressions":...
                        }
                }
            ]
    },
    "curr": null
}

2 are caused by comparing null to default value: need to be aware of default values when comparing with null

"root['spec']['image']": {
      "prev": null,
      "curr": "random"
}

resulted:

"root['test-cluster-server']['spec']['template']['spec']['containers'][0]['image']": {
      "prev": "rabbitmq:3.8.21-management",
      "curr": "random"
}

1 is caused by 0 != '0': easy to fix

Modifying Service's annotations in CR does not delete removed annotations

Describe the bug

I was trying to modify the service's annotation via /spec/service/annotations. After changing a key-value pair in annotations from key1: value1 to key2: value2, I noticed that the key2: value2 is added under the Service's metadata/annotations correctly, but the key1: value1 is still present.

To Reproduce

Steps to reproduce this behavior:

  1. Deploy cluster-operator using the command:
    kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml
  2. Deploy rabbitmqCluster with the following yaml file:
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: sample
spec:
  image: "rabbitmq:3.8.21-management"
  persistence:
    storage: 20Gi
  replicas: 2
  service:
    annotations:
      key1: value1
    type: ClusterIP
  1. Use kubectl apply to apply the following changed yaml file. Note that the /spec/service/annotations is changed from key1: value1 to key2: value2
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: sample
spec:
  image: "rabbitmq:3.8.21-management"
  persistence:
    storage: 20Gi
  replicas: 2
  service:
    annotations:
      key2: value2
    type: ClusterIP
  1. Use kubectl get services sample -o yaml to get the state of deployed service
apiVersion: v1
kind: Service
metadata:
  annotations:
    key1: value1
    key2: value2
  creationTimestamp: "2022-02-15T23:40:25Z"
  labels:
    app.kubernetes.io/component: rabbitmq
    app.kubernetes.io/name: sample
    app.kubernetes.io/part-of: rabbitmq
...

Note that under the /metadata/annotations field, both key1: value1 and key2: value2 are present.

Expected behavior
After applying the change, the Service should not have the key1: value1 annotation.

Version and environment information

  • RabbitMQ: rabbitmq:3.8.21-management
  • RabbitMQ Cluster Operator: rabbitmqoperator/cluster-operator:1.10.0
  • Kubernetes: 1.21.1
  • Cloud provider or hardware configuration: Kind

Meeting summary 02/10/2022

Here are the main points we discussed during our meeting:

  1. We built a working end-to-end testing pipeline
    1. Keep changing the CR and submit to k8s
    2. Collect system states - fetch k8s objects and dump into dict
    3. Check the operator log for error msg - search for keyword ‘error’
    4. Using Kind to mock testing environment
  2. Input generation automation
    1. Currently we predefine a bunch of valid values for some interesting fields. This is a manual process, but we want to automate this.
    2. The CRD specifies the types of each field, and sometimes specifies some constraints on the field (e.g. min:0). We can possibly use this to guide the input generation.
    3. The api that the operator provides (types.go). This defines the exact object each field is.
    4. We don’t need to generate completely valid inputs. It’s interesting to see if the operator can handle the invalid inputs too.
    5. Future problem: One operator may have multiple CRDs, there could be dependencies among the fields across the CRDs.
  3. How to determine when the system converges
    1. Problem: After submitting the input CR, the system needs some time to do the operations. We need some indication that the system converged, and then we check the system state and error log.
    2. Solution # 1: check for any state changes. If the state does not change for some period of time, then we determine that the system converged
    3. Solution # 2: Listen to all events in the test namespace. If no new events occur for some period of time, we determine the system converges. We can even use the events as an oracle.
  4. Preliminary result
    1. Ran the tests for 6 hours, 105 tests. Only used the error log as the oracle.
    2. Got 32 errors. Errors are in two categories:
      1. “error… shrinking persistent volumes is not supported”
      2. “error… the object has been modified; please apply your changes to the latest version and try again…”
    3. Problem: Errors could be benign
    4. Solution: Filter out the benign error messages. For the second error message, it’s from k8s server, and it’s a finite set of these error messages.
  5. Oracle
    1. Currently we only check the error log.
    2. WIP: We also want to collect the system state(pods, statefulsets, deployments, services, configmaps, custom resources, etc).
    3. By collecting the system states, we want to construct pairs: (delta(input), delta(system state)). Ultimately, we want to infer the delta(system state) from the delta(input).
    4. Problem: delta(system state) is very noisy. It contains a lot of indeterministic fields.
    5. Action: Go through the noisy fields. Check if there are application specific noisy fields. Noisy fields like timestamp and resourceVersion are easy to solve.
    6. Possible solution: Deploy the operator and submit the CR twice. Check what fields are different between two runs, these fields are very likely to be indeterministic fields.

After the discussion, here are the action items:

  • Finish implementation of collecting the system state to also include the custom resource.
  • Go through the noisy fields in the system state delta to see if there are some application-specific fields.
    • Depending on the further observation on the noisy fields, need to discuss what is the best solution for it
  • Implement the solution for the system convergence problem. Try solution # 2 first, since the solution # 1 requires the solution for system state noise.
  • Implement the fix for benign error messages: use a filter to ignore these errors

On novelty

I know @tylergu has been thinking about @wangchen615 's question on novelty of the project since last Thursday. I love the question which really pushes the students to think hard and deeper. On the other hand, let me spend some time to clarify novelty in the context of systems research. Oftentimes, the "novelty" argument is confused or even abused.

Vijay Chidambaram (whom many of you like and worked with during SOSP'21) wrote a great summary of novelty
https://twitter.com/vj_chidambaram/status/1395086227204939780
which I can't agree more. Let me quote the points:

  • Sometimes novelty lies in an interesting solution, sometimes novelty lies in an interesting new problem.
  • There is a lot of value and contribution in the careful synthesis of known techniques to tackle a new problem. This is basically how all the systems research I know has been done.

So, as we have all laughed at some bad examples I show in Siebel 3111 this afternoon, it is ridiculous to use a microscope to look at one piece of a large system and say, "hey, that piece is not novel." Literally, if you do that, you will find no systems research is novel.

Novelty of The Problem

We have an extremely novel problem -- AFAIK this is among the first work addressing reliability of operation programs (i.e., operators) of large-scale infrastructures. I have been wanting to do such a project for a long time since I was working at FB and helped build two of their DR operators (Taiji and Maelstrom). But, I didn't find a good way. Apart from the overhead of engaging with 5 teams to look at their operator code, academically it is hard to be generic by looking at one company's 5 operators. A bigger blocker was that FB didn't have a unified control-plane framework/API (at least when I was there), so it's easier to study the operator code and bugs, but hard to build something generally applicable and highly impactful.

The Kubernetes operators provides a golden opportunity, as we have all seen and been excited about.

We already have a beautiful story to tell, which we almost wrote for Sieve. We later went on a different route when writing the Sieve paper (which is also successful). We are going to tell the story for Acto this time!

In summary, if done successfully, Acto will be the first research addressing reliability of operation programs (operators) of large-scale cloud systems and will be the first fully automatic testing tool for unmodified Kubernetes controllers.

Novelty of The Solution

I have made the point of the novelty in synthesis in the preamble. Now let me clarify a few things.

Novelty of Input Generation

I hope to clarify that our current input generation based on CR definitions is NOT novel.

I discussed input generation in #16. One point I made is that there are many information sources where we can extract semantics from and each requires different techniques with different tradeoffs.

A more important point I made is that we have to build Rome brick by brick. It's hard to think about the dome without building the groundwork. That's the reason I have been pushing the team to first build the very basic input generation, and it turns out that even the basic one is nontrivial. Once we have the basic ones which can correctly generate the structure and some constraints, let's experiment it to understand how we can do better using more novel techniques. In this way, we can justify our novelty more than a Rococo decoration with the sole purpose of fooling junior reviewers.

In other words, in my own experience, novelty comes from deep understanding and careful evaluation; novelty without understanding and evaluation is likely to be useless.

And, I have never run into problems of coming up novelty -- almost all the failures in the past was caused by the team not being able to deliver the understanding.

Novelty of Oracles

The oracle part of the project is inherent novelty. I believe I shared with you my fascination on intent-driven networking such as the Robotron work from FB and intentionnet from the network verification folks. And, it's incredibly exciting to explore whether it's ever possible to apply the same principle to software programs. And, if there is a way to do that, the best bet is Kubernetes due to its declarative design. It will be a dream if we can exploit the declarative nature and build automated "state-centric" oracles. I hope you see the significance!

Certainly, "declarative design" does not make the problem anything easy as shown by the many practical problems we have already encountered (noises, canonicalization, and reasoning about high-order semantics).

Note that I have been pushing the team to focus more on the input generation over the oracle, not at all because the oracle problem has been solved, but because the effectiveness of oracles need to be understood with large-scale test results driven by the generated inputs.

F.A.Q.

1. Will Acto be scooped by Sieve, because Sieve is the first work on Kubernetes operators?

Not at all! First, the two papers have drastically different perspectives. Sieve focuses on controllers from a cluster manager's perspective and Acto focuses on operators from an operation perspective. Sieve reasons about behavior under faults, and Acto checks correctness and expectation without faults.

Certainly, there are great synergies between Acto and Sieve -- from a very high level, Acto can generate workloads for Sieve which is written manually now.

2. What conferences can we target?

Any systems conferences, such as OSDI/SOSP, ASPLOS, EuroSys, USENIX ATC, etc., depending on the quality of the work and the solidness of the evaluation. Timeline wise, ASPLOS looks a good one.

3. You are talking about operator code -- but Kubernetes operators << operators!

Indeed, I'm talking about operators at large. But, shouldn't a moonshot be done step by step?

IMO, if we can build a solution for every Kubernetes operators, it is already a biiiiig deal and highly impactful! @wangchen615 can confirm that :)

And, after we know how to build an effective solution for all Kubernetes operators, we can certainly think about more unbounded operators (which we may need to connect to companies like FB who write proprietary operators)

4. Coming from a SE background, I don't see novelty of SE techniques -- Darko has done a number of mutation testing research and has the Korat tool for generating objects (even Alloy has been used for many systems).

Let's not submit to SE conferences then // joking

More seriously, I believe Darko will be very happy if some of the ideas of Korat or mutation testing of him can find their souls in our work, though I'm skeptical. I hope we view research, at least systems research, as a process of building on top of the shoulders to solve new, important problems, rather than denying prior work and calculating the credits.

Zookeeper-operator does not restart pods when static configs are changed

Description

When I change the fields under spec.config, zookeeper-operator does not issue a rolling update to reflect the changed config. For example, if I change the spec.config.commitLogCount field to 100, the operator reconciles the configMap to reflect the change. So in the pod, /conf/zoo.cfg which is where the configMap mounted has the commitLogCount field set to 100. But the /data/conf/zoo.cfg which is the config used by zookeeper still has the commitLogCount set to default value as 500.

Steps to reproduce:

  1. Deploy a simple zookeeper cluster with the following yaml:
apiVersion: "zookeeper.pravega.io/v1beta1"
kind: "ZookeeperCluster"
metadata:
  name: "zookeeper"
spec:
  replicas: 3
  1. Change the zookeeper yaml to configure the commitLogCount to 100
apiVersion: "zookeeper.pravega.io/v1beta1"
kind: "ZookeeperCluster"
metadata:
  name: "zookeeper"
spec:
  replicas: 3
  config:
    commitLogCount: 100
  1. Observe that the pods didn't restart, and go into the pod and check the config used by zookeeper in /data/conf/zoo.cfg:
metricsProvider.exportJvmInfo=true
dataDir=/data
4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok
syncLimit=2
commitLogCount=500
metricsProvider.httpPort=7000
snapSizeLimitInKb=4194304
standaloneEnabled=false
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
initLimit=10
minSessionTimeout=4000
snapCount=10000
admin.serverPort=8080
autopurge.purgeInterval=1
maxSessionTimeout=40000
maxCnxns=0
globalOutstandingLimit=1000
reconfigEnabled=true
skipACL=yes
autopurge.snapRetainCount=3
tickTime=2000
quorumListenOnAllIPs=false
preAllocSize=65536
maxClientCnxns=60
dynamicConfigFile=/data/conf/zoo.cfg.dynamic.100000010
root@zookeeper-0:/data/conf# cat zoo.cfg | grep con
4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok
reconfigEnabled=true
dynamicConfigFile=/data/conf/zoo.cfg.dynamic.100000010

Importance

should-have

Location

Zookeeper-operator is missing the functionality to restart the pods when config is changed.

Suggestions for an improvement

We suggest to attach the hash of config as annotations to zookeeper's statefulSet's template. So that when the config is changed, the changed annotation would trigger statefulSet's rolling update.

Update acto_helm for new error messages

From Tyler

"I recently made a change to include more information when returning errors: b68a87c
This change may need you to change some code in the acto_helm.py like I did to acto.py in the commit."

[Bug Report] (cass-operator) The newly spawned pods do not use the up-to-date server-config-init under certain conditions

@kevchentw and I found a weird behavior of cass-operator. We recognized the behavior as a bug after the discussion with @tylergu.

What did you do?

As mentioned in k8ssandra/cass-operator#150, configBuilderResources, the resource configuration of server-config-init can be set by configBuilderResources. We realized that under certain conditions, the new pods do not necessarily use the most up-to-date resource configuration. For example, as shown in the steps below, when we scale up the cluster and change the resource configuration at the same time, the newly spawned pod does not use the updated configBuilderResources.

We also found that even if we explicitly separate the configBuilderResources change and scale-up change into two steps, if there are pods that are not ready yet during the two operations, the statefulSet won't get updated immediately and the newly spawned pods still use the old configBuilderResources.

# Step1: Install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.7.1/cert-manager.yaml

# Step2: Install operator
kubectl apply -f init.yaml
kubectl apply --force-conflicts --server-side -k 'github.com/k8ssandra/cass-operator/config/deployments/cluster?ref=v1.10.3'

# Step3: Apply custom resource
kubectl apply -f cr1_spec.yaml

# Step4: Check CR for the "Config Builder Resources" field => The field is the same as cr1_spec.yaml
kubectl describe cassandradatacenters.cassandra.datastax.com cassandra-datacenter

#  Config Builder Resources:
#    Requests:
#      Cpu:     512m
#      Memory:  100m

# Step5: Check statefulset for the resource request config of server-config-init => Same as cr1_spec.yaml
kubectl describe statefulsets.apps cluster1-cassandra-datacenter-default-sts

#   server-config-init:
#    Image:      datastax/cass-config-builder:1.0.4-ubi7
#    Port:       <none>
#    Host Port:  <none>
#    Requests:
#      cpu:     512m
#      memory:  100m

# Step6: Update cassandra-datacenter
kubectl apply -f cr2_spec.yaml

# Step7: Check CR for the "Config Builder Resources" field => The field is the same as cr2_spec.yaml
kubectl describe cassandradatacenters.cassandra.datastax.com cassandra-datacenter

#  Config Builder Resources:
#    Requests:
#      Cpu:     1024m
#      Memory:  200m

# Step8: Check Pods for the resource request config of server-config-init => Not the same as cr2_spec.yaml
kubectl describe pod cluster1-cassandra-datacenter-default-sts-0

#   server-config-init:
#    Image:      datastax/cass-config-builder:1.0.4-ubi7
#    Port:       <none>
#    Host Port:  <none>
#    Requests:
#      cpu:     512m
#      memory:  100m

kubectl describe pod cluster1-cassandra-datacenter-default-sts-1

#   server-config-init:
#    Image:      datastax/cass-config-builder:1.0.4-ubi7
#    Port:       <none>
#    Host Port:  <none>
#    Requests:
#      cpu:     512m
#      memory:  100m
  • init.yaml
 apiVersion: storage.k8s.io/v1
 kind: StorageClass
 metadata:
   # Changing the name to server-storage is the only change we have made compared to upstream
   name: server-storage
 provisioner: rancher.io/local-path
 volumeBindingMode: WaitForFirstConsumer
 reclaimPolicy: Delete
  • cr1_spec.yaml
 apiVersion: cassandra.datastax.com/v1beta1
 kind: CassandraDatacenter
 metadata:
   name: cassandra-datacenter
 spec:
   clusterName: cluster1
   config:
     cassandra-yaml:
       authenticator: org.apache.cassandra.auth.PasswordAuthenticator
       authorizer: org.apache.cassandra.auth.CassandraAuthorizer
       role_manager: org.apache.cassandra.auth.CassandraRoleManager
     jvm-options:
       initial_heap_size: 800M
       max_heap_size: 800M
   configBuilderResources:
     requests:
       memory: 100m
       cpu: 512m
   managementApiAuth:
     insecure: {}
   serverType: cassandra
   serverVersion: 3.11.7
   size: 1
   storageConfig:
     cassandraDataVolumeClaimSpec:
       accessModes:
       - ReadWriteOnce
       resources:
         requests:
           storage: 3Gi
       storageClassName: server-storage
  • cr2_spec.yaml (Update size and configBuilderResources)
 apiVersion: cassandra.datastax.com/v1beta1
 kind: CassandraDatacenter
 metadata:
   name: cassandra-datacenter
 spec:
   clusterName: cluster1
   config:
     cassandra-yaml:
       authenticator: org.apache.cassandra.auth.PasswordAuthenticator
       authorizer: org.apache.cassandra.auth.CassandraAuthorizer
       role_manager: org.apache.cassandra.auth.CassandraRoleManager
     jvm-options:
       initial_heap_size: 800M
       max_heap_size: 800M
   configBuilderResources:
     requests:
       memory: 200m
       cpu: 1024m
   managementApiAuth:
     insecure: {}
   serverType: cassandra
   serverVersion: 3.11.7
   size: 2
   storageConfig:
     cassandraDataVolumeClaimSpec:
       accessModes:
       - ReadWriteOnce
       resources:
         requests:
           storage: 3Gi
       storageClassName: server-storage

Did you expect to see some different?

In Step8, I expected that server-config-init should have the same resource request configurations as cr2_spec.yaml. As mentioned in k8ssandra/cass-operator#150, configBuilderResources, the resource configuration of server-config-init can be set by configBuilderResources. We updated the variable size in cr2_spec.yaml, and thus the number of Pod will increase from 1 to 2. The new Pod will trigger the init containers which run before app containers (k8s doc). Hence, at least the new Pod should reflect the new resource requirement specified in cr2_spec.yaml (memory: 200m, cpu: 1024m).

Environment

  • Cass Operator version:

    docker.io/k8ssandra/cass-operator:v1.10.3

  • Kubernetes version information:

    Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:11:29Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:25:06Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
    
  • Kubernetes cluster kind:

minikube start --vm-driver=docker --cpus 4 --memory 4096 --kubernetes-version v1.21.0

Root cause

The function ReconcileAllRacks updated the Replica number of the StatefulSet (i.e. size in cr1_spec.yaml and cr2_spec.yaml) before updated the podTemplate in the StatefulSet. Hence, when we update the field size and configBuilderResources at the same time, the new Pod will be created with stale podTemplate.

To elaborate, the function ReconcileAllRacks updated the Replica number of the StatefulSet at reconcile_racks.go#L2416 and updated the podTemplate in the StatefulSet at reconcile_racks.go#L2440. In my opinion, we need to change the order of L2416 and L2440, and then the new Pod will be spawned with new podTemplate.

Action Plan before 02/04/2022

Develop a basic framework for end-to-end testing framework

Pipeline: construct test environment -> Deploy operator -> test input generation -> submit to operator -> check results -> ...

Action Items:

Using Python for faster development

  • Construct testing environment - Kind cluster, borrow from sieve - by 01/31
  • Deploy operator - deploy scripts framework, borrow from sieve - by 01/31
  • Test input generation - as the first step, no fancy technique used for test input generation. My proposal is to provide an example CR file, and our framework mutate one field at a time, and continuously feed it to operator. - by 02/02
    • Need to develop a routine to accept a JSON file, understand its structure, and mutate one field.
      • How to understand the JSON file: There are two ways in my mind, the easier one is just treat it as map, the other one is to unmarshal it into Go struct (api definition provided by the developers). The later one is definitely more powerful, but it's hard to tell if we need it or not. For now, I will implement as treating the JSON as a map.
      • Select which parameter to mutate: random for now. How to select field randomly? Need to google a bit, but should be easy.
      • How to mutate: hardcode a bunch of valid values for each parameter.
      • So essentially what I am going to do is, to hardcode some valid values for all the parameters. Each time this routine is invoked, it will pick a field randomly and select a random value, and overwrite it to the original JSON input.
  • Test output checking & recording - invoked after each input - by 02/02
    • For now, check the operator log and look for error messages. k8s client library to get log, and use string pattern matching
    • Grab the system status (for future)
      • Use the k8s client library
      • Need to specify which objects to Get. How to figure out what objects are owned by rabbitmq cluster?
  • Start running tests and get results by next meeting

A summary on the Monday meeting with Kevin and Kuan-Yin

@marshtompsxd and @tianyin met with Kevin (@kevin85421) and Kuan-Yin (@kevchentw) today.

We understand more about the proposal of using a state-machine based approach to check event sequences (mentioned in #14).

I think the consensus is a lack of killer bugs that support the advanced oracle. Moreover, automatic input generation (see #15) is still required.

The decision is that we will all work together to build the basic testing approaches in acto, and experiment it with multiple operators, and evaluate the results. During the process, we will accumulate the experience and insights to develop the targeted techniques. We will put aside the state machine.

It's great that we finally reach a point of working together.

The plan is that @kevin85421 @kevchentw and @Yicheng-Lu-llll will each port an operator of their choices. @kevchentw and @Yicheng-Lu-llll want to use an operator in Sieve (the list can be found here: https://github.com/sieve-project/sieve/tree/main/examples).

Then, they will also help develop the oracles (#14) and input generation (#15) as they need them to experiment with their operators.

Progress summary 04/21

Today's meeting was cancelled, so I want to write this issue to sync-up on the progress.
There are actually a lot to sync, this summary mainly contains two parts:

  1. Progress update. We ran updated Acto on rabbitmq-operator and cass-operator
    1. For rabbitmq-operator, we successfully reproduced two previous bugs, and found two new bugs.
    2. For cass-operator, we found two new bugs.
  2. Discussion.
    1. I want to mainly discuss about the program analysis for input space pruning. This is the part I mainly need help with.
      • I wonder if the reasoning about the two pruning strategies makes sense.
      • The concrete next step to implement the program analysis (e.g. which tool I should use, AST or llvm...)
    2. Back-and-forth testing. A new testing oracle we can use based on the declarative nature of k8s operators.

Testing results for rabbitmq-operator

  • Used the updated infrastructure to test rabbitmq’s cluster-operator
  • While we are currently working on automating the input space pruning, I manually pruned the “spec/affinity”, “spec/toleration”, and “spec/override” fields.
  • In total, Acto ran 93 test cases over 45 fields in 7 hours 38 minutes.
  • It produced 61 alarms, 7 of them are true alarms, 45 of them are false alarms, 9 of them are unsupported operations.

45 false alarms

  • 12 are due to format issue when doing value comparison. E.g. 1 compared to “1”. They can be fixed by using heuristic for value comparison (Yuxuan is working on this)
  • 13 are due to dependency among the fields. E.g. if spec.vault.PKIIssuerPath is empty, then spec.vault.altName has no effect.
  • 3 are due to the field being configuration of the operator itself (the field does not map to a state in the application). All 3 of these false alarms are due to the same field
  • 4 are due to problems in the input generation. When Acto changes the input, it changes the default value to null. Then essentially the operator still uses the same default value since the value is null
  • 3 are due to invalid inputs. For example, for the field secretName, the value has to be the name of an existing secret object in the cluster.
  • 4 are due to default value when doing value comparison. When changing the input field from null to some value, the application state changes from default to some value. Because null != default when Acto compares the value change, it reports alarms.
  • 2 are caused by the string value getting modified by the operator. In the input, the string field has value "key1=value1;key2=value2", and in the application state, operator changes it to "key2=value2;key1=value1"
  • Other than the other common reasons we discussed before, we discovered couple new reasons causing the false alarm:
    • 3 of them are caused because we determined the system convergence too early
    • 1 is caused by wrong matching

7 true alarms

Potential new bug

Depends on if this bug should be considered as a duplicate bug or different bug
Acto found a bug that's very similar to the annotation bug (In the previous bug, the users can only add annotations to the service by specifying the spec.service.annotation field. But when users delete the values, the annotations are not removed from the service.)
Here in this bug, the annotation is added to a different place (spec.secretBackend.vault.annotations). In this bug, the annotation is added to the pod instead of the service.
The root cause of these two bugs are the same, because both of them are calling the same utility function which contains the root cause.

New bug related to client-go's reflector

  • Acto sets the persistence.storage as ".07893E985.90504"
  • The value satisfies the regex specified in the CRD
  • But an error message from reflector shows in operator’s log
    E0421 07:19:52.910713       1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1beta1.RabbitmqCluster: failed to list *v1beta1.RabbitmqCluster: v1beta1.RabbitmqClusterList.Items: []v1beta1.RabbitmqCluster: v1beta1.RabbitmqCluster.Spec: v1beta1.RabbitmqClusterSpec.Rabbitmq: v1beta1.RabbitmqClusterConfigurationSpec.Persistence: v1beta1.RabbitmqClusterPersistenceSpec.Storage: unmarshalerDecoder: quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$', error found in #10 byte of ...|985.90504"},"rabbitm|..., bigger context ...|de":{},"persistence":{"storage":".07893E985.90504"},"rabbitmq":{"additionalConfig":"cluster_partitio|...
  • Although the value satisfies the regex, the unmarshalerDecoder is unable to unmarshal the value. There seems to be an inconsistency between the regex required by the decoder and the regex specified in rabbitmq’s CRD
  • Since the rabbitmq-operator’s CRD is automatically generated by the kubebuilder, this might be a bug in kubebuilder.
  • Need further investigation to find whose fault it is.

Testing result for cass-operator

  • In total, Acto ran 330 test cases over 157 fields for 25 hours.
  • It produced 115 alarms, 7 of them are true alarms, 77 of them are false alarms, the rest are still inspecting.
  • 28/77 false alarms are caused by format, e.g. 1 != "1"
  • 9/77 false alarms are caused by wrong matching (the field is spec.tolerations[0].key, and it’s matched to another irrelavent field whose name is alse key.)
  • 10/77 false alarms are caused by invalid input (secretName has to refer to an existing secret)
  • 9/77 false alarms are caused by default != null
    ...

Two new bugs found in cass-operator

The newly spawned pods do not use the up-to-date server-config-init under certain conditions

  1. Acto first submits
spec:  
    configBuilderResources:
        requests:
            memory: 100m
            cpu: 512m
	size: 1

, which results following application state:

server-config-init:
    Image:      datastax/cass-config-builder:1.0.4-ubi7
    Port:       <none>
    Host Port:  <none>
    Requests:
      cpu:     512m
      memory:  100m
  1. Then acto changes the input to
spec:  
    configBuilderResources:
        requests:
            memory: 1000m
            cpu: 512m
	size: 1

, but the application state does not change.

nodeAffinityLabels

  1. Acto first submits input:
spec:
    nodeAffinityLabels:
        mqvoj: chxieobujr

which results in the application state:

affinity:
   node_affinity:
       required_during_scheduling_ignored_during_execution:
           node_selector_terms:
               match_expressions:
                   key: mqvoj
                   operator: In
                   values: chxieobujr

Note here the input is bad input, because there is no node in the cluster that has the label mqvoj: chxieobujr. This renders application unavailability because all the pods become unschedulable.
4. Then Acto tries to delete this label:

spec:
    nodeAffinityLabels:
        mqvoj: null

But the application state does not change. This is because the operator wants to wait for all the pods to become ready before updating the application state. But here it is stuck because all the pods are unschedulable due to this bad nodeAffinityLabels value. And the user is unable to remove this value because the operator decides to wait for pods to become ready to update the statefulSet.

Program analysis for input space pruning

As described in #65, we have two pruning strategies to prune the input space.
We can use program analysis to implement the pruning strategies.
The program analysis can be separated into two parts:

  1. Figure out the mapping between the variable and the input fields.
  2. Assuming we have the mapping, what is the algorithm for the program analysis

Let's first assume we have the perfect mapping between variables and the input fields.

1. For the first pruning strategy, we have a concrete pattern to look for.

Motivation for this pruning strategy: When designing the operator's input, operator developers allow users to specify some kubernetes-general functionalities, e.g. affinity, persistence volume... Inside the operator's logic, the operator does not handle these fields, instead, the operator simply passes these fields over to Kubernetes' other controllers to handle. For such fields, we can safely prune the children of them.

To implement this strategy concretely, we are looking for specific patterns:

resource := corev1.someResource{
	affinity = spec.affinity,
	tolerations = spec.tolerations,
	… = …
}

In source code, we are looking for struct initializations. Inside the struct initializations, it's a list of keyValueExpr. (An example of keyValueExpr is affinity = spec.affinity, where affinity is the key and spec.affinity is the value).
We loop through the keyValueExpr list and check if the value is a SelectorNode. (SelectorNode is a type of AST nodes, it is specifically of the pattern of someStruct.someField. For example, spec.affinity is a SelectorNode, because it is access affinity field inside the spec variable.)
If the value is SelectorNode, we then check its expression. If this SelectorNode is trying to access a field of the input variable, then we have found the field for pruning. As an example shown above, spec.affinity and spec.tolerations are both SelectorNodes, and they are both accessing a field of the input, because the variable spec maps to the input.
After finding such SelectorNodes, we can use the expression and the mapping to calculate the exact field that this SelectorNode is accessing to. After getting the exact field, we can prune the children of the field.

2. Second pruning strategy

I am not very confident about reasoning the second pruning strategy.
Motivation for this pruning strategy Consider the below simplified code snippet from rabbitmq-operator:

mergedSpec = json.JsonStrategicMerge(spec, rabbitmq.spec.override.spec)

newSts.spec = mergedSpec

In the source code, we see that the field spec.override.spec from the input is being accessed by the operator’s logic. So by exercising the field spec.override.spec in our test cases, we are testing this part of logic in the operator.
However, the field rabbitmq.spec.override.spec has ~1000 subfields, and it’s too expensive to exhaustively test all of them.

This example motivates us to come up with a cost model for fields in the input.
The high level idea is that, by testing one part of the input, there is cost and benefit associated with it.
The cost of testing a part of the input is obviously the number of test cases needed to run.
The benefit of testing the part of the input can increase our coverage on the operator logic.
In the previous example, we are paying ~3000 test cases to only exercise ~2% of the operator logic.

But it's hard for us to compute the exact amount of coverage for testing a field, and we only need very rough estimation to prune most of the fields. So as the first step, we can treat all the field accesses equally. When a field is accessed by the operator's source code, we consider that by testing this field, we can gain 1 unit of the logic coverage.
Then by analyzing the source code, we can find all the fields that are directly accessed by the operator's source code. By calculating how many test cases are needed for testing each of the fields, we can get a cost model.

To implement this analysis concretely, we need to get all the fields that are directly accessed by the operator's source code.
We can achieve this by traversing through the AST of the operator code, and find all SelectorNodes in the AST. If the SelectorNode is accessing a field in the input, we consider that this field exercises 1 unit of the operator logic. Then we compute the number of test cases needed to test this field, if the number is larger than our threshold, we will prune this field.

The logic of find all the directly accessed fields can be expressed as below:

functionSrc <- source code for each function
mapping <- mapping between the variable and the fields in the input
accessed_fields <- buffer for return the result

procedure GET_ACCESSED_FIELDS(functionSrc, mapping, accessed_fields):

    inputs = getInputs(function, mapping)  // get the variables in this function that are considered as part of the input

    // traverse the AST of this function
    astNodes = parseAST(functionSrc)

    for node in astNodes:
        if node is Selector:  // node is in the format of a.b.c 
            if node[0] in inputs:
                rootField = getField(node[0], mapping)  // use the mapping to get the corresponding field
                field = rootField
                for key in node:
                    field = field[key]

                accessed_fields.append(field)  // append the field to the list

Back and forth testing

How Acto currently runs tests:
image
It continuously changes the input and use oracle to check if operator's output is correct. When it reports an alarm, it will restart from the seed input and continue with the remaining test cases.

The idea of back-and-forth testing is based on the declarative nature of Kubernetes operators, where in the CR, the users specify the desired input.
This declarative nature means that no matter what is previous condition of the application, if the user submits input A, the resulting application state should always be the same.
Then this idea enables us to have a new oracle. When the Acto reports an alarm and starts a new testing round, instead of the seed input, Acto can start with some other previous inputs. For example, as shown in the diagram below:
image
Acto can start with Input A instead, and it will have two application states with Input A at different time. Acto then can compare these two application states, if they are different, Acto will report an alarm.

Action Plan before 03/18/2022

  • Stabilize input generation implementation
  • Run massive testing and collect results.
  • Although we have several problems right now, depending on the testing results, I will focus on problems based on the results of the test run.
    • Fix meaningless input mutation
    • System healthy oracle
    • Handle the alarms caused by rejected inputs. (need to either be able to tell if the input is rejected by operator, or expose an interface to avoid invalid changes)

Action Plan before 03/11/2022

  • Count domain-specific/application-specific fields in rabbitmq CRD
  • Clarify terminologies [schema, properties, constraint]
  • Implement mutate() after gen(), then integrate it into the test pipeline.
  • Although the values we generate might not make sense at this point, after integrating this into pipeline, run tests against rabbitmq and see how things go.
  • Start thinking about implementing heuristics

Action Plan before 04/01/2022

  • Generate test plans instead of doing stateless random walk
    • Systematically generate fixed number of test cases for fields (cover different change types)
    • When a field's test cases are fulfilled, this field will no longer be selected
  • Make oracle heuristics extensible
  • Port 2 more operators (discuss with Kevin, Kuan-Yin, Yicheng on cass-operator's helm)
  • Merge support of helm

Summary of the oracle designs

Let me write some thoughts on the oracle designs.

There are four ideas we have.

  1. O1: Checking error messages in operator/application logs (the traditional oracles)
  2. O2: State diff -- check whether the CR diff is reflected in the state diff
  3. O3: State idempotency -- check whether the same state-transitions result in the same states; See #18
  4. O4: State machine -- learn a state machine first and check whether the valid state is in the state machine (@kevin85421 is investing on it so I'm not going to discuss it; I love the idea but also communicated my skepticism).

O3 is proposed by @tylergu (I have asked him to write an issue to organize his thoughts). But, the assumption that the system state is strictly idempotent for every CR change; that assumption seems wrong based on the discussion today. I think we should look into the idempotency semantic of Kubernetes and understand whether certain behavior is expected or buggy. We should create an issue for that. Moreover, O3 is much costly as it has a >O(N) overhead.

O1 and O2 are basic oracles.

@tylergu could you point me to the code where O1 is implemented?

In fact, O1 is not that easy to do, despite the simple idea :) I had a lot of gray hairs when trying to write a common parser for different application logs (due to the format).

O2 is blocked by #12 and we have to solve it in a good way. I think building O2 is the next thing to achieve. See #13

Meeting summary 03/24

In this meeting, we went through the problems and progress during the past two weeks:

Stabilizing Acto's testing - being able to run for a long time

Goal: To be able to run Acto for an overnight-run or a week-run.
Problem: Acto pulls images too frequently, both for the operator image and the application image, causing ImagePullBackOff
Solution:
- For the operator image, we can preload the operator into the Kind cluster and change the pull policy to be IfNotPresent.
- For the application image, the workaround is to provide an argument option to preload some frequently used images.

Input space

Rabbitmq-operator has 1323 fields in its CR.

If we assume no dependency among the fields, only allow one field change at a time, and test three different values for each field, we would need to run 3289 tests.

The exploration strategy Acto currently is using is random walk. It means at each step, Acto tries to select a random field, and select a random value for this field. This exploration strategy causes that we are running a lot of redundant tests.

Comments: We have a huge input space to explore, it's interesting to see how to reduce the exploration space.
We also could have a more systematic way to do the exploration. Instead of stateless random walk, we can remember which fields are already explored and bias towards the fields that are not explored before.

Solving the nested value comparison

Problem: The value in the delta could be nested, causing problems when doing value comparison.
image

Solution: Flatten the dict before comparison
image

Comments:
- We should make heuristics easily extensible. We may add other heuristics later (e.g. solving the format problem in value comparison), and users can implement their own heuristics.
- This problem is essentially a problem of matching input delta to system state deltas. There are some possible related work about Object-relational mapping https://github.com/Frankkkkk/pykorm

Trying bad values

When the CR yaml is submitted, it goes through two levels of checks. The first level is the Kubernetes server. Since the CRD was previously registered to the server, the server will use the schema in the CRD to validate the CR yaml. If the CR yaml fails the check here, it is rejected by the server without even reaching the operator code, and user is prompted with an error message.

Our goal should be testing the operator with both good values and the bad values that can pass the server-side check. By testing the operators with bad values, we can test if the operators could handle these bad values properly. If the operators cannot handle bad values properly, there would be a failure.

In fact, Acto currently is already generating a lot of bad values that pass the server-side check.

Generating CRD using Kubebuilder

Problem: There was this previous discussion on whether to use the CRD or the API to help the input generation.
Acto currently relies on the schema in the CRD to generate structure-correct inputs. The quality of the inputs we generate largely depends on the quality of the schema in the CRD.

We found that some operator developers only specify very opaque schema in their CRD. In this case, the API would contain the correct structure information. We also want to take advantage of the information in the API definition.

Solution: Kubebuilder has a feature of generating CRDs automatically, and this feature is cleanly separated out as a CLI.
I was able to generate CRD for Percona's mongodb-operator with one command line, without modifying source code.

Next step

  • Port 2 more operators: choice(cass-operator, redis-operator, ibm-cloud-operator...)
  • Explore the input space more systematically, and think about how to prune it.
  • Make oracle heuristics easily extensible, as we most likely would add more heuristics later, and allow users to extend it too

We need to talk about helm and the 10-operator porting plan

hey guys, I'm looking at the recent commits/PRs. I think we really need to talk.

It seems to me that right now the team is split by two different efforts/groups who do not talk.

The lack of communication leads to some nonsense, e.g., we have redundant pieces of code

and issues like #46 -- why are we doing this? It's a waste of time.

I asked @tylergu about helm and he seems not eager to use helm. On the other hand, his work is mostly on the testing side. I think we should talk and decouple the deployment and testing. I'm very open to helm or other deployment practice, as long as it can help our project or save our own time.

I also notice @Yicheng-Lu-llll's message on slack,

Goal for next week:
1, Yicheng Lu: fix bugs when using candidates.yaml
2, Kuan-Yin Chen and Kai-Hsun Chen:helm chart support to ten operator

I'm confused why we want to scale to 10 operators at this moment.

Likely, I do not understand @kevin85421 and @kevchentw 's plan. But, let me speak from a 598 course project perspective. This is a testing project and your work will be evaluated by the test effectiveness (e.g., how many new bugs you find). So, you would really want to reach to the testing phase to have results soon. We discussed the input generation before and I think I convinced you that we should start from building basic techniques rather than jumping to very advanced ones :) I think @tylergu has implemented some basic ones which are already able to find some new bugs (https://github.com/xlab-uiuc/acto/blob/main/bugs.md).

So, at this point, we should either work hard on improving the testing technique currently implemented in Acto (either the input generation or the oracle), or evaluating the existing techniques on a few other operators. But, I don't see we're doing those.

In fact, I don't know whether acto_helm.py could use the testing code @tylergu implemented.

This reflects the sad fact that there's really lack of communications between the two groups, which I think is very bad to both groups. Let me speak for 598 -- even you just run what @tylergu builds on a few other operators and be able to find new bugs, I would think it's a cool project (luckily @tylergu is not in 598 so you can take all the credits). But, if we only port 10 operators without doing the testing or finding anything useful, the 598 project won't end too well.

So my takeaways are:

  • COMMUNICATE and WORK TOGETHER!
  • Focus on the testing part and figure out the scope of 598

Let me know your thoughts.

Terminologies for JSON schema

To facilitate future discussion, I want to clarify some terminologies that are frequently used when we are dealing with OpenAPISchema(based on JSON schema).

Primitive types for Value:

A value in JSON can be one of the six primitive types:

  • null
  • boolean: A "true" or "false" value
  • object: An unordered set of properties mapping a string to a value - basically a dict in python
  • array: An ordered list of values - a list in python
  • number: float/int in python
  • string

property

A property stands for a string-value pair in an object. Here, object specifically refers the JSON primitive type above: dict in python.
Consider this person object:

{
  "age": 15,
  "profession": "student"
}

"age": 15 is a property of this object.

Schema keywords:

JSON schema objects have keywords.
Example:
We have a JSON schema for a person:

{
  "type": "object"
  "properties": {
    "age": { "type": "number" }
    "profession": { "type": "string" }
  }
  "maxProperties": 2
}

"type" is a keyword, "properties" is also a keyword. "maxProperties" is an assertion keyword that make sure the value does not have more than 2 children.

Broadly speaking, keywords mostly fall into these categories:

  • identifiers: control schema identification through setting the schema's canonical URI and/or changing how the base URI is determined
    Example: id
  • assertions: produce a boolean result when applied to an instance
    Example: maxProperties, type
  • annotations: attach information to an instance for application use
    Example: description, default
    ...

Schema:

JSON Schema is a JSON media type for defining the structure of JSON data.
Examples of JSON schema:

{
  "type": "object",
  "minProperties": 2,
  "properties": {
    "first_name": { "type": "string" },
    "last_name": { "type": "string" },
    "birthday": { "type": "string", "format": "date" },
  }
}
{
  "type": "array",
  "items": {
    "type": "number"
  },
  "minItems": 2
}

Some keywords take schemas themselves, allowing JSON Schemas to be nested.
For example, the items keyword in the array schema shown above takes another schema.

Operator repo selection

There are discussions on Slack about the repo selection.

It's indeed non-trivial to select repos that can cover multiple different dimensions.

What I suggest we to do is to collect as much meta-information as possible so we can have an informative selection together.

Currently, I don't have much information to make a call or comment based on
https://github.com/xlab-uiuc/k8s-operator-bugs/blob/main/k8s-operator-repos.md

What I hope to see is more like a spreadsheet (you can use google spreadsheet) that summarizes many potential repos so we can just sit together and select.

Use cert-manager and enable validation webhook for cass-operator

I finished inspecting the cass-operator's results, and I realized that cass-operator uses a validation webhook which prevents changes from many fields. I noticed that the cert-manager and the validation webhook were not enabled in the current deployment config. A lot of false alarms in the test result could be prevented by enabling the validation webhook by avoiding many invalid changes.

Action Plan before 03/04/2022

Before 03/04/2022:

  • Wrap up oracle implementation and try running it, it should be able to reproduce the service annotation bug.
  • Implement input generation - Use CRD first
    • Implement a way to extract the schema out of the CRD, and represent the schema as an object in python.
    • Implement in Acto to consume the schema to generate inputs automatically (integer, bool, enum).

Meeting summary

  • [high pri] 101 tests/mutations is small → run a day run for 24 hours or a week
    • 4 hours and machine is overloaded → infra is not solid
    • Could be resource leak
    • After 4 hours, you don’t have enough CPU/memory
  • [high pri] Run a daily or a weekly run (thousands of tests)
    • Back and forth
  • [mid-pri] Experiment with bad values in addition to good values
    • If the value is bad, will it break the operator?
    • Operator rejects it and does not misbehave.
    • Some generated good values → bad
    • Some generated bad values → good
  • [high-pri] Violate the constraints -- if they are rejected upfront, confirm the place that checks constraints
  • [low-pri] Want to test the 2nd and 3rd operator [RabbitMQ] → many of our observations are kinda biased by RabbitMQ
    • Confirm that our infra is solid and can work with more operators
    • We can have more results to understand the problems
  • [mid-pri] Investigate a way to reduce inspection effort
    • Codify some signatures so we don’t invest the same problem twice.
  • #43

cass-operator becomes partially inoperable if replaceNodes has a wrong pod name

What happened?
cass-operator loses part of its functionality after the user mistakenly supplied a wrong pod name under the spec.replaceNodes field. For example, the operator fails to decommission nodes nor do a rolling restart. This is because in the reconciliation loop, the operator requeues the request if the status.NodeReplacements is nonempty before reconciling many other functionalities. Since the pod name under the status.NodeReplacements does not exist, the operator is never able to clear the status.NodeReplacements. This causes the operator becomes partially nonfunctional.

Did you expect to see something different?
The operator should be robust and still able to reconcile other functionalities even when the user submitted a wrong pod name for spec.replaceNodes. Or the operator should do a sanity check on spec.replaceNodes to prevent from getting stuck.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy the cass-operator
  2. Deploy the CassandraDatacenter using this yaml, kubectl apply -f sample.yaml:
apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: cassandra-datacenter
spec:
  clusterName: cluster1
  serverType: cassandra
  serverVersion: 3.11.7
  size: 1
  storageConfig:
    cassandraDataVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 3Gi
      storageClassName: server-storage
  1. Provide an invalid pod name for spec.replaceNodes, kubectl apply -f sample.yaml
apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: cassandra-datacenter
spec:
  clusterName: cluster1
  serverType: cassandra
  serverVersion: 3.11.7
  size: 1
  storageConfig:
    cassandraDataVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 3Gi
      storageClassName: server-storage
  replaceNodes:
  - rtiisajufx
  1. Try request rolling restart, kubectl apply -f sample.yaml:
apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  name: cassandra-datacenter
spec:
  clusterName: cluster1
  serverType: cassandra
  serverVersion: 3.11.7
  size: 1
  storageConfig:
    cassandraDataVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 3Gi
      storageClassName: server-storage
  replaceNodes:
  - rtiisajufx
  rollingRestartRequested: true
  1. Observe that the cluster does not restart

Environment

  • Cass Operator version:

    docker.io/k8ssandra/cass-operator:v1.9.0

  • Kubernetes version information:

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:41:01Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-21T23:01:33Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster kind:

    kind

  • Manifests:

See above yaml manifests in the reproduce steps

Anything else we need to know?:
The bug is caused because the operator keeps requeueing the request while the status.NodeReplacements cannot get cleared due to wrong pod name. We suggest to sanitize the spec.replaceNodes field to make the operator robust.

New oracle needed to detect if the changed config is rolled out

The bug in zookeeper #59 is caused because the config is translated into a configMap object, but the application is not restarted to reflect the changed configs. koperator also had a very similar bug like this.

In our current oracles, we are not able to detect such bugs. In the input delta, Acto would see that a config field is changed. Then Acto tries to find a matching delta in the application state. It successfully finds a matching delta in the configMap object, thus passes the check.

To detect such bugs, I can think of two ways:

  1. We need to first recognize which field represents the application config. Then when this field changes, we apply an additional oracle to check if application restarted or not. This is heuristic, because not all applications/configurations need to restart the application to roll out the changed config.
  2. The second way is to provide an interface to allow users to do special checking for certain fields. Users could register a callback function to the config field as an extension of the oracle. The callback function will be called as part of the oracle when the test case change the config field. Inside the callback function, it implements the mechanism to check whether the changed config is applied on the application.

Test results

I ran Acto for rabbitmq-operator for 101 test cases, got 39 alarms. 2 of them are true alarms corresponding to #39, 37 are false alarms.

9 of 37 false alarms are caused by unsupported operation.

For example, scale down or shrink PVC volume. Because these changes are rejected by the operator, there is not system delta and Acto was unable to find a matching field.

13 of 37 false alarms are caused by complex deltas.

There are two cases:

  1. Input delta is a dictionary containing many subfields, these subfields map to several different fields in the system delta
    e.g.:
"root['spec']['secretBackend']": {
        "prev": null,
        "curr": {
            "vault": {
                    "annotations": {
                        "key": "random"
                    }
            }
        }
}

, in the system delta, the corresponding field is:

"root['test-cluster-server']['spec']['template']['metadata']['annotations']['key']": {
    "prev": null,
    "curr": "random"
}
  1. The second case is when input delta and matched system delta are both dictionaries. When comparing the dictionary, we also need to canonicalize the keys in the dictionary.

4 out of the 37 false alarms are caused by comparing number with string, e.g.: 1 != '1'

4 are caused by comparing null with default value.

For example,
Input delta:

"root['spec']['image']": {
      "prev": null,
      "curr": "random"
}

matches with:

"root['test-cluster-server']['spec']['template']['spec']['containers'][0]['image']": {
      "prev": "rabbitmq:3.8.21-management",
      "curr": "random"
}

"rabbitmq:3.8.21-management" is the default value used by the operator when image is null.

The rest has smaller numbers

1 caused by invalid input
2 caused by unchanged input
1 caused by unused field
2 caused by the field being config of operator

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.