xlab-uiuc / acto Goto Github PK
View Code? Open in Web Editor NEWPush-Button End-to-End Testing of Kubernetes Operators and Controllers
License: Apache License 2.0
Push-Button End-to-End Testing of Kubernetes Operators and Controllers
License: Apache License 2.0
In today's meeting, we mainly discussed the following topics
Previously, we run tests by randomly selecting the field, and assigning random values to the selected fields. This results many redundent testing/alarms, and we don’t know how well we explored the input space.
To address this problem, I changed Acto to first generate a list of test cases, and execute them one by one.
The input is essentially a JSON instance, so we can consider the input as a tree.
The object nodes and array nodes in the JSON are essentially the inner nodes of the tree.
The basic types, e.g. Number, String, Boolean, are the outer nodes of the tree.
We consider both the outer nodes and inner nodes as fields.
For outer nodes, it's straight forward because they have concrete values and we want to test different values for them.
For inner nodes, their characteristics also affect the operators' behavior. For example, for an array field, array with 0, 1, 3 items may trigger different operator behavior.
Once we have the list of all fields, we can generate test cases for each of them.
Acto uses heuristics to generate test cases for each of the fields depending on their type.
You may notice that these test cases require some preconditions to run. For example, to run the Pop-item test case of the array fields, there needs to be at least one item in the current value in the field.
So each test case has three callbacks: precondition, mutator, setup. The precondition callback will check if the precondition of this test case is satisfied or not. The mutator will change the value to exercise the actual test case. If the precondition is not satisfied, we will call the setup callback to satisfy the precondition, so that we can exercise this test case next time.
By using the test case generation mentioned above, we parsed 1332 fields for rabbitmq-operator. And we generated 2861 test cases in total. On average time each test case takes ~3 minutes to run, to exhaust all the test cases for rabbitmq, we need to run for 18 days on a single machine. We need to prune the input space.
spec.override.statefulSet.spec
has 1109 fields, which makes up 85% of the test cases. However, this spec.override.statefulSet.spec
field is only exercised in the following statement, where the variable podSpecOverride corresponds to the field spec.override.statefulSet.spec
:patch, err := json.Marshal(podSpecOverride)
patchedJSON, err := strategicpatch.StrategicMergePatch(originalPodSpec, patch, corev1.PodSpec{})
This means that, we are using 85% of the test cases to just test one functionality in the program.
Through discussion, I think we can find which field corresponds to one functionality in the program, and calculate the cost to test this functionality. If the cost is too high, we can prune this field.
For example, in the code shown above, we can learn that the field spec.override.statefulSet.spec
corresponds to one functionality, since the operator only accesses to the spec.over.statefulSet.**spec**
level and didn't access any children fields of spec.override.statefulSet.spec
. Then we calculate the number of test cases needed to test this functionality and decide if we should spend the effort to test it.
Goal: An easy-to-understand terminology that can be used for all the following objects:
I want to have a general terminology for these different objects because I observe that they share the same structure (all of them are trying to describe how the data should be constructed). And in our implementation, their classes inherit from the same parent class; I need to properly name their classes and the parent class.
{
"type": "object",
"minProperties": 2,
"properties": {
"first_name": { "type": "string" },
"last_name": { "type": "string" },
"birthday": { "type": "string", "format": "date" },
}
}
{
"type": "array",
"items": {
"type": "number"
},
"minItems": 2
}
{
"type": "number",
"default": 20,
"multipleOf" : 10
}
All of these objects are describing the organization of the data and guiding how the data should be constructed.
My initial proposal was using the term schema
. But during the meeting, it seems that the term schema
is confusing to folks. It only makes sense for complex structures to have schema, and it was confusing to call
{
"type": "number",
"default" : 20,
"multipleOf" : 10
}
a schema
, because this only describes a number
.
@tianyin proposed to use constraint
. I think it makes sense to call the items constraints, for example, in the first object, "type": "object"
, "minProperties": 2
, "properties": ...
can all be called constraints. Then the entire object could be called a ConstraintSet
?
@marshtompsxd proposed to use property
. I think property
is a similar term to constraint
. For example, "type": "array"
is a property of this object. But I think it is a little weird to call the entire object a "property".
@tylergu As we discussed after the meeting, I think the highest priority for now is to build a basic prototype that can work for a few operators (other than RabbitMQ). Without that, I feel some discussions are less effective due to the lack of understanding on the strawman and the data:
acto
now?The fact is that there is no a
in acto
now:
The current acto
does NOT have an automatic input generation -- all the values were hardcoded by @tylergu based on reading documents. In other words, it can't even be applied to the second operator.
For the oracle, there are a lot of proposals including using a state machine proposed by @kevin85421 and leveraging idempotency of state transition proposed by @tylergu. Those are certainly nice to have, but even the simplest oracle (the state diff) does not work now (see #12).
IMHO, it's always fun to chat about new ideas and potential completeness/soundness problems. But, the most important thing is to build a solid prototype and experiment it with multiple operators. Only by doing that, we can have the experience of what does not work and develop the understanding of what will work better, and will be able to identify opportunities to improve the test technologies. Otherwise, we will keep cycling on some very hard, but less important problems. Oftentimes, some of those problems do not matter in practice.
Based on the meeting, I feel the following two things are blockers that we have to fix:
I would highly suggest we focus on addressing these two issues and run stuff in an automatic fashion.
It is hard to come up with the list of operators we want to study, so we decided to study one operator(spark-operator) at first. By studying the first operator, we hope to:
After studying one or two operators, we should be able to
So that we can deal with operators that do not provide the fine-grained CR yaml file (like those Percona ones).
@marshtompsxd will ask Percona why they don't provide one.
If we know how to automatically generate the CR yaml, we can contribute it to them, and of course use it for Acto.
We discussed several interesting next steps to do at this stage:
Cluster management systems like Kubernetes provides some generic functionalities for managing applications, for example, Affinity and PersistenceVolume. The operators enable users to manage their applications with just one application-specific input (CR). In this input, they still allow users to specific these generic functionalities provided by Kubernetes. Then in the operators' logic, they will just simply hand these generic fields to Kubernetes. If we can identify such fields in the operator's input, then we can prune the subfields of them.
For example, in the rabbitmq-operator's code, spec.Affinity
is simply copied over to a field when creating the podTemplateSpec for statefulSet. In this case, we can prune all the children of the field spec.Affinity
in rabbitmq's CR.
We observe that in rabbitmq-operator's input, there are in total 1323 fields. Out of the 1323 fields, 1109 fields are under the field spec.override.statefulSet.spec
, because this field contains the complete schema of the statefulSet.spec resource. But this spec.override.statefulSet.spec
field is only used as a patch to conduct a JSONStrategicMergePatch on the existing statefulSet as shown in the code below, where podSpecOverride corresponds to the spec.override.statefulSet.spec
field:
patch, err := json.Marshal(podSpecOverride)
patchedJSON, err := strategicpatch.StrategicMergePatch(originalPodSpec, patch, corev1.PodSpec{})
It is too expensive for us to spend 99% of the test cases to only test this single functionality. We can do program analysis to identify the fields that the operator directly accesses. And then we can get the cost of testing this field, if it is too expensive to test this field, then we should prune this field.
Cass-operator also has a field called spec.podSpecTemplate
which has the entire schema of statefulSet's podTemplate. This spec.podSpecTemplate
has ~1000 fields.
Currently our test cases only change one field in each test case. If we can run several test cases at the same time, we can largely reduce the testing time.
There are two potential challenges:
1. There are dependencies among the fields
2. Changing multiple fields at a time could complicate the oracle
We need to run rabbitmq-operator/cass-operator with the new input generation. We can first run them by manually pruning the input space. The results will show us how many false alarms we have.
Currently we only test different inputs in one CR. It's also possible to test deleting the CR and recreating. We can also test inputs in two CRs, for example in rabbitmq's case, we will be creating two clusters of rabbitmq.
Acto is based on Kind cluster which uses docker containers to virtualize clusters. So it is possible to have multiple Kubernetes clusters running different test cases on the same machine. I noticed that not all the cores are efficiently used while running Acto, so it might be beneficial to explore running Acto in multi-cluster setting.
Some operators do not provide a complete CRD that fully reflects the input structure defined in the API types. We can use kubebuilder to automatically generate CRD for operators in this case. We need to incorporate this option into Acto's pipeline.
I was trying to change the persistence/storageClassName
field in my rabbitmq-cluster's CR, but changing the persistence/storageClassName
has no effect on the PVC used by the statefulSet.
The persistence/storageClassName
field was initially not specified so the operator used the default storage class "standard". Then I created a new storage class following instruction here https://github.com/rabbitmq/cluster-operator/blob/main/docs/examples/production-ready/ssd-gke.yaml, and changed persistence/storageClassName
from null
to ssd
. This change failed silently.
Steps to reproduce this behavior:
kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: sample
spec:
image: "rabbitmq:3.8.21-management"
persistence:
storage: 20Gi
replicas: 2
kubectl apply -f https://github.com/rabbitmq/cluster-operator/blob/main/docs/examples/production-ready/ssd-gke.yaml
persistence/storageClassName
to ssd
and applyapiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: sample
spec:
image: "rabbitmq:3.8.21-management"
persistence:
storage: 20Gi
storageClassName: ssd
replicas: 2
"spec": {
"accessModes": [
"ReadWriteOnce"
],
"resources": {
"requests": {
"storage": "20Gi"
}
},
"storageClassName": "standard",
"volumeMode": "Filesystem",
"volumeName": "pvc-43d66309-2090-4de3-bc82-9edcd0a69361"
}
This bug is caused because the operator only reconciles the PVC's storage capacity, but does not reconcile the storageClassName here: https://github.com/rabbitmq/cluster-operator/blob/d657ffb516f948aaffd252794e3ed5e75e352d3d/controllers/reconcile_persistence.go#L15.
Possible fix is to create the desired storage type and migrate the data over. Or report an error message like PVC scaling down here: https://github.com/rabbitmq/cluster-operator/blob/d657ffb516f948aaffd252794e3ed5e75e352d3d/internal/scaling/scaling.go#L51
Before 02/18/2022:
Before 02/25/2022:
Focusing on rabbitmq-operator
Currently our e2e testing explore the space by doing a random walk: After each test, we randomly select a field and randomly assign a value to it. I want to propose a more systematic approach for exploring our input space.
Assumption: Operator implements level-triggering, that is, operator only needs to observe the final input to drive the system to the desired state. Under this assumption, we can get this property: For any input x, there is only one correct system state corresponds to x.
The problem we want to solve: In operators' reconciliation loop, they read two inputs: 1) The current system state 2) The cr.spec which is the desired state. The responsibility of operators not only includes deploying the correct system from scratch(when current system state is empty), but also includes to drive the current system to the desired state when users reconfigure. We want to test operators ability to drive the system to the desired spec no matter what the current system state is.
We can use this test case matrix to represent all the test cases we want to test. We use the system_state(X)
function to denote the correct system state we collect after we submit input X.
Once we get this test case matrix, we can start running the tests to fulfill this matrix.
For example, if we first run a test trial: A->B->C, we fulfills the (system_state(A), Input B), and (system_state(B), Input C) in the matrix.
When running the trial A->B->C, we collected states after each test and save them as trial_1{A,B,C}
Then we run another trial B->A->C, in this trial we tested (system_state(B), A) and (system_state(A), C) in the matrix, and collect system states after each test trial_2{A,B,C}. Then we can compare the system states we collected in the trial_2 with the ones we collect in the trial_1. Since we assume that the operator implements level-triggering, these states should agree.
If we want to test all the test cases in this matrix, we will have O(n^2) complexity where n is the number of different inputs.
It is impossible for us to run all the test cases, but we can do some test prioritization and assigning weights to each edge.
After all, this test case matrix is just a representation. We can still do random walk here and only run certain number of tests, but we will be able to avoid redundant tests and have an additional oracle to use.
We discussed the different information sources we can leverage for value generation.
I know @Yicheng-Lu-llll and @kevin85421 perhaps have some black magic in mind. But, in this issue, let me write down the generation based on value constraints (e.g., data type, data range, semantic types such as image and filepath, etc).
To be able to generate input CR diff, there are two problems to solve:
Structure is typically a simpler problem to solve, because what we need is a definition of the structure. The API definition, described in #15, has a complete program definition of the structure.
The CRD could also have the definition, but @tylergu finds that in some projects, the CRD is a partial definition, rather than a complete one.
There is a debate on whether to use CRD or API definition (which I address below). But, no matter whether we start from CRD or from API definition, the problem is straightforward.
The essential problem for generating different good and bad values is to learn the constraints of the value. For an operator, there are many different information sources we can aggregate and leverage, including:
acto
, the values are "hand-coded" based on the docs (see #13)If we look at prior research papers, prior work leverages all the above information.
The ideas are all there, the main question is how to apply them to build a practical system.
Each information source has different tradeoffs. One needs much more expensive analysis than the other.
The practice is always to start from the simplest to the hardest, so that we can always understand the cost-vs-benefits -- a great question asked by @wangchen615 during the meeting -- "what do you gain by using API definitions over CRD?"
@tylergu later provides examples that why API definitions are likely more complete than CRD. On the other hand, he also agrees that CRD could have different information.
It is clear that CRD is much cheaper to use (it's an independent YAML file) than API definitions (Go source code which needs constructors).
So, the agreement is to start from CRD and then (or meanwhile) invest how to use API definition.
This could help answer @wangchen615 question about what is the additional benefit API definition brings over CRD.
Also, given the highest priority being #13 , a CRD-based input generation can lead to a quick prototype to make the CPUs busy.
The general idea of the input generation is to generate items recursively. The root of our input generation is the CR.spec object.
Then we will have a generic generate() function. Example:
def generate(type):
if type == int: return random()
if type == bool: return random(true, false)
ret = {}
for child in type.fields:
ret.child = generate(type(child))
return ret
Consider the following CRD:
type RabbitmqClusterSpec struct {
Replicas *int32 `json:"replicas,omitempty"`
Image string `json:"image,omitempty"`
Service RabbitmqClusterServiceSpec `json:"service,omitempty"`
}
type RabbitmqClusterServiceSpec struct {
Type corev1.ServiceType `json:"type,omitempty"`
Annotations map[string]string `json:"annotations,omitempty"`
}
Acto will call generate(RabbitmqClusterSpec), which will generate an integer, a string, and call generate(RabbitmqClusterServiceSpec) which in turn generates a corev1.ServiceType and a map of string to string.
Some of the fields are contraints among its children, we can override generate function for these fields to express these constraints.
The input generation is going to be non-trivial amount of work. We want to learn the structure of the input CR, and then generate the inputs structurally. Even after we have the framework to generate input structurally, we still won't be able to generate the inputs fully automatically without some human guidance. I don't think we will ever be able to fully automate the input generation, our goal should be reducing the human effort for input generation as much as possible.
There are two options to use as the guidance for learning the input structure: CRD or API definition. Here are my thoughts on the trade-offs:
I was trying to patch the /spec/override/statefulSet/spec/template/spec/affinity
to null
to delete the affinity rule that was previously specified for rabbitmqCluster under /spec/affinity
. It seems this null value does not get propagated into the Go value when unmarshalling. This causes that the affinity rule deleting is not applied to the statefulSet.
Steps to reproduce this behavior:
kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: sample
spec:
image: "rabbitmq:3.8.21-management"
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- sample
topologyKey: kubernetes.io/hostname
persistence:
storage: 20Gi
replicas: 2
kubectl patch rabbitmqCluster sample --type merge --patch-file patch.yaml
patch.yaml:
spec:
override:
statefulSet:
spec:
template:
spec:
containers: []
affinity: null
"override": {
"statefulSet": {
"spec": {
"template": {
"spec": {
"containers": []
}
}
}
}
}
and in the statefulSet spec, affinity rule is not deleted.
The root cause here is because the override.statefulset.spec.template is unmarshalled into *corev1.podSpec at first and then the operator marshals it back to JSON format so that it can apply JSON strategic merge. However, the affinity field is omitted due to the omitempty
rule in coreV1 API, and the patch no longer has affinity: null
in it. The essential thing that's missed here is that, for a JSON patch P, marshal(unmarshal(P))
does not necessarily equal to P.
This is a very similar issue(rabbitmq/cluster-operator#741) reported before, but the fix was specifically for securityContext. This problem needs a systematic fix(e.g. marshal the field override into Raw JSON format) because there are dozens of other fields in the coreV1 API that could cause the same issue.
kubectl
.
Haven't finished Spark-on-k8s-operator, but here is a running list of things I learnt from it so far:
fix
.We get the custom resource object from k8s server after each test to compute the delta. When I was inspecting the deltas, I realized some strange changes:
This is the initial CR yaml we submit:
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: hello-world
spec:
resources:
requests:
cpu: 1
memory: 4Gi
limits:
cpu: 1
memory: 4Gi
tls:
caSecretName: null
disableNonTLSListeners: false
secretName: null
skipPostDeploySteps: false
tolerations: null
After submit this CR yaml, the custom resource object we get from the k8s server is:
"resources": {
"limits": {
"cpu": "1",
"memory": "4Gi"
},
"requests": {
"cpu": "1",
"memory": "4Gi"
}
},
"secretBackend": {},
"service": {
"type": "ClusterIP"
},
"terminationGracePeriodSeconds": 604800,
"tls": {}
Note that the cpu
fields under resources/limits
and resources/requests
are strings, and tls
is empty while we specified disableNonTLSListeners: false
in our CR yaml. The response also misses skipPostDeploySteps
, while our CR yaml input has skipPostDeploySteps: false
.
Then, if we change the resources/limits/cpu
to 2, and submit the changed CR yaml, the custom resource object we get from k8s server will be:
"resources": {
"limits": {
"cpu": 2,
"memory": "4Gi"
},
"requests": {
"cpu": 1,
"memory": "4Gi"
}
},
"secretBackend": {},
"service": {
"type": "ClusterIP"
},
"skipPostDeploySteps": false,
"terminationGracePeriodSeconds": 604800,
"tls": {
"disableNonTLSListeners": false
}
Note that the cpu
fields under resources/limits
and resources/requests
are integers now, and it now reflects disableNonTLSListeners: false
and "skipPostDeploySteps": false
.
It's hard for me figure out why this happens, and it currently interferes our system state delta.
One guess I have is that false
is probably the omitempty
value for bool variables, and they are omitted when the server dumps the CR spec. And when we change the resources/limits/cpu
field, it triggered the server to dump the CR spec in a different way.
I run Acto with the following command:
python3 acto.py --candidates data/rabbitmq-operator/candidates.yaml --seed data/rabbitmq-operator/cr.yaml --operator data/rabbitmq-operator/operator.yaml --duration 1
The python script generates new YAML files periodically (mutate-0.yaml, mutate-1.yaml, ...). However, I did not see any CR in my "rabbitmq-system" namespace.
kubectl get RabbitmqCluster -n rabbitmq-system
Hence, I applied all 8 generated YAML manually (mutate-0.yaml ~ mutate-7.yaml). No one can be deployed successfully as shown in the following figures. Kubernetes tell us the problems are located in candidates.yaml due to type error.
I ran Acto for 3 Hours. It ran 65 tests, and produced 25 alarms
All of the alarms are from our system state oracle.
For 19 out of 25: Acto didn't find any matching field in system state deltas for the input delta.
For 6 out of 25 alarms: Acto found some matching fields, but the value change is different
See here: #39
3 are caused by changing a complex object: When we change a complex object to null, changes are reflected on a lower level.
Concretely, consider the following example where we changed the secretBackend
from null
to a new object.
"root['spec']['secretBackend']": {
"prev": null,
"curr": {
"vault": {
"annotations": {
"key": "random"
}
}
}
}
Then we have the following system state delta:
"root['test-cluster-server']['spec']['template']['metadata']['annotations']['key']": {
"prev": null,
"curr": "random"
}
...
Acto tries to find a matching field based on the input delta's path ['spec']['secretBackend']
, but the system state delta is at lower level.
In the system state delta, the path is ...['annotations']['key']
. To match these two fields, we need to flatten the dict
in the input delta before field matching.
10 are caused because we performed a change which is rejected by the operator: scale down and shrink volumn, key-value delimiter not found.
1 is caused by changing from a default value to null (this is effectively no change, but we were not aware of the default value). Need to be aware of default values
1 is caused by that field does not affect application's state (configuration of the operator itself)
2 are caused by a bug in our input generation
1 need further inspection
3 are caused by lack of canonicalization when comparing dictionaries: easy to fix, canonicalize fields when comparing dict
Need canonicalization when comparing dictionaries
requiredDuringSchedulingIgnoredDuringExecution
!= required_during_scheduling_ignored_during_execution
"root['spec']['affinity']['podAntiAffinity']": {
"prev": {
"requiredDuringSchedulingIgnoredDuringExecution": [
{
"labelSelector": {
"matchExpressions":...
}
}
]
},
"curr": null
}
"root['test-cluster-server-0']['spec']['affinity']['pod_anti_affinity']": {
"prev": {
"required_during_scheduling_ignored_during_execution": [
{
"label_selector": {
"match_expressions":...
}
}
]
},
"curr": null
}
2 are caused by comparing null to default value: need to be aware of default values when comparing with null
"root['spec']['image']": {
"prev": null,
"curr": "random"
}
resulted:
"root['test-cluster-server']['spec']['template']['spec']['containers'][0]['image']": {
"prev": "rabbitmq:3.8.21-management",
"curr": "random"
}
1 is caused by 0 != '0'
: easy to fix
I was trying to modify the service's annotation via /spec/service/annotations
. After changing a key-value pair in annotations from key1: value1
to key2: value2
, I noticed that the key2: value2
is added under the Service's metadata/annotations
correctly, but the key1: value1
is still present.
Steps to reproduce this behavior:
kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/latest/download/cluster-operator.yml
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: sample
spec:
image: "rabbitmq:3.8.21-management"
persistence:
storage: 20Gi
replicas: 2
service:
annotations:
key1: value1
type: ClusterIP
kubectl apply
to apply the following changed yaml file. Note that the /spec/service/annotations
is changed from key1: value1
to key2: value2
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: sample
spec:
image: "rabbitmq:3.8.21-management"
persistence:
storage: 20Gi
replicas: 2
service:
annotations:
key2: value2
type: ClusterIP
kubectl get services sample -o yaml
to get the state of deployed serviceapiVersion: v1
kind: Service
metadata:
annotations:
key1: value1
key2: value2
creationTimestamp: "2022-02-15T23:40:25Z"
labels:
app.kubernetes.io/component: rabbitmq
app.kubernetes.io/name: sample
app.kubernetes.io/part-of: rabbitmq
...
Note that under the /metadata/annotations
field, both key1: value1
and key2: value2
are present.
Expected behavior
After applying the change, the Service should not have the key1: value1
annotation.
We are adding more operators in acto, the following is the tracker.
https://docs.google.com/spreadsheets/d/1qeMk4m8D8fgJdI61QJ67mBHZ9m3gCD-axcJB567z5FM/edit#gid=0
Here are the main points we discussed during our meeting:
After the discussion, here are the action items:
I know @tylergu has been thinking about @wangchen615 's question on novelty of the project since last Thursday. I love the question which really pushes the students to think hard and deeper. On the other hand, let me spend some time to clarify novelty in the context of systems research. Oftentimes, the "novelty" argument is confused or even abused.
Vijay Chidambaram (whom many of you like and worked with during SOSP'21) wrote a great summary of novelty
https://twitter.com/vj_chidambaram/status/1395086227204939780
which I can't agree more. Let me quote the points:
So, as we have all laughed at some bad examples I show in Siebel 3111 this afternoon, it is ridiculous to use a microscope to look at one piece of a large system and say, "hey, that piece is not novel." Literally, if you do that, you will find no systems research is novel.
We have an extremely novel problem -- AFAIK this is among the first work addressing reliability of operation programs (i.e., operators) of large-scale infrastructures. I have been wanting to do such a project for a long time since I was working at FB and helped build two of their DR operators (Taiji and Maelstrom). But, I didn't find a good way. Apart from the overhead of engaging with 5 teams to look at their operator code, academically it is hard to be generic by looking at one company's 5 operators. A bigger blocker was that FB didn't have a unified control-plane framework/API (at least when I was there), so it's easier to study the operator code and bugs, but hard to build something generally applicable and highly impactful.
The Kubernetes operators provides a golden opportunity, as we have all seen and been excited about.
We already have a beautiful story to tell, which we almost wrote for Sieve. We later went on a different route when writing the Sieve paper (which is also successful). We are going to tell the story for Acto
this time!
In summary, if done successfully, Acto
will be the first research addressing reliability of operation programs (operators) of large-scale cloud systems and will be the first fully automatic testing tool for unmodified Kubernetes controllers.
I have made the point of the novelty in synthesis in the preamble. Now let me clarify a few things.
I hope to clarify that our current input generation based on CR definitions is NOT novel.
I discussed input generation in #16. One point I made is that there are many information sources where we can extract semantics from and each requires different techniques with different tradeoffs.
A more important point I made is that we have to build Rome brick by brick. It's hard to think about the dome without building the groundwork. That's the reason I have been pushing the team to first build the very basic input generation, and it turns out that even the basic one is nontrivial. Once we have the basic ones which can correctly generate the structure and some constraints, let's experiment it to understand how we can do better using more novel techniques. In this way, we can justify our novelty more than a Rococo decoration with the sole purpose of fooling junior reviewers.
In other words, in my own experience, novelty comes from deep understanding and careful evaluation; novelty without understanding and evaluation is likely to be useless.
And, I have never run into problems of coming up novelty -- almost all the failures in the past was caused by the team not being able to deliver the understanding.
The oracle part of the project is inherent novelty. I believe I shared with you my fascination on intent-driven networking such as the Robotron work from FB and intentionnet from the network verification folks. And, it's incredibly exciting to explore whether it's ever possible to apply the same principle to software programs. And, if there is a way to do that, the best bet is Kubernetes due to its declarative design. It will be a dream if we can exploit the declarative nature and build automated "state-centric" oracles. I hope you see the significance!
Certainly, "declarative design" does not make the problem anything easy as shown by the many practical problems we have already encountered (noises, canonicalization, and reasoning about high-order semantics).
Note that I have been pushing the team to focus more on the input generation over the oracle, not at all because the oracle problem has been solved, but because the effectiveness of oracles need to be understood with large-scale test results driven by the generated inputs.
1. Will Acto be scooped by Sieve, because Sieve is the first work on Kubernetes operators?
Not at all! First, the two papers have drastically different perspectives. Sieve focuses on controllers from a cluster manager's perspective and Acto focuses on operators from an operation perspective. Sieve reasons about behavior under faults, and Acto checks correctness and expectation without faults.
Certainly, there are great synergies between Acto and Sieve -- from a very high level, Acto can generate workloads for Sieve which is written manually now.
2. What conferences can we target?
Any systems conferences, such as OSDI/SOSP, ASPLOS, EuroSys, USENIX ATC, etc., depending on the quality of the work and the solidness of the evaluation. Timeline wise, ASPLOS looks a good one.
3. You are talking about operator code -- but Kubernetes operators << operators!
Indeed, I'm talking about operators at large. But, shouldn't a moonshot be done step by step?
IMO, if we can build a solution for every Kubernetes operators, it is already a biiiiig deal and highly impactful! @wangchen615 can confirm that :)
And, after we know how to build an effective solution for all Kubernetes operators, we can certainly think about more unbounded operators (which we may need to connect to companies like FB who write proprietary operators)
4. Coming from a SE background, I don't see novelty of SE techniques -- Darko has done a number of mutation testing research and has the Korat tool for generating objects (even Alloy has been used for many systems).
Let's not submit to SE conferences then // joking
More seriously, I believe Darko will be very happy if some of the ideas of Korat or mutation testing of him can find their souls in our work, though I'm skeptical. I hope we view research, at least systems research, as a process of building on top of the shoulders to solve new, important problems, rather than denying prior work and calculating the credits.
When I change the fields under spec.config
, zookeeper-operator does not issue a rolling update to reflect the changed config. For example, if I change the spec.config.commitLogCount
field to 100
, the operator reconciles the configMap to reflect the change. So in the pod, /conf/zoo.cfg
which is where the configMap mounted has the commitLogCount
field set to 100
. But the /data/conf/zoo.cfg
which is the config used by zookeeper still has the commitLogCount
set to default value as 500
.
apiVersion: "zookeeper.pravega.io/v1beta1"
kind: "ZookeeperCluster"
metadata:
name: "zookeeper"
spec:
replicas: 3
apiVersion: "zookeeper.pravega.io/v1beta1"
kind: "ZookeeperCluster"
metadata:
name: "zookeeper"
spec:
replicas: 3
config:
commitLogCount: 100
/data/conf/zoo.cfg
:metricsProvider.exportJvmInfo=true dataDir=/data 4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok syncLimit=2 commitLogCount=500 metricsProvider.httpPort=7000 snapSizeLimitInKb=4194304 standaloneEnabled=false metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider initLimit=10 minSessionTimeout=4000 snapCount=10000 admin.serverPort=8080 autopurge.purgeInterval=1 maxSessionTimeout=40000 maxCnxns=0 globalOutstandingLimit=1000 reconfigEnabled=true skipACL=yes autopurge.snapRetainCount=3 tickTime=2000 quorumListenOnAllIPs=false preAllocSize=65536 maxClientCnxns=60 dynamicConfigFile=/data/conf/zoo.cfg.dynamic.100000010 root@zookeeper-0:/data/conf# cat zoo.cfg | grep con 4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok reconfigEnabled=true dynamicConfigFile=/data/conf/zoo.cfg.dynamic.100000010
should-have
Zookeeper-operator is missing the functionality to restart the pods when config is changed.
We suggest to attach the hash of config as annotations to zookeeper's statefulSet's template. So that when the config is changed, the changed annotation would trigger statefulSet's rolling update.
From Tyler
"I recently made a change to include more information when returning errors: b68a87c
This change may need you to change some code in the acto_helm.py like I did to acto.py in the commit."
@kevchentw and I found a weird behavior of cass-operator. We recognized the behavior as a bug after the discussion with @tylergu.
As mentioned in k8ssandra/cass-operator#150, configBuilderResources
, the resource configuration of server-config-init
can be set by configBuilderResources
. We realized that under certain conditions, the new pods do not necessarily use the most up-to-date resource configuration. For example, as shown in the steps below, when we scale up the cluster and change the resource configuration at the same time, the newly spawned pod does not use the updated configBuilderResources
.
We also found that even if we explicitly separate the configBuilderResources
change and scale-up change into two steps, if there are pods that are not ready yet during the two operations, the statefulSet won't get updated immediately and the newly spawned pods still use the old configBuilderResources
.
# Step1: Install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.7.1/cert-manager.yaml
# Step2: Install operator
kubectl apply -f init.yaml
kubectl apply --force-conflicts --server-side -k 'github.com/k8ssandra/cass-operator/config/deployments/cluster?ref=v1.10.3'
# Step3: Apply custom resource
kubectl apply -f cr1_spec.yaml
# Step4: Check CR for the "Config Builder Resources" field => The field is the same as cr1_spec.yaml
kubectl describe cassandradatacenters.cassandra.datastax.com cassandra-datacenter
# Config Builder Resources:
# Requests:
# Cpu: 512m
# Memory: 100m
# Step5: Check statefulset for the resource request config of server-config-init => Same as cr1_spec.yaml
kubectl describe statefulsets.apps cluster1-cassandra-datacenter-default-sts
# server-config-init:
# Image: datastax/cass-config-builder:1.0.4-ubi7
# Port: <none>
# Host Port: <none>
# Requests:
# cpu: 512m
# memory: 100m
# Step6: Update cassandra-datacenter
kubectl apply -f cr2_spec.yaml
# Step7: Check CR for the "Config Builder Resources" field => The field is the same as cr2_spec.yaml
kubectl describe cassandradatacenters.cassandra.datastax.com cassandra-datacenter
# Config Builder Resources:
# Requests:
# Cpu: 1024m
# Memory: 200m
# Step8: Check Pods for the resource request config of server-config-init => Not the same as cr2_spec.yaml
kubectl describe pod cluster1-cassandra-datacenter-default-sts-0
# server-config-init:
# Image: datastax/cass-config-builder:1.0.4-ubi7
# Port: <none>
# Host Port: <none>
# Requests:
# cpu: 512m
# memory: 100m
kubectl describe pod cluster1-cassandra-datacenter-default-sts-1
# server-config-init:
# Image: datastax/cass-config-builder:1.0.4-ubi7
# Port: <none>
# Host Port: <none>
# Requests:
# cpu: 512m
# memory: 100m
init.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
# Changing the name to server-storage is the only change we have made compared to upstream
name: server-storage
provisioner: rancher.io/local-path
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
cr1_spec.yaml
apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
name: cassandra-datacenter
spec:
clusterName: cluster1
config:
cassandra-yaml:
authenticator: org.apache.cassandra.auth.PasswordAuthenticator
authorizer: org.apache.cassandra.auth.CassandraAuthorizer
role_manager: org.apache.cassandra.auth.CassandraRoleManager
jvm-options:
initial_heap_size: 800M
max_heap_size: 800M
configBuilderResources:
requests:
memory: 100m
cpu: 512m
managementApiAuth:
insecure: {}
serverType: cassandra
serverVersion: 3.11.7
size: 1
storageConfig:
cassandraDataVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
storageClassName: server-storage
cr2_spec.yaml
(Update size
and configBuilderResources
) apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
name: cassandra-datacenter
spec:
clusterName: cluster1
config:
cassandra-yaml:
authenticator: org.apache.cassandra.auth.PasswordAuthenticator
authorizer: org.apache.cassandra.auth.CassandraAuthorizer
role_manager: org.apache.cassandra.auth.CassandraRoleManager
jvm-options:
initial_heap_size: 800M
max_heap_size: 800M
configBuilderResources:
requests:
memory: 200m
cpu: 1024m
managementApiAuth:
insecure: {}
serverType: cassandra
serverVersion: 3.11.7
size: 2
storageConfig:
cassandraDataVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
storageClassName: server-storage
In Step8, I expected that server-config-init
should have the same resource request configurations as cr2_spec.yaml
. As mentioned in k8ssandra/cass-operator#150, configBuilderResources
, the resource configuration of server-config-init
can be set by configBuilderResources
. We updated the variable size
in cr2_spec.yaml
, and thus the number of Pod will increase from 1 to 2. The new Pod will trigger the init containers which run before app containers (k8s doc). Hence, at least the new Pod should reflect the new resource requirement specified in cr2_spec.yaml
(memory: 200m, cpu: 1024m).
Cass Operator version:
docker.io/k8ssandra/cass-operator:v1.10.3
Kubernetes version information:
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:11:29Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:25:06Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes cluster kind:
minikube start --vm-driver=docker --cpus 4 --memory 4096 --kubernetes-version v1.21.0
The function ReconcileAllRacks
updated the Replica number of the StatefulSet (i.e. size
in cr1_spec.yaml and cr2_spec.yaml) before updated the podTemplate in the StatefulSet. Hence, when we update the field size
and configBuilderResources
at the same time, the new Pod will be created with stale podTemplate.
To elaborate, the function ReconcileAllRacks
updated the Replica number of the StatefulSet at reconcile_racks.go#L2416 and updated the podTemplate in the StatefulSet at reconcile_racks.go#L2440. In my opinion, we need to change the order of L2416 and L2440, and then the new Pod will be spawned with new podTemplate.
Develop a basic framework for end-to-end testing framework
Pipeline: construct test environment -> Deploy operator -> test input generation -> submit to operator -> check results -> ...
Using Python for faster development
@marshtompsxd and @tianyin met with Kevin (@kevin85421) and Kuan-Yin (@kevchentw) today.
We understand more about the proposal of using a state-machine based approach to check event sequences (mentioned in #14).
I think the consensus is a lack of killer bugs that support the advanced oracle. Moreover, automatic input generation (see #15) is still required.
The decision is that we will all work together to build the basic testing approaches in acto
, and experiment it with multiple operators, and evaluate the results. During the process, we will accumulate the experience and insights to develop the targeted techniques. We will put aside the state machine.
It's great that we finally reach a point of working together.
The plan is that @kevin85421 @kevchentw and @Yicheng-Lu-llll will each port an operator of their choices. @kevchentw and @Yicheng-Lu-llll want to use an operator in Sieve (the list can be found here: https://github.com/sieve-project/sieve/tree/main/examples).
Then, they will also help develop the oracles (#14) and input generation (#15) as they need them to experiment with their operators.
Today's meeting was cancelled, so I want to write this issue to sync-up on the progress.
There are actually a lot to sync, this summary mainly contains two parts:
spec.vault.PKIIssuerPath
is empty, then spec.vault.altName
has no effect.Depends on if this bug should be considered as a duplicate bug or different bug
Acto found a bug that's very similar to the annotation bug (In the previous bug, the users can only add annotations to the service by specifying the spec.service.annotation
field. But when users delete the values, the annotations are not removed from the service.)
Here in this bug, the annotation is added to a different place (spec.secretBackend.vault.annotations
). In this bug, the annotation is added to the pod instead of the service.
The root cause of these two bugs are the same, because both of them are calling the same utility function which contains the root cause.
persistence.storage
as ".07893E985.90504" E0421 07:19:52.910713 1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1beta1.RabbitmqCluster: failed to list *v1beta1.RabbitmqCluster: v1beta1.RabbitmqClusterList.Items: []v1beta1.RabbitmqCluster: v1beta1.RabbitmqCluster.Spec: v1beta1.RabbitmqClusterSpec.Rabbitmq: v1beta1.RabbitmqClusterConfigurationSpec.Persistence: v1beta1.RabbitmqClusterPersistenceSpec.Storage: unmarshalerDecoder: quantities must match the regular expression '^([+-]?[0-9.]+)([eEinumkKMGTP]*[-+]?[0-9]*)$', error found in #10 byte of ...|985.90504"},"rabbitm|..., bigger context ...|de":{},"persistence":{"storage":".07893E985.90504"},"rabbitmq":{"additionalConfig":"cluster_partitio|...
spec.tolerations[0].key
, and it’s matched to another irrelavent field whose name is alse key
.)spec:
configBuilderResources:
requests:
memory: 100m
cpu: 512m
size: 1
, which results following application state:
server-config-init:
Image: datastax/cass-config-builder:1.0.4-ubi7
Port: <none>
Host Port: <none>
Requests:
cpu: 512m
memory: 100m
spec:
configBuilderResources:
requests:
memory: 1000m
cpu: 512m
size: 1
, but the application state does not change.
spec:
nodeAffinityLabels:
mqvoj: chxieobujr
which results in the application state:
affinity:
node_affinity:
required_during_scheduling_ignored_during_execution:
node_selector_terms:
match_expressions:
key: mqvoj
operator: In
values: chxieobujr
Note here the input is bad input, because there is no node in the cluster that has the label mqvoj: chxieobujr. This renders application unavailability because all the pods become unschedulable.
4. Then Acto tries to delete this label:
spec:
nodeAffinityLabels:
mqvoj: null
But the application state does not change. This is because the operator wants to wait for all the pods to become ready before updating the application state. But here it is stuck because all the pods are unschedulable due to this bad nodeAffinityLabels value. And the user is unable to remove this value because the operator decides to wait for pods to become ready to update the statefulSet.
As described in #65, we have two pruning strategies to prune the input space.
We can use program analysis to implement the pruning strategies.
The program analysis can be separated into two parts:
Let's first assume we have the perfect mapping between variables and the input fields.
Motivation for this pruning strategy: When designing the operator's input, operator developers allow users to specify some kubernetes-general functionalities, e.g. affinity, persistence volume... Inside the operator's logic, the operator does not handle these fields, instead, the operator simply passes these fields over to Kubernetes' other controllers to handle. For such fields, we can safely prune the children of them.
To implement this strategy concretely, we are looking for specific patterns:
resource := corev1.someResource{
affinity = spec.affinity,
tolerations = spec.tolerations,
… = …
}
In source code, we are looking for struct initializations. Inside the struct initializations, it's a list of keyValueExpr. (An example of keyValueExpr is affinity = spec.affinity
, where affinity
is the key and spec.affinity
is the value).
We loop through the keyValueExpr list and check if the value is a SelectorNode. (SelectorNode is a type of AST nodes, it is specifically of the pattern of someStruct.someField
. For example, spec.affinity
is a SelectorNode, because it is access affinity
field inside the spec
variable.)
If the value is SelectorNode, we then check its expression. If this SelectorNode is trying to access a field of the input variable, then we have found the field for pruning. As an example shown above, spec.affinity
and spec.tolerations
are both SelectorNodes, and they are both accessing a field of the input, because the variable spec
maps to the input.
After finding such SelectorNodes, we can use the expression and the mapping to calculate the exact field that this SelectorNode is accessing to. After getting the exact field, we can prune the children of the field.
I am not very confident about reasoning the second pruning strategy.
Motivation for this pruning strategy Consider the below simplified code snippet from rabbitmq-operator:
mergedSpec = json.JsonStrategicMerge(spec, rabbitmq.spec.override.spec)
newSts.spec = mergedSpec
In the source code, we see that the field spec.override.spec
from the input is being accessed by the operator’s logic. So by exercising the field spec.override.spec
in our test cases, we are testing this part of logic in the operator.
However, the field rabbitmq.spec.override.spec
has ~1000 subfields, and it’s too expensive to exhaustively test all of them.
This example motivates us to come up with a cost model for fields in the input.
The high level idea is that, by testing one part of the input, there is cost and benefit associated with it.
The cost of testing a part of the input is obviously the number of test cases needed to run.
The benefit of testing the part of the input can increase our coverage on the operator logic.
In the previous example, we are paying ~3000 test cases to only exercise ~2% of the operator logic.
But it's hard for us to compute the exact amount of coverage for testing a field, and we only need very rough estimation to prune most of the fields. So as the first step, we can treat all the field accesses equally. When a field is accessed by the operator's source code, we consider that by testing this field, we can gain 1 unit of the logic coverage.
Then by analyzing the source code, we can find all the fields that are directly accessed by the operator's source code. By calculating how many test cases are needed for testing each of the fields, we can get a cost model.
To implement this analysis concretely, we need to get all the fields that are directly accessed by the operator's source code.
We can achieve this by traversing through the AST of the operator code, and find all SelectorNodes in the AST. If the SelectorNode is accessing a field in the input, we consider that this field exercises 1 unit of the operator logic. Then we compute the number of test cases needed to test this field, if the number is larger than our threshold, we will prune this field.
The logic of find all the directly accessed fields can be expressed as below:
functionSrc <- source code for each function
mapping <- mapping between the variable and the fields in the input
accessed_fields <- buffer for return the result
procedure GET_ACCESSED_FIELDS(functionSrc, mapping, accessed_fields):
inputs = getInputs(function, mapping) // get the variables in this function that are considered as part of the input
// traverse the AST of this function
astNodes = parseAST(functionSrc)
for node in astNodes:
if node is Selector: // node is in the format of a.b.c
if node[0] in inputs:
rootField = getField(node[0], mapping) // use the mapping to get the corresponding field
field = rootField
for key in node:
field = field[key]
accessed_fields.append(field) // append the field to the list
How Acto currently runs tests:
It continuously changes the input and use oracle to check if operator's output is correct. When it reports an alarm, it will restart from the seed input and continue with the remaining test cases.
The idea of back-and-forth testing is based on the declarative nature of Kubernetes operators, where in the CR, the users specify the desired input.
This declarative nature means that no matter what is previous condition of the application, if the user submits input A, the resulting application state should always be the same.
Then this idea enables us to have a new oracle. When the Acto reports an alarm and starts a new testing round, instead of the seed input, Acto can start with some other previous inputs. For example, as shown in the diagram below:
Acto can start with Input A instead, and it will have two application states with Input A at different time. Acto then can compare these two application states, if they are different, Acto will report an alarm.
mutate()
after gen()
, then integrate it into the test pipeline.TODO: I will modify the input generation to make sure the namespace is correct.
Let me write some thoughts on the oracle designs.
There are four ideas we have.
O1
: Checking error messages in operator/application logs (the traditional oracles)O2
: State diff -- check whether the CR diff is reflected in the state diffO3
: State idempotency -- check whether the same state-transitions result in the same states; See #18O4:
State machine -- learn a state machine first and check whether the valid state is in the state machine (@kevin85421 is investing on it so I'm not going to discuss it; I love the idea but also communicated my skepticism).O3
is proposed by @tylergu (I have asked him to write an issue to organize his thoughts). But, the assumption that the system state is strictly idempotent for every CR change; that assumption seems wrong based on the discussion today. I think we should look into the idempotency semantic of Kubernetes and understand whether certain behavior is expected or buggy. We should create an issue for that. Moreover, O3
is much costly as it has a >O(N) overhead.
O1
and O2
are basic oracles.
@tylergu could you point me to the code where O1
is implemented?
In fact, O1
is not that easy to do, despite the simple idea :) I had a lot of gray hairs when trying to write a common parser for different application logs (due to the format).
O2
is blocked by #12 and we have to solve it in a good way. I think building O2
is the next thing to achieve. See #13
In this meeting, we went through the problems and progress during the past two weeks:
Goal: To be able to run Acto for an overnight-run or a week-run.
Problem: Acto pulls images too frequently, both for the operator image and the application image, causing ImagePullBackOff
Solution:
- For the operator image, we can preload the operator into the Kind cluster and change the pull policy to be IfNotPresent.
- For the application image, the workaround is to provide an argument option to preload some frequently used images.
Rabbitmq-operator has 1323 fields in its CR.
If we assume no dependency among the fields, only allow one field change at a time, and test three different values for each field, we would need to run 3289 tests.
The exploration strategy Acto currently is using is random walk. It means at each step, Acto tries to select a random field, and select a random value for this field. This exploration strategy causes that we are running a lot of redundant tests.
Comments: We have a huge input space to explore, it's interesting to see how to reduce the exploration space.
We also could have a more systematic way to do the exploration. Instead of stateless random walk, we can remember which fields are already explored and bias towards the fields that are not explored before.
Problem: The value in the delta could be nested, causing problems when doing value comparison.
Solution: Flatten the dict before comparison
Comments:
- We should make heuristics easily extensible. We may add other heuristics later (e.g. solving the format problem in value comparison), and users can implement their own heuristics.
- This problem is essentially a problem of matching input delta to system state deltas. There are some possible related work about Object-relational mapping https://github.com/Frankkkkk/pykorm
When the CR yaml is submitted, it goes through two levels of checks. The first level is the Kubernetes server. Since the CRD was previously registered to the server, the server will use the schema in the CRD to validate the CR yaml. If the CR yaml fails the check here, it is rejected by the server without even reaching the operator code, and user is prompted with an error message.
Our goal should be testing the operator with both good values and the bad values that can pass the server-side check. By testing the operators with bad values, we can test if the operators could handle these bad values properly. If the operators cannot handle bad values properly, there would be a failure.
In fact, Acto currently is already generating a lot of bad values that pass the server-side check.
Problem: There was this previous discussion on whether to use the CRD or the API to help the input generation.
Acto currently relies on the schema in the CRD to generate structure-correct inputs. The quality of the inputs we generate largely depends on the quality of the schema in the CRD.
We found that some operator developers only specify very opaque schema in their CRD. In this case, the API would contain the correct structure information. We also want to take advantage of the information in the API definition.
Solution: Kubebuilder has a feature of generating CRDs automatically, and this feature is cleanly separated out as a CLI.
I was able to generate CRD for Percona's mongodb-operator with one command line, without modifying source code.
hey guys, I'm looking at the recent commits/PRs. I think we really need to talk.
It seems to me that right now the team is split by two different efforts/groups who do not talk.
helm
and operator portingThe lack of communication leads to some nonsense, e.g., we have redundant pieces of code
and issues like #46 -- why are we doing this? It's a waste of time.
I asked @tylergu about helm and he seems not eager to use helm
. On the other hand, his work is mostly on the testing side. I think we should talk and decouple the deployment and testing. I'm very open to helm
or other deployment practice, as long as it can help our project or save our own time.
I also notice @Yicheng-Lu-llll's message on slack,
Goal for next week:
1, Yicheng Lu: fix bugs when using candidates.yaml
2, Kuan-Yin Chen and Kai-Hsun Chen:helm chart support to ten operator
I'm confused why we want to scale to 10 operators at this moment.
Likely, I do not understand @kevin85421 and @kevchentw 's plan. But, let me speak from a 598 course project perspective. This is a testing project and your work will be evaluated by the test effectiveness (e.g., how many new bugs you find). So, you would really want to reach to the testing phase to have results soon. We discussed the input generation before and I think I convinced you that we should start from building basic techniques rather than jumping to very advanced ones :) I think @tylergu has implemented some basic ones which are already able to find some new bugs (https://github.com/xlab-uiuc/acto/blob/main/bugs.md).
So, at this point, we should either work hard on improving the testing technique currently implemented in Acto (either the input generation or the oracle), or evaluating the existing techniques on a few other operators. But, I don't see we're doing those.
In fact, I don't know whether acto_helm.py
could use the testing code @tylergu implemented.
This reflects the sad fact that there's really lack of communications between the two groups, which I think is very bad to both groups. Let me speak for 598 -- even you just run what @tylergu builds on a few other operators and be able to find new bugs, I would think it's a cool project (luckily @tylergu is not in 598 so you can take all the credits). But, if we only port 10 operators without doing the testing or finding anything useful, the 598 project won't end too well.
So my takeaways are:
Let me know your thoughts.
To facilitate future discussion, I want to clarify some terminologies that are frequently used when we are dealing with OpenAPISchema(based on JSON schema).
A value in JSON can be one of the six primitive types:
dict
in pythonlist
in pythonfloat
/int
in pythonproperty
A property
stands for a string-value pair in an object
. Here, object
specifically refers the JSON primitive type above: dict
in python.
Consider this person object:
{
"age": 15,
"profession": "student"
}
"age": 15
is a property of this object.
keywords
:JSON schema objects have keywords.
Example:
We have a JSON schema for a person:
{
"type": "object"
"properties": {
"age": { "type": "number" }
"profession": { "type": "string" }
}
"maxProperties": 2
}
"type" is a keyword, "properties" is also a keyword. "maxProperties" is an assertion keyword that make sure the value does not have more than 2 children.
Broadly speaking, keywords mostly fall into these categories:
id
maxProperties
, type
description
, default
JSON Schema is a JSON media type for defining the structure of JSON data.
Examples of JSON schema:
{
"type": "object",
"minProperties": 2,
"properties": {
"first_name": { "type": "string" },
"last_name": { "type": "string" },
"birthday": { "type": "string", "format": "date" },
}
}
{
"type": "array",
"items": {
"type": "number"
},
"minItems": 2
}
Some keywords take schemas themselves, allowing JSON Schemas to be nested.
For example, the items
keyword in the array schema shown above takes another schema.
There are discussions on Slack about the repo selection.
It's indeed non-trivial to select repos that can cover multiple different dimensions.
What I suggest we to do is to collect as much meta-information as possible so we can have an informative selection together.
Currently, I don't have much information to make a call or comment based on
https://github.com/xlab-uiuc/k8s-operator-bugs/blob/main/k8s-operator-repos.md
What I hope to see is more like a spreadsheet (you can use google spreadsheet) that summarizes many potential repos so we can just sit together and select.
I finished inspecting the cass-operator's results, and I realized that cass-operator uses a validation webhook which prevents changes from many fields. I noticed that the cert-manager and the validation webhook were not enabled in the current deployment config. A lot of false alarms in the test result could be prevented by enabling the validation webhook by avoiding many invalid changes.
Before 03/04/2022:
What happened?
cass-operator loses part of its functionality after the user mistakenly supplied a wrong pod name under the spec.replaceNodes
field. For example, the operator fails to decommission nodes nor do a rolling restart. This is because in the reconciliation loop, the operator requeues the request if the status.NodeReplacements
is nonempty before reconciling many other functionalities. Since the pod name under the status.NodeReplacements
does not exist, the operator is never able to clear the status.NodeReplacements
. This causes the operator becomes partially nonfunctional.
Did you expect to see something different?
The operator should be robust and still able to reconcile other functionalities even when the user submitted a wrong pod name for spec.replaceNodes
. Or the operator should do a sanity check on spec.replaceNodes
to prevent from getting stuck.
How to reproduce it (as minimally and precisely as possible):
kubectl apply -f sample.yaml
:apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
name: cassandra-datacenter
spec:
clusterName: cluster1
serverType: cassandra
serverVersion: 3.11.7
size: 1
storageConfig:
cassandraDataVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
storageClassName: server-storage
spec.replaceNodes
, kubectl apply -f sample.yaml
apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
name: cassandra-datacenter
spec:
clusterName: cluster1
serverType: cassandra
serverVersion: 3.11.7
size: 1
storageConfig:
cassandraDataVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
storageClassName: server-storage
replaceNodes:
- rtiisajufx
kubectl apply -f sample.yaml
:apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
name: cassandra-datacenter
spec:
clusterName: cluster1
serverType: cassandra
serverVersion: 3.11.7
size: 1
storageConfig:
cassandraDataVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
storageClassName: server-storage
replaceNodes:
- rtiisajufx
rollingRestartRequested: true
Environment
Cass Operator version:
docker.io/k8ssandra/cass-operator:v1.9.0
Kubernetes version information:
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:41:01Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-21T23:01:33Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes cluster kind:
kind
Manifests:
See above yaml manifests in the reproduce steps
Anything else we need to know?:
The bug is caused because the operator keeps requeueing the request while the status.NodeReplacements
cannot get cleared due to wrong pod name. We suggest to sanitize the spec.replaceNodes
field to make the operator robust.
The bug in zookeeper #59 is caused because the config is translated into a configMap object, but the application is not restarted to reflect the changed configs. koperator also had a very similar bug like this.
In our current oracles, we are not able to detect such bugs. In the input delta, Acto would see that a config field is changed. Then Acto tries to find a matching delta in the application state. It successfully finds a matching delta in the configMap object, thus passes the check.
To detect such bugs, I can think of two ways:
For example, scale down or shrink PVC volume. Because these changes are rejected by the operator, there is not system delta and Acto was unable to find a matching field.
There are two cases:
"root['spec']['secretBackend']": {
"prev": null,
"curr": {
"vault": {
"annotations": {
"key": "random"
}
}
}
}
, in the system delta, the corresponding field is:
"root['test-cluster-server']['spec']['template']['metadata']['annotations']['key']": {
"prev": null,
"curr": "random"
}
For example,
Input delta:
"root['spec']['image']": {
"prev": null,
"curr": "random"
}
matches with:
"root['test-cluster-server']['spec']['template']['spec']['containers'][0]['image']": {
"prev": "rabbitmq:3.8.21-management",
"curr": "random"
}
"rabbitmq:3.8.21-management" is the default value used by the operator when image
is null.
1 caused by invalid input
2 caused by unchanged input
1 caused by unused field
2 caused by the field being config of operator
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.