cloud-bulldozer / benchmark-operator Goto Github PK
View Code? Open in Web Editor NEWThe Chuck Norris of cloud benchmarks
License: Apache License 2.0
The Chuck Norris of cloud benchmarks
License: Apache License 2.0
Move Operator and workloads to shared namespace...
Looks like 2 plays got duplicated.
Also, currently the couchbase PR requires a toggle specified in its spec part of CR to check if it's okd/ocp, and the deployment accordingly makes changes required. I think this toggle should be present at a higher level like cleanup, so the other workloads/infra also make use of the same.
Hi,
I was trying to tweak main.yaml to understand how Fio-bench works but any change done to main.yaml is not reflected once I deploy the CR.
Say for eg:
Even If i just change the name of Configmap to fio-test_shekhar and redeploy the CR the configmap is still created with name fio-test.
name: Generate fio test
k8s:
definition:
apiVersion: v1
kind: ConfigMap
metadata:
name: fio-test_shekhar
namespace: '{{ meta.namespace }}'
data:
fiojob: "{{ lookup('template', 'job.fio.seq_write') }}"
when: fio.clients > 0
oc get configmap
NAME DATA AGE
benchmark-operator-lock 0 3h
fio-test 1 1h
Am I missing something??
Leaving pin-node blank caused the client pod to not create. Adding a value such as pin_node: "ip-10-0-134-143"
in ripsaw_v1alpha1_pgbench_cr.yaml fixed the problem and the test ran as expected after this change.
$ oc get benchmark
NAME TYPE AGE
pgbench-benchmark pgbench 56s
[ec2-user@ip-172-31-14-128 ripsaw-dustin]$ oc describe benchmark pgbench-benchmark
Name: pgbench-benchmark
Namespace: ripsaw
Labels: <none>
Annotations: <none>
API Version: ripsaw.cloudbulldozer.io/v1alpha1
Kind: Benchmark
Metadata:
Creation Timestamp: 2019-07-08T23:39:49Z
Generation: 1
Resource Version: 91783
Self Link: /apis/ripsaw.cloudbulldozer.io/v1alpha1/namespaces/ripsaw/benchmarks/pgbench-benchmark
UID: ad4debce-a1d9-11e9-acf0-0268146ce15c
Spec:
Workload:
Args:
Clients:
4
8
cmd_flags:
Databases:
db_name: sampledb
Host: 172.30.86.214
Password: wTRfg5vxpmtfkYKA
pin_node: <nil>
Port: <nil>
User: user8LH
init_cmd_flags:
run_time: 300
Samples: 2
scaling_factor: 30
Threads: 4
Timeout: 5
Transactions: <nil>
Name: pgbench
Status:
Conditions:
Last Transition Time: 2019-07-08T23:40:53Z
Message: Running reconciliation
Reason: Running
Status: False
Type: Running
Ansible Result:
Changed: 2
Completion: 2019-07-08T23:40:56.93001
Failures: 1
Ok: 5
Skipped: 6
Last Transition Time: 2019-07-08T23:40:57Z
Message: An unhandled exception occurred while running the lookup plugin 'template'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Unexpected templating type error occurred on (---
kind: Job
apiVersion: batch/v1
metadata:
name: '{{ meta.name }}-pgbench-client-{{ item.0|int + 1 }}'
namespace: '{{ operator_namespace }}'
spec:
ttlSecondsAfterFinished: 600
template:
metadata:
labels:
app: pgbench-client
spec:
containers:
- name: benchmark
image: "quay.io/cloud-bulldozer/pgbench:latest"
command: ["/bin/sh", "-c"]
args:
- "export PGPASSWORD='{{ item.1.password }}';
export pgbench_auth='-h {{ item.1.host }} -p {% if item.1.port is defined and item.1.port|int > 0 %} {{ item.1.port }} {% else %} {{ db_port }} {% endif %} -U {{ item.1.user }}';
echo 'Init Database {{ item.1.host }}/{{ item.1.db_name }}';
pgbench $pgbench_auth -i -s {{ pgbench.scaling_factor }} {{ pgbench.init_cmd_flags }} {{ item.1.db_name }};
if [ $? -eq 0 ]; then
echo 'Waiting for start signal...';
redis-cli -h {{ bo.resources[0].status.podIP }} lpush pgb_client_ready {{ item.0|int }};
while true; do
if [[ $(redis-cli -h {{ bo.resources[0].status.podIP }} get pgb_start) =~ 'true' ]]; then
echo 'GO!';
{% for clients in pgbench.clients %}
echo '';
echo 'Running PGBench with {{ clients }} clients on database {{ item.1.host }}/{{ item.1.db_name }}';
for i in `seq 1 {{ pgbench.samples|int }}`; do
echo \"Begin test sample $i of {{ pgbench.samples }}...\";
pgbench $pgbench_auth -c {{ clients }} -j {{ pgbench.threads }} {% if pgbench.transactions is defined and pgbench.transactions|int > 0 %} -t {{ pgbench.transactions }} {% elif pgbench.run_time is defined and pgbench.run_time|int > 0 %} -T {{ pgbench.run_time }} {% endif %} -s {{ pgbench.scaling_factor }} {{ pgbench.cmd_flags }} {{ item.1.db_name }};
done;
{% endfor %}
else
continue;
fi;
break;
done;
fi"
restartPolicy: OnFailure
{% if item.1.pin_node is defined and item.1.pin_node|length and item.1.pin_node is not sameas "" %}
nodeSelector:
kubernetes.io/hostname: '{{ item.1.pin_node }}'
{% endif %}
): object of type 'NoneType' has no len()
Reason: Failed
Status: True
Type: Failure
Events: <none>
$subject
We should look at publishing operator if not now at least sometime in the future, and operator-courier seems to be a good tool to help with that.
Currently Ripsaw attempts to manage other operators, which has been a struggle.
We wanted to have the application be managed via the Operator framework for testing the workload, and to provide an end to end solution.
It causes complexity that we continuously have to work around. Broken CI, which slows down our ability to accept PRs
Instead of Ripsaw managing other operators, we could have our CI/Testing use statefulsets to deploy applications like mongodb, and we simply update our YCSB CR to point at the mongodb deployed with statefulsets.
This is particularly important for infra roles. It may not be possible to create one test CR for a role that could be expected to pass on all of K8s upstream, OCP 3.0, and OCP 4.0. If the CI system is going to automate testing of multiple environment types, then we need the ability to create test CRs that are applicable to the specifics of the environment.
Currently, the URLs and versions for the CB operator and pods are defined within the role, and therefore coded into the operator image. We should allow for overriding default values with the CR.
There is a handy k8s shortcut that allows you to kubectl/oc create -f <directory>
and it will load all of the yaml files in that directory. The deployment instructions could be simplified by leveraging this shortcut, but the current deploy/ structure is problematic for this since the CRD needs to be loaded before the operator. If the CRD were in the deploy/ directory, then setup would be as simple as kubectl/oc create -f deploy
and the operator would be up-and-running.
If the existing structure follows some established standard, then we could at least simplify the deployment instructions to say:
# oc create -f deploy/crds/bench_v1alpha1_bench_crd.yaml
# oc create -f deploy
In the CI test.sh
script, we call the cleanup_resources
function followed by the wait_clean
function.
...
source tests/common.sh
trap cleanup_resources EXIT
wait_clean
...
The wait_clean
function checks 30 times effectively for the operator pod to not be running.
function wait_clean {
for i in {1..30}; do
if [ `kubectl get pods --namespace ripsaw | grep bench | wc -l` -ge 1 ]; then
sleep 5
else
break
fi
done
}
However, the cleanup_resources
function does not include commands to stop the operator pod.
function cleanup_resources {
echo "Exiting after cleanup of resources"
kubectl delete -f resources/crds/ripsaw_v1alpha1_ripsaw_crd.yaml
kubectl delete -f deploy
marketplace_cleanup
}
This results in the wait_clean
function simply running through its 30 iterations, doing nothing.
I'm not entirely sure of the intention, so I don't want to propose a solution, but I will point out that the cleanup_operator_resources
function is defined and does include the commands to delete the operator pod, but this function is not actually used anywhere.
function cleanup_operator_resources {
delete_operator
cleanup_resources
wait_clean
}
Need a sweet new logo for Ripsaw!
fio best practice is (generally) to sync and drop caches with every iteration/sample. This is usually done with:
sync; echo 3 > /proc/sys/vm/drop_caches
We could add this as a bash command to each test sequence, or we could include it directly in the fio job file with:
exec_prerun=sync; echo 3 > /proc/sys/vm/drop_caches
Might be worthwhile to make this optional in the CR for the off chance someone wants to test without this.
We should look at running stockpile/ other tool to collect cluster metadata before we trigger the workload, instead of having a harness run stockpile/other_tool to collect data, I'd propose that we trigger the playbook/role from within the operator itself.
Possible cons of this would be:
The value of threads: in CR is not being used. Below is how CR is configured and below is output after CR is deployed.
Pushing fio to find system limits usually involves ramping up the parallelism of worker jobs. You can do this via threads per worker (which should also be implemented as a list; separate issue), but it is often valuable, particularly for a distributed storage SUT, to increase the total number of workers in order to get past any per-worker bottlenecks and properly saturate the storage system.
Implementing the servers
key in the CR as a list would allow us to iterate through server counts as an outer loop to samples
. This is a bit complicated with the current implementation of the fio-d role as we first run an ansible task to spin up the number of workers as fio server pods before executing the fio job task. So we would need to be able to spin up servers, run a job, increment the number of servers, and then run the subsequent job from the server
list.
One option is to use the highest value from the servers
list and spin up a number of server pods once equal to that maximum value, that way only the fio job task needs to be looped. The downside is that the server pods unused for the lower-value jobs will still be consuming resources in the k8s system and could potentially skew the results.
Another option may be to implement this loop at the playbook.yaml
level, similar to the original proposed method for implementing the samples
feature.
We currently use the key pin_server
across a few roles to identify the k8s node to which a pod or pods will be pinned with a nodeSelector
in the template. I think the name of this key is a little confusing -- Since the value you pass to pin_server
is a k8s node name to which you are pinning, on the surface I would argue that it should be called pin_node
instead (which is what I used in the pgebench role).
However, in the uperf and iperf roles, there is also a pin_client
key, so it is more clear that the key name is in the context of what you are pinning, not where you are pinning it.
The keys might be more clear in all conditions as something like pin_server_to_node
and pin_client_to_node
, but perhaps I'm overthinking it and we should just address this in documentation. To me, it is valuable to make the usage of the CR structure as self-explanatory as possible in order to enhance usability, which will better drive adoption and limit the pings we get to answer howto questions.
PVCs should have unique names between runs. Currently these are claim-{{ item }}
, which can at the very least cause some annoyances if deleting and then re-applying the CR, as the operator will trigger creating of a PVC with the same name as one that is still being deleted.
Implement size- vs. time-based runs. Currently the CR and templates allow for a time-based run in the jobfile only. A nice-to-have feature would be an option to run based on file size.
Workers/threads should be a list to loop through.
File size should be a list to loop through.
Merge jobfile templates into one. I don't see a good reason to have multiple templates. We should simply be adjusting values in the templates based on the type of job and the specific parameters provided in the CR.
Provide jobname in a better way. Seems like a weird thing to name explicitly as a value in the CR. Maybe jobname
can just match job
in the CR.
Merge fio and fio-d into one role.
Implement some way of balancing server pods across nodes.
Need to consider how a RWX PV would be tested.
Git rid of redundant pin boolean. Simply providing a pin host or not should be sufficient. Also... Does pinning even make sense in a distributed test? The way this is implemented means all server pods would go to the single pin host.
The filesize
in the CR should be passed with the scale, just as the fio command would accept it (i.e., 2g instead of 2)
The jobname
in the CR is redundant to the job
and can be removed
Also please add CI as admin, as this would enable us to use many more features.
Following merge of #149, ycsb load phase is not optional. It can be useful while rerunning same ycsb workloads, for testing/debugging purposes.
Since we have the pod to pod tests in place, it would be good to have a way to do serviceIP to serviceIP tests also.
Github workflow has the following issues:
I propose that we move to Gerrithub workflow as it addresses above mentioned problem and also provides additional benefits such as
The only backdrop I see is that some of us will be needing to adopt to new workflow.
Facilitate fio-d to run a custom job, that can be provided as an url
I still feel like we have some structural issues that make ripsaw harder to consume than it should be. Exactly what the best practices are to define and structure operator resources seems to be up for debate, but I would recommend we do something similar to rook.io where the operator and all of its k8s resource dependencies are deployed via 2 yaml files -- common.yaml (which houses the namespace definition, CRD, and all RBAC) and operator.yaml. I might even suggest that we simply converge all of this into one operator.yaml file.
I think it is also a bit confusing the way the CR files are named and placed in the structure. I would suggest creating an examples/
directory at the repo root and simplifying the naming convention of the files to just <workload_name>.yaml, <infra_name>.yaml, and <workload_name>-<infra_name>.yaml, as appropriate.
We need to do this in order to get out of maintaining the separate operator_store_results.yaml
and the result
and pre-result
roles. The current implementation for uperf should serve as the design example.
A baseline test that I've commonly used is to fully saturate the network connections between nodes in order to reveal any bottlenecks that may affect higher-layer tests. I'd like to see us add a "mesh" mode to the uperf test in which we intelligently determine the number of scheduleable workers and run tests between all nodes simultaneously.
We want to have a single UPerf CR that iterates through multiple UPerf scenarios, for example:
apiVersion: benchmark.example.com/v1alpha1
kind: Benchmark
metadata:
name: uperf-benchmark
namespace: ripsaw
spec:
workload:
# cleanup: true
name: uperf
args:
hostnetwork: true
pin: true
pin_server: "master-0"
pin_client: "master-1"
rerun: 1
pair: 1
protos:
- tcp
test_type: stream
nthr: 2
sizes:
- 16384
- 1024
runtime: 30
The above test will iterate through 16384 and 1024 message sizes.
There are two approaches I see, one of which I have implemented and tested.
Build UPerf XML to iterate through the tests.
However, the output isn't obvious, meaning:
TX worklist success Sent workorder
Handshake phase 2 with 192.168.111.20 done
Completed handshake phase 2
Starting 2 threads running profile:ripsaw-test ... 0.00 seconds
TX command [UPERF_CMD_NEXT_TXN, 0] to 192.168.111.20
Txn1 0 / 0.00(s) = 0 0op/s
Txn1 0 / 1.00(s) = 0 2op/s
TX command [UPERF_CMD_NEXT_TXN, 1] to 192.168.111.20
Txn2 0 / 0.00(s) = 0 0op/s
Txn2 2.75GB / 1.00(s) = 23.57Gb/s 179816op/s
Txn2 5.49GB / 2.00(s) = 23.55Gb/s 179686op/s
Txn2 8.23GB / 3.00(s) = 23.54Gb/s 179623op/s
Txn2 10.97GB / 4.00(s) = 23.54Gb/s 179597op/s
Txn2 13.71GB / 5.00(s) = 23.54Gb/s 179592op/s
Txn2 16.46GB / 6.01(s) = 23.54Gb/s 179575op/s
Txn2 19.20GB / 7.01(s) = 23.54Gb/s 179566op/s
Txn2 21.94GB / 8.01(s) = 23.53Gb/s 179551op/s
Txn2 24.68GB / 9.01(s) = 23.53Gb/s 179549op/s
Txn2 27.39GB / 10.01(s) = 23.50Gb/s 179310op/s
Txn2 30.13GB / 11.01(s) = 23.51Gb/s 179335op/s
Txn2 32.87GB / 12.01(s) = 23.51Gb/s 179357op/s
Txn2 35.61GB / 13.01(s) = 23.51Gb/s 179362op/s
Txn2 38.36GB / 14.01(s) = 23.51Gb/s 179383op/s
Txn2 41.10GB / 15.01(s) = 23.51Gb/s 179392op/s
Txn2 43.83GB / 16.02(s) = 23.51Gb/s 179359op/s
Txn2 46.58GB / 17.02(s) = 23.51Gb/s 179373op/s
Txn2 49.32GB / 18.02(s) = 23.51Gb/s 179378op/s
Txn2 52.06GB / 19.02(s) = 23.51Gb/s 179391op/s
Txn2 54.80GB / 20.02(s) = 23.51Gb/s 179397op/s
Txn2 57.54GB / 21.02(s) = 23.51Gb/s 179400op/s
Txn2 60.28GB / 22.02(s) = 23.52Gb/s 179409op/s
Txn2 63.03GB / 23.02(s) = 23.52Gb/s 179417op/s
Txn2 65.77GB / 24.02(s) = 23.52Gb/s 179418op/s
Txn2 68.51GB / 25.02(s) = 23.52Gb/s 179424op/s
Txn2 71.25GB / 26.02(s) = 23.52Gb/s 179429op/s
Txn2 73.99GB / 27.03(s) = 23.52Gb/s 179428op/s
Txn2 76.70GB / 28.03(s) = 23.51Gb/s 179357op/s
Txn2 79.45GB / 29.03(s) = 23.51Gb/s 179365op/s
Sending signal SIGUSR2 to 140611952076544
Sending signal SIGUSR2 to 140611943683840
called out
Txn2 82.19GB / 30.23(s) = 23.35Gb/s 178180op/s
TX command [UPERF_CMD_NEXT_TXN, 2] to 192.168.111.20
Txn3 0 / 0.00(s) = 0 0op/s
Txn3 0 / 1.00(s) = 0 2op/s
TX command [UPERF_CMD_NEXT_TXN, 3] to 192.168.111.20
Txn4 0 / 0.00(s) = 0 0op/s
Txn4 0 / 1.00(s) = 0 2op/s
TX command [UPERF_CMD_NEXT_TXN, 4] to 192.168.111.20
Txn5 0 / 0.00(s) = 0 0op/s
Txn5 1.61GB / 1.00(s) = 13.80Gb/s 1685114op/s
Txn5 3.23GB / 2.00(s) = 13.85Gb/s 1690938op/s
Txn5 4.81GB / 3.00(s) = 13.77Gb/s 1681184op/s
Txn5 6.47GB / 4.00(s) = 13.88Gb/s 1694944op/s
Txn5 8.09GB / 5.00(s) = 13.89Gb/s 1695700op/s
Txn5 9.73GB / 6.01(s) = 13.91Gb/s 1698501op/s
Txn5 11.32GB / 7.01(s) = 13.88Gb/s 1694143op/s
Txn5 12.90GB / 8.01(s) = 13.84Gb/s 1689175op/s
Txn5 14.52GB / 9.01(s) = 13.85Gb/s 1690120op/s
Txn5 16.14GB / 10.01(s) = 13.85Gb/s 1690853op/s
Txn5 17.76GB / 11.01(s) = 13.86Gb/s 1691405op/s
Txn5 19.41GB / 12.01(s) = 13.88Gb/s 1694451op/s
Txn5 21.04GB / 13.01(s) = 13.89Gb/s 1695563op/s
Txn5 22.67GB / 14.01(s) = 13.90Gb/s 1696544op/s
Txn5 24.28GB / 15.01(s) = 13.89Gb/s 1695637op/s
Txn5 25.91GB / 16.02(s) = 13.89Gb/s 1696059op/s
Txn5 27.53GB / 17.02(s) = 13.90Gb/s 1696469op/s
Txn5 29.18GB / 18.02(s) = 13.91Gb/s 1698319op/s
Txn5 30.82GB / 19.02(s) = 13.92Gb/s 1699048op/s
Txn5 32.46GB / 20.02(s) = 13.93Gb/s 1700361op/s
Txn5 34.10GB / 21.02(s) = 13.93Gb/s 1700785op/s
Txn5 35.68GB / 22.02(s) = 13.92Gb/s 1699024op/s
Txn5 37.29GB / 23.02(s) = 13.91Gb/s 1698154op/s
Txn5 38.93GB / 24.02(s) = 13.92Gb/s 1699317op/s
Txn5 40.57GB / 25.03(s) = 13.93Gb/s 1699966op/s
Txn5 42.21GB / 26.03(s) = 13.93Gb/s 1700781op/s
Txn5 43.86GB / 27.03(s) = 13.94Gb/s 1701440op/s
Txn5 45.46GB / 28.03(s) = 13.93Gb/s 1700898op/s
Txn5 47.10GB / 29.03(s) = 13.94Gb/s 1701272op/s
Sending signal SIGUSR2 to 140611952076544
Sending signal SIGUSR2 to 140611943683840
It isn't obvious to the reader which test is what -- (I know what is what, but to a user, this isn't too obvious).
Instead of having UPerf iterate through the tests, create a unique client for each workload.
I personally am preferring Option 2, but want to get feedback before I go down the route of implementing this...
Add a feature to tag workload results and upload them to a central repository (likely an object store). Could be configurable as a private repo, or could default to a public one where we can collect broad result sets for analysis.
While I was working on small-file operator, I faced a lot of issues due to non-root privileges container, so , the solution we could rule out was to create an SCC and associated SA , then allowing the pods to run as root. But the difference was, I was not using a PVC to develop the smallfile operator. I don't think that when we will be working with PVC mounted on a mountpath, we could really see that issue. Hence, I am creating this issue to have discussion and reach to a majority based opinion.
Hi,
Joe asked me to create an issue for this. Many of the default YCSB benchmark configurations use a zipfian distribution for reads which will greatly favor reads from cache rather than from disk. It would be worth creating several additional benchmark configurations that also test random read distributions in addition to zipfian.
We are facing a few issues related to namespaces and contexts. Ultimately, it makes sense for all resources created by the benchmark operator to exist in a single namespace. It seems that other operator projects are explicit about this -- defining the namespace directly in their deployment and config files.
We have begun hard-coding the 'benchmark' namespace in deploy files, but not for the operator.yaml file (so currently the operator itself will deploy to $current_context), and we rely on {{ meta.namespace }} across the roles.
It seems sub-optimal to scatter-shot the hard-coding of the 'benchmark' namespace across the many files where this context needs to be set. What other options do we have?
This is to avoid race condition as multiple tests can be run simultaneously.
There can be cases where the infra/workloads would require to create additional resources which could be something like RBAC and since it's not required for the operator itself, It wouldn't make sense to mandate user to create them. So thus there are 2 ways to approach this problem:
What do you think is the best way to move forward?
A discussion with Alex Calhoun got me thinking about fio-result.json produced by ripsaw fio-bench, specifically the "All clients" section. Does this section mean anything? All fio-benchmarks I've used in the past has just given you throughput numbers that were the sum of the individual fio processes (pods in your case) running at the same time on the same workload. But ripsaw is running fio jobs serially (one at a time) for rw={read,write,randread,randwrite} rather than in parallel. So how can ripsaw get a valid result by adding up throughputs for jobs running at different times? I don't think it can. Also the latency numbers in there are not useful because you can't calculate system-wide percentiles from per-process percentiles.
We could separate out the JSON elements for different workloads (run at different times) and aggregate those results in a meaningful way. Perhaps it would be easier to just run each fio workload as a separate fio client job, with output files in a separate subdirectory. As a result, "All clients" now has meaning because all fio pods in fio-result.json were running at the same time on the same workload, so throughput aggregation now makes sense. The separation of workload results into separate directories would make it easier for anything/anyone else that is analyzing those results. For example, they could be easily worked on by ACT (Alex's elastic search injector) or a browbench scribe, or whatever equivalent data analysis tool is being used, in parallel with the test run. Each directory would have the exact parameters and inputs fed to fio, + all the fio logs and JSON that was output by fio as a result. This is similar to what pbench-fio does in its directory structure also.
Finally, as both CBT and pbench-fio do, we could add a layer to this directory structure for multiple samples of the same workload (combination of fio parameters) so that %deviation can be calculated from it. This allows for ripsaw to be extended to run multiple samples of the same data point, which would be vital for demonstrating that results are accurate. At this point, ripsaw output would become much more similar to CBT and pbench-fio in directory structure and analysis code overlap could be increased. For example, pbench-fio has a directory structure that indicates workload at top and sample at the next level, such as:
/var/lib/pbench-agent/fio__2019.05.24T14.25.38/1-read-4KiB/sample1/
All this would lead to a natural reuse of existing analysis tools, such as grafana dashboards, with ripsaw. HTH -ben
Right now, unless you explicitly pin the clients and servers, uperf may (and often does) schedule the client and server component pods onto the same host. This leads to network testing only of the loopback of the host, which I do not think is generally useful.
I believe we should implement anti-affinity for the associated client and server pods so that by default they run on different nodes and therefore test real network connections.
YCSB Supports creating multiple clients to load the database, we should look to implement this.
I think we should manage our own CR, allowing us to set our own status with k8s_status
... Different states that I think make sense
1. Staging - Infra being built
2. Init - Workload is loading data (optional)
3. Running - Workload is running
4. Complete - Workload completed
5. (Optional) Failed - Workload failed (possible to have multiple failed states to help debug.
With the fio jobfile templated in roles/fio-bench/templates/job.fio.j2, we either have to restrict fio parameters to a subset that we are prepared to support, or we have to template out every possible fio parameter.
In order to support full flexibility of fio jobs, it might make more sense to pull the job file out as something that is user-provided as part of the CR or an include of the CR. Off hand, it seems like there would still need to be an amount of templating, so I'm not immediately sure how to import a raw standard-formatted fio jobfile.
Another option proposed is to split fio into two different types/roles -- One would be a basic "fio-simple-bench" in which the jobfile would remain templated in the role with limited parameters provided to the user to adjust in the CR. The second would be a "fio-user-bench" in which the fio jobfile would be provided as part of the CR outside of the role.
Just a simple hack to help with CI log output readability. I've added to each test script:
figlet $(basename $0)
Which will output something like this at the beginning of each script:
_ _ _ _ _
| |_ ___ ___| |_ | |__ _ _ _____ _| | ___| |__
| __/ _ \/ __| __| | '_ \| | | |/ _ \ \ /\ / / | / __| '_ \
| || __/\__ \ |_ | |_) | |_| | (_) \ V V /| |_\__ \ | | |
\__\___||___/\__|___|_.__/ \__, |\___/ \_/\_/ |_(_)___/_| |_|
|_____| |___/
But these commands are disabled right now because figlet is not available on the CI systems.
Right now our Custom resource looks like
spec:
uperf:
# To disable uperf, set pairs to 0
pair: 1
proto: tcp
test_type: stream
nthr: 2
size: 16384
runtime: 10
fio:
# To disable fio, set clients to 0
clients: 0
jobname: test-write
bs: 4k
iodepth: 4
runtime: 57
rw: write
filesize: 1
We only have a limited definition. I would like to move to a list
spec:
infra:
couchbase:
servers: 1
workloads:
- uperf:
# To disable uperf, set pairs to 0
pair: 1
...
- fio:
# To disable fio, set clients to 0
clients: 0
jobname: test-write
- ycsb:
...
Which would allow us to have a pipeline of workloads that we iterate through.
Also, defining the infrastructure (ie databases we will run against) allows us to kick off infrastructure and run many different workloads against it, vs just always relying on YCSB...
Identify incorrect/invalid CRs or essentially just validate the CR parameters, before starting to run the benchmark and hitting errors. An example of this can be, if a user provides a storage class that's on available on the cluster, then an error message should be displayed without entering logic of the particular benchmark instead of hitting errors while creating pods/pv needed for that benchmark.
I would like to run FIO-cr over a predefined PVC attached to an existing pod.
This is mainly needed as testing efforts not specific to benchmark
Right now we are defining the job/pods in the tasks.
We should consider jinja templates for a two reasons
It would be good to have a CI based on minikube as in https://blog.travis-ci.com/2017-10-26-running-kubernetes-on-travis-ci-with-minikube
Per aakarshg I've rolled together a simple setup of the minikube/shift environments for CI. This gets triggered near the start of the test.sh script using blackknight's setup/install playbooks depending on the NODE_LABELS.
Current work is here: https://github.com/dry923/ripsaw/blob/miniinstall/tests/start_mini.sh
A few outstanding questions:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.