marcomicera / kubemarks Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 1.0 1.22 MB

☸ Kubernetes periodic benchmarking tool with Prometheus Pushgateway results exposer

License: MIT License

Shell 60.97% Dockerfile 39.03%

kubernetes benchmark prometheus-pushgateway

kubemarks's Introduction

👋 Hi! I'm Marco!

I'm a Senior DevOps Engineer at Ada Health in Berlin, Germany.

I love open-source, cloud computing, string instruments, and Oxford commas.

🌎 Links	💻 Contributions	✏️ Blog posts
LinkedIn Website Learning project board Starred repos lists	PRs Issues	Java containerization for modular PF4j applications

kubemarks's People

Contributors

Watchers

Forkers

clix-dev-llc

kubemarks's Issues

CronJob results retrievement

When launching benchmarks in a CronJob, results will be stored in dk8s-cronjob containers. The final results collector must be able to retrieve a container's results as soon as a job finishes.

Cannot resolve Python dependencies in dk8s-cronjob image

dk8s-cronjob image does not manage to resolve PerfkitBenchmarker's dependencies:

Installing PerfKitBenchmarker dependencies...
Collecting absl-py (from -r requirements.txt (line 14))
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obje
ct at 0x7fd56d277990>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)': /simple/absl-py/
  Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obje
ct at 0x7fd56d277f50>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)': /simple/absl-py/
  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obje
ct at 0x7fd56d367090>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)': /simple/absl-py/
  Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obje
ct at 0x7fd56d3671d0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)': /simple/absl-py/
  Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obje
ct at 0x7fd56d367310>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)': /simple/absl-py/
  Could not find a version that satisfies the requirement absl-py (from -r requirements.txt (line 14)) (from versions: )
No matching distribution found for absl-py (from -r requirements.txt (line 14))

Adopt OpenMetrics as a result format

This page in the Prometheus documentation shows the text format details.

metric_name [
  "{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"
] value [ timestamp ]

An example would be:

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"}    3 1395066363000

# Escaping in label values:
msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9

# Minimalistic line:
metric_without_timestamp_and_labels 12.47

# A weird metric from before the epoch:
something_weird{problem="division by zero"} +Inf -3982045

# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320

# Finally a summary, which has a complex representation, too:
# HELP rpc_duration_seconds A summary of the RPC duration in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.01"} 3102
rpc_duration_seconds{quantile="0.05"} 3272
rpc_duration_seconds{quantile="0.5"} 4773
rpc_duration_seconds{quantile="0.9"} 9001
rpc_duration_seconds{quantile="0.99"} 76656
rpc_duration_seconds_sum 1.7560473e+07
rpc_duration_seconds_count 2693

Commit 1a5a5a7 in my PerfKitBenchmarker fork added the append mode to the CSV results writer.
Another writer could export this data into the OpenMetrics format.

Remove platform-dependent scripts

Docker build scripts
Kubernetes scripts
rsync scripts

Bundled Pushgateway and Grafana

Using their Helm Charts should be enough.

Load benchmarks-specific flags as a ConfigMap

Get rid of the kubeconfig file

Ideas:

Add Pushgateway metrics' descriptions

"No tty present" bug in dk8s-pkb docker

When Docker image dk8s-pkb issues a sudo command in start.sh, the container log shows:

sudo: no tty present and no askpass program specified

Pushgateway address as environment variable

Currently, the Pushgateway address is stored within this script:
https://github.com/marcomicera/distributed-k8s/blob/fa29b453d8697b5c4e280a3e92e217b88089514f/start.sh#L34

cluster_boot and redis won't start

Wrong flags are being used.
For cluster_boot:

perfkitbenchmarker.errors.UnrecognizedOption: Unrecognized options were found in cluster_boot: cluster_boot_time_reboot.

For redis:

perfkitbenchmarker.errors.MissingOption: Required options were missing from redis.vm_groups.clients: vm_spec.

Files name refractoring

I.e.,

experiment-conf.yaml is too general
cronjob.yaml is too general

Uniform benchmark results

Parse .pkb log files and retrieve essential data
- Benchmark results
- Pods info
- Physical machines info
Create single .csv file with results of different experiments

Change repo name

One of those:

benchetes: BENCHmarks on kubernETES
kubemarks: KUBErnetes benchMARKS

Some benchmarks fail

A list of tested, fully-functioning benchmarks:

Node ID label in metrics

Each metric exposed to the Pushgateway should contain the node ID on which the benchmark has been executed.

Expose other info to the Pushgateway

To choose amongst this list:

Only for `lscpu` commands

Architecture
BogoMIPS
Byte Order
CPU MHz
CPU family
CPU max MHz
CPU min MHz
CPU op-mode(s)
CPU(s): num_cpus is not set for lscpu entries, but this is
Core(s) per socket

Flags: huge list like:

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts flush_l1d

L1d cache
L1i cache
L2 cache
L3 cache
Model
Model name
NUMA node(s)
NUMA node0 CPU(s)
On-line CPU(s) list
Socket(s)
Stepping
Thread(s) per core
Vendor ID
Virtualization
bw_agg
These fields are strangely always set to 0:
- bw_dev
- bw_max
- bw_mean
- bw_min

Upload Dockerfiles upon publicizing

dk8s-cronjob images don't have access to kubectl

CronJobs use the dk8s-cronjob image. They must provide the kubectl path to PerfkitBenchmarker in order for it to work:
https://github.com/marcomicera/distributed-k8s/blob/5e4c7a79b2b9712a16585891daba993da690eced/start.sh#L35
These docker images do not have access to the same kubectlcommand of the machine on which start_cron.sh was launched in the first place.

2019-10-28 19:26:23,983 f786558f MainThread INFO     Flag values:
...
--kubectl=
...

Exception: Please provide path to kubectl tool using --kubectl flag. Exiting.

Create a ConfigMap

A ConfigMap could store:

Benchmarks to run
...

Create dedicated CronJob files for benchmarks

E.g., cluster_boot every 5 minutes since it completes in ~10 seconds.

Insert ServiceAccount into CronJob YAML file

Separate the two sections by ---
Update the README.md file

Duplicate Pushgateway entry with multiple pods

Upon finishing a benchmark involving more than one VM (pod), PerfKit encounters in this error:

Traceback (most recent call last):
  File "pkb/pkb.py", line 21, in <module>
    sys.exit(Main())
  File "/home/root/distributed-k8s/pkb/perfkitbenchmarker/pkb.py", line 1209, in Main
    return RunBenchmarks()
  File "/home/root/distributed-k8s/pkb/perfkitbenchmarker/pkb.py", line 1122, in RunBenchmarks
    collector.PublishSamples()
  File "/home/root/distributed-k8s/pkb/perfkitbenchmarker/publisher.py", line 1108, in PublishSamples
    publisher.PublishSamples(self.samples)
  File "/home/root/distributed-k8s/pkb/perfkitbenchmarker/publisher.py", line 582, in PublishSamples
    registry=self.registry).labels(*(label_values + metadata_label_values))
  File "/usr/local/lib/python2.7/dist-packages/prometheus_client/metrics.py", line 324, in __init__
    labelvalues=labelvalues,
  File "/usr/local/lib/python2.7/dist-packages/prometheus_client/metrics.py", line 107, in __init__
    registry.register(self)
  File "/usr/local/lib/python2.7/dist-packages/prometheus_client/registry.py", line 29, in register
    duplicates))
ValueError: Duplicated timeseries in CollectorRegistry: set(['boot_time_seconds'])

PushgatewayPublisher needs to have a dictionary of Gauges, so it can re-use the ones that have been previously used for exposing metrics.

Group Pushgateway metrics by benchmark name

More specifically, the job field must be the benchmark name.

push_to_gateway(gateway=FLAGS.pushgateway, job=sample['test'], registry=registry)

Group one run's results into a single folder

PerfKitBenchmarker expects different benchmarks to write their results in different folders: it refuses to run an experiment if its results folder name is longer than 12 characters and any other sorts of constraints I didn't really go through.
Problem is, it's not possible to group different benchmark results of a single PKB run.

Adopt Prometheus metric naming best practices

List of best practices here.

Periodic benchmarks

Benchmarks might be executed periodically using CronJobs.
Practically, this could be done by wrapping commands currently issued by PerfkitBenchmarker in the following way:

kubectl run hello --schedule="*/1 * * * *" --restart=OnFailure --image=busybox -- /bin/sh -c "date; echo Hello from the Kubernetes cluster"

Also, a ConfigMap should contain a list of benchmarks to execute, as well as their frequency.

marcomicera / kubemarks Goto Github PK

kubemarks's Introduction

👋 Hi! I'm Marco!

kubemarks's People

Contributors

Watchers

Forkers

kubemarks's Issues

Only for lscpu commands

Recommend Projects

Recommend Topics

Recommend Org

Only for `lscpu` commands