Coder Social home page Coder Social logo

marcomicera / kubemarks Goto Github PK

View Code? Open in Web Editor NEW
0.0 3.0 1.0 1.22 MB

☸ Kubernetes periodic benchmarking tool with Prometheus Pushgateway results exposer

License: MIT License

Shell 60.97% Dockerfile 39.03%
kubernetes benchmark prometheus-pushgateway

kubemarks's Introduction

kubemarks's People

Contributors

frisso avatar marcomicera avatar

Watchers

 avatar  avatar  avatar

Forkers

clix-dev-llc

kubemarks's Issues

CronJob results retrievement

When launching benchmarks in a CronJob, results will be stored in dk8s-cronjob containers. The final results collector must be able to retrieve a container's results as soon as a job finishes.

Cannot resolve Python dependencies in dk8s-cronjob image

dk8s-cronjob image does not manage to resolve PerfkitBenchmarker's dependencies:

Installing PerfKitBenchmarker dependencies...
Collecting absl-py (from -r requirements.txt (line 14))
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obje
ct at 0x7fd56d277990>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)': /simple/absl-py/
  Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obje
ct at 0x7fd56d277f50>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)': /simple/absl-py/
  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obje
ct at 0x7fd56d367090>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)': /simple/absl-py/
  Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obje
ct at 0x7fd56d3671d0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)': /simple/absl-py/
  Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obje
ct at 0x7fd56d367310>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',)': /simple/absl-py/
  Could not find a version that satisfies the requirement absl-py (from -r requirements.txt (line 14)) (from versions: )
No matching distribution found for absl-py (from -r requirements.txt (line 14))

Adopt OpenMetrics as a result format

This page in the Prometheus documentation shows the text format details.

metric_name [
  "{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"
] value [ timestamp ]

An example would be:

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027 1395066363000
http_requests_total{method="post",code="400"}    3 1395066363000

# Escaping in label values:
msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:\n\"FILE.TXT\""} 1.458255915e9

# Minimalistic line:
metric_without_timestamp_and_labels 12.47

# A weird metric from before the epoch:
something_weird{problem="division by zero"} +Inf -3982045

# A histogram, which has a pretty complex representation in the text format:
# HELP http_request_duration_seconds A histogram of the request duration.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320

# Finally a summary, which has a complex representation, too:
# HELP rpc_duration_seconds A summary of the RPC duration in seconds.
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.01"} 3102
rpc_duration_seconds{quantile="0.05"} 3272
rpc_duration_seconds{quantile="0.5"} 4773
rpc_duration_seconds{quantile="0.9"} 9001
rpc_duration_seconds{quantile="0.99"} 76656
rpc_duration_seconds_sum 1.7560473e+07
rpc_duration_seconds_count 2693

Commit 1a5a5a7 in my PerfKitBenchmarker fork added the append mode to the CSV results writer.
Another writer could export this data into the OpenMetrics format.

cluster_boot and redis won't start

Wrong flags are being used.
For cluster_boot:

perfkitbenchmarker.errors.UnrecognizedOption: Unrecognized options were found in cluster_boot: cluster_boot_time_reboot.

For redis:

perfkitbenchmarker.errors.MissingOption: Required options were missing from redis.vm_groups.clients: vm_spec.

Uniform benchmark results

  • Parse .pkb log files and retrieve essential data
    • Benchmark results
    • Pods info
    • Physical machines info
  • Create single .csv file with results of different experiments

Change repo name

One of those:

  • benchetes: BENCHmarks on kubernETES
  • kubemarks: KUBErnetes benchMARKS

Expose other info to the Pushgateway

To choose amongst this list:

  • unit: measurement unit
  • run_uri: benchmark execution ID
  • sample_uri: sample ID
  • cloud: always Kubernetes
  • data_disk_0_num_stripes
  • data_disk_0_size
  • data_disk_0_type: always emptyDir
  • data_disk_count: always 1
  • direct: always 0
  • directory: always /scratch
  • end_fsync
  • filename
  • filesize
  • fio_job
  • image
  • invalidate
  • iodepth
  • ioengine
  • kernel_release
  • max
  • mean
  • min
  • node_name
  • num_cpus: CPU(s) column for lscpu entries
  • numa_node_count
  • os_info
  • os_type
  • overwrite
  • p1
  • p10
  • p20
  • p30
  • p40
  • p5
  • p50
  • p60
  • p70
  • p80
  • p90
  • p95
  • p99
  • p99.5
  • p99.9
  • p99.95
  • p99.99
  • perfkitbenchmarker_version
  • randrepeat
  • run_number
  • rw
  • size
  • stddev
  • tcp_congestion_control: always cubic
  • vm_count
  • workload_mode
  • zone

Only for lscpu commands

  • Architecture
  • BogoMIPS
  • Byte Order
  • CPU MHz
  • CPU family
  • CPU max MHz
  • CPU min MHz
  • CPU op-mode(s)
  • CPU(s): num_cpus is not set for lscpu entries, but this is
  • Core(s) per socket
  • Flags: huge list like:
    fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts flush_l1d
    
  • L1d cache
  • L1i cache
  • L2 cache
  • L3 cache
  • Model
  • Model name
  • NUMA node(s)
  • NUMA node0 CPU(s)
  • On-line CPU(s) list
  • Socket(s)
  • Stepping
  • Thread(s) per core
  • Vendor ID
  • Virtualization
  • bw_agg
  • These fields are strangely always set to 0:
    • bw_dev
    • bw_max
    • bw_mean
    • bw_min

dk8s-cronjob images don't have access to kubectl

CronJobs use the dk8s-cronjob image. They must provide the kubectl path to PerfkitBenchmarker in order for it to work:
https://github.com/marcomicera/distributed-k8s/blob/5e4c7a79b2b9712a16585891daba993da690eced/start.sh#L35
These docker images do not have access to the same kubectlcommand of the machine on which start_cron.sh was launched in the first place.

2019-10-28 19:26:23,983 f786558f MainThread INFO     Flag values:
...
--kubectl=
...

Exception: Please provide path to kubectl tool using --kubectl flag. Exiting.

Duplicate Pushgateway entry with multiple pods

Upon finishing a benchmark involving more than one VM (pod), PerfKit encounters in this error:

Traceback (most recent call last):
  File "pkb/pkb.py", line 21, in <module>
    sys.exit(Main())
  File "/home/root/distributed-k8s/pkb/perfkitbenchmarker/pkb.py", line 1209, in Main
    return RunBenchmarks()
  File "/home/root/distributed-k8s/pkb/perfkitbenchmarker/pkb.py", line 1122, in RunBenchmarks
    collector.PublishSamples()
  File "/home/root/distributed-k8s/pkb/perfkitbenchmarker/publisher.py", line 1108, in PublishSamples
    publisher.PublishSamples(self.samples)
  File "/home/root/distributed-k8s/pkb/perfkitbenchmarker/publisher.py", line 582, in PublishSamples
    registry=self.registry).labels(*(label_values + metadata_label_values))
  File "/usr/local/lib/python2.7/dist-packages/prometheus_client/metrics.py", line 324, in __init__
    labelvalues=labelvalues,
  File "/usr/local/lib/python2.7/dist-packages/prometheus_client/metrics.py", line 107, in __init__
    registry.register(self)
  File "/usr/local/lib/python2.7/dist-packages/prometheus_client/registry.py", line 29, in register
    duplicates))
ValueError: Duplicated timeseries in CollectorRegistry: set(['boot_time_seconds'])

PushgatewayPublisher needs to have a dictionary of Gauges, so it can re-use the ones that have been previously used for exposing metrics.

Group one run's results into a single folder

PerfKitBenchmarker expects different benchmarks to write their results in different folders: it refuses to run an experiment if its results folder name is longer than 12 characters and any other sorts of constraints I didn't really go through.
Problem is, it's not possible to group different benchmark results of a single PKB run.

Periodic benchmarks

Benchmarks might be executed periodically using CronJobs.
Practically, this could be done by wrapping commands currently issued by PerfkitBenchmarker in the following way:

kubectl run hello --schedule="*/1 * * * *" --restart=OnFailure --image=busybox -- /bin/sh -c "date; echo Hello from the Kubernetes cluster"

Also, a ConfigMap should contain a list of benchmarks to execute, as well as their frequency.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.