Coder Social home page Coder Social logo

kad / workload-collocation-agent Goto Github PK

View Code? Open in Web Editor NEW

This project forked from iwankgb/owca

0.0 2.0 0.0 10.2 MB

Orchestration-aware Workload Collocation Agent - a daemon that can help you collocate your workloads.

License: Apache License 2.0

Python 97.63% Dockerfile 1.77% Makefile 0.60%

workload-collocation-agent's Introduction

WCA - Workload Collocation Agent

https://travis-ci.com/intel/workload-collocation-agent.svg?branch=master

This software is pre-production and should not be deployed to production servers.

Workload Collocation Agent's goal is to reduce interference between collocated tasks and increase tasks density while ensuring the quality of service for high priority tasks. Chosen approach allows to enable real-time resource isolation management to ensure that high priority jobs meet their Service Level Objective (SLO) and best-effort jobs effectively utilize as many idle resources as possible.

Resource usage can be increased by:

  • collocating best effort and high priority tasks to exploit resources that are underutilized by high priority applications,
  • collocating tasks that do not compete for shared resources on the platform.

docs/overview.png

WCA abstracts compute node, workloads, monitoring and resource allocation. An externally provided algorithm is responsible for allocating resources or anomaly detection logic. WCA and the algorithm exchange information about current resource usage, isolation actuations or detected anomalies. WCA stores information about detected anomalies, resource allocation and platform utilization metrics to a remote storage such as Kafka.

The diagram below puts WCA in context of a cluster and monitoring infrastructure:

docs/context.png

For context regarding Mesos see this document and for Kubernetes see this document.

See WCA Architecture 1.7.pdf for further details.


WCA is targeted at and tested on Centos 7.5.

Note: for full production installation please follow this detailed installation guide.

# Install required software.
sudo yum install epel-release -y
sudo yum install git python36 make which -y
python3.6 -m ensurepip --user
python3.6 -m pip install --user pipenv
export PATH=$PATH:~/.local/bin

# Clone the repository & build.
git clone https://github.com/intel/workload-collocation-agent
cd workload-collocation-agent

export LC_ALL=en_US.utf8 #required for centos docker image
make venv
make wca_package

# Prepare tasks manually (only cgroups are required)
sudo mkdir /sys/fs/cgroup/{cpu,cpuacct,perf_event}/task1

# Example of running agent in measurements-only mode with predefined static list of tasks
sudo dist/wca.pex --config $PWD/configs/extra/static_measurements.yaml --root

# Example of static allocation with predefined rules on predefined list of tasks.
sudo dist/wca.pex --config $PWD/configs/extra/static_allocator.yaml --root

Running those commands outputs metrics in Prometheus format to standard error like this:

# HELP cache_misses Linux Perf counter for cache-misses per container.
# TYPE cache_misses counter
cache_misses{cores="4",cpus="8",host="gklab-126-081",wca_version="0.1.dev655+g586f259.d20190401",sockets="1",task_id="task1"} 0.0 1554139418146

# HELP cpu_usage_per_cpu [1/USER_HZ] Logical CPU usage in 1/USER_HZ (usually 10ms).Calculated using values based on /proc/stat
# TYPE cpu_usage_per_cpu counter
cpu_usage_per_cpu{cores="4",cpu="0",cpus="8",host="gklab-126-081",wca_version="0.1.dev655+g586f259.d20190401",sockets="1"} 5103734 1554139418146
cpu_usage_per_cpu{cores="4",cpu="1",cpus="8",host="gklab-126-081",wca_version="0.1.dev655+g586f259.d20190401",sockets="1"} 6860714 1554139418146

# HELP cpu_usage_per_task [ns] cpuacct.usage (total kernel and user space)
# TYPE cpu_usage_per_task counter
cpu_usage_per_task{cores="4",cpus="8",host="gklab-126-081",wca_version="0.1.dev655+g586f259.d20190401",sockets="1",task_id="task1"} 0 1554139418146

# HELP instructions Linux Perf counter for instructions per container.
# TYPE instructions counter
instructions{cores="4",cpus="8",host="gklab-126-081",wca_version="0.1.dev655+g586f259.d20190401",sockets="1",task_id="task1"} 0.0 1554139418146

# HELP memory_usage [bytes] Total memory used by platform in bytes based on /proc/meminfo and uses heuristic based on linux free tool (total - free - buffers - cache).
# TYPE memory_usage gauge
memory_usage{cores="4",cpus="8",host="gklab-126-081",wca_version="0.1.dev655+g586f259.d20190401",sockets="1"} 6407118848 1554139418146

# TYPE wca_tasks gauge
wca_tasks{cores="4",cpus="8",host="gklab-126-081",wca_version="0.1.dev655+g586f259.d20190401",sockets="1"} 1 1554139418146

# TYPE wca_up counter
wca_up{cores="4",cpus="8",host="gklab-126-081",wca_version="0.1.dev655+g586f259.d20190401",sockets="1"} 1554139418.146581 1554139418146

When reconfigured, other built-in components allow to:

  • store those metrics in Kafka,
  • integrate with Mesos or Kubernetes,
  • enable anomaly detection,
  • or enable anomaly prevention (allocation) to mitigate interference between workloads.

WCA introduces simple but extensible mechanism to inject dependencies into classes and build complete software stack of components. WCA main control loop is based on Runner base class that implements single run blocking method. Depending on Runner class used, the WCA is run in different execution mode (e.g. detection, allocation).

Refer to full of list of Components for further reference.

Available runners:

  • MeasurementRunner simple runner that only collects data without calling detection/allocation API.
  • DetectionRunner implements the loop calling detect function in regular and configurable intervals. See detection API for details.
  • AllocationRunner implements the loop calling allocate function in regular and configurable intervals. See allocation API for details.

Conceptually Runner reads a state of the system (both metrics and workloads), passes the information to external component (an algorithm), logs the algorithm input and output using implementation of Storage and allocates resources if instructed.

Following snippet is an example configuration of a runner:

runner: !SomeRunner
    node: !SomeNode
    callback_component: !ClassImplementingCallback
    storage: !SomeStorage

After starting WCA with the above configuration, an instance of the class SomeRunner will be created. The instance's properties will be set to:

  • node - to an instance of SomeNode
  • callback_component - to an instance of ClassImplementingCallback
  • storage - to an instance of SomeStorage

Configuration mechanism allows to:

  • Create and configure complex python objects (e.g. DetectionRunner, MesosNode, KafkaStorage) using YAML tags.
  • Inject dependencies (with type checking support) into constructed objects using dataclasses annotations.
  • Register external classes using -r command line argument or by using wca.config.register decorator API. This allows to extend WCA with new functionalities (more information here) and is used to provide external components with e.g. anomaly logic like Platform Resource Manager.

See external detector example for more details.

Following built-in components are available (stable API):

  • MesosNode provides workload discovery on Mesos cluster node where mesos containerizer is used (see the docs here)
  • KubernetesNode provides workload discovery on Kubernetes cluster node (see the docs here)
  • MeasurementRunner implements simple loop that reads state of the system, encodes this information as metrics and stores them,
  • DetectionRunner extends MeasurementRunner and additionally implements anomaly detection callback and encodes anomalies as metrics to enable alerting and analysis. See Detection API for more details.
  • AllocationRunner extends MeasurementRunner and additionally implements resource allocation callback. See Allocation API for more details.
  • NOPAnomalyDetector dummy "no operation" detector that returns no metrics, nor anomalies. See Detection API for more details.
  • NOPAllocator dummy "no operation" allocator that returns no metrics, nor anomalies and does not configure resources. See Detection API for more details.
  • KafkaStorage logs metrics to Kafka streaming platform using configurable topics.
  • LogStorage logs metrics to standard error or to a file at configurable location.
  • SSL to enabled secure communication with external components (more information here).

Following built-in components are available as provisional API:

  • StaticNode to support static list of tasks (does not require full orchestration software stack),
  • StaticAllocator to support simple rules based logic for resource allocation.

Officially supported third-party components:

Warning:Note that, those components are run as ordinary python class, without any isolation and with process's privileges so there is no built-in protection against malicious external components. For security reasons, please use only built-in and officially supported components. More about security here.

The project contains Dockerfiles together with helper scripts aimed at preparation of reference workloads to be run on Mesos cluster using Aurora framework.

To enable anomaly detection algorithm validation the workloads are prepared to:

  • provide continuous stream of Application Performance Metrics using wrappers (all workloads),
  • simulate varying load (patches to generate sine-like pattern of requests per second are available for YCSB and rpc-perf ).

See workloads directory for list of supported applications and load generators.

workload-collocation-agent's People

Contributors

ppalucki avatar felidadae avatar mmucek95 avatar damenus avatar psykulsk avatar maciej-wisniewski avatar thuang6 avatar wangjialei-a avatar squall0gd avatar not7cd avatar gryf avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.