WCA - Workload Collocation Agent

https://travis-ci.com/intel/workload-collocation-agent.svg?branch=master

Table of Contents

Introduction
Getting started
Configuration
- Components
Workloads
Further reading

This software is pre-production and should not be deployed to production servers.

Introduction

Workload Collocation Agent's goal is to reduce interference between collocated tasks and increase tasks density while ensuring the quality of service for high priority tasks. Chosen approach allows to enable real-time resource isolation management to ensure that high priority jobs meet their Service Level Objective (SLO) and best-effort jobs effectively utilize as many idle resources as possible.

Resource usage can be increased by:

collocating best effort and high priority tasks to exploit resources that are underutilized by high priority applications,
collocating tasks that do not compete for shared resources on the platform.

WCA abstracts compute node, workloads, monitoring and resource allocation. An externally provided algorithm is responsible for allocating resources or anomaly detection logic. WCA and the algorithm exchange information about current resource usage, isolation actuations or detected anomalies. WCA stores information about detected anomalies, resource allocation and platform utilization metrics to a remote storage such as Kafka.

The diagram below puts WCA in context of a cluster and monitoring infrastructure:

For context regarding Mesos see this document and for Kubernetes see this document.

See WCA Architecture 1.7.pdf for further details.

Getting started

WCA is targeted at and tested on Centos 7.5.

Note: for full production installation please follow this detailed installation guide.

# Install required software.
sudo yum install epel-release -y
sudo yum install git python36 make which -y
python3.6 -m ensurepip --user
python3.6 -m pip install --user pipenv
export PATH=$PATH:~/.local/bin

# Clone the repository & build.
git clone https://github.com/intel/workload-collocation-agent
cd workload-collocation-agent

export LC_ALL=en_US.utf8 #required for centos docker image
make venv
make wca_package

# Prepare tasks manually (only cgroups are required)
sudo mkdir /sys/fs/cgroup/{cpu,cpuacct,perf_event}/task1

# Example of running agent in measurements-only mode with predefined static list of tasks
sudo dist/wca.pex --config $PWD/configs/extra/static_measurements.yaml --root

# Example of static allocation with predefined rules on predefined list of tasks.
sudo dist/wca.pex --config $PWD/configs/extra/static_allocator.yaml --root

Running those commands outputs metrics in Prometheus format to standard error like this:

# HELP cache_misses Linux Perf counter for cache-misses per container.
# TYPE cache_misses counter
cache_misses{cores="4",cpus="8",host="gklab-126-081",wca_version="0.1.dev655+g586f259.d20190401",sockets="1",task_id="task1"} 0.0 1554139418146

# HELP cpu_usage_per_cpu [1/USER_HZ] Logical CPU usage in 1/USER_HZ (usually 10ms).Calculated using values based on /proc/stat
# TYPE cpu_usage_per_cpu counter
cpu_usage_per_cpu{cores="4",cpu="0",cpus="8",host="gklab-126-081",wca_version="0.1.dev655+g586f259.d20190401",sockets="1"} 5103734 1554139418146
cpu_usage_per_cpu{cores="4",cpu="1",cpus="8",host="gklab-126-081",wca_version="0.1.dev655+g586f259.d20190401",sockets="1"} 6860714 1554139418146

# HELP cpu_usage_per_task [ns] cpuacct.usage (total kernel and user space)
# TYPE cpu_usage_per_task counter
cpu_usage_per_task{cores="4",cpus="8",host="gklab-126-081",wca_version="0.1.dev655+g586f259.d20190401",sockets="1",task_id="task1"} 0 1554139418146

# HELP instructions Linux Perf counter for instructions per container.
# TYPE instructions counter
instructions{cores="4",cpus="8",host="gklab-126-081",wca_version="0.1.dev655+g586f259.d20190401",sockets="1",task_id="task1"} 0.0 1554139418146

# HELP memory_usage [bytes] Total memory used by platform in bytes based on /proc/meminfo and uses heuristic based on linux free tool (total - free - buffers - cache).
# TYPE memory_usage gauge
memory_usage{cores="4",cpus="8",host="gklab-126-081",wca_version="0.1.dev655+g586f259.d20190401",sockets="1"} 6407118848 1554139418146

# TYPE wca_tasks gauge
wca_tasks{cores="4",cpus="8",host="gklab-126-081",wca_version="0.1.dev655+g586f259.d20190401",sockets="1"} 1 1554139418146

# TYPE wca_up counter
wca_up{cores="4",cpus="8",host="gklab-126-081",wca_version="0.1.dev655+g586f259.d20190401",sockets="1"} 1554139418.146581 1554139418146

When reconfigured, other built-in components allow to:

store those metrics in Kafka,
integrate with Mesos or Kubernetes,
enable anomaly detection,
or enable anomaly prevention (allocation) to mitigate interference between workloads.

Configuration

WCA introduces simple but extensible mechanism to inject dependencies into classes and build complete software stack of components. WCA main control loop is based on Runner base class that implements single run blocking method. Depending on Runner class used, the WCA is run in different execution mode (e.g. detection, allocation).

Refer to full of list of Components for further reference.

Available runners:

MeasurementRunner simple runner that only collects data without calling detection/allocation API.
DetectionRunner implements the loop calling detect function in regular and configurable intervals. See detection API for details.
AllocationRunner implements the loop calling allocate function in regular and configurable intervals. See allocation API for details.

Conceptually Runner reads a state of the system (both metrics and workloads), passes the information to external component (an algorithm), logs the algorithm input and output using implementation of Storage and allocates resources if instructed.

Following snippet is an example configuration of a runner:

runner: !SomeRunner
    node: !SomeNode
    callback_component: !ClassImplementingCallback
    storage: !SomeStorage

After starting WCA with the above configuration, an instance of the class SomeRunner will be created. The instance's properties will be set to:

node - to an instance of SomeNode
callback_component - to an instance of ClassImplementingCallback
storage - to an instance of SomeStorage

Configuration mechanism allows to:

Create and configure complex python objects (e.g. DetectionRunner, MesosNode, KafkaStorage) using YAML tags.
Inject dependencies (with type checking support) into constructed objects using dataclasses annotations.
Register external classes using -r command line argument or by using wca.config.register decorator API. This allows to extend WCA with new functionalities (more information here) and is used to provide external components with e.g. anomaly logic like Platform Resource Manager.

See external detector example for more details.

Components

Following built-in components are available (stable API):

MesosNode provides workload discovery on Mesos cluster node where mesos containerizer is used (see the docs here)
KubernetesNode provides workload discovery on Kubernetes cluster node (see the docs here)
MeasurementRunner implements simple loop that reads state of the system, encodes this information as metrics and stores them,
DetectionRunner extends MeasurementRunner and additionally implements anomaly detection callback and encodes anomalies as metrics to enable alerting and analysis. See Detection API for more details.
AllocationRunner extends MeasurementRunner and additionally implements resource allocation callback. See Allocation API for more details.
NOPAnomalyDetector dummy "no operation" detector that returns no metrics, nor anomalies. See Detection API for more details.
NOPAllocator dummy "no operation" allocator that returns no metrics, nor anomalies and does not configure resources. See Detection API for more details.
KafkaStorage logs metrics to Kafka streaming platform using configurable topics.
LogStorage logs metrics to standard error or to a file at configurable location.
SSL to enabled secure communication with external components (more information here).

Following built-in components are available as provisional API:

StaticNode to support static list of tasks (does not require full orchestration software stack),
StaticAllocator to support simple rules based logic for resource allocation.

Officially supported third-party components:

Intel "Platform Resource Manager" plugin - machine learning based component for both anomaly detection and allocation.

Warning:	Note that, those components are run as ordinary python class, without any isolation and with process's privileges so there is no built-in protection against malicious external components. For security reasons, please use only built-in and officially supported components. More about security here.

Workloads

The project contains Dockerfiles together with helper scripts aimed at preparation of reference workloads to be run on Mesos cluster using Aurora framework.

To enable anomaly detection algorithm validation the workloads are prepared to:

provide continuous stream of Application Performance Metrics using wrappers (all workloads),
simulate varying load (patches to generate sine-like pattern of requests per second are available for YCSB and rpc-perf ).

See workloads directory for list of supported applications and load generators.

kad / workload-collocation-agent Goto Github PK

workload-collocation-agent's Introduction

WCA - Workload Collocation Agent

Introduction

Getting started

Configuration

Components

Workloads

Further reading

workload-collocation-agent's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent