intel / cri-resource-manager Goto Github PK

Kubernetes Container Runtime Interface proxy service with hardware resource aware workload placement policies

License: Apache License 2.0

Shell 16.86% Makefile 1.88% Go 79.04% Dockerfile 0.09% C 0.25% CSS 0.03% HTML 0.03% JavaScript 0.55% Python 1.23% Roff 0.04%

kubernetes container-runtime-interface hardware-topology

cri-resource-manager's Introduction

CRI Resource Manager for Kubernetes*

Welcome!

See our Documentation site for detailed documentation.

cri-resource-manager's People

Contributors

Stargazers

Watchers

Forkers

klihub kad marquiz obedmr rojkov mythi ahsan518 ipuustin grahamwhaley egernst askervin dougtw intelkevinputnam ppalucki ashahba dodan isabella232 ouclight qingshanyinyin nolancon jukkar huangrui666 kitianfresh ekmixon wpross duizhang12321 cmingxu okartau ailun258 rheezy29 tianzichenone guozijing pauldintel yugar-1 rouzip bowen-intel yaofighting mmucek95 fmuyassarov njascanu yanjing1104 bosheng1 amruta-bandhu-chaudhury amrutabandhu luukasmakila zvier intel-k8s-bot oxxenix changzhi1990 beatrem rdower jongwu

cri-resource-manager's Issues

External Policy Prototype

Prototype WCA integration and required interfaces.

Add metrics collection for AVX512 statistics

Depends: #63

Add timer based avxcollector for AVX512 statistics.

Container affinity/anti-affinity support for topology-aware policy

Add container affinity/anti-affinity support to topology-aware policy.

Enablers for metrics collection

Add parallel processing for non-CRI events to enable, .e.g., timer based metrics collection.

README improvements

I'm going through the README for the first time and I have a few suggestions for improvements:

I would not put the CRI message dumper at the top of the README. This makes it feel like it is the main use case for the project while it's not. IMHO It's an interesting marginal usage of this project and should be mentioned from the README, but not at one of the first things people will read.
How can I actually deploy cri-resource-manager? I was expecting a few practical, complete examples of how I could deploy, install and run cri-resource-manager. I am not sure if I have to build a webhook image, if I need to start the crm agent or not, etc. A few sequential steps on how to kick things off would be quite benefitial already, but something like a container based deployment tool or a kubectl apply -f https://https://github.com/intel/cri-resource-manager/deploy/crm.yaml command line would be even better
I am also missing an overall picture of the project and how it interacts with kubernetes, kubelet and CRI-O/containerd. Overall, a high level, one illustrated paragraph architecture documentation would help.

I hope this helps, I feel this project is very valuable for the k8s ecosystem and I would love to see the README make it shine a little bit more ;-)

agent: define a sensible strategy for dealing with configuration errors.

It is quite clear how to deal with configuration delivery errors: keep retrying indefinitely and preferably propagate the fact of failed configuration back to the source, IOW to the ConfigMap.

It is not clear however how to deal with application-level configuration errors. The agent has no knowledge of the application/configuration semantics, neither it is in control or even gets notified about system configuration changes.

RDT: "standardized" mechanism for out-of-band resource control enforcement

Inspired by #14 (review)

There would need to be a standard way of adding out-of-band (i.e. outside of CRI) resource control enforcement points, such as RDT or blockio (or other) cgroup controls. Some design goals:

adding new resource control types should be easy
using the resource controls from policies should be simple, avoiding duplicate/boilerplate code

One possible scheme would be utilizing cache.Container (and/or Pod) for storing the desired state, and, resource-manager level callbacks on enforcing the desired state(s). E.g. in the case of RDT there would be Container.SetRdtClass() for setting/storing the desired RDT class and a separate callback in resource-manager (registered by the RDT handler) would do the enforcement.

Mechanism for communicating status/conditions to the k8s control plane

Through k8s Extended Resources, for example

Documentation: “how to write policies”

cpuallocator: improve allocation heuristics

Bring more NUMA-awareness to the cpuallocator (implemented in pkg/cpuallocator/). Discussing/reviewing the cpu allocation logic with @klihub we realized that the allocator is too simple, resulting in clearly non-optimal results. This concerns especially takeIdleCores() which should (we think) try to more aggressively and intelligently pack workloads in topology-aware manner.

The cpuallocator would need to be improved with additional tightest-fit allocation rules beyond the current topology socket/core/thread hierarchy to get to a more realistic socket/die/NUMA node/core/thread hierachy:

try allocating a full die if the number of requested cpus matches exactly
try allocating a full NUMA node if the number of cpus matches exactly
only then give up and try allocating mere full cores or threads, and also with these
- try taking sub-NUMA node number of cores/threads from a single NUMA node,
- try taking sub-die number of cores/threads from a single die

Define Dynamic rebalancing algorithm POC

-Define and implement dynamic rebalancing cost estimate
-Common mechanism for policies for dynamic rebalancing

Speed Select Technology Base Frequency (SST-BF) CPU Detection

Add functionality to detect clocked CPUs using SpeedSelect Base Frequency

Visualization: Cache data to visualization mechanism

Provide data for external visualization tool. Heatmap good candidate: AVX, SpeedSelect, RDT, workload assignments?

AVX512: use annotations for workload placement

Being able to utilize object metadata information (e.g., from CRI) to do proper avx512 workload placement in advance could save us from doing avx512 activity monitoring + rebalancing.

Challenges: NFD labels for instance are node affinity labels and not visible in pod metadata

External Policy Support

Externally pluggable policy, WCA integration.

ConfigMap node overrides for all options.

Currently the configuration allows per-node configuration only for policy-specific configuration. We need this generally for all configuration, not just policies.

SpeedSelect PP Functionality

Add functionality to detect clocked CPUs using SpeedSelect:
-PP

Prototyping complex numa hierarchy

Cache Metrics

CRI-RM must provide data on cache hits and misses.

RDT: include info/last_cmd_status in error messages

Block-IO controller: class based enforcement + metrics

Properly detect and act on container crashes.

We used to piggyback on ListContainers periodically sent by kubelet to detect state changes of containers. We stopped doing that at one point and as a side-effect now we don't always properly detect when a container crashes. This eventually leads to resources being allocated to crashed containers and new container creation requests failing with insufficient resources.

I think one way to trigger/test this is to run a container that constantly leaks memory, with a memory limit set, until the OOM-killer decides to kill it. At this point, if the container's CPU request is more than the amount of remaining free CPU when the container was still running, the kubelet's attempt to (re)create the crashed container will fail with an error about insufficient available CPU.

go get fails to import cri-resource-manager

Here is how it happens:

$ go get github.com/intel/cri-resource-manager/pkg/sysfs
go: finding github.com/intel/cri-resource-manager v0.2.0
go: finding github.com/intel/cri-resource-manager/pkg latest
go: finding github.com/intel/cri-resource-manager/pkg/sysfs latest
go: finding github.com/intel/cri-resource-manager v0.2.0
go: downloading github.com/intel/cri-resource-manager v0.2.0
verifying github.com/intel/[email protected]: github.com/intel/[email protected]: reading https://sum.golang.org/lookup/github.com/intel/[email protected]: 410 Gone

The actual error is this:

not found: unzip /tmp/gopath/pkg/mod/cache/download/github.com/intel/cri-resource-manager/@v/v0.2.0.zip: malformed file path "pkg/sysfs/testdata/sys/devices/pci0000:00/0000:00:02.0/class": invalid char ':'

I'll try to come up with a fix soon.

RDT: non-overlapping cache allocations

Currently, there is no way of specifying non-overlapping L3 schemas using the relative (percentage) notation. We might want to specify something like this (using the percentage notation):

L3:          XXXXXXXXXXXXXXXXXXXX
Guaranteed:  xxxxxxxxxx
Burstable              xxxxxxxxxx
Besteffort                  xxxxx

For this we would need both exclusive non-overlapping, and, non-exclusive overlapping definitions at the same time. One way to achieve this would be to add one extra level of hierarchy (i.e. groups) to the class configuration. Cache resources could not overlap on the group level. The scheme described above could be achieved with a configuration something like this:

resctrlGroups:
  exclusive:
    l3allocation: "50%"
    classes:
      Guaranteed:
        l3schema:
          default: "100%"
  shared:
    l3allocation: "50%"
    classes:
      Burstable:
        l3schema:
          default: "100%"
      BestEffort:
        l3schema:
          default: "50%"

In this scheme, Memory Bandwidth (MB) would probably allow over-committing i.e. each group could have e.g. 100%.

RDT: support cache allocation in kiB/MiB

It might be useful for the user to be able to configure the amount of cache allocation in kiB/MiB in addition to percentages or absolute bitmasks.

RDT: support MBA MBps based allocation

We want to support MB correctly when resctrl is mounted with the -o mba_MBps option.

Default blockio/RDT class from pod/container annotations

There should be a way to set the default blockio and/or RDT class on per pod/container level through annotations.

the mechanism should set the default class of the container
it should be a generic (not policy specific) mechanism
there should be configuration options to specify the annotation(s) to watch

/var/run/cri-resmgr must be created before running cri-resmgr

when launching cri-resmgr after a reboot, it fails to run

D: [   cri/server   ] waiting for server to become ready...                                                           
E: [   cri-resmgr   ] failed to start resource manager: resource-manager: failed to start configuration server: failed to listen to socket: listen unix /var/run/cri-resmgr/cri-resmgr-config.sock: bind: no such file or directory

but works after sudo mkdir /var/run/cri-resmgr.

cri-resmgr fails to properly check if another instance is already running.

It should not be possible to run two (conflicting) instances of cri-resmgr on the same node.

avx512 eBPF elf vs kernel version checking

we need to re-evaluate the kernel version vs eBPF elf version checking.

Let's not check for exact match but build in 'host kernel needs to be at least version X' value and accept any host kernel >= that version.

More detailed cache (LLC) metrics

Get improved cache metrics from HW perf counters. E.g. L1 and LLC cache misses

Remove extra Space in Go.mod

Currently there is an extra line in go.mod, this needs to be removed.

RDT: support for CDP

support CDP (codee and data priorization) schemas (i.e. separate
L3data and L3code)

policy-level tri-state (on/off/auto) knob for rdt

Per-policy control whether RDT must be on, should be turned off, or is automatically taken into use if it is available. Policy MUST fail to start if RDT cannot be initialized according to this configuration (IOW, if it must be on but is not available).

pkg/metrics to Prometheus

We want to provide metrics collected by collectors registered to pkg/metrics to CRI-RM Prometheus HTTP endpoint.

metrics: use BuildFQName()

To better distinguish CRI-RM metrics from the metrics collectors, use prometheus.BuildFQName()

Default config startup failed in case RDT is enabled only with MBM

Supported schemata:

$ cat /sys/fs/resctrl/schemata 
MB:0=100;1=100;2=100;3=100
$

I: [resource-manager] using auto-discovered resctrl-path "/sys/fs/resctrl"
D: [       rdt      ] writing schemata "L3:0=1;1=1;2=1;3=1\nMB:0=100;1=100;2=100;3=100\n"
E: [   cri-resmgr   ] failed to create resource manager instance: resource-manager: failed to create resource manager: rdt: configuration failed: write /sys/fs/resctrl/cri-resmgr.BestEffort/schemata: invalid argument

ensure kubernetes e2e validation passes

Run CRI-validation + node e2e + sonobuoy

Release Process Preparations for v0.3 release

Complete required release processes

Target: 2020-05-08

Documentation about how to start cri-resmgr is outdated

Top-level README.md refers to non-existent cri-resmgr command line flags:

For full message dumping you start the CRI relay like this:

  ./cmd/cri-resmgr/cri-resmgr -policy null -dump 'reset,full:.*' -dump-file /tmp/cri.dump

Same for topology-aware policy README.md:

You can activate the tpology-aware policy by setting the --policy option of cri-resmgr to topology-aware. For instance like this:

cri-resmgr --policy topology-aware --reserved-resources cpu=750m

The documentation should be updated to refer to the current method for configuring the server.

RDT: possibility to configure the root resctrl group with CRI-RM

Make it possible for the user to configure the "root" resctrl group (usually /sys/fs/resctrl/schemata) with CRI-RM.

Make block-IO control using the cgroup blkio controller available to policies in an RDT-like fashion.

metrics: relay correlated metrics to Prometheus

Collected raw metrics are already relayed to Prometheus. Rebalancing will probably produce a number of high-level, domain-specfic metrics, more suitable for decision making within cri-resmgr. These metrics should be exported to Prometheus as well.

Realtime: Create a controller for Real-Time group scheduling

https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt

RDT Class configuration soft/hard requirements

Change rdt class configuration to allow soft (preferred) requirements for rdt features (i.e. e.g. enforce MB if supported by system).

RDT Metrics

CRI-RM must provide data on:
-mem bandwidth
-cache occupancy

Fix broken native resource requests/limits estimation heuristics.

Our native resource requests/limits estimation heuristics is now hosed. The worst offense is the memory request estimation. It is really badly broken.

The algorithm should take into account the pod's/container's apparent QoS class, which can be deducted from the pod's cgroup parent path. Fix/improve estimation to correctly handle the straightforward cases (Guaranteed, BestEffort QoS classes) and work reasonably for the rest (Burstable QoS class).

End-to-end testing capability/infra for CRI-RM.

Integration/e2e validation proposal
implement the proposal

Documentation overhaul

We need to go through all the documentation, restructure it and supplement missing areas, before the next release. Motivated by #171

Backlog (please append new items as you encounter areas to fix):

#321 README should provide references to all the sub-documents
#320 update RDT documentation, better describing partitions, classes, configuration, exclusivity etc.
#121 (Documentation: “how to write policies”)
#159 (Documentation about how to start cri-resmgr is outdated)
#171 (Readme Improvements)
#318 document (per policy) handling of isolcpus

BlockIO: Update weight/throttling on running containers on blockio class and class mapping config changes

Currently new configuration is applied only on new containers created after the configuration has become effective.