Coder Social home page Coder Social logo

intel / cri-resource-manager Goto Github PK

View Code? Open in Web Editor NEW
169.0 12.0 56.0 17.46 MB

Kubernetes Container Runtime Interface proxy service with hardware resource aware workload placement policies

License: Apache License 2.0

Shell 16.86% Makefile 1.88% Go 79.04% Dockerfile 0.09% C 0.25% CSS 0.03% HTML 0.03% JavaScript 0.55% Python 1.23% Roff 0.04%
kubernetes container-runtime-interface hardware-topology

cri-resource-manager's Introduction

CRI Resource Manager for Kubernetes*

Welcome!

See our Documentation site for detailed documentation.

cri-resource-manager's People

Contributors

ahsan518 avatar arskama avatar askervin avatar bart0sh avatar dependabot[bot] avatar dodan avatar dougtw avatar fmuyassarov avatar huangrui666 avatar intel-k8s-bot avatar intelkevinputnam avatar ipuustin avatar jukkar avatar kad avatar klihub avatar marquiz avatar mmucek95 avatar mythi avatar okartau avatar ppalucki avatar testwill avatar wpross avatar yugar-1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cri-resource-manager's Issues

README improvements

I'm going through the README for the first time and I have a few suggestions for improvements:

  • I would not put the CRI message dumper at the top of the README. This makes it feel like it is the main use case for the project while it's not. IMHO It's an interesting marginal usage of this project and should be mentioned from the README, but not at one of the first things people will read.

  • How can I actually deploy cri-resource-manager? I was expecting a few practical, complete examples of how I could deploy, install and run cri-resource-manager. I am not sure if I have to build a webhook image, if I need to start the crm agent or not, etc. A few sequential steps on how to kick things off would be quite benefitial already, but something like a container based deployment tool or a kubectl apply -f https://https://github.com/intel/cri-resource-manager/deploy/crm.yaml command line would be even better

  • I am also missing an overall picture of the project and how it interacts with kubernetes, kubelet and CRI-O/containerd. Overall, a high level, one illustrated paragraph architecture documentation would help.

I hope this helps, I feel this project is very valuable for the k8s ecosystem and I would love to see the README make it shine a little bit more ;-)

agent: define a sensible strategy for dealing with configuration errors.

It is quite clear how to deal with configuration delivery errors: keep retrying indefinitely and preferably propagate the fact of failed configuration back to the source, IOW to the ConfigMap.

It is not clear however how to deal with application-level configuration errors. The agent has no knowledge of the application/configuration semantics, neither it is in control or even gets notified about system configuration changes.

RDT: "standardized" mechanism for out-of-band resource control enforcement

Inspired by #14 (review)

There would need to be a standard way of adding out-of-band (i.e. outside of CRI) resource control enforcement points, such as RDT or blockio (or other) cgroup controls. Some design goals:

  • adding new resource control types should be easy
  • using the resource controls from policies should be simple, avoiding duplicate/boilerplate code

One possible scheme would be utilizing cache.Container (and/or Pod) for storing the desired state, and, resource-manager level callbacks on enforcing the desired state(s). E.g. in the case of RDT there would be Container.SetRdtClass() for setting/storing the desired RDT class and a separate callback in resource-manager (registered by the RDT handler) would do the enforcement.

cpuallocator: improve allocation heuristics

Bring more NUMA-awareness to the cpuallocator (implemented in pkg/cpuallocator/). Discussing/reviewing the cpu allocation logic with @klihub we realized that the allocator is too simple, resulting in clearly non-optimal results. This concerns especially takeIdleCores() which should (we think) try to more aggressively and intelligently pack workloads in topology-aware manner.

The cpuallocator would need to be improved with additional tightest-fit allocation rules beyond the current topology socket/core/thread hierarchy to get to a more realistic socket/die/NUMA node/core/thread hierachy:

  • try allocating a full die if the number of requested cpus matches exactly
  • try allocating a full NUMA node if the number of cpus matches exactly
  • only then give up and try allocating mere full cores or threads, and also with these
    • try taking sub-NUMA node number of cores/threads from a single NUMA node,
    • try taking sub-die number of cores/threads from a single die

AVX512: use annotations for workload placement

Being able to utilize object metadata information (e.g., from CRI) to do proper avx512 workload placement in advance could save us from doing avx512 activity monitoring + rebalancing.

Challenges: NFD labels for instance are node affinity labels and not visible in pod metadata

ConfigMap node overrides for all options.

Currently the configuration allows per-node configuration only for policy-specific configuration. We need this generally for all configuration, not just policies.

Cache Metrics

CRI-RM must provide data on cache hits and misses.

Properly detect and act on container crashes.

We used to piggyback on ListContainers periodically sent by kubelet to detect state changes of containers. We stopped doing that at one point and as a side-effect now we don't always properly detect when a container crashes. This eventually leads to resources being allocated to crashed containers and new container creation requests failing with insufficient resources.

I think one way to trigger/test this is to run a container that constantly leaks memory, with a memory limit set, until the OOM-killer decides to kill it. At this point, if the container's CPU request is more than the amount of remaining free CPU when the container was still running, the kubelet's attempt to (re)create the crashed container will fail with an error about insufficient available CPU.

go get fails to import cri-resource-manager

Here is how it happens:

$ go get github.com/intel/cri-resource-manager/pkg/sysfs
go: finding github.com/intel/cri-resource-manager v0.2.0
go: finding github.com/intel/cri-resource-manager/pkg latest
go: finding github.com/intel/cri-resource-manager/pkg/sysfs latest
go: finding github.com/intel/cri-resource-manager v0.2.0
go: downloading github.com/intel/cri-resource-manager v0.2.0
verifying github.com/intel/[email protected]: github.com/intel/[email protected]: reading https://sum.golang.org/lookup/github.com/intel/[email protected]: 410 Gone

The actual error is this:

not found: unzip /tmp/gopath/pkg/mod/cache/download/github.com/intel/cri-resource-manager/@v/v0.2.0.zip: malformed file path "pkg/sysfs/testdata/sys/devices/pci0000:00/0000:00:02.0/class": invalid char ':'

I'll try to come up with a fix soon.

RDT: non-overlapping cache allocations

Currently, there is no way of specifying non-overlapping L3 schemas using the relative (percentage) notation. We might want to specify something like this (using the percentage notation):

L3:          XXXXXXXXXXXXXXXXXXXX
Guaranteed:  xxxxxxxxxx
Burstable              xxxxxxxxxx
Besteffort                  xxxxx

For this we would need both exclusive non-overlapping, and, non-exclusive overlapping definitions at the same time. One way to achieve this would be to add one extra level of hierarchy (i.e. groups) to the class configuration. Cache resources could not overlap on the group level. The scheme described above could be achieved with a configuration something like this:

resctrlGroups:
  exclusive:
    l3allocation: "50%"
    classes:
      Guaranteed:
        l3schema:
          default: "100%"
  shared:
    l3allocation: "50%"
    classes:
      Burstable:
        l3schema:
          default: "100%"
      BestEffort:
        l3schema:
          default: "50%"

In this scheme, Memory Bandwidth (MB) would probably allow over-committing i.e. each group could have e.g. 100%.

Default blockio/RDT class from pod/container annotations

There should be a way to set the default blockio and/or RDT class on per pod/container level through annotations.

  • the mechanism should set the default class of the container
  • it should be a generic (not policy specific) mechanism
  • there should be configuration options to specify the annotation(s) to watch

/var/run/cri-resmgr must be created before running cri-resmgr

when launching cri-resmgr after a reboot, it fails to run

D: [   cri/server   ] waiting for server to become ready...                                                           
E: [   cri-resmgr   ] failed to start resource manager: resource-manager: failed to start configuration server: failed to listen to socket: listen unix /var/run/cri-resmgr/cri-resmgr-config.sock: bind: no such file or directory

but works after sudo mkdir /var/run/cri-resmgr.

avx512 eBPF elf vs kernel version checking

we need to re-evaluate the kernel version vs eBPF elf version checking.

Let's not check for exact match but build in 'host kernel needs to be at least version X' value and accept any host kernel >= that version.

RDT: support for CDP

support CDP (codee and data priorization) schemas (i.e. separate
L3data and L3code)

policy-level tri-state (on/off/auto) knob for rdt

Per-policy control whether RDT must be on, should be turned off, or is automatically taken into use if it is available. Policy MUST fail to start if RDT cannot be initialized according to this configuration (IOW, if it must be on but is not available).

pkg/metrics to Prometheus

We want to provide metrics collected by collectors registered to pkg/metrics to CRI-RM Prometheus HTTP endpoint.

Default config startup failed in case RDT is enabled only with MBM

Supported schemata:

$ cat /sys/fs/resctrl/schemata 
MB:0=100;1=100;2=100;3=100
$
I: [resource-manager] using auto-discovered resctrl-path "/sys/fs/resctrl"
D: [       rdt      ] writing schemata "L3:0=1;1=1;2=1;3=1\nMB:0=100;1=100;2=100;3=100\n"
E: [   cri-resmgr   ] failed to create resource manager instance: resource-manager: failed to create resource manager: rdt: configuration failed: write /sys/fs/resctrl/cri-resmgr.BestEffort/schemata: invalid argument

Documentation about how to start cri-resmgr is outdated

Top-level README.md refers to non-existent cri-resmgr command line flags:

For full message dumping you start the CRI relay like this:

  ./cmd/cri-resmgr/cri-resmgr -policy null -dump 'reset,full:.*' -dump-file /tmp/cri.dump

Same for topology-aware policy README.md:

You can activate the tpology-aware policy by setting the --policy option of cri-resmgr to topology-aware. For instance like this:

cri-resmgr --policy topology-aware --reserved-resources cpu=750m

The documentation should be updated to refer to the current method for configuring the server.

metrics: relay correlated metrics to Prometheus

Collected raw metrics are already relayed to Prometheus. Rebalancing will probably produce a number of high-level, domain-specfic metrics, more suitable for decision making within cri-resmgr. These metrics should be exported to Prometheus as well.

RDT Metrics

CRI-RM must provide data on:
-mem bandwidth
-cache occupancy

Fix broken native resource requests/limits estimation heuristics.

Our native resource requests/limits estimation heuristics is now hosed. The worst offense is the memory request estimation. It is really badly broken.

The algorithm should take into account the pod's/container's apparent QoS class, which can be deducted from the pod's cgroup parent path. Fix/improve estimation to correctly handle the straightforward cases (Guaranteed, BestEffort QoS classes) and work reasonably for the rest (Burstable QoS class).

Documentation overhaul

We need to go through all the documentation, restructure it and supplement missing areas, before the next release. Motivated by #171

Backlog (please append new items as you encounter areas to fix):

  • #321 README should provide references to all the sub-documents
  • #320 update RDT documentation, better describing partitions, classes, configuration, exclusivity etc.
  • #121 (Documentation: “how to write policies”)
  • #159 (Documentation about how to start cri-resmgr is outdated)
  • #171 (Readme Improvements)
  • #318 document (per policy) handling of isolcpus

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.