Welcome!
intel / cri-resource-manager Goto Github PK
View Code? Open in Web Editor NEWKubernetes Container Runtime Interface proxy service with hardware resource aware workload placement policies
License: Apache License 2.0
Kubernetes Container Runtime Interface proxy service with hardware resource aware workload placement policies
License: Apache License 2.0
Welcome!
Prototype WCA integration and required interfaces.
Depends: #63
Add timer based avxcollector for AVX512 statistics.
Add container affinity/anti-affinity support to topology-aware policy.
Add parallel processing for non-CRI events to enable, .e.g., timer based metrics collection.
I'm going through the README for the first time and I have a few suggestions for improvements:
I would not put the CRI message dumper at the top of the README. This makes it feel like it is the main use case for the project while it's not. IMHO It's an interesting marginal usage of this project and should be mentioned from the README, but not at one of the first things people will read.
How can I actually deploy cri-resource-manager? I was expecting a few practical, complete examples of how I could deploy, install and run cri-resource-manager. I am not sure if I have to build a webhook image, if I need to start the crm agent or not, etc. A few sequential steps on how to kick things off would be quite benefitial already, but something like a container based deployment tool or a kubectl apply -f https://https://github.com/intel/cri-resource-manager/deploy/crm.yaml
command line would be even better
I am also missing an overall picture of the project and how it interacts with kubernetes, kubelet and CRI-O/containerd. Overall, a high level, one illustrated paragraph architecture documentation would help.
I hope this helps, I feel this project is very valuable for the k8s ecosystem and I would love to see the README make it shine a little bit more ;-)
It is quite clear how to deal with configuration delivery errors: keep retrying indefinitely and preferably propagate the fact of failed configuration back to the source, IOW to the ConfigMap.
It is not clear however how to deal with application-level configuration errors. The agent has no knowledge of the application/configuration semantics, neither it is in control or even gets notified about system configuration changes.
Inspired by #14 (review)
There would need to be a standard way of adding out-of-band (i.e. outside of CRI) resource control enforcement points, such as RDT or blockio (or other) cgroup controls. Some design goals:
One possible scheme would be utilizing cache.Container
(and/or Pod
) for storing the desired state, and, resource-manager
level callbacks on enforcing the desired state(s). E.g. in the case of RDT there would be Container.SetRdtClass()
for setting/storing the desired RDT class and a separate callback in resource-manager
(registered by the RDT handler) would do the enforcement.
Through k8s Extended Resources, for example
Bring more NUMA-awareness to the cpuallocator (implemented in pkg/cpuallocator/
). Discussing/reviewing the cpu allocation logic with @klihub we realized that the allocator is too simple, resulting in clearly non-optimal results. This concerns especially takeIdleCores() which should (we think) try to more aggressively and intelligently pack workloads in topology-aware manner.
The cpuallocator would need to be improved with additional tightest-fit allocation rules beyond the current topology socket/core/thread hierarchy to get to a more realistic socket/die/NUMA node/core/thread hierachy:
-Define and implement dynamic rebalancing cost estimate
-Common mechanism for policies for dynamic rebalancing
Add functionality to detect clocked CPUs using SpeedSelect Base Frequency
Provide data for external visualization tool. Heatmap good candidate: AVX, SpeedSelect, RDT, workload assignments?
Being able to utilize object metadata information (e.g., from CRI) to do proper avx512 workload placement in advance could save us from doing avx512 activity monitoring + rebalancing.
Challenges: NFD labels for instance are node affinity labels and not visible in pod metadata
Externally pluggable policy, WCA integration.
Currently the configuration allows per-node configuration only for policy-specific configuration. We need this generally for all configuration, not just policies.
Add functionality to detect clocked CPUs using SpeedSelect:
-PP
Prototyping complex numa hierarchy
CRI-RM must provide data on cache hits and misses.
We used to piggyback on ListContainers periodically sent by kubelet to detect state changes of containers. We stopped doing that at one point and as a side-effect now we don't always properly detect when a container crashes. This eventually leads to resources being allocated to crashed containers and new container creation requests failing with insufficient resources.
I think one way to trigger/test this is to run a container that constantly leaks memory, with a memory limit set, until the OOM-killer decides to kill it. At this point, if the container's CPU request is more than the amount of remaining free CPU when the container was still running, the kubelet's attempt to (re)create the crashed container will fail with an error about insufficient available CPU.
Here is how it happens:
$ go get github.com/intel/cri-resource-manager/pkg/sysfs
go: finding github.com/intel/cri-resource-manager v0.2.0
go: finding github.com/intel/cri-resource-manager/pkg latest
go: finding github.com/intel/cri-resource-manager/pkg/sysfs latest
go: finding github.com/intel/cri-resource-manager v0.2.0
go: downloading github.com/intel/cri-resource-manager v0.2.0
verifying github.com/intel/[email protected]: github.com/intel/[email protected]: reading https://sum.golang.org/lookup/github.com/intel/[email protected]: 410 Gone
The actual error is this:
not found: unzip /tmp/gopath/pkg/mod/cache/download/github.com/intel/cri-resource-manager/@v/v0.2.0.zip: malformed file path "pkg/sysfs/testdata/sys/devices/pci0000:00/0000:00:02.0/class": invalid char ':'
I'll try to come up with a fix soon.
Currently, there is no way of specifying non-overlapping L3 schemas using the relative (percentage) notation. We might want to specify something like this (using the percentage notation):
L3: XXXXXXXXXXXXXXXXXXXX
Guaranteed: xxxxxxxxxx
Burstable xxxxxxxxxx
Besteffort xxxxx
For this we would need both exclusive non-overlapping, and, non-exclusive overlapping definitions at the same time. One way to achieve this would be to add one extra level of hierarchy (i.e. groups) to the class configuration. Cache resources could not overlap on the group level. The scheme described above could be achieved with a configuration something like this:
resctrlGroups:
exclusive:
l3allocation: "50%"
classes:
Guaranteed:
l3schema:
default: "100%"
shared:
l3allocation: "50%"
classes:
Burstable:
l3schema:
default: "100%"
BestEffort:
l3schema:
default: "50%"
In this scheme, Memory Bandwidth (MB) would probably allow over-committing i.e. each group could have e.g. 100%.
It might be useful for the user to be able to configure the amount of cache allocation in kiB/MiB in addition to percentages or absolute bitmasks.
We want to support MB correctly when resctrl
is mounted with the -o mba_MBps
option.
There should be a way to set the default blockio and/or RDT class on per pod/container level through annotations.
when launching cri-resmgr
after a reboot, it fails to run
D: [ cri/server ] waiting for server to become ready...
E: [ cri-resmgr ] failed to start resource manager: resource-manager: failed to start configuration server: failed to listen to socket: listen unix /var/run/cri-resmgr/cri-resmgr-config.sock: bind: no such file or directory
but works after sudo mkdir /var/run/cri-resmgr
.
It should not be possible to run two (conflicting) instances of cri-resmgr on the same node.
we need to re-evaluate the kernel version vs eBPF elf version checking.
Let's not check for exact match but build in 'host kernel needs to be at least version X' value and accept any host kernel >= that version.
Get improved cache metrics from HW perf counters. E.g. L1 and LLC cache misses
Currently there is an extra line in go.mod, this needs to be removed.
support CDP (codee and data priorization) schemas (i.e. separate
L3data and L3code)
Per-policy control whether RDT must be on, should be turned off, or is automatically taken into use if it is available. Policy MUST fail to start if RDT cannot be initialized according to this configuration (IOW, if it must be on but is not available).
We want to provide metrics collected by collectors registered to pkg/metrics
to CRI-RM Prometheus HTTP endpoint.
To better distinguish CRI-RM metrics from the metrics collectors, use prometheus.BuildFQName()
Supported schemata:
$ cat /sys/fs/resctrl/schemata
MB:0=100;1=100;2=100;3=100
$
I: [resource-manager] using auto-discovered resctrl-path "/sys/fs/resctrl"
D: [ rdt ] writing schemata "L3:0=1;1=1;2=1;3=1\nMB:0=100;1=100;2=100;3=100\n"
E: [ cri-resmgr ] failed to create resource manager instance: resource-manager: failed to create resource manager: rdt: configuration failed: write /sys/fs/resctrl/cri-resmgr.BestEffort/schemata: invalid argument
Run CRI-validation + node e2e + sonobuoy
Complete required release processes
Target: 2020-05-08
Top-level README.md
refers to non-existent cri-resmgr
command line flags:
For full message dumping you start the CRI relay like this:
./cmd/cri-resmgr/cri-resmgr -policy null -dump 'reset,full:.*' -dump-file /tmp/cri.dump
Same for topology-aware policy README.md
:
You can activate the tpology-aware policy by setting the --policy option of cri-resmgr to topology-aware. For instance like this:
cri-resmgr --policy topology-aware --reserved-resources cpu=750m
The documentation should be updated to refer to the current method for configuring the server.
Make it possible for the user to configure the "root" resctrl group (usually /sys/fs/resctrl/schemata
) with CRI-RM.
Collected raw metrics are already relayed to Prometheus. Rebalancing will probably produce a number of high-level, domain-specfic metrics, more suitable for decision making within cri-resmgr. These metrics should be exported to Prometheus as well.
Change rdt class configuration to allow soft (preferred) requirements for rdt features (i.e. e.g. enforce MB if supported by system).
CRI-RM must provide data on:
-mem bandwidth
-cache occupancy
Our native resource requests/limits estimation heuristics is now hosed. The worst offense is the memory request estimation. It is really badly broken.
The algorithm should take into account the pod's/container's apparent QoS class, which can be deducted from the pod's cgroup parent path. Fix/improve estimation to correctly handle the straightforward cases (Guaranteed, BestEffort QoS classes) and work reasonably for the rest (Burstable QoS class).
We need to go through all the documentation, restructure it and supplement missing areas, before the next release. Motivated by #171
Backlog (please append new items as you encounter areas to fix):
Currently new configuration is applied only on new containers created after the configuration has become effective.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.