Set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs

License: Other

caelus's Introduction


Caelus is a set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs, these resources come from the underutilization of online jobs, especially during low traffic periods. To make batch jobs compatible with online jobs, caelus dynamically manages multiple resource isolation mechanisms and also checks abnormalities of various metrics. Batch jobs will be throttled or even killed if interference detected.


  • Collect various metrics, including node resources, cgroup resources and online jobs latency

  • Batch jobs could be running on YARN or Kubernetes

  • Predict total resource usages of the node, including online jobs and kernel modules, such as slab

  • Dynamically manage multiple resource isolation mechanisms, such as CPU, memory, and disk space

  • Dynamically check abnormalities of various metrics, such as CPU usage or online jobs latency

  • Throttle or even kill batch jobs when resource pressure or latency spike detected

  • Prometheus metrics supported

  • Alarm supported


Find more usage at The project also have two attached tools:


nm_operator is used to execute YARN commands in the way of remote API.

Getting started


# binary build, which generates binary under _output/bin/
$ make build

# image build
$ make image

# run unit test
$ make test


The caelus should better run on the node with the kubelet process, and write the kubelet's "root-dir" value to "kubelet_root_dir" in config file.

The config file and rule file are just the example files, you could add more feature based on different demands.

# running in script
$ mkdir -p /etc/caelus/
$ cp hack/config/rules.json /etc/caelus/
$ # if the batch job is running on YARN, you must modify the "offline_type" in hack/config/caelus.json as "yarn_on_k8s", and run the command
$ caelus --config=hack/config/caelus.json --v=2
$ # if the batch job is running on K8S, you must modify the "offline_type" in hack/config/caelus.json as "k8s", and run the command
$ # if the command is running inside the pod, then you could ignore the kubeconfig parameter
$ caelus --config=hack/config/caelus.json --hostname-override=xxx --v=2 --kubeconfig=xxx

# run in container
$ # the container parameters and environments could be found from hack/yaml/caelus.json, such as:
$ docker run -it --cap-add SYS_ADMIN --cap-add NET_ADMIN --cap-add MKNOD --cap-add SYS_PTRACE --cap-add SYS_CHROOT --cap-add SYS_NICE -v /:/rootfs -v /sys:/sys -v /dev/disk:/dev/disk /bin/bash

# running on K8S
$ kubectl create -f hack/yaml/caelus.yaml
$ kubectl label node colation=true
$ kubectl -n kube-system get daemonset


You could find more about how to start with Caelus from the DETAIL


For more information about contributing issues or pull requests, see our Contributing to Caelus.


Caelus is under the Apache License 2.0. See the License file for details.

caelus's Issues

lighthouse 和 lighthouse-plugin 部署之后报错

lighthouse 和 lighthouse-plugin都部署了 ,kubelet也更改了相关参数, 启动还是报错

kubelet 直接报错无法获取docker版本,

lighthouse 进程 也报错:

I1209 15:13:22.711478 3037 hook_manager.go:164] Build router: post /containers/create
I1209 15:13:22.711633 3037 hook_manager.go:101] Hook manager is running
I1209 15:13:42.033089 3037 hook_manager.go:343] Unhandled request GET /info
I1209 15:13:42.033122 3037 log.go:184] http: proxy error: context canceled
I1209 15:13:42.033493 3037 hook_manager.go:343] Unhandled request GET /info
I1209 15:13:42.033520 3037 log.go:184] http: proxy error: context canceled
I1209 15:13:42.044698 3037 hook_manager.go:343] Unhandled request GET /versio


看caelus源代码,发现 pkg/caelus/qos/manager/netio/netio.go, 代码文件里面是调用linux tc指令做的流量限额, 了解到 linux TC 只能做流出流量的限额, 这个是不是代表目前caelus 只能做离线任务流出流量的限额


既然lighthouse的func (p *offlineMutator) mutate拦截执行了:

newSplits = append(newSplits, splits[1], offlineKey) //offlineKey = "offline"
newSplits = append(newSplits, splits[2:]...)
newCgroupParent := strings.Join(newSplits, string(filepath.Separator))
newCgroupParent = "/" + newCgroupParent
containerConfig.InnerHostConfig.CgroupParent = newCgroupParent


Some Feedback

#37 这次pr, 在pkg/caelus/predict/predict_local.go 文件,
mem := math.Max(memStats.UsageRss-memStats.UsageTotal, 0)
mem永远是0, 辛苦确认下是否符合预期

hadoop version problem

I has tried to deploy nm-operator with my hadoop cluster with version 2.6. but I found the yarn client doesn't work with yarn rmadmin -updateNodeResource command, how can i use nm-operator with lower version of hadoop?


lighthouse make rpm 报错

/caelus/contrib/lighthouse-plugin$ make rpm
Sending build context to Docker daemon 137.7kB
Error response from daemon: failed to parse Dockerfile: Syntax error - can't find = in "M". Must be of the form: name=value
make: *** [rpm] 错误 1


离线调度器哪里去了,需要通过kubelet device plugin注册colocation/cpu、colocation/memory等新类型的硬件资源吗?


systemctl status lighthouse.service
● lighthouse.service - Lighthouse server
Loaded: loaded (/usr/lib/systemd/system/lighthouse.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since 一 2022-08-29 18:09:16 CST; 9min ago
Process: 57742 ExecStart=/usr/bin/lighthouse $ARGS (code=exited, status=255)
Main PID: 57742 (code=exited, status=255)

8月 29 18:09:16 host-241 systemd[1]: lighthouse.service: main process exited, code=exited, status=255/n/a
8月 29 18:09:16 host-241 lighthouse[57742]: F0829 18:09:16.070937 57742 server.go:54] failed complete: failed to decode hook configuration file "/etc/lighthouse/config.yaml", no kind "hookConfiguration" is registered for version "" in scheme "pkg/runtime/scheme.go:101"
8月 29 18:09:16 host-241 systemd[1]: Failed to start Lighthouse server.
8月 29 18:09:16 host-241 systemd[1]: Unit lighthouse.service entered failed state.
8月 29 18:09:16 host-241 systemd[1]: lighthouse.service failed.
8月 29 18:09:16 host-241 systemd[1]: lighthouse.service holdoff time over, scheduling restart.
8月 29 18:09:16 host-241 systemd[1]: start request repeated too quickly for lighthouse.service
8月 29 18:09:16 host-241 systemd[1]: Failed to start Lighthouse server.
8月 29 18:09:16 host-241 systemd[1]: Unit lighthouse.service entered failed state.
8月 29 18:09:16 host-241 systemd[1]: lighthouse.service failed.

ERROR: /sys/fs/cgroup/cpu/cpu.offline: no such file or directory

Hi,i would like to ask for help that when i deploy caelus on k8s, pods of caelus show the following logs:
I1208 17:04:24.202219 7830 feature_gate.go:243] feature gates: &{map[]}
I1208 17:04:24.202253 7830 types.go:490] current namespace is NOT host
E1208 17:04:24.208590 7830 cpubt.go:56] checking BT file(cpu.offline) err: stat /sys/fs/cgroup/cpu/cpu.offline: no such file or directory
I1208 17:04:24.208616 7830 types.go:708] cpu isolate auto detect is enabled, chosen manage policy is: quota
W1208 17:04:24.208624 7830 types.go:745] adding non-host namespace prefix for kubelet root dir
F1208 17:04:24.208639 7830 types.go:724] cpu manager file(/rootfs/data/cpu_manager_state) err: open /rootfs/data/cpu_manager_state: no such file or directory

I don't know if it is my miss of some steps?

收集指标问题 中提到:
Multiple metrics supported, including cgroup metrics from cadvisor, node resource metrics, kernel metrics from eBPF, hardware events from PMU, and also Caelus collects online jobs latency from outside in the way of executable command or http server.
但代码中似乎没有看到有 kernel metrics from eBPF 这一项。





  1. 看代码中的使用方式,yarn必须得基于K8S吗?
  2. Predict之类的统筹,目前只能基于单node,没有整个集群上的资源调度吗?
  3. 希望能出一个完善的傻瓜式README。感谢万分

