Coder Social home page Coder Social logo

tencent / caelus Goto Github PK

View Code? Open in Web Editor NEW
342.0 11.0 83.0 1.06 MB

Set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs

License: Other

Makefile 0.08% Go 97.82% Dockerfile 0.17% Shell 1.93%
docker containerd runtime hadoop yarn kubernetes

caelus's Introduction

Caelus

GitHub license Release PRs Welcome

Caelus is a set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs, these resources come from the underutilization of online jobs, especially during low traffic periods. To make batch jobs compatible with online jobs, caelus dynamically manages multiple resource isolation mechanisms and also checks abnormalities of various metrics. Batch jobs will be throttled or even killed if interference detected.

Features

  • Collect various metrics, including node resources, cgroup resources and online jobs latency

  • Batch jobs could be running on YARN or Kubernetes

  • Predict total resource usages of the node, including online jobs and kernel modules, such as slab

  • Dynamically manage multiple resource isolation mechanisms, such as CPU, memory, and disk space

  • Dynamically check abnormalities of various metrics, such as CPU usage or online jobs latency

  • Throttle or even kill batch jobs when resource pressure or latency spike detected

  • Prometheus metrics supported

  • Alarm supported

Usage

Find more usage at Tutorial.md. The project also have two attached tools:

nm_operator

nm_operator is used to execute YARN commands in the way of remote API.

Getting started

Build

# binary build, which generates binary under _output/bin/
$ make build

# image build
$ make image

# run unit test
$ make test

Run

The caelus should better run on the node with the kubelet process, and write the kubelet's "root-dir" value to "kubelet_root_dir" in config file.

The config file and rule file are just the example files, you could add more feature based on different demands.

# running in script
$ mkdir -p /etc/caelus/
$ cp hack/config/rules.json /etc/caelus/
$ # if the batch job is running on YARN, you must modify the "offline_type" in hack/config/caelus.json as "yarn_on_k8s", and run the command
$ caelus --config=hack/config/caelus.json --v=2
$ # if the batch job is running on K8S, you must modify the "offline_type" in hack/config/caelus.json as "k8s", and run the command
$ # if the command is running inside the pod, then you could ignore the kubeconfig parameter
$ caelus --config=hack/config/caelus.json --hostname-override=xxx --v=2 --kubeconfig=xxx

# run in container
$ # the container parameters and environments could be found from hack/yaml/caelus.json, such as:
$ docker run -it --cap-add SYS_ADMIN --cap-add NET_ADMIN --cap-add MKNOD --cap-add SYS_PTRACE --cap-add SYS_CHROOT --cap-add SYS_NICE -v /:/rootfs -v /sys:/sys -v /dev/disk:/dev/disk ccr.ccs.tencentyun.com/caelus/caelus:v1.0.0 /bin/bash

# running on K8S
$ kubectl create -f hack/yaml/caelus.yaml
$ kubectl label node colation=true
$ kubectl -n kube-system get daemonset

More

You could find more about how to start with Caelus from the DETAIL

Contributing

For more information about contributing issues or pull requests, see our Contributing to Caelus.

License

Caelus is under the Apache License 2.0. See the License file for details.

caelus's People

Contributors

chaosju avatar chenlingpeng avatar ddongchen avatar jiwq avatar mymneo avatar testwill avatar threestoneliu avatar vanient avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

caelus's Issues

lighthouse 和 lighthouse-plugin 部署之后报错

lighthouse 和 lighthouse-plugin都部署了 ,kubelet也更改了相关参数, 启动还是报错

kubelet 直接报错无法获取docker版本,

lighthouse 进程 也报错:

I1209 15:13:22.711478 3037 hook_manager.go:164] Build router: post /containers/create
I1209 15:13:22.711633 3037 hook_manager.go:101] Hook manager is running
I1209 15:13:42.033089 3037 hook_manager.go:343] Unhandled request GET /info
I1209 15:13:42.033122 3037 log.go:184] http: proxy error: context canceled
I1209 15:13:42.033493 3037 hook_manager.go:343] Unhandled request GET /info
I1209 15:13:42.033520 3037 log.go:184] http: proxy error: context canceled
I1209 15:13:42.044698 3037 hook_manager.go:343] Unhandled request GET /versio

离线业务流量限制手段

看caelus源代码,发现 pkg/caelus/qos/manager/netio/netio.go, 代码文件里面是调用linux tc指令做的流量限额, 了解到 linux TC 只能做流出流量的限额, 这个是不是代表目前caelus 只能做离线任务流出流量的限额

离线大框的一个问题

既然lighthouse的func (p *offlineMutator) mutate拦截执行了:

newSplits = append(newSplits, splits[1], offlineKey) //offlineKey = "offline"
newSplits = append(newSplits, splits[2:]...)
newCgroupParent := strings.Join(newSplits, string(filepath.Separator))
newCgroupParent = "/" + newCgroupParent
containerConfig.InnerHostConfig.CgroupParent = newCgroupParent

拦截执行以后,这些任务应该都在大框offline的cgroup父目录下面,那么,为啥还要有qos_k8s.go里面的moveOfflinePidsTogether?这里的moveOfflinePidsTogether是不是多余的?

Some Feedback

#37 这次pr, 在pkg/caelus/predict/predict_local.go 文件,
mem := math.Max(memStats.UsageRss-memStats.UsageTotal, 0)
mem永远是0, 辛苦确认下是否符合预期

hadoop version problem

I has tried to deploy nm-operator with my hadoop cluster with version 2.6. but I found the yarn client doesn't work with yarn rmadmin -updateNodeResource command, how can i use nm-operator with lower version of hadoop?

/help

lighthouse make rpm 报错

/caelus/contrib/lighthouse-plugin$ make rpm
./hack/rpm
Sending build context to Docker daemon 137.7kB
Error response from daemon: failed to parse Dockerfile: Syntax error - can't find = in "M". Must be of the form: name=value
make: *** [rpm] 错误 1

离线调度器哪里去了

离线调度器哪里去了,需要通过kubelet device plugin注册colocation/cpu、colocation/memory等新类型的硬件资源吗?

lighthouse运行报错

systemctl status lighthouse.service
● lighthouse.service - Lighthouse server
Loaded: loaded (/usr/lib/systemd/system/lighthouse.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since 一 2022-08-29 18:09:16 CST; 9min ago
Process: 57742 ExecStart=/usr/bin/lighthouse $ARGS (code=exited, status=255)
Main PID: 57742 (code=exited, status=255)

8月 29 18:09:16 host-241 systemd[1]: lighthouse.service: main process exited, code=exited, status=255/n/a
8月 29 18:09:16 host-241 lighthouse[57742]: F0829 18:09:16.070937 57742 server.go:54] failed complete: failed to decode hook configuration file "/etc/lighthouse/config.yaml", no kind "hookConfiguration" is registered for version "lighthouse.io/v1alpha1" in scheme "pkg/runtime/scheme.go:101"
8月 29 18:09:16 host-241 systemd[1]: Failed to start Lighthouse server.
8月 29 18:09:16 host-241 systemd[1]: Unit lighthouse.service entered failed state.
8月 29 18:09:16 host-241 systemd[1]: lighthouse.service failed.
8月 29 18:09:16 host-241 systemd[1]: lighthouse.service holdoff time over, scheduling restart.
8月 29 18:09:16 host-241 systemd[1]: start request repeated too quickly for lighthouse.service
8月 29 18:09:16 host-241 systemd[1]: Failed to start Lighthouse server.
8月 29 18:09:16 host-241 systemd[1]: Unit lighthouse.service entered failed state.
8月 29 18:09:16 host-241 systemd[1]: lighthouse.service failed.

ERROR: /sys/fs/cgroup/cpu/cpu.offline: no such file or directory

Hi,i would like to ask for help that when i deploy caelus on k8s, pods of caelus show the following logs:
I1208 17:04:24.202219 7830 feature_gate.go:243] feature gates: &{map[]}
I1208 17:04:24.202253 7830 types.go:490] current namespace is NOT host
E1208 17:04:24.208590 7830 cpubt.go:56] checking BT file(cpu.offline) err: stat /sys/fs/cgroup/cpu/cpu.offline: no such file or directory
I1208 17:04:24.208616 7830 types.go:708] cpu isolate auto detect is enabled, chosen manage policy is: quota
W1208 17:04:24.208624 7830 types.go:745] adding non-host namespace prefix for kubelet root dir
F1208 17:04:24.208639 7830 types.go:724] cpu manager file(/rootfs/data/cpu_manager_state) err: open /rootfs/data/cpu_manager_state: no such file or directory

I don't know if it is my miss of some steps?

收集指标问题

tutorial.md 中提到:
Multiple metrics supported, including cgroup metrics from cadvisor, node resource metrics, kernel metrics from eBPF, hardware events from PMU, and also Caelus collects online jobs latency from outside in the way of executable command or http server.
但代码中似乎没有看到有 kernel metrics from eBPF 这一项。

lighthouse支持gRPC协议吗?

从目前开源的lighthouse来看,lighthouse仅仅支持http协议,如果后续将docker替换成containerd,kubelet直接向containerd发送gRPC的CRI请求,请求问一下,lighthouse还能继续支持吗?谢谢。

不完善的README

因为最近在基于Caelus开始着手搞混部调度,但是看了一圈代码下来,人还是懵的。有几个问题请教下:

  1. 看代码中的使用方式,yarn必须得基于K8S吗?
  2. Predict之类的统筹,目前只能基于单node,没有整个集群上的资源调度吗?
  3. 希望能出一个完善的傻瓜式README。感谢万分

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.