Caelus is a set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs, these resources come from the underutilization of online jobs, especially during low traffic periods. To make batch jobs compatible with online jobs, caelus dynamically manages multiple resource isolation mechanisms and also checks abnormalities of various metrics. Batch jobs will be throttled or even killed if interference detected.
-
Collect various metrics, including node resources, cgroup resources and online jobs latency
-
Batch jobs could be running on YARN or Kubernetes
-
Predict total resource usages of the node, including online jobs and kernel modules, such as slab
-
Dynamically manage multiple resource isolation mechanisms, such as CPU, memory, and disk space
-
Dynamically check abnormalities of various metrics, such as CPU usage or online jobs latency
-
Throttle or even kill batch jobs when resource pressure or latency spike detected
-
Prometheus metrics supported
-
Alarm supported
Find more usage at Tutorial.md. The project also have two attached tools:
nm_operator is used to execute YARN commands in the way of remote API.
# binary build, which generates binary under _output/bin/
$ make build
# image build
$ make image
# run unit test
$ make test
The caelus should better run on the node with the kubelet process, and write the kubelet's "root-dir" value to "kubelet_root_dir" in config file.
The config file and rule file are just the example files, you could add more feature based on different demands.
# running in script
$ mkdir -p /etc/caelus/
$ cp hack/config/rules.json /etc/caelus/
$ # if the batch job is running on YARN, you must modify the "offline_type" in hack/config/caelus.json as "yarn_on_k8s", and run the command
$ caelus --config=hack/config/caelus.json --v=2
$ # if the batch job is running on K8S, you must modify the "offline_type" in hack/config/caelus.json as "k8s", and run the command
$ # if the command is running inside the pod, then you could ignore the kubeconfig parameter
$ caelus --config=hack/config/caelus.json --hostname-override=xxx --v=2 --kubeconfig=xxx
# run in container
$ # the container parameters and environments could be found from hack/yaml/caelus.json, such as:
$ docker run -it --cap-add SYS_ADMIN --cap-add NET_ADMIN --cap-add MKNOD --cap-add SYS_PTRACE --cap-add SYS_CHROOT --cap-add SYS_NICE -v /:/rootfs -v /sys:/sys -v /dev/disk:/dev/disk ccr.ccs.tencentyun.com/caelus/caelus:v1.0.0 /bin/bash
# running on K8S
$ kubectl create -f hack/yaml/caelus.yaml
$ kubectl label node colation=true
$ kubectl -n kube-system get daemonset
You could find more about how to start with Caelus from the DETAIL
For more information about contributing issues or pull requests, see our Contributing to Caelus.
Caelus is under the Apache License 2.0. See the License file for details.
caelus's People
Forkers
wangao1236 langyenan chaosju xiaonancc77 chenchun hex108 free-luowei ddongchen wondermen kitianfresh bsjs threestoneliu botieking98 gavinljj isgasho weiyanhua100 chenlingpeng blueblue-lee devhan2020 cityofwang yiwenshao kom0055 watermeion xing0821 sataqiu mabinbin0202 sencoder laashub-soa jiujuan zmberg benjaminhuang tzzcfrank yjxyy zuston 0x0034 shijieqin chenhong231 junxu yixingzhong jackiewang96 vanient bretagne-peiqi attlee-wang janeliul yelianjin jangocheng oceanchen2012 warmchang zyecho tianzichenone integra-hellsing chenmol xieydd nicholaswang haifzhu hm4radi tangcong luckyplusten isabella232 growing-luo xxlest sangshenya zhenkuang ryansxy houshanren wushuang-1997 geegee2006 njlkj kersus blakechiang4 fjding interstellarss xiaoxiaopan118 wl4g-k8s danek2003 fearblackcat zhy76 dragon-flyings wolfboys wang-mask qiuming520 leaf-aqua ironicbocaelus's Issues
请问下batch-scheduler和coordinator现在开源了么
lighthouse 和 lighthouse-plugin 部署之后报错
lighthouse 和 lighthouse-plugin都部署了 ,kubelet也更改了相关参数, 启动还是报错
kubelet 直接报错无法获取docker版本,
lighthouse 进程 也报错:
I1209 15:13:22.711478 3037 hook_manager.go:164] Build router: post /containers/create
I1209 15:13:22.711633 3037 hook_manager.go:101] Hook manager is running
I1209 15:13:42.033089 3037 hook_manager.go:343] Unhandled request GET /info
I1209 15:13:42.033122 3037 log.go:184] http: proxy error: context canceled
I1209 15:13:42.033493 3037 hook_manager.go:343] Unhandled request GET /info
I1209 15:13:42.033520 3037 log.go:184] http: proxy error: context canceled
I1209 15:13:42.044698 3037 hook_manager.go:343] Unhandled request GET /versio
离线业务流量限制手段
看caelus源代码,发现 pkg/caelus/qos/manager/netio/netio.go, 代码文件里面是调用linux tc指令做的流量限额, 了解到 linux TC 只能做流出流量的限额, 这个是不是代表目前caelus 只能做离线任务流出流量的限额
question: which tc filter worked in netqos
hi, I found that the netqos implementation creates two types of tc filters, tc+cgroup and tc+ipset, so which one worked in the end
no kind "hookConfiguration"
no kind "hookConfiguration"
hookConfiguration crd是怎么装上的啊
离线大框的一个问题
既然lighthouse的func (p *offlineMutator) mutate拦截执行了:
newSplits = append(newSplits, splits[1], offlineKey) //offlineKey = "offline"
newSplits = append(newSplits, splits[2:]...)
newCgroupParent := strings.Join(newSplits, string(filepath.Separator))
newCgroupParent = "/" + newCgroupParent
containerConfig.InnerHostConfig.CgroupParent = newCgroupParent
拦截执行以后,这些任务应该都在大框offline的cgroup父目录下面,那么,为啥还要有qos_k8s.go里面的moveOfflinePidsTogether?这里的moveOfflinePidsTogether是不是多余的?
在容器中往/rootfs/etc写文件,报:Read-only file system
caelus/pkg/caelus/diskquota/manager/projectquota/projectfile.go
Lines 50 to 53 in 27d65d5
看已经直接挂载/etc/进入容器中,这里是否可以不需要了?
Some Feedback
#37 这次pr, 在pkg/caelus/predict/predict_local.go 文件,
mem := math.Max(memStats.UsageRss-memStats.UsageTotal, 0)
mem永远是0, 辛苦确认下是否符合预期
hadoop version problem
I has tried to deploy nm-operator with my hadoop cluster with version 2.6. but I found the yarn client doesn't work with yarn rmadmin -updateNodeResource
command, how can i use nm-operator with lower version of hadoop?
/help
lighthouse make rpm 报错
/caelus/contrib/lighthouse-plugin$ make rpm
./hack/rpm
Sending build context to Docker daemon 137.7kB
Error response from daemon: failed to parse Dockerfile: Syntax error - can't find = in "M". Must be of the form: name=value
make: *** [rpm] 错误 1
lighthouse组件是必须的吗
请教下离线作业在yarn上,not on k8s,是不是可以不用部署lighthouse和plugin server
Whether LinuxContainerExecutor could be supported on NM runs in Docker
Thanks for your great work on this project, especially for elastic Yarn with K8S.
Sorry i'm not familiar with K8S and so confused whether Hadoop LinuxContainerExecutor could be supported on NM runs in Docker natively.
If you have any ideas on it, please share it with me.
离线调度器哪里去了
离线调度器哪里去了,需要通过kubelet device plugin注册colocation/cpu、colocation/memory等新类型的硬件资源吗?
lighthouse运行报错
systemctl status lighthouse.service
● lighthouse.service - Lighthouse server
Loaded: loaded (/usr/lib/systemd/system/lighthouse.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since 一 2022-08-29 18:09:16 CST; 9min ago
Process: 57742 ExecStart=/usr/bin/lighthouse $ARGS (code=exited, status=255)
Main PID: 57742 (code=exited, status=255)
8月 29 18:09:16 host-241 systemd[1]: lighthouse.service: main process exited, code=exited, status=255/n/a
8月 29 18:09:16 host-241 lighthouse[57742]: F0829 18:09:16.070937 57742 server.go:54] failed complete: failed to decode hook configuration file "/etc/lighthouse/config.yaml", no kind "hookConfiguration" is registered for version "lighthouse.io/v1alpha1" in scheme "pkg/runtime/scheme.go:101"
8月 29 18:09:16 host-241 systemd[1]: Failed to start Lighthouse server.
8月 29 18:09:16 host-241 systemd[1]: Unit lighthouse.service entered failed state.
8月 29 18:09:16 host-241 systemd[1]: lighthouse.service failed.
8月 29 18:09:16 host-241 systemd[1]: lighthouse.service holdoff time over, scheduling restart.
8月 29 18:09:16 host-241 systemd[1]: start request repeated too quickly for lighthouse.service
8月 29 18:09:16 host-241 systemd[1]: Failed to start Lighthouse server.
8月 29 18:09:16 host-241 systemd[1]: Unit lighthouse.service entered failed state.
8月 29 18:09:16 host-241 systemd[1]: lighthouse.service failed.
运行二进制 ./caelus --v="2" --kubeconfig=config 找不到k8s 节点??
ERROR: /sys/fs/cgroup/cpu/cpu.offline: no such file or directory
Hi,i would like to ask for help that when i deploy caelus on k8s, pods of caelus show the following logs:
I1208 17:04:24.202219 7830 feature_gate.go:243] feature gates: &{map[]}
I1208 17:04:24.202253 7830 types.go:490] current namespace is NOT host
E1208 17:04:24.208590 7830 cpubt.go:56] checking BT file(cpu.offline) err: stat /sys/fs/cgroup/cpu/cpu.offline: no such file or directory
I1208 17:04:24.208616 7830 types.go:708] cpu isolate auto detect is enabled, chosen manage policy is: quota
W1208 17:04:24.208624 7830 types.go:745] adding non-host namespace prefix for kubelet root dir
F1208 17:04:24.208639 7830 types.go:724] cpu manager file(/rootfs/data/cpu_manager_state) err: open /rootfs/data/cpu_manager_state: no such file or directory
I don't know if it is my miss of some steps?
收集指标问题
tutorial.md 中提到:
Multiple metrics supported, including cgroup metrics from cadvisor, node resource metrics, kernel metrics from eBPF, hardware events from PMU, and also Caelus collects online jobs latency from outside in the way of executable command or http server.
但代码中似乎没有看到有 kernel metrics from eBPF 这一项。
Add support for building caelus in a docker container
Add support for building caelus in a docker container, then we could build it on anywhere :)
/help
离线调度器开源了吗
lighthouse支持gRPC协议吗?
从目前开源的lighthouse来看,lighthouse仅仅支持http协议,如果后续将docker替换成containerd,kubelet直接向containerd发送gRPC的CRI请求,请求问一下,lighthouse还能继续支持吗?谢谢。
如何使用lighthouse插件
是修改kubelet启动参数 --container-runtime-endpoint 来指定使用lighthouse插件嘛?
干扰检测部分的实现有开源吗?
Fix the wrong PodSpec for nodemanager.yaml
As we know the PodSpec's containers field is array type, but typo in nodemanager.yaml
离线调度器准备开源吗,如果开源的话,大概什么时候开源
不完善的README
因为最近在基于Caelus开始着手搞混部调度,但是看了一圈代码下来,人还是懵的。有几个问题请教下:
- 看代码中的使用方式,yarn必须得基于K8S吗?
- Predict之类的统筹,目前只能基于单node,没有整个集群上的资源调度吗?
- 希望能出一个完善的傻瓜式README。感谢万分
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.