kubernetes-retired / kube-batch Goto Github PK
View Code? Open in Web Editor NEWA batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC
License: Apache License 2.0
A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC
License: Apache License 2.0
Currently, the minimum allocation unit of CPU by kube-arbitrator is 1
. And the used CPU by pods must be an integer to fit kube-arbitrator.
Need to support smaller minimum allocation unit of CPU to make the kube-arbitrator more flexible, such as 1m
(1=1000m
)
The policy will terminate pod to release resources if the Queue
is underused (Used > Allocated). Now the pod is selected randomly. Need support more strategy of pod selection according to some factors, such as
Currently, kube-batchd
assign pod of job randomly for minAvailable
.
For example, there are three pods p1
p2
p3
in a job, and its minAvailable
is 2. kube-batchd
may start minAvailable
pods in three ways
p1
and p2
p1
and p3
p2
and p3
for some job, such as tensorflow job, some pods of the job must start first. Otherwise, the job could not startup.
To handle above issue, the user could specify a different priority for pods, then kube-batchd
will start minAvailable
pods for the job based on their priority.
kube-batchd
reuse kubernetes PDB for minAvailable
. Scheduler cache sync PDB from api-server and policy get cache snapshot for scheduling.
While the policy may get cache snapshot before cache sync PDB from api-server. It may cause policy lost minAvailable
of a PodSet when scheduling pods for it.
I noticed today the proposal for an Application API object which, at first glance, appears to be a way to group multiple API objects needed to launch an application. This seems pretty similar to the QueueJob concept from the original kube arbitrator proposal, at least at a high level (in particular see the ApplicationSpec.Components
field). Anyway, I just wanted to point this out so you might look into whether there is anything to leverage from Application.
Maybe there is not a useful connection, but it is probably worth investigating.
Currently, ResourceQuota is used to limit resource usage of each queue(namespace level indeed).
In the roadmap, we need to new admission control for Queue and QueueJob to limit resource usage.
It seems not restrict as a list of const string in the api definition for now.
see "pkg/apis/v1/queue.go"
Problem Description:
Upon executing:
chmod +x hack/verify-golint.sh; ./hack/verify-golint.sh
I received the following error message:
Detected go version: go version go1.10 linux/amd64.
Kubernetes requires go1.8.3 or greater.
Please install go1.8.3 or later.
!!! Error in ./hack/verify-golint.sh:334
Error in ./hack/verify-golint.sh:334. 'return 2' exited with status 2
Call stack:
1: ./hack/verify-golint.sh:334 main(...)
Exiting with status 1
I verified my go version with go version
which also confirmed:
go version go1.10 linux/amd64
Currently, queue controller will update allocation result to Queue
and quota manager will update the allocation result from Queue
to ResourceQuota
. There is a time window between the two update and it may cause some strange behavior.
Just open the issue to trace this.
There are not any unit test cases for this project. Need to add them to each go file.
Before killing pods to apply shrink quota, queue need wait a grace period to give queuejob
controller chance to choose lower priority pods to be killed firstly to fulfill the shrink quota.
Currently, the policy could allocate resources to queue by percentage.
Also, need to support allocate resources by DRF.
Plan to merge release-pre0.1
branch to master
, and provide two independent binaries kube-batchd
and kube-quotalloc
.
kube-batchd
contains current master
branch function, it is to support batch job scheduling.
kube-quotalloc
contains current release-pre0.1
branch function, it is for resources management.
Need to handle the following items:
Consumer
to Queue
to in master
branch to avoid conflictpkg/*
into pkg/batchd/
for master
branch and build kube-batchd
release-pre0.1
(only include resource allocation for original Queue) into pkg/quotalloc/
in master
branch and rename Queue
to ResourceQuotaAllocation
Unlike kube-quotalloc, it looks like that kube-batchd should be able to work in its own namespace isolated from others?
Hi, kube-arbitrator members
I am trying to run ML(especially TF) workload on Kubernetes with the support of kube-arbitrator, and have some questions about this repo.
First, what's the relationship between kube-arbitrator and IBM EGO, is it an open source re-implementation of EGO? And if I'm not mistaken I think EGO is similar to mesos:
EGO 还在其内核中提供了一个高度可配置的、高效的**布局服务
I was wondering what is a **布局服务
, I can not find the description in the post.
Second, what's the relationship between batchd and quotalloc? I found that they have a similar CRD: QuotaAllocator
and Queue
, but I am not sure what is it used to do.
Finally, I have a question about the behavioud of batchd:
When resources are not sufficient, Kube-batchd just tries to start minAvailable pods of each application as much as possible.
If I have two applications: the first one app1
's minAvailable pod is 4, and the second one app2
's minAvailable pod is 6. There are only 5 slots in the machine, then what will happen if batched schedules the two applications?
I'd appreciate it if you could help me 😄
The Queue API doc contains the basic function description of preemption. Need to provide a preemption design doc of details.
Now pod preemption between PodSet is not supported by kube-batchd
.
It may cause the first PodSet get all resources and the second PodSet get no resources unless some pods of the first PodSet finish.
So kube-batchd
need to have the ability to balance the resources between PodSet. Preempting resources from one PodSet and assign to other PodSet.
Queue
is an important API object in kube-arbitrator. This issue is used to trace all related features.
Queue
@jinzhejz , as we discussed offline, please help to create a PR for CRD definition of ResourceQuotaAllocator
.
Is this a BUG REPORT or FEATURE REQUEST?: FEATURE REQUEST
The queuejob controller need get notifications when Queue/QueueJob/Pods are being updated.
This could be implemented as a kind of shared informer and lister
/kind feature
Hi, all!
I've been investigating how applicable kube-arbitrator might be for running multinode MPI jobs on a kubernetes cluster. Here are my observations - please let me know if I was doing something wrong or whether the outcome was as expected. Also, please note that I don't have too much experience with Kubernetes.
For the experiments, I had three nodes with 80 CPUs each.
I first experimented with kube-batchd
, hoping that it wouldn't schedule a job until the requested resources are available. I set up one dormant (sleep infinity
) pod that used 50 CPUs. I then created the pod disruption budget with matchLabels: app: mpi3pod
with minAvailable: 3
. I then created a deployment with 3 replicas, label app: mpi3pod
, using the kube-arbitrator as scheduler and requesting 50
CPUs per replica.
It seems that minAvailable
is not honoured in the way I expected it to be - the whole deployment was marked as "not ready" (good), but 2 of the 3 replicas were launched. I hoped that kube-arbitrator would not start the pods until all the required resources were available. That's not how it works, I presume?
Second experiment was with kube-quotaalloc
. My intended use was to generate a separate namespace/QuotaAllocator/ResourceQuota for each MPI job that would be submitted. So I created three namespaces, three QuotaAllocators (with 100 CPU each) and three resource quotas but upon running kube-quotaalloc
, I saw that the quotas were clipped to fit within one node:
$ kubectl get quota rq01 -n allocator-ns01 -o yaml
apiVersion: v1
kind: ResourceQuota
metadata:
creationTimestamp: 2018-02-05T11:08:38Z
name: rq01
namespace: allocator-ns01
resourceVersion: "7507551"
selfLink: /api/v1/namespaces/allocator-ns01/resourcequotas/rq01
uid: eac65704-0a64-11e8-b0e9-54ab3a8ca064
spec:
hard:
limits.cpu: "80"
limits.memory: "540954869760"
pods: "100"
requests.cpu: "80"
requests.memory: "540954869760"
status:
hard:
limits.cpu: "80"
limits.memory: "540954869760"
pods: "100"
requests.cpu: "80"
requests.memory: "540954869760"
used:
limits.cpu: "0"
limits.memory: "0"
pods: "0"
requests.cpu: "0"
requests.memory: "0"
It was initially 100 CPUs, but was trimmed down to 80 CPUs (which is how many CPUs are present on each node).
How can I use kube-arbitrator
as a resource manager/scheduler for multinode MPI jobs? Ideally, I'd wish for the Job's pods to only be scheduled once all the resources for the whole job become available.
Also, can kube-batchd handle a case of two applications having been submitted at pretty much the same time that could lead to a deadlock? I.e. imagine we have 5 nodes and two applications which require 4 nodes each (and this is a hard request, i.e. they'd wait for all their jobs to start running before making progress, thus indifinitely blocking whatever nodes they have been assigned). If they were to be submitted at the same time, we could run into a deadlock because app1 would take e.g. 3 nodes and app2 would take the remaining 2 nodes in the meantime. Neither of the apps can make progress until they get more nodes and hence a deadlock occurs.
I found document for how to create the queue and set the resource quota:
https://github.com/kubernetes-incubator/kube-arbitrator/blob/master/doc/usage/tutorial.md
But could anyone tell me how to create the pod/task into the queue?
Kube-arbitrator uses resource quota to limit resource usage of each queue.
Now the quota must be created for each queue manually, and it still exists even if the queue is deleted.
Need to:
PR #151 fix the following issue in DRF policy and need to add UT cases for this issue.
Now kube-batchd
could support kubeflow basically.
Next, we need to build performance test for TensorFlow run on kubernetes cluster and then optimize kube-batchd
based on the performance test result.
currently user are required to download src and build kube-batchd/kube-quotalloc.
I am trying to deploy kube-batchd and kube-quotalloc on a minikube cluster. I must ssh into the minkube VM and build kube-batchd/kube-quotalloc form src code and build docker images there. unfortunately, minkube even has no apt-get
, make
, go
installed, building process is a nightmare.
Is it possible to open a public repository on dockerhub ? I am glad to write script to automate the build and push process
There is a race condition between cache and policy, which will cause policy schedule more pods on the same host. Here are the details:
AllocationQueue
AllocationQueue
and update it into api-serverAllocationQueue
(Pods still pending)AllocationQueue
(Pods still pending)Hey,
The tutorial here https://github.com/kubernetes-incubator/kube-arbitrator/blob/master/doc/usage/quotalloc_tutorial.md#5-start-kube-quotalloc launch a kube-quotalloc via downing src code and building, is there any issues about launching kube-quotalloc via Kubernetes Deployment ?
Currently, there're several features in kube-arbitrator's roadmap; and other requirements from community and user. But we can not build them in one or two days, so I open this issue to trace discussion of kube-arbitrator MVP.
The MVP will be the first version of kube-arbitrator; if any requirements to MVP, please feel free to ask :).
An informer for custom resources definition Queue
is already added(pkg/client/informers/queue
) and need to go through the code to see if can use the new informer.
Kube-arbitrator only allocates resources to queue level now.
However, a queue also can contain multiple queuejobs as design doc and need to support allocate/assign resources to queuejobs.
There may be two ways:
Interface.Assign()
Currently, there is only one integration test case for policy/preemption. Need to add more integration test cases.
step to reproduce:
cluster can run 3 pods. whatever replicas I have in the replica set yaml file, as long as pdb min available is larger than 1. cluster can not run pods.
pdb.yaml
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: pdb-01
spec:
minAvailable: 2
selector:
matchLabels:
app: nginx
ss.yaml
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
replicas: 3
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
schedulerName: kube-batchd
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
resources:
limits:
memory: "3Gi"
cpu: "3"
requests:
memory: "3Gi"
cpu: "3"
Now there is gofmt check by hack/update-gofmt.sh
.
Need to add golint check for source code and refine code according to golint result. Such as, adding a new script as hack/update-golint.sh
for the check.
Currently, we cached all node in kube-bached
and assign node to pod accordingly. But for batch job, the conflict is so heavy :(. We'll use 'static' zone for batch job before finding a "good" way to work together with kube-scheduler.
List some tasks (including but not limited to)
Now some code in vendor directory is redundancy and need sync up from github too. Including but not limited to:
The integration test is failed in CI env. We need to install etcd before testing:
The command "make test" exited with 0.
0.04s$ make test-integration
hack/make-rules/test-integration.sh
+++ [0929 07:05:29] Checking etcd is on PATH
+++ [0929 07:05:29] Cannot find etcd, cannot run integration tests.
+++ [0929 07:05:29] Please see https://github.com/kubernetes/community/blob/master/contributors/devel/testing.md#install-etcd-dependency for instructions.
You can use 'hack/install-etcd.sh' to install a copy in third_party/.
!!! Error in hack/make-rules/test-integration.sh:87
Error in hack/make-rules/test-integration.sh:87. 'return 1' exited with status 1
Call stack:
1: hack/make-rules/test-integration.sh:87 main(...)
Exiting with status 1
make: *** [test-integration] Error 1
@jinzhejz , please check travis-ci related doc to fix it :).
Currently, kube-batchd
groups pods into a PodSet by their owner references. However, it will cause some problems in following cases:
K8S Deployment
A deployment may contain more than one RS when rolling-upgrade happens, pods in different RS have different owner references, so kube-batchd
will group them into different PodSet. However, these pods belong to the same deployment and should be in one PodSet.
Kubeflow job
In a tfjob, kubeflow create Master/PS/Worker as a k8s job and each job contains one pod. kube-batchd
will group these pods into different PodSet because of different owner references. However, they belong to the same tfjob and should be in one PodSet.
Now the QueueJob controller in kube-batchd supports to create pods for a QueueJob simply after #194 is merged, however, it is not enough for the QueueJob lifecycle and QueueJob controller doesn't handle pod lifecycle of a QueueJob.
So kube-batchd should be enhanced to support following task:
Not sure if this really fits into the goals for this Incubator, if it does not, please feel free to close the issue.
At some level a BatchJob processing a large corpus of data should have awareness of data (location, Skew etc) for efficient scheduling.
These requirements may be more appropriate for a layer above the scheduler, if so there should be a way to identify how these requirements would translate into changes at the scheduler layer.
For policy, we need a snapshot of cluster, e.g. Pod, Node, to calculate how many resource should be allocated to a tenant. This task is used to provide such kind of cache and related helper functions.
Sub-Tasks:
Resource
type and related helper functions, e.g. Resource.Add
According to proposal https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA/edit#heading=h.a1k69dgabg0w, a new abstraction is to be incorporated in Kubernetes to support complex batch jobs also named QueueJobs. These jobs can be composed of services, replica sets, deployments, stateful sets, etc. and might benefit of atomic allocation, e.g., all-or-nothing, preemption, prioritization of QueueJobs and within the same QueueJob, etc.
We have developed a proof-of-concept prototype that offers support for jobs defined of services, pods, replica sets and deployments (https://github.com/hanghliu/kube-arbitrator) using CRD, as well as prioritization and preemption for them. We would like to integrate this prototype with the current kube-arbitrator project to enhance the support for QueueJobs in it.
The default scheduler publishes events to pod:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
17m 17m 1 default-scheduler Normal ScheduledSuccessfully assigned kube-batchd-5f86bd5b6-hx8kt to ist
But batchd does not implement it, it seems that batchd does not have the recorder. I think it should be added to make users know which scheduler schedules the pod.
When resources are not sufficient, kube-batchd
just tries to start minAvailable
pods of each application as much as possible. It will cause the running pods' size of a podSet is less than minAvailable
which is not acceptable for some applications, e.g. MPI.
To handle above issue, kube-batchd
need to support that it will not assign any resources to the job unless its minAvailable
could be met.
At early stage, integration test is ok to verify our code; it's better to reuse upstream integration test here.
Overall, prefer to build the prototype quickly to make all developer on the same page. And keep improving it before 0.1
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.