kubernetes-retired / kube-batch Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 265.0 45.2 MB

A batch scheduler of kubernetes for high performance workload, e.g. AI/ML, BigData, HPC

License: Apache License 2.0

Go 72.40% Makefile 0.31% Shell 25.52% Dockerfile 0.02% Python 1.64% Mustache 0.11%

bigdata hpc k8s-sig-scheduling kubernetes machine-learning

kube-batch's People

Contributors

Stargazers

Watchers

Forkers

connordoyle hsaputra asifdxtreme mbssaiakhil zmoon111 hanghliu bsalamat foxish swiftdiaries tanshanshan elaron spiffxp ravisantoshgudimetla gaocegege mitake clenimar kenjikanedanvidia kkaneda rohan47 raoofm spandankumarsahu yanniszark rc-zhang sedflix nivedita-v xiaonancc77 dmatch01 jjmtraveller chenyuyun-emc hochuenw fatkun nevermore-muyi adohe-zz scostache damora dylanble etiennecoutaud qiniu-ava jinzhejz garnettwang etsangsplk cheyang essamhassan spyyang denkensk julianocristian chanyilin kingonthestar keenh atul9 jiaxuanzhou suleisl2000 chenyangxuehdu animeshsingh jeefy cwdsuzhou kirk-enterprise rexxar-liang yds05 tizhou86 adam-marek mooncak xichengliudui goodluckbot lbwangbj jeffwan lovejoy sataqiu kadisi xvwh zyqsempai bluealin daelf fengzixu mateusz-ciesielski wackxu youbingli ringtail cindyxing prashantgolash thandayuthapani dongeunsuh rajadeepan ghanima shivramsrivastava lizebang srinivaschilveri baikai dhzhuo dalavancloud scarita hex108 deliangfan surfndez tommylike isgasho shinytang6 hougangliu bnulwh zionwu

kube-batch's Issues

Enhancement: allocate CPU by decimals

Currently, the minimum allocation unit of CPU by kube-arbitrator is 1. And the used CPU by pods must be an integer to fit kube-arbitrator.

Need to support smaller minimum allocation unit of CPU to make the kube-arbitrator more flexible, such as 1m(1=1000m)

Enhancement: multiple strategy of pod selection

The policy will terminate pod to release resources if the Queue is underused (Used > Allocated). Now the pod is selected randomly. Need support more strategy of pod selection according to some factors, such as

Running time
Priority
Labels

Separate Queue Controller, QJ Controller and PolicyController/QJ Scheduler in different daemons (allows running different QJ controllers)

Schedule pods by priority for minAvailable

Currently, kube-batchd assign pod of job randomly for minAvailable.

For example, there are three pods p1 p2 p3 in a job, and its minAvailable is 2. kube-batchd may start minAvailable pods in three ways

p1 and p2
p1 and p3
p2 and p3

for some job, such as tensorflow job, some pods of the job must start first. Otherwise, the job could not startup.

To handle above issue, the user could specify a different priority for pods, then kube-batchd will start minAvailable pods for the job based on their priority.

Race condition between cache and policy

kube-batchd reuse kubernetes PDB for minAvailable. Scheduler cache sync PDB from api-server and policy get cache snapshot for scheduling.

While the policy may get cache snapshot before cache sync PDB from api-server. It may cause policy lost minAvailable of a PodSet when scheduling pods for it.

Consider relationship between QueueJob and Application

I noticed today the proposal for an Application API object which, at first glance, appears to be a way to group multiple API objects needed to launch an application. This seems pretty similar to the QueueJob concept from the original kube arbitrator proposal, at least at a high level (in particular see the ApplicationSpec.Components field). Anyway, I just wanted to point this out so you might look into whether there is anything to leverage from Application. Maybe there is not a useful connection, but it is probably worth investigating.

Enhanced representation of QJ resources in PolicyController

Currently PolicyController knows about PodSets and schedules one PodSet at a time
We want to support a group of PodSets (podSets that have the same parent with given priority)

Roadmap: Using QueueQuota and QueueJobQuota to replace ResourceQuota

Currently, ResourceQuota is used to limit resource usage of each queue(namespace level indeed).

In the roadmap, we need to new admission control for Queue and QueueJob to limit resource usage.

QueueQuota: resource usage limitation of a Queue
QueueQuotaController: admission controller for QueueQuota
QueueJobQuota: resource usage limitation of a QueueJob
QueueJobQuotaController: admission controller for QueueJobQuota

Queue resource name should be a list of const

It seems not restrict as a list of const string in the api definition for now.
see "pkg/apis/v1/queue.go"

hack/verify-golint.sh: Incorrect check for Go version

Problem Description:
Upon executing:
chmod +x hack/verify-golint.sh; ./hack/verify-golint.sh

I received the following error message:

Detected go version: go version go1.10 linux/amd64.
Kubernetes requires go1.8.3 or greater.
Please install go1.8.3 or later.

!!! Error in ./hack/verify-golint.sh:334
Error in ./hack/verify-golint.sh:334. 'return 2' exited with status 2
Call stack:
1: ./hack/verify-golint.sh:334 main(...)
Exiting with status 1

I verified my go version with go version which also confirmed:
go version go1.10 linux/amd64

Enhancement: out of sync between quota manager and queue controller

Currently, queue controller will update allocation result to Queue and quota manager will update the allocation result from Queue to ResourceQuota. There is a time window between the two update and it may cause some strange behavior.

Just open the issue to trace this.

Increase the code coverage of unit test.

There are not any unit test cases for this project. Need to add them to each go file.

Before killing pods to apply shrink quota, queue need wait a grace period

Before killing pods to apply shrink quota, queue need wait a grace period to give queuejob
controller chance to choose lower priority pods to be killed firstly to fulfill the shrink quota.

Roadmap: resource allocation by DRF

Currently, the policy could allocate resources to queue by percentage.

Also, need to support allocate resources by DRF.

Merge branch release-pre0.1 to master

Plan to merge release-pre0.1 branch to master, and provide two independent binaries kube-batchd and kube-quotalloc.

kube-batchd contains current master branch function, it is to support batch job scheduling.
kube-quotalloc contains current release-pre0.1 branch function, it is for resources management.

Need to handle the following items:

Rename Consumer to Queue to in master branch to avoid conflict
Move pkg/* into pkg/batchd/ for master branch and build kube-batchd
Move release-pre0.1(only include resource allocation for original Queue) into pkg/quotalloc/ in master branch and rename Queue to ResourceQuotaAllocation

Q: Would having one kube-batchd per namespace work?

Unlike kube-quotalloc, it looks like that kube-batchd should be able to work in its own namespace isolated from others?

[question] The relationship between quotalloc and batchd

Hi, kube-arbitrator members

I am trying to run ML(especially TF) workload on Kubernetes with the support of kube-arbitrator, and have some questions about this repo.

First, what's the relationship between kube-arbitrator and IBM EGO, is it an open source re-implementation of EGO? And if I'm not mistaken I think EGO is similar to mesos:

EGO 还在其内核中提供了一个高度可配置的、高效的**布局服务

I was wondering what is a **布局服务, I can not find the description in the post.

Second, what's the relationship between batchd and quotalloc? I found that they have a similar CRD: QuotaAllocator and Queue, but I am not sure what is it used to do.

Finally, I have a question about the behavioud of batchd:

When resources are not sufficient, Kube-batchd just tries to start minAvailable pods of each application as much as possible.

If I have two applications: the first one app1's minAvailable pod is 4, and the second one app2's minAvailable pod is 6. There are only 5 slots in the machine, then what will happen if batched schedules the two applications?

I'd appreciate it if you could help me 😄

Preemption design doc

The Queue API doc contains the basic function description of preemption. Need to provide a preemption design doc of details.

Support pods preemption between PodSet

Now pod preemption between PodSet is not supported by kube-batchd.

It may cause the first PodSet get all resources and the second PodSet get no resources unless some pods of the first PodSet finish.

So kube-batchd need to have the ability to balance the resources between PodSet. Preempting resources from one PodSet and assign to other PodSet.

Queue API & Controller

Queue is an important API object in kube-arbitrator. This issue is used to trace all related features.

Design doc of Queue

Add CRD definition of ResourceQuotaAllocator

@jinzhejz , as we discussed offline, please help to create a PR for CRD definition of ResourceQuotaAllocator.

schedulercache need notify controllers when being updated

Is this a BUG REPORT or FEATURE REQUEST?: FEATURE REQUEST

The queuejob controller need get notifications when Queue/QueueJob/Pods are being updated.

This could be implemented as a kind of shared informer and lister

/kind feature

[Question] Using kube-arbitrator for multinode MPI jobs

Hi, all!

I've been investigating how applicable kube-arbitrator might be for running multinode MPI jobs on a kubernetes cluster. Here are my observations - please let me know if I was doing something wrong or whether the outcome was as expected. Also, please note that I don't have too much experience with Kubernetes.

For the experiments, I had three nodes with 80 CPUs each.

I first experimented with kube-batchd, hoping that it wouldn't schedule a job until the requested resources are available. I set up one dormant (sleep infinity) pod that used 50 CPUs. I then created the pod disruption budget with matchLabels: app: mpi3pod with minAvailable: 3. I then created a deployment with 3 replicas, label app: mpi3pod, using the kube-arbitrator as scheduler and requesting 50 CPUs per replica.

It seems that minAvailable is not honoured in the way I expected it to be - the whole deployment was marked as "not ready" (good), but 2 of the 3 replicas were launched. I hoped that kube-arbitrator would not start the pods until all the required resources were available. That's not how it works, I presume?

Second experiment was with kube-quotaalloc. My intended use was to generate a separate namespace/QuotaAllocator/ResourceQuota for each MPI job that would be submitted. So I created three namespaces, three QuotaAllocators (with 100 CPU each) and three resource quotas but upon running kube-quotaalloc, I saw that the quotas were clipped to fit within one node:

$ kubectl get quota rq01 -n allocator-ns01 -o yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  creationTimestamp: 2018-02-05T11:08:38Z
  name: rq01
  namespace: allocator-ns01
  resourceVersion: "7507551"
  selfLink: /api/v1/namespaces/allocator-ns01/resourcequotas/rq01
  uid: eac65704-0a64-11e8-b0e9-54ab3a8ca064
spec:
  hard:
    limits.cpu: "80"
    limits.memory: "540954869760"
    pods: "100"
    requests.cpu: "80"
    requests.memory: "540954869760"
status:
  hard:
    limits.cpu: "80"
    limits.memory: "540954869760"
    pods: "100"
    requests.cpu: "80"
    requests.memory: "540954869760"
  used:
    limits.cpu: "0"
    limits.memory: "0"
    pods: "0"
    requests.cpu: "0"
    requests.memory: "0"

It was initially 100 CPUs, but was trimmed down to 80 CPUs (which is how many CPUs are present on each node).

How can I use kube-arbitrator as a resource manager/scheduler for multinode MPI jobs? Ideally, I'd wish for the Job's pods to only be scheduled once all the resources for the whole job become available.

Also, can kube-batchd handle a case of two applications having been submitted at pretty much the same time that could lead to a deadlock? I.e. imagine we have 5 nodes and two applications which require 4 nodes each (and this is a hard request, i.e. they'd wait for all their jobs to start running before making progress, thus indifinitely blocking whatever nodes they have been assigned). If they were to be submitted at the same time, we could run into a deadlock because app1 would take e.g. 3 nodes and app2 would take the remaining 2 nodes in the meantime. Neither of the apps can make progress until they get more nodes and hence a deadlock occurs.

How to create the task using the Queue?

I found document for how to create the queue and set the resource quota:

https://github.com/kubernetes-incubator/kube-arbitrator/blob/master/doc/usage/tutorial.md

But could anyone tell me how to create the pod/task into the queue?

Error handling: create/delete quota for new/old queue

Kube-arbitrator uses resource quota to limit resource usage of each queue.

Now the quota must be created for each queue manually, and it still exists even if the queue is deleted.

Need to:

Create a quota if there is not any quota for a new queue
Delete or reset the quota limitation to zero if a queue is deleted

Add UT cases for DRF policy

PR #151 fix the following issue in DRF policy and need to add UT cases for this issue.

The priority of PodSet is not updated after assign minAvailable

Build performance test for TensorFlow run on kubernetes cluster with kube-batchd

Now kube-batchd could support kubeflow basically.

Next, we need to build performance test for TensorFlow run on kubernetes cluster and then optimize kube-batchd based on the performance test result.

Support GPU for ML/AI framework

build and push kube-batchd/kube-quotalloc images to a public repository?

currently user are required to download src and build kube-batchd/kube-quotalloc.

I am trying to deploy kube-batchd and kube-quotalloc on a minikube cluster. I must ssh into the minkube VM and build kube-batchd/kube-quotalloc form src code and build docker images there. unfortunately, minkube even has no apt-get, make, go installed, building process is a nightmare.

Is it possible to open a public repository on dockerhub ? I am glad to write script to automate the build and push process

Race condition between cache and policy

There is a race condition between cache and policy, which will cause policy schedule more pods on the same host. Here are the details:

The brief workflow of each thread/component:

T1. Cache thread:

Update Pod/Node/Consumer information from api-server

T2.Policy - resource alloc thread:

Get Pod/Node/Consumer snapshot from Cache
Allocate resources for Pod/PodSet in the cache snapshot
2.1. Assign MinAvailable to each PodSet one by one
2.2. Assign left resources to each PodSet by DRF
Add allocate resources result to a queue, name it as AllocationQueue

T3.Policy - process alloc decision thread:

Get the scheduler result from AllocationQueue and update it into api-server

T4.Kubelet

Get Pod information from api-server, start/stop it and then update status to api-server

Case 01:

In step T2.2.1: Policy allocate MinAvailable to PodSet
In step T2.3: Policy add allocate result to AllocationQueue (Pods still pending)
In step T3.1: Policy update allocate result to api-server (Pods still pending, with Nodename)
In step T1.1: Cache update information from api-server in step T1.1 (Pods still pending, with Nodename)
In step T2.1: Policy get cache snapshot (Pods still pending, with Nodename)
In step T2.2.1: Policy allocate MinAvailable to PodSet, however, the policy doesn't calculate the number of pending pods with Nodename, still get pods from pending queue one by one.
In step T3.1: Policy update allocate result to api-server (more Pods still pending, with Nodename)
In step T4.1: Kubelet start pods

Case 02:

In step T2.2.1: Policy allocate MinAvailable to PodSet
In step T2.3: Policy add allocate result to AllocationQueue (Pods still pending)
In step T3.1: Policy update allocate result to api-server (Pods still pending, with Nodename)
In step T2.1: Policy get cache snapshot (Pods still pending)
In step T1.1: Cache update information from api-server (Pods still pending, with Nodename)
In step T2.2.1: Policy allocate MinAvailable to PodSet, however, all the pods in the pending queue without nodename, it may cause policy allocate more MinAvailable in this step.
In step T3.1: Policy update allocate result to api-server (more Pods still pending, with Nodename)
In step T4.1: Kubelet start pods

Launch a kube-arbitrator via Kubernetes Deployment/Helm

Hey,

The tutorial here https://github.com/kubernetes-incubator/kube-arbitrator/blob/master/doc/usage/quotalloc_tutorial.md#5-start-kube-quotalloc launch a kube-quotalloc via downing src code and building, is there any issues about launching kube-quotalloc via Kubernetes Deployment ?

Defined MVP of kube-arbitrator

Currently, there're several features in kube-arbitrator's roadmap; and other requirements from community and user. But we can not build them in one or two days, so I open this issue to trace discussion of kube-arbitrator MVP.

The MVP will be the first version of kube-arbitrator; if any requirements to MVP, please feel free to ask :).

Code refine to use the new Queue informer

An informer for custom resources definition Queue is already added(pkg/client/informers/queue) and need to go through the code to see if can use the new informer.

Roadmap: resource oversubscription

Queue job level resources assignment

Kube-arbitrator only allocates resources to queue level now.

However, a queue also can contain multiple queuejobs as design doc and need to support allocate/assign resources to queuejobs.

There may be two ways:

Allocate resources to queuejobs directly by policy
Assign resources to queuejobs after policy allocates resources to each queue, such as assign resources to queuejobs in Interface.Assign()

Add integration test cases

Currently, there is only one integration test case for policy/preemption. Need to add more integration test cases.

StatefulSet is not working well with pdb minAvailable

step to reproduce:
cluster can run 3 pods. whatever replicas I have in the replica set yaml file, as long as pdb min available is larger than 1. cluster can not run pods.

pdb.yaml

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: pdb-01
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: nginx

ss.yaml

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: nginx # has to match .spec.template.metadata.labels
  serviceName: "nginx"
  replicas: 3
  template:
    metadata:
      labels:
        app: nginx # has to match .spec.selector.matchLabels
    spec:
      schedulerName: kube-batchd
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        resources:
          limits:
            memory: "3Gi"
            cpu: "3"
          requests:
            memory: "3Gi"
            cpu: "3"

Add golint check for source code

Now there is gofmt check by hack/update-gofmt.sh.

Need to add golint check for source code and refine code according to golint result. Such as, adding a new script as hack/update-golint.sh for the check.

Add golint tools into this repo (#168)
Enable golint check, at least policy directory

Add node selector to identify a sub-cluster for batch job

Currently, we cached all node in kube-bached and assign node to pod accordingly. But for batch job, the conflict is so heavy :(. We'll use 'static' zone for batch job before finding a "good" way to work together with kube-scheduler.

List some tasks (including but not limited to)

Filter pods with scheduler name for kube-batchd (#170)
Filter nodes with labels for kube-batchd (#203)
Filter nodes with NodeAffinity for kube-batchd

Refine vendor directory

Now some code in vendor directory is redundancy and need sync up from github too. Including but not limited to:

github.com:kubernetes/api.git
github.com:kubernetes/apiextensions-apiserver.git
github.com:kubernetes/apimachinery.git
github.com:kubernetes/apiserver.git
github.com:kubernetes/client-go.git
github.com:kubernetes/metrics.git
github.com:kubernetes/kube-aggregator.git

The integration test is failed

The integration test is failed in CI env. We need to install etcd before testing:

The command "make test" exited with 0.
0.04s$ make test-integration
hack/make-rules/test-integration.sh 
+++ [0929 07:05:29] Checking etcd is on PATH
+++ [0929 07:05:29] Cannot find etcd, cannot run integration tests.
+++ [0929 07:05:29] Please see https://github.com/kubernetes/community/blob/master/contributors/devel/testing.md#install-etcd-dependency for instructions.
You can use 'hack/install-etcd.sh' to install a copy in third_party/.
!!! Error in hack/make-rules/test-integration.sh:87
  Error in hack/make-rules/test-integration.sh:87. 'return 1' exited with status 1
Call stack:
  1: hack/make-rules/test-integration.sh:87 main(...)
Exiting with status 1
make: *** [test-integration] Error 1

@jinzhejz , please check travis-ci related doc to fix it :).

Group pods into one PodSet when they belong to same deployment

Currently, kube-batchd groups pods into a PodSet by their owner references. However, it will cause some problems in following cases:

K8S Deployment
A deployment may contain more than one RS when rolling-upgrade happens, pods in different RS have different owner references, so kube-batchd will group them into different PodSet. However, these pods belong to the same deployment and should be in one PodSet.

~~Kubeflow job~~
In a tfjob, kubeflow create Master/PS/Worker as a k8s job and each job contains one pod. kube-batchd will group these pods into different PodSet because of different owner references. However, they belong to the same tfjob and should be in one PodSet.

Enhance QueueJob controller to handle QueueJob/Pod lifecycle

Now the QueueJob controller in kube-batchd supports to create pods for a QueueJob simply after #194 is merged, however, it is not enough for the QueueJob lifecycle and QueueJob controller doesn't handle pod lifecycle of a QueueJob.

So kube-batchd should be enhanced to support following task:

Handle QueueJob lifecycle, include add/update/delete
Handle Pod lifecycle of a QueueJob, include add/update/delete

Data Placement, Rack Awareness and Scheduling

Not sure if this really fits into the goals for this Incubator, if it does not, please feel free to close the issue.

At some level a BatchJob processing a large corpus of data should have awareness of data (location, Skew etc) for efficient scheduling.

If there are multiple copies of the data block, the scheduler should be aware of their locations for enhanced scheduling.
The scheduling algorithm should have ability to optimize the cost of data access: local disk --> Single Network Hop (within Rack) --> Multiple Network Hops.
Intelligence to address Data Skews: Performance of a BatchJob is directly related to the ability to address data skews efficiently. Schedulers should have intelligence to address the stragglers.

These requirements may be more appropriate for a layer above the scheduler, if so there should be a way to identify how these requirements would translate into changes at the scheduler layer.

Enhance object cache for policy.

For policy, we need a snapshot of cluster, e.g. Pod, Node, to calculate how many resource should be allocated to a tenant. This task is used to provide such kind of cache and related helper functions.

Sub-Tasks:

Resource type and related helper functions, e.g. Resource.Add
Pod and Node cache
Consumer cache

add support for complex QueueJob Objects

According to proposal https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA/edit#heading=h.a1k69dgabg0w, a new abstraction is to be incorporated in Kubernetes to support complex batch jobs also named QueueJobs. These jobs can be composed of services, replica sets, deployments, stateful sets, etc. and might benefit of atomic allocation, e.g., all-or-nothing, preemption, prioritization of QueueJobs and within the same QueueJob, etc.

We have developed a proof-of-concept prototype that offers support for jobs defined of services, pods, replica sets and deployments (https://github.com/hanghliu/kube-arbitrator) using CRD, as well as prioritization and preemption for them. We would like to integrate this prototype with the current kube-arbitrator project to enhance the support for QueueJobs in it.

[Request] Publish scheduler events to pod

The default scheduler publishes events to pod:

  FirstSeen	LastSeen	Count	From			SubObjectPath				Type		Reason		Message
  ---------	--------	-----	----			-------------				--------	------		-------
  17m		17m		1	default-scheduler						Normal		ScheduledSuccessfully assigned kube-batchd-5f86bd5b6-hx8kt to ist

But batchd does not implement it, it seems that batchd does not have the recorder. I think it should be added to make users know which scheduler schedules the pod.

Support gang scheduling (hard) for podset

When resources are not sufficient, kube-batchd just tries to start minAvailable pods of each application as much as possible. It will cause the running pods' size of a podSet is less than minAvailable which is not acceptable for some applications, e.g. MPI.

To handle above issue, kube-batchd need to support that it will not assign any resources to the job unless its minAvailable could be met.

Migrate integration test from upstream.

At early stage, integration test is ok to verify our code; it's better to reuse upstream integration test here.

Overall, prefer to build the prototype quickly to make all developer on the same page. And keep improving it before 0.1.