volcano-sh / volcano Goto Github PK
View Code? Open in Web Editor NEWA Cloud Native Batch System (Project under CNCF)
Home Page: https://volcano.sh
License: Apache License 2.0
A Cloud Native Batch System (Project under CNCF)
Home Page: https://volcano.sh
License: Apache License 2.0
Is this a BUG REPORT or FEATURE REQUEST?:
/kind cleanup
Description:
We may Queue
which will reuse job/cache
& JobInfo
.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind cleanup
Description:
In Makefile, our binaries are vk-controller, vk-scheduler and so on; but the docker image is volcanosh/volcano-scheduler
. It's better to make them align with each other to avoid confusion.
/cc @asifdxtreme
This OWNERS file mandatory for bot
Is this a BUG REPORT or FEATURE REQUEST?:
/kind cleanup
Description:
We'd like to dependent on volcano-sh/kube-batch master branch, and use upstream-master branch for kuberentes-sigs/kube-batch.
Sorry for the inconvenient :)
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Description:
Add an example on how to run MPI job :)
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
Description:
There're 11 tests in CI were failed, we need to get it fixed ASAP before release.
[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: PodFailed; Action: TerminateJob
/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:102
[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: PodFailed; Action: AbortJob
/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:139
[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: PodEvicted; Action: RestartJob
/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:174
[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: PodEvicted; Action: TerminateJob
/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:218
[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: PodEvicted; Action: AbortJob
/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:262
[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: Any; Action: RestartJob
/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:306
[Fail] Job Error Handling [It] job level LifecyclePolicy, Event: TaskCompleted; Action: CompletedJob
/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:468
[Fail] Job Error Handling [It] job level LifecyclePolicy, error code: 3; Action: RestartJob
/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_error_handling.go:507
[Fail] Job E2E Test [It] Gang scheduling
/home/travis/gopath/src/volcano.sh/volcano/test/e2e/job_scheduling.go:109
[Fail] MPI E2E Test [It] will run and complete finally
/home/travis/gopath/src/volcano.sh/volcano/test/e2e/mpi.go:74
[Fail] Job E2E Test: Test Job Command [It] Suspend pending job
/home/travis/gopath/src/volcano.sh/volcano/test/e2e/command.go:142
xref https://travis-ci.com/volcano-sh/volcano/jobs/197649052
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Description:
Currently, we only record event for Command
s; it's better to also record an event for each actions of jobs.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
There are two patches in kube-batch which are introducing kubemark into kube-batch, we need sync them into volcano as well.
Have to mention that the testcase should be upgraded since we don't the original kubernete Job resource in test.
Currently, user can only create a Queue for scheduling; but it's hard to know more info about it, e.g. how many job in the queue, which plugins is used by this queue; and if the Queue is deleted, the job is still there :( It's better to have QueueController to mamange Queue's lifecycle and update its status; and have related command line for uset to get its info.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
Description:
It's better to support following targets in Makefile:
make
: only make related binaries, e.g. controller, schedulermake images
: build related docker imagesmake e2e-test-kind
: run e2e test with kindmake unit-test
: run unit testmake integration-test
: run integration testvolcano/pkg/controllers/job/job_controller_util.go
Lines 153 to 154 in 3632d36
It looks this label is used for service, also I am not sure what is job service for.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Description:
Currently, we delay pod creattion in job controller which make it hard for two scenarios:
enqueue
can not support other operatorsTo resolve the above issues, perfer to add an admission controller to check PodGroup's status for them. If they did not use PodGroup, PodGroupController will help them to create a shadow one.
We already have a design doc of kube-batch at kubernetes-retired/kube-batch#539 . Job controller need to create pod according to scheduler's feedback.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Description:
Support MaxResource of queue by admission controller & queue controller.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Description:
Move vkctl queue
info volcano, xref to https://github.com/volcano-sh/kube-batch/tree/master/pkg/cli/queue .
For now, found the follow two issues here:
hack/verify-gofmt.sh
for make verify
e2e-test
and e2e-test-kind
miss scriptsIs this a BUG REPORT or FEATURE REQUEST?:
/kind bug
During main logic sync, the change of combine input & output feature is partially synced due to the code merge conflict, need fix and reenable it.
See
volcano/pkg/admission/admit_job.go
Line 147 in d9b532c
Is this a BUG REPORT or FEATURE REQUEST?:
/kind cleanup
Description:
release
is almost equal to all
docker
target should be images
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Description:
We need cherry pick kubernetes-retired/kube-batch#841 to volcano-sh/kube-batch:master, and rebase upstream-master to k-sigs/kube-batch:master.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Description:
Got this requirements from a user, it's better to support error handling for exit code.
Currently, we still merge code manually; it's better to have robot for it. We can leverage robot from other community, e.g. Kubernetes.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Description:
Cherry pick related PR in kube-batch to volcano-sh/kube-batch for conformance test.
/cc @asifdxtreme
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Description:
Make sure "completed" & "terminated" jobs will be removed later.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Description:
Currently, we only support Job level and Task instance level error handling; TaskSpec level error handling is also necessary, e.g. the MPI job should be completed when mpirun
Pod completed successfully.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Description:
Currently, the default value of PodGroup is set by operator/customized-controller which is inconvenience for developer. It's better to set those default value to PodGroup for all users/developers.
Both MPI and Tensorflow need hostfile for its workers; and MPI job need more, e.g. ssh authentication. It's better to provide related plugins for different works.
The yaml file maybe similar as follow:
spec:
- plugins
ssh: ["seed"]
env: [""]
For example, if ssh
is enabled, job controller should create related rsa public/private keys and mount them for ssh.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
New state of enqueue has been introduced, but it's unfinished, need keep working on this and fix related testcase issues.
NOTES: There are some testcases are expected to have job status: pending->running/xxxxx, which are incorrect within new status of enqueue, please update them all asl well.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Description:
Currently, kube-batch create shadow PodGroup by pod's OwnerReference
for upstream objects, e.g. Deployment. It make Queue related feature harder, e.g. Queue's status, it's better to have such a controller to create PodGroup
for upstream objects.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
Part of the testcases are failing: https://travis-ci.com/volcano-sh/volcano/jobs/186568285
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
For PR #28
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
There is Travis Build CI Failure https://travis-ci.com/volcano-sh/volcano/jobs/185932518
I noticed we want to support Job.Spec update in Controller.updateJob
But the generated request is
req := apis.Request{
Namespace: newJob.Namespace,
JobName: newJob.Name,
Event: vkbatchv1.OutOfSyncEvent,
}
But in syncJob
if no pods provided in request, it will create new pods for the Job, and so it will fail, and the following status is unknown.
btw, I am not very familiar with the entire state machine , and maybe i miss something.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind test
What happened:
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
Description:
It's better to use --hostfile
for mpi job instead of parsing it to string:
% mpirun [ -np X ] [ --hostfile <filename> ] <program>
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
Description:
Queue E2E teat failed as follow, it seems there're not enough resource for recliam e2e test.
• Failure [19.615 seconds]
Queue E2E Test
/home/travis/gopath/src/volcano.sh/volcano/test/e2e/queue.go:26
Reclaim [It]
/home/travis/gopath/src/volcano.sh/volcano/test/e2e/queue.go:27
Expected error:
<*errors.errorString | 0xc00028f4f0>: {
s: "expected replica <1> is too small",
}
expected replica <1> is too small
not to have occurred
/home/travis/gopath/src/volcano.sh/volcano/test/e2e/queue.go:57
refer to https://travis-ci.com/volcano-sh/volcano/jobs/188302297 for more detail :)
in case delete event miss, there is a risk for memory leak
func mutateSpec(tasks []v1alpha1.TaskSpec, basePath string) (patch []patchOperation) {
for index := range tasks {
// add default task name
taskName := tasks[index].Name
if len(taskName) == 0 {
tasks[index].Name = v1alpha1.DefaultTaskSpec
}
}
patch = append(patch, patchOperation{
Op: "replace",
Path: basePath,
Value: tasks,
})
return patch
}
If user not specify the task names of a job, default
will be used in mutating stage, but the validating admission controller will reject the Job creation because of duplicate task names.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Description:
Support task/job retry; if it's still failed after try count, mark as Failed
.
xref: kubernetes-retired/kube-batch#797
Even job fails, the state machine would call KillJob
to delete all owned resources. I think it is not acceptable for users, they need to look why it fails.
/kind bug
Currently, Travis would spend almost 26 minutes to finish e2e tests, need to figure it out how to speed up these tests.
Ran for 26 min 14 sec
Ran 33 of 33 Specs in 773.302 seconds
SUCCESS! -- 33 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestE2E (773.30s)
PASS
ok volcano.sh/volcano/test/e2e 773.323s
release "integration" deleted
Running kind: [kind delete cluster --name integration]
Deleting cluster "integration" ...
$KUBECONFIG is still set to use /home/travis/.kube/kind-config-integration even though that file has been deleted, remember to unset it
Volcano logs are currently not supported.
The command "make e2e-test-kind" exited with 0.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Description:
Scheduled job is an common requirement for high performance workload.
Is this a BUG REPORT or FEATURE REQUEST?:
Uncomment only one, leave it on its own line:
/kind bug
/kind feature
What happened:
Error log:
certificatesigningrequest.certificates.k8s.io/integration-admission-service.kube-system created
NAME AGE REQUESTOR CONDITION
integration-admission-service.kube-system 0s kubernetes-admin Pending
certificatesigningrequest.certificates.k8s.io/integration-admission-service.kube-system approved
ERROR: After approving csr integration-admission-service.kube-system, the signed certificate did not appear on the resource. Giving up after 10 attempts.
Error: plugin "gen-admission-secret" exited with error
Install volcano chart
NAME: integration
LAST DEPLOYED: Mon Apr 1 03:20:04 2019
NAMESPACE: kube-system
STATUS: DEPLOYED
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
):uname -a
):Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
Resolve all the golint issues ignored in the file of .golint_failures
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
):uname -a
):Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
While deleting helm chart we get an error and all crds still exist.
# helm delete sid
Error: deletion completed with 1 error(s): mutatingwebhookconfigurations.admissionregistration.k8s.io "sid-mutate-job" already exists
because of which for deploying it next time we need to delete all crd's and then deploy again
What you expected to happen:
Delete helm chart should exit properly
We need to update tutorial & README accordingly for volcano.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
What happened:
Currently, there is only one goroutine worker syncing jobs. For large scale jobs, this will be a bottle neck.
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Description:
When batch jobs have to compete with each others or elastic jobs for resources, the resources that become available are likely to be taken immediately by elastic job. Batch jobs need multiple resources to be available before they can be dispatched. If the cluster is always busy, a large batch job could be pending indefinitely. The more processors a parallel job requires, the worse the problem is. Resource reservation solves this problem by reserving resources as they become available, until there are enough reserved resources to run the batch job.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.