Comments (8)
明白了...
from dlrover.
k8s ElasticJob operator 相关代码。
from dlrover.
k8s ElasticJob operator 相关代码。
谢谢回答,设计dlrover是参考K8S生态嘛,我看很多idea和k8s的一些工具有关联QAQ我也读过你们的源码,包括自动缩放,多节点管理。不是很难。但我好奇的是为啥会这么去设计,参照物是啥QAQ
from dlrover.
ElasticJob 是为了方便用户在k8s 上提交 DLRover 的弹性容错作业。ElasticJob operator 会启动一个 master Pod,剩下的 worker Pod都是由 master Pod 通过 k8s Python api 启动的。你的问题应该是说,为什么我们单独把 pod 的 CURD 放到了 master 的代码里用 python 单独实现?因为 master 需要监控worker 的状态来和worker 里跑的训练框架联动。举个2个例子:
- PS 训练,master 会给worker 分发训练数据分片,如果worker 挂了,master 会监控到 worker 挂了的event,然后master 会将挂了的worker 数据分片重新分给其他的worker。还有比如 worker 的 CPU和memory 预测,worker-0 启动后会在前N个迭代(N=100)后将自己的CPU和内存发送给 master,然后 master 就可以根据真实的worker 的 CPU 和内存启动其他 worker。
- PyTorch AllReduce 训练,如果有 worker 挂了,master 需要将挂了的 worker 从集合通信组网中删除,然后重启新的worker 并通知所有的worker 再重新组网。在比如,worker Pod 上使用 dlrover-run --network-check 发现当前机器为故障机,需要通知 master 调用 k8s api 来隔离机器。
像 kubeflow/training 等operator 其只管pod 的 CURD(增删改查),Pod 的事件无法和训练框架联动,比如 Pod 重启了,怎么加入到训练组网中,这个就需要训练框架能感知到 Pod 事件。
from dlrover.
- master 会给worker 分发训练数据分片,如果worker 挂了
懂了,大体上是保证大规模集群上训练任务的稳定性.有点像自动化任务,只不过是结合ML框架来写了,差点思考错方向了QAQ感谢这么细节的指导
from dlrover.
- master 会给worker 分发训练数据分片,如果worker 挂了
懂了,大体上是保证大规模集群上训练任务的稳定性.有点像自动化任务,只不过是结合ML框架来写了,差点思考错方向了QAQ感谢这么细节的指导
是的,可以理解为结合ML框架定制的调度器。
from dlrover.
- master 会给worker 分发训练数据分片,如果worker 挂了
懂了,大体上是保证大规模集群上训练任务的稳定性.有点像自动化任务,只不过是结合ML框架来写了,差点思考错方向了QAQ感谢这么细节的指导
是的,可以理解为结合ML框架定制的调度器。
太感谢了,省了我几个月时间,。我准备去看看弹性推理框架,目前看同类框架没啥难度了
from dlrover.
- master 会给worker 分发训练数据分片,如果worker 挂了
懂了,大体上是保证大规模集群上训练任务的稳定性.有点像自动化任务,只不过是结合ML框架来写了,差点思考错方向了QAQ感谢这么细节的指导
是的,可以理解为结合ML框架定制的调度器。
我想了下,这事情的关键还是研究调度器,k8s,ack,torchx
from dlrover.
Related Issues (20)
- how to use Flash Checkpoint for huggingface trainer job
- 案例介绍中图3解释
- Can you share the training cases on Huawei acceleration card? HOT 1
- 这里提到的弹性训练是否一定是PS架构的,由于PS架构带宽上的限制,现在大模型的训练中使用PS架构的场景应该不多了吧? HOT 7
- load_checkpoint failed when using Megatron flash checkpoint because tracker_file is not saved by dlrover HOT 2
- The job stops restarting workers and exits if the traceback is a code bug. HOT 2
- Use Gang Scheduling in ElasticJob of DLRover.
- hf trainer with flash checkpoint hang when save_to_memory HOT 12
- The job master hangs when there is only one worker and the worker is preempted.
- Fatal Python error: Segmentation fault when kill the training process.
- 故障自动恢复后,load(flash ckpt)后loss异常震荡 HOT 1
- kubectl -n dlrover apply -f examples/pytorch/nanogpt/elastic_job.yaml error HOT 1
- OSError: [Errno 98] Address already in use HOT 3
- Set join timeout value as timeout in rdzv params. HOT 1
- make deploy IMG=easydl/elasticjob-controller:master HOT 1
- dlrover/blob/master/docs/tutorial/tf_elasticjob_on_k8s 【tf_elasticjob_on_k8s example failed to start】 【tf_elasticjob_on_k8s 示例启动失败】 HOT 5
- possible typo in the example of [tf_elasticjob_on_k8s] HOT 1
- Error llama2 demo with pytorch 2.3.0
- [observability] OTEL Trace/Event for training rendezvous, gpu check, flash checkpoint, etc.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dlrover.