Coder Social home page Coder Social logo

Comments (8)

HongLouyemeng avatar HongLouyemeng commented on June 15, 2024

明白了...

from dlrover.

workingloong avatar workingloong commented on June 15, 2024

k8s ElasticJob operator 相关代码。

from dlrover.

HongLouyemeng avatar HongLouyemeng commented on June 15, 2024

k8s ElasticJob operator 相关代码。

谢谢回答,设计dlrover是参考K8S生态嘛,我看很多idea和k8s的一些工具有关联QAQ我也读过你们的源码,包括自动缩放,多节点管理。不是很难。但我好奇的是为啥会这么去设计,参照物是啥QAQ

from dlrover.

workingloong avatar workingloong commented on June 15, 2024

ElasticJob 是为了方便用户在k8s 上提交 DLRover 的弹性容错作业。ElasticJob operator 会启动一个 master Pod,剩下的 worker Pod都是由 master Pod 通过 k8s Python api 启动的。你的问题应该是说,为什么我们单独把 pod 的 CURD 放到了 master 的代码里用 python 单独实现?因为 master 需要监控worker 的状态来和worker 里跑的训练框架联动。举个2个例子:

  1. PS 训练,master 会给worker 分发训练数据分片,如果worker 挂了,master 会监控到 worker 挂了的event,然后master 会将挂了的worker 数据分片重新分给其他的worker。还有比如 worker 的 CPU和memory 预测,worker-0 启动后会在前N个迭代(N=100)后将自己的CPU和内存发送给 master,然后 master 就可以根据真实的worker 的 CPU 和内存启动其他 worker。
  2. PyTorch AllReduce 训练,如果有 worker 挂了,master 需要将挂了的 worker 从集合通信组网中删除,然后重启新的worker 并通知所有的worker 再重新组网。在比如,worker Pod 上使用 dlrover-run --network-check 发现当前机器为故障机,需要通知 master 调用 k8s api 来隔离机器。

像 kubeflow/training 等operator 其只管pod 的 CURD(增删改查),Pod 的事件无法和训练框架联动,比如 Pod 重启了,怎么加入到训练组网中,这个就需要训练框架能感知到 Pod 事件。

from dlrover.

HongLouyemeng avatar HongLouyemeng commented on June 15, 2024
  1. master 会给worker 分发训练数据分片,如果worker 挂了

懂了,大体上是保证大规模集群上训练任务的稳定性.有点像自动化任务,只不过是结合ML框架来写了,差点思考错方向了QAQ感谢这么细节的指导

from dlrover.

workingloong avatar workingloong commented on June 15, 2024
  1. master 会给worker 分发训练数据分片,如果worker 挂了

懂了,大体上是保证大规模集群上训练任务的稳定性.有点像自动化任务,只不过是结合ML框架来写了,差点思考错方向了QAQ感谢这么细节的指导

是的,可以理解为结合ML框架定制的调度器。

from dlrover.

HongLouyemeng avatar HongLouyemeng commented on June 15, 2024
  1. master 会给worker 分发训练数据分片,如果worker 挂了

懂了,大体上是保证大规模集群上训练任务的稳定性.有点像自动化任务,只不过是结合ML框架来写了,差点思考错方向了QAQ感谢这么细节的指导

是的,可以理解为结合ML框架定制的调度器。

太感谢了,省了我几个月时间,。我准备去看看弹性推理框架,目前看同类框架没啥难度了

from dlrover.

HongLouyemeng avatar HongLouyemeng commented on June 15, 2024
  1. master 会给worker 分发训练数据分片,如果worker 挂了

懂了,大体上是保证大规模集群上训练任务的稳定性.有点像自动化任务,只不过是结合ML框架来写了,差点思考错方向了QAQ感谢这么细节的指导

是的,可以理解为结合ML框架定制的调度器。

我想了下,这事情的关键还是研究调度器,k8sack,torchx

from dlrover.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.