Comments (10)
this seems to be exactly the same way as determined master and determined agent separate their responsibilities: https://docs.determined.ai/latest/architecture/index.html
can you elaborate what specific issues do you see with our implementation?
from determined.
@ioga Thanks for your reply. There is separation of responsibilities. I am looking for a bit more distributed nature of it. To give you an example, as a platform engineering team member I want to be responsible for hosting the master node in my aws tenant account while different teams in the organization could have their own aws tenant account. All their experiments and tasks should be submitted to the central determined master hosted in platform engineering domain while the experiments should run on agent nodes created in their own separate aws tenant accounts (customer ones). So one determined master node in tenant-1 would be creating experiments across agent nodes in tenant-2, tenant-3. This is achieved through the queue and distributed architecture of clearml referred earlier.
from determined.
you can setup determined master in one place, and determined agents in another.
for example, some of our users run a central master in their cloud account in a particular region, a number of agents in their on prem clusters, and another few sets of agents in different cloud regions for overflow purposes. we recommend setting this up within a VPN (e.g. tincvpn, wireguards, ...) for better networking and security.
from determined.
I think determined currently handles only a one-to-one model for customers. Let's say there are 6 independent teams (customer accounts) running their private experiments. They would each submit their experiment to the central master and the master would deploy agents in their own (customer's) infra environments (aws tenant accounts) that is 6 aws tenant accounts could be used as agent environments. How can such a setup be achieved with determined?
from determined.
- if you want dynamic instances provisioning, you'd need to setup the master to have proper IAM access to launch instances in all 6 accounts.
- for each account, separate resource pool should be setup with the appropriate provisioner setup.
- for each team, you should create at least one workspace per team
- you'll need to setup RBAC to restrict the users on each individual team to only be able to access their workspaces. you'll need Determined Enterprise Edition for this.
- you'll need to setup resource pool <-> workspace binding to lock the account resource pools to their team's workspaces. this also requires the enterprise edition.
In this setup, you'll have one master with shared user accounts, but separate resources and workspaces. you'll be able to create shared workspaces to share the experiments/metrics/model registry between the teams.
if you do not need any sharing between the independent teams, it doesn't seem to me it'd be useful to have one shared master, and I'd recommend to setup individual master per account instead. Since we're into Enterprise Edition territory anyway, I'd also suggest taking a look at our "Bring your own cloud" offering which does this setup for you.
from determined.
Thanks for the inputs. This is something of a challenge given that we want separation of cross account access. Is there a way to achieve this through decoupling without direct access to other aws accounts kind of like a queueing layer where tasks can be pulled independently by customer aws accounts and started on their agent nodes giving control on execution to customers but centralisation of request through the central master determined node.
from determined.
without IAM access you'll lose the dynamic instance provisioning, which is an important cost control feature. is this acceptable for you?
we do not have pull-based autoscaler like clearml does.
from determined.
Dynamic instance provisioning is an important need for me. Is there an alternative way this can be achieved? If not having dynamic provisioning feature is acceptable, is there a way to achieve this?
from determined.
Dynamic instance provisioning is an important need for me. Is there an alternative way this can be achieved?
can you elaborate why exactly do you need a centralized master instance? I am not sure what's the point of having a centralized queue server, if the queues and their resources are separate anyway. having isolated master and agents per tenant will simplify the setup, and it won't require any enterprise features.
If not having dynamic provisioning feature is acceptable, is there a way to achieve this?
yes, then you'll only need to do steps 2 through 5 from my earlier message. it'll require the enterprise edition (aka MLDE), so I'd recommend getting in touch with our sales team, who'll also be able help with building out a PoC for this.
from determined.
@ioga The centralized master instance would help me as part of the platform engineering team to have a central managed service to which customer accounts could send requests. It acts as the starting point and may be in future could also have a charge back model based on the usage reports seen in the central service.
from determined.
Related Issues (20)
- π€[question] Updating the default Determined-Pytorch container to 2.1/2.2 HOT 1
- π[bug] Running Mnist Tutorial distributed causes Runtime Errors and Hanging behavior HOT 12
- π€[question] dialing to http://172.22.0.1:32862: dial tcp 172.22.0.1:32862: connect: connection refused HOT 2
- π[bug] Kernel status: pending HOT 11
- π€[question] Where can I find the source code of the CLI? HOT 1
- π€[question] Can not connect to master node HOT 6
- π€[question] Open to updates to EKS deployment? HOT 6
- π€[question] How to get pod address by experiment HOT 1
- Integrated with VTable HOT 1
- π[bug] pulling container image: error parsing image name HOT 3
- π‘[feat] local cluster to use offline docker images HOT 1
- π‘[feat] the request to add a feature that releases resources automatically in case of a timeout or if the GPU utilization falls below a certain threshold HOT 4
- π‘[feat] delete the task logs HOT 1
- π€[question] I want to callback the interface when the resource is released. HOT 2
- π€[question] where to set the `find_unused_parameters=True` HOT 1
- π€[question] can you provide me a example that use amp(mixed precision) HOT 2
- π€[question] βdtrainNetworkInterfaceβ seems does not take effect when deploy on k8s HOT 5
- π[bug] Experiments fails after running for a week HOT 2
- π‘[feat] how to avoid node GPU fragmentation HOT 2
- π[bug] already set min_validation_period , but still got a single validation metric HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from determined.