Coder Social home page Coder Social logo

Comments (10)

ioga avatar ioga commented on September 27, 2024

this seems to be exactly the same way as determined master and determined agent separate their responsibilities: https://docs.determined.ai/latest/architecture/index.html

can you elaborate what specific issues do you see with our implementation?

from determined.

humbleearth avatar humbleearth commented on September 27, 2024

@ioga Thanks for your reply. There is separation of responsibilities. I am looking for a bit more distributed nature of it. To give you an example, as a platform engineering team member I want to be responsible for hosting the master node in my aws tenant account while different teams in the organization could have their own aws tenant account. All their experiments and tasks should be submitted to the central determined master hosted in platform engineering domain while the experiments should run on agent nodes created in their own separate aws tenant accounts (customer ones). So one determined master node in tenant-1 would be creating experiments across agent nodes in tenant-2, tenant-3. This is achieved through the queue and distributed architecture of clearml referred earlier.

from determined.

ioga avatar ioga commented on September 27, 2024

you can setup determined master in one place, and determined agents in another.

for example, some of our users run a central master in their cloud account in a particular region, a number of agents in their on prem clusters, and another few sets of agents in different cloud regions for overflow purposes. we recommend setting this up within a VPN (e.g. tincvpn, wireguards, ...) for better networking and security.

from determined.

humbleearth avatar humbleearth commented on September 27, 2024

I think determined currently handles only a one-to-one model for customers. Let's say there are 6 independent teams (customer accounts) running their private experiments. They would each submit their experiment to the central master and the master would deploy agents in their own (customer's) infra environments (aws tenant accounts) that is 6 aws tenant accounts could be used as agent environments. How can such a setup be achieved with determined?

from determined.

ioga avatar ioga commented on September 27, 2024
  1. if you want dynamic instances provisioning, you'd need to setup the master to have proper IAM access to launch instances in all 6 accounts.
  2. for each account, separate resource pool should be setup with the appropriate provisioner setup.
  3. for each team, you should create at least one workspace per team
  4. you'll need to setup RBAC to restrict the users on each individual team to only be able to access their workspaces. you'll need Determined Enterprise Edition for this.
  5. you'll need to setup resource pool <-> workspace binding to lock the account resource pools to their team's workspaces. this also requires the enterprise edition.

In this setup, you'll have one master with shared user accounts, but separate resources and workspaces. you'll be able to create shared workspaces to share the experiments/metrics/model registry between the teams.

if you do not need any sharing between the independent teams, it doesn't seem to me it'd be useful to have one shared master, and I'd recommend to setup individual master per account instead. Since we're into Enterprise Edition territory anyway, I'd also suggest taking a look at our "Bring your own cloud" offering which does this setup for you.

from determined.

humbleearth avatar humbleearth commented on September 27, 2024

Thanks for the inputs. This is something of a challenge given that we want separation of cross account access. Is there a way to achieve this through decoupling without direct access to other aws accounts kind of like a queueing layer where tasks can be pulled independently by customer aws accounts and started on their agent nodes giving control on execution to customers but centralisation of request through the central master determined node.

from determined.

ioga avatar ioga commented on September 27, 2024

without IAM access you'll lose the dynamic instance provisioning, which is an important cost control feature. is this acceptable for you?

we do not have pull-based autoscaler like clearml does.

from determined.

humbleearth avatar humbleearth commented on September 27, 2024

Dynamic instance provisioning is an important need for me. Is there an alternative way this can be achieved? If not having dynamic provisioning feature is acceptable, is there a way to achieve this?

from determined.

ioga avatar ioga commented on September 27, 2024

Dynamic instance provisioning is an important need for me. Is there an alternative way this can be achieved?

can you elaborate why exactly do you need a centralized master instance? I am not sure what's the point of having a centralized queue server, if the queues and their resources are separate anyway. having isolated master and agents per tenant will simplify the setup, and it won't require any enterprise features.

If not having dynamic provisioning feature is acceptable, is there a way to achieve this?

yes, then you'll only need to do steps 2 through 5 from my earlier message. it'll require the enterprise edition (aka MLDE), so I'd recommend getting in touch with our sales team, who'll also be able help with building out a PoC for this.

from determined.

humbleearth avatar humbleearth commented on September 27, 2024

@ioga The centralized master instance would help me as part of the platform engineering team to have a central managed service to which customer accounts could send requests. It acts as the starting point and may be in future could also have a charge back model based on the usage reports seen in the central service.

from determined.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.