Coder Social home page Coder Social logo

Comments (4)

ainoam avatar ainoam commented on September 4, 2024

Makes total sense @cthorey - Would you care to issue a PR?

from clearml.

cthorey avatar cthorey commented on September 4, 2024

I though about it but then I realize, we have still no way to guarantee that the agent does not pick up a new task while the instance is taken down by the cloud provider. What would be better would be to be able to detect when the agent have been taken down and reschedule Task that have been interrupted this way.

I raise an issue allegroai/clearml-agent#188 here which prevents this for now.

Specifically, when an instance is taken down, SIGTERM are sent to running processed and the running task are marked as completed. What would be better would be to mark them as fail so that we have the option to reschedule them via the retry_on_failure parameter which we can pass to the PipelineController.

from clearml.

ainoam avatar ainoam commented on September 4, 2024

Sounds like we're mixing up a number of points @cthorey.

  1. Your original post - A race condition where an instances activity status is obsolete by the time the autoscaler takes action for taking it down.
  2. The status of a task once its executing agent was explicitly terminated (which you address in clearml-agent#188) and its effect on pipeline logic.

These should probably be handled independently. WDYT?

from clearml.

cthorey avatar cthorey commented on September 4, 2024

Yep - I agree they should be handled independently.
Regarding 1. and hence this issue, I think we can reasonably close it given that, as I said above, we have no way to guarantee that the agent does not pick up a new task while the instance is taken down by the cloud provider.

from clearml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.