Coder Social home page Coder Social logo

Comments (8)

vishnu2kmohan avatar vishnu2kmohan commented on July 17, 2024

is caused somehow by me manually stopping the master instance
I already terminated master instance, deleted S3 bucket and CloudFormulation though to try to start from scratch.

Just so we better understand how we arrived at this issue, before running det deploy aws up did you det deploy aws down the previous cluster, assuming it's a cluster with the same name: moonsniper? We don't recommend manual deletion of resources unless absolutely necessary - it's best to let det deploy coordinate the cleanup (via CloudFormation) on your behalf.

det deploy is intended to be used as a convenience tool for users who are new to Determined, as a first-time/getting-started tool (which is itself a wrapper around CloudFormation in the case of AWS, and Terraform in the case of GCP).

is it OK to do deployment again with the same command just to change the config like GPU instance type and number of dynamic agents?

Once you have a cluster up and running, you can modify the CloudFormation stack in the EC2 console to change some elements of the configuration - i.e., fields and variables that are configurable in the CF template.

For advanced changes, beyond what's supported in the CF template, you can SSH into the master and edit /usr/local/determined/etc/master.yaml and restart the master process (sudo sh -c 'docker stop determined-master && docker start determined-master').

Note: If you are changing GPU instance types, for safety, please ensure that you've terminated the old instances before restarting the master to reflect the new instance configuration.

from determined.

offchan42 avatar offchan42 commented on July 17, 2024

Thanks. I think that's the problem.
I haven't called the down command yet. Maybe this means I haven't deleted all the resources manually e.g.Aurora DB.
So what should I do now?

Once you have a cluster up and running, you can modify the CloudFormation stack in the EC2 console to change some elements of the configuration - i.e., fields and variables that are configurable in the CF template.

Where can I find these variables to change? Is it EC2 console or CloudFormation console?

from determined.

vishnu2kmohan avatar vishnu2kmohan commented on July 17, 2024

I haven't called the down command yet. Maybe this means I haven't deleted all the resources manually e.g.Aurora DB.
So what should I do now?

If you want to start from scratch, then yes, please det deploy aws down your existing cluster - it'll try and delete all the associated resources.

Note: Calling det deploy down will delete your database (Aurora Serverless PostgreSQL), and you'll lose all your experiments and metadata.

Where can I find these variables to change? Is it EC2 console or CloudFormation console?

The CloudFormation Service UI on the AWS Console (under CloudFormation -> Stacks, select Update Stack on the relevant deployment) will show you how it's been configured, and what can be modified.

from determined.

offchan42 avatar offchan42 commented on July 17, 2024

The command det deploy aws down failed because the CF is already deleted manually.

(base) C:\Users\off99>det deploy aws down --cluster-id moonsniper
An error occurred (ValidationError) when calling the DescribeStacks operation: Stack with id moonsniper does not exist
Stack Deletion Failed. Check the AWS CloudFormation Console for details.

If I create it again with det deploy aws up, here is the error:

(base) C:\Users\off99>det deploy --no-preflight-checks aws up --cluster-id moonsniper --keypair moonsniper
Starting Determined Deployment
Determined Version: 0.16.0
Stack Name: moonsniper
AWS Region: us-west-2
Keypair: moonsniper
Checking if the SSH Keypair (moonsniper) exists: True
Checking if the CloudFormation Stack (moonsniper) exists: False - Creating Stack
Creating stack moonsniper. This may take a few minutes... Check the CloudFormation Console for updates
Waiter StackCreateComplete failed: Waiter encountered a terminal failure state: For expression "Stacks[].StackStatus" we matched expected path: "ROLLBACK_COMPLETE" at least once
Stack Deployment Failed. Check the AWS CloudFormation Console for details.

If I try to run det deploy aws down command on the failed deployment, it will also have weird error saying Outputs

(base) C:\Users\off99>det deploy aws down --cluster-id moonsniper
'Outputs'
Stack Deletion Failed. Check the AWS CloudFormation Console for details.

So I think I need to delete everything manually. What are the things I need to delete?
First, here I'm attempting to delete the RDS parameter groups which I suspected was created from determined.
image
It has this error:
image

All this starts from one butterfly effect: I stopped the master instance to save compute then attempted to restart it. It said I was unauthorized to restart the instance so I ran det deploy aws up again (I was hoping the command would be able to restart the master instance), which failed below:

(base) C:\Users\off99>det deploy --no-preflight-checks aws up --cluster-id moonsniper --keypair moonsniper
Starting Determined Deployment
Determined Version: 0.16.0
Stack Name: moonsniper
AWS Region: us-west-2
Keypair: moonsniper
Checking if the SSH Keypair (moonsniper) exists: True
Checking if the CloudFormation Stack (moonsniper) exists: True - Updating Stack
Updating stack moonsniper. This may take a few minutes... Check the CloudFormation Console for updates
An error occurred (IncorrectInstanceState) when calling the StopInstances operation: This instance 'i-0bef35c5692b1ced3' is not in a state from which it can be stopped.
Stack Deployment Failed. Check the AWS CloudFormation Console for details.

After that, I deleted the master instance manually, and run det deploy aws up with some configs (hoping it would create new master instance with new config), which failed:

(base) C:\Users\off99>det deploy aws up --cluster-id moonsniper --keypair moonsniper --gpu-agent-instance-type p2.8xlarge --max-dynamic-agents 1 --max-idle-agent-period 5m
Starting Determined Deployment
Determined Version: 0.16.0
Stack Name: moonsniper
AWS Region: us-west-2
Keypair: moonsniper
Checking if the SSH Keypair (moonsniper) exists: True
Checking if the CloudFormation Stack (moonsniper) exists: True - Updating Stack
Updating stack moonsniper. This may take a few minutes... Check the CloudFormation Console for updates
Waiting For Master Instance To Stop
Master Instance Stopped
Waiter StackUpdateComplete failed: Waiter encountered a terminal failure state: For expression "Stacks[].StackStatus" we matched expected path: "UPDATE_ROLLBACK_FAILED" at least once
Stack Deployment Failed. Check the AWS CloudFormation Console for details.

How do I clean everything now? I don't mind losing the experiments.
Thanks.

from determined.

vishnu2kmohan avatar vishnu2kmohan commented on July 17, 2024

What are the things I need to delete?

You'd now need to manually delete all the resources that were created for the cluster. It appears that all that is left now is the resources relating to the Aurora DB. You may need to ensure that it's in a valid state that allows for it to be deleted, first.

It said I was unauthorized to restart the instance so I attempted to delete CF.

Do you login as a different user with the aws CLI which was used by det deploy aws vs. the user that you use to login via web browser? Please try using the same user that was used to create the cluster.

from determined.

offchan42 avatar offchan42 commented on July 17, 2024

About user, that's a good point, I have now logged in with the IAM user that is used in AWS CLI instead of Root user. I will try to restart the master instance with this IAM user next time. (Though it's weird that the Root user couldn't restart the instance but could stop it.)

Anyway, this IAM user with admin permission also cannot not delete the DBParameterGroup.
image

This article said that default DB parameter group cannot be deleted:
https://stackoverflow.com/questions/66332624/failed-to-delete-default-mysql8-0-default-dbparametergroup-cannot-be-deleted-d

So maybe it should mean that there are no resources related to Aurora DB anymore beside the parameter groups? (Also there was no DB instances to delete in the first place, that's also weird because I expected Determined to create a few DB instances. Maybe it's because I deleted its bucket from S3?)
Currently there's nothing I deleted manually from RDS console.
image
What are the remaining resources I need to delete next?

from determined.

offchan42 avatar offchan42 commented on July 17, 2024

Ah, it seems to be problem related to authority.
I should have checked this Events section first. Sorry for wasting your time and thanks for your help. Now my headache is reduced a bit. I will look into this and see what's the cause. I think it's related to me requesting CPU limits to 160 and then Amazon didn't fulfill that. They gave me 16 so I asked 160 again, they gave me 32 I asked
160 again and they haven't given me. Maybe they think this is the suspicious activity.
image

Also they seem to think that my account was used by someone else. Have you met this issue before?
image

from determined.

vishnu2kmohan avatar vishnu2kmohan commented on July 17, 2024

Oh, sorry to hear that.

Hope you resolve the issue with AWS 🤞

Let us know if you need help once you're past that ☝️ - either here, or in the community slack.

from determined.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.