Comments (7)
- We do not explicitly mount
~/.sky
to VM/container- SSH key changes on new VM/container.
The error does not happen when I run the same container from my local (launches a different spot controller each time), but when launched from a VM (with a service account), it tries to access the same spot controller and fails.
Thanks for sharing more details @subhamde8247! One hypothesis of this would be that both the linux username and the python -c "import socket; print(socket.gethostname())"
is the same for the multiple containers, causing the user hash we generated to identify different machines based on the two values being the same, which leads to using the same spot controller.
To confirm the hypothesis, it would be nice to check if cat ~/.sky/user_hash
has the same value across multiple VM/containers.
There are several workarounds:
- share the SSH key across multiple VM/container by explicitly upload/mounting those keys to them.
- Or, randomly generate a user hash for each VM/container whenever it is firstly provisioned, by randomly generating the
~/.sky/user_hash
:python -c "import uuid; print(uuid.uuid4().hex[:8])" > ~/.sky/user_hash
We will also look into the issue and see if the username and the python -c "import socket; print(socket.gethostname())"
not sufficient for identifying a user : )
from skypilot.
One hack we have been using now: use gcloud compute instances list
to list all VMs that match the pattern sky-spot-controller-
at the end of each run and deleting them.
Wondering if I am missing something or if there is a better solution.
from skypilot.
from a docker container inside GCP VM that has a service account attached to it
Thanks for the report @subhamde8247! Could you share some details of this client VM? For example,
- what does “gcloud auth list” show inside this container? Does it only have the service account, or also some static credential files?
- On each day, is the spot launch triggered from a different container on this client VM, or the same container?
from skypilot.
- “gcloud auth list” only shows the service account
- The client VM is deleted each day after running, and a new VM is created next day, and a new container started within that VM. So spot launch is triggered from a different container each day attached to a new VM instance.
from skypilot.
@subhamde8247 Got it. Some followups:
- Do you mount the same
~/.sky
to the new VM/container everyday? This would explain why the new VM/container reuse the same spot controller. - Does the SSH key,
~/.ssh/sky-key{.pub}
, change on the new VM/container? A newly generated key would explain why the connection to the same spot controller VM is unsuccessful.- If this is the case, a fix should be mounting the same
~/.ssh/sky-key{.pub}
to the new VM/container everyday. This way the same spot controller can be reused, and during idle periods it'd be autostopped to save costs.
- If this is the case, a fix should be mounting the same
from skypilot.
- We do not explicitly mount
~/.sky
to VM/container - SSH key changes on new VM/container.
The error does not happen when I run the same container from my local (launches a different spot controller each time), but when launched from a VM (with a service account), it tries to access the same spot controller and fails.
from skypilot.
Confirmed that cat ~/.sky/user_hash
is same for multiple runs of docker container when launched from multiple VMs. However, the hashes are different for multiple runs when same container is run from my local.
explicitly upload/mounting those keys
yeah, this will add some complexity of storing these keys in GCP secret manager, and properly loading them during the container start-up each day.
randomly generate a user hash for each VM/container
we don't mind having a new spot controller for each daily job. The only issue - old spot controllers are not auto-downed and we are left with a bunch of stopped spot controller instances in our VM list (which we have to manually delete). If available, a --down
option for spot controller would work for us.
from skypilot.
Related Issues (20)
- [Managed Jobs] Support Rsync for managed jobs HOT 3
- Not possible to specify multiple ports with SkyServe
- [Storage] removing `_download_file` method as not used
- [Pipeline][Storage] Support new buckets in MOUNT mode
- [Core] Backward compatibility fails to use the old cluster name in cluster yaml
- Add option to disable conda installation when using custom docker images HOT 2
- sky.exceptions.FetchClusterInfoError in sky serve HOT 1
- [K8s] Fail to launch service on GKE cluster HOT 1
- [Docs] SkyServe architecture redirect
- vLLM tutorial doesn't work (cannot find vllm module) HOT 3
- 'sky check' Command Fails on Windows Due to Missing Resource Directory HOT 2
- Nightly release is stuck on 1.0.0.dev2024053101 HOT 2
- No CUDA drivers in Azure A10 HOT 3
- Slack link is broken HOT 3
- Add option to select Azure resource group
- [Core] Reduce number of SSH when initialize docker as runtime environment
- Windows remote development with SSH: SSH only set up for WSL HOT 3
- [Storage] Creating new buckets in `sky jobs launch` fails HOT 1
- [k8s] Support exclusive use of spot node pools when use_spot: True
- Update Ray Serve example HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from skypilot.