Comments (2)
Thanks for reporting this @Hubert-Bonisseur! This is quite weird. Possibly the controller process is somehow killed.
Could you share how many spot jobs you were running concurrently and if you have seen any issue with the other spot jobs?
It would be nice to share the job task yaml you were running as well : )
from skypilot.
I was running only one job at that time. I since launched 2 concurrent spot jobs and they are working fine so far, but there hasn't been a preemption yet. I will update then.
Here is the task.yml
name: small-yt
resources:
cloud: gcp
region: europe-west4
cpus: 12+
accelerators: A100
memory: 6+
disk_size: 500
disk_tier: 'medium'
file_mounts:
~/secret/service_account.json: /Users/datalab/épellations/STT/finetune/secrets/finetuning-414911-4d293f61509f.json
envs:
COMMIT: b11a10fb86feef059d4798ff883ea719e4169218
MODEL_ID: small-yt-V2
NUM_WORKERS: 12
setup: |
echo "Begin setup."
sudo apt-get update
sudo apt-get -y install ffmpeg
cd ~/sky_workdir
git clone [email protected]:data-science/speech/finetune.git
cd finetune_whisper
git checkout $COMMIT
pip install -r requirements.txt
pip install "git+https://github.com/skypilot-org/skypilot.git#egg=sky-callback&subdirectory=sky/callbacks/"
mkdir ~/checkpoints/
if gsutil ls gs://finetuning-checkpoints/$MODEL_ID; then
gsutil -m cp -r gs://finetuning-checkpoints/$MODEL_ID/* ~/checkpoints/
else
echo "Remote folder does not exist. Starting a new training run"
fi
echo "Setup complete."
run: |
echo "Beginning task."
cd finetune
export GOOGLE_APPLICATION_CREDENTIALS=$(realpath ~/secret/service_account.json)
export PYTHONPATH=$PWD
python finetune run configs/training_config_mosaicML.yml
from skypilot.
Related Issues (20)
- [Tests] GCP Image tests failed on latest master HOT 4
- [Core] Support image id when using docker as runtime environment HOT 1
- Central coordination for multiple skypilot cli users HOT 1
- Examples: add examples for Triton, TensorRT-LLM HOT 1
- [k8s example] Add example for putting `~/.sky` on a persistent volume HOT 1
- [Storage] Investigate `rclone mount` with VFS caching HOT 1
- `file_mount` with `mode: COPY` slower than expected on Google Cloud Storage
- [Serve] Service update scales to zero unexpectedly? HOT 1
- [SkyServe] : API Authentication Options, HTTPS, More Stable Web Server that http serve HOT 1
- [Core][Controller] Respect region/zone settings in controller resources when creating controller
- [Tests] Add smoke tests for AI gallery
- AWS Serving models insufficient permissions of skypilot role HOT 2
- [Controller] Supporting multiple controllers when ClusterOwnerIdentity changes HOT 1
- [k8s] Ingress paths for exposing ports need to be namespaced
- [tests] Allow custom ~/.sky/config.yaml for tests HOT 1
- RunPod skypilot does not allow stopping instances HOT 2
- Runpod cluster created with wrong number of accelerators HOT 2
- Spot instances not supported for runpod HOT 1
- [cudo] Unable to setup credentials on cudo HOT 1
- [Forward compat] Clearly surface `older client -> newer cluster` error
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from skypilot.