skypilot-org / skypilot Goto Github PK

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.

Home Page: https://skypilot.readthedocs.io

License: Apache License 2.0

Python 96.36% Jinja 2.43% Shell 1.02% Dockerfile 0.01% HTML 0.17%

cloud-computing data-science deep-learning gpu hyperparameter-tuning machine-learning tpu job-queue job-scheduler cloud-management

skypilot's Introduction

SkyPilot

Run LLMs and AI on Any Cloud

🔥 News 🔥

[Apr, 2024] Serve and finetune Llama 3 on any cloud or Kubernetes: example
[Apr, 2024] Serve Qwen-110B on your infra: example
[Apr, 2024] Using Ollama to deploy quantized LLMs on CPUs and GPUs: example
[Mar, 2024] Serve and deploy Databricks DBRX on your infra: example
[Feb, 2024] Deploying and scaling Gemma with SkyServe: example
[Feb, 2024] Speed up your LLM deployments with SGLang for 5x throughput on SkyServe: example
[Feb, 2024] Serving Code Llama 70B with vLLM and SkyServe: example
[Dec, 2023] Mixtral 8x7B, a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: example
[Nov, 2023] Using Axolotl to finetune Mistral 7B on the cloud (on-demand and spot): example
[Sep, 2023] Case study: Covariant transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: read the case study
[Aug, 2023] Finetuning Cookbook: Finetuning Llama 2 in your own cloud environment, privately: example, blog post
[June, 2023] Serving LLM 24x Faster On the Cloud with vLLM and SkyPilot: example, blog post

Archived

[Dec, 2023] Using LoRAX to serve 1000s of finetuned LLMs on a single instance in the cloud: example
[Sep, 2023] Mistral 7B, a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: Mistral docs
[July, 2023] Self-Hosted Llama-2 Chatbot on Any Cloud: example
[April, 2023] SkyPilot YAMLs for finetuning & serving the Vicuna LLM with a single command!

SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution.

SkyPilot abstracts away cloud infra burdens:

Launch jobs & clusters on any cloud
Easy scale-out: queue and run many jobs, automatically managed
Easy access to object stores (S3, GCS, R2)

SkyPilot maximizes GPU availability for your jobs:

Provision in all zones/regions/clouds you have access to (the Sky), with automatic failover

SkyPilot cuts your cloud costs:

Managed Spot: 3-6x cost savings using spot VMs, with auto-recovery from preemptions
Optimizer: 2x cost savings by auto-picking the cheapest VM/zone/region/cloud
Autostop: hands-free cleanup of idle clusters

SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.

Install with pip (we recommend the nightly build for the latest features or from source):

pip install "skypilot-nightly[aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp,kubernetes]"  # choose your clouds

To get the last release, use:

pip install -U "skypilot[aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp,kubernetes]"  # choose your clouds

Current supported providers (AWS, Azure, GCP, OCI, Lambda Cloud, RunPod, Fluidstack, Paperspace, Cudo, IBM, Samsung, Cloudflare, any Kubernetes cluster):

SkyPilot

Getting Started

You can find our documentation here.

SkyPilot in 1 Minute

A SkyPilot task specifies: resource requirements, data to be synced, setup commands, and the task commands.

Once written in this unified interface (YAML or Python API), the task can be launched on any available cloud. This avoids vendor lock-in, and allows easily moving jobs to a different provider.

Paste the following into a file my_task.yaml:

resources:
  accelerators: V100:1  # 1x NVIDIA V100 GPU

num_nodes: 1  # Number of VMs to launch

# Working directory (optional) containing the project codebase.
# Its contents are synced to ~/sky_workdir/ on the cluster.
workdir: ~/torch_examples

# Commands to be run before executing the job.
# Typical use: pip install -r requirements.txt, git clone, etc.
setup: |
  pip install "torch<2.2" torchvision --index-url https://download.pytorch.org/whl/cu121

# Commands to run as a job.
# Typical use: launch the main program.
run: |
  cd mnist
  python main.py --epochs 1

Prepare the workdir by cloning:

git clone https://github.com/pytorch/examples.git ~/torch_examples

Launch with sky launch (note: access to GPU instances is needed for this example):

sky launch my_task.yaml

SkyPilot then performs the heavy-lifting for you, including:

Find the lowest priced VM instance type across different clouds
Provision the VM, with auto-failover if the cloud returned capacity errors
Sync the local workdir to the VM
Run the task's setup commands to prepare the VM for running the task
Run the task's run commands

Refer to Quickstart to get started with SkyPilot.

More Information

To learn more, see our Documentation and Tutorials.

Runnable examples:

LLMs on SkyPilot
- Llama 3
- Qwen
- Databricks DBRX
- Gemma
- Mixtral 8x7B; Mistral 7B (from official Mistral team)
- Code Llama
- vLLM: Serving LLM 24x Faster On the Cloud (from official vLLM team)
- SGLang: Fast and Expressive LLM Serving On the Cloud (from official SGLang team)
- Vicuna chatbots: Training & Serving (from official Vicuna team)
- Train your own Vicuna on Llama-2
- Self-Hosted Llama-2 Chatbot
- Ollama: Quantized LLMs on CPUs
- LoRAX
- QLoRA
- LLaMA-LoRA-Tuner
- Tabby: Self-hosted AI coding assistant
- LocalGPT
- Falcon
- Add yours here & see more in llm/!
Framework examples: PyTorch DDP, DeepSpeed, JAX/Flax on TPU, Stable Diffusion, Detectron2, Distributed TensorFlow, Ray Train, NeMo, programmatic grid search, Docker, Cog, Unsloth, Ollama and many more (examples/).

Follow updates:

Read the research:

Support and Questions

We are excited to hear your feedback!

For issues and feature requests, please open a GitHub issue.
For questions, please use GitHub Discussions.

For general discussions, join us on the SkyPilot Slack.

Contributing

We welcome and value all contributions to the project! Please refer to CONTRIBUTING for how to get involved.

skypilot's People

Contributors

Stargazers

Watchers

Forkers

concretevitamin dmatch01 sumanthgenz romilbhardwaj romakoks lambda7xx franklsf95 akshat977 gilv vivekkhimani ewzeng cohen-j-omer choochutran jeffwan jalajk24 dongreenberg blaisemuhirwa vinvc86 cemberk saruberoz bm-nhs thearchiver zmzlois nanderoo gravitasse foxxnuaa super-rain xzg2022 hadoop835 dylanxult gerrey lilelr ahlfors kinddevil monkeyboy123 hellowrold2022 dd-guo nikkidada jrcribb richardsonjf mkostreba tngamemo alahmdi scsldb backyes new-cherry-inc cyd3nt vincentwei2021 ronenkat luyanzheng becloudready gg-big-org turian archiveproject soumojit rheehot yuzmoon qzweng zhangfaquan fakela blueman1223 cblmemo landscapepainter suryavengadesan markhng525 mental2008 stephenbalaban kai-x-org neolithics-ai mturilli dbonattoj eli-osherovich jermwatt anhmike test-heywtu thalesfsp adamuas ormwish apollohuang1 angeloluidens wsxiaoys jaidsar hbcbh1999 arnaudmkonan mabrains kp-forks miss-bug muertebt5 joolstorrentecalo swifilaboroka botogoske pmb2 doliver-app jeffhsu3 lcsouzamenezes tailaw dewey363 gumplus treebeardtech trail-of-forks

skypilot's Issues

Sanitize cluster names / make names with underscores be able to run on GCP

multi_echo.py was using the cluster name multi_echo. When forcing it on GCP, it failed with

googleapiclient.errors.HttpError: <HttpError 400 when requesting https://compute.googleapis.com/compute/v1/projects/intercloud-320520/zones/us-west1-a/instances?alt=json returned "Invalid value for field 'resource.name': 'ray-multi_echo-head-480dea0e'. Must be a match of regex '(?:a-z?)'". Details: "[{'message': "Invalid value for field 'resource.name': 'ray-multi_echo-head-480dea0e'. Must be a match of regex '(?:a-z?)'", 'domain': 'global', 'reason': 'invalid'}]">

Replacing the name with multi-echo worked. It must be regex (?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)

Need to design a solution (force replace _ with -? some other approach?).

Support auto-teardown of clusters?

I ran a large experiment, forgot to ray down <configfile>. AWS billed me for the hours I didn't even use the machine. Should we add some hints to remind the user to teardown their cluster and/or implement automatic teardown of idle clusters?

(feel free to close the issue if it's not a priority)

No such file or directory: '/home/ubuntu/.ssh/config'

https://github.com/concretevitamin/sky-experiments/blob/320699b789479605c259ffe023264a62ea437c5c/prototype/sky/backends/backend_utils.py#L260

I ran into this error when running sky down. I'm launching tasks from a sky cpunode which is a new VM that does not have a ~/.ssh/config file.

Avoid attempting to connect to GCP instance launched by other users

When I tried to run the resnet_app.py example with Resources(sky.GCP(), accelerator={'V100': 4}), the sky will try to connect to the instance that was already launched by another user with only 1 V100 GPU, and raise the following error (the f0eb6cb1 in the error message is the instance id launched by another user). The reason for it can be because we use the same cluster_name in the gcp-ray.yml for everyone.

Traceback (most recent call last):
  File "/data/zhwu/miniconda3/envs/sky/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/scripts/scripts.py", line 1970, in main
    return cli()
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/scripts/scripts.py", line 963, in up
    create_or_update_cluster(
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/commands.py", line 242, in create_or_update_cluster
    get_or_create_head_node(config, config_file, no_restart, restart_only, yes,
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/commands.py", line 634, in get_or_create_head_node
    provider.terminate_node(head_node)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 36, in method_with_retries
    return method(self, *args, **kwargs)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 162, in terminate_node
    resource = self._get_resource_depending_on_node_name(node_id)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node_provider.py", line 85, in _get_resource_depending_on_node_name
    return self.resources[GCPNodeType.name_to_type(node_name)]
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/site-packages/ray/autoscaler/_private/gcp/node.py", line 125, in name_to_type
    return GCPNodeType(name.split("-")[-1])
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/enum.py", line 384, in __call__
    return cls.__new__(cls, value)
  File "/data/zhwu/miniconda3/envs/sky/lib/python3.9/enum.py", line 702, in __new__
    raise ve_exc
ValueError: 'f0eb6cb1' is not a valid GCPNodeType

`pip install -e .` should auto install requirements.txt

From the latest master, I did a pip install -e . and typed sky, which gave

...

  File "/Users/zongheng/Dropbox/workspace/riselab/sky-computing/prototype/sky/backends/__init__.py", line 2, in <module>
    from sky.backends.backend import Backend
  File "/Users/zongheng/Dropbox/workspace/riselab/sky-computing/prototype/sky/backends/backend.py", line 5, in <module>
    from sky.backends import backend_utils
  File "/Users/zongheng/Dropbox/workspace/riselab/sky-computing/prototype/sky/backends/backend_utils.py", line 19, in <module>
    from sky import authentication as auth
  File "/Users/zongheng/Dropbox/workspace/riselab/sky-computing/prototype/sky/authentication.py", line 13, in <module>
    from Crypto.PublicKey import RSA
ModuleNotFoundError: No module named 'Crypto'

This is because pip install -r requirements.txt has changed and needs to be manually run again.

We could make the former include the second command.

A bug of detecting key pairs on cloud

I was trying to launch sky from a local server but got the below error for 30mins and autoscaler failed.

[email protected]: Permission denied (publickey).                      
    SSH still not available (SSH command failed.), retrying in 5 seconds.

I investigated and found a bug in sky/authentication.py where we wrongly assume that the same sky-key is used when key.name matches.
https://github.com/concretevitamin/sky-experiments/blob/master/prototype/sky/authentication.py#L85
Turns out that sky-key is a wrong one that was uploaded from my local macbook (which I used before for running sky and generated a different sky-key than the one on the current server). We may need to identify the key by its public key value rather than its name.
@michaelzhiluo

AMI ID does not work across regions

It seems that the AMI ID does not work across regions. For instance, the same deep learning AMI in us-east-2 is identified as ami-04b343a85ab150b2d rather than the below one for us-west-2. We may need to save such info to service directory and change the region/ID accordingly as needed.

https://github.com/concretevitamin/sky-experiments/blob/8b6ba284d384aa183e086082fd677ea15671b40e/prototype/config/aws.yml.j2#L20

`sky status` not consistent with EC2 instance list

What happens if the user terminates an instance while the instance is setting up? Sky status will report the instance even if it does not exist anymore.

This was mentioned in our meetings before, and I found an explicit example to replicate this bug.

Run python examples/resnet_app.py
While the machine is in the setup phase (where you can see pip installing packages on Terminal), go to EC2 instance and kill/terminate the instance.
Run Sky status

Alias sky.Optimizer.optimize() to sky.optimize()

Intent is to have a simpler API.

Experience with the Ray Tune example (Docker-based dry run would be very useful!)

Most of my time spent making the Ray Tune example is to get the Python dependencies right. It turns out the example requires a specific pytorch-lightning version, implicitly depends on torchvision, etc, etc. Part of the blame goes to Ray Tune for not specifying a clean requirements.txt (ray-project/ray#20601), but I guess this can be common for our users too.

In this case I think first debug iteratively using a local Docker would be very useful since it is much faster than waiting for VMs to spin up every time you change a line.

No action items; will try Romil's local Docker backend soon.

Refactor aws.csv

We now have an entry per-instance-type, per-region and per-availability-zone. I think we can simplify this by making SpotPrice a dictionary like {'us-west-1a': 1.0, ...}, since that seems to be the only difference between AZs. This can make getting the instance types easier and reduce duplicate data.

Low-priority for now, since the current design works fine.

Cluster reuse and resize with `[gpu|cpu|tpu]node`

Currently there is no mechanism to support reusing a stopped cluster for gpunode. Intuitively I would expect being able to launch sky gpunode -c se01 --gpus K80 --cloud aws, then do some stuff, sky stop se01. But there's no way for me to restart the cluster again. Intuitively something like sky gpunode -c se01 should just restart the cluster. I know @concretevitamin is working on sky start to fix the re-use problem.

The other thing that should be considered is if we run something like this:

# starts a cluster named se01 with 1 V100 on AWS
sky gpunode -c se01 --gpus V100 --cloud aws

# do some stuff

sky stop se01

# now the same se01's instance type is changed 
# (equivalent to going into the AWS console and using "Change instance type")
# all data on the cluster's disk is the preserved
sky gpunode -c se01 --gpus V100:8

This seems like an intuitive behavior if we specify --gpus that are different from the GPUs we allocated initially, and resizing instances (at least in my experience) is a common thing.

In the future, we may also want to support changing clouds and having the data on that cluster migrate to the other cloud easily (but very low-pri for now).

Bug of post_setup_fn

Got this error when introducing a new example without setting up post_setup_fn. I think we shouldn't assume post_setup_fn exists in every task.

Traceback (most recent call last):
File "tpu_app.py", line 72, in <module>
File "/Users/weichiang/Workspace/research/sky-experiments/prototype/sky/execution.py", line 409, in execute                                                                   
    runner.run()
File "/Users/weichiang/Workspace/research/sky-experiments/prototype/sky/execution.py", line 281, in run                                                                       
    commands = fn(self.cluster_ips)
TypeError: 'NoneType' object is not callable

To fix this, we can add an one line check

if task.post_setup_fn is not None:

before this code segment
https://github.com/concretevitamin/sky-experiments/blob/154549c0eec432d366681ac063dacdbafdaa6475/prototype/sky/execution.py#L371-L374

.git directory is not synced to remote VM

I log into the remove VM launched by Sky.
However, the.git under my working directory was ignored and not transferred to the remote VM, which causes some inconvenience for code development.

(bert-sign) ubuntu@ray-sky-gcp-head-487a3051-compute:/tmp/workdir$ ls -alhrt .git
ls: cannot access '.git': No such file or directory

Is this an expected behavior?

Tasks have no default unique id/name

In sky.Task, we do not have any unique identifier for a task. The name field is perhaps intended to be an identifier, but if it is not specified by the user, it defaults to None.

The LocalDockerBackend requires a unique identifier for each task so it can assign container names and track associated volume mounts assigned during build time.

Possible solution: Randomly generate a task name if it is not specified by the user.

https://github.com/concretevitamin/sky-experiments/blob/517b97d102108f7a8cb2a0dbd1061f5b790f8af2/prototype/sky/task.py#L28-L44

`sky attach [--port-forward] <cluster_name>`

Can we have the CLI as the title?

Reason: The sky status only shows the <cluster_name>. If a user would like to connect to the cluster or do the port forwarding, they need to find the path of the config file by themselves. Directly using the sky attach <cluster_name> might be more intuitive.

Missing `cluster_name` when `tpu_type` is found.

https://github.com/concretevitamin/sky-experiments/blob/0910ed15da4c37f346dbd8e70a2d1c1f2717df01/prototype/sky/backends/backend_utils.py#L150-L154

It seems that the cluster_name is not set here, which will fail the tpu_app.py @gmittal .

`sky cancel` needs a cancel all option

Is it able to support sky cancel all or some task name matching, to make it easier for users to kill tasks in batch?

QOL Improvement, format.sh should also run Pylint

^ As title says

Azure K80 provision fails

Azure will raise the following error when launching an instance with K80. We may need to update the azure-ray.yml.j2 to avoid this error.

Exception Details:	(BadRequest) {
	  "error": {
	    "code": "BadRequest",
	    "message": "The selected VM size 'Standard_NC6_Promo' cannot boot Hypervisor Generation '2'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '2' VM Size."
	  }
	}
	Code: BadRequest
	Message: {
	  "error": {
	    "code": "BadRequest",
	    "message": "The selected VM size 'Standard_NC6_Promo' cannot boot Hypervisor Generation '2'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '2' VM Size."
	  }
	}

Move local Sky keys/files/metadata to ~/.sky

Makes uninstallation easier and keeps things organized. Maybe not a huge deal, but as a user I feel a bit weird having a Sky app create and save a new SSH key under ~/.ssh/sky-key without me knowing.

Handling error from step.run()

We need to catch errors from each step.run().
https://github.com/concretevitamin/sky-experiments/blob/3e9bac359da41187060b348be48a6400704f25aa/prototype/sky/execution.py#L169

Apparently ray up failed but sky still shows execution finished.

  File "/Users/weichiang/opt/miniconda3/envs/sky/lib/python3.8/site-packages/ray/autoscaler/_private/gcp/node.py", line 267, in <listcomp>
    self.create_instance(
  File "/Users/weichiang/opt/miniconda3/envs/sky/lib/python3.8/site-packages/ray/autoscaler/_private/gcp/node.py", line 440, in create_instance
    operation = self.resource.instances().insert(
  File "/Users/weichiang/opt/miniconda3/envs/sky/lib/python3.8/site-packages/googleapiclient/_helpers.py", line 131, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/Users/weichiang/opt/miniconda3/envs/sky/lib/python3.8/site-packages/googleapiclient/http.py", line 937, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 404 when requesting https://compute.googleapis.com/compute/v1/projects/intercloud-320520/zones/us-west1-a/instances?alt=json returned "The resource 'projects/intercloud-320520/zones/us-west1-a/acceleratorTypes/nvidia-tesla-tpu-v3-8' was not found". Details: "[{'message': "The resource 'projects/intercloud-320520/zones/us-west1-a/acceleratorTypes/nvidia-tesla-tpu-v3-8' was not found", 'domain': 'global', 'reason': 'notFound'}]">
Step 000_provision finished

---------------------------
  Sky execution finished
---------------------------

TPU info too long in `sky status`

This looks a bit ugly.

Dataset Memory > Instance Memory

We need to account for the case when the downloaded dataset on the VM size exceeds the size of the VM storage capacity.

Non-Deterministic State for AWS AMI

Sometimes, when the AWS EC2 instance is launched, there is a lock for apt-get and dpkg package handlers.

Code to replicate:

wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo apt-get update
sudo dpkg -i ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

Error:

dpkg: error: dpkg frontend is locked by another process

There is no process/PID that is taking the lock. lsof /var/lib/dpkg/lock-frontend

No `/tmp/workdir` when running `examples/multi_hostname.py` twice

To reproduce:

Run python examples/multi_hostname.py
Run examples/multi_hostname.py again without terminating the instances before running.

The error message is as follows:

I 11-24 11:06:40 cloud_vm_ray_backend.py:656] Starting Task execution.
2021-11-24 11:06:42,780 INFO util.py:282 -- setting max workers for head node type to 0
bash: cd: /tmp/workdir: No such file or directory
Loaded cached provider configuration
Shared connection to 54.149.150.178 closed.
Error: Command failed:
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.

It might because ray will not run the setup_commands again if it finds the instances already exist?

Make config/aws.yml task-specific

Currently its a global config that gets modified each time we provision resources for a task. What happens if we want to launch multiple tasks at the same time?

Use Sky to provision 8 VMs each with 8 GPUs

Lianmin asked us if we can help him find a region with 8x p3.16xlarge VMs (each with 8 GPUs). They have been doing this search manually.

Our current provisioner's retry strategy doesn't work because we consider provisioning successful when the head node setup is complete. But due to Ray's annoying "provision head -> setup head -> provision workers -> setup workers" strategy, if we could not launch the other 7 nodes, we would hang at "waiting for cluster to be ready" state, without moving on to trying a different region.

Would be great to support their use case.

Stream logs to files during VM provisioning and setup

Is it expected behavior that there is no output for VM setup during provisioning? I think it might be hard for the user to debug for the setup commends. Also, I failed to find the output anywhere as a file.

I am running python ./examples/resnet_app.py, and the output is as below.

I 11-12 10:57:38 cloud_vm_ray_backend.py:213] If this takes longer than ~30 seconds, provisioning is likely successful. Setting up may take a few minutes.
I 11-12 10:57:38 backend_utils.py:38] Created or updated file config/aws-ray.yml
I 11-12 11:02:52 cloud_vm_ray_backend.py:238] Successfully provisioned or found existing VM(s). Setup completed.

Key errors when the resource is not available on the cloud

We need a complete directory for each cloud and code logic to search for resources available. For example, when specifying acc='TPU', this method in AWS should return an empty list.

https://github.com/concretevitamin/sky-experiments/blob/3e9bac359da41187060b348be48a6400704f25aa/prototype/sky/clouds/aws.py#L88

#### train ####
Traceback (most recent call last):
  File "resnet_app.py", line 77, in <module>
    dag = sky.Optimizer.optimize(dag, minimize=sky.Optimizer.COST)
  File "/Users/weichiang/Workspace/research/sky-experiments/prototype/sky/optimizer.py", line 59, in optimize
    optimized_dag, best_plan = Optimizer._optimize_cost(
  File "/Users/weichiang/Workspace/research/sky-experiments/prototype/sky/optimizer.py", line 143, in _optimize_cost
    launchable_resources = sky.registry.fill_in_launchable_resources(
  File "/Users/weichiang/Workspace/research/sky-experiments/prototype/sky/registry.py", line 27, in fill_in_launchable_resources
    cloud.get_feasible_launchable_resources(resources))
  File "/Users/weichiang/Workspace/research/sky-experiments/prototype/sky/clouds/aws.py", line 88, in get_feasible_launchable_resources
    return _make(directory[(acc, acc_count)])
KeyError: ('TPU', 1)

Create shared decorator for cpunode/tpunode/gpunode

The click options/args/config decorators are all basically identical for each of these commands in the CLI.

ParTask bug: `conda activate` does not work for the task.run

Code for reproduce:

import sky
from sky import clouds

with sky.Dag() as dag:
    # The working directory contains all code and will be synced to remote.
    workdir = '~/Downloads/tpu'

    # The setup command.  Will be run under the working directory.
    setup = 'pip install --upgrade pip && \
        conda init bash && \
        conda activate resnet || \
          (conda create -n resnet python=3.7 -y && \
           conda activate resnet && \
           pip install tensorflow==2.4.0 pyyaml && \
           cd models && pip install -e .)'

    # The command to run.  Will be run under the working directory.
    run = 'conda activate resnet'

    conda1 = sky.Task(
        'activate_1',
        workdir=workdir,
        setup=setup,
        run=run,
    )
    conda1.set_resources({
        sky.Resources(clouds.AWS(), accelerators='V100'),
    })

    # Run the training and tensorboard in parallel.
    task = sky.ParTask([conda1])
    total = sky.Resources(clouds.AWS(), accelerators={'V100': 1})
    task.set_resources(total)

dag = sky.Optimizer.optimize(dag, minimize=sky.Optimizer.COST)
# sky.execute(dag, dryrun=True)
sky.execute(dag)

Error:

(pid=20548) 
(pid=20548) CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
(pid=20548) To initialize your shell, run
(pid=20548) 
(pid=20548)     $ conda init <SHELL_NAME>
(pid=20548) 
(pid=20548) Currently supported shells are:
(pid=20548)   - bash
(pid=20548)   - fish
(pid=20548)   - tcsh
(pid=20548)   - xonsh
(pid=20548)   - zsh
(pid=20548)   - powershell
(pid=20548) 
(pid=20548) See 'conda init --help' for more information and options.
(pid=20548) 
(pid=20548) IMPORTANT: You may need to close and restart your shell after running 'conda init'.
(pid=20548) 
(pid=20548) 
Traceback (most recent call last):
  File "/tmp/sky_app_lee66joq", line 15, in <module>
    ray.get(futures)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/ray/worker.py", line 1621, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(CalledProcessError): ray::activate_1 (pid=20548, ip=172.31.23.63)
  File "/tmp/sky_app_lee66joq", line 11, in <lambda>
    shell=True, check=True)) \
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'cd /tmp/workdir && conda activate resnet' returned non-zero exit status 1.

YAML support for sky.Storage

Sky storage (#63) will also require YAML support so Storage objects can be defined as a part of the YAML schema. Maybe something like:

name: resnet-app
workdir: ~/Downloads/tpu

resources:
  cloud: aws
  instance_type: p3.2xlarge

storage:
  name: imagenet-bucket
  source_path=s3://imagenet-bucket
  mount_path=/imagenet/

Don't run optimizer when using cached cluster (e.g. `sky gpunode`)

The optimizer running again when we're re-attaching to an existing cluster (e.g. sky gpunode) is a weird thing to have happen.

Problems of Azure related to `ray` bug

The Azure is now usable, though there are two problems related to ray that needs to be further investigated:

ray up for Azure will ignore the cluster_name, reusing the existing instance. [Fixed in #111 ]
After ray up commend is finished, it sometimes takes a while for the cluster to be available for ray, especially for Azure, which will cause the ray exec and ray get_head-node-ip in our code to fail.

local ray version: 1.9.0
remote ray version: 1.7.0

Sky Authentication Python Test

The test for authentication is missing for Sky. We are limited by several factors:

Creating an account for the Python Test bot on our Repo
Getting the credentials (access key, secret key) and being able to add/remove key pairs from Amazon EC2 service via Python Test Bot

Long waiting time for iterative code development

I was using sky for my code development.
After successfully launched a VM, I changed some codes and re-run

python examples/my_app.py

However, I need to wait more than 5 mins for Ray autoscaler to re-run those unnecessary setup commands, which may be too slow for iterative code development.
And I ended up just ssh logging into the VM and execute the command myself. however, such way would make the codebase asynchronous and I had to manually upload the codes myself.
Can we implement sth like sky exec to just sync the code + run commands for users?

CLI feature requests

It seems we can sky run <yaml file>. Can we also have sky run <python file> for convenience?

-- updated --
Whiteboard doc

`sky status`

sky status

      | Cluster name | Launched | Resources | Status               |
      | sky-xxxx     | 2021-12  | 4 V100    | [stopped | running]  |

For the resources column, an alternative is to show the detailed info using the following commend
- sky status --cluster <cluster_name>
sky status --refresh Partially satisfied with #653
- Update the status from the cloud services, since users may manually stop or terminate some machines.
- Delete the whole cluster if the some node in the cluster is already terminated
Lower priority: sky monitor --cluster cluster_name --gpu --cpu
- Show the resource utilization of the gpus and cpus, like the gpuwatch blaze on the RISE servers.
  This would be very useful for the ML people.

`sky down`

sky down --all
- Stop all clusters, and change the status to be stopped
sky down --all --terminate/kill
- Terminate all clusters, and remove the clusters from the status table
  Reason: user does not need to re-do the setup on the stopped instances, but still can save cost by stopping the cluster. By terminating the clusters, the user can save additional costs from the EBS.

`sky info`

~~sky info~~ Done by sky show-gpus
- This is for showing the resources on the clouds. (similar to sinfo in slurm)
- Show all instances on all available clouds.
sky info --accelerator V100 --cloud GCP
- Show all instances with V100 on GCP

`sky cpu/gpunode`

Need an option that keeps the instance running, when the interactive session is logged out. This would be good when the user has unstable internet connection and want to connect back to the previous debugging session.

Switch TPU implementation to ray up yaml

According to here, ray up supports the tpu configuration in the cluster_yaml. Should we consider switching our implementation to the yaml file for clarity and simplicity? Also, with the cluster_yaml we can make the TPU preemptible, as mentioned in #113. @infwinston

Explicitly execute multiple commands on the head node in parallel.

I am trying to add an example to do training and start a tensorboard at the same time. Is it possible for us to have an explicit API (like the one below) that lets us run multiple commands in parallel on the head node, instead of using the implicit scheduling of ParTask? Though we can use the tensorboard --logdir . & to run it in the background, I am not sure if it is an elegant solution.

train = sky.Task(run='python train.py', num_nodes=2)
tensorboard = sky.Task(run='tensorboard --logdir .')
sky.add_child(train, [tensorboard])

Optimizer should take num_nodes into account

...e.g., during pricing calculation.

Scrambled spaces in logging

There's also a situation where @infwinston and I saw - sometimes after Ctrl-C, or after tailing the log, the lines contain scrambled spaces (like we saw before) or the shell's cursor disappears. Did you see that issue?

Originally posted by @concretevitamin in #69 (comment)

Fork sky.execute when stream_logs=False

Mainly looking at this from a UX perspective, but if you turn stream_logs=False (only pipe output to file, not stdout) you get this awkward interaction where sky execute still has control of the main terminal and it looks like it's just hanging while all the output goes to a logfile. Instead, we should run the process in the background and return control to the user's shell session so they can do other things (e.g. tail the logfile, launch another run, launch a gpunode, write code, etc.).

requirements.txt

We need either requirements.txt for installing dependencies for three clouds (awscli, azure-cli, gcloud), or some better warning e.g. "awscli is not available so you cannot launch stuff on AWS".

`sky gpunode` unreliable for provisioning a GPU instance

sky gpunode -c skylab –gpus V100

# Above didn’t work because it chose GCP...asked for a K80 instead.
sky gpunode -c skylab –gpus K80

# Didn’t work because Azure doesn’t work with K80...forced GCP instead
sky gpunode -c skylab –gpus K80 –cloud gcp

Also, the time to provision is really bad (potentially to the point where a user may think the system is broken) for Azure.

cc @Michaelvll

Support spot pricing history in service directory

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances-history.html
https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-spot-price-history.html

cluster is not registered in status, if killed while setting up

As title. python examples/multi_echo.py and Ctrl-C when it is installing ray. The instance is already launched, but not registered in the status.

Support multiple `tpunode`s / multiple runs of `tpu_app`

sky tpunode  # ...Success.

sky tpunode -c new 

# The second command, however, reuses the tpu name "sky_tpu".
# which will give:

I 12-28 11:58:55 cloud_vm_ray_backend.py:560] Launching on GCP us-central1 (us-central1-a)
E 12-28 11:59:03 cloud_vm_ray_backend.py:511] Updated property [core/project].
E 12-28 11:59:03 cloud_vm_ray_backend.py:511] ERROR: (gcloud.compute.tpus.create) ALREADY_EXISTS: Resource 'projects/intercloud-320520/locations/us-central1-a/nodes/sky_tpu' already exists
E 12-28 11:59:03 cloud_vm_ray_backend.py:511] - '@type': type.googleapis.com/google.rpc.ResourceInfo
E 12-28 11:59:03 cloud_vm_ray_backend.py:511]   resourceName: projects/intercloud-320520/locations/us-central1-a/nodes/sky_tpu
E 12-28 11:59:03 cloud_vm_ray_backend.py:511]
I 12-28 11:59:03 cloud_vm_ray_backend.py:518] TPU sky_tpu already exists; skipped creation.

this means 2 host VMs are created but they connect (? to verify) to the same underlying TPU. called "sky_tpu".

Error when creating directory during filemount

I specify file_mounts like below

file_mounts = {
        '/data/weichiang/dataset/': '/data/weichiang/dataset/',
    }

the autoscaler setup failed and I got the permission denied error.

mkdir: cannot create directory ‘/data’: Permission denied

I had to change all the data paths in my codebase to make it work, which may be too inconvenient for users.
Can we grant root permission during filemount?

Disable `sky [gpu|tpu|cpu]node` workdir syncing

I often forget about this feature and it causes issues when I run it in folders that I don't actually want to sync (e.g. my homedir).