Coder Social home page Coder Social logo

aviweit / skypilot Goto Github PK

View Code? Open in Web Editor NEW

This project forked from skypilot-org/skypilot

1.0 1.0 0.0 19.48 MB

SkyPilot is a framework for easily running machine learning workloads on any cloud through a unified interface.

Home Page: https://skypilot.readthedocs.io

License: Apache License 2.0

Shell 0.72% Python 96.48% Dockerfile 0.01% Jinja 2.60% HTML 0.19%

skypilot's People

Contributors

alex000kim avatar asaiacai avatar cblmemo avatar cohen-j-omer avatar concretevitamin avatar dongreenberg avatar ewzeng avatar franklsf95 avatar gbmarc1 avatar gmittal avatar hemildesai avatar hysunhe avatar hzeng-0 avatar infwinston avatar iojw avatar landscapepainter avatar lhqing avatar maoziming avatar michaelvll avatar michaelzhiluo avatar mraheja avatar mtaku3 avatar romilbhardwaj avatar saihtaungkham avatar saikrishna-achalla avatar shethhriday29 avatar sumanthgenz avatar sunny0826 avatar suquark avatar woosukkwon avatar

Stargazers

 avatar

skypilot's Issues

[K8s] Handle exceptions and pod leakage for evicted containers

If I correctly understand, we would like to maintain the behavior of where sky status of a failed cluster reports INIT state; when a user invokes sky status -r then corrupted cluster will be set to INIT and then user will delete its resources with sky down.

I continued to look into this and could see some specific cloud handling in this method: https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L3668. K8s is being handled in the generic 'else': https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L3849
ray down command is handled inside ray stack: https://github.com/ray-project/ray/blob/ray-2.4.0/python/ray/autoscaler/_private/commands.py#L394 which retrieves none terminated nodes: https://github.com/ray-project/ray/blob/ray-2.4.0/python/ray/autoscaler/_private/commands.py#L439. K8s provider does not return the pod since it is not Running. Hence it does not get removed.

[ks8_cloud_beta1] Launch hangs due to wrong NodePort being used.

2023-08-17 12:46:47,031 INFO updater.py:316 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2023-08-17 12:46:52,082 VINFO command_runner.py:373 -- Running `uptime`
2023-08-17 12:46:52,082 VVINFO command_runner.py:375 -- Full command is `ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8290c29bcd/3fca379b3f/%C -o ControlPersist=300s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -p 30025 -W %h:%p [email protected] -o Port=22 -o ConnectTimeout=10s [email protected] bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
kex_exchange_identification: Connection closed by remote host
2023-08-17 12:46:52,129 INFO updater.py:316 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2023-08-17 12:46:57,179 VINFO command_runner.py:373 -- Running `uptime`
2023-08-17 12:46:57,179 VVINFO command_runner.py:375 -- Full command is `ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8290c29bcd/3fca379b3f/%C -o ControlPersist=300s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -p 30025 -W %h:%p [email protected] -o Port=22 -o ConnectTimeout=10s [email protected] bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
kex_exchange_identification: Connection closed by remote host
2023-08-17 12:46:57,222 INFO updater.py:316 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2023-08-17 12:47:02,269 VINFO command_runner.py:373 -- Running `uptime`
2023-08-17 12:47:02,269 VVINFO command_runner.py:375 -- Full command is `ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8290c29bcd/3fca379b3f/%C -o ControlPersist=300s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -p 30025 -W %h:%p [email protected] -o Port=22 -o ConnectTimeout=10s [email protected] bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
kex_exchange_identification: Connection closed by remote host
2023-08-17 12:47:02,307 INFO updater.py:316 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
2023-08-17 12:47:07,357 VINFO command_runner.py:373 -- Running `uptime`
2023-08-17 12:47:07,357 VVINFO command_runner.py:375 -- Full command is `ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8290c29bcd/3fca379b3f/%C -o ControlPersist=300s -o ProxyCommand=ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -p 30025 -W %h:%p [email protected] -o Port=22 -o ConnectTimeout=10s [email protected] bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`

Steps to recreate:

  1. sky local up
  2. launch some tasks so that several ray clusters are created
  3. sky local down
  4. sky local up
  5. sky status. It still lists the clusters provisioned before
  6. launch a task against one of the clusters.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.