Comments (19)
cc @cblmemo
from skypilot.
Hi @sean-styleai ! Thanks for reporting the issue. Could you try to directly sky launch
this YAML and to see if the error persists? Also, could you share the output of sky status
in your local laptop (for more information on SkyServe Controller spec)?
from skypilot.
Thank you for so fast response! @concretevitamin @cblmemo
@cblmemo Error mights persist same?
last few lines of sky launch
I 03-19 16:20:42 provisioner.py:451] Successfully provisioned or found existing instance.
I 03-19 16:21:31 provisioner.py:553] Successfully provisioned cluster: sky-878e-namsangho
⠸ Launching - Opening new portsWARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED"
I 03-19 16:21:47 cloud_vm_ray_backend.py:2968] Syncing workdir (to 1 node): . -> ~/sky_workdir
I 03-19 16:21:47 cloud_vm_ray_backend.py:2976] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-03-19-16-10-03-637914/workdir_sync.log
I 03-19 16:22:19 cloud_vm_ray_backend.py:3076] Running setup on 1 node.
bash: !dwk: event not found
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get "https://registry-1.docker.io/v2/": unauthorized: incorrect username or password
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
sky-878e-namsangho 40 secs ago 1x GCP(g2-standard-4[Spot], {'L4': 1}, ports=['8000']) UP - sky launch -n studio_api ...
sky.exceptions.CommandError: Command /bin/bash -i /tmp/sky_setup_sky-2024-03-19-16-10-03-637914 2>&1 failed with return code 1.
Failed to setup with return code 1. Check the details in log: ~/sky_logs/sky-2024-03-19-16-10-03-637914/setup-34.73.11.91.log
****** START Last lines of setup output ******
bash: !dwk: event not found
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get "https://registry-1.docker.io/v2/": unauthorized: incorrect username or password
******* END Last lines of setup output *******
output of sky status
❯ sky status
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
sky-878e-namsangho 2 mins ago 1x GCP(g2-standard-4[Spot], {'L4': 1}, ports=['8000']) UP - sky launch -n studio_api ...
Managed spot jobs
No in-progress spot jobs. (See: sky spot -h)
Services
No live services. (See: sky serve -h)
from skypilot.
Thank you for so fast response! @concretevitamin @cblmemo @cblmemo Error mights persist same?
last few lines of
sky launch
I 03-19 16:20:42 provisioner.py:451] Successfully provisioned or found existing instance. I 03-19 16:21:31 provisioner.py:553] Successfully provisioned cluster: sky-878e-namsangho ⠸ Launching - Opening new portsWARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED" I 03-19 16:21:47 cloud_vm_ray_backend.py:2968] Syncing workdir (to 1 node): . -> ~/sky_workdir I 03-19 16:21:47 cloud_vm_ray_backend.py:2976] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-03-19-16-10-03-637914/workdir_sync.log I 03-19 16:22:19 cloud_vm_ray_backend.py:3076] Running setup on 1 node. bash: !dwk: event not found WARNING! Using --password via the CLI is insecure. Use --password-stdin. Error response from daemon: Get "https://registry-1.docker.io/v2/": unauthorized: incorrect username or password Clusters NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND sky-878e-namsangho 40 secs ago 1x GCP(g2-standard-4[Spot], {'L4': 1}, ports=['8000']) UP - sky launch -n studio_api ... sky.exceptions.CommandError: Command /bin/bash -i /tmp/sky_setup_sky-2024-03-19-16-10-03-637914 2>&1 failed with return code 1. Failed to setup with return code 1. Check the details in log: ~/sky_logs/sky-2024-03-19-16-10-03-637914/setup-34.73.11.91.log ****** START Last lines of setup output ****** bash: !dwk: event not found WARNING! Using --password via the CLI is insecure. Use --password-stdin. Error response from daemon: Get "https://registry-1.docker.io/v2/": unauthorized: incorrect username or password ******* END Last lines of setup output *******
output of
sky status
❯ sky status Clusters NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND sky-878e-namsangho 2 mins ago 1x GCP(g2-standard-4[Spot], {'L4': 1}, ports=['8000']) UP - sky launch -n studio_api ... Managed spot jobs No in-progress spot jobs. (See: sky spot -h) Services No live services. (See: sky serve -h)
Humm, seems like the password is not correct? Could you successfully run the setup&run command in your local laptop?
from skypilot.
@cblmemo
Oh I'm so sorry. Above result seem to caused by incorrect docker auth.
I will recheck and update comments.
Thank you!
from skypilot.
@cblmemo
sky launch
works fine!
few lines of output sky launch
I 03-19 17:19:23 provisioner.py:76] Launching on GCP us-east4 (us-east4-a)
I 03-19 17:22:10 provisioner.py:451] Successfully provisioned or found existing instance.
I 03-19 17:23:00 provisioner.py:553] Successfully provisioned cluster: sky-d8d7-namsangho
⠹ Launching - Opening new portsWARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED"
I 03-19 17:23:16 cloud_vm_ray_backend.py:2968] Syncing workdir (to 1 node): . -> ~/sky_workdir
I 03-19 17:23:16 cloud_vm_ray_backend.py:2976] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-03-19-17-19-14-933366/workdir_sync.log
I 03-19 17:23:43 cloud_vm_ray_backend.py:3076] Running setup on 1 node.
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /home/gcpuser/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
I 03-19 17:23:50 cloud_vm_ray_backend.py:3089] Setup completed.
I 03-19 17:24:01 cloud_vm_ray_backend.py:3172] Job submitted with Job ID: 1
I 03-19 08:24:04 log_lib.py:392] Start streaming logs for job 1.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.150.0.11']
output of sky status
❯ sky status
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
sky-d8d7-namsangho 16 mins ago 1x GCP(g2-standard-4, {'L4': 1}, ports=['8000']) UP - sky launch --env-file /Us...
from skypilot.
same setting and retry sky serve up
controller seem to work fine, but Service Replicas seem to have same problem!
result of sky serve logs studio_api 1
and repetition of creating replica occurs
I 03-19 08:42:45 provisioner.py:553] Successfully provisioned cluster: studio_api-1
I 03-19 08:42:00 provisioner.py:451] Successfully provisioned or found existing instance.
I 03-19 08:42:45 provisioner.py:553] Successfully provisioned cluster: studio_api-1
I 03-19 08:42:53 cloud_vm_ray_backend.py:4266] Processing file mounts.
I 03-19 08:42:55 replica_managers.py:118] Failed to launch the sky serve replica cluster with error: subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null && { gcloud --help > /dev/null 2>&1 || { mkdir -p ~/.sky/logs && wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log && tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log && rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log && mv google-cloud-sdk ~/ && ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 && echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc && source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } && popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38' returned non-zero exit status 1.)
I 03-19 08:42:55 replica_managers.py:121] Traceback: Traceback (most recent call last):
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 95, in launch_cluster
I 03-19 08:42:55 replica_managers.py:121] sky.launch(task,
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121] return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121] return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 501, in launch
I 03-19 08:42:55 replica_managers.py:121] return _execute(
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 334, in _execute
I 03-19 08:42:55 replica_managers.py:121] backend.sync_file_mounts(handle, task.file_mounts,
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121] return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 349, in _record
I 03-19 08:42:55 replica_managers.py:121] return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/backends/backend.py", line 73, in sync_file_mounts
I 03-19 08:42:55 replica_managers.py:121] return self._sync_file_mounts(handle, all_file_mounts, storage_mounts)
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 2990, in _sync_file_mounts
I 03-19 08:42:55 replica_managers.py:121] self._execute_file_mounts(handle, all_file_mounts)
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 4341, in _execute_file_mounts
I 03-19 08:42:55 replica_managers.py:121] if storage.is_directory(src):
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/cloud_stores.py", line 116, in is_directory
I 03-19 08:42:55 replica_managers.py:121] p = subprocess.run(command,
I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
I 03-19 08:42:55 replica_managers.py:121] raise CalledProcessError(retcode, process.args,
I 03-19 08:42:55 replica_managers.py:121] subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null && { gcloud --help > /dev/null 2>&1 || { mkdir -p ~/.sky/logs && wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log && tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log && rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log && mv google-cloud-sdk ~/ && ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 && echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc && source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } && popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38' returned non-zero exit status 1.
from skypilot.
same setting and retry
sky serve up
controller seem to work fine, but Service Replicas seem to have same problem!
result of
sky serve logs studio_api 1
and repetition of creating replica occursI 03-19 08:42:45 provisioner.py:553] Successfully provisioned cluster: studio_api-1 I 03-19 08:42:00 provisioner.py:451] Successfully provisioned or found existing instance. I 03-19 08:42:45 provisioner.py:553] Successfully provisioned cluster: studio_api-1 I 03-19 08:42:53 cloud_vm_ray_backend.py:4266] Processing file mounts. I 03-19 08:42:55 replica_managers.py:118] Failed to launch the sky serve replica cluster with error: subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null && { gcloud --help > /dev/null 2>&1 || { mkdir -p ~/.sky/logs && wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log && tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log && rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log && mv google-cloud-sdk ~/ && ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 && echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc && source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } && popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38' returned non-zero exit status 1.) I 03-19 08:42:55 replica_managers.py:121] Traceback: Traceback (most recent call last): I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 95, in launch_cluster I 03-19 08:42:55 replica_managers.py:121] sky.launch(task, I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record I 03-19 08:42:55 replica_managers.py:121] return f(*args, **kwargs) I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record I 03-19 08:42:55 replica_managers.py:121] return f(*args, **kwargs) I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 501, in launch I 03-19 08:42:55 replica_managers.py:121] return _execute( I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 334, in _execute I 03-19 08:42:55 replica_managers.py:121] backend.sync_file_mounts(handle, task.file_mounts, I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record I 03-19 08:42:55 replica_managers.py:121] return f(*args, **kwargs) I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 349, in _record I 03-19 08:42:55 replica_managers.py:121] return f(*args, **kwargs) I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/backends/backend.py", line 73, in sync_file_mounts I 03-19 08:42:55 replica_managers.py:121] return self._sync_file_mounts(handle, all_file_mounts, storage_mounts) I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 2990, in _sync_file_mounts I 03-19 08:42:55 replica_managers.py:121] self._execute_file_mounts(handle, all_file_mounts) I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 4341, in _execute_file_mounts I 03-19 08:42:55 replica_managers.py:121] if storage.is_directory(src): I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/site-packages/sky/cloud_stores.py", line 116, in is_directory I 03-19 08:42:55 replica_managers.py:121] p = subprocess.run(command, I 03-19 08:42:55 replica_managers.py:121] File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run I 03-19 08:42:55 replica_managers.py:121] raise CalledProcessError(retcode, process.args, I 03-19 08:42:55 replica_managers.py:121] subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null && { gcloud --help > /dev/null 2>&1 || { mkdir -p ~/.sky/logs && wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log && tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log && rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log && mv google-cloud-sdk ~/ && ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 && echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc && source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } && popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38' returned non-zero exit status 1.
Thanks for reporting this! Could you share the output of sky -v
and sky -c
as well?
from skypilot.
@sean-styleai Also, could you share current output of sky status
that contains the contorller information as well?
from skypilot.
@cblmemo
Here it is. Thank you for your fast response!
❯ sky -v
skypilot, version 1.0.0.dev20240317
❯ sky -c
skypilot, commit 823999af850ee93138f45d01abba6c54a93d3c1e
output of sky status
❯ sky status
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
sky-serve-controller-b61da251 2 mins ago 1x GCP(n2-standard-4, disk_size=200, ports=['30001-30100']) UP 10m sky serve up -n studio_api...
Managed spot jobs
No in-progress spot jobs. (See: sky spot -h)
Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
studio_api - - NO_REPLICA 0/1 34.172.38.176:30001
Service Replicas
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
studio_api 1 1 - - - PROVISIONING -
* To see detailed service status: sky serve status -a
* 1 cluster has auto{stop,down} scheduled. Refresh statuses with: sky status --refresh
from skypilot.
@cblmemo Here it is. Thank you for your fast response!
❯ sky -v skypilot, version 1.0.0.dev20240317 ❯ sky -c skypilot, commit 823999af850ee93138f45d01abba6c54a93d3c1e
output of
sky status
❯ sky status Clusters NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND sky-serve-controller-b61da251 2 mins ago 1x GCP(n2-standard-4, disk_size=200, ports=['30001-30100']) UP 10m sky serve up -n studio_api... Managed spot jobs No in-progress spot jobs. (See: sky spot -h) Services NAME VERSION UPTIME STATUS REPLICAS ENDPOINT studio_api - - NO_REPLICA 0/1 34.172.38.176:30001 Service Replicas SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION studio_api 1 1 - - - PROVISIONING - * To see detailed service status: sky serve status -a * 1 cluster has auto{stop,down} scheduled. Refresh statuses with: sky status --refresh
Humm, seems like I cannot reproduce this error in the same commit. Could you ssh to the controller and run the following command, and share the output with me?
pushd /tmp &>/dev/null && { gcloud --help > /dev/null 2>&1 || { mkdir -p ~/.sky/logs && wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log && tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log && rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log && mv google-cloud-sdk ~/ && ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 && echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc && source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } && popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38
from skypilot.
@cblmemo
Output of above command is like below:
Your "OAuth 2.0 Service Account" credentials are invalid. Please run
$ gcloud auth login
OSError: No such file or directory.
After manually do gcloud auth login
, I can get output like below:
gs://skypilot-workdir-namsangho-7141c640
gs://skypilot-workdir-namsangho-7141c640/.dockerignore
gs://skypilot-workdir-namsangho-7141c640/.gitignore
gs://skypilot-workdir-namsangho-7141c640/README.md
gs://skypilot-workdir-namsangho-7141c640/requirements-api-serverless.txt
gs://skypilot-workdir-namsangho-7141c640/requirements-api.txt
gs://skypilot-workdir-namsangho-7141c640/requirements-pipeline.txt
gs://skypilot-workdir-namsangho-7141c640/requirements.txt
gs://skypilot-workdir-namsangho-7141c640/assets/
gs://skypilot-workdir-namsangho-7141c640/dockerfiles/
gs://skypilot-workdir-namsangho-7141c640/infra/
gs://skypilot-workdir-namsangho-7141c640/notebooks/
gs://skypilot-workdir-namsangho-7141c640/scripts/
gs://skypilot-workdir-namsangho-7141c640/src/
from skypilot.
@cblmemo Output of above command is like below:
Your "OAuth 2.0 Service Account" credentials are invalid. Please run $ gcloud auth login OSError: No such file or directory.
After manually do
gcloud auth login
, I can get output like below:gs://skypilot-workdir-namsangho-7141c640 gs://skypilot-workdir-namsangho-7141c640/.dockerignore gs://skypilot-workdir-namsangho-7141c640/.gitignore gs://skypilot-workdir-namsangho-7141c640/README.md gs://skypilot-workdir-namsangho-7141c640/requirements-api-serverless.txt gs://skypilot-workdir-namsangho-7141c640/requirements-api.txt gs://skypilot-workdir-namsangho-7141c640/requirements-pipeline.txt gs://skypilot-workdir-namsangho-7141c640/requirements.txt gs://skypilot-workdir-namsangho-7141c640/assets/ gs://skypilot-workdir-namsangho-7141c640/dockerfiles/ gs://skypilot-workdir-namsangho-7141c640/infra/ gs://skypilot-workdir-namsangho-7141c640/notebooks/ gs://skypilot-workdir-namsangho-7141c640/scripts/ gs://skypilot-workdir-namsangho-7141c640/src/
Could you run the command on your local laptop again? If it also failed, that might be the reason...
from skypilot.
@cblmemo
It works fine in local!
gs://skypilot-workdir-namsangho-89bfeef2/.dockerignore
gs://skypilot-workdir-namsangho-89bfeef2/.gitignore
gs://skypilot-workdir-namsangho-89bfeef2/README.md
gs://skypilot-workdir-namsangho-89bfeef2/requirements-api-serverless.txt
gs://skypilot-workdir-namsangho-89bfeef2/requirements-api.txt
gs://skypilot-workdir-namsangho-89bfeef2/requirements-pipeline.txt
gs://skypilot-workdir-namsangho-89bfeef2/requirements.txt
gs://skypilot-workdir-namsangho-89bfeef2/assets/
gs://skypilot-workdir-namsangho-89bfeef2/dockerfiles/
gs://skypilot-workdir-namsangho-89bfeef2/infra/
gs://skypilot-workdir-namsangho-89bfeef2/notebooks/
gs://skypilot-workdir-namsangho-89bfeef2/scripts/
gs://skypilot-workdir-namsangho-89bfeef2/src/
from skypilot.
@cblmemo
Is this issue related with transmission of gcp sa data from controller to service replica?
from skypilot.
@cblmemo Is this issue related with transmission of gcp sa data from controller to service replica?
Sorry for the late reply; was a little bit busy recently. Given that you cannot access your gcs storage on the controller, it seems more like not correctly sync SA credentials from local laptop to the controller. cc @Michaelvll for a look here 👀 does the SA credentials included in the following directory?
Lines 39 to 41 in acb49ee
from skypilot.
Hi @sean-styleai, I experienced a similar issue with the managed spot jobs controller. What worked for me was deleting the gcloud
directory in the .config
directory. After that, I executed the sky spot launch
command again, and everything worked as expected. It seems like this might be a workaround worth trying.
from skypilot.
I experienced the same issue. Basically gsutil
fails with Your "OAuth 2.0 Service Account" credentials are invalid
in the controller vm.
I SSH to the VM and tried several things:
gcloud storage ls
works butgsutil ls
fails- Install latest gcloud via
https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-474.0.0-linux-x86_64.tar.gz
, butgsutil ls
still fail apt update
thenapt upgrade google-cloud-sdk
, thengsutil ls
works
Not an expert on this, any thoughts?
from skypilot.
Tried to debug it in the controller VM, gsutil -D ls
gives some more detail, then I found the credential gs_service_key_file
in ~/.config/gcloud/legacy_credentials/xxx.iam.gserviceaccount.com/.boto
points to my local laptop path (/Users/xxx/.config/gcloud/legacy_credentials/xxx.iam.gserviceaccount.com/adc.json
) rather than /home/gcpuser/.config/...
So basically it copied all the credentials to remote without ensure the path is corrected in remote server.
from skypilot.
Related Issues (20)
- [k8s] Long cluster names fail to provision HOT 1
- [UX] `sky jobs launch` should be able to take a list of yamls HOT 2
- [jobs] `sky jobs` is not backward compatible with `sky spot`
- [Azure] Incorrect pricing information for H100 HOT 1
- [SERVE] Add headers to the readiness probe HOT 2
- [Logging] Symlink latest logs to a `latest` path
- [Core] Update the images for clouds with Ubuntu 22.04 to support latest C++ compiler
- [Managed Jobs] Support Rsync for managed jobs HOT 3
- Not possible to specify multiple ports with SkyServe
- [Storage] removing `_download_file` method as not used
- [Pipeline][Storage] Support new buckets in MOUNT mode
- [Core] Backward compatibility fails to use the old cluster name in cluster yaml
- Add option to disable conda installation when using custom docker images HOT 2
- sky.exceptions.FetchClusterInfoError in sky serve HOT 1
- [K8s] Fail to launch service on GKE cluster HOT 1
- [Docs] SkyServe architecture redirect
- vLLM tutorial doesn't work (cannot find vllm module) HOT 3
- 'sky check' Command Fails on Windows Due to Missing Resource Directory HOT 2
- Nightly release is stuck on 1.0.0.dev2024053101 HOT 2
- No CUDA drivers in Azure A10 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from skypilot.