Coder Social home page Coder Social logo

Comments (19)

concretevitamin avatar concretevitamin commented on June 11, 2024 1

cc @cblmemo

from skypilot.

cblmemo avatar cblmemo commented on June 11, 2024 1

Hi @sean-styleai ! Thanks for reporting the issue. Could you try to directly sky launch this YAML and to see if the error persists? Also, could you share the output of sky status in your local laptop (for more information on SkyServe Controller spec)?

from skypilot.

sean-styleai avatar sean-styleai commented on June 11, 2024

Thank you for so fast response! @concretevitamin @cblmemo
@cblmemo Error mights persist same?

last few lines of sky launch

I 03-19 16:20:42 provisioner.py:451] Successfully provisioned or found existing instance.
I 03-19 16:21:31 provisioner.py:553] Successfully provisioned cluster: sky-878e-namsangho
⠸ Launching - Opening new portsWARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED"
I 03-19 16:21:47 cloud_vm_ray_backend.py:2968] Syncing workdir (to 1 node): . -> ~/sky_workdir
I 03-19 16:21:47 cloud_vm_ray_backend.py:2976] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-03-19-16-10-03-637914/workdir_sync.log
I 03-19 16:22:19 cloud_vm_ray_backend.py:3076] Running setup on 1 node.
bash: !dwk: event not found
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get "https://registry-1.docker.io/v2/": unauthorized: incorrect username or password
Clusters
NAME                LAUNCHED     RESOURCES                                               STATUS  AUTOSTOP  COMMAND
sky-878e-namsangho  40 secs ago  1x GCP(g2-standard-4[Spot], {'L4': 1}, ports=['8000'])  UP      -         sky launch -n studio_api ...

sky.exceptions.CommandError: Command /bin/bash -i /tmp/sky_setup_sky-2024-03-19-16-10-03-637914 2>&1 failed with return code 1.
Failed to setup with return code 1. Check the details in log: ~/sky_logs/sky-2024-03-19-16-10-03-637914/setup-34.73.11.91.log

****** START Last lines of setup output ******
bash: !dwk: event not found
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get "https://registry-1.docker.io/v2/": unauthorized: incorrect username or password
******* END Last lines of setup output *******

output of sky status

❯ sky status
Clusters
NAME                LAUNCHED    RESOURCES                                               STATUS  AUTOSTOP  COMMAND
sky-878e-namsangho  2 mins ago  1x GCP(g2-standard-4[Spot], {'L4': 1}, ports=['8000'])  UP      -         sky launch -n studio_api ...

Managed spot jobs
No in-progress spot jobs. (See: sky spot -h)

Services
No live services. (See: sky serve -h)

from skypilot.

cblmemo avatar cblmemo commented on June 11, 2024

Thank you for so fast response! @concretevitamin @cblmemo @cblmemo Error mights persist same?

last few lines of sky launch

I 03-19 16:20:42 provisioner.py:451] Successfully provisioned or found existing instance.
I 03-19 16:21:31 provisioner.py:553] Successfully provisioned cluster: sky-878e-namsangho
⠸ Launching - Opening new portsWARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED"
I 03-19 16:21:47 cloud_vm_ray_backend.py:2968] Syncing workdir (to 1 node): . -> ~/sky_workdir
I 03-19 16:21:47 cloud_vm_ray_backend.py:2976] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-03-19-16-10-03-637914/workdir_sync.log
I 03-19 16:22:19 cloud_vm_ray_backend.py:3076] Running setup on 1 node.
bash: !dwk: event not found
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get "https://registry-1.docker.io/v2/": unauthorized: incorrect username or password
Clusters
NAME                LAUNCHED     RESOURCES                                               STATUS  AUTOSTOP  COMMAND
sky-878e-namsangho  40 secs ago  1x GCP(g2-standard-4[Spot], {'L4': 1}, ports=['8000'])  UP      -         sky launch -n studio_api ...

sky.exceptions.CommandError: Command /bin/bash -i /tmp/sky_setup_sky-2024-03-19-16-10-03-637914 2>&1 failed with return code 1.
Failed to setup with return code 1. Check the details in log: ~/sky_logs/sky-2024-03-19-16-10-03-637914/setup-34.73.11.91.log

****** START Last lines of setup output ******
bash: !dwk: event not found
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get "https://registry-1.docker.io/v2/": unauthorized: incorrect username or password
******* END Last lines of setup output *******

output of sky status

❯ sky status
Clusters
NAME                LAUNCHED    RESOURCES                                               STATUS  AUTOSTOP  COMMAND
sky-878e-namsangho  2 mins ago  1x GCP(g2-standard-4[Spot], {'L4': 1}, ports=['8000'])  UP      -         sky launch -n studio_api ...

Managed spot jobs
No in-progress spot jobs. (See: sky spot -h)

Services
No live services. (See: sky serve -h)

Humm, seems like the password is not correct? Could you successfully run the setup&run command in your local laptop?

from skypilot.

sean-styleai avatar sean-styleai commented on June 11, 2024

@cblmemo
Oh I'm so sorry. Above result seem to caused by incorrect docker auth.
I will recheck and update comments.
Thank you!

from skypilot.

sean-styleai avatar sean-styleai commented on June 11, 2024

@cblmemo
sky launch works fine!

few lines of output sky launch

I 03-19 17:19:23 provisioner.py:76] Launching on GCP us-east4 (us-east4-a)
I 03-19 17:22:10 provisioner.py:451] Successfully provisioned or found existing instance.
I 03-19 17:23:00 provisioner.py:553] Successfully provisioned cluster: sky-d8d7-namsangho
⠹ Launching - Opening new portsWARNING:googleapiclient.http:Encountered 403 Forbidden with reason "PERMISSION_DENIED"
I 03-19 17:23:16 cloud_vm_ray_backend.py:2968] Syncing workdir (to 1 node): . -> ~/sky_workdir
I 03-19 17:23:16 cloud_vm_ray_backend.py:2976] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-03-19-17-19-14-933366/workdir_sync.log
I 03-19 17:23:43 cloud_vm_ray_backend.py:3076] Running setup on 1 node.
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /home/gcpuser/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
I 03-19 17:23:50 cloud_vm_ray_backend.py:3089] Setup completed.
I 03-19 17:24:01 cloud_vm_ray_backend.py:3172] Job submitted with Job ID: 1
I 03-19 08:24:04 log_lib.py:392] Start streaming logs for job 1.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.150.0.11']

output of sky status

❯ sky status
Clusters
NAME                           LAUNCHED     RESOURCES                                                    STATUS  AUTOSTOP  COMMAND
sky-d8d7-namsangho             16 mins ago  1x GCP(g2-standard-4, {'L4': 1}, ports=['8000'])             UP      -         sky launch --env-file /Us...

from skypilot.

sean-styleai avatar sean-styleai commented on June 11, 2024

@cblmemo

same setting and retry sky serve up

controller seem to work fine, but Service Replicas seem to have same problem!

result of sky serve logs studio_api 1 and repetition of creating replica occurs

I 03-19 08:42:45 provisioner.py:553] Successfully provisioned cluster: studio_api-1
I 03-19 08:42:00 provisioner.py:451] Successfully provisioned or found existing instance.
I 03-19 08:42:45 provisioner.py:553] Successfully provisioned cluster: studio_api-1
I 03-19 08:42:53 cloud_vm_ray_backend.py:4266] Processing file mounts.

I 03-19 08:42:55 replica_managers.py:118] Failed to launch the sky serve replica cluster with error: subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null &&     { gcloud --help > /dev/null 2>&1 ||     { mkdir -p ~/.sky/logs &&     wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log &&     tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log &&     rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log  &&     mv google-cloud-sdk ~/ &&     ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 &&     echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc &&     source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } &&     popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38' returned non-zero exit status 1.)
I 03-19 08:42:55 replica_managers.py:121]   Traceback: Traceback (most recent call last):
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 95, in launch_cluster
I 03-19 08:42:55 replica_managers.py:121]     sky.launch(task,
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121]     return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121]     return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 501, in launch
I 03-19 08:42:55 replica_managers.py:121]     return _execute(
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 334, in _execute
I 03-19 08:42:55 replica_managers.py:121]     backend.sync_file_mounts(handle, task.file_mounts,
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121]     return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 349, in _record
I 03-19 08:42:55 replica_managers.py:121]     return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/backend.py", line 73, in sync_file_mounts
I 03-19 08:42:55 replica_managers.py:121]     return self._sync_file_mounts(handle, all_file_mounts, storage_mounts)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 2990, in _sync_file_mounts
I 03-19 08:42:55 replica_managers.py:121]     self._execute_file_mounts(handle, all_file_mounts)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 4341, in _execute_file_mounts
I 03-19 08:42:55 replica_managers.py:121]     if storage.is_directory(src):
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/cloud_stores.py", line 116, in is_directory
I 03-19 08:42:55 replica_managers.py:121]     p = subprocess.run(command,
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
I 03-19 08:42:55 replica_managers.py:121]     raise CalledProcessError(retcode, process.args,
I 03-19 08:42:55 replica_managers.py:121] subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null &&     { gcloud --help > /dev/null 2>&1 ||     { mkdir -p ~/.sky/logs &&     wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log &&     tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log &&     rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log  &&     mv google-cloud-sdk ~/ &&     ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 &&     echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc &&     source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } &&     popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38' returned non-zero exit status 1.

from skypilot.

cblmemo avatar cblmemo commented on June 11, 2024

@cblmemo

same setting and retry sky serve up

controller seem to work fine, but Service Replicas seem to have same problem!

result of sky serve logs studio_api 1 and repetition of creating replica occurs

I 03-19 08:42:45 provisioner.py:553] Successfully provisioned cluster: studio_api-1
I 03-19 08:42:00 provisioner.py:451] Successfully provisioned or found existing instance.
I 03-19 08:42:45 provisioner.py:553] Successfully provisioned cluster: studio_api-1
I 03-19 08:42:53 cloud_vm_ray_backend.py:4266] Processing file mounts.

I 03-19 08:42:55 replica_managers.py:118] Failed to launch the sky serve replica cluster with error: subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null &&     { gcloud --help > /dev/null 2>&1 ||     { mkdir -p ~/.sky/logs &&     wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log &&     tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log &&     rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log  &&     mv google-cloud-sdk ~/ &&     ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 &&     echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc &&     source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } &&     popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38' returned non-zero exit status 1.)
I 03-19 08:42:55 replica_managers.py:121]   Traceback: Traceback (most recent call last):
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/serve/replica_managers.py", line 95, in launch_cluster
I 03-19 08:42:55 replica_managers.py:121]     sky.launch(task,
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121]     return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121]     return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 501, in launch
I 03-19 08:42:55 replica_managers.py:121]     return _execute(
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/execution.py", line 334, in _execute
I 03-19 08:42:55 replica_managers.py:121]     backend.sync_file_mounts(handle, task.file_mounts,
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 370, in _record
I 03-19 08:42:55 replica_managers.py:121]     return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/utils/common_utils.py", line 349, in _record
I 03-19 08:42:55 replica_managers.py:121]     return f(*args, **kwargs)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/backend.py", line 73, in sync_file_mounts
I 03-19 08:42:55 replica_managers.py:121]     return self._sync_file_mounts(handle, all_file_mounts, storage_mounts)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 2990, in _sync_file_mounts
I 03-19 08:42:55 replica_managers.py:121]     self._execute_file_mounts(handle, all_file_mounts)
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/backends/cloud_vm_ray_backend.py", line 4341, in _execute_file_mounts
I 03-19 08:42:55 replica_managers.py:121]     if storage.is_directory(src):
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/site-packages/sky/cloud_stores.py", line 116, in is_directory
I 03-19 08:42:55 replica_managers.py:121]     p = subprocess.run(command,
I 03-19 08:42:55 replica_managers.py:121]   File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
I 03-19 08:42:55 replica_managers.py:121]     raise CalledProcessError(retcode, process.args,
I 03-19 08:42:55 replica_managers.py:121] subprocess.CalledProcessError: Command 'pushd /tmp &>/dev/null &&     { gcloud --help > /dev/null 2>&1 ||     { mkdir -p ~/.sky/logs &&     wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log &&     tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log &&     rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log  &&     mv google-cloud-sdk ~/ &&     ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 &&     echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc &&     source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } &&     popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38' returned non-zero exit status 1.

Thanks for reporting this! Could you share the output of sky -v and sky -c as well?

from skypilot.

cblmemo avatar cblmemo commented on June 11, 2024

@sean-styleai Also, could you share current output of sky status that contains the contorller information as well?

from skypilot.

sean-styleai avatar sean-styleai commented on June 11, 2024

@cblmemo
Here it is. Thank you for your fast response!

❯ sky -v
skypilot, version 1.0.0.dev20240317
❯ sky -c
skypilot, commit 823999af850ee93138f45d01abba6c54a93d3c1e

output of sky status

❯ sky status
Clusters
NAME                           LAUNCHED    RESOURCES                                                    STATUS  AUTOSTOP  COMMAND
sky-serve-controller-b61da251  2 mins ago  1x GCP(n2-standard-4, disk_size=200, ports=['30001-30100'])  UP      10m       sky serve up -n studio_api...

Managed spot jobs
No in-progress spot jobs. (See: sky spot -h)

Services
NAME        VERSION  UPTIME  STATUS      REPLICAS  ENDPOINT
studio_api  -        -       NO_REPLICA  0/1       34.172.38.176:30001

Service Replicas
SERVICE_NAME  ID  VERSION  IP  LAUNCHED  RESOURCES  STATUS        REGION
studio_api    1   1        -   -         -          PROVISIONING  -

* To see detailed service status: sky serve status -a
* 1 cluster has auto{stop,down} scheduled. Refresh statuses with: sky status --refresh

from skypilot.

cblmemo avatar cblmemo commented on June 11, 2024

@cblmemo Here it is. Thank you for your fast response!

❯ sky -v
skypilot, version 1.0.0.dev20240317
❯ sky -c
skypilot, commit 823999af850ee93138f45d01abba6c54a93d3c1e

output of sky status

❯ sky status
Clusters
NAME                           LAUNCHED    RESOURCES                                                    STATUS  AUTOSTOP  COMMAND
sky-serve-controller-b61da251  2 mins ago  1x GCP(n2-standard-4, disk_size=200, ports=['30001-30100'])  UP      10m       sky serve up -n studio_api...

Managed spot jobs
No in-progress spot jobs. (See: sky spot -h)

Services
NAME        VERSION  UPTIME  STATUS      REPLICAS  ENDPOINT
studio_api  -        -       NO_REPLICA  0/1       34.172.38.176:30001

Service Replicas
SERVICE_NAME  ID  VERSION  IP  LAUNCHED  RESOURCES  STATUS        REGION
studio_api    1   1        -   -         -          PROVISIONING  -

* To see detailed service status: sky serve status -a
* 1 cluster has auto{stop,down} scheduled. Refresh statuses with: sky status --refresh

Humm, seems like I cannot reproduce this error in the same commit. Could you ssh to the controller and run the following command, and share the output with me?

pushd /tmp &>/dev/null &&     { gcloud --help > /dev/null 2>&1 ||     { mkdir -p ~/.sky/logs &&     wget --quiet https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-424.0.0-linux-x86_64.tar.gz > ~/.sky/logs/gcloud_installation.log &&     tar xzf google-cloud-sdk-424.0.0-linux-x86_64.tar.gz >> ~/.sky/logs/gcloud_installation.log &&     rm -rf ~/google-cloud-sdk >> ~/.sky/logs/gcloud_installation.log  &&     mv google-cloud-sdk ~/ &&     ~/google-cloud-sdk/install.sh -q >> ~/.sky/logs/gcloud_installation.log 2>&1 &&     echo "source ~/google-cloud-sdk/path.bash.inc > /dev/null 2>&1" >> ~/.bashrc &&     source ~/google-cloud-sdk/path.bash.inc >> ~/.sky/logs/gcloud_installation.log 2>&1; }; } &&     popd &>/dev/null && [[ "$(uname)" == "Darwin" ]] && skypilot_gsutil() { gsutil -m -o "GSUtil:parallel_process_count=1" "$@"; } || skypilot_gsutil() { gsutil -m "$@"; }; GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/application_default_credentials.json skypilot_gsutil ls -d gs://skypilot-workdir-namsangho-dc04aa38

from skypilot.

sean-styleai avatar sean-styleai commented on June 11, 2024

@cblmemo
Output of above command is like below:

Your "OAuth 2.0 Service Account" credentials are invalid. Please run
  $ gcloud auth login
OSError: No such file or directory.

After manually do gcloud auth login, I can get output like below:

gs://skypilot-workdir-namsangho-7141c640
gs://skypilot-workdir-namsangho-7141c640/.dockerignore
gs://skypilot-workdir-namsangho-7141c640/.gitignore
gs://skypilot-workdir-namsangho-7141c640/README.md
gs://skypilot-workdir-namsangho-7141c640/requirements-api-serverless.txt
gs://skypilot-workdir-namsangho-7141c640/requirements-api.txt
gs://skypilot-workdir-namsangho-7141c640/requirements-pipeline.txt
gs://skypilot-workdir-namsangho-7141c640/requirements.txt
gs://skypilot-workdir-namsangho-7141c640/assets/
gs://skypilot-workdir-namsangho-7141c640/dockerfiles/
gs://skypilot-workdir-namsangho-7141c640/infra/
gs://skypilot-workdir-namsangho-7141c640/notebooks/
gs://skypilot-workdir-namsangho-7141c640/scripts/
gs://skypilot-workdir-namsangho-7141c640/src/

from skypilot.

cblmemo avatar cblmemo commented on June 11, 2024

@cblmemo Output of above command is like below:

Your "OAuth 2.0 Service Account" credentials are invalid. Please run
  $ gcloud auth login
OSError: No such file or directory.

After manually do gcloud auth login, I can get output like below:

gs://skypilot-workdir-namsangho-7141c640
gs://skypilot-workdir-namsangho-7141c640/.dockerignore
gs://skypilot-workdir-namsangho-7141c640/.gitignore
gs://skypilot-workdir-namsangho-7141c640/README.md
gs://skypilot-workdir-namsangho-7141c640/requirements-api-serverless.txt
gs://skypilot-workdir-namsangho-7141c640/requirements-api.txt
gs://skypilot-workdir-namsangho-7141c640/requirements-pipeline.txt
gs://skypilot-workdir-namsangho-7141c640/requirements.txt
gs://skypilot-workdir-namsangho-7141c640/assets/
gs://skypilot-workdir-namsangho-7141c640/dockerfiles/
gs://skypilot-workdir-namsangho-7141c640/infra/
gs://skypilot-workdir-namsangho-7141c640/notebooks/
gs://skypilot-workdir-namsangho-7141c640/scripts/
gs://skypilot-workdir-namsangho-7141c640/src/

Could you run the command on your local laptop again? If it also failed, that might be the reason...

from skypilot.

sean-styleai avatar sean-styleai commented on June 11, 2024

@cblmemo
It works fine in local!

gs://skypilot-workdir-namsangho-89bfeef2/.dockerignore
gs://skypilot-workdir-namsangho-89bfeef2/.gitignore
gs://skypilot-workdir-namsangho-89bfeef2/README.md
gs://skypilot-workdir-namsangho-89bfeef2/requirements-api-serverless.txt
gs://skypilot-workdir-namsangho-89bfeef2/requirements-api.txt
gs://skypilot-workdir-namsangho-89bfeef2/requirements-pipeline.txt
gs://skypilot-workdir-namsangho-89bfeef2/requirements.txt
gs://skypilot-workdir-namsangho-89bfeef2/assets/
gs://skypilot-workdir-namsangho-89bfeef2/dockerfiles/
gs://skypilot-workdir-namsangho-89bfeef2/infra/
gs://skypilot-workdir-namsangho-89bfeef2/notebooks/
gs://skypilot-workdir-namsangho-89bfeef2/scripts/
gs://skypilot-workdir-namsangho-89bfeef2/src/

from skypilot.

sean-styleai avatar sean-styleai commented on June 11, 2024

@cblmemo
Is this issue related with transmission of gcp sa data from controller to service replica?

from skypilot.

cblmemo avatar cblmemo commented on June 11, 2024

@cblmemo Is this issue related with transmission of gcp sa data from controller to service replica?

Sorry for the late reply; was a little bit busy recently. Given that you cannot access your gcs storage on the controller, it seems more like not correctly sync SA credentials from local laptop to the controller. cc @Michaelvll for a look here 👀 does the SA credentials included in the following directory?

DEFAULT_GCP_APPLICATION_CREDENTIAL_PATH: str = (
'~/.config/gcloud/'
'application_default_credentials.json')

from skypilot.

GrelaM100 avatar GrelaM100 commented on June 11, 2024

Hi @sean-styleai, I experienced a similar issue with the managed spot jobs controller. What worked for me was deleting the gcloud directory in the .config directory. After that, I executed the sky spot launch command again, and everything worked as expected. It seems like this might be a workaround worth trying.

from skypilot.

martin-liu avatar martin-liu commented on June 11, 2024

I experienced the same issue. Basically gsutil fails with Your "OAuth 2.0 Service Account" credentials are invalid in the controller vm.
I SSH to the VM and tried several things:

  • gcloud storage ls works but gsutil ls fails
  • Install latest gcloud via https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-474.0.0-linux-x86_64.tar.gz, but gsutil ls still fail
  • apt update then apt upgrade google-cloud-sdk, then gsutil ls works

Not an expert on this, any thoughts?

from skypilot.

martin-liu avatar martin-liu commented on June 11, 2024

Tried to debug it in the controller VM, gsutil -D ls gives some more detail, then I found the credential gs_service_key_file in ~/.config/gcloud/legacy_credentials/xxx.iam.gserviceaccount.com/.boto points to my local laptop path (/Users/xxx/.config/gcloud/legacy_credentials/xxx.iam.gserviceaccount.com/adc.json) rather than /home/gcpuser/.config/...

So basically it copied all the credentials to remote without ensure the path is corrected in remote server.

from skypilot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.