Coder Social home page Coder Social logo

Comments (14)

azhou-determined avatar azhou-determined commented on September 27, 2024 1

Nice! Thank you for the detailed logs, I will look into this further.

Closing this ticket for now, please feel free to reopen if you have more questions or run into more issues.

from determined.

azhou-determined avatar azhou-determined commented on September 27, 2024

I can't seem to reproduce this. It seems like the error logs are getting swallowed somewhere. Per Creating determined_determined-master_1... it looks like the master container started up, so can you get the logs from the container and share them?

Running docker ps -a should show the exited master container. Then docker logs CONTAINER_ID should show the logs of it.

from determined.

PurRigiN avatar PurRigiN commented on September 27, 2024

Yes, I tried to find logs in master container too, but the containers were removed automatically when this command was done. It seems that the --rm flag in docker command is set when determined creates containers. I could not find an entry from determined to disable this auto-removal setting.
The execution is recorded as gif here.
ๅŠจ็”ป
Do I miss some settings relevant to "auto-removal"? Are there any other ways to find more logs?

from determined.

azhou-determined avatar azhou-determined commented on September 27, 2024

I see, yeah, unfortunately we remove containers that hit an error. From your gif, it looks like the container did actually start up at some point, so let's try to find those logs.

Would it be possible for you to configure your docker daemon to keep logs even after container removal? See https://docs.docker.com/config/containers/logging/journald/. The default logging driver does not persist logs after containers are deleted, but you can easily change this in daemon.json (in /etc/docker/ on Linux). If you configure it to use the journald logging driver, the logs for containers should be persisted even after removal, and you can view them using journalctl.

Barring that, you could try and "catch" the container after it starts but before it exits by running docker logs determined_determined-master_1 -f repeatedly. The container name should be constant, so you wouldn't need to try and find the container ID.

from determined.

PurRigiN avatar PurRigiN commented on September 27, 2024

It's really helpful to change the logging driver in docker!
Following the steps you provided, it generated a log below.
log.log
This log reveals that PostgreSQL database system was started up and ready to accept connections. But the next container received an ambiguous error stream copy error: reading from a closed fifo. This bug may appear when the disk space is not enough. I checked the disk on server but it was all fine. Do you have any idea?

from determined.

azhou-determined avatar azhou-determined commented on September 27, 2024

Hmm, I've never seen this particular issue before. Seems something is being silently killed. My guess would have been to check for memory issues on the machine, or maybe there is some issue with reading the DB volume. Some ideas for further debugging:

  • Are you able to run det deploy local db-up successfully? This would effectively just set up the database container.
  • Does docker volume ls show a determined DB container? If so, and if you do not have any important data on it, maybe try deleting it? The det deploy local should try and recreate it.
  • Can you try setting up/running Determined docker containers manually? We have a guide on how to do this. This is usually the recommended installation method for more advanced use cases, but it may also help us debug this issue.

from determined.

PurRigiN avatar PurRigiN commented on September 27, 2024

It seems that det deploy local db-up is not supported in determined 0.25.1:

usage: det deploy local [-h] subsubcommand ...
det deploy local: error: argument subsubcommand: invalid choice: 'db-up' (choose from 'help', 'agent-down', 'agent-up', 'cluster-down', 'cluster-up', 'logs', 'master-down', 'master-up')

Then I backed up the files corresponding to the volumes and removed volumes. However it faileds again.
Then I tried your third idea and started PostgreSQL and master container using these commands below.

docker run -d --name determined-db -p 5432:5432 -v determined_db:/var/lib/postgresql/data -e POSTGRES_DB=determined -e POSTGRES_PASSWORD=abcABC123 postgres:10.14
docker run -d --name determined-master -p 8080:8080  -e DET_DB_HOST=127.0.0.1  -e DET_DB_NAME=determined -e DET_DB_PORT=5432  -e DET_DB_USER=postgres  -e DET_DB_PASSWORD=abcABC123  determinedai/determined-master:0.25.1

The database was set up successfully but master container failed. But this time the log provided more information.

ๆญฃๅœจๆ‰ง่กŒไปปๅŠก: docker logs --tail 1000 -f 0135c41c69cb8e538b4df1f80a90016fec14890b3e2b1f2b923d625f002eb5b1 

INFO[2023-11-14T13:39:46Z] master configuration: {"config_file":"","log":{"level":"info","color":true},"db":{"user":"postgres","password":"********","migrations":"file:///usr/share/determined/master/static/migrations","host":"127.0.0.1","port":"5432","name":"determined","ssl_mode":"disable","ssl_root_cert":""},"tensorboard_timeout":300,"notebook_timeout":null,"security":{"default_task":{"id":0,"user_id":0,"user":"root","uid":0,"group":"root","gid":0,"RelatedUser":null},"tls":{"cert":"","key":""},"ssh":{"rsa_key_size":1024},"authz":{"type":"basic","fallback":"basic","rbac_ui_enabled":null,"_strict_ntsc_enabled":false,"workspace_creator_assign_role":{"enabled":true,"role_id":2},"strict_job_queue_control":false}},"checkpoint_storage":{"host_path":"/tmp","propagation":null,"save_experiment_best":0,"save_trial_best":1,"save_trial_latest":1,"storage_path":"determined-checkpoint","type":"shared_fs"},"task_container_defaults":{"shm_size_bytes":4294967296,"network_mode":"bridge","cpu_pod_spec":null,"gpu_pod_spec":null,"add_capabilities":null,"drop_capabilities":null,"devices":null,"bind_mounts":null,"work_dir":null,"slurm":{},"pbs":{}},"port":8080,"root":"/usr/share/determined/master","telemetry":{"enabled":true,"segment_master_key":"********","otel_enabled":false,"otel_endpoint":"localhost:4317","segment_webui_key":"********"},"enable_cors":false,"launch_error":true,"cluster_name":"","logging":{"type":"default"},"observability":{"enable_prometheus":false},"cache":{"cache_dir":"/var/cache/determined"},"webhooks":{"base_url":"","signing_key":"93e790bb450c"},"feature_switches":[],"resource_manager":{"client_ca":"","default_aux_resource_pool":"default","default_compute_resource_pool":"default","require_authentication":false,"scheduler":{"allow_heterogeneous_fits":false,"fitting_policy":"best","type":"fair_share"},"type":"agent"},"resource_pools":[{"pool_name":"default","description":"","provider":null,"max_aux_containers_per_agent":100,"task_container_defaults":null,"agent_reattach_enabled":false,"agent_reconnect_wait":"25s","kubernetes_namespace":""}],"__internal":{"audit_logging_enabled":false,"external_sessions":{"login_uri":"","logout_uri":"","jwt_key":""}}} 
INFO[2023-11-14T13:39:46Z] Determined master 0.25.1 (built with go1.21.0) 
INFO[2023-11-14T13:39:46Z] connecting to database 127.0.0.1:5432        
WARN[2023-11-14T13:39:50Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=127.0.0.1 user=postgres database=determined`: dial error (dial tcp 127.0.0.1:5432: connect: connection refused)"
WARN[2023-11-14T13:39:54Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=127.0.0.1 user=postgres database=determined`: dial error (dial tcp 127.0.0.1:5432: connect: connection refused)"
WARN[2023-11-14T13:39:58Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=127.0.0.1 user=postgres database=determined`: dial error (dial tcp 127.0.0.1:5432: connect: connection refused)"
WARN[2023-11-14T13:40:02Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=127.0.0.1 user=postgres database=determined`: dial error (dial tcp 127.0.0.1:5432: connect: connection refused)"
WARN[2023-11-14T13:40:06Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=127.0.0.1 user=postgres database=determined`: dial error (dial tcp 127.0.0.1:5432: connect: connection refused)"
WARN[2023-11-14T13:40:10Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=127.0.0.1 user=postgres database=determined`: dial error (dial tcp 127.0.0.1:5432: connect: connection refused)"
WARN[2023-11-14T13:40:14Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=127.0.0.1 user=postgres database=determined`: dial error (dial tcp 127.0.0.1:5432: connect: connection refused)"
WARN[2023-11-14T13:40:18Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=127.0.0.1 user=postgres database=determined`: dial error (dial tcp 127.0.0.1:5432: connect: connection refused)"
WARN[2023-11-14T13:40:22Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=127.0.0.1 user=postgres database=determined`: dial error (dial tcp 127.0.0.1:5432: connect: connection refused)"
WARN[2023-11-14T13:40:26Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=127.0.0.1 user=postgres database=determined`: dial error (dial tcp 127.0.0.1:5432: connect: connection refused)"
WARN[2023-11-14T13:40:30Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=127.0.0.1 user=postgres database=determined`: dial error (dial tcp 127.0.0.1:5432: connect: connection refused)"
WARN[2023-11-14T13:40:34Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=127.0.0.1 user=postgres database=determined`: dial error (dial tcp 127.0.0.1:5432: connect: connection refused)"
WARN[2023-11-14T13:40:38Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=127.0.0.1 user=postgres database=determined`: dial error (dial tcp 127.0.0.1:5432: connect: connection refused)"
WARN[2023-11-14T13:40:42Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=127.0.0.1 user=postgres database=determined`: dial error (dial tcp 127.0.0.1:5432: connect: connection refused)"
ERRO[2023-11-14T13:40:42Z] failed to connect to `host=127.0.0.1 user=postgres database=determined`: dial error (dial tcp 127.0.0.1:5432: connect: connection refused)
could not connect to database after 15 tries
github.com/determined-ai/determined/master/internal/db.ConnectPostgres
        /home/circleci/project/master/internal/db/postgres.go:189
github.com/determined-ai/determined/master/internal/db.Connect
        /home/circleci/project/master/internal/db/setup.go:24
github.com/determined-ai/determined/master/internal/db.Setup
        /home/circleci/project/master/internal/db/setup.go:36
github.com/determined-ai/determined/master/internal.(*Master).Run
        /home/circleci/project/master/internal/core.go:864
main.runRoot
        /home/circleci/project/master/cmd/determined-master/root.go:65
main.newRootCmd.func1
        /home/circleci/project/master/cmd/determined-master/root.go:29
github.com/spf13/cobra.(*Command).execute
        /home/circleci/go/pkg/mod/github.com/spf13/[email protected]/command.go:920
github.com/spf13/cobra.(*Command).ExecuteC
        /home/circleci/go/pkg/mod/github.com/spf13/[email protected]/command.go:1044
github.com/spf13/cobra.(*Command).Execute
        /home/circleci/go/pkg/mod/github.com/spf13/[email protected]/command.go:968
main.main
        /home/circleci/project/master/cmd/determined-master/main.go:12
runtime.main
        /usr/local/go/src/runtime/proc.go:267
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1650
error connecting to database: 127.0.0.1:5432
github.com/determined-ai/determined/master/internal/db.Connect
        /home/circleci/project/master/internal/db/setup.go:26
github.com/determined-ai/determined/master/internal/db.Setup
        /home/circleci/project/master/internal/db/setup.go:36
github.com/determined-ai/determined/master/internal.(*Master).Run
        /home/circleci/project/master/internal/core.go:864
main.runRoot
        /home/circleci/project/master/cmd/determined-master/root.go:65
main.newRootCmd.func1
        /home/circleci/project/master/cmd/determined-master/root.go:29
github.com/spf13/cobra.(*Command).execute
        /home/circleci/go/pkg/mod/github.com/spf13/[email protected]/command.go:920
github.com/spf13/cobra.(*Command).ExecuteC
        /home/circleci/go/pkg/mod/github.com/spf13/[email protected]/command.go:1044
github.com/spf13/cobra.(*Command).Execute
        /home/circleci/go/pkg/mod/github.com/spf13/[email protected]/command.go:968
main.main
        /home/circleci/project/master/cmd/determined-master/main.go:12
runtime.main
        /usr/local/go/src/runtime/proc.go:267
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1650 
 *  ็ปˆ็ซฏๅฐ†่ขซไปปๅŠก้‡็”จ๏ผŒๆŒ‰ไปปๆ„้”ฎๅ…ณ้—ญใ€‚

It shows that master container could not connect to database. I tested database connection using a vscode extension but it worked well:
image
According to the error msg "failed to connect to 'host=127.0.0.1 user=postgres database=determined': dial error (dial tcp 127.0.0.1:5432: connect: connection refused)"
Is it possible that the password was not passed properly?

from determined.

azhou-determined avatar azhou-determined commented on September 27, 2024

It seems the master container cannot resolve the DB container address. Can you try running the master with --network host (after you've started up the db)?

docker run -d --name determined-master --network host -e DET_DB_HOST=127.0.0.1 -e DET_DB_NAME=determined -e DET_DB_PORT=5432  -e DET_DB_USER=postgres  -e DET_DB_PASSWORD=abcABC123  determinedai/determined-master:0.25.1

from determined.

PurRigiN avatar PurRigiN commented on September 27, 2024

OK, it finally worked and I figured out where is the problem.
If flag --network host is not set, the network of container will be default to bridge, which is created when docker is installed. The default network bridge does not have a DNS, so DET_DB_HOST=127.0.0.1 and DET_DB_HOST=determined-db can not be resolved.
Then it hitted me that whether the network setting somehow prevents the master from starting using det deploy local master-up or not. I unset http_proxy and https_proxy(http://127.0.0.1:7890) envs and ran det deploy local master-up again. Master started up successfully. Setting http_proxy and https_proxy envs, it failed again. So this is the problem. Maybe you can reproduce it by
setting these two envs.
Now I have restored our data and determined runs well. Thank you so much!

from determined.

mabrowning avatar mabrowning commented on September 27, 2024

I had exactly the same problem, but solved it with another workaround:

# determined-master.yaml
db:
  host: "determined_determined-db_1"
  user: "postgres"
  port: "5432"
  name: "determined"

Rather than rely on the default determined-db alias which wasn't resolving, I used the actual container name determined_determined-db_1, which did. Unfortunately, still had to fill in the fields under db which weren't otherwise populated with the defaults (though the password field is apparently).

Then this worked:

$ det deploy local cluster-up --no-gpu --master-config-path determined-master.yaml

from determined.

mabrowning avatar mabrowning commented on September 27, 2024

FWIW, the logs I had from the original failure hinted at the name resolution issue:

$ det deploy local cluster-up --no-gpu
Creating network determined_default...
Creating determined_determined-db_1...
Waiting for determined_determined-db_1...
Creating determined_determined-master_1...
Waiting for master instance to be available.....................................................
Timed out connecting to master, but attempting to dump logs from cluster...
Stopping determined_determined-master_1
PostgreSQL Database directory appears to contain a database; Skipping initialization
2023-11-13 17:26:12.128 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-11-13 17:26:12.128 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2023-11-13 17:26:12.131 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-11-13 17:26:12.160 UTC [31] LOG:  database system was shut down at 2023-11-13 17:26:00 UTC
2023-11-13 17:26:12.165 UTC [32] FATAL:  the database system is starting up
2023-11-13 17:26:12.166 UTC [1] LOG:  database system is ready to accept connections

Removing determined_determined-master_1
INFO[2023-11-13T17:26:13Z] master configuration: {"config_file":"","log":{"level":"info","color":true},"db":{"user":"postgres","password":"********","migrations":"file:///usr/share/determined/master/static/migrations","host":"determined-db","port":"5432","name":"determined","ssl_mode":"disable","ssl_root_cert":""},"tensorboard_timeout":300,"notebook_timeout":null,"security":{"default_task":{"id":0,"user_id":0,"user":"root","uid":0,"group":"root","gid":0},"tls":{"cert":"","key":""},"ssh":{"rsa_key_size":1024},"authz":{"type":"basic","fallback":"basic","rbac_ui_enabled":null,"_strict_ntsc_enabled":false,"workspace_creator_assign_role":{"enabled":true,"role_id":2},"strict_job_queue_control":false}},"checkpoint_storage":{"host_path":"/cb/home/mark/.local/share/determined","propagation":null,"save_experiment_best":0,"save_trial_best":1,"save_trial_latest":1,"storage_path":null,"type":"shared_fs"},"task_container_defaults":{"shm_size_bytes":4294967296,"network_mode":"bridge","cpu_pod_spec":null,"gpu_pod_spec":null,"add_capabilities":null,"drop_capabilities":null,"devices":null,"bind_mounts":null,"work_dir":null,"slurm":{},"pbs":{},"LogPolicies":null,"kubernetes":null},"port":8080,"root":"/usr/share/determined/master","telemetry":{"enabled":true,"segment_master_key":"********","otel_enabled":false,"otel_endpoint":"localhost:4317","segment_webui_key":"********","cluster_id":""},"enable_cors":false,"launch_error":true,"cluster_name":"","logging":{"type":"default"},"observability":{"enable_prometheus":false},"cache":{"cache_dir":"/var/cache/determined"},"webhooks":{"base_url":"","signing_key":"26ce0f243f6a"},"feature_switches":[],"reserved_ports":null,"resource_manager":{"client_ca":"","default_aux_resource_pool":"default","default_compute_resource_pool":"default","no_default_resource_pools":false,"require_authentication":false,"scheduler":{"allow_heterogeneous_fits":false,"fitting_policy":"best","type":"fair_share"},"type":"agent"},"resource_pools":[{"pool_name":"default","description":"","provider":null,"max_aux_containers_per_agent":100,"task_container_defaults":null,"agent_reattach_enabled":false,"agent_reconnect_wait":"25s","kubernetes_namespace":""}],"__internal":{"audit_logging_enabled":false,"external_sessions":{"login_uri":"","logout_uri":"","jwt_key":""},"proxied_servers":null}}
INFO[2023-11-13T17:26:13Z] Determined master 0.26.3 (built with go1.21.0)
INFO[2023-11-13T17:26:13Z] connecting to database determined-db:5432
WARN[2023-11-13T17:26:17Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=determined-db user=postgres database=determined`: hostname resolving error (lookup determined-db on 127.0.0.11:53: no such host)"
# .... repeat 14 times
ERRO[2023-11-13T17:27:11Z] could not connect to database after 15 tries: failed to connect to `host=determined-db user=postgres database=determined`: hostname resolving error (lookup determined-db on 127.0.0.11:53: no such host): error connecting to database: determined-db:5432
INFO[2023-11-13T17:27:12Z] master configuration: {"config_file":"","log":{"level":"info","color":true},"db":{"user":"postgres","password":"********","migrations":"file:///usr/share/determined/master/static/migrations","host":"determined-db","port":"5432","name":"determined","ssl_mode":"disable","ssl_root_cert":""},"tensorboard_timeout":300,"notebook_timeout":null,"security":{"default_task":{"id":0,"user_id":0,"user":"root","uid":0,"group":"root","gid":0},"tls":{"cert":"","key":""},"ssh":{"rsa_key_size":1024},"authz":{"type":"basic","fallback":"basic","rbac_ui_enabled":null,"_strict_ntsc_enabled":false,"workspace_creator_assign_role":{"enabled":true,"role_id":2},"strict_job_queue_control":false}},"checkpoint_storage":{"host_path":"/cb/home/mark/.local/share/determined","propagation":null,"save_experiment_best":0,"save_trial_best":1,"save_trial_latest":1,"storage_path":null,"type":"shared_fs"},"task_container_defaults":{"shm_size_bytes":4294967296,"network_mode":"bridge","cpu_pod_spec":null,"gpu_pod_spec":null,"add_capabilities":null,"drop_capabilities":null,"devices":null,"bind_mounts":null,"work_dir":null,"slurm":{},"pbs":{},"LogPolicies":null,"kubernetes":null},"port":8080,"root":"/usr/share/determined/master","telemetry":{"enabled":true,"segment_master_key":"********","otel_enabled":false,"otel_endpoint":"localhost:4317","segment_webui_key":"********","cluster_id":""},"enable_cors":false,"launch_error":true,"cluster_name":"","logging":{"type":"default"},"observability":{"enable_prometheus":false},"cache":{"cache_dir":"/var/cache/determined"},"webhooks":{"base_url":"","signing_key":"767e9353fdbc"},"feature_switches":[],"reserved_ports":null,"resource_manager":{"client_ca":"","default_aux_resource_pool":"default","default_compute_resource_pool":"default","no_default_resource_pools":false,"require_authentication":false,"scheduler":{"allow_heterogeneous_fits":false,"fitting_policy":"best","type":"fair_share"},"type":"agent"},"resource_pools":[{"pool_name":"default","description":"","provider":null,"max_aux_containers_per_agent":100,"task_container_defaults":null,"agent_reattach_enabled":false,"agent_reconnect_wait":"25s","kubernetes_namespace":""}],"__internal":{"audit_logging_enabled":false,"external_sessions":{"login_uri":"","logout_uri":"","jwt_key":""},"proxied_servers":null}}
INFO[2023-11-13T17:27:12Z] Determined master 0.26.3 (built with go1.21.0)
INFO[2023-11-13T17:27:12Z] connecting to database determined-db:5432
WARN[2023-11-13T17:27:16Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=determined-db user=postgres database=determined`: hostname resolving error (lookup determined-db on 127.0.0.11:53: no such host)"
WARN[2023-11-13T17:27:20Z] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=determined-db user=postgres database=determined`: hostname resolving error (lookup determined-db on 127.0.0.11:53: no such host)"


Stopping determined_determined-db_1
Removing determined_determined-db_1
Removing network determined_default
Traceback (most recent call last):
  File "/net/mark-dev/srv/nfs/mark-data/ws/monolith3/gen/python/python-x86_64/lib/python3.8/site-packages/determined/deploy/local/cluster_utils.py", line 115, in _wait_for_master
    wait_for_master(master_host, master_port)
  File "/net/mark-dev/srv/nfs/mark-data/ws/monolith3/gen/python/python-x86_64/lib/python3.8/site-packages/determined/deploy/healthcheck.py", line 26, in wait_for_master
    return wait_for_master_url(master_url, timeout, cert)
  File "/net/mark-dev/srv/nfs/mark-data/ws/monolith3/gen/python/python-x86_64/lib/python3.8/site-packages/determined/deploy/healthcheck.py", line 52, in wait_for_master_url
    raise MasterTimeoutExpired
determined.deploy.errors.MasterTimeoutExpired

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/net/mark-dev/srv/nfs/mark-data/ws/monolith3/gen/python/python-x86_64/lib/python3.8/site-packages/determined/cli/cli.py", line 232, in main
    parsed_args.func(parsed_args)
  File "/net/mark-dev/srv/nfs/mark-data/ws/monolith3/gen/python/python-x86_64/lib/python3.8/site-packages/determined/deploy/local/cli.py", line 17, in handle_cluster_up
    cluster_utils.cluster_up(
  File "/net/mark-dev/srv/nfs/mark-data/ws/monolith3/gen/python/python-x86_64/lib/python3.8/site-packages/determined/deploy/local/cluster_utils.py", line 347, in cluster_up
    master_up(
  File "/net/mark-dev/srv/nfs/mark-data/ws/monolith3/gen/python/python-x86_64/lib/python3.8/site-packages/determined/deploy/local/cluster_utils.py", line 268, in master_up
    _wait_for_master("localhost", port, cluster_name)
  File "/net/mark-dev/srv/nfs/mark-data/ws/monolith3/gen/python/python-x86_64/lib/python3.8/site-packages/determined/deploy/local/cluster_utils.py", line 120, in _wait_for_master
    raise ConnectionError("Timed out connecting to master")
ConnectionError: Timed out connecting to master
Failed to Create a Determined cluster

from determined.

PurRigiN avatar PurRigiN commented on September 27, 2024

@mabrowning I tried doing so like you, but it failed again:

# master-config-test.yaml
db:
  host: determined_determined-db_1
  name: determined
  port: 5432
  user: postgres
(determined) dl@dl-X299-UD4-Pro:~/determined-scripts$ det deploy local master-up --master-config-path /home/dl/determined-scripts/master-config-test.yaml
Creating network determined_default...
Creating determined_determined-db_1...
Waiting for determined_determined-db_1...
Creating determined_determined-master_1...
Stopping determined_determined-master_1
Removing determined_determined-master_1
Stopping determined_determined-db_1
Removing determined_determined-db_1
Removing network determined_default
Failed to Start a Determined master: 
(determined) dl@dl-X299-UD4-Pro:~/determined-scripts$ det deploy local cluster-up --no-gpu --master-config-path /home/dl/determined-scripts/master-config-test.yaml
Creating network determined_default...
Creating determined_determined-db_1...
Waiting for determined_determined-db_1...
Creating determined_determined-master_1...
Stopping determined_determined-master_1
Removing determined_determined-master_1
Stopping determined_determined-db_1
Removing determined_determined-db_1
Removing network determined_default
Failed to Create a Determined cluster: 
(determined) dl@dl-X299-UD4-Pro:~/determined-scripts$ echo $http_proxy
http://127.0.0.1:7890/
(determined) dl@dl-X299-UD4-Pro:~/determined-scripts$ export http_proxy=""
(determined) dl@dl-X299-UD4-Pro:~/determined-scripts$ export https_proxy=""
(determined) dl@dl-X299-UD4-Pro:~/determined-scripts$ det deploy local cluster-up --no-gpu --master-config-path /home/dl/determined-scripts/master-config-test.yaml
Creating network determined_default...
Creating determined_determined-db_1...
Waiting for determined_determined-db_1...
Creating determined_determined-master_1...
Waiting for master instance to be available....
Starting determined-agent-0
(determined) dl@dl-X299-UD4-Pro:~/determined-scripts$ 

On my machine, it seems that the problem still comes from network proxy, changing the db name in config does not help. Did you try unsetting these two envs? Maybe this will also help you setup determined. Anyway, thank you for your information!

from determined.

mabrowning avatar mabrowning commented on September 27, 2024

I did not have http_proxy set. So perhaps our issues were only similar in nature rather than being the same.

from determined.

PurRigiN avatar PurRigiN commented on September 27, 2024

Yes. Our issues were both that the DB name could not be resolved. On my machine this was caused by proxy but was not on yours. I guess your problem is also concerned with the network setting (of course not the proxy we just excluded).

from determined.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.