Coder Social home page Coder Social logo

determined-ai / determined Goto Github PK

View Code? Open in Web Editor NEW
2.9K 80.0 344.0 183.75 MB

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

Home Page: https://determined.ai

License: Apache License 2.0

Dockerfile 0.02% Makefile 0.42% Python 31.44% Go 42.93% HCL 0.33% PLpgSQL 0.44% Shell 0.70% Jupyter Notebook 0.03% JavaScript 0.23% HTML 0.02% TypeScript 22.24% SCSS 0.93% Smarty 0.01% Roff 0.28%
deep-learning machine-learning ml-platform ml-infrastructure hyperparameter-optimization hyperparameter-search distributed-training pytorch tensorflow hyperparameter-tuning kubernetes data-science mlops keras

determined's Introduction

Determined AI Logo

Determined is an all-in-one deep learning platform, compatible with PyTorch and TensorFlow.

It takes care of:

  • Distributed training for faster results.
  • Hyperparameter tuning for obtaining the best models.
  • Resource management for cutting cloud GPU costs.
  • Experiment tracking for analysis and reproducibility.

Features gif

How Determined Works

The main components of Determined are the Python library, the command line interface (CLI), and the Web UI.

Python Library

Use the Python library to make your existing PyTorch or Tensorflow code compatible with Determined.

You can do this by organizing your code into one of the class-based APIs:

from determined.pytorch import PyTorchTrial

class YourExperiment(PyTorchTrial):
  def __init__(self, context):
    ...

Or by using just the functions you want, via the Core API:

import determined as det

with det.core.init() as core_context:
    ...

Command Line Interface (CLI)

You can use the CLI to:

  • Start a Determined cluster locally:
det deploy local cluster-up
  • Launch Determined on cloud services, such as Amazon Web Services (AWS) or Google Cloud Platform (GCP):
det deploy aws up
  • Train your models:
det experiment create gpt.yaml .

Configure everything from distributed training to hyperparameter tuning using YAML files:

resources:
  slots_per_trial: 8
  priority: 1
hyperparameters:
  learning_rate:
    type: double
    minval: .0001
    maxval: 1.0
searcher:
  name: adaptive_asha
  metric: validation_loss
  smaller_is_better: true

Web UI

Use the Web UI to view loss curves, hyperparameter plots, code and configuration snapshots, model registries, cluster utilization, debugging logs, performance profiling reports, and more.

Web UI

Installation

To install the CLI:

pip install determined

Then use det deploy to start the Determined cluster locally, or on cloud services like AWS and GCP.

For installation details, visit the the cluster deployment guide for your environment:

Examples

Get familiar with Determined by exploring the 30+ examples in the examples folder and the determined-examples repo.

Documentation

Community

If you need help, want to file a bug report, or just want to keep up-to-date with the latest news about Determined, please join the Determined community!

Contributing

Contributor's Guide

License

Apache V2

determined's People

Contributors

aaron276h avatar apizzini avatar ashtong avatar azhou-determined avatar brainhart avatar carolinaecalderon avatar dzhu avatar eecsliu avatar emilybonar avatar erikwilson avatar gt2345 avatar hamidzr avatar hkang1 avatar ioga avatar jerryharrow avatar johnkim-det avatar julian-determined-ai avatar keita-determined avatar liamcli avatar mackrorysd avatar mapmeld avatar neilconway avatar nicholasblaskey avatar rb-determined-ai avatar shiyuann avatar stoksc avatar tara-det-ai avatar thiagodallacqua-hpe avatar trentwatt avatar wes-turner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

determined's Issues

Typo in release date

A minor heads-up, the below release dates reflects the previous year instead of 2021.
image

Number of training steps is unclear when max_steps > 1 epoch

I have a dataset that has 829 * 100 mini-batches. If I set max_steps to 9 and batches_per_step to 100 (default) then:

the training stopped before reaching to step 9 due to StopIteration exception, and the python process was terminated with "container failed with non-zero exit code: container failed with non-zero exit code: 1". immediately, another python process started to train for 6 steps (600 mini-batches), and it stopped with "INFO: Workload completed: <RUN_STEP (100): (34,64,10)> (duration 0:00:21.639865)"

so it trained 14.29 (8.29 in the 1st run, plus 6 in the 2nd run) steps in total, also the 2nd run seems trained from scratch and does not use the parameter from the first run.

I was expecting to have 9 steps in total in one go.

description: siim_const_batch_1
environment:
 image: determinedai/environments:cuda-10.1-pytorch-1.4-tf-2.2-gpu-1
data:
 url: https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz
hyperparameters:
 learning_rate: 1.0
 global_batch_size: 32
 n_filters1: 32
 n_filters2: 64
 dropout1: 0.25
 dropout2: 0.5
checkpoint_policy: best
min_validation_period: 1  # in step
searcher:
 name: single
 metric: auc
 max_steps: 10  # 9 steps is ~ one epoch
 smaller_is_better: false
batches_per_step: 100 

resource constrained notebooks

Hi, greetings from Germany!

I just discovered this project and have been trying out some of its features. I have to say that I really like it so far.

I would love to see the feature of resource constrained jupyter notebooks:
At the moment I can request a GPU accelerated notebook which blocks an entire GPU.
I would like to host multiple notebooks on a single GPU however.
Maybe with a VRAM or processing constraint per notebook.

I think that on the docker side, it should be possible to host multiple instances on a single GPU, correct me if I'm wrong.

Thank you!

Add "Active Filters" count and "Clear Filters" button to tables with filtering

We recently switched to in-column filtering for tables. Users sometimes do not know what is being filtered, because it's shown by a small arrow or sometimes off the screen. Let's make it obvious that the tables are getting filtered by adding two components to tables with filtering:

  • Active Filters count
  • Clear Filters button

specification of default configs for notebooks created through the web UI

Feature request: it would be really helpful to be able to specify one or more default configs for the notebook instances that are used when launching notebooks with the button in the web UI. In our situation, being able to specify default mounts and agent labels would be helpful, since we typically mount and NFS share for notebook persistence.

It would be really nice if we could have several presets that are selectable from the dropdown when you click on that button, similarly to how you can select "cpu only" currently.

Notebooks started from the UI don't appear in the CLI

Create a local cluster: det-deploy local cluster-up --no-gpu
Visit the UI and launch a new notebook.

The notebook isn't listed in the CLI:

$ det notebook ls
 Id   | Owner   | Description   | State   | Exit Status
------+---------+---------------+---------+---------------

Trial container raise error when use --no-gpu

Hi,
The container raise a TypeError: NoneType object does not support item assignment. in _socket_manager.py when trial is running.

1. Envs

  • Cpu only
  • OS: Ubuntu 16.04
  • Determined: 0.15.1
  • Python: 3.7

2. Install and start

pip install dertermined
det deploy local cluster-up --no-gpu
det experiment create const.yaml . (mnist_pytorch official)

2. Master logs

<info>    [2021-04-28, 08:18:41] master configuration: {"config_file":"","log":{"level":"info","color":true},"db":{"user":"postgres","password":"********","migrations":"file:///usr/share/determined/master/static/migrations","host":"determined-db","port":"5432","name":"determined","ssl_mode":"disable","ssl_root_cert":""},"tensorboard_timeout":300,"security":{"default_task":{"id":0,"user_id":0,"user":"root","uid":0,"group":"root","gid":0},"tls":{"cert":"","key":""}},"checkpoint_storage":{"host_path":"/home/shihk/.local/share/determined","save_experiment_best":0,"save_trial_best":1,"save_trial_latest":1,"storage_path":"determined-checkpoint","type":"shared_fs"},"task_container_defaults":{"shm_size_bytes":4294967296,"network_mode":"bridge","cpu_pod_spec":null,"gpu_pod_spec":null},"port":8080,"harness_path":"/opt/determined","root":"/usr/share/determined/master","telemetry":{"enabled":true,"segment_master_key":"********","segment_webui_key":"********"},"enable_cors":false,"cluster_name":"","logging":{"type":"default"},"hyperparameter_importance":{"workers_limit":2,"queue_limit":16,"cores_per_worker":1,"max_trees":100},"resource_manager":{"default_cpu_resource_pool":"default","default_gpu_resource_pool":"default","scheduler":{"fitting_policy":"best","type":"fair_share"},"type":"agent"},"resource_pools":[{"pool_name":"default","description":"","provider":null,"max_cpu_containers_per_agent":100}]}
<info>    [2021-04-28, 08:18:41] Determined master 0.15.1 (built with go1.16.3)
<info>    [2021-04-28, 08:18:41] connecting to database determined-db:5432
<warning> [2021-04-28, 08:18:45] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=determined-db user=postgres database=determined`: dial error (dial tcp 172.23.0.2:5432: connect: connection refused)"
<info>    [2021-04-28, 08:18:45] running migrations from file:///usr/share/determined/master/static/migrations
<info>    [2021-04-28, 08:18:45] unable to find golang-migrate version
<info>    [2021-04-28, 08:18:45] deleting all snapshots for terminal state experiments
<info>    [2021-04-28, 08:18:45] creating resource pool: default  id="agentRM" system="master" type="agentResourceManager"
<info>    [2021-04-28, 08:18:45] pool default using global scheduling config  id="agentRM" system="master" type="agentResourceManager"
<info>    [2021-04-28, 08:18:45] initializing endpoints for agents
<info>    [2021-04-28, 08:18:45] not enabling provisioner for resource pool: default  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:18:45] scheduling next resource allocation aggregation in 15h42m14s at 2021-04-29 00:01:00 +0000 UTC  id="allocation-aggregator" system="master" type="allocationAggregator"
<info>    [2021-04-28, 08:18:45] telemetry reporting is enabled; run with `--telemetry-enabled=false` to disable
<info>    [2021-04-28, 08:18:45] accepting incoming connections on port 8080
<info>    [2021-04-28, 08:18:45] Subchannel Connectivity change to READY  system="system"
<info>    [2021-04-28, 08:18:45] pickfirstBalancer: HandleSubConnStateChange: 0xc000255d10, {READY <nil>}  system="system"
<info>    [2021-04-28, 08:18:45] Channel Connectivity change to READY  system="system"
<info>    [2021-04-28, 08:18:46] finished unary call with code Unauthenticated  error="rpc error: code = Unauthenticated desc = invalid credentials" grpc.code="Unauthenticated" grpc.method="GetAgents" grpc.service="determined.api.v1.Determined" grpc.start_time="2021-04-28T08:18:46Z" grpc.time_ms="0.791" span.kind="server" system="grpc"
<info>    [2021-04-28, 08:18:46] finished unary call with code Unauthenticated  error="rpc error: code = Unauthenticated desc = invalid credentials" grpc.code="Unauthenticated" grpc.method="Logout" grpc.service="determined.api.v1.Determined" grpc.start_time="2021-04-28T08:18:46Z" grpc.time_ms="0.323" span.kind="server" system="grpc"
<info>    [2021-04-28, 08:18:47] resource pool is empty; using default resource pool: default  id="agents" system="master" type="agents"
<info>    [2021-04-28, 08:18:48] agent connected ip: 192.168.100.30 resource pool: default slots: 1  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:18:48] adding agent: determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:18:48] adding device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) on determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<warning> [2021-04-28, 08:20:03] response already committed
<info>    [2021-04-28, 08:20:03] experiment state changed to ACTIVE  id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:03] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: 51871fd3-0ac4-4419-95ce-eee38508beff)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:04] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:04] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:04] starting container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:06] found container running: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:06] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:06] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] new connection from container 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 trial 1 (experiment 1) at 172.23.0.1:43810
<info>    [2021-04-28, 08:20:09] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:09] found 4 rendezvous addresses instead of 2 for container 63a14abe-fc1f-4c44-8583-d2fb6f9c6136; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:09] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-63a14abe-fc1f-4c44-8583-d2fb6f9c6136" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:09] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:09] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] killing container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:09] stopped container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:09] found container terminated: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] killing container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:09] unexpected failure of trial after restart 0/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:09] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: d997c647-2a3c-497c-8cd0-7ac7dd468634)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:10] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:10] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:10] starting container id: d73b67ac-1500-4037-8bc0-066701af1efa slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:12] found container running: d73b67ac-1500-4037-8bc0-066701af1efa (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:12] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:12] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] new connection from container d73b67ac-1500-4037-8bc0-066701af1efa trial 1 (experiment 1) at 172.23.0.1:43858
<info>    [2021-04-28, 08:20:15] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:15] found 4 rendezvous addresses instead of 2 for container d73b67ac-1500-4037-8bc0-066701af1efa; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:15] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-d73b67ac-1500-4037-8bc0-066701af1efa" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:15] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:15] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] killing container id: d73b67ac-1500-4037-8bc0-066701af1efa  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:15] stopped container id: d73b67ac-1500-4037-8bc0-066701af1efa  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:15] found container terminated: d73b67ac-1500-4037-8bc0-066701af1efa  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] killing container id: d73b67ac-1500-4037-8bc0-066701af1efa  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:15] unexpected failure of trial after restart 1/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:15] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: a8cc64d8-4902-4c37-97c6-00ca352a5bde)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:16] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:16] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:16] starting container id: b8fb8292-b905-48a2-951d-817b640b4ab6 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:18] found container running: b8fb8292-b905-48a2-951d-817b640b4ab6 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:18] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:18] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:21] new connection from container b8fb8292-b905-48a2-951d-817b640b4ab6 trial 1 (experiment 1) at 172.23.0.1:43922
<info>    [2021-04-28, 08:20:21] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:21] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:21] found 4 rendezvous addresses instead of 2 for container b8fb8292-b905-48a2-951d-817b640b4ab6; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:21] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-b8fb8292-b905-48a2-951d-817b640b4ab6" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:21] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:21] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:21] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:21] killing container id: b8fb8292-b905-48a2-951d-817b640b4ab6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:22] stopped container id: b8fb8292-b905-48a2-951d-817b640b4ab6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:22] found container terminated: b8fb8292-b905-48a2-951d-817b640b4ab6  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] killing container id: b8fb8292-b905-48a2-951d-817b640b4ab6  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:22] unexpected failure of trial after restart 2/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:22] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: a7f517a4-d7e2-46fd-be41-c271865069f0)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:22] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:22] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] starting container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:24] found container running: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:24] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:24] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] new connection from container 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 trial 1 (experiment 1) at 172.23.0.1:43982
<info>    [2021-04-28, 08:20:27] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:27] found 4 rendezvous addresses instead of 2 for container 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:27] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:27] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:27] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] killing container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:27] stopped container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:27] found container terminated: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] killing container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:27] unexpected failure of trial after restart 3/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:27] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: ca1b557d-35fa-4c22-ae39-17e86f9cb140)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:28] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:28] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:28] starting container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:30] found container running: 1636009d-9875-4cf4-91f4-3fe617fba5b7 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:30] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:30] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] new connection from container 1636009d-9875-4cf4-91f4-3fe617fba5b7 trial 1 (experiment 1) at 172.23.0.1:44030
<info>    [2021-04-28, 08:20:33] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:33] found 4 rendezvous addresses instead of 2 for container 1636009d-9875-4cf4-91f4-3fe617fba5b7; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:33] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-1636009d-9875-4cf4-91f4-3fe617fba5b7" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:33] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:33] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] killing container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:33] stopped container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:33] found container terminated: 1636009d-9875-4cf4-91f4-3fe617fba5b7  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] killing container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:33] unexpected failure of trial after restart 4/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:33] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: fc49fcfc-d6db-4dd5-96b1-1aa52032c118)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:34] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:34] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:34] starting container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:36] found container running: 1bc2292d-139d-45ea-ad5c-d261d06c4847 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:36] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:36] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] new connection from container 1bc2292d-139d-45ea-ad5c-d261d06c4847 trial 1 (experiment 1) at 172.23.0.1:44090
<info>    [2021-04-28, 08:20:39] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:39] found 4 rendezvous addresses instead of 2 for container 1bc2292d-139d-45ea-ad5c-d261d06c4847; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:39] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-1bc2292d-139d-45ea-ad5c-d261d06c4847" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:39] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:39] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] killing container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:39] stopped container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:39] found container terminated: 1bc2292d-139d-45ea-ad5c-d261d06c4847  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] killing container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:39] unexpected failure of trial after restart 5/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] trial completed workload: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] exiting trial early: 0xc000d32040  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:39] error shutting down actor  error="trial 1 failed and reached maximum number of restarts" experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<error>   [2021-04-28, 08:20:39] trial failed unexpectedly  error="trial 1 failed and reached maximum number of restarts" id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:39] experiment state changed to STOPPING_ERROR  id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:39] experiment state changed to ERROR  id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:39] resources are requested by /experiment-1-checkpoint-gc (Task ID: 16f11e76-f209-4ab2-8c79-045bbfd6ccf9)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] experiment shut down successfully  id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:40] allocated resources to /experiment-1-checkpoint-gc  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:40] starting checkpoint garbage collection  id="experiment-1-checkpoint-gc" system="master" type="checkpointGCTask"
<info>    [2021-04-28, 08:20:40] starting container id: 47011443-1146-4a7e-a138-e94a2115aa66 slots: 0 task handler: /experiment-1-checkpoint-gc  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:44] stopped container id: 47011443-1146-4a7e-a138-e94a2115aa66  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:44] finished checkpoint garbage collection  id="experiment-1-checkpoint-gc" system="master" type="checkpointGCTask"
<info>    [2021-04-28, 08:20:44] resources are released for /experiment-1-checkpoint-gc  id="default" resource-pool="default" system="master" type="ResourcePool"
<error>   [2021-04-28, 08:49:47] error while actor was running  error="ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932" id="websocket-00617727-e0b6-4874-9b0e-b933d51c1ecb" system="master" type="websocketActor"
<error>   [2021-04-28, 08:49:47] ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932
<error>   [2021-04-28, 08:49:47] error while actor was running  error="child failed: /agents/determined-agent-0/websocket-00617727-e0b6-4874-9b0e-b933d51c1ecb: ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932" id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:49:47] http: connection has been hijacked
<info>    [2021-04-28, 08:49:47] agent disconnected  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:49:47] removing device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) (determined-agent-0)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:49:47] removing agent: determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 09:01:37] resource pool is empty; using default resource pool: default  id="agents" system="master" type="agents"
<info>    [2021-04-28, 09:01:38] agent connected ip: 192.168.100.30 resource pool: default slots: 1  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 09:01:38] adding agent: determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 09:01:38] adding device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) on determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"<info>    [2021-04-28, 08:18:41] master configuration: {"config_file":"","log":{"level":"info","color":true},"db":{"user":"postgres","password":"********","migrations":"file:///usr/share/determined/master/static/migrations","host":"determined-db","port":"5432","name":"determined","ssl_mode":"disable","ssl_root_cert":""},"tensorboard_timeout":300,"security":{"default_task":{"id":0,"user_id":0,"user":"root","uid":0,"group":"root","gid":0},"tls":{"cert":"","key":""}},"checkpoint_storage":{"host_path":"/home/shihk/.local/share/determined","save_experiment_best":0,"save_trial_best":1,"save_trial_latest":1,"storage_path":"determined-checkpoint","type":"shared_fs"},"task_container_defaults":{"shm_size_bytes":4294967296,"network_mode":"bridge","cpu_pod_spec":null,"gpu_pod_spec":null},"port":8080,"harness_path":"/opt/determined","root":"/usr/share/determined/master","telemetry":{"enabled":true,"segment_master_key":"********","segment_webui_key":"********"},"enable_cors":false,"cluster_name":"","logging":{"type":"default"},"hyperparameter_importance":{"workers_limit":2,"queue_limit":16,"cores_per_worker":1,"max_trees":100},"resource_manager":{"default_cpu_resource_pool":"default","default_gpu_resource_pool":"default","scheduler":{"fitting_policy":"best","type":"fair_share"},"type":"agent"},"resource_pools":[{"pool_name":"default","description":"","provider":null,"max_cpu_containers_per_agent":100}]}
<info>    [2021-04-28, 08:18:41] Determined master 0.15.1 (built with go1.16.3)
<info>    [2021-04-28, 08:18:41] connecting to database determined-db:5432
<warning> [2021-04-28, 08:18:45] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=determined-db user=postgres database=determined`: dial error (dial tcp 172.23.0.2:5432: connect: connection refused)"
<info>    [2021-04-28, 08:18:45] running migrations from file:///usr/share/determined/master/static/migrations
<info>    [2021-04-28, 08:18:45] unable to find golang-migrate version
<info>    [2021-04-28, 08:18:45] deleting all snapshots for terminal state experiments
<info>    [2021-04-28, 08:18:45] creating resource pool: default  id="agentRM" system="master" type="agentResourceManager"
<info>    [2021-04-28, 08:18:45] pool default using global scheduling config  id="agentRM" system="master" type="agentResourceManager"
<info>    [2021-04-28, 08:18:45] initializing endpoints for agents
<info>    [2021-04-28, 08:18:45] not enabling provisioner for resource pool: default  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:18:45] scheduling next resource allocation aggregation in 15h42m14s at 2021-04-29 00:01:00 +0000 UTC  id="allocation-aggregator" system="master" type="allocationAggregator"
<info>    [2021-04-28, 08:18:45] telemetry reporting is enabled; run with `--telemetry-enabled=false` to disable
<info>    [2021-04-28, 08:18:45] accepting incoming connections on port 8080
<info>    [2021-04-28, 08:18:45] Subchannel Connectivity change to READY  system="system"
<info>    [2021-04-28, 08:18:45] pickfirstBalancer: HandleSubConnStateChange: 0xc000255d10, {READY <nil>}  system="system"
<info>    [2021-04-28, 08:18:45] Channel Connectivity change to READY  system="system"
<info>    [2021-04-28, 08:18:46] finished unary call with code Unauthenticated  error="rpc error: code = Unauthenticated desc = invalid credentials" grpc.code="Unauthenticated" grpc.method="GetAgents" grpc.service="determined.api.v1.Determined" grpc.start_time="2021-04-28T08:18:46Z" grpc.time_ms="0.791" span.kind="server" system="grpc"
<info>    [2021-04-28, 08:18:46] finished unary call with code Unauthenticated  error="rpc error: code = Unauthenticated desc = invalid credentials" grpc.code="Unauthenticated" grpc.method="Logout" grpc.service="determined.api.v1.Determined" grpc.start_time="2021-04-28T08:18:46Z" grpc.time_ms="0.323" span.kind="server" system="grpc"
<info>    [2021-04-28, 08:18:47] resource pool is empty; using default resource pool: default  id="agents" system="master" type="agents"
<info>    [2021-04-28, 08:18:48] agent connected ip: 192.168.100.30 resource pool: default slots: 1  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:18:48] adding agent: determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:18:48] adding device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) on determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<warning> [2021-04-28, 08:20:03] response already committed
<info>    [2021-04-28, 08:20:03] experiment state changed to ACTIVE  id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:03] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: 51871fd3-0ac4-4419-95ce-eee38508beff)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:04] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:04] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:04] starting container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:06] found container running: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:06] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:06] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] new connection from container 63a14abe-fc1f-4c44-8583-d2fb6f9c6136 trial 1 (experiment 1) at 172.23.0.1:43810
<info>    [2021-04-28, 08:20:09] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:09] found 4 rendezvous addresses instead of 2 for container 63a14abe-fc1f-4c44-8583-d2fb6f9c6136; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:09] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-63a14abe-fc1f-4c44-8583-d2fb6f9c6136" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:09] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:09] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] killing container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:09] stopped container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:09] found container terminated: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] killing container id: 63a14abe-fc1f-4c44-8583-d2fb6f9c6136  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:09] unexpected failure of trial after restart 0/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:09] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:09] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: d997c647-2a3c-497c-8cd0-7ac7dd468634)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:10] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:10] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:10] starting container id: d73b67ac-1500-4037-8bc0-066701af1efa slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:12] found container running: d73b67ac-1500-4037-8bc0-066701af1efa (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:12] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:12] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] new connection from container d73b67ac-1500-4037-8bc0-066701af1efa trial 1 (experiment 1) at 172.23.0.1:43858
<info>    [2021-04-28, 08:20:15] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:15] found 4 rendezvous addresses instead of 2 for container d73b67ac-1500-4037-8bc0-066701af1efa; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:15] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-d73b67ac-1500-4037-8bc0-066701af1efa" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:15] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:15] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] killing container id: d73b67ac-1500-4037-8bc0-066701af1efa  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:15] stopped container id: d73b67ac-1500-4037-8bc0-066701af1efa  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:15] found container terminated: d73b67ac-1500-4037-8bc0-066701af1efa  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] killing container id: d73b67ac-1500-4037-8bc0-066701af1efa  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:15] unexpected failure of trial after restart 1/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:15] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:15] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: a8cc64d8-4902-4c37-97c6-00ca352a5bde)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:16] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:16] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:16] starting container id: b8fb8292-b905-48a2-951d-817b640b4ab6 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:18] found container running: b8fb8292-b905-48a2-951d-817b640b4ab6 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:18] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:18] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:21] new connection from container b8fb8292-b905-48a2-951d-817b640b4ab6 trial 1 (experiment 1) at 172.23.0.1:43922
<info>    [2021-04-28, 08:20:21] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:21] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:21] found 4 rendezvous addresses instead of 2 for container b8fb8292-b905-48a2-951d-817b640b4ab6; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:21] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-b8fb8292-b905-48a2-951d-817b640b4ab6" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:21] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:21] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:21] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:21] killing container id: b8fb8292-b905-48a2-951d-817b640b4ab6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:22] stopped container id: b8fb8292-b905-48a2-951d-817b640b4ab6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:22] found container terminated: b8fb8292-b905-48a2-951d-817b640b4ab6  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] killing container id: b8fb8292-b905-48a2-951d-817b640b4ab6  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:22] unexpected failure of trial after restart 2/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:22] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: a7f517a4-d7e2-46fd-be41-c271865069f0)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:22] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:22] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:22] starting container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:24] found container running: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:24] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:24] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] new connection from container 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4 trial 1 (experiment 1) at 172.23.0.1:43982
<info>    [2021-04-28, 08:20:27] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:27] found 4 rendezvous addresses instead of 2 for container 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:27] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:27] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:27] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] killing container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:27] stopped container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:27] found container terminated: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] killing container id: 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:27] unexpected failure of trial after restart 3/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:27] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:27] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: ca1b557d-35fa-4c22-ae39-17e86f9cb140)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:28] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:28] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:28] starting container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:30] found container running: 1636009d-9875-4cf4-91f4-3fe617fba5b7 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:30] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:30] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] new connection from container 1636009d-9875-4cf4-91f4-3fe617fba5b7 trial 1 (experiment 1) at 172.23.0.1:44030
<info>    [2021-04-28, 08:20:33] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:33] found 4 rendezvous addresses instead of 2 for container 1636009d-9875-4cf4-91f4-3fe617fba5b7; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:33] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-1636009d-9875-4cf4-91f4-3fe617fba5b7" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:33] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:33] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] killing container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:33] stopped container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:33] found container terminated: 1636009d-9875-4cf4-91f4-3fe617fba5b7  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] killing container id: 1636009d-9875-4cf4-91f4-3fe617fba5b7  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:33] unexpected failure of trial after restart 4/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] resetting trial 1  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:33] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:33] resources are requested by /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6 (Task ID: fc49fcfc-d6db-4dd5-96b1-1aa52032c118)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:34] allocated resources to /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:34] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:34] starting container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847 slots: 1 task handler: /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:36] found container running: 1bc2292d-139d-45ea-ad5c-d261d06c4847 (rank 0)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:36] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:36] found not all containers are connected  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] new connection from container 1bc2292d-139d-45ea-ad5c-d261d06c4847 trial 1 (experiment 1) at 172.23.0.1:44090
<info>    [2021-04-28, 08:20:39] pushing rendezvous information  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] found all containers are connected successfully  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:39] found 4 rendezvous addresses instead of 2 for container 1bc2292d-139d-45ea-ad5c-d261d06c4847; dropping rendezvous addresses  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:39] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="socket-1bc2292d-139d-45ea-ad5c-d261d06c4847" system="master" type="websocketActor"
<error>   [2021-04-28, 08:20:39] websocket handler error: websocket: close 1006 (abnormal closure): unexpected EOF
<info>    [2021-04-28, 08:20:39] found child actor failed, terminating forcibly  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] killing container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:39] stopped container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:39] found container terminated: 1bc2292d-139d-45ea-ad5c-d261d06c4847  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] forcibly terminating trial  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] killing container id: 1bc2292d-139d-45ea-ad5c-d261d06c4847  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:20:39] unexpected failure of trial after restart 5/5: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] trial completed workload: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] exiting trial early: 0xc000d32040  experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<error>   [2021-04-28, 08:20:39] error shutting down actor  error="trial 1 failed and reached maximum number of restarts" experiment-id="1" id="c495adfb-4819-417a-bee0-aae46fd24ba6" system="master" trial-id="1" type="trial"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] resources are released for /experiments/1/c495adfb-4819-417a-bee0-aae46fd24ba6  id="default" resource-pool="default" system="master" type="ResourcePool"
<error>   [2021-04-28, 08:20:39] trial failed unexpectedly  error="trial 1 failed and reached maximum number of restarts" id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:39] experiment state changed to STOPPING_ERROR  id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:39] experiment state changed to ERROR  id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:39] resources are requested by /experiment-1-checkpoint-gc (Task ID: 16f11e76-f209-4ab2-8c79-045bbfd6ccf9)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:39] experiment shut down successfully  id="1" system="master" type="experiment"
<info>    [2021-04-28, 08:20:40] allocated resources to /experiment-1-checkpoint-gc  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:20:40] starting checkpoint garbage collection  id="experiment-1-checkpoint-gc" system="master" type="checkpointGCTask"
<info>    [2021-04-28, 08:20:40] starting container id: 47011443-1146-4a7e-a138-e94a2115aa66 slots: 0 task handler: /experiment-1-checkpoint-gc  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:44] stopped container id: 47011443-1146-4a7e-a138-e94a2115aa66  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:20:44] finished checkpoint garbage collection  id="experiment-1-checkpoint-gc" system="master" type="checkpointGCTask"
<info>    [2021-04-28, 08:20:44] resources are released for /experiment-1-checkpoint-gc  id="default" resource-pool="default" system="master" type="ResourcePool"
<error>   [2021-04-28, 08:49:47] error while actor was running  error="ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932" id="websocket-00617727-e0b6-4874-9b0e-b933d51c1ecb" system="master" type="websocketActor"
<error>   [2021-04-28, 08:49:47] ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932
<error>   [2021-04-28, 08:49:47] error while actor was running  error="child failed: /agents/determined-agent-0/websocket-00617727-e0b6-4874-9b0e-b933d51c1ecb: ping 11681f0e-e60d-4706-9c13-571690c5d88e did not receive pong response by 2021-04-28 08:49:47.642705858 +0000 UTC m=+1866.505227932" id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-28, 08:49:47] http: connection has been hijacked
<info>    [2021-04-28, 08:49:47] agent disconnected  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 08:49:47] removing device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) (determined-agent-0)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 08:49:47] removing agent: determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 09:01:37] resource pool is empty; using default resource pool: default  id="agents" system="master" type="agents"
<info>    [2021-04-28, 09:01:38] agent connected ip: 192.168.100.30 resource pool: default slots: 1  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-28, 09:01:38] adding agent: determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-28, 09:01:38] adding device: cpu0 (Intel(R) Core(TM) i7-9700T CPU @ 2.00GHz x 1 physical cores) on determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"

3. Trial logs

[2021-04-28T08:20:04Z] 63a14abe || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40
[2021-04-28T08:20:04Z] 63a14abe || INFO: copying files to container: /
[2021-04-28T08:20:04Z] 63a14abe || INFO: copying files to container: /
[2021-04-28T08:20:04Z] 63a14abe || INFO: copying files to container: /
[2021-04-28T08:20:04Z] 63a14abe || INFO: copying files to container: /
[2021-04-28T08:20:04Z] 63a14abe || INFO: copying files to container: /
[2021-04-28T08:20:05Z] 63a14abe || INFO: copying files to container: /
[2021-04-28T08:20:05Z] 63a14abe || INFO: copying files to container: /run/determined/train
[2021-04-28T08:20:05Z] 63a14abe || INFO: copying files to container: /run/determined/train/model
[2021-04-28T08:20:05Z] 63a14abe || INFO: copying files to container: /run/determined/workdir
[2021-04-28T08:20:06Z] 63a14abe || + WORKING_DIR=/run/determined/workdir
[2021-04-28T08:20:06Z] 63a14abe || + STARTUP_HOOK=startup-hook.sh
[2021-04-28T08:20:06Z] 63a14abe || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:06Z] 63a14abe || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:06Z] 63a14abe || + '[' -z '' ']'
[2021-04-28T08:20:06Z] 63a14abe || + export DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:06Z] 63a14abe || + DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:06Z] 63a14abe || + /bin/which python3
[2021-04-28T08:20:06Z] 63a14abe || + '[' /root = / ']'
[2021-04-28T08:20:06Z] 63a14abe || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl
[2021-04-28T08:20:08Z] 63a14abe || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-04-28T08:20:08Z] 63a14abe || + cd /run/determined/workdir
[2021-04-28T08:20:08Z] 63a14abe || + test -f startup-hook.sh
[2021-04-28T08:20:08Z] 63a14abe || + exec python3 -m determined.exec.harness
[2021-04-28T08:20:09Z] 63a14abe || INFO: New trial runner in (container 63a14abe-fc1f-4c44-8583-d2fb6f9c6136) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '63a14abe-fc1f-4c44-8583-d2fb6f9c6136', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-04-28T08:20:09Z] 63a14abe || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/63a14abe-fc1f-4c44-8583-d2fb6f9c6136
[2021-04-28T08:20:09Z] 63a14abe || INFO: Connected to master
[2021-04-28T08:20:09Z] 63a14abe || INFO: Established WebSocket session with master
[2021-04-28T08:20:09Z] 63a14abe || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49193}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49190}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49192}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49189}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-04-28T08:20:09Z] 63a14abe || Traceback (most recent call last):
[2021-04-28T08:20:09Z] 63a14abe ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-04-28T08:20:09Z] 63a14abe ||     "__main__", mod_spec)
[2021-04-28T08:20:09Z] 63a14abe ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-04-28T08:20:09Z] 63a14abe ||     exec(code, run_globals)
[2021-04-28T08:20:09Z] 63a14abe ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-04-28T08:20:09Z] 63a14abe ||     main()
[2021-04-28T08:20:09Z] 63a14abe ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-04-28T08:20:09Z] 63a14abe ||     build_and_run_training_pipeline(env)
[2021-04-28T08:20:09Z] 63a14abe ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline
[2021-04-28T08:20:09Z] 63a14abe ||     with layers.SocketManager(env) as socket_mgr:
[2021-04-28T08:20:09Z] 63a14abe ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__
[2021-04-28T08:20:09Z] 63a14abe ||     ri = self.check_for_rendezvous_info(ws_event)
[2021-04-28T08:20:09Z] 63a14abe ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info
[2021-04-28T08:20:09Z] 63a14abe ||     addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}"
[2021-04-28T08:20:09Z] 63a14abe || TypeError: 'NoneType' object does not support item assignment
[2021-04-28T08:20:09Z] 63a14abe || WARNING: disconnecting websocket
[2021-04-28T08:20:09Z] 63a14abe || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)
[2021-04-28T08:20:10Z] d73b67ac || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40
[2021-04-28T08:20:10Z] d73b67ac || INFO: copying files to container: /
[2021-04-28T08:20:10Z] d73b67ac || INFO: copying files to container: /
[2021-04-28T08:20:10Z] d73b67ac || INFO: copying files to container: /
[2021-04-28T08:20:10Z] d73b67ac || INFO: copying files to container: /
[2021-04-28T08:20:10Z] d73b67ac || INFO: copying files to container: /
[2021-04-28T08:20:11Z] d73b67ac || INFO: copying files to container: /
[2021-04-28T08:20:11Z] d73b67ac || INFO: copying files to container: /run/determined/train
[2021-04-28T08:20:11Z] d73b67ac || INFO: copying files to container: /run/determined/train/model
[2021-04-28T08:20:11Z] d73b67ac || INFO: copying files to container: /run/determined/workdir
[2021-04-28T08:20:12Z] d73b67ac || + WORKING_DIR=/run/determined/workdir
[2021-04-28T08:20:12Z] d73b67ac || + STARTUP_HOOK=startup-hook.sh
[2021-04-28T08:20:12Z] d73b67ac || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:12Z] d73b67ac || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:12Z] d73b67ac || + '[' -z '' ']'
[2021-04-28T08:20:12Z] d73b67ac || + export DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:12Z] d73b67ac || + DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:12Z] d73b67ac || + /bin/which python3
[2021-04-28T08:20:12Z] d73b67ac || + '[' /root = / ']'
[2021-04-28T08:20:12Z] d73b67ac || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl
[2021-04-28T08:20:14Z] d73b67ac || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-04-28T08:20:14Z] d73b67ac || + cd /run/determined/workdir
[2021-04-28T08:20:14Z] d73b67ac || + exec python3 -m determined.exec.harness
[2021-04-28T08:20:14Z] d73b67ac || + test -f startup-hook.sh
[2021-04-28T08:20:15Z] d73b67ac || INFO: New trial runner in (container d73b67ac-1500-4037-8bc0-066701af1efa) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'd73b67ac-1500-4037-8bc0-066701af1efa', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-04-28T08:20:15Z] d73b67ac || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/d73b67ac-1500-4037-8bc0-066701af1efa
[2021-04-28T08:20:15Z] d73b67ac || INFO: Connected to master
[2021-04-28T08:20:15Z] d73b67ac || INFO: Established WebSocket session with master
[2021-04-28T08:20:15Z] d73b67ac || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49195}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49192}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49194}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49191}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-04-28T08:20:15Z] d73b67ac || Traceback (most recent call last):
[2021-04-28T08:20:15Z] d73b67ac ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-04-28T08:20:15Z] d73b67ac ||     "__main__", mod_spec)
[2021-04-28T08:20:15Z] d73b67ac ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-04-28T08:20:15Z] d73b67ac ||     exec(code, run_globals)
[2021-04-28T08:20:15Z] d73b67ac ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-04-28T08:20:15Z] d73b67ac ||     main()
[2021-04-28T08:20:15Z] d73b67ac ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-04-28T08:20:15Z] d73b67ac ||     build_and_run_training_pipeline(env)
[2021-04-28T08:20:15Z] d73b67ac ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline
[2021-04-28T08:20:15Z] d73b67ac ||     with layers.SocketManager(env) as socket_mgr:
[2021-04-28T08:20:15Z] d73b67ac ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__
[2021-04-28T08:20:15Z] d73b67ac ||     ri = self.check_for_rendezvous_info(ws_event)
[2021-04-28T08:20:15Z] d73b67ac ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info
[2021-04-28T08:20:15Z] d73b67ac ||     addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}"
[2021-04-28T08:20:15Z] d73b67ac || TypeError: 'NoneType' object does not support item assignment
[2021-04-28T08:20:15Z] d73b67ac || WARNING: disconnecting websocket
[2021-04-28T08:20:15Z] d73b67ac || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)
[2021-04-28T08:20:16Z] b8fb8292 || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40
[2021-04-28T08:20:16Z] b8fb8292 || INFO: copying files to container: /
[2021-04-28T08:20:16Z] b8fb8292 || INFO: copying files to container: /
[2021-04-28T08:20:16Z] b8fb8292 || INFO: copying files to container: /
[2021-04-28T08:20:16Z] b8fb8292 || INFO: copying files to container: /
[2021-04-28T08:20:16Z] b8fb8292 || INFO: copying files to container: /
[2021-04-28T08:20:17Z] b8fb8292 || INFO: copying files to container: /
[2021-04-28T08:20:17Z] b8fb8292 || INFO: copying files to container: /run/determined/train
[2021-04-28T08:20:17Z] b8fb8292 || INFO: copying files to container: /run/determined/train/model
[2021-04-28T08:20:17Z] b8fb8292 || INFO: copying files to container: /run/determined/workdir
[2021-04-28T08:20:18Z] b8fb8292 || + WORKING_DIR=/run/determined/workdir
[2021-04-28T08:20:18Z] b8fb8292 || + STARTUP_HOOK=startup-hook.sh
[2021-04-28T08:20:18Z] b8fb8292 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:18Z] b8fb8292 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:18Z] b8fb8292 || + '[' -z '' ']'
[2021-04-28T08:20:18Z] b8fb8292 || + DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:18Z] b8fb8292 || + export DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:18Z] b8fb8292 || + /bin/which python3
[2021-04-28T08:20:18Z] b8fb8292 || + '[' /root = / ']'
[2021-04-28T08:20:18Z] b8fb8292 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl
[2021-04-28T08:20:20Z] b8fb8292 || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-04-28T08:20:20Z] b8fb8292 || + cd /run/determined/workdir
[2021-04-28T08:20:20Z] b8fb8292 || + test -f startup-hook.sh
[2021-04-28T08:20:20Z] b8fb8292 || + exec python3 -m determined.exec.harness
[2021-04-28T08:20:21Z] b8fb8292 || INFO: New trial runner in (container b8fb8292-b905-48a2-951d-817b640b4ab6) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'b8fb8292-b905-48a2-951d-817b640b4ab6', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-04-28T08:20:21Z] b8fb8292 || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/b8fb8292-b905-48a2-951d-817b640b4ab6
[2021-04-28T08:20:21Z] b8fb8292 || INFO: Connected to master
[2021-04-28T08:20:21Z] b8fb8292 || INFO: Established WebSocket session with master
[2021-04-28T08:20:21Z] b8fb8292 || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49197}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49194}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49196}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49193}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-04-28T08:20:21Z] b8fb8292 || Traceback (most recent call last):
[2021-04-28T08:20:21Z] b8fb8292 ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-04-28T08:20:21Z] b8fb8292 ||     "__main__", mod_spec)
[2021-04-28T08:20:21Z] b8fb8292 ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-04-28T08:20:21Z] b8fb8292 ||     exec(code, run_globals)
[2021-04-28T08:20:21Z] b8fb8292 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-04-28T08:20:21Z] b8fb8292 ||     main()
[2021-04-28T08:20:21Z] b8fb8292 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-04-28T08:20:21Z] b8fb8292 ||     build_and_run_training_pipeline(env)
[2021-04-28T08:20:21Z] b8fb8292 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline
[2021-04-28T08:20:21Z] b8fb8292 ||     with layers.SocketManager(env) as socket_mgr:
[2021-04-28T08:20:21Z] b8fb8292 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__
[2021-04-28T08:20:21Z] b8fb8292 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info
[2021-04-28T08:20:21Z] b8fb8292 ||     ri = self.check_for_rendezvous_info(ws_event)
[2021-04-28T08:20:21Z] b8fb8292 ||     addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}"
[2021-04-28T08:20:21Z] b8fb8292 || TypeError: 'NoneType' object does not support item assignment
[2021-04-28T08:20:21Z] b8fb8292 || WARNING: disconnecting websocket
[2021-04-28T08:20:22Z] b8fb8292 || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)
[2021-04-28T08:20:22Z] 8fd8ae29 || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40
[2021-04-28T08:20:22Z] 8fd8ae29 || INFO: copying files to container: /
[2021-04-28T08:20:22Z] 8fd8ae29 || INFO: copying files to container: /
[2021-04-28T08:20:22Z] 8fd8ae29 || INFO: copying files to container: /
[2021-04-28T08:20:22Z] 8fd8ae29 || INFO: copying files to container: /
[2021-04-28T08:20:22Z] 8fd8ae29 || INFO: copying files to container: /
[2021-04-28T08:20:23Z] 8fd8ae29 || INFO: copying files to container: /
[2021-04-28T08:20:23Z] 8fd8ae29 || INFO: copying files to container: /run/determined/train
[2021-04-28T08:20:23Z] 8fd8ae29 || INFO: copying files to container: /run/determined/train/model
[2021-04-28T08:20:23Z] 8fd8ae29 || INFO: copying files to container: /run/determined/workdir
[2021-04-28T08:20:24Z] 8fd8ae29 || + WORKING_DIR=/run/determined/workdir
[2021-04-28T08:20:24Z] 8fd8ae29 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:24Z] 8fd8ae29 || + STARTUP_HOOK=startup-hook.sh
[2021-04-28T08:20:24Z] 8fd8ae29 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:24Z] 8fd8ae29 || + '[' -z '' ']'
[2021-04-28T08:20:24Z] 8fd8ae29 || + export DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:24Z] 8fd8ae29 || + DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:24Z] 8fd8ae29 || + /bin/which python3
[2021-04-28T08:20:24Z] 8fd8ae29 || + '[' /root = / ']'
[2021-04-28T08:20:24Z] 8fd8ae29 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl
[2021-04-28T08:20:25Z] 8fd8ae29 || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-04-28T08:20:26Z] 8fd8ae29 || + cd /run/determined/workdir
[2021-04-28T08:20:26Z] 8fd8ae29 || + test -f startup-hook.sh
[2021-04-28T08:20:26Z] 8fd8ae29 || + exec python3 -m determined.exec.harness
[2021-04-28T08:20:27Z] 8fd8ae29 || INFO: New trial runner in (container 8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-04-28T08:20:27Z] 8fd8ae29 || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/8fd8ae29-7f3d-4cf3-a2a4-77f9bdbccda4
[2021-04-28T08:20:27Z] 8fd8ae29 || INFO: Connected to master
[2021-04-28T08:20:27Z] 8fd8ae29 || INFO: Established WebSocket session with master
[2021-04-28T08:20:27Z] 8fd8ae29 || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49199}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49196}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49198}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49195}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-04-28T08:20:27Z] 8fd8ae29 || Traceback (most recent call last):
[2021-04-28T08:20:27Z] 8fd8ae29 ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-04-28T08:20:27Z] 8fd8ae29 ||     "__main__", mod_spec)
[2021-04-28T08:20:27Z] 8fd8ae29 ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-04-28T08:20:27Z] 8fd8ae29 ||     exec(code, run_globals)
[2021-04-28T08:20:27Z] 8fd8ae29 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-04-28T08:20:27Z] 8fd8ae29 ||     main()
[2021-04-28T08:20:27Z] 8fd8ae29 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-04-28T08:20:27Z] 8fd8ae29 ||     build_and_run_training_pipeline(env)
[2021-04-28T08:20:27Z] 8fd8ae29 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline
[2021-04-28T08:20:27Z] 8fd8ae29 ||     with layers.SocketManager(env) as socket_mgr:
[2021-04-28T08:20:27Z] 8fd8ae29 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__
[2021-04-28T08:20:27Z] 8fd8ae29 ||     ri = self.check_for_rendezvous_info(ws_event)
[2021-04-28T08:20:27Z] 8fd8ae29 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info
[2021-04-28T08:20:27Z] 8fd8ae29 ||     addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}"
[2021-04-28T08:20:27Z] 8fd8ae29 || TypeError: 'NoneType' object does not support item assignment
[2021-04-28T08:20:27Z] 8fd8ae29 || WARNING: disconnecting websocket
[2021-04-28T08:20:27Z] 8fd8ae29 || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)
[2021-04-28T08:20:28Z] 1636009d || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40
[2021-04-28T08:20:28Z] 1636009d || INFO: copying files to container: /
[2021-04-28T08:20:28Z] 1636009d || INFO: copying files to container: /
[2021-04-28T08:20:28Z] 1636009d || INFO: copying files to container: /
[2021-04-28T08:20:28Z] 1636009d || INFO: copying files to container: /
[2021-04-28T08:20:28Z] 1636009d || INFO: copying files to container: /
[2021-04-28T08:20:29Z] 1636009d || INFO: copying files to container: /
[2021-04-28T08:20:29Z] 1636009d || INFO: copying files to container: /run/determined/train
[2021-04-28T08:20:29Z] 1636009d || INFO: copying files to container: /run/determined/train/model
[2021-04-28T08:20:29Z] 1636009d || INFO: copying files to container: /run/determined/workdir
[2021-04-28T08:20:30Z] 1636009d || + WORKING_DIR=/run/determined/workdir
[2021-04-28T08:20:30Z] 1636009d || + STARTUP_HOOK=startup-hook.sh
[2021-04-28T08:20:30Z] 1636009d || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:30Z] 1636009d || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:30Z] 1636009d || + '[' -z '' ']'
[2021-04-28T08:20:30Z] 1636009d || + export DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:30Z] 1636009d || + DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:30Z] 1636009d || + /bin/which python3
[2021-04-28T08:20:30Z] 1636009d || + '[' /root = / ']'
[2021-04-28T08:20:30Z] 1636009d || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl
[2021-04-28T08:20:31Z] 1636009d || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-04-28T08:20:32Z] 1636009d || + cd /run/determined/workdir
[2021-04-28T08:20:32Z] 1636009d || + test -f startup-hook.sh
[2021-04-28T08:20:32Z] 1636009d || + exec python3 -m determined.exec.harness
[2021-04-28T08:20:33Z] 1636009d || INFO: New trial runner in (container 1636009d-9875-4cf4-91f4-3fe617fba5b7) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '1636009d-9875-4cf4-91f4-3fe617fba5b7', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-04-28T08:20:33Z] 1636009d || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/1636009d-9875-4cf4-91f4-3fe617fba5b7
[2021-04-28T08:20:33Z] 1636009d || INFO: Connected to master
[2021-04-28T08:20:33Z] 1636009d || INFO: Established WebSocket session with master
[2021-04-28T08:20:33Z] 1636009d || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49201}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49198}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49200}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49197}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-04-28T08:20:33Z] 1636009d || Traceback (most recent call last):
[2021-04-28T08:20:33Z] 1636009d ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-04-28T08:20:33Z] 1636009d ||     "__main__", mod_spec)
[2021-04-28T08:20:33Z] 1636009d ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-04-28T08:20:33Z] 1636009d ||     exec(code, run_globals)
[2021-04-28T08:20:33Z] 1636009d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-04-28T08:20:33Z] 1636009d ||     main()
[2021-04-28T08:20:33Z] 1636009d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-04-28T08:20:33Z] 1636009d ||     build_and_run_training_pipeline(env)
[2021-04-28T08:20:33Z] 1636009d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline
[2021-04-28T08:20:33Z] 1636009d ||     with layers.SocketManager(env) as socket_mgr:
[2021-04-28T08:20:33Z] 1636009d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__
[2021-04-28T08:20:33Z] 1636009d ||     ri = self.check_for_rendezvous_info(ws_event)
[2021-04-28T08:20:33Z] 1636009d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info
[2021-04-28T08:20:33Z] 1636009d ||     addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}"
[2021-04-28T08:20:33Z] 1636009d || TypeError: 'NoneType' object does not support item assignment
[2021-04-28T08:20:33Z] 1636009d || WARNING: disconnecting websocket
[2021-04-28T08:20:33Z] 1636009d || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)
[2021-04-28T08:20:34Z] 1bc2292d || INFO: image already found, skipping pull phase: docker.io/determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40
[2021-04-28T08:20:34Z] 1bc2292d || INFO: copying files to container: /
[2021-04-28T08:20:34Z] 1bc2292d || INFO: copying files to container: /
[2021-04-28T08:20:34Z] 1bc2292d || INFO: copying files to container: /
[2021-04-28T08:20:34Z] 1bc2292d || INFO: copying files to container: /
[2021-04-28T08:20:34Z] 1bc2292d || INFO: copying files to container: /
[2021-04-28T08:20:35Z] 1bc2292d || INFO: copying files to container: /
[2021-04-28T08:20:35Z] 1bc2292d || INFO: copying files to container: /run/determined/train
[2021-04-28T08:20:35Z] 1bc2292d || INFO: copying files to container: /run/determined/train/model
[2021-04-28T08:20:35Z] 1bc2292d || INFO: copying files to container: /run/determined/workdir
[2021-04-28T08:20:36Z] 1bc2292d || + WORKING_DIR=/run/determined/workdir
[2021-04-28T08:20:36Z] 1bc2292d || + STARTUP_HOOK=startup-hook.sh
[2021-04-28T08:20:36Z] 1bc2292d || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:36Z] 1bc2292d || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-04-28T08:20:36Z] 1bc2292d || + '[' -z '' ']'
[2021-04-28T08:20:36Z] 1bc2292d || + export DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:36Z] 1bc2292d || + DET_PYTHON_EXECUTABLE=python3
[2021-04-28T08:20:36Z] 1bc2292d || + /bin/which python3
[2021-04-28T08:20:36Z] 1bc2292d || + '[' /root = / ']'
[2021-04-28T08:20:36Z] 1bc2292d || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.1-py3-none-any.whl
[2021-04-28T08:20:38Z] 1bc2292d || ERROR: determined 0.15.1 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-04-28T08:20:38Z] 1bc2292d || + cd /run/determined/workdir
[2021-04-28T08:20:38Z] 1bc2292d || + test -f startup-hook.sh
[2021-04-28T08:20:38Z] 1bc2292d || + exec python3 -m determined.exec.harness
[2021-04-28T08:20:39Z] 1bc2292d || INFO: New trial runner in (container 1bc2292d-139d-45ea-ad5c-d261d06c4847) on agent determined-agent-0: {'master_addr': '192.168.100.30', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '1bc2292d-139d-45ea-ad5c-d261d06c4847', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/home/shihk/.local/share/determined', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'storage_path': 'determined-checkpoint', 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': 'default'}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None}, 'reproducibility': {'experiment_seed': 1619598003}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 0, 'container_gpus': [], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '7ae8dca8-1748-4f46-9d54-860ca6dc3a74', 'trial_seed': 829893716, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-04-28T08:20:39Z] 1bc2292d || INFO: Connecting to master at ws://192.168.100.30:8080/ws/trial/1/1/1bc2292d-139d-45ea-ad5c-d261d06c4847
[2021-04-28T08:20:39Z] 1bc2292d || INFO: Connected to master
[2021-04-28T08:20:39Z] 1bc2292d || INFO: Established WebSocket session with master
[2021-04-28T08:20:39Z] 1bc2292d || INFO: Got rendezvous information: {'addrs': None, 'addrs2': None, 'containers': [{'addresses': [{'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '192.168.100.30', 'host_port': 49203}, {'container_ip': '172.17.0.2', 'container_port': 1734, 'host_ip': '::', 'host_port': 49200}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '192.168.100.30', 'host_port': 49202}, {'container_ip': '172.17.0.2', 'container_port': 1750, 'host_ip': '::', 'host_port': 49199}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-04-28T08:20:39Z] 1bc2292d || Traceback (most recent call last):
[2021-04-28T08:20:39Z] 1bc2292d ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-04-28T08:20:39Z] 1bc2292d ||     "__main__", mod_spec)
[2021-04-28T08:20:39Z] 1bc2292d ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-04-28T08:20:39Z] 1bc2292d ||     exec(code, run_globals)
[2021-04-28T08:20:39Z] 1bc2292d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-04-28T08:20:39Z] 1bc2292d ||     main()
[2021-04-28T08:20:39Z] 1bc2292d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-04-28T08:20:39Z] 1bc2292d ||     build_and_run_training_pipeline(env)
[2021-04-28T08:20:39Z] 1bc2292d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 89, in build_and_run_training_pipeline
[2021-04-28T08:20:39Z] 1bc2292d ||     with layers.SocketManager(env) as socket_mgr:
[2021-04-28T08:20:39Z] 1bc2292d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 63, in __init__
[2021-04-28T08:20:39Z] 1bc2292d ||     ri = self.check_for_rendezvous_info(ws_event)
[2021-04-28T08:20:39Z] 1bc2292d ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/layers/_socket_manager.py", line 153, in check_for_rendezvous_info
[2021-04-28T08:20:39Z] 1bc2292d ||     addrs[rank] = f"0.0.0.0:{rendezvous_ports[0]}"
[2021-04-28T08:20:39Z] 1bc2292d || TypeError: 'NoneType' object does not support item assignment
[2021-04-28T08:20:39Z] 1bc2292d || WARNING: disconnecting websocket
[2021-04-28T08:20:39Z] 1bc2292d || INFO: container failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)
�[32mTrial log stream ended. To reopen log stream, run: det trial logs -f 1�[0m

The variable rendezvous_ports in env seem not be set correct?

det-deploy local cluster-up fails when trying single node install

I have moved over to Ubuntu now and am trying to install determined-ai

I get a FileNotFoundError when running the above command.

  1. pip install determined-cli - This is successful
  2. det-deploy local cluster-up - This gives a stack trace with FileNotFoundException.

raceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 387, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.8/http/client.py", line 1255, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1301, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1250, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1010, in _send_output
self.send(msg)
File "/usr/lib/python3.8/http/client.py", line 950, in send
self.connect()
File "/home/greg/.local/lib/python3.8/site-packages/docker/transport/unixconn.py", line 43, in connect
sock.connect(self.unix_socket)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 719, in urlopen
retries = retries.increment(
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 400, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/lib/python3/dist-packages/six.py", line 702, in reraise
raise value.with_traceback(tb)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 387, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.8/http/client.py", line 1255, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1301, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1250, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1010, in _send_output
self.send(msg)
File "/usr/lib/python3.8/http/client.py", line 950, in send
self.connect()
File "/home/greg/.local/lib/python3.8/site-packages/docker/transport/unixconn.py", line 43, in connect
sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/greg/.local/lib/python3.8/site-packages/docker/api/client.py", line 205, in _retrieve_server_version
return self.version(api_version=False)["ApiVersion"]
File "/home/greg/.local/lib/python3.8/site-packages/docker/api/daemon.py", line 181, in version
return self._result(self._get(url), json=True)
File "/home/greg/.local/lib/python3.8/site-packages/docker/utils/decorators.py", line 46, in inner
return f(self, *args, **kwargs)
File "/home/greg/.local/lib/python3.8/site-packages/docker/api/client.py", line 228, in _get
return self.get(url, **self._set_request_timeout(kwargs))
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 546, in get
return self.request('GET', url, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/greg/.local/bin/docker-compose", line 8, in
sys.exit(main())
File "/home/greg/.local/lib/python3.8/site-packages/compose/cli/main.py", line 67, in main
command()
File "/home/greg/.local/lib/python3.8/site-packages/compose/cli/main.py", line 123, in perform_command
project = project_from_options('.', options)
File "/home/greg/.local/lib/python3.8/site-packages/compose/cli/command.py", line 60, in project_from_options
return get_project(
File "/home/greg/.local/lib/python3.8/site-packages/compose/cli/command.py", line 131, in get_project
client = get_client(
File "/home/greg/.local/lib/python3.8/site-packages/compose/cli/docker_client.py", line 41, in get_client
client = docker_client(
File "/home/greg/.local/lib/python3.8/site-packages/compose/cli/docker_client.py", line 170, in docker_client
client = APIClient(**kwargs)
File "/home/greg/.local/lib/python3.8/site-packages/docker/api/client.py", line 188, in init
self._version = self._retrieve_server_version()
File "/home/greg/.local/lib/python3.8/site-packages/docker/api/client.py", line 212, in _retrieve_server_version
raise DockerException(
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 387, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.8/http/client.py", line 1255, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1301, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1250, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1010, in _send_output
self.send(msg)
File "/usr/lib/python3.8/http/client.py", line 950, in send
self.connect()
File "/home/greg/.local/lib/python3.8/site-packages/docker/transport/unixconn.py", line 43, in connect
sock.connect(self.unix_socket)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 719, in urlopen
retries = retries.increment(
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 400, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/lib/python3/dist-packages/six.py", line 702, in reraise
raise value.with_traceback(tb)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 387, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.8/http/client.py", line 1255, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1301, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1250, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.8/http/client.py", line 1010, in _send_output
self.send(msg)
File "/usr/lib/python3.8/http/client.py", line 950, in send
self.connect()
File "/home/greg/.local/lib/python3.8/site-packages/docker/transport/unixconn.py", line 43, in connect
sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/greg/.local/lib/python3.8/site-packages/docker/api/client.py", line 205, in _retrieve_server_version
return self.version(api_version=False)["ApiVersion"]
File "/home/greg/.local/lib/python3.8/site-packages/docker/api/daemon.py", line 181, in version
return self._result(self._get(url), json=True)
File "/home/greg/.local/lib/python3.8/site-packages/docker/utils/decorators.py", line 46, in inner
return f(self, *args, **kwargs)
File "/home/greg/.local/lib/python3.8/site-packages/docker/api/client.py", line 228, in _get
return self.get(url, **self._set_request_timeout(kwargs))
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 546, in get
return self.request('GET', url, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/greg/.local/bin/det-deploy", line 8, in
sys.exit(main())
File "/home/greg/.local/lib/python3.8/site-packages/determined_deploy/cli.py", line 29, in main
environment_mapenvironment
File "/home/greg/.local/lib/python3.8/site-packages/determined_deploy/local/cli.py", line 243, in deploy_local
OPERATION_TO_FNargs.command
File "/home/greg/.local/lib/python3.8/site-packages/determined_deploy/local/cli.py", line 175, in handle_cluster_up
cluster_utils.cluster_up(
File "/home/greg/.local/lib/python3.8/site-packages/determined_deploy/local/cluster_utils.py", line 164, in cluster_up
cluster_down(cluster_name, delete_db)
File "/home/greg/.local/lib/python3.8/site-packages/determined_deploy/local/cluster_utils.py", line 192, in cluster_down
stop_cluster_agents(cluster_name=cluster_name)
File "/home/greg/.local/lib/python3.8/site-packages/determined_deploy/local/cluster_utils.py", line 271, in stop_cluster_agents
docker_client = docker.from_env()
File "/home/greg/.local/lib/python3.8/site-packages/docker/client.py", line 84, in from_env
return cls(
File "/home/greg/.local/lib/python3.8/site-packages/docker/client.py", line 40, in init
self.api = APIClient(*args, **kwargs)
File "/home/greg/.local/lib/python3.8/site-packages/docker/api/client.py", line 188, in init
self._version = self._retrieve_server_version()
File "/home/greg/.local/lib/python3.8/site-packages/docker/api/client.py", line 212, in _retrieve_server_version
raise DockerException(
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

Update buf.yaml with version and update deprecated commands/flags

https://docs.buf.build/faq

You want to add a version to the top of buf.yaml:

version: v1beta1

The flag buf check breaking --against-input is now --against, and buf image build is now `buf build as well.

Note that right now, everything continues to work, which is why you haven't had any issues, but we will likely enforce the version key for buf.yaml at v1.0.

It may be an opportune time to check out buf generate as well :-) https://docs.buf.build/generate-usage Let us know if you have any questions!

remote debugging with VS Code

Hi again,

would it be possible to support development and debugging inside a remote docker container via SSH forwarding?
Let's say I have a docker image which runs an SSH server and I deploy it to a machine via your interface.
A feature which automatically redirects the container SSH port to a port on the dev machine would be very helpful.
I could then use the VS Code remote SSH-extension to connect and enable development and debugging.

Albert Squad example doesn't work

behind a corp. firewall/proxy, the Albert Squad example gets stuck at:
Running command git clone -q git://github.com/LiyuanLucasLiu/RAdam.git /run/determined/workdir/src/radam
... I believe this is due to network policies, and request that git:// be replaced by https:// in the requirements file.

or in the dockerfile, perhaps this setting can be applied:
git config --global url."https://github.com/".insteadOf git://github.com/

Unable to open Notebook via Web Services

Determined-AI version: 0.12.11
Installation method: det-deploy local cluster-up

After installing and configuring users through the determined CLI, I am unable to access to a "just-created" notebook. The system correctly pulls determined/environments image and from the logs, it is clear that the notebook is created.

[2020-07-20T07:50:04Z] 24b5f33d [RUNNING] ||  [I 07:50:04.703 LabApp] Writing notebook server cookie secret to /run/determined/jupyter/runtime/notebook_cookie_secret
[2020-07-20T07:50:04Z] 24b5f33d [RUNNING] ||  [W 07:50:04.929 LabApp] All authentication is disabled.  Anyone who can connect to this server will be able to run code.
[2020-07-20T07:50:04Z] 24b5f33d [RUNNING] ||  [I 07:50:04.934 LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
[2020-07-20T07:50:04Z] 24b5f33d [RUNNING] ||  [I 07:50:04.934 LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
[2020-07-20T07:50:04Z] 24b5f33d [RUNNING] ||  [I 07:50:04.936 LabApp] Serving notebooks from local directory: /run/determined/workdir
[2020-07-20T07:50:04Z] 24b5f33d [RUNNING] ||  [I 07:50:04.936 LabApp] The Jupyter Notebook is running at:
[2020-07-20T07:50:04Z] 24b5f33d [RUNNING] ||  [I 07:50:04.936 LabApp] http://23b3f02a6e33:8888/proxy/c91200a8-bae3-4c61-b3e9-07af6d2fc51e/
[2020-07-20T07:50:04Z] 24b5f33d [RUNNING] ||  [I 07:50:04.936 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

The problem is that when I try to open the notebook from the browser it stays in a charging phase for minutes and then returns a broken web page with error 502

I think that it might be a proxy problem. What is possible to do to fix this problem?

Container be killed after 'found not all containers are connected' create experiment failed

Hi,
Experiments couldn't run with unexpected failure of trial after restart

1. Envs

  • GPU: Titan x (only one)
  • Nvidia driver: 440.31
  • OS: Ubuntu 16.04
  • Docker: 19.03.6
  • Python: 3.7
  • Virtualenv

2. How to install

According to install doc official
pip install determined
export DET_MASTER=127.0.0.1

3. How to used

  1. det deploy local cluster-up
WARNING: The DET_DB_PASSWORD variable is not set. Defaulting to a blank string.
WARNING: The DET_VERSION variable is not set. Defaulting to a blank string.
Removing network determined_default
WARNING: Network determined_default not found.
WARNING: The DET_DB_PASSWORD variable is not set. Defaulting to a blank string.
WARNING: The DET_VERSION variable is not set. Defaulting to a blank string.
Removing network determined_default
WARNING: Network determined_default not found.
Creating network "determined_default" with the default driver
Creating volume "determined_determined-db-volume" with default driver
Creating determined_determined-db_1 ... done
Creating determined_determined-master_1 ... done
Waiting for master instance to be available......
Starting determined-agent-0
  1. Download mnist_pytorch.tgz from Quick Start Guide
  2. tar xzvf mnist_pytorch.tgz and cd mnist_pytorch
  3. det experiment create const.yaml .
Preparing files (/data/workspace/github_object/determined/sample/mnist_pytorch) to send to master... 11.5KB and 7 files 
Created experiment 1
  1. docker images
REPOSITORY                       TAG                                         IMAGE ID            CREATED             SIZE
determinedai/determined-agent    0.15.1                                      23ba192606fe        6 days ago          117MB
determinedai/determined-master   0.15.1                                      1c05071d6db5        6 days ago          226MB
determinedai/environments        cuda-10.2-pytorch-1.7-tf-1.15-gpu-0.11.0    5be3c289917e        13 days ago         8.77GB
determinedai/environments        cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40   da8e0cde3cc7        2 weeks ago         8.77GB
determinedai/environments        py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40      8161c4ea54dc        2 weeks ago         3.02GB
fluent/fluent-bit                1.6                                         672c60a7ab2a        3 months ago        78.3MB
postgres                         10.14                                       3cfd168e7b61        6 months ago        200MB

4. Error and Master Logs

The experiment seems not running and the container be killed after 'found not all containers are connected'

<info>    [2021-04-26, 09:00:37] master configuration: {"config_file":"","log":{"level":"info","color":true},"db":{"user":"postgres","password":"********","migrations":"file:///usr/share/determined/master/static/migrations","host":"determined-db","port":"5432","name":"determined","ssl_mode":"disable","ssl_root_cert":""},"tensorboard_timeout":300,"security":{"default_task":{"id":0,"user_id":0,"user":"root","uid":0,"group":"root","gid":0},"tls":{"cert":"","key":""}},"checkpoint_storage":{"host_path":"/home/caoyu/.local/share/determined","save_experiment_best":0,"save_trial_best":1,"save_trial_latest":1,"storage_path":"determined-checkpoint","type":"shared_fs"},"task_container_defaults":{"shm_size_bytes":4294967296,"network_mode":"bridge","cpu_pod_spec":null,"gpu_pod_spec":null},"port":8080,"harness_path":"/opt/determined","root":"/usr/share/determined/master","telemetry":{"enabled":true,"segment_master_key":"********","segment_webui_key":"********"},"enable_cors":false,"cluster_name":"","logging":{"type":"default"},"hyperparameter_importance":{"workers_limit":2,"queue_limit":16,"cores_per_worker":1,"max_trees":100},"resource_manager":{"default_cpu_resource_pool":"default","default_gpu_resource_pool":"default","scheduler":{"fitting_policy":"best","type":"fair_share"},"type":"agent"},"resource_pools":[{"pool_name":"default","description":"","provider":null,"max_cpu_containers_per_agent":100}]}
<info>    [2021-04-26, 09:00:37] Determined master 0.15.1 (built with go1.16.3)
<info>    [2021-04-26, 09:00:37] connecting to database determined-db:5432
<warning> [2021-04-26, 09:00:41] failed to connect to postgres, trying again in 4s  error="failed to connect to `host=determined-db user=postgres database=determined`: dial error (dial tcp 192.168.208.2:5432: connect: connection refused)"
<info>    [2021-04-26, 09:00:41] running migrations from file:///usr/share/determined/master/static/migrations
<info>    [2021-04-26, 09:00:41] unable to find golang-migrate version
<info>    [2021-04-26, 09:00:42] deleting all snapshots for terminal state experiments
<info>    [2021-04-26, 09:00:42] creating resource pool: default  id="agentRM" system="master" type="agentResourceManager"
<info>    [2021-04-26, 09:00:42] pool default using global scheduling config  id="agentRM" system="master" type="agentResourceManager"
<info>    [2021-04-26, 09:00:42] initializing endpoints for agents
<info>    [2021-04-26, 09:00:42] not enabling provisioner for resource pool: default  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:00:42] telemetry reporting is enabled; run with `--telemetry-enabled=false` to disable
<info>    [2021-04-26, 09:00:42] scheduling next resource allocation aggregation in 15h0m17s at 2021-04-27 00:01:00 +0000 UTC  id="allocation-aggregator" system="master" type="allocationAggregator"
<info>    [2021-04-26, 09:00:42] accepting incoming connections on port 8080
<info>    [2021-04-26, 09:00:42] Subchannel Connectivity change to READY  system="system"
<info>    [2021-04-26, 09:00:42] pickfirstBalancer: HandleSubConnStateChange: 0xc0008c0280, {READY <nil>}  system="system"
<info>    [2021-04-26, 09:00:42] Channel Connectivity change to READY  system="system"
<info>    [2021-04-26, 09:00:44] resource pool is empty; using default resource pool: default  id="agents" system="master" type="agents"
<info>    [2021-04-26, 09:00:45] agent connected ip: 192.168.100.212 resource pool: default slots: 1  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-26, 09:00:45] adding agent: determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:00:45] adding device: gpu0 (GeForce GTX TITAN X) on determined-agent-0  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:00:46] finished unary call with code Unauthenticated  error="rpc error: code = Unauthenticated desc = invalid credentials" grpc.code="Unauthenticated" grpc.method="GetExperiments" grpc.service="determined.api.v1.Determined" grpc.start_time="2021-04-26T09:00:46Z" grpc.time_ms="0.528" span.kind="server" system="grpc"
<info>    [2021-04-26, 09:00:46] finished unary call with code Unauthenticated  error="rpc error: code = Unauthenticated desc = invalid credentials" grpc.code="Unauthenticated" grpc.method="Logout" grpc.service="determined.api.v1.Determined" grpc.start_time="2021-04-26T09:00:46Z" grpc.time_ms="0.538" span.kind="server" system="grpc"
<info>    [2021-04-26, 09:00:47] finished unary call with code Unauthenticated  error="rpc error: code = Unauthenticated desc = invalid credentials" grpc.code="Unauthenticated" grpc.method="GetAgents" grpc.service="determined.api.v1.Determined" grpc.start_time="2021-04-26T09:00:47Z" grpc.time_ms="0.575" span.kind="server" system="grpc"
<info>    [2021-04-26, 09:00:47] finished unary call with code Unauthenticated  error="rpc error: code = Unauthenticated desc = invalid credentials" grpc.code="Unauthenticated" grpc.method="Logout" grpc.service="determined.api.v1.Determined" grpc.start_time="2021-04-26T09:00:47Z" grpc.time_ms="0.42" span.kind="server" system="grpc"
<error>   [2021-04-26, 09:00:55] failed to save trial logs  error="error inserting 1 trial logs: ERROR: insert or update on table \"trial_logs\" violates foreign key constraint \"trial_logs_trial_id_fkey\" (SQLSTATE 23503)" id="trialLogger" system="master" type="trialLogger"
<error>   [2021-04-26, 09:00:55] failed to save trial logs  error="error inserting 13 trial logs: ERROR: insert or update on table \"trial_logs\" violates foreign key constraint \"trial_logs_trial_id_fkey\" (SQLSTATE 23503)" id="trialLogger" system="master" type="trialLogger"
<warning> [2021-04-26, 09:01:42] response already committed
<info>    [2021-04-26, 09:01:42] experiment state changed to ACTIVE  id="1" system="master" type="experiment"
<info>    [2021-04-26, 09:01:42] resources are requested by /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10 (Task ID: 94f512dc-2ff1-4277-bd16-39498ea7e75b)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:01:43] allocated resources to /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:01:43] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:01:43] starting container id: f5f3864f-5742-4967-9f43-cdb24bfc4c59 slots: 1 task handler: /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-26, 09:01:45] found container running: f5f3864f-5742-4967-9f43-cdb24bfc4c59 (rank 0)  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:01:45] pushing rendezvous information  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:01:45] found not all containers are connected  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:02:21] stopped container id: f5f3864f-5742-4967-9f43-cdb24bfc4c59  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-26, 09:02:21] found container terminated: f5f3864f-5742-4967-9f43-cdb24bfc4c59  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:02:21] forcibly terminating trial  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:02:21] killing container id: f5f3864f-5742-4967-9f43-cdb24bfc4c59  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-26, 09:02:21] unexpected failure of trial after restart 0/5: container failed with non-zero exit code: container failed with non-zero exit code: 1 (exit code 1)  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:02:21] resetting trial 1  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:02:21] resources are released for /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:02:21] resources are requested by /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10 (Task ID: 66fe070d-22b0-4141-a3c1-a29e320fda82)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:02:21] allocated resources to /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:02:21] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:02:21] starting container id: 50b4b205-88c9-455a-be43-64f8fa273b72 slots: 1 task handler: /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-26, 09:02:23] found container running: 50b4b205-88c9-455a-be43-64f8fa273b72 (rank 0)  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:02:23] pushing rendezvous information  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:02:23] found not all containers are connected  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:02:58] stopped container id: 50b4b205-88c9-455a-be43-64f8fa273b72  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-26, 09:02:58] found container terminated: 50b4b205-88c9-455a-be43-64f8fa273b72  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:02:58] forcibly terminating trial  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:02:58] killing container id: 50b4b205-88c9-455a-be43-64f8fa273b72  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-26, 09:02:58] unexpected failure of trial after restart 1/5: container failed with non-zero exit code: container failed with non-zero exit code: 1 (exit code 1)  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:02:58] resetting trial 1  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:02:58] resources are released for /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:02:58] resources are requested by /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10 (Task ID: a2258605-5117-4094-baca-d9de8fbf16fa)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:02:59] allocated resources to /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:02:59] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:02:59] starting container id: fc97897e-ceff-4fc9-b4aa-dbfaac8a90e2 slots: 1 task handler: /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-26, 09:03:01] found container running: fc97897e-ceff-4fc9-b4aa-dbfaac8a90e2 (rank 0)  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:03:01] pushing rendezvous information  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:03:01] found not all containers are connected  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:03:37] stopped container id: fc97897e-ceff-4fc9-b4aa-dbfaac8a90e2  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-26, 09:03:37] found container terminated: fc97897e-ceff-4fc9-b4aa-dbfaac8a90e2  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:03:37] forcibly terminating trial  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:03:37] killing container id: fc97897e-ceff-4fc9-b4aa-dbfaac8a90e2  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-26, 09:03:37] unexpected failure of trial after restart 2/5: container failed with non-zero exit code: container failed with non-zero exit code: 1 (exit code 1)  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:03:37] resetting trial 1  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:03:37] resources are released for /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:03:37] resources are requested by /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10 (Task ID: 31ccd525-caab-4cde-bf56-b627b0eb0480)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:03:37] allocated resources to /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:03:37] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:03:37] starting container id: bf544c6d-9a58-4817-acf2-d5641798a793 slots: 1 task handler: /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-26, 09:03:40] found container running: bf544c6d-9a58-4817-acf2-d5641798a793 (rank 0)  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:03:40] pushing rendezvous information  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:03:40] found not all containers are connected  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:04:15] stopped container id: bf544c6d-9a58-4817-acf2-d5641798a793  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-26, 09:04:15] found container terminated: bf544c6d-9a58-4817-acf2-d5641798a793  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:04:15] forcibly terminating trial  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:04:15] killing container id: bf544c6d-9a58-4817-acf2-d5641798a793  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-26, 09:04:15] unexpected failure of trial after restart 3/5: container failed with non-zero exit code: container failed with non-zero exit code: 1 (exit code 1)  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:04:15] resetting trial 1  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:04:15] resources are released for /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:04:15] resources are requested by /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10 (Task ID: 750a0a79-e38c-4032-92ec-c1a21d28f863)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:04:15] allocated resources to /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:04:15] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:04:15] starting container id: cdbfd18d-76ce-46b8-9f19-d29113fb73c5 slots: 1 task handler: /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-26, 09:04:18] found container running: cdbfd18d-76ce-46b8-9f19-d29113fb73c5 (rank 0)  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:04:18] pushing rendezvous information  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:04:18] found not all containers are connected  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:04:52] stopped container id: cdbfd18d-76ce-46b8-9f19-d29113fb73c5  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-26, 09:04:52] found container terminated: cdbfd18d-76ce-46b8-9f19-d29113fb73c5  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:04:52] forcibly terminating trial  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:04:52] killing container id: cdbfd18d-76ce-46b8-9f19-d29113fb73c5  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-26, 09:04:53] unexpected failure of trial after restart 4/5: container failed with non-zero exit code: container failed with non-zero exit code: 1 (exit code 1)  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:04:53] resetting trial 1  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:04:53] resources are released for /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:04:53] resources are requested by /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10 (Task ID: 35228f6e-a010-4046-b9fa-8b1dced693f9)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:04:53] allocated resources to /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:04:53] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:04:53] starting container id: 1d8c20c3-c506-4625-bc1e-8cc403c402d2 slots: 1 task handler: /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-26, 09:04:56] found container running: 1d8c20c3-c506-4625-bc1e-8cc403c402d2 (rank 0)  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:04:56] pushing rendezvous information  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:04:56] found not all containers are connected  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:05:31] stopped container id: 1d8c20c3-c506-4625-bc1e-8cc403c402d2  id="determined-agent-0" system="master" type="agent"
<info>    [2021-04-26, 09:05:31] found container terminated: 1d8c20c3-c506-4625-bc1e-8cc403c402d2  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:05:31] forcibly terminating trial  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:05:31] killing container id: 1d8c20c3-c506-4625-bc1e-8cc403c402d2  id="determined-agent-0" system="master" type="agent"
<error>   [2021-04-26, 09:05:31] unexpected failure of trial after restart 5/5: container failed with non-zero exit code: container failed with non-zero exit code: 1 (exit code 1)  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:05:31] resources are released for /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:05:31] trial completed workload: <RUN_STEP (100 Batches) (0 Prior Batches): (1,1,1)>  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:05:31] exiting trial early: 0xc0011a95f0  experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<error>   [2021-04-26, 09:05:31] error shutting down actor  error="trial 1 failed and reached maximum number of restarts" experiment-id="1" id="aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10" system="master" trial-id="1" type="trial"
<info>    [2021-04-26, 09:05:31] resources are released for /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:05:31] resources are released for /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:05:31] resources are released for /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:05:31] resources are released for /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:05:31] resources are released for /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:05:31] resources are released for /experiments/1/aaffe6c5-56e8-47da-9d6c-76bdfc4a8f10  id="default" resource-pool="default" system="master" type="ResourcePool"
<error>   [2021-04-26, 09:05:31] trial failed unexpectedly  error="trial 1 failed and reached maximum number of restarts" id="1" system="master" type="experiment"
<info>    [2021-04-26, 09:05:31] experiment state changed to STOPPING_ERROR  id="1" system="master" type="experiment"
<info>    [2021-04-26, 09:05:31] experiment state changed to ERROR  id="1" system="master" type="experiment"
<info>    [2021-04-26, 09:05:31] resources are requested by /experiment-1-checkpoint-gc (Task ID: fb3a9773-9d69-435f-8b75-7c600425198a)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:05:31] experiment shut down successfully  id="1" system="master" type="experiment"
<info>    [2021-04-26, 09:05:31] allocated resources to /experiment-1-checkpoint-gc  id="default" resource-pool="default" system="master" type="ResourcePool"
<info>    [2021-04-26, 09:05:31] starting checkpoint garbage collection  id="experiment-1-checkpoint-gc" system="master" type="checkpointGCTask"
<info>    [2021-04-26, 09:05:31] starting container id: 435bfbe1-cf67-4f04-b53b-59bb00720d5f slots: 0 task handler: /experiment-1-checkpoint-gc  id="determined-agent-0" system="master" type="agent"

Someone help me please?

Launch Notebook - crash determined-init-container

Hey!

I ran determined-ai in kubernetes. I try to start Notebook, but I get an error:

pod failed with exit code: 1 container determined-init-container: standard_init_linux.go: 211: exec user process caused "permission denied"
   id = "pod-a0aa3974-397a-4ceb-a701-fe3b516d9326" pod = "cmd-86dd2140-dc46-46aa-b10e-7548b8798955-mighty-guinea" system = "master" type = "pod"

full log:

[2021-02-04, 15:32:48] master configuration: {"config_file":"","log":{"level":"info","color":true},"db":{"user":"determined","password":"********","migrations":"file:///usr/share/determined/master/static/migrations","host":"test-postgresql","port":"5432","name":"determined","ssl_mode":"disable","ssl_root_cert":""},"tensorboard_timeout":300,"security":{"default_task":{"id":0,"user_id":0,"user":"root","uid":0,"group":"root","gid":0},"tls":{"cert":"","key":""}},"checkpoint_storage":{"access_key":"********","bucket":"test","save_experiment_best":0,"save_trial_best":1,"save_trial_latest":1,"secret_key":"********","type":"s3"},"task_container_defaults":{"shm_size_bytes":4294967296,"network_mode":"bridge","cpu_pod_spec":{"kind":"Pod","apiVersion":"v1","metadata":{"creationTimestamp":null,"labels":{"app.kubernetes.io/component":"determined-task-cpu","app.kubernetes.io/instance":"test","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"test","helm.sh/chart":"determined-ai-1.2.1"}},"spec":{"volumes":[{"name":"determined-podspec-storage","persistentVolumeClaim":{"claimName":"test"}}],"containers":[{"name":"determined-container","resources":{"limits":{"cpu":"100m","memory":"256Mi"},"requests":{"cpu":"100m","memory":"256Mi"}},"volumeMounts":[{"name":"determined-podspec-storage","mountPath":"/media/test"}]}]},"status":{}},"gpu_pod_spec":{"kind":"Pod","apiVersion":"v1","metadata":{"creationTimestamp":null,"labels":{"app.kubernetes.io/component":"determined-task-gpu","app.kubernetes.io/instance":"test","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"test","helm.sh/chart":"determined-ai-1.2.1"}},"spec":{"volumes":[{"name":"determined-podspec-storage","persistentVolumeClaim":{"claimName":"test"}}],"containers":[{"name":"determined-container","resources":{"limits":{"cpu":"100m","memory":"256Mi","nvidia.com/gpu":"1"},"requests":{"cpu":"100m","memory":"256Mi","nvidia.com/gpu":"1"}},"volumeMounts":[{"name":"determined-podspec-storage","mountPath":"/media/test"}]}]},"status":{}},"image":{"cpu":"jupyter/base-notebook:notebook-6.0.3","gpu":"jupyter/base-notebook:notebook-6.0.3"}},"port":8080,"harness_path":"/opt/determined","root":"/usr/share/determined/master","telemetry":{"enabled":false,"segment_master_key":"********","segment_webui_key":"********"},"enable_cors":false,"cluster_name":"","logging":{"type":"default"},"scheduler":null,"provisioner":null,"resource_pools":[{"pool_name":"default","description":"","provider":null,"max_cpu_containers_per_agent":100}],"resource_manager":{"leave_kubernetes_resources":false,"master_service_name":"test","max_slots_per_pod":10,"namespace":"8fc7b370-fe9a-4acf-a24d-89d8fb747c00","type":"kubernetes"}}
<info> [2021-02-04, 15:32:48] Determined master 0.13.13 (built with go1.15.7)
<info> [2021-02-04, 15:32:48] connecting to database test-postgresql:5432
<info> [2021-02-04, 15:32:58] running migrations from file:///usr/share/determined/master/static/migrations
<info> [2021-02-04, 15:32:59] found golang-migrate version 20200929105146
<info> [2021-02-04, 15:32:59] deleting all searcher events for terminal state experiments
<info> [2021-02-04, 15:32:59] initializing endpoints for pods
<info> [2021-02-04, 15:32:59] kubernetes clientSet initialized  id="pods" system="master" type="pods"
<info> [2021-02-04, 15:32:59] master URL set to 10.233.44.244:8080  id="pods" system="master" type="pods"
<info> [2021-02-04, 15:32:59] telemetry reporting is disabled
<info> [2021-02-04, 15:32:59] accepting incoming connections on port 8080
<info> [2021-02-04, 15:32:59] Subchannel Connectivity change to READY  system="system"
<info> [2021-02-04, 15:32:59] pickfirstBalancer: HandleSubConnStateChange: 0xc0008134d0, {READY <nil>}  system="system"
<info> [2021-02-04, 15:32:59] Channel Connectivity change to READY  system="system"
<info> [2021-02-04, 15:32:59] event listener is starting  id="event-listener" system="master" type="eventListener"
<info> [2021-02-04, 15:32:59] pod informer is starting  id="pod-informer" system="master" type="informer"
<info> [2021-02-04, 15:33:00] deleted configMap cmd-aaf6c8cc-9224-4736-9135-c862dd39d106-eternal-titmouse  handler="/pods" id="kubernetes-worker-0" system="master" type="requestProcessingWorker"
<info> [2021-02-04, 15:35:20] creating notebook  id="notebooks" system="master" type="notebookManager"
<info> [2021-02-04, 15:35:20] resources are requested by /notebooks/86dd2140-dc46-46aa-b10e-7548b8798955 (Task ID: 86dd2140-dc46-46aa-b10e-7548b8798955)  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-02-04, 15:35:20] created notebook 86dd2140-dc46-46aa-b10e-7548b8798955  id="notebooks" system="master" type="notebookManager"
<info> [2021-02-04, 15:35:21] resources assigned with 1 pods  id="kubernetesRM" system="master" task-handler="/notebooks/86dd2140-dc46-46aa-b10e-7548b8798955" task-id="86dd2140-dc46-46aa-b10e-7548b8798955" type="kubernetesResourceManager"
<info> [2021-02-04, 15:35:21] registering pod handler  handler="/pods/pod-a0aa3974-397a-4ceb-a701-fe3b516d9326" id="pods" pod="cmd-86dd2140-dc46-46aa-b10e-7548b8798955-mighty-guinea" system="master" type="pods"
<info> [2021-02-04, 15:35:23] created configMap cmd-86dd2140-dc46-46aa-b10e-7548b8798955-mighty-guinea  handler="/pods/pod-a0aa3974-397a-4ceb-a701-fe3b516d9326" id="kubernetes-worker-1" system="master" type="requestProcessingWorker"
<info> [2021-02-04, 15:35:27] created pod cmd-86dd2140-dc46-46aa-b10e-7548b8798955-mighty-guinea  handler="/pods/pod-a0aa3974-397a-4ceb-a701-fe3b516d9326" id="kubernetes-worker-1" system="master" type="requestProcessingWorker"
<info> [2021-02-04, 15:35:29] transitioning pod state from ASSIGNED to PULLING  id="pod-a0aa3974-397a-4ceb-a701-fe3b516d9326" pod="cmd-86dd2140-dc46-46aa-b10e-7548b8798955-mighty-guinea" system="master" type="pod"
<info> [2021-02-04, 15:35:29] transitioning pod state from PULLING to STARTING  id="pod-a0aa3974-397a-4ceb-a701-fe3b516d9326" pod="cmd-86dd2140-dc46-46aa-b10e-7548b8798955-mighty-guinea" system="master" type="pod"
<info> [2021-02-04, 15:36:23] transitioning pod state from STARTING to TERMINATED  id="pod-a0aa3974-397a-4ceb-a701-fe3b516d9326" pod="cmd-86dd2140-dc46-46aa-b10e-7548b8798955-mighty-guinea" system="master" type="pod"
<info> [2021-02-04, 15:36:23] pod failed with exit code: 1 container determined-init-container: standard_init_linux.go:211: exec user process caused "permission denied"
  id="pod-a0aa3974-397a-4ceb-a701-fe3b516d9326" pod="cmd-86dd2140-dc46-46aa-b10e-7548b8798955-mighty-guinea" system="master" type="pod" <info> [2021-02-04, 15:36:23]
requesting to delete kubernetes resources  id="pod-a0aa3974-397a-4ceb-a701-fe3b516d9326" pod="cmd-86dd2140-dc46-46aa-b10e-7548b8798955-mighty-guinea" system="master" type="pod" <info> [2021-02-04, 15:36:23]
de-registering pod handler  handler="/pods/pod-a0aa3974-397a-4ceb-a701-fe3b516d9326" id="pods" pod="cmd-86dd2140-dc46-46aa-b10e-7548b8798955-mighty-guinea" system="master" type="pods" <info> [2021-02-04, 15:36:23]
resources are released for /notebooks/86dd2140-dc46-46aa-b10e-7548b8798955  id="kubernetesRM" system="master" type="kubernetesResourceManager" <warning> [2021-02-04, 15:36:23]
received pod status update for un-registered pod  id="pods" pod-name="cmd-86dd2140-dc46-46aa-b10e-7548b8798955-mighty-guinea" system="master" type="pods" <info> [2021-02-04, 15:36:23]
deleted pod cmd-86dd2140-dc46-46aa-b10e-7548b8798955-mighty-guinea  handler="/pods/pod-a0aa3974-397a-4ceb-a701-fe3b516d9326" id="kubernetes-worker-2" system="master" type="requestProcessingWorker" <warning> [2021-02-04, 15:36:23]
received pod status update for un-registered pod  id="pods" pod-name="cmd-86dd2140-dc46-46aa-b10e-7548b8798955-mighty-guinea" system="master" type="pods" <info> [2021-02-04, 15:36:23]
deleted configMap cmd-86dd2140-dc46-46aa-b10e-7548b8798955-mighty-guinea  handler="/pods/pod-a0aa3974-397a-4ceb-a701-fe3b516d9326" id="kubernetes-worker-2" system="master" type="requestProcessingWorker" 

my master.yaml:

    scheduler:
      resource_provider:
        type: "kubernetes"
        namespace: 8fc7b370-fe9a-4acf-a24d-89d8fb747c00
        max_slots_per_pod: 10
        master_service_name: test

    task_container_defaults:
      cpu_pod_spec: {"apiVersion":"v1","kind":"Pod","metadata":{"labels":{"app.kubernetes.io/component":"determined-task-cpu","app.kubernetes.io/instance":"test","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"test","helm.sh/chart":"determined-ai-1.2.1"}},"spec":{"containers":[{"name":"determined-container","resources":{"limits":{"cpu":"100m","memory":"256Mi"},"requests":{"cpu":"100m","memory":"256Mi"}},"volumeMounts":[{"mountPath":"/media/test","name":"determined-podspec-storage"}]}],"volumes":[{"name":"determined-podspec-storage","persistentVolumeClaim":{"claimName":"test"}}]}}
      gpu_pod_spec: {"apiVersion":"v1","kind":"Pod","metadata":{"labels":{"app.kubernetes.io/component":"determined-task-gpu","app.kubernetes.io/instance":"test","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"test","helm.sh/chart":"determined-ai-1.2.1"}},"spec":{"containers":[{"name":"determined-container","resources":{"limits":{"cpu":"100m","memory":"256Mi","nvidia.com/gpu":1},"requests":{"cpu":"100m","memory":"256Mi","nvidia.com/gpu":1}},"volumeMounts":[{"mountPath":"/media/test","name":"determined-podspec-storage"}]}],"volumes":[{"name":"determined-podspec-storage","persistentVolumeClaim":{"claimName":"test"}}]}}
      image:
        cpu: "jupyter/base-notebook:notebook-6.0.3"
        gpu: "jupyter/base-notebook:notebook-6.0.3"

Training and Validation times in Summary are inaccurate--continue to accumulate after experiment is over

Something is not right with the training time as depicted in this image. Note that the actual difference between the start and end time in this image is a little less than 5 minutes. Most of this time was the spin up of an agent to support the experiment. Although the experiment is over, every time I return to this page, the experiment time is increased. The same is true of the validating time, though the time does not increase at the same rate.

image

Error in 'tf.config.set_visible_devices' when running a Keras model in eager mode

Getting the following error:

[2020-10-21T19:34:47Z] 0c41d7e5 [RUNNING] || File "/run/determined/pythonuserbase/lib/python3.6/site-packages/determined/experimental/keras/_tf_keras_native.py", line 77, in init
[2020-10-21T19:34:47Z] 0c41d7e5 [RUNNING] || master_url=master_url,
[2020-10-21T19:34:47Z] 0c41d7e5 [RUNNING] || File "/run/determined/pythonuserbase/lib/python3.6/site-packages/determined/experimental/_native.py", line 285, in init_native
[2020-10-21T19:34:47Z] 0c41d7e5 [RUNNING] || master_url=master_url,
[2020-10-21T19:34:47Z] 0c41d7e5 [RUNNING] || File "/run/determined/pythonuserbase/lib/python3.6/site-packages/determined/experimental/_native.py", line 133, in _init_cluster_mode
[2020-10-21T19:34:47Z] 0c41d7e5 [RUNNING] || hvd_config=load.RunpyGlobals.get_instance().hvd_config,
[2020-10-21T19:34:47Z] 0c41d7e5 [RUNNING] || File "/run/determined/pythonuserbase/lib/python3.6/site-packages/determined/keras/_tf_keras_trial.py", line 285, in pre_execute_hook
[2020-10-21T19:34:47Z] 0c41d7e5 [RUNNING] || TFKerasTrialController._configure_session(env, hvd_config, session_config)
[2020-10-21T19:34:47Z] 0c41d7e5 [RUNNING] || File "/run/determined/pythonuserbase/lib/python3.6/site-packages/determined/keras/_tf_keras_trial.py", line 265, in _configure_session
[2020-10-21T19:34:47Z] 0c41d7e5 [RUNNING] || tf.config.set_visible_devices(gpu, "GPU")
[2020-10-21T19:34:47Z] 0c41d7e5 [RUNNING] || File "/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/util/module_wrapper.py", line 193, in getattr
[2020-10-21T19:34:47Z] 0c41d7e5 [RUNNING] || attr = getattr(self._tfmw_wrapped_module, name)
[2020-10-21T19:34:47Z] 0c41d7e5 [RUNNING] || AttributeError: module 'tensorflow._api.v1.config' has no attribute 'set_visible_devices'

A possible solution is changing:

tf.config.set_visible_devices(gpu, "GPU")

to:
python tf.config.experimental.set_visible_devices(gpu, "GPU")

AuthFailure.ServiceLinkedRoleCreationNotPermitted when attempting to launch EC2 Spot Instances

When the cluster attempts to launch EC2 spot instances, I receive the following notice and error:

<info> [2021-01-21, 05:20:24] AWS error while launching spot instances, AuthFailure.ServiceLinkedRoleCreationNotPermitted, The provided credentials do not have permission to create the service-linked role for EC2 Spot Instances.  id="provisioner" resource-pool="default" system="master" type="Provisioner"
<error> [2021-01-21, 05:20:24] cannot launch EC2 spot requests  error="AuthFailure.ServiceLinkedRoleCreationNotPermitted: The provided credentials do not have permission to create the service-linked role for EC2 Spot Instances.\n\tstatus code: 403, request id: d055892e-797c-46ae-8da5-b98f70638de8" id="provisioner" resource-pool="default" system="master" type="Provisioner"

AWS documentation states that in most cases, the AWSServiceRoleForEC2Spot service-linked role is automatically added on the first attempt at creating an EC2 Spot instance from the console. I assume because I have never attempted to create a spot instance from the console, this role was not added.

I added the role manually using the AWS CLI:

aws iam create-service-linked-role --aws-service-name spot.amazonaws.com

The AWS documentation also provides the following instructions for using the console:

  1. Open the IAM console at https://console.aws.amazon.com/iam/.
  2. In the navigation pane, choose Roles.
  3. Choose Create role.
  4. On the Select type of trusted entity page, choose EC2, EC2 - Spot Instances, Next: Permissions.
  5. On the next page, choose Next: Review.
  6. On the Review page, choose Create role.

These instructions work except that a new Step 5 should be inserted: "On the next page, choose Next: Tags", and the existing step 5 and step 6 should each be incremented by 1.

I recommend adding a instructions to create the service-linked role on the appropriate page of the AWS documentation: https://docs.determined.ai/latest/topic-guides/deployment/aws-spot.html#

Header fonts are hard to distinguish

image
I think the header font should look different from header level 2 and header level 3 or something like that.
Because it can be confusing whether Manual Deployment is a section that follows the previous one or it's the new section.
I was thinking that I need to follow everything when in fact I just need to follow Deploying section. This is because I mistakenly thought Manual Deployment was the required continuation of Deploying section.

Here is example of clear difference:

Header 1

Header 2

Header 3

The above image came from https://docs.determined.ai/latest/how-to/installation/aws.html#aws-manual-deployment

Not possible to turn off telemetry via values.yaml

In the latest helm chart (0.5.0) it is not possible to disable telemetry via a values.yaml file:

https://github.com/determined-ai/determined/blob/master/helm/charts/determined/templates/master-config.yaml#L93

    {{- if .Values.telemetry }}
    {{- if .Values.telemetry.enabled }}
    telemetry:
      enabled: {{ .Values.telemetry.enabled }}
    {{- end }}
    {{- end }}

if .Values.telemetry.enabled is false this section is not added to the config yaml at all. Removing the inner-if statement results in the desired behavior.

ERROR 503: Service Unavailable on WebUI Login/Dashboard

Hi,

I wanted to set up a small cluster with det-deploy local cluster-up --master-port 8080 --no-gpu on my laptop. All seems fine; I open via my browser the web-ui at http://localhost:8080 and the login page appears. After entering my credentials, nothing happens apparently.

I refresh on which I'm redirected to http://localhost/det/dashboard where the web page keeps endlessly loading.

The debugging terminal on my browser tells me that some requests are closed by http503. On closer inspection curl -vvv http://localhost:8080/api/v1/agents, I get the following response:

...
<p>The following error was encountered while trying to retrieve the URL: <a href=":8080">:8080</a></p>
...

I suppose, the hostname has to be inserted, but where? Is it automatically deduced from some (environment) variable? Can I set it explicitely?

$ det-deploy --version
det-deploy 0.14.3

I kindly appreciate any help.

Cheers,
Michael

mnist_gan_pytorch example errors "'PyTorchTrialContext' object has no attribute 'wrap_model'`"

315f8025 [RUNNING] || Wed Aug 12 22:48:22 2020[1]<stderr>:AttributeError: 'PyTorchTrialContext' object has no attribute 'wrap_model'

I get this error while running the 0.12.13 version of the mnist_gan_pytorch example under examples/official/trial/mnist_gan_pytorch.

Seems to be related to lines 85 and 86 on model_def.py:

        self.generator = self.context.wrap_model(Generator(latent_dim=self.context.get_hparam("latent_dim"), img_shape=mnist_shape))
        self.discriminator = self.context.wrap_model(Discriminator(img_shape=mnist_shape))

cuda related error when try to run example

I try to run this tutorial. after I start the job with command: det experiment create const.yaml .

I see the failed result from WEB UI. the error log is ( the log repeat 6 times):

[2020-11-19, 14:08:19] 25f5d319 [PULLING] || image already found, skipping pull phase: docker.io/determinedai/environments:cuda-10.0-pytorch-1.4-tf-1.15-gpu-1def2ee
[2020-11-19, 14:08:20] 25f5d319 [STARTING] || copying files to container: /
[2020-11-19, 14:08:20] 25f5d319 [STARTING] || copying files to container: /
[2020-11-19, 14:08:20] 25f5d319 [STARTING] || copying files to container: /
[2020-11-19, 14:08:20] 25f5d319 [STARTING] || copying files to container: /run/determined/train
[2020-11-19, 14:08:21] 25f5d319 [STARTING] || copying files to container: /run/determined/workdir
[2020-11-19, 14:08:21] 25f5d319 [STARTING] || copying files to container: /
[2020-11-19, 14:08:21] 25f5d319 [STARTING] || copying files to container: /
[2020-11-19, 14:08:22] 25f5d319 [TERMINATED] || container failed with non-zero exit code: error starting container: Error response from daemon: could not select device driver "nvidia" with capabilities: [[gpu compute utility]]

I have installed cuda 10.2. I can run training with tensorflow and pytorch with gpu-enabled on this machine. can someone tell wha't going on?

Stack Deployment Failed. Check the AWS CloudFormation Console for details

I have this error when trying to deploy aws. I think it is caused somehow by me manually stopping the master instance to save money.

(base) C:\Users\off99>det deploy aws up --cluster-id moonsniper --keypair moonsniper --gpu-agent-instance-type p2.8xlarge --max-dynamic-agents 1 --max-idle-agent-period 5m
Starting Determined Deployment
Determined Version: 0.16.0
Stack Name: moonsniper
AWS Region: us-west-2
Keypair: moonsniper
Checking if the SSH Keypair (moonsniper) exists: True
Checking if the CloudFormation Stack (moonsniper) exists: True - Updating Stack
Updating stack moonsniper. This may take a few minutes... Check the CloudFormation Console for updates
'Outputs'
Stack Deployment Failed. Check the AWS CloudFormation Console for details.

I already terminated master instance, deleted S3 bucket and CloudFormulation though to try to start from scratch. How do I fix this?
This is the screenshot of the CloudFormation:
image

Another question:
If I already deploy the cluster, is it OK to do deployment again with the same command just to change the config like GPU instance type and number of dynamic agents?

RuntimeError: CUDA error: no kernel image is available for execution on the device

🐛 Bug

Hi,
I received ERRORED after created an experiment by cli, and trial logs show the RuntimeError mentioned in title.
It seems my CUDA version is too high. Should I downgrade it?

1. Envs

  • OS: Ubuntu 20.04
  • Docker: 20.10.6/20.10.2/19.3.0
  • GPU: 3090 & 2080
  • CUDA: 11.2
  • Kubernetes 1.18.8
  • Determined: 1.15.1~1.15.3 all tried

2. How to install

Use helm to deploy on k8s, according to the official doc:

# Install Helm
$ snap install helm --classic

# Pull chart and values
$ wget https://docs.determined.ai/latest/_downloads/d4ac66d27b6e777fe944620f402e571d/determined-0.5.0.tgz
$ tar zxvf determined-0.5.0.tgz
$ rm determined-0.5.0.tgz
$ cd determined/

# Configure values.yaml
$ vim values.yaml
# Change the file as follow
maxSlotsPerPod: 2
useNodePortForMaster: true

# Install Determined
$ kubectl create ns determined
$ helm install determined-ai ./ -n determined

3. How to used

There are 2 methods to reproduce my error.

3.1 Method 1: Use by CLI

It's ok to launch a Notebook and run the sample notebook on Web UI, but it raised ERRORED while pushing the MNIST example code in QUICK START GUIDE by cli.

$ wget http://10.102.32.201:30000/docs/_downloads/61c6df286ba829cb9730a0407275ce50/mnist_pytorch.tgz
$ tar xzvf mnist_pytorch.tgz
$ cd mnist_pytorch
$ det experiment create const.yaml .

Preparing files (/home/miuii/det/mnist_pytorch) to send to master... 11.5KB and 7 files
Created experiment 4

After running above commands, there would be a new experiment on WebUI Dashboard. Several minutes after it turned from "ACTIVE" into "ERRORED" status.

3.2 Method2: Use in Notebook

Minimal test case:

import torch
from torchvision import models
import numpy as np
import IPython

print(torch.cuda.is_available())

image = np.random.random(size=[2, 3, 224, 224])
image.dtype = 'float32'

image_tensor = torch.from_numpy(image).cuda()

model = models.resnet50(pretrained=True)
model = model.cuda()

out = model(image_tensor)
print(out)

# IPython.embed()

You may need to run following command in Notebook terminal to avoid "ImportError: IntProgress not found.":

conda install -c conda-forge ipywidgets
jupyter nbextension enable --py widgetsnbextension

Error occured by running above code:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-ee58e2937bbe> in <module>
     14 model = model.cuda()
     15 
---> 16 out = model(image_tensor)
     17 print(out)

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/opt/conda/lib/python3.7/site-packages/torchvision/models/resnet.py in forward(self, x)
    218 
    219     def forward(self, x):
--> 220         return self._forward_impl(x)
    221 
    222 

/opt/conda/lib/python3.7/site-packages/torchvision/models/resnet.py in _forward_impl(self, x)
    202         # See note [TorchScript super()]
    203         x = self.conv1(x)
--> 204         x = self.bn1(x)
    205         x = self.relu(x)
    206         x = self.maxpool(x)

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py in forward(self, input)
    109             # TODO: if statement only here to tell the jit to skip emitting this when it is None
    110             if self.num_batches_tracked is not None:
--> 111                 self.num_batches_tracked = self.num_batches_tracked + 1
    112                 if self.momentum is None:  # use cumulative moving average
    113                     exponential_average_factor = 1.0 / float(self.num_batches_tracked)

RuntimeError: CUDA error: no kernel image is available for execution on the device

4. Error and Master Logs

4.1 Master Logs

<info>    [2021-05-06, 09:04:58] master configuration: {"config_file":"","log":{"level":"info","color":true},"db":{"user":"postgres","password":"********","migrations":"file:///usr/share/determined/master/static/migrations","host":"determined-db-service-determined-ai","port":"5432","name":"determined","ssl_mode":"disable","ssl_root_cert":""},"tensorboard_timeout":300,"security":{"default_task":{"id":0,"user_id":0,"user":"root","uid":0,"group":"root","gid":0},"tls":{"cert":"","key":""}},"checkpoint_storage":{"host_path":"/checkpoints","save_experiment_best":0,"save_trial_best":1,"save_trial_latest":1,"type":"shared_fs"},"task_container_defaults":{"shm_size_bytes":4294967296,"network_mode":"bridge","cpu_pod_spec":null,"gpu_pod_spec":null,"add_capabilities":null,"drop_capabilities":null,"devices":null},"port":8081,"harness_path":"/opt/determined","root":"/usr/share/determined/master","telemetry":{"enabled":true,"segment_master_key":"********","segment_webui_key":"********"},"enable_cors":false,"cluster_name":"","logging":{"type":"default"},"hyperparameter_importance":{"workers_limit":2,"queue_limit":16,"cores_per_worker":1,"max_trees":100},"resource_manager":{"default_scheduler":"","leave_kubernetes_resources":false,"master_service_name":"determined-master-service-determined-ai","max_slots_per_pod":2,"namespace":"determined","type":"kubernetes"},"resource_pools":null}
<info>    [2021-05-06, 09:04:58] Determined master 0.15.3 (built with go1.16.3)
<info>    [2021-05-06, 09:04:58] connecting to database determined-db-service-determined-ai:5432
<info>    [2021-05-06, 09:04:58] running migrations from file:///usr/share/determined/master/static/migrations
<info>    [2021-05-06, 09:04:58] found golang-migrate version 20210322160616
<info>    [2021-05-06, 09:04:58] deleting all snapshots for terminal state experiments
<info>    [2021-05-06, 09:04:58] initializing endpoints for pods
<info>    [2021-05-06, 09:04:58] kubernetes clientSet initialized  id="pods" system="master" type="pods"
<info>    [2021-05-06, 09:04:58] scheduling next resource allocation aggregation in 14h56m1s at 2021-05-07 00:01:00 +0000 UTC  id="allocation-aggregator" system="master" type="allocationAggregator"
<info>    [2021-05-06, 09:04:58] master URL set to 192.168.245.182:8080  id="pods" system="master" type="pods"
<info>    [2021-05-06, 09:04:58] telemetry reporting is enabled; run with `--telemetry-enabled=false` to disable
<info>    [2021-05-06, 09:04:58] accepting incoming connections on port 8081
<info>    [2021-05-06, 09:04:58] Subchannel Connectivity change to READY  system="system"
<info>    [2021-05-06, 09:04:58] pickfirstBalancer: HandleSubConnStateChange: 0xc000556340, {READY <nil>}  system="system"
<info>    [2021-05-06, 09:04:58] Channel Connectivity change to READY  system="system"
<info>    [2021-05-06, 09:04:58] event listener is starting  id="event-listener" system="master" type="eventListener"
<info>    [2021-05-06, 09:04:58] pod informer is starting  id="pod-informer" system="master" type="informer"
<info>    [2021-05-06, 09:04:58] preemption listener is starting  id="preemption-listener" system="master" type="preemptionListener"
<info>    [2021-05-06, 09:04:58] node informer has started  id="node-informer" system="master" type="nodeInformer"

<info>    [2021-05-06, 11:25:56] experiment state changed to ACTIVE  id="4" system="master" type="experiment"
<info>    [2021-05-06, 11:25:56] resources are requested by /experiments/4/b9b4e788-0d16-463d-988e-732645112365 (Task ID: bcccb550-1c39-423b-8353-304fc5dccf72)  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:25:56] resources assigned with 1 pods  id="kubernetesRM" system="master" task-handler="/experiments/4/b9b4e788-0d16-463d-988e-732645112365" task-id="bcccb550-1c39-423b-8353-304fc5dccf72" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:25:56] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (4,4,1)>  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:25:56] registering pod handler  handler="/pods/pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" id="pods" pod="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pods"
<info>    [2021-05-06, 11:25:56] created configMap exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish  handler="/pods/pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" id="kubernetes-worker-3" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:25:56] created pod exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish  handler="/pods/pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" id="kubernetes-worker-3" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:25:56] transitioning pod state from ASSIGNED to PULLING  id="pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" pod="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pod"
<info>    [2021-05-06, 11:25:56] transitioning pod state from PULLING to STARTING  id="pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" pod="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pod"
<info>    [2021-05-06, 11:26:00] transitioning pod state from STARTING to RUNNING  id="pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" pod="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pod"
<info>    [2021-05-06, 11:26:00] found container running: eac146c4-1b8d-402b-a533-6ed3c3ff0845 (rank 0)  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:00] pushing rendezvous information  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:00] found not all containers are connected  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:04] new connection from container eac146c4-1b8d-402b-a533-6ed3c3ff0845 trial 4 (experiment 4) at 100.82.21.4:46300
<info>    [2021-05-06, 11:26:04] pushing rendezvous information  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:04] found all containers are connected successfully  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:26] transitioning pod state from RUNNING to TERMINATED  id="pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" pod="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pod"
<info>    [2021-05-06, 11:26:26] pod failed with exit code: 1   id="pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" pod="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pod"
<info>    [2021-05-06, 11:26:26] found container terminated: eac146c4-1b8d-402b-a533-6ed3c3ff0845  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:26] forcibly terminating trial  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<error>   [2021-05-06, 11:26:26] unexpected failure of trial after restart 0/5: container failed with non-zero exit code:  (exit code 1)  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:26] resetting trial 4  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:26] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:26:27] requesting to delete kubernetes resources  id="pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" pod="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pod"
<info>    [2021-05-06, 11:26:27] de-registering pod handler  handler="/pods/pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" id="pods" pod="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pods"
<info>    [2021-05-06, 11:26:27] deleted pod exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish  handler="/pods/pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" id="kubernetes-worker-4" system="master" type="requestProcessingWorker"
<warning> [2021-05-06, 11:26:27] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pods"
<info>    [2021-05-06, 11:26:27] deleted configMap exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish  handler="/pods/pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" id="kubernetes-worker-4" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:26:27] resources are requested by /experiments/4/b9b4e788-0d16-463d-988e-732645112365 (Task ID: 37fda087-fb9f-4ece-ac40-3fd48dd20b05)  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:26:27] resources assigned with 1 pods  id="kubernetesRM" system="master" task-handler="/experiments/4/b9b4e788-0d16-463d-988e-732645112365" task-id="37fda087-fb9f-4ece-ac40-3fd48dd20b05" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:26:27] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (4,4,1)>  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:27] registering pod handler  handler="/pods/pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" id="pods" pod="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pods"
<info>    [2021-05-06, 11:26:27] created configMap exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck  handler="/pods/pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" id="kubernetes-worker-0" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:26:27] created pod exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck  handler="/pods/pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" id="kubernetes-worker-0" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:26:27] transitioning pod state from ASSIGNED to PULLING  id="pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" pod="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pod"
<info>    [2021-05-06, 11:26:27] transitioning pod state from PULLING to STARTING  id="pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" pod="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pod"
<warning> [2021-05-06, 11:26:33] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pods"
<info>    [2021-05-06, 11:26:33] transitioning pod state from STARTING to RUNNING  id="pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" pod="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pod"
<info>    [2021-05-06, 11:26:33] found container running: 407621fe-2e52-43f3-a4d0-9058cf2630ed (rank 0)  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:33] pushing rendezvous information  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:33] found not all containers are connected  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<warning> [2021-05-06, 11:26:34] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pods"
<warning> [2021-05-06, 11:26:34] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pods"
<info>    [2021-05-06, 11:26:36] new connection from container 407621fe-2e52-43f3-a4d0-9058cf2630ed trial 4 (experiment 4) at 100.82.21.5:44404
<info>    [2021-05-06, 11:26:36] pushing rendezvous information  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:36] found all containers are connected successfully  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:48] transitioning pod state from RUNNING to TERMINATED  id="pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" pod="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pod"
<info>    [2021-05-06, 11:26:48] pod failed with exit code: 1   id="pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" pod="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pod"
<info>    [2021-05-06, 11:26:48] found container terminated: 407621fe-2e52-43f3-a4d0-9058cf2630ed  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:48] forcibly terminating trial  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:48] requesting to delete kubernetes resources  id="pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" pod="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pod"
<info>    [2021-05-06, 11:26:48] de-registering pod handler  handler="/pods/pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" id="pods" pod="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pods"
<info>    [2021-05-06, 11:26:48] deleted pod exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck  handler="/pods/pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" id="kubernetes-worker-1" system="master" type="requestProcessingWorker"
<warning> [2021-05-06, 11:26:48] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pods"
<info>    [2021-05-06, 11:26:48] deleted configMap exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck  handler="/pods/pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" id="kubernetes-worker-1" system="master" type="requestProcessingWorker"
<error>   [2021-05-06, 11:26:48] unexpected failure of trial after restart 1/5: container failed with non-zero exit code:  (exit code 1)  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:48] resetting trial 4  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:48] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:26:48] resources are requested by /experiments/4/b9b4e788-0d16-463d-988e-732645112365 (Task ID: d6bfa1db-15fe-4f01-b699-41f60c327e33)  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:26:49] resources assigned with 1 pods  id="kubernetesRM" system="master" task-handler="/experiments/4/b9b4e788-0d16-463d-988e-732645112365" task-id="d6bfa1db-15fe-4f01-b699-41f60c327e33" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:26:49] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (4,4,1)>  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:49] registering pod handler  handler="/pods/pod-4db3595f-1526-4dde-9385-dc7de6860d56" id="pods" pod="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pods"
<info>    [2021-05-06, 11:26:49] created configMap exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe  handler="/pods/pod-4db3595f-1526-4dde-9385-dc7de6860d56" id="kubernetes-worker-2" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:26:49] created pod exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe  handler="/pods/pod-4db3595f-1526-4dde-9385-dc7de6860d56" id="kubernetes-worker-2" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:26:49] transitioning pod state from ASSIGNED to PULLING  id="pod-4db3595f-1526-4dde-9385-dc7de6860d56" pod="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pod"
<info>    [2021-05-06, 11:26:49] transitioning pod state from PULLING to STARTING  id="pod-4db3595f-1526-4dde-9385-dc7de6860d56" pod="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pod"
<warning> [2021-05-06, 11:26:55] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pods"
<info>    [2021-05-06, 11:26:55] transitioning pod state from STARTING to RUNNING  id="pod-4db3595f-1526-4dde-9385-dc7de6860d56" pod="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pod"
<info>    [2021-05-06, 11:26:55] found container running: 4db3595f-1526-4dde-9385-dc7de6860d56 (rank 0)  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:55] pushing rendezvous information  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:55] found not all containers are connected  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:58] new connection from container 4db3595f-1526-4dde-9385-dc7de6860d56 trial 4 (experiment 4) at 100.82.21.8:57194
<info>    [2021-05-06, 11:26:58] pushing rendezvous information  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:26:58] found all containers are connected successfully  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<warning> [2021-05-06, 11:27:02] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pods"
<warning> [2021-05-06, 11:27:02] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pods"
<warning> [2021-05-06, 11:27:34] preemption listener stopped unexpectedly  id="preemption-listener" system="master" type="preemptionListener"
<info>    [2021-05-06, 11:27:34] preemption listener is starting  id="preemption-listener" system="master" type="preemptionListener"
<info>    [2021-05-06, 11:27:49] transitioning pod state from RUNNING to TERMINATED  id="pod-4db3595f-1526-4dde-9385-dc7de6860d56" pod="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pod"
<info>    [2021-05-06, 11:27:49] pod failed with exit code: 1   id="pod-4db3595f-1526-4dde-9385-dc7de6860d56" pod="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pod"
<info>    [2021-05-06, 11:27:49] found container terminated: 4db3595f-1526-4dde-9385-dc7de6860d56  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:27:49] forcibly terminating trial  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<error>   [2021-05-06, 11:27:49] unexpected failure of trial after restart 2/5: container failed with non-zero exit code:  (exit code 1)  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:27:49] resetting trial 4  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:27:49] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:27:49] requesting to delete kubernetes resources  id="pod-4db3595f-1526-4dde-9385-dc7de6860d56" pod="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pod"
<info>    [2021-05-06, 11:27:49] de-registering pod handler  handler="/pods/pod-4db3595f-1526-4dde-9385-dc7de6860d56" id="pods" pod="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pods"
<info>    [2021-05-06, 11:27:49] deleted pod exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe  handler="/pods/pod-4db3595f-1526-4dde-9385-dc7de6860d56" id="kubernetes-worker-3" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:27:49] resources are requested by /experiments/4/b9b4e788-0d16-463d-988e-732645112365 (Task ID: f672e95a-a74b-44c7-991f-2c83fe90e225)  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<warning> [2021-05-06, 11:27:49] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pods"
<info>    [2021-05-06, 11:27:49] deleted configMap exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe  handler="/pods/pod-4db3595f-1526-4dde-9385-dc7de6860d56" id="kubernetes-worker-3" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:27:50] resources assigned with 1 pods  id="kubernetesRM" system="master" task-handler="/experiments/4/b9b4e788-0d16-463d-988e-732645112365" task-id="f672e95a-a74b-44c7-991f-2c83fe90e225" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:27:50] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (4,4,1)>  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:27:50] registering pod handler  handler="/pods/pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" id="pods" pod="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pods"
<info>    [2021-05-06, 11:27:50] created configMap exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie  handler="/pods/pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" id="kubernetes-worker-4" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:27:50] created pod exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie  handler="/pods/pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" id="kubernetes-worker-4" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:27:50] transitioning pod state from ASSIGNED to PULLING  id="pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" pod="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pod"
<info>    [2021-05-06, 11:27:50] transitioning pod state from PULLING to STARTING  id="pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" pod="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pod"
<info>    [2021-05-06, 11:27:56] transitioning pod state from STARTING to RUNNING  id="pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" pod="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pod"
<info>    [2021-05-06, 11:27:56] found container running: 45593757-aeb6-495b-ae10-b8cfa617ca1c (rank 0)  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:27:56] pushing rendezvous information  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:27:56] found not all containers are connected  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<warning> [2021-05-06, 11:27:56] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pods"
<info>    [2021-05-06, 11:28:00] new connection from container 45593757-aeb6-495b-ae10-b8cfa617ca1c trial 4 (experiment 4) at 100.82.21.9:38376
<info>    [2021-05-06, 11:28:00] pushing rendezvous information  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:28:00] found all containers are connected successfully  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<warning> [2021-05-06, 11:28:02] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pods"
<warning> [2021-05-06, 11:28:02] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pods"
<info>    [2021-05-06, 11:28:19] transitioning pod state from RUNNING to TERMINATED  id="pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" pod="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pod"
<info>    [2021-05-06, 11:28:19] pod failed with exit code: 1   id="pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" pod="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pod"
<info>    [2021-05-06, 11:28:19] found container terminated: 45593757-aeb6-495b-ae10-b8cfa617ca1c  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:28:19] forcibly terminating trial  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<error>   [2021-05-06, 11:28:19] unexpected failure of trial after restart 3/5: container failed with non-zero exit code:  (exit code 1)  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:28:19] resetting trial 4  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:28:19] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:28:19] requesting to delete kubernetes resources  id="pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" pod="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pod"
<info>    [2021-05-06, 11:28:19] de-registering pod handler  handler="/pods/pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" id="pods" pod="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pods"
<info>    [2021-05-06, 11:28:19] deleted pod exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie  handler="/pods/pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" id="kubernetes-worker-0" system="master" type="requestProcessingWorker"
<warning> [2021-05-06, 11:28:19] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pods"
<info>    [2021-05-06, 11:28:19] deleted configMap exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie  handler="/pods/pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" id="kubernetes-worker-0" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:28:19] resources are requested by /experiments/4/b9b4e788-0d16-463d-988e-732645112365 (Task ID: c824a795-d349-4383-82df-599f3737820c)  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:28:20] resources assigned with 1 pods  id="kubernetesRM" system="master" task-handler="/experiments/4/b9b4e788-0d16-463d-988e-732645112365" task-id="c824a795-d349-4383-82df-599f3737820c" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:28:20] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (4,4,1)>  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:28:20] registering pod handler  handler="/pods/pod-c1b0d268-193b-4591-97bb-a2b44befa968" id="pods" pod="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pods"
<info>    [2021-05-06, 11:28:20] created configMap exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan  handler="/pods/pod-c1b0d268-193b-4591-97bb-a2b44befa968" id="kubernetes-worker-1" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:28:20] created pod exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan  handler="/pods/pod-c1b0d268-193b-4591-97bb-a2b44befa968" id="kubernetes-worker-1" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:28:20] transitioning pod state from ASSIGNED to PULLING  id="pod-c1b0d268-193b-4591-97bb-a2b44befa968" pod="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pod"
<info>    [2021-05-06, 11:28:20] transitioning pod state from PULLING to STARTING  id="pod-c1b0d268-193b-4591-97bb-a2b44befa968" pod="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pod"
<info>    [2021-05-06, 11:28:25] transitioning pod state from STARTING to RUNNING  id="pod-c1b0d268-193b-4591-97bb-a2b44befa968" pod="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pod"
<info>    [2021-05-06, 11:28:25] found container running: c1b0d268-193b-4591-97bb-a2b44befa968 (rank 0)  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:28:25] pushing rendezvous information  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:28:25] found not all containers are connected  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<warning> [2021-05-06, 11:28:26] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pods"
<info>    [2021-05-06, 11:28:29] new connection from container c1b0d268-193b-4591-97bb-a2b44befa968 trial 4 (experiment 4) at 100.82.21.10:47188
<info>    [2021-05-06, 11:28:29] pushing rendezvous information  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:28:29] found all containers are connected successfully  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<warning> [2021-05-06, 11:28:32] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pods"
<warning> [2021-05-06, 11:28:32] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pods"
<info>    [2021-05-06, 11:28:41] transitioning pod state from RUNNING to TERMINATED  id="pod-c1b0d268-193b-4591-97bb-a2b44befa968" pod="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pod"
<info>    [2021-05-06, 11:28:41] pod failed with exit code: 1   id="pod-c1b0d268-193b-4591-97bb-a2b44befa968" pod="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pod"
<info>    [2021-05-06, 11:28:41] found container terminated: c1b0d268-193b-4591-97bb-a2b44befa968  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:28:41] forcibly terminating trial  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:28:41] requesting to delete kubernetes resources  id="pod-c1b0d268-193b-4591-97bb-a2b44befa968" pod="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pod"
<info>    [2021-05-06, 11:28:41] de-registering pod handler  handler="/pods/pod-c1b0d268-193b-4591-97bb-a2b44befa968" id="pods" pod="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pods"
<info>    [2021-05-06, 11:28:41] deleted pod exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan  handler="/pods/pod-c1b0d268-193b-4591-97bb-a2b44befa968" id="kubernetes-worker-2" system="master" type="requestProcessingWorker"
<warning> [2021-05-06, 11:28:41] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pods"
<info>    [2021-05-06, 11:28:41] deleted configMap exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan  handler="/pods/pod-c1b0d268-193b-4591-97bb-a2b44befa968" id="kubernetes-worker-2" system="master" type="requestProcessingWorker"
<error>   [2021-05-06, 11:28:41] unexpected failure of trial after restart 4/5: container failed with non-zero exit code:  (exit code 1)  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:28:41] resetting trial 4  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:28:41] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:28:41] resources are requested by /experiments/4/b9b4e788-0d16-463d-988e-732645112365 (Task ID: ad26f119-f705-4544-be46-06e7efd80a78)  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:28:42] resources assigned with 1 pods  id="kubernetesRM" system="master" task-handler="/experiments/4/b9b4e788-0d16-463d-988e-732645112365" task-id="ad26f119-f705-4544-be46-06e7efd80a78" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:28:42] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (4,4,1)>  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:28:42] registering pod handler  handler="/pods/pod-98e667f9-a2e1-46a5-87b1-42786cd67805" id="pods" pod="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pods"
<info>    [2021-05-06, 11:28:42] created configMap exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien  handler="/pods/pod-98e667f9-a2e1-46a5-87b1-42786cd67805" id="kubernetes-worker-3" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:28:42] created pod exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien  handler="/pods/pod-98e667f9-a2e1-46a5-87b1-42786cd67805" id="kubernetes-worker-3" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:28:42] transitioning pod state from ASSIGNED to PULLING  id="pod-98e667f9-a2e1-46a5-87b1-42786cd67805" pod="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pod"
<info>    [2021-05-06, 11:28:42] transitioning pod state from PULLING to STARTING  id="pod-98e667f9-a2e1-46a5-87b1-42786cd67805" pod="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pod"
<warning> [2021-05-06, 11:28:48] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pods"
<info>    [2021-05-06, 11:28:48] transitioning pod state from STARTING to RUNNING  id="pod-98e667f9-a2e1-46a5-87b1-42786cd67805" pod="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pod"
<info>    [2021-05-06, 11:28:48] found container running: 98e667f9-a2e1-46a5-87b1-42786cd67805 (rank 0)  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:28:48] pushing rendezvous information  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:28:48] found not all containers are connected  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<warning> [2021-05-06, 11:28:49] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pods"
<warning> [2021-05-06, 11:28:49] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pods"
<info>    [2021-05-06, 11:28:51] new connection from container 98e667f9-a2e1-46a5-87b1-42786cd67805 trial 4 (experiment 4) at 100.82.21.11:46114
<info>    [2021-05-06, 11:28:51] pushing rendezvous information  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:28:51] found all containers are connected successfully  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:29:03] transitioning pod state from RUNNING to TERMINATED  id="pod-98e667f9-a2e1-46a5-87b1-42786cd67805" pod="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pod"
<info>    [2021-05-06, 11:29:03] pod failed with exit code: 1   id="pod-98e667f9-a2e1-46a5-87b1-42786cd67805" pod="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pod"
<info>    [2021-05-06, 11:29:03] requesting to delete kubernetes resources  id="pod-98e667f9-a2e1-46a5-87b1-42786cd67805" pod="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pod"
<info>    [2021-05-06, 11:29:03] found container terminated: 98e667f9-a2e1-46a5-87b1-42786cd67805  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:29:03] forcibly terminating trial  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:29:03] de-registering pod handler  handler="/pods/pod-98e667f9-a2e1-46a5-87b1-42786cd67805" id="pods" pod="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pods"
<info>    [2021-05-06, 11:29:03] received stop pod command for unregistered container id  id="pods" pod-id="98e667f9-a2e1-46a5-87b1-42786cd67805" system="master" type="pods"
<info>    [2021-05-06, 11:29:03] deleted pod exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien  handler="/pods/pod-98e667f9-a2e1-46a5-87b1-42786cd67805" id="kubernetes-worker-4" system="master" type="requestProcessingWorker"
<warning> [2021-05-06, 11:29:03] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pods"
<info>    [2021-05-06, 11:29:03] deleted configMap exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien  handler="/pods/pod-98e667f9-a2e1-46a5-87b1-42786cd67805" id="kubernetes-worker-4" system="master" type="requestProcessingWorker"
<error>   [2021-05-06, 11:29:04] unexpected failure of trial after restart 5/5: container failed with non-zero exit code:  (exit code 1)  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:29:04] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:29:04] exiting trial early from <RUN_STEP (100 Batches) (0 Prior Batches): (4,4,1)> with reason ERRORED  experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<error>   [2021-05-06, 11:29:04] error shutting down actor  error="trial 4 failed and reached maximum number of restarts" experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info>    [2021-05-06, 11:29:04] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:29:04] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:29:04] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:29:04] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:29:04] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:29:04] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<error>   [2021-05-06, 11:29:04] trial failed unexpectedly  error="trial 4 failed and reached maximum number of restarts" id="4" system="master" type="experiment"
<info>    [2021-05-06, 11:29:04] experiment state changed to STOPPING_ERROR  id="4" system="master" type="experiment"
<info>    [2021-05-06, 11:29:04] experiment state changed to ERROR  id="4" system="master" type="experiment"
<info>    [2021-05-06, 11:29:04] resources are requested by /experiment-4-checkpoint-gc (Task ID: 4d36779b-8de5-4a9a-9558-e6655935369f)  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:29:04] experiment shut down successfully  id="4" system="master" type="experiment"
<info>    [2021-05-06, 11:29:04] resources assigned with 1 pods  id="kubernetesRM" system="master" task-handler="/experiment-4-checkpoint-gc" task-id="4d36779b-8de5-4a9a-9558-e6655935369f" type="kubernetesResourceManager"
<info>    [2021-05-06, 11:29:04] starting checkpoint garbage collection  id="experiment-4-checkpoint-gc" system="master" type="checkpointGCTask"
<info>    [2021-05-06, 11:29:04] registering pod handler  handler="/pods/pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" id="pods" pod="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pods"
<info>    [2021-05-06, 11:29:04] created configMap gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant  handler="/pods/pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" id="kubernetes-worker-0" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:29:04] created pod gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant  handler="/pods/pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" id="kubernetes-worker-0" system="master" type="requestProcessingWorker"
<info>    [2021-05-06, 11:29:04] transitioning pod state from ASSIGNED to PULLING  id="pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" pod="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pod"
<info>    [2021-05-06, 11:29:04] transitioning pod state from PULLING to STARTING  id="pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" pod="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pod"
<info>    [2021-05-06, 11:29:07] transitioning pod state from STARTING to RUNNING  id="pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" pod="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pod"
<info>    [2021-05-06, 11:29:09] transitioning pod state from RUNNING to TERMINATED  id="pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" pod="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pod"
<info>    [2021-05-06, 11:29:09] pod exited successfully  id="pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" pod="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pod"
<info>    [2021-05-06, 11:29:09] finished checkpoint garbage collection  id="experiment-4-checkpoint-gc" system="master" type="checkpointGCTask"
<info>    [2021-05-06, 11:29:09] resources are released for /experiment-4-checkpoint-gc  id="kubernetesRM" system="master" type="kubernetesResourceManager"
<warning> [2021-05-06, 11:29:10] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pods"
<info>    [2021-05-06, 11:29:10] requesting to delete kubernetes resources  id="pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" pod="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pod"
<info>    [2021-05-06, 11:29:10] de-registering pod handler  handler="/pods/pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" id="pods" pod="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pods"
<info>    [2021-05-06, 11:29:10] deleted pod gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant  handler="/pods/pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" id="kubernetes-worker-1" system="master" type="requestProcessingWorker"
<warning> [2021-05-06, 11:29:10] received pod status update for un-registered pod  id="pods" pod-name="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pods"
<warning> [2021-05-06, 11:29:10] received pod status update for un-registered pod  id="pods" pod-name="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pods"
<info>    [2021-05-06, 11:29:10] deleted configMap gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant  handler="/pods/pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" id="kubernetes-worker-1" system="master" type="requestProcessingWorker"
<warning> [2021-05-06, 11:29:11] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pods"
<warning> [2021-05-06, 11:29:11] received pod status update for un-registered pod  id="pods" pod-name="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pods"

4.2 Trail logs

The logs by running det trial logs 1 > experiment_4_trial_5_logs.txt:

[2021-05-06T03:34:00Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Pod resources allocated.
[2021-05-06T03:34:00Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:34:01Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Created container determined-init-container
[2021-05-06T03:34:01Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Started container determined-init-container
[2021-05-06T03:34:01Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-06T03:34:01Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Created container determined-fluent-container
[2021-05-06T03:34:01Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Started container determined-fluent-container
[2021-05-06T03:34:01Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:34:03Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Created container determined-container
[2021-05-06T03:34:03Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Started container determined-container
[2021-05-06T03:34:04Z] ad2df585 || + WORKING_DIR=/run/determined/workdir
[2021-05-06T03:34:04Z] ad2df585 || + STARTUP_HOOK=startup-hook.sh
[2021-05-06T03:34:04Z] ad2df585 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:34:04Z] ad2df585 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:34:04Z] ad2df585 || + '[' -z '' ']'
[2021-05-06T03:34:04Z] ad2df585 || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:34:04Z] ad2df585 || + DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:34:04Z] ad2df585 || + /bin/which python3
[2021-05-06T03:34:04Z] ad2df585 || + '[' /root = / ']'
[2021-05-06T03:34:04Z] ad2df585 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.2-py3-none-any.whl
[2021-05-06T03:34:07Z] ad2df585 || ERROR: determined 0.15.2 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-05-06T03:34:07Z] ad2df585 || + cd /run/determined/workdir
[2021-05-06T03:34:07Z] ad2df585 || + test -f startup-hook.sh
[2021-05-06T03:34:07Z] ad2df585 || + exec python3 -m determined.exec.harness
[2021-05-06T03:34:08Z] ad2df585 || INFO: New trial runner in (container ad2df585-0619-404d-92e2-465adc5d00eb) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'ad2df585-0619-404d-92e2-465adc5d00eb', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620272038}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-e5db4f9e-516e-9153-4a40-b0be71028ab5'], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1001305543, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-05-06T03:34:08Z] ad2df585 || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/1/1/ad2df585-0619-404d-92e2-465adc5d00eb
[2021-05-06T03:34:08Z] ad2df585 || INFO: Connected to master
[2021-05-06T03:34:08Z] ad2df585 || INFO: Established WebSocket session with master
[2021-05-06T03:34:08Z] ad2df585 || INFO: Got rendezvous information: {'addrs': ['100.122.176.247:1734'], 'addrs2': ['100.122.176.247:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.247', 'container_port': 1734, 'host_ip': '100.122.176.247', 'host_port': 1734}, {'container_ip': '100.122.176.247', 'container_port': 1750, 'host_ip': '100.122.176.247', 'host_port': 1750}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-05-06T03:34:09Z] ad2df585 || INFO: Horovod config: {'use': False, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-06T03:34:09Z] ad2df585 || INFO: Loading Trial implementation with entrypoint model_def:MNistTrial.
[2021-05-06T03:34:10Z] ad2df585 || /opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning: 
[2021-05-06T03:34:10Z] ad2df585 || GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
[2021-05-06T03:34:10Z] ad2df585 || The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
[2021-05-06T03:34:10Z] ad2df585 || If you want to use the GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
[2021-05-06T03:34:10Z] ad2df585 ||   warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[2021-05-06T03:34:12Z] ad2df585 || INFO: Creating PyTorchTrialController with MNistTrial.
[2021-05-06T03:34:12Z] ad2df585 || INFO: Downloading https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz to /tmp/data-rank0/MNIST/pytorch_mnist.tar.gz
[2021-05-06T03:34:18Z] ad2df585 || INFO: Running workload <RUN_STEP (100 Batches): (1,1,1)>
[2021-05-06T03:34:18Z] ad2df585 || INFO: WebSocket closed
[2021-05-06T03:34:18Z] ad2df585 || INFO: Disconnected from master, exiting gracefully
[2021-05-06T03:34:18Z] ad2df585 || Traceback (most recent call last):
[2021-05-06T03:34:18Z] ad2df585 ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-05-06T03:34:18Z] ad2df585 ||     "__main__", mod_spec)
[2021-05-06T03:34:18Z] ad2df585 ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-05-06T03:34:18Z] ad2df585 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-05-06T03:34:18Z] ad2df585 ||     exec(code, run_globals)
[2021-05-06T03:34:18Z] ad2df585 ||     main()
[2021-05-06T03:34:18Z] ad2df585 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-05-06T03:34:18Z] ad2df585 ||     build_and_run_training_pipeline(env)
[2021-05-06T03:34:18Z] ad2df585 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 149, in build_and_run_training_pipeline
[2021-05-06T03:34:18Z] ad2df585 ||     controller.run()
[2021-05-06T03:34:18Z] ad2df585 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 152, in run
[2021-05-06T03:34:18Z] ad2df585 ||     w.total_batches_processed,
[2021-05-06T03:34:18Z] ad2df585 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 310, in _train_for_step
[2021-05-06T03:34:18Z] ad2df585 ||     batch_idx=batch_idx,
[2021-05-06T03:34:18Z] ad2df585 ||   File "/run/determined/workdir/model_def.py", line 84, in train_batch
[2021-05-06T03:34:18Z] ad2df585 ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:34:18Z] ad2df585 ||     output = self.model(data)
[2021-05-06T03:34:18Z] ad2df585 ||     result = self.forward(*input, **kwargs)
[2021-05-06T03:34:18Z] ad2df585 ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
[2021-05-06T03:34:18Z] ad2df585 ||     input = module(input)
[2021-05-06T03:34:18Z] ad2df585 ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:34:18Z] ad2df585 ||     result = self.forward(*input, **kwargs)
[2021-05-06T03:34:18Z] ad2df585 ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
[2021-05-06T03:34:18Z] ad2df585 ||     return self._conv_forward(input, self.weight)
[2021-05-06T03:34:18Z] ad2df585 ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
[2021-05-06T03:34:18Z] ad2df585 ||     self.padding, self.dilation, self.groups)
[2021-05-06T03:34:18Z] ad2df585 || RuntimeError: CUDA error: no kernel image is available for execution on the device
[2021-05-06T03:34:20Z] ad2df585 || INFO: container failed with non-zero exit code:  (exit code 1)
[2021-05-06T03:34:21Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Pod resources allocated.
[2021-05-06T03:34:22Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:34:22Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Created container determined-init-container
[2021-05-06T03:34:23Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Started container determined-init-container
[2021-05-06T03:34:23Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-06T03:34:23Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Created container determined-fluent-container
[2021-05-06T03:34:23Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Started container determined-fluent-container
[2021-05-06T03:34:23Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:34:25Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Created container determined-container
[2021-05-06T03:34:26Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Started container determined-container
[2021-05-06T03:34:29Z] 4568b05f || + WORKING_DIR=/run/determined/workdir
[2021-05-06T03:34:29Z] 4568b05f || + STARTUP_HOOK=startup-hook.sh
[2021-05-06T03:34:29Z] 4568b05f || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:34:29Z] 4568b05f || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:34:29Z] 4568b05f || + '[' -z '' ']'
[2021-05-06T03:34:29Z] 4568b05f || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:34:29Z] 4568b05f || + DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:34:29Z] 4568b05f || + /bin/which python3
[2021-05-06T03:34:29Z] 4568b05f || + '[' /root = / ']'
[2021-05-06T03:34:29Z] 4568b05f || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.2-py3-none-any.whl
[2021-05-06T03:34:29Z] 4568b05f || ERROR: determined 0.15.2 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-05-06T03:34:29Z] 4568b05f || + cd /run/determined/workdir
[2021-05-06T03:34:29Z] 4568b05f || + test -f startup-hook.sh
[2021-05-06T03:34:29Z] 4568b05f || + exec python3 -m determined.exec.harness
[2021-05-06T03:34:29Z] 4568b05f || INFO: New trial runner in (container 4568b05f-8693-4379-9856-eaa3768b5d11) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '4568b05f-8693-4379-9856-eaa3768b5d11', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620272038}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-a523e020-b504-535c-2b83-967ab28cdbab'], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1001305543, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-05-06T03:34:29Z] 4568b05f || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/1/1/4568b05f-8693-4379-9856-eaa3768b5d11
[2021-05-06T03:34:29Z] 4568b05f || INFO: Connected to master
[2021-05-06T03:34:29Z] 4568b05f || INFO: Established WebSocket session with master
[2021-05-06T03:34:29Z] 4568b05f || INFO: Got rendezvous information: {'addrs': ['100.122.176.248:1734'], 'addrs2': ['100.122.176.248:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.248', 'container_port': 1734, 'host_ip': '100.122.176.248', 'host_port': 1734}, {'container_ip': '100.122.176.248', 'container_port': 1750, 'host_ip': '100.122.176.248', 'host_port': 1750}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-05-06T03:34:31Z] 4568b05f || INFO: Horovod config: {'use': False, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-06T03:34:31Z] 4568b05f || INFO: Loading Trial implementation with entrypoint model_def:MNistTrial.
[2021-05-06T03:34:31Z] 4568b05f || /opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning: 
[2021-05-06T03:34:31Z] 4568b05f || GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
[2021-05-06T03:34:31Z] 4568b05f || The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
[2021-05-06T03:34:31Z] 4568b05f || If you want to use the GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
[2021-05-06T03:34:31Z] 4568b05f ||   warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[2021-05-06T03:34:33Z] 4568b05f || INFO: Creating PyTorchTrialController with MNistTrial.
[2021-05-06T03:34:33Z] 4568b05f || INFO: Downloading https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz to /tmp/data-rank0/MNIST/pytorch_mnist.tar.gz
[2021-05-06T03:34:39Z] 4568b05f || INFO: Running workload <RUN_STEP (100 Batches): (1,1,1)>
[2021-05-06T03:34:39Z] 4568b05f || INFO: WebSocket closed
[2021-05-06T03:34:39Z] 4568b05f || INFO: Disconnected from master, exiting gracefully
[2021-05-06T03:34:39Z] 4568b05f || Traceback (most recent call last):
[2021-05-06T03:34:39Z] 4568b05f ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-05-06T03:34:39Z] 4568b05f ||     "__main__", mod_spec)
[2021-05-06T03:34:39Z] 4568b05f ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-05-06T03:34:39Z] 4568b05f ||     exec(code, run_globals)
[2021-05-06T03:34:39Z] 4568b05f ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-05-06T03:34:39Z] 4568b05f ||     main()
[2021-05-06T03:34:39Z] 4568b05f ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-05-06T03:34:39Z] 4568b05f ||     build_and_run_training_pipeline(env)
[2021-05-06T03:34:39Z] 4568b05f ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 149, in build_and_run_training_pipeline
[2021-05-06T03:34:39Z] 4568b05f ||     controller.run()
[2021-05-06T03:34:39Z] 4568b05f ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 152, in run
[2021-05-06T03:34:39Z] 4568b05f ||     w.total_batches_processed,
[2021-05-06T03:34:39Z] 4568b05f ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 310, in _train_for_step
[2021-05-06T03:34:39Z] 4568b05f ||     batch_idx=batch_idx,
[2021-05-06T03:34:39Z] 4568b05f ||   File "/run/determined/workdir/model_def.py", line 84, in train_batch
[2021-05-06T03:34:39Z] 4568b05f ||     output = self.model(data)
[2021-05-06T03:34:39Z] 4568b05f ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:34:39Z] 4568b05f ||     result = self.forward(*input, **kwargs)
[2021-05-06T03:34:39Z] 4568b05f ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
[2021-05-06T03:34:39Z] 4568b05f ||     input = module(input)
[2021-05-06T03:34:39Z] 4568b05f ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:34:39Z] 4568b05f ||     result = self.forward(*input, **kwargs)
[2021-05-06T03:34:39Z] 4568b05f ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
[2021-05-06T03:34:39Z] 4568b05f ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
[2021-05-06T03:34:39Z] 4568b05f ||     return self._conv_forward(input, self.weight)
[2021-05-06T03:34:39Z] 4568b05f ||     self.padding, self.dilation, self.groups)
[2021-05-06T03:34:39Z] 4568b05f || RuntimeError: CUDA error: no kernel image is available for execution on the device
[2021-05-06T03:34:40Z] 4568b05f || INFO: container failed with non-zero exit code:  (exit code 1)
[2021-05-06T03:34:41Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Pod resources allocated.
[2021-05-06T03:34:42Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:34:42Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Created container determined-init-container
[2021-05-06T03:34:42Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Started container determined-init-container
[2021-05-06T03:34:43Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-06T03:34:43Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Created container determined-fluent-container
[2021-05-06T03:34:44Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Started container determined-fluent-container
[2021-05-06T03:34:44Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:34:45Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Created container determined-container
[2021-05-06T03:34:46Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Started container determined-container
[2021-05-06T03:34:49Z] 7047ec73 || + WORKING_DIR=/run/determined/workdir
[2021-05-06T03:34:49Z] 7047ec73 || + STARTUP_HOOK=startup-hook.sh
[2021-05-06T03:34:49Z] 7047ec73 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:34:49Z] 7047ec73 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:34:49Z] 7047ec73 || + '[' -z '' ']'
[2021-05-06T03:34:49Z] 7047ec73 || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:34:49Z] 7047ec73 || + DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:34:49Z] 7047ec73 || + /bin/which python3
[2021-05-06T03:34:49Z] 7047ec73 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.2-py3-none-any.whl
[2021-05-06T03:34:49Z] 7047ec73 || + '[' /root = / ']'
[2021-05-06T03:34:50Z] 7047ec73 || ERROR: determined 0.15.2 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-05-06T03:34:50Z] 7047ec73 || + cd /run/determined/workdir
[2021-05-06T03:34:50Z] 7047ec73 || + test -f startup-hook.sh
[2021-05-06T03:34:50Z] 7047ec73 || + exec python3 -m determined.exec.harness
[2021-05-06T03:34:51Z] 7047ec73 || INFO: New trial runner in (container 7047ec73-a6b0-4193-94b8-97dd94eca9b9) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '7047ec73-a6b0-4193-94b8-97dd94eca9b9', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620272038}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-8e73c217-85aa-c8c5-2551-52c59279e009'], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1001305543, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-05-06T03:34:51Z] 7047ec73 || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/1/1/7047ec73-a6b0-4193-94b8-97dd94eca9b9
[2021-05-06T03:34:51Z] 7047ec73 || INFO: Connected to master
[2021-05-06T03:34:51Z] 7047ec73 || INFO: Established WebSocket session with master
[2021-05-06T03:34:51Z] 7047ec73 || INFO: Got rendezvous information: {'addrs': ['100.122.176.249:1734'], 'addrs2': ['100.122.176.249:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.249', 'container_port': 1734, 'host_ip': '100.122.176.249', 'host_port': 1734}, {'container_ip': '100.122.176.249', 'container_port': 1750, 'host_ip': '100.122.176.249', 'host_port': 1750}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-05-06T03:34:52Z] 7047ec73 || INFO: Horovod config: {'use': False, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-06T03:34:52Z] 7047ec73 || INFO: Loading Trial implementation with entrypoint model_def:MNistTrial.
[2021-05-06T03:34:53Z] 7047ec73 || /opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning: 
[2021-05-06T03:34:53Z] 7047ec73 || GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
[2021-05-06T03:34:53Z] 7047ec73 || If you want to use the GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
[2021-05-06T03:34:53Z] 7047ec73 || The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
[2021-05-06T03:34:53Z] 7047ec73 ||   warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[2021-05-06T03:34:55Z] 7047ec73 || INFO: Creating PyTorchTrialController with MNistTrial.
[2021-05-06T03:34:55Z] 7047ec73 || INFO: Downloading https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz to /tmp/data-rank0/MNIST/pytorch_mnist.tar.gz
[2021-05-06T03:35:02Z] 7047ec73 || INFO: Running workload <RUN_STEP (100 Batches): (1,1,1)>
[2021-05-06T03:35:02Z] 7047ec73 || INFO: WebSocket closed
[2021-05-06T03:35:02Z] 7047ec73 || INFO: Disconnected from master, exiting gracefully
[2021-05-06T03:35:02Z] 7047ec73 || Traceback (most recent call last):
[2021-05-06T03:35:02Z] 7047ec73 ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-05-06T03:35:02Z] 7047ec73 ||     "__main__", mod_spec)
[2021-05-06T03:35:02Z] 7047ec73 ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-05-06T03:35:02Z] 7047ec73 ||     exec(code, run_globals)
[2021-05-06T03:35:02Z] 7047ec73 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-05-06T03:35:02Z] 7047ec73 ||     main()
[2021-05-06T03:35:02Z] 7047ec73 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-05-06T03:35:02Z] 7047ec73 ||     build_and_run_training_pipeline(env)
[2021-05-06T03:35:02Z] 7047ec73 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 149, in build_and_run_training_pipeline
[2021-05-06T03:35:02Z] 7047ec73 ||     controller.run()
[2021-05-06T03:35:02Z] 7047ec73 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 152, in run
[2021-05-06T03:35:02Z] 7047ec73 ||     w.total_batches_processed,
[2021-05-06T03:35:02Z] 7047ec73 ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 310, in _train_for_step
[2021-05-06T03:35:02Z] 7047ec73 ||     batch_idx=batch_idx,
[2021-05-06T03:35:02Z] 7047ec73 ||   File "/run/determined/workdir/model_def.py", line 84, in train_batch
[2021-05-06T03:35:02Z] 7047ec73 ||     output = self.model(data)
[2021-05-06T03:35:02Z] 7047ec73 ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:35:02Z] 7047ec73 ||     result = self.forward(*input, **kwargs)
[2021-05-06T03:35:02Z] 7047ec73 ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
[2021-05-06T03:35:02Z] 7047ec73 ||     input = module(input)
[2021-05-06T03:35:02Z] 7047ec73 ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:35:02Z] 7047ec73 ||     result = self.forward(*input, **kwargs)
[2021-05-06T03:35:02Z] 7047ec73 ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
[2021-05-06T03:35:02Z] 7047ec73 ||     return self._conv_forward(input, self.weight)
[2021-05-06T03:35:02Z] 7047ec73 ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
[2021-05-06T03:35:02Z] 7047ec73 ||     self.padding, self.dilation, self.groups)
[2021-05-06T03:35:02Z] 7047ec73 || RuntimeError: CUDA error: no kernel image is available for execution on the device
[2021-05-06T03:35:03Z] 7047ec73 || INFO: container failed with non-zero exit code:  (exit code 1)
[2021-05-06T03:35:04Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Pod resources allocated.
[2021-05-06T03:35:05Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:35:05Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Created container determined-init-container
[2021-05-06T03:35:05Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Started container determined-init-container
[2021-05-06T03:35:06Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-06T03:35:06Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Created container determined-fluent-container
[2021-05-06T03:35:07Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Started container determined-fluent-container
[2021-05-06T03:35:07Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:35:08Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Created container determined-container
[2021-05-06T03:35:09Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Started container determined-container
[2021-05-06T03:35:12Z] fd7316ae || + WORKING_DIR=/run/determined/workdir
[2021-05-06T03:35:12Z] fd7316ae || + STARTUP_HOOK=startup-hook.sh
[2021-05-06T03:35:12Z] fd7316ae || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:35:12Z] fd7316ae || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:35:12Z] fd7316ae || + '[' -z '' ']'
[2021-05-06T03:35:12Z] fd7316ae || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:35:12Z] fd7316ae || + DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:35:12Z] fd7316ae || + /bin/which python3
[2021-05-06T03:35:12Z] fd7316ae || + '[' /root = / ']'
[2021-05-06T03:35:12Z] fd7316ae || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.2-py3-none-any.whl
[2021-05-06T03:35:12Z] fd7316ae || ERROR: determined 0.15.2 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-05-06T03:35:12Z] fd7316ae || + test -f startup-hook.sh
[2021-05-06T03:35:12Z] fd7316ae || + cd /run/determined/workdir
[2021-05-06T03:35:12Z] fd7316ae || + exec python3 -m determined.exec.harness
[2021-05-06T03:35:13Z] fd7316ae || INFO: New trial runner in (container fd7316ae-4cd1-4324-ab6c-a060b0cea14d) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'fd7316ae-4cd1-4324-ab6c-a060b0cea14d', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620272038}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-2bfb4f26-19eb-b65f-08b1-1de7d47801e2'], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1001305543, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-05-06T03:35:13Z] fd7316ae || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/1/1/fd7316ae-4cd1-4324-ab6c-a060b0cea14d
[2021-05-06T03:35:13Z] fd7316ae || INFO: Connected to master
[2021-05-06T03:35:13Z] fd7316ae || INFO: Established WebSocket session with master
[2021-05-06T03:35:13Z] fd7316ae || INFO: Got rendezvous information: {'addrs': ['100.122.176.250:1734'], 'addrs2': ['100.122.176.250:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.250', 'container_port': 1734, 'host_ip': '100.122.176.250', 'host_port': 1734}, {'container_ip': '100.122.176.250', 'container_port': 1750, 'host_ip': '100.122.176.250', 'host_port': 1750}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-05-06T03:35:14Z] fd7316ae || INFO: Horovod config: {'use': False, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-06T03:35:14Z] fd7316ae || INFO: Loading Trial implementation with entrypoint model_def:MNistTrial.
[2021-05-06T03:35:15Z] fd7316ae || /opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning: 
[2021-05-06T03:35:15Z] fd7316ae || GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
[2021-05-06T03:35:15Z] fd7316ae || The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
[2021-05-06T03:35:15Z] fd7316ae || If you want to use the GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
[2021-05-06T03:35:15Z] fd7316ae ||   warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[2021-05-06T03:35:16Z] fd7316ae || INFO: Creating PyTorchTrialController with MNistTrial.
[2021-05-06T03:35:16Z] fd7316ae || INFO: Downloading https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz to /tmp/data-rank0/MNIST/pytorch_mnist.tar.gz
[2021-05-06T03:35:29Z] fd7316ae || INFO: Running workload <RUN_STEP (100 Batches): (1,1,1)>
[2021-05-06T03:35:29Z] fd7316ae || INFO: WebSocket closed
[2021-05-06T03:35:29Z] fd7316ae || INFO: Disconnected from master, exiting gracefully
[2021-05-06T03:35:29Z] fd7316ae || Traceback (most recent call last):
[2021-05-06T03:35:29Z] fd7316ae ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-05-06T03:35:29Z] fd7316ae ||     "__main__", mod_spec)
[2021-05-06T03:35:29Z] fd7316ae ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-05-06T03:35:29Z] fd7316ae ||     exec(code, run_globals)
[2021-05-06T03:35:29Z] fd7316ae ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-05-06T03:35:29Z] fd7316ae ||     main()
[2021-05-06T03:35:29Z] fd7316ae ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-05-06T03:35:29Z] fd7316ae ||     build_and_run_training_pipeline(env)
[2021-05-06T03:35:29Z] fd7316ae ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 149, in build_and_run_training_pipeline
[2021-05-06T03:35:29Z] fd7316ae ||     controller.run()
[2021-05-06T03:35:29Z] fd7316ae ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 152, in run
[2021-05-06T03:35:29Z] fd7316ae ||     w.total_batches_processed,
[2021-05-06T03:35:29Z] fd7316ae ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 310, in _train_for_step
[2021-05-06T03:35:29Z] fd7316ae ||     batch_idx=batch_idx,
[2021-05-06T03:35:29Z] fd7316ae ||   File "/run/determined/workdir/model_def.py", line 84, in train_batch
[2021-05-06T03:35:29Z] fd7316ae ||     output = self.model(data)
[2021-05-06T03:35:29Z] fd7316ae ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:35:29Z] fd7316ae ||     result = self.forward(*input, **kwargs)
[2021-05-06T03:35:29Z] fd7316ae ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
[2021-05-06T03:35:29Z] fd7316ae ||     input = module(input)
[2021-05-06T03:35:29Z] fd7316ae ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:35:29Z] fd7316ae ||     result = self.forward(*input, **kwargs)
[2021-05-06T03:35:29Z] fd7316ae ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
[2021-05-06T03:35:29Z] fd7316ae ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
[2021-05-06T03:35:29Z] fd7316ae ||     return self._conv_forward(input, self.weight)
[2021-05-06T03:35:29Z] fd7316ae ||     self.padding, self.dilation, self.groups)
[2021-05-06T03:35:29Z] fd7316ae || RuntimeError: CUDA error: no kernel image is available for execution on the device
[2021-05-06T03:35:30Z] fd7316ae || INFO: container failed with non-zero exit code:  (exit code 1)
[2021-05-06T03:35:31Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Pod resources allocated.
[2021-05-06T03:35:32Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:35:32Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Created container determined-init-container
[2021-05-06T03:35:33Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Started container determined-init-container
[2021-05-06T03:35:34Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-06T03:35:34Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Created container determined-fluent-container
[2021-05-06T03:35:34Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Started container determined-fluent-container
[2021-05-06T03:35:34Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:35:36Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Created container determined-container
[2021-05-06T03:35:37Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Started container determined-container
[2021-05-06T03:35:37Z] a58c2d6c || + WORKING_DIR=/run/determined/workdir
[2021-05-06T03:35:37Z] a58c2d6c || + STARTUP_HOOK=startup-hook.sh
[2021-05-06T03:35:37Z] a58c2d6c || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:35:37Z] a58c2d6c || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:35:37Z] a58c2d6c || + '[' -z '' ']'
[2021-05-06T03:35:37Z] a58c2d6c || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:35:37Z] a58c2d6c || + /bin/which python3
[2021-05-06T03:35:37Z] a58c2d6c || + DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:35:37Z] a58c2d6c || + '[' /root = / ']'
[2021-05-06T03:35:37Z] a58c2d6c || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.2-py3-none-any.whl
[2021-05-06T03:35:39Z] a58c2d6c || ERROR: determined 0.15.2 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-05-06T03:35:39Z] a58c2d6c || + cd /run/determined/workdir
[2021-05-06T03:35:39Z] a58c2d6c || + test -f startup-hook.sh
[2021-05-06T03:35:39Z] a58c2d6c || + exec python3 -m determined.exec.harness
[2021-05-06T03:35:40Z] a58c2d6c || INFO: New trial runner in (container a58c2d6c-1799-4eca-8af4-b5854c1ba447) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'a58c2d6c-1799-4eca-8af4-b5854c1ba447', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620272038}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-329704d2-485f-8e97-0c87-112c63f4201b'], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1001305543, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-05-06T03:35:40Z] a58c2d6c || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/1/1/a58c2d6c-1799-4eca-8af4-b5854c1ba447
[2021-05-06T03:35:40Z] a58c2d6c || INFO: Connected to master
[2021-05-06T03:35:40Z] a58c2d6c || INFO: Established WebSocket session with master
[2021-05-06T03:35:40Z] a58c2d6c || INFO: Got rendezvous information: {'addrs': ['100.122.176.251:1734'], 'addrs2': ['100.122.176.251:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.251', 'container_port': 1734, 'host_ip': '100.122.176.251', 'host_port': 1734}, {'container_ip': '100.122.176.251', 'container_port': 1750, 'host_ip': '100.122.176.251', 'host_port': 1750}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-05-06T03:35:41Z] a58c2d6c || INFO: Horovod config: {'use': False, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-06T03:35:41Z] a58c2d6c || INFO: Loading Trial implementation with entrypoint model_def:MNistTrial.
[2021-05-06T03:35:42Z] a58c2d6c || /opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning: 
[2021-05-06T03:35:42Z] a58c2d6c || GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
[2021-05-06T03:35:42Z] a58c2d6c || If you want to use the GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
[2021-05-06T03:35:42Z] a58c2d6c || The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
[2021-05-06T03:35:42Z] a58c2d6c ||   warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[2021-05-06T03:35:44Z] a58c2d6c || INFO: Creating PyTorchTrialController with MNistTrial.
[2021-05-06T03:35:44Z] a58c2d6c || INFO: Downloading https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz to /tmp/data-rank0/MNIST/pytorch_mnist.tar.gz
[2021-05-06T03:35:49Z] a58c2d6c || INFO: Running workload <RUN_STEP (100 Batches): (1,1,1)>
[2021-05-06T03:35:49Z] a58c2d6c || INFO: WebSocket closed
[2021-05-06T03:35:49Z] a58c2d6c || INFO: Disconnected from master, exiting gracefully
[2021-05-06T03:35:49Z] a58c2d6c || Traceback (most recent call last):
[2021-05-06T03:35:49Z] a58c2d6c ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-05-06T03:35:49Z] a58c2d6c ||     "__main__", mod_spec)
[2021-05-06T03:35:49Z] a58c2d6c ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-05-06T03:35:49Z] a58c2d6c ||     exec(code, run_globals)
[2021-05-06T03:35:49Z] a58c2d6c ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-05-06T03:35:49Z] a58c2d6c ||     main()
[2021-05-06T03:35:49Z] a58c2d6c ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-05-06T03:35:49Z] a58c2d6c ||     build_and_run_training_pipeline(env)
[2021-05-06T03:35:49Z] a58c2d6c ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 149, in build_and_run_training_pipeline
[2021-05-06T03:35:49Z] a58c2d6c ||     controller.run()
[2021-05-06T03:35:49Z] a58c2d6c ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 152, in run
[2021-05-06T03:35:49Z] a58c2d6c ||     w.total_batches_processed,
[2021-05-06T03:35:49Z] a58c2d6c ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 310, in _train_for_step
[2021-05-06T03:35:49Z] a58c2d6c ||     batch_idx=batch_idx,
[2021-05-06T03:35:49Z] a58c2d6c ||   File "/run/determined/workdir/model_def.py", line 84, in train_batch
[2021-05-06T03:35:49Z] a58c2d6c ||     output = self.model(data)
[2021-05-06T03:35:49Z] a58c2d6c ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:35:49Z] a58c2d6c ||     result = self.forward(*input, **kwargs)
[2021-05-06T03:35:49Z] a58c2d6c ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
[2021-05-06T03:35:49Z] a58c2d6c ||     input = module(input)
[2021-05-06T03:35:49Z] a58c2d6c ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:35:49Z] a58c2d6c ||     result = self.forward(*input, **kwargs)
[2021-05-06T03:35:49Z] a58c2d6c ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
[2021-05-06T03:35:49Z] a58c2d6c ||     return self._conv_forward(input, self.weight)
[2021-05-06T03:35:49Z] a58c2d6c ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
[2021-05-06T03:35:49Z] a58c2d6c ||     self.padding, self.dilation, self.groups)
[2021-05-06T03:35:49Z] a58c2d6c || RuntimeError: CUDA error: no kernel image is available for execution on the device
[2021-05-06T03:35:51Z] a58c2d6c || INFO: container failed with non-zero exit code:  (exit code 1)
[2021-05-06T03:35:52Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Pod resources allocated.
[2021-05-06T03:35:53Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:35:53Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Created container determined-init-container
[2021-05-06T03:35:53Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Started container determined-init-container
[2021-05-06T03:35:54Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-06T03:35:54Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Created container determined-fluent-container
[2021-05-06T03:35:55Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Started container determined-fluent-container
[2021-05-06T03:35:55Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:35:57Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Created container determined-container
[2021-05-06T03:35:57Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Started container determined-container
[2021-05-06T03:36:00Z] c5a9683c || + WORKING_DIR=/run/determined/workdir
[2021-05-06T03:36:00Z] c5a9683c || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:36:00Z] c5a9683c || + STARTUP_HOOK=startup-hook.sh
[2021-05-06T03:36:00Z] c5a9683c || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:36:00Z] c5a9683c || + '[' -z '' ']'
[2021-05-06T03:36:00Z] c5a9683c || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:36:00Z] c5a9683c || + DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:36:00Z] c5a9683c || + /bin/which python3
[2021-05-06T03:36:00Z] c5a9683c || + '[' /root = / ']'
[2021-05-06T03:36:00Z] c5a9683c || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.2-py3-none-any.whl
[2021-05-06T03:36:00Z] c5a9683c || ERROR: determined 0.15.2 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-05-06T03:36:00Z] c5a9683c || + cd /run/determined/workdir
[2021-05-06T03:36:00Z] c5a9683c || + test -f startup-hook.sh
[2021-05-06T03:36:00Z] c5a9683c || + exec python3 -m determined.exec.harness
[2021-05-06T03:36:01Z] c5a9683c || INFO: New trial runner in (container c5a9683c-809a-4a1d-b28b-50782b5536c1) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'c5a9683c-809a-4a1d-b28b-50782b5536c1', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620272038}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-a11059eb-0891-239b-9231-a055bf282a20'], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1001305543, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-05-06T03:36:01Z] c5a9683c || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/1/1/c5a9683c-809a-4a1d-b28b-50782b5536c1
[2021-05-06T03:36:01Z] c5a9683c || INFO: Connected to master
[2021-05-06T03:36:01Z] c5a9683c || INFO: Established WebSocket session with master
[2021-05-06T03:36:01Z] c5a9683c || INFO: Got rendezvous information: {'addrs': ['100.122.176.252:1734'], 'addrs2': ['100.122.176.252:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.252', 'container_port': 1734, 'host_ip': '100.122.176.252', 'host_port': 1734}, {'container_ip': '100.122.176.252', 'container_port': 1750, 'host_ip': '100.122.176.252', 'host_port': 1750}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-05-06T03:36:02Z] c5a9683c || INFO: Horovod config: {'use': False, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-06T03:36:02Z] c5a9683c || INFO: Loading Trial implementation with entrypoint model_def:MNistTrial.
[2021-05-06T03:36:03Z] c5a9683c || /opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning: 
[2021-05-06T03:36:03Z] c5a9683c || The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
[2021-05-06T03:36:03Z] c5a9683c || GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
[2021-05-06T03:36:03Z] c5a9683c || If you want to use the GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
[2021-05-06T03:36:03Z] c5a9683c ||   warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[2021-05-06T03:36:04Z] c5a9683c || INFO: Creating PyTorchTrialController with MNistTrial.
[2021-05-06T03:36:04Z] c5a9683c || INFO: Downloading https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz to /tmp/data-rank0/MNIST/pytorch_mnist.tar.gz
[2021-05-06T03:36:10Z] c5a9683c || INFO: Running workload <RUN_STEP (100 Batches): (1,1,1)>
[2021-05-06T03:36:10Z] c5a9683c || INFO: WebSocket closed
[2021-05-06T03:36:10Z] c5a9683c || INFO: Disconnected from master, exiting gracefully
[2021-05-06T03:36:10Z] c5a9683c || Traceback (most recent call last):
[2021-05-06T03:36:10Z] c5a9683c ||   File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-05-06T03:36:10Z] c5a9683c ||     "__main__", mod_spec)
[2021-05-06T03:36:10Z] c5a9683c ||   File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-05-06T03:36:10Z] c5a9683c ||     exec(code, run_globals)
[2021-05-06T03:36:10Z] c5a9683c ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-05-06T03:36:10Z] c5a9683c ||     main()
[2021-05-06T03:36:10Z] c5a9683c ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-05-06T03:36:10Z] c5a9683c ||     build_and_run_training_pipeline(env)
[2021-05-06T03:36:10Z] c5a9683c ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 149, in build_and_run_training_pipeline
[2021-05-06T03:36:10Z] c5a9683c ||     controller.run()
[2021-05-06T03:36:10Z] c5a9683c ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 152, in run
[2021-05-06T03:36:10Z] c5a9683c ||     w.total_batches_processed,
[2021-05-06T03:36:10Z] c5a9683c ||   File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 310, in _train_for_step
[2021-05-06T03:36:10Z] c5a9683c ||     batch_idx=batch_idx,
[2021-05-06T03:36:10Z] c5a9683c ||   File "/run/determined/workdir/model_def.py", line 84, in train_batch
[2021-05-06T03:36:10Z] c5a9683c ||     output = self.model(data)
[2021-05-06T03:36:10Z] c5a9683c ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:36:10Z] c5a9683c ||     result = self.forward(*input, **kwargs)
[2021-05-06T03:36:10Z] c5a9683c ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
[2021-05-06T03:36:10Z] c5a9683c ||     input = module(input)
[2021-05-06T03:36:10Z] c5a9683c ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:36:10Z] c5a9683c ||     result = self.forward(*input, **kwargs)
[2021-05-06T03:36:10Z] c5a9683c ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
[2021-05-06T03:36:10Z] c5a9683c ||   File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
[2021-05-06T03:36:10Z] c5a9683c ||     return self._conv_forward(input, self.weight)
[2021-05-06T03:36:10Z] c5a9683c ||     self.padding, self.dilation, self.groups)
[2021-05-06T03:36:10Z] c5a9683c || RuntimeError: CUDA error: no kernel image is available for execution on the device
[2021-05-06T03:36:12Z] c5a9683c || INFO: container failed with non-zero exit code:  (exit code 1)
�[32mTrial log stream ended. To reopen log stream, run: det trial logs -f 1�[0m

tensorboard can't open

After run the example pytorch minist tutorial. I try to open tensorbaord but failed.

I try 2 methods.

  1. web GUI:

the web page show: "Service State: PENDING Waiting for service.." .

  1. det command: det tensorboard start 14

the output is:

Scheduling TensorBoard (typically-harmless-ox) (id: cbf3a135-083c-47c2-83b4-ad875fa6414c)...

Need to re-code all with determined trial class ?

Hi,
Thanks for good project.
Is it a way to have a loose couplings between
our existing model and Determined framework
without the need of re-coding current model
with Trial class ?

There is optuna and horovod which proceed in that way (external meta processing) ?
Thanks

optional timeout for idle notebooks and shells

It would be very helpful to be able to configure an optional timeout for idle notebooks and shells, similarly to the timeout option available for tensorboards. This would help prevent team members from accidentally leaving shells and notebooks running and taking up GPU resources without using them.

Support for region us-gov-west-1

Attempting to run det-deploy in region us-gov-west-1 yields the following error:

det-deploy is only supported in ['ap-northeast-1', 'eu-central-1', 'eu-west-1', 'us-east-1', 'us-east-2', 'us-west-2'] - tried to deploy to us-gov-west-1
use the --region argument to deploy to a supported region

Recommended fix: allow us-gov-west-1 as a region for det-deploy

Terminate failed EC2 launch

How does one terminate a failed EC2 launch?

In my particular case, EC2 did not launch because of some severely limited vCPU quotas (32). I'm working to resolve that with AWS; however, in the meantime, every 5 seconds there is a new attempt to launch an instance.

Here was the command I executed (following the MNIST Quick Start tutorial):

det experiment create const.yaml .

Because of insufficient vCPU resources, I get this error every 5 seconds on the determined.ai master instance:

[2021-01-17, 01:58:34] decided to launch 1 instances (type p2.8xlarge)  id="provisioner" resource-pool="default" system="master" type="Provisioner"
<error> [2021-01-17, 01:58:34] cannot launch EC2 instances  error="VcpuLimitExceeded: You have requested more vCPU capacity than your current vCPU limit of 4 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.\n\tstatus code: 400, request id: 60814cd9-d31e-44d2-a668-bef01139f648" id="provisioner" resource-pool="default" system="master" type="Provisioner"

I attempted this command to no effect:

det -m <MASTER_ADDRESS> agent disable --all

I also tried the following commands with similar no effect:

det experiment cancel 1
det experiment kill 1

The former command cancel reported that the experiment is cancelled, and the experiment listing shows the experiment is cancelled. The second command kill reported that there is no experiment 1. The master logs show that the EC2 server is continuing to attempt to launch a p2.8xlarge instance.

Integration with other experiments tracker (Comet ML, Weights and Biases)

Is it possible to use determined ai with other experiment tracker such as Comet ML, or Weights and Biases ?

The feature that I am interested the most in determined ai is the distributed training mechanism. However it seems that it is not yet possible to couple this feature with existing experiments trackers as mentioned above.

WorkerError in distributed training with cuda11.1

I use RTX 3090 to do training, but it only support CUDA version ≥ 11.1. So I've used "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" to be the environment image, which caused error in distributed training.

The code is mnist_pytorch in "Docs > Tutorials > Quick Start Guide".

The trial logs:

[2021-05-10T07:34:42Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Pod resources allocated.
[2021-05-10T07:34:42Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Pod resources allocated.
[2021-05-10T07:34:42Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Pod resources allocated.
[2021-05-10T07:34:42Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Pod resources allocated.
[2021-05-10T07:34:42Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Pod resources allocated.
[2021-05-10T07:34:43Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:43Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Created container determined-init-container
[2021-05-10T07:34:43Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:43Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:43Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:43Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Created container determined-init-container
[2021-05-10T07:34:43Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:43Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Created container determined-init-container
[2021-05-10T07:34:43Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Created container determined-init-container
[2021-05-10T07:34:43Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Created container determined-init-container
[2021-05-10T07:34:43Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Started container determined-init-container
[2021-05-10T07:34:43Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Started container determined-init-container
[2021-05-10T07:34:43Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Started container determined-init-container
[2021-05-10T07:34:44Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Started container determined-init-container
[2021-05-10T07:34:44Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Started container determined-init-container
[2021-05-10T07:34:45Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-10T07:34:45Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-10T07:34:45Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Created container determined-fluent-container
[2021-05-10T07:34:45Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Created container determined-fluent-container
[2021-05-10T07:34:45Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-10T07:34:45Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-10T07:34:45Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-10T07:34:45Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Created container determined-fluent-container
[2021-05-10T07:34:45Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Created container determined-fluent-container
[2021-05-10T07:34:45Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Created container determined-fluent-container
[2021-05-10T07:34:45Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Started container determined-fluent-container
[2021-05-10T07:34:45Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Started container determined-fluent-container
[2021-05-10T07:34:45Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:45Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:45Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Started container determined-fluent-container
[2021-05-10T07:34:45Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:45Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Started container determined-fluent-container
[2021-05-10T07:34:45Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Started container determined-fluent-container
[2021-05-10T07:34:45Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:45Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Container image "determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0" already present on machine
[2021-05-10T07:34:48Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Created container determined-container
[2021-05-10T07:34:48Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Created container determined-container
[2021-05-10T07:34:48Z] c393b38a || INFO: Pod exp-11-trial-26-rank-1-d14365d6-3b27-4c68-a6bd-f56001924600-wondrous-cub: Started container determined-container
[2021-05-10T07:34:48Z] 13573602 || INFO: Pod exp-11-trial-26-rank-4-d14365d6-3b27-4c68-a6bd-f56001924600-curious-dory: Started container determined-container
[2021-05-10T07:34:49Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Created container determined-container
[2021-05-10T07:34:49Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Created container determined-container
[2021-05-10T07:34:50Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Created container determined-container
[2021-05-10T07:34:50Z] 0797af9d || INFO: Pod exp-11-trial-26-rank-0-d14365d6-3b27-4c68-a6bd-f56001924600-proven-finch: Started container determined-container
[2021-05-10T07:34:50Z] c7b38690 || INFO: Pod exp-11-trial-26-rank-3-d14365d6-3b27-4c68-a6bd-f56001924600-crucial-snail: Started container determined-container
[2021-05-10T07:34:50Z] 86c9d22a || INFO: Pod exp-11-trial-26-rank-2-d14365d6-3b27-4c68-a6bd-f56001924600-caring-gannet: Started container determined-container
[2021-05-10T07:34:50Z] 13573602 || + WORKING_DIR=/run/determined/workdir
[2021-05-10T07:34:50Z] c393b38a || + WORKING_DIR=/run/determined/workdir
[2021-05-10T07:34:50Z] c393b38a || + STARTUP_HOOK=startup-hook.sh
[2021-05-10T07:34:50Z] 13573602 || + STARTUP_HOOK=startup-hook.sh
[2021-05-10T07:34:50Z] 13573602 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:50Z] c393b38a || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:50Z] 13573602 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:50Z] 13573602 || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:50Z] c393b38a || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:50Z] c393b38a || + '[' -z '' ']'
[2021-05-10T07:34:50Z] 13573602 || + '[' -z '' ']'
[2021-05-10T07:34:50Z] c393b38a || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:50Z] c393b38a || + DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:50Z] 13573602 || + /bin/which python3
[2021-05-10T07:34:50Z] 13573602 || + DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:50Z] 13573602 || + '[' /root = / ']'
[2021-05-10T07:34:50Z] c393b38a || + /bin/which python3
[2021-05-10T07:34:50Z] c393b38a || + '[' /root = / ']'
[2021-05-10T07:34:50Z] 13573602 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.3-py3-none-any.whl
[2021-05-10T07:34:50Z] c393b38a || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.3-py3-none-any.whl
[2021-05-10T07:34:53Z] c7b38690 || + WORKING_DIR=/run/determined/workdir
[2021-05-10T07:34:53Z] 0797af9d || + WORKING_DIR=/run/determined/workdir
[2021-05-10T07:34:53Z] 0797af9d || + STARTUP_HOOK=startup-hook.sh
[2021-05-10T07:34:53Z] 86c9d22a || + WORKING_DIR=/run/determined/workdir
[2021-05-10T07:34:53Z] 0797af9d || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] 0797af9d || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] 0797af9d || + '[' -z '' ']'
[2021-05-10T07:34:53Z] 0797af9d || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] 0797af9d || + DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] 86c9d22a || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] 86c9d22a || + STARTUP_HOOK=startup-hook.sh
[2021-05-10T07:34:53Z] 0797af9d || + '[' /root = / ']'
[2021-05-10T07:34:53Z] 0797af9d || + /bin/which python3
[2021-05-10T07:34:53Z] 0797af9d || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.3-py3-none-any.whl
[2021-05-10T07:34:53Z] 86c9d22a || + '[' -z '' ']'
[2021-05-10T07:34:53Z] 86c9d22a || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] 86c9d22a || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] 86c9d22a || + /bin/which python3
[2021-05-10T07:34:53Z] 86c9d22a || + DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] 86c9d22a || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.3-py3-none-any.whl
[2021-05-10T07:34:53Z] 86c9d22a || + '[' /root = / ']'
[2021-05-10T07:34:53Z] c7b38690 || + STARTUP_HOOK=startup-hook.sh
[2021-05-10T07:34:53Z] c7b38690 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] c7b38690 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-10T07:34:53Z] c7b38690 || + '[' -z '' ']'
[2021-05-10T07:34:53Z] c7b38690 || + DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] c7b38690 || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-10T07:34:53Z] c7b38690 || + '[' /root = / ']'
[2021-05-10T07:34:53Z] c7b38690 || + /bin/which python3
[2021-05-10T07:34:53Z] c7b38690 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.3-py3-none-any.whl
[2021-05-10T07:35:01Z] 13573602 || + cd /run/determined/workdir
[2021-05-10T07:35:01Z] 13573602 || + exec python3 -m determined.exec.harness
[2021-05-10T07:35:01Z] 13573602 || + test -f startup-hook.sh
[2021-05-10T07:35:01Z] c393b38a || + cd /run/determined/workdir
[2021-05-10T07:35:01Z] c393b38a || + test -f startup-hook.sh
[2021-05-10T07:35:01Z] c393b38a || + exec python3 -m determined.exec.harness
[2021-05-10T07:35:01Z] c7b38690 || + cd /run/determined/workdir
[2021-05-10T07:35:01Z] c7b38690 || + test -f startup-hook.sh
[2021-05-10T07:35:01Z] c7b38690 || + exec python3 -m determined.exec.harness
[2021-05-10T07:35:02Z] 13573602 || WARNING: `global_batch_size` changed from 512 to 510 to divide equally across 10 slots.
[2021-05-10T07:35:02Z] 13573602 || INFO: New trial runner in (container 13573602-b07b-4ad4-9d6a-7e4ca5d9fb8d) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '13573602-b07b-4ad4-9d6a-7e4ca5d9fb8d', 'experiment_config': {'description': 'mnist_pytorch_distributed', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 512}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 117}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 10, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0', 'gpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620631820}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 512, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (11,26,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-0de43703-9b09-14b9-86e6-8fdb3ba55cb4', 'GPU-4fb2a63e-162c-b338-a19d-57cf7346d36d'], 'slot_ids': [0, 1], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '26', 'det_experiment_id': '11', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1068804133, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 51, '_global_batch_size': 510}.
[2021-05-10T07:35:02Z] 13573602 || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/11/26/13573602-b07b-4ad4-9d6a-7e4ca5d9fb8d
[2021-05-10T07:35:02Z] 13573602 || INFO: Connected to master
[2021-05-10T07:35:02Z] 13573602 || INFO: Established WebSocket session with master
[2021-05-10T07:35:02Z] c393b38a || WARNING: `global_batch_size` changed from 512 to 510 to divide equally across 10 slots.
[2021-05-10T07:35:02Z] c393b38a || INFO: New trial runner in (container c393b38a-923e-4d0d-8fcd-bf0967da86aa) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'c393b38a-923e-4d0d-8fcd-bf0967da86aa', 'experiment_config': {'description': 'mnist_pytorch_distributed', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 512}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 117}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 10, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0', 'gpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620631820}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 512, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (11,26,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-f85c4ec5-d32d-01bb-c08e-d2b779314a9a', 'GPU-72adb4a9-d511-dbec-bd69-10960a34b452'], 'slot_ids': [0, 1], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '26', 'det_experiment_id': '11', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1068804133, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 51, '_global_batch_size': 510}.
[2021-05-10T07:35:02Z] c393b38a || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/11/26/c393b38a-923e-4d0d-8fcd-bf0967da86aa
[2021-05-10T07:35:02Z] c393b38a || INFO: Connected to master
[2021-05-10T07:35:02Z] c393b38a || INFO: Established WebSocket session with master
[2021-05-10T07:35:02Z] c7b38690 || WARNING: `global_batch_size` changed from 512 to 510 to divide equally across 10 slots.
[2021-05-10T07:35:02Z] c7b38690 || INFO: New trial runner in (container c7b38690-2d44-460f-877b-c9c8fdc15157) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'c7b38690-2d44-460f-877b-c9c8fdc15157', 'experiment_config': {'description': 'mnist_pytorch_distributed', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 512}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 117}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 10, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0', 'gpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620631820}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 512, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (11,26,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-a523e020-b504-535c-2b83-967ab28cdbab', 'GPU-8e73c217-85aa-c8c5-2551-52c59279e009'], 'slot_ids': [0, 1], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '26', 'det_experiment_id': '11', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1068804133, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 51, '_global_batch_size': 510}.
[2021-05-10T07:35:02Z] c7b38690 || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/11/26/c7b38690-2d44-460f-877b-c9c8fdc15157
[2021-05-10T07:35:02Z] c7b38690 || INFO: Connected to master
[2021-05-10T07:35:02Z] c7b38690 || INFO: Established WebSocket session with master
[2021-05-10T07:35:04Z] 0797af9d || + cd /run/determined/workdir
[2021-05-10T07:35:04Z] 0797af9d || + test -f startup-hook.sh
[2021-05-10T07:35:04Z] 0797af9d || + exec python3 -m determined.exec.harness
[2021-05-10T07:35:04Z] 86c9d22a || + cd /run/determined/workdir
[2021-05-10T07:35:04Z] 86c9d22a || + test -f startup-hook.sh
[2021-05-10T07:35:04Z] 86c9d22a || + exec python3 -m determined.exec.harness
[2021-05-10T07:35:05Z] 0797af9d || WARNING: `global_batch_size` changed from 512 to 510 to divide equally across 10 slots.
[2021-05-10T07:35:05Z] 0797af9d || INFO: New trial runner in (container 0797af9d-18d4-41ee-9128-4f42e8bc1f39) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '0797af9d-18d4-41ee-9128-4f42e8bc1f39', 'experiment_config': {'description': 'mnist_pytorch_distributed', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 512}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 117}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 10, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0', 'gpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620631820}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 512, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (11,26,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-afc0f67e-f5bf-0b0d-9bfd-abca42f2de36', 'GPU-2bfb4f26-19eb-b65f-08b1-1de7d47801e2'], 'slot_ids': [0, 1], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '26', 'det_experiment_id': '11', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1068804133, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 51, '_global_batch_size': 510}.
[2021-05-10T07:35:05Z] 0797af9d || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/11/26/0797af9d-18d4-41ee-9128-4f42e8bc1f39
[2021-05-10T07:35:05Z] 0797af9d || INFO: Connected to master
[2021-05-10T07:35:05Z] 0797af9d || INFO: Established WebSocket session with master
[2021-05-10T07:35:05Z] 86c9d22a || WARNING: `global_batch_size` changed from 512 to 510 to divide equally across 10 slots.
[2021-05-10T07:35:05Z] 86c9d22a || INFO: New trial runner in (container 86c9d22a-d207-4d66-abbb-5ae6165c16a4) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '86c9d22a-d207-4d66-abbb-5ae6165c16a4', 'experiment_config': {'description': 'mnist_pytorch_distributed', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 512}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 117}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 10, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0', 'gpu': 'determinedai/environments:cuda-11.1-pytorch-1.8-lightning-1.2-tf-2.4-gpu-0.9.0'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620631820}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 512, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (11,26,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-a11059eb-0891-239b-9231-a055bf282a20', 'GPU-4bafdd74-8f91-f260-1c06-8d48864ec2ee'], 'slot_ids': [0, 1], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '26', 'det_experiment_id': '11', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1068804133, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 51, '_global_batch_size': 510}.
[2021-05-10T07:35:05Z] 86c9d22a || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/11/26/86c9d22a-d207-4d66-abbb-5ae6165c16a4
[2021-05-10T07:35:05Z] 86c9d22a || INFO: Connected to master
[2021-05-10T07:35:05Z] 86c9d22a || INFO: Established WebSocket session with master
[2021-05-10T07:35:05Z] 86c9d22a || INFO: Got rendezvous information: {'addrs': ['100.122.176.216:1734', '100.82.21.9:1734', '100.122.176.226:1734', '100.82.21.8:1734', '100.122.176.232:1734'], 'addrs2': ['100.122.176.216:1750', '100.82.21.9:1750', '100.122.176.226:1750', '100.82.21.8:1750', '100.122.176.232:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.216', 'container_port': 1734, 'host_ip': '100.122.176.216', 'host_port': 1734}, {'container_ip': '100.122.176.216', 'container_port': 1750, 'host_ip': '100.122.176.216', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.9', 'container_port': 1734, 'host_ip': '100.82.21.9', 'host_port': 1734}, {'container_ip': '100.82.21.9', 'container_port': 1750, 'host_ip': '100.82.21.9', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.226', 'container_port': 1734, 'host_ip': '100.122.176.226', 'host_port': 1734}, {'container_ip': '100.122.176.226', 'container_port': 1750, 'host_ip': '100.122.176.226', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.8', 'container_port': 1734, 'host_ip': '100.82.21.8', 'host_port': 1734}, {'container_ip': '100.82.21.8', 'container_port': 1750, 'host_ip': '100.82.21.8', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.232', 'container_port': 1734, 'host_ip': '100.122.176.232', 'host_port': 1734}, {'container_ip': '100.122.176.232', 'container_port': 1750, 'host_ip': '100.122.176.232', 'host_port': 1750}]}], 'rank': 2, 'type': 'RENDEZVOUS_INFO'}
[2021-05-10T07:35:05Z] 0797af9d || INFO: Got rendezvous information: {'addrs': ['100.122.176.216:1734', '100.82.21.9:1734', '100.122.176.226:1734', '100.82.21.8:1734', '100.122.176.232:1734'], 'addrs2': ['100.122.176.216:1750', '100.82.21.9:1750', '100.122.176.226:1750', '100.82.21.8:1750', '100.122.176.232:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.216', 'container_port': 1734, 'host_ip': '100.122.176.216', 'host_port': 1734}, {'container_ip': '100.122.176.216', 'container_port': 1750, 'host_ip': '100.122.176.216', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.9', 'container_port': 1734, 'host_ip': '100.82.21.9', 'host_port': 1734}, {'container_ip': '100.82.21.9', 'container_port': 1750, 'host_ip': '100.82.21.9', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.226', 'container_port': 1734, 'host_ip': '100.122.176.226', 'host_port': 1734}, {'container_ip': '100.122.176.226', 'container_port': 1750, 'host_ip': '100.122.176.226', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.8', 'container_port': 1734, 'host_ip': '100.82.21.8', 'host_port': 1734}, {'container_ip': '100.82.21.8', 'container_port': 1750, 'host_ip': '100.82.21.8', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.232', 'container_port': 1734, 'host_ip': '100.122.176.232', 'host_port': 1734}, {'container_ip': '100.122.176.232', 'container_port': 1750, 'host_ip': '100.122.176.232', 'host_port': 1750}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-05-10T07:35:05Z] 13573602 || INFO: Got rendezvous information: {'addrs': ['100.122.176.216:1734', '100.82.21.9:1734', '100.122.176.226:1734', '100.82.21.8:1734', '100.122.176.232:1734'], 'addrs2': ['100.122.176.216:1750', '100.82.21.9:1750', '100.122.176.226:1750', '100.82.21.8:1750', '100.122.176.232:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.216', 'container_port': 1734, 'host_ip': '100.122.176.216', 'host_port': 1734}, {'container_ip': '100.122.176.216', 'container_port': 1750, 'host_ip': '100.122.176.216', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.9', 'container_port': 1734, 'host_ip': '100.82.21.9', 'host_port': 1734}, {'container_ip': '100.82.21.9', 'container_port': 1750, 'host_ip': '100.82.21.9', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.226', 'container_port': 1734, 'host_ip': '100.122.176.226', 'host_port': 1734}, {'container_ip': '100.122.176.226', 'container_port': 1750, 'host_ip': '100.122.176.226', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.8', 'container_port': 1734, 'host_ip': '100.82.21.8', 'host_port': 1734}, {'container_ip': '100.82.21.8', 'container_port': 1750, 'host_ip': '100.82.21.8', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.232', 'container_port': 1734, 'host_ip': '100.122.176.232', 'host_port': 1734}, {'container_ip': '100.122.176.232', 'container_port': 1750, 'host_ip': '100.122.176.232', 'host_port': 1750}]}], 'rank': 4, 'type': 'RENDEZVOUS_INFO'}
[2021-05-10T07:35:05Z] c393b38a || INFO: Got rendezvous information: {'addrs': ['100.122.176.216:1734', '100.82.21.9:1734', '100.122.176.226:1734', '100.82.21.8:1734', '100.122.176.232:1734'], 'addrs2': ['100.122.176.216:1750', '100.82.21.9:1750', '100.122.176.226:1750', '100.82.21.8:1750', '100.122.176.232:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.216', 'container_port': 1734, 'host_ip': '100.122.176.216', 'host_port': 1734}, {'container_ip': '100.122.176.216', 'container_port': 1750, 'host_ip': '100.122.176.216', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.9', 'container_port': 1734, 'host_ip': '100.82.21.9', 'host_port': 1734}, {'container_ip': '100.82.21.9', 'container_port': 1750, 'host_ip': '100.82.21.9', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.226', 'container_port': 1734, 'host_ip': '100.122.176.226', 'host_port': 1734}, {'container_ip': '100.122.176.226', 'container_port': 1750, 'host_ip': '100.122.176.226', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.8', 'container_port': 1734, 'host_ip': '100.82.21.8', 'host_port': 1734}, {'container_ip': '100.82.21.8', 'container_port': 1750, 'host_ip': '100.82.21.8', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.232', 'container_port': 1734, 'host_ip': '100.122.176.232', 'host_port': 1734}, {'container_ip': '100.122.176.232', 'container_port': 1750, 'host_ip': '100.122.176.232', 'host_port': 1750}]}], 'rank': 1, 'type': 'RENDEZVOUS_INFO'}
[2021-05-10T07:35:05Z] c7b38690 || INFO: Got rendezvous information: {'addrs': ['100.122.176.216:1734', '100.82.21.9:1734', '100.122.176.226:1734', '100.82.21.8:1734', '100.122.176.232:1734'], 'addrs2': ['100.122.176.216:1750', '100.82.21.9:1750', '100.122.176.226:1750', '100.82.21.8:1750', '100.122.176.232:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.216', 'container_port': 1734, 'host_ip': '100.122.176.216', 'host_port': 1734}, {'container_ip': '100.122.176.216', 'container_port': 1750, 'host_ip': '100.122.176.216', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.9', 'container_port': 1734, 'host_ip': '100.82.21.9', 'host_port': 1734}, {'container_ip': '100.82.21.9', 'container_port': 1750, 'host_ip': '100.82.21.9', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.226', 'container_port': 1734, 'host_ip': '100.122.176.226', 'host_port': 1734}, {'container_ip': '100.122.176.226', 'container_port': 1750, 'host_ip': '100.122.176.226', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.82.21.8', 'container_port': 1734, 'host_ip': '100.82.21.8', 'host_port': 1734}, {'container_ip': '100.82.21.8', 'container_port': 1750, 'host_ip': '100.82.21.8', 'host_port': 1750}]}, {'addresses': [{'container_ip': '100.122.176.232', 'container_port': 1734, 'host_ip': '100.122.176.232', 'host_port': 1734}, {'container_ip': '100.122.176.232', 'container_port': 1750, 'host_ip': '100.122.176.232', 'host_port': 1750}]}], 'rank': 3, 'type': 'RENDEZVOUS_INFO'}
[2021-05-10T07:35:05Z] 86c9d22a || 2021-05-10 07:35:05.959396: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2021-05-10T07:35:05Z] 0797af9d || 2021-05-10 07:35:05.978576: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2021-05-10T07:35:05Z] c7b38690 || 2021-05-10 07:35:05.978783: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2021-05-10T07:35:05Z] 13573602 || 2021-05-10 07:35:05.997814: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2021-05-10T07:35:05Z] c393b38a || 2021-05-10 07:35:05.997814: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[2021-05-10T07:35:07Z] c7b38690 || INFO: Horovod config: {'use': True, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-10T07:35:07Z] 0797af9d || INFO: Horovod config: {'use': True, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-10T07:35:07Z] 86c9d22a || INFO: Horovod config: {'use': True, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-10T07:35:07Z] 13573602 || INFO: Horovod config: {'use': True, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-10T07:35:07Z] c393b38a || INFO: Horovod config: {'use': True, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-10T07:35:07Z] 0797af9d || Traceback (most recent call last):
[2021-05-10T07:35:07Z] 0797af9d ||   File "/opt/conda/bin/horovodrun", line 5, in <module>
[2021-05-10T07:35:07Z] 0797af9d ||     from horovod.runner.launch import run_commandline
[2021-05-10T07:35:07Z] 0797af9d ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 34, in <module>
[2021-05-10T07:35:07Z] 0797af9d ||     from horovod.runner.driver import driver_service
[2021-05-10T07:35:07Z] 0797af9d ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/driver/driver_service.py", line 23, in <module>
[2021-05-10T07:35:07Z] 0797af9d ||     from horovod.runner.common.service import driver_service
[2021-05-10T07:35:07Z] 0797af9d ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/common/service/driver_service.py", line 18, in <module>
[2021-05-10T07:35:07Z] 0797af9d ||     from horovod.runner.common.util import network
[2021-05-10T07:35:07Z] 0797af9d ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/common/util/network.py", line 22, in <module>
[2021-05-10T07:35:07Z] 0797af9d ||     import cloudpickle
[2021-05-10T07:35:07Z] 0797af9d ||   File "/opt/conda/lib/python3.8/site-packages/cloudpickle/__init__.py", line 3, in <module>
[2021-05-10T07:35:07Z] 0797af9d ||     from cloudpickle.cloudpickle import *
[2021-05-10T07:35:07Z] 0797af9d ||   File "/opt/conda/lib/python3.8/site-packages/cloudpickle/cloudpickle.py", line 151, in <module>
[2021-05-10T07:35:07Z] 0797af9d ||     _cell_set_template_code = _make_cell_set_template_code()
[2021-05-10T07:35:07Z] 0797af9d ||   File "/opt/conda/lib/python3.8/site-packages/cloudpickle/cloudpickle.py", line 132, in _make_cell_set_template_code
[2021-05-10T07:35:07Z] 0797af9d || TypeError: an integer is required (got type bytes)
[2021-05-10T07:35:07Z] 0797af9d ||     return types.CodeType(
[2021-05-10T07:35:08Z] 0797af9d || INFO: WebSocket closed
[2021-05-10T07:35:08Z] 0797af9d || INFO: Disconnected from master, exiting gracefully
[2021-05-10T07:35:08Z] 0797af9d || Traceback (most recent call last):
[2021-05-10T07:35:08Z] 0797af9d ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[2021-05-10T07:35:08Z] 0797af9d ||     return _run_code(code, main_globals, None,
[2021-05-10T07:35:08Z] 0797af9d ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
[2021-05-10T07:35:08Z] 0797af9d ||     exec(code, run_globals)
[2021-05-10T07:35:08Z] 0797af9d ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-05-10T07:35:08Z] 0797af9d ||     main()
[2021-05-10T07:35:08Z] 0797af9d ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 229, in main
[2021-05-10T07:35:08Z] 0797af9d ||     build_and_run_training_pipeline(env)
[2021-05-10T07:35:08Z] 0797af9d ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 135, in build_and_run_training_pipeline
[2021-05-10T07:35:08Z] 0797af9d ||     subproc.run()
[2021-05-10T07:35:08Z] 0797af9d ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/layers/_worker_process.py", line 268, in run
[2021-05-10T07:35:08Z] 0797af9d ||     self._do_startup_message_sequence()
[2021-05-10T07:35:08Z] 0797af9d ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/layers/_worker_process.py", line 246, in _do_startup_message_sequence
[2021-05-10T07:35:08Z] 0797af9d ||     responses, exception_received = self.broadcast_server.gather_with_polling(
[2021-05-10T07:35:08Z] 0797af9d ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/ipc.py", line 150, in gather_with_polling
[2021-05-10T07:35:08Z] 0797af9d ||     health_check()
[2021-05-10T07:35:08Z] 0797af9d ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/layers/_worker_process.py", line 288, in _health_check
[2021-05-10T07:35:08Z] 0797af9d ||     raise det.errors.WorkerError("Training process died.")
[2021-05-10T07:35:08Z] 0797af9d || determined.errors.WorkerError: Training process died.
[2021-05-10T07:35:09Z] 0797af9d || INFO: container failed with non-zero exit code:  (exit code 1)
[2021-05-10T07:35:25Z] 13573602 || INFO: container failed with non-zero exit code:  (exit code 137)
[2021-05-10T07:35:26Z] c393b38a || INFO: container failed with non-zero exit code:  (exit code 137)
[2021-05-10T07:35:26Z] c7b38690 || INFO: container failed with non-zero exit code:  (exit code 137)
[2021-05-10T07:35:26Z] 86c9d22a || INFO: container failed with non-zero exit code:  (exit code 137)
�[32mTrial log stream ended. To reopen log stream, run: det trial logs -f 26�[0m

The following resource(s) failed to create: [DeterminedAddress]

I tried to run det-deploy aws up --cluster-id testcluster0 --keypair mykeypairname --deployment-type vpc. It failed with --depoyment-type simple as well. I also made sure that I have all the permissions before running these.
Any help would be appreciated.
Thank you

det-deploy does not operate within a multi-profile aws cli environment

aws cli allows the use of multiple profiles by adding named profiles in the ~/.aws/config and ~/.aws/credentials files. It appears det-deploy simply exclusively uses the default profile.

Recommendation: create a --profile argument for the det-deploy aws up and det-deploy aws down so that det-deploy can operate within a user's multi-profile setup.

Master crash - error converting YAML to JSON

Hi, if you use dot and slash in labels:

scheduler:
  resource_provider:
    type: "kubernetes"
    namespace: default
    max_slots_per_pod: 10
    master_service_name: test-determined-ai

task_container_defaults:
  cpu_pod_spec:
    apiVersion: v1
    kind: Pod
    metadata:
      labels:
        helm.sh/chart: "determined-ai-0.1.61"
        app.kubernetes.io/name: "test-determined-ai"
        app.kubernetes.io/instance: "test"
        app.kubernetes.io/managed-by: "Helm"

master crash:

error converting YAML to JSON: yaml: line 74: mapping values are not allowed in this context
error unmarshal yaml configuration file
main.mergeConfigBytesIntoViper
	/home/circleci/project/master/cmd/determined-master/root.go:111
main.initializeConfig
	/home/circleci/project/master/cmd/determined-master/root.go:69
main.runRoot
	/home/circleci/project/master/cmd/determined-master/root.go:40
main.glob..func1
	/home/circleci/project/master/cmd/determined-master/root.go:29
github.com/spf13/cobra.(*Command).execute
	/home/circleci/go/pkg/mod/github.com/spf13/[email protected]/command.go:846
github.com/spf13/cobra.(*Command).ExecuteC
	/home/circleci/go/pkg/mod/github.com/spf13/[email protected]/command.go:950
github.com/spf13/cobra.(*Command).Execute
	/home/circleci/go/pkg/mod/github.com/spf13/[email protected]/command.go:887
main.main
	/home/circleci/project/master/cmd/determined-master/main.go:12
runtime.main
	/usr/local/go/src/runtime/proc.go:203
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1357

failed to start Fluent daemon: failed to kill old logging container: failed to list containers by name: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"

Hi, when i run the command

docker run \
    --name determined-agent \
    --network determined \
    -e DET_MASTER_HOST=determined-master \
    -e DET_MASTER_PORT=8080 \
    determinedai/determined-agent:0.15.1

a error has occured, the agent report:

xidian@xidian-S2600WFT:~$ docker run     --name determined-agent     --network determined     -e DET_MASTER_HOST=determined-master     -e DET_MASTER_PORT=8080     determinedai/determined-agent:0.15.1
chmod: cannot access '/usr/local/determined/container_startup_script': No such file or directory
/run/determined/workdir/entrypoint.sh: line 4: /usr/local/determined/container_startup_script: No such file or directory
WARN[2021-04-22T12:54:14Z] no configuration file at /etc/determined/agent.yaml, skipping 
INFO[2021-04-22T12:54:14Z] agent configuration: {"config_file":"","master_host":"determined-master","master_port":8080,"agent_id":"e1234989a06d","artificial_slots":0,"slot_type":"auto","container_master_host":"","container_master_port":0,"label":"","resource_pool":"","api_enabled":false,"bind_ip":"0.0.0.0","bind_port":9090,"visible_gpus":"","tls":false,"cert_file":"","key_file":"","http_proxy":"","https_proxy":"","ftp_proxy":"","no_proxy":"","security":{"tls":{"enabled":false,"skip_verify":false,"master_cert":"","master_cert_name":""}},"fluent":{"image":"fluent/fluent-bit:1.6","port":24224}} 
INFO[2021-04-22T12:54:14Z] Determined agent 0.15.1 (built with go1.16.3)  id=agent system=e1234989a06d type=agent
INFO[2021-04-22T12:54:14Z] connecting to master at: ws://determined-master:8080/agents?id=e1234989a06d&resource_pool=  id=agent system=e1234989a06d type=agent
INFO[2021-04-22T12:54:14Z] successfully connected to master              id=agent system=e1234989a06d type=agent
ERRO[2021-04-22T12:54:14Z] error while actor was running                 error="failed to start Fluent daemon: failed to kill old logging container: failed to list containers by name: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" id=agent system=e1234989a06d type=agent
INFO[2021-04-22T12:54:14Z] agent shut down                               id=agent system=e1234989a06d type=agent
FATA[2021-04-22T12:54:14Z] failed to start Fluent daemon: failed to kill old logging container: failed to list containers by name: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

the master report:

ERRO[2021-04-22T12:54:14Z] error while actor was running                 error="websocket: close 1006 (abnormal closure): unexpected EOF" id=websocket-fc76cdac-037a-4ca1-b23a-283c3235491f system=master type=websocketActor
ERRO[2021-04-22T12:54:14Z] websocket: close 1006 (abnormal closure): unexpected EOF 
echo: http: response.WriteHeader on hijacked connection from github.com/labstack/echo.(*Response).WriteHeader (response.go:63)
echo: http: response.Write on hijacked connection from github.com/labstack/echo.(*Response).Write (response.go:72)
ERRO[2021-04-22T12:54:14Z] error while actor was running                 error="child failed: /agents/e1234989a06d/websocket-fc76cdac-037a-4ca1-b23a-283c3235491f: websocket: close 1006 (abnormal closure): unexpected EOF" id=e1234989a06d system=master type=agent
INFO[2021-04-22T12:54:14Z] agent disconnected                            id=e1234989a06d system=master type=agent
ERRO[2021-04-22T12:54:14Z] http: connection has been hijacked           
INFO[2021-04-22T12:54:14Z] removing agent: e1234989a06d                  id=default resource-pool=default system=master type=ResourcePool

i don't know how to slove it, can you help me?

Errors in log during agent shutdown on AWS

The following is a snippet from the AWS logs... essentially, the master node is shutting down the 8x GPU agent due to inactivity. It appears that everything shutdown correctly, but the logs reported websocket and http errors.

As far as I can tell, everything worked correctly, but the logged information indicates that something was wrong, when there probably wasn't anything wrong.

[2021-01-18, 16:20:03] resources are released for /experiment-34-checkpoint-gc  id="default" resource-pool="default" system="master" type="ResourcePool"
<info> [2021-01-18, 16:30:09] decided to terminate 1 instances: ,i-0d2fc018489ab2f24 (reason: long idle)  id="provisioner" resource-pool="default" system="master" type="Provisioner"
<info> [2021-01-18, 16:30:09] terminated 1/1 EC2 instances: i-0d2fc018489ab2f24 (Terminating)  id="provisioner" resource-pool="default" system="master" type="Provisioner"
<error> [2021-01-18, 16:30:10] error while actor was running  error="websocket: close 1006 (abnormal closure): unexpected EOF" id="websocket-393a233c-3a95-4aad-b362-3ac57c244a5c" system="master" type="websocketActor"
<error> [2021-01-18, 16:30:10] websocket: close 1006 (abnormal closure): unexpected EOF
<error> [2021-01-18, 16:30:10] http: connection has been hijacked
<error> [2021-01-18, 16:30:10] error while actor was running  error="child failed: /agents/i-0d2fc018489ab2f24/websocket-393a233c-3a95-4aad-b362-3ac57c244a5c: websocket: close 1006 (abnormal closure): unexpected EOF" id="i-0d2fc018489ab2f24" system="master" type="agent"
<info> [2021-01-18, 16:30:10] removing device: gpu5 (Tesla K80) (i-0d2fc018489ab2f24)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info> [2021-01-18, 16:30:10] removing device: gpu6 (Tesla K80) (i-0d2fc018489ab2f24)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info> [2021-01-18, 16:30:10] removing device: gpu7 (Tesla K80) (i-0d2fc018489ab2f24)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info> [2021-01-18, 16:30:10] removing device: gpu0 (Tesla K80) (i-0d2fc018489ab2f24)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info> [2021-01-18, 16:30:10] removing device: gpu1 (Tesla K80) (i-0d2fc018489ab2f24)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info> [2021-01-18, 16:30:10] agent disconnected  id="i-0d2fc018489ab2f24" system="master" type="agent"
<info> [2021-01-18, 16:30:10] removing device: gpu2 (Tesla K80) (i-0d2fc018489ab2f24)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info> [2021-01-18, 16:30:10] removing device: gpu3 (Tesla K80) (i-0d2fc018489ab2f24)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info> [2021-01-18, 16:30:10] removing device: gpu4 (Tesla K80) (i-0d2fc018489ab2f24)  id="default" resource-pool="default" system="master" type="ResourcePool"
<info> [2021-01-18, 16:30:10] removing agent: i-0d2fc018489ab2f24  id="default" resource-pool="default" system="master" type="ResourcePool"
<info> [2021-01-18, 16:30:14] found state changes in 0 instances:   id="provisioner" resource-pool="default" system="master" type="Provisioner"

incorrect merging behavior for bind_mounts when specified in template and config file

According to the docs "If the field specifies a list value, the merged value will be the concatenation of the list specified in the template and that specified in the configuration."

However, I've noticed that in the case of bind mounts, the bind_mounts section in the config file completely overwrites the one from the template, rather than concatenating the two lists.

On the other hand, I tested the same thing with the environment_variables key and it works as expected in this case.

det-deploy is not recognized

I did the pip install and I get the above when I am looking to deploy. Can this run on Windows or that something that is not happening?

shell ssh connect operation timed out (local cluster on macOS)

Create a local cluster on a macOS laptop: det-deploy local cluster-up --no-gpu

Start a shell:

$ det shell start
Scheduling Shell (willingly-whole-baboon) (id: 9c0dd595-29ac-4dfe-8862-27cf5e8ef722)...
Shell (willingly-whole-baboon) was assigned to an agent...
[2020-07-11T06:44:01Z] 4037bd18 [PULLING] || image already found, skipping pull phase: docker.io/determinedai/environments:py-3.6.9-pytorch-1.4-tf-1.14-cpu-0c9e956
[2020-07-11T06:44:01Z] 4037bd18 [STARTING] || copying files to container: /
[2020-07-11T06:44:01Z] 4037bd18 [STARTING] || copying files to container: /run/determined/workdir
[2020-07-11T06:44:01Z] 4037bd18 [STARTING] || copying files to container: /
[2020-07-11T06:44:06Z] 4037bd18 [STARTING] || copying files to container: /
disconnecting websocket
ssh: connect to host 172.18.0.1 port 32773: Operation timed out
To reconnect, run: det shell open 9c0dd595-29ac-4dfe-8862-27cf5e8ef722

The shell container (naughty_nightingale) is running and listening on 0.0.0.0:32773:

$ docker ps --format "table {{.Image}}\t{{.Ports}}\t{{.Names}}"
IMAGE                                                                PORTS                     NAMES
determinedai/environments:py-3.6.9-pytorch-1.4-tf-1.14-cpu-0c9e956   0.0.0.0:32773->2222/tcp   naughty_nightingale
determinedai/determined-agent:0.12.10                                                          determined-agent-0
determinedai/determined-master:0.12.10                               0.0.0.0:8080->8080/tcp    determined_determined-master_1
postgres:10.8                                                        5432/tcp                  determined_determined-db_1

Connecting to 0.0.0.0 works:

$ ssh 0.0.0.0 37773
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is SHA256:Unb+Ml1bKeDzzCH2MS1h7blGRLF9uSuOnb0i/MBIOx0.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
Password:

Bridge network gateway is on 172.18.0.1:

$ docker network inspect determined_default
[
    {
        "Name": "determined_default",
        "Id": "0aaafd82d8848d02d38b82fd623ef06f3b2926ac659a5db6bf8e75767e7f4219",
        "Created": "2020-07-11T03:57:05.104944304Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "9bac1e8320ccbf3875671cc0ea4c495b317156a68bf0bd216de96de632997cf8": {
                "Name": "determined_determined-db_1",
                "EndpointID": "7b1ca6ea4292379bf1a635b07626cb0c225ca33648ceb2dc489fc79193dec195",
                "MacAddress": "02:42:ac:12:00:02",
                "IPv4Address": "172.18.0.2/16",
                "IPv6Address": ""
            },
            "c09fb7c3362583590f164ddb8178c5d447e92f5a8d0dc167fe602cb032720b07": {
                "Name": "determined_determined-master_1",
                "EndpointID": "e0800911a7282c03c042a117f0e120385024375355098f3fd26239d0cc6cffb6",
                "MacAddress": "02:42:ac:12:00:03",
                "IPv4Address": "172.18.0.3/16",
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {
            "com.docker.compose.network": "default",
            "com.docker.compose.project": "determined",
            "com.docker.compose.version": "1.26.2"
        }
    }
]

Looks like det shell is connecting to the bridge network gateway (172.18.0.1) rather than localhost.

How do I use my own dataset?

Let's say I've already downloaded the data set.

The directory structure is as follows:
-rw-r--r-- 1 dc2-user dc2-user 653 Jun 13 08:23 adaptive.yaml
-rw-r--r-- 1 dc2-user dc2-user 405 Jun 13 08:23 const.yaml
-rw-r--r-- 1 dc2-user dc2-user 1.5K Jun 15 22:27 data.py
-rw-r--r-- 1 dc2-user dc2-user 486 Jun 13 08:23 distributed.yaml
-rw-r--r-- 1 dc2-user dc2-user 555 Jun 13 08:23 layers.py
drwxrwxr-x 2 dc2-user dc2-user 4.0K Jun 15 22:20 MNIST_data
-rw-r--r-- 1 dc2-user dc2-user 4.0K Jun 15 22:27 model_def.py

The MNIST_data dir is myself dataset

When I use the command “det create ------” to create a experiment. Then the log from experiment page indicates that no data was found.

How do I use my own dataset?

`det-deploy local fixture-up` doesn't show agents on the web

Hi,
I tried to follow the tutorial to set up determined on my local machine. The machine is running Ubuntu18.04 with one 2080Ti. The driver is 440.82.
After installing the CLI and try to run the single node single agent by det-deploy local fixture-up, there is not agent found on the web.

WARNING: The DET_DB_PASSWORD variable is not set. Defaulting to a blank string.
WARNING: The DET_HASURA_SECRET variable is not set. Defaulting to a blank string.
WARNING: The DET_VERSION variable is not set. Defaulting to a blank string.
Removing network determined_default
WARNING: Network determined_default not found.
WARNING: The DET_DB_PASSWORD variable is not set. Defaulting to a blank string.
WARNING: The DET_HASURA_SECRET variable is not set. Defaulting to a blank string.
WARNING: The DET_VERSION variable is not set. Defaulting to a blank string.
Removing network determined_default
WARNING: Network determined_default not found.
Creating network "determined_default" with the default driver
Creating determined_determined-graphql_1 ... done
Creating determined_determined-db_1      ... done
Creating determined_determined-master_1  ... done
Waiting for master to be available...
Starting determined-agent-0

Do I need to provide a configuration file for the fixture-up run?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.