Coder Social home page Coder Social logo

aws / sagemaker-containers Goto Github PK

View Code? Open in Web Editor NEW
184.0 32.0 89.0 697 KB

WARNING: This package has been deprecated. Please use the SageMaker Training Toolkit for model training and the SageMaker Inference Toolkit for model serving.

Home Page: https://github.com/aws/sagemaker-training-toolkit

License: Apache License 2.0

Python 95.52% C 4.00% Dockerfile 0.30% Shell 0.18%

sagemaker-containers's Introduction

WARNING: This package has been deprecated. Please use the SageMaker Training Toolkit for model training and the SageMaker Inference Toolkit for model serving.

Code style: black

SageMaker Containers gives you tools to create SageMaker-compatible Docker containers, and has additional tools for letting you create Frameworks (SageMaker-compatible Docker containers that can run arbitrary Python or shell scripts).

Currently, this library is used by the SageMaker Scikit-learn containers.

Here we'll demonstrate how to create a Docker image using SageMaker Containers in order to show the simplicity of using this library.

Let's suppose we need to train a model with the following training script train.py using TF 2.0 in SageMaker:

import tensorflow as tf

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=1)

model.evaluate(x_test, y_test)

We then create a Dockerfile with our dependencies and define the program that will be executed in SageMaker:

FROM tensorflow/tensorflow:2.0.0a0

RUN pip install sagemaker-containers

# Copies the training code inside the container
COPY train.py /opt/ml/code/train.py

# Defines train.py as script entry point
ENV SAGEMAKER_PROGRAM train.py

More documentation on how to build a Docker container can be found here

We then build the Docker image using docker build:

docker build -t tf-2.0 .

We can use Local Mode to test the container locally:

from sagemaker.estimator import Estimator

estimator = Estimator(image_name='tf-2.0',
                      role='SageMakerRole',
                      train_instance_count=1,
                      train_instance_type='local')

estimator.fit()

After using Local Mode, we can push the image to ECR and run a SageMaker training job. To see a complete example on how to create a container using SageMaker Container, including pushing it to ECR, see the example notebook tensorflow_bring_your_own.ipynb.

The training script must be located under the folder /opt/ml/code and its relative path is defined in the environment variable SAGEMAKER_PROGRAM. The following scripts are supported:

  • Python scripts: uses the Python interpreter for any script with .py suffix
  • Shell scripts: uses the Shell interpreter to execute any other script

When training starts, the interpreter executes the entry point, from the example above:

python train.py

Any hyperparameters provided by the training job will be passed by the interpreter to the entry point as script arguments. For example the training job hyperparameters:

{"HyperParameters": {"batch-size": 256, "learning-rate": 0.0001, "communicator": "pure_nccl"}}

Will be executed as:

./user_script.sh --batch-size 256 --learning_rate 0.0001 --communicator pure_nccl

The entry point is responsible for parsing these script arguments. For example, in a Python script:

import argparse

if __name__ == '__main__':
  parser = argparse.ArgumentParser()

  parser.add_argument('--learning-rate', type=int, default=1)
  parser.add_argument('--batch-size', type=int, default=64)
  parser.add_argument('--communicator', type=str)
  parser.add_argument('--frequency', type=int, default=20)

  args = parser.parse_args()
  ...

Very often, an entry point needs additional information from the container that is not available in hyperparameters. SageMaker Containers writes this information as environment variables that are available inside the script. For example, the training job below includes the channels training and testing:

from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point='train.py', ...)

estimator.fit({'training': 's3://bucket/path/to/training/data',
               'testing': 's3://bucket/path/to/testing/data'})

The environment variable SM_CHANNEL_{channel_name} provides the path were the channel is located:

import argparse
import os

if __name__ == '__main__':
  parser = argparse.ArgumentParser()

  ...

  # reads input channels training and testing from the environment variables
  parser.add_argument('--training', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
  parser.add_argument('--testing', type=str, default=os.environ['SM_CHANNEL_TESTING'])

  args = parser.parse_args()
  ...

When training starts, SageMaker Containers will print all available environment variables.

These environment variables are those that you're likely to use when writing a user script. A full list of environment variables is given below.

SM_MODEL_DIR=/opt/ml/model

When the training job finishes, the container will be deleted including its file system with exception of the /opt/ml/model and /opt/ml/output folders. Use /opt/ml/model to save the model checkpoints. These checkpoints will be uploaded to the default S3 bucket. Usage example:

import os

# using it in argparse
parser.add_argument('model_dir', type=str, default=os.environ['SM_MODEL_DIR'])

# using it as variable
model_dir = os.environ['SM_MODEL_DIR']

# saving checkpoints to model dir in chainer
serializers.save_npz(os.path.join(os.environ['SM_MODEL_DIR'], 'model.npz'), model)

For more information, see: How Amazon SageMaker Processes Training Output.

SM_CHANNELS='["testing","training"]'

Contains the list of input data channels in the container.

When you run training, you can partition your training data into different logical "channels". Depending on your problem, some common channel ideas are: "training", "testing", "evaluation" or "images" and "labels".

SM_CHANNELS includes the name of the available channels in the container as a JSON encoded list. Usage example:

import os
import json

# using it in argparse
parser.add_argument('channel_names', default=json.loads(os.environ['SM_CHANNELS'])))

# using it as variable
channel_names = json.loads(os.environ['SM_CHANNELS']))
SM_CHANNEL_TRAINING='/opt/ml/input/data/training'
SM_CHANNEL_TESTING='/opt/ml/input/data/testing'

Contains the directory where the channel named channel_name is located in the container. Usage examples:

import os
import json

parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TESTING'])


args = parser.parse_args()

train_file = np.load(os.path.join(args.train, 'train.npz'))
test_file = np.load(os.path.join(args.test, 'test.npz'))
SM_HPS='{"batch-size": "256", "learning-rate": "0.0001","communicator": "pure_nccl"}'

Contains a JSON encoded dictionary with the user provided hyperparameters. Example usage:

import os
import json

hyperparameters = json.loads(os.environ['SM_HPS']))
# {"batch-size": 256, "learning-rate": 0.0001, "communicator": "pure_nccl"}
SM_HP_LEARNING-RATE=0.0001
SM_HP_BATCH-SIZE=10000
SM_HP_COMMUNICATOR=pure_nccl

Contains value of the hyperparameter named hyperparameter_name. Usage examples:

learning_rate = float(os.environ['SM_HP_LEARNING-RATE'])
batch_size = int(os.environ['SM_HP_BATCH-SIZE'])
comminicator = os.environ['SM_HP_COMMUNICATOR']
SM_CURRENT_HOST=algo-1

The name of the current container on the container network. Usage example:

import os

# using it in argparse
parser.add_argument('current_host', type=str, default=os.environ['SM_CURRENT_HOST'])

# using it as variable
current_host = os.environ['SM_CURRENT_HOST']
SM_HOSTS='["algo-1","algo-2"]'

JSON encoded list containing all the hosts . Usage example:

import os
import json

# using it in argparse
parser.add_argument('hosts', type=str, default=json.loads(os.environ['SM_HOSTS']))

# using it as variable
hosts = json.loads(os.environ['SM_HOSTS'])
SM_NUM_GPUS=1

The number of gpus available in the current container. Usage example:

import os

# using it in argparse
parser.add_argument('num_gpus', type=int, default=os.environ['SM_NUM_GPUS'])

# using it as variable
num_gpus = int(os.environ['SM_NUM_GPUS'])
SM_NUM_CPUS=32

The number of cpus available in the current container. Usage example:

# using it in argparse
parser.add_argument('num_cpus', type=int, default=os.environ['SM_NUM_CPUS'])

# using it as variable
num_cpus = int(os.environ['SM_NUM_CPUS'])
SM_LOG_LEVEL=20

The current log level in the container. Usage example:

import os
import logging

logger = logging.getLogger(__name__)

logger.setLevel(int(os.environ.get('SM_LOG_LEVEL', logging.INFO)))
SM_NETWORK_INTERFACE_NAME=ethwe

Name of the network interface, useful for distributed training. Usage example:

# using it in argparse
parser.add_argument('network_interface', type=str, default=os.environ['SM_NETWORK_INTERFACE_NAME'])

# using it as variable
network_interface = os.environ['SM_NETWORK_INTERFACE_NAME']
SM_USER_ARGS='["--batch-size","256","--learning_rate","0.0001","--communicator","pure_nccl"]'

JSON encoded list with the script arguments provided for training.

SM_INPUT_DIR=/opt/ml/input/

The path of the input directory, e.g. /opt/ml/input/ The input_dir, e.g. /opt/ml/input/, is the directory where SageMaker saves input data and configuration files before and during training.

SM_INPUT_CONFIG_DIR=/opt/ml/input/config

The path of the input configuration directory, e.g. /opt/ml/input/config/. The directory where standard SageMaker configuration files are located, e.g. /opt/ml/input/config/.

SageMaker training creates the following files in this folder when training starts:

  • hyperparameters.json: Amazon SageMaker makes the hyperparameters in a CreateTrainingJob request available in this file.
  • inputdataconfig.json: You specify data channel information in the InputDataConfig parameter in a CreateTrainingJob request. Amazon SageMaker makes this information available in this file.
  • resourceconfig.json: name of the current host and all host containers in the training.

More information about this files can be find here: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html

SM_OUTPUT_DATA_DIR=/opt/ml/output/data/algo-1

The dir to write non-model training artifacts (e.g. evaluation results) which will be retained by SageMaker, e.g. /opt/ml/output/data.

As your algorithm runs in a container, it generates output including the status of the training job and model and output artifacts. Your algorithm should write this information to the this directory.

SM_RESOURCE_CONFIG='{"current_host":"algo-1","hosts":["algo-1","algo-2"]}'

The contents from /opt/ml/input/config/resourceconfig.json. It has the following keys:

  • current_host: The name of the current container on the container network. For example, 'algo-1'.
  • hosts: The list of names of all containers on the container network, sorted lexicographically. For example, ['algo-1', 'algo-2', 'algo-3'] for a three-node cluster.

For more information about resourceconfig.json: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html#your-algorithms-training-algo-running-container-dist-training

SM_INPUT_DATA_CONFIG='{
    "testing": {
        "RecordWrapperType": "None",
        "S3DistributionType": "FullyReplicated",
        "TrainingInputMode": "File"
    },
    "training": {
        "RecordWrapperType": "None",
        "S3DistributionType": "FullyReplicated",
        "TrainingInputMode": "File"
    }
}'

Input data configuration from /opt/ml/input/config/inputdataconfig.json.

For more information about inpudataconfig.json: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html#your-algorithms-training-algo-running-container-dist-training

SM_TRAINING_ENV='
{
    "channel_input_dirs": {
        "test": "/opt/ml/input/data/testing",
        "train": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_chainer_container.training:main",
    "hosts": [
        "algo-1",
        "algo-2"
    ],
    "hyperparameters": {
        "batch-size": 10000,
        "epochs": 1
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "test": {
            "RecordWrapperType": "None",
            "S3DistributionType": "FullyReplicated",
            "TrainingInputMode": "File"
        },
        "train": {
            "RecordWrapperType": "None",
            "S3DistributionType": "FullyReplicated",
            "TrainingInputMode": "File"
        }
    },
    "input_dir": "/opt/ml/input",
    "job_name": "preprod-chainer-2018-05-31-06-27-15-511",
    "log_level": 20,
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-{aws-region}-{aws-id}/{training-job-name}/source/sourcedir.tar.gz",
    "module_name": "user_script",
    "network_interface_name": "ethwe",
    "num_cpus": 4,
    "num_gpus": 1,
    "output_data_dir": "/opt/ml/output/data/algo-1",
    "output_dir": "/opt/ml/output",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1",
            "algo-2"
        ]
    }
}'

Provides the entire training information as a JSON-encoded dictionary.

sagemaker-containers's People

Contributors

ajaykarpur avatar andremoeller avatar aws-patlin avatar bearpelican avatar choibyungwook avatar chuyang-deng avatar hsm207 avatar icywang86rui avatar iquintero avatar jesterhazy avatar juliodelgadoaws avatar larsll avatar laurenyu avatar lukmis avatar mvsusp avatar nadiaya avatar pdasamzn avatar saurabh3949 avatar wiltonwu avatar winstonaws avatar yangaws avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sagemaker-containers's Issues

Copy bug in 'download_and_extract' function in _files.py

When a script is provided, the .py extension gets stripped to get the module name here in the _env.py file. This stripped module name is ultimately used when the module is run in the _modules.py file

During a review with @mvsusp, he mentioned that this line of code may be a bug?

Instead of

def download_and_extract(uri, name, path):
    ...
    if not os.listdir(path):
        with tmpdir() as tmp:
            if uri.startswith('s3://'):
                ...
            elif os.path.isdir(uri):
                ...
            else:
                shutil.copy2(uri, os.path.join(path, name))

it should be

def download_and_extract(uri, name, path):
    ...
    if not os.listdir(path):
        with tmpdir() as tmp:
            if uri.startswith('s3://'):
                ...
            elif os.path.isdir(uri):
                ...
            else:
                shutil.copy2(uri, path)

i'm not too familiar with the previous use case, so this might be completely wrong. If so, is there any other way to prevent stripping .py extension of a user script during copying to the docker container? We basically want to be able to copy over the a user .py script before training begins, and then execute the script within docker during training (instead of pulling the script from s3).

I was directed by @mvsusp to submit as an issue to get it reviewed and prioritized.

Serialized hyperparams break containers that aren't launched by python sdk

Cheers,

I have already offloaded this to AWS support, but find it might help other people too, so here I go...

Problem

Serialized hyperparams break containers that aren't launched by python sdk

We are currently using step-functions to automate training of some models using sagemaker.
We use arn:aws:states:::sagemaker:createTrainingJob.sync as outlined here.

This in turn uses the sagemaker API as described here, where it is specified to use Hyperparameters as follows:

"HyperParameters": { 
      "string" : "string" 
   }

(Type: String to string map).

It is reasonable to expect the following syntax to be valid for HyperParameters:

"HyperParameters": {
    "epochs": "1",
    "batch_size": "128",
    "conv_block_length": "2",
    "cycle_length": "10",
    "depth": "5",
    "dropout": "0.5",
    "job_name": "my_job",
    "max_lr": "0.1",
    "min_lr": "0.0001",
    "sagemaker_container_log_level": "20",
    "sagemaker_enable_cloudwatch_metrics": "false",
    "sagemaker_job_name": "my_job",
    "sagemaker_program": "train_n_folds.py",
    ...
}

However, this will fail the training job with the following error:

Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/container_support/training.py", line 31, in start
env = TrainingEnvironment()
File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 182, in __init__
os.path.join(self.input_config_dir, TrainingEnvironment.HYPERPARAMETERS_FILE))
File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 224, in _load_hyperparameters
return self._deserialize_hyperparameters(serialized)
File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 238, in _deserialize_hyperparameters
hyperparameter_dict[k] = json.loads(v)
File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.5/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None

After some digging what caused this issue I became frustrated, because I couldn't find the code from the above stack-trace anywhere on GitHub. (:thinking: Am I just looking in the wrong place, or is this intended? If intended, will it ever be published for input from the community? :thinking:)

To get access to the code I downloaded the docker image from ECR and copied it from there and lo and behold I found something:

class TrainingEnvironment(ContainerEnvironment):

    ...

    # TODO expecting serialized hyperparams might break containers that aren't launched by python sdk
    @staticmethod
    def _deserialize_hyperparameters(hp):
        hyperparameter_dict = {}

        for (k, v) in hp.items():
            # Tuning jobs inject a hyperparameter that does not conform to the JSON format
            if k == '_tuning_objective_metric':
                if v.startswith('"') and v.endswith('"'):
                    v = v.strip('"')
                hyperparameter_dict[k] = v
            else:
                hyperparameter_dict[k] = json.loads(v)

        return hyperparameter_dict

So somebody had the right idea ๐Ÿ‘, however it wasn't followed trough with ๐Ÿ‘Ž .

What seems to break things is the fact that json.loads(v) is called on each element of the dict. However, if that element is a string (as expected by a string to string map) it will raise a JSONDecodeError. The sagemaker-sdk will do a json.dumps("value") for every input, thus yielding something like this '"mystring"'. You can do a json.dumps('"mystring"'), but a json.dumps("mystring") will fail.

Solution

There are some possibilities:

  1. open source the code for people to help (if this hasn't happened)
  2. fix the TODO
  3. ... use an ugly workaround (see below)

Workaround

Our current workaround is quoting the "true" strings explicitly:

"HyperParameters": {
    "epochs": "1",
    "batch_size": "128",
    "conv_block_length": "2",
    "cycle_length": "10",
    "depth": "5",
    "dropout": "0.5",
    "job_name": "\"my_job\"",
    "max_lr": "0.1",
    "min_lr": "0.0001",
    "sagemaker_container_log_level": "20",
    "sagemaker_enable_cloudwatch_metrics": "false",
    "sagemaker_job_name": "\"my_job\"",
    "sagemaker_program": "\"train_n_folds.py\"",
    ...
}

InstallModuleError when training local on SageMaker notebook instance

The latest version of sagemaker-containers is giving me an error when training my model locally on a SageMaker notebook instance. The error occurs when running the command: /usr/local/bin/python -m pip install -U .

The code block that failed in the notebook is the following:

from sagemaker.estimator import Estimator

hyperparameters = {'epochs': 4, 'batch-size': 64}

estimator = Estimator(role=role,
                      train_instance_count=1,
                      train_instance_type=instance_type,
                      image_name=image_name,
                      hyperparameters=hyperparameters)

estimator.fit(f'file://{DATA_PATH}')

Was working fine in a previous version of sagemaker-containers

Full stack trace can be found here:

[{'DataSource': {'FileDataSource': {'FileDataDistributionType': 'FullyReplicated', 'FileUri': 'file:///home/ec2-user/SageMaker/sagemaker-fastai-example/data/shirts'}}, 'ChannelName': 'training', 'DataUri': 'file:///home/ec2-user/SageMaker/sagemaker-fastai-example/data/shirts'}]
Creating tmp4usrw_13_algo-1-VZHOB_1_12f99eb2c642 ... 
Attaching to tmp4usrw_13_algo-1-VZHOB_1_5c757e8ba99a2mdone
algo-1-VZHOB_1_5c757e8ba99a | 2018-11-19 23:17:56,515 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training
algo-1-VZHOB_1_5c757e8ba99a | 2018-11-19 23:17:56,539 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
algo-1-VZHOB_1_5c757e8ba99a | 2018-11-19 23:17:56,542 sagemaker_pytorch_container.training INFO     Invoking user training script.
algo-1-VZHOB_1_5c757e8ba99a | 2018-11-19 23:17:56,542 sagemaker-containers INFO     Installing module with the following command:
algo-1-VZHOB_1_5c757e8ba99a | /usr/local/bin/python -m pip install -U . 
algo-1-VZHOB_1_5c757e8ba99a | Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.
algo-1-VZHOB_1_5c757e8ba99a | 2018-11-19 23:17:56,969 sagemaker-containers ERROR    InstallModuleError:
algo-1-VZHOB_1_5c757e8ba99a | Command "/usr/local/bin/python -m pip install -U ."
tmp4usrw_13_algo-1-VZHOB_1_5c757e8ba99a exited with code 1
Aborting on container exit...
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/image.py in train(self, input_data_config, output_data_config, hyperparameters, job_name)
    124         try:
--> 125             _stream_output(process)
    126         except RuntimeError as e:

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/image.py in _stream_output(process)
    556     if exit_code != 0:
--> 557         raise RuntimeError("Process exited with code: %s" % exit_code)
    558 

RuntimeError: Process exited with code: 1

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-12-08a4e64e55a4> in <module>()
      9                       hyperparameters=hyperparameters)
     10 
---> 11 estimator.fit(f'file://{DATA_PATH}')

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name)
    207         self._prepare_for_training(job_name=job_name)
    208 
--> 209         self.latest_training_job = _TrainingJob.start_new(self, inputs)
    210         if wait:
    211             self.latest_training_job.wait(logs=logs)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in start_new(cls, estimator, inputs)
    460                                           resource_config=config['resource_config'], vpc_config=config['vpc_config'],
    461                                           hyperparameters=hyperparameters, stop_condition=config['stop_condition'],
--> 462                                           tags=estimator.tags)
    463 
    464         return cls(estimator.sagemaker_session, estimator._current_job_name)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in train(self, image, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags)
    279         LOGGER.info('Creating training-job with name: {}'.format(job_name))
    280         LOGGER.debug('train request: {}'.format(json.dumps(train_request, indent=4)))
--> 281         self.sagemaker_client.create_training_job(**train_request)
    282 
    283     def tune(self, job_name, strategy, objective_type, objective_metric_name,

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/local_session.py in create_training_job(self, TrainingJobName, AlgorithmSpecification, InputDataConfig, OutputDataConfig, ResourceConfig, **kwargs)
     72         training_job = _LocalTrainingJob(container)
     73         hyperparameters = kwargs['HyperParameters'] if 'HyperParameters' in kwargs else {}
---> 74         training_job.start(InputDataConfig, OutputDataConfig, hyperparameters, TrainingJobName)
     75 
     76         LocalSagemakerClient._training_jobs[TrainingJobName] = training_job

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/entities.py in start(self, input_data_config, output_data_config, hyperparameters, job_name)
     68         self.state = self._TRAINING
     69 
---> 70         self.model_artifacts = self.container.train(input_data_config, output_data_config, hyperparameters, job_name)
     71         self.end = datetime.datetime.now()
     72         self.state = self._COMPLETED

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/local/image.py in train(self, input_data_config, output_data_config, hyperparameters, job_name)
    128             # which contains the exit code and append the command line to it.
    129             msg = "Failed to run: %s, %s" % (compose_command, str(e))
--> 130             raise RuntimeError(msg)
    131 
    132         artifacts = self.retrieve_artifacts(compose_data, output_data_config, job_name)

RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmp4usrw_13/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

A/B testing?

I am unable to find how these containers can be enabled for A/B testing? Is this feature in here?
The docs says "Amazon SageMaker also includes built-in A/B testing capabilities to help you test your model and experiment with different versions to achieve the best results."

Also, any comments on best practices for managing different production model variants? How are your customers using this model variants feature?

MPI Smart Default for Processes per Host

For the referenced line/section below, would it be reasonable to have a smart default here to consume all available GPUโ€™s per host if executed inside SM job and the parameter isnโ€™t supplied?

Would prevent user from having to translate p3.16xlarge to 8 GPU for instance, and modifying these parameters if changing instances.

processes_per_host = _mpi_param_value(mpi_args, env, _params.MPI_PROCESSES_PER_HOST, 1)

Version `2.5.12` is missing module `scipy.sparse`

I'm getting this error when trying to run train

bash-4.2# train
Traceback (most recent call last):
  File "/usr/bin/train", line 11, in <module>
    load_entry_point('sagemaker-containers==2.5.12', 'console_scripts', 'train')()
  File "/usr/lib/python2.7/site-packages/pkg_resources/__init__.py", line 572, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2755, in load_entry_point
    return ep.load()
  File "/usr/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2408, in load
    return self.resolve()
  File "/usr/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2414, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/usr/lib64/python2.7/site-packages/sagemaker_containers/cli/train.py", line 14, in <module>
    from sagemaker_containers.beta.framework import trainer
  File "/usr/lib64/python2.7/site-packages/sagemaker_containers/beta/framework/__init__.py", line 19, in <module>
    from sagemaker_containers import _encoders as encoders
  File "/usr/lib64/python2.7/site-packages/sagemaker_containers/_encoders.py", line 22, in <module>
    from scipy.sparse import issparse
ImportError: No module named scipy.sparse

That is with version 2.5.12

If I install 2.5.11, I get

Successfully installed sagemaker-containers-2.5.11
bash-4.2# train
bash-4.2#

wrong env var name in README

Under "List of provided environment variables by SageMaker Containers"

a)
The code example for SM_INPUT_CONFIG_DIR seems to be wrong, it says
SM_INPUT_DIR=/opt/ml/input/config
instead of
SM_INPUT_CONFIG_DIR=/opt/ml/input/config

b)
The first sentence of the explanation text reads:

The path of the input directory, e.g. /opt/ml/input/config/.

I guess this is also a copy&paste error and should instead read:

The path of the input configuration directory, e.g. /opt/ml/input/config/.

Running Sagemaker Endpoint Locally Using default 'serve' entrypoint with Aargs/env args

I have a docker container model I wish to deploy in Sagemaker. I can deploy the Sagemaker model using the Sagemaker SDK just fine, and it works as advertised.

For my use case, I am seeking to run the sagemaker container locally with my own code. I need to run it in a local testing cluster for integration testing, and am trying to get my container to mimic the Sagemaker endpoint Environment as close as possible. It is unclear to me where the files I am specifying in my entry_point argument using the sagemaker sdk are going and what utilities are being used to load them into the PYTHONPATH.

I am looking for the actual docker commands sagemaker is using at runtime to import my entry_point files, unzip/untar my model.tar.gz, and serve it in the endpoint serve script. Currently when I just run docker run <container-name> serve, I find my custom code isn't imported and my saved model isn't decompressed/compiled. I really wish to not hack the container to much, and searching the Sagemaker docs/source code has proven quite challenging for finding serve arguments and environment variables for this task. Specifically I am using the XGBoost Sagemaker container, but will be using others in the future.

As an example, I currently just have a single entrypoint file defined below, but this could change to be many files in a directory/zip. Here is my file located at `/opt/ml/code' in my container:

entry_point.py:

import os
import logging
import pickle as pkl

import numpy as np
import xgboost as xgb
from sagemaker_xgboost_container import encoder as xgb_encoders
# v0.0.4

logger = logging.getLogger(__name__)


def _clean_csv_string(csv_string, delimiter):
    return ['nan' if x == '' else x for x in csv_string.split(delimiter)]


def csv_to_dmatrix_no_sniff(string_like):  # type: (str) -> xgb.DMatrix
    """Convert a CSV object to a DMatrix object.
    Args:
        string_like (str): CSV string.
        dtype (dtype, optional):  Data type of the resulting array. If None, the dtypes will be determined by the
                                        contents of each column, individually. This argument can only be used to
                                        'upcast' the array.  For downcasting, use the .astype(t) method.
    Returns:
        (xgb.DMatrix): XGBoost DataMatrix
    """
    logger.warn('I AM A SPECIAL NO SNIFF CSV!!')
    delimiter = ','
    np_payload = np.array(list(map(lambda x: _clean_csv_string(x, delimiter), string_like.split('\n'))))
    return xgb.DMatrix(np_payload)


def input_fn(input_data, content_type):
    """Take request data and de-serializes the data into an object for prediction.
        When an InvokeEndpoint operation is made against an Endpoint running SageMaker model server,
        the model server receives two pieces of information:
            - The request Content-Type, for example "application/json"
            - The request data, which is at most 5 MB (5 * 1024 * 1024 bytes) in size.
        The input_fn is responsible to take the request data and pre-process it before prediction.
    Args:
        input_data (obj): the request data.
        content_type (str): the request Content-Type.
    Returns:
        (obj): data ready for prediction. For XGBoost, this defaults to DMatrix.
    """
    if content_type == 'text/csv':
        return csv_to_dmatrix_no_sniff(input_data)
    else:
        return xgb_encoders.decode(input_data, content_type)


def model_fn(model_dir):
    """Load a model. For XGBoost Framework, a default function to load a model is not provided.
    Users should provide customized model_fn() in script.
    Args:
        model_dir: a directory where model is saved.
    Returns: A XGBoost model.
    """
    model_file = 'xgboost-model'
    with open(os.path.join(model_dir, model_file), 'rb') as f:
        model = pkl.load(f)
    return model

I can use the sagemaker sdk with this file as the entry_point just fine, but cannot get the same functionality locally without hacking the sagemaker container or writing my own serve.py script with custom code.

Minor documentation error

I'm working my way through migrating a container to use sagemaker-containers, I've found a minor documentation issue, and I have a few suggestions / observations.

README.rst states

The training script must be located under the folder /opt/ml/model and its relative path is defined in the environment variable SAGEMAKER_PROGRAM

But the Dockerfile copies the file to /opt/ml/code not /opt/ml/model. I suspect the documentation is wrong?

Also it took me a while to figure out ENV is used by sagemaker-containers to location the train endpoint. i.e.

ENV SAGEMAKER_PROGRAM train.py

I think this could be more explicit, i.e. list all the ENV options and how they're invoked, I've found SAGEMAKER_PROGRAM and SAGEMAKER_TRAINING_MODULE which I assume are for scripts and a callable (which seems to take zero args?)

Finally I've not yet looked at the serve endpoint yet but I don't see much in the way of details on how that works i.e. to the level of TRAINING_IN_DETAIL.rst

And it would be really nice if 'local' worked without having to setup credentials, and pass an arn to role='' (ideally docker run -v \opt\ml:\opt\ml XYZ train would work provided \opt\ml has the required structure)

Issue: Failed to run: ['docker-compose', '-f', ...

When I try to run in local mode I get:

tmpsqcmqlg5_algo-1-5rcld_1 exited with code 1
Aborting on container exit...
Traceback (most recent call last):
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\image.py", line 161, in train
    _stream_output(process)
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\image.py", line 677, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Development/Projects/DeepTradingAI/deeptradingmodels/DeepTradingEstimator.py", line 7, in <module>
    estimator.fit()
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\estimator.py", line 494, in fit
    self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\estimator.py", line 1066, in start_new
    estimator.sagemaker_session.train(**train_args)
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\session.py", line 590, in train
    self.sagemaker_client.create_training_job(**train_request)
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\local_session.py", line 102, in create_training_job
    training_job.start(InputDataConfig, OutputDataConfig, hyperparameters, TrainingJobName)
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\entities.py", line 96, in start
    input_data_config, output_data_config, hyperparameters, job_name
  File "C:\Program Files\Python37\lib\site-packages\sagemaker\local\image.py", line 166, in train
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker-compose', '-f', 'C:\\Users\\NEKTAR~1\\AppData\\Local\\Temp\\tmpsqcmqlg5\\docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

Cannot run locally with SageMaker Notebook Instance

Given I call the following code:

local_estimator.deploy(initial_instance_count=1, instance_type='local_cpu')

I get the following error.

Under my requirements.txt I have the following:

Pillow
numpy
keras
tensorflow
pandas
wheel
#theano

I added wheel as a requirement but it still fails

Attaching to tmpsnfgja1z_algo-1-dya9s_1
algo-1-dya9s_1  | INFO:__main__:starting services
algo-1-dya9s_1  | INFO:__main__:using default model name: model
algo-1-dya9s_1  | INFO:__main__:tensorflow serving model config: 
algo-1-dya9s_1  | model_config_list: {
algo-1-dya9s_1  |   config: {
algo-1-dya9s_1  |     name: "model",
algo-1-dya9s_1  |     base_path: "/opt/ml/model",
algo-1-dya9s_1  |     model_platform: "tensorflow"
algo-1-dya9s_1  |   },
algo-1-dya9s_1  | }
algo-1-dya9s_1  | 
algo-1-dya9s_1  | 
algo-1-dya9s_1  | INFO:__main__:nginx config: 
algo-1-dya9s_1  | load_module modules/ngx_http_js_module.so;
algo-1-dya9s_1  | 
algo-1-dya9s_1  | worker_processes auto;
algo-1-dya9s_1  | daemon off;
algo-1-dya9s_1  | pid /tmp/nginx.pid;
algo-1-dya9s_1  | error_log  /dev/stderr info;
algo-1-dya9s_1  | 
algo-1-dya9s_1  | worker_rlimit_nofile 4096;
algo-1-dya9s_1  | 
algo-1-dya9s_1  | events {
algo-1-dya9s_1  |   worker_connections 2048;
algo-1-dya9s_1  | }
algo-1-dya9s_1  | 
algo-1-dya9s_1  | http {
algo-1-dya9s_1  |   include /etc/nginx/mime.types;
algo-1-dya9s_1  |   default_type application/json;
algo-1-dya9s_1  |   access_log /dev/stdout combined;
algo-1-dya9s_1  |   js_include tensorflow-serving.js;
algo-1-dya9s_1  | 
algo-1-dya9s_1  |   upstream tfs_upstream {
algo-1-dya9s_1  |     server localhost:8501;
algo-1-dya9s_1  |   }
algo-1-dya9s_1  | 
algo-1-dya9s_1  |   upstream gunicorn_upstream {
algo-1-dya9s_1  |     server unix:/tmp/gunicorn.sock fail_timeout=1;
algo-1-dya9s_1  |   }
algo-1-dya9s_1  | 
algo-1-dya9s_1  |   server {
algo-1-dya9s_1  |     listen 8080 deferred;
algo-1-dya9s_1  |     client_max_body_size 0;
algo-1-dya9s_1  |     client_body_buffer_size 100m;
algo-1-dya9s_1  |     subrequest_output_buffer_size 100m;
algo-1-dya9s_1  | 
algo-1-dya9s_1  |     set $tfs_version 1.13;
algo-1-dya9s_1  |     set $default_tfs_model model;
algo-1-dya9s_1  | 
algo-1-dya9s_1  |     location /tfs {
algo-1-dya9s_1  |         rewrite ^/tfs/(.*) /$1  break;
algo-1-dya9s_1  |         proxy_redirect off;
algo-1-dya9s_1  |         proxy_pass_request_headers off;
algo-1-dya9s_1  |         proxy_set_header Content-Type 'application/json';
algo-1-dya9s_1  |         proxy_set_header Accept 'application/json';
algo-1-dya9s_1  |         proxy_pass http://tfs_upstream;
algo-1-dya9s_1  |     }
algo-1-dya9s_1  | 
algo-1-dya9s_1  |     location /ping {
algo-1-dya9s_1  |         proxy_pass http://gunicorn_upstream/ping;
algo-1-dya9s_1  |     }
algo-1-dya9s_1  | 
algo-1-dya9s_1  |     location /invocations {
algo-1-dya9s_1  |         proxy_pass http://gunicorn_upstream/invocations;
algo-1-dya9s_1  |     }
algo-1-dya9s_1  | 
algo-1-dya9s_1  |     location / {
algo-1-dya9s_1  |         return 404 '{"error": "Not Found"}';
algo-1-dya9s_1  |     }
algo-1-dya9s_1  | 
algo-1-dya9s_1  |     keepalive_timeout 3;
algo-1-dya9s_1  |   }
algo-1-dya9s_1  | }
algo-1-dya9s_1  | 
algo-1-dya9s_1  | 
algo-1-dya9s_1  | WARNING:__main__:failed to run command: tensorflow_model_server --version
algo-1-dya9s_1  | INFO:__main__:tensorflow serving command: tensorflow_model_server --port=9000 --rest_api_port=8501 --model_config_file=/sagemaker/model-config.cfg 
algo-1-dya9s_1  | tensorflow_model_server: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory
algo-1-dya9s_1  | INFO:__main__:started tensorflow serving (pid: 9)
algo-1-dya9s_1  | INFO:__main__:installing packages from requirements.txt...
algo-1-dya9s_1  | Collecting Pillow (from -r /opt/ml/model/code/requirements.txt (line 1))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/d6/98/0d360dbc087933679398d73187a503533ec0547ba4ffd2115365605559cc/Pillow-6.1.0-cp35-cp35m-manylinux1_x86_64.whl (2.1MB)
    100% |################################| 2.1MB 687kB/s 
algo-1-dya9s_1  | Collecting numpy (from -r /opt/ml/model/code/requirements.txt (line 2))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/bb/ef/d5a21cbc094d3f4d5b5336494dbcc9550b70c766a8345513c7c24ed18418/numpy-1.16.4-cp35-cp35m-manylinux1_x86_64.whl (17.2MB)
    100% |################################| 17.2MB 79kB/s 
algo-1-dya9s_1  | Collecting keras (from -r /opt/ml/model/code/requirements.txt (line 3))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/5e/10/aa32dad071ce52b5502266b5c659451cfd6ffcbf14e6c8c4f16c0ff5aaab/Keras-2.2.4-py2.py3-none-any.whl (312kB)
    100% |################################| 317kB 4.5MB/s 
algo-1-dya9s_1  | Collecting tensorflow (from -r /opt/ml/model/code/requirements.txt (line 4))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/7c/fb/7b2c5b3e85ad335b53ca67deb2ef4af574dc0a8759f43b7f45e15005e449/tensorflow-1.14.0-cp35-cp35m-manylinux1_x86_64.whl (109.2MB)
    100% |################################| 109.2MB 12kB/s 
algo-1-dya9s_1  | Collecting pandas (from -r /opt/ml/model/code/requirements.txt (line 5))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/a7/d9/e03b615e973c2733ff8fd53d95bd3633ecbfa81b5af2f83fe39647c02344/pandas-0.25.0-cp35-cp35m-manylinux1_x86_64.whl (10.3MB)
    100% |################################| 10.3MB 143kB/s 
algo-1-dya9s_1  | Collecting wheel (from -r /opt/ml/model/code/requirements.txt (line 6))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/bb/10/44230dd6bf3563b8f227dbf344c908d412ad2ff48066476672f3a72e174e/wheel-0.33.4-py2.py3-none-any.whl
algo-1-dya9s_1  | Collecting scipy>=0.14 (from keras->-r /opt/ml/model/code/requirements.txt (line 3))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/14/49/8f13fa215e10a7ab0731cc95b0e9bb66cf83c6a98260b154cfbd0b55fb19/scipy-1.3.0-cp35-cp35m-manylinux1_x86_64.whl (25.1MB)
    100% |################################| 25.1MB 55kB/s 
algo-1-dya9s_1  | Collecting keras-applications>=1.0.6 (from keras->-r /opt/ml/model/code/requirements.txt (line 3))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/71/e3/19762fdfc62877ae9102edf6342d71b28fbfd9dea3d2f96a882ce099b03f/Keras_Applications-1.0.8-py3-none-any.whl (50kB)
    100% |################################| 51kB 10.0MB/s 
algo-1-dya9s_1  | Collecting h5py (from keras->-r /opt/ml/model/code/requirements.txt (line 3))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/4c/77/c4933e12dca0f61bcdafc207c7532e1250b8d12719459fd85132f3daa9fd/h5py-2.9.0-cp35-cp35m-manylinux1_x86_64.whl (2.8MB)
    100% |################################| 2.8MB 164kB/s 
algo-1-dya9s_1  | Collecting keras-preprocessing>=1.0.5 (from keras->-r /opt/ml/model/code/requirements.txt (line 3))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/28/6a/8c1f62c37212d9fc441a7e26736df51ce6f0e38455816445471f10da4f0a/Keras_Preprocessing-1.1.0-py2.py3-none-any.whl (41kB)
    100% |################################| 51kB 11.2MB/s 
algo-1-dya9s_1  | Collecting pyyaml (from keras->-r /opt/ml/model/code/requirements.txt (line 3))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/a3/65/837fefac7475963d1eccf4aa684c23b95aa6c1d033a2c5965ccb11e22623/PyYAML-5.1.1.tar.gz (274kB)
    100% |################################| 276kB 4.8MB/s 
algo-1-dya9s_1  | Collecting six>=1.9.0 (from keras->-r /opt/ml/model/code/requirements.txt (line 3))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
algo-1-dya9s_1  | Collecting gast>=0.2.0 (from tensorflow->-r /opt/ml/model/code/requirements.txt (line 4))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
algo-1-dya9s_1  | Collecting protobuf>=3.6.1 (from tensorflow->-r /opt/ml/model/code/requirements.txt (line 4))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/55/34/7158a5ec978f12307eb361a8c4fdd867a8e2a0ab63fac99e5f555ee796d2/protobuf-3.9.0-cp35-cp35m-manylinux1_x86_64.whl (1.2MB)
    100% |################################| 1.2MB 1.2MB/s 
algo-1-dya9s_1  | Collecting absl-py>=0.7.0 (from tensorflow->-r /opt/ml/model/code/requirements.txt (line 4))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/da/3f/9b0355080b81b15ba6a9ffcf1f5ea39e307a2778b2f2dc8694724e8abd5b/absl-py-0.7.1.tar.gz (99kB)
    100% |################################| 102kB 10.1MB/s 
algo-1-dya9s_1  | Collecting tensorflow-estimator<1.15.0rc0,>=1.14.0rc0 (from tensorflow->-r /opt/ml/model/code/requirements.txt (line 4))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/3c/d5/21860a5b11caf0678fbc8319341b0ae21a07156911132e0e71bffed0510d/tensorflow_estimator-1.14.0-py2.py3-none-any.whl (488kB)
    100% |################################| 491kB 2.8MB/s 
algo-1-dya9s_1  | Collecting wrapt>=1.11.1 (from tensorflow->-r /opt/ml/model/code/requirements.txt (line 4))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/23/84/323c2415280bc4fc880ac5050dddfb3c8062c2552b34c2e512eb4aa68f79/wrapt-1.11.2.tar.gz
algo-1-dya9s_1  | Collecting astor>=0.6.0 (from tensorflow->-r /opt/ml/model/code/requirements.txt (line 4))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/d1/4f/950dfae467b384fc96bc6469de25d832534f6b4441033c39f914efd13418/astor-0.8.0-py2.py3-none-any.whl
algo-1-dya9s_1  | Collecting tensorboard<1.15.0,>=1.14.0 (from tensorflow->-r /opt/ml/model/code/requirements.txt (line 4))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/91/2d/2ed263449a078cd9c8a9ba50ebd50123adf1f8cfbea1492f9084169b89d9/tensorboard-1.14.0-py3-none-any.whl (3.1MB)
    100% |################################| 3.2MB 450kB/s 
algo-1-dya9s_1  | Collecting google-pasta>=0.1.6 (from tensorflow->-r /opt/ml/model/code/requirements.txt (line 4))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/d0/33/376510eb8d6246f3c30545f416b2263eee461e40940c2a4413c711bdf62d/google_pasta-0.1.7-py3-none-any.whl (52kB)
    100% |################################| 61kB 11.2MB/s 
algo-1-dya9s_1  | Collecting grpcio>=1.8.6 (from tensorflow->-r /opt/ml/model/code/requirements.txt (line 4))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/7e/8e/9e446349fc449951ecf3768070483ea88e76725cdd5bbddb9bc50f6948d4/grpcio-1.22.0-cp35-cp35m-manylinux1_x86_64.whl (2.2MB)
    100% |################################| 2.2MB 687kB/s 
algo-1-dya9s_1  | Collecting termcolor>=1.1.0 (from tensorflow->-r /opt/ml/model/code/requirements.txt (line 4))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz
algo-1-dya9s_1  | Collecting python-dateutil>=2.6.1 (from pandas->-r /opt/ml/model/code/requirements.txt (line 5))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/41/17/c62faccbfbd163c7f57f3844689e3a78bae1f403648a6afb1d0866d87fbb/python_dateutil-2.8.0-py2.py3-none-any.whl (226kB)
    100% |################################| 235kB 5.6MB/s 
algo-1-dya9s_1  | Collecting pytz>=2017.2 (from pandas->-r /opt/ml/model/code/requirements.txt (line 5))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/3d/73/fe30c2daaaa0713420d0382b16fbb761409f532c56bdcc514bf7b6262bb6/pytz-2019.1-py2.py3-none-any.whl (510kB)
    100% |################################| 512kB 2.8MB/s 
algo-1-dya9s_1  | Requirement already satisfied (use --upgrade to upgrade): setuptools in /usr/lib/python3/dist-packages (from protobuf>=3.6.1->tensorflow->-r /opt/ml/model/code/requirements.txt (line 4))
algo-1-dya9s_1  | Collecting werkzeug>=0.11.15 (from tensorboard<1.15.0,>=1.14.0->tensorflow->-r /opt/ml/model/code/requirements.txt (line 4))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/d1/ab/d3bed6b92042622d24decc7aadc8877badf18aeca1571045840ad4956d3f/Werkzeug-0.15.5-py2.py3-none-any.whl (328kB)
    100% |################################| 337kB 4.3MB/s 
algo-1-dya9s_1  | Collecting markdown>=2.6.8 (from tensorboard<1.15.0,>=1.14.0->tensorflow->-r /opt/ml/model/code/requirements.txt (line 4))
algo-1-dya9s_1  |   Downloading https://files.pythonhosted.org/packages/c0/4e/fd492e91abdc2d2fcb70ef453064d980688762079397f779758e055f6575/Markdown-3.1.1-py2.py3-none-any.whl (87kB)
    100% |################################| 92kB 11.5MB/s 
algo-1-dya9s_1  | Building wheels for collected packages: pyyaml, gast, absl-py, wrapt, termcolor
algo-1-dya9s_1  |   Running setup.py bdist_wheel for pyyaml ... error
algo-1-dya9s_1  |   Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-hb08wimz/pyyaml/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /tmp/tmp1787yp26pip-wheel- --python-tag cp35:
algo-1-dya9s_1  |   /usr/lib/python3.5/distutils/dist.py:261: UserWarning: Unknown distribution option: 'python_requires'
algo-1-dya9s_1  |     warnings.warn(msg)
algo-1-dya9s_1  |   usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
algo-1-dya9s_1  |      or: -c --help [cmd1 cmd2 ...]
algo-1-dya9s_1  |      or: -c --help-commands
algo-1-dya9s_1  |      or: -c cmd --help
algo-1-dya9s_1  |   
algo-1-dya9s_1  |   error: invalid command 'bdist_wheel'
algo-1-dya9s_1  |   
algo-1-dya9s_1  |   ----------------------------------------
algo-1-dya9s_1  |   Failed building wheel for pyyaml
algo-1-dya9s_1  |   Running setup.py clean for pyyaml
algo-1-dya9s_1  |   Running setup.py bdist_wheel for gast ... error
algo-1-dya9s_1  |   Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-hb08wimz/gast/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /tmp/tmpf50t811rpip-wheel- --python-tag cp35:
algo-1-dya9s_1  |   /usr/lib/python3.5/distutils/dist.py:261: UserWarning: Unknown distribution option: 'python_requires'
algo-1-dya9s_1  |     warnings.warn(msg)
algo-1-dya9s_1  |   usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
algo-1-dya9s_1  |      or: -c --help [cmd1 cmd2 ...]
algo-1-dya9s_1  |      or: -c --help-commands
algo-1-dya9s_1  |      or: -c cmd --help
algo-1-dya9s_1  |   
algo-1-dya9s_1  |   error: invalid command 'bdist_wheel'
algo-1-dya9s_1  |   
algo-1-dya9s_1  |   ----------------------------------------
algo-1-dya9s_1  |   Failed building wheel for gast
algo-1-dya9s_1  |   Running setup.py clean for gast
algo-1-dya9s_1  |   Running setup.py bdist_wheel for absl-py ... error
algo-1-dya9s_1  |   Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-hb08wimz/absl-py/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /tmp/tmp7mxsqz52pip-wheel- --python-tag cp35:
algo-1-dya9s_1  |   /usr/lib/python3.5/distutils/dist.py:261: UserWarning: Unknown distribution option: 'long_description_content_type'
algo-1-dya9s_1  |     warnings.warn(msg)
algo-1-dya9s_1  |   usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
algo-1-dya9s_1  |      or: -c --help [cmd1 cmd2 ...]
algo-1-dya9s_1  |      or: -c --help-commands
algo-1-dya9s_1  |      or: -c cmd --help
algo-1-dya9s_1  |   
algo-1-dya9s_1  |   error: invalid command 'bdist_wheel'
algo-1-dya9s_1  |   
algo-1-dya9s_1  |   ----------------------------------------
algo-1-dya9s_1  |   Failed building wheel for absl-py
algo-1-dya9s_1  |   Running setup.py clean for absl-py
algo-1-dya9s_1  |   Running setup.py bdist_wheel for wrapt ... error
algo-1-dya9s_1  |   Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-hb08wimz/wrapt/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /tmp/tmpxdzvv1xepip-wheel- --python-tag cp35:
algo-1-dya9s_1  |   usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
algo-1-dya9s_1  |      or: -c --help [cmd1 cmd2 ...]
algo-1-dya9s_1  |      or: -c --help-commands
algo-1-dya9s_1  |      or: -c cmd --help
algo-1-dya9s_1  |   
algo-1-dya9s_1  |   error: invalid command 'bdist_wheel'
algo-1-dya9s_1  |   
algo-1-dya9s_1  |   ----------------------------------------
algo-1-dya9s_1  |   Failed building wheel for wrapt
algo-1-dya9s_1  |   Running setup.py clean for wrapt
algo-1-dya9s_1  |   Running setup.py bdist_wheel for termcolor ... error
algo-1-dya9s_1  |   Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-hb08wimz/termcolor/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d /tmp/tmptrp25yvopip-wheel- --python-tag cp35:
algo-1-dya9s_1  |   usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
algo-1-dya9s_1  |      or: -c --help [cmd1 cmd2 ...]
algo-1-dya9s_1  |      or: -c --help-commands
algo-1-dya9s_1  |      or: -c cmd --help
algo-1-dya9s_1  |   
algo-1-dya9s_1  |   error: invalid command 'bdist_wheel'
algo-1-dya9s_1  |   
algo-1-dya9s_1  |   ----------------------------------------
algo-1-dya9s_1  |   Failed building wheel for termcolor
algo-1-dya9s_1  |   Running setup.py clean for termcolor
algo-1-dya9s_1  | Failed to build pyyaml gast absl-py wrapt termcolor
algo-1-dya9s_1  | Installing collected packages: Pillow, numpy, scipy, six, h5py, keras-applications, keras-preprocessing, pyyaml, keras, gast, protobuf, absl-py, tensorflow-estimator, wrapt, astor, werkzeug, markdown, grpcio, wheel, tensorboard, google-pasta, termcolor, tensorflow, python-dateutil, pytz, pandas
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-53-af53aaad44dc> in <module>()
----> 1 local_estimator.deploy(initial_instance_count=1, instance_type='local_cpu')

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, use_compiled_model, update_endpoint, wait, **kwargs)
    384             update_endpoint=update_endpoint,
    385             tags=self.tags,
--> 386             wait=wait)
    387 
    388     @property

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, update_endpoint, tags, kms_key, wait)
    301         else:
    302             self.sagemaker_session.endpoint_from_production_variants(self.endpoint_name, [production_variant],
--> 303                                                                      tags, kms_key, wait)
    304 
    305         if self.predictor_cls:

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, kms_key, wait)
   1077 
   1078             self.sagemaker_client.create_endpoint_config(**config_options)
-> 1079         return self.create_endpoint(endpoint_name=name, config_name=name, tags=tags, wait=wait)
   1080 
   1081     def expand_role(self, role):

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, tags, wait)
    785         tags = tags or []
    786 
--> 787         self.sagemaker_client.create_endpoint(EndpointName=endpoint_name, EndpointConfigName=config_name, Tags=tags)
    788         if wait:
    789             self.wait_for_endpoint(endpoint_name)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/local/local_session.py in create_endpoint(self, EndpointName, EndpointConfigName, Tags)
    141         endpoint = _LocalEndpoint(EndpointName, EndpointConfigName, Tags, self.sagemaker_session)
    142         LocalSagemakerClient._endpoints[EndpointName] = endpoint
--> 143         endpoint.serve()
    144 
    145     def update_endpoint(self, EndpointName, EndpointConfigName):  # pylint: disable=unused-argument

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/local/entities.py in serve(self)
    382 
    383         serving_port = get_config_value('local.serving_port', self.local_session.config) or 8080
--> 384         _wait_for_serving_container(serving_port)
    385         # the container is running and it passed the healthcheck status is now InService
    386         self.state = _LocalEndpoint._IN_SERVICE

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/local/entities.py in _wait_for_serving_container(serving_port)
    420             return
    421 
--> 422         time.sleep(5)
    423 
    424 

KeyboardInterrupt: 
Gracefully stopping... (press Ctrl+C again to force)

Errors when training with a custom MXNet container

Hi,

I built a custom MXNet container using https://github.com/aws/sagemaker-mxnet-containers, and pushed it to ECR. The container is fine as far as I can tell (inspecting with 'docker run', etc).

When I run this:

mxnet = sagemaker.estimator.Estimator(
                       image_name=image,
                       role=role, 
                       train_instance_count=1, 
                       train_instance_type='ml.c4.2xlarge',
                       output_path="s3://{}/output".format(sess.default_bucket()),
                       sagemaker_session=sess)

mxnet.fit(data_location)

The training job fails with:

ValueError: Error training mxnet-2018-05-27-12-08-04-153: Failed Reason: AlgorithmError: uncaught exception during training: 'sagemaker_region'
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/container_support/training.py", line 31, in start
    env = TrainingEnvironment()
  File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 219, in __init__
    self.sagemaker_region = self.hyperparameters[ContainerEnvironment.SAGEMAKER_REGION_PARAM_NAME]
KeyError: 'sagemaker_region'

It looks like I'd need to set a 'sagemaker_region' parameter, which is weird because SageMaker should know what the region is.

Anyway, if I try to set it (in the Estimator or with set_hyperparameters):

mxnet.set_hyperparameters(sagemaker_region=region)
mxnet.fit(data_location)

Then the job fails because it can't deserialize hyperparameters:

ValueError: Error training mxnet-2018-05-27-12-22-10-626: Failed Reason: AlgorithmError: uncaught exception during training: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/container_support/training.py", line 31, in start
    env = TrainingEnvironment()
  File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 182, in __init__
    os.path.join(self.input_config_dir, TrainingEnvironment.HYPERPARAMETERS_FILE))
  File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 224, in _load_hyperparameters
    return self._deserialize_hyperparameters(serialized)
  File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 238, in _deserialize_hyperparameters
    hyperparameter_dict[k] = json.loads(v)
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib

Have I missed anything? Thanks for your help.

UnicodeEncodeError when outputing not-ascii characters

What is the recommended way to output non-ascii characters?

This minimal example fails with UnicodeEncodeError

#!/usr/bin/env python

print(u"\u2639")  # print sad face

specifically, when running with PyTorch in local mode, this is the full stack trace:

algo-1-3lmzz_1  | Invoking script with the following command:
algo-1-3lmzz_1  | 
algo-1-3lmzz_1  | /usr/bin/python -m unicode_test
algo-1-3lmzz_1  | 
algo-1-3lmzz_1  | 
algo-1-3lmzz_1  | 2019-01-07 19:38:45,883 sagemaker-containers ERROR    ExecuteUserScriptError:
algo-1-3lmzz_1  | Command "/usr/bin/python -m unicode_test"
algo-1-3lmzz_1  | Traceback (most recent call last):
algo-1-3lmzz_1  |   File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
algo-1-3lmzz_1  |     "__main__", mod_spec)
algo-1-3lmzz_1  |   File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
algo-1-3lmzz_1  |     exec(code, run_globals)
algo-1-3lmzz_1  |   File "/opt/ml/code/unicode_test.py", line 3, in <module>
algo-1-3lmzz_1  |     print(u"\u2639")  # print sad face
algo-1-3lmzz_1  | UnicodeEncodeError: 'ascii' codec can't encode character '\u2639' in position 0: ordinal not in range(128)
tmpdl15aqf5_algo-1-3lmzz_1 exited with code 1
Aborting on container exit...

executed by:

sagemaker.pytorch.PyTorch(
    entry_point='unicode_test.py',
    train_instance_type='local',
    train_instance_count=1,
    framework_version='1.0.0',
 ).fit()

I've tried a bunch of things like changing environment variables to set locale, but the only thing that I've gotten to work is to encode in ascii, which is not a good solution.

How to emit serving metrics?

There doesn't seem to be any ability to emit serving metrics of any kind in this repository.

There is no way to get metrics in regards to model latency, predict time, inference time or anything related to the GUnicorn workers.

There seems to be an env for enabling metrics: https://github.com/aws/sagemaker-containers/blob/master/src/sagemaker_containers/_params.py#L24.

It looks like in container_support, there was some support for metrics using telegraf: https://github.com/aws/sagemaker-containers/blob/r1.0/src/container_support/serving.py#L79

However, in sagemaker-containers, I can't seem to find anything.

Running a container to serve my own algorithm

Environment:

  • CentOS 7
  • Docker 18.09.1
  • Python 2.7
  • Tensorflow 1.9.0 (CPU)

My App structure:

  • Dockerfile
  • sagemaker_app
    ---mymodel
    ---nginx.conf
    ---serve
    ---wsgi.py

After building the docker image with my sagemaker_app, I ran "docker run -p 8080:8080 -it --rm mymodel_sagemaker_app serve" and I got this error:
2019-02-01 21:23:11,067 INFO - container_support.serving - starting gunicorn
2019-02-01 21:23:11,079 INFO - container_support.serving - inference server started. waiting on processes: set([16, 15])
2019-02-01 21:23:11.099684: I tensorflow_serving/model_servers/main.cc:154] Building single TensorFlow model file config: model_name: generic_model model_base_path: /opt/ml/model/export/Servo
2019-02-01 21:23:11.099858: I tensorflow_serving/model_servers/server_core.cc:444] Adding/updating models.
2019-02-01 21:23:11.099873: I tensorflow_serving/model_servers/server_core.cc:499] (Re-)adding model: generic_model
2019-02-01 21:23:11.105072: E tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:370] FileSystemStoragePathSource encountered a file-system access error: Could not find base path /opt/ml/model/export/Servo for servable generic_model
[2019-02-01 21:23:11 +0000] [16] [INFO] Starting gunicorn 19.9.0
[2019-02-01 21:23:11 +0000] [16] [INFO] Listening at: unix:/tmp/gunicorn.sock (16)
[2019-02-01 21:23:11 +0000] [16] [INFO] Using worker: gevent
[2019-02-01 21:23:11 +0000] [27] [INFO] Booting worker with pid: 27
2019-02-01 21:23:11,819 INFO - container_support.serving - creating Server instance
2019-02-01 21:23:12.104205: E tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:370] FileSystemStoragePathSource encountered a file-system access error: Could not find base path /opt/ml/model/export/Servo for servable generic_model
2019-02-01 21:23:13.105312: E tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:370] FileSystemStoragePathSource encountered a file-system access error: Could not find base path /opt/ml/model/export/Servo for servable generic_model

This is the sequence of commands that I've used to build final docker image "mymodel_sagemaker":
$ git clone https://github.com/aws/sagemaker-tensorflow-container.git
$ cd sagemaker-tensorflow-container
$ python setup.py sdist
$ cp dist/sagemaker_tensorflow_container-1.0.0.tar.gz docker/1.9.0/final/py2

$ cd ~
$ wget https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.9.0-cp27-none-linux_x86_64.whl
$ cd sagemaker-tensorflow-container/docker/1.9.0
$ cp ~/tensorflow-1.9.0-cp27-none-linux_x86_64.whl .

$ docker build -t tensorflow-base:1.9.0-cpu-py2 --build-arg py_version=2 --build-arg framework_installable=tensorflow-1.9.0-cp27-none-linux_x86_64.whl -f Dockerfile.cpu .

$ cd sagemaker_container #(where my app is)
$ touch Dockerfile
$ vim Dockerfile

Dockerfile content:

FROM tensorflow-base:1.9.0-cpu-py2

RUN pip --no-cache-dir install
flask==1.0
requests
snowballstemmer==1.2.1
keras==2.2.4
joblib==0.12.2

ENV PATH="/opt/program:${PATH}" AWS_DEFAULT_REGION=us-west-2

COPY mymodel_sagemaker_app /opt/program
WORKDIR /opt/program

$ docker build -t mymodel_sagemaker_app:latest .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.