Coder Social home page Coder Social logo

aws / amazon-sagemaker-examples Goto Github PK

View Code? Open in Web Editor NEW
9.6K 267.0 6.6K 633.52 MB

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.

Home Page: https://sagemaker-examples.readthedocs.io

License: Apache License 2.0

Jupyter Notebook 94.37% HTML 0.04% Python 4.65% R 0.02% Shell 0.18% Dockerfile 0.07% Java 0.03% C 0.01% Roff 0.60% Makefile 0.01% Batchfile 0.01% jq 0.01% JavaScript 0.03% CSS 0.01%
sagemaker aws reinforcement-learning machine-learning deep-learning examples jupyter-notebook mlops data-science training

amazon-sagemaker-examples's Issues

ValueError: export_outputs must be a dict error when saving model_to_estimator

Hi I have been trying to test tf.keras.estimator.model_to_estimator(keras_model=model) and save it i order to set up hosting for the model like in this example: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/tensorflow_iris_byom/tensorflow_BYOM_iris.ipynb

However I keep receiving this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-75-a22213ebd37e> in <module>()
      1 exported_model = model.export_savedmodel(export_dir_base = 'export/Servo/', 
----> 2                                serving_input_receiver_fn = serving_input_fn)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py in export_savedmodel(self, export_dir_base, serving_input_receiver_fn, assets_extra, as_text, checkpoint_path)
    515           serving_input_receiver.receiver_tensors,
    516           estimator_spec.export_outputs,
--> 517           serving_input_receiver.receiver_tensors_alternatives)
    518 
    519       if not checkpoint_path:

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/estimator/export/export.py in build_all_signature_defs(receiver_tensors, export_outputs, receiver_tensors_alternatives)
    191     receiver_tensors = {_SINGLE_RECEIVER_DEFAULT_NAME: receiver_tensors}
    192   if export_outputs is None or not isinstance(export_outputs, dict):
--> 193     raise ValueError('export_outputs must be a dict.')
    194 
    195   signature_def_map = {}

ValueError: export_outputs must be a dict.


My code:

import numpy as np
import os
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder
from sklearn.externals import joblib


def featureTransform(features, max_words):
    tokenize = tf.keras.preprocessing.text.Tokenizer(num_words=max_words, char_level=False)
    tokenize.fit_on_texts(features) 
    return tokenize.texts_to_matrix(features).astype(np.float32) 

def encodeLabels(labels):
    encoder = LabelEncoder()
    encoder.fit(labels)
    y = encoder.transform(labels)
    num_classes = np.max(y) + 1
    print("num classes: {}".format(num_classes))
    return tf.keras.utils.to_categorical(y, num_classes).astype(np.float32)

def baselineModel(max_words, num_classes):
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Dense(500, activation='relu', input_shape=(max_words,), name="features"))
    model.add(tf.keras.layers.Dropout(0.5))
    model.add(tf.keras.layers.Dense(500, activation='relu'))
    model.add(tf.keras.layers.Dropout(0.5))
    model.add(tf.keras.layers.Dense(num_classes, activation='softmax'))
    model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])
    
    return tf.keras.estimator.model_to_estimator(keras_model=model)


def serving_input_fn():
    feature_spec = {'features_input': tf.FixedLenFeature(dtype=tf.float32, shape=[500])}
    return tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)()


def train_input_fn(training_dir, params):
    """Returns input function that would feed the model during training"""
    return input_function(training_dir, 'assignment_train.csv')

def input_function(training_dir, training_filename, shuffle=False):
    training_set = tf.contrib.learn.datasets.base.load_csv_without_header(
    filename=os.path.join(training_dir, training_filename), target_dtype=np.str, features_dtype=np.float32)
    
    input_fn = tf.estimator.inputs.numpy_input_fn(
        x={"features_input":  np.array(training_set.data)}, 
        y=encodeLabels(training_set.target),
        num_epochs=100,
        shuffle=shuffle
    )
    return input_fn

I run the following:

model = baselineModel(500, 123)
model.train(input_fn=input_function('data','assignment_train.csv', shuffle=True))
score = model.evaluate(input_function('data','assignment_train.csv', shuffle=True), steps = 100)
exported_model = model.export_savedmodel(export_dir_base = 'export/Servo/', 
                               serving_input_receiver_fn = serving_input_fn)

Any insight would be greatly appreciated!

Forbidden S3 bucket in the example: amazon-sagemaker-examples/introduction_to_applying_machine_learning/gluon_recommender_system/gluon_recommender_system.ipynb

In the example code "amazon-sagemaker-examples/introduction_to_applying_machine_learning/gluon_recommender_system/gluon_recommender_system.ipynb", I couldn't copy the data file located at "s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz". Please check the below code part in that Jupyter file and replace the below code with a valid S3 address:

aws s3 cp s3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz /tmp/recsys/

Trying to deploy pretrained MXNet model for inference only

I have a model that I've trained in MXNet to classify images, and I already have the model assets saved as
model.tar.gz in an s3 bucket.

from sagemaker.mxnet.model import MXNetModel
import sagemaker
import sys
from sagemaker import get_execution_role
role = get_execution_role()
sagemaker_model = MXNetModel(model_data = 's3://bucket-name/model.tar.gz', 
entry_point='entry_point.py', #entry_point.py is an empty .py file since we aren't using for training
role = role)
predictor = sagemaker_model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

I just want to be able to deploy this within a sagemaker notebook to a host and then call the predictor.predict function on an input image. However, the above sagemaker_model.deploy call fails and yields following error message:

ValueErrorTraceback (most recent call last)
in ()
----> 1 predictor = sagemaker_model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/model.pyc in deploy(self, initial_instance_count, instance_type, endpoint_name)
90 production_variant = sagemaker.production_variant(model_name, instance_type, initial_instance_count)
91 self.endpoint_name = endpoint_name or model_name
---> 92 self.sagemaker_session.endpoint_from_production_variants(self.endpoint_name, [production_variant])
93 if self.predictor_cls:
94 return self.predictor_cls(self.endpoint_name, self.sagemaker_session)

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/session.pyc in endpoint_from_production_variants(self, name, production_variants, wait)
512 self.sagemaker_client.create_endpoint_config(
513 EndpointConfigName=name, ProductionVariants=production_variants)
--> 514 return self.create_endpoint(endpoint_name=name, config_name=name, wait=wait)
515
516 def expand_role(self, role):

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/session.pyc in create_endpoint(self, endpoint_name, config_name, wait)
344 self.sagemaker_client.create_endpoint(EndpointName=endpoint_name, EndpointConfigName=config_name)
345 if wait:
--> 346 self.wait_for_endpoint(endpoint_name)
347 return endpoint_name
348

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/session.pyc in wait_for_endpoint(self, endpoint, poll)
405 if status != 'InService':
406 reason = desc.get('FailureReason', None)
--> 407 raise ValueError('Error hosting endpoint {}: {} Reason: {}'.format(endpoint, status, reason))
408 return desc
409

ValueError: Error hosting endpoint sagemaker-mxnet-py2-cpu-2018-03-22-20-10-57-938: Failed Reason: The primary container for production variant AllTraffic did not pass the ping health check.

I believe my attempt at using an empty file for the entry_point.py script is the reason this happened. But the problem in that case is that nowhere in the documentation was it clear to me what exactly should be in the entry_point.py script in the case that I just want to perform inference and no training with this model.

My other question is related to what the predictor.predict function actually expects. Do I need to pass it a numpy array? Is there a way to pass a string for the image_url instead and then write some simple image preprocessing script that loads the image, resizes, etc on the host before calling the mxnet model.predict function? I'm concerned that opencv is not part of the endpoint environment by default.

Any help with this would be much appreciated.

Tensorflow container error for evaluation

Hello

I am trying to train a Keras model using Sagemaker.

I am able to train my model in a Sagemaker Notebook, but when I try to execute my scripts locally, I get the following, pointing to a failure in the evaluation step (I get this error instantly after evaluation starts):

ERROR - container_support.training - uncaught exception during training: unsupported operand type(s) for /: 'unicode' and 'float'

From:

2018-04-11 21:31:39,770 ERROR - container_support.training - uncaught exception during training: unsupported operand type(s) for /: 'unicode' and 'float'
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 25, in start
    fw.train()
  File "/usr/local/lib/python2.7/dist-packages/tf_container/train.py", line 107, in train
    train_wrapper.train()
  File "/usr/local/lib/python2.7/dist-packages/tf_container/trainer.py", line 118, in train
    hparams=hparams)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 218, in run
    return _execute_schedule(experiment, schedule)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 46, in_execute_schedule
    return task()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 661, in train_and_evaluate
    self.train(delay_secs=0)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 390, in train
    saving_listeners=self._saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 868, in _call_train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 314, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 815, in _train_model
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 539, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1013, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1104, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1089, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1169, in run
    run_metadata=run_metadata))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py", line 1196, in after_run
    induce_stop = m.step_end(self._last_step, result)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py", line 356, in step_end
    return self.every_n_step_end(step, output)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py", line 694, in every_n_step_end
    validation_outputs = self._evaluate_estimator()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/monitors.py", line 665, in _evaluate_estimator
    name=self.name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 361, in evaluate
    hooks.extend(self._convert_eval_steps_to_hooks(steps))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 375, in _convert_eval_steps_to_hooks
    return [evaluation._StopAfterNEvalsHook(num_evals=steps)]  # pylint: disable=protected-access
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/evaluation.py", line 97, in __init__
    else math.floor(num_evals / 10.))
TypeError: unsupported operand type(s) for /: 'unicode' and 'float'

It seems like it has something to do with the "training_steps" argument passed to the Tensorflow() Estimator:

estimator = TensorFlow(entry_point='itemembd.py',
                               role=role,
                               training_steps=None,
                               evaluation_steps= 100,
                               train_instance_count=1,
                               train_instance_type='ml.c4.xlarge',
                               output_path='s3://ml/artifacts/itemembd',
                               
                              )

estimator.fit('s3://ml/data/itemembd', job_name='itememdb-01')

and the "num_epochs" argument in my _input_fn:

def _input_fn(training_dir, training_filename):
    training_set = tf.contrib.learn.datasets.base.load_csv_without_header(
        filename=os.path.join(training_dir, training_filename),
        target_dtype=np.int,
        features_dtype=np.int
    )

    return tf.estimator.inputs.numpy_input_fn(
        x={
            USER_EMBEDDING_TENSOR_NAME: np.array(training_set.data[:, 0]),
            ITEM_EMBEDDING_TENSOR_NAME: np.array(training_set.data[:, 1])
        },
        y=np.array(training_set.data[:, 2]),
        shuffle=True,
        batch_size=64,
        num_epochs=10
    )()

ie I am trying to use epochs over the training data instead of gradient updates. It's very strange that this works within the notebook and by local deployment as well. Any clues?

scikit_bring_your_own.ipynb deploy model error

When going through the notebook above with a sagemaker notebook instance everything worked up to the deploy line:
predictor = tree.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer)
There I get "ClientError: An error occurred (ValidationException) when calling the CreateModel operation: ECR image ".dkr.ecr.us-east-1.amazonaws.com/decision-trees-sample" is invalid."

I tried via the sagemaker console using my root account and also get the same ValidationException error.

Testing the image (pulled from ECR) locally with serve_local.sh/predict_local.sh didn't show errors

How is the entry point to the code specified in bring your own code?

I'm trying out the sample notebooks, currently in the mxnet mnist example which demonstrates bringing your own code. The entry point parameter passed in when instantiating an estimator instance, only mentions the source file (mnist.py) and not a method name or any other point inside the source file.
So how does sagemaker figure out which method to send the training data to?

Model Error - Invoke the endpoint

When I try to invoke the endpoint and start the prediction , I get this error below

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "

<title>500 Internal Server Error</title>

Internal Server Error

The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.

". See https://eu-west-1.console.aws.amazon.com/cloudwatch/home?region=eu-west-1#logEventViewer:group=/aws/sagemaker/Endpoints/sagemaker-all-2018-04-11-12-00-21-560 in account 180856571690 for more information.

How to get the training job name inside train program?

I followed link [1] to build a docker image for training on ECS with my own model. Is there a way to get the current training job name inside train program? I want to find out the S3 folder where the model artifact is uploaded to. Because I want to upload the training logs and some other training intermediate data to the same folder. Sounds like sagemaker.Session should have property/method to support it but I didn't find it yet.

Or is there a way to pass in job name as parameter to train script? Actually, the job name is available when create training job.

Thanks!

[1] https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb

Issues with the default (latest?) version of framework (as of May 24th 2018)

Re: amazon-sagemaker-examples/sagemaker-python-sdk/tensorflow_resnet_cifar10_with_tensorboard/

  1. The training returns strange verbose logs.
    ...
    2018-05-24 18:39:00.972957: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
    2018-05-24 18:39:01.003414: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
    2018-05-24 18:39:01.012146: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
    2018-05-24 18:39:01.026917: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.

  2. Tensorboard does not seem to be working.
    [Errno 111] Connection refused

These issues will disappear if framework_version = 1.5 is used.
estimator = TensorFlow(entry_point='resnet_cifar_10.py',
source_dir=source_dir,
framework_version='1.5',
role=role,
hyperparameters={'min_eval_frequency': 10},
training_steps=1000, evaluation_steps=100,
train_instance_count=2, train_instance_type='ml.c4.xlarge',
base_job_name='tensorboard-example')

The first issue seems to be a known issue which relates to S3 and tensorflow.

Regardless of this strange issue, a model is properly trained and saved in S3. Endpoint is also created.

How to kill pending notebook instance?

I am attempting to launch a ml.p2.xlarge notebook instance. However, 30 minutes and counting, it still is in pending, and there is no way to kill it. What do I do about this? And what is the billing status of a notebook instance that is in its pending state?

Failed Reason: The primary container for production variant AllTraffic did not pass the ping health check.

I am trying to deploy a BYOB (bring your own model) keras model. I pushed the image to ECR with the 'latest' tag. All local testing passed, and I am able to successfully train the model e.g.:

image = '{}.dkr.ecr.{}.amazonaws.com/my-model:latest'.format(account, region)

dl = sage.estimator.Estimator(image,
                       role, 1, 'ml.c4.2xlarge',
                       output_path="s3://{}/output".format(sess.default_bucket()),
                       sagemaker_session=sess)

However attempting to deploy gives me the error:

Failed Reason:  The primary container for production variant AllTraffic did not pass the ping health check.

I am not quite sure where this stems from given local health check passed. Any insight would be great! Thanks.

Customer Churn Prediction with XGBoost

When I tried a different csv data set using XGBboost, I got the following issues:

Arguments: train
[2018-01-10:21:51:56:INFO] Running standalone xgboost training.
[2018-01-10:21:51:56:INFO] File size need to be processed in the node: 38.24mb. Available memory size in the node: 8611.8mb
/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_helper.py:279: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support sep=None with delim_whitespace=False; you can avoid this warning by specifying engine='python'.
df = pd.read_csv(os.path.join(files_path, csv_file), sep=None, header=None)
/opt/amazon/lib/python2.7/site-packages/sage_xgboost/exceptions.py:19: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6
message = getattr(exception, 'message', str(exception))
/opt/amazon/lib/python2.7/site-packages/sage_xgboost/exceptions.py:19: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6
message = getattr(exception, 'message', str(exception))
[2018-01-10:21:52:06:ERROR] Algorithm Error: Could not determine delimiter (caused by Error)

Caused by: Could not determine delimiter
Traceback (most recent call last):
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train.py", line 34, in main
standalone_train(resource_config, train_config, data_config)
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_methods.py", line 16, in standalone_train
train_job(resource_config, train_config, data_config)
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_helper.py", line 389, in train_job
dtrain = get_dmatrix(train_path, file_type, exceed_memory)
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_helper.py", line 317, in get_dmatrix
dmatrix = get_csv_dmatrix(files_path)
File "/opt/amazon/lib/python2.7/site-packages/sage_xgboost/train_helper.py", line 279, in get_csv_dmatrix
df = pd.read_csv(os.path.join(files_path, csv_file), sep=None, header=None)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 315, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 645, in init
self._make_engine(self.engine)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 805, in _make_engine
self._engine = klass(self.f, **self.options)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 1601, in init
self._make_reader(f)
File "/opt/amazon/lib/python2.7/site-packages/pandas/io/parsers.py", line 1705, in _make_reader
sniffed = csv.Sniffer().sniff(line)
File "/opt/amazon/python2.7/lib/python2.7/csv.py", line 188, in sniff
raise Error, "Could not determine delimiter"
Error: Could not determine delimiter

ValueErrorTraceback (most recent call last)
in ()
16 num_round=100)
17
---> 18 xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in fit(self, inputs, wait, logs, job_name)
152 self.latest_training_job = _TrainingJob.start_new(self, inputs)
153 if wait:
--> 154 self.latest_training_job.wait(logs=logs)
155 else:
156 raise NotImplemented('Asynchronous fit not available')

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/estimator.pyc in wait(self, logs)
321 def wait(self, logs=True):
322 if logs:
--> 323 self.sagemaker_session.logs_for_job(self.job_name, wait=True)
324 else:
325 self.sagemaker_session.wait_for_job(self.job_name)

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/session.pyc in logs_for_job(self, job_name, wait, poll)
656
657 if wait:
--> 658 self._check_job_status(job_name, description)
659 if dot:
660 print()

/home/ec2-user/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/sagemaker/session.pyc in _check_job_status(self, job, desc)
399 if status != 'Completed':
400 reason = desc.get('FailureReason', '(No reason provided)')
--> 401 raise ValueError('Error training {}: {} Reason: {}'.format(job, status, reason))
402
403 def wait_for_endpoint(self, endpoint, poll=5):

ValueError: Error training xgboost-2018-01-10-21-46-25-058: Failed Reason: InternalServerError: We encountered an internal error. Please try again.

scikit_bring_your_own.ipynb not working as expected

Hello,

Background

I'm trying to learn how to create own containers and deploy a model with them. I have been following the example scikit_bring_your_own.ipynb to push the docker image to my ECR repositories, train and upload the model artifact to S3 bucket. The Dockerfile was not changed, so the docker image should be the same. The model was successfully trained with the given codes.

Error

However, when I'm trying to create the model from the container image and artifact (step "Deploy the model" in the notebook) by running:

from sagemaker.predictor import csv_serializer
predictor = tree.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer)

I encountered an error which is:
An error occurred (ValidationException) when calling the CreateModel operation: ECR image "xxxxxx(my account id here).dkr.ecr.us-west-2.amazonaws.com/decision-trees-sample" is invalid.

Attempts:

  • I have tested hosting the model locally with the given train.sh, serve.sh and predict.sh files, the model works well locally.
  • I have tried to use the SageMaker console to create the model from ECR, the error is the same
  • I have tried to use AWS CLI to create the model, same issue.

Could anyone help with this problem?
Thanks a lot!

Tensorboard not displaying scalars

The notebook example with Tensorboard amazon-sagemaker-examples/sagemaker-python-sdk/tensorflow_resnet_cifar10_with_tensorboard/tensorflow_resnet_cifar10_with_tensorboard.ipynb is not displaying scalars or images. Only the graph and projector are displayed.

If one run is terminated and a new one is started (using the same base_job_name so it starts from the previously saved checkpoint) by running again:
estimator.fit(inputs, run_tensorboard_locally=True)
then the scalars and images of the previous run are displayed on Tensorboard but they are not updated as training continues.

Cannot install rJava

I'm creating a SageMaker notebook using R, and I installed the R Kernel correctly so that I can run R inside a Jupyter notebook. There are a few issues I've noticed with installing packages, but they can usually be circumvented by adding a specific repository source, specifying the dependencies, etc. However, there is one package that doesn't follow these workarounds: rJava.

Can rJava be installed in SageMaker?

Background:

When I try to install the library "RWeka" there is an issue that I can't get around. I've traced to the error down to the dependency on the "rJava" package. I've entered the following commands:

install.packages('rJava')
install.packages('rJava', repos = 'https://cran.r-project.org/')
This results in the following error:

“installation of package ‘rJava’ had non-zero exit status”Updating HTML index of packages in '.Library'
I suspect this is because rJava has system requirements of "Java JDK 1.2 or higher (for JRI/REngine JDK 1.4 or higher), GNU make."

SageMaker's Java JDK is 1.8.0_121 so I'm not sure what the issue is, I've tried installing in the terminal and multiple variants of R using devtools library, etc. Is rJava not supported?

Using caffe with sagemaker

Hello,

Not really a issue but more of an request.

I have a caffe model, I also have a caffe docker docker container

I was wondering if there was any plans to support caffe?

Thanks

seq2seq_translation_en-de - bucket for pretrained model artifacts does not exist

Under the pre-trained model section of the seq2seq_translation_en-de.ipynb the instructions say to curl the model artifacts from "https://s3-us-west-2.amazonaws.com/gsaur-seq2seq-data/seq2seq/eng-german/full-nb-translation-eng-german-p2-16x-2017-11-24-22-25-53/output/"

When attempting to curl these artifacts you get the following error:

This XML file does not appear to have any style information associated with it. The document tree is shown below.

NoSuchBucket
The specified bucket does not exist
gsaur-seq2seq-data
689B28A38C6874F0

DB0wm6RExts0znrV1uBktPdZa7ore4QA2IhlP6F7usDNFaZ7I8DbnYVgRPobziNc7cbQzZ2pGps=

Was this bucket deleted accidentally?

Gluon recommender system example causes kernel to crash

When attempting to run this example, the kernel dies somewhere around net.collectParams() with an output that looks like
mfblock0_ ( Parameter mfblock0_embedding0_weight (shape=(140344, 64), dtype=<class 'numpy.float32'>) Parameter mfblock0_embedding1_weight (shape=(38385, 64), dtype=<class 'numpy.float32'>) Parameter mfblock0_dense0_weight (shape=(64, 0), dtype=<class 'numpy.float32'>) Parameter mfblock0_dense0_bias (shape=(64,), dtype=<class 'numpy.float32'>) )
. The error log says .../tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: unknown error. Any thoughts?

Deploying code from notebook

From scikit_bring_your_own.ipynb

When you are using a framework (such as Apache MXNet or TensorFlow) that has direct support in SageMaker, you can simply supply the Python code that implements your algorithm using the SDK entry points for that framework.

What exactly is this referring to? I have a python script currently living inside an .ipynb in Jupyter, SageMaker. It's a batch script which pulls the training data from DynamoDB and runs ALS training on it using Spark's MLLib. It finishes by writing some results to DynamoDB. Can I go ahead and deploy this as a recurring batch job easily without messing with docker containers?

scikit_bring_your_own.ipynb train model pandas error

Hello!

I am following the scikit_bring_your_own tutorial and I am trying to set up BYO bring your own model for production use, but I am encountering the following issue when trying to train the model on AWS Sagemaker.


AlgorithmError: Exception during training: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'. Traceback (most recent call last): File "/opt/program/train", line 48, in train raw_data = [ pd.read_csv(file, header=None) for file in input_files ] File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 709, in parser_f return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 449, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 818, in __init__ self._make_engine(self.engine) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1049, in _make_engine self._engine = CParserWrapper(self.f, **self.options) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1695, in __init__ self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parser

I uploaded the data to s3 using:

    def upload_data(self):
        self.logger.info(
            'Uploading locally available data to s3 in path: %s, using bucket: %s using s3 directory prefix: %s'
            % (
                self.config.data_directory_path,
                self.config.data_upload_bucket,
                self.config.s3_data_directory_prefix,
            )
        )

        self.train_data_location = self.session.upload_data(
            path=self.config.data_directory_path,
            bucket=self.config.data_upload_bucket,
            key_prefix=self.config.s3_data_directory_prefix
        )

        self.logger.info('Uploaded local data to s3 path: %s' % (self.train_data_location))

I ran the build_and_push.sh script.

Then I tried to train the model using:

    def estimator(self):
        self.logger.info(
            'Creating estimator for %s model %s using image %s' % (
                'BYO',
                self.config.model_name,
                self.image,
            )
        )

        return Estimator(
            image_name=self.image,
            role=self.config.role,
            train_instance_count=self.config.train_instance_count,
            train_instance_type=self.config.train_instance_type,
            output_path=self.config.output_path,
            base_job_name=self.config.base_job_name,
            sagemaker_session=self.session,
        )

(I'm using the same code as in the notebook, just rewritten for using it as a class)

Am I missing something or doing something wrong?

Missing chmod command in build_and_push.sh in scikit_bring_your_own project

In the build_and_push.sh script in the scikit_bring_your_own project there are a few missing lines (which are present in the corresponding notebook section):

#make the program executable
chmod +x decision_trees/train 
#On a SageMaker Notebook Instance, the docker daemon may need to be restarted in order
#to detect your network configuration correctly.  (This is a known issue.)
if [ -d "/home/ec2-user/SageMaker" ]; then
  sudo service docker restart
fi

How do I perform A/B testing?

I'm trying to figure out how to perform A/B testing using AWS sagemaker. I understand setting the train_instance_count will distribute the training across two instances. But how do I specify the set the percentage of inference calls each model will handle and perform A/B testing?

Invoke the endpoint in Sagemaker AWS via predictor.predict ()

Hello SageMaker Community,

I have a probem when I call predictor.predict(" txt format ") in Sagemaker notebook, I get this error according to the format of the parameter to test


ValueError Traceback (most recent call last)
in ()
8
9
---> 10 print(predictor.predict("the password is 15jdgvd "))

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/predictor.py in predict(self, data)
72 """
73 if self.serializer is not None:
---> 74 data = self.serializer(data)
75
76 request_args = {

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/predictor.py in call(self, data)
247 return _json_serialize_from_buffer(data)
248
--> 249 raise ValueError("Unable to handle input format: {}".format(type(data)))
250
251

ValueError: Unable to handle input format: <class 'str'>

How to make parameters/files available to Tensorflow Endpoint Instance

I'm looking to make some hyper parameters or files available to the serving endpoint in SageMaker. The training instances is given access to input parameters using hyperparameters in:

estimator = TensorFlow(entry_point='autocat.py',
                       role=role,
                       output_path=params['output_path'],
                       code_location=params['code_location'],
                       train_instance_count=1,
                       train_instance_type='ml.c4.xlarge',
                       training_steps=10000,
                       evaluation_steps=None,
                       hyperparameters=params)

However, when the endpoint is deployed, there is no way to pass in parameters that are used to control the data processing in the input_fn(serialized_input, content_type) function.

What would be the best way to pass parameters to the serving instance?? Is the source_dir parameter defined in the sagemaker.tensorflow.TensorFlow class copied to the serving instance? If so, I could use a config.yml or similar.

The reason that I'm asking is that I keep the location of a TFIDF vectorizer in the params dictionary, and it loads it at training time from s3. In the future I'd like to use this same approach to load embeddings at serving time.

What is the

Re: amazon-sagemaker-examples/introduction_to_applying_machine_learning/gluon_recommender_system/

import pip
pip.main(['install', 'pandas'])

This method is not working with the newer version of pip. p2.xlarge instance does not seem to be supporting pandas. What is the recommended/alternative method to install Python packages when training and inference do not support the particular modules?

Thank you.

xgboost direct marketing example doesn't make sense

Something doesn't seem right about this example.

https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/xgboost_direct_marketing/xgboost_direct_marketing_sagemaker.ipynb

The notes in the Exploration section indicate correctly that 90% of the of the customers do not subscribe.

The notes in the Evaluation section says ~3700 customers from the test data set were predicted to subscribe. The test data set has right around 4,000 (10% of the whole dataset). So the model that was built predicts a vast majority of subscribers. What's going on here?

Connecting to Tensorboard without using notebook?

Hello.

I am trying to use sagemaker.tensorflow.estimator.TensorFlow for training a new tensorflow model on Sagemaker. I would love to have access to TensorBoard no matter what, but since this will be a job run through airflow on a machine dedicated to data pipeline jobs, I will not have access to the local TensorBoard link. I am not using notebooks either, so I can't really use this option to 'connect to' as mentioned in the Resnet CIFAR 10 Example

You can access TensorBoard locally at http://localhost:6006 or using your SageMaker notebook instance proxy/6006/(TensorBoard will not work if forget to put the slash, '/', in end of the url).

Is there a way to connect via proxy for a job instead, or another way if not using notebooks?

I instanciate sagemaker.tensorflow.estimator.TensorFlow in the following way:

        return TensorFlow(
            entry_point=self.entry_point,
            source_dir=self.source_dir,
            role=self.config.role,
            output_path=self.config.output_path,
            code_location=self.config.code_location,
            train_instance_count=self.config.train_instance_count,
            train_instance_type=self.config.train_instance_type,
            training_steps=self.config.training_steps,
            evaluation_steps=self.config.evaluation_steps
        )

I'm calling fit like:

            self.estimator.fit(
                self.config.train_data_location,
                job_name=self.training_job_name,
                run_tensorboard_locally=True,
            )

Current cifar10 example uses ml.p2.xlarge which requires request to AWS support

I was working through the CIFAR10 example here after launching a notebook in AWS and noticed that I can't complete the tutorial without requesting a limit increase.

The current tutorial says:
"If you want to try the example without requesting an increase, just change the train_instance_count value to 1."

I did this but I still hit an AWS limit error:

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p2.xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.

Looks like the current default limit is 0 now? Is there an alternative instance type you suggest to complete this or is a support request now required?

deploy a new algorithm

Hello
i'am trying to deploy my own algorithm so should i make a docker image or there is a better method i don't know so much about sagemaker so can anyone please give the steps to deploy my own algorithm
Best Regards

Getting error while invoking sagemaker endpoint

I created training job in sagemaker with my own training and inference code using MXNet framework. I am able to train the model successfully and created endpoint as well. But while inferring the model, I am getting the following error:
‘ClientError: An error occurred (413) when calling the InvokeEndpoint operation: HTTP content length exceeded 5246976 bytes.’
What I understood from my research is the error is due to the size of the image. The image shape is (480, 512, 3). I trained the model with images of same shape (480, 512, 3).

When I resized the image to (240, 256), the error was gone. But producing another error 'shape inconsistent in convolution' as I the trained the model with images of size (480, 512).

I didn’t understand why I am getting this error while inferring.
Can't we use images of larger size to infer the model?
Any suggestions will be helpful

Thanks, Harathi

SageMaker returns a 500 error after Installing the R Kernel

As soon as I've run the command to install the R kernel, and I refresh the Jupyter dashboard I get a 500 error.

I've tried through the notebook example in advanced_functionality and also through the terminal. I also tried upgrading conda first but same result.

DeepAR external Regressors

I am working on time series forecasting using LSTMs. My dataset has time series (sales on each day) data, plus external regressors like discount on particular day, holidays, day of week. Now that I want to move to DeepAR, I was wondering, how to incorporate these features in the DeepAR training data (json format) and run DeepAR ? "cat": in training dataset didn't work. Note : I want to add these because, I want to know how forecast is affected based on discount and holiday period features.

Question about MxNet estimator

In the example "mnist_with_gluon_local_mode.ipynb" I notice that we have to include a python file to the MXNet estimator. I was wondering what would happen if the pyton file (in this case mnist.py) is dependant on a few other python files? Is it possible to make these available to the estimator as well?

Can't run TensorFlow graphs without Estimator

I have a unique model paradigm that does not fit the structure of an Estimator, i.e. There is no model_fn() definition such that I can get the correct inputs/outputs. In other words, even under the rubric of custom estimators, I cannot use a tf.Estimator, and instead of have had to code everything using the low-level TF API.

Is there a low-level version of the Python SageMaker API such that I can define my graph and run it, using feed_dict as needed, and all the other low-level TF API features?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.