Coder Social home page Coder Social logo

sagemaker-huggingface-inference-toolkit's Introduction

SageMaker Hugging Face Inference Toolkit

Latest Version Supported Python Versions Code Style: Black

SageMaker Hugging Face Inference Toolkit is an open-source library for serving πŸ€— Transformers and Diffusers models on Amazon SageMaker. This library provides default pre-processing, predict and postprocessing for certain πŸ€— Transformers and Diffusers models and tasks. It utilizes the SageMaker Inference Toolkit for starting up the model server, which is responsible for handling inference requests.

For Training, see Run training on Amazon SageMaker.

For the Dockerfiles used for building SageMaker Hugging Face Containers, see AWS Deep Learning Containers.

For information on running Hugging Face jobs on Amazon SageMaker, please refer to the πŸ€— Transformers documentation.

For notebook examples: SageMaker Notebook Examples.


πŸ’» Getting Started with πŸ€— Inference Toolkit

needs to be adjusted -> currently pseudo code

Install Amazon SageMaker Python SDK

pip install sagemaker --upgrade

Create a Amazon SageMaker endpoint with a trained model.

from sagemaker.huggingface import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.6',
    pytorch_version='1.7',
    py_version='py36',
    model_data='s3://my-trained-model/artifacts/model.tar.gz',
    role=role,
)
# deploy model to SageMaker Inference
huggingface_model.deploy(initial_instance_count=1,instance_type="ml.m5.xlarge")

Create a Amazon SageMaker endpoint with a model from the πŸ€— Hub.
note: This is an experimental feature, where the model will be loaded after the endpoint is created. Not all sagemaker features are supported, e.g. MME

from sagemaker.huggingface import HuggingFaceModel
# Hub Model configuration. https://huggingface.co/models
hub = {
  'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad',
  'HF_TASK':'question-answering'
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.6',
    pytorch_version='1.7',
    py_version='py36',
    env=hub,
    role=role,
)
# deploy model to SageMaker Inference
huggingface_model.deploy(initial_instance_count=1,instance_type="ml.m5.xlarge")

πŸ› οΈ Environment variables

The SageMaker Hugging Face Inference Toolkit implements various additional environment variables to simplify your deployment experience. A full list of environment variables is given below.

HF_TASK

The HF_TASK environment variable defines the task for the used πŸ€— Transformers pipeline. A full list of tasks can be find here.

HF_TASK="question-answering"

HF_MODEL_ID

The HF_MODEL_ID environment variable defines the model id, which will be automatically loaded from huggingface.co/models when creating or SageMaker Endpoint. The πŸ€— Hub provides +10 000 models all available through this environment variable.

HF_MODEL_ID="distilbert-base-uncased-finetuned-sst-2-english"

HF_MODEL_REVISION

The HF_MODEL_REVISION is an extension to HF_MODEL_ID and allows you to define/pin a revision of the model to make sure you always load the same model on your SageMaker Endpoint.

HF_MODEL_REVISION="03b4d196c19d0a73c7e0322684e97db1ec397613"

HF_API_TOKEN

The HF_API_TOKEN environment variable defines the your Hugging Face authorization token. The HF_API_TOKEN is used as a HTTP bearer authorization for remote files, like private models. You can find your token at your settings page.

HF_API_TOKEN="api_XXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

HF_TRUST_REMOTE_CODE

The HF_TRUST_REMOTE_CODE environment variable defines wether or not to allow for custom models defined on the Hub in their own modeling files. Allowed values are "True" and "False"

HF_TRUST_REMOTE_CODE="True"

HF_OPTIMUM_BATCH_SIZE

The HF_OPTIMUM_BATCH_SIZE environment variable defines the batch size, which is used when compiling the model to Neuron. The default value is 1. Not required when model is already converted.

HF_OPTIMUM_BATCH_SIZE="1"

HF_OPTIMUM_SEQUENCE_LENGTH

The HF_OPTIMUM_SEQUENCE_LENGTH environment variable defines the sequence length, which is used when compiling the model to Neuron. There is no default value. Not required when model is already converted.

HF_OPTIMUM_SEQUENCE_LENGTH="128"

πŸ§‘πŸ»β€πŸ’» User defined code/modules

The Hugging Face Inference Toolkit allows user to override the default methods of the HuggingFaceHandlerService. Therefore, they need to create a folder named code/ with an inference.py file in it. You can find an example for it in sagemaker/17_customer_inference_script. For example:

model.tar.gz/
|- pytorch_model.bin
|- ....
|- code/
  |- inference.py
  |- requirements.txt 

In this example, pytorch_model.bin is the model file saved from training, inference.py is the custom inference module, and requirements.txt is a requirements file to add additional dependencies. The custom module can override the following methods:

  • model_fn(model_dir): overrides the default method for loading the model, the return value model will be used in the predict() for predicitions. It receives argument the model_dir, the path to your unzipped model.tar.gz.
  • transform_fn(model, data, content_type, accept_type): overrides the default transform function with a custom implementation. Customers using this would have to implement preprocess, predict and postprocess steps in the transform_fn. NOTE: This method can't be combined with input_fn, predict_fn or output_fn mentioned below.
  • input_fn(input_data, content_type): overrides the default method for preprocessing, the return value data will be used in the predict() method for predicitions. The input is input_data, the raw body of your request and content_type, the content type form the request Header.
  • predict_fn(processed_data, model): overrides the default method for predictions, the return value predictions will be used in the postprocess() method. The input is processed_data, the result of the preprocess() method.
  • output_fn(prediction, accept): overrides the default method for postprocessing, the return value result will be the respond of your request(e.g.JSON). The inputs are predictions, the result of the predict() method and accept the return accept type from the HTTP Request, e.g. application/json

🀝 Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.


πŸ“œ License

SageMaker Hugging Face Inference Toolkit is licensed under the Apache 2.0 License.


πŸ§‘πŸ»β€πŸ’» Development Environment

Install all test and development packages with

pip3 install -e ".[test,dev]"

Run Model Locally

  1. manually change MMS_CONFIG_FILE
wget -O sagemaker-mms.properties https://raw.githubusercontent.com/aws/deep-learning-containers/master/huggingface/build_artifacts/inference/config.properties
  1. Run Container, e.g. text-to-image
HF_MODEL_ID="stabilityai/stable-diffusion-xl-base-1.0" HF_TASK="text-to-image" python src/sagemaker_huggingface_inference_toolkit/serving.py
  1. Adjust handler_service.py and comment out if content_type in content_types.UTF8_TYPES: thats needed for SageMaker but cannot be used locally

  2. Send request

curl --request POST \
  --url http://localhost:8080/invocations \
  --header 'Accept: image/png' \
  --header 'Content-Type: application/json' \
  --data '"{\"inputs\": \"Camera\"}" \
  --output image.png

sagemaker-huggingface-inference-toolkit's People

Contributors

abhisheksms avatar aduverger avatar amazon-auto avatar davidthomas426 avatar maaquib avatar mxnet-sdk-team-mms avatar philschmid avatar rhelmeczi avatar sachanub avatar thilohuellmann avatar tommy-neeld-rft avatar vdantu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sagemaker-huggingface-inference-toolkit's Issues

Any plans for inference acceleration support?

I tried

predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.t2.medium",
    accelerator_type="ml.eia2.medium",
    endpoint_name=endpoint_name,
)

And got sagemaker Unsupported image scope: eia. You may need to upgrade your SDK version (pip install -U sagemaker) for newer image scopes. Supported image scope(s): training, inference.

batch transform can not find custom inference.py

Here is my problem. Does Sagemaker-huggingface-inference-toolkit support batch transform? I created a FrameworkModel loading a trained model saved on S3 as (since HuggingFaceModel is not available for Sagemaker: 2.39.0, so I use FrameworkModel here):

model = FrameworkModel(
    model_data=model_url,
    image_uri=image_uri,
    role=role,
    entry_point='inference.py', 
    name=name,
    vpc_config={
        'Subnets': vpc_subnets,
        'SecurityGroupIds': vpc_sgs        
    },
    source_dir='./source',
    sagemaker_session=Session(boto_session=boto3_session, default_bucket=s3_bucket),
)

transformer = model.transformer(
    tags=tags,
    instance_count=1,
    instance_type='ml.g4dn.xlarge',
    assemble_with='Line',
    output_path=predict_output_path,
    accept='application/jsonlines'
)

transformer.transform(
    data=input_path,
    content_type='application/jsonlines',
    split_type='Line',
    join_source= "Input",
)

Return error message:

2021-07-23 04:04:25,545 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - sagemaker_inference.errors.UnsupportedFormatError: Content type application/jsonlines is not supported by this framework.

However, such a content_type is handled in the inference.py. So I assume that this script is not loaded during the batch transform. Please help.

InternalServerException at runtime

Hi all,

I am trying to run https://huggingface.co/nomic-ai/gpt4all-13b-snoozy on a sagemaker endpoint using the HG inference toolkit. I extended the base DLC container to install a newer version of the transformers library (4.28.0). The endpoint is successfully deployed, however at runtime I get the following error:

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "Could not load model /.sagemaker/mms/models/nomic-ai__gpt4all-13b-snoozy with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM\u0027\u003e, \u003cclass \u0027transformers.models.llama.modeling_llama.LlamaForCausalLM\u0027\u003e)."
}

Here is the code I'm using:

# Hub Model configuration. https://huggingface.co/models
hub_snoozy = {
	'HF_MODEL_ID':'nomic-ai/gpt4all-13b-snoozy',
	'HF_TASK':'text2text-generation'
}

# create Hugging Face Model Class
huggingface_model_snoozy = HuggingFaceModel(
        image_uri=ecr_image,
	transformers_version='4.28.0',
	pytorch_version='1.13.1',
	py_version='py39',
	env=hub_snoozy,
	role=role, 
)

predictor_snoozy = huggingface_model_snoozy.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.g5.48xlarge', # ec2 instance type,
    endpoint_name='gpt4all-13b-snoozy-text2text-generation',
    container_startup_health_check_timeout=600
)

data = {
"inputs": {
    "question": "What is used for inference?",
    "context": "My Name is Philipp and I live in Nuremberg. This model is used with sagemaker for inference."
    }
}

predictor_snoozy.predict(data)

where ecr-image is my custom container based on 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04

sagemaker version 2.155.0

I also tested with a different task (textgeneration) and same outcome

Am I doing something wrong ?

Thank you !

Where should we store config.json?

Hi,

I'm deploying the following artifact to SM HF Hosting

model.tar.gz
    code/
        inference.py
    squad_tf/
        config.json
        tf_model.h5
    squad_tf_tokenizer/
        special_tokens_map.json
        tokenizer.json
        tokenizer_config.json
        vocab.txt

When I send inferences I have errors in my logs, including file /.sagemaker/mms/models/model/config.json not found Where should that config.json be?

What if I want to bring more than an inference.py?

Hi,

I see in the docs that I can customize inference with an inference.py script. But what if I need more than that?

  1. In particular, is it possible to bring a directory of scripts?
  2. Is it possible to choose a custom entrypoint name or it has to be named inference.py?

logs not showing up in cloudwatch

When deploying a question answering model with later versions of python, transformers and pytorch the logs do not seem to be seen in cloudwatch. If I deploy the same model but with older python and module versions the logs can be seen in cloudwatch.

Recreate the issue

Case 1, No logs seen

Code


hub = {
  'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad',
  'HF_TASK':'question-answering'
}

huggingface_model = HuggingFaceModel(
   env=hub,
   role=role,
   transformers_version="4.12.3",
   pytorch_version="1.9.1", 
   py_version="py38",
)
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)

data = {
"inputs": {
    "question": "What is used for inference?",
    "context": "My Name is Philipp and I live in Nuremberg. This model is used with sagemaker for inference."
    }
}

# request
predictor.predict(data)

logs:


#015Downloading:   0%\|          \| 0.00/2.74k [00:00<?, ?B/s]#015Downloading: 100%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ\| 2.74k/2.74k [00:00<00:00, 3.36MB/s]
--
#015Downloading:   0%\|          \| 0.00/451 [00:00<?, ?B/s]#015Downloading: 100%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ\| 451/451 [00:00<00:00, 387kB/s]
#015Downloading:   0%\|          \| 0.00/265M [00:00<?, ?B/s]#015Downloading:   3%\|β–Ž         \| 8.39M/265M [00:00<00:03, 83.9MB/s]#015Downloading:   6%\|β–‹         \| 17.2M/265M [00:00<00:02, 86.1MB/s]#015Downloading:  10%\|β–‰         \| 25.8M/265M [00:00<00:03, 79.6MB/s]#015Downloading:  13%\|β–ˆβ–Ž        \| 34.0M/265M [00:00<00:02, 80.7MB/s]#015Downloading:  16%\|β–ˆβ–Œ        \| 42.1M/265M [00:00<00:02, 74.5MB/s]#015Downloading:  19%\|β–ˆβ–Š        \| 49.6M/265M [00:00<00:02, 72.9MB/s]#015Downloading:  21%\|β–ˆβ–ˆβ–       \| 57.0M/265M [00:00<00:03, 68.6MB/s]#015Downloading:  24%\|β–ˆβ–ˆβ–       \| 63.9M/265M [00:00<00:03, 63.6MB/s]#015Downloading:  27%\|β–ˆβ–ˆβ–‹       \| 70.7M/265M [00:00<00:03, 64.9MB/s]#015Downloading:  29%\|β–ˆβ–ˆβ–‰       \| 77.5M/265M [00:01<00:02, 65.5MB/s]#015Downloading:  32%\|β–ˆβ–ˆβ–ˆβ–      \| 84.1M/265M [00:01<00:02, 62.5MB/s]#015Downloading:  34%\|β–ˆβ–ˆβ–ˆβ–      \| 90.4M/265M [00:01<00:02, 61.2MB/s]#015Downloading:  37%\|β–ˆβ–ˆβ–ˆβ–‹      \| 97.4M/265M [00:01<00:02, 63.8MB/s]#015Downloading:  39%\|β–ˆβ–ˆβ–ˆβ–‰      \| 104M/265M [00:01<00:02, 63.0MB/s] #015Downloading:  42%\|β–ˆβ–ˆβ–ˆβ–ˆβ–     \| 110M/265M [00:01<00:02, 60.5MB/s]#015Downloading:  44%\|β–ˆβ–ˆβ–ˆβ–ˆβ–     \| 116M/265M [00:01<00:02, 58.9MB/s]#015Downloading:  46%\|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     \| 123M/265M [00:01<00:02, 60.0MB/s]#015Downloading:  48%\|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     \| 129M/265M [00:01<00:02, 60.1MB/s]#015Downloading:  51%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     \| 135M/265M [00:02<00:02, 60.0MB/s]#015Downloading:  53%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    \| 141M/265M [00:02<00:02, 60.3MB/s]#015Downloading:  55%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ    \| 147M/265M [00:02<00:01, 59.7MB/s]#015Downloading:  58%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    \| 153M/265M [00:02<00:01, 59.8MB/s]#015Downloading:  60%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    \| 159M/265M [00:02<00:01, 58.4MB/s]#015Downloading:  62%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   \| 165M/265M [00:02<00:01, 60.3MB/s]#015Downloading:  65%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   \| 171M/265M [00:02<00:01, 60.1MB/s]#015Downloading:  67%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   \| 177M/265M [00:02<00:01, 59.4MB/s]#015Downloading:  69%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   \| 183M/265M [00:02<00:01, 57.9MB/s]#015Downloading:  71%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   \| 189M/265M [00:02<00:01, 56.6MB/s]#015Downloading:  73%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  \| 195M/265M [00:03<00:01, 56.9MB/s]#015Downloading:  76%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ  \| 202M/265M [00:03<00:01, 61.1MB/s]#015Downloading:  79%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰  \| 210M/265M [00:03<00:00, 66.0MB/s]#015Downloading:  81%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– \| 216M/265M [00:03<00:00, 62.2MB/s]#015Downloading:  84%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– \| 223M/265M [00:03<00:00, 60.2MB/s]#015Downloading:  86%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ \| 229M/265M [00:03<00:00, 58.8MB/s]#015Downloading:  88%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š \| 235M/265M [00:03<00:00, 58.3MB/s]#015Downloading:  91%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ \| 241M/265M [00:03<00:00, 58.9MB/s]#015Downloading:  93%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž\| 247M/265M [00:03<00:00, 57.4MB/s]#015Downloading:  95%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ\| 252M/265M [00:04<00:00, 56.9MB/s]#015Downloading:  97%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹\| 258M/265M [00:04<00:00, 57.9MB/s]#015Downloading: 100%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰\| 264M/265M [00:04<00:00, 58.3MB/s]#015Downloading: 100%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ\| 265M/265M [00:04<00:00, 62.1MB/s]
#015Downloading:   0%\|          \| 0.00/466k [00:00<?, ?B/s]#015Downloading:  18%\|β–ˆβ–Š        \| 86.0k/466k [00:00<00:00, 567kB/s]#015Downloading:  92%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–\| 430k/466k [00:00<00:00, 1.56MB/s]#015Downloading: 100%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ\| 466k/466k [00:00<00:00, 1.52MB/s]
#015Downloading:   0%\|          \| 0.00/28.0 [00:00<?, ?B/s]#015Downloading: 100%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ\| 28.0/28.0 [00:00<00:00, 20.7kB/s]
#015Downloading:   0%\|          \| 0.00/232k [00:00<?, ?B/s]#015Downloading:  16%\|β–ˆβ–Œ        \| 36.9k/232k [00:00<00:00, 242kB/s]#015Downloading:  82%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– \| 190k/232k [00:00<00:00, 802kB/s] #015Downloading: 100%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ\| 232k/232k [00:00<00:00, 757kB/s]
WARNING - Overwriting /.sagemaker/mms/models/distilbert-base-uncased-distilled-squad ...
Warning: MMS is using non-default JVM parameters: -XX:-UseContainerSupport
Model server started.

Case 2, logs visible

Code


hub = {
  'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad',
  'HF_TASK':'question-answering'
}

huggingface_model = HuggingFaceModel(
   env=hub,
   role=role,
   transformers_version="4.6", 
   pytorch_version="1.7",
   py_version="py36",
)
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)

data = {
"inputs": {
    "question": "What is used for inference?",
    "context": "My Name is Philipp and I live in Nuremberg. This model is used with sagemaker for inference."
    }
}

# request
predictor.predict(data)

logs:



This is an experimental beta features, which allows downloading model from the Hugging Face Hub on start up. It loads the model defined in the env var `HF_MODEL_ID`
--
#015Downloading:   0%\|          \| 0.00/2.74k [00:00<?, ?B/s]#015Downloading: 100%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ\| 2.74k/2.74k [00:00<00:00, 2.74MB/s]
#015Downloading:   0%\|          \| 0.00/451 [00:00<?, ?B/s]#015Downloading: 100%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ\| 451/451 [00:00<00:00, 473kB/s]
#015Downloading:   0%\|          \| 0.00/265M [00:00<?, ?B/s]#015Downloading:   3%\|β–Ž         \| 8.17M/265M [00:00<00:03, 81.7MB/s]#015Downloading:   6%\|β–Œ         \| 16.5M/265M [00:00<00:03, 82.9MB/s]#015Downloading:   9%\|β–‰         \| 24.8M/265M [00:00<00:03, 75.2MB/s]#015Downloading:  12%\|β–ˆβ–        \| 32.4M/265M [00:00<00:03, 66.9MB/s]#015Downloading:  15%\|β–ˆβ–        \| 39.8M/265M [00:00<00:03, 69.1MB/s]#015Downloading:  18%\|β–ˆβ–Š        \| 46.8M/265M [00:00<00:03, 66.9MB/s]#015Downloading:  20%\|β–ˆβ–ˆ        \| 53.6M/265M [00:00<00:03, 64.6MB/s]#015Downloading:  23%\|β–ˆβ–ˆβ–Ž       \| 60.2M/265M [00:00<00:03, 65.1MB/s]#015Downloading:  25%\|β–ˆβ–ˆβ–Œ       \| 67.3M/265M [00:00<00:02, 66.7MB/s]#015Downloading:  29%\|β–ˆβ–ˆβ–‰       \| 76.5M/265M [00:01<00:02, 74.3MB/s]#015Downloading:  32%\|β–ˆβ–ˆβ–ˆβ–      \| 85.2M/265M [00:01<00:02, 78.0MB/s]#015Downloading:  35%\|β–ˆβ–ˆβ–ˆβ–Œ      \| 93.0M/265M [00:01<00:02, 76.5MB/s]#015Downloading:  39%\|β–ˆβ–ˆβ–ˆβ–Š      \| 103M/265M [00:01<00:01, 82.1MB/s] #015Downloading:  42%\|β–ˆβ–ˆβ–ˆβ–ˆβ–     \| 111M/265M [00:01<00:01, 82.6MB/s]#015Downloading:  45%\|β–ˆβ–ˆβ–ˆβ–ˆβ–     \| 119M/265M [00:01<00:01, 76.1MB/s]#015Downloading:  48%\|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     \| 127M/265M [00:01<00:01, 75.6MB/s]#015Downloading:  51%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     \| 135M/265M [00:01<00:01, 76.9MB/s]#015Downloading:  54%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    \| 143M/265M [00:01<00:01, 71.3MB/s]#015Downloading:  57%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    \| 150M/265M [00:02<00:01, 70.4MB/s]#015Downloading:  59%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    \| 157M/265M [00:02<00:01, 68.8MB/s]#015Downloading:  62%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   \| 165M/265M [00:02<00:01, 70.2MB/s]#015Downloading:  65%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   \| 172M/265M [00:02<00:01, 68.7MB/s]#015Downloading:  67%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   \| 179M/265M [00:02<00:01, 64.9MB/s]#015Downloading:  70%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   \| 185M/265M [00:02<00:01, 62.4MB/s]#015Downloading:  72%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  \| 191M/265M [00:02<00:01, 61.1MB/s]#015Downloading:  75%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  \| 198M/265M [00:02<00:01, 63.2MB/s]#015Downloading:  78%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  \| 206M/265M [00:02<00:00, 67.1MB/s]#015Downloading:  81%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  \| 214M/265M [00:03<00:00, 71.8MB/s]#015Downloading:  83%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž \| 222M/265M [00:03<00:00, 72.3MB/s]#015Downloading:  86%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ \| 229M/265M [00:03<00:00, 69.9MB/s]#015Downloading:  89%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ \| 236M/265M [00:03<00:00, 71.1MB/s]#015Downloading:  92%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–\| 245M/265M [00:03<00:00, 76.8MB/s]#015Downloading:  95%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ\| 253M/265M [00:03<00:00, 75.0MB/s]#015Downloading:  98%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š\| 261M/265M [00:03<00:00, 69.5MB/s]#015Downloading: 100%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ\| 265M/265M [00:03<00:00, 70.8MB/s]
#015Downloading:   0%\|          \| 0.00/466k [00:00<?, ?B/s]#015Downloading:   6%\|β–Œ         \| 28.7k/466k [00:00<00:02, 191kB/s]#015Downloading:  45%\|β–ˆβ–ˆβ–ˆβ–ˆβ–     \| 209k/466k [00:00<00:00, 783kB/s] #015Downloading: 100%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ\| 466k/466k [00:00<00:00, 1.23MB/s]
#015Downloading:   0%\|          \| 0.00/28.0 [00:00<?, ?B/s]#015Downloading: 100%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ\| 28.0/28.0 [00:00<00:00, 19.5kB/s]
#015Downloading:   0%\|          \| 0.00/232k [00:00<?, ?B/s]#015Downloading:  18%\|β–ˆβ–Š        \| 41.0k/232k [00:00<00:00, 273kB/s]#015Downloading:  88%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š \| 205k/232k [00:00<00:00, 750kB/s] #015Downloading: 100%\|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ\| 232k/232k [00:00<00:00, 762kB/s]
WARNING - Overwriting /.sagemaker/mms/models/distilbert-base-uncased-distilled-squad ...
2022-03-23 14:31:16,829 [INFO ] main com.amazonaws.ml.mms.ModelServer -
MMS Home: /opt/conda/lib/python3.6/site-packages
Current directory: /
Temp directory: /home/model-server/tmp
Number of GPUs: 0
Number of CPUs: 1
Max heap size: 3234 M
Python executable: /opt/conda/bin/python3.6
Config file: /etc/sagemaker-mms.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8080
Model Store: /.sagemaker/mms/models
Initial Models: ALL
Log dir: /logs
Metrics dir: /logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Preload model: false
Prefer direct buffer: false
2022-03-23 14:31:16,921 [WARN ] W-9000-distilbert-base-uncased-d com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-9000-distilbert-base-uncased-d
2022-03-23 14:31:17,049 [INFO ] W-9000-distilbert-base-uncased-d-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - model_service_worker started with args: --sock-type unix --sock-name /home/model-server/tmp/.mms.sock.9000 --handler sagemaker_huggingface_inference_toolkit.handler_service --model-path /.sagemaker/mms/models/distilbert-base-uncased-distilled-squad --model-name distilbert-base-uncased-distilled-squad --preload-model false --tmp-dir /home/model-server/tmp
2022-03-23 14:31:17,051 [INFO ] W-9000-distilbert-base-uncased-d-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Listening on port: /home/model-server/tmp/.mms.sock.9000
2022-03-23 14:31:17,051 [INFO ] W-9000-distilbert-base-uncased-d-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - [PID] 35
2022-03-23 14:31:17,051 [INFO ] W-9000-distilbert-base-uncased-d-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - MMS worker started.
2022-03-23 14:31:17,051 [INFO ] W-9000-distilbert-base-uncased-d-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Python runtime: 3.6.13
2022-03-23 14:31:17,052 [INFO ] main com.amazonaws.ml.mms.wlm.ModelManager - Model distilbert-base-uncased-distilled-squad loaded.
2022-03-23 14:31:17,057 [INFO ] main com.amazonaws.ml.mms.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2022-03-23 14:31:17,074 [INFO ] W-9000-distilbert-base-uncased-d com.amazonaws.ml.mms.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.mms.sock.9000
2022-03-23 14:31:17,151 [INFO ] main com.amazonaws.ml.mms.ModelServer - Inference API bind to: http://0.0.0.0:8080
Model server started.
2022-03-23 14:31:17,175 [INFO ] W-9000-distilbert-base-uncased-d-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.mms.sock.9000.
2022-03-23 14:31:17,176 [WARN ] pool-2-thread-1 com.amazonaws.ml.mms.metrics.MetricCollector - worker pid is not available yet.
2022-03-23 14:31:18,810 [INFO ] W-9000-distilbert-base-uncased-d-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Model distilbert-base-uncased-distilled-squad loaded io_fd=daebe6fffed0ff6d-00000017-00000001-e136dd8f47306764-937c7c58
2022-03-23 14:31:18,816 [INFO ] W-9000-distilbert-base-uncased-d com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 1563
2022-03-23 14:31:18,818 [WARN ] W-9000-distilbert-base-uncased-d com.amazonaws.ml.mms.wlm.WorkerLifeCycle - attachIOStreams() threadName=W-distilbert-base-uncased-d-1
2022-03-23 14:31:19,265 [INFO ] pool-1-thread-3 ACCESS_LOG - /169.254.178.2:51978 "GET /ping HTTP/1.1" 200 31
2022-03-23 14:31:24,202 [INFO ] pool-1-thread-3 ACCESS_LOG - /169.254.178.2:51978 "GET /ping HTTP/1.1" 200 0
2022-03-23 14:31:29,202 [INFO ] pool-1-thread-3 ACCESS_LOG - /169.254.178.2:51978 "GET /ping HTTP/1.1" 200 1
2022-03-23 14:31:34,202 [INFO ] pool-1-thread-3 ACCESS_LOG - /169.254.178.2:51978 "GET /ping HTTP/1.1" 200 1
2022-03-23 14:31:39,201 [INFO ] pool-1-thread-3 ACCESS_LOG - /169.254.178.2:51978 "GET /ping HTTP/1.1" 200 0
2022-03-23 14:31:44,201 [INFO ] pool-1-thread-3 ACCESS_LOG - /169.254.178.2:51978 "GET /ping HTTP/1.1" 200 0
2022-03-23 14:31:49,202 [INFO ] pool-1-thread-3 ACCESS_LOG - /169.254.178.2:51978 "GET /ping HTTP/1.1" 200 1
2022-03-23 14:31:54,202 [INFO ] pool-1-thread-3 ACCESS_LOG - /169.254.178.2:51978 "GET /ping HTTP/1.1" 200 0
2022-03-23 14:31:57,949 [INFO ] W-distilbert-base-uncased-d-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Preprocess time - 0.028371810913085938 ms
2022-03-23 14:31:57,949 [INFO ] W-distilbert-base-uncased-d-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Predict time - 1648045917948.6992 ms
2022-03-23 14:31:57,949 [INFO ] W-9000-distilbert-base-uncased-d com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 70
2022-03-23 14:31:57,949 [INFO ] W-distilbert-base-uncased-d-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Postprocess time - 0.07843971252441406 ms
2022-03-23 14:31:57,950 [INFO ] W-9000-distilbert-base-uncased-d ACCESS_LOG - /169.254.178.2:51978 "POST /invocations HTTP/1.1" 200 74
2022-03-23 14:31:59,202 [INFO ] pool-1-thread-3 ACCESS_LOG - /169.254.178.2:51978 "GET /ping HTTP/1.1" 200 1
2022-03-23 14:32:04,201 [INFO ] pool-1-thread-3 ACCESS_LOG - /169.254.178.2:51978 "GET /ping HTTP/1.1" 200 0
2022-03-23 14:32:09,202 [INFO ] pool-1-thread-3 ACCESS_LOG - /169.254.178.2:51978 "GET /ping HTTP/1.1" 200 1
2022-03-23 14:32:14,202 [INFO ] pool-1-thread-3 ACCESS_LOG - /169.254.178.2:51978 "GET /ping HTTP/1.1" 200 1
2022-03-23 14:32:19,202 [INFO ] pool-1-thread-3 ACCESS_LOG - /169.254.178.2:51978 "GET /ping HTTP/1.1" 200 1
2022-03-23 14:32:24,201 [INFO ] pool-1-thread-3 ACCESS_LOG - /169.254.178.2:51978 "GET /ping HTTP/1.1" 200 0
2022-03-23 14:47:01,381 [INFO ] W-distilbert-base-uncased-d-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Preprocess time - 0.029802322387695312 ms
2022-03-23 14:47:01,381 [INFO ] W-distilbert-base-uncased-d-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Predict time - 1648046821381.0054 ms
2022-03-23 14:47:01,381 [INFO ] W-9000-distilbert-base-uncased-d com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 61
2022-03-23 14:47:01,381 [INFO ] W-9000-distilbert-base-uncased-d ACCESS_LOG - /169.254.178.2:51978 "POST /invocations HTTP/1.1" 200 62
2022-03-23 14:47:01,381 [INFO ] W-distilbert-base-uncased-d-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Postprocess time - 0.08940696716308594 ms
2022-03-23 14:47:02,644 [INFO ] W-distilbert-base-uncased-d-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Preprocess time - 0.027418136596679688 ms
2022-03-23 14:47:02,644 [INFO ] W-9000-distilbert-base-uncased-d com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 61
2022-03-23 14:47:02,644 [INFO ] W-distilbert-base-uncased-d-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Predict time - 1648046822644.274 ms
2022-03-23 14:47:02,645 [INFO ] W-9000-distilbert-base-uncased-d ACCESS_LOG - /169.254.178.2:51978 "POST /invocations HTTP/1.1" 200 62
2022-03-23 14:47:02,645 [INFO ] W-distilbert-base-uncased-d-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Postprocess time - 0.08702278137207031 ms
2022-03-23 14:47:03,714 [INFO ] W-distilbert-base-uncased-d-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Preprocess time - 0.02288818359375 ms
2022-03-23 14:47:03,715 [INFO ] W-distilbert-base-uncased-d-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Predict time - 1648046823714.5115 ms
2022-03-23 14:47:03,715 [INFO ] W-9000-distilbert-base-uncased-d com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 61
2022-03-23 14:47:03,715 [INFO ] W-distilbert-base-uncased-d-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Postprocess time - 0.08249282836914062 ms
2022-03-23 14:47:03,715 [INFO ] W-9000-distilbert-base-uncased-d ACCESS_LOG - /169.254.178.2:51978 "POST /invocations HTTP/1.1" 200 61
2022-03-23 14:47:04,201 [INFO ] pool-1-thread-3 ACCESS_LOG - /169.254.178.2:51978 "GET /ping HTTP/1.1" 200 0

Support passing model_kwargs to pipeline

I'm trying to deploy BLIP-2 (specifically Salesforce/blip2-opt-2.7b) to a Sagemaker (SM) endpoint, but coming up against some problems.

We can deploy this model by tar'ing the model artifacts as model.tar.gz and hosting on S3, but creating a ~9GB tar file is time-consuming and leads to slow deployment feedback loops.

Alternatively, the toolkit has experimental support for downloading models from πŸ€—Hub on start, which is a more time/space efficient.
However, this functionality only supports passing HF_TASK and HF_MODEL_ID as env vars. In order to run inference on this model using GPU's available on SM (T4/A10) we need to pass additional model_kwargs as:

pipe = pipeline(model="Salesforce/blip2-opt-2.7b", model_kwargs={"load_in_8bit": True})

A potential solution to this would be:
On line 104 of handler_service.py the ability to pass kwargs has not been implemented, but the function get_pipeline allows for kwargs.

FileNotFoundError when providing entry_point

I can successfully deploy a HF model to Sagemaker using the standard inference methods. However, I can't get the custom inference to deploy. My code is the following:

from sagemaker.huggingface import HuggingFaceModel
import sagemaker

role = sagemaker.get_execution_role()

huggingface_model = HuggingFaceModel(
    model_data="s3://beaver-models-us/paragraph_classifier.tar.gz",   
    role=role,
    entry_point="inference.py",
    source_dir="./code",
    transformers_version='4.6.1',
    pytorch_version='1.7.1',
    py_version='py36',
)

predictor = huggingface_model.deploy(
    endpoint_name="paragraph-classifier",
    initial_instance_count=1,
    instance_type='ml.m5.xlarge'
)

The directory structure of paragraph_classifier.tar.gz is the following:

.
└── paragraph_classifier.tar.gz/
    β”œβ”€β”€ code/
    β”‚   └── inference.py
    β”œβ”€β”€ pytorch_model.bin
    β”œβ”€β”€ config.json
    β”œβ”€β”€ tokenizer.json
    └── vocab.txt

When I run this, I get an error on the deploy() command saying that the code directory was not found! Full error stack below:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-15-44909ec8a9da> in <module>
     18     endpoint_name="paragraph-classifier",
     19     initial_instance_count=1, # number of instances
---> 20     instance_type='ml.m5.xlarge' # ec2 instance type
     21 )

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, **kwargs)
    765                 self._base_name = "-".join((self._base_name, compiled_model_suffix))
    766 
--> 767         self._create_sagemaker_model(instance_type, accelerator_type, tags)
    768         production_variant = sagemaker.production_variant(
    769             self.name, instance_type, initial_instance_count, accelerator_type=accelerator_type

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/model.py in _create_sagemaker_model(self, instance_type, accelerator_type, tags)
    267                 /api/latest/reference/services/sagemaker.html#SageMaker.Client.add_tags
    268         """
--> 269         container_def = self.prepare_container_def(instance_type, accelerator_type=accelerator_type)
    270 
    271         self._ensure_base_name_if_needed(container_def["Image"])

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/huggingface/model.py in prepare_container_def(self, instance_type, accelerator_type)
    272 
    273         deploy_key_prefix = model_code_key_prefix(self.key_prefix, self.name, deploy_image)
--> 274         self._upload_code(deploy_key_prefix, repack=True)
    275         deploy_env = dict(self.env)
    276         deploy_env.update(self._framework_env_vars())

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/model.py in _upload_code(self, key_prefix, repack)
   1144                 repacked_model_uri=repacked_model_data,
   1145                 sagemaker_session=self.sagemaker_session,
-> 1146                 kms_key=self.model_kms_key,
   1147             )
   1148 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/utils.py in repack_model(inference_script, source_directory, dependencies, model_uri, repacked_model_uri, sagemaker_session, kms_key)
    413 
    414         _create_or_update_code_dir(
--> 415             model_dir, inference_script, source_directory, dependencies, sagemaker_session, tmp
    416         )
    417 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/utils.py in _create_or_update_code_dir(model_dir, inference_script, source_directory, dependencies, sagemaker_session, tmp)
    456         if os.path.exists(code_dir):
    457             shutil.rmtree(code_dir)
--> 458         shutil.copytree(source_directory, code_dir)
    459     else:
    460         if not os.path.exists(code_dir):

~/anaconda3/envs/python3/lib/python3.6/shutil.py in copytree(src, dst, symlinks, ignore, copy_function, ignore_dangling_symlinks)
    313 
    314     """
--> 315     names = os.listdir(src)
    316     if ignore is not None:
    317         ignored_names = ignore(src, names)

FileNotFoundError: [Errno 2] No such file or directory: './code'

What am I doing wrong here? Has anyone successfully deployed with custom inference code?

Thank you!!

More info on settings appearing in the logs?

I see in the HF-TF CPU endpoint image log logs the following:

Preload model: false
Prefer direct buffer: false

What is this? Some nice serving settings that we can tune? Is it explained in some doc?

Support for return_all_scores in pipeline

For text_classification pipeline, the parameter return_all_scores=True is needed to get all scores of all labels. Could this be integrated into the toolkit? (or potentially for newer version, the parameters is top_k = n).

Allowing to pass additional parameters needed for the pipeline would be great.

pipe = pipeline("text-classification", model='path/to/mode', tokenizer='path/to/tokenizer', return_all_scores=True)

how to deploy https://huggingface.co/flair/ner-english-ontonotes-large as it doesn't have a config.json

Hi I was trying to deploy this model on sagemaker and unfortunately it throws an error because it can't find the config.json, I was able to deploy other models using the same template provided but this one failed because there is no config.json provided. the model is available for inference in the hugging face models API and it's working fine, so please let me know how can I deploy this model onto sagemaker !! https://huggingface.co/flair/ner-english-ontonotes-large

How do I use my own config.properties

How can I use my own config.properties
with the aws deep learning inference images (eg: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04) built on top on this tool kit

Weird behavior of custom inference.py script

Hi,

I'm deploying the custom inference.py script below

import os

import tensorflow as tf
from transformers import TFAutoModelForQuestionAnswering, AutoTokenizer




def load_fn(model_dir):
    """this function reads the model from disk"""
    
    print('load_fn dir view:')
    print(os.listdir())
    
    # load model
    model = TFAutoModelForQuestionAnswering.from_pretrained(os.environ['MODEL_FOLDER'])
    
    # load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(os.environ['TOKENIZER_FOLDER'])
    
    return model, tokenizer




def preprocess_fn(input_data, content_type):
    """this function pre-processes the input.
       payload format is {"inputs": {"question": XX, "text": XX}}"""
    
    print('input_data received:')
    print(input_data)
    return input_data['inputs']['question'], input_data['inputs']['text']




def predict(processed_data)
    """this function runs inference"""

    print('processed_data received: ')
    print(processed_data)
    
    question, text = processed_data

    input_dict = tokenizer(question, text, return_tensors='tf')
    outputs = model(input_dict)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    
    all_tokens = tokenizer.convert_ids_to_tokens(input_dict["input_ids"].numpy()[0])
    answer = ' '.join(all_tokens[tf.math.argmax(start_logits, 1)[0] : tf.math.argmax(end_logits, 1)[0]+1])
    
    return answer

There are 2 weirds things happening:

  1. the print statements are not showing up in CloudWatch
  2. the inference errors with W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - mms.service.PredictionException: invalid syntax (inference.py, line 39) : 400

Anything wrong in the code?

References on transformer 4.4 and pytorch 1.6 but image version no longer exists on ECR

Instructions on https://github.com/aws/sagemaker-huggingface-inference-toolkit suggests the following:

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.4',
    pytorch_version='1.6',
    env=hub,
    role=role,
    name=hub['HF_MODEL_ID'], 
)

However it comes back with "The image '763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.7.1-transformers4.4-gpu-py36-cu110-ubuntu18.04' does not exist.."

I found this blog that references a different transformer and pytorch version:
https://aws.amazon.com/blogs/machine-learning/announcing-managed-inference-for-hugging-face-models-in-amazon-sagemaker/

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data="s3://bucket/model.tar.gz", # S3 path to your trained sagemaker model
   role=<SageMaker Role>, # IAM role with permissions to create an Endpoint
   transformers_version="4.6", # transformers version used
   pytorch_version="1.7", # pytorch version used
   py_version="py36", # python version of the DLC
)

Could you update this repo to prevent further confusion?

Can't use deploy function

Hi, I'm testing how to deploy a hf model using the example script:

from sagemaker.huggingface import HuggingFaceModel
# Hub Model configuration. https://huggingface.co/models
hub = {
  'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad',
  'HF_TASK':'question-answering'
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.6',
    pytorch_version='1.7',
    py_version='py36',
    env=hub,
    role=role,
)
# deploy model to SageMaker Inference
huggingface_model.deploy(initial_instance_count=1,instance_type="ml.m5.xlarge")

But the following error appears when executing the last line:


TypeError Traceback (most recent call last)
in
16 )
17 # deploy model to SageMaker Inference
---> 18 huggingface_model.deploy(initial_instance_count=1,instance_type="ml.m5.xlarge")

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, **kwargs)
709 self._base_name = "-".join((self._base_name, compiled_model_suffix))
710
--> 711 self._create_sagemaker_model(instance_type, accelerator_type, tags)
712 production_variant = sagemaker.production_variant(
713 self.name, instance_type, initial_instance_count, accelerator_type=accelerator_type

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/model.py in _create_sagemaker_model(self, instance_type, accelerator_type, tags)
277 vpc_config=self.vpc_config,
278 enable_network_isolation=enable_network_isolation,
--> 279 tags=tags,
280 )
281

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in create_model(self, name, role, container_defs, vpc_config, enable_network_isolation, primary_container, tags)
2651 enable_network_isolation=enable_network_isolation,
2652 primary_container=primary_container,
-> 2653 tags=tags,
2654 )
2655 LOGGER.info("Creating model with name: %s", name)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _create_model_request(self, name, role, container_defs, vpc_config, enable_network_isolation, primary_container, tags)
2567 container_defs = primary_container
2568
-> 2569 role = self.expand_role(role)
2570
2571 if isinstance(container_defs, list):

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in expand_role(self, role)
3530 str: The corresponding AWS IAM role ARN.
3531 """
-> 3532 if "/" in role:
3533 return role
3534 return self.boto_session.resource("iam").Role(role).arn

TypeError: argument of type 'function' is not iterable


Any help is appreciated, thanks!

pipeline parameters not read by the deployed endpoint

Hi,

I followed the documentation in https://huggingface.co/docs/sagemaker/inference to setup an endpoint hosting a sentiment analysis model. However, when passed in inputs larger than 512, it now breaks:

line 676, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "index out of range in self"
}

Setup code:

hub = {
    "HF_MODEL_ID": "distilbert-base-uncased",
    "HF_TASK": "sentiment-analysis",
}
huggingface_model = HuggingFaceModel(
    env=hub,  # configuration for loading model from Hub
    role=role,  # iam role with permissions to create an Endpoint
    transformers_version="4.6.1",  # transformers version used
    pytorch_version="1.7.1",  # pytorch version used
    py_version="py36",  # python version used
)
huggingface_model.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")

Invoking endpoint using this data:

data = {
    "parameters": {
        "return_all_scores": True,
        "truncation": "longest_first",
    },
    "inputs": "My Name is Philipp and I live in Nuremberg. "*300,
}

The same setup directly with Pipeline in the transformers API works perfectly fine:

model_name = "distilbert-base-uncased"
pipe = pipeline("sentiment-analysis", model=model_name, tokenizer=model_name)
data = {
    "parameters": {
        "return_all_scores": True,
        "truncation": "longest_first",
    },
    "inputs": "My Name is Philipp and I live in Nuremberg. "*300,
}

params = data.pop("parameters", None)
results = pipe(data["inputs"], **params)

where print(results) gives [{'label': 'LABEL_1', 'score': 0.5471687912940979}]

What is possibly wrong here? Could it be the toolkit version loaded by HuggingFaceModel?

verified that the endpoint is run with the following versions transformers=4.6.1, tokenizers=0.10.3, sagemaker=2.68.0

Make `DEFAULT_HF_HUB_MODEL_EXPORT_DIRECTORY` configurable through environment variable

Currently, DEFAULT_HF_HUB_MODEL_EXPORT_DIRECTORY points to /.sagemaker/mms/models, which is only 50GB, where ~27GB are already reserved with system things. This means that customers can only deploy models with ~23GB of size.
We should either change this by default to /tmp/sagemaker/mms/models or make it configurable through an environment variable.

SageMaker HuggingFace - 'ConversationalPipeline expects a Conversation or list of Conversations as an input' - custom inference.py ignored

I've deployed a custom huggingface model based off of microsoft/DialoGPT-small and uploaded it to huggingface. I've then deployed the model to an endpoint using AWS sagemaker with the following inside sagemaker studio (mimicked from this aws blog post):

!pip install "sagemaker" -q --upgrade

import sagemaker

sess = sagemaker.Session()
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

role = sagemaker.get_execution_role()
hub = {
	'HF_MODEL_ID': '[masked model name]',
	'HF_TASK': 'conversational',
}

huggingface_model = sagemaker.huggingface.HuggingFaceModel(
	transformers_version='4.6.1',
	pytorch_version='1.7.1',
	py_version='py36',
	role=role,
        env=hub,
)

predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.m5.xlarge' # ec2 instance type
)

print(predictor.endpoint_name)

Issue:

Huggingface transformer ConversationalPipeline does not take the standard inputs when generating a prediction

Tried invoking it with the following (input data is a copy of the huggingface api documentation for conversational pipelines):

boto3.client('sagemaker-runtime').invoke_endpoint(
    EndpointName='[model endpoint here]',
    Body=json.dumps({
        'inputs': {
            "past_user_inputs": ["Which movie is the best ?"],
            "generated_responses": ["It's Die Hard for sure."],
            "text": "Can you explain why ?",
        }
    }),
    ContentType='application/json'
)

Gives the following error in AWS (as seen in cloudwatch):

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 222, in handle",
    response = self.transform_fn(self.model, input_data, content_type, accept)",
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 181, in transform_fn",
    predictions = self.predict(processed_data, model)",
  File "/opt/conda/lib/python3.6/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 149, in predict",
    prediction = model(inputs)",
  File "/opt/conda/lib/python3.6/site-packages/transformers/pipelines/conversational.py", line 241, in __call__",
    raise ValueError("ConversationalPipeline expects a Conversation or list of Conversations as an input")",
ValueError: ConversationalPipeline expects a Conversation or list of Conversations as an input",
During handling of the above exception, another exception occurred:",
Traceback (most recent call last):",
  File "/opt/conda/lib/python3.6/site-packages/mms/service.py", line 108, in predict",
     ret = self._entry_point(input_batch, self.context)",
   File "/opt/conda/lib/python3.6/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 231, in handle",
     raise PredictionException(str(e), 400)",
 mms.service.PredictionException: ConversationalPipeline expects a Conversation or list of Conversations as an input : 400"

As a workaround for this, I attempted to add custom inference.py script overwriting the default predict_fn and postprocess_fn as follows:

from typing import Dict, Any
from transformers.pipelines import ConversationalPipeline, Conversation

def predict_fn(data: Dict[str, Any], model: ConversationalPipeline) -> Conversation:
    inputs = data['inputs']
    c = Conversation(inputs['text'], past_user_inputs=inputs.get('past_user_inputs', []), generated_responses=inputs.get('generated_responses', []))
    
    prediction = model(c) # in this case, my model object returns a Conversation object with the a new generated response appended to the object's generated_responses property
    return prediction


def output_fn(prediction: Conversation, accept: str) -> str:
    return json.dumps({
        'generated_text': prediction.generated_responses[-1],
        'conversation': {
            'past_user_inputs': prediction.past_user_inputs,
            'generated_responses': prediction.generated_responses
        }
    })

Followed the documentation for this library in adding the inference.py inside of code/

|- pytorch_model.bin
|- ....
|- code/
  |- inference.py
  |- requirements.txt 

But still, the error above happens. You can see from the stacktrace that the custom functions I've added are ignored and not used by the HuggingFaceHandlerService

How to dynamically batch to help handle high load?

Hi there,

I'm trying to deploy an endpoint that has bursts of high load. I'd like the endpoint to batch requests so we can increase through put under high load at the cost of a slight increase in latency under low load.

I found a blog post about how this can be done through torchserve and aws. See the section TorchServe dynamic batching on SageMaker.

I'd like to have dynamic batching in a huggingface container, as I'm told there are optimizations taken for transformer models there.

I can see the param for batch_size in the handler_service.py code, but I'm not sure of the recommended way to adjust this, along with a parameter for max_batch_delay.

Is this something currently available?

I reached out to AWS for support, who suggested I open an issue here for assistance. Do let me know if this is more appropriate for a Q&A forum and point me there.

Thanks so much in advance,
Jamie

SageMaker deployment in local mode fails with KeyError: 'ModelDataUrl'

I'm using the example notebook for deploying a Hub model to SageMaker, and it works fine for deploying to AWS. Now, for development purposes, I'd like to deploy locally, for which I change the code to:

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="local"
)

However, that fails with the following error:

.../venv/lib/python3.8/site-packages/sagemaker/local/entities.py in serve(self)
    576         )
    577         self.container.serve(
--> 578             self.primary_container["ModelDataUrl"], self.primary_container["Environment"]
    579         )
    580 

KeyError: 'ModelDataUrl'

My assumption is that this is because model_data is not provided to the HuggingFaceModel constructor and we rely on env instead. Is there a workaround / fix for this please?

Data format for inference

Hi there,

I'm experimenting with the Dolly model and I'm trying to deploy it in SageMaker. It all works fine but I'm struggling to run inferenceβ€”there's something going on with the data format I'm passing, but cannot figure out what!

import json

import boto3
import sagemaker
from sagemaker.huggingface import HuggingFaceModel


# %% Deploy new model
role = sagemaker.get_execution_role()
hub = {"HF_MODEL_ID": "databricks/dolly-v2-12b", "HF_TASK": "text-generation"}

# Create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version="4.17.0",
    pytorch_version="1.10.2",
    py_version="py38",
    env=hub,
    role=role,
)

# Deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,  # number of instances
    instance_type="ml.m5.xlarge",  # ec2 instance type
)

predictor.predict({"inputs": "Once upon a time there "})

results in:

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "\u0027gpt_neox\u0027"
}

I've tried using json strings but no luck either.

Any help appreciated!
Cheers.

Update transformers

Hi! I'd like to deploy a SegFormer model, which was introduced in v4.13.0. However, the Sagemaker toolkit only supports transformers up to v4.12.3.

I already have a working custom inference.py script, but it uses an outdated feature extractor because of the wrong version of transformers, thus leading to incorrect predictions.

Is there a workaround, or when do you expect supporting the newer versions of transformers?

Why SM HF Hosting shows some SM Profilers-related logs?

Hi,

I see some logs about SM Profiler

W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2021-06-25 10:14:23.533531: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler
W-9000-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2021-06-25 10:14:23.533634: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.

I thought Profiler was a training thing? What is it doing in the inference DLC?

How to view train error metrics for a model

Thanks for the package, I have enjoyed using it greatly but have a question about metrics.

I have trained a model using the Hello World example for token classification.

I can easily calculate and view the metrics generated on the evaluation test set: accuracy, f-score, precision, recall etc. by calling training_job_analytics on the trained model: huggingface_estimator.training_job_analytics.dataframe()

How can I also see the same metrics on training sets (or even training error for each epoch)?

Training code is basically the same as the link with extra parts of the docs added:

from sagemaker.huggingface import HuggingFace

# optionally parse logs for key metrics
# from the docs: https://huggingface.co/docs/sagemaker/train#sagemaker-metrics
metric_definitions = [
    {'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'learning_rate', 'Regex': "'learning_rate': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_accuracy', 'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_f1', 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_precision', 'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_recall', 'Regex': "'eval_recall': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_runtime', 'Regex': "'eval_runtime': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"}
]

# hyperparameters, which are passed into the training job
hyperparameters={
    'epochs': 5,
    'train_batch_size': batch_size,
    'model_name': model_checkpoint,
    'task': task,
}

# init the model (but not yet trained)
huggingface_estimator = HuggingFace(
    entry_point='train.py',
    source_dir='./scripts',
    instance_type='ml.p3.2xlarge',
    instance_count=1,
    role=role,
    transformers_version='4.6',
    pytorch_version='1.7',
    py_version='py36',
    hyperparameters = hyperparameters,
    metric_definitions=metric_definitions
)
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

# does not return metrics on training - only on eval!
huggingface_estimator.training_job_analytics.dataframe()

This is my output:

image

as a json:

[{'timestamp': 0.0, 'metric_name': 'eval_loss', 'value': 2.09915896824428},
 {'timestamp': 0.0,
  'metric_name': 'eval_accuracy',
  'value': 0.48287172011661805},
 {'timestamp': 0.0, 'metric_name': 'eval_f1', 'value': 0.0023274664199115004},
 {'timestamp': 0.0,
  'metric_name': 'eval_precision',
  'value': 0.0012611563074355545},
 {'timestamp': 0.0,
  'metric_name': 'eval_recall',
  'value': 0.02029664324746292},
 {'timestamp': 0.0,
  'metric_name': 'eval_runtime',
  'value': 0.19744285714285711},
 {'timestamp': 0.0,
  'metric_name': 'eval_samples_per_second',
  'value': 51.15857142857143},
 {'timestamp': 0.0, 'metric_name': 'epoch', 'value': 11.857142857142858}]

How to pass device_id to overriden functions?

I want to select the gpu that my custom model is being loaded into. Can't find a way to do this, as when we do self.load_fn = load_fn we lose the reference to self inside load_fn

Support for huggingface/peft

After looking through the code it currently seems to not be possible to lead adapter models produced by peft. It would be a great addition to HF DLCs.

inference.py ignored by SM HF

Hi,

I have the following in my model.tar.gz

special_tokens_map.json
vocab.txt
config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
code/
code/inference.py

I'm deploying with the following SDK call:

from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer


env = {
    'MODEL_FOLDER': unique_dir,
    'HF_TASK':'question-answering'
}
    
# create Hugging Face Model Class
huggingface_model = Model(
    env=env,
    model_data=model_arn,
    role=get_execution_role()
    image_uri=tf_cpu_uri,
    predictor_cls=Predictor)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.4xlarge")

predictor.serializer = JSONSerializer()

my inference.py is the following. (it is incorrect and WIP but that's not the issue in this issue)

import os

import tensorflow as tf
from transformers import TFAutoModelForQuestionAnswering, AutoTokenizer




def load(model_dir):
    """this function reads the model from disk"""
    
    print('load_fn dir view:')
    print(os.listdir())
    
    # load model
    model = TFAutoModelForQuestionAnswering.from_pretrained(os.environ['MODEL_FOLDER'])
    
    # load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(os.environ['MODEL_FOLDER'])
    
    return model, tokenizer




def predict(processed_data):
    """this function runs inference"""

    print('processed_data received: ')
    print(processed_data)
    
    question, text = processed_data['input']['question'], processed_data['input']['text']

    input_dict = tokenizer(question, text, return_tensors='tf')
    outputs = model(input_dict)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    
    all_tokens = tokenizer.convert_ids_to_tokens(input_dict["input_ids"].numpy()[0])
    answer = ' '.join(all_tokens[tf.math.argmax(start_logits, 1)[0] : tf.math.argmax(end_logits, 1)[0]+1])
    
    return answer

But it is ignored by SM Hosting. When I run a the following inference, I have a default QA answer template that is not what my answer is (should be only the answer). Also, since my inference.py reads from the wrong model dir I wouldn't be surprised if it failed, instead of returning anything at all.
Why is SM HF ignoring my inference.py?

payload = {"question": "What is my name?", "context": "My name is Clara and I live in Berkeley."}
predictor.predict(payload)
'{"score":0.993812620639801,"start":11,"end":16,"answer":"Clara"}'

Endpoint inference with trained HuggingFaceEstimator fails

Hi,

I .deploy() the model.tar.gz created by 2 sample notebooks, and both fail. It seems that dependencies are not the same between training and inference. Is this something that could be automated? Or documented? I used to think that the config.json would be enough for inference, I don't understand why SM Hosting wants to use the training script (it actually doesn't need to in theory)

  • PyTorch sample: .deploy() works correctly both on CPU and GPU, but GPU inference returns fails with No module named sklearn
  • TF sample: .deploy()works correctly both on CPU and GPU, but GPU inference fails with a No module named datasets

SageMaker endpoint can't load huggingface tokenizer

I used Amazon SageMaker to train a HuggingFace model. At the end of the training script provided to the estimator, I saved the model into the correct path (SM_MODEL_DIR):

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
    ...
    trainer.model.save_pretrained(args.model_dir)

After the model was trained, I deployed it using the deploy method of the HuggingFace estimator. Once the endpoint was successfully created, I tried inference with the returned predictor:

response = self.predictor.predict(
    {"inputs": "I want to know where is my order"}
)

And I received the following client error:

{'code': 400, 'type': 'InternalServerException', 'message': "Can't load tokenizer for '/.sagemaker/mms/models/model'. Make sure that:\n\n- '/.sagemaker/mms/models/model' is a correct model identifier listed on 'https://huggingface.co/models'\n\n- or '/.sagemaker/mms/models/model' is the correct path to a directory containing relevant tokenizer files\n\n"}

The problem seems to be on the path that the endpoint uses to load the model on the from_pretrained method.

Any idea of why the tokenizer cannot be loaded?

batch transform fails to install

I'm using HuggingFaceModel from sagemaker inference toolkit to create batch transform jobs for large scale inference data pipelines. I started by looking at the example and walk through here: https://huggingface.co/docs/sagemaker/inference

I regularly run into an issue where about 1/2 of submitted jobs will fail at around the 35 min mark.

Looking at the logs what I see is:

ValueError("failed to install required packages")

This is the same error across all of the failures.

I run a lot of batch transform jobs on production scale data, and I'm wondering if we're simply running into issues with throttling from pip.

Is there an approach to leverage the HuggingFaceModel container creation, but then reuse the registered model within Sagemaker so that we're not re-creating the container with each new execution?

I have looked around and do not see any examples like this.

This is the general approach we have for creating the batch transform jobs

from sagemaker.huggingface.model import HuggingFaceModel
hub = {"HF_TASK": "summarization"}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    model_data=model_s3_location,
    role=role,
    transformers_version="4.17",
    pytorch_version="1.10",
    py_version="py38",
    env=hub,
)

batch_job = huggingface_model.transformer(
    instance_count=instance_count,
    instance_type=instance_type,
    strategy="SingleRecord",
    output_path=output_path,
    assemble_with="Line",
)

formatted_time = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

batch_job.transform(
    data=input_path,
    content_type="application/json",
    split_type="Line",
    logs=False,
    wait=False,
    job_name=f"my-batch-job-batch-{batch}-shard-{shard}-{formatted_time}",
    model_client_config={"InvocationsMaxRetries": 3, "InvocationsTimeoutInSeconds": 600},
)

Using Huggingface Estimator - exec: "serve": executable file not found in $PATH

I've successfully used the Sagemaker (2.69.0), Huggingface (4.10) and Tensorflow (2.5) libraries to complete training a model, initiated as follows:

model_name = 'bert-base-cased'
import datetime
ct = datetime.datetime.now() 
current_time = str(ct.now()).replace(":", "-").replace(" ", "-")[:19]
training_job_name=f'finetune-{model_name}-{current_time}'
print( training_job_name )

from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 1,
                 'train_batch_size': 16,
                 'eval_batch_size' : 32,
                 'model_name': model_name
                 }

huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.10',
                            tensorflow_version='2.5',
                            py_version='py37',
                            hyperparameters = hyperparameters)

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path}, job_name=training_job_name)

The training script starts with the usual argument parsing and log configuration. It differs from the samples as it adds some additional layers to the model for fine tuning. It also freezes all the layers in the base bert-base-cased model (to avoid OOM issues, with GPU memory, when training)

base_model = TFAutoModel.from_pretrained(args.model_name)
    tokenizer = AutoTokenizer.from_pretrained(args.model_name)
    
    # lock base model layers so only the additional task specific layers are
    # changed. Trying to get past OOM blocker
    for layer in base_model.layers:
        layer.trainable=False
    
    # two input layers, we ensure layer name variables match to dictionary keys in TF dataset
    input_ids = tf.keras.layers.Input(shape=(512,), name='input_ids', dtype='int32')
    mask = tf.keras.layers.Input(shape=(512,), name='attention_mask', dtype='int32')

    # we access the transformer model within our bert object using the bert attribute (eg bert.bert instead of bert)
    embeddings = base_model.bert(input_ids, attention_mask=mask)[1]  # access final activations (alread max-pooled) [1]
    # convert bert embeddings into 5 output classes
    x = tf.keras.layers.Dense(1024, activation='relu')(embeddings)
    y = tf.keras.layers.Dense(5, activation='softmax', name='outputs')(x)
    
    model = tf.keras.Model(inputs=[input_ids, mask], outputs=y)

    # fine optimizer and loss
    optimizer = tf.keras.optimizers.Adam(learning_rate=args.learning_rate)
    loss = tf.keras.losses.CategoricalCrossentropy()
    acc = tf.keras.metrics.CategoricalAccuracy('accuracy')
    model.compile(optimizer=optimizer, loss=loss, metrics=[acc])
    
    # Preprocess train dataset
    train_features = {
        "input_ids": train_dataset["input_ids"],
        "attention_mask": train_dataset["attention_mask"]
    }
    tf_train_dataset = tf.data.Dataset.from_tensor_slices((train_features, train_dataset["labels"])).batch(
        args.train_batch_size
    )

    # Preprocess test dataset
    test_features = {
        "input_ids": test_dataset["input_ids"],
        "attention_mask": test_dataset["attention_mask"]
    }
    tf_test_dataset = tf.data.Dataset.from_tensor_slices((test_features, test_dataset["labels"])).batch(
        args.eval_batch_size
    )
    
    # Training
    if args.do_train:

        train_results = model.fit(tf_train_dataset, epochs=args.epochs, batch_size=args.train_batch_size)
        logger.info("*** Train ***")

        output_eval_file = os.path.join(args.output_data_dir, "train_results.txt")

        with open(output_eval_file, "w") as writer:
            logger.info("***** Train results *****")
            logger.info(train_results)
            for key, value in train_results.history.items():
                logger.info("  %s = %s", key, value)
                writer.write("%s = %s\n" % (key, value))

    # Evaluation
    if args.do_eval:

        result = model.evaluate(tf_test_dataset, batch_size=args.eval_batch_size, return_dict=True)
        logger.info("*** Evaluate ***")

        output_eval_file = os.path.join(args.output_data_dir, "eval_results.txt")

        with open(output_eval_file, "w") as writer:
            logger.info("***** Eval results *****")
            logger.info(result)
            for key, value in result.items():
                logger.info("  %s = %s", key, value)
                writer.write("%s = %s\n" % (key, value))

    # Save result
    model_dir = f'{args.model_dir}/00000001'
    model.save(model_dir)
    tokenizer.save_pretrained(model_dir)

I can successfully download the trained model from S3, unzip it in terminal on the SM notebook instance and then load the model directly with Tensorflow to perform inference operations.

import tensorflow as tf

model = tf.keras.models.load_model("/home/ec2-user/SageMaker/00000001")

So the model seems to have trained correctly and works. However, when I try and deploy as an SM endpoint:

from sagemaker.estimator import Estimator

# job which is going to be attached to the estimator
old_training_job_name='finetune-bert-base-cased-2021-12-08-20-18-18'

# attach old training job
huggingface_estimator_loaded = Estimator.attach(old_training_job_name)

# get model output s3 from training job
huggingface_estimator_loaded.model_data

This works and displays the expected display for the model_data, but then:

predictor = huggingface_estimator_loaded.deploy(1,"ml.g4dn.xlarge")

Eventually fails. The Endpoint view in AWS console says: "The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint."

Looking into the logs for the endpoint, I see the message repeated 100 times or so:
"exec: "serve": executable file not found in $PATH"

Is this a bug in the Huggingface Docker image, as the only help I've seen for this type of error relates to people hand rolling their containers incorrectly. Or is there some issue in the way that I'm training and saving the model that causes this issue? I've been able to successfully deploy an endpoint with huggingface using the standard samples, so I wonder if this issue is to do with the extra layers that I added to the model? If that's the case I'd love to know how to fix this and this issue would then be a request for more useful logs or errors messages to debug these scenarios.

Thanks!

Serverless Inference

Hello, and thank you for the open-source code. Does this code support serverless inference on AWS SageMaker as described here?

Zero Shot Multi-label text classification

Greetings,

I have developed a script on my computer to do some zero shot multi-label text classification using xlm-roberta.
I want to reporduce my work on sagemaker using huggingface inference toolkit and I having some trouble doing so.

On local when i do the classification i do the following:

classifier = pipeline(model="joeddav/xlm-roberta-large-xnli", task="zero-shot-classification")

predictions = classifier(sequence_to_classify, candidate_labels, multi_label=True)

On sagemaker, I configure the model from the hub and launch a batch transform job for inference but i can't seem to find the multi_label parameter in the following:

huggingface_model = HuggingFaceModel(
        transformers_version="4.17.0",
        pytorch_version="1.10.2",
        py_version="py38",
        env=hub,
        role=event['role'])

    bt_output_key = f"s3://{event['bucket']}/{event['output_prefix']}/{event['execution_id']}"

    hf_transformer = huggingface_model.transformer(
        instance_count=event["instance_count"],
        instance_type=event["instance_type"],
        output_path=bt_output_key,
        strategy="SingleRecord",
        max_concurrent_transforms=event["concurrent_transforms"],
    )

    hf_transformer.transform(
        data=event['input_s3_path'],
        content_type="application/json",
        split_type="Line",
        wait=False
    )

I looked in the environment variables list but I think Im missing some thing.
Thank you for your help.

test_decode_csv fails when running locally

When running the tests locally I encounter a failure:

>       assert decoded_data == {"inputs": ["I love you", "I like you"]}
E       AssertionError: assert {'inputs': [{'context': 'My name is Philipp and I live in Nuremberg',\n             'question': 'where do i live?'},\n            {'context': 'Berlin is the capital of Germany',\n             'question': 'where is Berlin?'}]} == {'inputs': ['I love you', 'I like you']}
E         Differing items:
E         {'inputs': [{'context': 'My name is Philipp and I live in Nuremberg', 'question': 'where do i live?'}, {'context': 'Berlin is the capital of Germany', 'question': 'where is Berlin?'}]} != {'inputs': ['I love you', 'I like you']}
E         Full diff:
E           {
E         -  'inputs': ['I love you',
E         -             'I like you'],
E         +  'inputs': [{'context': 'My name is Philipp and I live in Nuremberg',
E         +              'question': 'where do i live?'},
E         +             {'context': 'Berlin is the capital of Germany',
E         +              'question': 'where is Berlin?'}],
E           }

...

[gw0] FAILED tests/unit/test_decoder_encoder.py

This appears to be caused by a typo entering the wrong input data for the second assertion in the test. I have proposed a PR to fix here - #38.

Accessing context available to the handler in a custom model_fn

The default load function in the HuggingFaceHandlerService uses available context to set the device to self.device e.g.

if "HF_TASK" in os.environ:
    hf_pipeline = get_pipeline(task=os.environ["HF_TASK"], model_dir=model_dir, device=self.device)

When we override the load function from local storage our custom model_fn is not bound to the instance of HuggingFaceHandlerService so does not have access to useful local context e.g. self.device. This makes it difficult to set the device for our custom pipeline.

Using types.MethodType to bind the custom model_fn would solve this problem and allow us to access context available to the handler when initializing our pipeline.

Support for image-classification tasks

I'm attempting to deploy a Huggingface vision transformer model to Sagemaker, and it works great if I get the model directly from the Huggingface Hub. However, if I try to provide my own model with a custom code/inference.py file, I get an error saying Task couldn't be inferenced from BeitForImageClassification. Inference Toolkit can only inference tasks from architectures ending with <display list of architectures here>.

Is it just a matter of adding "ForImageClassification": "image-classification" to the ARCHITECTURES_2_TASK dict in transformers_utils.py? If so I'll happily add a PR.

InternalServerException while deploying HuggingFace model on SageMaker

Background

We are working to deploy the AllenAI Cosmo XL model - https://huggingface.co/allenai/cosmo-xl for inference on SageMaker.

Our Approach

We are using a SageMaker notebook.
Instance: ml.p3.2xlarge
Kernel: conda_pytorch_p39

Code is directly from the HuggingFace website here: https://huggingface.co/allenai/cosmo-xl
We select Deploy -> Amazon SageMaker
Select Task=Conversational and Configuration=AWS
and copied the code into our notebook.

Error

The model is created and deployed.
But, when we run predictor.predict for model inference, we run into the following error

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "Could not load model /.sagemaker/mms/models/allenai__cosmo-xl with any of the following classes: (\u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM\u0027\u003e, \u003cclass \u0027transformers.models.auto.modeling_auto.AutoModelForCausalLM\u0027\u003e, \u003cclass \u0027transformers.models.t5.modeling_t5.T5ForConditionalGeneration\u0027\u003e)."
}

Relevant error messages from CloudWatch

2023-03-15T19:53:32,781 [INFO ] W-allenai__cosmo-xl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Prediction error
2023-03-15T19:53:32,783 [INFO ] W-allenai__cosmo-xl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/sagemaker_huggingface_inference_toolkit/handler_service.py", line 219, in handle
2023-03-15T19:53:32,784 [INFO ] W-allenai__cosmo-xl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -   File "/opt/conda/lib/python3.8/site-packages/transformers/pipelines/__init__.py", line 549, in pipeline
2023-03-15T19:53:32,785 [INFO ] W-allenai__cosmo-xl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}."
2023-03-15T19:53:32,785 [INFO ] W-allenai__cosmo-xl-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ValueError: Could not load model /.sagemaker/mms/models/allenai__cosmo-xl with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForSeq2SeqLM'>, <class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.t5.modeling_t5.T5ForConditionalGeneration'>)

[Feature Request] Support Japanese language

In some cases, dedicated libraries(e.g. fugashi, ipadic) are required for Japanese tokenizers.
Currently, these libraries are not included in the inference container.
Is it possible to include these libraries or to have an option in the transformers installation?

For example, if we can rewrite the Dockerfile like this, we can handle it.
transformers[sentencepiece] β†’ transformers[ja]

Currently, if we deploy from S3, we can work around it with requirements.txt and an empty inference.py, but if we deploy from HF Hub, we don't have a workaround.

Thanks!

how to load data in inference.py

Hey,

im implementing an endpoint that can perform semantic search on pre computed embeddings. I think everything is working fine, but i am having trouble loading data in the inference.py. How do i effectively do this (the embeddings and the csv i want to load are pretty small, so no need to use elastic search).

Do i put the data in the model.tar.gz, or load it seperately to s3? None of this seems to work.

import torch
import torch.nn.functional as F
import pandas as pd
import numpy as np

corpus_embeddings = np.load('s3://custommodels/sentence_embeddings.npy')
data = pd.read_csv('s3://custommodels/sentences.csv', index_col=0)

def similarity(embeddings_1, embeddings_2):
    normalized_embeddings_1 = F.normalize(embeddings_1, p=2)
    normalized_embeddings_2 = F.normalize(embeddings_2, p=2)
    return torch.matmul(
        normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1)
    )

def output_fn(prediction, accept):
    prediction = torch.tensor(prediction)
    attention_mask = torch.ones(prediction.shape[1], dtype=torch.uint8)
    mask = attention_mask.unsqueeze(-1).expand(prediction.size()).float()
    masked_embeddings = prediction * mask
    summed = torch.sum(masked_embeddings, 1)
    summed_mask = torch.clamp(mask.sum(1), min=1e-9)
    query_embedding = (summed / summed_mask)
    scores = similarity(torch.tensor(corpus_embeddings, dtype=torch.float32), torch.tensor(query_embedding, dtype=torch.float32))
    maximum_score = torch.max(scores, 0)
    result = data.iloc[int(maximum_score[1])].to_dict()
    result['Γ„hnlichkeit'] = float(maximum_score[0][0])
    return result

and the error :

---------------------------------------------------------------------------
ModelError                                Traceback (most recent call last)
<ipython-input-53-590fc088c70c> in <module>
----> 1 test = predictor.predict({"inputs": "abcd"})
      2 
      3 
      4 test

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/sagemaker/predictor.py in predict(self, data, initial_args, target_model, target_variant, inference_id)
    159             data, initial_args, target_model, target_variant, inference_id
    160         )
--> 161         response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
    162         return self._handle_response(response)
    163 

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    389                     "%s() only accepts keyword arguments." % py_operation_name)
    390             # The "self" in this scope is referring to the BaseClient.
--> 391             return self._make_api_call(operation_name, kwargs)
    392 
    393         _api_call.__name__ = str(py_operation_name)

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    717             error_code = parsed_response.get("Error", {}).get("Code")
    718             error_class = self.exceptions.from_code(error_code)
--> 719             raise error_class(parsed_response, operation_name)
    720         else:
    721             return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "[Errno 2] No such file or directory: \u0027s3://custommodels/sentence_embeddings.npy\u0027"
}
". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/huggingface-pytorch-inference-2021-12-21-16-26-49-230 in account 736551082663 for more information.

ModuleNotFoundError: No module named 'mms'

Hi,

I'm in the process of adding sagemaker_huggingface_inference_toolkit to conda-forge (conda-forge/staged-recipes#16505).

As part of the testing stage I'm importing sagemaker_huggingface_inference_toolkit and also testing the CLI by running serve (I know it's generic, I just want to verify that it works).

This fails with the error: ModuleNotFoundError: No module named 'mms'

From a cursory look at the code, it seems like this cli is not actually exposed to the end-user and instead only used internally so it should also only work in this context.
Before I remove that part again though, I just wanted to confirm if that's actually the case or if I'm missing something?
Thanks!

Serverless inference using the Sagemaker toolkit

Hey!
I was looking at how you've built this inference toolkit to try and figure out how to couple using the Multi-Model-Service package with serverless inference. I've seen that you coded your own start_model_server and your own service handler, I'd be super interested to hear whether any of the changes are related to using serverless inference endpoints. Thank you!

config.json file not found when loading model from AWS S3 bucket.

Ok, so I am trying to deploy this model : https://huggingface.co/sshleifer/distilbart-cnn-12-6 with a custom inference.py to an endpoint on Amazon Web Services. First, however, I am trying to deploy it as is to before I add the custom inference.py file.

I will step you through the steps I have taken so far in hopes that you can tell me what I am doing wrong.

  1. Download the model files using git clone https://huggingface.co/sshleifer/distilbart-cnn-12-6

2)Compress the 5.2gb model into a model.tar.gz using the command 'tar -czf model.tar.gz distilbart-cnn-12-6'

3)Upload the model.tar.gz to my s3 bucket

4)Deploy my model using this script

image

Whenever I run this, the endpoint successfully deploys. However when I try to run a prediction using

predictor.predict({ 'inputs': "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct." })

I get an error telling me on my aws logs

2021-07-22 16:02:37,654 [INFO ] W-model-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ValueError: ("You need to define one of the following ['feature-extraction', 'text-classification', 'token-classification', 'question-answering', 'table-question-answering', 'fill-mask', 'summarization', 'translation', 'text2text-generation', 'text-generation', 'zero-shot-classification', 'conversational', 'image-classification'] as env 'TASK'.", 403)

So I went into the source code here and found this section:
image

So for some reason, my json.config file is not loading. I think it has something to do with the model directory not being in the right place and I am kind of lost. Any help would be much appreciated!!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.