microsoft / deepspeed-mii Goto Github PK

View Code? Open in Web Editor NEW

1.8K 41.0 164.0 6.56 MB

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

License: Apache License 2.0

Python 99.54% Shell 0.46%

deep-learning inference pytorch

deepspeed-mii's People

Stargazers

Watchers

Forkers

lipovsek techthiyanes kamalkraj isabella232 cderinbogaz marcus-arcadius brunotech henrywoo tahabinhuraib volkerha raychorn stjordanis webaverse-studios anoop-qasolve njarac ikpehlivan l-yohai algoskynet 5l1v3r1 sawadata lucasleandro1204 thytu son1128 nattaponkum natiy4 tchaton cian0 u-brixton henry-zeng trantrungpixta rraminen philhchen mallorbc robertalanm trelent realbigdave912 habvt rahul-nath dumpmemory cemberk tohtana zsjtiger bk111 lcw99 pnrajan rex-asabor juncongmoo fudp rocm ashishpatel26 markhng525 donggyukimc alvin-c-shih thabaum aponte411 a-ml-er davidalphafox satpalsr khanhduy1407 jakemanger sarvex browntea andronixs zhangsanfeng86 crossnox hellojixian vpegasus programli sakoush charlesxrwu novaturient95 tosinseg msinha251 youdaodc loadams jinchihe monoid-privacy msp8955 zperzendetta stillmatic models-hub takahiro-itou srsaggam knowledgehacker thomasjpatterson shendlcode veltsy tulika612 ahmsayat ringohoffman blackmambaza kevinmgyu muhammad-asn gauravrajguru ankitshah009 yzs-lab eltociear toilaluan fredhuang99 huangyingting

deepspeed-mii's Issues

Can DeepSpeed-MII also support models/libraries like YOLO / Detectron2 / FCOS / VitDet?

Some generate parameters do not work for query

When using DeepSpeed MII, there are some parameters that do not work when querying the model that otherwise work when using model.generate or when using huggingface pipelines. I have also tried these parameters using DeepSpeed inference on its own and found them to work

The parameters that cause issues for me are num_beams and bad_words_ids but there may be more.

I have found do_sample, max_length, min_length, top_k, top_p, temperature, repetition_penalty, and early_stopping to not cause issues but there may be more.

Error in inf/nan tensors

@mrwyattii

Traceback (most recent call last):
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/server_client.py", line 268, in _request_async_response
    response = await self.stubs[stub_id].GeneratorReply(req)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/grpc/aio/_call.py", line 285, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception calling application: probability tensor contains either `inf`, `nan` or element < 0"
        debug_error_string = "{"created":"@1667201081.916813042","description":"Error received from peer ipv6:[::1]:50956","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Exception calling application: probability tensor contains either `inf`, `nan` or element < 0","grpc_status":2}"

I see this sometimes for BLOOM-176B

mii example FileNotFoundError: [Errno 2] No such file or directory: '/tmp/mii_cache/bert-base-uncased_deployment/score.py'

Python=3.8
Deepspeed-MII=latest version

test example in mii.py file

import mii

# roberta
name = "roberta-base"
mask = "<mask>"
# bert
name = "bert-base-uncased"
mask = "[MASK]"
print(f"Querying {name}...")

generator = mii.mii_query_handle(name + "_deployment")
result = generator.query({'query': "Hello I'm a " + mask + " model."})
print(result.response)
print("time_taken:", result.time_taken)

python mii.py

Error code

(ds1) [root@6301babb8dc8a1eeb0ed2044 DeepSpeed-MII (main)]# python mii.py
Querying bert-base-uncased...
Traceback (most recent call last):
  File "mii.py", line 11, in <module>
    generator = mii.mii_query_handle(name + "_deployment")
  File "/root/workspace/sharing/big-storage/hyungrak/tuning/text_filter/mii_test/DeepSpeed-MII/mii/server_client.py", line 34, in mii_query_handle
    configs = mii.utils.import_score_file(deployment_name).configs
  File "/root/workspace/sharing/big-storage/hyungrak/tuning/text_filter/mii_test/DeepSpeed-MII/mii/utils.py", line 147, in import_score_file
    spec.loader.exec_module(score)
  File "<frozen importlib._bootstrap_external>", line 839, in exec_module
  File "<frozen importlib._bootstrap_external>", line 975, in get_code
  File "<frozen importlib._bootstrap_external>", line 1032, in get_data
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/mii_cache/bert-base-uncased_deployment/score.py'

why are this error occuring

No text is shown when using MII in fp32 and greedy search

When using greedy search (do_sample=False) and dtype=fp32 the generated tokens are not shown in the output of the query. I believe the text generation is happening, because different values for max_new_tokens lead to different runtimes for the query. See this notebook as a minimal example.

possibly related to #101
1 T4, GPU memory 16GB
deepspeed-mii version 0.0.3
transformers version 4.24.0
Amazon Linux 2

Support for Albert and Swin/ViT

Just curious whether there is a plan to support Albert and Swin/ViT. currently I am playing with a model for multimodal learning which involves language models like Albert and visual transformers like Swin and ViT. If there is no immediate plan for this due to tight hands, I am wondering whether there is any documents to guide adding support of new models or customized models so I could help?

AML deployment error due to missing az cli arguments

When trying to run the aml example, e.g. bloom aml, it tries to run get_acr_name() but fails because its missing the resource group name argument. Is there be a way to pass in user arguments such as the resource group, subscription, etc? It would also be nice to expose more arguments for the aml online endpoints such as the auth_mode, e.g. we arent allowed to use keys, only aml_tokens in production environments. But I can also imagine other deployment attributes/arguments being useful as well such as instance_count or type.

[2022-12-08 10:53:37,253] [INFO] [deployment.py:87:deploy] ************* MII is using DeepSpeed Optimizations to accelerate your model *************
ERROR: the following arguments are required: --resource-group/-g, --name/-n

Examples from AI knowledge base:
https://aka.ms/cli_ref
Read more about the command in reference docs

 ------------------------------ 

Unable to obtain ACR name from Azure-CLI. Please verify that you:
        - Have Azure-CLI installed (https://learn.microsoft.com/en-us/cli/azure/install-azure-cli)
        - Are logged in to an active account on Azure-CLI ($az login)
        - Have Azure-CLI ML plugin installed ($az extension add --name ml)

 ------------------------------ 

Traceback (most recent call last):
  File "/mnt/c/Users/davidaponte/Documents/CS677-DeepLearning/deeplearning/deeplearning/deep_learning/text_to_image/deepspeed_mii/bloom560m-aml.py", line 7, in <module>
    mii.deploy(task='text-generation',
  File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/deployment.py", line 112, in deploy
    _deploy_aml(deployment_name=deployment_name, model_name=model, version=version)
  File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/deployment.py", line 124, in _deploy_aml
    acr_name = mii.aml_related.utils.get_acr_name()
  File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/aml_related/utils.py", line 31, in get_acr_name
    raise (e)
  File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/aml_related/utils.py", line 13, in get_acr_name
    acr_name = subprocess.check_output(
  File "/home/bambam/.pyenv/versions/3.9.0/lib/python3.9/subprocess.py", line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/home/bambam/.pyenv/versions/3.9.0/lib/python3.9/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['az', 'ml', 'workspace', 'show', '--query', 'container_registry']' returned non-zero exit status 2.

Setup:
deepspeed==0.7.6
deepspeed-mii==0.0.4
py3.9.0
Ubuntu 20.04.4 LTS (Focal Fossa)

CUDA OOM when loading large models

I'm trying out deepspeed-mii on a local machine (8 GPU with 23GB VRAM each). Smaller models like bloom-560m and EleutherAI/gpt-neo-2.7B worked well. However, I got CUDA OOM errors when loading larger models, like bloom-7b1. For some even larger models like EleutherAI/gpt-neox-20b, the server just crashed without any specific error messages or logs.

I've tried deepspeed inference before, and it worked fine on these models.

I use this script to deploy models

import mii

mii_configs = {"tensor_parallel": 8, "dtype": "fp16"}
mii.deploy(task='text-generation',
           model="facebook/opt-6.7b",
           deployment_name="facebook/opt-6.7b",
           model_path="/home/ubuntu/.cache/huggingface/hub",
           mii_config=mii_configs)

Is there something I should change to my deployment script?

Thanks!

Support multiple nodes deployment?

As subject, If I have to deploy one model into more than 1 machines, any kind of configuration could I make?

OPT in TP or PP mode

Is there a way to inference OPT models in TensorParallel or PipelineParallel mode?

As I understand:

BLOOM uses llm provider which loads the model weights as meta tensors first and then assigns devices during checkpoint loading in ds-inference.
OPT uses hf provider with 🤗 pipeline and directly loads checkpoint weights on a specific device.

However, only MP is supported from 🤗 side (using accelerate). Is there a way to inference OPT with llm provider?

Is there a way to communicate with the individual models in an MII deployment?

Like, take Stable diffusion as an example, is it possible to communicate to CLIP, VAE, and UNET separately?

Support for Fairseq Translation Model

Hi, does DeepSpeed-MII support fairseq's translation model, such as transformer.wmt16.en-de or transformer.wmt19.en-de? as no task translation listed in the Supported Models and Tasks section.

Clean up and enhance MII config

Add pydantic support for our config so that mistyped configs error out instead of silently ignoring like they do in deepspeed
Bake in default config values so that if someone passes a config of {"tensor_parallel": 4} it will pick up the default port number without them needing to specify it.

example use case:

config = {"tensor_parallel": 4}
mii.deploy('fill-mask',
           name,
           mii.DeploymentType.LOCAL,
           deployment_name=name + "_deployment",
           local_model_path=".cache/models/" + name,
           mii_configs=config,
           enable_deepspeed=True)

add unit tests around configuration files, e.g., error out on typos and incorrect types

Unable to run mii-sd.py for txt2img benchmark

https://github.com/microsoft/DeepSpeed-MII/tree/main/examples/benchmark/txt2img

Ive run into some strange protobuf related errors. When I first ran into this, I was able to resolve by changing my protobuf version to >=3.20.0 but now it doesnt work anymore.

My hunch is that its related to how I am installing things? I wasnt sure what the correct way was to install deespeed and deepspeed-mii, so I have been trying to use the following:
pip install deepspeed[sd] deepspeed-mii

I am now seeing this error when trying to run mii-sd.py:

Setup:
deepspeed==0.7.6
deepspeed-mii==0.0.4
diffusers==0.9.0
Ubuntu 18.0.4.6 LTS
py3.8.15

Support for FLAN-T5

I saw that T5 wasn't in the list of supported huggingface transformers models. Are there plans / ETA for when the T5 family would be added? FLAN-T5 is a very strong llm for zero/fewshot instruction prompting. I am currently building out a hacky implementation for hosting with deepspeed-inference, but having it natively supported in deepspeed-mii would be ideal.

8x40GB GPU instance failing with bloom-deepspeed-inference-int8

int8 doesn't work. (DeepSpeed-MII) only has fp16 support for now.
Any idea on when int8 will be supported?

TXT2IMAGE - TXTRuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmStridedBatchedExFix

Hello!
Thanks for this great optimization,
We're using a fresh ec2 G5XL instance,

After installing everything and running python baseline-sd.py
I see the following error:

    attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

I've installed the envoirment using: pip install deepspeed[sd] deepspeed-mii

when running ds_report I see the following output:

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.9/site-packages/torch']
torch version .................... 1.13.0+cu117
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed install path ........... ['/opt/conda/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0

Multi-gpu inference, the query gets stuck when using my own provider

I modified this file with my own defined provider. No problems with single cards, but query gets stuck with multiple cards. Here is the file I mainly modified.
"DeepSpeed-MII/mii/models/providers/huggingface.py"

Now it's stuck

Here is my deepspeed version

Graceful teardown

Currently there's no way to teardown a local or azure deployment gracefully, we currently just pkill python which is clearly not a clean solution.

Errors running Zero-Inference text generation example

Hey,

I'm trying to run the example provided for text generation with Zero-Inference, and having trouble getting predictions without running into errors.

When I try to deploy the exact same model and config, I first get a validation error for the aio configuration.

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 70, in <module>
    main()
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 56, in main
    inference_pipeline = load_models(task_name=args.task_name,
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/mii/models/load_models.py", line 87, in load_models
    ds_config = DeepSpeedConfig(ds_config_path)
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 811, in __init__
    self._initialize_params(copy.copy(self._param_dict))
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 830, in _initialize_params
    self.zero_config = get_zero_config(param_dict)
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed/runtime/zero/config.py", line 66, in get_zero_config
    return DeepSpeedZeroConfig(**zero_config_dict)
  File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed/runtime/config_utils.py", line 54, in __init__
    super().__init__(**data)
  File "pydantic/main.py", line 406, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for DeepSpeedZeroConfig
aio
  extra fields not permitted (type=value_error.extra)

If I remove the aio config the server starts successfully, but as I'm trying to create a generator and query it (just like in your sample, I get another error for the generator.query() call:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_5004/1158427391.py in <cell line: 1>()
----> 1 result = generator.query({'query': ["DeepSpeed is the", "Seattle is"]})

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/mii/server_client.py in query(self, request_dict, **query_kwargs)
    357         else:
    358             assert self.initialize_grpc_client, "grpc client has not been setup when this model was created"
--> 359             response = self.asyncio_loop.run_until_complete(
    360                 self._query_in_tensor_parallel(request_dict,
    361                                                query_kwargs))

~/anaconda3/envs/pytorch_p38/lib/python3.8/asyncio/base_events.py in run_until_complete(self, future)
    590         """
    591         self._check_closed()
--> 592         self._check_running()
    593 
    594         new_task = not futures.isfuture(future)

~/anaconda3/envs/pytorch_p38/lib/python3.8/asyncio/base_events.py in _check_running(self)
    550     def _check_running(self):
    551         if self.is_running():
--> 552             raise RuntimeError('This event loop is already running')
    553         if events._get_running_loop() is not None:
    554             raise RuntimeError(

RuntimeError: This event loop is already running

Any help is greatly appreciated.

RuntimeError: This event loop is already running

Hi all, really intrigued by this project, love the idea of democratising the use large models! I've been playing around and encountered a few bugs/unexpected behaviour, so will raise some issues. Happy to help and provide constructive feedback weher I can :)

When running the example provided at https://github.com/microsoft/deepspeed-mii#deploying-mii-public I receive the following error: RuntimeError: This event loop is already running. See this notebook for a minimal example to reproduce the error.

I found a workaround using nest_asyncio.apply(), see this notebook. Nevertheless this strikes me as a bug (or at least unintended behaviour).

Possibly related to #87 , although this example here is not using ZeRO
Tested on a vanilla g4dn EC2 instance, no processes running
1 T4, GPU memory 16GB
deepspeed-mii version 0.0.3
transformers version 4.24.0
Amazon Linux 2

[BUG] number of dims don't match in permute when inferencing with bert model type

When serving models of "bert" type, the following error showed up.

To reproduce, checkout #19 and start a local server with "fill-mask-example.py" using "bert-base-uncased", and query the server with the following client code

import os
import grpc
import mii

# bert
name = "bert-base-uncased"
mask = "[MASK]"
print(f"Querying {name}...")

generator = mii.mii_query_handle(name + "_deployment")
result = generator.query({'query': "Hello I'm a " + mask + " model."})

Issue with default mii_cache location

The default mii_cache location is hardcoded as /tmp/cache, and we run into issues in a cluster environment when multi-users are trying to submit jobs and write on that directory. Maybe it is better to make the default cache location respect the environment variable set in the system.

DeepSpeed-MII/mii/constants.py

Line 98 in 81aca4d

MII_CACHE_PATH_DEFAULT = "/tmp/mii_cache"

Example "text2img-example.py" not working

When running text2img-example.py I encounter the following error message :

raise ValueError(f"model must be a torch.nn.Module, got {type(self.module)}"

It's raised from

mii.deploy(task='text-to-image',
               model="CompVis/stable-diffusion-v1-4",
               deployment_name="sd_deploy",
               mii_config=mii_configs)

Is "CompVis/stable-diffusion-v1-4" still handled?

Installed packages

asyncio==3.4.3
certifi @ file:///croot/certifi_1665076670883/work/certifi
charset-normalizer==2.1.1
deepspeed==0.7.3
deepspeed-mii==0.0.2
diffusers==0.6.0
filelock==3.8.0
grpcio==1.50.0
grpcio-tools==1.50.0
hjson==3.1.0
huggingface-hub==0.10.1
idna==3.4
importlib-metadata==5.0.0
ninja==1.10.2.4
numpy==1.23.4
packaging==21.3
Pillow==9.2.0
protobuf==4.21.9
psutil==5.9.3
py-cpuinfo==9.0.0
pydantic==1.10.2
pyparsing==3.0.9
PyYAML==6.0
regex==2022.9.13
requests==2.28.1
six==1.16.0
tokenizers==0.12.1
torch==1.13.0+cu116
torchaudio==0.13.0+cu116
torchvision==0.14.0+cu116
tqdm==4.64.1
transformers==4.21.2
typing_extensions==4.4.0
urllib3==1.26.12
zipp==3.10.0

Passing `stopping_criteria` to DeepSpeed MII

Hi, would it be possible to pass in a stopping_criteria inside .generate()?

https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate

...
mii_generator = mii.mii_query_handle('name')
mii_generator.query({"query": ['hello']}, stopping_criteria=[])

Currently we get an error (can't pass a list of objects through grpc):

~/venv/lib/python3.7/site-packages/mii/utils.py in kwarg_dict_to_proto(kwarg_dict)
    176         return proto_value
    177
--> 178     return {k: get_proto_value(v) for k, v in kwarg_dict.items()}
    179
    180

~/venv/lib/python3.7/site-packages/mii/utils.py in <dictcomp>(.0)
    176         return proto_value
    177
--> 178     return {k: get_proto_value(v) for k, v in kwarg_dict.items()}
    179
    180

~/venv/lib/python3.7/site-packages/mii/utils.py in get_proto_value(value)
    173     def get_proto_value(value):
    174         proto_value = mii.grpc_related.proto.modelresponse_pb2.Value()
--> 175         setattr(proto_value, dtype_proto_field[type(value)], value)
    176         return proto_value
    177

KeyError: <class 'list'>

Use case is, for a text-generation task, I'd like to stop at a newline / custom token.

feature request : Docker image for deepspeed-mii

Motivation :

As a developper I want to easily be able to test deepspeed-mii.
However, while using conda (or other python package manager i.e pypenv), I still encounter error (with protobuf for example).

Solution :

Fastest one : Provide a Dockefile that the developer/user could build to use and test deepspeed-mii
What would be amazing : At each deepspeed-mii modification, a CI build the docker image and upload/update it on the dockerhub.

This should take long to do but would be great to have 🙂

Question : How to query a remote DeepSpeed server?

I don't see any parameter allowing the user to specify a remote DeepSpeed server to target.
I there any option for that?

If yes :

If no :

How could we do it manually?
Do you attend to implement such a feature in a near futur?

Second question : Is there any option to manually load/unload a model at query time ?

Stable Diffusion | Multi-Pipeline Support (i.e. img2img)

Hi, thank you for the incredible work done here.

Curious as to if img2img and inpainting are planned for release via MII for Stable Diffusion? Happy to potentially help add those features.

It would be ideal to be able to pass in any class that inherits from diffusers.pipeline_utils.DiffusionPipeline, and then just allow the passed kwargs to handle the various inputs. Doing this would allow both img2img, inpainting, and any other community pipelines that exist out there to take advantage of mii.

OOM Error when deploying BLOOM-3B on 16GB GPU via MII

When deploying the bigscience/bloom-3b (in fp32) via MII on a T4 GPU I receive a CUDA out of memory error, see this notebook. When deploying the same model (also in fp32) via the standard HF Pipeline API, it works, see this notebook.

My expectation would be that it should be possible to deploy the same model via MII if I can deploy it via HF Pipelines. If this is not possible then it'd be good to explain why and set expectations with users.

1 T4, GPU memory 16GB
deepspeed-mii version 0.0.3
transformers version 4.24.0
Amazon Linux 2

FileNotFoundError: [Errno 2] No such file or directory: 'deepspeed'

Hello,

When running the following code I get the FileNotFoundError Error.

Any idea why this happens? I follow the usual install through conda (pytorch+cuda) and pip install .

mii_configs = {"tensor_parallel": 1, "dtype": "fp16"}
mii.deploy(task="text-generation",
           model="gpt2",
           deployment_name="gpt2_deployment",
           mii_config=mii_configs)

[2022-08-25 12:41:19,489] [INFO] [deployment.py:74:deploy] *************DeepSpeed Optimizations: True*************
[2022-08-25 12:41:19,524] [INFO] [server_client.py:206:_initialize_service] multi-gpu deepspeed launch: ['deepspeed', '--num_gpus', '1', '--no_local_rank', '--no_python', '/mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/DeepSpeedInterface/bin/python', '-m', 'mii.launch.multi_gpu_server', '--task-name', 'text-generation', '--model', 'gpt2', '--model-path', '/tmp/mii_models', '--port', '50050', '--ds-optimize', '--provider', 'hugging-face', '--config', 'eyJ0ZW5zb3JfcGFyYWxsZWwiOiAxLCAicG9ydF9udW1iZXIiOiA1MDA1MCwgImR0eXBlIjogImZwMTYiLCAiZW5hYmxlX2N1ZGFfZ3JhcGgiOiBmYWxzZSwgImNoZWNrcG9pbnRfZGljdCI6IG51bGx9']

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [2], in <cell line: 2>()
      1 mii_configs = {"tensor_parallel": 1, "dtype": "fp16"}
----> 2 mii.deploy(task="text-generation",
      3            model="gpt2",
      4            deployment_name="gpt2_deployment",
      5            mii_config=mii_configs)

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/NLP/Text_Generation/text_generator/DeepSpeed-MII/mii/deployment.py:94, in deploy(task, model, deployment_name, deployment_type, model_path, enable_deepspeed, enable_zero, ds_config, mii_config)
     92     print(f"Score file created at {generated_score_path(deployment_name)}")
     93 elif deployment_type == DeploymentType.LOCAL:
---> 94     return _deploy_local(deployment_name, model_path=model_path)
     95 else:
     96     raise Exception(f"Unknown deployment type: {deployment_type}")

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/NLP/Text_Generation/text_generator/DeepSpeed-MII/mii/deployment.py:100, in _deploy_local(deployment_name, model_path)
     99 def _deploy_local(deployment_name, model_path):
--> 100     mii.utils.import_score_file(deployment_name).init()

File /tmp/mii_cache/gpt2_deployment/score.py:29, in init()
     26 assert task is not None, "The task name should be set before calling init"
     28 global model
---> 29 model = mii.MIIServerClient(task,
     30                             model_name,
     31                             model_path,
     32                             ds_optimize=configs[mii.constants.ENABLE_DEEPSPEED_KEY],
     33                             ds_zero=configs[mii.constants.ENABLE_DEEPSPEED_ZERO_KEY],
     34                             ds_config=configs[mii.constants.DEEPSPEED_CONFIG_KEY],
     35                             mii_configs=configs[mii.constants.MII_CONFIGS_KEY],
     36                             use_grpc_server=use_grpc_server,
     37                             initialize_grpc_client=initialize_grpc_client)

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/NLP/Text_Generation/text_generator/DeepSpeed-MII/mii/server_client.py:83, in MIIServerClient.__init__(self, task_name, model_name, model_path, ds_optimize, ds_zero, ds_config, mii_configs, initialize_service, initialize_grpc_client, use_grpc_server)
     80     self.model = None
     82 if self.initialize_service:
---> 83     self.process = self._initialize_service(model_name,
     84                                             model_path,
     85                                             ds_optimize,
     86                                             ds_zero,
     87                                             ds_config,
     88                                             mii_configs)
     89     if self.use_grpc_server:
     90         self._wait_until_server_is_live()

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/NLP/Text_Generation/text_generator/DeepSpeed-MII/mii/server_client.py:209, in MIIServerClient._initialize_service(self, model_name, model_path, ds_optimize, ds_zero, ds_config, mii_configs)
    207     mii_env = os.environ.copy()
    208     mii_env["TRANSFORMERS_CACHE"] = model_path
--> 209     process = subprocess.Popen(cmd, env=mii_env)
    210 return process

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/DeepSpeedInterface/lib/python3.9/subprocess.py:951, in Popen.__init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask)
    947         if self.text_mode:
    948             self.stderr = io.TextIOWrapper(self.stderr,
    949                     encoding=encoding, errors=errors)
--> 951     self._execute_child(args, executable, preexec_fn, close_fds,
    952                         pass_fds, cwd, env,
    953                         startupinfo, creationflags, shell,
    954                         p2cread, p2cwrite,
    955                         c2pread, c2pwrite,
    956                         errread, errwrite,
    957                         restore_signals,
    958                         gid, gids, uid, umask,
    959                         start_new_session)
    960 except:
    961     # Cleanup if the child failed starting.
    962     for f in filter(None, (self.stdin, self.stdout, self.stderr)):

File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/DeepSpeedInterface/lib/python3.9/subprocess.py:1821, in Popen._execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, gid, gids, uid, umask, start_new_session)
   1819     if errno_num != 0:
   1820         err_msg = os.strerror(errno_num)
-> 1821     raise child_exception_type(errno_num, err_msg, err_filename)
   1822 raise child_exception_type(err_msg)

FileNotFoundError: [Errno 2] No such file or directory: 'deepspeed'

DS-MII License query

Hi @mrwyattii ,
I am trying to create an inference solution for large models that has support for various frameworks like DS-inference, DS-ZeRO and standard HF codebase.

Is it fine, if I extend some of the classes in MII like MIIServerClient and borrow some pieces of code from the proto files?
This is the relevant PR: huggingface/transformers-bloom-inference#25

Add release tag 0.05

I noticed version.txt is at 0.05 but there is no release tag for 0.05 and PyPI is at 0.04. This change was made over a month ago. Perhaps there was meant to be a release tag but for some reason, it was forgotten?

Is model split in OPT TP mode?

I tried HF OPT-13b on a 4 GPU machine with tensor-parallel: 4. One observation is all GPUs used the same amount of memory (~25G). It is consistent with other users report. And I also found the memory is as same as the memory used when tensor-parallel: 2. So my question is whether the model is split after it is loaded into CPU memory as said in this thread? My understanding is the memory should be a fourth if the model is split when tensor-parallel: 4 and a second when tensor-paralle: 2.

By the way, I also didn't really find latency reduction when increasing tensor parallel number (the latency only has 2 or 3 ms difference).

Multiple people querying MII issue

I have an issue
If multiple people query a model deployed via MII, I run into event loop is already running error.

@jeffra

Custom model configs

Allow the users to pass a dictionary or transformers.PretrainedConfig when deploying models.

'DSUNet' object has no attribute 'config'

I'm unable to get the example script working from here:
https://github.com/microsoft/DeepSpeed-MII/blob/main/examples/local/txt2img-example.py

When I run without arguments it loads the model and deploys okay.
But then using the --query produces this:

ERROR:grpc._server:Exception calling application: 'DSUNet' object has no attribute 'config'
Traceback (most recent call last):
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/grpc/_server.py", line 443, in _call_behavior
    response_or_iterator = behavior(argument, context)
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/mii/grpc_related/modelresponse_server.py", line 77, in Txt2ImgReply
    response = self.inference_pipeline(request, **query_kwargs)
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 504, in __call__
    height = height or self.unet.config.sample_size * self.vae_scale_factor
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1265, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DSUNet' object has no attribute 'config'
Traceback (most recent call last):
  File "deploy.py", line 52, in <module>
    result = generator.query({
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/mii/server_client.py", line 367, in query
    response = self.asyncio_loop.run_until_complete(
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/mii/server_client.py", line 263, in _query_in_tensor_parallel
    await responses[0]
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/mii/server_client.py", line 313, in _request_async_response
    response = await self.stubs[stub_id].Txt2ImgReply(req)
  File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception calling application: 'DSUNet' object has no attribute 'config'"
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:50050 {created_time:"2022-11-28T15:19:01.639187607-08:00", grpc_status:2, grpc_message:"Exception calling application: \'DSUNet\' object has no attribute \'config\'"}"

Versions:

deepspeed                     0.7.5
deepspeed-mii                 0.0.3
transformers                  4.24.0

Deactivate quantization?

Hello,

I've been playing around with the SD image generation. I am seeing the 1.8x speedup (which is awesome), but I've also noticed a small drop in quality. How would I go about to deactivate quantization to see whether that's the reason for the drop?
Thanks

Change number of max_tokens from 1024

I am currently able to deploy, query, and shut down a model using the provided scripts.

However, unlike using DeepSpeed inference on its own, I am not able to figure out how to change the number of max generated tokens from 1024 to a different value.

I believe this is currently not supported, but I could be mistaken.

I believe the issue can be found with the code here:

DeepSpeed-MII/mii/models/load_models.py

Lines 73 to 80 in 79b56af

    
           engine = deepspeed.init_inference(getattr(inference_pipeline, 
        
                                                     "model", 
        
                                                     inference_pipeline), 
        
                                             mp_size=world_size, 
        
                                             dtype=mii_config.torch_dtype(), 
        
                                             replace_method='auto', 
        
                                             enable_cuda_graph=mii_config.enable_cuda_graph, 
        
                                             **ds_kwargs)

A value called max_tokens needs to be passed as an argument.

If I am correct, this should be a fairly simple fix. I may create a PR for it if I can resolve it.

Memory issue when loading OPT

Recently I am trying to run OPT models on MII but came across some memory issues. The OPT model I used is facebook/opt-13b. mii-config and deployment parameters are like this:

mii_configs = {
    "dtype": "fp32",
    "tensor_parallel": 4,
}

name = "facebook/opt-13b"

mii.deploy(task='text-generation',
           model=name,
           deployment_name=name + "_deployment",
           model_path='/root/ckpt/opt_13b/mii',
           mii_config=mii_configs)

The checkpoint is already downloaded into the model_path. Since the checkpoint size of opt-13b is around 26 Gb, I suppose it should work on a machine with 4 x v100 and 224G memory. But it turns out the loading part (even before the server started), MII reported an error of the server crashed and exit quietly. I then checked the memory usage and surprisingly found MII used up all 224G memory. So my question is why MII consumes several times of memory than the checkpoint? Is there any configuration to change this behavior?

Support for int8 inference

Any plans to support int-8 inference any time soon?

Stop Sequence

Hi Deepspeed-MII team,

I was wondering if there is a way to implement a stop sequence or stop token in ds-mii to stop generation early.

In the current implementation, the model mostly generates max_new_tokens number of tokens. In huggingface transformers, it's possible to implement custom stopping criteria but I did not find this option here.

I tried setting the eos_token_id to the desired stop token but somehow the model keeps generating even after producing the stop token.

Cheers, V

Error when trying MII with dtype=fp32 with sampling

When running the example from https://github.com/microsoft/deepspeed-mii#deploying-mii-public in fp32 I receive an AioRpcError error. See this notebook for a minimal example to reproduce the error.

1 T4, GPU memory 16GB
deepspeed-mii version 0.0.3
transformers version 4.24.0
Amazon Linux 2

New microsoft/bloom-deepspeed-inference-fp16 weights not working with DeepSpeed MII

New microsoft/bloom-deepspeed-inference-fp16 and microsoft/bloom-deepspeed-inference-int8 weights not working with DeepSpeed MII

@jeffra @RezaYazdaniAminabadi

Traceback (most recent call last):
  File "scripts/bloom-inference-server/server.py", line 83, in <module>
    model = DSInferenceGRPCServer(args)
  File "/net/llm-shared-nfs/nfs/mayank/BigScience-Megatron-DeepSpeed/scripts/bloom-inference-server/ds_inference/grpc_server.py", line 36, in __init__
    mii.deploy(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/mii/deployment.py", line 70, in deploy
    mii.utils.check_if_task_and_model_is_valid(task, model)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/mii/utils.py", line 108, in check_if_task_and_model_is_valid
    assert (
AssertionError: text-generation only supports [.....]

The list of models doesn't contain the new weights.

Socket timeouts in MII

@mrwyattii seeing this a lot lately:

Traceback (most recent call last):
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/server_client.py", line 268, in _request_async_response
    response = await self.stubs[stub_id].GeneratorReply(req)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/grpc/aio/_call.py", line 285, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1667202507.928505909","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5391,"referenced_errors":[{"created":"@1667202507.928504405","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>
Task exception was never retrieved
future: <Task finished name='Task-3477' coro=<MIIServerClient._request_async_response() done, defined at /net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/server_client.py:260> exception=<AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1667202507.928579654","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5391,"referenced_errors":[{"created":"@1667202507.928578643","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>>
Traceback (most recent call last):
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/server_client.py", line 268, in _request_async_response
    response = await self.stubs[stub_id].GeneratorReply(req)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/grpc/aio/_call.py", line 285, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1667202507.928579654","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5391,"referenced_errors":[{"created":"@1667202507.928578643","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>
Task exception was never retrieved
future: <Task finished name='Task-3472' coro=<MIIServerClient._request_async_response() done, defined at /net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/server_client.py:260> exception=<AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1667202508.129364892","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5391,"referenced_errors":[{"created":"@1667202508.129363364","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>>
Traceback (most recent call last):
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/server_client.py", line 268, in _request_async_response
    response = await self.stubs[stub_id].GeneratorReply(req)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/grpc/aio/_call.py", line 285, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1667202508.129364892","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5391,"referenced_errors":[{"created":"@1667202508.129363364","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>
Task exception was never retrieved
future: <Task finished name='Task-3473' coro=<MIIServerClient._request_async_response() done, defined at /net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/server_client.py:260> exception=<AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1667202508.453402948","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5391,"referenced_errors":[{"created":"@1667202508.453401110","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>>
Traceback (most recent call last):
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/server_client.py", line 268, in _request_async_response
    response = await self.stubs[stub_id].GeneratorReply(req)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/grpc/aio/_call.py", line 285, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1667202508.453402948","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5391,"referenced_errors":[{"created":"@1667202508.453401110","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"

"error: cuda_runtime_api.h: No such file or directory"

Hello, I'm trying to run the basic example. I have several LLMs working and have used Huggingface Hub to download them, for reference. However, I get this error in the title. Indeed this file is not found in:
/home/user/.local/lib/python3.10/site-packages/torch/include/c10/I did find it here:
/usr/local/cuda-11.7/targets/x86_64-linux/include/cuda_runtime_api.h

I had a challenging time getting my nvidia driver to work with the right cuda version during torch install. Current PyTorch version is: Version: 1.12.1+cu116. You can see the version 11.7 in the above path. I'm not sure how relevant that is, but this is the only combination of cuda and torch versions I could get working. I think c10 denotes the default version of torch installed with python 3.10 on Ubuntu 22.04. Which is supported by this quote from SE:

"PyTorch doesn't use the system's CUDA library. When you install PyTorch using the precompiled binaries using either pip or conda it is shipped with a copy of the specified version of the CUDA library which is installed locally."

The output does say:
Installed CUDA version 11.7 does not match the version torch was compiled with 11.6 but since the APIs are compatible, accepting this combination Using /home/user/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...

Do I need to set some environment vars and/or install another version of PyTorch in a virtualenv? I'm a little short on space, so hopping not. It seems there is some conflict between the default PyTorch c10 locations and the discovered 11.6/11.7 version of Cuda.

Quick side note: the models downloaded to /tmp/mii_models. Is it possible to use the standard Huggingface model locations?

Running bigscience/bloom-350m example returns AssertionError

When I run the following example from the readme:

import mii
mii_configs = {"tensor_parallel": 1, "dtype": "fp16"}
mii.deploy(task="text-generation",
           model="bigscience/bloom-350m",
           deployment_name="bloom350m_deployment",
           mii_config=mii_configs)

It returns the following:

`[/usr/local/lib/python3.7/dist-packages/mii/utils.py](https://localhost:8080/#) in check_if_task_and_model_is_valid(task, model_name)
    108     assert (
    109         model_name in valid_task_models
--> 110     ), f"{task_name} only supports {valid_task_models}"
    111 
    112 

AssertionError: text-generation only supports....

Error. I suspect this is related to a change in model weights.

Can you point me in the right direction?

And also, thanks for this amazing repo! Can't wait to use it 👍 💯

Currently https://huggingface.co/Salesforce/codegen-16B-multi and smaller variantes are not supported.
It seems to be a standard text-generation transformer and may be it just doesn't work yet, because of the model-type constraints in SUPPORTED_MODEL_TYPES?
This doesn't match any of the supported model types: https://huggingface.co/api/models?filter=codegen&full=true

RuntimeError: server crashed for some reason, unable to proceed

Using default example to deploy Deploying MII-Public on Azure ML:
Compute instance: TeslaK80 12GB
Kernel: Python 3.8 - AzureML

pip install deepspeed-mii

restart kernel

using this fails:

import mii

mii_configs = {"tensor_parallel": 1, "dtype": "fp16"}
mii.deploy(task='text-generation',
           model="bigscience/bloom-560m",
           deployment_name="bloom560m_deployment",
           mii_config=mii_configs)

AssertionError: text-generation only supports ['distilgpt2', 'gpt2-large'...

using this modified to tensor_parallel=1 fails:

import mii

mii_configs = {
    "dtype": "fp16",
    "tensor_parallel": 1,
    "port_number": 50950,
}
name = "microsoft/bloom-deepspeed-inference-fp16"

mii.deploy(task='text-generation',
           model=name,
           deployment_name=name + "_deployment",
           model_path="/data/bloom-mp",
           mii_config=mii_configs)

RuntimeError: server crashed for some reason, unable to proceed

Also switching to int8 didn't help.

Is my compute instance too small?

	engine = deepspeed.init_inference(getattr(inference_pipeline,
	"model",
	inference_pipeline),
	mp_size=world_size,
	dtype=mii_config.torch_dtype(),
	replace_method='auto',
	enable_cuda_graph=mii_config.enable_cuda_graph,
	**ds_kwargs)