microsoft / deepspeed-mii Goto Github PK
View Code? Open in Web Editor NEWMII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
License: Apache License 2.0
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
License: Apache License 2.0
When using DeepSpeed MII, there are some parameters that do not work when querying the model that otherwise work when using model.generate or when using huggingface pipelines. I have also tried these parameters using DeepSpeed inference on its own and found them to work
The parameters that cause issues for me are num_beams
and bad_words_ids
but there may be more.
I have found do_sample
, max_length
, min_length
, top_k
, top_p
, temperature
, repetition_penalty
, and early_stopping
to not cause issues but there may be more.
Traceback (most recent call last):
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/server_client.py", line 268, in _request_async_response
response = await self.stubs[stub_id].GeneratorReply(req)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/grpc/aio/_call.py", line 285, in __await__
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception calling application: probability tensor contains either `inf`, `nan` or element < 0"
debug_error_string = "{"created":"@1667201081.916813042","description":"Error received from peer ipv6:[::1]:50956","file":"src/core/lib/surface/call.cc","file_line":1068,"grpc_message":"Exception calling application: probability tensor contains either `inf`, `nan` or element < 0","grpc_status":2}"
I see this sometimes for BLOOM-176B
Python=3.8
Deepspeed-MII=latest version
test example in mii.py file
import mii
# roberta
name = "roberta-base"
mask = "<mask>"
# bert
name = "bert-base-uncased"
mask = "[MASK]"
print(f"Querying {name}...")
generator = mii.mii_query_handle(name + "_deployment")
result = generator.query({'query': "Hello I'm a " + mask + " model."})
print(result.response)
print("time_taken:", result.time_taken)
python mii.py
Error code
(ds1) [root@6301babb8dc8a1eeb0ed2044 DeepSpeed-MII (main)]# python mii.py
Querying bert-base-uncased...
Traceback (most recent call last):
File "mii.py", line 11, in <module>
generator = mii.mii_query_handle(name + "_deployment")
File "/root/workspace/sharing/big-storage/hyungrak/tuning/text_filter/mii_test/DeepSpeed-MII/mii/server_client.py", line 34, in mii_query_handle
configs = mii.utils.import_score_file(deployment_name).configs
File "/root/workspace/sharing/big-storage/hyungrak/tuning/text_filter/mii_test/DeepSpeed-MII/mii/utils.py", line 147, in import_score_file
spec.loader.exec_module(score)
File "<frozen importlib._bootstrap_external>", line 839, in exec_module
File "<frozen importlib._bootstrap_external>", line 975, in get_code
File "<frozen importlib._bootstrap_external>", line 1032, in get_data
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/mii_cache/bert-base-uncased_deployment/score.py'
why are this error occuring
When using greedy search (do_sample=False
) and dtype=fp32
the generated tokens are not shown in the output of the query. I believe the text generation is happening, because different values for max_new_tokens
lead to different runtimes for the query. See this notebook as a minimal example.
Just curious whether there is a plan to support Albert and Swin/ViT. currently I am playing with a model for multimodal learning which involves language models like Albert and visual transformers like Swin and ViT. If there is no immediate plan for this due to tight hands, I am wondering whether there is any documents to guide adding support of new models or customized models so I could help?
When trying to run the aml example, e.g. bloom aml, it tries to run get_acr_name() but fails because its missing the resource group name argument. Is there be a way to pass in user arguments such as the resource group, subscription, etc? It would also be nice to expose more arguments for the aml online endpoints such as the auth_mode, e.g. we arent allowed to use keys, only aml_tokens in production environments. But I can also imagine other deployment attributes/arguments being useful as well such as instance_count or type.
[2022-12-08 10:53:37,253] [INFO] [deployment.py:87:deploy] ************* MII is using DeepSpeed Optimizations to accelerate your model *************
ERROR: the following arguments are required: --resource-group/-g, --name/-n
Examples from AI knowledge base:
https://aka.ms/cli_ref
Read more about the command in reference docs
------------------------------
Unable to obtain ACR name from Azure-CLI. Please verify that you:
- Have Azure-CLI installed (https://learn.microsoft.com/en-us/cli/azure/install-azure-cli)
- Are logged in to an active account on Azure-CLI ($az login)
- Have Azure-CLI ML plugin installed ($az extension add --name ml)
------------------------------
Traceback (most recent call last):
File "/mnt/c/Users/davidaponte/Documents/CS677-DeepLearning/deeplearning/deeplearning/deep_learning/text_to_image/deepspeed_mii/bloom560m-aml.py", line 7, in <module>
mii.deploy(task='text-generation',
File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/deployment.py", line 112, in deploy
_deploy_aml(deployment_name=deployment_name, model_name=model, version=version)
File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/deployment.py", line 124, in _deploy_aml
acr_name = mii.aml_related.utils.get_acr_name()
File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/aml_related/utils.py", line 31, in get_acr_name
raise (e)
File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/aml_related/utils.py", line 13, in get_acr_name
acr_name = subprocess.check_output(
File "/home/bambam/.pyenv/versions/3.9.0/lib/python3.9/subprocess.py", line 420, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/home/bambam/.pyenv/versions/3.9.0/lib/python3.9/subprocess.py", line 524, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['az', 'ml', 'workspace', 'show', '--query', 'container_registry']' returned non-zero exit status 2.
Setup:
deepspeed==0.7.6
deepspeed-mii==0.0.4
py3.9.0
Ubuntu 20.04.4 LTS (Focal Fossa)
I'm trying out deepspeed-mii on a local machine (8 GPU with 23GB VRAM each). Smaller models like bloom-560m
and EleutherAI/gpt-neo-2.7B
worked well. However, I got CUDA OOM errors when loading larger models, like bloom-7b1
. For some even larger models like EleutherAI/gpt-neox-20b
, the server just crashed without any specific error messages or logs.
I've tried deepspeed inference before, and it worked fine on these models.
I use this script to deploy models
import mii
mii_configs = {"tensor_parallel": 8, "dtype": "fp16"}
mii.deploy(task='text-generation',
model="facebook/opt-6.7b",
deployment_name="facebook/opt-6.7b",
model_path="/home/ubuntu/.cache/huggingface/hub",
mii_config=mii_configs)
Is there something I should change to my deployment script?
Thanks!
As subject, If I have to deploy one model into more than 1 machines, any kind of configuration could I make?
Is there a way to inference OPT models in TensorParallel or PipelineParallel mode?
As I understand:
BLOOM uses llm provider which loads the model weights as meta tensors first and then assigns devices during checkpoint loading in ds-inference.
OPT uses hf provider with ๐ค pipeline and directly loads checkpoint weights on a specific device.
However, only MP is supported from ๐ค side (using accelerate). Is there a way to inference OPT with llm provider?
Like, take Stable diffusion as an example, is it possible to communicate to CLIP, VAE, and UNET separately?
Hi, does DeepSpeed-MII support fairseq's translation model, such as transformer.wmt16.en-de or transformer.wmt19.en-de? as no task translation listed in the Supported Models and Tasks
section.
{"tensor_parallel": 4}
it will pick up the default port number without them needing to specify it.example use case:
config = {"tensor_parallel": 4}
mii.deploy('fill-mask',
name,
mii.DeploymentType.LOCAL,
deployment_name=name + "_deployment",
local_model_path=".cache/models/" + name,
mii_configs=config,
enable_deepspeed=True)
https://github.com/microsoft/DeepSpeed-MII/tree/main/examples/benchmark/txt2img
Ive run into some strange protobuf related errors. When I first ran into this, I was able to resolve by changing my protobuf version to >=3.20.0 but now it doesnt work anymore.
My hunch is that its related to how I am installing things? I wasnt sure what the correct way was to install deespeed and deepspeed-mii, so I have been trying to use the following:
pip install deepspeed[sd] deepspeed-mii
I am now seeing this error when trying to run mii-sd.py:
Setup:
deepspeed==0.7.6
deepspeed-mii==0.0.4
diffusers==0.9.0
Ubuntu 18.0.4.6 LTS
py3.8.15
I saw that T5 wasn't in the list of supported huggingface transformers models. Are there plans / ETA for when the T5 family would be added? FLAN-T5 is a very strong llm for zero/fewshot instruction prompting. I am currently building out a hacky implementation for hosting with deepspeed-inference, but having it natively supported in deepspeed-mii would be ideal.
int8 doesn't work. (DeepSpeed-MII) only has fp16 support for now.
Any idea on when int8 will be supported?
Hello!
Thanks for this great optimization,
We're using a fresh ec2 G5XL instance,
After installing everything and running python baseline-sd.py
I see the following error:
attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
I've installed the envoirment using: pip install deepspeed[sd] deepspeed-mii
when running ds_report I see the following output:
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.9/site-packages/torch']
torch version .................... 1.13.0+cu117
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed install path ........... ['/opt/conda/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
Currently there's no way to teardown a local or azure deployment gracefully, we currently just pkill python
which is clearly not a clean solution.
Hey,
I'm trying to run the example provided for text generation with Zero-Inference, and having trouble getting predictions without running into errors.
When I try to deploy the exact same model and config, I first get a validation error for the aio
configuration.
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 70, in <module>
main()
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/mii/launch/multi_gpu_server.py", line 56, in main
inference_pipeline = load_models(task_name=args.task_name,
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/mii/models/load_models.py", line 87, in load_models
ds_config = DeepSpeedConfig(ds_config_path)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 811, in __init__
self._initialize_params(copy.copy(self._param_dict))
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 830, in _initialize_params
self.zero_config = get_zero_config(param_dict)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed/runtime/zero/config.py", line 66, in get_zero_config
return DeepSpeedZeroConfig(**zero_config_dict)
File "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/deepspeed/runtime/config_utils.py", line 54, in __init__
super().__init__(**data)
File "pydantic/main.py", line 406, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for DeepSpeedZeroConfig
aio
extra fields not permitted (type=value_error.extra)
If I remove the aio
config the server starts successfully, but as I'm trying to create a generator and query it (just like in your sample, I get another error for the generator.query()
call:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_5004/1158427391.py in <cell line: 1>()
----> 1 result = generator.query({'query': ["DeepSpeed is the", "Seattle is"]})
~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/mii/server_client.py in query(self, request_dict, **query_kwargs)
357 else:
358 assert self.initialize_grpc_client, "grpc client has not been setup when this model was created"
--> 359 response = self.asyncio_loop.run_until_complete(
360 self._query_in_tensor_parallel(request_dict,
361 query_kwargs))
~/anaconda3/envs/pytorch_p38/lib/python3.8/asyncio/base_events.py in run_until_complete(self, future)
590 """
591 self._check_closed()
--> 592 self._check_running()
593
594 new_task = not futures.isfuture(future)
~/anaconda3/envs/pytorch_p38/lib/python3.8/asyncio/base_events.py in _check_running(self)
550 def _check_running(self):
551 if self.is_running():
--> 552 raise RuntimeError('This event loop is already running')
553 if events._get_running_loop() is not None:
554 raise RuntimeError(
RuntimeError: This event loop is already running
Any help is greatly appreciated.
Hi all, really intrigued by this project, love the idea of democratising the use large models! I've been playing around and encountered a few bugs/unexpected behaviour, so will raise some issues. Happy to help and provide constructive feedback weher I can :)
When running the example provided at https://github.com/microsoft/deepspeed-mii#deploying-mii-public I receive the following error: RuntimeError: This event loop is already running
. See this notebook for a minimal example to reproduce the error.
I found a workaround using nest_asyncio.apply()
, see this notebook. Nevertheless this strikes me as a bug (or at least unintended behaviour).
When serving models of "bert" type, the following error showed up.
To reproduce, checkout #19 and start a local server with "fill-mask-example.py" using "bert-base-uncased", and query the server with the following client code
import os
import grpc
import mii
# bert
name = "bert-base-uncased"
mask = "[MASK]"
print(f"Querying {name}...")
generator = mii.mii_query_handle(name + "_deployment")
result = generator.query({'query': "Hello I'm a " + mask + " model."})
The default mii_cache location is hardcoded as /tmp/cache
, and we run into issues in a cluster environment when multi-users are trying to submit jobs and write on that directory. Maybe it is better to make the default cache location respect the environment variable set in the system.
DeepSpeed-MII/mii/constants.py
Line 98 in 81aca4d
When running text2img-example.py I encounter the following error message :
raise ValueError(f"model must be a torch.nn.Module, got {type(self.module)}"
It's raised from
mii.deploy(task='text-to-image',
model="CompVis/stable-diffusion-v1-4",
deployment_name="sd_deploy",
mii_config=mii_configs)
Is "CompVis/stable-diffusion-v1-4"
still handled?
asyncio==3.4.3
certifi @ file:///croot/certifi_1665076670883/work/certifi
charset-normalizer==2.1.1
deepspeed==0.7.3
deepspeed-mii==0.0.2
diffusers==0.6.0
filelock==3.8.0
grpcio==1.50.0
grpcio-tools==1.50.0
hjson==3.1.0
huggingface-hub==0.10.1
idna==3.4
importlib-metadata==5.0.0
ninja==1.10.2.4
numpy==1.23.4
packaging==21.3
Pillow==9.2.0
protobuf==4.21.9
psutil==5.9.3
py-cpuinfo==9.0.0
pydantic==1.10.2
pyparsing==3.0.9
PyYAML==6.0
regex==2022.9.13
requests==2.28.1
six==1.16.0
tokenizers==0.12.1
torch==1.13.0+cu116
torchaudio==0.13.0+cu116
torchvision==0.14.0+cu116
tqdm==4.64.1
transformers==4.21.2
typing_extensions==4.4.0
urllib3==1.26.12
zipp==3.10.0
Hi, would it be possible to pass in a stopping_criteria
inside .generate()
?
...
mii_generator = mii.mii_query_handle('name')
mii_generator.query({"query": ['hello']}, stopping_criteria=[])
Currently we get an error (can't pass a list of objects through grpc):
~/venv/lib/python3.7/site-packages/mii/utils.py in kwarg_dict_to_proto(kwarg_dict)
176 return proto_value
177
--> 178 return {k: get_proto_value(v) for k, v in kwarg_dict.items()}
179
180
~/venv/lib/python3.7/site-packages/mii/utils.py in <dictcomp>(.0)
176 return proto_value
177
--> 178 return {k: get_proto_value(v) for k, v in kwarg_dict.items()}
179
180
~/venv/lib/python3.7/site-packages/mii/utils.py in get_proto_value(value)
173 def get_proto_value(value):
174 proto_value = mii.grpc_related.proto.modelresponse_pb2.Value()
--> 175 setattr(proto_value, dtype_proto_field[type(value)], value)
176 return proto_value
177
KeyError: <class 'list'>
Use case is, for a text-generation
task, I'd like to stop at a newline / custom token.
As a developper I want to easily be able to test deepspeed-mii
.
However, while using conda (or other python package manager i.e pypenv), I still encounter error (with protobuf for example).
Fastest one : Provide a Dockefile
that the developer/user could build to use and test deepspeed-mii
What would be amazing : At each deepspeed-mii
modification, a CI build the docker image and upload/update it on the dockerhub.
This should take long to do but would be great to have ๐
I don't see any parameter allowing the user to specify a remote DeepSpeed server to target.
I there any option for that?
If yes :
If no :
Second question : Is there any option to manually load/unload a model at query time ?
Hi, thank you for the incredible work done here.
Curious as to if img2img
and inpainting
are planned for release via MII for Stable Diffusion? Happy to potentially help add those features.
It would be ideal to be able to pass in any class that inherits from diffusers.pipeline_utils.DiffusionPipeline
, and then just allow the passed kwargs
to handle the various inputs. Doing this would allow both img2img
, inpainting
, and any other community pipelines that exist out there to take advantage of mii
.
When deploying the bigscience/bloom-3b
(in fp32) via MII on a T4 GPU I receive a CUDA out of memory
error, see this notebook. When deploying the same model (also in fp32) via the standard HF Pipeline API, it works, see this notebook.
My expectation would be that it should be possible to deploy the same model via MII if I can deploy it via HF Pipelines. If this is not possible then it'd be good to explain why and set expectations with users.
Hello,
When running the following code I get the FileNotFoundError Error.
Any idea why this happens? I follow the usual install through conda (pytorch+cuda) and pip install .
mii_configs = {"tensor_parallel": 1, "dtype": "fp16"}
mii.deploy(task="text-generation",
model="gpt2",
deployment_name="gpt2_deployment",
mii_config=mii_configs)
[2022-08-25 12:41:19,489] [INFO] [deployment.py:74:deploy] *************DeepSpeed Optimizations: True*************
[2022-08-25 12:41:19,524] [INFO] [server_client.py:206:_initialize_service] multi-gpu deepspeed launch: ['deepspeed', '--num_gpus', '1', '--no_local_rank', '--no_python', '/mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/DeepSpeedInterface/bin/python', '-m', 'mii.launch.multi_gpu_server', '--task-name', 'text-generation', '--model', 'gpt2', '--model-path', '/tmp/mii_models', '--port', '50050', '--ds-optimize', '--provider', 'hugging-face', '--config', 'eyJ0ZW5zb3JfcGFyYWxsZWwiOiAxLCAicG9ydF9udW1iZXIiOiA1MDA1MCwgImR0eXBlIjogImZwMTYiLCAiZW5hYmxlX2N1ZGFfZ3JhcGgiOiBmYWxzZSwgImNoZWNrcG9pbnRfZGljdCI6IG51bGx9']
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Input In [2], in <cell line: 2>()
1 mii_configs = {"tensor_parallel": 1, "dtype": "fp16"}
----> 2 mii.deploy(task="text-generation",
3 model="gpt2",
4 deployment_name="gpt2_deployment",
5 mii_config=mii_configs)
File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/NLP/Text_Generation/text_generator/DeepSpeed-MII/mii/deployment.py:94, in deploy(task, model, deployment_name, deployment_type, model_path, enable_deepspeed, enable_zero, ds_config, mii_config)
92 print(f"Score file created at {generated_score_path(deployment_name)}")
93 elif deployment_type == DeploymentType.LOCAL:
---> 94 return _deploy_local(deployment_name, model_path=model_path)
95 else:
96 raise Exception(f"Unknown deployment type: {deployment_type}")
File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/NLP/Text_Generation/text_generator/DeepSpeed-MII/mii/deployment.py:100, in _deploy_local(deployment_name, model_path)
99 def _deploy_local(deployment_name, model_path):
--> 100 mii.utils.import_score_file(deployment_name).init()
File /tmp/mii_cache/gpt2_deployment/score.py:29, in init()
26 assert task is not None, "The task name should be set before calling init"
28 global model
---> 29 model = mii.MIIServerClient(task,
30 model_name,
31 model_path,
32 ds_optimize=configs[mii.constants.ENABLE_DEEPSPEED_KEY],
33 ds_zero=configs[mii.constants.ENABLE_DEEPSPEED_ZERO_KEY],
34 ds_config=configs[mii.constants.DEEPSPEED_CONFIG_KEY],
35 mii_configs=configs[mii.constants.MII_CONFIGS_KEY],
36 use_grpc_server=use_grpc_server,
37 initialize_grpc_client=initialize_grpc_client)
File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/NLP/Text_Generation/text_generator/DeepSpeed-MII/mii/server_client.py:83, in MIIServerClient.__init__(self, task_name, model_name, model_path, ds_optimize, ds_zero, ds_config, mii_configs, initialize_service, initialize_grpc_client, use_grpc_server)
80 self.model = None
82 if self.initialize_service:
---> 83 self.process = self._initialize_service(model_name,
84 model_path,
85 ds_optimize,
86 ds_zero,
87 ds_config,
88 mii_configs)
89 if self.use_grpc_server:
90 self._wait_until_server_is_live()
File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/NLP/Text_Generation/text_generator/DeepSpeed-MII/mii/server_client.py:209, in MIIServerClient._initialize_service(self, model_name, model_path, ds_optimize, ds_zero, ds_config, mii_configs)
207 mii_env = os.environ.copy()
208 mii_env["TRANSFORMERS_CACHE"] = model_path
--> 209 process = subprocess.Popen(cmd, env=mii_env)
210 return process
File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/DeepSpeedInterface/lib/python3.9/subprocess.py:951, in Popen.__init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask)
947 if self.text_mode:
948 self.stderr = io.TextIOWrapper(self.stderr,
949 encoding=encoding, errors=errors)
--> 951 self._execute_child(args, executable, preexec_fn, close_fds,
952 pass_fds, cwd, env,
953 startupinfo, creationflags, shell,
954 p2cread, p2cwrite,
955 c2pread, c2pwrite,
956 errread, errwrite,
957 restore_signals,
958 gid, gids, uid, umask,
959 start_new_session)
960 except:
961 # Cleanup if the child failed starting.
962 for f in filter(None, (self.stdin, self.stdout, self.stderr)):
File /mnt/2287294e-32c7-437b-84bd-452a29548b1a/conda_env/DeepSpeedInterface/lib/python3.9/subprocess.py:1821, in Popen._execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, gid, gids, uid, umask, start_new_session)
1819 if errno_num != 0:
1820 err_msg = os.strerror(errno_num)
-> 1821 raise child_exception_type(errno_num, err_msg, err_filename)
1822 raise child_exception_type(err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'deepspeed'
Hi @mrwyattii ,
I am trying to create an inference solution for large models that has support for various frameworks like DS-inference, DS-ZeRO and standard HF codebase.
Is it fine, if I extend some of the classes in MII like MIIServerClient and borrow some pieces of code from the proto files?
This is the relevant PR: huggingface/transformers-bloom-inference#25
I noticed version.txt is at 0.05 but there is no release tag for 0.05 and PyPI is at 0.04. This change was made over a month ago. Perhaps there was meant to be a release tag but for some reason, it was forgotten?
I tried HF OPT-13b on a 4 GPU machine with tensor-parallel: 4
. One observation is all GPUs used the same amount of memory (~25G). It is consistent with other users report. And I also found the memory is as same as the memory used when tensor-parallel: 2
. So my question is whether the model is split after it is loaded into CPU memory as said in this thread? My understanding is the memory should be a fourth if the model is split when tensor-parallel: 4
and a second when tensor-paralle: 2
.
By the way, I also didn't really find latency reduction when increasing tensor parallel number (the latency only has 2 or 3 ms difference).
I have an issue
If multiple people query a model deployed via MII, I run into event loop is already running error.
Allow the users to pass a dictionary or transformers.PretrainedConfig when deploying models.
I'm unable to get the example script working from here:
https://github.com/microsoft/DeepSpeed-MII/blob/main/examples/local/txt2img-example.py
When I run without arguments it loads the model and deploys okay.
But then using the --query
produces this:
ERROR:grpc._server:Exception calling application: 'DSUNet' object has no attribute 'config'
Traceback (most recent call last):
File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/grpc/_server.py", line 443, in _call_behavior
response_or_iterator = behavior(argument, context)
File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/mii/grpc_related/modelresponse_server.py", line 77, in Txt2ImgReply
response = self.inference_pipeline(request, **query_kwargs)
File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 504, in __call__
height = height or self.unet.config.sample_size * self.vae_scale_factor
File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1265, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DSUNet' object has no attribute 'config'
Traceback (most recent call last):
File "deploy.py", line 52, in <module>
result = generator.query({
File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/mii/server_client.py", line 367, in query
response = self.asyncio_loop.run_until_complete(
File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/mii/server_client.py", line 263, in _query_in_tensor_parallel
await responses[0]
File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/mii/server_client.py", line 313, in _request_async_response
response = await self.stubs[stub_id].Txt2ImgReply(req)
File "/home/snd/bin/miniconda3/envs/ldm/lib/python3.8/site-packages/grpc/aio/_call.py", line 290, in __await__
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception calling application: 'DSUNet' object has no attribute 'config'"
debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:50050 {created_time:"2022-11-28T15:19:01.639187607-08:00", grpc_status:2, grpc_message:"Exception calling application: \'DSUNet\' object has no attribute \'config\'"}"
Versions:
deepspeed 0.7.5
deepspeed-mii 0.0.3
transformers 4.24.0
Hello,
I've been playing around with the SD image generation. I am seeing the 1.8x speedup (which is awesome), but I've also noticed a small drop in quality. How would I go about to deactivate quantization to see whether that's the reason for the drop?
Thanks
I am currently able to deploy, query, and shut down a model using the provided scripts.
However, unlike using DeepSpeed inference on its own, I am not able to figure out how to change the number of max generated tokens from 1024 to a different value.
I believe this is currently not supported, but I could be mistaken.
I believe the issue can be found with the code here:
DeepSpeed-MII/mii/models/load_models.py
Lines 73 to 80 in 79b56af
A value called max_tokens
needs to be passed as an argument.
If I am correct, this should be a fairly simple fix. I may create a PR for it if I can resolve it.
Recently I am trying to run OPT models on MII but came across some memory issues. The OPT model I used is facebook/opt-13b
. mii-config
and deployment parameters are like this:
mii_configs = {
"dtype": "fp32",
"tensor_parallel": 4,
}
name = "facebook/opt-13b"
mii.deploy(task='text-generation',
model=name,
deployment_name=name + "_deployment",
model_path='/root/ckpt/opt_13b/mii',
mii_config=mii_configs)
The checkpoint is already downloaded into the model_path
. Since the checkpoint size of opt-13b
is around 26 Gb, I suppose it should work on a machine with 4 x v100 and 224G memory. But it turns out the loading part (even before the server started), MII reported an error of the server crashed
and exit quietly. I then checked the memory usage and surprisingly found MII used up all 224G memory. So my question is why MII consumes several times of memory than the checkpoint? Is there any configuration to change this behavior?
Any plans to support int-8 inference any time soon?
Hi Deepspeed-MII team,
I was wondering if there is a way to implement a stop sequence or stop token in ds-mii to stop generation early.
In the current implementation, the model mostly generates max_new_tokens
number of tokens. In huggingface transformers, it's possible to implement custom stopping criteria but I did not find this option here.
I tried setting the eos_token_id
to the desired stop token but somehow the model keeps generating even after producing the stop token.
Cheers, V
When running the example from https://github.com/microsoft/deepspeed-mii#deploying-mii-public in fp32
I receive an AioRpcError
error. See this notebook for a minimal example to reproduce the error.
New microsoft/bloom-deepspeed-inference-fp16
and microsoft/bloom-deepspeed-inference-int8
weights not working with DeepSpeed MII
Traceback (most recent call last):
File "scripts/bloom-inference-server/server.py", line 83, in <module>
model = DSInferenceGRPCServer(args)
File "/net/llm-shared-nfs/nfs/mayank/BigScience-Megatron-DeepSpeed/scripts/bloom-inference-server/ds_inference/grpc_server.py", line 36, in __init__
mii.deploy(
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/mii/deployment.py", line 70, in deploy
mii.utils.check_if_task_and_model_is_valid(task, model)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/mii/utils.py", line 108, in check_if_task_and_model_is_valid
assert (
AssertionError: text-generation only supports [.....]
The list of models doesn't contain the new weights.
@mrwyattii seeing this a lot lately:
Traceback (most recent call last):
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/server_client.py", line 268, in _request_async_response
response = await self.stubs[stub_id].GeneratorReply(req)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/grpc/aio/_call.py", line 285, in __await__
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1667202507.928505909","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5391,"referenced_errors":[{"created":"@1667202507.928504405","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>
Task exception was never retrieved
future: <Task finished name='Task-3477' coro=<MIIServerClient._request_async_response() done, defined at /net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/server_client.py:260> exception=<AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1667202507.928579654","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5391,"referenced_errors":[{"created":"@1667202507.928578643","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>>
Traceback (most recent call last):
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/server_client.py", line 268, in _request_async_response
response = await self.stubs[stub_id].GeneratorReply(req)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/grpc/aio/_call.py", line 285, in __await__
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1667202507.928579654","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5391,"referenced_errors":[{"created":"@1667202507.928578643","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>
Task exception was never retrieved
future: <Task finished name='Task-3472' coro=<MIIServerClient._request_async_response() done, defined at /net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/server_client.py:260> exception=<AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1667202508.129364892","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5391,"referenced_errors":[{"created":"@1667202508.129363364","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>>
Traceback (most recent call last):
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/server_client.py", line 268, in _request_async_response
response = await self.stubs[stub_id].GeneratorReply(req)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/grpc/aio/_call.py", line 285, in __await__
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1667202508.129364892","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5391,"referenced_errors":[{"created":"@1667202508.129363364","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>
Task exception was never retrieved
future: <Task finished name='Task-3473' coro=<MIIServerClient._request_async_response() done, defined at /net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/server_client.py:260> exception=<AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1667202508.453402948","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5391,"referenced_errors":[{"created":"@1667202508.453401110","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
>>
Traceback (most recent call last):
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/mii/server_client.py", line 268, in _request_async_response
response = await self.stubs[stub_id].GeneratorReply(req)
File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/bloom-server/lib/python3.8/site-packages/grpc/aio/_call.py", line 285, in __await__
raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1667202508.453402948","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":5391,"referenced_errors":[{"created":"@1667202508.453401110","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}"
Hello, I'm trying to run the basic example. I have several LLMs working and have used Huggingface Hub to download them, for reference. However, I get this error in the title. Indeed this file is not found in:
/home/user/.local/lib/python3.10/site-packages/torch/include/c10/
I did find it here:
/usr/local/cuda-11.7/targets/x86_64-linux/include/cuda_runtime_api.h
I had a challenging time getting my nvidia driver to work with the right cuda version during torch install. Current PyTorch version is: Version: 1.12.1+cu116
. You can see the version 11.7 in the above path. I'm not sure how relevant that is, but this is the only combination of cuda and torch versions I could get working. I think c10 denotes the default version of torch installed with python 3.10 on Ubuntu 22.04. Which is supported by this quote from SE:
"PyTorch doesn't use the system's CUDA library. When you install PyTorch using the precompiled binaries using either pip or conda it is shipped with a copy of the specified version of the CUDA library which is installed locally."
The output does say:
Installed CUDA version 11.7 does not match the version torch was compiled with 11.6 but since the APIs are compatible, accepting this combination Using /home/user/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Do I need to set some environment vars and/or install another version of PyTorch in a virtualenv? I'm a little short on space, so hopping not. It seems there is some conflict between the default PyTorch c10 locations and the discovered 11.6/11.7 version of Cuda.
Quick side note: the models downloaded to /tmp/mii_models. Is it possible to use the standard Huggingface model locations?
When I run the following example from the readme:
import mii
mii_configs = {"tensor_parallel": 1, "dtype": "fp16"}
mii.deploy(task="text-generation",
model="bigscience/bloom-350m",
deployment_name="bloom350m_deployment",
mii_config=mii_configs)
It returns the following:
`[/usr/local/lib/python3.7/dist-packages/mii/utils.py](https://localhost:8080/#) in check_if_task_and_model_is_valid(task, model_name)
108 assert (
109 model_name in valid_task_models
--> 110 ), f"{task_name} only supports {valid_task_models}"
111
112
AssertionError: text-generation only supports....
Error. I suspect this is related to a change in model weights.
Can you point me in the right direction?
And also, thanks for this amazing repo! Can't wait to use it ๐ ๐ฏ
After #25 is complete we want to expose all DS-inference configs (https://deepspeed.readthedocs.io/en/latest/inference-init.html#deepspeed.init_inference) and ZeRO inference configs in the MII config dictionary.
Provide local AML deployment option, this will use the AML inference server for the front end.
We can then easily deploy an MII generated score file via: azmlinfsrv --model_dir <model-path> --entry_script score.py
Currently https://huggingface.co/Salesforce/codegen-16B-multi and smaller variantes are not supported.
It seems to be a standard text-generation transformer and may be it just doesn't work yet, because of the model-type constraints in SUPPORTED_MODEL_TYPES?
This doesn't match any of the supported model types: https://huggingface.co/api/models?filter=codegen&full=true
Using default example to deploy Deploying MII-Public on Azure ML:
Compute instance: TeslaK80 12GB
Kernel: Python 3.8 - AzureML
pip install deepspeed-mii
restart kernel
using this fails:
import mii
mii_configs = {"tensor_parallel": 1, "dtype": "fp16"}
mii.deploy(task='text-generation',
model="bigscience/bloom-560m",
deployment_name="bloom560m_deployment",
mii_config=mii_configs)
AssertionError: text-generation only supports ['distilgpt2', 'gpt2-large'...
using this modified to tensor_parallel=1 fails:
import mii
mii_configs = {
"dtype": "fp16",
"tensor_parallel": 1,
"port_number": 50950,
}
name = "microsoft/bloom-deepspeed-inference-fp16"
mii.deploy(task='text-generation',
model=name,
deployment_name=name + "_deployment",
model_path="/data/bloom-mp",
mii_config=mii_configs)
RuntimeError: server crashed for some reason, unable to proceed
Also switching to int8 didn't help.
Is my compute instance too small?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.