tunib-ai / parallelformers Goto Github PK
View Code? Open in Web Editor NEWParallelformers: An Efficient Model Parallelization Toolkit for Deployment
Home Page: https://tunib-ai.github.io/parallelformers
License: Apache License 2.0
Parallelformers: An Efficient Model Parallelization Toolkit for Deployment
Home Page: https://tunib-ai.github.io/parallelformers
License: Apache License 2.0
from transformers import AutoModelForCausalLM, AutoTokenizer
if __name__ == '__main__':
model_name = 'facebook/opt-30b'
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
from parallelformers import parallelize
parallelize(model, num_gpus=8, fp16=True)
This error was thrown at parallelize method :
Bus error (core dumped)
We tried with parallelformers version 1.2.6 and transformers version 4.21.11, this error was not thrown. This error is only happening with the parallelformers version 1.2.7 and transformers version 4.21.11.
Hi,
Would it be possible to support new OPT models (a suite of GPT-like models)?
Here's the official doc:
https://huggingface.co/docs/transformers/model_doc/opt
Thanks for your great work!
We will continue to log problems with Docker containers on this thread. And we aim to solve it. Ultimately, the goal is to deploy the model in a Kubernetes environment. If anyone has any problems with the Docker environment, please feel free to leave issues. We will actively review and resolve them.
안녕하세요
GPTJForCausalLM모델을 지원하는지 확인하려고
KoGPT3를 가지고
parallelformers 라이브러리로 인퍼런스 해보는 걸 테스트하고 있었는데요.
실행코드는 아래와 같습니다.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from parallelformers import parallelize
tokenizer = AutoTokenizer.from_pretrained(
'kakaobrain/kogpt', revision='KoGPT6B-ryan1.5b-float16', # or float32 version: revision=KoGPT6B-ryan1.5b
bos_token='[BOS]', eos_token='[EOS]', unk_token='[UNK]', pad_token='[PAD]', mask_token='[MASK]'
)
model = AutoModelForCausalLM.from_pretrained(
'kakaobrain/kogpt', revision='KoGPT6B-ryan1.5b-float16', # or float32 version: revision=KoGPT6B-ryan1.5b
pad_token_id=tokenizer.eos_token_id,
torch_dtype='auto'
)
parallelize(model, num_gpus=2, fp16=True, verbose='detail')
prompt = '''[공부, 학생, 힘들] => 힘들더라도 학생의 본분은 공부입니다
[시작, 떨림, 긴장] => 새로운 시작은 항상 떨리고 긴장되죠 파이팅!!
[방어, 제철, 겨울] => 겨울에는 방어가 제철이죠 방어회 어떠세요?
[겸손, 인생, 변화] => 인생은 어떻게 변할지 몰라요 항상 겸손한 태도를 갖춰야해요
[학교, 선생님, 은혜] => 학창시절 선생님의 은혜를 잊지 못해요 감사합니다.
[입사, 회사, 신입] =>'''
temperature = 0.8
max_length = 140
batch_size = 5
inputs = tokenizer([prompt]*batch_size, return_tensors="pt")
## **inputs의 경우
gen_tokens = model.generate(**inputs, do_sample=True, temperature=temperature, max_length=max_length)
## input_ids와 attention_mask를 넣을 경우
## gen_tokens = model.generate(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, do_sample=True, temperature=temperature, max_length=max_length)
generated = tokenizer.batch_decode(gen_tokens)
OUTPUT은 아래와 같습니다.
위처럼 parallelformers로 래핑을 했을 때 품질이 떨어지는 경우가 발생하는데 (문법자체가 어긋나는 결과가 나오는..)
혹시 제가 잘못사용하고 있는건지 아니면 gpt3는 지원을 안하는 건지 물어보려 이슈 남깁니다 :)..
I am using a 3060 and a 3090 to split GPT models two ways including GPTJ and GPT Neo 2.7B. When generating many tokens, say 500, the model hangs and either takes a abnormal amount of time to finish or does not finish. ( I kill it) Generating 50 tokens does not have this issue.
During this issue, the 3090 memory is pinned to 100% while the 3060 stays low.
Subjectively, especially for GPTJ, the results, while not complete gibberish seem to be of lower quality.
from transformers import AutoModelForCausalLM, AutoTokenizer
from parallelformers import parallelize
model = AutoModelForCausalLM.from_pretrained("distilgpt2")
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
parallelize(model, num_gpus=2, fp16=True, verbose='detail')
inputs = tokenizer("Parallelformers is", return_tensors="pt")
outputs = model.generate(
**inputs,
max_length=15,
)
print(f"Output: {tokenizer.batch_decode(outputs)[0]}")
The system distributes the model between GPUs, but when generating the second GPU is 100% loaded and does not leave this state. Generation failed.
PyTorch version: 1.10.1+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17
Python version: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.15.0-187-generic-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: NVIDIA Tesla K80
GPU 1: NVIDIA Tesla K80
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.10.1+cu113
[conda] numpy 1.21.6 pypi_0 pypi
[conda] torch 1.10.1+cu113 pypi_0 pypi
Thank you for the great project!
gpt2
, gpt2-medium
, gpt2-large
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 193, in inference
outputs = function_(
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1294, in generate
return self.greedy_search(
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1689, in greedy_search
outputs = self(
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1058, in forward
transformer_outputs = self.transformer(
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 901, in forward
outputs = block(
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 401, in forward
attn_outputs = self.attn(
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 325, in forward
query = self._split_heads(query, self.num_heads, self.head_dim)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 290, in _split_heads
tensor = tensor.view(new_shape)
RuntimeError: shape '[116, 5, 12, 64]' is invalid for input of size 464000
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 193, in inference
outputs = function_(
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1294, in generate
return self.greedy_search(
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1689, in greedy_search
outputs = self(
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1058, in forward
transformer_outputs = self.transformer(
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 901, in forward
outputs = block(
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 401, in forward
attn_outputs = self.attn(
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 325, in forward
query = self._split_heads(query, self.num_heads, self.head_dim)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 290, in _split_heads
tensor = tensor.view(new_shape)
RuntimeError: shape '[116, 5, 12, 64]' is invalid for input of size 464000
Hi
Thanks for this library.
I am using the huggingface zero shot classification pipeline with the typeform/distilbert-base-uncased-mnli model.
classifier = pipeline("zero-shot-classification", model="typeform/distilbert-base-uncased-mnli", device=0)
res = classifier(prod_name_lst, tag_values))
length of prod_name_lst is 500K and tag_values is 52.
I am currently using a loop based approach as the above code results in OOM error.
Please assist on how i can use parallelformers to scale for my dataset.
Thanks,
Subham
>>> a = Foo()
>>> a.predict()
Can you please add support for gpt_neox
Its official documentation is here https://huggingface.co/docs/transformers/model_doc/gpt_neox
from transformers import AutoModelForCausalLM, AutoTokenizer
from parallelformers import parallelize
import torch
model = AutoModelForCausalLM.from_pretrained("./2.7B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")
print("Model Loaded")
parallelize(model, num_gpus=2, fp16=True, verbose='detail')
Thanks for sharing the great code.
Let me ask you one question.
In the list of Fully Supported Models, it says Megatron BERT, but the following code does not work.
from transformers import MegatronBertModel
from parallelformers import parallelize
model = MegatronBertModel.from_pretrained('nvidia/megatron-bert-cased-345m')
parallelize(model, num_gpus=2, verbose='detail')
Also, I could not find MegatronBertModel in the policies.
How can I parallelize the MegatronBertModel?
from transformers import TrainingArguments
import torch
# get the number of gpus
num_gpus = torch.cuda.device_count()
if num_gpus > 1:
from parallelformers import parallelize
parallelize(model, num_gpus=num_gpus, fp16=True, verbose="detail")
gives
RuntimeError: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=9, timeout=0:30:00) WARNING No nodes ran. Repeat the previous runner.py:213 command to attempt a new run. [10/15/23 12:57:26] ERROR Node 'sort_using_baal: node.py:356 preprocess_and_sort([baal.reed_textkernel_labeled,params:reed.pretrained_model_name,reed.aimwel_labeled.finetuned_pre_trained_isco_classifier]) -> [reed.textkernel_labeled.sorted_jobs,baal.reed_textkernel_labeled_parquet]' failed with error: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=9, timeout=0:30:00)
python 3.10.1
parralelformers latest
o: ubuntu
Hi there!
Thanks for the awesome work on this lib! Just wanted to ask what the recommended way is to clean up a loaded model that has been parallelize
d using this library. What method should be called to clean up all the resources, move data out of the GPU, empty cuda cache and shut down the master process?
Tried to run this but it hanged:
model = ...
p = parallelize(
model,
num_gpus=2,
fp16=True,
verbose="simple",
)
# Do some inference
p.deparallelize() # --> This hanged
Is it possible to use this library for CNN networks implemented with pytorch? Can you show me an example?
The model:
google/ul2
The Hardware:
2x RTX Titan
AMD Ryzen 9 5900X 12-Core Processor
64Gb RAM
The Environment:
Python 3.9.13
Pytorch 1.12.0+cu102
NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5
Code used:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from parallelformers import parallelize
import torch
tokenizer = AutoTokenizer.from_pretrained("google/ul2")
model = AutoModelForSeq2SeqLM.from_pretrained("google/ul2")
parallelize(model, num_gpus=2, fp16=True, verbose='detail')
input_string = "[S2S] Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, solid man with a bald head. Mrs. Dursley was thin and blonde and more than the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbours. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere <extra_id_0>"
inputs = tokenizer(input_string, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0]))
Error Message:
$ python test.py
/home/******/miniconda3/envs/ul2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 16 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Bus error (core dumped)
Is this something I can fix? I would love to use this large model, as it's near SOTA on everything :)
I am trying to use Roberta NER and BERT NER uncased but for both of the models I am getting the following issues. Is it something which is still under development or anything wrong from my side?
Error:
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'
It was discussed in #4.
When a attn_qkv Layer
is set with n_fused>1
and reversed=False
, the shape of its sliced weight is incorrect.
Seems that the root cause is here:
parallelformers/parallelformers/parallel/slicing.py
Lines 79 to 95 in 436573b
For a attn_qkv weight, the arg dim
is 0. So when the reversed=False
and n_fused>1
, the tensor is chunked on the dim 0 and then concatenated on the dim 1. Which make its shape incorrect.
I started recording my work here. please note.
@stas00 @RezaYazdaniAminabadi
Hello
Do I need to use the no_grad context manager or is it already inside?
Hello, first of all congratulations for this amazing project. It's simple, efficient and versatile. Very useful.
In some cases, it happens that one has several GPUs, but not enough RAM to parallelize the model.
When loading the model on GPU, and then parallelizing, I'm getting the below error:
AssertionError: Model should be on CPU before parallelization. It is more memory-efficient.
It doesn't stop the script, but it seems that the parallelization fails.
My question is: is it possible to load the initial model on GPU instead of CPU (even if it's not memory-efficient) or not at all?
Thanks!
Dear Community,
I could not find the Falcon models neither in the list of supported nor unsupported models. So are these models supported by parallelformers? If not, are there any plans to add support for these models on the roadmap?
Doesn't look like this library still in development?
What are some other ones you can point us to that do similar things? Or have HF integrated something similar themselves?
Hi I wonder if there is any plan to support the salesforce/codegen model family?
Using a p4d.24xlarge:
from parallelformers import parallelize
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "facebook/opt-66b"
batch_size = [1]
batch = [["out story begins on"] * bs for bs in batch_size]
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
inputs = [tokenizer(seq, return_tensors="pt").input_ids for seq in batch]
parallelize(model, num_gpus=8, fp16=True)
for _ in range(100):
model.generate(
torch.cat(inputs, dim=0),
do_sample=True,
max_length=2048,
num_return_sequences=1,
)
It loads okay and begins performing inference.
Can see all 8 GPUs at 90+% utilization using nvidia-smi
for a while.
Then eventually one GPU drops to 0%, the others jump to 100%.
Terminal shows:
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 367, in reduce_storage
df = multiprocessing.reduction.DupFd(fd)
File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/reduction.py", line 198, in DupFd
return resource_sharer.DupFd(fd)
File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/resource_sharer.py", line 48, in __init__
new_fd = os.dup(fd)
OSError: [Errno 9] Bad file descriptor
It then seems to hang forever from there.
I do realize this stacktrace doesn't give enough enough to get back to parallelformers, which is frustrating. Maybe it's actually a bug in PyTorch or Multiprocessing?
I tried running the example from the readme but received the above error. Does that mean that my hardware is not supported?
Is there any way to perform tensor parallelism across multiple nodes instead just in a single node? Any tips would be helpful!
>>> a = Foo()
>>> a.predict()
Trying to use parallelformers with the megatron-11b pip package.
The MegatronPolicy class is as-provided from megatron-11b pypi webpage
from megatron_11b import MegatronForCausalLM, MegatronTokenizer
tokenizer = MegatronTokenizer.from_pretrained("./megatron-11B")
model = MegatronForCausalLM.from_pretrained("./megatron-11B")
# https://tunib-ai.github.io/parallelformers/intro/POLICY.html
from parallelformers.policies.base import Policy, Layer
from parallelformers.utils.dist_utils import AllReduceLinear
from megatron_11b.modeling_megatron import MegatronDecoderLayer
class MegatronPolicy(Policy):
@staticmethod
def replace_arguments(config, world_size):
return {
# 1. reduce hidden size
"self_attn.embed_dim": config.d_model // world_size,
# 2. reduce number of heads
"self_attn.num_heads": config.encoder_attention_heads // world_size,
}
@staticmethod
def attn_qkv():
return [
Layer(
weight="self_attn.q_proj.weight",
bias="self_attn.q_proj.bias",
),
Layer(
weight="self_attn.k_proj.weight",
bias="self_attn.k_proj.bias",
),
Layer(
weight="self_attn.v_proj.weight",
bias="self_attn.v_proj.bias",
),
]
@staticmethod
def attn_out():
return [
Layer(
weight="self_attn.out_proj.weight",
bias="self_attn.out_proj.bias",
replace=AllReduceLinear,
),
]
@staticmethod
def mlp_in():
return [
Layer(
weight="fc1.weight",
bias="fc1.bias",
),
]
@staticmethod
def mlp_out():
return [
Layer(
weight="fc2.weight",
bias="fc2.bias",
replace=AllReduceLinear,
),
]
@staticmethod
def original_layer_class():
return MegatronDecoderLayer
from parallelformers import parallelize
parallelize(model, num_gpus=8, fp16=True, verbose='detail', custom_policies=[MegatronPolicy])
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer,AutoModelForCausalLM
from parallelformers import parallelize
model = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-neo-2.7B')
parallelize(model, num_gpus=4, fp16 = False)
Thanks for the great repo! I have tried it out, it's really amazing to lead such a large model in multiple GPUs.
Currently, GPT-J is supported only in HF 4.7.0 and by installing
pip install git+https://github.com/finetuneanon/transformers@gpt-j
In your requirement, there is HF 4.8.0, and needs to load several new models. Soon gpt-j will be fully integrated in HF: huggingface/transformers#12243
I am wondering if is there an easy way to have back compatibility, or include GPT-J soon.
Thanks again for your great repo 👍🏻
-- Andrea
> os.environ["CUDA_VISIBLE_DEVICES"]="1" , parallelize(model_2, ... )
.... ( 두번 째 모델 로드 시 에러 발생 )
===========================================================
model name : ./model/ko-gpt-trinity-1.2B-v0.5
CUDA_VISIBLE_DEVICES : 1
request_gpu : 1
used_gpu : 2
===========================================================
Process ParallelProcess-2:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/parallelformers/parallel/process.py", line 254, in run
custom_policies=self.custom_policies,
File "/opt/conda/lib/python3.7/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
self.mp_group = self.create_process_group(backend)
File "/opt/conda/lib/python3.7/site-packages/parallelformers/parallel/engine.py", line 104, in create_process_group
dist.init_process_group(backend=backend)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 158, in _create_c10d_store
hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: Address already in use
def create_process_group(self, backend: str):
"""
Create Pytorch distributed process group
Args:
backend (str): distributed backend
Returns:
ProcessGroupNCCL: process group for parallization
"""
if not dist.is_initialized():
dist.init_process_group(backend=backend)
torch.cuda.set_device(int(os.getenv("LOCAL_RANK", "0")))
new_group = dist.new_group([i for i in range(self.num_gpus)])
return new_group
I'm getting the following error while trying to run the example in the getting started document
Process ParallelProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 251, in run
engine = ParallelEngine(
File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
self.mp_group = self.create_process_group(backend)
File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 106, in create_process_group
torch.cuda.set_device(int(os.getenv("LOCAL_RANK", "0")))
File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/cuda/__init__.py", line 314, in set_device
torch._C._cuda_setDevice(device)
File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/cuda/__init__.py", line 207, in _lazy_init
raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Process ParallelProcess-2:
Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 251, in run
engine = ParallelEngine(
File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
self.mp_group = self.create_process_group(backend)
File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 104, in create_process_group
dist.init_process_group(backend=backend)
File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 242, in _store_based_barrier
worker_count = store.add(store_key, 0)
RuntimeError: Connection reset by peer
This is my code. I'm running it on a AWS g5.12xlarge instance with 4 GPUs
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")
from parallelformers import parallelize
parallelize(model, num_gpus=2, fp16=True, verbose='detail')
inputs = tokenizer("Parallelformers is", return_tensors="pt")
outputs = model.generate(
**inputs,
num_beams=5,
no_repeat_ngram_size=4,
max_length=15,
)
print(f"Output: {tokenizer.batch_decode(outputs)[0]}")
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 |
| 0% 29C P8 19W / 300W | 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 |
| 0% 29C P8 16W / 300W | 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 |
| 0% 29C P8 16W / 300W | 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 30C P8 15W / 300W | 2MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I pip installed multiprocess
https://pypi.org/project/multiprocess/ as initially I kept getting importing multiprocess as mp, multiprocess not found
. Then I noticed there was a PR that removed torch.multiprocessing
done by @Oaklight . Maybe I'm not using the right multiprocessing library? Reverting it back to torch.multiprocessing
caused the same error noticed by @Oaklight .
tokenizer = AutoTokenizer.from_pretrained(model_name,
bos_token='[BOS]', eos_token='[EOS]', unk_token='[UNK]', pad_token='[PAD]', mask_token='[MASK]')
model = AutoModelForCausalLM.from_pretrained(model_name)#.to(device='cuda', non_blocking=True)
_ = model.eval()
parallelformers.parallelize(model, num_gpus=4, fp16=True, verbose='detail')
tok = tokenizer("My name is Kevin."*10, return_tensors="pt")
model.generate(
tok['input_ids'],
max_length=2048,
use_cache=True, no_repeat_ngram_size=3, max_time=5.0)
반복적으로 인퍼런스를 하다보면,
간간히 특정 GPU노드에 util이 100%로 차면서 블록이 걸려버리는 이슈가 있습니다.
Ctrl+C를 해도 세마포어에 락이 걸려서 프로세스 중단이 안되네요.
혹시 코드상 에러인가 싶어 아래처럼 일부러 버그를 내도록 유도해 봤는데, 해당 이슈는 모든 노드의 util이 0%로 바뀌고 Ctrl+C를 하면
원인이 되는 에러를 내뱉어서 이 이슈는 아닌듯 합니다.
tok = tokenizer("My name is Kevin."*2048, return_tensors="pt")
model.generate(
tok['input_ids'],
max_length=2048,
use_cache=True, no_repeat_ngram_size=3, max_time=5.0)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/albert/modeling_albert.py", line 368, in forward
self.dense.weight.t()
RuntimeError: shape '[6, 64, 384]' is invalid for input of size 294912
I wonder if there's any plan to support 8bit inference in parallelformers. Right now, we can load 🤗 transformers models in 8bit like here, e.g.:
model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
However, it's not possible to parallelize()
the model with parallelformers since only fp16 mode is supported at the moment.
If 8bit inference could be supported, it would good to add another argument as for fp16, e.g.
from parallelformers import parallelize
model = AutoModelForCausalLM.from_pretrained(model_name)
parallelize(model, num_gpus=2, int8=True, verbose='detail')
# or one argument for precision mode, where dtype can be either "int8", "fp16", or "fp32" (default)
# parallelize(model, num_gpus=2, dtype='int8', verbose='detail')
Hi, I'm very interested in this work, looks super interesting and useful. Unfortunately one of my models is an EncoderDecoder model and I have no idea how to get it to work. Your FAQ makes it clear I'd have to implement a custom Policy, but it's not clear to me where to start. Do you have an example that one could follow? Is this something that needs to be different for each individual EncoderDecoderModel or can it be automated?
It would be great to see the recent bloom model from bigscience can be added to the auto policy. The Bloom model is another auto-regressive large language model thus the policy might be inherited from existing policies.
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-2b5")
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-2b5")
from parallelformers import parallelize
parallelize(model, num_gpus=2, fp16=True, verbose='detail')
inputs = tokenizer("Parallelformers is", return_tensors="pt")
outputs = model.generate(
**inputs,
num_beams=5,
no_repeat_ngram_size=4,
max_length=15,
)
print(f"Output: {tokenizer.batch_decode(outputs)[0]}")
Hi,
Would it be possible to support LLaMA models? It is open and based model for some other large models, such as Alpaca.
Here's the official doc:
https://huggingface.co/docs/transformers/model_doc/llama
Thanks for your great work!
I was running some performance tests and I noticed that checking if an object is pickable:
takes a lot of time when the output is big (f.e., when a model returns a large logits tensor), because the whole object is being serialized into memory and then deserialized. I wonder what are the cases in whichcheck_pickable
helps, as dataclasses and ModelOutput
should be as pickable as its dictionary representation.
If the check is still needed, I guess the code could be still sped up by modifying an object only on pickle failure. That would require some workarounds (perhaps overriding https://github.com/python/cpython/blob/9dc787ea96916552695e79397588fdfa68f22024/Lib/multiprocessing/queues.py#L275) so I want to make sure the check is still necessary, before giving it a shot. Another option is to always check for
parallelformers/parallelformers/parallel/process.py
Lines 236 to 239 in ccaea51
First of all, thanks for this great project!
I'm facing an issue running the test code provided here on Kubernetes.
This is what I'm running inside a Kubeflow pod:
python3 tests/seq2seq_lm.py --test-name=test --name=Helsinki-NLP/opus-mt-en-zh --gpu-from=0 --gpu-to=3 --use-pf
I'm using a g4dn.12xlarge AWS machine with four T4 GPUs.
The pod hangs when executing this line until I manually terminate it.
I suspected this change might have been the culprit so I ran the same code with v1.2.4 of parallelformers. This time, the pod quits during execution of the same line without outputting any errors which is odd.
Notably, if I run the same command without --use-pf
it runs fine.
I saw you've reported some problems using docker. However, memory should not be an issue here since I'm using Helsinki-NLP/opus-mt-en-zh
model which is relatively small.
I was wondering if parallelformers code has ever been tested on Kubernetes?
Also would appreciate it if you could look into this issue. Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.