tunib-ai / parallelformers Goto Github PK

View Code? Open in Web Editor NEW

765.0 15.0 61.0 3.11 MB

Parallelformers: An Efficient Model Parallelization Toolkit for Deployment

Home Page: https://tunib-ai.github.io/parallelformers

License: Apache License 2.0

Python 87.73% Shell 12.22% Makefile 0.05%

parallelformers's People

Contributors

Stargazers

Watchers

Forkers

ai-natural-language-processing-lab msathishkumar1990 dobbytk dante42maru switiz kforcodeai stjordanis ashishpatel26 ntnshrav jeongah-shin andrea-mariadb ml-ai-nlp-ir koorukuroo joonhankim wf1024966 jucho2725 apachecn icodein yeongukang rogervaas techthiyanes oaklight mehrad0711 olemeyer wanicca insutil-lab rosssong kabicm abhay5991 dongs0104 lovit taiqihe bridgecrew-perf7 md-a-khan hanyullai jepetolee aniketmaurya dumpmemory mkardas qikahh ht-zhou zhejiangyyf codingchild2424 4agi remarkablej foolplayer abhilash1910 zengyijie geekadalovelace goswamig essamrafie 9tong samuelgalaxys rielkim slamj1 omarusdt lihuibng

parallelformers's Issues

Bus error in parallelformers 1.2.7 for OPT model

How to reproduce

from transformers import AutoModelForCausalLM, AutoTokenizer

if __name__ == '__main__':
    model_name = 'facebook/opt-30b'
    model = AutoModelForCausalLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    from parallelformers import parallelize
    
    parallelize(model, num_gpus=8, fp16=True)

This error was thrown at parallelize method :

Bus error (core dumped)

We tried with parallelformers version 1.2.6 and transformers version 4.21.11, this error was not thrown. This error is only happening with the parallelformers version 1.2.7 and transformers version 4.21.11.

Environment

OS : Ubuntu
Python version : 3.8.13
Transformers version : 4.21.11
Parallelformers version : 1.2.6
Whether to use Docker: yes
Misc.:

Support for OPT

Hi,

Would it be possible to support new OPT models (a suite of GPT-like models)?

Here's the official doc:
https://huggingface.co/docs/transformers/model_doc/opt

Thanks for your great work!

docker support

We will continue to log problems with Docker containers on this thread. And we aim to solve it. Ultimately, the goal is to deploy the model in a Kubernetes environment. If anyone has any problems with the Docker environment, please feel free to leave issues. We will actively review and resolve them.

KoGPT3와 연동시 품질 이슈

안녕하세요
GPTJForCausalLM모델을 지원하는지 확인하려고
KoGPT3를 가지고
parallelformers 라이브러리로 인퍼런스 해보는 걸 테스트하고 있었는데요.

실행코드는 아래와 같습니다.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM 
from parallelformers import parallelize

tokenizer = AutoTokenizer.from_pretrained(
  'kakaobrain/kogpt', revision='KoGPT6B-ryan1.5b-float16',  # or float32 version: revision=KoGPT6B-ryan1.5b
  bos_token='[BOS]', eos_token='[EOS]', unk_token='[UNK]', pad_token='[PAD]', mask_token='[MASK]'
)
model = AutoModelForCausalLM.from_pretrained(
  'kakaobrain/kogpt', revision='KoGPT6B-ryan1.5b-float16',  # or float32 version: revision=KoGPT6B-ryan1.5b
  pad_token_id=tokenizer.eos_token_id,
  torch_dtype='auto'
)

parallelize(model, num_gpus=2, fp16=True, verbose='detail')

prompt = '''[공부, 학생, 힘들] => 힘들더라도 학생의 본분은 공부입니다
[시작, 떨림, 긴장] => 새로운 시작은 항상 떨리고 긴장되죠 파이팅!!
[방어, 제철, 겨울] => 겨울에는 방어가 제철이죠 방어회 어떠세요?
[겸손, 인생, 변화] => 인생은 어떻게 변할지 몰라요 항상 겸손한 태도를 갖춰야해요
[학교, 선생님, 은혜] => 학창시절 선생님의 은혜를 잊지 못해요 감사합니다.
[입사, 회사, 신입] =>'''

temperature = 0.8
max_length = 140
batch_size = 5

inputs = tokenizer([prompt]*batch_size, return_tensors="pt")
## **inputs의 경우
gen_tokens = model.generate(**inputs, do_sample=True, temperature=temperature, max_length=max_length)
## input_ids와 attention_mask를 넣을 경우
## gen_tokens = model.generate(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, do_sample=True, temperature=temperature, max_length=max_length)
generated = tokenizer.batch_decode(gen_tokens)

OUTPUT은 아래와 같습니다.

parallelformers를 쓰지 않았을 경우
parallelformers를 쓸 경우 (**inputs 일 경우)
parallelformers를 쓸 경우 (input_ids와 attention_mask만 넣을 경우)

위처럼 parallelformers로 래핑을 했을 때 품질이 떨어지는 경우가 발생하는데 (문법자체가 어긋나는 결과가 나오는..)
혹시 제가 잘못사용하고 있는건지 아니면 gpt3는 지원을 안하는 건지 물어보려 이슈 남깁니다 :)..

GPT models hang on large token generation. Lower performance?

I am using a 3060 and a 3090 to split GPT models two ways including GPTJ and GPT Neo 2.7B. When generating many tokens, say 500, the model hangs and either takes a abnormal amount of time to finish or does not finish. ( I kill it) Generating 50 tokens does not have this issue.
During this issue, the 3090 memory is pinned to 100% while the 3060 stays low.

Subjectively, especially for GPTJ, the results, while not complete gibberish seem to be of lower quality.

GPT2 parallelism does not work on the Tesla K80

How to reproduce

from transformers import AutoModelForCausalLM, AutoTokenizer
from parallelformers import parallelize

model = AutoModelForCausalLM.from_pretrained("distilgpt2")
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

parallelize(model, num_gpus=2, fp16=True, verbose='detail')

inputs = tokenizer("Parallelformers is", return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_length=15,
)

print(f"Output: {tokenizer.batch_decode(outputs)[0]}")

Problem

The system distributes the model between GPUs, but when generating the second GPU is 100% loaded and does not leave this state. Generation failed.

Environment

PyTorch version: 1.10.1+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.7.13 (default, Mar 29 2022, 02:18:16)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.15.0-187-generic-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: NVIDIA Tesla K80
GPU 1: NVIDIA Tesla K80

Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.10.1+cu113
[conda] numpy                     1.21.6                   pypi_0    pypi
[conda] torch                     1.10.1+cu113             pypi_0    pypi

Support for GPT2-XL

Thank you for the great project!

How to reproduce

https://github.com/snoop2head/Language_Model_Memorization/blob/2c5db6f9bdd0206cba87d13b158d8c27ce0e55a7/parallel_inference.py#L39-L82

Tested and works for gpt2, gpt2-medium, gpt2-large
If AutoModelForCausalLM is changed into gpt2-xl, it yields the following error message

File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 193, in inference
    outputs = function_(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1294, in generate
    return self.greedy_search(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1689, in greedy_search
    outputs = self(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1058, in forward
    transformer_outputs = self.transformer(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 901, in forward
    outputs = block(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 401, in forward
    attn_outputs = self.attn(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 325, in forward
    query = self._split_heads(query, self.num_heads, self.head_dim)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 290, in _split_heads
    tensor = tensor.view(new_shape)
RuntimeError: shape '[116, 5, 12, 64]' is invalid for input of size 464000
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 193, in inference
    outputs = function_(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1294, in generate
    return self.greedy_search(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1689, in greedy_search
    outputs = self(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1058, in forward
    transformer_outputs = self.transformer(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 901, in forward
    outputs = block(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 401, in forward
    attn_outputs = self.attn(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 325, in forward
    query = self._split_heads(query, self.num_heads, self.head_dim)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 290, in _split_heads
    tensor = tensor.view(new_shape)
RuntimeError: shape '[116, 5, 12, 64]' is invalid for input of size 464000

How do I use this for zero shot classification tasks

Thanks for this library.

I am using the huggingface zero shot classification pipeline with the typeform/distilbert-base-uncased-mnli model.

classifier = pipeline("zero-shot-classification", model="typeform/distilbert-base-uncased-mnli", device=0)

res = classifier(prod_name_lst, tag_values))

length of prod_name_lst is 500K and tag_values is 52.

I am currently using a loop based approach as the above code results in OOM error.

Please assist on how i can use parallelformers to scale for my dataset.

Thanks,
Subham

Add Vision Encoder Decoder model to parallelformers

Describe a requested feature

Add Vision Encoder Decoder model to parallelformers

Expected behavior

>>> a = Foo()
>>> a.predict()

Can you please add support for gpt_neox

Describe a requested feature

Can you please add support for gpt_neox

Its official documentation is here https://huggingface.co/docs/transformers/model_doc/gpt_neox

freeze_support()

How to reproduce

from transformers import AutoModelForCausalLM, AutoTokenizer
from parallelformers import parallelize
import torch
model = AutoModelForCausalLM.from_pretrained("./2.7B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")

print("Model Loaded")

parallelize(model, num_gpus=2, fp16=True, verbose='detail')

Environment

OS : Linux
Python version :3.8.5
Transformers version : 4.17.0
Whether to use Docker: No
Misc.: conda

How can I parallelize the MegatronBertModel?

Thanks for sharing the great code.

Let me ask you one question.
In the list of Fully Supported Models, it says Megatron BERT, but the following code does not work.

from transformers import MegatronBertModel
from parallelformers import parallelize

model = MegatronBertModel.from_pretrained('nvidia/megatron-bert-cased-345m')
parallelize(model, num_gpus=2, verbose='detail')

Also, I could not find MegatronBertModel in the policies.
How can I parallelize the MegatronBertModel?

freeze_support()

Title: RuntimeError: Timed out initializing process group in store based barrier

    from transformers import TrainingArguments
    import torch

    # get the number of gpus
    num_gpus = torch.cuda.device_count()
    if num_gpus > 1:
        from parallelformers import parallelize

        parallelize(model, num_gpus=num_gpus, fp16=True, verbose="detail")

gives

RuntimeError: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=9, timeout=0:30:00) WARNING No nodes ran. Repeat the previous runner.py:213 command to attempt a new run. [10/15/23 12:57:26] ERROR Node 'sort_using_baal: node.py:356 preprocess_and_sort([baal.reed_textkernel_labeled,params:reed.pretrained_model_name,reed.aimwel_labeled.finetuned_pre_trained_isco_classifier]) -> [reed.textkernel_labeled.sorted_jobs,baal.reed_textkernel_labeled_parquet]' failed with error: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=9, timeout=0:30:00)

Environment

python 3.10.1
parralelformers latest
o: ubuntu

Recommended way for cleaning up?

Hi there!

Thanks for the awesome work on this lib! Just wanted to ask what the recommended way is to clean up a loaded model that has been parallelized using this library. What method should be called to clean up all the resources, move data out of the GPU, empty cuda cache and shut down the master process?

Tried to run this but it hanged:

model = ...
p = parallelize(
    model, 
    num_gpus=2, 
    fp16=True, 
    verbose="simple",
)

# Do some inference

p.deparallelize()  # --> This hanged

Use this library for CNN networks like Unet

Is it possible to use this library for CNN networks implemented with pytorch? Can you show me an example?

Error using google/UL2 model

The model:
google/ul2

The Hardware:
2x RTX Titan
AMD Ryzen 9 5900X 12-Core Processor
64Gb RAM

The Environment:
Python 3.9.13
Pytorch 1.12.0+cu102
NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5

Code used:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from parallelformers import parallelize
import torch

tokenizer = AutoTokenizer.from_pretrained("google/ul2")
model = AutoModelForSeq2SeqLM.from_pretrained("google/ul2")

parallelize(model, num_gpus=2, fp16=True, verbose='detail')

input_string = "[S2S] Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, solid man with a bald head. Mrs. Dursley was thin and blonde and more than the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbours. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere <extra_id_0>"

inputs = tokenizer(input_string, return_tensors="pt")

outputs = model.generate(**inputs, max_length=200)

print(tokenizer.decode(outputs[0]))

Error Message:

$ python test.py 
/home/******/miniconda3/envs/ul2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 16 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Bus error (core dumped)

Is this something I can fix? I would love to use this large model, as it's near SOTA on everything :)

RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

I am trying to use Roberta NER and BERT NER uncased but for both of the models I am getting the following issues. Is it something which is still under development or anything wrong from my side?

Error:
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

Add guides about the number of GPUs to the documentation

It was discussed in #4.

A bug with `n_fused`

When a attn_qkv Layer is set with n_fused>1 and reversed=False, the shape of its sliced weight is incorrect.

Seems that the root cause is here:

parallelformers/parallelformers/parallel/slicing.py

Lines 79 to 95 in 436573b

    
           dim = dim if not reversed or is_bias else abs(dim - 1) 
        
           n_fused = 1 if not n_fused else n_fused 
        
           proj_layer = proj_layer.chunk( 
        
               n_fused * self.world_size, 
        
               dim=dim, 
        
           ) 
        
           if n_fused > 1: 
        
               ranks = (len(proj_layer) + self.world_size - 1) // self.world_size 
        
               proj_layer = [ 
        
                   proj_layer[i * self.world_size : (i + 1) * self.world_size] 
        
                   for i in range(ranks) 
        
               ] 
        
               proj_layer = list( 
        
                   map(lambda x: torch.cat([*x], dim=-1), zip(*proj_layer)) 
        
               )

For a attn_qkv weight, the arg dim is 0. So when the reversed=False and n_fused>1, the tensor is chunked on the dim 0 and then concatenated on the dim 1. Which make its shape incorrect.

Integration Note with Huggingface Transformers & Microsoft DeepSpeed

I started recording my work here. please note.
@stas00 @RezaYazdaniAminabadi

torch no_grad

Hello
Do I need to use the no_grad context manager or is it already inside?

AssertionError: Model should be on CPU before parallelization. It is more memory-efficient.

Hello, first of all congratulations for this amazing project. It's simple, efficient and versatile. Very useful.

In some cases, it happens that one has several GPUs, but not enough RAM to parallelize the model.
When loading the model on GPU, and then parallelizing, I'm getting the below error:
AssertionError: Model should be on CPU before parallelization. It is more memory-efficient.

It doesn't stop the script, but it seems that the parallelization fails.

My question is: is it possible to load the initial model on GPU instead of CPU (even if it's not memory-efficient) or not at all?

Thanks!

Support for Falcon-7B and Falcon-40B models

Dear Community,

I could not find the Falcon models neither in the list of supported nor unsupported models. So are these models supported by parallelformers? If not, are there any plans to add support for these models on the roadmap?

Still in development?

Doesn't look like this library still in development?

What are some other ones you can point us to that do similar things? Or have HF integrated something similar themselves?

Support Codegen 12B

Hi I wonder if there is any plan to support the salesforce/codegen model family?

OSError: [Errno 9] Bad file descriptor

How to reproduce

Using a p4d.24xlarge:

from parallelformers import parallelize
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "facebook/opt-66b"
batch_size = [1]
batch = [["out story begins on"] * bs for bs in batch_size]
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
inputs = [tokenizer(seq, return_tensors="pt").input_ids for seq in batch]
parallelize(model, num_gpus=8, fp16=True)
for _ in range(100):
    model.generate(
        torch.cat(inputs, dim=0),
        do_sample=True,
        max_length=2048,
        num_return_sequences=1,
    )

It loads okay and begins performing inference.
Can see all 8 GPUs at 90+% utilization using nvidia-smi for a while.
Then eventually one GPU drops to 0%, the others jump to 100%.
Terminal shows:

Traceback (most recent call last):                                                                         
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)                                                                       
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps                                                                                                         
    cls(buf, protocol).dump(obj)                                                                           
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 367, in reduce_storage                                                                          
    df = multiprocessing.reduction.DupFd(fd)                                                               
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/reduction.py", line 198, in DupFd                                                                                                        
    return resource_sharer.DupFd(fd)                                                                       
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/resource_sharer.py", line 48, in __init__                                                                                                
    new_fd = os.dup(fd)                                                                                    
OSError: [Errno 9] Bad file descriptor

It then seems to hang forever from there.

I do realize this stacktrace doesn't give enough enough to get back to parallelformers, which is frustrating. Maybe it's actually a bug in PyTorch or Multiprocessing?

Environment

OS : Ubuntu 20.04.4 LTS
Python version : 3.8.13
Transformers version : 4.24.0
Whether to use Docker : No
Misc. : N/A

RuntimeError: CUDA error: peer access is not supported between these two devices

I tried running the example from the readme but received the above error. Does that mean that my hardware is not supported?

Environment

OS : Ubuntu
Python version : 3.7.11
Transformers version : 4.23.1
Whether to use Docker: no
GPUs: NVIDIA GeForce RTX 2080 Ti

Cross-node inference

Is there any way to perform tensor parallelism across multiple nodes instead just in a single node? Any tips would be helpful!

Can you please add Question Answering models like LayoutLMv2ForQuestionAnswering

Describe a requested feature

Expected behavior

>>> a = Foo()
>>> a.predict()

AttributeError: Can't get attribute 'MegatronPolicy' on <module 'main' (built-in)>

Trying to use parallelformers with the megatron-11b pip package.
The MegatronPolicy class is as-provided from megatron-11b pypi webpage

How to reproduce

from megatron_11b import MegatronForCausalLM, MegatronTokenizer

tokenizer = MegatronTokenizer.from_pretrained("./megatron-11B")
model = MegatronForCausalLM.from_pretrained("./megatron-11B")

# https://tunib-ai.github.io/parallelformers/intro/POLICY.html

from parallelformers.policies.base import Policy, Layer
from parallelformers.utils.dist_utils import AllReduceLinear
from megatron_11b.modeling_megatron import MegatronDecoderLayer


class MegatronPolicy(Policy):

    @staticmethod
    def replace_arguments(config, world_size):
        return {
            # 1. reduce hidden size
            "self_attn.embed_dim": config.d_model // world_size,

            # 2. reduce number of heads
            "self_attn.num_heads": config.encoder_attention_heads // world_size,
        }

    @staticmethod
    def attn_qkv():
        return [
            Layer(
                weight="self_attn.q_proj.weight",
                bias="self_attn.q_proj.bias",
            ),
            Layer(
                weight="self_attn.k_proj.weight",
                bias="self_attn.k_proj.bias",
            ),
            Layer(
                weight="self_attn.v_proj.weight",
                bias="self_attn.v_proj.bias",
            ),
        ]

    @staticmethod
    def attn_out():
        return [
            Layer(
                weight="self_attn.out_proj.weight",
                bias="self_attn.out_proj.bias",
                replace=AllReduceLinear,
            ),
        ]

    @staticmethod
    def mlp_in():
        return [
            Layer(
                weight="fc1.weight",
                bias="fc1.bias",
            ),
        ]

    @staticmethod
    def mlp_out():
        return [
            Layer(
                weight="fc2.weight",
                bias="fc2.bias",
                replace=AllReduceLinear,
            ),
        ]

    @staticmethod
    def original_layer_class():
        return MegatronDecoderLayer

from parallelformers import parallelize

parallelize(model, num_gpus=8, fp16=True, verbose='detail', custom_policies=[MegatronPolicy])

Environment

OS : Ubuntu LTS 20.04
Python version : 3.8
Transformers version : 4.4.2
Whether to use Docker: no
Misc.: it's executed in a jupyter notebook, which might be the source of the problem: https://stackoverflow.com/a/65001152

Bug with T511b inference

How to reproduce

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer,AutoModelForCausalLM
from parallelformers import parallelize
model = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-neo-2.7B')
parallelize(model, num_gpus=4, fp16 = False)

Environment

OS : 18.04.4 LTS (Bionic Beaver) Ubuntu
Python version : 3.7.3
Transformers version : 4.22.1
Whether to use Docker: No
Misc.: N/A

Support for GPT-J

Thanks for the great repo! I have tried it out, it's really amazing to lead such a large model in multiple GPUs.

Describe a requested feature

Currently, GPT-J is supported only in HF 4.7.0 and by installing

pip install git+https://github.com/finetuneanon/transformers@gpt-j

In your requirement, there is HF 4.8.0, and needs to load several new models. Soon gpt-j will be fully integrated in HF: huggingface/transformers#12243

I am wondering if is there an easy way to have back compatibility, or include GPT-J soon.

Thanks again for your great repo 👍🏻

-- Andrea

다중 Model 로드 방법

How to reproduce

먼저 좋은 프로젝트를 만들어 주셔서 감사의 말씀을 드립니다.
현재 1080 GPU 8개가 있는 서버에서 Flask 를 사용하여 한국어 모델을 여러개를 올려보는 테스트를 해보고 있는데요.
1개의 모델을 여러개의 GPU에 올리는 부분들은 잘 되는데 동시에 여러 모델을 올릴 때 아래와 같은 에러가 발생하고 있습니다.
혹시 여러 모델을 동시에 올릴 경우 추가적으로 해야할 작업이 있을까요?
타깃 GPU의 경우에는 모델 호출 전 Environments 의 CUDA_VISIBLE_DEVICES를 조절하여 변경하고 있습니다.
ex > os.environ["CUDA_VISIBLE_DEVICES"]="0" , parallelize(model_1, ... )

 > os.environ["CUDA_VISIBLE_DEVICES"]="1" ,  parallelize(model_2, ... )

.... ( 두번 째 모델 로드 시 에러 발생 )
===========================================================       
model name :  ./model/ko-gpt-trinity-1.2B-v0.5
CUDA_VISIBLE_DEVICES :  1
request_gpu :  1                            
used_gpu    :  2
===========================================================          
Process ParallelProcess-2:                                         
Traceback (most recent call last):                                        
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()                  
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/parallelformers/parallel/process.py", line 254, in run
    custom_policies=self.custom_policies,
  File "/opt/conda/lib/python3.7/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
    self.mp_group = self.create_process_group(backend)
  File "/opt/conda/lib/python3.7/site-packages/parallelformers/parallel/engine.py", line 104, in create_process_group
    dist.init_process_group(backend=backend)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 158, in _create_c10d_store
    hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: Address already in use

parallelformers/parallel/engine.py 부분에서 dist.init_process_group 을 할 때 에러가 발생하는 것 같은데요.
parallelize 호출 시 어떻게 변경하면 다양한 모델들을 동시에 올릴 수 있을까요?

    def create_process_group(self, backend: str):
        """
        Create Pytorch distributed process group
        Args:
            backend (str): distributed backend
        Returns:
            ProcessGroupNCCL: process group for parallization
        """
        if not dist.is_initialized():
            dist.init_process_group(backend=backend)

        torch.cuda.set_device(int(os.getenv("LOCAL_RANK", "0")))
        new_group = dist.new_group([i for i in range(self.num_gpus)])

        return new_group

Environment

OS : Ubuntu 18.04
Python version :3.7.11
Transformers version : 4.15.0
Whether to use Docker: FROM pytorch/pytorch:1.9.1-cuda11.1-cudnn8-devel
Misc.: " flask 내에서 parallelformers를 활용한 다중 모델 로드"

RuntimeError: Cannot re-initialize CUDA in forked subprocess

How to reproduce

I'm getting the following error while trying to run the example in the getting started document

Process ParallelProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 251, in run
    engine = ParallelEngine(
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
    self.mp_group = self.create_process_group(backend)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 106, in create_process_group
    torch.cuda.set_device(int(os.getenv("LOCAL_RANK", "0")))
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/cuda/__init__.py", line 314, in set_device
    torch._C._cuda_setDevice(device)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/cuda/__init__.py", line 207, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Process ParallelProcess-2:
Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 251, in run
    engine = ParallelEngine(
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
    self.mp_group = self.create_process_group(backend)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 104, in create_process_group
    dist.init_process_group(backend=backend)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 242, in _store_based_barrier
    worker_count = store.add(store_key, 0)
RuntimeError: Connection reset by peer

This is my code. I'm running it on a AWS g5.12xlarge instance with 4 GPUs

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")

from parallelformers import parallelize

parallelize(model, num_gpus=2, fp16=True, verbose='detail')

inputs = tokenizer("Parallelformers is", return_tensors="pt")

outputs = model.generate(
    **inputs,
    num_beams=5,
    no_repeat_ngram_size=4,
    max_length=15,
)

print(f"Output: {tokenizer.batch_decode(outputs)[0]}")

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         On   | 00000000:00:1B.0 Off |                    0 |
|  0%   29C    P8    19W / 300W |      2MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G         On   | 00000000:00:1C.0 Off |                    0 |
|  0%   29C    P8    16W / 300W |      2MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G         On   | 00000000:00:1D.0 Off |                    0 |
|  0%   29C    P8    16W / 300W |      2MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   30C    P8    15W / 300W |      2MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I pip installed multiprocess https://pypi.org/project/multiprocess/ as initially I kept getting importing multiprocess as mp, multiprocess not found. Then I noticed there was a PR that removed torch.multiprocessing done by @Oaklight . Maybe I'm not using the right multiprocessing library? Reverting it back to torch.multiprocessing caused the same error noticed by @Oaklight .

Environment

OS : Ubuntu
Python version : 3.9
Transformers version : 4.21.0
Whether to use Docker: Nope
Misc.: Cuda 11.6

GPU행업 이슈

How to reproduce

tokenizer = AutoTokenizer.from_pretrained(model_name,
  bos_token='[BOS]', eos_token='[EOS]', unk_token='[UNK]', pad_token='[PAD]', mask_token='[MASK]') 
model = AutoModelForCausalLM.from_pretrained(model_name)#.to(device='cuda', non_blocking=True)
_ = model.eval()
parallelformers.parallelize(model, num_gpus=4, fp16=True, verbose='detail')

tok = tokenizer("My name is Kevin."*10, return_tensors="pt")
model.generate(
                  tok['input_ids'],
                  max_length=2048, 
                  use_cache=True, no_repeat_ngram_size=3, max_time=5.0)

Environment

OS : Ubuntu 18
Python version : 3.7.9
Transformers version :
Whether to use Docker:
Misc.: V100 x 4

반복적으로 인퍼런스를 하다보면,
간간히 특정 GPU노드에 util이 100%로 차면서 블록이 걸려버리는 이슈가 있습니다.
Ctrl+C를 해도 세마포어에 락이 걸려서 프로세스 중단이 안되네요.

혹시 코드상 에러인가 싶어 아래처럼 일부러 버그를 내도록 유도해 봤는데, 해당 이슈는 모든 노드의 util이 0%로 바뀌고 Ctrl+C를 하면
원인이 되는 에러를 내뱉어서 이 이슈는 아닌듯 합니다.

tok = tokenizer("My name is Kevin."*2048, return_tensors="pt")
model.generate(
                  tok['input_ids'],
                  max_length=2048, 
                  use_cache=True, no_repeat_ngram_size=3, max_time=5.0)

원인을 혹시 좀 알수 있을까 싶어 이슈 남겨봅니다.

Bug about `AlbertModel`

File "/opt/conda/lib/python3.8/site-packages/transformers/models/albert/modeling_albert.py", line 368, in forward
  self.dense.weight.t()
RuntimeError: shape '[6, 64, 384]' is invalid for input of size 294912

Environment

OS : Ubuntu 16.04
Python version : 3.7
Transformers version : 4.8.2
Whether to use Docker: No
Misc.: -

INT8 support

Describe a requested feature

I wonder if there's any plan to support 8bit inference in parallelformers. Right now, we can load 🤗 transformers models in 8bit like here, e.g.:

model_8bit = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)

However, it's not possible to parallelize() the model with parallelformers since only fp16 mode is supported at the moment.

Expected behavior

If 8bit inference could be supported, it would good to add another argument as for fp16, e.g.

from parallelformers import parallelize

model = AutoModelForCausalLM.from_pretrained(model_name)
parallelize(model, num_gpus=2, int8=True, verbose='detail')
# or one argument for precision mode, where dtype can be either "int8", "fp16", or "fp32" (default)
# parallelize(model, num_gpus=2, dtype='int8', verbose='detail')

EncoderDecoder support

Hi, I'm very interested in this work, looks super interesting and useful. Unfortunately one of my models is an EncoderDecoder model and I have no idea how to get it to work. Your FAQ makes it clear I'd have to implement a custom Policy, but it's not clear to me where to start. Do you have an example that one could follow? Is this something that needs to be different for each individual EncoderDecoderModel or can it be automated?

[Feature Request] Add Bloom to the Auto Policy

Add Bloom to the Auto Policy

It would be great to see the recent bloom model from bigscience can be added to the auto policy. The Bloom model is another auto-regressive large language model thus the policy might be inherited from existing policies.

Expected behavior

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-2b5")
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-2b5")

from parallelformers import parallelize
parallelize(model, num_gpus=2, fp16=True, verbose='detail')

inputs = tokenizer("Parallelformers is", return_tensors="pt")

outputs = model.generate(
    **inputs,
    num_beams=5,
    no_repeat_ngram_size=4,
    max_length=15,
)

print(f"Output: {tokenizer.batch_decode(outputs)[0]}")

Support for LLaMA

Hi,

Would it be possible to support LLaMA models? It is open and based model for some other large models, such as Alpaca.

Here's the official doc:
https://huggingface.co/docs/transformers/model_doc/llama

Thanks for your great work!

Speed up results serialization

Describe a requested feature

I was running some performance tests and I noticed that checking if an object is pickable:

parallelformers/parallelformers/parallel/process.py

Line 209 in ccaea51

outputs = self.check_picklable(outputs)

takes a lot of time when the output is big (f.e., when a model returns a large logits tensor), because the whole object is being serialized into memory and then deserialized. I wonder what are the cases in which check_pickable helps, as dataclasses and ModelOutput should be as pickable as its dictionary representation.

If the check is still needed, I guess the code could be still sped up by modifying an object only on pickle failure. That would require some workarounds (perhaps overriding https://github.com/python/cpython/blob/9dc787ea96916552695e79397588fdfa68f22024/Lib/multiprocessing/queues.py#L275) so I want to make sure the check is still necessary, before giving it a shot. Another option is to always check for

parallelformers/parallelformers/parallel/process.py

Lines 236 to 239 in ccaea51

    
           if _is_dataclass_instance(obj) or isinstance(obj, ModelOutput): 
        
               _obj = asdict(obj) 
        
               _obj["orig_dataclass_type"] = obj.__class__ 
        
               obj = _obj

and modify the object even if it's pickable, but that would remove custom fields added outside a definition of a given class.

Issue running parallelformers test script in a VM

How to reproduce

First of all, thanks for this great project!

I'm facing an issue running the test code provided here on Kubernetes.

This is what I'm running inside a Kubeflow pod:

python3 tests/seq2seq_lm.py --test-name=test --name=Helsinki-NLP/opus-mt-en-zh --gpu-from=0 --gpu-to=3 --use-pf

I'm using a g4dn.12xlarge AWS machine with four T4 GPUs.

The pod hangs when executing this line until I manually terminate it.

I suspected this change might have been the culprit so I ran the same code with v1.2.4 of parallelformers. This time, the pod quits during execution of the same line without outputting any errors which is odd.

Notably, if I run the same command without --use-pf it runs fine.

I saw you've reported some problems using docker. However, memory should not be an issue here since I'm using Helsinki-NLP/opus-mt-en-zh model which is relatively small.

I was wondering if parallelformers code has ever been tested on Kubernetes?
Also would appreciate it if you could look into this issue. Thanks!

Environment

OS : Linux
Python version : 3.8.3
Transformers version : 4.17.0
Whether to use Docker: Yes
Misc.:
branch: main

	dim = dim if not reversed or is_bias else abs(dim - 1)
	n_fused = 1 if not n_fused else n_fused

	proj_layer = proj_layer.chunk(
	n_fused * self.world_size,
	dim=dim,
	)

	if n_fused > 1:
	ranks = (len(proj_layer) + self.world_size - 1) // self.world_size
	proj_layer = [
	proj_layer[i * self.world_size : (i + 1) * self.world_size]
	for i in range(ranks)
	]
	proj_layer = list(
	map(lambda x: torch.cat([x], dim=-1), zip(proj_layer))
	)

	if _is_dataclass_instance(obj) or isinstance(obj, ModelOutput):
	_obj = asdict(obj)
	_obj["orig_dataclass_type"] = obj.__class__
	obj = _obj

tunib-ai / parallelformers Goto Github PK

parallelformers's People

Contributors

Stargazers

Watchers

Forkers

parallelformers's Issues

How to reproduce

Environment

How to reproduce

Problem

Environment

How to reproduce

Describe a requested feature

Expected behavior

Describe a requested feature

How to reproduce

Environment

Environment

How to reproduce

Environment

Environment

Describe a requested feature

Expected behavior

How to reproduce

Environment

How to reproduce

Environment

Describe a requested feature

How to reproduce

Environment

How to reproduce

Environment

How to reproduce

Environment

Environment

Describe a requested feature

Expected behavior

Add Bloom to the Auto Policy

Expected behavior

Describe a requested feature

How to reproduce

Environment

Recommend Projects

Recommend Topics

Recommend Org