I'm attempting to deploy llama2-7b-chat-hf on a server equipped with two V100 GPUs lin

From the information you provided: The comm kernel is prefixed

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Try Tensor Parallel on a server equipped with two V100 linked by NVLINK, but got a performance degradation about gpt-fast HOT 8 OPEN

duanzhaol commented on July 16, 2024

Try Tensor Parallel on a server equipped with two V100 linked by NVLINK, but got a performance degradation

from gpt-fast.

Comments (8)

yifuwang commented on July 16, 2024

From the information you provided:

The comm kernel is prefixed with ncclDevKernel which suggests P2P is enabled
PyTorch's IntraNodeComm is not enabled (IIRC it only supports A100 and onward)

I suggest:

Checking nvidia-smi topo -m to see if the GPUs are directly connected
Running nsys with --gpu-metrics-device=all to collective the activities of both GPUs. Compare them to see if it's a straggler problem
Running a minimal all_reduce microbenchmark to if the issue also presents

from gpt-fast.

duanzhaol commented on July 16, 2024

Thank you for your response. I have confirmed the presence of a straggler issue. As illustrated in the attached images, the first GPU remains idle, waiting for the second GPU during the AllReduce operation.

However, I am puzzled by the cause of this behavior. I have ensured that no other programs are running on my system that could potentially cause interference. This setup was established by simply cloning this repository and executing the provided code. Could there be a misconfiguration or another underlying issue that I might have overlooked?

from gpt-fast.

Chillee commented on July 16, 2024

@duanzhaol I don't think you're using compilation are you?

from gpt-fast.

duanzhaol commented on July 16, 2024

@duanzhaol I don't think you're using compilation are you?

Yes, I haven't use compile in my process. Is compile a necessary step for tensor parallel? I think it should work without it.

from gpt-fast.

Chillee commented on July 16, 2024

Compilation will significantly reduce the tensor-parallel latency.

In general, gpt-fast will not be particularly fast without using compilation :P

from gpt-fast.

duanzhaol commented on July 16, 2024

I opted not to use compilation because my objective is to use tensor parallelism on a serverless platform. The initial compilation process is significantly time-consuming, which becomes impractical in our case since each request necessitates a fresh compilation. This overhead is unacceptable for our use case. If there were a method to persist or checkpoint the results of the compilation—similar to checkpoint an engine in TensorRT—it would greatly enhance efficiency. Unfortunately, I have yet to discover a tool or method that facilitates this capability. Any guidance or suggestions on how to address this challenge would be immensely appreciated

from gpt-fast.

Chillee commented on July 16, 2024

@duanzhaol Out of curiosity, what level of overhead is acceptable?

from gpt-fast.

duanzhaol commented on July 16, 2024

@duanzhaol Out of curiosity, what level of overhead is acceptable?

Maybe less than a second? In serverless if the function is pure stateless, every request need to recompile the model. And if it is optimized as a Model-as-a-Service platform, compilation will severely restrict the ability to scaled out new instance to handle the burst workloads.

Moreover, I'm still puzzled by the significant straggler issue without ompilation we're encountering. The kernel launch times, according to nsys traces, show considerable variability. Does this problem originate from the implementation of gpt-fast, or is it a broader issue associated with employing tensor parallelism in PyTorch without prior compilation?

from gpt-fast.

Try Tensor Parallel on a server equipped with two V100 linked by NVLINK, but got a performance degradation about gpt-fast HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent