Comments (8)
From the information you provided:
- The comm kernel is prefixed with
ncclDevKernel
which suggests P2P is enabled - PyTorch's IntraNodeComm is not enabled (IIRC it only supports A100 and onward)
I suggest:
- Checking
nvidia-smi topo -m
to see if the GPUs are directly connected - Running
nsys
with--gpu-metrics-device=all
to collective the activities of both GPUs. Compare them to see if it's a straggler problem - Running a minimal all_reduce microbenchmark to if the issue also presents
from gpt-fast.
Thank you for your response. I have confirmed the presence of a straggler issue. As illustrated in the attached images, the first GPU remains idle, waiting for the second GPU during the AllReduce operation.
However, I am puzzled by the cause of this behavior. I have ensured that no other programs are running on my system that could potentially cause interference. This setup was established by simply cloning this repository and executing the provided code. Could there be a misconfiguration or another underlying issue that I might have overlooked?
from gpt-fast.
@duanzhaol I don't think you're using compilation are you?
from gpt-fast.
@duanzhaol I don't think you're using compilation are you?
Yes, I haven't use compile in my process. Is compile a necessary step for tensor parallel? I think it should work without it.
from gpt-fast.
Compilation will significantly reduce the tensor-parallel latency.
In general, gpt-fast will not be particularly fast without using compilation :P
from gpt-fast.
I opted not to use compilation because my objective is to use tensor parallelism on a serverless platform. The initial compilation process is significantly time-consuming, which becomes impractical in our case since each request necessitates a fresh compilation. This overhead is unacceptable for our use case. If there were a method to persist or checkpoint the results of the compilation—similar to checkpoint an engine in TensorRT—it would greatly enhance efficiency. Unfortunately, I have yet to discover a tool or method that facilitates this capability. Any guidance or suggestions on how to address this challenge would be immensely appreciated
from gpt-fast.
@duanzhaol Out of curiosity, what level of overhead is acceptable?
from gpt-fast.
@duanzhaol Out of curiosity, what level of overhead is acceptable?
Maybe less than a second? In serverless if the function is pure stateless, every request need to recompile the model. And if it is optimized as a Model-as-a-Service platform, compilation will severely restrict the ability to scaled out new instance to handle the burst workloads.
Moreover, I'm still puzzled by the significant straggler issue without ompilation we're encountering. The kernel launch times, according to nsys traces, show considerable variability. Does this problem originate from the implementation of gpt-fast, or is it a broader issue associated with employing tensor parallelism in PyTorch without prior compilation?
from gpt-fast.
Related Issues (20)
- Questions on Speculative Decoding in gpt-fast generate.py HOT 2
- What happens to bias during int8 quantization? HOT 3
- batching/dynamic batching HOT 1
- Question about the gennerated code of `WeightOnlyInt8Linear` HOT 5
- AMD RX 7900 XTX Wrong outputs
- Speculative decoding with draft model:TinyLlama-1.1B
- Can't quantize to int4 and can't compile on RTX2080Ti HOT 2
- Int4 perplexity
- index out of range: No transformer config could be loaded HOT 1
- Reducing Latency in Application with Torch Compilation: Initialization and Inference Optimization
- int4/int4-gptq support in Mixtral 8x7B HOT 2
- CUDA error if enabling compile_prefill for quantization model (int8) HOT 7
- Question about large sequence length attention kernels HOT 1
- Size mismatch error occurs when loading models quantized by GPTQ HOT 1
- `eval.py` uses older version of lm_eval HOT 1
- Can GPT-Fast support larger batch sizes HOT 3
- I try to speed up with llava,but this it slower then eager mode,why?
- pass@1 score extremely low using GPT-fast API HOT 2
- Bandwidth achieved for INT8 is much smaller than FP16 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpt-fast.