Coder Social home page Coder Social logo

Comments (8)

yifuwang avatar yifuwang commented on July 16, 2024

From the information you provided:

  • The comm kernel is prefixed with ncclDevKernel which suggests P2P is enabled
  • PyTorch's IntraNodeComm is not enabled (IIRC it only supports A100 and onward)

I suggest:

  • Checking nvidia-smi topo -m to see if the GPUs are directly connected
  • Running nsys with --gpu-metrics-device=all to collective the activities of both GPUs. Compare them to see if it's a straggler problem
  • Running a minimal all_reduce microbenchmark to if the issue also presents

from gpt-fast.

duanzhaol avatar duanzhaol commented on July 16, 2024

Thank you for your response. I have confirmed the presence of a straggler issue. As illustrated in the attached images, the first GPU remains idle, waiting for the second GPU during the AllReduce operation.
image
However, I am puzzled by the cause of this behavior. I have ensured that no other programs are running on my system that could potentially cause interference. This setup was established by simply cloning this repository and executing the provided code. Could there be a misconfiguration or another underlying issue that I might have overlooked?

from gpt-fast.

Chillee avatar Chillee commented on July 16, 2024

@duanzhaol I don't think you're using compilation are you?

from gpt-fast.

duanzhaol avatar duanzhaol commented on July 16, 2024

@duanzhaol I don't think you're using compilation are you?

Yes, I haven't use compile in my process. Is compile a necessary step for tensor parallel? I think it should work without it.

from gpt-fast.

Chillee avatar Chillee commented on July 16, 2024

Compilation will significantly reduce the tensor-parallel latency.

In general, gpt-fast will not be particularly fast without using compilation :P

from gpt-fast.

duanzhaol avatar duanzhaol commented on July 16, 2024

I opted not to use compilation because my objective is to use tensor parallelism on a serverless platform. The initial compilation process is significantly time-consuming, which becomes impractical in our case since each request necessitates a fresh compilation. This overhead is unacceptable for our use case. If there were a method to persist or checkpoint the results of the compilation—similar to checkpoint an engine in TensorRT—it would greatly enhance efficiency. Unfortunately, I have yet to discover a tool or method that facilitates this capability. Any guidance or suggestions on how to address this challenge would be immensely appreciated

from gpt-fast.

Chillee avatar Chillee commented on July 16, 2024

@duanzhaol Out of curiosity, what level of overhead is acceptable?

from gpt-fast.

duanzhaol avatar duanzhaol commented on July 16, 2024

@duanzhaol Out of curiosity, what level of overhead is acceptable?

Maybe less than a second? In serverless if the function is pure stateless, every request need to recompile the model. And if it is optimized as a Model-as-a-Service platform, compilation will severely restrict the ability to scaled out new instance to handle the burst workloads.

Moreover, I'm still puzzled by the significant straggler issue without ompilation we're encountering. The kernel launch times, according to nsys traces, show considerable variability. Does this problem originate from the implementation of gpt-fast, or is it a broader issue associated with employing tensor parallelism in PyTorch without prior compilation?

from gpt-fast.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.