Coder Social home page Coder Social logo

Comments (2)

varunponda avatar varunponda commented on May 24, 2024

You are getting tensor size mismatch error because batch size is not divisible by the number of gpu you are using. You can either change the batch size or number of gpu such that it is divisible by each other. Please let me know if the issue persists.

from phenaki-pytorch.

sezginerr avatar sezginerr commented on May 24, 2024

Hello @varunponda ,

Thank you very much for your reply. Despite implementing your suggestions, I continue to face the division by 0 error when using batch_size=3 with 3 GPUs. I suspect the issue might lie within my custom dataloader, as it groups batches based on their number of slices, and my input images possess varying layers. Therefore, I employed the dataloader from this repository after slicing input images to a uniform size, but the error persisted. The error also occurs prior to data loading. I know this because I am working in a ~10TB of data and dataset initialization takes ~1 hour but error happens after couple of seconds. Also code works when I don't wrap the model.

To further investigate the problem, I attempted the following:

-Utilized PyTorch's DistributedDataParallel (DDP) to wrap the model rather than huggingface accelerator, but the issue remained unresolved.
-Tested PyTorch 2.0 on an 8-GPU node within an HPC to eliminate the possibility of issues with my local SSH, but the problem remained.
-Employing Fully Sharded Data Parallel (FSDP) in the Accelerate launch seems to resolve the issue. For now I am using this for my training. Though I needed the change the code a little to make it work. The dataloader is wrapped before the cycle is called, causing the dataloader does not load images after 1 iteration with FSDP. I simply put cycle(dl) before wrapping the dataloader.

I believe that the discriminator model may not be wrapped correctly with DDP. I would sincerely appreciate any further guidance or recommendations you might have to help me address this issue.

Thank you once again for your assistance.

from phenaki-pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.