Coder Social home page Coder Social logo

Comments (5)

fattorib avatar fattorib commented on August 15, 2024

As a first debugging step, will train a much smaller model (~50M) param in multihost setting to see if same issues arise. In theory, if everything is working this model shouldn't overfit

from zero-transformer.

fattorib avatar fattorib commented on August 15, 2024

50M run for 4 epochs on pod completed, this training curve looks as expected. Given that this run only took around 12 hours, I didn't have to restart training at any point, Need to investigate code for resuming runs next.
qood

For reference, commit 80825b39bb20256f7df48f70537b74efc47e1c67.

from zero-transformer.

fattorib avatar fattorib commented on August 15, 2024

Could also be that this is the result of some issues with the data/model scale? Following OPT, will try and log l2 norm of final layer activations, they found this to be good indicator for loss divergences. I'm a bit skeptical this is what it would be, but working on this is still good practice

from zero-transformer.

fattorib avatar fattorib commented on August 15, 2024

Made a small change to how data resuming is handled:

  • Splitting out rng keys while resuming in dataset. Before, rng keys weren't split out when resuming from a previous ckpt, meaning rng state was reused.

Confirmed that this works properly by training a 125M model and forcing a resume about 50% through training. Model trained properly from this ckpt and loss didnt diverge. Considering this issue closed for now

from zero-transformer.

fattorib avatar fattorib commented on August 15, 2024

Turns out there was an odd duplication bug with the webtext data. Remade dataset from scratch, will see how new models perform

from zero-transformer.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.