Comments (9)
How many epochs does your run go for? Also, what is the save_filename? Is it unique per timestamp (to verify it isn't being overwritten each time)
from composer.
Only 1 checkpoint is getting saved..in the format of 'ep3-ba1458-rank0.pt'. If I specify save interval 1ep only 1st epoch gets saved even if I have 30 epochs. And if save interval is 3ep, the 3rd epoch gets saved, rest are ignored
from composer.
Hm... do you get any errors or traces? Mind sharing a minimal repro please?
from composer.
There are no errors and everything working fine including plotting the loss etc.
I am using this git branch for mosaicbert pretraining. https://github.com/Skylion007/mosaicml-examples/tree/skylion007/add-fa2-to-bert/examples
from composer.
Mind sharing what version of composer you are using?
from composer.
0.17.2
from composer.
That's a very old (>6 months old) version of composer. is there a reason you need to use that old of a version?
from composer.
I have been using this particular docker image https://github.com/Skylion007/mosaicml-examples/tree/skylion007/add-fa2-to-bert/examples to use flashattention, triton etc. These set up require sepcific versions, if I update to latest composer, triton fails.
But now I managed by updating to a slightly higher version like 0.19. And it works fine. But I wonder why composer failed only at checkpoint part for the previous 0.17.2 version
from composer.
Going to close as it seems to work.
I'm not super sure what the bug was, but it's certainly possible there was an issue in older versions. Definitely recommend upgrading to latest :)
from composer.
Related Issues (20)
- NUMA affinity control HOT 2
- Optional `CheckpointSaver` instantiation inside the `Trainer` HOT 9
- TypeError: Subscripted generics cannot be used with class and instance checks HOT 2
- Autoresume and duration mismatch on reload HOT 12
- CUDA OOM error not caught with auto microbatching HOT 3
- Computing train metrics at a given frequency HOT 1
- Support DDP with rank-dependent dataloader lengths HOT 2
- Training stops after first pass of Evaluation when pretraining MosaicBert HOT 8
- EVAL_STANDALONE_END documentation missing HOT 1
- Epoch length incorrectly calculated when using DDP HOT 5
- Resuming Training from Load Path HOT 1
- ONNX export with `dynamic_axes` does not work when applying `BlurPool` HOT 1
- FSDP Wrapping Alters Optimizer's Parameter Tracking Behavior HOT 8
- How to use composer under python3.8 HOT 1
- Multi-TPU Training Support for Composer HOT 4
- Eval Loss HOT 1
- Torch script export not working with hugging face model HOT 1
- Unable to script model HOT 5
- Documentation: Dead links to images stored in GCS buckets in "Analyzing Traces" Tutorial
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from composer.