Comments (5)
Using the sample deepspeed command。The same situation,kill process and return -9. How to fix.
from deepseek-coder.
请检查下你的package版本 以及是否有足够的内存。
from deepseek-coder.
Thanks for the quick response!
You are right. Looks like an OOM but not VRAM. All 170GB of system ram was used just before the python process died.
I changed pin_memory to false in configs/ds_config_zero3.json for both offload_optimizer and offload_param and that got it to start fine tuning. (I also chnaged the deepspeed param of --per_device_train_batch_size 1 instead of 16.)
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": false
},
"offload_param": {
"device": "cpu",
"pin_memory": false
},
...
Not sure if I should tune sub_group_size
as well but it is fine tuning now so will report back if a better config is found for a single A100 80GB.
Thanks again for the help!
from deepseek-coder.
Using nvidia-smi to monitor the VRAM, my per_device_train_batch_size was way too small.
from deepseek-coder.
Thanks for the quick response!
You are right. Looks like an OOM but not VRAM. All 170GB of system ram was used just before the python process died. I changed pin_memory to false in configs/ds_config_zero3.json for both offload_optimizer and offload_param and that got it to start fine tuning. (I also chnaged the deepspeed param of --per_device_train_batch_size 1 instead of 16.)
"zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": false }, "offload_param": { "device": "cpu", "pin_memory": false }, ...Not sure if I should tune
sub_group_size
as well but it is fine tuning now so will report back if a better config is found for a single A100 80GB.Thanks again for the help!
Hey! Can you tell me the minimum requirement (like GPU VRAM, System RAM, memory) for finetuning using the edits you made in config.json? Actually I am having the same -9 error. Have you found any better config.json?
from deepseek-coder.
Related Issues (20)
- Clarification Request on Discrepancies Between Appendix B and Section 4.1 Results HOT 4
- How many tokens of code in pretraining HOT 2
- Swift and Objective C? HOT 1
- Finetune of FIM HOT 4
- tokenizer.json issue creating gguf files HOT 2
- Question about training dataset
- Undefined variable in `Evaluation/MBPP/human_eval/evaluation.py`
- Detailed version information of test programs in different languages.
- How is the amount of training data measured? HOT 1
- deepseek-coder-7b-base-v1.5 tokenizer=LlamaTokenizerFast 为什么 分词会有很多乱码字符呢? HOT 1
- Reproduce FIM Evaluation HOT 1
- Code to generate data HOT 1
- Pretraining code HOT 2
- 模型推理完成后怎么一直占用显存呢? HOT 1
- Catastrophic forgetting problem HOT 2
- chat completion任务时输出大量<|EOT|> token HOT 3
- Trying to finetune DeepSeek-Coder on custom Dataset HOT 13
- 33B AWQ量化+vLLM部署问题
- 如何构建微调的CoT数据 HOT 1
- 官方提供的微调训练脚本是否支持33B模型训练?(及训练相关问题) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepseek-coder.