Comments (16)
没有碰到过。我这边代码基本没有碰上deepspeed相关的问题。
顺便google了一下,感觉似乎是端口设置不当的问题,建议在启动脚本的时候指定好deepspeed的host文件或者 master_port, include 等等参数试试
from llms_tool.
主体代码并没有大的改动,且也用的是咱们的example数据
from llms_tool.
主体代码并没有大的改动,且也用的是咱们的example数据
收到,我的数据集只是简单的调试数据集,跑起来的效果是不能确保有效。如果需要用数据跑出来一个有效的基线,我可以把这个调试数据集换成正式的数据集。
from llms_tool.
我试了一下来自https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM的comparison_gpt4_data_zh.json
数据集共48k,并且尝试把chosen和reject进行对调(为了验证代码可行性),发现还是如上图所示loss不变,'rewards/chosen': 0.0, 'rewards/rejected': 0.0的情况
from llms_tool.
我试了一下来自https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM的
comparison_gpt4_data_zh.json
数据集共48k,并且尝试把chosen和reject进行对调(为了验证代码可行性),发现还是如上图所示loss不变,'rewards/chosen': 0.0, 'rewards/rejected': 0.0的情况
你的学习率给的多少呢?
from llms_tool.
如下:
deepspeed \
--include="localhost:"${gpus} \
--master_port=9909 \
main.py \
--deepspeed deepspeed_configs/zero_stage2_config.json \
--mode dpo_train \
--fine_tuning_type full \
--model_path ${model_path} \
--output_dir ${output_dir} \
--cache_data_path ${cache_dir} \
--task_name ${task_name} \
--do_train \
--num_train_epochs 1.0 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 1 \
--learning_rate 1e-5 \
--save_strategy epoch \
--logging_steps 1 \
--model_type baichuan \
--prompt_template baichuan
from llms_tool.
如下:
deepspeed \ --include="localhost:"${gpus} \ --master_port=9909 \ main.py \ --deepspeed deepspeed_configs/zero_stage2_config.json \ --mode dpo_train \ --fine_tuning_type full \ --model_path ${model_path} \ --output_dir ${output_dir} \ --cache_data_path ${cache_dir} \ --task_name ${task_name} \ --do_train \ --num_train_epochs 1.0 \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 1 \ --learning_rate 1e-5 \ --save_strategy epoch \ --logging_steps 1 \ --model_type baichuan \ --prompt_template baichuan
补充:模型为Baichuan-13B-Chat,全量参数微调,dpo训练
from llms_tool.
如下:
deepspeed \ --include="localhost:"${gpus} \ --master_port=9909 \ main.py \ --deepspeed deepspeed_configs/zero_stage2_config.json \ --mode dpo_train \ --fine_tuning_type full \ --model_path ${model_path} \ --output_dir ${output_dir} \ --cache_data_path ${cache_dir} \ --task_name ${task_name} \ --do_train \ --num_train_epochs 1.0 \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 1 \ --learning_rate 1e-5 \ --save_strategy epoch \ --logging_steps 1 \ --model_type baichuan \ --prompt_template baichuan
补充:模型为Baichuan-13B-Chat,全量参数微调,dpo训练
已经找到原因了正在fix
from llms_tool.
@tuzeao Hello, 请教一下13B微调dpo需要多少显存和内存?我看代码里有个deepcopy,用的时候一直报oom
from llms_tool.
有deepcopy是因为dpo需要一个ref model来作为参考,不论是deepcopy还是重新model.from_pretrained本质都是为了这个。
用stage3可以跑起来,stage2不行,基本都是8*A100打满,单卡肯定不行。
from llms_tool.
有deepcopy是因为dpo需要一个ref model来作为参考,不论是deepcopy还是重新model.from_pretrained本质都是为了这个。 用stage3可以跑起来,stage2不行,基本都是8*A100打满,单卡肯定不行。
@tuzeao 感谢。修复了oom问题后,还遇到一个情况是程序直接在deepspeed初始化阶段跳出了,也没有任何其它报错信息,请问有遇到过类似的情况吗?我的代码基本上和sft没什么区别,只是改了dpo trainer:
Time to load utils op: 0.4331231117248535 seconds
[2023-08-31 19:02:27,894] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-08-31 19:02:27,895] [INFO] [utils.py:786:see_memory_usage] MA 12.74 GB Max_MA 12.74 GB CA 14.31 GB Max_CA 14 GB
[2023-08-31 19:02:27,895] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 16.27 GB, percent = 4.3%
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2023-08-31 19:02:28,030] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-08-31 19:02:28,031] [INFO] [utils.py:786:see_memory_usage] MA 12.74 GB Max_MA 12.74 GB CA 14.31 GB Max_CA 14 GB
[2023-08-31 19:02:28,031] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 16.28 GB, percent = 4.3%
[2023-08-31 19:02:28,148] [INFO] [utils.py:785:see_memory_usage] Before creating fp16 partitions
[2023-08-31 19:02:28,149] [INFO] [utils.py:786:see_memory_usage] MA 12.74 GB Max_MA 12.74 GB CA 14.31 GB Max_CA 14 GB
[2023-08-31 19:02:28,149] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 16.27 GB, percent = 4.3%
[2023-08-31 19:02:40,532] [INFO] [utils.py:785:see_memory_usage] After creating fp16 partitions: 7
[2023-08-31 19:02:40,533] [INFO] [utils.py:786:see_memory_usage] MA 12.74 GB Max_MA 12.74 GB CA 13.54 GB Max_CA 14 GB
[2023-08-31 19:02:40,533] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 16.26 GB, percent = 4.3%
[2023-08-31 19:02:40,654] [INFO] [utils.py:785:see_memory_usage] Before creating fp32 partitions
[2023-08-31 19:02:40,655] [INFO] [utils.py:786:see_memory_usage] MA 12.74 GB Max_MA 12.74 GB CA 13.54 GB Max_CA 14 GB
[2023-08-31 19:02:40,655] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 16.26 GB, percent = 4.3%
[2023-08-31 19:02:55,692] [INFO] [utils.py:785:see_memory_usage] After creating fp32 partitions
[2023-08-31 19:02:55,693] [INFO] [utils.py:786:see_memory_usage] MA 12.74 GB Max_MA 12.74 GB CA 13.54 GB Max_CA 14 GB
[2023-08-31 19:02:55,693] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 41.39 GB, percent = 11.0%
[2023-08-31 19:02:55,814] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states
[2023-08-31 19:02:55,815] [INFO] [utils.py:786:see_memory_usage] MA 12.74 GB Max_MA 12.74 GB CA 13.54 GB Max_CA 14 GB
[2023-08-31 19:02:55,815] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 41.42 GB, percent = 11.0%
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 80359) of binary: /opt/conda/bin/python
from llms_tool.
请教下 loss为0.6931,'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0
这个问题是怎么修复的?
{'loss': 0.6931, 'learning_rate': 1.8467489107293509e-06, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -1307.955322265625, 'logps/chosen': -2537.4287109375, 'logits/rejected': 46.67363739013672, 'logits/chosen': 47.7917366027832, 'epoch': 0.5}
{'loss': 0.6931, 'learning_rate': 8.217156947590064e-07, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -566.4472045898438, 'logps/chosen': -863.7874755859375, 'logits/rejected': 47.26816177368164, 'logits/chosen': 47.84038543701172, 'epoch': 0.51}
{'loss': 0.6931, 'learning_rate': 2.05569786813925e-07, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -1334.9326171875, 'logps/chosen': -2294.96533203125, 'logits/rejected': 46.71002960205078, 'logits/chosen': 47.56126022338867, 'epoch': 0.52}
{'loss': 0.6931, 'learning_rate': 0.0, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -2079.239501953125, 'logps/chosen': -4097.9521484375, 'logits/rejected': 46.693851470947266, 'logits/chosen': 47.634830474853516, 'epoch': 0.52}
from llms_tool.
我将 ref_model = deepcopy(model)
重新用
ref_model = model_class.from_pretrained(
args.model_name_or_path,
torch_dtype=torch_dtype,
device_map=args.device_map,
trust_remote_code=args.trust_remote_code,
)
加载一遍,不再为 0 ,但是loss大的离谱,这正常吗?
如下为训练过程中logging :
{'loss': 0.6931, 'learning_rate': 0.0004998072590601808, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -1000.675537109375, 'logps/chosen': -2531.12158203125, 'logits/rejected': 71.77708435058594, 'logits/chosen': 89.86573791503906, 'epoch': 0.06}
{'loss': 27.3212, 'learning_rate': 0.000499229333433282, 'rewards/chosen': 2.425292491912842, 'rewards/rejected': 23.138628005981445, 'rewards/accuracies': 0.390625, 'rewards/margins': -20.713336944580078, 'logps/rejected': -1084.484375, 'logps/chosen': -2350.399658203125, 'logits/rejected': 119.3428955078125, 'logits/chosen': 148.06057739257812, 'epoch': 0.12}
{'loss': 16.8992, 'learning_rate': 0.0004982671142387316, 'rewards/chosen': -2.7379796504974365, 'rewards/rejected': -34.28330993652344, 'rewards/accuracies': 0.578125, 'rewards/margins': 31.545333862304688, 'logps/rejected': -1494.697021484375, 'logps/chosen': -2707.468505859375, 'logits/rejected': 116.12063598632812, 'logits/chosen': 105.90425109863281, 'epoch': 0.18}
{'loss': 631.0259, 'learning_rate': 0.0004969220851487844, 'rewards/chosen': -819.938720703125, 'rewards/rejected': -214.3435821533203, 'rewards/accuracies': 0.046875, 'rewards/margins': -605.5950317382812, 'logps/rejected': -3010.5205078125, 'logps/chosen': -10806.6650390625, 'logits/rejected': 108.48920440673828, 'logits/chosen': 106.30279541015625, 'epoch': 0.24}
{'loss': 249.3126, 'learning_rate': 0.0004951963201008077, 'rewards/chosen': -336.6590881347656, 'rewards/rejected': -88.30933380126953, 'rewards/accuracies': 0.0625, 'rewards/margins': -248.34974670410156, 'logps/rejected': -1614.24853515625, 'logps/chosen': -5640.9951171875, 'logits/rejected': 97.84693908691406, 'logits/chosen': 96.17440032958984, 'epoch': 0.3}
{'loss': 29.9124, 'learning_rate': 0.0004930924800994192, 'rewards/chosen': -69.85091400146484, 'rewards/rejected': -53.566314697265625, 'rewards/accuracies': 0.34375, 'rewards/margins': -16.284597396850586, 'logps/rejected': -1685.541748046875, 'logps/chosen': -3788.849609375, 'logits/rejected': 93.85322570800781, 'logits/chosen': 92.70535278320312, 'epoch': 0.36}
{'loss': 64.2579, 'learning_rate': 0.0004906138091134118, 'rewards/chosen': -111.27135467529297, 'rewards/rejected': -97.27839660644531, 'rewards/accuracies': 0.453125, 'rewards/margins': -13.992941856384277, 'logps/rejected': -2088.65576171875, 'logps/chosen': -4169.7021484375, 'logits/rejected': 65.00901794433594, 'logits/chosen': 64.33920288085938, 'epoch': 0.42}
from llms_tool.
感谢咱们项目非常简约又规范的代码,在看和改造的时候都非常舒服
不过在dpo训练的时候,我的loss和rewards/chosen一直是以下这样的,这正常吗?
{'loss': 0.6931, 'learning_rate': 9.99231529256779e-06, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -445.9979248046875, 'logps/chosen': -30.411256790161133, 'logits/rejected': 2.6535236835479736, 'logits/chosen': 1.1344398260116577, 'epoch': 0.0} {'loss': 0.6931, 'learning_rate': 9.991217477220333e-06, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -200.39610290527344, 'logps/chosen': -33.55436706542969, 'logits/rejected': 1.5881015062332153, 'logits/chosen': 3.952385187149048, 'epoch': 0.0} {'loss': 0.6931, 'learning_rate': 9.990119661872873e-06, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -155.58612060546875, 'logps/chosen': -40.69193649291992, 'logits/rejected': 2.194831132888794, 'logits/chosen': 1.8327357769012451, 'epoch': 0.0} {'loss': 0.6931, 'learning_rate': 9.989021846525415e-06, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -258.0289306640625, 'logps/chosen': -41.872779846191406, 'logits/rejected': 4.575325965881348, 'logits/chosen': 0.9270402789115906, 'epoch': 0.0} {'loss': 0.6931, 'learning_rate': 9.987924031177957e-06, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -319.283203125, 'logps/chosen': -37.61365509033203, 'logits/rejected': 2.6409215927124023, 'logits/chosen': 2.5549163818359375, 'epoch': 0.0} {'loss': 0.6931, 'learning_rate': 9.986826215830499e-06, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -215.65521240234375, 'logps/chosen': -34.17694091796875, 'logits/rejected': 5.640789985656738, 'logits/chosen': 0.18884596228599548, 'epoch': 0.0} {'loss': 0.6931, 'learning_rate': 9.986826215830499e-06, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -422.649658203125, 'logps/chosen': -43.95429992675781, 'logits/rejected': 1.0700151920318604, 'logits/chosen': 1.0940637588500977, 'epoch': 0.0} ```遇到同样的问题,loss=0.6931不变,rewards/chosen为0,调小学习率和改变bf16也没用,请问有什么解决思路吗?
from llms_tool.
Related Issues (17)
- 可不可以提供一下生成json的脚本,我这边生成的会报错,是编码格式的问题吗? HOT 4
- Pls support RWKV world model
- 关于权重合并
- 关于权重合并 HOT 1
- 使用prefix-tuning微调Qwen模型时报错 HOT 1
- 期待預訓練代碼 HOT 1
- sft时不输出eval loss HOT 3
- 大神能建个微信群或者留个联系方式吗? HOT 1
- baichuan2-13b-chat的deepspeed训练报错,是dpo训练
- 扩展词表代码需优化
- 关于模型预训练 HOT 2
- 大佬,能尝试做训练用的WebUI吗? HOT 1
- deepspeed报错
- QLora似乎不能和deepspeed zero3一起使用? HOT 1
- SFT数据集字段 HOT 1
- AttributeError: 'DataManager' object has no attribute 'generating_args_preprocess' HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llms_tool.