Comments (17)
模型会先载到内存上,之后在 zero3的控制下切片到每一张卡,每张卡分配的切片是不一样的
okok我理解了谢谢您!!!!!!
from xtuner.
看到之前的解决方式为换用zero3,这是否说明xtuner无法像设置cuda='auto'那样加载模型呢?
from xtuner.
经测试,使用单卡训练只会在第一张卡加载参数。
from xtuner.
xtuner在训练时不支持hf载模型时的device='auto'的行为。如果单卡会OOM,请尝试使用zero策略
from xtuner.
xtuner在训练时不支持hf载模型时的device='auto'的行为。如果单卡会OOM,请尝试使用zero策略
后续有没有支持该策略的计划呀~
from xtuner.
xtuner在训练时不支持hf载模型时的device='auto'的行为。如果单卡会OOM,请尝试使用zero策略
后续有没有支持该策略的计划呀~
这个策略类似 pipeline parallel,并不高效,GPU利用率会很低,与 zero 相比没有显著优势,所以暂不考虑支持。
from xtuner.
@LumenScope 如果是使用 zero3,加载模型时会切片加载权重,不会出现重复加载导致 OOM 的问题;
但看你 config 里是用的 lora,lora/qlora 和 zero3 是不兼容的,只能用 zero 1/2
推荐使用 qlora 训练来避免重复加载模型导致的 OOM,我们实际开发中发现,使用 QLoRA 和 LoRA 并无精度差异
from xtuner.
xtuner在训练时不支持hf载模型时的device='auto'的行为。如果单卡会OOM,请尝试使用zero策略
后续有没有支持该策略的计划呀~
这个策略类似 pipeline parallel,并不高效,GPU利用率会很低,与 zero 相比没有显著优势,所以暂不考虑支持。
但是我的服务器多卡device='auto'再采用zero3似乎支持全参训练,不用使用lora,这种情况是否使用支持device='auto'的框架会比xtuner更有优势?
from xtuner.
@LumenScope
xtuner 使用 zero3 也可以支持全参数训练。
device_map='auto'结合zero3 的话,我这边暂时没有看到能够直接使用,请问是有代码可以这样操作吗?
from xtuner.
@LumenScope xtuner 使用 zero3 也可以支持全参数训练。 device_map='auto'结合zero3 的话,我这边暂时没有看到能够直接使用,请问是有代码可以这样操作吗?
firefly框架可以,但是我使用xtuner+zero3进行20b全参训练OOM了
Generating train split: 2182 examples [00:00, 58146.15 examples/s]
Map (num_proc=32): 100%|█████████████████████████████████████████████████████████████| 2182/2182 [00:00<00:00, 5632.11 examples/s]
Map (num_proc=32): 100%|█████████████████████████████████████████████████████████████| 2182/2182 [00:00<00:00, 5927.05 examples/s]
Filter (num_proc=32): 100%|██████████████████████████████████████████████████████████| 2182/2182 [00:00<00:00, 6130.98 examples/s]
Map (num_proc=32): 100%|██████████████████████████████████████████████████████████████| 2182/2182 [00:06<00:00, 355.65 examples/s]
Filter (num_proc=32): 100%|██████████████████████████████████████████████████████████| 2182/2182 [00:00<00:00, 5824.98 examples/s]
Flattening the indices (num_proc=32): 100%|██████████████████████████████████████████| 2182/2182 [00:00<00:00, 4669.87 examples/s]
Map (num_proc=32): 100%|█████████████████████████████████████████████████████████████| 2182/2182 [00:00<00:00, 5952.87 examples/s]
01/31 11:55:39 - mmengine - WARNING - Dataset Dataset has no metainfo. ``dataset_meta`` in visualizer will be None.
[2024-01-31 11:55:40,523] [INFO] [partition_parameters.py:349:__exit__] finished initializing model - num_params = 339, num_elems = 19.86B
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 21/21 [00:33<00:00, 1.61s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 21/21 [00:33<00:00, 1.61s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 21/21 [00:34<00:00, 1.63s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 21/21 [00:42<00:00, 2.03s/it]
01/31 11:56:23 - mmengine - INFO - dispatch internlm2 attn forward
01/31 11:56:23 - mmengine - WARNING - Due to the implementation of the PyTorch version of flash attention, even when the `output_attentions` flag is set to True, it is not possible to return the `attn_weights`.
[2024-01-31 11:56:23,356] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.0, git-hash=unknown, git-branch=unknown
[2024-01-31 11:56:23,383] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-01-31 11:56:23,386] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-01-31 11:56:23,386] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-01-31 11:56:23,453] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2024-01-31 11:56:23,453] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2024-01-31 11:56:23,453] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2024-01-31 11:56:23,453] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
[2024-01-31 11:56:23,607] [INFO] [utils.py:791:see_memory_usage] Stage 3 initialize beginning
[2024-01-31 11:56:23,608] [INFO] [utils.py:792:see_memory_usage] MA 10.0 GB Max_MA 12.12 GB CA 15.64 GB Max_CA 16 GB
[2024-01-31 11:56:23,608] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 8.05 GB, percent = 6.4%
[2024-01-31 11:56:23,612] [INFO] [stage3.py:128:__init__] Reduce bucket size 500,000,000
[2024-01-31 11:56:23,612] [INFO] [stage3.py:129:__init__] Prefetch bucket size 50,000,000
[2024-01-31 11:56:23,765] [INFO] [utils.py:791:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-01-31 11:56:23,766] [INFO] [utils.py:792:see_memory_usage] MA 10.0 GB Max_MA 10.0 GB CA 15.64 GB Max_CA 16 GB
[2024-01-31 11:56:23,766] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 8.05 GB, percent = 6.4%
Parameter Offload: Total persistent parameters: 595968 in 97 params
[2024-01-31 11:56:23,977] [INFO] [utils.py:791:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-01-31 11:56:23,978] [INFO] [utils.py:792:see_memory_usage] MA 10.0 GB Max_MA 10.0 GB CA 15.64 GB Max_CA 16 GB
[2024-01-31 11:56:23,978] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 8.05 GB, percent = 6.4%
[2024-01-31 11:56:24,135] [INFO] [utils.py:791:see_memory_usage] Before creating fp16 partitions
[2024-01-31 11:56:24,136] [INFO] [utils.py:792:see_memory_usage] MA 10.0 GB Max_MA 10.0 GB CA 15.64 GB Max_CA 16 GB
[2024-01-31 11:56:24,136] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 8.05 GB, percent = 6.4%
[2024-01-31 11:56:36,419] [INFO] [utils.py:791:see_memory_usage] After creating fp16 partitions: 5
[2024-01-31 11:56:36,421] [INFO] [utils.py:792:see_memory_usage] MA 10.0 GB Max_MA 10.0 GB CA 14.98 GB Max_CA 16 GB
[2024-01-31 11:56:36,421] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 35.86 GB, percent = 28.6%
[2024-01-31 11:56:36,585] [INFO] [utils.py:791:see_memory_usage] Before creating fp32 partitions
[2024-01-31 11:56:36,586] [INFO] [utils.py:792:see_memory_usage] MA 10.0 GB Max_MA 10.0 GB CA 14.98 GB Max_CA 15 GB
[2024-01-31 11:56:36,586] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 35.86 GB, percent = 28.6%
Traceback (most recent call last):
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 299, in <module>
main()
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 295, in main
runner.train()
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1182, in train
self.strategy.prepare(
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 381, in prepare
self.model = self._wrap_model(model)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/engine/_strategy/deepspeed.py", line 16, in _wrap_model
wrapper = super()._wrap_model(model)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 396, in _wrap_model
engine, self.optim_wrapper.optimizer, *_ = deepspeed.initialize(
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1247, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1569, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 361, in __init__
self._setup_for_real_optimizer()
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 468, in _setup_for_real_optimizer
self._create_fp32_partitions()
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 860, in _create_fp32_partitions
self.device).clone().float().detach())
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.73 GiB. GPU 0 has a total capacty of 31.74 GiB of which 2.66 GiB is free. Including non-PyTorch memory, this process has 29.07 GiB memory in use. Of the allocated memory 23.20 GiB is allocated by PyTorch, and 5.02 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 299, in <module>
main()
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 295, in main
runner.train()
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1182, in train
self.strategy.prepare(
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 381, in prepare
self.model = self._wrap_model(model)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/engine/_strategy/deepspeed.py", line 16, in _wrap_model
wrapper = super()._wrap_model(model)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 396, in _wrap_model
engine, self.optim_wrapper.optimizer, *_ = deepspeed.initialize(
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1247, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1569, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 361, in __init__
self._setup_for_real_optimizer()
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 468, in _setup_for_real_optimizer
self._create_fp32_partitions()
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 860, in _create_fp32_partitions
self.device).clone().float().detach())
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.73 GiB. GPU 2 has a total capacty of 31.74 GiB of which 1.95 GiB is free. Including non-PyTorch memory, this process has 29.78 GiB memory in use. Of the allocated memory 23.20 GiB is allocated by PyTorch, and 5.75 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 299, in <module>
main()
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 295, in main
runner.train()
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1182, in train
self.strategy.prepare(
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 381, in prepare
self.model = self._wrap_model(model)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/engine/_strategy/deepspeed.py", line 16, in _wrap_model
wrapper = super()._wrap_model(model)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 396, in _wrap_model
engine, self.optim_wrapper.optimizer, *_ = deepspeed.initialize(
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1247, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1569, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 361, in __init__
self._setup_for_real_optimizer()
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 468, in _setup_for_real_optimizer
self._create_fp32_partitions()
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 860, in _create_fp32_partitions
self.device).clone().float().detach())
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.73 GiB. GPU 3 has a total capacty of 31.74 GiB of which 2.64 GiB is free. Including non-PyTorch memory, this process has 29.09 GiB memory in use. Of the allocated memory 23.20 GiB is allocated by PyTorch, and 5.09 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 299, in <module>
main()
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py", line 295, in main
runner.train()
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1182, in train
self.strategy.prepare(
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 381, in prepare
self.model = self._wrap_model(model)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/engine/_strategy/deepspeed.py", line 16, in _wrap_model
wrapper = super()._wrap_model(model)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/mmengine/_strategy/deepspeed.py", line 396, in _wrap_model
engine, self.optim_wrapper.optimizer, *_ = deepspeed.initialize(
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1247, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1569, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 361, in __init__
self._setup_for_real_optimizer()
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 468, in _setup_for_real_optimizer
self._create_fp32_partitions()
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 860, in _create_fp32_partitions
self.device).clone().float().detach())
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.73 GiB. GPU 1 has a total capacty of 31.74 GiB of which 1.94 GiB is free. Including non-PyTorch memory, this process has 29.79 GiB memory in use. Of the allocated memory 23.20 GiB is allocated by PyTorch, and 5.75 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-01-31 11:56:39,982] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 36032) of binary: /home/tangshi/miniconda3/envs/xtuner/bin/python
Traceback (most recent call last):
File "/home/tangshi/miniconda3/envs/xtuner/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/tangshi/miniconda3/envs/xtuner/lib/python3.10/site-packages/xtuner/tools/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-01-31_11:56:39
host : tangshi
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 36033)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-01-31_11:56:39
host : tangshi
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 36034)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-01-31_11:56:39
host : tangshi
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 36035)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-01-31_11:56:39
host : tangshi
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 36032)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
以下是代码:
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine import DatasetInfoHook, EvaluateChatHook
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '/home/tangshi/TangShi/Pku政务大模型/Models/internlm2-chat-20b'
# Data
alpaca_zh_path = '/home/tangshi/TangShi/Pku政务大模型/Trainer/Tools/xtuner/data'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 4096
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = "你的任务是重庆市政务文书写作、政务问答 \n 你生成的问题必须包含:1、留言标题,2、留言摘要。你生成的答复内容部分必须有法律依据,且表明已审查,参照你固有的知识或者我给出的法律文献,在引用法律文件时使用《》包裹其名称。\n"
evaluation_inputs = [
'留言标题:合川区一超市怀疑出售过期食品\n留言摘要:市民在合川区一超市发现了怀疑是过期的食品,担心会对消费者的健康造成威胁。', '留言标题:渝北区小区内共享单车难以管理\n留言摘要:渝北区某小区内共享单车聚集,严重影响居民出行,请求相关部门加强管理。'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
# quantization_config=dict(
# type=BitsAndBytesConfig,
# load_in_4bit=True,
# load_in_8bit=False,
# llm_int8_threshold=6.0,
# llm_int8_has_fp16_weight=False,
# bnb_4bit_compute_dtype=torch.float16,
# bnb_4bit_use_double_quant=True,
# bnb_4bit_quant_type='nf4')
),
# lora=dict(
# type=LoraConfig,
# r=64,
# lora_alpha=16,
# lora_dropout=0.1,
# bias='none',
# task_type='CAUSAL_LM')
)
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
T_max=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(by_epoch=True, max_epochs=max_epochs, val_interval=1)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 100 iterations.
logger=dict(type=LoggerHook, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per epoch.
checkpoint=dict(type=CheckpointHook, interval=1),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
from xtuner.
从这里看,firefly 并没有使用 'auto'
from xtuner.
从这里看,firefly 并没有使用 'auto'
所以使用xtuner+zero3仍然是无法全参微调模型的吗?即便我有很多张卡?我认为你们应该重点关注这个问题
from xtuner.
xtuner + zero3可以全参数微调
你这边全量微调20B的话,4张32GB显存是不够的
from xtuner.
全量微调20B的话,4张32GB显存是不够的
但是按照xtuner的逻辑,多少张32G的卡都不行,因为不论多少张,单张的处理逻辑是一样的。这么理解对吗?所以要使用xtuner就必须要拥有80G的单卡?
from xtuner.
@LumenScope
不是的,zero3 会自动对模型做切分
https://deepspeed.readthedocs.io/en/latest/zero3.html
from xtuner.
@LumenScope 不是的,zero3 会自动对模型做切分 https://deepspeed.readthedocs.io/en/latest/zero3.html
我理解zero3的切分原理,但是还是有点不明白:
xtuner设置多卡的时候每张卡是强制并行处理的,不能'auto',这是否意味着单卡显存无法支撑zero3时就无法训练了呢?
还是说,多卡xtuner+zero3时,每张卡分配的切片其实是不一样的?
但是我观察到显存占用几张卡是一致的欸
from xtuner.
模型会先载到内存上,之后在 zero3的控制下切片到每一张卡,每张卡分配的切片是不一样的
from xtuner.
Related Issues (20)
- xtuner convert merge 报段错误 HOT 3
- xtuner convert merge CUDA out of memory HOT 1
- 认知微调失败 HOT 4
- InternLM2_Chat模型工具调用能力微调是否有相关样例? HOT 2
- KeyError: 'Cache only has 0 layers, attempted to access layer with index 0' HOT 1
- 超大规模预训练 HOT 1
- Inheritance between configuration files HOT 4
- RLHF and DPO
- [Bug] Empty available tools during execute `msagent_react_map_fn` HOT 3
- Do you have plan of adding RWKV finetuning? HOT 2
- 微调interlm20b-chat报错:UnicodeEncodeError: 'ascii' codec can't encode characters in position 28-39: ordinal not in range(128) HOT 2
- 微调llava后,转换模型报错 HOT 2
- Support model sharding for training large models HOT 1
- 微调的时候,支持可视化loss吗
- 如何指定卡来进行增量预训练? HOT 2
- 请问是否支持200k上文的微调,需要什么样的配置? HOT 5
- Add `Gemma` Model to xtuner. HOT 4
- LLaVA MME指标 HOT 1
- xtuner qlora微调internlm2-chat-7b报错RuntimeError: FlashAttention only support fp16 and bf16 data type
- Support finetuning LLaVA 1.6 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xtuner.