Coder Social home page Coder Social logo

llama-x's People

Contributors

aethercortex avatar aiforallnlp avatar chelsea-yeung avatar cyril-jz avatar llm-p avatar nlpkazhe avatar nlpxucan avatar robertmarton avatar sky1worker avatar victorsungo avatar wangpeiyi9979 avatar yxkryptonite avatar zhaopu7 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llama-x's Issues

utils.py and openai==0.8.0 TypeError: 'type' object is not subscriptable

this code ./src/utils.py
line 47 return_text=False in
def openai_completion(
prompts: Union[str, Sequence[str], Sequence[dict[str, str]], dict[str, str]],
decoding_args: OpenAIDecodingArguments,
model_name="text-davinci-003",
sleep_time=2,
batch_size=1,
max_instances=sys.maxsize,
max_batches=sys.maxsize,
return_text=False,
shows this error:
File "train.py", line 27, in
import utils
File "/home/llama/Llama-X-main/src/utils.py", line 47, in
return_text=False,
TypeError: 'type' object is not subscriptable

my openai==0.8.0,deepspeed==0.10.0,and other pip in requirements.txt , After changed the transformer version to 4.29.2 but i can't slove this error, maybe some pip version are not true, if anyone have some suggestion, please leave your comment here,thank you very much!

there is a endless loop in convert_tokens_to_ids(self.unk_token) and self._convert_token_to_id_with_added_voc(tokens)

hi,this is part of my error information,thanks!

File "/mnt/data/wxc/workspace/Llama-X/src/transformers/src/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
return self.unk_token_id
File "/mnt/data/wxc/workspace/Llama-X/src/transformers/src/transformers/tokenization_utils_base.py", line 1150, in unk_token_id
return self.convert_tokens_to_ids(self.unk_token)
File "/mnt/data/wxc/workspace/Llama-X/src/transformers/src/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
return self._convert_token_to_id_with_added_voc(tokens)

How to fix unstable loss. I am using wizardlm or Llama-X training code with vicuna style chat format for fine-tuning Llama-2-7b-hf model.

I'm using the 'Llama-X' (https://github.com/AetherCortex/Llama-X) training code with the vicuna-style chat template to fine-tune the Llama-2-7b-hf model. However, I'm observing an unstable loss during the process.

Please find the detailed Weights & Biases report at (https://wandb.ai/arpitsh018/huggingface/reports/Untitled-Report--Vmlldzo1NjE2Njgz).

Training Parameters:

os.system(f'deepspeed train.py
--model_name_or_path meta-llama/Llama-2-7b-hf
--data_path ../data/dummy_conversation.json
--output_dir ./checkpoint/finetuned-llama2-7b
--num_train_epochs 1
--model_max_length 4096
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1000
--save_total_limit 3
--learning_rate 2e-5
--warmup_steps 0
--logging_steps 1
--lr_scheduler_type "cosine"
--report_to "wandb"
--gradient_checkpointing True
--deepspeed configs/deepspeed_config.json
--bf16 True')

deepspeed_config.json:

{
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 0,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0,
"stage3_gather_16bit_weights_on_model_save": true
},
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": [0.9, 0.999],
"eps": 1e-8,
"weight_decay": 0
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"wall_clock_breakdown": false
}

I'm eager to understand how to stabilize the loss for my training sessions. Any insights or recommendations would be greatly appreciated.

Learning rate fixed at 0 during training via DeepSpeed

I followed all the setup instruction given in the README.
The command I am using is:

deepspeed train.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --data_path Llama-X/data/alpaca_data.json \
    --output_dir ./model_weights_finetuned \
    --num_train_epochs 3 \
    --model_max_length 512 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --warmup_steps 2 \
    --logging_steps 2 \
    --lr_scheduler_type "cosine" \
    --report_to "tensorboard" \
    --gradient_checkpointing True \
    --deepspeed configs/deepspeed_config.json \
    --fp16 True

Initially, I got the following error:

ValueError: Found `optimizer` configured in the DeepSpeed config, but no `scheduler`. Please configure a scheduler in the DeepSpeed config.

I downgraded to transformers version 4.29.2 as suggested here.

Now, training is happening but the learning rate from the beginning itself is fixed to zero. Below are the logs:

[2023-08-28 04:36:42,566] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:43,585] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-08-28 04:36:43,585] [INFO] [runner.py:555:main] cmd = /home/anmol/anaconda3/envs/llamax/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --model_name_or_path meta-llama/Llama-2-7b-hf --data_path /home/anmol/TieredModels/code/08_llamax_approach/Llama-X/data/alpaca_data.json --output_dir ./model_weights_finetuned --num_train_epochs 3 --model_max_length 512 --per_device_train_batch_size 64 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 100 --save_total_limit 2 --learning_rate 2e-5 --warmup_steps 2 --logging_steps 2 --lr_scheduler_type cosine --report_to tensorboard --gradient_checkpointing True --deepspeed configs/deepspeed_config.json --fp16 True
[2023-08-28 04:36:44,193] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,187] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-08-28 04:36:45,187] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-08-28 04:36:45,187] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-08-28 04:36:45,187] [INFO] [launch.py:163:main] dist_world_size=4
[2023-08-28 04:36:45,187] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-08-28 04:36:45,955] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,977] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,980] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,996] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:47,765] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,765] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:47,765] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-08-28 04:36:47,772] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,772] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:47,773] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,773] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:47,801] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,801] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:54,794] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.97s/it]
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Running tokenizer on train dataset (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52002/52002 [00:02<00:00, 24943.76 examples/s]
52002
Sample 12208 of the training set: {'input_ids': [1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 4391, 385, 4544, 1813, 411, 263, 28435, 322, 263, 1014, 2813, 292, 13, 13, 2277, 29937, 13291, 29901, 29966, 1420, 29958, 13, 1678, 529, 2813, 29958, 13, 4706, 529, 3257, 29958, 5494, 292, 322, 3323, 2813, 292, 829, 3257, 29958, 13, 1678, 1533, 2813, 29958, 13, 1678, 529, 2587, 29958, 13, 4706, 529, 29882, 29896, 29958, 5494, 292, 829, 29882, 29896, 29958, 13, 4706, 529, 29882, 29906, 29958, 4035, 2813, 292, 829, 29882, 29906, 29958, 13, 1678, 1533, 2587, 29958, 13, 829, 1420, 29958, 2], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 29966, 1420, 29958, 13, 1678, 529, 2813, 29958, 13, 4706, 529, 3257, 29958, 5494, 292, 322, 3323, 2813, 292, 829, 3257, 29958, 13, 1678, 1533, 2813, 29958, 13, 1678, 529, 2587, 29958, 13, 4706, 529, 29882, 29896, 29958, 5494, 292, 829, 29882, 29896, 29958, 13, 4706, 529, 29882, 29906, 29958, 4035, 2813, 292, 829, 29882, 29906, 29958, 13, 1678, 1533, 2587, 29958, 13, 829, 1420, 29958, 2]}.
Sample 46872 of the training set: {'input_ids': [1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 6113, 263, 29871, 29941, 29899, 29946, 10541, 5828, 1048, 263, 285, 9102, 1058, 27401, 278, 2462, 29889, 13, 13, 2277, 29937, 13291, 29901, 26222, 2501, 263, 931, 29892, 727, 471, 263, 26565, 285, 9102, 1058, 10600, 297, 263, 282, 898, 297, 278, 25013, 29889, 3118, 2462, 29892, 263, 2107, 13569, 3974, 14455, 714, 29892, 20616, 292, 599, 278, 15006, 297, 278, 4038, 29889, 450, 285, 9102, 471, 10087, 304, 1371, 29892, 577, 1183, 5089, 2986, 714, 310, 278, 282, 898, 322, 4822, 278, 25013, 29892, 11705, 292, 19225, 11308, 297, 902, 13394, 29889, 2296, 5096, 287, 701, 278, 11308, 2820, 902, 282, 898, 322, 2825, 263, 2594, 4336, 304, 5557, 278, 3974, 515, 9677, 292, 29889, 2973, 14332, 9109, 29892, 278, 285, 9102, 750, 7160, 278, 2462, 29991, 2], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 26222, 2501, 263, 931, 29892, 727, 471, 263, 26565, 285, 9102, 1058, 10600, 297, 263, 282, 898, 297, 278, 25013, 29889, 3118, 2462, 29892, 263, 2107, 13569, 3974, 14455, 714, 29892, 20616, 292, 599, 278, 15006, 297, 278, 4038, 29889, 450, 285, 9102, 471, 10087, 304, 1371, 29892, 577, 1183, 5089, 2986, 714, 310, 278, 282, 898, 322, 4822, 278, 25013, 29892, 11705, 292, 19225, 11308, 297, 902, 13394, 29889, 2296, 5096, 287, 701, 278, 11308, 2820, 902, 282, 898, 322, 2825, 263, 2594, 4336, 304, 5557, 278, 3974, 515, 9677, 292, 29889, 2973, 14332, 9109, 29892, 278, 285, 9102, 750, 7160, 278, 2462, 29991, 2]}.
Sample 4920 of the training set: {'input_ids': [1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29892, 3300, 2859, 411, 385, 1881, 393, 8128, 4340, 3030, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 6113, 263, 740, 297, 5132, 304, 7252, 1023, 6031, 29889, 13, 13, 2277, 29937, 10567, 29901, 13, 1576, 1023, 6031, 526, 525, 11548, 29915, 322, 525, 272, 927, 4286, 13, 13, 2277, 29937, 13291, 29901, 1753, 7252, 29918, 19651, 29898, 29879, 29896, 29892, 269, 29906, 1125, 13, 29871, 396, 3831, 598, 1023, 6031, 322, 736, 263, 7223, 995, 13, 29871, 565, 269, 29896, 1275, 269, 29906, 29901, 13, 1678, 736, 5852, 13, 29871, 1683, 29901, 13, 1678, 736, 7700, 13, 13, 29937, 4321, 1206, 13, 1807, 29896, 353, 525, 11548, 29915, 13, 1807, 29906, 353, 525, 272, 927, 29915, 13, 13, 2914, 353, 7252, 29918, 19651, 29898, 1807, 29896, 29892, 1347, 29906, 29897, 13, 2158, 29898, 2914, 29897, 2], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 1753, 7252, 29918, 19651, 29898, 29879, 29896, 29892, 269, 29906, 1125, 13, 29871, 396, 3831, 598, 1023, 6031, 322, 736, 263, 7223, 995, 13, 29871, 565, 269, 29896, 1275, 269, 29906, 29901, 13, 1678, 736, 5852, 13, 29871, 1683, 29901, 13, 1678, 736, 7700, 13, 13, 29937, 4321, 1206, 13, 1807, 29896, 353, 525, 11548, 29915, 13, 1807, 29906, 353, 525, 272, 927, 29915, 13, 13, 2914, 353, 7252, 29918, 19651, 29898, 1807, 29896, 29892, 1347, 29906, 29897, 13, 2158, 29898, 2914, 29897, 2]}.
[2023-08-28 04:37:06,368] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-08-28 04:37:06,370] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-08-28 04:37:06,376] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-08-28 04:37:06,377] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Creating extension directory /home/anmol/.cache/torch_extensions/py310_cu113/cpu_adam...
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Emitting ninja build file /home/anmol/.cache/torch_extensions/py310_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include/TH -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include/THC -isystem /home/anmol/anaconda3/envs/llamax/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -march=native -fopenmp -D__AVX256__ -D__DISABLE_CUDA__ -c /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[2/2] c++ cpu_adam.o -shared -fopenmp -L/home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.43460202217102 seconds
Time to load cpu_adam op: 16.424538373947144 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.438047647476196 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.528525590896606 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.06}

Does anyone have any idea on what I might be doing wrong ?

Concern on the language

Interesting project, but I have some concern on the language.
As is known that there are less Chinese tokens in the training data of Llama, and each Chinese token is tokenized into several tokens which is ineffecient in generation. Would the project hand this? e.g. add new tokens and do some pretraining?

About the training strategy

Very nice project and appreciate your contribution!

I have seen the deepspeed config and I want to confirm the current training strategy. For LLaMA-13B, the training uses Zero-3 optimization, checkpointing, and CPU-offload, right? I'm curious if you have tried tensor parallel (used in original LLaMA training) or model parallel?

We would also love to contribute to the training implementation about model parallel for fast large scale training, aiming at models greater than 13B. Currently I'm investigating torch/fairscale pipeline parallel mechanism.

Best,
Fangkai

About Llama-X and Alpaca repo

Hi, may I know why the hyperparameters of the training command in Llama-x (this repo) and Alpaca are different. Eg., the batch size 128 vs. 512 (64*8), the warmup steps 0.03 (ratio) vs. 2.
Which hyperparameter should we adopt?

Another question is what is the Llama-i (7B) in the Llama-X Evaluation section? And the GSM8K result is 18.8% while my own LLAMA-X model (using the hyperparamters in this repo) is only 10%. Not sure why the gap is so large. Would you mind sharing your evaluation script on GSM8K in Llama-X? Thank you.

improve LLaMA for visual understanding like GPT-4

Thanks for the good works!

We have tried to improve LLaMa model to understand visual information and support multi-modal chatting.
We are inspired that a good vit, e.g., CLIP vision encoder, and a well-trained large language model, e.g., LLaMA, with connection network, e.g., MLP or Transformer, can cover visual applications, like PALM-E.

The results in image captioning, VQA, and more multi-modal tasks, are promising in 7B and we call on more people to support testing of larger models.

Github: https://github.com/feizc/Visual-LLaMA

  • fine-tuning scripts and hyper-parameters setting
  • datasets for fine-grained alignment and instruct tuning
  • interactive gradio and visual chatbot

Why use offload_param in CPU?

I think the model fragment loading can be completed under the 6.7B parameter, why use parameterized offload to the cpu?

        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },

Need optimization for mps

Reproduction info:

  • MacBook Pro 2021 14' with M1 chip
  • macOS 12.4
  • 6B model

The inference speed is very slow on macOS machine with mps, which needs further optimization.

only load lm_head.weight and embed_tokens.weight parameters

I can only load these parameters,
○ lm_head.weight : 131076096
○ model.embed_tokens.weight : 131076096
and the other parameter is None.

Number of parameters:  262152192
model.embed_tokens.weight : 131076096
model.layers.0.self_attn.q_proj.weight : 0
model.layers.0.self_attn.k_proj.weight : 0
model.layers.0.self_attn.v_proj.weight : 0
model.layers.0.self_attn.o_proj.weight : 0
model.layers.0.mlp.gate_proj.weight : 0
model.layers.0.mlp.down_proj.weight : 0
model.layers.0.mlp.up_proj.weight : 0
model.layers.0.input_layernorm.weight : 0
model.layers.0.post_attention_layernorm.weight : 0

how big is your RAM

I trained on 4V100 32g and got OOM . If i set offload to cpu ,the process will be killed because of out of RAM,So how large is your RAM ? And is it possible to train llama 7B with 4v100? Really thanks for reply.

RuntimeError: CUDA out of memory.

I fine-tune llama-7b on 8 V100 32G. However, it occurs CUDA out of memory.

RuntimeError: CUDA out of memory. Tried to allocate 688.00 MiB (GPU 6; 31.75 GiB total capacity; 29.87 GiB already allocated; 41.94 MiB free; 30.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-04-04 17:43:59,766] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1840
[2023-04-04 17:44:02,370] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1841
[2023-04-04 17:44:05,213] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1842
[2023-04-04 17:44:07,817] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1843
[2023-04-04 17:44:10,420] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1844
[2023-04-04 17:44:13,023] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1845
[2023-04-04 17:44:15,585] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1846
[2023-04-04 17:44:15,586] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1847
[2023-04-04 17:44:18,429] [ERROR] [launch.py:324:sigkill_handler] ['/home/aiscuser/.conda/envs/llamax/bin/python3.1', '-u', 'train.py', '--local_rank=7', '--model_name_or_path', './llama-7b-hf', '--data_path', '../data/alpaca_data.json', '--output_dir', 'output/', '--num_train_epochs', '3', '--per_device_train_batch_size', '64', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '100', '--save_total_limit', '2', '--learning_rate', '2e-5', '--warmup_steps', '2', '--logging_steps', '2', '--lr_scheduler_type', 'cosine', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--deepspeed', 'configs/deepspeed_config.json', '--fp16', 'True'] exits with return code = 1

watch -n 1 nvidia-smi


    Tue Apr  4 17:48:15 2023
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  On   | 00000001:00:00.0 Off |                    0 |
    | N/A   42C    P0    67W / 300W |   4491MiB / 32768MiB |     32%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla V100-SXM2...  On   | 00000002:00:00.0 Off |                    0 |
    | N/A   44C    P0    57W / 300W |   2482MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   2  Tesla V100-SXM2...  On   | 00000003:00:00.0 Off |                    0 |
    | N/A   40C    P0    54W / 300W |   2502MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   3  Tesla V100-SXM2...  On   | 00000004:00:00.0 Off |                    0 |
    | N/A   41C    P0    56W / 300W |   2482MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   4  Tesla V100-SXM2...  On   | 00000005:00:00.0 Off |                    0 |
    | N/A   39C    P0    54W / 300W |   2522MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   5  Tesla V100-SXM2...  On   | 00000006:00:00.0 Off |                    0 |
    | N/A   43C    P0    59W / 300W |   2522MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   6  Tesla V100-SXM2...  On   | 00000007:00:00.0 Off |                    0 |
    | N/A   40C    P0    55W / 300W |   2502MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   7  Tesla V100-SXM2...  On   | 00000008:00:00.0 Off |                    0 |
    | N/A   43C    P0    53W / 300W |   2522MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|

About llama-2-70B fine-tuning

Thanks for your amazing work!

I'm experiencing out-of-memory problems when using Llama-X's fine-tuning code to do Supervised fine-tuning for 70B (Llama-2-13B doesn't have this problem), using a configuration of 3 sets of 8*A100 (40G).

So would like to inquire about the training configuration used if possible, thanks a lot!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.