aethercortex / llama-x Goto Github PK

View Code? Open in Web Editor NEW

1.6K 1.6K 101.0 7.59 MB

Open Academic Research on Improving LLaMA to SOTA LLM

License: Apache License 2.0

Python 100.00%

llama-x's People

Contributors

Stargazers

Watchers

Forkers

wangjiaqiys nlpxucan sheli00 zwhinmedia jiayong zurichrain patrick-tssn jxzhangjhu yxkryptonite nickfantasy dumpmemory janglichao jiangwei victorsungo bruinxiong fpzh2011 jeffrey28 shengzing zhenyangiacas ddlbojack nlpkazhe sky1worker chelsea-yeung reborm ftgreat felixgithub2017 jchzhao zxyscz hanbu97 baponkar neromous lw3259111 newlightlw kindow guihonghao furongpeng jibin5167 ucas010 julyhcw rchanggogogo chenllliang royiluo mikesimb s1530129650 zhanghg1983 git-models glin93 vinglogn wangpeiyi9979 qshuang123 synho gaoxiaojun xwjim syp2ysy zss205 xgreat8 amishakov vishal-s-p robertmarton dkzdev huijiawu0 152334h thalesfsp oalee ltngonnguyen kennydizi tpk28 jgayle99 worthmining jasonfenggit deltavml teknium1 generalzh l1aoxingyu haorand lp177 hertera1 ogyank mars-wei songfang w32zhong sysang heraclex12 elshor wang-xiaodong1899 tadeze xipq thanhpham1987 franciszhangkk xedis gptcrash notvicent3 jingchensun greatwildfire belle9217 zhymma moyeen52 dongxf369 heli-dawnlab703 250handsomeliang

llama-x's Issues

utils.py and openai==0.8.0 TypeError: 'type' object is not subscriptable

this code ./src/utils.py
line 47 return_text=False in
def openai_completion(
prompts: Union[str, Sequence[str], Sequence[dict[str, str]], dict[str, str]],
decoding_args: OpenAIDecodingArguments,
model_name="text-davinci-003",
sleep_time=2,
batch_size=1,
max_instances=sys.maxsize,
max_batches=sys.maxsize,
return_text=False,
shows this error:
File "train.py", line 27, in
import utils
File "/home/llama/Llama-X-main/src/utils.py", line 47, in
return_text=False,
TypeError: 'type' object is not subscriptable

my openai==0.8.0,deepspeed==0.10.0,and other pip in requirements.txt , After changed the transformer version to 4.29.2 but i can't slove this error, maybe some pip version are not true, if anyone have some suggestion, please leave your comment here，thank you very much!

there is a endless loop in convert_tokens_to_ids(self.unk_token) and self._convert_token_to_id_with_added_voc(tokens)

hi,this is part of my error information,thanks!

File "/mnt/data/wxc/workspace/Llama-X/src/transformers/src/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
return self.unk_token_id
File "/mnt/data/wxc/workspace/Llama-X/src/transformers/src/transformers/tokenization_utils_base.py", line 1150, in unk_token_id
return self.convert_tokens_to_ids(self.unk_token)
File "/mnt/data/wxc/workspace/Llama-X/src/transformers/src/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
return self._convert_token_to_id_with_added_voc(tokens)

How to fix unstable loss. I am using wizardlm or Llama-X training code with vicuna style chat format for fine-tuning Llama-2-7b-hf model.

I'm using the 'Llama-X' (https://github.com/AetherCortex/Llama-X) training code with the vicuna-style chat template to fine-tune the Llama-2-7b-hf model. However, I'm observing an unstable loss during the process.

Please find the detailed Weights & Biases report at (https://wandb.ai/arpitsh018/huggingface/reports/Untitled-Report--Vmlldzo1NjE2Njgz).

Training Parameters:

os.system(f'deepspeed train.py
--model_name_or_path meta-llama/Llama-2-7b-hf
--data_path ../data/dummy_conversation.json
--output_dir ./checkpoint/finetuned-llama2-7b
--num_train_epochs 1
--model_max_length 4096
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1000
--save_total_limit 3
--learning_rate 2e-5
--warmup_steps 0
--logging_steps 1
--lr_scheduler_type "cosine"
--report_to "wandb"
--gradient_checkpointing True
--deepspeed configs/deepspeed_config.json
--bf16 True')

deepspeed_config.json:

{
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 0,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0,
"stage3_gather_16bit_weights_on_model_save": true
},
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": [0.9, 0.999],
"eps": 1e-8,
"weight_decay": 0
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"wall_clock_breakdown": false
}

I'm eager to understand how to stabilize the loss for my training sessions. Any insights or recommendations would be greatly appreciated.

Learning rate fixed at 0 during training via DeepSpeed

I followed all the setup instruction given in the README.
The command I am using is:

deepspeed train.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --data_path Llama-X/data/alpaca_data.json \
    --output_dir ./model_weights_finetuned \
    --num_train_epochs 3 \
    --model_max_length 512 \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --warmup_steps 2 \
    --logging_steps 2 \
    --lr_scheduler_type "cosine" \
    --report_to "tensorboard" \
    --gradient_checkpointing True \
    --deepspeed configs/deepspeed_config.json \
    --fp16 True

Initially, I got the following error:

ValueError: Found `optimizer` configured in the DeepSpeed config, but no `scheduler`. Please configure a scheduler in the DeepSpeed config.

I downgraded to transformers version 4.29.2 as suggested here.

Now, training is happening but the learning rate from the beginning itself is fixed to zero. Below are the logs:

[2023-08-28 04:36:42,566] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:43,585] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-08-28 04:36:43,585] [INFO] [runner.py:555:main] cmd = /home/anmol/anaconda3/envs/llamax/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --model_name_or_path meta-llama/Llama-2-7b-hf --data_path /home/anmol/TieredModels/code/08_llamax_approach/Llama-X/data/alpaca_data.json --output_dir ./model_weights_finetuned --num_train_epochs 3 --model_max_length 512 --per_device_train_batch_size 64 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 100 --save_total_limit 2 --learning_rate 2e-5 --warmup_steps 2 --logging_steps 2 --lr_scheduler_type cosine --report_to tensorboard --gradient_checkpointing True --deepspeed configs/deepspeed_config.json --fp16 True
[2023-08-28 04:36:44,193] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,187] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-08-28 04:36:45,187] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-08-28 04:36:45,187] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-08-28 04:36:45,187] [INFO] [launch.py:163:main] dist_world_size=4
[2023-08-28 04:36:45,187] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-08-28 04:36:45,955] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,977] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,980] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:45,996] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-28 04:36:47,765] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,765] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:47,765] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-08-28 04:36:47,772] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,772] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:47,773] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,773] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:47,801] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-28 04:36:47,801] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-28 04:36:54,794] [INFO] [partition_parameters.py:326:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.91s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:03<00:00,  1.97s/it]
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Running tokenizer on train dataset (num_proc=32): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52002/52002 [00:02<00:00, 24943.76 examples/s]
52002
Sample 12208 of the training set: {'input_ids': [1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 4391, 385, 4544, 1813, 411, 263, 28435, 322, 263, 1014, 2813, 292, 13, 13, 2277, 29937, 13291, 29901, 29966, 1420, 29958, 13, 1678, 529, 2813, 29958, 13, 4706, 529, 3257, 29958, 5494, 292, 322, 3323, 2813, 292, 829, 3257, 29958, 13, 1678, 1533, 2813, 29958, 13, 1678, 529, 2587, 29958, 13, 4706, 529, 29882, 29896, 29958, 5494, 292, 829, 29882, 29896, 29958, 13, 4706, 529, 29882, 29906, 29958, 4035, 2813, 292, 829, 29882, 29906, 29958, 13, 1678, 1533, 2587, 29958, 13, 829, 1420, 29958, 2], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 29966, 1420, 29958, 13, 1678, 529, 2813, 29958, 13, 4706, 529, 3257, 29958, 5494, 292, 322, 3323, 2813, 292, 829, 3257, 29958, 13, 1678, 1533, 2813, 29958, 13, 1678, 529, 2587, 29958, 13, 4706, 529, 29882, 29896, 29958, 5494, 292, 829, 29882, 29896, 29958, 13, 4706, 529, 29882, 29906, 29958, 4035, 2813, 292, 829, 29882, 29906, 29958, 13, 1678, 1533, 2587, 29958, 13, 829, 1420, 29958, 2]}.
Sample 46872 of the training set: {'input_ids': [1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 6113, 263, 29871, 29941, 29899, 29946, 10541, 5828, 1048, 263, 285, 9102, 1058, 27401, 278, 2462, 29889, 13, 13, 2277, 29937, 13291, 29901, 26222, 2501, 263, 931, 29892, 727, 471, 263, 26565, 285, 9102, 1058, 10600, 297, 263, 282, 898, 297, 278, 25013, 29889, 3118, 2462, 29892, 263, 2107, 13569, 3974, 14455, 714, 29892, 20616, 292, 599, 278, 15006, 297, 278, 4038, 29889, 450, 285, 9102, 471, 10087, 304, 1371, 29892, 577, 1183, 5089, 2986, 714, 310, 278, 282, 898, 322, 4822, 278, 25013, 29892, 11705, 292, 19225, 11308, 297, 902, 13394, 29889, 2296, 5096, 287, 701, 278, 11308, 2820, 902, 282, 898, 322, 2825, 263, 2594, 4336, 304, 5557, 278, 3974, 515, 9677, 292, 29889, 2973, 14332, 9109, 29892, 278, 285, 9102, 750, 7160, 278, 2462, 29991, 2], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 26222, 2501, 263, 931, 29892, 727, 471, 263, 26565, 285, 9102, 1058, 10600, 297, 263, 282, 898, 297, 278, 25013, 29889, 3118, 2462, 29892, 263, 2107, 13569, 3974, 14455, 714, 29892, 20616, 292, 599, 278, 15006, 297, 278, 4038, 29889, 450, 285, 9102, 471, 10087, 304, 1371, 29892, 577, 1183, 5089, 2986, 714, 310, 278, 282, 898, 322, 4822, 278, 25013, 29892, 11705, 292, 19225, 11308, 297, 902, 13394, 29889, 2296, 5096, 287, 701, 278, 11308, 2820, 902, 282, 898, 322, 2825, 263, 2594, 4336, 304, 5557, 278, 3974, 515, 9677, 292, 29889, 2973, 14332, 9109, 29892, 278, 285, 9102, 750, 7160, 278, 2462, 29991, 2]}.
Sample 4920 of the training set: {'input_ids': [1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29892, 3300, 2859, 411, 385, 1881, 393, 8128, 4340, 3030, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 6113, 263, 740, 297, 5132, 304, 7252, 1023, 6031, 29889, 13, 13, 2277, 29937, 10567, 29901, 13, 1576, 1023, 6031, 526, 525, 11548, 29915, 322, 525, 272, 927, 4286, 13, 13, 2277, 29937, 13291, 29901, 1753, 7252, 29918, 19651, 29898, 29879, 29896, 29892, 269, 29906, 1125, 13, 29871, 396, 3831, 598, 1023, 6031, 322, 736, 263, 7223, 995, 13, 29871, 565, 269, 29896, 1275, 269, 29906, 29901, 13, 1678, 736, 5852, 13, 29871, 1683, 29901, 13, 1678, 736, 7700, 13, 13, 29937, 4321, 1206, 13, 1807, 29896, 353, 525, 11548, 29915, 13, 1807, 29906, 353, 525, 272, 927, 29915, 13, 13, 2914, 353, 7252, 29918, 19651, 29898, 1807, 29896, 29892, 1347, 29906, 29897, 13, 2158, 29898, 2914, 29897, 2], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 1753, 7252, 29918, 19651, 29898, 29879, 29896, 29892, 269, 29906, 1125, 13, 29871, 396, 3831, 598, 1023, 6031, 322, 736, 263, 7223, 995, 13, 29871, 565, 269, 29896, 1275, 269, 29906, 29901, 13, 1678, 736, 5852, 13, 29871, 1683, 29901, 13, 1678, 736, 7700, 13, 13, 29937, 4321, 1206, 13, 1807, 29896, 353, 525, 11548, 29915, 13, 1807, 29906, 353, 525, 272, 927, 29915, 13, 13, 2914, 353, 7252, 29918, 19651, 29898, 1807, 29896, 29892, 1347, 29906, 29897, 13, 2158, 29898, 2914, 29897, 2]}.
[2023-08-28 04:37:06,368] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-08-28 04:37:06,370] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-08-28 04:37:06,376] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-08-28 04:37:06,377] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Creating extension directory /home/anmol/.cache/torch_extensions/py310_cu113/cpu_adam...
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /home/anmol/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Emitting ninja build file /home/anmol/.cache/torch_extensions/py310_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include/TH -isystem /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/include/THC -isystem /home/anmol/anaconda3/envs/llamax/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -march=native -fopenmp -D__AVX256__ -D__DISABLE_CUDA__ -c /home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[2/2] c++ cpu_adam.o -shared -fopenmp -L/home/anmol/anaconda3/envs/llamax/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.43460202217102 seconds
Time to load cpu_adam op: 16.424538373947144 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.438047647476196 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 16.528525590896606 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.05}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.06}

Does anyone have any idea on what I might be doing wrong ?

good in llama-7b,but bad in llama-13B ?

finetune the same data on translation,good result in llama-7b,while bad result in llama-13b.why?

Concern on the language

Interesting project, but I have some concern on the language.
As is known that there are less Chinese tokens in the training data of Llama, and each Chinese token is tokenized into several tokens which is ineffecient in generation. Would the project hand this? e.g. add new tokens and do some pretraining?

Have you ever tried PEFT on LLaMA?

Hi, have you ever tried to integrate PEFT to your code and provide the training scripts ?

How to run 13B LLaMA model for Full-parameters SFT ?

Is there anybody can provide a running command example for LLaMA-13B model Full-parameters SFT?

Btw, I gonna run this experiment on 8 * A100-80GB hardware envs.

Thanks so much!

About the training strategy

Very nice project and appreciate your contribution!

I have seen the deepspeed config and I want to confirm the current training strategy. For LLaMA-13B, the training uses Zero-3 optimization, checkpointing, and CPU-offload, right? I'm curious if you have tried tensor parallel (used in original LLaMA training) or model parallel?

We would also love to contribute to the training implementation about model parallel for fast large scale training, aiming at models greater than 13B. Currently I'm investigating torch/fairscale pipeline parallel mechanism.

Best,
Fangkai

About Llama-X and Alpaca repo

Hi, may I know why the hyperparameters of the training command in Llama-x (this repo) and Alpaca are different. Eg., the batch size 128 vs. 512 (64*8), the warmup steps 0.03 (ratio) vs. 2.
Which hyperparameter should we adopt?

Another question is what is the Llama-i (7B) in the Llama-X Evaluation section? And the GSM8K result is 18.8% while my own LLAMA-X model (using the hyperparamters in this repo) is only 10%. Not sure why the gap is so large. Would you mind sharing your evaluation script on GSM8K in Llama-X? Thank you.

improve LLaMA for visual understanding like GPT-4

Thanks for the good works!

We have tried to improve LLaMa model to understand visual information and support multi-modal chatting.
We are inspired that a good vit, e.g., CLIP vision encoder, and a well-trained large language model, e.g., LLaMA, with connection network, e.g., MLP or Transformer, can cover visual applications, like PALM-E.

The results in image captioning, VQA, and more multi-modal tasks, are promising in 7B and we call on more people to support testing of larger models.

Github: https://github.com/feizc/Visual-LLaMA

fine-tuning scripts and hyper-parameters setting
datasets for fine-grained alignment and instruct tuning
interactive gradio and visual chatbot

Demo is dead

The demo link gives an 502

extremely bad performance

just an FYI for other people who may use this repo

Why use offload_param in CPU？

I think the model fragment loading can be completed under the 6.7B parameter, why use parameterized offload to the cpu?

        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },

Need optimization for mps

Reproduction info:

MacBook Pro 2021 14' with M1 chip
macOS 12.4
6B model

The inference speed is very slow on macOS machine with mps, which needs further optimization.

only load lm_head.weight and embed_tokens.weight parameters

I can only load these parameters,
○ lm_head.weight : 131076096
○ model.embed_tokens.weight : 131076096
and the other parameter is None.

Number of parameters:  262152192
model.embed_tokens.weight : 131076096
model.layers.0.self_attn.q_proj.weight : 0
model.layers.0.self_attn.k_proj.weight : 0
model.layers.0.self_attn.v_proj.weight : 0
model.layers.0.self_attn.o_proj.weight : 0
model.layers.0.mlp.gate_proj.weight : 0
model.layers.0.mlp.down_proj.weight : 0
model.layers.0.mlp.up_proj.weight : 0
model.layers.0.input_layernorm.weight : 0
model.layers.0.post_attention_layernorm.weight : 0

how big is your RAM

I trained on 4V100 32g and got OOM . If i set offload to cpu ,the process will be killed because of out of RAM，So how large is your RAM ? And is it possible to train llama 7B with 4v100? Really thanks for reply.

RuntimeError: CUDA out of memory.

I fine-tune llama-7b on 8 V100 32G. However， it occurs CUDA out of memory.

RuntimeError: CUDA out of memory. Tried to allocate 688.00 MiB (GPU 6; 31.75 GiB total capacity; 29.87 GiB already allocated; 41.94 MiB free; 30.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-04-04 17:43:59,766] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1840
[2023-04-04 17:44:02,370] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1841
[2023-04-04 17:44:05,213] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1842
[2023-04-04 17:44:07,817] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1843
[2023-04-04 17:44:10,420] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1844
[2023-04-04 17:44:13,023] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1845
[2023-04-04 17:44:15,585] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1846
[2023-04-04 17:44:15,586] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1847
[2023-04-04 17:44:18,429] [ERROR] [launch.py:324:sigkill_handler] ['/home/aiscuser/.conda/envs/llamax/bin/python3.1', '-u', 'train.py', '--local_rank=7', '--model_name_or_path', './llama-7b-hf', '--data_path', '../data/alpaca_data.json', '--output_dir', 'output/', '--num_train_epochs', '3', '--per_device_train_batch_size', '64', '--per_device_eval_batch_size', '1', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '100', '--save_total_limit', '2', '--learning_rate', '2e-5', '--warmup_steps', '2', '--logging_steps', '2', '--lr_scheduler_type', 'cosine', '--report_to', 'tensorboard', '--gradient_checkpointing', 'True', '--deepspeed', 'configs/deepspeed_config.json', '--fp16', 'True'] exits with return code = 1

watch -n 1 nvidia-smi


    Tue Apr  4 17:48:15 2023
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  On   | 00000001:00:00.0 Off |                    0 |
    | N/A   42C    P0    67W / 300W |   4491MiB / 32768MiB |     32%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   1  Tesla V100-SXM2...  On   | 00000002:00:00.0 Off |                    0 |
    | N/A   44C    P0    57W / 300W |   2482MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   2  Tesla V100-SXM2...  On   | 00000003:00:00.0 Off |                    0 |
    | N/A   40C    P0    54W / 300W |   2502MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   3  Tesla V100-SXM2...  On   | 00000004:00:00.0 Off |                    0 |
    | N/A   41C    P0    56W / 300W |   2482MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   4  Tesla V100-SXM2...  On   | 00000005:00:00.0 Off |                    0 |
    | N/A   39C    P0    54W / 300W |   2522MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   5  Tesla V100-SXM2...  On   | 00000006:00:00.0 Off |                    0 |
    | N/A   43C    P0    59W / 300W |   2522MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   6  Tesla V100-SXM2...  On   | 00000007:00:00.0 Off |                    0 |
    | N/A   40C    P0    55W / 300W |   2502MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   7  Tesla V100-SXM2...  On   | 00000008:00:00.0 Off |                    0 |
    | N/A   43C    P0    53W / 300W |   2522MiB / 32768MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|

About llama-2-70B fine-tuning

Thanks for your amazing work!

I'm experiencing out-of-memory problems when using Llama-X's fine-tuning code to do Supervised fine-tuning for 70B (Llama-2-13B doesn't have this problem), using a configuration of 3 sets of 8*A100 (40G).

So would like to inquire about the training configuration used if possible, thanks a lot!

Does this repository support training of the GPT series?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.