pacman100 / llm-workshop Goto Github PK

View Code? Open in Web Editor NEW

315.0 315.0 110.0 45.73 MB

LLM Workshop by Sourab Mangrulkar

License: Apache License 2.0

Python 4.15% Jupyter Notebook 95.23% Shell 0.62%

llm-workshop's Introduction

hey there

About Me :

I'm Sourab Mangrulkar; an Applied Scientist and Machine Learning Engineer from India 🇮🇳.

🔭 I’m currently working as an Applied Scientist at Amazon.
🌱 Exploring Natural Language Processing, Computer Vision and Distributed Training at Scale. Always up for meaningful collaboration.
😄 Pronouns: He/His/Him.
⚡ Painting 🎨, sketching ✍️ and poetry 📝 are my favourite hobbies. Recently, I've started reading up on stocks and economic markets.
📫 How to reach me:

llm-workshop's People

Contributors

Stargazers

Watchers

Forkers

archan49 chintan-donda challasandeep420 nikshrimali siva-taicho77 keshavrg taltaf913 pksvv sahdev77 annejeevan archan49scs iamanubhav10 jphme oceanos74 bdytx5 sonwy2 imvijay23 orangetin zhaoli2017 xiechengmude yynnxu singl3 aprilatmc hiboyang bkrmint richardtvu densimm fnauman jon-chun dhinchakdhoomdev 3dalgolab taner45 arpieb phoenicxai joru10 osipov bdice codeblackwell homiehari skazo4nick wenjun90 0xd49daa johnbendi datxuantran shyam-sundar-7 nicoyang-21 lchcapitalhumain soheilpaper andrmoura amrrs techthiyanes creatorlimen hkordbacheh wangguojim we1k thoo purnasai benedict-lee prashant9501 badbyte7 mankra zhichaoxu-shufe saravananpsg timxx ismail777777 dipeshpaulsystango manfredwang093 pminhyung auyez-k peiji1981 sebastianbodza rr-jino-jose shatealaboxiaowang karn3003 omnave c-tc alirezabayatmk kghamilton89 mrm8488 shossain nahidalam kiranbharadwaj alvin-r npn279 vualidon allenliu88 blockchainexpert2022 weltonjerry gnosia93 bohdanpetryshyn felix-red-panda lancelotnd heshwa inweb3 mrvladd1 guptaarpit arjun-mavonic faroucksc haizhixin2021 kristiankielhofner

llm-workshop's Issues

Incorrectness in Flash Attention

We completely ignore attention_mask here: https://github.com/pacman100/DHS-LLM-Workshop/blob/53672e1b774da7798fb10a50ef8ca5b2750c5608/personal_copilot/training/starcoder_flash_attn_monkey_patch.py#L60

If the input had padding, then this is incorrect (not by a major amount I think but that might depend on how much padding the input has).
We need to maintain cu_seqlens and use the packed version of flash here.
But the current implementementation is easier to implement and maintain I guess?

Can we add a note regarding the incorrect behaviour?

train Segmentation fault

python train.py
--model_path "bigcode/starcoderbase-1b"
--dataset_name "smangrul/hf-stack-v1"
--subset "data"
--data_column "content"
--split "train"
--seq_length 2048
--max_steps 2000
--batch_size 1
--gradient_accumulation_steps 1
--learning_rate 5e-5
--lr_scheduler_type "cosine"
--weight_decay 0.01
--num_warmup_steps 30
--eval_freq 100
--save_freq 500
--log_freq 25
--use_reentrant False
--num_workers 4
--bf16
--no_fp16
--output_dir "starcoderbase1b-personal-copilot-A100-40GB-colab"
--fim_rate 0.5
--fim_spm_rate 0.5
--use_flash_attn

error:/u01/liuys/anaconda3/envs/code/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/u01/liuys/anaconda3/envs/code/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Segmentation fault (core dumped)

Fine Tuning with LoRA failed during train step

Below is the notebook link from your blog - https://huggingface.co/blog/personal-copilot
https://colab.research.google.com/drive/1Tz9KKgacppA4S6H4eo_sw43qEaC9lFLs?usp=sharing

!git pull
!python train.py \
    --model_name_or_path "bigcode/starcoder" \
    --dataset_name "smangrul/hf-stack-v1" \
    --subset "data" \
    --data_column "content" \
    --splits "train" \
    --seq_length 2048 \
    --max_steps 2000 \
    --batch_size 4 \
    --gradient_accumulation_steps 4 \
    --learning_rate 5e-4 \
    --lr_scheduler_type "cosine" \
    --weight_decay 0.01 \
    --num_warmup_steps 30 \
    --eval_freq 100 \
    --save_freq 100 \
    --log_freq 25 \
    --num_workers 4 \
    --bf16 \
    --no_fp16 \
    --output_dir "peft-lora-starcoder15B-v2-personal-copilot-A100-40GB-colab" \
    --fim_rate 0.5 \
    --fim_spm_rate 0.5 \
    --use_peft_lora \
    --lora_r 32 \
    --lora_alpha 64 \
    --lora_dropout 0.0 \
    --lora_target_modules "c_proj,c_attn,q_attn,c_fc,c_proj" \
    --use_flash_attn \
    --use_4bit_qunatization \
    --use_nested_quant \
    --bnb_4bit_compute_dtype "bfloat16"

I am stuck at this step.

Below is the error

Already up to date.
2024-05-09 20:44:58.617684: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-09 20:44:58.617733: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-09 20:44:58.619695: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-09 20:44:58.630452: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-09 20:45:00.111432: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
  File "/content/DHS-LLM-Workshop/personal_copilot/training/train.py", line 494, in <module>
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/usr/local/lib/python3.10/dist-packages/transformers/hf_argparser.py", line 348, in parse_args_into_dataclasses
    raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--subset', 'data', '--data_column', 'content', '--seq_length', '2048', '--batch_size', '4', '--num_warmup_steps', '30', '--eval_freq', '100', '--save_freq', '100', '--log_freq', '25', '--num_workers', '4', '--no_fp16', '--use_4bit_qunatization']

Problem training with FSDP

When I am trying to train a model with FSDP, I am getting following error.

*** TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union

It is happening on this specific line
trainer.model = trainer.model_wrapped = FSDP(trainer.model, **kwargs)

and after a bit of debugging it feels like it has something to do with auto_wrap_policy. I am not really sure how to solve this. Do you have any suggestions. It was working fine until few days ago.

Using device_map auto when launch acceleator

in https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/training/utils.py#L182C9-L182C19, what is the reason to set device_map = 'auto'
When I run it with accelerator (with fsdp) I got the error

ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.

Also in here (huggingface/accelerate#1840) it states that it is not compatible with using DistributedDataParallel.

how to load model trained by accelerate with fsdp.

Hi dear:

I finetuned with accelerate with fsdp, but i do not know how to load checkpoint to do inference, checkpoint output is as below:

checkpoin-100

optomizer_0
- __0_0.distcp
- __1_0.distcp
- _-2_0,distcp
pytorch_model_fsdp_0
- .metadata
- __0_0.distcp
- __1_0.distcp
- _-2_0,distcp
rng_state_0.pth
rng_state_1.pth
rng_state_2.pth
scheduler.pt
trainer_state.json

Looking forward to your reply

chat assistant traning: CUDA out of memory

Hi, I am getting a CUDA out of memory error when I try to run the chat_assistant training's run_fsdp.sh script on a 34b model. Changing the model from 7b to 34b is the only change I made.

Local edits

I only edited chat_assistant/training/run_fsdp.sh to replace the 7b model with a 34b model. Screenshot:

Stack trace

  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 252, in <module>
    main(args)
  File "/home/ubuntu/steven/DHS-LLM-Workshop/chat_assistant/training/train.py", line 223, in main
    trainer.train()
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step
    self.accelerator.backward(loss)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1986, in backward
    loss.backward(**kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.29 GiB. GPU 1 has a total capacty of 79.19 GiB of which 1.26 GiB is free. Including non-PyTorch memory, this process has 77.93 GiB memory in use. Of the allocated memory 74.62 GiB is allocated by PyTorch, and 1.24 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):

Hardware / Gpu info

I was running this job on a machine with 8 H100 gpus. Below is screenshot of nvidia-smi -l 1 output; you can see that in the span of 5 seconds the GPU memory usage spiked from around 42GiB on all 8 devices to 80GiB on all 8 devices, and then the process crashed.

Full logs:
full_logs_dhs.txt

Let me know if there's any additional information I can provide to be helpful. Thanks in advance!

How to draw the loss graph?

Error on save_steps using FSDP

I am currently using the FSDP (Fully Sharded Data Parallelism) approach with the Llama 2 70B model. The training process has begun, but I encounter an error when attempting to save the checkpoint at each save_step. I have set the save_step as 50.

System: 1 Node with 2 A100 80 GB GPU

Here are the supporting screenshots

@pacman100

url in the readme is broken

Hi @pacman100, the url mentioned in the README is broken

No Version mentioned for any library used in the project.

I tried to follow https://huggingface.co/blog/personal-copilot

But everything is breaking. Maybe because the version of the dependency needed is not mentioned. Is it not a good practice to mention the version so that we are always sure the software keeps working. How can I know which version of the dependency is compatible for the current date.

Can you please make sure to update the requirements.txt? Also, Please update the version of python dependency in the colab link as well. Referring to the PEFT one.

Eval is like running forever

Hello author,

Thanks for your tutorial.

I am using the dataset hf-codegen-v2 which has 370k rows.

The validation set is about 1850. The batch size is 4. For other params, they are the same as the ones in run_peft.sh.

The training speed is normal but the eval loop is running forever.

Below is the log for evaluation:
{'eval_loss': 0.1817416101694107, 'eval_runtime': 6666.0306, 'eval_samples_per_second': 8.072, 'eval_steps_per_second': 2.018, 'epoch': 0.5}

I am not sure if this is normal.

Any help will be appreciated!

Enhancements for Efficient Utilization and Optimization in Fine-tuning Llama 2 70B Example

Hi @pacman100 ,

Firstly, thank you for the well-detailed article! I am writing to provide some feedback and seek clarification.

Optimizer Selection:
- The blog post demonstrates the use of a particular optimizer, "paged_adamw_32bit". However, upon altering this to "adamw_torch", I encountered an Out Of Memory (OOM) issue. Could you elucidate on the critical role the default optimizer plays in the successful execution of the example provided? Any insight into why the memory issue arises with "adamw_torch" would be highly valuable.
GPU Utilization:
- In attempting to replicate the described setup on a 2 nodes x 8 H100s machine, I observed a relatively low GPU utilization rate of around 20% with the GPUs drawing only about ~200 Watts. Is there any recommendation on how to elevate the GPU utilization rate, to potentially expedite the training process and maximize the computational resources at hand?

Your guidance will be immensely beneficial!

Thank you for your time.

train.py: error: ambiguous option: --split could match --splits, --split_batches

I uploaded my private github repo as data set to private hugging face dataset. Below is the error I receive when I try to train using PEFT method

2024-05-05 18:43:36.206142: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-05 18:43:36.206194: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-05 18:43:36.207621: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-05 18:43:36.214881: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-05 18:43:37.321192: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
usage: train.py [-h] --model_name_or_path MODEL_NAME_OR_PATH [--lora_alpha LORA_ALPHA]
[--lora_dropout LORA_DROPOUT] [--lora_r LORA_R]
[--lora_target_modules LORA_TARGET_MODULES]
[--use_nested_quant [USE_NESTED_QUANT]]
[--bnb_4bit_compute_dtype BNB_4BIT_COMPUTE_DTYPE]
[--bnb_4bit_quant_type BNB_4BIT_QUANT_TYPE] [--use_flash_attn [USE_FLASH_ATTN]]
[--use_peft_lora [USE_PEFT_LORA]]
[--use_8bit_qunatization [USE_8BIT_QUNATIZATION]]
[--use_4bit_quantization [USE_4BIT_QUANTIZATION]]
[--use_reentrant [USE_REENTRANT]] [--use_unsloth [USE_UNSLOTH]]
[--use_loftq [USE_LOFTQ]] [--use_loftq_callback [USE_LOFTQ_CALLBACK]]
[--dataset_name DATASET_NAME] [--dataset_text_field DATASET_TEXT_FIELD]
[--max_seq_length MAX_SEQ_LENGTH] [--test_size TEST_SIZE] [--fim_rate FIM_RATE]
[--fim_spm_rate FIM_SPM_RATE] [--splits SPLITS] --output_dir OUTPUT_DIR
[--overwrite_output_dir [OVERWRITE_OUTPUT_DIR]] [--do_train [DO_TRAIN]]
[--do_eval [DO_EVAL]] [--do_predict [DO_PREDICT]]
[--eval_strategy {no,steps,epoch}] [--prediction_loss_only [PREDICTION_LOSS_ONLY]]
[--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE]
[--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE]
[--per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE]
[--per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE]
[--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
[--eval_accumulation_steps EVAL_ACCUMULATION_STEPS] [--eval_delay EVAL_DELAY]
[--learning_rate LEARNING_RATE] [--weight_decay WEIGHT_DECAY]
[--adam_beta1 ADAM_BETA1] [--adam_beta2 ADAM_BETA2] [--adam_epsilon ADAM_EPSILON]
[--max_grad_norm MAX_GRAD_NORM] [--num_train_epochs NUM_TRAIN_EPOCHS]
[--max_steps MAX_STEPS]
[--lr_scheduler_type {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup,inverse_sqrt,reduce_lr_on_plateau,cosine_with_min_lr,warmup_stable_decay}]
[--lr_scheduler_kwargs LR_SCHEDULER_KWARGS] [--warmup_ratio WARMUP_RATIO]
[--warmup_steps WARMUP_STEPS]
[--log_level {detail,debug,info,warning,error,critical,passive}]
[--log_level_replica {detail,debug,info,warning,error,critical,passive}]
[--log_on_each_node [LOG_ON_EACH_NODE]] [--no_log_on_each_node]
[--logging_dir LOGGING_DIR] [--logging_strategy {no,steps,epoch}]
[--logging_first_step [LOGGING_FIRST_STEP]] [--logging_steps LOGGING_STEPS]
[--logging_nan_inf_filter [LOGGING_NAN_INF_FILTER]] [--no_logging_nan_inf_filter]
[--save_strategy {no,steps,epoch}] [--save_steps SAVE_STEPS]
[--save_total_limit SAVE_TOTAL_LIMIT] [--save_safetensors [SAVE_SAFETENSORS]]
[--no_save_safetensors] [--save_on_each_node [SAVE_ON_EACH_NODE]]
[--save_only_model [SAVE_ONLY_MODEL]]
[--restore_callback_states_from_checkpoint [RESTORE_CALLBACK_STATES_FROM_CHECKPOINT]]
[--no_cuda [NO_CUDA]] [--use_cpu [USE_CPU]] [--use_mps_device [USE_MPS_DEVICE]]
[--seed SEED] [--data_seed DATA_SEED] [--jit_mode_eval [JIT_MODE_EVAL]]
[--use_ipex [USE_IPEX]] [--bf16 [BF16]] [--fp16 [FP16]]
[--fp16_opt_level FP16_OPT_LEVEL] [--half_precision_backend {auto,apex,cpu_amp}]
[--bf16_full_eval [BF16_FULL_EVAL]] [--fp16_full_eval [FP16_FULL_EVAL]]
[--tf32 TF32] [--local_rank LOCAL_RANK]
[--ddp_backend {nccl,gloo,mpi,ccl,hccl,cncl}] [--tpu_num_cores TPU_NUM_CORES]
[--tpu_metrics_debug [TPU_METRICS_DEBUG]] [--debug DEBUG [DEBUG ...]]
[--dataloader_drop_last [DATALOADER_DROP_LAST]] [--eval_steps EVAL_STEPS]
[--dataloader_num_workers DATALOADER_NUM_WORKERS]
[--dataloader_prefetch_factor DATALOADER_PREFETCH_FACTOR]
[--past_index PAST_INDEX] [--run_name RUN_NAME] [--disable_tqdm DISABLE_TQDM]
[--remove_unused_columns [REMOVE_UNUSED_COLUMNS]] [--no_remove_unused_columns]
[--label_names LABEL_NAMES [LABEL_NAMES ...]]
[--load_best_model_at_end [LOAD_BEST_MODEL_AT_END]]
[--metric_for_best_model METRIC_FOR_BEST_MODEL]
[--greater_is_better GREATER_IS_BETTER] [--ignore_data_skip [IGNORE_DATA_SKIP]]
[--fsdp FSDP] [--fsdp_min_num_params FSDP_MIN_NUM_PARAMS]
[--fsdp_config FSDP_CONFIG]
[--fsdp_transformer_layer_cls_to_wrap FSDP_TRANSFORMER_LAYER_CLS_TO_WRAP]
[--accelerator_config ACCELERATOR_CONFIG] [--deepspeed DEEPSPEED]
[--label_smoothing_factor LABEL_SMOOTHING_FACTOR]
[--optim {adamw_hf,adamw_torch,adamw_torch_fused,adamw_torch_xla,adamw_torch_npu_fused,adamw_apex_fused,adafactor,adamw_anyprecision,sgd,adagrad,adamw_bnb_8bit,adamw_8bit,lion_8bit,lion_32bit,paged_adamw_32bit,paged_adamw_8bit,paged_lion_32bit,paged_lion_8bit,rmsprop,rmsprop_bnb,rmsprop_bnb_8bit,rmsprop_bnb_32bit,galore_adamw,galore_adamw_8bit,galore_adafactor,galore_adamw_layerwise,galore_adamw_8bit_layerwise,galore_adafactor_layerwise}]
[--optim_args OPTIM_ARGS] [--adafactor [ADAFACTOR]]
[--group_by_length [GROUP_BY_LENGTH]] [--length_column_name LENGTH_COLUMN_NAME]
[--report_to REPORT_TO] [--ddp_find_unused_parameters DDP_FIND_UNUSED_PARAMETERS]
[--ddp_bucket_cap_mb DDP_BUCKET_CAP_MB]
[--ddp_broadcast_buffers DDP_BROADCAST_BUFFERS]
[--dataloader_pin_memory [DATALOADER_PIN_MEMORY]] [--no_dataloader_pin_memory]
[--dataloader_persistent_workers [DATALOADER_PERSISTENT_WORKERS]]
[--skip_memory_metrics [SKIP_MEMORY_METRICS]] [--no_skip_memory_metrics]
[--use_legacy_prediction_loop [USE_LEGACY_PREDICTION_LOOP]]
[--push_to_hub [PUSH_TO_HUB]] [--resume_from_checkpoint RESUME_FROM_CHECKPOINT]
[--hub_model_id HUB_MODEL_ID]
[--hub_strategy {end,every_save,checkpoint,all_checkpoints}]
[--hub_token HUB_TOKEN] [--hub_private_repo [HUB_PRIVATE_REPO]]
[--hub_always_push [HUB_ALWAYS_PUSH]]
[--gradient_checkpointing [GRADIENT_CHECKPOINTING]]
[--gradient_checkpointing_kwargs GRADIENT_CHECKPOINTING_KWARGS]
[--include_inputs_for_metrics [INCLUDE_INPUTS_FOR_METRICS]]
[--eval_do_concat_batches [EVAL_DO_CONCAT_BATCHES]] [--no_eval_do_concat_batches]
[--fp16_backend {auto,apex,cpu_amp}] [--evaluation_strategy {no,steps,epoch}]
[--push_to_hub_model_id PUSH_TO_HUB_MODEL_ID]
[--push_to_hub_organization PUSH_TO_HUB_ORGANIZATION]
[--push_to_hub_token PUSH_TO_HUB_TOKEN] [--mp_parameters MP_PARAMETERS]
[--auto_find_batch_size [AUTO_FIND_BATCH_SIZE]]
[--full_determinism [FULL_DETERMINISM]] [--torchdynamo TORCHDYNAMO]
[--ray_scope RAY_SCOPE] [--ddp_timeout DDP_TIMEOUT]
[--torch_compile [TORCH_COMPILE]] [--torch_compile_backend TORCH_COMPILE_BACKEND]
[--torch_compile_mode TORCH_COMPILE_MODE] [--dispatch_batches DISPATCH_BATCHES]
[--split_batches SPLIT_BATCHES]
[--include_tokens_per_second [INCLUDE_TOKENS_PER_SECOND]]
[--include_num_input_tokens_seen [INCLUDE_NUM_INPUT_TOKENS_SEEN]]
[--neftune_noise_alpha NEFTUNE_NOISE_ALPHA]
[--optim_target_modules OPTIM_TARGET_MODULES]
train.py: error: ambiguous option: --split could match --splits, --split_batches

Flash Attention for fine-tuning

How can we use flash attention v2 for fine-tuning with huggingface models?

Does the path only works for pre-training(or extended pre-training)?

Link

All the discussions mentioned below are for pre-training(or extended pre-training).

I would like to fine-tune bigcode/starcoder 15.5 billion parameter model with 2k context length using A100-80GB.

In the preparetion of train dataset, why "input_ids" is equal with "labels"?

in the personal_copilot/training/train.py class ConstantLengthDataset(IterableDataset) where 249th and 250th line.
yield { "input_ids": torch.LongTensor(example), "labels": torch.LongTensor(example), }

If input is " I LIKE APPLE", why we need to teach the model to repeat "I LIKE APPLE" instead of saying "I LIKE APPLE, BECAUSE IT IS SWEET"?

Packing = True

Quick Question:

You have kept packing=True.
Shouldn't we be computing loss for the response part only? When setting packing as true, we won't get that functionality of mentioning to compute loss only for response template.
So, is it ok to train(compute loss) right from first token (including prompt) in sft?

@pacman100

Finetune 70B model on one node

Thanks for your educational blog post and this repo.

Could you please provide your scripts to finetune the 70B model in this repo?

BTW, when I run your 7B finetune script, it consumes ~60GB GPU memory using FSDP with batch size=1. Is this normal?

pacman100 / llm-workshop Goto Github PK

llm-workshop's Introduction

hey there

About Me :

📝 Research :

✍️ Blog Posts :

💬 Talks and Presentations