<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubu

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

fine_tune_tiny_llama.txt Yes, you can use the to finetune our

fine_tune_tiny_llama.txt Yes, you can use the to

Trying to finetune DeepSeek-Coder on custom Dataset about deepseek-coder HOT 13 CLOSED

A-Janj commented on August 22, 2024

Trying to finetune DeepSeek-Coder on custom Dataset

from deepseek-coder.

Comments (13)

DejianYang commented on August 22, 2024 1

size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32256, 2048]).

https://huggingface.co/docs/accelerate/usage_guides/deepspeed#saving-and-loading

from deepseek-coder.

A-Janj commented on August 22, 2024 1

size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32256, 2048]).

https://huggingface.co/docs/accelerate/usage_guides/deepspeed#saving-and-loading

Thank you so much for your time and help. This helped me in understanding the different deepspeed config files and ZeRO stages.
What I did to resolve my issue was to put trainer.save_model() and tokenizer in the finetune script, done as below:

trainer.train()
trainer.save_model("SaveOutputFolder")
trainer.tokenizer.save_pretrained("SaveOutputFolder")
trainer.save_state()

Thanks once again so much for all your help.

from deepseek-coder.

DejianYang commented on August 22, 2024

I am trying to finetune DeepSeek-Coder but I am getting this -9 kill code, and I have no idea why. My dataset is in the following format:

Please check if you have enough CPU memory?

from deepseek-coder.

A-Janj commented on August 22, 2024

I am trying to finetune DeepSeek-Coder but I am getting this -9 kill code, and I have no idea why. My dataset is in the following format:

Please check if you have enough CPU memory?

I have 64 GB RAM (CPU memory). How much does deepseek require to get finetuned?

from deepseek-coder.

DejianYang commented on August 22, 2024

I am trying to finetune DeepSeek-Coder but I am getting this -9 kill code, and I have no idea why. My dataset is in the following format:

Please check if you have enough CPU memory?

I have 64 GB RAM (CPU memory). How much does deepseek require to get finetuned?

I do not have exact number of RAM required by finetune. The DeepSpeed is used in the finetune script which requires a lot of RAM to do cpu offload. You can try another config of deepspeed to reduce the cpu memory used if you have enough GPU memory. Maybe you can try our 1b model first.

from deepseek-coder.

A-Janj commented on August 22, 2024

I am trying to finetune DeepSeek-Coder but I am getting this -9 kill code, and I have no idea why. My dataset is in the following format:

Please check if you have enough CPU memory?

I have 64 GB RAM (CPU memory). How much does deepseek require to get finetuned?

I do not have exact number of RAM required by finetune. The DeepSpeed is used in the finetune script which requires a lot of RAM to do cpu offload. You can try another config of deepspeed to reduce the cpu memory used if you have enough GPU memory. Maybe you can try our 1b model first.

Can you give me an idea of how much GPU VRAM I would require if I have 64 GB system RAM?
Moreover should I put false instead of true in offload cpu parameters in ds_config_zero3.json, like:
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": false
},
"offload_param": {
"device": "cpu",
"pin_memory": false
},

from deepseek-coder.

A-Janj commented on August 22, 2024

@DejianYang Can You Help Me?!

I was able to finetune the 6.7b parameter model using 1 x H100 80GB SXM5 (80 GB VRAM and 251 GB RAM 24 vCPU). The Finetune script created files in the given output folder, but model.safetensors file is only 539.6 kB.

Doing inference on the finetuned directory first gave the following error:

"RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32256, 2048]).
You may consider adding ignore_mismatched_sizes=True in the model from_pretrained method."

After setting ignore_mismatched_sizes=True argument in the from_pretrained method the model is giving gibberish. You can see the inference code and the output in the screenshot.

Am I missing something?

from deepseek-coder.

seancarmod-y commented on August 22, 2024

@DejianYang I'm looking to fine-tune deepseek-coder-1.3b-base. Ideally, I'd like to do it using huggingface libraries as I have done for tinyllama in the attached file. Is this possible or do I need to use the finetune_deepseekcoder.py (can this even be used for the 1.3b model?)
fine_tune_tiny_llama.txt

from deepseek-coder.

DejianYang commented on August 22, 2024

fine_tune_tiny_llama.txt
Yes, you can use the script to finetune our model just like you are using other llama models.

from deepseek-coder.

seancarmod-y commented on August 22, 2024

fine_tune_tiny_llama.txt
Yes, you can use the script to finetune our model just like you are using other llama models.

Hi, that's great, thanks. I can't seem to find documentation on how to format the custom dataset. For both Llama2 and Tinyllama I have formatted it as a csv where there is a 'text' column. Is there a similar format that I can follow for deepseek-coder-1.3b? The formats of each row are:
Llama2: [INST] prompt [/INST] Llama2 answer <\s>
Tinyllama: <|user|>
prompt
<|assistant|>
Tinyllama answer
I then load the dataset like this:
from datasets import load_dataset
dataset = load_dataset(dataset_folder, split="train")

from deepseek-coder.

LarkLeeOnePiece commented on August 22, 2024

Sorry, can you solve the mismatched_sizes problem after adding"trainer.train()
trainer.save_model("SaveOutputFolder")
trainer.Tokenizer.save_pretrained("SaveOutputFolder")
trainer.save_state()"
I met the same prblem, could you help me out?
Do you use the AutoTokenizer.from_pretrained and AutoModelForCausalLM.from_pretrained to load the model and tokenier?

from deepseek-coder.

A-Janj commented on August 22, 2024

Sorry, can you solve the mismatched_sizes problem after adding"trainer.train() trainer.save_model("SaveOutputFolder") trainer.Tokenizer.save_pretrained("SaveOutputFolder") trainer.save_state()" I met the same prblem, could you help me out? Do you use the AutoTokenizer.from_pretrained and AutoModelForCausalLM.from_pretrained to load the model and tokenier?

I then used the following code for inference:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("SaveOutputFolder", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("SaveOutputFolder", ignore_mismatched_sizes=True, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()

input_text = "#write a quick sort algorithm"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128, top_k=50,
top_p=0.95,
do_sample=True,
temperature=0.9, # Adjust as needed
repetition_penalty=1.2, # Penalize repeated tokens
no_repeat_ngram_size=2, # Prevent repeating n-grams
num_return_sequences=1)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

from deepseek-coder.

A-Janj commented on August 22, 2024

fine_tune_tiny_llama.txt
Yes, you can use the script to finetune our model just like you are using other llama models.

Hi, that's great, thanks. I can't seem to find documentation on how to format the custom dataset. For both Llama2 and Tinyllama I have formatted it as a csv where there is a 'text' column. Is there a similar format that I can follow for deepseek-coder-1.3b? The formats of each row are: Llama2: [INST] prompt [/INST] Llama2 answer <\s> Tinyllama: <|user|> prompt <|assistant|> Tinyllama answer I then load the dataset like this: from datasets import load_dataset dataset = load_dataset(dataset_folder, split="train")

this is the sample data set format for deepseek coder: https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1

the .json file should contain data like:
[
{
"instruction": "give python syntax in a Nutshell",
"output": "Row1"
},
{
"instruction": "Print the content in between the curly brackets to the template output",
"output": "Row2"
},
{
"instruction": "Statements of the Jinja language that do not have an output.",
"output": "Row3"
}
]

from deepseek-coder.

Trying to finetune DeepSeek-Coder on custom Dataset about deepseek-coder HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent