deepseek-ai / deepseek-moe Goto Github PK

View Code? Open in Web Editor NEW

946.0 946.0 44.0 2.08 MB

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

License: MIT License

Python 100.00%

deepseek-moe's People

Contributors

Stargazers

Watchers

deepseek-moe's Issues

关于flash_attn

torch 2.1
transformers:4.37.1
显卡:A800
手动安装装进去了，但是检测不到flash_attn
直接用pip 装不进去
去github flash_attn 是支持A800的

您们好请问准备开源的moe-145b什么时候准备上传呢?

您们好请问准备开源的moe-145b什么时候准备上传呢?
@luofuli 请问后续还有开源计画吗?

Can you provide the inference version of DeepSeek based on vllm, deepspeed and tensorrt-llm

Selective precision In gate and norm may conflict with deepspeed？

These lines of code set norm and gate to be trained in float32

DeepSeek-MoE/finetune/finetune.py

Lines 238 to 247 in 66edeee

    
           for name, module in model.named_modules(): 
        
               if isinstance(module, LoraLayer): 
        
                   if training_args.bf16: 
        
                       module = module.to(torch.bfloat16) 
        
               if 'norm' in name or 'gate' in name: 
        
                   module = module.to(torch.float32) 
        
               if 'lm_head' in name or 'embed_tokens' in name: 
        
                   if hasattr(module, 'weight'): 
        
                       if training_args.bf16 and module.weight.dtype == torch.float32: 
        
                           module = module.to(torch.bfloat16)

But with deepspeed's bf16 training, I think its initialization will set all the model to be bf16?
https://github.com/microsoft/DeepSpeed/blob/8ec1cc3be315e2a3276a771e6de706aae91cd330/deepspeed/runtime/engine.py#L1094-L1097
Or they are somehow compatible（an explain would be really appreciate）？

Thank you

能添加modelscope链接吗，这样可以更方便一些不能连hg的情况

现在必要要到modelscope去搜才能看到

finetune 过程出错

你好，我按照官方指导进行全参数 finetune , 遇到以下问题，请问应该怎么解决呢？

Detected kernel version 4.19.118, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Traceback (most recent call last):
  File "/data/share_user//training/fintune.py", line 332, in <module>
    train()
  File "/data/share_user/training/fintune.py", line 322, in train
    trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
  File "/root/miniconda3/envs/moe-env/lib/python3.10/site-packages/transformers/trainer.py", line 408, in __init__
    raise ValueError(
ValueError: You cannot perform fine-tuning on purely quantized models. Please attach trainable adapters on top of the quantized model to correctly perform fine-tuning. Please see: https://huggingface.co/docs/transformers/peft for more details

运行的命令是

DATA_PATH="garage-bAInd/Open-Platypus"
OUTPUT_PATH="/test"
MODEL_PATH="deepseek-ai/deepseek-moe-16b-base"

deepspeed training/finetune.py \
    --model_name_or_path $MODEL_PATH \
    --data_path $DATA_PATH \
    --output_dir $OUTPUT_PATH \
    --cache_dir ./cache \
    --num_train_epochs 1 \
    --model_max_length 1024 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --warmup_steps 10 \
    --logging_steps 1 \
    --lr_scheduler_type "cosine" \
    --gradient_checkpointing True \
    --report_to "tensorboard" \
    --deepspeed training/configs/ds_config_zero3.json \
    --bf16 True

GPU utils is low compared with dense model

training time is longer than 14B model. GPU utils is low, it always drop to 0 and raise up again to 100%.

The released DeepSeekMoE 16B Base has 3 different vocab size

In config.json, vocab size says 102400 which is the same as the shape of "model.embed_tokens.weight".
But tokenizer.vocab_size says 100000 and len(tokenizer) gives 100015. I have not dig deeper into the code. Is this intended or some bug?

Thank you!

inference tools like vllm can support?

as i mentioned , deepseek-moe is supported by vllm now ?

How to fully finetune MoE on multiple nodes

Thanks for you reply.

您们能够开源复现模型架构的训练项目吗?

开源的MoE模型支持中文吗？

求助：模型无法加载

您好：

我在加载moe模型的时候出现了一些问题

报错显示缺少flash_attn而无法加载模型，但在我的环境中是有flash_attn的。

为此特地新开了一个conda环境，依然如此。

希望可以得到您的帮助~

请问现在支持在NPU设备上进行微调吗

您们有计划支持llama.cpp这个项目吗

CUDA error: device-side assert triggered when trying to run the model

Error log: https://sprunge.us/MefNDs
Example I'm trying to run: http://sprunge.us/laQn8Q
Installed packages and versions: http://sprunge.us/ich59c
The only thing I changed from the example in the HF repo is the device map to prevent OOM on my dual GPU setup.

EDIT: seems to be some kind of an issue with my OS/hardware, cannot replicate on a rented machine.

deepseek-moe模型在进行lora微调训练时loss值会突然变为0一直到最后，导致推理异常。

现象1：deepseek-moe模型在进行lora微调训练时loss值会突然变为0一直到最后，导致推理异常，输出结果为：！！！。

现象2：deepseek-moe模型在checkpoint模型基础上进一步lora微调训练，会报错。
需要将trainer.train(resume_from_checkpoint = resume_from_checkpoint_dir)改为：
trainer.train() 才会启动成功。但保存的checkpoint就会从头开始，而不是从原checkpoint模型开始。

期待回复，谢谢~

About expert capacities: Is there token-dropping during training?

Hi there, thanks for open-sourcing such an amazing LM~

I'm wondering if you limit the maximum expert capacity by dropping some tokens during training like Switch Transformer.
Since training large MoE models is pretty costly, maybe it's necessary to add a hard constraint to ensure faster expert parallelism?

Thanks again and looking forward to hearing from you.

非常棒的工作，有没有微信沟通群呢

如题

MOE 并行怎么实现的？

MOE 模型的MLP parallel 是基于deepspeed 怎么实现的呢？

flash atten

Question about AddAuxiliaryLoss?

In the code AddAuxiliaryLoss, the loss is not stored or used in the forward function, does that mean the grad is constantly to be 1? should it be grad_output * loss?

Thanks a lot if you can straighten this out for me.

class AddAuxiliaryLoss(torch.autograd.Function):
    """
    The trick function of adding auxiliary (aux) loss,
    which includes the gradient of the aux loss during backpropagation.
    """
    @staticmethod
    def forward(ctx, x, loss):
        assert loss.numel() == 1
        ctx.dtype = loss.dtype
        ctx.required_aux_loss = loss.requires_grad
        return x

    @staticmethod
    def backward(ctx, grad_output):
        grad_loss = None
        if ctx.required_aux_loss:
            grad_loss = torch.ones(1, dtype=ctx.dtype, device=grad_output.device)
        return grad_output, grad_loss

@DeepSeekDDM @zwd003 Thanks a lot for helping.

Will it compare performance with llama-moe?

llama-moe：
https://github.com/pjlab-sys4nlp/llama-moe/tree/main

Or will a training framework be released with llama as the base model?

#feature request# DeepSeek-Moe for code

Thank you for your excellent work.
DeepSeek-MoE and DeepSeek-Coder are impressive.

Will you combine DeepSeed Coder with MoE architure?
This would bring significant performance improvement.

load erros

Is the flash_attn library necessary to load your model?,when I load your model,there is error:ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run pip install flash_attn
thank you so much!

您们会开源DeepSeekMoE 2B模型吗?

您们会开源2b参数的moe模型吗?

deepseek-moe-16b inference speed is slower than Baichuan-13b

Hi, I have tested inference performance of deepseek-moe-16b and baichuan-13b on A800-80G, the result is

deepseek-moe-16b 14.73 tokens/s
baichuan-13b 22.00 tokens/s

Is the result in line with expectations？ Or is there anything wrong with me ?

finetune后的模型输出异常

按照提供的finetune.py脚本在alpaca数据上微调后，使用下面的代码测试模型，输出有多余的 <|EOT|>符号，请问怎么纠正去掉这些符号，谢谢。
测试代码：

model.eval()
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
    messages = [
        {"role": "user", "content": "How to learn English?"},
    ]
    input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")

    
    outputs = model.generate(input_tensor.to(model.device), max_new_tokens=max_new_tokens)
    result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
    print(result)

模型输出：

1. Reading: Reading is the first step in learning English. It helps to develop vocabulary, grammar, and pronunciation skills. Choose a picture book or a good friend to read on.

2. Writing: Writing is a great way to practice grammar and sentence structure. Write short stories or practice writing on a computer or journal.

3. Listening and Speaking: Listening to English and speaking in front of an English audience can help to improve pronunciation and listening skills.

4. Reading books and games: Reading books and games with English can help to develop vocabulary, reading style, and comprehension skills. Choose books and games that have interactive elements and that challenge you.

5. Speaking with an English tutor or tutor in a classroom: Having a trained English tutor or tutor in a classroom can provide you with real-life practice and help you improve grammar and vocabulary skills.

6. Watching movies and TV shows: Watching movies and TV shows with subtitles or in international English can help to improve your listening skills and vocabulary. Choose movies and shows that have subtitles or that feature English characters.

7. Playing with a native speaker: Playing with a native speaker of the language can help to improve pronunciation and vocabulary skills. Choose a friend to play with and practice in a group.


8. Speaking with a real-life English person: Speaking with a real-life English person can help to improve grammar, vocabulary, and listening skills. Choose a friend to practice with a trained English tutor or tutor.
<|EOT|>OT|>2. Playing with all of these methods above can help to improve your English skills.
<||EOT|>||
<|EOT|>|
<|EOT|>|
<|EOT|>
<|EOT|>
<|EOT|>
<|EOT|>
<|EOT|>
<EOT>
<|EOT|>
<|EOT>
<|EOT>
<|EOT>
<|EOT>
<EOT>
<EOT>
<|EOT>
<EOT>
<EOT>
<|EOT>
<EOT>
<EOT>
<|EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOTOT>
<EOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<<EOTOT>
<EOT>
<EOT>
<EOT>EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOTEOT>
<EOTOTEOT>
<EOTOTEOT>
<EOTOTEOT>
<EOTOTOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOTOTOT
<EOTOTOT

No need to add epsilon 1e-20 in topk norm?

I found deepseek MoE add a very small epsilon when normalize topk result: https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat/blob/main/modeling_deepseek.py#L316

But in FP16/BF16 datatype, it cannot express the 1e-20, it will return as 0, so there is no need to add this epsilon?

模型结果测评复现

请问论文中的模型评测是如何评测的，我使用lm-eval库对DeepSeek-MoE-16b-chat模型进行测评，测试了DROP和GSM8K两个数据集，但是得到的结果和论文相差很大，请问可能是什么原因。下面是测试命令和结果，在2张40G A100上测试。谢谢！
DROP测试命令：

lm_eval --model hf \  
    --model_args pretrained=deepseek-ai/deepseek-moe-16b-chat,dtype="bfloat16",parallelize=True \  
    --tasks drop \
    --batch_size 8 \
    --num_fewshot 1 \
    --output_path deepseek-moe-16b-chat-drop.json \
    --trust_remote_code

DROP测试结果：

{
  "results": {
    "drop": {
      "em,none": 0.011954697986577181,
      "em_stderr,none": 0.001113005689885913,
      "f1,none": 0.044237625838926285,
      "f1_stderr,none": 0.0014818249593304353,
      "alias": "drop"
    }
  },
  "group_subtasks": {
    "drop": []
  },
  "configs": {
    "drop": {
      "task": "drop",
      "dataset_path": "benchmark_datasets/drop",
      "dataset_kwargs": {
        "trust_remote_code": true
      },
      "training_split": "train",
      "validation_split": "validation",
      "process_docs": "def process_docs(dataset):\n    def _process(doc):\n        return {\n            \"id\": doc[\"query_id\"],\n            \"passage\": doc[\"passage\"],\n            \"question\": doc[\"question\"],\n            \"answers\": get_answers(doc),\n        }\n\n    return dataset.map(_process)\n",
      "doc_to_text": "{{passage}} {{question}}",
      "doc_to_target": "{{ answer|join(',')}}",
      "process_results": "def process_results(doc, results):\n    preds, golds = results, doc[\"answers\"]\n    max_em = 0\n    max_f1 = 0\n    for gold_answer in golds:\n        exact_match, f1_score = get_metrics(preds, gold_answer)\n        if gold_answer[0].strip():\n            max_em = max(max_em, exact_match)\n            max_f1 = max(max_f1, f1_score)\n    return {\"em\": max_em, \"f1\": max_f1}\n",
      "description": "",
      "target_delimiter": "",
      "fewshot_delimiter": "\n\n",
      "num_fewshot": 1,
      "metric_list": [
        {
          "metric": "em",
          "aggregation": "mean",
          "higher_is_better": true
        },
        {
          "metric": "f1",
          "aggregation": "mean",
          "higher_is_better": true
        }
      ],
      "output_type": "generate_until",
      "generation_kwargs": {
        "until": [
          "."
        ]
      },
      "repeats": 1,
      "should_decontaminate": true,
      "doc_to_decontamination_query": "{{passage}} {{question}}",
      "metadata": {
        "version": 3.0
      }
    }
  },
  "versions": {
    "drop": 3.0
  },
  "n-shot": {
    "drop": 1
  },
  "config": {
    "model": "hf",
    "model_args": "pretrained=deepseek-ai/deepseek-moe-16b-chat,dtype=bfloat16,parallelize=True,trust_remote_code=True",
    "batch_size": "8",
    "batch_sizes": [],
    "device": null,
    "use_cache": null,
    "limit": null,
    "bootstrap_iters": 100000,
    "gen_kwargs": null
  },
  "git_hash": "2dafddf",
  "transformers_version": "4.38.1",
  "upper_git_hash": null
}

GSM8K测试命令：

lm_eval --model hf \  
    --model_args pretrained=deepseek-ai/deepseek-moe-16b-chat,dtype="bfloat16",parallelize=True \  
    --tasks gsm8k \
    --batch_size 8 \
    --num_fewshot 0 \
    --output_path deepseek-moe-16b-chat-gsm8k.json \
    --trust_remote_code

GSM8K测试结果：

{
  "results": {
    "gsm8k": {
      "exact_match,strict-match": 0.0,
      "exact_match_stderr,strict-match": 0.0,
      "exact_match,flexible-extract": 0.33358605003790753,
      "exact_match_stderr,flexible-extract": 0.012987282131410809,
      "alias": "gsm8k"
    }
  },
  "group_subtasks": {
    "gsm8k": []
  },
  "configs": {
    "gsm8k": {
      "task": "gsm8k",
      "group": [
        "math_word_problems"
      ],
      "dataset_path": "benchmark_datasets/gsm8k",
      "dataset_name": "main",
      "training_split": "train",
      "test_split": "test",
      "fewshot_split": "train",
      "doc_to_text": "Question: {{question}}\nAnswer:",
      "doc_to_target": "{{answer}}",
      "description": "",
      "target_delimiter": " ",
      "fewshot_delimiter": "\n\n",
      "num_fewshot": 0,
      "metric_list": [
        {
          "metric": "exact_match",
          "aggregation": "mean",
          "higher_is_better": true,
          "ignore_case": true,
          "ignore_punctuation": false,
          "regexes_to_ignore": [
            ",",
            "\\$",
            "(?s).*#### ",
            "\\.$"
          ]
        }
      ],
      "output_type": "generate_until",
      "generation_kwargs": {
        "until": [
          "Question:",
          "</s>",
          "<|im_end|>"
        ],
        "do_sample": false,
        "temperature": 0.0
      },
      "repeats": 1,
      "filter_list": [
        {
          "name": "strict-match",
          "filter": [
            {
              "function": "regex",
              "regex_pattern": "#### (\\-?[0-9\\.\\,]+)"
            },
            {
              "function": "take_first"
            }
          ]
        },
        {
          "name": "flexible-extract",
          "filter": [
            {
              "function": "regex",
              "group_select": -1,
              "regex_pattern": "(-?[$0-9.,]{2,})|(-?[0-9]+)"
            },
            {
              "function": "take_first"
            }
          ]
        }
      ],
      "should_decontaminate": false,
      "metadata": {
        "version": 3.0
      }
    }
  },
  "versions": {
    "gsm8k": 3.0
  },
  "n-shot": {
    "gsm8k": 0
  },
  "config": {
    "model": "hf",
    "model_args": "pretrained=deepseek-ai/deepseek-moe-16b-chat,dtype=bfloat16,parallelize=True,trust_remote_code=True",
    "batch_size": "8",
    "batch_sizes": [],
    "device": null,
    "use_cache": null,
    "limit": null,
    "bootstrap_iters": 100000,
    "gen_kwargs": null
  },
  "git_hash": "2c017c1",
  "transformers_version": "4.38.1",
  "upper_git_hash": null
}

	for name, module in model.named_modules():
	if isinstance(module, LoraLayer):
	if training_args.bf16:
	module = module.to(torch.bfloat16)
	if 'norm' in name or 'gate' in name:
	module = module.to(torch.float32)
	if 'lm_head' in name or 'embed_tokens' in name:
	if hasattr(module, 'weight'):
	if training_args.bf16 and module.weight.dtype == torch.float32:
	module = module.to(torch.bfloat16)

deepseek-ai / deepseek-moe Goto Github PK

deepseek-moe's People

Contributors

Stargazers

Watchers

Forkers

deepseek-moe's Issues

Recommend Projects

Recommend Topics

Recommend Org