Coder Social home page Coder Social logo

deepseek-moe's People

Contributors

zwd003 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepseek-moe's Issues

关于flash_attn

torch 2.1
transformers:4.37.1
显卡:A800
手动安装装进去了,但是检测不到flash_attn
直接用pip 装不进去
去github flash_attn 是支持A800的

image

Selective precision In gate and norm may conflict with deepspeed?

These lines of code set norm and gate to be trained in float32

for name, module in model.named_modules():
if isinstance(module, LoraLayer):
if training_args.bf16:
module = module.to(torch.bfloat16)
if 'norm' in name or 'gate' in name:
module = module.to(torch.float32)
if 'lm_head' in name or 'embed_tokens' in name:
if hasattr(module, 'weight'):
if training_args.bf16 and module.weight.dtype == torch.float32:
module = module.to(torch.bfloat16)

But with deepspeed's bf16 training, I think its initialization will set all the model to be bf16?
https://github.com/microsoft/DeepSpeed/blob/8ec1cc3be315e2a3276a771e6de706aae91cd330/deepspeed/runtime/engine.py#L1094-L1097
Or they are somehow compatible(an explain would be really appreciate)?

Thank you

finetune 过程出错

你好,我按照官方指导进行全参数 finetune , 遇到以下问题,请问应该怎么解决呢?

Detected kernel version 4.19.118, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Traceback (most recent call last):
  File "/data/share_user//training/fintune.py", line 332, in <module>
    train()
  File "/data/share_user/training/fintune.py", line 322, in train
    trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
  File "/root/miniconda3/envs/moe-env/lib/python3.10/site-packages/transformers/trainer.py", line 408, in __init__
    raise ValueError(
ValueError: You cannot perform fine-tuning on purely quantized models. Please attach trainable adapters on top of the quantized model to correctly perform fine-tuning. Please see: https://huggingface.co/docs/transformers/peft for more details

运行的命令是

DATA_PATH="garage-bAInd/Open-Platypus"
OUTPUT_PATH="/test"
MODEL_PATH="deepseek-ai/deepseek-moe-16b-base"

deepspeed training/finetune.py \
    --model_name_or_path $MODEL_PATH \
    --data_path $DATA_PATH \
    --output_dir $OUTPUT_PATH \
    --cache_dir ./cache \
    --num_train_epochs 1 \
    --model_max_length 1024 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --warmup_steps 10 \
    --logging_steps 1 \
    --lr_scheduler_type "cosine" \
    --gradient_checkpointing True \
    --report_to "tensorboard" \
    --deepspeed training/configs/ds_config_zero3.json \
    --bf16 True 

The released DeepSeekMoE 16B Base has 3 different vocab size

In config.json, vocab size says 102400 which is the same as the shape of "model.embed_tokens.weight".
But tokenizer.vocab_size says 100000 and len(tokenizer) gives 100015. I have not dig deeper into the code. Is this intended or some bug?

Thank you!

求助:模型无法加载

您好:

我在加载moe模型的时候出现了一些问题

报错显示缺少flash_attn而无法加载模型,但在我的环境中是有flash_attn的。

为此特地新开了一个conda环境,依然如此。

希望可以得到您的帮助~

image

deepseek-moe模型在进行lora微调训练时loss值会突然变为0一直到最后,导致推理异常。

现象1:deepseek-moe模型在进行lora微调训练时loss值会突然变为0一直到最后,导致推理异常,输出结果为:!!!。
image

现象2:deepseek-moe模型在checkpoint模型基础上进一步lora微调训练,会报错。
需要将trainer.train(resume_from_checkpoint = resume_from_checkpoint_dir)改为:
trainer.train() 才会启动成功。但保存的checkpoint就会从头开始,而不是从原checkpoint模型开始。

期待回复,谢谢~

About expert capacities: Is there token-dropping during training?

Hi there, thanks for open-sourcing such an amazing LM~

I'm wondering if you limit the maximum expert capacity by dropping some tokens during training like Switch Transformer.
Since training large MoE models is pretty costly, maybe it's necessary to add a hard constraint to ensure faster expert parallelism?

Thanks again and looking forward to hearing from you.

image

Question about AddAuxiliaryLoss?

In the code AddAuxiliaryLoss, the loss is not stored or used in the forward function, does that mean the grad is constantly to be 1? should it be grad_output * loss?

Thanks a lot if you can straighten this out for me.

class AddAuxiliaryLoss(torch.autograd.Function):
    """
    The trick function of adding auxiliary (aux) loss,
    which includes the gradient of the aux loss during backpropagation.
    """
    @staticmethod
    def forward(ctx, x, loss):
        assert loss.numel() == 1
        ctx.dtype = loss.dtype
        ctx.required_aux_loss = loss.requires_grad
        return x

    @staticmethod
    def backward(ctx, grad_output):
        grad_loss = None
        if ctx.required_aux_loss:
            grad_loss = torch.ones(1, dtype=ctx.dtype, device=grad_output.device)
        return grad_output, grad_loss

@DeepSeekDDM @zwd003 Thanks a lot for helping.

#feature request# DeepSeek-Moe for code

Thank you for your excellent work.
DeepSeek-MoE and DeepSeek-Coder are impressive.

Will you combine DeepSeed Coder with MoE architure?
This would bring significant performance improvement.

load erros

Is the flash_attn library necessary to load your model?,when I load your model,there is error:ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run pip install flash_attn
thank you so much!

deepseek-moe-16b inference speed is slower than Baichuan-13b

Hi, I have tested inference performance of deepseek-moe-16b and baichuan-13b on A800-80G, the result is

deepseek-moe-16b 14.73 tokens/s
baichuan-13b 22.00 tokens/s

Is the result in line with expectations? Or is there anything wrong with me ?

finetune后的模型输出异常

按照提供的finetune.py脚本在alpaca数据上微调后,使用下面的代码测试模型,输出有多余的 <|EOT|>符号,请问怎么纠正去掉这些符号,谢谢。
测试代码:

model.eval()
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
    messages = [
        {"role": "user", "content": "How to learn English?"},
    ]
    input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")

    
    outputs = model.generate(input_tensor.to(model.device), max_new_tokens=max_new_tokens)
    result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
    print(result)

模型输出:

1. Reading: Reading is the first step in learning English. It helps to develop vocabulary, grammar, and pronunciation skills. Choose a picture book or a good friend to read on.

2. Writing: Writing is a great way to practice grammar and sentence structure. Write short stories or practice writing on a computer or journal.

3. Listening and Speaking: Listening to English and speaking in front of an English audience can help to improve pronunciation and listening skills.

4. Reading books and games: Reading books and games with English can help to develop vocabulary, reading style, and comprehension skills. Choose books and games that have interactive elements and that challenge you.

5. Speaking with an English tutor or tutor in a classroom: Having a trained English tutor or tutor in a classroom can provide you with real-life practice and help you improve grammar and vocabulary skills.

6. Watching movies and TV shows: Watching movies and TV shows with subtitles or in international English can help to improve your listening skills and vocabulary. Choose movies and shows that have subtitles or that feature English characters.

7. Playing with a native speaker: Playing with a native speaker of the language can help to improve pronunciation and vocabulary skills. Choose a friend to play with and practice in a group.


8. Speaking with a real-life English person: Speaking with a real-life English person can help to improve grammar, vocabulary, and listening skills. Choose a friend to practice with a trained English tutor or tutor.
<|EOT|>OT|>2. Playing with all of these methods above can help to improve your English skills.
<||EOT|>||
<|EOT|>|
<|EOT|>|
<|EOT|>
<|EOT|>
<|EOT|>
<|EOT|>
<|EOT|>
<EOT>
<|EOT|>
<|EOT>
<|EOT>
<|EOT>
<|EOT>
<EOT>
<EOT>
<|EOT>
<EOT>
<EOT>
<|EOT>
<EOT>
<EOT>
<|EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOTOT>
<EOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<<EOTOT>
<EOT>
<EOT>
<EOT>EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOTEOT>
<EOTOTEOT>
<EOTOTEOT>
<EOTOTEOT>
<EOTOTOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOT>
<EOTOT>
<EOT>
<EOTOTOT
<EOTOTOT

模型结果测评复现

请问论文中的模型评测是如何评测的,我使用lm-eval库DeepSeek-MoE-16b-chat模型进行测评,测试了DROP和GSM8K两个数据集,但是得到的结果和论文相差很大,请问可能是什么原因。下面是测试命令和结果,在2张40G A100上测试。谢谢!
DROP测试命令:

lm_eval --model hf \  
    --model_args pretrained=deepseek-ai/deepseek-moe-16b-chat,dtype="bfloat16",parallelize=True \  
    --tasks drop \
    --batch_size 8 \
    --num_fewshot 1 \
    --output_path deepseek-moe-16b-chat-drop.json \
    --trust_remote_code

DROP测试结果:

{
  "results": {
    "drop": {
      "em,none": 0.011954697986577181,
      "em_stderr,none": 0.001113005689885913,
      "f1,none": 0.044237625838926285,
      "f1_stderr,none": 0.0014818249593304353,
      "alias": "drop"
    }
  },
  "group_subtasks": {
    "drop": []
  },
  "configs": {
    "drop": {
      "task": "drop",
      "dataset_path": "benchmark_datasets/drop",
      "dataset_kwargs": {
        "trust_remote_code": true
      },
      "training_split": "train",
      "validation_split": "validation",
      "process_docs": "def process_docs(dataset):\n    def _process(doc):\n        return {\n            \"id\": doc[\"query_id\"],\n            \"passage\": doc[\"passage\"],\n            \"question\": doc[\"question\"],\n            \"answers\": get_answers(doc),\n        }\n\n    return dataset.map(_process)\n",
      "doc_to_text": "{{passage}} {{question}}",
      "doc_to_target": "{{ answer|join(',')}}",
      "process_results": "def process_results(doc, results):\n    preds, golds = results, doc[\"answers\"]\n    max_em = 0\n    max_f1 = 0\n    for gold_answer in golds:\n        exact_match, f1_score = get_metrics(preds, gold_answer)\n        if gold_answer[0].strip():\n            max_em = max(max_em, exact_match)\n            max_f1 = max(max_f1, f1_score)\n    return {\"em\": max_em, \"f1\": max_f1}\n",
      "description": "",
      "target_delimiter": "",
      "fewshot_delimiter": "\n\n",
      "num_fewshot": 1,
      "metric_list": [
        {
          "metric": "em",
          "aggregation": "mean",
          "higher_is_better": true
        },
        {
          "metric": "f1",
          "aggregation": "mean",
          "higher_is_better": true
        }
      ],
      "output_type": "generate_until",
      "generation_kwargs": {
        "until": [
          "."
        ]
      },
      "repeats": 1,
      "should_decontaminate": true,
      "doc_to_decontamination_query": "{{passage}} {{question}}",
      "metadata": {
        "version": 3.0
      }
    }
  },
  "versions": {
    "drop": 3.0
  },
  "n-shot": {
    "drop": 1
  },
  "config": {
    "model": "hf",
    "model_args": "pretrained=deepseek-ai/deepseek-moe-16b-chat,dtype=bfloat16,parallelize=True,trust_remote_code=True",
    "batch_size": "8",
    "batch_sizes": [],
    "device": null,
    "use_cache": null,
    "limit": null,
    "bootstrap_iters": 100000,
    "gen_kwargs": null
  },
  "git_hash": "2dafddf",
  "transformers_version": "4.38.1",
  "upper_git_hash": null
}

GSM8K测试命令:

lm_eval --model hf \  
    --model_args pretrained=deepseek-ai/deepseek-moe-16b-chat,dtype="bfloat16",parallelize=True \  
    --tasks gsm8k \
    --batch_size 8 \
    --num_fewshot 0 \
    --output_path deepseek-moe-16b-chat-gsm8k.json \
    --trust_remote_code

GSM8K测试结果:

{
  "results": {
    "gsm8k": {
      "exact_match,strict-match": 0.0,
      "exact_match_stderr,strict-match": 0.0,
      "exact_match,flexible-extract": 0.33358605003790753,
      "exact_match_stderr,flexible-extract": 0.012987282131410809,
      "alias": "gsm8k"
    }
  },
  "group_subtasks": {
    "gsm8k": []
  },
  "configs": {
    "gsm8k": {
      "task": "gsm8k",
      "group": [
        "math_word_problems"
      ],
      "dataset_path": "benchmark_datasets/gsm8k",
      "dataset_name": "main",
      "training_split": "train",
      "test_split": "test",
      "fewshot_split": "train",
      "doc_to_text": "Question: {{question}}\nAnswer:",
      "doc_to_target": "{{answer}}",
      "description": "",
      "target_delimiter": " ",
      "fewshot_delimiter": "\n\n",
      "num_fewshot": 0,
      "metric_list": [
        {
          "metric": "exact_match",
          "aggregation": "mean",
          "higher_is_better": true,
          "ignore_case": true,
          "ignore_punctuation": false,
          "regexes_to_ignore": [
            ",",
            "\\$",
            "(?s).*#### ",
            "\\.$"
          ]
        }
      ],
      "output_type": "generate_until",
      "generation_kwargs": {
        "until": [
          "Question:",
          "</s>",
          "<|im_end|>"
        ],
        "do_sample": false,
        "temperature": 0.0
      },
      "repeats": 1,
      "filter_list": [
        {
          "name": "strict-match",
          "filter": [
            {
              "function": "regex",
              "regex_pattern": "#### (\\-?[0-9\\.\\,]+)"
            },
            {
              "function": "take_first"
            }
          ]
        },
        {
          "name": "flexible-extract",
          "filter": [
            {
              "function": "regex",
              "group_select": -1,
              "regex_pattern": "(-?[$0-9.,]{2,})|(-?[0-9]+)"
            },
            {
              "function": "take_first"
            }
          ]
        }
      ],
      "should_decontaminate": false,
      "metadata": {
        "version": 3.0
      }
    }
  },
  "versions": {
    "gsm8k": 3.0
  },
  "n-shot": {
    "gsm8k": 0
  },
  "config": {
    "model": "hf",
    "model_args": "pretrained=deepseek-ai/deepseek-moe-16b-chat,dtype=bfloat16,parallelize=True,trust_remote_code=True",
    "batch_size": "8",
    "batch_sizes": [],
    "device": null,
    "use_cache": null,
    "limit": null,
    "bootstrap_iters": 100000,
    "gen_kwargs": null
  },
  "git_hash": "2c017c1",
  "transformers_version": "4.38.1",
  "upper_git_hash": null
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.