deepseek-ai / deepseek-v2 Goto Github PK

View Code? Open in Web Editor NEW

3.0K 23.0 110.0 2.31 MB

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

License: MIT License

deepseek-v2's Introduction

Paper Link👁️

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

1. Introduction

Today, we’re introducing DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token. Compared with DeepSeek 67B, DeepSeek-V2 achieves stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times.

We pretrained DeepSeek-V2 on a diverse and high-quality corpus comprising 8.1 trillion tokens. This comprehensive pretraining was followed by a process of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unleash the model's capabilities. The evaluation results validate the effectiveness of our approach as DeepSeek-V2 achieves remarkable performance on both standard benchmarks and open-ended generation evaluation.

2. News

2024.05.16: We released the DeepSeek-V2-Lite.
2024.05.06: We released the DeepSeek-V2.

3. Model Downloads

Model	#Total Params	#Activated Params	Context Length	Download
DeepSeek-V2-Lite	16B	2.4B	32k	🤗 HuggingFace
DeepSeek-V2-Lite-Chat (SFT)	16B	2.4B	32k	🤗 HuggingFace
DeepSeek-V2	236B	21B	128k	🤗 HuggingFace
DeepSeek-V2-Chat (RL)	236B	21B	128k	🤗 HuggingFace

Due to the constraints of HuggingFace, the open-source code currently experiences slower performance than our internal codebase when running on GPUs with Huggingface. To facilitate the efficient execution of our model, we offer a dedicated vllm solution that optimizes performance for running our model effectively.

4. Evaluation Results

Base Model

Standard Benchmark (Models larger than 67B)

Benchmark	Domain	LLaMA3 70B	Mixtral 8x22B	DeepSeek-V1 (Dense-67B)	DeepSeek-V2 (MoE-236B)
MMLU	English	78.9	77.6	71.3	78.5
BBH	English	81.0	78.9	68.7	78.9
C-Eval	Chinese	67.5	58.6	66.1	81.7
CMMLU	Chinese	69.3	60.0	70.8	84.0
HumanEval	Code	48.2	53.1	45.1	48.8
MBPP	Code	68.6	64.2	57.4	66.6
GSM8K	Math	83.0	80.3	63.4	79.2
Math	Math	42.2	42.5	18.7	43.6

Standard Benchmark (Models smaller than 16B)

Benchmark	Domain	DeepSeek 7B (Dense)	DeepSeekMoE 16B	DeepSeek-V2-Lite (MoE-16B)
Architecture	-	MHA+Dense	MHA+MoE	MLA+MoE
MMLU	English	48.2	45.0	58.3
BBH	English	39.5	38.9	44.1
C-Eval	Chinese	45.0	40.6	60.3
CMMLU	Chinese	47.2	42.5	64.3
HumanEval	Code	26.2	26.8	29.9
MBPP	Code	39.0	39.2	43.2
GSM8K	Math	17.4	18.8	41.1
Math	Math	3.3	4.3	17.1

For more evaluation details, such as few-shot settings and prompts, please check our paper.

Context Window

Evaluation results on the Needle In A Haystack (NIAH) tests. DeepSeek-V2 performs well across all context window lengths up to 128K.

Chat Model

Standard Benchmark (Models larger than 67B)

Benchmark	Domain	QWen1.5 72B Chat	Mixtral 8x22B	LLaMA3 70B Instruct	DeepSeek-V1 Chat (SFT)	DeepSeek-V2 Chat (SFT)	DeepSeek-V2 Chat (RL)
MMLU	English	76.2	77.8	80.3	71.1	78.4	77.8
BBH	English	65.9	78.4	80.1	71.7	81.3	79.7
C-Eval	Chinese	82.2	60.0	67.9	65.2	80.9	78.0
CMMLU	Chinese	82.9	61.0	70.7	67.8	82.4	81.6
HumanEval	Code	68.9	75.0	76.2	73.8	76.8	81.1
MBPP	Code	52.2	64.4	69.8	61.4	70.4	72.0
LiveCodeBench (0901-0401)	Code	18.8	25.0	30.5	18.3	28.7	32.5
GSM8K	Math	81.9	87.9	93.2	84.1	90.8	92.2
Math	Math	40.6	49.8	48.5	32.6	52.7	53.9

Standard Benchmark (Models smaller than 16B)

Benchmark	Domain	DeepSeek 7B Chat (SFT)	DeepSeekMoE 16B Chat (SFT)	DeepSeek-V2-Lite 16B Chat (SFT)
MMLU	English	49.7	47.2	55.7
BBH	English	43.1	42.2	48.1
C-Eval	Chinese	44.7	40.0	60.1
CMMLU	Chinese	51.2	49.3	62.5
HumanEval	Code	45.1	45.7	57.3
MBPP	Code	39.0	46.2	45.8
GSM8K	Math	62.6	62.2	72.0
Math	Math	14.7	15.2	27.9

English Open Ended Generation Evaluation

We evaluate our model on AlpacaEval 2.0 and MTBench, showing the competitive performance of DeepSeek-V2-Chat-RL on English conversation generation.

Chinese Open Ended Generation Evaluation

Alignbench (https://arxiv.org/abs/2311.18743)

模型	开源/闭源	总分	中文推理	中文语言
gpt-4-1106-preview	闭源	8.01	7.73	8.29
DeepSeek-V2 Chat (RL)	开源	7.91	7.45	8.36
erniebot-4.0-202404 (文心一言)	闭源	7.89	7.61	8.17
DeepSeek-V2 Chat (SFT)	开源	7.74	7.30	8.17
gpt-4-0613	闭源	7.53	7.47	7.59
erniebot-4.0-202312 (文心一言)	闭源	7.36	6.84	7.88
moonshot-v1-32k-202404 (月之暗面)	闭源	7.22	6.42	8.02
Qwen1.5-72B-Chat (通义千问)	开源	7.19	6.45	7.93
DeepSeek-67B-Chat	开源	6.43	5.75	7.11
Yi-34B-Chat (零一万物)	开源	6.12	4.86	7.38
gpt-3.5-turbo-0613	闭源	6.08	5.35	6.71
DeepSeek-V2-Lite 16B Chat	开源	6.01	4.71	7.32

Coding Benchmarks

We evaluate our model on LiveCodeBench (0901-0401), a benchmark designed for live coding challenges. As illustrated, DeepSeek-V2 demonstrates considerable proficiency in LiveCodeBench, achieving a Pass@1 score that surpasses several other sophisticated models. This performance highlights the model's effectiveness in tackling live coding tasks.

5. Model Architecture

DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference：

For attention, we design MLA (Multi-head Latent Attention), which utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference.
For Feed-Forward Networks (FFNs), we adopt DeepSeekMoE architecture, a high-performance MoE architecture that enables training stronger models at lower costs.

6. Chat Website

You can chat with the DeepSeek-V2 on DeepSeek's official website: chat.deepseek.com

7. API Platform

We also provide OpenAI-Compatible API at DeepSeek Platform: platform.deepseek.com. Sign up for over millions of free tokens. And you can also pay-as-you-go at an unbeatable price.

8. How to run locally

To utilize DeepSeek-V2 in BF16 format for inference, 80GB*8 GPUs are required.

Inference with Huggingface's Transformers

You can directly employ Huggingface's Transformers for model inference.

Text Completion

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# `max_memory` should be set based on your devices
max_memory = {i: "75GB" for i in range(8)}
# `device_map` cannot be set to `auto`
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Chat Completion

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# `max_memory` should be set based on your devices
max_memory = {i: "75GB" for i in range(8)}
# `device_map` cannot be set to `auto`
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "Write a piece of quicksort code in C++"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

The complete chat template can be found within tokenizer_config.json located in the huggingface model repository.

An example of chat template is as belows:

<｜begin▁of▁sentence｜>User: {user_message_1}

Assistant: {assistant_message_1}<｜end▁of▁sentence｜>User: {user_message_2}

Assistant:

You can also add an optional system message:

<｜begin▁of▁sentence｜>{system_message}

User: {user_message_1}

Assistant: {assistant_message_1}<｜end▁of▁sentence｜>User: {user_message_2}

Assistant:

Inference with vLLM (recommended)

To utilize vLLM for model inference, please merge this Pull Request into your vLLM codebase: vllm-project/vllm#4650.

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 8192, 8
model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])

messages_list = [
    [{"role": "user", "content": "Who are you?"}],
    [{"role": "user", "content": "Translate the following content into Chinese directly: DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference."}],
    [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

LangChain Support

Since our API is compatible with OpenAI, you can easily use it in langchain. Here is an example:

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
    model='deepseek-chat',
    openai_api_key=<your-deepseek-api-key>,
    openai_api_base='https://api.deepseek.com/v1',
    temperature=0.85,
    max_tokens=8000)

9. License

This code repository is licensed under the MIT License. The use of DeepSeek-V2 Base/Chat models is subject to the Model License. DeepSeek-V2 series (including Base and Chat) supports commercial use.

10. Citation

@misc{deepseekv2,
      title={DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model}, 
      author={DeepSeek-AI},
      year={2024},
      eprint={2405.04434},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

11. Contact

If you have any questions, please raise an issue or contact us at [email protected].

deepseek-v2's People

Contributors

Stargazers

Watchers

Forkers

cygwynd largebigbug dassmeta hbcbh1999 eddsi isgasho thomascherickal liunix61 fangfsz ggrain jmaigc diogodsa qzl164 snoopycn evdcush petitetech sunhonghua1 zhangpeibj01 kevinlights angelasunny wendongj gibo-chun dulvqingyun ototao ink-splatters wangshuniguang chriscao2020 rustytoms skaiphd heli-dawnlab703 misterypoem eltociear twogapme lamardealmaker aicodehunt keyman9848 matrixplayer fangwudi clivelau1990 jackerman8026 hhy5277 web3nemo ukaserge manfar wuweihua1999sz lysomuch deepseek1588 jymnils2 superxcv catsdogone zhangxuemiao youngboy88bin mastermind0001 heda321 qxmao deftruth moemoefish hzjai0624 hubayirp leaderyangzi starmagic xiaozhiob long630904 shirley1995 pythoner-xu linguo123 xinlzhang zggl 2suns2018 winning1120xx coderwpf moji1245 zhaopufeng yuanhzhu sunnyxorange hyowe jinrudy jangocheng ricardozzf dependify dfa2nfa catalyst-plus conglesolutionx kekewolf map1eum gillesbouyer lihuibng zjgugugu lokeshjonnakuti degerli sjanulonoks djcas9 mcanthomroubert ddofborg ka-lel bon3less omega-intel bang78945 gmh5225 freddymu

deepseek-v2's Issues

Can not use tool and function-call?

源码

后期会提供模型的源码吗？

Clarifications Needed on KVCache Compression and Matrix Operations in MLA KVCache

In MLA, the KVCache compresses $h_t$ into $C_t^{KV} \in \mathbb{R}^{d_c}$, and to circumvent the issue of incompatibility with RoPE for low-rank KVCache compression, it concatenates $k_t^R = \text{RoPE}(W^{KR}h_t) \in \mathbb{R}^{d_h^R}$.

However, according to equation (17): $k_{t,i}=[k_{t,i}^C; k_t^R]$, during the computation of attention, $k_t^c = W^{UK}C_t^{KV} \in \mathbb{R}^{d_hn_h}$ is used instead of $C_t^{KV}$.

Appendix B mentions that by applying the associative law of matrix multiplication, $W^{DKV}$ can be absorbed into $W^Q$: $W^Q[W^{UK}(W^{DKV}h_t)] = (W^QW^{UK})(W^{DKV}h_t)=(W^{UQ})C_t^{KV}$.

Questions:

Given that $W^Q \in \mathbb{R}^{d_hn_h \times d}$ and $W^{UK} \in \mathbb{R}^{d_hn_h \times d_c}$, how are these matrices multiplied to derive $W^{UQ}$?
How are the values for the matrices $W^{DKV}, W^{UK}, W^{KR}$ computed? Appendix B suggests that these are calculated offline once and not during training as part of the low-rank matrix values.

Any insights or detailed explanations regarding these points would be highly appreciated.

Error in Equation 16?

It appears that the current formulation is ?

$$
q_{t,i} = [q^{C}{t,i};q{t}^R],
$$

有没有计划将 deepseek-v2-lite 上传到 modelscope

如题

How to deploy in VLLM?

Drop Token

hello @DeepSeekDDM @luofuli
I have some question in drop token deepseek v2
Is the capacity in the token-dropping strategy based on the expert dimension or the device dimension?
If it's on the expert dimension, then the capacity is calculated as capacity = math.ceil(num_tokens * topk) / num_experts * capacity_factor, and then each expert processes its own tokens, dropping the least scored tokens if token > capacity, and padding if token < capacity.
If it's on the device dimension, is the capacity calculated as capacity = math.ceil(num_tokens * topk) / num_groups * capacity_factor? How is the token dropping executed in this case?
Because the paper mentions device-level token dropping, I have the above confusion.

About datasets

Hi, thank you for your great work!

Could you provide more details about the pretrain dataset?
How has the pretrain dataset been optimized in DeepSeek-V2 compared to the previous version, DeepSeek?

Thank you.

Knowledge cutoff date

尊敬的开发者您好，请问贵模型的知识截断日是什么时候？使用不同语言询问他时给出了不同的回复。希望能得到您们的解答，提前祝工作顺利，谢谢！

8 * A100 启动巨慢，有启动成功的勇士不

Add MoE offloading strategy？

https://arxiv.org/abs/2312.17238

无法支持 autogpt 中的 langchain

单独 langchain 是支持.

但是在 autogpt 中就不行了.

语言模型

llm = ChatOpenAI(
    openai_api_base="https://api.deepseek.com/v1",
    openai_api_key="sk-exxxxxxxxxxxxxxxxxxx6",
    model="deepseek-chat",
    temperature=0,
    model_kwargs={
        "seed": 42
    },
)

Could we have scores for `LongBookQA Eng` and `LongBookSum Eng`

Some results pasted below from this link:

Task Name	GPT-4	YaRN-Mistral-7B	Kimi-Chat	Claude 2	Yi-6B-200K	Yi-34B-200K	Chatglm3-6B-128K
En.Sum	14.73%	9.09%	17.93%	14.45%	< 5%	< 5%	< 5%
En.QA	22.22%	9.55%	16.52%	11.97%	9.20%	12.17%	< 5%

敏感词封禁问题

taipei is the capital of Taiwan：返回400
请评价下毛主席：返回空字符串

有些返回BadRequestError: Error code: 400 - {'detail': 'Content Exists Risk'}，有些返回空字符串，能不能统一下？
而且api调用文档说400是【请求体格式错误】，太随意了。

How to understand W^UK can be absorbed into W^Q and W^UV can be absorbed into W^O？

Comparison Between MLA and MHA in dense model

Hi, great job. Did you have a ablation study about the performance between MLA and MHA in dense model ? Thanks.

MLA vs MHA

Hello, great work. I want to know why the performance of MLA is better than that of MHA. I think MLA is a approximate low-rank decomposition for MHA.

docker for vllm. with deepseekv2 support merged

https://hub.docker.com/r/superdizh891/vllm042-dsv2
build a docker of vllm with vllm-project/vllm#4650 merged

hope deepseek's PR could get merged, quite a pain to compile vllm for it

RuntimeError: mat1 and mat2 shapes cannot be multiplied

When using the DeepSeek-V2-Lite_Chat model to generate text, sometimes this error occurs。Can anybody help？

如何实现Device limited route

在Device limited route设置下，每个token最多对应了不同的M个设备，因此理论上需要的通讯量确实减少了。
但是，按照Megatron中MoE的2种通讯实现，不管是all gather还是all to all，实际上的通讯组都是完整的EP group，所以我理解通讯量并没有减少。
请问你们在工程侧是如何实现从而能让这种策略减少实际的通讯量的？

量化

支持量化吗

发送图片

content内容只能是str类型，我该怎么发送图片？

请增加gguf支持

目前不支持llamacpp转化gguf
希望能够支持

代码开源相关

请问 Aux Loss 部分的 device-level balance loss 和 communication balance loss 代码会开源吗，还有后面的 Token Dropping 策略

why i use vllm inference deepseek v2 ,speed is low

i use vllm to inference deepspeed, use flask to deploy model. When the problem enters the model, it always gets stuck for a long time in the processd prompt step，the code i use is your example code

Invalid max_token values

I followed the instruction in README about how to utilize deepseek in langchain:

model = ChatOpenAI(
        model="deepseek-chat",
        openai_api_key=API_KEY,
        openai_api_base='https://api.deepseek.com/v1',
        temperature=0.85,
        max_tokens=8000)

However, seems like the max_tokens is still restricted to 4k and an error will be raised when intending to integrate the model into the chain and invoke it:

qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm=model,
    chain_type="stuff",
    retriever=pinecone.as_retriever()
)
query = "foobar"

result = qa_with_sources.invoke(query)

Warning

File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/base.py", line 153, in invoke
self._call(inputs, run_manager=run_manager)
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py", line 137, in _call
output, extra_return_dict = self.combine_docs(
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 244, in combine_docs
return self.llm_chain.predict(callbacks=callbacks, **inputs), {}
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/llm.py", line 316, in predict
return self(kwargs, callbacks=callbacks)[self.output_key]
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain_core/_api/deprecation.py", line 148, in warning_emitting_wrapper
return wrapped(*args, **kwargs)
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/base.py", line 378, in call
return self.invoke(
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/base.py", line 163, in invoke
raise e
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/base.py", line 153, in invoke
self._call(inputs, run_manager=run_manager)
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/llm.py", line 126, in _call
response = self.generate([inputs], run_manager=run_manager)
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/llm.py", line 138, in generate
return self.llm.generate_prompt(
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 560, in generate_prompt
return self.generate(prompt_messages, stop=stop, callbacks=callbacks, **kwargs)
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 421, in generate
raise e
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 411, in generate
self._generate_with_cache(
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 632, in _generate_with_cache
result = self._generate(
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain_openai/chat_models/base.py", line 522, in _generate
response = self.client.create(messages=message_dicts, **params)
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/openai/_utils/_utils.py", line 277, in wrapper
return func(*args, **kwargs)
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/openai/resources/chat/completions.py", line 590, in create
return self._post(
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/openai/_base_client.py", line 1240, in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/openai/_base_client.py", line 921, in request
return self._request(
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/openai/_base_client.py", line 1020, in _request
raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'detail': 'Invalid max_tokens value, the valid range of max_tokens is [0, 4096]'}

你好，现在不支持，计划支持函数工具调用吗？

has it function calling?

太容易陷入死循环了

向通用对话v2和代码助手各提了一个问题，然后就出不来了。。。

如何在 langchain 中调用 DeepSeek-V2？

How to fine-tune deepseek v2 models?

Hi,

Can you please give us instructions about fine-tuning deepseekv2 model? Can we use finetune.py script from DeepSeek-MoE
https://github.com/deepseek-ai/DeepSeek-MoE/blob/main/finetune/finetune.py

Error executing method determine_num_available_blocks

使用 vllm 启动 openai server 报错。使用官方的 demo 脚本是正常。

启动命令：python -m vllm.entrypoints.openai.api_server --model /data/huggingface/models--deepseek-ai--DeepSeek-V2-Chat/snapshots/cfa90959d985cd3288fd835519099d9c46fa4842 --tensor-parallel-size 8 --served-model-name deepseek-v2-chat --dtype auto --api-key none --trust-remote-code

error log

(RayWorkerWrapper pid=1402517) INFO 05-10 20:16:16 selector.py:81] Cannot use FlashAttention-2 backend because the flash_attn package is not found. Please install it for better performance. [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) INFO 05-10 20:16:16 selector.py:32] Using XFormers backend. [repeated 6x across cluster]
Cache shape torch.Size([163840, 64])
(RayWorkerWrapper pid=1401736) Cache shape torch.Size([163840, 64])
(RayWorkerWrapper pid=1402517) INFO 05-10 20:16:18 pynccl_utils.py:43] vLLM is using nccl==2.20.5 [repeated 6x across cluster]
INFO 05-10 20:16:56 model_runner.py:175] Loading model weights took 56.1087 GB
(RayWorkerWrapper pid=1401736) INFO 05-10 20:17:00 model_runner.py:175] Loading model weights took 56.1087 GB
(RayWorkerWrapper pid=1402517) INFO 05-10 20:16:21 utils.py:132] reading GPU P2P access cache from /home/centos/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) Cache shape torch.Size([163840, 64]) [repeated 6x across cluster]
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145] Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145] Traceback (most recent call last):
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return func(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in determine_num_available_blocks
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     self.model_runner.profile_run()
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return func(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 888, in profile_run
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     self.execute_model(seqs, kv_caches)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return func(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 808, in execute_model
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = model_executable(**execute_model_kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 429, in forward
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = self.model(input_ids, positions, kv_caches,
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 400, in forward
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states, residual = layer(positions, hidden_states,
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 362, in forward
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = self.mlp(hidden_states)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 156, in forward
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     final_hidden_states = fused_moe(hidden_states,
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 510, in fused_moe
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return torch.sum(intermediate_cache3.view(*intermediate_cache3.shape),
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145] RuntimeError: CUDA error: an illegal memory access was encountered
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145] 
(RayWorkerWrapper pid=1402053) INFO 05-10 20:17:05 model_runner.py:175] Loading model weights took 56.1087 GB [repeated 6x across cluster]
ERROR 05-10 20:17:12 worker_base.py:145] Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution.
ERROR 05-10 20:17:12 worker_base.py:145] Traceback (most recent call last):
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
ERROR 05-10 20:17:12 worker_base.py:145]     return executor(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 05-10 20:17:12 worker_base.py:145]     return func(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in determine_num_available_blocks
ERROR 05-10 20:17:12 worker_base.py:145]     self.model_runner.profile_run()
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 05-10 20:17:12 worker_base.py:145]     return func(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 888, in profile_run
ERROR 05-10 20:17:12 worker_base.py:145]     self.execute_model(seqs, kv_caches)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 05-10 20:17:12 worker_base.py:145]     return func(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 808, in execute_model
ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = model_executable(**execute_model_kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 429, in forward
ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 400, in forward
ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states, residual = layer(positions, hidden_states,
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 362, in forward
ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = self.mlp(hidden_states)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 156, in forward
ERROR 05-10 20:17:12 worker_base.py:145]     final_hidden_states = fused_moe(hidden_states,
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 510, in fused_moe
ERROR 05-10 20:17:12 worker_base.py:145]     return torch.sum(intermediate_cache3.view(*intermediate_cache3.shape),
ERROR 05-10 20:17:12 worker_base.py:145] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 05-10 20:17:12 worker_base.py:145] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 05-10 20:17:12 worker_base.py:145] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR 05-10 20:17:12 worker_base.py:145] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 05-10 20:17:12 worker_base.py:145] 
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 168, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 366, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 324, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 172, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 27, in determine_num_available_blocks
[rank0]:     num_blocks = self._run_workers("determine_num_available_blocks", )
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 234, in _run_workers
[rank0]:     driver_worker_output = self.driver_worker.execute_method(
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 146, in execute_method
[rank0]:     raise e
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
[rank0]:     return executor(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 888, in profile_run
[rank0]:     self.execute_model(seqs, kv_caches)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 808, in execute_model
[rank0]:     hidden_states = model_executable(**execute_model_kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 429, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 400, in forward
[rank0]:     hidden_states, residual = layer(positions, hidden_states,
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 362, in forward
[rank0]:     hidden_states = self.mlp(hidden_states)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 156, in forward
[rank0]:     final_hidden_states = fused_moe(hidden_states,
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 510, in fused_moe
[rank0]:     return torch.sum(intermediate_cache3.view(*intermediate_cache3.shape),
[rank0]: RuntimeError: CUDA error: an illegal memory access was encountered
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145] Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution. [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145] Traceback (most recent call last): [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     return executor(*args, **kwargs) [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [repeated 18x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     return func(*args, **kwargs) [repeated 18x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in determine_num_available_blocks [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     self.model_runner.profile_run() [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 888, in profile_run [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     self.execute_model(seqs, kv_caches) [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 808, in execute_model [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = model_executable(**execute_model_kwargs) [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [repeated 24x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs) [repeated 24x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [repeated 24x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs) [repeated 24x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 156, in forward [repeated 24x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = self.model(input_ids, positions, kv_caches, [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states, residual = layer(positions, hidden_states, [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = self.mlp(hidden_states) [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     final_hidden_states = fused_moe(hidden_states, [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 510, in fused_moe [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     return torch.sum(intermediate_cache3.view(*intermediate_cache3.shape), [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145] RuntimeError: CUDA error: an illegal memory access was encountered [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145] For debugging consider passing CUDA_LAUNCH_BLOCKING=1. [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]  [repeated 6x across cluster]
Failed: Cuda error /home/runner/work/vllm/vllm/csrc/custom_all_reduce.cuh:475 'an illegal memory access was encountered'
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Reproduce inference benchmark mentioned in the paper

I have a few questions about the inference efficiency of deepseek v2
1.

In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8.

Are all the storage and computation performed in fp8 ? Does this harm the performance of the model?
2.

On a single node with 8 H800 GPUs, DeepSeek-V2 achieves a generation throughput
exceeding 50K tokens per second, which is 5.76 times the maximum generation throughput of
DeepSeek 67B. In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens
per second.

Is this throughput achieved using testing request of 128K context length? Can we reproduce it using vllm-project/vllm#4650

Any plan to involve VQA

As I've tried out DeepSeek (model version 1), I found it quite good in VQA, just wanna ask if your team has any plan on similar demo on HuggingFace like first version. Thnaks!

`V-MoE` token droping and `MoD`

This token dropping method, as indicated by the citation, is based on the V-MoE method.

How this different from the recent MoD? It look like they very similar techniques.

API ERROR

"detail": "Invalid top_logprobs and logprobs value"

has it function calling?

Device-Level Balance Loss and Communication Balance Loss

What's the main difference?
As I see from your paper, pi' == pi'', and fi' = some_coeff * fi''
maybe fi'' should be:
... (Token t is sent to Device i from Device j where j!=i)

当结尾 "finish_reason":"stop" 时，role 值为空

data: {"id":"2c0e9145-0ef9-4037-a5b0-590c6df994b0","choices":[{"index":0,"delta":{"content":".","role":"assistant"},"finish_reason":null,"logprobs":null}],"created":1715667871,"model":"deepseek-chat","system_fingerprint":null,"object":"chat.completion.chunk"}

data: {"id":"2c0e9145-0ef9-4037-a5b0-590c6df994b0","choices":[{"index":0,"delta":{"content":"","role":null},"finish_reason":"stop","logprobs":null}],"created":1715667871,"model":"deepseek-chat","system_fingerprint":null,"object":"chat.completion.chunk","usage":{"prompt_tokens":8,"completion_tokens":27,"total_tokens":35}}

.Net在使用SemanticKernel框架流式输出, 会因为最后一行记录role =null 报错

请扩充模型的中文词表

目前的DeepSeek V2似乎并没有扩中文词表，这样中文推理的效率还未达到最高。

vocab-coverage的统计结果（https://github.com/twang2218/vocab-coverage）：
字表《《通用规范汉字表》一级汉字》：字数：3500，完整：3168，完整率：90.51%
字表《《通用规范汉字表》二级汉字》：字数：3000，完整：251，完整率：8.37%
字表《《通用规范汉字表》三级汉字》：字数：1605，完整：5，完整率：0.31%
字表《《常用國字標準字體表》甲表(增)》：字数：1749，完整：0，完整率：0.00%
字表《《常用國字標準字體表》乙表(增)》：字数：4503，完整：0，完整率：0.00%
字表《《Unicode中日韩统一表意文字》(增)》：字数：6910，完整：1，完整率：0.01%

对比Qwen的模型：
字表《《通用规范汉字表》一级汉字》：字数：3500，完整：3500，完整率：100.00%
字表《《通用规范汉字表》二级汉字》：字数：3000，完整：3000，完整率：100.00%
字表《《通用规范汉字表》三级汉字》：字数：1605，完整：1605，完整率：100.00%
字表《《常用國字標準字體表》甲表(增)》：字数：1749，完整：633，完整率：36.19%
字表《《常用國字標準字體表》乙表(增)》：字数：4503，完整：4，完整率：0.09%
字表《《Unicode中日韩统一表意文字》(增)》：字数：6910，完整：32，完整率：0.46%

请提供GGUF,并支持OLLAMA

BadRequestError: Error code: 400 - {'detail': 'Content Exists Risk'}

API调用时，这个错误是什么原因引起的？token超出限制吗？那么该如何确保token不超出限制呢？

如何能达到论文里说的吞吐量50000多tokens

硬件：H800 PCIE * 8
我使用vllm推理最多只能达到1500tokens/s，batch_size为1024，请问怎样才能达到论文里说的50000多tokens？

Failure to reproduce MLA > MHA

I tried out MLA and it was a good amount worse than MHA and wanted to try to find out why. Firstly, I am using a hybrid model therefore I am not using any Rope in either MLA or MHA, and therefore use the basic version of MLA. I suspect the issue could be due to the part saying:
"In addition, the low-rank compression and fine-grained expert segmentation
will impact the output scale of a layer. Therefore, in practice, we employ additional RMS Norm
layers after the compressed latent vectors, and multiply additional scaling factors at the width
bottlenecks (i.e., the compressed latent vectors and the intermediate hidden states of routed experts) to ensure stable training."
It is unclear if the additional scaling factor is done before or after the RMSNorm, also what this factor would be. Another reason could be that the rope version of MLA gives it a performance boost.

Any clarification on this scaling factor and its placement would be great, thanks!

缓存C<sup>KV</sup><sub>t</sub> 多卡并行推理是否需要每张卡缓存一份

缓存C^KV_t在推理时，是否需要重新计算k^C_t，v^C_t？如果需要，在多卡推理的时候，每张卡需要完整的C^KV_t，这样需要存储多份吧

偏好数据构造方法

论文中提到：
We obtain code preference data based on compiler-feedback, and mathematical
preference data based on the ground-truth labels
可以详细讲一下是如何做的吗？

模型部署困惑

在A800 8卡机上按照Chat Completion的示例代码加载模型
设置
max_memory = {i: "75GB" for i in range(8)}
device_map="sequential"
启动后显存占用集中在0卡~6卡，7卡被完全放空，然后测试的上下文稍长一点，就报显存无法分配了
这是什么原因呢，有人遇到过吗

Could we have an in4 model and its LiveCodeBench score?

'detail': 'Content Exists Risk'

这个审核可以松一点吗有点太严格了或者可以选择审核的标准吗

服务器部署问题

请问我用这个demo在linux运行v2需要安装哪些环境？谢谢
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

`max_memory` should be set based on your devices

max_memory = {i: "75GB" for i in range(8)}

`device_map` cannot be set to `auto`

model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
{"role": "user", "content": "Write a piece of quicksort code in C++"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

deepseek-ai / deepseek-v2 Goto Github PK

deepseek-v2's Introduction

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

1. Introduction

2. News

3. Model Downloads

4. Evaluation Results

Base Model

Standard Benchmark (Models larger than 67B)

Standard Benchmark (Models smaller than 16B)

Context Window

Chat Model

Standard Benchmark (Models larger than 67B)

Standard Benchmark (Models smaller than 16B)

English Open Ended Generation Evaluation

Chinese Open Ended Generation Evaluation

Coding Benchmarks

5. Model Architecture

6. Chat Website

7. API Platform

8. How to run locally

Inference with Huggingface's Transformers

Text Completion

Chat Completion

Inference with vLLM (recommended)

LangChain Support

9. License

10. Citation

11. Contact

deepseek-v2's People

Contributors

Stargazers

Watchers

Forkers

deepseek-v2's Issues

语言模型

max_memory should be set based on your devices

device_map cannot be set to auto

Recommend Projects

Recommend Topics

Recommend Org

`max_memory` should be set based on your devices

`device_map` cannot be set to `auto`