Coder Social home page Coder Social logo

deepseek-v2's Introduction

DeepSeek-V2

Model Download | Evaluation Results | Model Architecture | API Platform | License | Citation

Paper Link👁️

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

1. Introduction

Today, we’re introducing DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token. Compared with DeepSeek 67B, DeepSeek-V2 achieves stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times.

We pretrained DeepSeek-V2 on a diverse and high-quality corpus comprising 8.1 trillion tokens. This comprehensive pretraining was followed by a process of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unleash the model's capabilities. The evaluation results validate the effectiveness of our approach as DeepSeek-V2 achieves remarkable performance on both standard benchmarks and open-ended generation evaluation.

2. News

  • 2024.05.16: We released the DeepSeek-V2-Lite.
  • 2024.05.06: We released the DeepSeek-V2.

3. Model Downloads

Model #Total Params #Activated Params Context Length Download
DeepSeek-V2-Lite 16B 2.4B 32k 🤗 HuggingFace
DeepSeek-V2-Lite-Chat (SFT) 16B 2.4B 32k 🤗 HuggingFace
DeepSeek-V2 236B 21B 128k 🤗 HuggingFace
DeepSeek-V2-Chat (RL) 236B 21B 128k 🤗 HuggingFace

Due to the constraints of HuggingFace, the open-source code currently experiences slower performance than our internal codebase when running on GPUs with Huggingface. To facilitate the efficient execution of our model, we offer a dedicated vllm solution that optimizes performance for running our model effectively.

4. Evaluation Results

Base Model

Standard Benchmark (Models larger than 67B)

Benchmark Domain LLaMA3 70B Mixtral 8x22B DeepSeek-V1 (Dense-67B) DeepSeek-V2 (MoE-236B)
MMLU English 78.9 77.6 71.3 78.5
BBH English 81.0 78.9 68.7 78.9
C-Eval Chinese 67.5 58.6 66.1 81.7
CMMLU Chinese 69.3 60.0 70.8 84.0
HumanEval Code 48.2 53.1 45.1 48.8
MBPP Code 68.6 64.2 57.4 66.6
GSM8K Math 83.0 80.3 63.4 79.2
Math Math 42.2 42.5 18.7 43.6

Standard Benchmark (Models smaller than 16B)

Benchmark Domain DeepSeek 7B (Dense) DeepSeekMoE 16B DeepSeek-V2-Lite (MoE-16B)
Architecture - MHA+Dense MHA+MoE MLA+MoE
MMLU English 48.2 45.0 58.3
BBH English 39.5 38.9 44.1
C-Eval Chinese 45.0 40.6 60.3
CMMLU Chinese 47.2 42.5 64.3
HumanEval Code 26.2 26.8 29.9
MBPP Code 39.0 39.2 43.2
GSM8K Math 17.4 18.8 41.1
Math Math 3.3 4.3 17.1
For more evaluation details, such as few-shot settings and prompts, please check our paper.

Context Window

Evaluation results on the Needle In A Haystack (NIAH) tests. DeepSeek-V2 performs well across all context window lengths up to 128K.

Chat Model

Standard Benchmark (Models larger than 67B)

Benchmark Domain QWen1.5 72B Chat Mixtral 8x22B LLaMA3 70B Instruct DeepSeek-V1 Chat (SFT) DeepSeek-V2 Chat (SFT) DeepSeek-V2 Chat (RL)
MMLU English 76.2 77.8 80.3 71.1 78.4 77.8
BBH English 65.9 78.4 80.1 71.7 81.3 79.7
C-Eval Chinese 82.2 60.0 67.9 65.2 80.9 78.0
CMMLU Chinese 82.9 61.0 70.7 67.8 82.4 81.6
HumanEval Code 68.9 75.0 76.2 73.8 76.8 81.1
MBPP Code 52.2 64.4 69.8 61.4 70.4 72.0
LiveCodeBench (0901-0401) Code 18.8 25.0 30.5 18.3 28.7 32.5
GSM8K Math 81.9 87.9 93.2 84.1 90.8 92.2
Math Math 40.6 49.8 48.5 32.6 52.7 53.9

Standard Benchmark (Models smaller than 16B)

Benchmark Domain DeepSeek 7B Chat (SFT) DeepSeekMoE 16B Chat (SFT) DeepSeek-V2-Lite 16B Chat (SFT)
MMLU English 49.7 47.2 55.7
BBH English 43.1 42.2 48.1
C-Eval Chinese 44.7 40.0 60.1
CMMLU Chinese 51.2 49.3 62.5
HumanEval Code 45.1 45.7 57.3
MBPP Code 39.0 46.2 45.8
GSM8K Math 62.6 62.2 72.0
Math Math 14.7 15.2 27.9

English Open Ended Generation Evaluation

We evaluate our model on AlpacaEval 2.0 and MTBench, showing the competitive performance of DeepSeek-V2-Chat-RL on English conversation generation.

Chinese Open Ended Generation Evaluation

Alignbench (https://arxiv.org/abs/2311.18743)

模型 开源/闭源 总分 中文推理 中文语言
gpt-4-1106-preview 闭源 8.01 7.73 8.29
DeepSeek-V2 Chat (RL) 开源 7.91 7.45 8.36
erniebot-4.0-202404 (文心一言) 闭源 7.89 7.61 8.17
DeepSeek-V2 Chat (SFT) 开源 7.74 7.30 8.17
gpt-4-0613 闭源 7.53 7.47 7.59
erniebot-4.0-202312 (文心一言) 闭源 7.36 6.84 7.88
moonshot-v1-32k-202404 (月之暗面) 闭源 7.22 6.42 8.02
Qwen1.5-72B-Chat (通义千问) 开源 7.19 6.45 7.93
DeepSeek-67B-Chat 开源 6.43 5.75 7.11
Yi-34B-Chat (零一万物) 开源 6.12 4.86 7.38
gpt-3.5-turbo-0613 闭源 6.08 5.35 6.71
DeepSeek-V2-Lite 16B Chat 开源 6.01 4.71 7.32

Coding Benchmarks

We evaluate our model on LiveCodeBench (0901-0401), a benchmark designed for live coding challenges. As illustrated, DeepSeek-V2 demonstrates considerable proficiency in LiveCodeBench, achieving a Pass@1 score that surpasses several other sophisticated models. This performance highlights the model's effectiveness in tackling live coding tasks.

5. Model Architecture

DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference:

  • For attention, we design MLA (Multi-head Latent Attention), which utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference.
  • For Feed-Forward Networks (FFNs), we adopt DeepSeekMoE architecture, a high-performance MoE architecture that enables training stronger models at lower costs.

6. Chat Website

You can chat with the DeepSeek-V2 on DeepSeek's official website: chat.deepseek.com

7. API Platform

We also provide OpenAI-Compatible API at DeepSeek Platform: platform.deepseek.com. Sign up for over millions of free tokens. And you can also pay-as-you-go at an unbeatable price.

8. How to run locally

To utilize DeepSeek-V2 in BF16 format for inference, 80GB*8 GPUs are required.

Inference with Huggingface's Transformers

You can directly employ Huggingface's Transformers for model inference.

Text Completion

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# `max_memory` should be set based on your devices
max_memory = {i: "75GB" for i in range(8)}
# `device_map` cannot be set to `auto`
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Chat Completion

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# `max_memory` should be set based on your devices
max_memory = {i: "75GB" for i in range(8)}
# `device_map` cannot be set to `auto`
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "Write a piece of quicksort code in C++"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

The complete chat template can be found within tokenizer_config.json located in the huggingface model repository.

An example of chat template is as belows:

<|begin▁of▁sentence|>User: {user_message_1}

Assistant: {assistant_message_1}<|end▁of▁sentence|>User: {user_message_2}

Assistant:

You can also add an optional system message:

<|begin▁of▁sentence|>{system_message}

User: {user_message_1}

Assistant: {assistant_message_1}<|end▁of▁sentence|>User: {user_message_2}

Assistant:

Inference with vLLM (recommended)

To utilize vLLM for model inference, please merge this Pull Request into your vLLM codebase: vllm-project/vllm#4650.

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 8192, 8
model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])

messages_list = [
    [{"role": "user", "content": "Who are you?"}],
    [{"role": "user", "content": "Translate the following content into Chinese directly: DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference."}],
    [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

LangChain Support

Since our API is compatible with OpenAI, you can easily use it in langchain. Here is an example:

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
    model='deepseek-chat',
    openai_api_key=<your-deepseek-api-key>,
    openai_api_base='https://api.deepseek.com/v1',
    temperature=0.85,
    max_tokens=8000)

9. License

This code repository is licensed under the MIT License. The use of DeepSeek-V2 Base/Chat models is subject to the Model License. DeepSeek-V2 series (including Base and Chat) supports commercial use.

10. Citation

@misc{deepseekv2,
      title={DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model}, 
      author={DeepSeek-AI},
      year={2024},
      eprint={2405.04434},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

11. Contact

If you have any questions, please raise an issue or contact us at [email protected].

deepseek-v2's People

Contributors

benjamin-eecs avatar deepseekddm avatar luofuli avatar soloice avatar stack-heap-overflow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepseek-v2's Issues

源码

后期会提供模型的源码吗?

Clarifications Needed on KVCache Compression and Matrix Operations in MLA KVCache

In MLA, the KVCache compresses $h_t$ into $C_t^{KV} \in \mathbb{R}^{d_c}$, and to circumvent the issue of incompatibility with RoPE for low-rank KVCache compression, it concatenates $k_t^R = \text{RoPE}(W^{KR}h_t) \in \mathbb{R}^{d_h^R}$.

However, according to equation (17): $k_{t,i}=[k_{t,i}^C; k_t^R]$, during the computation of attention, $k_t^c = W^{UK}C_t^{KV} \in \mathbb{R}^{d_hn_h}$ is used instead of $C_t^{KV}$.

Appendix B mentions that by applying the associative law of matrix multiplication, $W^{DKV}$ can be absorbed into $W^Q$: $W^Q[W^{UK}(W^{DKV}h_t)] = (W^QW^{UK})(W^{DKV}h_t)=(W^{UQ})C_t^{KV}$.

Questions:

  1. Given that $W^Q \in \mathbb{R}^{d_hn_h \times d}$ and $W^{UK} \in \mathbb{R}^{d_hn_h \times d_c}$, how are these matrices multiplied to derive $W^{UQ}$?
  2. How are the values for the matrices $W^{DKV}, W^{UK}, W^{KR}$ computed? Appendix B suggests that these are calculated offline once and not during training as part of the low-rank matrix values.

Any insights or detailed explanations regarding these points would be highly appreciated.

Error in Equation 16?

It appears that the current formulation is ?

$$
q_{t,i} = [q^{C}{t,i};q{t}^R],
$$

Drop Token

hello @DeepSeekDDM @luofuli
I have some question in drop token deepseek v2
Is the capacity in the token-dropping strategy based on the expert dimension or the device dimension?
If it's on the expert dimension, then the capacity is calculated as capacity = math.ceil(num_tokens * topk) / num_experts * capacity_factor, and then each expert processes its own tokens, dropping the least scored tokens if token > capacity, and padding if token < capacity.
If it's on the device dimension, is the capacity calculated as capacity = math.ceil(num_tokens * topk) / num_groups * capacity_factor? How is the token dropping executed in this case?
Because the paper mentions device-level token dropping, I have the above confusion.

About datasets

Hi, thank you for your great work!

Could you provide more details about the pretrain dataset?
How has the pretrain dataset been optimized in DeepSeek-V2 compared to the previous version, DeepSeek?

Thank you.

Knowledge cutoff date

尊敬的开发者您好,请问贵模型的知识截断日是什么时候?使用不同语言询问他时给出了不同的回复。希望能得到您们的解答,提前祝工作顺利,谢谢!

image

无法支持 autogpt 中的 langchain

单独 langchain 是支持.
image

但是在 autogpt 中就不行了.
image
image

语言模型

llm = ChatOpenAI(
    openai_api_base="https://api.deepseek.com/v1",
    openai_api_key="sk-exxxxxxxxxxxxxxxxxxx6",
    model="deepseek-chat",
    temperature=0,
    model_kwargs={
        "seed": 42
    },
)

敏感词封禁问题

  • taipei is the capital of Taiwan: 返回400
  • 请评价下毛主席:返回空字符串

有些返回BadRequestError: Error code: 400 - {'detail': 'Content Exists Risk'},有些返回空字符串,能不能统一下?
而且api调用文档说400是【请求体格式错误】,太随意了。

MLA vs MHA

Hello, great work. I want to know why the performance of MLA is better than that of MHA. I think MLA is a approximate low-rank decomposition for MHA.

如何实现Device limited route

在Device limited route设置下,每个token最多对应了不同的M个设备,因此理论上需要的通讯量确实减少了。
但是,按照Megatron中MoE的2种通讯实现,不管是all gather还是all to all,实际上的通讯组都是完整的EP group,所以我理解通讯量并没有减少。
请问你们在工程侧是如何实现从而能让这种策略减少实际的通讯量的?

发送图片

content内容只能是str类型,我该怎么发送图片?

代码开源相关

请问 Aux Loss 部分的 device-level balance loss 和 communication balance loss 代码会开源吗,还有后面的 Token Dropping 策略

why i use vllm inference deepseek v2 ,speed is low

i use vllm to inference deepspeed, use flask to deploy model. When the problem enters the model, it always gets stuck for a long time in the processd prompt step,the code i use is your example code

Invalid max_token values

I followed the instruction in README about how to utilize deepseek in langchain:

model = ChatOpenAI(
        model="deepseek-chat",
        openai_api_key=API_KEY,
        openai_api_base='https://api.deepseek.com/v1',
        temperature=0.85,
        max_tokens=8000)

However, seems like the max_tokens is still restricted to 4k and an error will be raised when intending to integrate the model into the chain and invoke it:

qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm=model,
    chain_type="stuff",
    retriever=pinecone.as_retriever()
)
query = "foobar"

result = qa_with_sources.invoke(query)

Warning

File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/base.py", line 153, in invoke
self._call(inputs, run_manager=run_manager)
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py", line 137, in _call
output, extra_return_dict = self.combine_docs(
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 244, in combine_docs
return self.llm_chain.predict(callbacks=callbacks, **inputs), {}
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/llm.py", line 316, in predict
return self(kwargs, callbacks=callbacks)[self.output_key]
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain_core/_api/deprecation.py", line 148, in warning_emitting_wrapper
return wrapped(*args, **kwargs)
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/base.py", line 378, in call
return self.invoke(
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/base.py", line 163, in invoke
raise e
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/base.py", line 153, in invoke
self._call(inputs, run_manager=run_manager)
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/llm.py", line 126, in _call
response = self.generate([inputs], run_manager=run_manager)
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain/chains/llm.py", line 138, in generate
return self.llm.generate_prompt(
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 560, in generate_prompt
return self.generate(prompt_messages, stop=stop, callbacks=callbacks, **kwargs)
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 421, in generate
raise e
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 411, in generate
self._generate_with_cache(
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain_core/language_models/chat_models.py", line 632, in _generate_with_cache
result = self._generate(
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/langchain_openai/chat_models/base.py", line 522, in _generate
response = self.client.create(messages=message_dicts, **params)
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/openai/_utils/_utils.py", line 277, in wrapper
return func(*args, **kwargs)
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/openai/resources/chat/completions.py", line 590, in create
return self._post(
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/openai/_base_client.py", line 1240, in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/openai/_base_client.py", line 921, in request
return self._request(
File "/Users/zhouyu/miniconda3/envs/ob_chatbot/lib/python3.10/site-packages/openai/_base_client.py", line 1020, in _request
raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'detail': 'Invalid max_tokens value, the valid range of max_tokens is [0, 4096]'}

Error executing method determine_num_available_blocks

使用 vllm 启动 openai server 报错。使用官方的 demo 脚本是正常。

启动命令:python -m vllm.entrypoints.openai.api_server --model /data/huggingface/models--deepseek-ai--DeepSeek-V2-Chat/snapshots/cfa90959d985cd3288fd835519099d9c46fa4842 --tensor-parallel-size 8 --served-model-name deepseek-v2-chat --dtype auto --api-key none --trust-remote-code

error log

(RayWorkerWrapper pid=1402517) INFO 05-10 20:16:16 selector.py:81] Cannot use FlashAttention-2 backend because the flash_attn package is not found. Please install it for better performance. [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) INFO 05-10 20:16:16 selector.py:32] Using XFormers backend. [repeated 6x across cluster]
Cache shape torch.Size([163840, 64])
(RayWorkerWrapper pid=1401736) Cache shape torch.Size([163840, 64])
(RayWorkerWrapper pid=1402517) INFO 05-10 20:16:18 pynccl_utils.py:43] vLLM is using nccl==2.20.5 [repeated 6x across cluster]
INFO 05-10 20:16:56 model_runner.py:175] Loading model weights took 56.1087 GB
(RayWorkerWrapper pid=1401736) INFO 05-10 20:17:00 model_runner.py:175] Loading model weights took 56.1087 GB
(RayWorkerWrapper pid=1402517) INFO 05-10 20:16:21 utils.py:132] reading GPU P2P access cache from /home/centos/.config/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) Cache shape torch.Size([163840, 64]) [repeated 6x across cluster]
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145] Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145] Traceback (most recent call last):
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return func(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in determine_num_available_blocks
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     self.model_runner.profile_run()
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return func(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 888, in profile_run
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     self.execute_model(seqs, kv_caches)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return func(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 808, in execute_model
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = model_executable(**execute_model_kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 429, in forward
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = self.model(input_ids, positions, kv_caches,
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 400, in forward
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states, residual = layer(positions, hidden_states,
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 362, in forward
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = self.mlp(hidden_states)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs)
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 156, in forward
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     final_hidden_states = fused_moe(hidden_states,
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 510, in fused_moe
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145]     return torch.sum(intermediate_cache3.view(*intermediate_cache3.shape),
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145] RuntimeError: CUDA error: an illegal memory access was encountered
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(RayWorkerWrapper pid=1402053) ERROR 05-10 20:17:12 worker_base.py:145] 
(RayWorkerWrapper pid=1402053) INFO 05-10 20:17:05 model_runner.py:175] Loading model weights took 56.1087 GB [repeated 6x across cluster]
ERROR 05-10 20:17:12 worker_base.py:145] Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution.
ERROR 05-10 20:17:12 worker_base.py:145] Traceback (most recent call last):
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
ERROR 05-10 20:17:12 worker_base.py:145]     return executor(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 05-10 20:17:12 worker_base.py:145]     return func(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in determine_num_available_blocks
ERROR 05-10 20:17:12 worker_base.py:145]     self.model_runner.profile_run()
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 05-10 20:17:12 worker_base.py:145]     return func(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 888, in profile_run
ERROR 05-10 20:17:12 worker_base.py:145]     self.execute_model(seqs, kv_caches)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 05-10 20:17:12 worker_base.py:145]     return func(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 808, in execute_model
ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = model_executable(**execute_model_kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 429, in forward
ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 400, in forward
ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states, residual = layer(positions, hidden_states,
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 362, in forward
ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = self.mlp(hidden_states)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs)
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 156, in forward
ERROR 05-10 20:17:12 worker_base.py:145]     final_hidden_states = fused_moe(hidden_states,
ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 510, in fused_moe
ERROR 05-10 20:17:12 worker_base.py:145]     return torch.sum(intermediate_cache3.view(*intermediate_cache3.shape),
ERROR 05-10 20:17:12 worker_base.py:145] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 05-10 20:17:12 worker_base.py:145] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 05-10 20:17:12 worker_base.py:145] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR 05-10 20:17:12 worker_base.py:145] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 05-10 20:17:12 worker_base.py:145] 
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 168, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 366, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 324, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 172, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 27, in determine_num_available_blocks
[rank0]:     num_blocks = self._run_workers("determine_num_available_blocks", )
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 234, in _run_workers
[rank0]:     driver_worker_output = self.driver_worker.execute_method(
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 146, in execute_method
[rank0]:     raise e
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
[rank0]:     return executor(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 888, in profile_run
[rank0]:     self.execute_model(seqs, kv_caches)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 808, in execute_model
[rank0]:     hidden_states = model_executable(**execute_model_kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 429, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 400, in forward
[rank0]:     hidden_states, residual = layer(positions, hidden_states,
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 362, in forward
[rank0]:     hidden_states = self.mlp(hidden_states)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 156, in forward
[rank0]:     final_hidden_states = fused_moe(hidden_states,
[rank0]:   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 510, in fused_moe
[rank0]:     return torch.sum(intermediate_cache3.view(*intermediate_cache3.shape),
[rank0]: RuntimeError: CUDA error: an illegal memory access was encountered
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145] Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution. [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145] Traceback (most recent call last): [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     return executor(*args, **kwargs) [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [repeated 18x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     return func(*args, **kwargs) [repeated 18x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 139, in determine_num_available_blocks [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     self.model_runner.profile_run() [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 888, in profile_run [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     self.execute_model(seqs, kv_caches) [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 808, in execute_model [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = model_executable(**execute_model_kwargs) [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl [repeated 24x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     return self._call_impl(*args, **kwargs) [repeated 24x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl [repeated 24x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     return forward_call(*args, **kwargs) [repeated 24x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/models/deepseek_v2.py", line 156, in forward [repeated 24x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = self.model(input_ids, positions, kv_caches, [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states, residual = layer(positions, hidden_states, [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     hidden_states = self.mlp(hidden_states) [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     final_hidden_states = fused_moe(hidden_states, [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]   File "/data/envs/ll3_3_ds2_vllm/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 510, in fused_moe [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]     return torch.sum(intermediate_cache3.view(*intermediate_cache3.shape), [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145] RuntimeError: CUDA error: an illegal memory access was encountered [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145] For debugging consider passing CUDA_LAUNCH_BLOCKING=1. [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. [repeated 6x across cluster]
(RayWorkerWrapper pid=1402517) ERROR 05-10 20:17:12 worker_base.py:145]  [repeated 6x across cluster]
Failed: Cuda error /home/runner/work/vllm/vllm/csrc/custom_all_reduce.cuh:475 'an illegal memory access was encountered'
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Reproduce inference benchmark mentioned in the paper

I have a few questions about the inference efficiency of deepseek v2
1.

In order to efficiently deploy DeepSeek-V2 for service, we first convert its parameters into the precision of FP8.

Are all the storage and computation performed in fp8 ? Does this harm the performance of the model?
2.

On a single node with 8 H800 GPUs, DeepSeek-V2 achieves a generation throughput
exceeding 50K tokens per second, which is 5.76 times the maximum generation throughput of
DeepSeek 67B. In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens
per second.

Is this throughput achieved using testing request of 128K context length? Can we reproduce it using vllm-project/vllm#4650

Any plan to involve VQA

As I've tried out DeepSeek (model version 1), I found it quite good in VQA, just wanna ask if your team has any plan on similar demo on HuggingFace like first version. Thnaks!

`V-MoE` token droping and `MoD`

This token dropping method, as indicated by the citation, is based on the V-MoE method.

How this different from the recent MoD? It look like they very similar techniques.

API ERROR

"detail": "Invalid top_logprobs and logprobs value"

当结尾 "finish_reason":"stop" 时,role 值为空

data: {"id":"2c0e9145-0ef9-4037-a5b0-590c6df994b0","choices":[{"index":0,"delta":{"content":".","role":"assistant"},"finish_reason":null,"logprobs":null}],"created":1715667871,"model":"deepseek-chat","system_fingerprint":null,"object":"chat.completion.chunk"}

data: {"id":"2c0e9145-0ef9-4037-a5b0-590c6df994b0","choices":[{"index":0,"delta":{"content":"","role":null},"finish_reason":"stop","logprobs":null}],"created":1715667871,"model":"deepseek-chat","system_fingerprint":null,"object":"chat.completion.chunk","usage":{"prompt_tokens":8,"completion_tokens":27,"total_tokens":35}}

.Net在使用SemanticKernel框架流式输出, 会因为最后一行记录role =null 报错

请扩充模型的中文词表

目前的DeepSeek V2似乎并没有扩中文词表,这样中文推理的效率还未达到最高。

vocab-coverage的统计结果(https://github.com/twang2218/vocab-coverage):
字表 《《通用规范汉字表》一级汉字》:字数:3500,完整:3168,完整率:90.51%
字表 《《通用规范汉字表》二级汉字》:字数:3000,完整:251,完整率:8.37%
字表 《《通用规范汉字表》三级汉字》:字数:1605,完整:5,完整率:0.31%
字表 《《常用國字標準字體表》甲表(增)》:字数:1749,完整:0,完整率:0.00%
字表 《《常用國字標準字體表》乙表(增)》:字数:4503,完整:0,完整率:0.00%
字表 《《Unicode中日韩统一表意文字》(增)》:字数:6910,完整:1,完整率:0.01%

对比Qwen的模型:
字表 《《通用规范汉字表》一级汉字》:字数:3500,完整:3500,完整率:100.00%
字表 《《通用规范汉字表》二级汉字》:字数:3000,完整:3000,完整率:100.00%
字表 《《通用规范汉字表》三级汉字》:字数:1605,完整:1605,完整率:100.00%
字表 《《常用國字標準字體表》甲表(增)》:字数:1749,完整:633,完整率:36.19%
字表 《《常用國字標準字體表》乙表(增)》:字数:4503,完整:4,完整率:0.09%
字表 《《Unicode中日韩统一表意文字》(增)》:字数:6910,完整:32,完整率:0.46%

Failure to reproduce MLA > MHA

I tried out MLA and it was a good amount worse than MHA and wanted to try to find out why. Firstly, I am using a hybrid model therefore I am not using any Rope in either MLA or MHA, and therefore use the basic version of MLA. I suspect the issue could be due to the part saying:
"In addition, the low-rank compression and fine-grained expert segmentation
will impact the output scale of a layer. Therefore, in practice, we employ additional RMS Norm
layers after the compressed latent vectors, and multiply additional scaling factors at the width
bottlenecks (i.e., the compressed latent vectors and the intermediate hidden states of routed experts) to ensure stable training."
It is unclear if the additional scaling factor is done before or after the RMSNorm, also what this factor would be. Another reason could be that the rope version of MLA gives it a performance boost.

Any clarification on this scaling factor and its placement would be great, thanks!

偏好数据构造方法

论文中提到:
We obtain code preference data based on compiler-feedback, and mathematical
preference data based on the ground-truth labels
可以详细讲一下是如何做的吗?

模型部署困惑

在A800 8卡机上按照Chat Completion的示例代码加载模型
设置
max_memory = {i: "75GB" for i in range(8)}
device_map="sequential"
启动后显存占用集中在0卡~6卡,7卡被完全放空,然后测试的上下文稍长一点,就报显存无法分配了
这是什么原因呢,有人遇到过吗

服务器部署问题

请问我用这个demo在linux运行v2需要安装哪些环境?谢谢
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/DeepSeek-V2-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

max_memory should be set based on your devices

max_memory = {i: "75GB" for i in range(8)}

device_map cannot be set to auto

model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
{"role": "user", "content": "Write a piece of quicksort code in C++"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.