deepseek-ai / deepseek-math Goto Github PK

View Code? Open in Web Editor NEW

757.0 757.0 44.0 96.42 MB

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

License: MIT License

Python 100.00%

deepseek-math's People

Contributors

Stargazers

Watchers

Forkers

yyht shi-kejian sorokinvld w32zhong daoyuan14 andreslavescu josephrp lucasmartincalderon jansystemic yuriiguy techthiyanes genostack moorbles dockiem lihuibng dmvaldman dearborn-open-ai dylancer1998 hbin0701 sunwood-ai-labs vinayakpathak youngboy88bin blacktea-c mfkiwl xiaozhiob joe2hpimn long630904 xinlzhang zhaopufeng csyanghan geoffreyporto dobestresearch harishgovardhandamodar chenwydj tsugikuni-yoriichi zcreatzy dtbinh schaudge hongcalmjin refraincode xcltyt guohan2315

deepseek-math's Issues

SFT的数据分布

恭喜你们的效果取得了非常好的效果！我有一个问题想要请教一下各位大佬：

我想了解一下SFT的数据分布。看到training examples 是 776K，但是可能是我对于数据集的估算可能出现了一些问题。English mathematical datasets：GSM8K和MATH部分我看是根据ToRA进行标注的，所以根据ToRA那篇文章的估算应该是69K，MathInstruct 260K 的子集不是特别好估算我就按照200K来估算，Lila-OOD是32.2K。总计300K左右，而且MathInstruct里面的MATH和GSM8K应该会与前面的69K的数据重复。那么Chinese mathematical datasets的数据应该是476K，这个数据集是你们收集，后续会开源的嘛？

建议检查数据

个人友情提醒，小模型 MATH 达到这个水平，要警惕数据泄露
可以对比 MATH 和 humaneval 数据集，感受一下难度 😂

how to sample 64 output from old policy model？

Is it just adjusting the decoding parameters?

Are you planning to release the training dataset?

GRPO as part of HF TRL?

Would be cool to see this compared to other methods

What is your chat template for huggingface chat ui?

I saw that openchat has this chat template for Huggingface chat ui:

{
"name": "openchat/openchat-3.5-0106",
"displayName": "openchat/openchat-3.5-0106",
"description": "OpenChat 3.5 is the #1 model on MT-Bench, with only 7B parameters.",
"websiteUrl": "https://huggingface.co/openchat/openchat-3.5-0106",
"preprompt": "",
"chatPromptTemplate" : "{{#each messages}}{{#ifUser}}GPT4 Correct User: {{#if @FIRST}}{{#if @root.preprompt}}{{@root.preprompt}}\n{{/if}}{{/if}}{{content}}<|end_of_turn|>GPT4 Correct Assistant:{{/ifUser}}{{#ifAssistant}}{{content}}<|end_of_turn|>{{/ifAssistant}}{{/each}}",
"parameters": {
"temperature": 0.6,
"top_p": 0.95,
"repetition_penalty": 1.2,
"top_k": 50,
"truncate": 6016,
"max_new_tokens": 2048,
"stop": ["<|end_of_turn|>"]
},
"promptExamples": [
{
"title": "Write an email from bullet list",
"prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
}, {
"title": "Code a snake game",
"prompt": "Code a basic snake game in python, give explanations for each step."
}, {
"title": "Assist in a task",
"prompt": "How do I make a delicious lemon cheesecake?"
}
]
}

=====================================================
What is the chat template for DeepSeek-Math for Huggingface chat ui?

关于sft阶段中数据拼接的问题

您好，论文3.2中有提到将训练数据随机拼接到4k token的长度，请问是指将sft数据拼接成（q0,a0,q1,a1,...）的形式后只计算answer部分的loss吗？
非常感谢大佬们的工作~

My environment is something wrong with flash-atten, can I drop it when finetune DeepSeek-Math?

Hello, there is something wrong with flash-attn, can I drop it when I finetune DeepSeek-Math? Will it destroy the performance of the model? Thank you.

apply_chat_template()报错，请问如何修改代码

Question about the way to extract text from CC HTML

Hi guys @DeepSeekPH , thanks so much for sharing such an excellent work. I note that Openwebmath uses a specialized pipeline to extract content from HTML instead of directing using the WET file from Common Crawl. I just wonder how you guys deal with this problem? Do you also follow openwebmath to process the html with a private diagram? sincerely wait for your feedback.

Publish on Ollama

Currently deepseek-llm and deepseek-coder are available on Ollama.ai, and I'd love to try out deepseek-math with ollama :)

MATH Test Score reproduce acc=43.6

MATH on 5000 problems.
deepseek-math-7b-rl @ cot @ greedy @ max 512 tokens

Accuracy: 43.6
Non decode: 0.0
Level 1 344/437 78.72
Level 2 546/894 61.07
Level 3 585/1131 51.72
Level 4 467/1214 38.47
Level 5 240/1324 18.13

Algebra 764/1187 64.36
Counting & Probability 181/474 38.19
Geometry 173/479 36.12
Intermediate Algebra 170/903 18.83
Number Theory 217/540 40.19
Prealgebra 561/871 64.41
Precalculus 116/546 21.25

代码数据应该怎么用呢

论文说代码数据会对数学推理能力有用，但是大多数的开源代码数据都是没有具体的函数注释和描述的，只是各种语言的组合。
请问你们有没有验证如果只用一种语言，比如c++作为代码数据，对效果有没有影响
还有就是如果代码数据没有注释，对效果有没有影响，我之前使用了LEMMA开源的代码堆栈数据，效果甚至有下降

Unable to get evaluation results

I've followed the steps given in the evaluation readme and run the inference file
However, I don't get any results :/

The results.json file has all blank entries with n_samples=0 always

Ask about the evaluation of deepseek-math-rl

I git clone this repo, and run the submit_eval_jobs.py, with 8GPUs. However, the tool-based results on MATH test set is 0.5786, where we set the iter=4, and our vllm-version is 0.2.0 as recommended. It has a gap with the reported 58.8, do you have any suggestions?

Request to add SeaLLM-7B-v2 in your paper tables.

Congrats on the super impressive results on math reasoning, we have many to learn from you!

We released SeaLLM-7B-v2 (chat) few days ago. It achieves
78.2 on GSM8K and 27.5 MATH and 64.8 on MGSM-zh.

While our model is not as good as yours, we kindly request your team to include our results in your tables in paper, and this github repo.

Paper 第二节预训练 2.2 节：为什么对不同 size 的数据集都要训练至高达 150B tokens？

Math 模型使用的数据集大小为 120B tokens 所对比的数据集分别为 - 8.9B tokens - 13.6B tokens - 13.6B×4+10.3B×1+28.0B ×2 ≈120B tokens （如果以上数据有误请纠正我）意味着最小的数据集可能需要训练接近 20 个 epoch，较大概率出现 overfitting 从而导致性能下降。一般来说可能更公平的比较是否应该是选择一个更小的数值，例如最小数据集的大小或更小，超过阈值的降采样吗？

想请教下实验中这样的设定是基于什么考虑

Any plan to provide local Web UI like this: https://github.com/imoneoi/openchat?

Why adding "hey\n" before model output staring with "```python"?

DeepSeek-Math/evaluation/infer/run_tool_integrated_eval.py

Line 30 in b8b0f8c

text = "hey\n" + text

This seems weird. Could anyone explain what this is for?

Should we need to add "You are an AI assistant, developed by DeepSeek Company...." when further finetune MATH-7B-instruct?

Thank you for your excellent job. I want to ask if the finetuning version of MATH-7B-instruct uses such instruction "You are an AI assistant, developed by DeepSeek Company....". If this version was used, we have to use it as the same if we want to further finetune it.

Path Issue when running evals

Hi there! While trying to reproduce the evals of python submit_eval_jobs.py --n-gpus 8 on 8 GPU A-100s, I came across an issue dumped in the logs:

Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library. Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it. Traceback (most recent call last): File "/home/ubuntu/src/DeepSeek-Math/evaluation/infer/run_cot_eval.py", line 11, in <module> from eval.utils import generate_completions, load_hf_lm_and_tokenizer ModuleNotFoundError: No module named 'eval'

Was wondering if this is familiar. I similar setup with the conda envs and packages, and I'm running out of <path_to_local_repo>/evaluation.

Some questions:

Is there any missing files like __init__.py?
Is the MKL_THREADING_LAYER=INTEL a cause of concern?

Thank you!

RuntimeError: cutlassF: no kernel found to launch!

I try the Getstated Code in notebook but error :(

[Question] SFT Data Curation

Hi. Thank you very much for the excellent paper. I was impressed throughout the reading of the paper, which was much higher performance and detailed description.

I have one question that I didn't understand while reading.
In the 3.1. SFT Data Curation of English mathematical datasets:

problems are paired with solutions in chain-of-thought (CoT) (Wei et al., 2022), program-of-thought (PoT) (Chen et al., 2022; Gao et al., 2023), and tool-integrated reasoning format (Gou et al., 2023)

In the text, it was said to be composed of paired, but is it like case 1 below, which wrote all 3 thought in the one answer? Or, is it like case 2 below, which put 3 thought per question in the form of a question + each thought? It is possible, can you show me an example file for one question?

case1) 
Question : what is 1+1 ? 
Answer : CoT + PoT + Tool-integrated reasoning format 
ex) 1 + 1 = 2 / print(1+1) / ... 

case 2) 
Question : what is 1+1 ? 
Answer : CoT
ex) 1 + 1 = 2

Question 
Answer : PoT
ex) print(1+1) 

Question 
Answer : Tool-integrated reasoning format

Thank you again for the great paper.

数学中英语料占比

感谢您出色的工作，我们尝试了用deepseek-math-base来做一些实验，发现它比其他基座模型好很多，所以想来请教一下在训练deepseek-math-base的时候，中文语料和英文语料的占比是多少呢

Access to data set?

Hi, is there a way to access the training data?

Any Plan to release the code of GRPO?

The idea of GRPO is impressive. Is there any plan to release the implementation of this method? THX:)

About raw common crawl data

Hi,

I'm trying to reproduce your paper. However, I find that many math-related contents are filtered out in many popular text extraction pipeline. I'm wondering which version of the common crawl data you used to mined high-quality math contents? Did you use the custom pipeline for web data processing or something more specific? I cannot find any details regarding this in your paper.