Coder Social home page Coder Social logo

deepseek-math's People

Contributors

chenxwh avatar guoday avatar pkuzqh avatar zhihongshao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepseek-math's Issues

SFT的数据分布

恭喜你们的效果取得了非常好的效果! 我有一个问题想要请教一下各位大佬:

我想了解一下SFT的数据分布。看到training examples 是 776K,但是可能是我对于数据集的估算可能出现了一些问题。English mathematical datasets:GSM8K和MATH部分我看是根据ToRA进行标注的,所以根据ToRA那篇文章的估算应该是69K,MathInstruct 260K 的子集不是特别好估算我就按照200K来估算,Lila-OOD是32.2K。总计300K左右,而且MathInstruct里面的MATH和GSM8K应该会与前面的69K的数据重复。那么Chinese mathematical datasets的数据应该是476K,这个数据集是你们收集,后续会开源的嘛?

建议检查数据

个人友情提醒,小模型 MATH 达到这个水平,要警惕数据泄露
可以对比 MATH 和 humaneval 数据集,感受一下难度 😂

What is your chat template for huggingface chat ui?

I saw that openchat has this chat template for Huggingface chat ui:

{
"name": "openchat/openchat-3.5-0106",
"displayName": "openchat/openchat-3.5-0106",
"description": "OpenChat 3.5 is the #1 model on MT-Bench, with only 7B parameters.",
"websiteUrl": "https://huggingface.co/openchat/openchat-3.5-0106",
"preprompt": "",
"chatPromptTemplate" : "{{#each messages}}{{#ifUser}}GPT4 Correct User: {{#if @FIRST}}{{#if @root.preprompt}}{{@root.preprompt}}\n{{/if}}{{/if}}{{content}}<|end_of_turn|>GPT4 Correct Assistant:{{/ifUser}}{{#ifAssistant}}{{content}}<|end_of_turn|>{{/ifAssistant}}{{/each}}",
"parameters": {
"temperature": 0.6,
"top_p": 0.95,
"repetition_penalty": 1.2,
"top_k": 50,
"truncate": 6016,
"max_new_tokens": 2048,
"stop": ["<|end_of_turn|>"]
},
"promptExamples": [
{
"title": "Write an email from bullet list",
"prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
}, {
"title": "Code a snake game",
"prompt": "Code a basic snake game in python, give explanations for each step."
}, {
"title": "Assist in a task",
"prompt": "How do I make a delicious lemon cheesecake?"
}
]
}

=====================================================
What is the chat template for DeepSeek-Math for Huggingface chat ui?

关于sft阶段中数据拼接的问题

您好,论文3.2中有提到将训练数据随机拼接到4k token的长度,请问是指将sft数据拼接成(q0,a0,q1,a1,...)的形式后只计算answer部分的loss吗?
非常感谢大佬们的工作~

Question about the way to extract text from CC HTML

Hi guys @DeepSeekPH , thanks so much for sharing such an excellent work. I note that Openwebmath uses a specialized pipeline to extract content from HTML instead of directing using the WET file from Common Crawl. I just wonder how you guys deal with this problem? Do you also follow openwebmath to process the html with a private diagram? sincerely wait for your feedback.

Publish on Ollama

Currently deepseek-llm and deepseek-coder are available on Ollama.ai, and I'd love to try out deepseek-math with ollama :)

MATH Test Score reproduce acc=43.6

MATH on 5000 problems.
deepseek-math-7b-rl @ cot @ greedy @ max 512 tokens

Accuracy: 43.6
Non decode: 0.0
Level 1 344/437 78.72
Level 2 546/894 61.07
Level 3 585/1131 51.72
Level 4 467/1214 38.47
Level 5 240/1324 18.13

Algebra 764/1187 64.36
Counting & Probability 181/474 38.19
Geometry 173/479 36.12
Intermediate Algebra 170/903 18.83
Number Theory 217/540 40.19
Prealgebra 561/871 64.41
Precalculus 116/546 21.25

代码数据应该怎么用呢

论文说代码数据会对数学推理能力有用,但是大多数的开源代码数据都是没有具体的函数注释和描述的,只是各种语言的组合。
请问你们有没有验证如果只用一种语言,比如c++作为代码数据,对效果有没有影响
还有就是如果代码数据没有注释,对效果有没有影响,我之前使用了LEMMA开源的代码堆栈数据,效果甚至有下降

Unable to get evaluation results

I've followed the steps given in the evaluation readme and run the inference file
However, I don't get any results :/

The results.json file has all blank entries with n_samples=0 always

Ask about the evaluation of deepseek-math-rl

I git clone this repo, and run the submit_eval_jobs.py, with 8GPUs. However, the tool-based results on MATH test set is 0.5786, where we set the iter=4, and our vllm-version is 0.2.0 as recommended. It has a gap with the reported 58.8, do you have any suggestions?

Paper 第二节预训练 2.2 节:为什么对不同 size 的数据集都要训练至高达 150B tokens?

image Math 模型使用的数据集大小为 120B tokens 所对比的数据集分别为 - 8.9B tokens - 13.6B tokens - 13.6B×4+10.3B×1+28.0B ×2 ≈120B tokens (如果以上数据有误请纠正我) 意味着最小的数据集可能需要训练接近 20 个 epoch,较大概率出现 overfitting 从而导致性能下降。 一般来说可能更公平的比较是否应该是选择一个更小的数值,例如最小数据集的大小或更小,超过阈值的降采样吗?

想请教下实验中这样的设定是基于什么考虑

Path Issue when running evals

Hi there! While trying to reproduce the evals of python submit_eval_jobs.py --n-gpus 8 on 8 GPU A-100s, I came across an issue dumped in the logs:

Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library. Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it. Traceback (most recent call last): File "/home/ubuntu/src/DeepSeek-Math/evaluation/infer/run_cot_eval.py", line 11, in <module> from eval.utils import generate_completions, load_hf_lm_and_tokenizer ModuleNotFoundError: No module named 'eval'

Was wondering if this is familiar. I similar setup with the conda envs and packages, and I'm running out of <path_to_local_repo>/evaluation.

Some questions:

  1. Is there any missing files like __init__.py?
  2. Is the MKL_THREADING_LAYER=INTEL a cause of concern?

Thank you!

[Question] SFT Data Curation

Hi. Thank you very much for the excellent paper. I was impressed throughout the reading of the paper, which was much higher performance and detailed description.

I have one question that I didn't understand while reading.
In the 3.1. SFT Data Curation of English mathematical datasets:

problems are paired with solutions in chain-of-thought (CoT) (Wei et al., 2022), program-of-thought (PoT) (Chen et al., 2022; Gao et al., 2023), and tool-integrated reasoning format (Gou et al., 2023)

In the text, it was said to be composed of paired, but is it like case 1 below, which wrote all 3 thought in the one answer? Or, is it like case 2 below, which put 3 thought per question in the form of a question + each thought? It is possible, can you show me an example file for one question?

case1) 
Question : what is 1+1 ? 
Answer : CoT + PoT + Tool-integrated reasoning format 
ex) 1 + 1 = 2 / print(1+1) / ... 

case 2) 
Question : what is 1+1 ? 
Answer : CoT
ex) 1 + 1 = 2

Question 
Answer : PoT
ex) print(1+1) 

Question 
Answer : Tool-integrated reasoning format 

Thank you again for the great paper.

数学中英语料占比

感谢您出色的工作,我们尝试了用deepseek-math-base来做一些实验,发现它比其他基座模型好很多,所以想来请教一下在训练deepseek-math-base的时候,中文语料和英文语料的占比是多少呢

About raw common crawl data

Hi,

I'm trying to reproduce your paper. However, I find that many math-related contents are filtered out in many popular text extraction pipeline. I'm wondering which version of the common crawl data you used to mined high-quality math contents? Did you use the custom pipeline for web data processing or something more specific? I cannot find any details regarding this in your paper.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.