openai / prm800k Goto Github PK

800,000 step-level correctness labels on LLM solutions to MATH problems

License: MIT License

Python 100.00%

prm800k's Introduction

PRM800K: A Process Supervision Dataset

[Blog Post] [Paper]

This repository accompanies the paper Let's Verify Step by Step and presents the PRM800K dataset introduced there. PRM800K is a process supervision dataset containing 800,000 step-level correctness labels for model-generated solutions to problems from the MATH dataset. More information on PRM800K and the project can be found in the paper.

We are releasing the raw labels as well as the instructions we gave labelers during phase 1 and phase 2 of the project. Example labels can be seen in the image below.

Data

The data/ folder contains our labels formatted as newline-delimited lists of json data. The data has been uploaded with Git LFS, which you'll need to install in order to properly clone the repository.

Each line represents 1 full solution sample and can contain many step-level labels. Here is one annotated line:

{
  // UUID representing a particular labeler.
  "labeler": "340d89bc-f5b7-45e9-b272-909ba68ee363",

  // The timestamp this trajectory was submitted.
  "timestamp": "2023-01-22T04:34:27.052924",

  // In phase 2, we split our data collection into generations, using our best
  // PRM so far to pick which solutions to score in the next generation.
  // In phase 1, this value should always be null.
  "generation": 9,

  // In each generation, we reserve some solutions for quality control. We serve
  // these solutions to every labeler, and check that they agree with our
  // gold labels.
  "is_quality_control_question": false,

  // generation -1 was reserved for a set of 30 questions we served every
  // labeler in order to screen for base task performance.
  "is_initial_screening_question": false,

  // Metadata about the question this solution is a response to.
  "question": {
    // Text of the MATH problem being solved.
    "problem": "What is the greatest common factor of $20 !$ and $200,\\!000$?  (Reminder: If $n$ is a positive integer, then $n!$ stands for the product $1\\cdot 2\\cdot 3\\cdot \\cdots \\cdot (n-1)\\cdot n$.)",
    // Ground truth solution from the MATH dataset.
    "ground_truth_solution": "The prime factorization of $200,000$ is $2^6 \\cdot 5^5$. Then count the number of factors of $2$ and $5$ in $20!$. Since there are $10$ even numbers, there are more than $6$ factors of $2$. There are $4$ factors of $5$. So the greatest common factor is $2^6 \\cdot 5^4=\\boxed{40,\\!000}$.",
    // Ground truth answer.
    "ground_truth_answer": "40,\\!000",

    // The full steps of the model-generated solution. This is only set for
    // phase 2 where we pre-generated all solutions that we labeled.
    "pre_generated_steps": [
      "I want to find the largest positive integer that divides both $20 !$ and $200,\\!000$ evenly.",
      "One way to do this is to factor both numbers into prime factors and look for the common ones.",
      "I know that $200,\\!000 = 2^5\\cdot 10^4 = 2^9\\cdot 5^4$.",
      "To find the prime factorization of $20 !$, I can use the fact that it is the product of all the positive integers from $1$ to $20$.",
      "For each prime number $p$ between $1$ and $20$, I can count how many multiples of $p$ are in that range.",
      "For example, there are $10$ multiples of $2$ between $1$ and $20$, namely $2, 4, 6, \\dots, 20$.",
      "But there are also $5$ multiples of $4$, which is $2^2$, and $2$ multiples of $8$, which is $2^3$, and $1$ multiple of $16$, which is $2^4$.",
      "So, the total power of $2$ in $20 !$ is $10 + 5 + 2 + 1 = 18$.",
      "Similarly, there are $4$ multiples of $5$, namely $5, 10, 15, 20$, so the power of $5$ in $20 !$ is $4$.",
      "There are $6$ multiples of $3$, namely $3, 6, 9, \\dots, 18$, but there are also $2$ multiples of $9$, which is $3^2$, so the power of $3$ in $20 !$ is $6 + 2 = 8$.",
      "There are $2$ multiples of $7$, namely $7$ and $14$, so the power of $7$ in $20 !$ is $2$.",
      "There are $1$ multiple of each of the other prime numbers $11, 13, 17$, and $19$, so the powers of those primes in $20 !$ are $1$ each.",
      "Therefore, the prime factorization of $20 !$ is $2^{18}\\cdot 3^8\\cdot 5^4\\cdot 7^2\\cdot 11\\cdot 13\\cdot 17\\cdot 19$.",
      "To find the greatest common factor of $20 !$ and $200,\\!000$, I need to take the lowest power of each common prime factor.",
      "The only common prime factors are $2$ and $5$, and the lowest powers are $9$ and $4$, respectively.",
      "So, the greatest common factor is $2^9\\cdot 5^4 = 512\\cdot 625 = 320,\\!000$.\n\n# Answer\n\n320,000"
    ],
    // The answer given as the end of the pre-generated solution. We can see
    // this solution is incorrect.
    "pre_generated_answer": "320,000",
    // The score given by our PRM to this solution. This one isn't rated very
    // highly!
    "pre_generated_verifier_score": 0.010779580529581414
  },

  // The human data we collected for this solution, containing correctness
  // labels for each step of the solution.
  "label": {
    "steps": [
      // Each object here represents labels for one step of the solution.
      {
        // Each step will contain one or more completions. These are candidate
        // steps the model output at this step of the trajectory. In phase 1,
        // we frequently collect labels on alternative steps, while in phase 2
        // we only collect labels on alternative steps after the first mistake,
        // so most completions lists are singletons.
        "completions": [
          {
            // Text of the step.
            "text": "I want to find the largest positive integer that divides both $20 !$ and $200,\\!000$ evenly.",
            // The rating the labeler gave to this step. Can be -1, 0, or +1.
            // This is a 0 because it isn't incorrect, but it does not make
            // any progress.
            "rating": 0,
            // The labeler can flag steps that they don't know how to label.
            // This is rarely used.
            "flagged": null
          }
        ],
        // In phase 1, if all completions were rated -1, we allowed labelers to
        // write their own +1 step. This is null for all steps in phase 2.
        "human_completion": null,
        // The index of the completion "chosen" at this step, or null if the
        // human_completion was used. You can reconstruct the solution
        // trajectory like:
        // [
        //     step["human_completion"] if step["chosen_completion"] is None
        //     else step["completions"][step["chosen_completion"]]["text"]
        //     for step in labeled_solution["label"]["steps"]
        // ]
        "chosen_completion": 0
      },
      {
        "completions": [
          {
            "text": "One way to do this is to factor both numbers into prime factors and look for the common ones.",
            "rating": 0,
            "flagged": null
          }
        ],
        "human_completion": null,
        "chosen_completion": 0
      },
      {
        // Some steps contain multiple alternative completions, and each one
        // gets a rating.
        "completions": [
          {
            "text": "I know that $200,\\!000 = 2^5\\cdot 10^4 = 2^9\\cdot 5^4$.",
            "rating": -1,
            "flagged": null
          },
          {
            "text": "To factor $20 !$, I can use the fact that every factorial is a multiple of every number less than or equal to it.",
            "rating": 0,
            "flagged": false
          },
          {
            "text": "I can use a factor tree to find the prime factors of $200,\\!000$: $200,\\!000 = 2^5\\cdot 10^4 = 2^5\\cdot 2^4\\cdot 5^4 = 2^9\\cdot 5^4$.",
            "rating": -1,
            "flagged": false
          },
          {
            "text": "I can use a factor tree to find the prime factors of $200,\\!000$.",
            "rating": 0,
            "flagged": false
          },
          {
            "text": "To factor $20 !$, I can use the fact that any factorial is divisible by all the primes less than or equal to the input.",
            "rating": 0,
            "flagged": false
          }
        ],
        "human_completion": null,
        "chosen_completion": null
      }
    ],
    // Total time in milliseconds spent on labeling this solution.
    "total_time": 278270,
    // Final result of labeling this solution. Will be one of:
    //   - "found_error": In phase 2 we stop labeling a solution after the
    //                    first error is found.
    //   - "solution": We reached a step that concluded in the correct answer
    //                 to the problem.
    //   - "bad_problem": The labeler reported the problem as broken.
    //   - "give_up": The labeler was stuck (the problem was taking too long,
    //                or the instructions were unclear) and moved onto the
    //                next problem.
    "finish_reason": "found_error"
  }
}

Instructions

The instructions/ folder contains the instructions documents we gave to labelers during each phase of the project.

Answer Grading

The grading/ folder contains the python grading logic we used for determining if a model-outputted answer correctly matched the ground truth answer in Hendrycks' MATH dataset. We build off of Hendrycks' math normalization logic in math_normalize.py and use sympy to check for equality of expressions in grader.py. We recommend using grader.grade_answer(model_answer, gt_answer) where both answers are strings to determine if a solution is correct or not.

Answer grading is difficult in general. This grading logic is designed to be conservative and will sometimes reject correct answers, though it does so less frequently than the normalization logic from MATH. Our logic might sometimes admit incorrect answers, though we've put effort into minimizing this.

MATH Splits

As explained in Let's Verify Step by Step, we use a nonstandard MATH train/test split.

In order to avoid the risk of over-fitting on the 7,500 MATH training problems, we expanded the training set to include 4,500 MATH test split problems. We therefore evaluate our models only on the remaining 500 held-out problems. We selected these 500 test problems uniformly at random, and we believe they are representative of the test set as a whole.

The math_splits/ folder contains our selected splits in the train.jsonl and test.jsonl files. You'll need Git LFS to properly clone these files.

Scored Samples

We release all large-scale model samples used to evaluate the large-scale ORM and PRM, corresponding to Figure 3 in the paper. Each test problem has to 1860 scored samples. Solutions that failed to reach an answer within 1024 tokens were discarded, resulting in less than 1860 samples on some problems. We account for this in the best-of-N evaluation logic.

Evaluate the PRM:

python eval/eval.py --method prm

Evaluate the ORM:

python eval/eval.py --method orm

Citation

Please use the below BibTeX entry to cite this dataset:

@article{lightman2023lets,
      title={Let's Verify Step by Step}, 
      author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl},
      journal={arXiv preprint arXiv:2305.20050},
      year={2023}
}

prm800k's People

Contributors

Stargazers

Watchers

Forkers

multicomplex kemolo techthiyanes clintg dumpmemory ht-zhou zyhxq evelynmitchell llegomark allofphysicsgraph stjordanis oneflow-inc standardgalactic dongsongshou apollohuang1 petercao charbull mrm8488 averyneumann adamsch1 popufig qmpham dkzdev zhanwenchen ckqqqq anszhao boringresearch nickydark1 benzfy 14h034160212 noahwoo nixotrix 1998x-stack theniteshsingh epinnock shatu cxxszz chaoqu12 aditya-vardhan13 huijiawu0 mukesh376 andreyryabov liujuncn shadowkun tigerneil focusaibuilder 2041085572 ghas-results ethicalsecurity-agency cnm13ryan abansalte swich1987 rioncarter janhuman crjaensch starspark-ai farshadb mbrukman spacejam ziyu-deep yacinebelaid mengunogul codeaudit anilcosaran jimmy-inl tanjingme polya20 zeroxclem fieryswampshire juandisay ro-hansolo omniatheatre dearborn-open-ai chrispoulin squit1 kleeeeea sankeerthrao lz9168 frustea kevinamiri luizguilhermeleste konisberg dujh22 guptadiva brunoscaglione zhangxy-2019 liangzhang2017

prm800k's Issues

Question about the correctness of step-level rating

Hi! Thank you for the inspiring work. I would like to know the accuracy of the step-level rating annotated by the human labelers, since I might just found a mistake in the third example of phase2_test.jsonl:

{"labeler": "b5b0d9b5-90f0-4756-ab90-140836ed55ab", "timestamp": "2022-11-17T18:14:18.615199", "generation": 6, "is_quality_control_question": false, "is_initial_screening_question": false, "question": {"problem": "A strictly increasing sequence of positive integers $a_1$, $a_2$, $a_3$, $\\dots$ has the property that for every positive integer $k$, the subsequence $a_{2k-1}$, $a_{2k}$, $a_{2k+1}$ is geometric and the subsequence $a_{2k}$, $a_{2k+1}$, $a_{2k+2}$ is arithmetic. Suppose that $a_{13} = 2016$. Find $a_1$.", "ground_truth_solution": "Let $\\frac{a_2}{a_1} = \\frac{b}{a},$ where $a$ and $b$ are relatively prime positive integers, and $a < b.$  Then $a_2 = \\frac{b}{a} \\cdot a_1,$ and\n\\[a_3 = \\frac{a_2^2}{a_1} = \\frac{(b/a \\cdot a_1)^2}{a_1} = \\frac{b^2}{a^2} \\cdot a_1.\\]This implies $a_1$ is divisible by $a^2.$  Let $a_1 = ca^2$; then $a_2 = cab,$ $a_3 = cb^2,$\n\\begin{align*}\na_4 &= 2a_3 - a_2 = 2cb^2 - cab = cb(2b - a), \\\\\na_5 &= \\frac{a_4^2}{a_3} = \\frac{[cb(2b - a)]^2}{(cb^2)} = c(2b - 2a)^2, \\\\\na_6 &= 2a_5 - a_4 = 2c(2b - a)^2 - cb(2b - a) = c(2b - a)(3b - 2a), \\\\\na_7 &= \\frac{a_6^2}{a_5} = \\frac{[c(2b - a)(3b - 2a)]^2}{c(2b - a)^2} = c(3b - 2a)^2, \\\\\na_8 &= 2a_7 - a_6 = 2c(3b - 2a)^2 - c(2b - a)(3b - 2a) = c(3b - 2a)(4b - 3a), \\\\\na_9 &= \\frac{a_8^2}{a_7} = \\frac{[c(3b - 2a)(4b - 3a)]^2}{[c(3b - 2a)^2} = c(4b - 3a)^2,\n\\end{align*}and so on.\n\nMore generally, we can prove by induction that\n\\begin{align*}\na_{2k} &= c[(k - 1)b - (k - 2)a][kb - (k - 1)a], \\\\\na_{2k + 1} &= c[kb - (k - 1)a]^2,\n\\end{align*}for all positive integers $k.$\n\nHence, from $a_{13} = 2016,$\n\\[c(6b - 5a)^2 = 2016 = 2^5 \\cdot 3^2 \\cdot 7 = 14 \\cdot 12^2.\\]Thus, $6b - 5a$ must be a factor of 12.\n\nLet $n = 6b - 5a.$  Then $a < a + 6(b - a) = n,$ and\n\\[n - a = 6b - 6a = 6(b - a),\\]so $n - a$ is a multiple of 6.  Hence,\n\\[6 < a + 6 \\le n \\le 12,\\]and the only solution is $(a,b,n) = (6,7,12).$  Then $c = 14,$ and $a_1 = 14 \\cdot 6^2 = \\boxed{504}.$", "ground_truth_answer": "504", "pre_generated_steps": ["I notice that the problem involves both geometric and arithmetic sequences, so I should recall the definitions and formulas for them.", "A geometric sequence has a common ratio $r$ such that $a_{n+1} = r a_n$ for all $n$.", "An arithmetic sequence has a common difference $d$ such that $a_{n+1} = a_n + d$ for all $n$.", "I also notice that the problem gives me a specific value for $a_{13}$, so I should try to work backwards from there and see if I can find a pattern or a relation between the terms.", "Since $a_{13}$ is part of both a geometric and an arithmetic subsequence, I can write two equations involving $a_{13}$, $a_{12}$, $a_{11}$, and $a_{14}$.", "For the geometric subsequence $a_{11}$, $a_{12}$, $a_{13}$, I have $a_{13} = r a_{12}$ and $a_{12} = r a_{11}$.", "I can eliminate $r$ by dividing the two equations and get $a_{13}^2 = a_{11} a_{12}^2$.", "Similarly, for the arithmetic subsequence $a_{12}$, $a_{13}$, $a_{14}$, I have $a_{14} = a_{13} + d$ and $a_{13} = a_{12} + d$.", "I can eliminate $d$ by subtracting the two equations and get $a_{14} - a_{13} = a_{13} - a_{12}$.", "Now I have two equations with four unknowns, which is not enough to solve for them.", "However, I can use the fact that the sequence is strictly increasing, which means that $a_n > a_{n-1}$ for all $n$.", "This implies that $r > 1$ and $d > 0$.", "I can also use the fact that all the terms are positive integers, which means that $r$ and $d$ are also positive integers, and $a_n$ divides $a_{n+1}$ for all $n$ in a geometric subsequence.", "With these additional constraints, I can try to guess some possible values for $r$ and $d$ that satisfy the equations and the conditions.", "For example, if $r = 2$ and $d = 1$, then I have $a_{13} = 2 a_{12}$, $a_{12} = 2 a_{11}$, $a_{14} = a_{13} + 1$, and $a_{13} - a_{12} = 1$.", "Plugging in $a_{13} = 2016$, I get $a_{12} = 1008$, $a_{11} = 504$, and $a_{14} = 2017$.", "This seems to work, but I need to check if the previous terms also follow the same pattern.", "If $r = 2$ and $d = 1$, then the previous geometric subsequence should be $a_9$, $a_{10}$, $a_{11}$, and the previous arithmetic subsequence should be $a_{10}$, $a_{11}$, $a_{12}$.", "Using the same formulas, I get $a_{11} = 2 a_{10}$, $a_{10} = 2 a_9$, $a_{12} = a_{11} + 1$, and $a_{11} - a_{10} = 1$.", "Plugging in $a_{11} = 504$, I get $a_{10} = 252$, $a_9 = 126$, and $a_{12} = 505$.", "This also seems to work, and I can continue this process until I reach $a_1$.", "After a few more steps, I get $a_7 = 63$, $a_6 = 32$, $a_5 = 16$, $a_4 = 9$, $a_3 = 5$, $a_2 = 3$, and $a_1 = 2$.", "This is the smallest possible value for $a_1$ that satisfies all the conditions, and I can verify that the sequence continues to be strictly increasing, geometric, and arithmetic as required.", "# Answer\n\n2"], "pre_generated_answer": "2", "pre_generated_verifier_score": 5.517159691249014e-05}, "label": {"steps": [{"completions": [{"text": "I notice that the problem involves both geometric and arithmetic sequences, so I should recall the definitions and formulas for them.", "rating": 1, "flagged": null}], "human_completion": null, "chosen_completion": 0}, {"completions": [{"text": "A geometric sequence has a common ratio $r$ such that $a_{n+1} = r a_n$ for all $n$.", "rating": 1, "flagged": null}], "human_completion": null, "chosen_completion": 0}, {"completions": [{"text": "An arithmetic sequence has a common difference $d$ such that $a_{n+1} = a_n + d$ for all $n$.", "rating": -1, "flagged": null}, {"text": "An arithmetic sequence has a common difference $d$ such that $a_{n+1} = a_n + d$ for all $n$.", "rating": -1, "flagged": false}], "human_completion": null, "chosen_completion": null}], "total_time": 1097989, "finish_reason": "found_error"}}

Here, the third step is labeled as -1, but seems to be correct:

"An arithmetic sequence has a common difference $d$ such that $a_{n+1} = a_n + d$ for all $n$."

In another issue, you said

our labelers probably made small mistakes more often than the MATH answer key was wrong

And in the paper Appendix B, you mentioned that

only admitted labelers that agreed with our gold labels at least 75% of the time.

So the accuracy of step-level annotation is maybe slightly higher than that?

when rating is null, what the meaning and how to use it?

@peterjdolan @agnoster
Thanks for your sharing of your work.
I found that some process of solution's rating is null, like below.
How should I use data like that?

{
      "completions": [
          {
              "text": "Now I can use the property that if two powers of the same base are equal, then their exponents must be equal as well.",
              "rating": null,
              "flagged": null
          }
      ],
      "human_completion": null,
      "chosen_completion": 0
  },
  {
      "completions": [
          {
              "text": "So I have\n\\[-\\frac{3}{2} + 2 \\cos \\theta - 1 = \\frac{1}{4**} + \\cos \\theta - 1.\\]",
              "rating": null,
              "flagged": null
          }
      ],
      "human_completion": null,
      "chosen_completion": 0
  },
  {
      "completions": [
          {
              "text": "Simplifying this equation, I get\n\\[2 \\cos \\theta - \\frac{5}{2} = \\cos \\theta - \\frac{3}{4}.\\]",
              "rating": null,
              "flagged": null
          }
      ],
      "human_completion": null,
      "chosen_completion": 0
  },
  {
      "completions": [
          {
              "text": "Subtracting $\\cos \\theta$ from both sides, I get\n\\[\\cos \\theta - \\frac{5}{2} = - \\frac{3}{4}.\\]",
              "rating": null,
              "flagged": null
          }
      ],
      "human_completion": null,
      "chosen_completion": 0
  }

Questions about the solution-level score.

Hi guys, thanks again for your excellent work on PRM. I also have some questions regarding to the solution-level score.

In the paper you mentioned "we consider neutral steps to be positive, and we define the solution score to be the product of step-level scores". In my understanding, this setup will give a high score to short solutions that make some progress but do not arrive at a final answer. This may not be a big problem in best-of-1860 senario but will definitely be a problem when used in RLHF or Best-of-N when N is much larger. What's your opinion on this ?
You also mentioned "Using the product instead of the minimum as the reduction does create a slight bias against solutions with a larger number of steps". I went through the prediction file and found that around 60% of response pairs, smaller PRM scores have larger number of steps. When used in RLHF, this will result in shorter solutions have higher PRM scores. what is your recommandation when applying this in RLHF ?

Incorrect trainning data?

[{"question": "The first four terms in an arithmetic sequence are $x+y$, $x-y$, $xy$, and $x/y$, in that order. What is the fifth term? Express your answer as a common fraction."}]

I really didn't understand why this answer is 123/40, is the data missing some conditions?

Could we use it with commercial models?

Hey folks, thanks for sharing these instructions! Could one use the dataset for training proprietary (commersial) models?

What's the training pipeline of PRM？

Hi, @longouyang @welinder @avital , thanks for your work in PRM. But I want to ask a question is that what's the training pipeline of PRM？
For ORM, we can define the chosen/rejected pairs and use the pairwise loss to supervise the reward model. But I didn't figure out how PRM train, which still confused me after I have read the dataset's structure. It's would be appreciate if anyone can help explaining it or discuss about it. Thanks a lot!

MathMix

Hi, is the MathMix dataset mentioned in the paper open-sourced?

Question about multiple alternative steps

To train PRM, how did you handle multiple alternative steps? Did you supervise every step concatenated with history or only consider a single trajectory? Would supervising every step be inefficient?

For example, supervising every step would like:

Step 1: [A, B, C] Chosen: A
Step 2: [D] Chosen: D
Step 2: [E, F] Chosen: E

Training data: [A] [B] [C] [A, D] [A, D, E] [A, D, F]

If we supervise only a single trajectory, that would be like:

Step 1: [A, B, C] Chosen: A
Step 2: [D] Chosen: D
Step 2: [E, F] Chosen: E

Training data: [A, D, E]

open AI is not available in your country

how can I path throw this problem, it asks for a phone number,
so, even if I pass throw that with VPN its still ask for a phone number verification.

PRM model training details

hello, I have a doubt with the training of the PRM using the dataset from the paper.

I've seen two related issues: #10 and #7

from #10 I understood you create one sequence for each intermediate trajectory and you compute a single reward at the end of each sequence correspodning to the last step in the sequence

using the same example

Step 1: [A, B, C] Chosen: A
Step 2: [D] Chosen: D
Step 2: [E, F] Chosen: E

Training sequences with labels:
[A,1] [B,-1] [C,-1] [A, D, 1] [A, D, E, 1] [A, D, F, -1]

but in issue #7 I understand the correct way of doing is not to have a sequence for each intermediate trajectory, but instead predict reward after each step in a larger trajectory:

Training sequences:
[B, -1]
[C, -1]
[A, 1, D, 1, E 1]
[A, 1, D, 1, F, -1]

this way the number of training sequeunces is reduced.

which way did you use?

Question about reward model evaluation metric

Thanks for this great work! I have one question about how you measure the performance of the reward model. You mentioned in section 2.1 that 'We evaluate a reward model by its ability to perform best-of-N search over uniformly sampled solutions from the generator'. I am curious about, why not directly calculate the reward model accuracy over the test set and use that as the metric?

You mentioned "We train like a normal LM, learning to predict a single rating token (-1, 0, 1) given a problem and the solution up to the current step". In my understanding, the input is something like [question_tokens, sep_token, solution_step1_tokens, rating_token_for_1, solution_step2_tokens, rating_token_for_0, solution_step3_tokens, rating_token_for_-1]. You use different token for different score, and when predicting the rating_token_for_step3, the model has information access to the rating score for step1 and step2 because they are present in the context. The training loss is the log-likelihood of these rating tokens over the whole vocabulary, am I correct ?
You also mentioned "it suffices to perform a single PRM forward pass over the whole solution". I am not sure how that worked out. My best guess is that a rating placeholder token is appended at the end of every step. The input is something like [question_tokens, sep_token, solution_step1_tokens, rating_placeholder_token, solution_step2_tokens, rating_placeholder_token, solution_step3_tokens, rating_placeholder_token]. You take the log-likelihood of these rating tokens and normalize it over 3 tokens (rating_token_for_-1, rating_token_for_0, rating_token_for_1). Is it correct ?

Question about Q*

On November 23, 2023 Reuters reported that OpenAI has been working on a "project called Q*".

Has this dataset been used for the Q* project?

Thanks.