rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.
Discussion
Host: Hey everyone, and welcome back to the podcast! I'm your host, Leo, and I'm super excited about today's episode. We're diving into some really cutting-edge research that's making waves in the world of AI, specifically in the realm of math reasoning. It’s getting pretty wild how much these models are progressing.
Guest: Hi Leo, thanks for having me! I'm just as excited to talk about this. The advancements in AI, especially in areas like mathematical reasoning, are happening at a breakneck pace. It's sometimes hard to keep up with all the new breakthroughs!
Host: Absolutely! It feels like every week there's something new and mind-blowing. And today's topic is no exception. We're going to be discussing a paper that explores how small language models, yes, those smaller versions, can actually master complex math reasoning. Now, that's not something we would’ve expected a few years ago, right?
Guest: Definitely not! Traditionally, we’ve always thought that bigger models were synonymous with better performance, especially in challenging tasks like math. But this paper really challenges that notion. It suggests that with the right approach, smaller models can achieve results that are comparable to, or even better than, their larger counterparts.
Host: Exactly. It's all about the methodology, it seems. The paper's titled 'Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking.' It’s a mouthful, but it basically means they're using a clever strategy to train these smaller models. It's almost like they are learning to think in a more profound way, which is pretty fascinating.
Guest: Yeah, the term ‘deep thinking’ is really interesting here, because it emphasizes a move away from just simple pattern matching to something that mirrors human-like reasoning. They are not just crunching numbers; they are constructing logical arguments and step-by-step solutions, which is a critical shift. This reminds me a lot of how we teach mathematical proof in schools; it's not just about getting the right answer but about showing how we arrived at the conclusion.
Host: Right, and that's where the 'self-evolved' part comes in, I think. This isn't just about feeding a model tons of data. It's about creating a system where the model gradually improves its own training data. It’s like the model is learning how to learn, which is a game changer. We can get into the specific technicalities, but the idea that models can iteratively improve themselves by creating their training material is so cool.
Guest: It's absolutely a step towards more autonomous AI, really. Instead of solely relying on curated datasets, these models are learning to evaluate their own outputs and enhance their capabilities iteratively. They're using a method called Monte Carlo Tree Search, or MCTS, which is a pretty powerful approach for exploration and decision-making.
Host: Okay, so let's start unpacking this a little bit. The paper is structured in a pretty logical way, I think. First off, they introduce the problem they're trying to solve, right? Basically stating that while bigger models are good, they can also be kind of...brute force-y? They generate fast but often flawed solutions. And that we need to emulate human-like 'System 2' thinking, which is slower, more deliberate.
Guest: thinking, concepts borrowed from cognitive psychology. System 1 thinking is rapid, automatic, and often prone to errors, like when a person guesses the answer quickly without considering all the steps. System 2, on the other hand, is slow, deliberate, and analytical, which is vital for complex mathematical reasoning. So, the goal here isn't just to get a correct answer, but to reach that answer through a methodical thought process. This is a significant leap from merely generating a single solution via System 1.
Host: Right, and they point out that relying on models to generate full solutions in one go, which is like 'System 1' thinking, often leads to errors. So, the goal is to shift towards a 'System 2' approach, where the model breaks the math problems into multiple steps, and the steps are evaluated individually. They kind of break the problem into a series of logical deductions, so you can verify each piece. This really highlights the importance of the process itself.
Guest: And that's a crucial point. It's not enough to have the final answer correct; the journey there matters just as much, especially in mathematics. Incorrect intermediate steps can significantly reduce the quality of the training data, even if they happen to lead to the correct final answer. The problem lies in that these models might sometimes generate correct answers by sheer luck or using an incorrect approach. So, it is crucial that each step in the reasoning process is logical, sound, and free of flaws.
Host: And that leads into the challenges of training these models, right? The paper makes it clear that finding high-quality data for math reasoning is really tough. And even when they try to create their own training data, it's hard to filter out the bad stuff. It's like they need to be sure that the model’s internal steps are also good, not just the final result.
Guest: Exactly. It's a multi-faceted challenge. High-quality math reasoning data is inherently scarce, and it's not like you can easily scale it up. If you're trying to automatically generate solutions, you have to be very careful about what you're including in the training data. Erroneous reasoning steps can easily creep in and corrupt the training process, and those steps are difficult to discern. A correct final answer doesn't necessarily imply that all the steps were correct too. So, how do you make sure each step of the reasoning process is high quality?
Host: And they specifically mention that 'distill-based' methods, where they're using a bigger, more powerful model to teach a smaller one, have diminishing returns. They can’t surpass the capability of the original model. It's like the student can't get better than the teacher, which is kind of a bummer, and it's the core reason why their approach needs to be different.
Guest: That’s a fundamental limitation in many traditional approaches. If the teacher model doesn't understand how to solve a problem, it's hard for the student model to learn to do it effectively. The authors explain that while techniques like scaling up the use of GPT-4 distilled data have shown some effectiveness, they reach a point where they no longer improve. The student, so to speak, plateaus, and you need new ways to push the training forward.
Host: Okay, so, to recap, they've identified that relying on big models alone isn’t sustainable and that the key is to move towards a more human-like reasoning process. They also highlight the challenges of finding high-quality math reasoning data and the limitations of existing methods like distillation. That brings us to what they actually propose: a self-evolvable approach. Can you give us a bit of a breakdown of what this self-evolutionary approach is about?
Guest: Sure, so the core of their method is called rStar, and it's a self-evolvable System 2 reasoning approach. Instead of relying on bigger models for training data, they start with a smaller language model (SLM) and use a technique called Monte Carlo Tree Search or MCTS. This is where they start breaking math problems down into individual steps. They use the MCTS to generate step-by-step solutions, essentially letting the model explore multiple different paths to arrive at the answer. What really makes this special is that each step is evaluated, leading to the ‘deep thinking’ we spoke about earlier.
Host: And that's what's so interesting to me! So, instead of just generating one long solution, it's exploring multiple, step-by-step solutions using MCTS, kind of like how a human might approach a problem, exploring different possibilities. Can we talk about how they use Monte Carlo Tree Search (MCTS)? For someone who is not too familiar, it's basically this algorithm that explores options in a tree-like structure, right?
Guest: That’s right. Think of a complex math problem as a tree with numerous branches. MCTS explores these branches by going through multiple iterations of selection, expansion, rollout, and backpropagation. The model starts at the root of the tree, which is the problem itself, then samples candidate nodes at each step. Each node represents a step in the reasoning process, and those that look promising are further explored using Python code.
Host: Okay, so at each step, the SLM generates a step, and it also generates some Python code that goes with it, right? That way they can actually verify if each step is correct? So it's like a double check, not just relying on the language model's output. That’s very smart.
Guest: Exactly! This is where the ‘code-augmented CoT data synthesis’ comes in. The language model isn't only generating the next step in natural language, but also the corresponding Python code. This code is then executed to verify the generation quality. If the code executes successfully, it's considered valid. It is an incredibly effective way to filter out erroneous reasoning steps, because you need a logically valid step to produce the code. This significantly reduces instances where the model makes an error but still generates something that seems correct.
Host: Okay, so now the model has generated a bunch of possible steps, each with its own Python code for verification. How do they decide which steps are better and how to assign values to these steps? That's where the Q-values come in, I believe?
Guest: You've got it. After running MCTS, each step is given a Q-value. It’s like a score that indicates how useful that step was to arriving at the correct answer. For example, if a step consistently leads to correct solutions during exploration, it gets a high Q-value; conversely, steps that lead nowhere will get low Q-values. This means the Q-values act as an automated feedback mechanism, guiding the exploration process towards more promising solution paths.
Host: Okay, so the model is not only generating these steps, but also automatically evaluating them based on how well they contribute to the correct answer. The more a step contributes to a successful solution, the higher its Q-value. This is like a built-in reward system, and that’s very clever. It seems like this eliminates the need for human labelling, which is something you said that would not be sustainable in the long run?
Guest: That’s precisely right. It's a self-annotated system that learns from its own experience. The model uses those Q-values to determine which steps are most beneficial. And this is critical because manually annotating intermediate steps is not just time-consuming but is also challenging because humans can be inconsistent when it comes to labeling. They introduce this notion of a ‘process reward model,’ or PRM for short, which provides step-by-step feedback. However, there is the issue of how one trains this model.
Host: Ah, yes, I remember that. It’s a big issue because while process reward models are super useful for giving granular feedback, you need accurate step-level feedback in your training data, which is very difficult to obtain in practice. It’s just not scalable if we're going to rely solely on human labelers. So, how does this paper address that challenge? How do they train a ‘process reward model’ in an efficient and scalable way?
Guest: That’s where their second key innovation lies, their method for training a process preference model, or PPM. They avoid directly using those Q-values as hard reward scores, which can be unreliable. Instead, they use those values to create preference pairs. They choose steps with high Q-values as positive examples and steps with low Q-values as negative examples, and they use this to construct preference pairs for each step. This allows the model to learn to discriminate between good steps and bad steps, and it's a far more effective approach than directly using the noisy Q-values.
Host: Ah, that's a clever way of doing it. So, rather than trying to pinpoint an exact numerical reward for each step, they’re just comparing steps and teaching the model to distinguish which steps are better than others. The PPM is trained to predict which step is preferred, instead of what the perfect score should be. They're learning preferences, not hard scores, which is a great distinction.
Guest: Exactly. By learning pairwise preferences, the model can learn to discern subtle differences in step quality. It doesn’t need to predict an exact score; it just needs to learn to differentiate between high-quality steps and irrelevant or flawed ones. This significantly stabilizes the training process and produces much more reliable results.
Host: Okay, so now we have a model that uses MCTS to generate step-by-step verified reasoning trajectories with Q-values. And we also have the process preference model that can evaluate these steps. It’s like we are teaching the model to learn how to evaluate its own steps, not just to generate solutions. And it's a self-evolving system because this process repeats iteratively, improving the policy model and the process preference model, which leads to even better data in subsequent rounds. Let's talk about this iterative, self-evolutionary process.
Guest: Right, so their method has four rounds of self-evolution that are built on each other. It’s not just training one model and then stopping. In the first round, they start with a decent, but not amazing, policy model and generate an initial dataset using MCTS with basic terminal-guided annotation. Then, in the second round, they refine the policy and introduce a more reliable process preference model, which lets them improve the step-by-step data. They iterate on both, improving the quality of the data and increasing coverage of challenging math problems in each cycle.
Host: So, each round of this self-evolution builds on the previous one. In each round the policy model and the PPM get better, which generates even higher quality training data, which then enables the model to handle more challenging and complex problems. So, at every round the model is getting better. It’s kind of like it’s self-teaching itself.
Guest: Exactly, it’s a progressive refinement. By round three, they’re using the PPM-augmented MCTS, which leads to significantly improved training data. Then, by the fourth round, they’re pushing to solve even the most challenging, Olympiad-level math problems, and if needed, they increase the MCTS rollouts to explore the solution space more fully. They really want to ensure they have enough data of very high quality.
Host: Okay, this four-round self-evolution is really where the 'self-evolved' aspect comes to the fore. It's not just about training on a static dataset; it's about the model continuously improving its training process, like a feedback loop. This self-evolution is key to achieving those state-of-the-art results they showed in the paper.
Guest: Precisely! This progressive approach allows them to cover more and more of the mathematical problem space. In their experiments, they start with 747,000 math problems from public sources. By the fourth round, they have a policy model and a PPM that can handle almost 90% of these math problems, including very tough ones. So, they’ve proven this method is incredibly effective.
Host: Right, that’s quite impressive. It's like the models are becoming increasingly competent mathematical thinkers, not just data processors. Now, after these four rounds of self-evolution, what kind of results do they end up seeing? It seems like they evaluate their approach on a wide range of math benchmarks?
Guest: They sure do. Their evaluation is quite comprehensive, encompassing various mathematical benchmarks, including standard datasets like GSM8K, challenging benchmarks like MATH, AIME 2024, and even out-of-domain data like the Chinese College Entrance Math Exam. They're not just looking at one kind of problem, they want to see if it really does generalize across many different benchmarks.
Host: And they evaluate their approach with smaller models ranging from 1.5 to 7 billion parameters. It's like they're not relying on the scale of the models, which, again, was one of their main points, right? It’s about showing that the method works across different model sizes. How do the smaller models compare to the bigger ones that we often see?
Guest: The results are quite striking. They show that models like Qwen2.5-Math-7B, which is a relatively small model, can achieve performance on par with or even surpassing OpenAI’s larger models. For example, this smaller model saw a huge boost in accuracy on the MATH benchmark, going from 58.8% to 90.0%. They also exceed the performance of their base models and the best-of-N baselines on most benchmarks. It's a major achievement and validates their approach.
Host: That's incredible! To go from 58.8% to 90.0% on a challenging benchmark like MATH is a massive leap. And they also do really well on the Olympiad-level AIME, where they are solving an average of 8 out of 15 problems, which puts them in the top 20% of the brightest high school math students! These are not trivial problems; these are the cream of the crop, right?
Guest: Absolutely! These are incredibly challenging problems, requiring not just mathematical skills but also problem-solving and strategic thinking. They’re on the level of complex mathematical competitions. The fact that a relatively small model trained using their approach can achieve that level of performance, comparable to some of the most sophisticated models out there, is quite remarkable.
Host: And they are also comparing it to other methods like GPT-distilled datasets and also the best-of-N approaches. It is important because they’re not just showing how they beat other small models but also other established methods. How do they compare against those other baselines?
Guest: They do a series of thorough comparisons. In terms of data, their step-by-step verified reasoning trajectories are far more effective than existing methods that use data distilled from GPT-4 or randomly sampled from themselves. They also show that their method consistently improves the reasoning accuracy of base models. And compared to the Best-of-N baselines, even when those baselines used much larger reward models, their approach comes out ahead. It demonstrates that it's not all about having a gigantic model; it's about having a better approach and method for thinking.
Host: And they also show results with increasing the test time compute, right? It’s like they are giving the model more time to think and explore more trajectories during evaluation. So, how did those results change as they increased test time computation?
Guest: That’s an important point. By increasing the test time computation, their results show that the models get even better. The more time the models had to explore the solution space, the higher their accuracy went across all benchmarks. This further underscores the effectiveness of their MCTS-driven search method and demonstrates that scaling computation during testing can significantly enhance the capabilities of the model. For the most challenging benchmarks such as Math and AIME, it reaches saturation at about 64 trajectories, while on the college math benchmark, its accuracy is still showing improvement.
Host: Okay, so we've covered the core ideas of the paper, the methodology, and the main results. They really do demonstrate how smaller models, through this self-evolving system, can perform some impressive math reasoning. What other insights or interesting points does the paper raise?
Guest: The paper also includes a fascinating ablation study that tests the effectiveness of their innovations separately. This shows that the key to their success is not just the self-evolution or the MCTS, but a combination of all their innovations: the step-by-step verified reasoning trajectory, the code augmented generation, and of course, the process preference model. All three of these elements play a critical role.
Host: And it’s not just about the numbers. They have some interesting qualitative observations too, right? Like the model's ability to self-correct and its preference for theorem applications. Can you tell me more about that?
Guest: That’s right. One of the most intriguing findings is the emergence of what they call ‘intrinsic self-reflection’ during problem-solving. The model has demonstrated the ability to recognize when it has made a mistake and actually backtrack to explore different solutions to reach the right answer. And this isn't some specific training that they did; this ability just spontaneously appeared.
Host: Wow, so it’s not like it was trained to do this self-correction. It's just a byproduct of this iterative self-evolution approach. That’s incredibly interesting. It's like it develops a certain level of metacognitive ability by itself. And what about the preference for theorem applications? That seems important too.
Guest: Yes, this is another key observation. The process preference model has a preference for steps that apply a relevant theorem or key concept. When faced with a challenging math problem, identifying and using the right theorem is essential. The model seems to naturally recognize the significance of these theorem-application steps and rewards them accordingly.
Host: So, the model isn't just generating random steps. It’s actually guided towards logical deductions using the appropriate theorems. This really highlights the ‘deep thinking’ aspect of the model. And it is really impressive for a model of this size. That’s almost like it's showing signs of actual mathematical understanding and not just pattern recognition.