s1: Simple test-time scaling
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.
Discussion
Host: Hey everyone, welcome back to the podcast! I'm Leo, and I'm super excited about today's episode. We're diving into some really cool stuff that's been popping up in the AI research world, something that I think has the potential to change how we approach machine learning.
Guest: Hey Leo, thanks for having me! I'm always thrilled to chat about the latest in AI, especially when it gets into the more intricate details. It's a field that's moving so fast; it feels like there's always something new and mind-bending to unpack.
Host: Exactly! And today’s topic is definitely that. So, we’ve been seeing a lot of buzz about test-time scaling, this idea that you can improve a language model’s performance just by throwing more compute at it when it's actually doing its thing, not just during training. It's almost like giving a student extra time to double-check their work on an exam. Have you noticed how much attention this has been getting lately?
Guest: Absolutely, Leo. It’s fascinating because traditionally, the focus has been so heavily on train-time compute – how much processing power you use while creating the model. Now, this new paradigm suggests that there’s significant potential in how we use compute when the model is actively being used. It’s like the difference between intensive training for a marathon and having extra reserves on race day. The core shift is from building the smartest model through sheer scale during training, to allowing a model to perform its peak performance on any given task by scaling the compute available during inference, right?
Host: Yeah, and it's kind of a mind-bender because it’s not just about having a bigger model; it’s about how a model can adaptively use more or less compute based on the task at hand. That's the bit that intrigues me the most. We're talking about active, adaptive models, not just static ones. It’s like, traditionally, we've been focused on building a really powerful engine, but now, it's also about how to control the gears of that engine in real-time to get peak performance, making it a much more flexible and efficient system. How do you see this compared to what’s been done previously?
Guest: That's a great analogy, Leo. It's less about the raw horsepower and more about the efficiency and strategic application of that power. Previously, we’d see most advancements come from just scaling up training datasets, throwing huge amounts of text at enormous models. That method is still incredibly powerful, but it has its limits, especially in terms of the resources required. It's also just that having huge models can lead to diminishing returns, where the performance increase is just less per unit increase in parameters. Test-time scaling suggests an alternative approach, a way to extract more out of a model that’s already trained, using real-time computational adjustments rather than just building a bigger model. Think of it like refining a skill through practice, rather than learning a new one. We can actually squeeze out more performance without the extensive retraining that’s typically needed.
Host: Okay, that makes a lot of sense. It's like we're not just building bigger muscles; we're learning how to use them more effectively. And I guess that leads us nicely into this paper we're going to discuss today, which was all about simplifying test-time scaling, right? I mean, they're calling it the simplest approach, and I’m definitely curious to see what they came up with because usually things that sound simple in tech, aren’t necessarily simple to come up with!
Guest: Exactly, Leo! The paper, titled 'Simple test-time scaling,' does exactly that. What’s really interesting is how it challenges the narrative that achieving high-level performance and test-time scaling requires incredibly complex techniques, like large-scale reinforcement learning with millions of data samples. Remember how OpenAI described their methods, right? They implied the use of massive data and complex training methods. This team, in contrast, is like, ‘Hold up, let's see if we can get similar results with a more straightforward process.’ And that’s the exciting part. It's about questioning the status quo and looking for simpler, more accessible pathways to complex outcomes.
Host: Yeah, that’s exactly what got me hooked on this paper. It's so refreshing to see researchers asking, 'Can we do this without needing a supercomputer?' So, what exactly is this ‘simple approach’ they're proposing? I mean, the paper is saying it's all about using a small, curated dataset, right? It's almost a counter-intuitive idea when so many advancements have been about scale.
Guest: Precisely, Leo! The core idea is to first curate a small, but incredibly potent dataset – they call it 's1K', which is just 1,000 questions that are specifically designed to push the limits of reasoning ability in these language models. It’s all about selecting the right data, not just having a ton of it. They focused on three key criteria: difficulty, diversity, and quality. It's not just a grab-bag of questions; they've really thought about what makes a problem complex. These questions required a fair bit of reasoning to solve, and they also covered a range of different domains, making sure the models aren't just good at one particular type of task. And, on top of that, they had to be really high quality – no poorly formatted, or ambiguous questions that could lead to confusion. This isn't just a matter of picking random problems from the internet. It's a very deliberate process of selecting problems that will really push and improve these models.
Host: That makes a huge difference, doesn't it? It's not enough to just have a big dataset; you need one that's actually challenging and diverse enough to improve a model. So, it's not just about the quantity but the quality and the specific kind of thinking it requires, right? Because, as you said, the goal here isn’t just to make the model memorize information but to improve its actual reasoning capabilities. It’s almost like creating a 'thinking bootcamp' instead of a general knowledge quiz.
Guest: Exactly, Leo! It's a laser-focused approach to enhancing reasoning. And what they did was build this s1K dataset by gathering over 59,000 problems from 16 different sources and then filtering down to the 1,000 most challenging. It’s like panning for gold, they started with a large pool and then carefully extracted what they considered the most valuable pieces. The sources of the data include math problems from online websites, historical AIME problems, and olympiad level questions which cover areas from astronomy to physics and computer science. They even included some of their original problems that come from the probability section of a PhD qualifying exam and some very hard brain teasers that are common in quantitative trading interviews. They wanted a really robust, diverse set of problems that aren’t easy to solve with a trivial approach. It’s that curated approach that seems to give the dataset the edge.
Host: Okay, so they've got this elite dataset, s1K. What's the other key piece to their 'simplest approach'? Because, I know, it's not just about the data, right? How do they make use of the dataset to actually achieve test-time scaling?
Guest: That's where 'budget forcing' comes in, Leo. It's a very clever technique that lets them control the amount of compute the model uses at test time. They basically intervene while the model is generating its answer. If the model is taking too long—generating too many tokens—they force it to stop and give an answer. This is like saying, ‘Okay, your time’s up, what’s the best answer you have right now?’ On the flip side, if they want the model to spend more time thinking, they'll inject the word 'Wait' into the model's text generation, which effectively lengthens the reasoning process, making it reflect on the problem some more. It’s like adding an additional ‘thinking step’. This method is far simpler than techniques like Monte Carlo Tree Search or other forms of reinforcement learning which are computationally expensive to implement.
Host: Wow, that's actually really straightforward, and it almost seems too easy, right? It's like they're just putting these simple levers in place to control the model’s 'thinking' duration. So, they're either hitting the 'stop' button early, or giving the model a gentle nudge to think more by adding the word 'wait'. It reminds me of how we sometimes do things – force ourselves to make a decision or consciously slow down to review our thinking. What makes this budget forcing so effective?
Guest: It is surprisingly simple, but it’s the simplicity that makes it so elegant. The effectiveness lies in the fact that it controls the model's processing time directly. What they've found is that models sometimes 'self-correct' when given more time. Sometimes they might start down an incorrect path, but with enough compute time, they’ll back-track, re-assess the problem, and come to the right answer. It's like letting a student show their work in detail, allowing them to catch their own mistakes. Conversely, setting a hard limit prevents endless loops and makes the model transition to an answer. By varying the 'compute budget,' they can make the model think more deeply when needed, or be more efficient when not required. This also helps in making the model more sample efficient, because models aren’t spending an excessive amount of compute on a simple task.
Host: Okay, so the budget forcing acts as a dynamic compute dial. The model uses that time to check its own work or try new approaches, not just re-hashing old information. This approach sounds surprisingly versatile. It's not just about being faster or slower, but making the model better through compute. Now, how did their approach actually perform? Because this all sounds really good in theory, but the proof is always in the pudding.
Guest: That's right, Leo. They tested their model, which they call 's1-32B,' on a range of complex reasoning benchmarks that are widely used, including AIME24, which consists of challenging math problems from the American Invitational Mathematics Examination, and the challenging MATH500 and GPQA Diamond datasets. What they discovered is pretty remarkable. Their model outperformed OpenAI’s o1-preview model, specifically on these highly complex math problems, exceeding its performance by up to 27%. And what’s most astonishing is that they got those results using just this 1,000-sample dataset and budget forcing technique. They also found that by scaling test-time compute with budget forcing, their model could perform even better without any additional training. They showed that just by using 'wait' to extend the thinking, their model improved performance on AIME24, going from 50% to 57% correct.
Host: That's incredible! So, by just strategically controlling how long their model thinks at test-time, they're able to surpass the performance of closed-source models that, according to reports, use a lot more complex methods and data. I mean, that kind of improvement is not just a small bump; it’s a significant leap in efficiency. And that really underscores their claim, doesn’t it? That the approach is indeed, simpler. It’s like finding a secret shortcut that bypasses the conventional, heavy-lifting methods.
Guest: It absolutely does, Leo! The contrast with models like the DeepSeek R1 series, which also tries to replicate the performance of OpenAI’s models through reinforcement learning using millions of samples, really highlights the value of simplicity. The DeepSeek R1 series did achieve high performance, but they needed hundreds of times more data and compute to train. The fact that this team was able to match, or in some cases, surpass that performance using a thousand carefully curated examples is just remarkable. This really shows how crucial data selection can be, and that we should be focusing on the quality rather than quantity when it comes to datasets.
Host: Absolutely. And I think it’s a very compelling point. Because one of the things that I find quite daunting is the compute and data requirements of training large models. I can see the appeal of a more streamlined approach. This idea of test-time scaling becomes a lot more accessible and feasible when you aren't talking about massive datasets and complex training paradigms. So, aside from the performance benchmarks, did the authors also perform any experiments to further test their findings, and see what makes their approach so special?
Guest: Yes, they did, and their ablation studies are really insightful, Leo. They started by testing their dataset’s composition. They trained on a randomly selected set of 1,000 problems, on just the most diverse problems, and also on those that had the longest reasoning traces, which is a measure of difficulty they used. Then, they also tried training on the whole 59K dataset to see what the impact of their filtered approach was. And what they found was that when you combine the quality, difficulty, and diversity criteria in their approach to dataset creation, it made a significant difference. The randomly selected samples and those based solely on diversity or difficulty each performed much worse. This showed they weren't just getting lucky with the s1K dataset. And even training on the full 59K pool showed no significant gain over training with the filtered 1K dataset, further proving that a smaller, curated dataset is superior to larger but less focused one. I’d say, that’s a solid argument for ‘quality over quantity’ if I ever heard one!
Host: That really drives home the point that thoughtful data curation is critical, doesn't it? It's not just about having a lot of data, but about having the right data. It’s like how a good personal trainer can be much more effective than just having a really long workout. You need the right exercises, the right intensity and the right focus to see real results. And I guess this is applicable to so many areas of life beyond AI. Did they also perform any tests related to test-time scaling, given that that’s what the entire paper is about?
Guest: Yeah, absolutely. They didn't just stop at the dataset, they also explored the best ways to scale the model’s computation at test-time. They compared their budget forcing technique against several baselines, including methods that try to control the generation length by explicitly telling the model to generate for a certain number of tokens, steps, or based on some generic ‘short’ or ‘long’ prompts. They also tried a method called rejection sampling where they would sample responses until they fit a predetermined computation budget. And overall, they found that their budget forcing technique provides the most controllable, scalable, and ultimately, the best performance. Other methods had issues like difficulty in controlling token and step length, or that models had learned to ‘game the system’ and not follow prompts precisely. Some methods even scaled inversely, leading to poorer performance. Budget forcing just showed the best consistency and best overall performance.
Host: That's fascinating. It's interesting how just trying to tell a model to ‘think more’ doesn't necessarily work without this budget forcing mechanism. And rejection sampling failing – that’s almost counterintuitive. Because you'd think just re-sampling until it fits a budget would be efficient. What do you think was the reason for that?
Guest: Well, the authors hypothesized that this could be because shorter generations might be the ones where the model was on the right track from the get-go, whereas the longer generations might be those where the model made a mistake and had to backtrack or correct itself, which in turn lead to longer generations that are ultimately wrong, rather than being better. So, the sampling method didn’t really capture models that were improving their reasoning, it just caught models that were rambling, for lack of a better word. That shows how important it is to not just rely on the metric of computation time, but to look at the quality of the thinking process itself. And also, that some of the simplest techniques work the best.