Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we challenge this paradigm and propose Critique Fine-Tuning (CFT), a strategy where models learn to critique noisy responses rather than simply imitate correct ones. Inspired by human learning processes that emphasize critical thinking, CFT encourages deeper analysis and nuanced understanding-traits often overlooked by standard SFT. To validate the effectiveness of CFT, we construct a 50K-sample dataset from WebInstruct, using GPT-4o as the teacher to generate critiques in the form of (input=[query; noisy response], output=critique). CFT on this dataset yields a consistent 4-10% improvement over SFT on six math benchmarks with different base models like Qwen2.5, Qwen2.5-Math and DeepSeek-Math. We further expand to MetaMath and NuminaMath datasets and observe similar gains over SFT. Notably, our Qwen2.5-Math-CFT model-trained on just 50K samples-matches or outperforms competitive models such as AceMath and Qwen2.5-Math-Instruct on most benchmarks, both of which use over 2M samples. Ablation studies show that CFT is robust to the source of noisy response and teacher critique model. Through these findings, we argue that critique-based training offers a more effective alternative to advance the reasoning of language models.
Discussion
Host: Hey everyone, and welcome back to the podcast! Today we've got something pretty interesting to dive into. We're going to be talking about a new approach to training language models, something that might just shake up the way we think about fine-tuning these powerful tools. It's always exciting to explore these cutting-edge ideas, so I'm really happy you're here to listen.
Guest: Yeah, Leo, I'm excited about this too. It seems like just yesterday we were all amazed by the raw capabilities of these large models, and now we're already figuring out how to make them even better. It’s definitely a fast-moving field and it makes you wonder what tomorrow will bring.
Host: Absolutely! And today’s topic is all about that 'making better' part. We’re going to be talking about a research paper that challenges the standard method of supervised fine-tuning, or SFT as it's commonly called. They've come up with something they’re calling Critique Fine-Tuning, or CFT. Basically, instead of just training models to imitate correct answers, they’re training them to critique wrong ones. It's like teaching a student not just by giving them the answers, but by having them analyze where they went wrong when they miss a question. I found that very interesting.
Guest: That's a really interesting analogy, Leo. It immediately makes intuitive sense. When we learn, understanding why something is incorrect is often far more valuable than simply seeing the right answer. It forces you to engage with the material on a deeper level. So, the idea of applying that to language models sounds promising and definitely worth looking into. I'm eager to see how it actually works and what kind of results they've achieved.
Host: Exactly! And the paper really emphasizes this point about learning through critique. They're not just blindly following the typical SFT approach. They make a compelling argument that simply having models imitate responses, even if they're correct, can actually lead to diminishing returns, especially when you're working with already powerful base models. So it's like, they're saying, 'hey, we've got these really smart models, let's teach them to think critically instead of just copying.'
Guest: That's a crucial point, and it really highlights where the field is headed. I think, we've all experienced those moments when we're trying to learn something new by just repeating it, only to hit a wall because we don't understand the underlying principles. It's the same with these models, right? If they don't truly understand the nuances of a task, their performance improvements will plateau. It sounds like CFT is an attempt to break through that barrier by encouraging a more analytical form of learning. It’s moving past rote memorization to true comprehension, which is exciting.
Host: Precisely! So, let’s start by talking about how they actually set this up. First, they had to build a dataset for this kind of training, and it's a little different than your standard SFT dataset. They used a dataset called WebInstruct, which is basically a collection of instructions and responses from the internet. But here's the catch: they used GPT-4o to act as a ‘teacher’, and instead of giving correct answers, GPT-4o generated critiques of the noisy responses, which are responses prone to errors. So, the dataset is full of these input-output pairs, where the input is a question and a noisy answer, and the output is a detailed critique from GPT-4o. It's quite an ingenious setup, really. The model isn't learning to give the answer, but rather to explain why the provided response is wrong.
Guest: That’s a clever way to generate training data, actually. Using GPT-4o to generate critiques is a good use of its analytical abilities. It's like having a really experienced tutor looking at a student’s work and giving specific feedback, not just a simple ‘yes’ or ‘no’. I imagine that the quality of the critique is really important here, though. It makes me think about how good these critiques are and whether those critiques will impact the learning outcome.
Host: That's definitely a concern, and it's something the paper addresses as well, we'll get to that a bit later. But first, regarding the data itself, they created a few different subsets for their experiments. There’s the WebInstruct-SFT, which is just a subset of the original dataset, and it has a very high error ratio. They have WebInstruct-verified, which is where they had GPT-4o check the answers to retain the most accurate ones. Then, there’s the WebInstruct-GPT4o, which uses the questions but replaces the answers with ones generated by GPT-4o directly. But the really important one here is the WebInstruct-CFT, which is the one with the critiques from GPT-4o. They also have a smaller version called WebInstruct-CFT-Tiny for testing with a smaller model. I like how they have considered all scenarios and tested them out. It's not a simple comparison between two datasets, but rather several datasets with different characteristics, which adds to the integrity of the research.
Guest: That sounds like a very thorough approach to data preparation, Leo. Having different subsets allows them to really isolate the impact of the critique-based training. Comparing CFT with SFT using verified responses, and responses directly generated by GPT-4o will allow them to pinpoint whether this CFT method can truly perform much better than the established SFT methods. It's clear they're trying to eliminate as many confounding factors as possible, which is great for evaluating the actual merit of this new approach.
Host: Exactly! And they also made sure to compare the size and scope of their dataset with others out there. What's interesting is that their CFT datasets, despite being relatively small compared to a lot of the large instruction datasets, cover a wide range of STEM topics. They're not just focused on math problems, they're also looking at things like physics, chemistry, business, humanities and so on. This highlights one of their points that CFT is very data efficient. It suggests that they're not just trying to get better results in a narrow area, but to have a model that can reason more generally through this method. That is a huge advantage to current training models. Less data, more effective learning, which is the future we want to move toward.
Guest: That broader coverage definitely makes the results more applicable, and it speaks to the underlying principle that critical thinking skills are not domain-specific. If this CFT approach can help models learn to analyze and critique across various subjects, it suggests that the model is actually learning how to learn, not just memorizing specific responses. In addition, as you mentioned earlier, the efficiency is remarkable. Training on 50,000 samples and achieving comparable or even better results than models trained on millions of examples is a massive leap forward. Less data, lower computational costs, better models. It's the holy grail of machine learning!
Host: Okay, so that's the data, but how do they actually train the models using this data? Well, the training objective is relatively simple. They give the model a question and a noisy response as input, and then they train it to generate the critique that was provided by GPT-4o. So, they're essentially optimizing the model to predict critiques based on the query-response pairs. It's all about maximizing the likelihood of the model producing the correct critique, given the input. In mathematical terms, they're maximizing P(c | [x; y]), where c is the critique and [x; y] is the query-response pair. It's a very targeted method of training.
Guest: That's a very clear and straightforward objective, and the math makes sense. By focusing the training process on generating these critiques, the models aren’t simply memorizing answers but rather, learning the analytical process behind why certain answers are wrong. This allows them to not only critique but also understand what makes a good or bad answer in the first place. This understanding is essential for genuine problem-solving and critical thinking, which is something most current language models simply do not have.
Host: Yeah, and now comes the fun part, the actual experiments and the results! They did a lot of testing and compared CFT with SFT using different base models. They experimented with three 7B-scale base models: DeepSeek-Math-7B, Qwen2.5-7B, and Qwen2.5-Math-7B. They evaluated these models across various mathematical reasoning benchmarks. That includes standard math benchmarks like MATH, Minerva-Math and GSM8K, but also more challenging benchmarks like AIME24, AMC23, and OlympiadBench. They even went beyond math and looked at broader STEM benchmarks like GPQA, TheoremQA and MMLU-Pro, covering topics such as physics, chemistry, and mathematics to ensure it's robust. It's really all-rounded.
Guest: That’s a really comprehensive set of benchmarks. It’s essential to not just look at standard math datasets, but also more challenging, competition-level problems and broader STEM topics, because that’s the true test of a model’s general reasoning capabilities. The fact that they tested across that whole range is really great and it provides a holistic view of the effectiveness of CFT.
Host: And the results, well, they were pretty striking. Across all the base models, CFT consistently outperformed SFT. I mean, on average, the CFT-trained models performed 4-10% better than the best SFT-trained models on the math benchmarks. For instance, on the DeepSeek-Math-7B, CFT had a 3.5% absolute improvement over the best SFT version. On Qwen2.5-7B, it was even more impressive, with a 10.4% improvement. And on Qwen2.5-Math-7B, it beat SFT by 5.7%. It just goes to show that this critique-based training really does bring substantial performance gains. It's not just minor tweaks but significant steps forward.
Guest: Those are really impressive numbers, Leo! A 4-10% improvement across various benchmarks is a massive jump in the field, it’s like a paradigm shift. It really backs up the idea that the model benefits significantly from learning how to critique and analyze solutions, rather than just imitating them. And to see these gains across three different base models just shows that the effectiveness of CFT isn't tied to a specific model architecture or specific characteristics of that model. It’s a learning method that can be broadly applied, which is quite powerful.
Host: Absolutely. And they didn’t stop there, they wanted to delve into the training dynamics, to understand how these models learned through CFT. They plotted graphs that tracked performance improvements on different benchmarks over training time. What they observed is that, on benchmarks like MATH and Minerva-Math, the CFT models not only converged faster, but they also reached higher levels of performance compared to the SFT variants. It was like, the CFT models understood the material more deeply and got to the answer quicker and more accurately. It's quite interesting to see that unfold.
Guest: Those training dynamic comparisons provide a lot of insights, Leo. Faster convergence means that the model is learning more efficiently, requiring less training time and resources. And when they reach higher performance, it shows that CFT isn't just a faster approach, it's also learning more effectively than the SFT methods. It reinforces the idea that critique-based training isn't just about better results; it’s also about the learning process. The model is internalizing the information and developing a strong understanding.
Host: And it wasn’t just about comparing against SFT methods. They also put their CFT-trained models up against some of the top-performing existing open-source reasoning models, including various specialized math models. The results were also surprising. Their Qwen2.5-Math-7B-CFT model, trained on just 50,000 samples, achieved better average performance than many models trained with millions of samples, models like Deepseek-Math-7B-Instruct, Mathstral-7B and NuminaMath-7B-CoT. It even performed better than much larger models like Llama-3.1-70B-Instruct and NuminaMath-72B-CoT, with a fraction of the training data and parameters. It's just another demonstration of how data efficient this method is.
Guest: Those results are simply astonishing, Leo! To outperform models that are significantly larger and trained on much larger datasets with just 50,000 samples is a testament to the power of CFT. It really underscores the fact that the quality of learning trumps quantity of data in this case. It shows that by teaching models to critique, they're not just learning to perform specific tasks, but they're learning general reasoning capabilities that can apply across various tasks, which allows them to achieve these really impressive results with significantly less resources and effort. This approach is much more efficient.
Host: And they didn't stop at 7B parameter models. They also tested CFT on 32B models, specifically a Qwen2.5-32B model. And again, the results were impressive. The Qwen2.5-32B-Instruct-CFT model, trained on just 4000 samples outperformed a similar model Sky-T1-32B-Preview trained on 17,000 samples. This once again highlights data efficiency. On some benchmarks, the performance gains were quite substantial, like the 10% improvement on the AMC23 benchmark. The general principle remains consistent - CFT improves the efficiency and effectiveness of model training, no matter the size.
Guest: The 32B model experiments just confirm what we've been discussing, that it isn’t just a fluke happening in the 7B scale models. CFT isn’t just efficient, it’s consistent, no matter the scale of the models. It's a solid way to improve models. The 10% jump in AMC23 performance especially is very impressive. It's as though the model is actually understanding the questions and answering better because of the critiques, rather than just relying on memorizing the data it has been trained on.
Host: Now, to really get to the bottom of things, the researchers conducted a bunch of ablation studies. Basically, they wanted to understand which factors had the most impact on the success of CFT. One thing they looked at was how the dataset source affected the results. They trained on three different datasets: MetaMathQA, NuminaMath, and WebInstruct. What they saw was that, while SFT performed better on MetaMathQA and NuminaMath, surprisingly when trained using CFT, WebInstruct performed the best. It means that it's the way of training rather than the quality of the dataset that affects the results the most.