SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.
Discussion
Host: Hey everyone, and welcome back to the podcast! Today, we're diving into some pretty fascinating research that's been making waves in the AI community. We’re going to be discussing a paper that looks at how we train large language models and vision models, specifically focusing on two popular techniques: supervised fine-tuning, or SFT, and reinforcement learning, or RL. We'll be exploring how these methods affect the model's ability to generalize, which is really the key to creating truly intelligent AI. It's like, can these models actually understand and adapt to new situations, or are they just really good at memorizing data? It’s a big question, and I'm excited to get into it with you all today.
Host: So, let's set the stage a bit. This paper, titled 'SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training,' comes from a team of researchers including Tianzhe Chu, Yuexiang Zhai, and a bunch of other very smart people. They're tackling this really important question about how we post-train these giant AI models. Now, these models, like GPT or the vision models you see generating images, they’re all initially trained on massive datasets. But, to make them useful for specific tasks, we usually fine-tune them. That's where SFT and RL come into play. Essentially, it's like taking a student who has learned a lot of general knowledge and then training them to be really good at, say, math or art. But the question is, which of these training methods leads to better overall learning and adaptability?
Host: Okay, so the paper kicks off with this big idea: while both SFT and RL are used a lot to get foundation models ready for the real world, it's not super clear how each method affects generalization. Generalization, as I was saying, is a huge deal – it's about how well the model can apply what it's learned to new situations it hasn't seen before. It’s the difference between, say, a student who can solve a specific type of math problem versus a student who truly understands the concepts and can tackle all kinds of math challenges. They point out that one of the toughest things to figure out is whether a model is just memorizing the training data or if it's actually learning some transferable rules or principles.
Host: Right, and that’s such a crucial distinction, you know? Because we don't want AI that just spits back what it’s seen before; we need them to be adaptable and actually understand. So, to really get to the heart of the matter, the researchers are focusing on two different kinds of generalization. First, they look at 'textual rule-based generalization,' which is all about whether a model can apply a set of rules, given as text, to variations of those rules. Think of it like learning the basic rules of grammar and then being able to understand all sorts of different sentences, even if they use slightly different phrasing. The second part is 'visual generalization' which checks how well vision-language models handle changes in visual input, like different colors or layouts. It's like if you train a model to recognize cats, can it still recognize cats if they’re a different color or in a different position?
Host: Exactly! To really put this to the test, they designed their own environment, a card game called 'GeneralPoints'. It's sort of like the game 24, where you have to use four numbers to reach a target number. But they made it way more interesting by changing how the cards are presented, both as text descriptions and as images. This really challenges both the text and visual processing abilities of the models. So, you could have a language model trying to solve this just from the text of the card values, and then a vision-language model having to both recognize the card values and then solve the equation. It's a clever way to test both abilities. And they also looked at a real-world navigation task called 'V-IRL,' which is basically about navigating in a realistic environment using visual and textual cues. It's like giving someone instructions on how to navigate around a city, and then seeing if they can actually follow the instructions and find their way.
Host: The whole setup sounds really well designed to test both text-based and vision-based reasoning. And as they mention in the paper, for the GeneralPoints game, the model gets four cards, and the goal is to create a mathematical equation using those card values to get to 24. But there's a twist – they use different ways of interpreting the face cards, like Jack, Queen, and King, which really tests the model’s ability to understand and apply the rules. And with V-IRL, it's this real navigation environment where the model needs to follow instructions to find a location, which introduces spatial reasoning into the mix. In short, one environment is a focused, rule-based setting, and the other is a real-world situation, and it gives the researchers some solid ground to analyze how the models are actually learning, not just memorizing. This reminds me of how a lot of human learning works, going from a controlled environment to a real, messy, scenario.
Host: Yeah, exactly. It’s not just about getting the correct answer, but how they get there, and how well they adapt to changes. The key here is that they're introducing variations to the rules and visual inputs. For example, in GeneralPoints, they might train the model with the rule that J, Q, K, each equal 10, but then test it with a rule where they represent the values 11, 12, and 13 respectively. This is what tests the rule-based generalization. And then, with visual variations, they might train the model to recognize black-suited cards and then test it with red-suited ones. With the V-IRL, there are different action spaces. One being absolute, where it’s like ‘go north’, ‘go east’, and the other being relative where it’s ‘turn left’ or ‘turn right’, so the model really has to adjust its understanding of the environment, and not just what it has previously seen in training. It's these little twists that make this study so insightful. It really helps isolate whether the model is genuinely understanding things or just blindly following patterns.
Host: host
Host: host
Host: host
Host: Yeah, that whole memorization vs. generalization point is a huge one. And it really highlights why it’s so important to dig into these models and really understand what they’re learning. You mentioned that they also looked at scaling up ‘inference-time compute,’ which is a fascinating concept, where they're not just focused on how big or powerful the model is. Instead they looked at how much computation you allocate when it’s using the model. A lot of recent research is looking into the idea that giving the model more ‘thinking time’ during inference can actually improve its performance, it’s like, if you let a student have more time on a test they may be able to really think through a problem, instead of just rushing to provide an answer. They mention some earlier research where they found that generating intermediate reasoning steps actually improve the model, that they were essentially guiding the model through the reasoning process, and with that process, it could solve harder and more complicated tasks. So they’re taking a lot of insights from these earlier studies, and applying it to their own multi-turn RL framework that lets the model correct its own errors during the learning process. It seems to be a really effective way of increasing performance.
Host: Definitely, it's like giving the model the chance to refine its approach. And also, they’ve looked into improving the visual capabilities of vision-language models, because, while they are super impressive in a lot of ways, they do have some shortcomings in how they perceive visual data. They note some previous ways that people have tried to improve the visual perception, like using multiple encoders, or providing high-quality training data, and even changing how the visual encoder is trained. But, what makes their approach unique, is that they showed that RL is another way to improve this visual perception. So, it’s not just about providing better data, it’s actually the training process itself that can play a huge role in how well the model can actually perceive the world. All of this preliminary work helps to set the stage, which is the focus of the paper. To see the role that RL plays in improving the visual perception of the model, and not just SFT, which is what a lot of other research has focused on. They’re kind of using all of this to build upon the existing knowledge that is present within the field.
Host: Okay, so that's a great overview of the setup and some of the research the authors are building on. Then, before they dive into the experiments, they lay out some standard RL terms, just to make sure everyone's on the same page, since this can get really technical. They have the state space which represents the different possible situations the model can find itself in, and action space is the range of actions a model can take, and then also a reward function which tells the model how well it's doing based on the actions it takes. Then they adapt these terms specifically for large language models, and vision-language models that involve a verifier. This verifier is really key here because that helps the model know how well it's doing, and what it needs to adjust. So, they create a system, where the state of the model for language is just the current input prompt, and for the vision-language model, it’s a combination of the prompt and the visual observation, and the action space is basically the text that the model spits back out. And then they also have the sequential revision, like we mentioned before, which is about how the model gets to refine its responses based on what it’s done before, which is like a running memory of all the different actions and verifications it has had previously.
Host: Yeah, they are very thorough in laying the groundwork. So, let's talk about the actual tasks they used to evaluate all of this. The first is the GeneralPoints environment we mentioned earlier, which is their variation on the Points24 game. It is used to test the model's arithmetic reasoning. What they're doing is, for the language model version, they are giving the cards to the model in text form, and the model then needs to compute the target number, which is, by default, 24. Then, for the vision-language model version, they're giving the model images of the cards instead of text, and they're seeing if the model can do the same thing. So, it adds that extra layer of visual perception. It's a great way to assess both types of reasoning and then with the rule variations, where J, Q, and K are interpreted differently, it helps them figure out whether the model is actually understanding the math, or simply memorizing the training data. Similarly, with the visual variations, where the cards can be different colors. It allows them to pinpoint how well the model is actually seeing the visual world, and not just seeing the specific visual representations it saw in training.
Host: Exactly, and it's not just a simple card game, they’ve really thought out how to make this an effective testbed. And then, the second environment they used is this V-IRL one, which is much more real-world, it's about spatial reasoning and navigation. Again, they tested it with a language model version and a vision-language model version, where one uses text descriptions only, and the other also uses the visual component to help guide it. The model is given instructions on how to navigate to a location, and they are testing how well it can follow the instructions to reach its destination. It's important to note, they’re doing this in an open world environment, so they are simulating real-world navigation tasks, with the realistic visual input. So this is testing spatial reasoning, a key part of what any model needs to understand the real world. And again, similar to GeneralPoints, they're also introducing variations here. They have the absolute action space we discussed, where you’re going north, south, east, and west. And the relative action space, where you’re turning left, right, and slight left, slight right. And similar to the cards, the visual variation here is about training the model with one location and then testing it on other locations, which makes the model really think about the underlying spatial relationships and how to generalize its knowledge. So, between these two different environments, they are trying to cover as many bases as possible to see how well RL and SFT can apply to a multitude of scenarios.