Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking. However, we identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. This behavior leads to inadequate depth of reasoning and decreased performance, particularly on challenging mathematical problems. To systematically analyze this issue, we conduct experiments on three challenging test sets and two representative open-source o1-like models, revealing that frequent thought switching correlates with incorrect responses. We introduce a novel metric to quantify underthinking by measuring token efficiency in incorrect answers. To address underthinking, we propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts, encouraging deeper exploration of each reasoning path. Experimental results demonstrate that our approach improves accuracy across challenging datasets without requiring model fine-tuning. Our findings contribute to understanding reasoning inefficiencies in o1-like LLMs and offer a practical solution to enhance their problem-solving capabilities.
Discussion
Host: Hey everyone, and welcome back to the podcast! I'm your host, Leo, and I'm super excited about today's topic. We're going to be diving into something really fascinating about how AI, specifically these large language models, actually think... or maybe, don't think as deeply as we might assume.
Guest: Hey Leo, great to be back! Yeah, this whole area of AI reasoning is incredibly complex, isn't it? We see these impressive demos and these models seem to understand so much. It's easy to think they're just these incredibly deep thinkers.
Host: , and it's quite the mouthful, I know, but the content is brilliant.
Guest: Yeah, the title's definitely a bit dense, but it really hits the nail on the head. The main argument revolves around something they're calling 'underthinking', which is basically the idea that some of these advanced LLMs, especially the ones mimicking models like OpenAI's o1, might actually jump between different reasoning paths too quickly, without fully exploring any of them.
Host: That's fascinating. So, it's not just a matter of not knowing the answer, it's about not being able to stick with one promising idea? It sounds like they are like, a student trying to solve a complex math problem, getting a little overwhelmed, and hopping between different half-baked ideas instead of really digging into one. I think everyone can relate to that.
Guest: Precisely! It's that 'grass is greener on the other side' kind of thinking. You start down a promising road, see a possible hurdle, and immediately switch to a totally different approach. The paper highlights that this happens even when the initial line of reasoning could actually lead to the correct answer. And what's even more telling is that it seems like this issue becomes more pronounced when the problem gets more challenging. It's like the model gets flustered and just flits around more erratically.
Host: Okay, that makes total sense. So, when the pressure's on, these models are more likely to bail on their thought process. What they've actually done here is they’ve looked at these models tackling really tough mathematical problems. They use datasets like MATH500, GPQA Diamond, and AIME, these aren’t your everyday math problems; they’re the kinds of problems that give even really bright humans a headache! These are designed to really push the boundaries of the models' abilities, I guess this is how they are noticing this effect.
Guest: Right, and the choice of those datasets is key. They're not looking at some simple addition or subtraction. They're using these problems that require multi-step reasoning, complex algebraic manipulations, or abstract thought. The authors analyzed responses from models like QwQ-32B-Preview and DeepSeek-R1-671B, which are these open-source models designed to have advanced reasoning. It's pretty interesting that even the most advanced models are exhibiting this kind of behavior. Also, they included DeepSeek-R1-Preview to show the improvements that have been made, which gives us a bit of context of the progress.
Host: Yeah, that’s an important point. So it’s not that they've chosen obscure, obscure models, these are really the state-of-the-art for reasoning. So, I understand they looked into the inner workings of the responses, how did they even begin to identify these “thoughts” within these LLM answers?
Guest: to signal a switch in their approach. Think of it like a human saying, 'Okay, let me try it this way now...' So it is possible to somewhat trace the reasoning process
Host: Right, those transition words are actually really important because they signal those shifts in focus. It's kind of like they are writing their thought process down on a piece of paper. But, how do you objectively say this is thought one, thought two, and so on?
Guest: Exactly! So, the authors leveraged another, really capable model called Llama-3.3-70B, and trained it to act like a thought-segmentation tool. They first manually analyzed outputs to identify common expressions that signal these thought shifts. Then they used the trained Llama model to scan through the rest of the text and look for those expressions and also determine if the expression actually indicated a change of thought, or if it was just a part of the writing style. Only the expressions that were true changes in thought were used as separators. This is very intelligent as we want to see the reasoning behind the solution process.
Host: Okay, so they basically taught an AI to identify when another AI was changing its mind. That is pretty meta! So, what did they observe once they had the thought-segmentation in place?
Guest: Well, one really striking observation was that the models tended to switch thoughts more frequently when tackling the harder problems, as well as when they were going to give a wrong answer. It wasn't just about the sheer number of thoughts, but also that, in incorrect responses, these models were generating a lot more tokens, indicating they were spending a lot more time exploring those diverse thoughts, but to no avail. It's like they were frantically trying everything instead of focusing on one good path. They also noticed that on the harder level 5 MATH500 problems the frequency of thought switching is more prominent, where they switch more frequently as the level of difficulty increases.
Host: So, it is not simply a matter of these models always switching a lot, they switch more when the going gets tough and when they’re likely to get it wrong. So, these weren't just random jumps, there was a pattern there. But how did they determine if the actual issue was indeed that they weren't staying on a promising line of thought?
Guest: That's where the investigation of the core of 'underthinking' comes in. To really get to the bottom of it, they needed to figure out if these 'abandoned' thoughts were actually promising leads or just random musings. To determine this, they used similar LLMs, this time Distilled DeepSeek models. They provided each individual thought to these DeepSeek distilled models along with the initial problem and asked them to evaluate if a specific thought could lead to the correct solution. If the model thought it could, it provided a high confidence score. If a single thought had a high confidence score of 2, that was labelled as a correct thought, the important bit being that these could often be the initial thoughts
Host: Okay, that makes sense. So, they’re using the models themselves to actually assess the quality of their own reasoning process. That is also quite clever. And what was the outcome after this assessment?
Guest: The outcome was very insightful. What they discovered was a significant portion of the early thoughts in these incorrect responses were actually correct! This means the model wasn’t just randomly switching because it didn’t know what to do, it was frequently abandoning correct approaches too early. They found that, across the different models, a substantial portion of incorrect responses had some level of valid thought. This is the real core of underthinking, it is not the inability to generate good solutions, but it is the inability to commit to a promising line of solution. So it is like being able to see the correct path but still deciding to wander around.
Host: That’s crazy! So, they're basically on the right track, and then they just give up and jump ship, that’s where they came up with this ‘underthinking’ name. This really highlights how important it is to dig into not just the final answer, but the whole reasoning process to really see what’s happening. But how can we actually measure this ‘underthinking’ in an objective way?
Guest: Yeah, it’s not something you can just see by looking at whether the final answer is right or wrong, which is why the metric is actually important. The metric they developed is called the 'underthinking score', and it's designed to measure the token efficiency in the incorrect responses. The idea is that if a model starts with a correct thought but then switches to other thoughts, then the tokens generated afterwards are a waste. The score calculates this efficiency by seeing what proportion of the tokens within a wrong answer were spent before the first correct thought occurred.
Host: So, if a model generates a long, convoluted incorrect answer with a correct thought hidden somewhere early on, it gets a high underthinking score, indicating it has wasted most of its tokens on inefficient reasoning. If it fails to find a correct thought, then it does not qualify, as that is an understanding issue not an underthinking one. A low score would mean it’s still relatively efficient even when wrong. So, the score rewards models that at least stick with a good idea for a bit before switching, even if they don’t nail the final answer.
Guest: Exactly. The way they put it, it's a measure of how much of the generated text in those incorrect responses actually contributes to reaching a correct solution, so how effective are the tokens being used. It’s not just about being right or wrong, but how well the model utilizes its resources. They found that when a model doesn't stick with a good approach it will generate a lot of inefficient tokens with incorrect responses. And what is also interesting is that with better models, while they generally have a higher accuracy, they also have a higher underthinking score in some tasks.
Host: That is really counterintuitive! You'd think a smarter model would automatically be less prone to underthinking, but it seems that isn’t always the case. I guess it's like the more options and complex thoughts the model can generate, the more it is tempted to jump around, even if it already had the right idea early on. So, after they were able to properly identify and measure underthinking, how did they approach mitigating it? It seems like this underthinking is a systematic failure in how these LLMs currently work, so how do you even try to fix it?
Guest: Yeah, that was the next crucial step. They introduced a method they called 'Thought Switching Penalty,' or Tip for short. This is a decoding strategy that doesn't require any model fine-tuning, which makes it very practical. The core idea behind Tip is to discourage the model from prematurely switching thoughts during the decoding process.
Host: So, it's like giving the model a little nudge to stick with its current line of reasoning. How do they actually achieve that, though? Are they adding some code into the models?
Guest: It's done by manipulating the probability of the model generating the tokens associated with thought switches, like the words that signal a shift in thought approach, like “alternatively”. They do this by modifying the logits, which are the unnormalized scores that the model outputs before it generates the actual words. They decrease the logit score of those transition words for a window of tokens after a new thought has been started. So, it makes it less likely the model will switch early, making it stay on its current path for a bit longer. They also introduced parameters, which allow for the adjusting of the strength and duration of the penalty to find a balance between thought exploration and depth.
Host: So, it's like a temporary penalty for switching thoughts. It encourages the model to really explore a particular path before jumping to a new one. This is almost like a nudge to the model that says ‘hang on, have you really considered this fully?'. It reminds me of someone constantly switching their answer when they’re doing a test.
Guest: Exactly! And the neat thing is that it's a simple but effective technique that didn't require any additional model training. It is a lightweight method as you put it. It is something that can be applied on top of existing models without having to retrain them. The experimental results showed that employing Tip did improve the accuracy across the difficult datasets. This implies that the underthinking problem can be somewhat mitigated by simply making the model stick to one approach for a longer period of time.
Host: That's incredibly useful, because it shows that we don't necessarily need to retrain these huge models from scratch to improve their reasoning capabilities. It seems that some simple manipulation of the decoding process is sufficient. This method is a much more practical and efficient way of increasing the effectiveness of the models and shows how we need to consider the way a model works as a whole, not just simply if its answer is correct or not. It also reminds me of other research being done that focuses on how efficient the ‘thinking’ is in these models, rather than solely the end result.