GuardReasoner: Towards Reasoning-based LLM Safeguards
As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner : https://github.com/yueliu1999/GuardReasoner/.
Discussion
Host: Hey everyone, and welcome back to the podcast! Today, we're diving into a fascinating area of AI safety, something that's becoming increasingly critical as large language models, or LLMs, continue to grow in power and prevalence. It's about how we can build better guardrails to ensure these powerful tools are used responsibly and safely. We’ve got a pretty exciting paper to discuss, something called 'GuardReasoner: Towards Reasoning-based LLM Safeguards,' and I'm really looking forward to breaking it down with you all.
Host: So, before we jump in, let’s take a quick overview of what we're tackling today. We're going to start with a general introduction to the paper, then we'll move into the research that the team behind this built upon. After that, we'll dig into the core of the paper - the GuardReasoner model, how it works and what it does. We’ll also check out the experiments they ran and then we will touch upon what we can take away from it. It’s a full agenda but we should get through it!
Host: Let's get started with the 'Introduction'. The paper starts by acknowledging the huge impact LLMs are having, touching upon their use in chatbots, search, software engineering and how they're really becoming a part of our daily routines. But, and it's a big but, they also bring potential risks if they’re not used properly, right? There have been recent attacks showing that these models can be manipulated maliciously.
Host: This is where the concept of guard models comes in. Now, these aren’t necessarily a new idea but the paper highlights some drawbacks of how they're generally used. So, models like OpenAI Moderation or LLaMA Guard, while effective to a point, often rely on what the paper calls 'straightforward instruction tuning'. This limits their reasoning abilities which impacts their performance. These are essentially classifiers which provide moderation but not the why, making it hard to understand their decision making process. And finally, their approach doesn’t really generalize that well to newer, unexpected types of harm as they're based on predefined categories. This is a problem since the kinds of ‘harm’ are always evolving and it's difficult for the models to keep up. In short, the introduction sets the stage that these guard models often lack the ability to reason effectively, explain their decisions, and generalize to unforeseen situations.
Host: So that’s a pretty good outline of the need and context of the paper, shall we move onto ‘Related Work’? It’s important to understand where this research sits in the field. The paper does a good job at outlining this. First up, it discusses 'Safety Alignment of LLMs'. This covers the work that goes into making sure LLMs are helpful, harmless, and honest, which is a massive challenge in itself. This involves everything from curating high quality data to filter unsafe content and implementing different training techniques like SFT (Supervised Fine-Tuning), RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization). The paper also mentions some new alignment methods that don't need additional fine tuning. What’s also very interesting is they note a research direction where they look at deliberative alignment where reasoning is used to make the models safer, this is clearly a big motivator for the team’s work.
Host: And then, it dives into 'Guard Models for LLMs'. This section outlines different types of guard models that are used to moderate the input and output of LLMs. Traditional guard models are mentioned first, using statistical methods, and then there are the Closed-Source guard APIs that are developed by companies like OpenAI. It’s always good to get a picture of what the big players are doing! The paper also goes on to discuss Open-Source guard models, like the LLaMA Guard series, Aegis series, and others, these are the models which are openly available and have been fine-tuned using red-teaming data. This is where the paper puts itself, focusing on this open source category. It’s interesting that the paper references some work on the calibration of these models, and also focuses on lightweight guard models. The paper makes it clear here that existing models have limitations with performance, explainability, and generalizability. They really emphasize that reasoning is the way forward to tackle this. So it seems like all of this is setting up the case for GuardReasoner which this paper proposes.
Host: Finally, in the ‘Related Work’ section, the paper goes onto to look at the ‘Reasoning Ability of LLMs’. This section is all about how important it is for language models to be able to reason like humans, taking things step by step. It also talks about how we've seen various methods like chain of thought prompting, self-correction, self-critique, debate, and plan-and-solve, which enhance reasoning capabilities. There’s also some interesting research about exploring code's influence on the reasoning ability during training which is quite insightful. The paper rounds it off by mentioning OpenAI's ‘o1’ model and other similar models which are trained to reason. They are really setting up the foundation here for why reasoning is critical, not just for general model functionality but also for guard models specifically.
Host: Now, let’s move into the core of the paper, ‘GuardReasoner’. This section is where they introduce their proposed model. The central idea is to guide the guard model to learn to reason by enhancing its reasoning ability first and foremost. The model’s training is broken down into two major stages. The first stage is all about collecting instruction tuning data, and then using GPT-4o to synthesize reasoning processes. This produces the 'GuardReasonerTrain' dataset, which has about 127,000 samples and about 460,000 reasoning steps. The size of this is noteworthy, it’s a very sizable dataset for this kind of work. It’s also great to see that the team trained different sized models starting from LLaMA 3.2 1B, LLaMA 3.2 3B, and LLaMA 3.1 8B, ensuring a range of usability options. They then train these using Reasoning Supervised Fine-Tuning (R-SFT) to unlock the model's basic reasoning capabilities, so that's a pretty solid start.
Host: The second training stage of the GuardReasoner model involves 'Hard Sample Direct Preference Optimization,' or HS-DPO. This is where it gets interesting. The model is first used to generate multiple outputs using reasoning steps. These outputs are then used to identify what the paper calls “ambiguous samples,” those near the decision boundary with a mix of both correct and incorrect responses. It then treats correct outputs and their reasoning as positive examples, and incorrect outputs as negative examples. This helps to focus the model on hard examples. It also up-weights samples with more incorrect outputs whilst down-weighting those with more correct ones. This helps the model focus on the edge cases. So through these two stages, GuardReasoner is designed to learn to reason, specifically focusing on complex and tricky cases. It’s a very specific and targeted approach.
Host: This multi-stage process is designed to improve three key areas, first is performance. By unlocking and enhancing reasoning ability, GuardReasoner aims to be better. Secondly, it’s designed to be explainable. Instead of just giving a yes/no answer, it provides the reasoning behind it. Finally, it’s looking to have better generalization, because intermediate reasoning is designed to help the model recognize more open-ended categories. This helps it work independently of fixed categories, therefore, making it adaptable to new situations. The paper makes it very clear that this is the intention of the design and that they will show this with the experiments they conduct.
Host: And how does it work exactly? Let’s break down the model a bit further. GuardReasoner consists of three core modules. The first is ‘Reasoning Data Synthesis’ where, as we mentioned, GPT-4o generates the reasoning data using the user's prompt, the target model’s response, and the ground truth. This is how the ‘GuardReasonerTrain’ dataset is made. The second module is ‘Reasoning SFT’. Here, the base model gets trained on the dataset to develop the reasoning model. Finally, there’s the ‘Hard Sample DPO’ module, which produces multiple outputs and looks for the ambiguous samples where it has both correct and incorrect responses. These are then used for preference optimisation with the harder samples weighted more. The model is trained on both the self-generated HS-DPO training data and the ensemble data, and they take the model trained on the ensemble data (that which has the diverse set of hard samples) as the final GuardReasoner model. This seems to be a very careful and methodical way to approach this problem.
Host: Now, let's look at the 'Task Definition' used in the paper. They clearly define the guardrail tasks. Given a target LLM, a user input, and the resulting response, the guard model's purpose is to moderate the input and output. This includes detecting if the LLM refused the request. This is done by predicting labels for prompt harmfulness, response harmfulness, and refusal detection tasks. Essentially, it is outputting three key labels. Harmfulness is labeled as either ‘harmful’ or ‘unharmful’, and refusal detection is labeled as ‘refusal’ or ‘compliance’. These labels are used to evaluate the guard model using the F1 score, which is a common metric for these kinds of tasks.
Host: Now, let's delve into 'Reasoning Supervised Fine-tuning' or R-SFT. The first key aspect of this is 'Reasoning Data Synthesis'. It starts by surveying existing datasets for red-teaming training. Datasets like WildGuardTrain, AegisTrain, BeaverTailsTrain, and ToxicChatTrain are analyzed to check the existing data. It was found that the existing datasets mainly focus on human-annotated classifications but they lacked detailed reasoning processes, as we said earlier. To address this, GPT-4o is used to generate these missing intermediate reasoning steps. The idea is that if the model knows why, it can classify better. The paper describes how GPT-4o is prompted to think step by step, keep each step small, be consistent between the reasoning and conclusion, and control the output format. This is a smart way of generating synthetic data. The datasets are then mixed to create the 'GuardReasonerTrain' dataset with 127,000 samples and 460,000 reasoning steps.
Host: Following this, 'R-SFT' is then implemented. The reasoning training data is used to train the model. The base model is trained to output the reasoning process and the moderation result. The training loss, as shown in equation one, is the negative log likelihood of the output given the instruction and the inputs. Through R-SFT, the base model is trained with data that includes reasoning steps, so it unlocks the basic reasoning ability of the base model and results in a reasoning model. This is the first big step in this process and I’m already seeing a clear methodology developing here.
Host: Next up is 'Hard Sample Direct Preference Optimization,' or HS-DPO. The first stage here is 'Hard Sample Mining'. This aims to pinpoint samples that are on the decision boundaries to improve the model’s performance. The model uses the reasoning model (that which has been through R-SFT) to generate 'k' outputs for a given input sample. These are then evaluated, and those with correct and incorrect outputs are labelled as 'hard samples'. This way it’s very focused on the ambiguous inputs, which are difficult to get right. It is a clever idea to generate these samples yourself instead of using those already in a dataset.
Host: To make sure these hard samples are diverse, the paper trains multiple reasoning models on subsets of the data. These models are used to produce extra hard samples which are added to the self generated ones. It makes the set of difficult inputs even more varied. This 'ensemble' approach results in a comprehensive and diverse set of hard samples. So, we are adding layers of difficulty and variability here for the model to learn and adapt to.
Host: Then 'HS-DPO' is performed using these hard samples. For these ambiguous samples, the correct output is used as positive data and the incorrect as negative data. The goal is to make sure the model prefers correct classifications and the corresponding reasoning. The loss function, equation two, is designed to do just that using a preference optimisation between the positive and negative samples. During this process, the model also weights samples, giving higher weight to those that have more incorrect outputs. This helps the model to focus on the really tricky cases. The final models are obtained by training using the self generated data and the ensemble data. The model trained on the ensemble data is then taken as the final GuardReasoner model.
Host: The paper also notes how the method ‘Inferences with Reasoning’ differs from traditional guard models. Existing models only provide moderation results, GuardReasoner gives you both the result and the reasoning behind it. By doing this the model claims it provides better performance, explainability, and generalizability. So, overall the model is not just a classifier but also a reasoning engine.