Improving Video Generation with Human Feedback
Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models by extending those from diffusion models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and standard supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs. Project page: https://gongyeliu.github.io/videoalign.
Discussion
Host: Hey everyone, and welcome back to the podcast! I'm your host, Leo, and I'm super excited about today's episode. We're diving into some seriously cutting-edge stuff that's happening in the world of AI, specifically video generation. It's a field that's been making some crazy leaps lately, and we're gonna unpack a really interesting paper that's been making the rounds.
Guest: Hey Leo, thanks for having me! Yeah, video generation is absolutely exploding right now. It feels like every week there’s a new breakthrough. I'm particularly interested in this idea of using human feedback to refine these models, it's like the AI is finally starting to get a proper sense of what we actually find appealing.
Host: Exactly! It's not just about generating videos that are technically impressive, but videos that are actually enjoyable and align with what people expect. And that's where this paper comes in. We'll be looking at a piece of research that's exploring how to improve video generation by using a large-scale human preference dataset and some pretty innovative algorithms. Now, I know that might sound a bit technical, but don’t worry, we'll break it all down in a way that’s easy to understand.
Guest: Sounds great. I think it’s important for our listeners to understand that this isn't just about making better cat videos, though that's a worthy goal in itself! This has huge implications for creative industries, education, and even scientific visualization. The ability to generate high-quality, customized video on demand is a game-changer.
Host: Absolutely. So, before we get into the nitty-gritty, a quick heads-up that this paper is from arXiv, which, if you’re not familiar with it, is basically a huge open-access repository for academic papers. It's like the wild west of research papers, it’s amazing but sometimes you need to be careful! The specific paper we're talking about is called, ‘Improving Video Generation with Human Feedback’ and it was uploaded in January of 2025, so it’s pretty hot off the presses.
Guest: Ah, yes arXiv, the place where the future gets published first! It's great to have this kind of open access to the latest research. Now the title itself gives a good sense of the direction, ‘Improving Video Generation with Human Feedback’. Essentially, they’re tackling a core issue in AI: how do you get the machine to understand what’s good and what’s not, what we like and what we don't? It’s not just about being accurate, it's about alignment with our human values and preferences.
Host: Right, and that brings us to the core of the issue. These video generation models, while incredibly powerful, still struggle with things like smooth motion and getting the content of the video to match the text prompt correctly. And that's where human feedback comes in. This paper talks about how they’ve created a huge dataset of human preferences, where people have rated different video clips across various dimensions like visual quality, motion quality, and text alignment. I find that last point especially interesting – it’s not enough for a video to look good if it has nothing to do with what it should be showing.
Guest: Yeah, that multi-dimensional approach is really key here. It’s not enough to just ask 'Is this video good?' You need to dig deeper. Is the movement natural? Is the scene consistent with the text prompt? And you know, human preferences are so subjective; what one person loves another might hate. Capturing all that complexity in a dataset is a huge undertaking, and this is one of the major contributions of the research, they’ve clearly put a lot of work into building this really robust dataset that can train the models effectively.
Host: Absolutely, the creation of a massive dataset is no small feat. They gathered about 182,000 annotated examples, which is insane. They didn’t just throw some videos up and ask people, ‘Like it or not?’ They broke it down into those three crucial dimensions: Visual Quality, Motion Quality, and Text Alignment. It's like they're teaching the AI to not just create a video but to create a good video according to very specific human metrics. And speaking of metrics, I noticed they put a lot of focus into the reward model they used, which is fascinating.
Guest: Right, the reward model is basically the AI's way of quantifying what makes a video ‘good’ based on the human preferences in the dataset. They developed something called ‘VideoReward’, a multi-dimensional reward model that's trained to understand and quantify these different aspects of video quality. They explored several approaches, and the way they went about it is actually pretty impressive. I think they ended up using something called a Bradley-Terry model, which is very interesting.
Host: Yeah, the Bradley-Terry model is really cool. It's a way to model pairwise comparisons – basically, instead of just rating each video independently, annotators are asked which of two videos they prefer. This method captures human preferences much better because it’s easier for us to say ‘I like this one more’ than it is to say ‘this one is a 7.2 out of 10.’ They also compared that approach against a standard regression method, and found the Bradley-Terry model to be significantly more effective, particularly with a smaller dataset. It is also very helpful that they also implemented a version that allows for 'ties' too, which is obviously very common in real-world cases.
Guest: Exactly. And you know, it’s those seemingly small things that make a big difference. The fact that they also addressed what they call ‘context-agnostic dimensions’ is crucial. It means, they tried to make sure that, say, a video’s ‘visual quality’ score isn’t inadvertently influenced by the text prompt. So the model is focused on that, and not trying to evaluate, say, text alignment when evaluating how blurry an image is. The way they designed the model to separately evaluate each aspect is quite clever. This prevents these dimensions from becoming intertwined and makes the whole system more robust, leading to cleaner and more accurate scoring.
Host: Yeah, that's a really important point, the decoupling of these dimensions. They also introduced separate special tokens within the model’s input, ensuring the model could focus on either the video content or both video and the prompt in the case of text alignment. It basically forces the model to pay attention to the right information when judging different aspects of the video, it is a very smart way to approach the issue. Now, the paper also goes into how they take the human feedback and use that to actually align video generation models. This is where things get even more interesting. They explore different training-time strategies. They adapted some existing methods used in diffusion models but tweaked them for flow-based models, which are common in current video generation.
Guest: That's right, they're tackling a very cutting-edge area. Many advanced video models use something called ‘rectified flow,’ which is a different way of generating images or video compared to diffusion models. It's sort of like trying to predict the velocity of particles rather than the level of noise. So, they had to adapt the alignment strategies specifically for these flow-based models. They explored three algorithms, ‘Flow-DPO’, ‘Flow-RWR’, and ‘Flow-NRG’. ‘Flow-DPO’, or Direct Preference Optimization for Flow, is particularly interesting; it's a training time strategy to directly optimize the model based on human preferences, and ‘Flow-RWR’, Reward Weighted Regression for Flow, is another training time approach that uses the reward to guide the model. And lastly ‘Flow-NRG’, is an inference-time technique, which allows them to guide the generation process with the reward model while they’re actually generating new videos.
Host: Exactly. And what I found fascinating is that they discovered that the original Flow-DPO algorithm, which has a time-step dependent parameter, wasn't as effective as a version with a constant parameter. Apparently, having this parameter changing based on the time-step can create an uneven training environment that negatively impacts the model’s performance. I'm not even going to pretend to fully understand why, but apparently it matters. The fact that simply changing that one little thing improved performance so much is just wild. It really shows how important it is to fine-tune these methods.
Guest: It does, it really underscores that even these well-established methodologies need to be carefully adapted and tweaked to different contexts. And also the fact that it's not just about the methodology you use, but how it is being implemented. You know, Flow-RWR, while it is generally a solid method in theory, the results in the paper show it did not quite perform as well as Flow-DPO. This highlights that, as usual, there is no one-size fits all solution. The ‘Flow-NRG’ approach is interesting too, because it means users can actually adjust the level of preference for various dimensions during the generation. So, they can tell the AI ‘focus more on text alignment,’ or ‘focus more on visual quality’ when they generate the video, which adds another level of customisation, which is pretty amazing.
Host: Yes! The ability to set custom weights during inference with Flow-NRG is a big deal. It means that users can tailor the video generation process to fit their specific needs, without having to retrain the model. They can be like, ‘Okay, I need a video where the motion is super smooth, but the text alignment isn’t as critical,’ and the model will try to fulfill that need. I also noticed that the reward guidance part uses a lightweight model that’s trained within the latent space. This makes the system much faster since they don’t need to decode the video for processing.
Guest: That’s right. They've essentially found a way to optimize how the reward model interacts with the video generation model, which is absolutely crucial for building more efficient systems. And the results they showed are pretty convincing. They demonstrated that ‘VideoReward’ outperformed existing reward models and that Flow-DPO, with a fixed parameter, was superior to both ‘Flow-RWR’ and standard supervised fine-tuning methods, particularly in terms of text alignment. This all shows the power of incorporating human preference data into the training process, and all three of their methods are valid, but Flow-DPO seems to perform better. The improvements in all those areas are tangible.
Host: Absolutely. They showed that with Flow-DPO, they could get better results across visual quality, motion quality, and text alignment. They also experimented with aligning a model for just text alignment, which led to even higher results. And then, with Flow-NRG, being able to apply custom weighting is just huge. It lets users really personalize the quality of the videos being produced. They also tried different approaches to the reward function and found that training it directly within the latent space, as you mentioned, was significantly more effective than other methods, like training it directly on the video. That’s really insightful, because it’s not something everyone might think of.
Guest: And it's crucial for real-world applications where you want to generate high-quality content quickly. The efficiency gains from working in the latent space are huge, it just takes a lot of the processing load away, which also reduces the computational costs for these kinds of systems. I also appreciate that they didn’t shy away from addressing some of the limitations they identified. For example, they mentioned that they saw some performance degradation during training due to excessive usage of DPO, but they were able to mitigate that with a LoRA implementation. They also talked about future work, like applying more RLHF algorithms and trying to improve the robustness of the reward model. It's a very comprehensive piece of research.
Host: Yeah, the self-awareness regarding limitations and future research is crucial in the academic process. They are very transparent about things like potential ‘reward hacking’, which is when the model learns to exploit the reward function without improving real quality. And they talk about extending the algorithm to different conditional tasks like image-to-video generation, which could open up so many new creative possibilities. I think it’s really interesting how they touched on all aspects, going all the way from data gathering and annotation to the actual optimization, and they also included a very comprehensive analysis of the results and methodologies used in the whole process. They are not just presenting results, but also doing a very robust study on how the entire process works. I find this kind of approach extremely valuable.