Cosmos World Foundation Model Platform for Physical AI
Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via https://github.com/NVIDIA/Cosmos.
Discussion
Host: Hey everyone, and welcome back to the podcast! I'm your host, Leo, and I'm super excited for today's episode. We've got a really fascinating topic to dive into that I think you're going to find incredibly interesting.
Host: Today, we're going to be unpacking a very recent paper – it actually just came out this year. It’s called, “Cosmos World Foundation Model Platform for Physical AI,” and it's from NVIDIA. Now, that title might sound a bit sci-fi, and honestly, some of it is, but it’s also incredibly grounded in where AI is going, especially when it comes to how AI interacts with the real, physical world.
Host: So, buckle up, because we're about to explore how researchers are building digital twins of the world to train AI, what it actually means for physical AI, and what it could mean for all sorts of future tech. It's going to be quite a ride! Let's get right into it.
Host: Okay, so first off, I think the whole concept of a 'World Foundation Model' is super interesting. Essentially, they’re envisioning a model, a digital representation, that can simulate the physical world. It's kind of like a video game engine, but way, way more complex. Imagine, building a complete, simulated universe in the computer.
Guest: Exactly, Leo. And the idea isn't just to create pretty visuals. The core concept revolves around building something that can understand how the world works – the physics, the interactions, the dynamics. Think of it like a sandbox for AI, where they can learn by safely interacting with a digital world that mimics reality.
Host: That's a great way to put it – a sandbox. It lets you play around and learn without breaking anything in the real world. They’re specifically focusing on training Physical AI, right? Not just any AI, but AI that needs to interact with the physical environment. Like robots or self-driving cars.
Guest: Yep, that's exactly right. Physical AI is different. It's not just processing information; it's about sensors, actuators, observing, and acting. The data needed to train these systems is hard to gather. You need sequences of observations and actions, and actions in the real world can be costly or even dangerous, especially during the learning process. Imagine a robot arm in its early training stage – you don't want it smashing valuable objects while it’s still figuring things out. Or, think about self-driving cars. We can't have these AI agents practicing on our streets while they're learning.
Host: That’s totally fair! And that’s where the idea of digital twins comes in, I suppose. We use this 'World Foundation Model' to create a safe space, a digital twin of the real world, so the AI can experiment, fail, and learn, all in a virtual setting. It's like a hyper-realistic video game for robots, allowing them to develop the policies needed for acting in the actual world.
Guest: Precisely. And the paper introduces this platform called 'Cosmos', which includes different components to make these 'World Foundation Models' a reality. It covers things like data curation, pre-trained models, and even guardrails to ensure safe application. It's a whole ecosystem for Physical AI development.
Host: Okay, so let's delve into some of these core components, starting with data curation. From what I gathered, they have this whole pipeline to find useful clips from a massive collection of video data. It sounds like they're not just grabbing random videos; they're actually being selective.
Guest: Exactly. They start with 20 million hours of raw video. That's a mountain of data! But a lot of it is semantically redundant or simply not useful for training the model. So, they've developed a system that intelligently sifts through it, looking for clips with rich dynamics and high visual quality.
Host: Okay, that makes sense. I mean, you need a huge amount of data, but it also has to be good data. The paper mentions they have this five-step pipeline: splitting, filtering, annotation, deduplication, and sharding. Sounds quite systematic. So, what are some of the key aspects of how they refine all of this content?
Guest: Right, let's break it down. The splitting step is all about taking those long videos and breaking them into shorter, visually consistent clips based on shot changes. They use advanced shot boundary detection to segment the videos into scenes. This is important because they want the model to learn physically plausible transitions, not those created by editing. They don’t want abrupt changes like jumping from two people talking in a kitchen to lions chasing zebras on a savanna – that would be nonsensical in a physical context.
Host: Totally. The goal is realism within each clip, and then I guess the filtering stage is where they remove those which just aren't suitable.
Guest: Exactly. Filtering is where they remove noisy video clips with poor quality or random camera motion. They also tag videos with different types of camera movements like pan, zoom, and tilt, which provides additional information to guide model training. Think of a camera which is simply static. There wouldn’t be much information regarding how the world moves.
Host: And there's more, right? Like text overlay filtering and video type filtering. So it’s not just about visual quality, but also the content itself.
Guest: Absolutely. They filter out videos with added text because they're often associated with visual effects, and they also use a complex video taxonomy to filter videos that would be less useful for training a world model, like abstract visual patterns or animated content. Then, they upsample categories that are more relevant to WFMs, like human action and object interaction and downsample those that aren’t like landscapes. This is really about tailoring the data to focus on building a WFM.
Host: That makes a lot of sense. I guess you don't want to train your physics-based model on a cartoon, right? It’s gotta be realistic, real world kind of stuff. Then the next bit is annotation, where they’re actually giving a description to the video content using a VLM? What was that part about?
Guest: Yep. They use a Vision-Language Model, or VLM, to create a detailed caption for each video clip. This gives the model a language-based understanding of the visual content, which helps it learn to generate videos based on both visual and textual inputs. They're not relying on Alt text here because they want to maintain consistency in how the video is interpreted. Also, the VLMs focus on the material facts and details instead of adding extra style or formats during captioning.
Host: Okay, I can see how that’d be key in making the model useful for many different contexts. And deduplication - it’s there to make sure there’s not an overlap, or that there isn’t repetitive information going in? Sort of like, ‘oh, we already saw that!?’
Guest: Precisely. They use semantic deduplication, meaning they're not just looking for exact copies but for clips that have similar visual content. This helps create a diverse, compact, and well-balanced training dataset, and also prevents the model from over-memorizing specific training samples and also significantly improves overall training efficiency.
Host: That’s a really good point, actually. Memorization is something you definitely don't want. And lastly, sharding, which I'm guessing is about preparing the data for training purposes?
Guest: Exactly. They're essentially packaging the processed video clips into a format their model trainer can directly consume, sharding the data based on things like resolution, aspect ratio, and length. It's about making the training data as efficient and usable as possible, and also to fit into different parallel training setups.
Host: Okay, so we’ve covered the data side of things. That's a really involved process. Let's move on to the 'tokenizer', what exactly is it and why is it so important for the WFM?
Guest: The tokenizer is a fundamental component. It's responsible for turning the raw video data into a compressed, efficient representation that the model can process. They essentially learn a bottlenecked latent space in an unsupervised manner. Think of it like converting a large file into a zip file - you get a smaller, more manageable representation.
Host: So it's about compressing the data so it’s easier to handle for a big model. And the paper mentions they have two types of tokenizers, continuous and discrete. What are their differences?
Guest: Right. Continuous tokenizers create continuous latent embeddings, like vectors, that are well-suited for models that generate data by sampling from continuous distributions. You see these in latent diffusion models. Discrete tokenizers encode visual data into quantized indices, integers essentially, which is necessary for models that are trained with cross-entropy loss, like autoregressive transformers. It’s sort of like a library: you could have a library with books of continuous values, or a discrete library using numbers to identify the books on the shelves.
Host: That makes a lot more sense now. So which type of model uses what tokenizer?
Guest: The diffusion-based models use continuous tokens, and the autoregressive models use discrete tokens. They both serve the same fundamental purpose – making the visual information more tractable – but their output forms are different and work better with different model architectures.
Host: Okay, it's all starting to come together. The paper emphasizes the importance of compression rates while still maintaining visual reconstruction quality, which is also crucial, right? You have to make sure you're still representing what the video content actually is.
Guest: Absolutely. The goal is to find that sweet spot where you compress the data as much as possible without losing key details. That's a major challenge in tokenizer design, since if you go too aggressive, you would lose too much information, but if you go too light, you are not going to save much space or computation. They’ve designed their architecture to operate in wavelet space which allows for better compression while removing redundancies.
Host: And they also built these tokenizers to be temporally causal. It is a fairly interesting design decision. Can you explain more about it?
Guest: Yes, this means that token computation for each video frame only depends on the current and past frames. It doesn’t look ahead into the future. This is crucial for both training and application. During training, it allows for joint image and video training by making sure causal video tokenizer can also be treated as an image tokenizer when the input is just a single image. In application it makes it better aligned with Physical AI systems that live in a causal world. The tokens for present can't rely on future information in Physical AI applications.
Host: That causal design really makes a lot of sense when you think about it from a Physical AI perspective. The AI needs to operate based on past and present information, not the future, which it obviously can’t see. They also claim that their tokenizer is much faster than other existing models, by quite a bit it seems, while maintaining very competitive reconstruction quality.
Guest: That’s right. In their evaluation, they found their tokenizers achieve significant improvements in reconstruction quality, and also run much faster compared to previous models, and this is for both the continuous and the discrete versions. It’s a big step for efficient video tokenization.
Host: All right, let's move onto the main players – the World Foundation Models themselves. They have two families, diffusion-based and autoregressive, right? And I remember you mentioning earlier that they're both about breaking complex problems down into easier sub-problems.
Guest: Exactly. Diffusion models break down video generation into a sequence of denoising steps. They start with a noisy video and then gradually remove the noise. Autoregressive models, on the other hand, generate video step-by-step, predicting the next token based on past generations. Both methods use transformers, which are really powerful and scalable for video processing and generation.
Host: So, both are capable, just in different ways. They’re using a diffusion model as well as an autoregressive model to get different perspectives on how this type of world simulation can work? The paper talks about scaling these models to 10,000 GPUs, that's an insane amount of compute power!
Guest: Absolutely. They utilize all sorts of parallelism techniques to efficiently train their models, given the computational demands of high resolution videos. It's quite impressive the scale they’ve managed. For their diffusion models, they adapt a model originally designed for label-conditional image generation (called DiT), but modify it to make it better suited for controllable video generation. They incorporate 3D positional embeddings, text conditioning, and they also normalize query and key vectors.
Host: Yeah, they also mentioned the use of AdaLN-LoRA for parameter reduction, which I didn’t fully understand. Something to do with low-rank adaptations?.
Guest: That’s right. It’s a way to reduce the number of parameters in the model without sacrificing performance. They employ Low-Rank Adaptation, or LoRA, to make the adaptive layer normalization layers much more parameter efficient. This leads to a significant reduction in parameter count while maintaining the same level of generation performance. It’s all about efficiency.
Host: And for the autoregressive models, it sounds like they're adopting methods inspired by Large Language Models, using a next-token prediction task. How do they make that work for videos?
Guest: They essentially treat video frames as a sequence of discrete tokens, similar to words in a text. They then train a Transformer decoder to predict the next token using past video tokens as context. They also add some modifications specific for videos, including the 3D-aware positional embeddings, cross-attention layers to incorporate textual information, and also use QK-Normalization, to enhance the stability of the model. These are specifically introduced to tackle the challenges presented by large-scale video processing.
Host: So they’re treating videos a bit like ‘sentences’, where each token is a bit of the video. And with those two types of models, what kind of results are they actually getting, are there real videos we can see and look at?
Guest: Yep, they do show some examples. The diffusion-based models, for instance, create impressive, high-resolution videos, aligned with input text prompts. These diffusion based models also support both image and video conditioning and can generate long videos autoregressively. And the autoregressive models, while sometimes a little blurry, can create videos from a single conditioning frame, and also produce videos aligned with text prompts. They also use a diffusion decoder to enhance the quality of the videos produced by the autoregressive models by mapping the discrete tokens back into the continuous space.
Host: They've also got this interesting thing called a ‘Prompt Upsampler’. What’s that all about and why do they use it?
Guest: This is used for the diffusion-based models to help bridge the gap between the detailed video descriptions used during training and the typically shorter user inputs. The upsampler takes a basic prompt and expands it with more descriptive information, adding more visual details and maintaining consistent description structures, and therefore lead to higher quality output. It’s like taking a very short sentence and making it into a paragraph of more details.
Host: Okay, so it’s a way to make sure user prompts give the best results, even if those prompts are very short or not fully descriptive. It sounds like they’ve really thought through all the different steps to get this model ready for real-world applications. This is all getting fairly complex, so let’s shift gears a bit and look into something else mentioned in the paper. They also discuss the applications of these pre-trained WFMs, fine-tuning them for other use cases, such as camera control, robotics, and autonomous driving. So how do they actually go about doing this and what types of applications can we see?
Guest: Exactly. The pre-trained WFMs are generalists, capturing the general knowledge of the visual world. They're designed to be fine-tuned to adapt for various tasks. For camera control, they integrate camera pose information, so the world model can generate videos based on how a camera is moving through a scene. The paper shows they fine-tune on a dataset with diverse camera poses and can create 3D navigable worlds. The model can generate videos that are both temporally coherent and consistent with their underlying 3D structure. It is like having a simulator with a controllable camera that can generate videos from different viewpoints.
Host: That sounds incredible. It's like creating a virtual reality world on the fly. And for robotics, it seems like they are also using two methods, action-based and instruction-based?
Guest: Yeah, for robotics, they focus on two key areas. One is instruction-based video prediction, where the model generates a video of a robot executing an instruction described in text. This is good for seeing how a robot would respond given instructions. They have the model look at current video and then extrapolate based on a given instruction. They also look at action-based next-frame generation, where the model predicts the immediate next frame based on the action performed by the robot. Think of it as a real-time simulation of what's going to happen in the next instance of time for robotic movement.
Host: So it’s kind of like, if I tell the robot to move the block, it simulates that action, and if I give it an exact movement to perform, it can also simulate what that will look like in the video space. And what about autonomous driving? That’s another area they’re looking at, right? I suppose it's where a realistic model will be very useful?
Guest: Yes, they fine-tune their models to simulate multiple camera views which is a key feature of autonomous vehicle sensor systems, all at the same time. They also train their model to account for trajectory data, which means you can generate a video simulation for a car driving along a certain path. This allows for detailed training scenarios for self-driving AI in ways that would otherwise be too costly, dangerous or impossible to set up in the real world.
Host: It’s like a real-world driving simulator that’s actually very close to reality. This is incredible. Finally, let’s talk about the ‘guardrails’ they’ve developed for their platform. What was the purpose of those?
Guest: The guardrails are critical for ensuring safe and responsible use of the models. They've developed a two-stage system: pre-Guard and post-Guard. Pre-Guard blocks harmful text prompts by using a keyword list as well as another system based on a Large Language Model that classifies potentially unsafe prompts. Then the post-Guard filters unsafe visual content through a classifier as well as blurring output faces.
Host: So it’s like a double layer of security to protect against any possible misuse of the technology. And from what you mentioned, they have even done red-teaming efforts to ensure that it functions as expected?
Guest: Yep. They have a dedicated red team whose job is to actively try to break the system using adversarial and standard test prompts to find vulnerabilities. These red team results are then carefully reviewed by trained annotators to determine the presence of any unsafe content. It shows a commitment to safety, as this is a powerful technology.
Host: This is such a fascinating area of research. I can see how these world foundation models could potentially revolutionize so many different areas, especially when it comes to Physical AI and the training of robots and other physical systems. Well, this is all we have time for today!