2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality multimodal textbook corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving~Our code are available at \url{https://github.com/DAMO-NLP-SG/multimodal_textbook}.
Discussion
Host: Hey everyone, and welcome back to the podcast! Today we're diving into a really fascinating area at the intersection of AI and education. It's something that's been brewing for a while, but now with the latest AI tools it's becoming incredibly powerful. We're going to be talking about a new way to teach AI by leveraging the vast amount of educational video content online. I'm your host, Leo, and I'm super excited to get into this.
Guest: Hi Leo, and hey everyone! Yeah, I think this is a topic that's incredibly relevant right now. It's not just about making AI smarter, but also about how AI can help us learn better. The way that educational videos are being used now, is a real opportunity. It’s pretty cool.
Host: Absolutely. So, we're specifically going to be discussing a really interesting paper that introduces this concept of a 'multimodal textbook'. It's a dataset created using 2.5 years of instructional videos. That's a lot of learning material! I find it amazing how the researchers have been able to take that content, and not just use it, but also organize it into an easily digestible format, similar to a textbook but with both visuals and text, you know?
Guest: Yeah, when you think about it, the sheer volume of instructional content out there is staggering. But, like you said, it's about how to structure it. It's not just randomly throwing videos at a model and expecting it to magically learn everything. The idea of a multimodal textbook, with that structured combination of images and text, makes much more sense. It’s almost like learning with a teacher, who's explaining diagrams, you know.
Host: Exactly! The paper actually highlights that existing datasets, often scraped from websites, have issues like weak text-image connections, low knowledge density, and even incoherent image sequences. And that's where these instructional videos really shine, right? They inherently have that strong connection between what you're seeing and what's being explained.
Guest: Definitely. Think about an online math lecture, for instance. You've got the instructor explaining concepts verbally, and then you have the corresponding visuals, like equations or diagrams, appearing on screen. That’s a really tight coupling of visual and textual information. Webpages often lack that. It might show a diagram of something, and it would just be there with text about the general topic, but not specifically related to the image.
Host: Right, and that’s not even considering the logical coherence. If you think about a step-by-step explanation of a physics problem, you'd see a sequence of images or animations that build on each other, right? That's the kind of structured knowledge the authors are trying to leverage, which a webpage might not provide.
Guest: Yeah, that's the key. It’s not just about having a bunch of images and text; it's about how those images and text are related both semantically and logically. The authors also talk about how web pages have all sorts of content mixed in with the important stuff, like ads, entertainment content, things that aren't really foundational knowledge. It dilutes the dataset.
Host: Precisely! So, they’ve built this textbook from the ground up, focusing on instructional videos, specifically for VLM training – that's Vision-Language Models, for our listeners who might not be familiar with it. It’s interesting they also mention how these videos are actually widely used by us humans, to learn. It's like we have this gold mine of knowledge that's currently underutilised for AI.
Guest: That's such a great point! We often talk about how AI needs to learn from human-like data, and educational videos are about as human-like as it gets when we talk about learning. These videos are how we, you know, a lot of us, actually go and try to learn new things. It makes perfect sense for AI to learn from them too.
Host: Okay, so let's dive into the construction process of this multimodal textbook. The first step was to collect the right videos, right? But they didn't just randomly grab videos from YouTube. They used an LLM, a Large Language Model, to create a kind of knowledge map first. I found this quite interesting.
Guest: Yeah, that's a really clever approach. They used the LLM to create a taxonomy that had four levels: subject, course, sub-course, and then finally, the knowledge point. So, they had a very granular map of what kind of educational content they wanted. It's almost like making an outline before writing an essay. This way, they are systematically categorizing the vast world of knowledge.
Host: Exactly. They went from broad subjects like 'mathematics' down to very specific concepts like 'the definition of irrational numbers'. This taxonomy allowed them to really pinpoint the types of videos they needed. And, of course, they didn't just stick to one subject. They covered six major areas, including physics, chemistry, even computer science and engineering. They seem to have focused on what is considered core foundational knowledge.
Guest: Yeah, so they were aiming for broad coverage, not just deep dives into one area. This makes sense if you're aiming for a more foundational level of learning. It’s like they’re trying to build a very strong knowledge base across several important STEM subjects. And, with 3,915 knowledge points, that’s some detailed taxonomy. I bet that was challenging to pull together!
Host: Definitely. So, once they had this taxonomy, they used those knowledge points as keywords to actually search for videos on YouTube. I wonder how much data deduplication was involved after that search? I can imagine there could be lots of videos that have overlapping content.
Guest: Yeah, you’re right. They mention that they deduplicated based on video IDs and used another LLM to review the video metadata, like titles, descriptions, and even comments, to filter out low-quality or inappropriate content. That’s another important step. You wouldn’t want to feed the model anything that's not relevant or even harmful.
Host: Right, and I think that’s where this meticulous curation really comes into play. You have to think about those steps, because if the foundation is built on poor quality or inappropriate data, then the results from using it won't be great either. So, the meticulousness of the data gathering is absolutely key here. I like that.
Guest: Totally. It's not just about gathering a lot of data; it's about gathering good data, right? It’s interesting that they’re using LLMs at so many stages of the process, almost like having an AI assistant helping them create the dataset. I think that’s a trend we're likely to see more of in the future.
Host: So, after the videos were collected and cleaned, then they moved to the extraction phase, right? And this was multi-level. They weren't just extracting text, but also the visual content, the actual frames from the videos themselves.
Guest: Exactly. They had this video-to-textbook pipeline where they extracted information from each video at multiple levels. First, they extracted the audio and transcribed it into text using ASR, or Automatic Speech Recognition. But they didn’t just stop there, as raw ASR can be a bit messy sometimes.
Host: Yeah, that’s true. I've definitely experienced that with transcriptions. They mention that instructors use colloquial language, so the raw transcriptions can have higher perplexity, meaning, kind of messier text. That's where the LLM comes in again. They used another LLM to rewrite the transcripts, improving their fluency and coherence while keeping the core meaning intact.
Guest: . And, that's also what allows them to perform another round of filtering. They actually use the ASR transcriptions to filter out videos that are not genuinely instructional, by checking things like relevance, knowledge density, and even the quality of the transcription itself.
Host: Yeah, they look for whether the ASR matches the knowledge point they're targeting, and if it actually contains meaningful educational content rather than just filler words. And, if the ASR is just a mess of repetitive or nonsensical text, then that video is out. It’s almost like quality control at each stage. It’s pretty rigorous.
Guest: And that was just the video level! They went a step further by splitting the long videos into shorter clips based on the timestamps of the ASR text, so that the text and the visual information are actually temporally aligned. It's like how you'd watch a video in class, where the explanation matches up with what you’re seeing on the screen. And then, another round of filtering happens after that.
Host: That's right, the clip-level filtering is so interesting. They used another VLM, a Vision-Language Model, to generate captions for each clip, and then check the similarity between that caption and the ASR transcript. If the visual content doesn’t seem to be related to the text, then the clip gets tossed out. But, it’s crucial to note that if the video clip was uninformative they still retained the useful transcripts, which is a very clever design. It means they don't lose any useful text.
Guest: That's a smart move, because a video might have a few scenes with no real visuals, like if a teacher is just talking to the screen, but the audio explanation could still be valuable. It makes sense to separate the visual from the textual, so you can still retain the textual parts even if the corresponding video clip isn't good enough. It shows they were very thoughtful in making this.
Host: And finally, at the keyframe level, they extract the keyframes themselves, so the actual visual content. They used an algorithm based on the similarity between consecutive frames to identify important moments, which makes sense, as not every frame in a video is actually useful. I like how they’re going from the overall video, down to the specific clip, down to even the specific frames within a clip.
Guest: Yeah, it's that coarse-to-fine strategy they mentioned earlier, which makes sure no unnecessary content is in the data, and also ensures the content they have is in an optimal format. So, they're not only discarding useless content but also selecting the most informative frames, not the redundant ones. Then, as if that wasn’t enough, they also used OCR to extract text from the keyframes themselves, like formulas and mathematical symbols.
Host: That’s right, because a lot of these instructional videos contain text that’s embedded in the visuals, like slides or diagrams. OCR is essential to capture that information. And then, just like with the ASR text, they also filter the OCR to remove anything that's redundant or low quality. They're really focusing on a high-quality dataset, which is so important for effective learning.
Guest: Definitely. The final dataset consists of those keyframes, the refined ASR, and the extracted OCR, all organized into an interleaved format based on time. So you have the keyframes mixed with the corresponding text. That’s what gives it the textbook-like structure that they’re after.
Host: It's really quite a rigorous process, and the results are impressive. They created a dataset of 6.5 million images and 0.75 billion text tokens, which is, you know, pretty substantial. And it's all extracted from 75,000 instructional videos totaling over 22,000 class hours. It just shows the scale of educational content that's available and now usable for training AI.
Guest: Yeah, and the best thing about this is how natural this data feels! It’s a much more natural way for a model to learn, rather than learning from randomly aligned text and images, or something. It’s almost like the model is attending a virtual lecture.
Host: Exactly. So, now that we've talked about how this multimodal textbook was created, I think we should delve a bit into the actual analysis of it, and also the experiments they did. Because how do you measure the quality of such a complex dataset? That’s an interesting question.
Guest: Yeah, absolutely. I think the comparison they made with existing datasets is pretty important. They broke it down into how their data is different to the usual datasets out there.
Host: Right, they compared their dataset with image-text pair datasets and webpage-centric interleaved datasets, focusing on the distribution of images and text. And they found that their dataset has a significantly higher average number of images and text tokens per sample. So, there's more contextual information within each sample.
Guest: And it’s not just about the quantity, right? They also looked at the quality of the image relationships. They introduce this 'In-Sample Image Similarity' metric, which essentially measures how similar the images are within the same sample. They found that their video-based textbook has much higher similarity scores.
Host: Yeah, they’re saying the images in their dataset are much more related to each other semantically and structurally than in the other datasets. It makes sense when you consider the source, right? They're all part of the same explanation or example, as opposed to just randomly scraped from web pages. The difference in the similarity was almost double compared to other datasets, which is remarkable.
Guest: It really highlights the importance of the video format, as the images are showing a dynamic explanation, or a sequential process. And, it is also interesting that the similarity score for their dataset stays stable as they add more images, while the others tend to decline, which also supports the coherence in the video format. It’s like the more images they added, the more coherent it is, while with other datasets the more images they add, it becomes less and less clear what’s going on.
Host: Yeah, I agree. So, having analyzed the dataset itself, they then use that dataset to pre-train VLMs, and then evaluate these models on several downstream benchmarks. I’m always most interested in seeing the experiments when I read a research paper. It's like the results are finally presented, and we can see the impact of all the previous steps. They used both LLaVA and Idefics2 models, right?
Guest: That's right. They used LLaVA-1.5-7B and Idefics2-8B, both powerful models. They did the experiment in two ways. One was a continual pre-training, which involves using the pre-existing model, and further training it on the new dataset, and the other was pre-training Idefics from scratch, without using the pre-trained model. And, of course, they made sure to use equivalent sample sizes from the other datasets to ensure fair comparison. And the experiments showed clear improvements on using the textbook data. It's like all the effort paid off.
Host: Definitely. Both models showed significant improvements on several benchmarks, particularly on the knowledge and reasoning-intensive ones like ScienceQA and MathVista. It really highlights the fact that this high-quality dataset was able to impart a stronger understanding of core knowledge, and also reasoning abilities. It wasn't just on some random benchmarks either, these are all challenging tasks.
Guest: Yeah, and it's important to note that they saw that improvement with both models. So it’s not just a peculiarity of using a particular model. It’s more of a systemic advantage that this data provides. I also thought the interesting thing was that even for really advanced models like the Idefics2, this textbook dataset brought an additional improvement. That shows the quality and impact of the data itself.
Host: That's right. They also showed that the textbook dataset improved the models' in-context learning capabilities. This is really important, as that’s what allows large language models to work really well. Specifically, their models became more effective at using both visual and textual cues in the few-shot examples to solve problems. It was like they became better at learning by observing how you do things, and then following suit.
Guest: Yeah, that's the real key takeaway for me. They saw this even in some of the general-domain tasks, the textbook-trained model performed better as they added more examples. That's because of the coherent context they're getting from the video data. And this is a stark contrast to the other datasets. The interleaved approach was not just a data format; it actually had real implications on the learning itself.
Host: Right, and this shows that the coherence of the image and text sequences in the textbook was key to the models’ ability to grasp foundational knowledge. And to further dive into this, they actually designed this really fascinating 'Cheat Test'.
Guest: Yeah, that 'Cheat Test' is a really ingenious way to assess whether the models are actually paying attention to their context. They replaced one of the few-shot examples with the exact test case, basically giving a massive hint. A model with strong in-context learning abilities would immediately recognise that, and it would answer without having to think too hard about it. And the results were quite dramatic.