The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2% of the wall-clock time and text quality in 75.6% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).

Host: Hey everyone, and welcome back to another episode of Tech Forward! Today, we're diving deep into the world of arXiv, that incredible online repository of pre-prints. I've got my friend, Sarah, here with me, and she's a real expert on this. So Sarah, thanks for joining me!

Guest: Thanks for having me, Leo! Always happy to talk about arXiv. It's such a vital resource for researchers across so many fields.

Host: Absolutely! For those listening who aren't familiar, arXiv is basically a massive database of research papers, right? But what makes it so unique? It's not like a traditional journal, is it?

Guest: Exactly. It's a pre-print server, which means researchers upload their papers before they've been formally peer-reviewed and published in a journal. This means that the latest research is available much faster than through the traditional publishing route. Think of it as a first look at cutting-edge work – sometimes years ahead of publication in journals.

Host: That's fascinating! So, it's kind of like a sneak peek at the future of research? And this means faster dissemination of new findings, leading to quicker advancements across different fields, right?

Guest: Precisely! It accelerates the research cycle significantly. Imagine a researcher in, say, astrophysics, making a breakthrough. They can immediately upload their findings to arXiv, making the information available globally to other researchers who can then build upon that work, collaborate, and potentially even replicate the study much quicker. The traditional peer review process can take months, even years, which is why it's so groundbreaking.

Host: That makes perfect sense. But what about the lack of peer review? Isn't there a risk of inaccurate or flawed research being disseminated?

Guest: That's a valid concern. While arXiv doesn't perform peer review, it does have a moderation system in place to catch obvious errors or plagiarism. Plus, the community itself acts as a kind of informal peer review system. Researchers scrutinize each other's work, and any serious flaws usually get identified pretty quickly through discussion and further research. However, it's crucial to remember that pre-prints haven't undergone the rigorous scrutiny of a traditional journal publication.

Host: So it's a system of trust and self-regulation, combined with quick dissemination. That's a pretty remarkable balance. I can imagine there is a lot of buzz surrounding a new paper when it first goes on arXiv.

Guest: Absolutely! The excitement of seeing new research posted there is palpable. It fosters a really vibrant and dynamic research community. Researchers often cite arXiv papers even before they are published in journals, which speaks volumes to its importance and influence.

Host: It sounds like it's a crucial tool not only for the rapid advancement of science but also for the collaboration among scientists themselves. It truly democratizes access to research, doesn't it? I mean anyone can access these papers, right?

Guest: Ideally, yes. arXiv's mission is to make research openly accessible, which is largely achieved. While there's a login system, it's primarily to manage submissions and author accounts, most content is freely available. The core principle is about disseminating research, not restricting access. This open access principle is vital for the advancement of knowledge.

Host: That's incredible. It really underscores the importance of open science and collaboration in driving progress. Now, you mentioned something about HTML conversions and accessibility. Can you elaborate on that?

Guest: Yes, so, sometimes there are technical challenges. Authors upload papers in different formats, including LaTeX which is a very common text formatting system in academia. arXiv tries to convert these to HTML for better accessibility, but sometimes the conversion fails, leading to the 'No HTML' message. This is something they are constantly working on improving. If you are an author and you encounter this, you can usually help by providing information on your file formats to improve future conversions.

Host: That's good to know. So, it’s not just about the speed and accessibility of the research, but also the technical challenges of maintaining such a vast database. It's a massive undertaking.

Guest: Absolutely. arXiv relies heavily on the support of various institutions and individuals, including donations, to continue its operation and improve its services. It's a testament to the collaborative nature of the scientific community itself that it's not a solely for-profit endeavor.

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Summary

Discussion