SRMT: Shared Memory for Multi-agent Lifelong Pathfinding
Multi-agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi-agent problems in various environments. One of the principal challenges in MARL is the need for explicit prediction of the agents' behavior to achieve cooperation. To resolve this issue, we propose the Shared Recurrent Memory Transformer (SRMT) which extends memory transformers to multi-agent settings by pooling and globally broadcasting individual working memories, enabling agents to exchange information implicitly and coordinate their actions. We evaluate SRMT on the Partially Observable Multi-Agent Pathfinding problem in a toy Bottleneck navigation task that requires agents to pass through a narrow corridor and on a POGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistently outperforms a variety of reinforcement learning baselines, especially under sparse rewards, and generalizes effectively to longer corridors than those seen during training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT is competitive with recent MARL, hybrid, and planning-based algorithms. These results suggest that incorporating shared recurrent memory into the transformer-based architectures can enhance coordination in decentralized multi-agent systems. The source code for training and evaluation is available on GitHub: https://github.com/Aloriosa/srmt.
Discussion
Host: Hey everyone, and welcome back to the podcast! I'm your host, Leo, and I'm super excited about today's episode. We're diving into some pretty fascinating stuff, a deep dive into the world of multi-agent systems and how they learn to cooperate. It's a topic that's not only super relevant to a lot of fields but also incredibly complex and intriguing. We'll be exploring a new approach that's showing a lot of promise in this area. Think of it, like, how a flock of birds manages to move so seamlessly together, or how a team of robots can collaborate efficiently. It's all about coordination and shared understanding, and that's what we're going to unravel today.
Host: So, the paper we're focusing on today is titled 'SRMT: Shared Memory for Multi-agent Lifelong Pathfinding,' by Alsu Sagirova, Yuri Kuratov, and Mikhail Burtsev. It's a mouthful, I know, but trust me, the core concept is something we can really get into. Before we really get down to details, imagine a group of agents, whether they are robots or simulated characters, navigating a complex environment. Traditional methods might rely on explicit communication – imagine these agents sending messages back and forth like walkie-talkies. But this approach introduces a lot of challenges, like bandwidth limitations, signal interference, or what if some agents are out of range. What this paper proposes is a new way for the agents to interact and coordinate – it involves a shared memory space.
Host: Yeah, it’s like, instead of directly messaging each other, they’re all writing their thoughts on a shared whiteboard, and everyone can see what's there. Obviously, it’s more technically nuanced than that, but the analogy helps, right? This approach, as the paper argues, could solve some of the bottlenecks that plague current multi-agent systems. It’s about implicit communication rather than explicit communication. I mean, think about how you and your friends work together in a crowded place. You're not necessarily sending constant messages but you all kind of understand each other's goals and intentions just from observation and that’s what we're trying to get to with these multi-agent systems. It’s all about this shared understanding. And the authors, they've taken inspiration from something called 'global workspace theory' from cognitive science, which is kind of cool.
Host: Okay, so let's dive in. The paper first sets the stage by discussing some related work in multi-agent reinforcement learning, or MARL as it's commonly known. MARL is basically the application of reinforcement learning, where agents learn through trial and error, in multi-agent settings. Now, there are a couple of ways that MARL has been approached. The paper discusses that you can have centralized settings, which is where a central controller oversees all the agents, making all the decisions. It's kind of like a single conductor leading an orchestra, where everything is pre-planned and coordinated from one point. Then, you have a completely decentralized approach, where each agent acts independently and makes decisions solely based on its own observations, like a bunch of solo artists improvising, and then, you’ve got the networked agents settings. This is where things get a bit more interesting, as agents share information with each other to improve coordination.
Host: Exactly! And the paper highlights a lot of really interesting stuff that's already out there. For instance, there are methods like IQL, VDN, QMIX, and QPLEX that work in decentralized settings without explicit communication. They focus on each agent making their decision based on their limited perspective. These methods operate with individual Q-functions which can be beneficial in large-scale systems where a central controller might be infeasible. Then, on the other side, you've got centralized methods like LaCAM and RHCR. These use a global view, but it's not always feasible because it relies on the entire system being accessible to a single point of control. You get bottlenecks with scaling or dealing with partial observation. And then, you get the approaches that do use communication, like DCC, MAMBA, and SCRIMP, which allow the agents to exchange info, each with its own specific communication strategies. They all attempt to strike a balance between the efficiency of the decentralized and the effectiveness of centralized approaches. It's like trying to find that sweet spot, right?
Host: And that's where the 'Shared Recurrent Memory Transformer' or SRMT for short comes into play. This approach fits under the decentralized setting with networked agents. But instead of direct communication, it proposes an indirect way of information sharing. It's like all agents can see and update a global shared memory space. This shared memory allows the agents to implicitly coordinate their actions. It's a subtle but significant shift. The paper argues that this approach can provide the benefits of both decentralized and centralized methods, without the explicit coordination that can bog down traditional systems. Agents can act independently based on local observations but they also have an understanding of what the other agents are doing through the shared memory. And that’s where the ‘recurrent’ part in SRMT comes in. It’s not just about a single snapshot of memory but a running memory over time.
Host: So, let's talk about this ‘shared recurrent memory’ part in a bit more detail. They’re building on some work that’s been done with memory transformers, that also uses this idea of adding special tokens to the input sequence. They basically extend memory transformers to multi-agent settings by pooling and globally broadcasting individual memories. Memory Transformer itself, or MT, was kind of the base for this. It’s a type of architecture that adds these trainable ‘memory tokens’ to the input. These tokens act like a working memory space for the model. Then you get Recurrent Memory Transformer or RMT, which allows for these memory tokens to pass information between different segments of input. So instead of a single token, you have a bunch of them working together like a recurrent hidden state. Then you have stuff like ATM, or Agent Transformer Memory, which uses a transformer based working memory in multi-agent reinforcement learning. The key difference between these and SRMT is how they utilize and share this memory space. It’s all about whether the memory is individual to the agent or accessible by all of them.
Host: Exactly, and it's fascinating how this works. SRMT pools and globally broadcasts the individual memories of all agents. It essentially creates this collective awareness, without requiring the agents to send messages directly to each other. And this architecture, this SRMT, is designed specifically to help agents coordinate and make joint decisions. Think about the way a team works in a football game. The players don't have to tell each other their exact moves every single time, but they share an understanding and anticipation. That's kind of the effect that SRMT is trying to achieve. It's not just about remembering a single state or a sequence, it’s about collective understanding and action planning. The agents learn to read from the shared memory what the other agents are doing and also learn to write to this shared space what their intentions are.
Host: Alright, let’s really break down the architecture of the SRMT. The paper describes it as an extension of memory transformers to multi-agent settings. So, what's happening is that each agent at each time step has its own personal memory vector, along with a sequence of its past observations and its current step observation. This is then passed through a self-attention mechanism. Basically, the agent is processing its own internal thoughts. But, importantly, then comes the cross-attention layer. This cross-attention layer lets the hidden representations of each agent interact with the shared memory. So, the memory is like a bulletin board where all the agents post their memory vectors and the agents read what the others posted. It’s the interaction between the personal memory and this shared memory that enables each agent to incorporate global context into its decision-making. Finally, the memory head is used to update the agent's memory vector, before it goes on to making action decisions. It’s like a cycle of observe, think, write and read, and act, constantly updating and refining the understanding of the team.
Host: Okay, now the fun part—how they actually tested this thing! The authors used a couple of different scenarios to evaluate the SRMT. They started with a relatively simple scenario, a ‘Bottleneck task’. It's basically a navigation problem where agents need to pass through a narrow corridor. Imagine two rooms connected by a single narrow door and two agents that start in different rooms needing to switch places. It's simple, but it requires some level of coordination to avoid being stuck and deadlocking. They tested the SRMT against a bunch of different baselines like MAMBA, QPLEX, ATM, RATE, and RRNN, and also some ablations of SRMT itself. Ablations are important because it lets you test the effectiveness of individual components of your system. So, they tested SRMT with a recurrent memory but without sharing, with a transformer without memory, and with a basic RNN structure instead of the attention-based methods, which are all pretty standard things to test. They also tested different types of reward functions, like rewarding agents for moving closer to their goal and also rewarding only when they reached the goal.
Host: The results they got from this bottleneck task are pretty interesting. They found that SRMT consistently outperformed all the baselines, especially under the most challenging reward settings, like with the sparse reward, where the agent is only rewarded when it reaches its goal and there's no in-between reward for partially succeeding. It’s the kind of task that demands really good coordination. What this indicates is that shared memory provides better coordination between agents and their decision-making. With no intermediate rewards, it can be incredibly challenging for agents to figure out what to do, and the fact that SRMT still performed effectively under these conditions really underscores how powerful the shared memory is. It seems like, having access to a shared space really allows agents to learn to cooperate and coordinate their movements even without explicit information. They've also shown that SRMT generalizes well. The agents that were trained on corridors of a certain length could still perform well when faced with corridors that are significantly longer. This is crucial because, in real-world scenarios, you need a system that can handle new and unknown situations.
Host: And it's not just the bottleneck task they tested it on. They also tested it on more complex ‘Lifelong Multi-Agent Pathfinding’ or LMAPF tasks, using the POGEMA benchmark, which is a standardized platform for testing these kinds of algorithms. LMAPF is when agents continuously have new destinations upon reaching their current goals. It’s like a delivery task, where as soon as you deliver one package, you immediately get assigned a new one. So, the task never stops, the agents always have to keep going. They tested SRMT on mazes, random environments, and a couple of more complex environments like warehouses, comparing it to other MARL methods, hybrid methods, and planning-based methods. The tests covered the throughput, the congestion in a multi-agent system, and also scalability. The results indicate that SRMT is competitive to other state-of-the-art methods in general. It shows how well the SRMT is generalizing, especially in new maps not used during the training.
Host: Yeah, and the interesting takeaway is how SRMT manages different situations. The paper mentions the 'Warehouse environment', which is complex because it contains narrow corridors and high agent density and congestion. In this environment the SRMT with heuristic planning performed even better than a lot of MARL baselines. Integrating heuristic planning, which is like using some prior knowledge of the environment, with the learning-based approach really improves performance in difficult settings. So, the results basically tell you that SRMT is capable of adapting to new environments, showing that the agents can learn cooperative behavior that is not just specific to the training setup. In the warehouse scenario, the heuristic planning methods basically give a better general idea on how to find a goal while avoiding collisions, but the SRMT helps agents do it collectively by sharing this information with each other. It seems that SRMT strikes a good balance between a learning-based approach and a more informed planning method.
Host: Absolutely, and the authors also look at some key performance metrics such as performance itself (measured by throughput), pathfinding which tests the single agent path optimality, congestion handling, and cooperation which is how well the method can handle complex situations. The analysis also showed the generalization ability by testing on unseen maps, which is called 'Out-of-Distribution' and the scalability metric which shows how well it scales with an increase in the number of agents. Across these tests, the SRMT showed very competitive results, often being the top performer on a number of metrics. It really emphasizes how well SRMT balances these different requirements of multi-agent systems. For instance, the scalability results are particularly important for real-world applications where you might have dozens, hundreds, or even thousands of agents.
Host: And the interesting thing is, they even did a deep dive into the memory representations themselves. They analyzed the agent memory vectors and related those to the physical distances between agents in the environment. Basically, when agents were physically close, their memory representations were also close, and this correlation was consistent as the agents moved and interacted with each other. This analysis gives you a glimpse into what the shared memory space is doing and how the agents are actually using it to build a collective awareness. The fact that the memory representations closely match the spatial arrangement of agents in the environment kind of confirms that the shared memory is indeed capturing the information that is relevant for multi-agent coordination.
Host: So, let's wrap up by summarizing the key findings. The authors of this paper proposed a novel architecture called Shared Recurrent Memory Transformer or SRMT which enables agents to implicitly exchange information and coordinate actions without direct communication protocols. They tested it on both the bottleneck navigation task and also in the POGEMA benchmark on MAPF which includes Mazes, Random, Moving-AI, and Warehouse environments, and the results demonstrated that it consistently outperformed existing MARL, planning, and hybrid algorithms, especially in challenging scenarios with sparse rewards. The system also scaled well with a larger number of agents. It's this combination of implicit communication, shared memory, and good generalization ability that makes the SRMT so promising. In some sense, it provides a new way of achieving collective intelligence in these multi-agent systems. The paper highlights how incorporating shared memory into the transformer based architectures is important for multi-agent reinforcement learning.
Host: Yeah, and they're also quite transparent about the limitations of this study, which I appreciate. They point out that in their experiments, they assume that all the agents have perfect localization and mapping. This is not always the case in real-world scenarios, and there might be noise or uncertainties in the observations. Also, they assume that agents execute their actions perfectly and that they are synchronized. Obviously, in a real-world environment, agents might have varying speeds, or actions might fail, or there might be delays in communication. These limitations indicate that there’s still a lot of work to be done before this can be directly implemented in all real-world applications. However, the paper makes a compelling case for the potential of this architecture and the concept of shared memory. These limitations don’t detract from the contribution that the paper has made, and it’s important that the paper points these out so that we are all aware of those limitations.