Constraint-Based Multi-Agent RL Enhances Cooperative Control

Recent advances in deep reinforcement learning (RL) have unlocked new possibilities for training control policies in robotics and gaming, particularly for tasks involving continuous action spaces. Among the actor-critic family of algorithms, Soft Actor-Critic (SAC) has emerged as a strong performer due to its stability and robustness. While single-agent RL frameworks are mature, extending them to multi-agent cooperative tasks—especially those with sequential objectives—remains challenging. The difficulty lies both in enabling effective agent interaction and in designing reward structures that foster coordinated behavior.

Image Credit to Wikimedia Commons | License details

Multi-agent environments require careful synchronization of agents’ actions in a shared, dynamic space. Existing platforms like OpenAI Gym, ML-Agents, and RoboGym have proven effective for single-agent control, but collaborative settings demand additional complexity in reward shaping and interaction modeling. Centralized Training with Decentralized Execution (CTDE), as implemented in MADDPG, has been widely adopted to address these issues. In CTDE, agents train with shared global information but execute policies independently. The work described here integrates SAC into CTDE, creating a Multi-Agent Soft Actor-Critic (MSAC) framework to handle continuous control with improved stability.

However, when tasks can be decomposed into multiple sequential phases, standard CTDE struggles with multi-objective reward design. Hierarchical RL can address this by assigning sub-objectives to lower-level controllers, but it increases architectural complexity. To overcome this, the authors adopt a concept from safe RL: using constraints not just for safety, but to encode intermediate objectives. In their Constrained Multi-Agent Soft Actor-Critic (C-MSAC) approach, all but the final phase objectives are treated as constraints. The system optimizes the final goal only when earlier-phase constraints are satisfied.

The testbed for this method is a physically simulated tray-balancing task in Unity. Two humanoid agents, controlled via inverse kinematics, must lift and stabilize a tray (phase one) before guiding a ball along a predefined moving trajectory (phase two). The environment supports different target paths, including randomized ellipses and S-curves, with evaluation also performed on unseen shapes like triangles and squares. State inputs include positions, orientations, velocities, and distances relevant to the tray, ball, and targets. Actions are forces applied at tray anchor points, scaled from normalized policy outputs.

Rewards are phase-specific: tray lifting combines distance-to-target height and orientation stability, while target following rewards proximity of the ball to the moving target. Both agents share averaged rewards to encourage cooperation. Early termination conditions—such as the ball falling off or the tray dropping—improve sample efficiency.

C-MSAC operates by maintaining separate value functions for each phase. During training, it selects which phase to optimize based on whether its constraint threshold is met. This primal constraint-handling approach avoids the complexity of dual variables and Lagrange multipliers, simplifying implementation.

In experiments, both MSAC and C-MSAC were trained for 45,000 episodes on ellipse and S-curve trajectories. C-MSAC initially lags in target-phase reward because it focuses on satisfying the lifting constraint, but it eventually surpasses MSAC, achieving higher and more stable rewards. In phase-one performance, C-MSAC reaches constraint thresholds faster and with less variance.

On-target performance—measured as the percentage of time the ball stays within a high-reward zone—was consistently higher for C-MSAC across both training trajectories. When tested on unseen triangle and square paths, both models generalized, but C-MSAC achieved better rewards and stability. Sharp corners in the square path posed more difficulty, likely due to training only on smooth curves.

Robustness tests introduced disturbance forces during execution. As disturbance magnitude increased, performance dropped for both models, but C-MSAC maintained higher mean rewards and lower variance, indicating greater resilience. Notably, no disturbances were applied during training, suggesting that SAC’s entropy-driven exploration contributes to robustness.

The framework demonstrates that constraint-based multi-agent RL can effectively manage sequential cooperative objectives without the overhead of hierarchical controllers. While the current work uses inverse kinematics for arm control and focuses on two phases, the authors note that extending to joint-level RL control and adding more phases could further exploit the method’s potential. The approach is also adaptable to competitive scenarios, broadening its applicability in complex multi-agent domains.

Leave a Reply

Your email address will not be published. Required fields are marked *