The Alignment Waltz: Jointly Training Agents to Collaborate for Safety – Jack Zhang (JHU)
Abstract Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency[…]