Supervising Models that are Smarter than Us – Shi Feng (Georgetown)
Abstract:
Advanced AI systems are being deployed for more and more complex tasks. To ensure reliable human oversight over AIs, we need supervision protocols that remain effective despite the increase in task complexity and model capabilities. Many approaches to this challenge involve assisting humanΒ supervisors with a second model, which can complement the human’s weaknesses. However, this can also introduce new vulnerabilities. In this talk, I will discuss new research on both methods and threat models for assisted supervision protocols. I’ll also share my thoughts on the meta-question of how we can make progress in scalable oversight, as well as how it overlaps with other AI safety research agendas.
Bio:Β
Shi Feng is an assistant professor of computer science at George Washington University. He received his PhD from University of Maryland and did postdocs at University of Chicago and New York University. He works on AI safety, and his recent work focuses on mitigating the risks of AIs sabotaging human oversight and control, exploring concepts like deception, collusion, and honesty. In the past, he worked on adversarial robustness and interpretability.