Antoine Bosselut (EPFL) “From Mechanistic Interpretability to Mechanistic Reasoning”
Abstract Pretrained language models (LMs) encode implicit representations of knowledge in their parameters. Despite this observation, our best methods for interpreting these representations yield few actionable insights on how to manipulate this parameter space for[…]