Evaluating LLM Performance in Research Astronomy

Large language models (LLMs) are being used not only for common-knowledge information retrieval, but also for specialized disciplines such as cutting-edge astronomical research. However, in specialized domains, we lack robust, realistic, and user-oriented evaluations of LLM capabilities. Human evaluations are time-intensive, subjective, and difficult to reproduce, while automated metrics like perplexity or task benchmarks fail to reflect realistic performance. We seek to advance understanding of LLM capabilities for supporting scientific research through user-centric analysis and the development of robust evaluation standards; while we expect outputs from this workshop to be generalizable, we will focus on astronomy. Astronomy has open data and a vibrant and active community that is open to partnering in the design, experimentation, and evaluation processes. The primary goal of the workshop is to develop a quantifiable metric or objective function for evaluating LLMs in astronomy research, thereby taking humans out of the evaluation loop. A secondary goal is to understand how the evaluation criteria for a specialized use case (astronomy) compares to the evaluation criteria for typical English conversations. Our proposal will explore the first step toward a lofty goal: how can AI transform science for the better, by first evaluating what it means to be better.

Team Leader
John Wu

Senior Members
Anjalie Field
Jo Ciuca
Josh Peek
Kartheik Iyer
Michelle Ntampaka
Philipp Koehn
Sanjib Sharma

Grad Students
Elina Baral
Mikaeel Yunus

Undergrad Students
Alina Hyk
Charles O’Neill
Christine Ye
Kiera McCormick

Staff
Jenn Kotler (designer/software engineer)

Opening Day Team Presentation (Video)
Group Reports (Team Website)
Closing Presentation (Video)

Center for Language and Speech Processing