Evaluating LLM Performance in Research Astronomy

Large language models (LLMs) are being used not only for common-knowledge information retrieval, but also for specialized disciplines such as cutting-edge astronomical research. However, in specialized domains, we lack robust, realistic, and user-oriented evaluations of LLM capabilities. Human evaluations are time-intensive, subjective, and difficult to reproduce, while automated metrics like perplexity or task benchmarks fail to reflect realistic performance. We seek to advance understanding of LLM capabilities for supporting scientific research through user-centric analysis and the development of robust evaluation standards; while we expect outputs from this workshop to be generalizable, we will focus on astronomy. Astronomy has open data and a vibrant and active community that is open to partnering in the design, experimentation, and evaluation processes. The primary goal of the workshop is to develop a quantifiable metric or objective function for evaluating LLMs in astronomy research, thereby taking humans out of the evaluation loop. A secondary goal is to understand how the evaluation criteria for a specialized use case (astronomy) compares to the evaluation criteria for typical English conversations. Our proposal will explore the first step toward a lofty goal: how can AI transform science for the better, by first evaluating what it means to be better.

Team Leader

John Wu

Senior Members

Anjalie Field
Ioana “Jo” Ciuca
Philipp Koehn
Kartheik Iyer
Sanjib Sharman
Daniel Khashabi (part time)
Josh Peek (part time)
Michelle Ntampaka (part time)

Graduate Students

Mikaeel Yunus
Elina Baral

Undergraduate Students

Charles O’Neill
Christine Ye
Kiera McCormick

Staff
Jenn Kottler (designer/software engineer)

Opening Day Team Presentation (Video)
Group Reports (Team Website)
Closing Presentation (Video)

Center for Language and Speech Processing