Andrew Blair-Stanek (JHU) – Shelter Check Dataset and Experiments; LLM Instability on Legal Questions
3400 N CHARLES ST
Abstract
This presentation will cover two papers, both on LLMs’ abilities to handle complex legal tasks. First is the Shelter Check dataset, a curated set of past tax-minimization schemes. We test o1 and claude-3.5’s abilities to understand these strategies and to verify that they meet stated goals. Given the relevant background information, we also prompt LLMs to generate individual steps in tax strategies, as well as to generate entire tax strategies from scratch. Intriguingly, while showing high variance in their overall performance, LLM-based reasoning was able to create three novel tax strategies not previously known. Second involves LLM instability on legal questions. An LLM is ‘‘stable’’ if it reaches the same conclusion when asked the identical question multiple times. We find leading LLMs like gpt-4o, claude-3.5, and gemini-1.5 are unstable when providing answers to hard legal questions, even when made as deterministic as possible by setting temperature to 0. We curate and release a novel dataset of 500 legal questions distilled from real cases, involving two parties, with facts, competing legal arguments, and the question of which party should prevail. When provided the exact same question, we observe that LLMs sometimes say one party should win, while other times saying the other party should win. This instability has implications for the increasing numbers of legal AI products, legal processes, and lawyers relying on these LLMs.
Bio
Andrew is a tax law professor at the University of Maryland and a seventh-year Ph.D. student advised by Benjamin Van Durme