Jaemin Cho (UNC Chapel Hill) – “Faithful Reasoning and Fine-grained Evaluation for Multimodal Generation”

When:
March 3, 2025 @ 12:00 pm – 1:15 pm
2025-03-03T12:00:00-05:00
2025-03-03T13:15:00-05:00
Where:
Hackerman Hall B17
3400 N CHARLES ST
Cost:
Free

Abstract

The paradigm of training large-scale foundation models has driven significant advancements in multimodal AI. However, pursuing further performance gains solely through model scaling is becoming impractical due to rising computational costs and resource limitations. Moreover, the reasoning and generation processes of these models remain mostly uninterpretable and uncontrollable, often leading to unfaithful outputs. In this talk, I will discuss my efforts to make multimodal generative models more controllable and trustworthy without increasing their size. First, I will introduce faithful reasoning frameworks, where the multimodal generation process mirrors how humans reason about and create content such as images and videos. Concretely, in these frameworks, models create a detailed plan that decomposes a complex generation task into simpler steps as well as retrieve relevant information from multimodal knowledge bases before generating the final outputs. Next, I will describe fine-grained evaluation methods that assess model capabilities across multiple dimensions, such as object counting and spatial relation understanding, thereby providing a detailed understanding of the strengths and weaknesses of models. In turn, these evaluations enable targeted model improvements that address identified weaknesses through test-time guidance or by updating training environments. Together, these directions offer a pathway toward more intelligent, reliable, and efficient multimodal AI models.

Bio

Jaemin Cho is a PhD candidate at the Department of Computer Science at UNC-Chapel Hill. His research focuses on improving the reasoning capabilities in multimodal generation. His work has been featured at top conferences in computer vision (CVPR, ICCV, ECCV), natural language processing (EMNLP, NAACL, COLM), and machine learning (NeurIPS, ICML, ICLR, AAAI). His work has been recognized through multiple oral/spotlight presentations and a top reviewer award at NeurIPS, the Bloomberg Data Science PhD Fellowship, and media coverage (MIT Technology Review, IEEE Spectrum, and WIRED). He also has co-organized the T4V: Transformers for Vision workshop at CVPR 2023 and 2024.

Center for Language and Speech Processing