Omnimodal Encoders 

Authors: David Harwath, Karen Livescu, Georg Heigold, Shankar Kumar.

Download proposal PDF

We are heading towards a near future where we use multimodal large language models (LLMs) to do most tasks involving a variety of modalities — written language, spoken language, general audio, images and video, and beyond. Today’s multimodal LLMs typically involve a combination of pre-trained text LLMs and modality-specific encoders for the various input modalities. This approach places a large burden on the encoders, and raises several natural questions: 

  • For this paradigm to work well, the encoders must be task-universal— i.e. a single set of encoders should provide the needed information for all of the tasks the multimodal LLM might be used for. How should we train encoders to satisfy this goal? Can we learn deep semantic representations that are task-universal, or only low-level ones (leaving the deeper work to downstream models)? 
  • How modality-universalcan the encoders and their learning techniques be? That is, can we jointly learn encoders for multiple modalities, or even share much of the model across modalities? How do we account for modality-specific information vs. shared information? 
  • Can we reliably evaluate encoder quality in a more efficient way than plugging them into LLMs and using large LLM benchmarks? What is a necessary and sufficient set of intrinsic encoder evaluation tasks

Existing work has begun to address some of these questions, but more of the space is still unexplored. Our proposed workshop project will address these challenges by: (1) developing techniques for learning task- and modality-universal encoders and (2) establishing benchmark tasks for efficient intrinsic encoder quality evaluation that correlate well with downstream performance. 

Center for Language and Speech Processing