Authors: David Harwath, Karen Livescu, Georg Heigold, Shankar Kumar.
We are heading towards a near future where we use multimodal large language models (LLMs) to do most tasks involving a variety of modalities — written language, spoken language, general audio, images and video, and beyond. Today’s multimodal LLMs typically involve a combination of pre-trained text LLMs and modality-specific encoders for the various input modalities. This approach places a large burden on the encoders, and raises several natural questions:
Existing work has begun to address some of these questions, but more of the space is still unexplored. Our proposed workshop project will address these challenges by: (1) developing techniques for learning task- and modality-universal encoders and (2) establishing benchmark tasks for efficient intrinsic encoder quality evaluation that correlate well with downstream performance.