Multimodal large language models (LLMs) capable of processing inputs in a variety of modalities including, text, speech, images and video, and perhaps generating outputs in multiple modalities as well, are an active topic of research and development.
Most current models rely on a pre-trained text-only LLM as their backbone, with modality-specific encoders and adaptors to map non-textual inputs into the same representation space as the text LLM, and corresponding modality-specific adaptors and decoders to generate non-textual outputs. This modular approach has its advantages. E.g. it enables each encoder to be independently pre-trained in an unsupervised manner. However, it also has some disadvantages, as summarized by the following research questions:
- Can one train modality-specific encoders to be task-universal? Specifically they each must effectively encode all information needed for a diversity of downstream tasks not known during pre-training, and do so without taking into account task-relevant information from the other modalities.
- Can one train a single modality-universal encoder? Or jointly train modality-specific encoders without assuming any “parallel” (i.e. synchronously captured) training data? Specifically, learning downstream tasks could be simplified if semantically equivalent input-elements across modalities were disentangled from elements unique to a modality, and were mapped to a shared representation sub-space.
- Can encoders be evaluated intrinsically, before integrating with the backbone LLM and assessing performance on a multimodal benchmark? Specifically, designing a set of encoder-only tasks that correlate strongly with eventual performance on the multimodal benchmark would greatly accelerate development and adaptation of encoders.
A team lead by David Harwath (University of Texas, Austin), with team members from other universities and industry, proposes to tackle these questions.