Joint Visual-Text Modeling

Research Group of the 2004 Summer Workshop

There has been a renewed spurt of research activity in Multimedia Information Retrieval. This may be partly attributed to the emergence of a NIST-sponsored video analysis track at TREC, coinciding with a renewed interest from industry and government in developing techniques for mining multimedia data.

Traditionally, multimedia retrieval has been viewed as a system-level combination of text- and speech-based retrieval techniques and image content-based retrieval techniques. It is our hypothesis that such system-level integration limits the exploitation of mutually informative and complementary cues in the different modalities. In addition, prevailing techniques for retrieval of images and speech differ vastly and this further inhibits truly cohesive interaction between these systems for multimedia information retrieval. For instance, if the query words have been incorrectly recognized by the ASR system, then speech-based retrieval systems may fail in retrieval of relevant video shots. Current systems back-off to image content-based searches and since image retrieval systems perform poorly for finding images related only by semantics, the overall performance of such late-fusion systems is poor. This situation is exacerbated in cross-language information retrieval where there is an additional degradation in the ASR transcripts resulting from subsequent machine translation.

We propose to investigate a unified approach to multimedia information retrieval. We proceed by discretizing the visual information in videos using blobs (Carson, 1997). Contrary to the simplistic representation incorrectly suggested by the nomenclature, blobs robustly capture some key features of image regions or segments that are fundamental to object recognition . shape, texture and color information. The discretization then permits us to view multimedia retrieval as a task of retrieving documents comprising these visual tokens and words. This represents a generalization of statistical text retrieval models in IR to statistical multimedia retrieval models. With joint visual-text modeling, we can better represent the relationships between words and the associated image cues. In cases where the speech transcript may be inaccurate, the visual part of the document can now be related to the query terms.

The infrastructure for visual tokenization will be developed prior to the workshop and will not be a focus of the workshop. During the workshop, the focus of the team will be on novel techniques for joint modeling of visual and text information. We will investigate a variety of techniques incorporating approaches suggested by Berger and Lafferty (1999), Ponte and Croft (1998), and Duygulu et al (2002). In particular methods that handle visual tokens in the same manner as word tokens will be investigated.

All techniques will be evaluated using the NIST 2003 Video TREC benchmark-test corpus and queries. Information retrieval systems using the joint modeling approach will be compared with late fusion of unimodal systems of identical design (e.g. machine translation based retrieval systems, LSI based systems for both word tokens and visual tokens). We will investigate whether our newly proposed (joint) early fusion approach is indeed beneficial and compare its performance with a state-of-the-art Video TREC multimodal system.

This workshop offers a unique opportunity to bring together experts from distinct disciplines. The team is diverse with members from industrial research and academic backgrounds and with multimedia processing and language modeling/machine translation expertise. This collaboration will result in a first demonstration of joint visual-text modeling for multimedia retrieval. The workshop will also allow graduate students to develop skills and new research directions, and introduce undergraduate students to cutting edge research in the area of multimedia search.

Bibliography

Statistical Models for Automatic Video Annotation and Retrieval, Presentation by R. Manmatha

Team Update, August 4, 2004

Final Presentation, August 16, 2004

Final Report, November 24, 2004

Final Report (Addendum), February 24, 2005

Final Presentation Video

Team Members
Senior Members
Sanjeev Khudanpur	CLSP
Pinar Duygulu	Bilkent University, Turkey
Giri Iyengar	IBM TJ Watson Research Center
Dietrich Klakow	University of Saarland, Germany
Harriet Nock	IBM TJ Watson Research Center
Manmatha R.	University of Massachusetts
Graduate Students
Shaolei Feng	University of Massachusetts
Pavel Ircing	University of West Bohemia
Brock Pytlik	JHU
Paola Virga	JHU
Undergraduate Students
Petkova Desislava	Mount Holyoke
Matthew Krause	Georgetown

Joint Visual-Text Modeling

Upcoming Seminars

Center for Language and Speech Processing