Joint Visual-Text Modeling for Multimedia Retrieval


The ability to search for text in large databases and on the web has added tremendous value to our lives. Yet, these capabilities are not yet extensible to searching for images or video clips, unless the images are accompanied by text in the form of captions. Wouldn.t it be great if one could describe (in words) a picture we are looking for in a database or on the web, and get back images and video-clips that match? Wouldn.t it be great to have a multimedia Google?

This proposal is about Multimedia Information Retrieval . content-based search of video-stream archives . using combined techniques from image processing and text processing. Today.s systems for Multimedia Information Retrieval approach this by either only focusing on the text (speech) or the video part. In this project, we will work on combining these distinct approaches in a mathematically consistent framework.

Our main idea is to use automatic techniques to identify object-regions in images and characterize their shape, color and texture in an approximate way, and simultaneously perform automatic speech recognition on the audio component of the video. We will then use machine learning and language modeling techniques to automatically identify typical visual objects that are frequently seen together with keywords, e.g. a checkered black-and-white blob along with the name Yasser Arafat (recall that Arafat usually wears a checkered head-scarf). These joint statistical will be used to construct a unified visual-text model.

We plan to investigate several modeling techniques from language processing and apply it to a vocabulary of ordinary words and visual tokens. The approaches that we develop will be benchmarked against leading systems in the latest NIST evaluations on Multimedia Retrieval.

Participating in this project gives you an opportunity to work with top people from industrial and academic research, and to advance the state of the art in multimedia retrieval.


The Center for Language and Speech Processing
The Johns Hopkins University
3400 North Charles Street, Barton Hall
Baltimore, MD 21218
*Telephone: (410) 516-4237 *Fax: (410) 516-5050 *E-mail: clsp@clsp.jhu.edu