Localizing Objects and Actions in Videos with the Help of Accompanying Text

Multimedia content is a growing focus of search and retrieval, personalization, categorization, and information extraction. Video analysis allows us to find both objects and actions in video, but recognition of a large variety of categories is very challenging. Any text accompanying the video, however, can be very good at describing objects and actions at a semantic level, and often outlines the salient information present in the video. Such textual descriptions are often available as closed captions, transcripts or program notes. In this inter-disciplinary project, we will combine natural language processing, computer vision and machine learning to investigate how the semantic information contained in textual sources can be leveraged to improve the detection of objects and complex actions in video. We will parse the text to obtain verb-object dependencies, use lexical knowledge-bases to identify words that describe these objects and actions, use web-wide image databases to get exemplars of the objects and actions, and build models that can detect where in the video the objects and actions are localized.


Final Report

Final Presentation | Video

Final Presentation Video

Team Members

Senior Members

Cornelia FermuellerUniversity of Maryland
Jana KoseckaGeorge Mason
Jan NeumannStreamSage/Comcast
Evelyne TzoukermannStreamSage

Graduate Students

Rizwan ChaudhryJohns Hopkins University
Yi LiUniversity of Maryland
Ben SappUniversity of Pennsylvania
Gautam SinghGeorge Mason
Ching Lik TeoUniversity of Maryland
Xiaodong YuUniversity of Maryland

Undergraduate Students

Francis FerraroUniversity of Rochester
He HeHong Kong Polytechnic University
Ian PereraUniversity of Pennsylvania

Affiliate Members

Yiannis AloimonosUniversity of Maryland
Greg HagerJohns Hopkins University
Rene VidalJohns Hopkins University