Localizing Objects and Actions in Videos with the Help of Accompanying Text

Multimedia content is a growing focus of search and retrieval, personalization, categorization, and information extraction. Video analysis allows us to find both objects and actions in video, but recognition of a large variety of categories is very challenging. Any text accompanying the video, however, can be very good at describing objects and actions at a semantic level, and often outlines the salient information present in the video. Such textual descriptions are often available as closed captions, transcripts or program notes. In this inter-disciplinary project, we will combine natural language processing, computer vision and machine learning to investigate how the semantic information contained in textual sources can be leveraged to improve the detection of objects and complex actions in video. We will parse the text to obtain verb-object dependencies, use lexical knowledge-bases to identify words that describe these objects and actions, use web-wide image databases to get exemplars of the objects and actions, and build models that can detect where in the video the objects and actions are localized.

Final Report
Final Presentation
Final Presentation Video

Team Members
Senior Members
Cornelia Fermueller University of Maryland
Jana Kosecka George Mason
Jan Neumann StreamSage/Comcast
Evelyne Tzoukermann StreamSage
Graduate Students
Rizwan Chaudhry Johns Hopkins University
Yi Li University of Maryland
Ben Sapp University of Pennsylvania
Gautam Singh George Mason
Ching Lik Teo University of Maryland
Xiaodong Yu University of Maryland
Undergraduate Students
Francis Ferraro University of Rochester
He He Hong Kong Polytechnic University
Ian Perera University of Pennsylvania
Affiliate Members
Yiannis Aloimonos University of Maryland
Greg Hager Johns Hopkins University
Rene Vidal Johns Hopkins University

Center for Language and Speech Processing