Localizing Objects and Actions in Videos with the Help of Accompanying Text

Multimedia content is a growing focus of search and retrieval, personalization, categorization, and information extraction. Video analysis allows us to find both objects and actions in video, but recognition of a large variety of categories is very challenging. Any text accompanying the video, however, can be very good at describing objects and actions at a semantic level, and often outlines the salient information present in the video. Such textual descriptions are often available as closed captions, transcripts or program notes. In this inter-disciplinary project, we will combine natural language processing, computer vision and machine learning to investigate how the semantic information contained in textual sources can be leveraged to improve the detection of objects and complex actions in video. We will parse the text to obtain verb-object dependencies, use lexical knowledge-bases to identify words that describe these objects and actions, use web-wide image databases to get exemplars of the objects and actions, and build models that can detect where in the video the objects and actions are localized.

Final Report
Final Presentation
Final Presentation Video


Team Members 
Senior Members
Cornelia FermuellerUniversity of Maryland
Jana KoseckaGeorge Mason
Jan NeumannStreamSage/Comcast
Evelyne TzoukermannStreamSage
Graduate Students
Rizwan ChaudhryJohns Hopkins University
Yi LiUniversity of Maryland
Ben SappUniversity of Pennsylvania
Gautam SinghGeorge Mason
Ching Lik TeoUniversity of Maryland
Xiaodong YuUniversity of Maryland
Undergraduate Students
Francis FerraroUniversity of Rochester
He HeHong Kong Polytechnic University
Ian PereraUniversity of Pennsylvania
Affiliate Members
Yiannis AloimonosUniversity of Maryland
Greg HagerJohns Hopkins University
Rene VidalJohns Hopkins University

Center for Language and Speech Processing