Multimedia content is a growing focus of search and retrieval, personalization, categorization, and information extraction. Video analysis allows us to find both objects and actions in video, but recognition of a large variety of categories is very challenging. Any text accompanying the video, however, can be very good at describing objects and actions at a semantic level, and often outlines the salient information present in the video. Such textual descriptions are often available as closed captions, transcripts or program notes. In this inter-disciplinary project, we will combine natural language processing, computer vision and machine learning to investigate how the semantic information contained in textual sources can be leveraged to improve the detection of objects and complex actions in video. We will parse the text to obtain verb-object dependencies, use lexical knowledge-bases to identify words that describe these objects and actions, use web-wide image databases to get exemplars of the objects and actions, and build models that can detect where in the video the objects and actions are localized.
Abstract
Final Report
Final Presentation
Final Presentation Video
Team Members | |
---|---|
Senior Members | |
Cornelia Fermueller | University of Maryland |
Jana Kosecka | George Mason |
Jan Neumann | StreamSage/Comcast |
Evelyne Tzoukermann | StreamSage |
Graduate Students | |
Rizwan Chaudhry | Johns Hopkins University |
Yi Li | University of Maryland |
Ben Sapp | University of Pennsylvania |
Gautam Singh | George Mason |
Ching Lik Teo | University of Maryland |
Xiaodong Yu | University of Maryland |
Undergraduate Students | |
Francis Ferraro | University of Rochester |
He He | Hong Kong Polytechnic University |
Ian Perera | University of Pennsylvania |
Affiliate Members | |
Yiannis Aloimonos | University of Maryland |
Greg Hager | Johns Hopkins University |
Rene Vidal | Johns Hopkins University |