Kate Saenko (Boston University)”Connecting Vision and Language End-to-End”

March 2, 2018 @ 12:00 pm – 1:15 pm
Hackerman Hall B17
3400 N Charles St
Baltimore, MD 21218


Despite much progress in neural models for joint vision and language understanding, current models are largely opaque and non-compositional.  Many language tasks are inherently compositional, and can be solved by decomposing them into modular sub-problems.  I will describe End-to End Module Networks (N2NMNs), which learn to answer questions about images by learning to decompose the question into subtasks, implemented as neural network modules.  Experimental results show that N2NMNs achieve better accuracy than state-of-the-art attentional approaches, while discovering interpretable network architectures specialized for each question.  I will also talk about our recent work on dense video captioning, and describe an end-to-end network that localizes activities in a long video and generates captions to describe each detected activity.


Kate Saenko is an Associate Professor of Computer Science at Boston University, director of the Computer Vision and Learning Group and co-director of the AI Research initiative at BU. Her past academic positions include: Assistant Professor at the Computer Science Department at UMass Lowell, Postdoctoral Researcher at ICSI, Visiting Scholar at UC Berkeley EECS and a Visiting Postdoctoral Fellow in the School of Engineering and Applied Science at Harvard University. Her research interests are in the broad area of Artificial Intelligence with a focus on Adaptive Machine Learning, Learning for Vision and Language Understanding, and Deep Learning.

Center for Language and Speech Processing