Towards a Detailed Understanding of Objects and Scenes in Natural Images

Final Report

As a human glances at an object, for example an apple, a building, or a rifle, he/she is immediately aware of many of the object qualities. For instance, the apple may be red or green, the building exterior may be reflective (glass) or dull (concrete), and the rifle may be made of metal or plastic. These properties or attributes can be used to describe the objects (e.g., differentiating a green apple from a red one), to further qualify them (e.g., a plastic rifle is probably a toy), or to improve discrimination (e.g., an object in the shape of a cat but made of stone is probably not an animal, but a statue). By contrast, even the best systems for artificial vision have a much more limited understanding of objects and scenes. For instance, state-of-the-art object detectors model objects as distributions of simple features which capture a blurred statistics of the two-dimensional shape of the objects. Colour, material, texture, and most of the other object attributes are likely ignored in the process.

The objective of this workshop is to develop novel methods to reliably extract from images a diverse set of attributes, and to use them to improve the accuracy, informativeness, and interpretability of object models. The goal is to combine advances in discrete-continuous optimisation, machine learning, and computer vision, to significantly advance our understanding of visual attributes and produce new state-of-the-art methods for their extraction. Inspired by popular features, easy-to-use open source software for the extraction of the new attribute will be released, with the goal of commodifying the use of attributes in computer vision.

Due to their significance, visual attributes have been an increasingly favourite topic of research. Nevertheless, results have been so far limited. For example, while most attributes are associated to a given object or object part, and have therefore a local scope, some methods treat them as global image properties. As a consequence, such methods are more likely to deduce the presence of attributes from correlated objects (e.g., there is a car and hence metal) instead of detecting the attributes as such (e.g., this region looks chrome, indicating that there may be a car or another metallic object). Other methods roughly localise attributes by bounding boxes, for instance from regions obtained by detecting the object of interest first. In this case, attributes may be useful to qualify objects a-posteriori, but are not integral part of the object model during detection. Finally, all these methods use canned features and models, which are probably suboptimal for a large number of attributes, as these are visually subtle properties (for example detecting chrome requires finding reflections). The work of this six weeks workshop will focus on four areas: (i) identifying and systematising visual attributes, (ii) collecting annotated data for learning and evaluating attribute models, (iii) exploring novel learning and inference techniques to better extract a diverse set of attributes and (iv) evaluating the new representations in canonical tasks such as object detection on international benchmark data. These areas are detailed next.

Attributes: Attributes may refer to any of a large number of very different concepts, including colours, textures, materials, two or three dimensional shapes, object parts, and relations. While these attributes are often treated equally in term of modelling and detection, they share little beyond being localised in an image. In preparation to the workshop, attributes will be roughly subdivided by expected modelling requirements, abstraction levels, and visualness, identifying prototypical attributes for each class. By focusing on these prototypes, most of the key issue in attribute extraction will be investigated, while maintaining the scope reasonable for the relatively short time available.

Data: High quality data and annotations have often been instrumental to many advances in computer vision. For example, the introduction of the PASCAL VOC challenge data has significantly boosted the performance of object detectors. In preparation to the workshop, in a collaborative effort using Amazon’s Turk, the existing PASCAL VOC dataset will be extended with annotations for the selected attributes, including their localisation in images. By extending an established dataset, this effort can be expected to be useful to the computer vision community at large.

Modeling, inference, and learning: The core of the workshop, spanning the six weeks, will be the development, learning, and evaluation of models for visual attributes. Modelling and inference will be based on novel ideas in discrete-continuous optimisation. The goal is to decompose efficiently an image into a set of regions characterised by semantically meaningful but visually subtle attributes. Unfortunately the expressive power of standard segmentation methods such as Markov Random Fields (MRFs) is severely limited by their inability to capture the appearance of segments as a whole. MRFs, for example, simply add a smoothness prior to evidence that is otherwise pooled locally and independently at each image pixel. Therefore these algorithms are usually considered adequate for recognizing stuff, i.e. homogeneous patterns which do not have a characteristic shape, such as grass, sky or wood, but cannot infer holistic models of the regions, which leaves out cues such as gradients, low rank textures, repeated patterns, and shapes that can be essential in the recognition of certain attributes. For example, recognising chrome requires analysing the overall structure of a region to identify reflections, something that a standard MRF cannot achieve. Discriminating between different instances of the same attribute is also difficult: for example, in order to differentiate horizontal black-and-white stripes from vertical yellow-and-green ones one would have to instantiate a new MRF label for each case, an approach that does not scale.

While segmentation methods have been combined with holistic top-down models before, for example to propose possible object locations, refine object localization, or both, simultaneously segmenting the image and estimating non-trivial models for each segment has led to intractable energy functions, difficult to optimize even approximately. Nevertheless, there exist a few examples that carry out this program efficiently, at least in special cases. In particular, in the workshop powerful techniques inspired by the combination of MRF and sampling techniques such as RANSAC will be explored.

In term of preparatory work, the VLFeat library will be adopted as a simple-to-use toolkit for basic image processing and feature extraction. Additional software implementing the continuous-discrete optimisation techniques discussed above will be made available to the workshop participants. During the workshop these powerful techniques will be used to design, learn, and test new and more advanced attribute models. The synergy between the team members, each with their own particular expertise in various computer vision areas, is likely to be key to success.

Evaluation: Throughout the workshop, attribute models will be evaluated in term of retrieval performance on the annotated PASCAL VOC data. Starting from week four, attributes will be tested as an additional cue in learning object category models in PASCAL VOC. Extensible state-of-the-art software will be provided to the participants to make this a plug-and-play operation. Both the accuracy and interpretability of the new detectors will be evaluated (for example, we expect to learn automatically that cars are (often) chrome and have windows made of glass).


Team Members
Senior Members
Matthew BlaschkoEcole Centrale Paris
Iasonas KokkinosEcole Centrale Paris
Subhransu MajiToyota Technological Institute at Chicago
Esa RahtuUniversity of Oulu, Finland
Ben TaskarUniversity of Pennsylvania
Andrea VedaldiUniversity of Oxford
Graduate Students
Ross GirshickUniversity of Chicago
Siddarth MahendranJohns Hopkins University
Karen SimonyanUniversity of Oxford
Undergraduate Students
Sammy MohamedStony Brook University
Naomi SaphraCarnegie Mellon University
Affiliate Members
Juho KannalaUniversity of Oulu, Finland

Center for Language and Speech Processing