In today's post I want to briefly discuss a computer vision paper which has caught my attention.
In the paper Recognizing Indoor Scenes, Quattoni and Torralba build a scene recognition system for categorizing indoor images. Instead of performing learning directly in descriptor space (such as the GIST over the entire image), the authors use a "distance-space" representation. An image is described by a vector of distances to a large number of scene prototypes. A scene prototype consists of a root feature (the global GIST) as well as features belonging to a small number of regions associated with the prototype. One example of such a prototype might be an office scene with a monitor region in the center of the image and a keyboard region below it -- however the ROIs (which can be thought of as parts of the scene) are often more abstract and do not neatly correspond to a single object.
The learning problem (which is solved once per category) is then to find the internal parameters of each prototype as well as the per-class prototype distance weights which are used for classification. From a distance function learning point of view, it is rather interesting to see distances to many exemplars being used as opposed to the distance to a single focal exemplar.
Although the authors report results on the image categorization task it is worthwhile to ask if scene prototypes could be used for object localization. While it is easy to be the evil genius and devise an image that is unique enough such that it doesn't conform to any notion of a prototype, I wouldn't be surprised if 80% of the images we encounter on the internet conform to a few hundred scene prototypes. Of course the problem of learning such prototypes from data without prototype-labeling (which requires expert vision knowledge) is still open. Overall, I like the direction and ideas contained in this research paper and I'm looking forward to see how these ideas develop.
Isn't it similar to the part-based detector works. I guess the region prototypes are treated as latent variables.ReplyDelete
Yeah this work reminds me of the Deformable Part Models that Deva and Pedro have worked on!ReplyDelete
This is one of attempt in exploiting the benefit of huge visual space (from similar images) into recognition/detection? I see Torralba's group is pursuiting this direction couple years.ReplyDelete
While I think this approach does use a reasonably sized set of images, it is somewhat orthogonal to the gargantuan datasets used in Hays/Efros-style image matching or Torralba-style 80 million tiny image approaches.ReplyDelete
I think that this paper is more similar to Torralba's work on context regarding object information given the gist. The most similar prior work to this from Torralba's group is Bryan Russell's Scene Alignment paper from NIPS.
Because they are using prototypes, this paper has a nice link to representational theories of concepts from psychology.
In the next decade or so we will see these types of deformable scene models applied to millions of images. These type of work might even come out of Torralba's group -- if I don't beat them to it!