Today's post is about 3D object recognition, that is localization and recognition of objects from 3D laser data (and not the perception/recovery of 3D from 2D images).
My first exposure to object recognition was in the context of specific object recognition inside 3D laser scans. In specific object recognition, you are looking for 'stapler X' or 'computer keyboard Y' and not just any stapler/computer keyboard. If the computer keyboard was black then it will always be black since we assume intrinsic appearance doesn't change in specific object recognition. This is a different (and easier!) problem than category-based recognition where colors and shapes can change due to intra-class variation.
The problem of specific object 3D recognition I'll be discussing is as follows:
Given M detailed 3D object models, localize all (if any) of these objects (in any spatial configuration) in a 3D laser scan of a scene potentially containing much more stuff than just the objects of interest (aka the clutter).
There was actually quite a lot of research in this style of 3D recognition in the 1990's with the belief that 3D recognition would be much simpler than recognition from 2D images. The idea (Marr's idea, actually) was that object recognition in 2D images would by preceded by object-identity independent 3D surface extraction so that 2D recognition would resemble this version of 3D recognition after some initial geometric processing.
However, it ends up that many of the ambiguities present in 2D imagery were also present in 3D laser data -- the problems of bottom-up perceptual grouping were as difficult in 3D as in 2D. Just because you have 3D locations associated with parts of an object does not make it any easier to tell where the object begins and where it ends (namely the problem of segmentation). It is this inability to segment out objects that resulted in the widespread usage of local descriptors such as SIFT.
Many of today's 2D object recognition problems rely on local descriptors which bypass the problem of segmentation, and it isn't surprising that the 3D recognition problem I described above was elegantly approached by A.E. Johnson and M. Hebert as early as 1997 via a local 3D descriptor known as a Spin Image.
The idea behind a Spin Image is actually very similar to that of a SIFT descriptor used in image-based object recognition. A spin image is a regional point descriptor used to characterize the shape properties of a 3D object with respect to a single oriented point. It is called a "spin" image because the process of creating such a descriptor can be envisioned as spinning a sheet around the axis defined by an oriented point and collecting the contributions of nearby points. Since a point's normal can be computed fairly robustly given its neighboring points, the spin image is highly robust to rigid transformations when defined with respect to this canonical frame. Since it is 2D and not 3D it does lose some discriminative power -- two different yet related surfaces chunks can have the same spin image. The idea behind using this descriptor for recognition is that we can compute many of these descriptors all over the surface of our object models as well as the input 3D laser scan. We then have to perform matching over these descriptors to create some sort of correspondences (potentially spatially verified).
(For a fairly recent overview of spin images as well as other similar regional shape descriptors and their applications to 3D object recognition check out Andrea Frome's ECCV 2004 paper, Recognizing Objects in Range Data Using Regional Point Descriptors.)
Spin images aren't a thing of the past, in fact here is a link to a RSS 2009 paper by Kevin Lai and Dieter Fox which uses spin images (and my local distance function learning approach!):
3D Laser Scan Classification Using Web Data and Domain Adaptation