Tombone's Computer Vision Blog: localizing sheep

Monday, April 17, 2006

localizing sheep

Here are our top 9 sheep detections sorted by a primitive confidence value. The dotted green bounding box is our result and the ground truth label is in yellow. Note that our top 9 sheep detections are all true positives.

7 comments:

Anonymous3:54 PM
So your machine can pretty readily detect sheep, good work. However will you have to do this training and analysis for every object you want your program to recognize? I guess thats prolly where the big problem resides, right?
ReplyDelete
Replies
Tomasz Malisiewicz8:33 AM
It did an exceptionally good job at the first 9 sheep; stil far from perfect on sheep.

Collecting manually labelled training data for every object in the world simply doesn't work. Ideally we would provide the computer with a small amount of training data (on the order of 100K labeled images of 'popular' objects) and have the machine automatically learn about new objects in new images. How this is going to work I have no clue; but perhaps you understand why I'm interested in Machine Learning as much as I'm interested in Computer Vision.

A deeper question is: can a machine (or even a human) learn everything we want it to know about the world from disconnected views (unrelated images) of the world?
ReplyDelete
Replies
Anonymous6:53 PM
Maybe images are just not the way to go. Maybe, just maybe, they just don't contain enough information do anything like you "vision ppl" are trying to do. I know the excuse/motivation cited every single time is that "hey, humans can do it.. so we should try to make a machine that can do the same". Well, what if humans use memory, what if a single image doesn't even contain enough information for such a thing to ever be possible. Or even worse, what if all that humans do, is just learn enough to get by their lives?
ReplyDelete
Replies
Tomasz Malisiewicz8:07 PM
Just like Alan Turing thought that machines could pass the Turing Test by the year 2000, vision researches are overly optimistic about the future of computer vision.

Before giving up building vision systems that learn about the visual world from disconnected views of the world (unrelated images), I would like to propose a few alternative directions.

Active Agents
First of all, there is something very interesting going on when a human is learning about objects in the world. As an active agent in the world, a human is generally capable of altering their viewpoint in order to get an alternative view of an object of interest. Most modern vision research is concerned with learning from static images and the vision system is not allowed to 'get a slightly different view.' In other words, vision research has been making the assumption that data collection can be decoupled from learning -- but I'm fairly confident in saying that for humans gathering data (observing new things) and learning are somehow integrated.

Video
Perhaps using video (sequences of highly related images) and adding the time dimension into the picture can help vision. After all, when a human reasons about objects they don't simply reason about the visual appearance of that particular object; humans also reason about how particular objects traverse space-time. Sometimes one imagine of an object could be enough to learn something about it, but I speculate that just a few more views of an object of interest would help a vision system (it would certainly help segment the object from the background).

Physics
Ultimately, the best way to test what a vision system has learned is to present it with one static image and ask it to parse the image into objects. Even though this is very easy for a human to do, humans are also capable of doing something significantly more difficult than just getting at the objects in an image. With little effort, humans can look at a static image (a snapshot of the world) and essentially go forward/backward in time and mentally re-synthesize the image at those slightly different times. For example, given an image of a car on a road, a human can easily understand the movement of the car. Even from one image, a human can see in what direction the car is going (or from the context why the car is probably stationary). What I'm trying to get at here is that [for a human] objects are just more than shapes and appearances. There is an understanding of the physical world (the basic laws of motion) going on when we interpret novel images, and perhaps vision researchers need to go beyond shapes and appearances if they ever want their vision systems to see.

However, just because humans can do all of this 'visual world understanding' doesn't mean that vision systems will be able to do it anytime soon.
ReplyDelete
Replies
Anonymous12:33 PM
Interesting enough, in cognitive neuroscience literature the view that visual perception has both "top - down" and "bottom - up" control has been emerging. That is recognition and perception control occurs early on in the reception of light (e.g. retina and visual cortex) and also higher up in the cognitive hierarchy (cerebral cortex, temporal lobe... etc.). Indeed humans do learn how to determine what's important and what to focus in on in a certain field of vision, and how humans do this surely is tied to memory, and subject to individual differences (e.g not everyone can be a an amazing baseball player (hitter)). Anyways taking a look at some cognitive neuroscience papers might be an interesting thing for you to do.

So, I was thinking maybe altering your training process in such a way that instead of labeling and classifying single objects multiple times in a progressive way; a better approach may be to take a two step process. The first step would be to simulate the bottom-up control of visual perception, in which the computer picks areas of the image it assumes are objects (where in an image "real" objects exist, where to focus) using an image with multiple amount of objects and then grading it with your regular regime and let it go, allowing it to detect objects with in an image.

Secondly you obtain the essence, or rather differentiation and recognition of these objects in the visual world, with the ability of custom coding to allow for prioritizing for objects of extreme interest (which ones to really pay attention to)(top-down control), in a sense have the "Abstract" object detection first, with the definition and differentiation of objects second.

However instead of trying to teach a computer, that is what you do, each type of object seperatly, it might be a more attainable idea to enable it with the tools, or rather simulation thereof, that we utilize throughout our experiences.

A problem clearly is that if you want a computer to percieve the essence of an object in an image, really know what is, and therefore have the ability to define and differentaite novel objects, not just differentiate, you need some sort of connection to these objects function or purpose. However, would it really be a bad thing to have some perceptual control over your machines in which the object can clealry function in the visual world, however does not have the abiltiy to define novel objects, and like a kid asking a parent "what is this?", asks either its human caretaker/employer or master computer (internent searching AI/detector) to define these novel objects.
ReplyDelete
Replies
Tomasz Malisiewicz6:27 PM
I want to quickly respond to a few ideas, and a more in-depth response will come later.

The use of both top-down and bottom-up elements of visual perception has been in computer vision for quite some time now. People like Tomaso Poggio from MIT's Center for Biological & Computational Learning are computational neuroscientists who are very very interested in computer vision (no surprise here).

My current research attempts to "simulate the bottom-up control of visual perception"; however, it is not so easy to group together low-level elements such as pixels directly into objects. An early step in the current object recognition framework I'm playing with utilizes a family of segmentation algorithms. These algorithms group together pixels (low-level primitives) into chunks of pixels (superpixels) that have a high probability of belonging to one object. These mid-level primitives (superpixels) are then processed and given object-level assignments.

I believe that you acknowledge the difficulty of this 'computer vision enterprise.' By saying "if you want a computer to percieve the essence of an object in an image," you probably agree with me that what I'm after is intelligence and not image processing -- I want to build a machine that can 'get at' the world (make a theory or two).

Metaphysics, here we come!
Finally, I completely agree with you when you say that "[in order to] have the ability to define and differentaite novel objects ... you need some sort of connection to these objects function or purpose." This is the 'lack of metaphysics in vision' dilemma. Allow me to quickly explain. When you define objects by their purpose in the world, you are transcending a visual description of the objects -- hinting at something beyond the physical description of the world. This is a problem for the computer vision community -- which is overly obsessed with the visual (physical) description of objects (to no surprise). I'm pretty confident that it is just a matter of time until computer vision researchers start asking questions like 'What is an object?' and then define objects in terms of attributes such as purpose which are not directly observable. But if we start asking those questions, will the field outgrow the name 'computer vision.' Perhaps 'computational metaphysics' will be a better name.
ReplyDelete
Replies
Andy G.11:27 AM
Tomasz, I just had the fun of stumbling upon this old post. Looks like you were years ahead of the game-- Of course objectness, "What is an Object?" (literally the name of the CVPR 2010 paper), and attributes have become *the* trendy topics in CVPR and I/ECCV in the past 2 or 3 years...
ReplyDelete
Replies

Add comment

Monday, April 17, 2006

localizing sheep

7 comments:

Subscribe To