A machine which has primitive 'in-the-moment' direct perception of the world is one that solved the small spatio-temporal scale correspondence problem. Here, the time dimension is represented with a time-ordered sequence of images, and spatial scale refers to the distance between objects (in 3-space and/or in image-space). Being able to group together similar pixels and performing a mid-level segmentation of 1 image is the single image segmentation problem. One could imagine using a small temporal scale that corresponds to a negligible view-point variation, and registering the superpixels in that sequence. However, the ability to register superpixels across small temporal scale, aka direct perception, doesn't solve the problem of vision in its entirety.
The ability to transcend spatio-temporal scale and register objects across all of time as opposed to a small temporal window is necessary in order to have true image understanding. You can think of direct perception as the process of staring at something and not thinking about anything but the image that you see. In some sense, direct perception is not even possible for humans. However, indirect perception is the key to vision. Imagine sitting on your couch and typing on your laptop, while staring at the laptop screen. If you're in a familiar location, then you don't have to really look around too much since you know how everything is arranged; in some sense you have such a high prior on what you expect to see, that you need minimal image data (only what you see on the fringes of your vision as you stare at the monitor) to understand the world around you. Or imagine walking down a street and closing your eyes for two seconds; you can almost 'see' the world around you, yet your eyes are closed. These examples show that there is a model of the world inside of us that we can be directly aware of even when our eyes are closed. I would be willing to bet that after training on high quality image data, a real-time system would be able to understand the world with an extremely low-quality camera.
On another note, I would like to build computer vision systems that can infer fundamental physical relationships relating to observed objects such as the law of gravity. Such physical relations will come out if they help 'compress' image data; and they always do. The reason why the concept of gravity compresses image data is that it places a strong prior on the relative locations of familiar objects. For example, the conditional probability density function over the rigid-body configuration of a vehicle given the configuration of a road is significantly lower dimensional than the marginal probability density of the vehicle configuration. In lamens terms, if you know where a road is in an image then you can be pretty sure where you are going to find the cars in the image and if you don't know the location of the road then the cars could be located anywhere in the image plane. If we want intelligent agents to see the physical world around them, we have to remember that they will only be able to understand the large amount of visual data that we give them if they can compress it. In this context, compressing image data is equivalent to performing object recognition on the image. Compression will not only occur at the object-level, but also at the world-level. Object-level compression entails understanding hierarchies of objects (such as a ford is a car) while world-level compression entails understanding the physical relationships between objects in the world. Object-level compression is important if we want to understand all of the different objects in the world, and world-level compression is necessary if we ever want to 'understand' it in a reasonable amount of time. Object-level compression is also related to the concept of meta-objects and the question of object generalization. World-level compression is related to physics and metaphysics.