There is a nice analogy between the problem of segmentation and the problem of object detection/classification/recognition.
Segmentation is grouping on an intra-image spatial level.
Detection/classification/recognition is grouping on an inter-image level.
Tracking is grouping at the inter-frame temporal level.
Let me remind the remind about the Theory Observation Distinction that is mainly attributed to Karl Popper. "All observation is selective and theory-laden," and similar quotes can be found on Stanford's Encyclopedia of Philosophy entry on Karl Popper. The entry further states that Popper repudiates induction and rejects the view that it is the characteristic method of scientific investigation and inference, and substitutes falsifiability in its place.
Researchers in the field of machine vision could learn from the philosophy of science. When placed in the context of machine intelligence, Popper's ideas sound like this:
The notion of training a system to classify images by presenting it with a large set of labeled examples and building an visual model is analogous to using induction over a finite set of observables. However, since a lesson on science has taught us that there is much to say about positing a theory, maybe we should be less concerned with machines that perform data-driven model building and more concerned with building machines that can posit models and verify them.
Should we be building machines that posit scientific theories, or are we doing this already?