There is no indication that it will ever be possible to find some simple trick that miraculously solves most problems in vision. It turns out that the processing system must be able to implement a model structure, the complexity of which is directly related to the structural complexity of the problem under consideration in the external world. It has become increasingly apparent that Vision cannot be treated in isolation from the response generation, because a very high degree of integration is required between different levels of percepts and corresponding response primitives. The response to be produced at a given instance is as much dependent upon the state of the system, as the percepts impinging upon the system. In addition, it has become apparent that many classical aspects of perception, such as geometry, probably do not belong to the percept domain of a Vision system, but to the response domain. This article will focus on what are considered crucial problems in Vision for robotics for the future, rather than on the classical solutions today. It will discuss hierarchical architectures for combination of percept and response primitives. It will discuss the concept of combined percept–response invariances as important structural elements for Vision. It will be maintained that learning is essential to obtain the necessary flexibility and adaptivity. In consequence, it will be argued that invariances for the purpose of Vision are not abstractly geometrical, but derived from the percept–response interaction with the environment. The issue of information representation becomes extremely important in distributed structures of the types foreseen, where uncertainty of information has to be stated for update of models and associated data. The question of object representation is central to the paper. Equivalence is established between the representations of response, geometry and time. Finally an integrated percept–response structure is proposed for flexible response control.