We introduce a compact coding of image information in terms of local multi-modal image descriptors. This coding allows for an explicit separation of the local image information into different visual sub-modalities: geometric information (orientation) and structural image information (contrast transition and colour). Based on this image representation, we derive a similarity function that compares visual information in each of these sub-modalities. This allows for an investigation of the importance of the different factors for stereo matching on a large data set. From this investigation we conclude that it is the combination of visual modalities that gives the best results. Concrete weights for their relative importance are measured. In addition to these quantitative results, we can demonstrate by our simulations that although our image representation reduces image information by 97% we achieve a matching performance which is comparable to block matching techniques. This shows that our very condensed representation preserves the relevant visual information. © 2004 Elsevier B.V. All rights reserved.