Stereo Vision

Hints from Life to AI, edited by Ugur HALICI, METU, 1994

artificial versus natural stereo depth perception

Ugur M. Leloglu

TUBITAK-AEAGE, METU

06531, Ankara, TURKEY

lel@tbtk.metu.edu.tr

Some cues on stereo vision that are implied by research on natural visual systems are overwieved and then the methods used by computer vision systems for solving the same problem is overviewed. The effect of those natural cues on computational stereo is discussed.

1. Introduction

Visual perception is an interpretation of 2-dimensional time-varying light information on the retinae to form a spatiotemporal reconstruction of 3-dimensional world. During the long course of evolution, this ability has reached an astonishing complexity, especially, in higher animals. From the "meaningless" sensory input, that is, from a set of activation values of sensory cells, a rich, abstract description of the environment is derived. The recovery of the third dimension information that is lost during the projection on the retinae is vital in this reconstruction process. In human stereo vision system, there are several known ways of estimating the depth like motion, shading, texture and stereo. Although these mechanisms are known in principle, the underlying biological structure and processes are not totally solved yet. Stereo vision is one of the most investigated depth perception mechanisms.

Figure 1: Eyes are fixated on the farthest of three ballons.

The stereo vision is based on the differences between the right and left eye images. Due to the distance between the eyes (interocular distance), the projections of a certain point on the two retinae are at different positions. The difference in these positions is called disparity and its value is related to the distance of the object. In Figure 1 and Figure 2 this process is illustrated schematically. The two eyes are fixated on the farthest of three ballons (Figure 1). That is, the eyes are directed such a way that the images of that ballon are at the centers of both retinae. The resulting left and right images are shown in Figure 2. It is clearly seen that the disparity values are larger for closer ballons. Once, for each point in one of the image, the corresponding point in the other one is determined, the depth of all points can be calculated as a function of the eye vergence.

Figure 2: The images formed on left and right retinae of eyes in Figure 1.

Our brains do this computation for us continuously and we do not even notice this effort: it is totally unconscious and automatic. To understand how this is possible is subject of several disciplines. Neuroanatomy and neurohistology identify the structures involved in vision in macro and micro scale, respectively. Neurophysiology tries to explain how these structures function. Psychophysics examine how certain visual inputs are perceived under certain conditions.

Cognitive psychology tries to explain the stereo mechanism in a more abstract level than neurophysiology.

Besides other reasons, to understand natural stereo vision is useful in constructing similar artificial systems. Such systems have a wide range of application areas (though not all are innocent), so considerable effort is devoted to this area. Any artificial system does not necessarily carry the properties of natural systems as long as it works properly, but the ideas from nature always prove to be powerful.

2. Problems in Matching

To understand why calculating stereo correspondence is not a trivial task, consider a stereo image pair given in Figure 3 and the light intensity levels of two small image patches from this figure (areas inside the rectangles in Figure 3) as given in Figure 4.

Figure 3: Pentagon stereo pair (from Prof. Takeo Kanade of Carnegie-Mellon University)

The average disparity of 1 pixel is not evident even after a long inspection by eye. The light intensity is coded as binary numbers in computers and coded as frequency of action potentials in human retina, but the very same data is the input of both systems. A simple comparison of intensity values is clearly not sufficient for determining corresponding points. One of the reasons for differences in intensity values is the noise that appears in any sensor either biologic or electronic. Another reason is reflectance properties of most surfaces: the light reflected from the surface depends on the viewing angle. Areas with no significant texture and areas with repetitive texture like a chess table increases the ambiguity. The areas which are seen in one image but occluded in the other are another source of ambiguity because we do not know which areas are occluded before calculating correspondence.

Since intensity values change considerably across images, we need some invariant properties for matching. These matching primitives may be edge points, line segments blobs or similar local image properties which can be calculated by simple local processing units. Alternatively, the matching of images can be postponed until monocular recognition of objects in the images. Then recognised objects can be matched easily. After reviewing some facts on natural stereo vision, we will discuss which strategy is used by biological systems and which one is appropriate for computer stereo vision.

129 124 132 135 123 130 131 127 127 120 112 123 111 126 147 129

149 142 149 157 142 134 128 128 137 129 125 114 124 136 132 128

169 177 156 159 161 136 126 129 155 158 145 141 166 154 131 129

157 162 156 160 140 129 127 127 133 147 157 172 166 139 131 132

159 147 132 144 131 129 126 135 163 156 159 170 147 131 136 140

146 150 138 133 127 130 125 124 142 143 149 151 129 127 133 152

119 139 139 128 126 123 125 131 131 124 131 129 111 114 118 134

103 113 128 126 126 126 130 149 151 152 143 112 96 105 113 133

108 124 127 125 124 145 157 146 152 168 149 122 117 120 129 152

127 128 124 121 111 129 165 155 149 151 137 130 138 149 143 143

132 126 124 120 110 120 144 167 159 136 130 126 118 144 158 144

128 125 121 107 116 124 123 144 155 134 130 123 104 127 148 148

125 123 115 108 128 140 145 143 133 131 134 147 138 142 143 163

124 122 105 119 151 167 175 150 131 130 137 165 175 163 145 161

126 128 124 135 159 163 161 136 129 127 132 166 173 149 143 141

131 133 137 148 153 162 145 131 128 121 126 149 149 134 133 149

a) Intensity levels of left patch

139 160 139 125 134 136 134 133 120 128 113 113 137 144 132 125

148 150 155 150 136 135 133 148 147 146 122 144 154 138 136 130

178 154 160 155 136 135 126 145 163 147 152 177 159 140 136 133

154 151 159 139 137 132 126 140 152 159 173 162 142 142 137 137

147 134 146 136 135 129 135 167 156 155 165 143 139 139 144 154

152 142 135 134 129 128 133 145 133 151 154 134 135 136 152 172

140 141 134 133 125 140 155 139 131 138 133 124 119 113 130 162

133 134 132 130 125 135 162 160 156 145 129 118 114 115 130 141

134 132 131 123 150 168 150 160 167 149 137 132 137 132 151 145

133 132 128 115 138 170 154 158 144 137 138 141 161 153 140 134

134 131 126 121 135 144 159 163 139 140 130 113 143 162 142 129

132 125 116 126 132 131 138 146 139 139 128 122 139 148 151 154

128 118 126 143 155 155 152 138 138 138 154 159 159 147 163 170

128 112 129 153 159 161 144 138 136 133 162 181 161 146 146 173

131 132 143 151 157 149 138 137 129 137 166 166 146 148 140 149

137 133 141 127 151 140 137 135 129 140 160 150 140 150 168 144

b) Intensity levels of right patch

Figure 4: Intensity levels of two patches shown in Figure 3.

3. Natural Stereo Vision

Although a complete theory of biological stereo vision is not built yet, there is a large body of information obtained through neurophysiological and psychological research on steropsis. Here, some facts on human stereo vision which are closely related to computer stereo vision will be briefly presented. Interested reader is recommended to refer [Hubel88], [Bruce90] and [Splillmann90] for detailed information.

The speed of human stereopsis

A very remarkable feature of human stereopsis is its speed: it takes about 200 msecs from presentation of the stimulus to the occurrence of depth perception [Yeshurun89]. That duration is very close to the time needed for the information on the retinae to reach to the visual cortex via the visual pathway.

Stereopsis is a Low-Level Process

Stereopsis is a low level process; that is, it does not require recognition or any abstract understanding of the image. It was first demonstrated by Julesz that [Julesz60] stereopsis survives in the absence of any monocular cue such as texture, a priori knowledge on the shapes and sizes of objects, shading, etc. Figure 5 is an example of random dot stereograms which was invented by Julesz. One can see the floating square above the background when he fixates his eyes at a nearer point in such a way that the two images overlap in the centre. But this phenomenon does not imply that other depth cues do not effect the stereo process. On the contrary, there is strong evidence that presence of monocular depth cues facilitates stereo vision.

Figure 5: A random-dot stereogram

Limited Fusional Area

Only the surfaces within a specific disparity interval, so-called Panum's fusional area, can be fused. The extent of this range is measured between 10-40 minutes of arc depending on the data used. There is evidence that this range is larger for inputs with low frequency content compared to high frequency inputs [Marr82] [Schor84].

Effect of Contrast

It was shown by Julesz that [Julesz71] changes in the magnitude of the contrast across the images does not destroy stereopsis, but a change in the sign of contrast makes fusion of images impossible [Julesz60].

Hyperacuity

Even though the average distance among the light-sensitive cells of the retina (cones), is about 20-30 seconds of arc at the fovea where those cells are densest, the disparity differences down to 2 seconds of arc are detectable by the human visual system [Morgan82]. But this hyperacuity drops drastically for non-zero disparities [Badcock85].

Gradient Limit

If the rate of change in disparity, that is, the disparity gradient, exceeds a certain limit the images cannot be fused and objects appear as double (diplopia) [Burt80].

Binocular Cells in Visual Cortex

Although there is some interaction of information from both eyes on the way from retinae to cortex, the first place where cells differentially sensitive to binocular disparity are observed is the visual cortex in cats and monkeys. A considerable proportion of the cells at visual cortex are binocularly sensitive [Hubel62].

Ocular Dominance

Binocularly sensitive cells can be classified as balanced or unbalanced according to the type of their sensitivity [Poggio77]. Balanced cells respond equally to stimuli from each eye, but respond very strongly when stimulated binocularly. Unbalanced cells either respond stronger to one eye or exhibit a complex ocular dominance pattern.

A certain layer of the visual cortex (layer 4) is organised in ocular dominance columns. These vertical strips which are 1 mm thick in monkeys and 2 mm thick in humans respond alternatingly to left eye and right eye. Binocular cells are located above and below these monocular cells.

Orientation Selectivity

Almost all of the cells in visual cortex exhibit orientation selectivity at various angles. But most of them respond best to bars oriented within 20 degrees from the vertical [Poggio77].

Frequency Selectivity

Another important property of these cells is their frequency selectivity. The range of optimal spatial frequencies range from 0.3 to 3 cycles/degree in cats and 2 to 8 cycles/degrees in monkeys [Bruce90]. The bandwidth of cells in the average is a little bit larger than one octave. The constancy of relative bandwidths over scales can be justified by the statistics of natural images [Field87]. There is almost constant energy in all channels, because the amplitude spectrum of natural images generally fall off with 1/f.

Receptive Fields Types

Receptive field is the activation pattern of a cell as a function of stimulus position on the retina. According to the pattern of their receptive fields the cells in the visual cortex are classified as simple and complex cells [Schiller76a] . Simple cells have smaller receptive fields and low spontaneous activity. Some parts of their receptive field respond the onset of the stimulus while some parts respond to the offset. On the contrary, complex cells respond both the onset and the offset. They have larger receptive fields and greater spontaneous activity.

Binocular Sensitivity Types

According to their binocular sensitivity, the cells in the visual cortex are classified into four groups by Poggio and Fischer [Poggio77] as tuned excitatory (TE), tuned inhibitory (TI), near and far. TE cells are excited by stimuli at the fixation distance. If the stimulus is disparate more than 0.1 degrees then the cell activities are suppressed, that is, these cells are sharply tuned to zero disparity. The response pattern of TI cells as a function of disparity is the reverse of, but is not as sharp as, that of the TE cells. Near cells are sensitive to stimuli near than the fixation distance and far cells are visa versa. Among these cell groups only TE cells are ocularly balanced. Later, other kinds of cells are also identified and it is claimed that types according to binocular sensitivity belong to a continuum rather than discrete groups [Freeman90].

Modelling Simple Cells

The monocular receptive fields of simple cells are well described by Gabor functions [Marcelja80] [Daugman80] which are filters limited in both space and frequency. Gabor filters will be discussed in detail later in this article. There exists evidence that simple cells are found in pairs with an approximate phase difference of 90 degrees [Pollen81] which may compute real and imaginary parts of a complex Gabor filter. The integration of data from monocular receptive fields is modelled as linear summation by Ohzawa and Freeman [Ohzawa86] based on neurophysiological experiments. Nomura et al. [Nomura90] proposed a similar modelling where linear summation is followed by a non-linear smoothed thresholding function. This model predicts largely the binocular behaviour of cells in the striate cortex. Freeman and Ohzawa observed that the phase difference sensitive responses of simple cells are not disturbed by large contrast differences across right and left eyes. Considering this observation, they proposed a monocular contrast gain mechanism that keeps the effect of contrast almost constant.

Coarse-to-Fine Structure

There is evidence that data from low-frequency channels constrain the matching at high frequencies. Wilson et al. [Wilson91] found that channels more than 2 octaves apart process independently, but closer channels interact. Low-frequency signals affect fusion in high-frequency channels but not vice versa. Watt [Watt87] also concludes, after a series of experiments, that the human visual system uses a coarse-to-fine strategy.

4. Computer Stereo Vision

The Matching Primitives

In the beginning, we considered the problem of what to choose for matching across images. The fact that human stereopsis can survive without monocular recognition, is very comforting for computer stereo research, since general recognition performance of computer vision's state-of-the-art is very weak.

We know that raw intensity values are not appropriate as matching primitives while recognised objects are not available. What we need, at this point, is matching the primitives that are more abstract and invariant than intensity values that can be determined without any help from top-down processes. We can group such primitives that currently used in computer vision into two rough groups. The first consists of features like edges, corners, blobs etc. which can be detected using local intensity values. The second group, area-based properties, are functions of intensity values that can be calculated at almost every point of an image.

Features

Image features which are chosen for matching are high interest points or point sets like edgels, edge segments or intervals between edges. The features can be localised very accurately (generally with sub-pixel resolution), so the accuracy of computed disparity is also high. Features generally correspond to physical boundaries of objects, surface markings or other physical discontinuities, so provide valuable depth information. Features are typically sparse, that is, they occupy only a very low percentage of an image. This speeds up processing, but disparities at non-feature points should be interpolated.

Figure 6: Laplacian-of-Gaussian operator

Use of features for stereo matching is biologically plausible, because cells sensitive to edges and corners are observed in the visual system. Based on the properties of some cells in the lateral geniculate, Marr and Hildreth [Marr80] proposed the zero-crossings of Laplacian-of-Gaussian (LoG) filtered images for edge detection. The LoG operator (Figure 6)

which is Gaussian smoothing followed by a second derivative operation, has several useful properties. The scale factor , which is the standard deviation of the Gausian, is inversely proportional to the average density of edges. Besides, even large convolutions can be calculated quickly by either approximating the LoG by a difference-of-Gaussians function or by decomposing the LoG. The disadvantage of LoG is the displacement of edges with growing . After Marr [Marr82] a number of researchers used zero-crossing edges as matching primitives. The direction of the edge is approximated as the direction of the gradient of the filtered image. Only edgels with the same sign and with roughly the same orientation are considered as possible matches. This is in accordance with the psychophysical observaation that images with opposite contrast cannot be fused.

More abstract image features are edge segments, either linear line segments or curves. The edgels are not matched individually but grouped into segments. This grouping can be performed by using Gestalt rules. Grouping reduces the number of possible matches significantly. Besides, one can define similarity measures of two edge segments using their length, orientation, curvature, strength, coordinates of edge points, average intensity or intensity slope at each side etc.

Area-Based Primitives

Area properties are those which are available at almost every point in an image. The simplest area property is the image intensity which is not appropriate for stereo matching due to its sensitivity to noise as well as to photometric variation. Another simple primitive is the spatial derivative of intensity that is less sensitive to photometric variation, but this measure is too sensitive to noise.

A common way to match areas directly is to find correlations of areas from left and right images. The cross-correlation and normalized cross-correlation at position (i,j) of the right image with disparity d are

and

respectively. There are several other correlation-like measures of which the most frequently used one is the sum of squared differences:

Although correlation techniques are successful at textured areas, they fail around depth discontinuities, since the area inside the correlation window belongs to at least two different surfaces at different depths, so the window does not match totally at any disparity value. They also suffer from disparity gradients because one of the signals is scaled compared to other. Besides, the accuracy obtained is lesser when compared to feature-based matches. Another drawback of the correlation technique is its computational complexity. As the size of the correlation window gets larger, the computational complexity and the uncertainty in disparity increase as well as problematic regions near discontinuities get larger, but, match becomes more robust to noise.

Another dense property to match is local frequency components [Clark86] [Sanger88] [Fleet91] [Westelius92] [Weng93] [Nomura93]. The Fourier theorem states that when a function f(x) with Fourier transform F(u) is shifted by an amount of x then the Fourier transform of the shifted function is , so a shift in the spatial domain corresponds to a phase shift in the frequency domain. If the left view had been a shifted version of the right view it would have been possible to determine the amount of shift from the phase of the Fourier transforms of both images. But since the shift, i.e. the disparity, is different in various regions of the images, one needs a local frequency filter to determine the phase differences. A natural choice for such a function is the Gabor filter [Gabor46] which is a bandpass filter with limited spatial width:

whose Fourier transform is

where the product is 1 which is the theoretical minimum of any linear complex filter [Gabor46]. This choice is also biologically plausible since the receptive fields of simple cells are not statistically distinguishable from Gabor filters [Marcelja80]. Besides, simple cells are found in pairs with an approximate phase difference of 90 degrees [Pollen81] and this justifies the use of complex filters. If the ratio of the spatial width to the period, , is held constant, then the shape of the filter and the relative bandwidth given by

in octaves remain unchanged. Figure 7 shows the real and imaginary parts of a Gabor filter with a bandwidth of 1 octave. The 2-dimensional extension of the filter is

Figure 7: The real and imaginary parts of a Gabor filter with a bandwidth of 1 octave.

Note that the filter is separable, so computational complexity is reduced from to . The filtered versions of right and left images and are

and

So that the Gabor filtered image is a band-pass signal, it can be modelled (in 1-D for simplicity) as [Fleet91]

where is the centre frequency equal to the frequency of the filter. The local frequency is defined as [Papoulis65] where . If we assume perfect sinusoids, that is, then we can estimate the disparity as [Sanger88]

Since the bandwidth of the filter is non-zero, may vary around zero and disturb the linearity. But in real images with sufficient texture the phase is almost linear over the image except some regions. Fleet et al. [Fleet91] showed that the bandpass phase is not sensitive to typical distortions that exist between right and left images.

Note that the phase measurements give the disparity directly, so a search is not performed for the best fit, because of this phase-based techniques are sometimes called ``correspondenceless''. It is worth mentioning that matching phases is a general case of matching zero-crossings because the zero-crossings of band-pass filters such as LoG correspond roughly to level curves at of the phase signal. Another advantage of the phase-measurements is that they provide sub-pixel measurements without explicitly reconstructing the signal between pixels. This hyperacuity is also in accordance with biological findings.

Phase measurements are valid within a limited range of disparity because of the wrap-around problem: we measure only the principal component of the phase in the range , so a filter of fundamental frequency signals only disparities of to .

Nomura [Nomura93] introduced a fundamental equation for binocular disparity,

where o is the eye position, I is the intensity and d is the disparity. This equation is a variation of the gradient model of optical flow field. Substituting Gabor filtered image in place of I, he obtained

Besides he showed that the terms other than d can be approximated as linear combinations of far, near and tuned inhibitory type simple cells.

Another area-based method that takes its flavour from natural stereo vision is the cepstral filtering approach of Yeshurun and Schwartz [Yeshurun89]. Cepstral filtering is a Fourier transformation followed by a logarithm and an inverse Fourier transform. Yeshurun and Schwartz append left image, l(x,y) to the left of right image, r(x,y). Assuming that the width of the patches is D and r(x,y) is equal to l(x-d,y) where d is the disparity to be computed, the compound image f(x,y) can be written as

with the Fourier transform

When we take the logarithm of F(u,v), the product becomes a sum:

Taking the Inverse Fourier Transform, we obtain

Thus, we can find the disparity of the patch by locating the largest delta function. Ocular dominance columns in the visual cortex that correspond to alternating image patches from right and left retinae have great similarity with the above method. Besides, the width of ocular dominance columns is in accordance with the Panum's fusional area. The authors also claim that this cepstral filter can be implemented using a set of bandpass filters similar to those found in the visual cortex, so this approach is biologically plausible.

Constraints

Since the combinations of all possible matches reach an enormous number, some a priori data is needed about the disparity field. The assumptions made are imposed on the algorithms as constraints. Every stereo algorithm uses some of these constraints implicitly or explicitly.

Smoothness

Marr and Poggio [Marr76] stated that matter is cohesive, that is,

"it is separated into objects, and the surfaces of objects are generally smooth in the sense that the surface variation due to roughness cracks , or other sharp differences that can be attributed to changes in distance from the viewer, are small compared with the overall distance from the viewer"[Marr82].

The disparity field produced by such surfaces is smooth everywhere except at object boundaries, which occupy only a small portion of an image. Considering this fact, the computed disparity field is forced to be as smooth as possible. Under the smoothness assumption, ill-posed stereo problem has a unique solution. This constraint is related to regularization theory that is a branch of mathematics dealing with ill-posed problems [Poggio85]. Blind use of the smoothness constraint can cause problems at depth discontinuities. A method proposed to avoid smoothing of the disparity field at and near these areas is using line processes where the smoothness constraint is broken.

A weaker form of the smoothness constraint is the figural continuity constraint that was first exploited by Mayhew and Frisby [Mayhew81]. This constraint implies smooth variation of disparity along edges, because the edgels on the same edge segment are assumed to belong to the same object and this assumption is almost always valid. Note that the figural continuity constraint is automatically satisfied when contours are used as matching primitives, so the above correction cannot be applied.

Smoothness constraint can also be expressed as a gradient limit on disparity that is known to be used in human stereopsis. Generally, the support from a neighbouring match to a potential match is inversely scaled by the disparity gradient between the two matches [Prazdny85].

Opaqueness

This assumption is violated if there are semi-transparent surfaces in the image, but this is very rare in natural images except objects like fence or bush that occludes background partially. In case of transparency, continuity constraint is not applicable, since the disparity field switches frequently between background and foreground. Human visual system can cope with transparencies without difficulty. To handle transparency as well as discontinuities at object boundaries, Prazny introduced the coherence principle that states that the world is made of (either opaque or transparent) objects each occupying a well defined 3D volume. So:

"a discontinuous disparity may be a superposition of a number of several interlaced continuous disparity fields each corresponding to a piecewise smooth surface" as a result "Two disparities are either similar, in which case they facilitate each other because they possibly contain information about the same surface, or dissimilar in which case they are informationally orthogonal, and should not interact at all because they potentially carry information about different surfaces" [Prazdny85].

He proposed the support function

where is the support from the neighbouring point to point . Among possible matches at point only the one with minimum disparity difference is used in calculation of support. The term on the exponent is the disparity gradient so the support function imposes a disparity gradient limit implicitly.

Orderedness

Assume a point A, and a point B that is right to A match points A' and B' in the other image. Then, this constraint states that B' cannot be at the left side of A'. Resulting disparity constraint violates this assumption if the disparity difference between a figure and its background is larger than the width of the figure in the image. Such objects, like columns, ropes etc. are rare in natural images, so this constraint is frequently used to reduce ambiguity. Human visual system also prefers order-preserving solutions [Weinshall89].

Uniqueness

This constraint states that a point in one image matches only one point in the other image, that is, the disparity field is a single valued function. In stereo pairs involving only opaque surfaces, this constraint greatly reduces the number of possible solutions. If human visual system uses this constraint or not is a controversial problem since there is evidence for both use of this constraint [Weinshall89] and for existence of multiple matches [Pollard90].

Compatibility

If point A in the right image matches point B in the left, the point B matches point A. Some researchers calculate right and left image disparities independently and then check for compatibility across the field to eliminate false matches. Figure 8 shows valid and invalid matches across two lines schematically where circles and arrows represent pixels and matches, respectively.

Epipolarity

Affine transformations are applied to the images such that the epipolar lines are collinear with image rows. The determination of the epipolar line reduces the search space to one-dimension, while the alignment with image rows greatly simplifies the search. In human visual system, this constraint is satisfied once both eyes are fixated on the same point, but still small vertical disparities remain due to the perspective projection onto the retinae.

Figure 8: Matches between rows R and L violating a) the uniqueness constraint,

b) the compatibility constraint and c) the orderedness constraint.

d) A valid matching field with 2 occluded pixels in row R.

Limited Disparity Range

In accordance with Panum's fusional area, the disparity range in which a match is searched for is determined a priori. Sometimes, even when the epipolarity constraint is used, a small vertical disparity range is allowed to compansate for inexact registration.

Strategies

Once matching primitives are decided and constraints are set, we face a very large problem. A multi-dimensional space is to be searched for (in some sense) the best solution which satisfies all constraints. Since to visit all states for the best solution is impractical, if not impossible, we need to employ heuristics to reach the best or at least a good solution.

Multi-channel analysis

The existence of different band-pass frequency channels in the vertebrate visual cortex led some researchers to use frequency filters in stereo algorithms. Gaussian smoothing and Gabor-like filters are mostly used in band-pass filtering. As the channel gets coarser (low-frequency), the size of the required masks gets larger, so the computational cost of the filters increases. An equivalent and simpler method is to smooth the image using a Gaussian kernel and to subsample it successively. This way, a Gaussian image pyramid with various resolutions is formed. Usually a spacing of one octave between the channels is used which leads to resolutions of half of the finer channel (i. e., 256x256, 128x128, 64x64). A more rapid way to form the image pyramid is image consolidation which replaces four adjacent pixels with one pixel having the intensity of average of the four pixels. Consider an n by n stereo pair with disparity range m. If integer disparity values are used there are possible solutions to the problem, while the number of possible solutions in the coarser channel is . The accuracy of the result is half of the coarser channel. But we can use this result to constrain the solution in the next finer

Figure 9: Coarse-to-fine control strategy.

channel. This strategy is called coarse-to-fine analysis (See Figure 9) and is very popular in stereo research. Besides the computational savings, this method generally leads more accurate final results.

The disadvantage of the method is the spreading of any error in a coarse level to finer levels. Also this methods assume spectral continuity. This approach is explained in the next subsection. The alternative multi-channel approach to coarse-to-fine analysis is to process each channel independently and to combine subsequently.

Integrating with Other Vision Modules

It is well known that human visual perception owns its power to integration of information from a variety of sources such as motion, shading etc. Computer vision maturing in each of such methods now is in the way of building more complete vision systems integrating those modules.

Fusing motion and stereo was considered by a number of researchers [Mutch86] [Waxman86] [Li93]. If we know the disparity field or optical flow for a sequence of stereo images, it is easier to compute the other one. Besides, the discontinuities of optical flow are generally also depth discontinuities. So one of them is computed first and is used to guide the other. But Toborg and Hwang [Toborg91] calculated stereo disparity, optical flow and intensity contours simultaneously and co-operatively. They demonstrated the effectiveness of integrating visual modules on synthetic images.

Other visual cues used with stereopsis include shape-from-shading [Thompson93] [Grimson84] [Bulthoff88] and shape-from-texture [Moerdler88]. Also, active systems which seek for useful additional information by controlling camera parameters are used more and more frequently [Ahuja93] [Coombs92] [Krotkov93] [Yuille90].

5. Conclusions

Natural evolution seem to find the optimum solution for perception of the environment. But the solution is optimum in terms of the needs of the species and the available "hardware" of biological systems, so the way they solve perceptual problems may not be appropriate for machine vision. Nevertheless the information obtained through the study of these systems proved to be useful and guiding for design of computer vision systems. This result may also be generalized to other kinds of information processing systems. As the biological basis of perception and cognition is understood better, more powerful information processors can be built.

References

[Ahuja93] Narendra Ahuja and A. Lynn Abbott, "Active Stereo: Integrating Disparity, Vergence, Focus, Aperture, and Calibration for Surface Estimation", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 10, 1007-1029, October 1993.

[Badcock85] David R. Badcock and Clifton M. Schor, "Depth-Increment Detection Function for Individual Spatial Channels", Optical Society of America A, vol. 2, no. 7, 1211-1216, July 1985.

[Bruce90] Vicki Bruce and Patrick Green, Visual Perception: Physiology, Psychology and Ecology, Lawrence Erlbaum Associates, Hove, UK, 1990.

[Bulthoff88] Heinrich H. Bulthoff and Hanspeter A. Mallot, "Integration of Depth Modules: Stereo and Shading", Optical Society of America A, vol. 5, no. 10, 1749-1758, 1988.

[Burt80] Peter Burt and Bela Julesz, "A Disparity Gradient Limit for Binocular Vision", Science, vol. 208, 615-617, May 1980.

[Clark86] J. J. Clark and P. D. Lawrence, "A Theoretical basis for Diffrequency Stereo", Computer Vision, Graphics and Image Processing, vol. 35, 1-19, 1990.

[Cochran90], Steven Douglas Cochran, Surface Description from Binocular Stereo, PhD. Thesis, School of Engineering, University of Southern California, November 1990.

[Coombs92] David John Coombs, Real-time Gaze Holding in Binocular Robot Vision, PhD. Thesis, Department of Computer Science, University of Rochester, June 1992.

[Daugman80] J. G. Daugman, "Two-Dimensional Spectral Analysis of Cortical Receptive Field Profile", Vision Research, vol. 20, 847-856, 1980.

[Field87] David J. Field, "Relations Between the Statistics of Natural Images and the Response Properties of Cortical Cells", Optical Society of America A, vol. 4, no. 12, 2379-2394, December 1987.

[Fleet91] David J. Fleet, Allan D. Jepson, Michael R. M. Jenkin, "Phase-Based Disparity Measurement", CVGIP: Image Understanding, vol. 53, no. 2, 198-210, March 1991.

[Freeman90], Ralph D. Freeman and Izumi Ohzawa, "On the Neurophysiological Organization of Binocular Vision", Vision Research, vol. 30, no. 11, 1661-1676, 1990.

[Gabor46] D. Gabor, "Theory of Communication", Journal of IEE, vol. 93, 429-459, 1946.

[Grimson84] W. E. L. Grimson, "Binocular Shading and Visual Surface Reconstruction", Computer Vision, Graphics and Image Processing, vol. 28, 19-43, 1984.

[Hubel62], David H. Hubel and T. N. Weisel, "Receptive Fields, Binocular Interaction and Functional Architecture in the Cat's Visual Cortex", Journal of Physiology, London, vol. 160, 106-154, 1962.

[Hubel88], David H. Hubel, Eye, Brain and Vision, Scientific American Library, New York, USA, 1988.

[Julesz60] Bela Julesz, "Binocular Depth Perception of Computer Generated Patterns", Bell Systems Technical Journal, vol. 39, 1125-1162, 1960.

[Julesz71] Bela Julesz, Foundations of Cyclopean Perception, The University of Chicago Press, Chicago, 1971.

[Krotkov93] Eric Krotkov and Ruzena Bajcsy, "Active Vision for Reliable Ranging: Cooperating Focus, Stereo, and Vergence", International Journal of Computer Vision, vol. 11, no. 2, 187-203, 1993.

[Li93] Lingxiao Li and James H. Duncan, "3-D Translational Motion and Structure from Binocular Image Flows", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 7, 657-667, 1993.

[Marcelja80] S. Marcelja, "Mathematical Description of the Responses of Simple Cortical Cells", Optical Society of America A, vol. 70, 1297-1300, 1980.

[Marr76] David Marr and T. Poggio, "A Cooperative Computation of Stereo Disparity", Science, vol. 194, 283-287, 1976.

[Marr80] David Marr and E. Hildreth, "Theory of Edge Detection", Proceedings of Royal Society of London B, vol. 207, 187-217, 1980.

[Marr82] David Marr, Vision, W. H. Freeman and Company, New York, 1982.

[Mayhew81] John E. W. Mayhew and John P. Frisby, "Psychophysical and Computational Studies towards a Theory of Human Stereopsis", Artificial Intelligence, vol. 17, 349-385, 1981.

[Moerdler88] M. L. Moerdler, "The Integration from Stereo and Multiple Shape-from-Texture Cues", Image Understanding Workshop, 786-793, April 1988.

[Morgan82] M. J. Morgan and R. J. Watt, "Mechanisms of Interpolation in Human Spatial Vision", Nature, vol. 299, 553-555, October 1982.

[Nomura90] M. Nomura and G. Matsumoto and S. Fujiwara, "A Binocular Model for the Simple Cell", Biological Cybernetics, vol. 63, 237-242, 1990.

[Nomura93] Masahide Nomura, "A Model for Neural Representation of Binocular Disparity in Striate Cortex: Distributed Representation and Veto Mechanisms", Biological Cybernetics, vol. 69, 165-171, 1993.

[Mutch86] K. M. Mutch, "Determining Object Translation Information Using Stereoscopic Motion", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 8, no. 6, 750-763, 1986.

[Ohzawa86] Izumi Ohzawa and Ralph D. Freeman, "The Binocular Organization of Simple Cells in the Cat's Visual Cortex", Journal of Neurophysiology, vol. 56, no. 1, 221-242, July 1986.

[Papoulis65] A. Papoulis, Probability, Random Variables and Stochastic Process, McGraw-Hill, Singapore, 1965.

[Poggio77] G. F. Poggio and B. Fischer, "Binocular Interaction and Depth Sensitivity in Striate and Prestriate Cortex of Behaving Rhesus Monkey", Journal of [Pollen81] Daniel A. Pollen and Steven F. Ronner, "Phase Relationships Between Adjecent Simple Cells in the Visual Cortex", Science, vol. 212, 1409-1411, June 1981. Neurophysiology, vol. 40, no. 6, 1392-1405, November 1977.

[Poggio85] T. Poggio, V. Torre and C. Koch, "Computational Vision and Regularization Theory", Nature, vol. 317, no. 26, 314-319, September 1985.

[Pollard90], Stephen B. Pollard and John P. Frisby, "Transparency and the Uniqueness Constraint in Human and Computer Stereo Vision", Nature, vol. 347, no. 11, 553-556, October 1990.

[Pollen81] Daniel A. Pollen and Steven F. Ronner, "Phase Relationships Between Adjecent Simple Cells in the Visual Cortex", Science, vol. 212, 1409-1411, 19 June 1981.

[Prazdny85] K. Prazdny, "Detection of Binocular Disparities", Biological Cybernetics, vol. 23, no. 5, 93-99, 1985.

[Sanger88] T. D. Sanger, "Stereo Disparity Computation Using Gabor Filters", Biological Cybernetics, vol. 59, 405-418, 1988.

[Schiller76a] Peter Schiller and Barbara L. Finlay and Susan F. Volman, "Quantitative Studies of Single-Cell Properties in Monkey Striate Cortex. I. Spatiotemporal Organization of Receptive Fields", Journal of Neurophysiology, vol. 39, no. 6, 1288-1319, November 1976.

[Schor84] C. Schor, I. Wood and J. Ogawa, "Binocular Sensory Vision is Limited by Spatial Resolution", Vision Research, vol. 24, 661-665, 1984.

[Splillmann90] L. Splillmann and J. S. Werner, editors, Visual Perception: the Neurophysiological Foundations, Academic Press, Inc., New York, USA, 1990.

[Thompson93] Clay Matthew Thompson, Robust Photo-Topography by Fusing Shape-from-Shading and Stereo, PhD. Thesis, Massachusets Institute of Technology, February 1993.

[Toborg91] Scott T. Toborg and Kai Hwang, "Cooperative Vision Integration Through Data-Parallel Neural Computations", IEEE Transactions on Computers, vol. 40, no. 12, 1368-1379, 1991.

[Watt87] R. J. Watt, "Scanning from Coarse to Fine Spatial Scales in the Human Visual System After the Onset of a Stimulus", Optical Society of America A, vol. 4, no. 10, 2006-2021", October 1987.

[Waxman86] A. M. Waxman and J. H. Duncan, "Binocular Image Flows: Steps Toward Stereo-Motion Fusion", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 8, no. 6, 715-731, 1986.

[Weinshall89] Daphna Weinshall, "Perception of Multiple Transparent Planes in Stereo Vision", Nature, vol. 341, no. 26, 737-739, October 1989.

[Weng93] J. J. Weng, "Image Matching Using the Windowed Fourier Phase", International Journal of Computer Vision, vol. 11, 211-239, 1993.

[Westelius92] C.-J. Westelius, Preattentive Gaze Control for Robot Vision, PhD Thesis, Department of Electrical Engineering, Linköping University, 1992.

[Wilson91] Hugh R. Wilson and Randolph Blake and D. Lynn Halpern, "Coarse Spatial Scales Constrain the Range of Binocular Vision on Fine Scales", Optical Society of America A, vol. 8, no. 1, 229-236, January 1991.

[Yeshurun89] Yehezkel Yeshurun and Eric L. Schwartz, "Cepstral Filtering on a Columnar Image Artchitecture: A Fast Algorithm for Binocular Stereo Segmentation", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, no. 7, 759-767, July 1989.

[Yuille90] Alan Yuille and Davi Geiger, "Stereo and Controlled Movement", International Journal of Computer Vision, vol. 4, 141-152, 1990.

contents home