The decision to use image features to find optic flow was made for several reasons. Firstly, optic flow found using two dimensional features should contain as much information about the scene motion as is available, at the places in the image where the process of flow recovery is most well conditioned and where the information is most relevant. Directly related to this, flow discontinuities are not smoothed in the proposed method, as they are with most others, and no constraints are necessary in order to find full flow. The use of features to find optic flow naturally leads on to a sensible and simple representation of object shape, as discussed in Section 6. Finally, the process of finding image flow (to the accuracy required here) is relatively computationally expensive using other methods of motion estimation. (The advantages of using two dimensional features are discussed further elsewhere [2], [9].)
The two dimensional features are found using either the SUSAN corner detector [27], [26] or the Harris corner detector [15].
Feature tracking is performed using simple two dimensional motion models. Either constant velocity or constant acceleration models are used, depending on the application. (Few applications benefit from using the higher order constant acceleration model.)
The first stage in tracking features is to instantiate a motion model for each new feature, by matching features from the first two frames. If image motion is smaller than the distance between image features then there is no difficulty in finding a list of correct matches, and the problem is trivial. However, even in the case of stereo matching, where epipolar geometry constrains the possible disparity vectors, ambiguous matching possibilities usually arise. In monocular motion estimation this occurs even more frequently. An intelligent algorithm to disambiguate different possible matches is therefore necessary. (The following is described in more detail in [29].)
The matching done is based on local operations. Each feature in the first image of an image pair is looked at in turn, and an attempt is made to find for each a matching feature in the second image. The criteria which have been used to determine the quality of a match vary greatly in their complexity. Many non-real-time programs (both stereo and motion) have used correlation of small patches to give a measure of the quality of the match; see [3] (and the discussion here on different methods used), [18], [36] and [25]. However, the theoretical justification for using the correlation-based matching method is weak if the features are formed from the physical corners of objects, where the moving background will make up more than half of the surrounding patch. Clearly if the background has local structure then the idea of using small patch correlation is not a good one as the correlation for a correct match may be very poor. This will also be the case if the motion is strongly non-translational. Correlation is also in general much more computationally intensive than the method described below.
Another method for matching is to use a smaller amount of information describing the structure of the image at the feature to perform matching. Possible functions suitable for matching include image brightness and local image spatial derivatives. The image functions chosen are typically combined into a vector which is compared with the corresponding vector in the second image. The vector created from n functions will be n dimensional, and these functions need to be weighted so that they affect the vector to an extent roughly equivalent to their reliability as matching components.
This method of finding feature matches was chosen because it is much faster than a correlation-based method but is not much less reliable. Correlation-based matching is approximately 20 times slower than matching using a small number (three, for example) of feature attributes. The two methods typically attain matching success rates of approximately 95% and 85% respectively, where this ``percentage'' is given by;
(An interactive graphics program was written to allow matching of features to be done by eye so that the match lists which were produced automatically could be analyzed.)
It was found that the best attributes to use were the image brightness (not smoothed) and the x and y components of the position of the USAN centre of gravity, when using the SUSAN corner detector. When using the Harris detector, the attributes are found in the hardware and are fixed to be the smoothed image brightness and the x and y image derivatives.
Obviously, once a corner has been matched once, the velocity estimate thus obtained allows a large reduction in the necessary search space for future matches. The motion model update filter used is a simplified two dimensional Kalman filter in which the search space is reduced over time and the model estimates are given increasing importance over time.
Simple logic is used to cope with the standard issues of temporarily unmatched features, new features, the purging of ``bad'' features and the designation of ``high quality'' features. Typically, two thirds of the features found by the feature detector are tracked successfully enough to be reported to the following stage as ``high quality''.
Figure 3 shows the set of velocity vectors found by the
Figure 3: An example set of flow vectors found by the feature
tracking stage of ASSET-2. The vectors point in the direction of
motion and have magnitudes equal to twice the inter-frame image
motion.
feature tracking stage of ASSET-2 at frame 44 of the sequence described and shown earlier. The Landrover has very little motion, the background is generally moving to the right, and the ambulance is moving to the left and away from the observer. Note the negative divergence in the flow of the ambulance.