978-1-4244-7868-2/10/$26.00 ©2010 IEEE382 TuE1.21 2010 IEEE Intelligent Vehicles SymposiumUniversity of California, San Diego, CA, USAJune 21-24, 2010

(1)

A Voting Strategy for Visual Ego-Motion from Stereo

ˇStˇep´an Obdrˇz´alek and Jiˇr´ı Matas

Center for Machine Perception, Czech Technical University Prague

xobdrzal@fel.cvut.cz, matas@fel.cvut.cz

Abstract— We present a procedure for egomotion estimation from visual input of a stereo pair of video cameras. The 3D egomotion problem, which has six degrees of freedom in general, is simplified to four dimensions and further decomposed to two two-dimensional subproblems. The decomposition allows us to use a voting strategy to identify the most probable solution, avoiding the random sampling (RANSAC) or other approximation techniques.

The input constitutes of image correspondences between consecutive stereo pairs, i.e. feature points do not need to be tracked over time. The experiments show that even if a trajectory is put together as a simple concatenation of frame- to-frame increments, it comes out reliable and precise.

I. INTRODUCTION

This paper concerns estimation of egomotion of a vehicle carrying a stereo pair of video cameras. The problem is well studied in the literature [1], [2], [3], [4]. Our situation differs from the majority of the published work in two key aspects.

First, the intended use is in urban scenes with the possibility of a heavy traffic. A large part, even a majority, of the field of view can be covered by moving objects, which distract the egomotion estimate. Second, the vehicle moves in an open space, where the distance to the observed objects is large compared to the baseline of the stereo pair. This results in very imprecise 3D triangulation, with spatial uncertainty of triangulated points in tens of meters. This is different to navigation in small closed environments,e.g.in laboratories or corridors, where the triangulation errors are smaller.

Our intended application is a detection of moving objects around the vehicle, respectively estimation of motion of the objects in the world coordinate frame. We are thus interested in reliable and precise egomotion estimation, but only locally, within a short time span. The problem of obtaining a globally correct trajectory is not dealt with, nor is the problem of detecting a repeated visit of a location (drift removal, or loop closing). Solutions to these are found in the literature (e.g.[5], [6], [7]). We assume that a GPS system solves these problems in practice.

A 3D motion has six degrees of freedom, three for rotation (change in orientation) and three for translation (change in position). A standard approach to recover the six unknowns is to align three pairs of corresponding triangulated 3D points [8], which however becomes tricky once the triangulation errors are large. Or, if only a single camera is available, the motion can be computed from a correspondence of at least five points,e.g. [9].

∗The authors were supported by Toyota Motor Corporation and by Czech Government under the reseach program MSM6840770038.

In our task, however, it is not necessary to recover all six parameters. We are interested only in horizontal projection of the motion, as if seen on a map. In that case the motion has only three unknowns. Two for 2D location on the ground plane and one for orientation – the heading (or yaw) angle.

We also estimate the pitch angle for a total of four unknowns.

Pitch is the vertical angle between the optical axis and the ground plane and is used to compute elevation of observed objects, relative to our vehicle. The remaining two unknowns, computation of which we avoid, are the rotation around the camera optical axis (roll) and the absolute elevation of our vehicle. The rolling we assume being negligible for ground vehicles under normal driving conditions. And the vehicle’s absolute elevation is not useful to us.

The egomotion estimation is based on a voting scheme.

The four-dimensional problem is decomposed into two two- dimensional subproblems, which makes the voting feasible.

The rotation angles (yaw and pitch) are estimated first. The 2D translation is computed in a second voting step, in which each vote explicitly reflects the triangulation imprecision.

II. EGOMOTIONESTIMATION

The egomotion is computed in the form of increments from one stereo image pair to another. Therefore, only four images are involved in the computation at a time - the current pair and the immediately preceding one. The situation is illustrated in Fig. 1.

Fig. 1. Illustration of images involved in the computation – two stereo pairs that are connected by three sets of image correspondences.

The figure depict the four images: a current stereo pair taken at time t (images I_L^t and I_R^t) and a preceding one taken at time t−1 (images I_L^t−1 and I_R^t−1). Three sets of 2010 IEEE Intelligent Vehicles Symposium

University of California, San Diego, CA, USA June 21-24, 2010

TuE1.21

(2)

pixel correspondences are computed. Two sets (C^tandC^t−1: {ci = (x_L, y_L, x_R, y_R)}) link pixels in the stereo pairs, the third one (CL : {ci = (x^t_L, y_L^t, x^t−1_L , y^t−1_L )}) connects the two images of the left camera. A 3D scene pointX is at time tprojected toI_L^t andI_R^t at locations(x^t_L, y_L^t)and(x^t_R, y^t_R).

At timet−1, it was projected to the preceding image pair to pixels (x^t−1_L , y_L^t−1)and(x^t−1_R , y_R^t−1).

Image correspondences are computed with the approach described in [10]. This method gives a semi-dense correspondence map, typically tens of thousands of correspondences are found for a pair of 640×480 images. The camera pair is calibrated, whence the two stereo correspondence setsC^tand C^t−1can be triangulated, yielding two sets of 3D pointsX^t andX^t−1 in camera-centric coordinates. The two 3D point sets are connected together by the correspondence set CL, forming a set of 3D vectors.

A 3D camera motion can be decomposed into two independent components, alignment of directions of the camera axes (camera rotation) and alignment of the camera centers (camera translation). The decomposition is commutative.

A. Estimation of Rotation

The rotation is computed using the ’in-time’ correspondence set CL, containing motion vectors cL = (x^t_L, y_L^t, x^t−1_L , y^t−1_L ) ∈ CL. Let us inspect the effect of the rotation on the motion vectors, assuming for now that the vehicle position is not changing and that the scene is static. Fig. 2 illustrates image motion vectors caused by pitch, yaw and roll components of the 3D rotation (for a camera with spherical projection, which well approximates a perspective camera for narrow fields of view). Assuming zero roll for ground vehicles, the motion vectors caused by rotation are linear segments that are identical across the image, independent of object distance.

Fig. 2. Image motion vectors due to individual components of 3D rotation.

We are interested only in the yaw and pitch, the roll is ignored.

Fig. 3. Image motion vectors due to vehicle translation. Forward motion on the left, sidewise on the right.

If the camera also moves, in addition to rotating, the observed motion field is affected differently. Fig. 3 illustrates the effect. Forward motion produces motion vectors oriented in the direction of the so called focus of expansion,i.e.image of the scene point towards which the camera moves. A sidewise motion produces parallel motion vectors similar

to the rotation. What is important, in both types of the translation the length of the motion vectors decreases with distanceto the observed scene point.

The observed motion field is a combination of the translation and rotation components, where the effect of the translation decreases with distance – motion vectors of points at infinity are affected only by the rotation. And the distances are known from the stereo triangulation. This leads us to a very simple voting algorithm for rotation estimation. It is estimated by adding votes to an accumulator, as in the Hough transform. Votes are cast by motion vectorsc_L∈ CL

with the weight of the vote being proportional to c_L’s 3D distance to the camera. The accumulator domain is in image pixels, its resolution is set to one pixel and its range to (−Θ_x,Θ_x) on the x-axis and (−Θ_y,Θ_y) on the y-axis.

The resolution is given by the precision with which are the image correspondences computed. The bounds on the maximal rotation are set empirically and depend on maximal angular speed and framerate and resolution of the cameras.

In our setup we haveΘx= 100andΘy= 50pixels, which cover all realistic situations with a large margin.

The procedure is summarised in Algorithm 1. At the end we identify the rotation vectorr= (r_x, r_y), in pixels, which has the largest support by the motion vectors. The precision of the estimate is further improved (on the x-axis only) by fitting a quadratic curve to its neighbouring support values in the accumulator,i.e.toA(rx−1, ry),A(rx, ry)andA(rx+ 1, ry). Position of the maximum on the parabola is found in a closed form, and it gives us the rotation vectorr with a sub-pixel precision.

A final step is to convert the pixel-based vectorrto yaw (ψ) and pitch (θ) angles. As shown in Fig. 4, the angle is the inverse tangent of the vector length multiplied by the pixel sizepand divided by the focal lengthf:

ψ= tan⁻¹(rxpx

f ), θ= tan⁻¹(rypy

f ),

wherepxandpyare horizontal and vertical pixel dimensions, in millimeters. Conveniently, in the standard representation of intrinsic camera parameters by an upper triangular3×3 matrix K [11], the _p^f

x and _p^f

y ratios are found on its first two diagonal elements. We can therefore write

ψ= tan⁻¹( r_x K1,1

), θ= tan⁻¹( r_y K2,2

).

Naturally, the system would be fooled if the field of view is obstructed by a large moving object,e.g.a truck pasing close in front of the vehicle. These situations can be detected, as the 3D depths are known, and failure in the estimation can be reported. We have not implemented such a detection though, and, as shown in the experiments, failures of this kind occur.

B. Estimation of Translation

We had two sets of triangulated 3D points,X^tandX^t−1, observed in two consecutive views. The points were in camera-centric coordinates, i.e. the origins of coordinate systems coincided with the cameras.

(3)

Fig. 4. Left: Relation between a rotation vectorr, in image coordinates, and the angle of rotation. Right: Estimation of 3D reconstruction tolerance.

Algorithm 1 Rotation by voting

Input: CL: correspondences between points of two consecutive images from one of the cameras

Input:C^t: correspondences between points of the stereo image pair Output:r: vector of rotation, in pixels

/*Initialise the accumulator*/

A∆x,∆y:= 0,∆x∈(−Θx,Θx),∆y∈(−Θy,Θy) foreachc^t_i:= (x^t_i,L, y^t_i,L, x^t_i,R, y^t_i,R)∈ C^t,

cj,L:= (x^t_j,L, y_j,L^t , x^t−1_j,L, y^t−1_j,L)∈ CL

wherex^t_i,L=x^t_j,Landy^t_i,L=y^t_j,Ldo

/*X: a 3D point in camera-centric coordinates*/

X:= triangulate(x^t_i,L,y^t_i,L,x^t_i,R,y_i,R^t ) /*d: distance betweenX and the camera*/

d:=||X,0||

/*vote for the rotation, weighted by distanced*/

∆x:=x^t−1_j,L −x^t_j,L

∆y:=y_j,L^t−1−y_j,L^t A∆x,∆y:=A∆x,∆y+d end

/*find where the maximum is*/

r:= (rx, ry) := argmax

∆x∈(−Θ_x,Θ_x),∆y∈(−Θ_y,Θ_y)

A∆x,∆y

We are looking for a coordinate system transformation that will align the scene points while moving the cameras accordingly. Knowing the rotation between the two views, X^t is rotated around the origin (camera) by θ and ψ. See Fig. 5 for an illustration. After that, the transformation from X^t−1 toX^t is a translations.

What complicates its identification are great imprecisions in the triangulated coordinates and errors in the established correspondences. In the following we search for a translation vector sthat would best explain the difference between the two point sets, given the triangulation errors. Again, a voting scheme is adopted in order to be robust to mismatches in the correspondence sets.

C. Triangulation Uncertainty

A 3D pointX is a measurement given by a stereo corre- spondencec^t_i = (x^t_i,L, y_i,L^t , x^t_i,R, y^t_i,R)∈ C^t, with uncertainty increasing with distance to the object. The uncertainty is a function of imprecision in the correspondence (in pixels), of camera resolution and calibration, of the disparity, pixel’s position in image, and generally of the image content (e.g.there may be a smaller uncertainty in higher contrast areas).

Fig. 5. Two-stage egomotion estimation. The rotation (yaw and pitch) is computed first, followed by the translation.

Let us assume that the images are rectified, i.e. that for any stereo correspondence c^t_i = (x^t_i,L, y_i,L^t , x^t_i,R, y^t_i,R)∈ C^t it holds thaty_i,L^t =y_i,R^t . The correspondence of(x^t_i,L, y_i,L^t ) is then given by a single number, the disparitydi=x^t_i,R− x^t_i,L. Let the disparities be computed with a tolerance of, say = 1 pixel. I.e. if a correspondence with a disparity dˆwas established, the actual disparity is considered to be d∈( ˆd−,d+)ˆ with a uniform distribution over the interval.

The pixel-wise tolerance is transformed to the 3D by trian- gulating both ends of the interval,i.e.both(x^t_i,L, y^t_i,L, x^t_i,R− , y^t_i,R) and (x^t_i,L, y_i,L^t , x^t_i,R +, y^t_i,R). See Fig. 4 for an illustration. This gives us two endpoints of a 3D line segment on which the scene point X_i is located with distribution approximately again uniform (the distribution is in fact a piece of a quadratic function, since the triangulation errors grow quadratically with the disparity).

The segment goes in the direction to the reference (left) camera and its length increases with the distance, reflecting the higher uncertainty of more distant depth measurements.

Table I shows the uncertainty of our stereo configuration, tabulated for some typical distances.

There are other forms of triangulation imprecisions, com- ing from imprecise calibration of the stereo pair, but their magnitude is significantly smaller. They are all together modelled as Gaussian and we treat them later.

Distance to the object Uncertainty of 3D triangulation

5m ±6cm

10m ±22cm

15m ±49cm

20m ±88cm

30m ±200cm

50m ±555cm

80m ±1600cm 100m ±2500cm

TABLE I

3DTRIANGULATION UNCERTAINTY FOR IMAGE CORRESPONDENCES WITH TOLERANCE±1PIXEL.

D. Translation by voting

Fig. 5 illustrates the two-step egomotion recovery from a top view. Two scene points, X1 and X2, are shown as their respective tolerance segments. We denote as X⁻ the closer end of the tolerance segment, obtained as triangulation

(4)

of (xL, yL, xR −, y_L^t), and as X⁺ the farther end, of (x_L, y_L, x_R+, y_L^t).

In the figure on the left, the points are in camera-centric coordinates, as triangulated from the two stereo pairs. The middle figure shows the situation after rotation by the estimated yawψwas applied to the points from the currentt-th frame. Finally, on the right, the points from the t-th frame are aligned with their counterparts from the(t−1)-th frame by the translation vectors, yet unknown.

Fig. 6 shows what the vector s can be, i.e. how can we move from X^t−1 in the previous frame, represented by tolerance segment X^−,t−1X^+,t−1, to X^t, represented by segmentX^−,tX^+,t, in the current frame. All possible translation vectors form (in the 2D top view projection) a tetragon shown in the middle of the figure. The coordinates of its vertices are the differences between the tolerance segment endpoints:X^−,t−1−X^+,t,X^−,t−1−X^−,t,X^+,t−1−X^+,t and X^+,t−1−X^−,t. This tetragon represents the vote that X casts into the accumulator.

Under the assumption that the distance to the point X does not change much between the frames, i.e. that it is relatively larger than the length of the translation vector s, the tolerance segments do not change significantly. We can assume them identical, i.e. X^+,t−1 −X^−,t−1 = X^+,t − X^−,t. In that case, the vote degenerates to a line segment (X^−,t−1−X^+,t) (X^+,t−1−X^−,t), as shown on the right side of Fig. 6.

The voting procedure is resumed in Algorithm 2. An accumulator of votes is initialised first, its domain being the translations in world coordinates. We have its resolution set to 1 mm, its range for left-right offset Θ^min_X = −200 mm and Θ^max_X = 200 mm and its backward-forward range Θ^min_Z =−500mm andΘ^max_Z = 1500mm. Then, each point X that was successfully triangulated in botht-th and(t−1)- th frame adds a vote in the form of the top-view projected 2D line segment.

As a final step, the accumulator is convolved with a kernel of 2D normal distribution, with deviation σ appropriate to cover all the other imprecisions in the triangulation. We have σ = 5 mm. A position of the maximum in the convolved accumulator is then found as the translation vectors. Fig. 7 shows examples of the accumulated votes. Note that a typical length of a vote is, in the world coordinates,in the order of metersor tens of meters. Yet, as shown in the experiments, the maximum can be localised with a precision of few millimeters.

The computational cost of the procedure is low once the correspondences were obtained. Since the correspondences are discretised in the pixel domain, the triangulation in camera-centric coordinates can be implemented as a table look-up. The voting itself requires rendering of line segments, which, if implemented on graphics hardware, is almost instantaneous. The only remaining non-trivial opera- tions relate to the accumulator management – initialisation, convolution with a Gaussian kernel and the maximum search – which are all fast and easily parallelisable.

Fig. 6. Shape of the translation vote. See text for explanation.

Fig. 7. Estimation of translation: two examples of the voting accumulators, each showing the accumulatorAand itsN(0, σ)smoothed variantA. Both cases represent an almost forward motion. The right one is at a lower speed and there was another motion candidate, caused by a car in front of our vehicle going at about the same speed and turning to the right. The illusory motion is therefore to the left, with no forward component. The coordinate lines intersect ats= (0,0).

III. EXPERIMENTS

The approach was tested on sequences that were taken with a stereo camera pair mounted on a vehicle driven through a city. The sequences, each several thousands of images long, represent real-world scenarios. They include sections of high-speed driving on an expressway as well as traffic congestions and a drive through a city centre with pedestrian-crowded alleyways. Sample images are shown in Fig. 8.

Fig. 8. Sample frames from sequences used to test the egomotion estimation. From open areas to crowded alleyways.

The egomotion was computed on frame-to-frame basis.

An update to the orientation and location was calculated from one frame to the immediatelly following one, never considering preceding images. Therefore, the trajectories pre- sented here are concatenations of thousands of increments.

(5)

Algorithm 2 Translation by voting

Input: CL: correspondences between points of two consecutive images from one of the cameras

Input:C^t,C^t−1: correspondences between points of stereo image pairs, current and previous frames

Output:s: vector of translation, in world coordinates /*Initialise the accumulator*/

A∆X,∆Z:= 0,

∆X∈(Θ^minX ,Θ^maxX ),∆Z∈(Θ^minZ ,Θ^maxZ ) foreachc^t_i:= (x^t_i,L, y^t_i,L, x^t_i,R, y^t_i,R)∈ C^t,

c^t−1_j := (x^t−1_j,L, y_j,L^t−1, x^t−1_j,R, y^t−1_j,R)∈ C^t−1 ck,L:= (x^t_k,L, y_k,L^t , x^t−1_k,L, y_k,L^t−1)∈ CL

where x^t_i,L = x^t_k,L and y^t_i,L = y^t_k,L and x^t−1_j,L = x^t−1_k,L and y_j,L^t−1=y_k,L^t−1do

/*Xˆ^±,t, X^±,t−1: endpoints of 3D tolerance segments in camera-centric coordinates*/

Xˆ^−,t:= triangulate(x^t_i,L,y^t_i,L,x^t_i,R−,y^t_i,R) Xˆ^+,t:= triangulate(x^t_i,L,y^t_i,L,x^t_i,R+,y^t_i,R) X^−,t−1:= triangulate(x^t−1_j,L,y_j,L^t−1,x^t−1_j,R −,y_j,R^t−1) X^+,t−1:= triangulate(x^t−1_j,L ,y^t−1_j,L ,x^t−1_j,R +,y^t−1_j,R) /*RotateXˆ^±,tbyθ andψ*/

X^±,t:=Rθ,ψ·Xˆ^±,t

/*vote for the translationswith a line segmentuv*/

u:= (X_X^−,t−1−X_X^+,t, X_Z^−,t−1−X_Z^+,t) v:= (X_X^+,t−1−X_X^−,t, X_Z^+,t−1−X_Z^−,t) addLineSegment(A, uv)

end

/*add tolerance to other forms of noise*/

A:= convolve(A,N(0, σ)) /*find where the maximum is*/

s:= (sX, sZ) := argmax

∆X∈(Θ^min_X ,Θ^max_X ),∆Z∈(Θ^min_Z ,Θ^max_Z )

A∆X,∆Z

If an error was made in the computation of an increment, it was not compensated later. Nonetheless, the trajectories are precise, indicating that there were only few mistakes and that no significant errors accumulated over time.

Fig. 9 shows top view of the sequences. The sequence on the left lasted about 8 minutes and consists of about 14000 image pairs taken at 30 frames per second. The figure shows our reconstructed trajectory (yellow) overlaid on a satellite map. The actual path, hand-drawn, is shown in red. For this sequence we also have a record of the in-car data from the CANbus, with speed and turning angle readings. The trajectory restored from the CANbus data is shown in green.

Using the CANbus data as ground-truth, we can sepa- ratelly evaluate rotation and translation estimates. The ro- tations are shown in left part of Fig. 10. By summing the incremental changes in yaw (ψ) computed at each frame, we obtain the cummulative orientation drawn in red. The green line is for the CANbus orientation, which is again a cumulative sum of per-frame readings. The differences in the graphs are within the precision of camera calibration, which indicates that there is no systematic error in the computation accumulating over time.

Right side of Fig. 10 shows progression of vehicle’s speed.

At each frame, the actual speed is the length of the translation

Fig. 10. Comparision of computed (red) and CANbus (green) estimates on the sequence from the left side of Fig. 9. Left: progression of orientation (heading) of the test vehicle. The computed orientation (red) is the cumulative sum of about 14000 yaw angle increments (ψ). Right: Speed of the vehicle. The computed speed (red) is the length of the translation vectors.

vector s. Again, our measurements are shown in red while the CANbus data are in green. The graphs correspond well, but there are some mistakes to be seen. Mostly they concern acceleration from a stop at a crossing when there is another car immediatelly in front of us accelerating concurrently. In such cases the visually perceived speed is lower than actual.

The most pronounced case can be seen at frames around 2200. Yet the mistakes are only few and their overall effect on the trajectory shown in Fig. 9 is small. In numbers, the difference between vision and CANbus speeds is in 92.5%

of the meassurements less than1m/s(33mmfor30f ps), in 79.5% less than10mm and in 55% less than5mm.

The second sequence shown in Fig. 9 is longer, lasting almost half an hour, and consisting of about 50000 stereo image pairs. Although the trajectory looks rather like a mess, it is in fact mostly correct at local scale. We start at the bottom right corner and until we reach the topmost part, about 25000 video frames later, the differences are small.

There we fail to get the orientation correctly, bending the trajectory by about 45 degrees. The same happens in the leftmost part, resulting in a total difference in orientation of about 90 degrees at the end of the sequence. In both cases, the failure was due to other vehicles passing from left to right very close in front of our car, obscuring most of the field of view (see Fig. 12).

Figure 11 shows in detail parts of the sequences. Trajec- tory segments are accompanied with representative images from the on-board cameras. The first segment is an over a minute long passage through a detour, with presence of multiple distracting moving objects, but none of them dominant. The second one shows a turnabout maneuver that includes reversing. The orthomap backgrounds under the trajectories were aligned manually.

Fig. 12. A situation where we fail to recover the rotation correctly. The car passing in front of us makes for a phantom rotation to the left.

(6)

Fig. 9. Comparison of reconstructed trajectories (yellow) with hand-drawn ground-truth (red). For the sequence on the left a trajectory obtained from the vehicle’s CANbus data is also shown (green).

Fig. 11. Two segments of the computed trajectory with corresponding scene images. Left: repeated structures on the fences interfere with the correspondence search process, and other moving objects in the surroundings that create illusions of false egomotion. Right: turnabout maneuver which includes reversing.

IV. CONCLUSIONS

We have proposed a solution to the problem of estimation of egomotion from a visual input, if it is acquired with a stereo pair of video cameras. The general 3D motion problem with six unknowns was simplified to four dimensions and further decomposed to two two-dimensional subproblems.

The decomposition allowed us to use a voting scheme to reliably identify the most representative egomotion, even when the input data – image correspondences – were noisy.

Experimental evaluation on real-world sequences has shown that although the egomotion was computed in the form of differences between consecutive video frames, the method provides reliable and precise output. The occasional mistakes occur when the visual input is dominated by another object moving in the scene.

A complex egomotion estimation system can be build on top of the proposed procedure. Results of the visual estimator should be combined with other sensors available, e.g. ac- celerometers or the CANbus car controls. Restrictions from a vehicle motion model should be considered,e.g.reflecting the minimal turning radius. And corrections at global scale should be obtained using a positioning system (GPS) and/or by any of the vision methods for the long-term drift removal.

REFERENCES

[1] C. F. Olson, L. H. Matthies, M. Schoppers, and M. W. Maimone,

“Rover navigation using stereo ego-motion,”Robotics and Autonomous Systems, vol. 43, no. 4, pp. 215 – 229, 2003.

[2] T. Lemaire, C. Berger, I.-K. Jung, and S. Lacroix, “Vision-based slam:

Stereo and monocular approaches,” Int. J. Comput. Vision, vol. 74, no. 3, pp. 343–364, 2007.

[3] A. Howard, “Real-time stereo visual odometry for autonomous ground vehicles,” inIEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2008, pp. 3946–3952.

[4] D. Nister, O. Naroditsky, and J. Bergen, “Visual odometry for ground vehicle applications,”Journal of Field Robotics, vol. 23, 2006.

[5] K. Cornelis, F. Verbiest, and L. Van Gool, “Drift detection and removal for sequential structure from motion algorithms,”IEEE PAMI, vol. 26, no. 10, pp. 1249–1259, 2004.

[6] T. Thorm¨ahlen, N. Hasler, M. Wand, and H.-P. Seidel, “Merging of feature tracks for camera motion estimation from video,” inConfer- ence on Visual Media Production, 2008.

[7] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “Monoslam:

Real-time single camera slam,”IEEE PAMI, vol. 29, 2007.

[8] R. M. Haralick, C.-N. Lee, K. Ottenberg, and M. N¨olle, “Review and analysis of solutions of the three point perspective pose estimation problem,”Int. J. Comput. Vision, vol. 13, no. 3, pp. 331–356, 1994.

[9] D. Nist´er, “An efficient solution to the five-point relative pose problem,”IEEE PAMI, vol. 26, no. 6, pp. 756–777, 2004.

[10] ˇS. Obdrˇz´alek, M. Perd’och, and J. Matas, “Dense linear-time correspondences for tracking,” inWorkshop on Visual Localization for Mobile Platforms, CVPR 2008, June 2008.

[11] R. I. Hartley and A. Zisserman,Multiple View Geometry in Computer Vision, 2nd ed. Cambridge University Press, 2004.