With some more free time lately I’ve decided to get back into some structure from motion (SFM). In this post I show a simple SFM pipeline using a mix of OpenCV, GTSAM and PMVS to create accurate and dense 3D point clouds.

This code only serves as an example. There is minimal error checking and will probably bomb out on large datasets!

# Code

https://github.com/nghiaho12/SFM_example

You will need to edit src/main.cpp to point to your dataset or you test with my desk dataset.

http://nghiaho.com/uploads/desk.zip

# Introduction

There are four main steps in the SFM pipeline code that I will briefly cover

If your code editor supports code folding you can turn it on and see each step locally scoped around curly braces in src/main.cpp.

To simplify the code I assume the images are taken sequentially, that is each subsequent image is physically close and from the same camera with fixed zoom.

## Feature matching

The first step is to find matching features between the images. There are many OpenCV tutorial on feature matching out there so I won’t go into too much detail.

There are a few ways to go about picking pair of images to match. In my code I match every image to each other. So if there N images, there are N*(N-1)/2 image pairs to match. This of course can get pretty slow since it scales quadratically with the number of images.

I initially tried ORB but found AKAZE to produce more matches. Matches have to past a ratio test (best match vs second best match) and the fundamental matrix constraint to be kept.

The only slightly tedious part is the book keeping of matched features. You need to keep track of features that match across many images. For example, a feature in image 0 may have a match in image 1 and image2. This is important because features that are seen in at least 3 views allow us to recover relative scale between motion.

## Motion recovery and triangulation

This is the core part of the SFM pipeline and I take advantage of the following OpenCV functions

- cv::findEssentialMat
- cv::recoverPose
- cv::triangulatePoints

cv::findEssentialMat uses the 5-point algorithm to calculate the essential matrix from the matched features. It’s the de-facto algorithm for finding the essential matrix and is superior to the 8-point algorithm when used in conjunction with RANSAC to filter out outliers. But it is not easy to code up from scratch compared to the 8-point.

The pose from the essential matrix is recovered by cv::recoverPose. It returns a 3×3 rotation (R) and translation vector (t). The pose at camera n is calculated as follows

is a 4×4 matrix.

Given two poses the 2D image points can then be triangulated using cv::triangulatePoints. This function expects two projection matrices (3×4) as input. To make a projection matrix from the pose you also need the camera matrix K (3×3). Let’s say we break up the pose matrix into and components as shown below.

The 3×4 projection matrix for the camera n is

This process is performed sequentially eg. image 0 to image 1, image 1 to image 2.

There is an ambiguity in the scale recovered from the essential matrix (hence only 5 degrees of freedom). We don’t know the length of the translation between each image, just the direction. But that’s okay, we can use the first image pair (image 0 and image 1) and use that as the reference for relative scale. An illustration of how this is done is shown below. Click on the image to zoom in.

The following calculations occurs

- Recover motion between image 0 and 1 and triangulate to get the blue points. These points will be used as the reference to figure out the relative scale.
- Recover motion between image 1 and image 2 and triangulate to get the red points. Using our 2D feature matching done earlier we find that they are the same as the blue points. The rotation () is correct, but the translation () needs to be scaled.
- Find the relative scale by picking a pair of blue points and the matching red points and calculate the ratio of the vector lengths.
- Apply the scale to and update the transformation and re-triangulate the points.

Everything I’ve outlined so far is the basically visual odometry. This method only works if the images are taken sequentially apart.

## Bundle adjustment

Everything we’ve done so far more or less works but is not optimal, in the statistical sense. We have not taken advantage of constraints imposed by features matched across multiple views nor have we considered uncertainty in the measurement (the statistical part). As a result, the poses and 3D points recovered won’t be as accurate as they can be. That’s where bundle adjustment comes in. It treats the problem as one big non-linear optimization. The variables to be optimized can include all the poses, all the 3D points and the camera matrix. This isn’t trivial to code up from scratch but fortunately there are a handful of libraries out there designed for this task. I chose GTSAM because I was learning it for another application. They had an example of using GTSAM for SFM in the example directory, which I based my code off.

## Output for PMVS

After performing bundle adjustment we have a good estimate of the poses and 3D points. However the point cloud generally isn’t dense. To get a denser point cloud we can use PMVS. PMVS only requires the 3×4 projection matrix for each image as input. It will generate its own features for triangulation. The projection matrix can be derived from the poses optimized by GTSAM.

# Example

Here’s a dataset of my desk you can use to verify everything is working correctly.

http://nghiaho.com/uploads/desk.zip

Here’s a screenshot of the output point cloud in Meshlab.

If you set Meshlab to display in orthographic mode you can see the walls make a right angle.

# References

http://rpg.ifi.uzh.ch/visual_odometry_tutorial.html

Thanks for good intro into sfm. I am just now learning the basics of openCV as a hobby. I am still confused by terminology such as camera pose. In opencv it seems that the convention is that [R|T] is the projection matrix used to go from homogeneous world cords to homogeneous normalized camera coordinates. It is my understanding that the recoverPose function returns the R and T such that the projection matrix is [R|T]. So I feed the triangulatePoints the function projection matrix = K[R|T] where K is the intrinsic camera matrix, rather than using K[Rt|-RtT] as you mention. So I worry I am not understanding the conventions used by the openCV functions recoverPose and triangulatePoints?

My understanding is that the camera pose relates camera movement, where as the projection operates on 3d points, so they should have different representation. A simple example you can think of is a camera that moves in the x direction by 1 unit with no rotation. So the camera pose is R = identity, t = [1,0,0]. If we are looking at a bunch of 3d points in front of us, then by moving the camera right, we expect the points to move left, resulting in the projection matrix [R=identity | t=[-1,0,0]].

Can you confirm you are getting correct results with the what you are doing?

Thanks much. That makes sense that the term “camera pose” refers to the location and orientation of the camera as opposed to coordinates in the camera’s reference frame given by the projection matrix. I guess my confusion stems from the fact that pose could refer to the orientation of an object relative to the camera (maybe that would be called “object pose”. I recently ran camera calibration for the 1st time using a checkerboard and in some documentation, pose referred to the pose of the checkerboard as it is moved relative to a fixed camera. Also, I learned about the essential matrix and how it is defined in terms of the camera coordinates. I ran the findEssentialMatrix(pts1, pts2, …..) function with a 5×5 grid of simulated point correspondences (simulated moving the camera to the right by shifting the second image(normalized) coordinates to the left by an arbitrary 0.1 units and then added a small bit of forward motion by magnifying the cords by factor of 1.02). So this simulates moving camera by (dx=0.1, dy=0, dz= 1.0-(1.0/1.02) = 0.01961). I checked that the resulting essential matrix E honored x2 dot E*x1 = 0 for all the point correspondences. So I feel good about the E matrix I got. Then I ran recoverPose(E, pts1, pts2….) and got the identity matrix for R (as expected – no rotation, so it choose the correct R from the 2 possible ones) and it output T = (-0.1, 0, -0.01961) which is consistent with [R|T] of the cam2 projection matrix (cam1 matrix being [I|0] in my way of thinking about it). So this is why I am confused, since recoverPose seemed to return R, T of the projection matrix for cam2 instead of the pose of camera2. But I may be thinking about this the wrong way, and the documentation just says that recoverPose recovers the relative rotation and translation of two cameras. Well, there are many possible meanings for that and the only thing I am sure of is that my head is spinning and translating or is that translating then spinning which to me appears as the world reverse spinning and negative translating or is that negative translating and reverse spinning respectively…….

Thanks for the materials. Can I ask what platform are you using ? I found lots of difficulties making GTSAM and PMVS works on windows…

Ubuntu 16.04. A lot of the libraries that it depends on are developed for Unix’ish platforms.

Hi, I just started learning about SFM and stumbled on your articles. I have tried your “RunSFM” and it works great! I’m currently trying to compile the “SFM_example” listed in this article, but can’t seem to find the “compile.sh” script listed anywhere to create the “./main” executable. Since it isn’t provided on the github page, should I try some other way to create the executable to run the “SFM_example”?

Thanks in advance!

Oops, do a git pull now.