Category Archives: Technical

A bunch of technical posts to help me remember useful stuff, should I become senile in the future.

Update on Markerless Augmented Reality stuff

Haven’t posted up anything for the past weeks. Been busy working on this Markerless AR demo code, which should hopefully be released soon.

Recently. there have been a lot of interest in getting AR working on mobile phones (dubbed MAR for Mobile Augmented Reality). The results from papers I’ve found have been quite impressive. But I thought it would be just as fun going the other way and trying to get it running as fast as possible on a powerful PC. This allows the user to do more intensive processing and yet run in real-time. Complex 3D models, animation, particle effects, sound, and maybe even some spare cycles for a frame of Crysis … thats the idea anyway.

To do this I’ve been using the GPU where possible. So far the GPU is used to: extract patch features, build the feature descriptor, match features using a simple K-Tree structure on the GPU.

I’ve spent some time trying to put the Homography calculation on the GPU with poor results, probably because my card does not support double types, which may be necessary for SVD calculations. My laptop’s graphics card is a Nvidia GTS 360M. I can get 20-30 fps easily. I’d imagine with the latest and greatest Nvidia card on a desktop PC, it can only go higher.

In the mean time, here’s a recent screenshot. The room was poorly lit so the webcam ran only like 5 fps. But the processing, at the current time, was capable of 33 fps or so. I’m hoping to increase that with some further GPU optimisation. I’d also like to point out it’s 33 fps WITHOUT any ‘real’ tracking code. I’ve only used a bounding box from previous matched frames to limit the feature search. It’s pretty much running in semi-detection mode all the time. One of the main reason why I didn’t put in a tracker yet is because my room is poorly lit, which makes the webcam very susceptible to blurring. The blurring will easily kill any tracker. So it would be kinda moot to test.

I also got rid of the teapot and replaced it with a 3D model of Tux! which turned out to be a pain in the !@#$ a$$ to get up and running, but I’ll leave that rant for another post.

Markerless Augmented Reality
Markerless Augmented Reality Demo App on Nvidia GTS 360M (laptop GPU)



Update on RANSAC Homography on the GPU

Updating on my previous post, I found a tiny bug when calculating the inliers. Still don’t know why the normalisation code produces less inliers though. So with the fixed code the inlier results without normalisation for Kyral castle and the house are 437 and 158 respectively, instead of 402 and 151 as reported before. The normalisation and bias random flag can be toggled in CUDA_RANSAC_Homography.h using the #define at the top. The new code can be downloaded from the same link as before:

CUDA_RANSAC_Homography.tar.bz2 (1 MB)

The code was written on Linux using CodeBlocks. If you don’t have both it should not be difficult to manually create a project in the IDE of your choice. If you try to compile on Windows you will have to get rid of the Linux specific timing code.

RANSAC Homography on the GPU using CUDA

UPDATE: 28 June 2011

It turns out the SVD code I used does not work particularly well if the image undergoes significant rotation. I have experimented with porting GNU GSL Jacobi SVD code and still no go. My GPU does not support double type, which I suspect is fairly important for SVD calculations. Oh well, guess I’ll have to wait till I get my hands on a newer Nvidia card.

This week I thought it would be interesting to see how well the popular 2D Homography using RANSAC algorithm would perform on the GPU. My first hunch was that it should work well because each RANSAC iteration is completely independent, at least for the vanilla implementation. I decided to write a simple test program that would find the homography between two images and output the matching points. The program does the following:

1. Extract FAST corners in both images (using OpenCV)
2. Use the GPU to do a simple brute force match using the Sum of Absolute Difference (SAD) of each feature patch (16×16 block).
3. Use the GPU to do the RANSAC homography calculation and compare with OpenCV’s cv::findHomography function.
4. Output matches

Things sounded simple enough until I realised I eventually needed an SVD function (for calculating the homography) that could be called in a CUDA kernel. I ended up using C code from:

with some modifications to make it work in CUDA. I also remembered my GeForce 360M GTS does not support double types in CUDA code, so things are a bit unclear in regards to the reliability and accuracy of the SVD results. At this point I just crossed my fingers and hoped for the best, like any good engineer would do.

Results and discussion

I used two image pairs to test the program. Both pairs were originally taken for panoramic stitching. The camera was rotated about the centre (no significant translation), which makes using the homography possible for this example.  Below shows the final results of each pair.

Kryal castle



Timing results for Kryal castle:

Load images: 27.847 ms
Convert to greyscale: 2.556 ms
FAST corners: 7.743 ms
Feature matching (GPU): 818.109 ms
<strong>RANSAC Homography (OpenCV): 24.791 ms
RANSAC Homography (GPU): 10.149 ms</strong>
Refine homography: 1.011 ms
Total time: 892.3 ms

Features extracted: 1724 1560
OpenCV Inliers: 518
GPU Inliers: 402
GPU RANSAC iterations: 2875

RANSAC parameters:
Confidence: 0.99
Inliers ratio: 0.2
Data is normalised: no
Bias random selection: yes

Homography matrix
1.31637 -0.0165917 -304.696
0.108898 1.22028 -73.6643
0.000389038 4.51187e-05 1

And timing results for the house image:

Load images: 30.393 ms
Convert to greyscale: 2.481 ms
FAST corners: 6.126 ms
Feature matching (GPU): 529.514 ms
<strong>RANSAC Homography (OpenCV): 82.173 ms
RANSAC Homography (GPU): 8.498 ms</strong>
Refine homography: 0.607 ms
Total time: 659.896 ms

Features extracted: 1635 1011
OpenCV Inliers: 195
GPU Inliers: 151
GPU RANSAC iterations: 2875

RANSAC parameters:
Confidence: 0.99
Inliers ratio: 0.2
Data is normalised: no
Bias random selection: yes

Homography matrix
1.58939 0.119068 -438.437
0.187436 1.43225 -117.053
0.000693531 0.000185465 1

The GPU code is faster than OpenCV’s cv::findHomography function, which is always a good start to the day. The speed improvement varies (2x to 10x) because OpenCV’s RANSAC code will adapt the number of iterations depending on the highest number of inliers it has found at the current time. The maximum iteration is capped at 2000. Here’s a snippet of OpenCV’s RANSAC parameters from opencv/modules/calib3d/src/fundam.cpp line 222 onwards:

const double confidence = 0.995;
const int maxIters = 2000;

There are some obvious and some less so obvious parameters I have used in my code that I need to explain. My RANSAC code uses a fixed number of iterations based on confidence and expected inlier ratio, based on the theoretical equation from the Wikipedia. I set the inlier to 0.2, which means I’m expecting 20% of the data to be good and the remaining 80% to be bad/noise/crap. I’m playing it safe here since I made no attempts to do any filtering during the feature matching step. In a real application you would, because any bad data that can be pruned out early will make RANSAC jobs easier.

The next parameter is data normalisation when calculating the homography. According to the book Multiple View Geometry, aka ‘bible’, you should never skip this stage, because doing so will unleash the gates of hell. So out of curiosity I made it an option that I can turn off and on to see the effect. For some reason I get more inliers with it turned off?! Which just ain’t right … I’ve checked the code over and over and have not been able to figure out why that is. My normalisation code is the same one described in Multiple View Geometry. Below are the results with and without normalisation for the test images:

No. Inliers with normalisation No. Inliers without normalisation
Kryal castle 320 402
House 44 151

It would be good to test the same code on a newer GeForce that supports double type to see if the results are different.

The last parameter, ‘Bias random selection’, was a simple and quick idea that I threw in, hoping it would improve the RANSAC point selection process. The idea is to bias the random selection to pick point pairs that have a stronger matching score. I got the idea when I was reading about the PROSAC algorithm, an improvement over RANSAC when extra information is available. I used the SAD score of each feature pair from the feature matching stage and transformed it so a good match has a high score. In the beginning I saw improvements in the number of inliers, but after adding the homography refinement code, it gave the same results with or without. Still, I like the idea of having there, it gives me a warm and fuzzy feeling at night.


From the brief playing around with the code I wrote, I can safely say that a GPU version of RANSAC for homography calculation looks like a real winner. The speed improvement is significant enough to justify using it in the real world. On my laptop, the GPU can do over 300,000 RANSAC homography iterations a second. I’m hoping to use it on my old Markerless Augmented Reality project to get real-time speed without trying too hard.

A point worth mentioning, if your data for some reason has a lot of noise and requires more than 2000 iterations to find a usable solution then OpenCV’s cv::findHomography will probably fail. I find it odd it’s not parameter that the user can change.



The project can be compiled in CodeBlocks. The project assumes CUDA is installed at /usr/local/cuda. If not, you will need to right click on each *.cu file, click Properties, Advanced and change the custom command.

The two image pair I used can be found in the samples diretory. The program’s usage is:

CUDA_RANSAC_Homography [img.jpg] [target.jpg] [results.png]

Simple Pano Stitcher

In the past I’ve received emails from people on the OpenCV Yahoo group asking for help on panoramic stitching, to which I answered them to the best of my knowledge, though the answers were theoretical. Being the practical man I like to think of myself,  I decided to finally get down and dirty and write a simple panoramic stitcher that should help beginners venturing into panoramic stitching. In the process of doing so I learned some new things along the way. I believe OpenCV is scheduled to include some sort of panoramic stitching engine in future release. In the meantime you can check out my Simple Pano Stitcher program here:

Last update: 31/03/2014


It uses OpenCV to do fundamental image processing stuff and includes a public domain Levenberg-Marquardt non-linear solver.

To compile the code just type make and run bin/Release/SimplePanoStitcher. It will output a bunch of layer-xxx.png images as output. I’ve included sample images in the sample directory, which the program uses by default. Edit the source code to load your own images. You’ll need to know the focal length in pixels used to capture each image.

To keep things simple I only wrote the bare minimum to get it up and running, for more information check the comments in main.cpp at the top of the file.

Results with Simple Pano Stitcher

Below is a sample result I got after combining all the layers, by simply pasting them on top of each other. No fancy blending or anything. On close inspection there is some noticeable parallax. If I had taken it on a tripod instead of hand held it should be less noticeable.

Sample results from Simple Pano Stitcher
Sample results from Simple Pano Stitcher


One of the crucial steps to getting good results is the non-linear optimisation step. I decided to optimise the focal length, yaw, pitch, and roll for each image. A total of 4 parameters. I thought it would be interesting to compare results with and without any optimisation. For comparison I’ll be using the full size images (3872 x 2592), which is not included in the sample directory to keep the package size reasonable. Below shows a close up section of the roof of the house without optimisation.

Panoramic stitching without non-linear optimisation

The root mean squared error reported is 22.5008 pixels. The results are without a doubt awful! Looking at it too long would cause me to require new glasses with a stronger optical prescription. I’ll now turn on the non-linear optimisation step for the 4 parameters mentioned previously. Below is what we get.

Panoramic stitching with optimisation
Panoramic stitching with non-linear optimisation

The reported root mean squared error is 3.84413 pixels. Much better! Not perfect, but much more pleasant to look at than the previous image.

There is no doubt a vast improvement by introducing non-linear optimisation to panoramic stitching. One can improve the results further by taking into consideration for radial distortion, but I’ll leave that as an exercise to the reader.

I originally got the focal length for my camera lens from calibration using the sample program (calibration.cpp) in the OpenCV 2.x directory. It performs a standard calibration using a checkerboard pattern. Yet even then its accuracy is still a bit off. This got me thinking a bit. Can we use panoramic stitching to refine the focal length? What are the pros and cons? The images used in the stitching process should ideally be scenery with features that are very very far, so that parallax is not an issue, such as outdoor landscape of mountains or something. Seems obvious enough, there are probably papers out there but I’m too lazy to check at this moment …

KD-tree on GPU for 3D point searching

This post is a follow up on my previous post about using the GPU for brute force nearest neighbour searching. In the previous post I found the GPU to be slower than using the ANN library, which utilises some sort of KD-tree to speed up searches. So I decided to get around to implementing a basic KD-tree on the GPU for a better comparison.

My approach was to build the tree on the host and transfer it to the GPU.  Sounds conceptually easy enough, as most things do. The tricky part was transferring the tree structure. I ended up representing the tree as an array node on the GPU, with each node referencing other nodes using an integer index, rather than using pointers, which I suspect would have got rather ugly. The KD-tree that I coded allows the user to set the maximum depth if they want to, which turned out to have some interesting effects as shown shortly.  But first lets see some numbers:

nghia@nghia-laptop:~/Projects/CUDA_KDtree$ bin/Release/CUDA_KDtree
CUDA blocks/threads: 196 512
Points in the tree: 100000
Query points: 100000

Results are the same!

GPU max tree depth: 99
GPU create + search: 321.849 + 177.968 = 499.817 ms
ANN create + search: 162.14 + 456.032 = 618.172 ms
Speed up of GPU over CPU for searches: 2.56x

Alright, now we’re seeing better results from the GPU. As you can see I broken the timing down for the create/search part of the KD-tree. My KD-tree creation code is slower than ANN, no real surprise there. However the searches using the GPU is faster, the results I was hoping for. But only 2.56x faster, nothing really to rave about. I could probably get the same speed up, or more, by multi-threading the searches on the CPU. I then started to play around with the maximum depth of my tree and got these results:

GPU max tree depth: 13
GPU create + search: 201.771 + 56.178 = 257.949 ms
ANN create + search: 163.414 + 403.978 = 567.392 ms
Speed up of GPU over CPU for searches: 7.19x

From 2.56x to 7.19x, a bit better! A max tree depth of 13 produced the fastest results. It would seem that there is a sweet spot between creating a tree with the maximum depth required, so that each leaf of the tree holds only a single point, versus a tree with a limited depth and doing a linear search on the leaves. I suspect most of the bottleneck is due to the heavy use of global memory, especially randomly accessing them from different threads, which is why we’re not seeing spectacular speed ups.

The code for the CUDA KD-tree can be downloaded here Its written for Linux and requires the ANN library to be installed.

Brute Force Nearest Neighbour in CUDA for 3D points

It has been a while since I touched CUDA and seeing as I haven’t posted anything interesting on the front page yet I decided to write a simple brute force nearest neighbour to pass time. The idea has been implemented by the fast kNN library. The library is restricted to non-commerical purposes.

In the past I’ve needed to do nearest neighbour search mainly in 3D space, particular for the Iterative Closest Point (ICP) algorithm, or the likes of it. I relied on the ANN library to do fast nearest neighbour searches, which it does very well. A brute force implementation at the time would have been unimaginably slow and simply impractical. It would have been as slow as watching paint dry for the kind of 3D data I worked on. On second thoughts, watching paint dry would have been faster. Brute force has a typical running time of O(NM), where N and M are the size of the two data set to be compared, basically quadratic running time if both data sets are similar in size. This is generally very slow and impractical for very large data set using the CPU. You would almost be crazy to do so. But on the GPU it is a different story.

I ran a test case where I have 100,000 3D points and 100,000 query points. Each query point searches for its nearest point. This equates to 100,000,000,000 comparisons (that’s a lot of zeros, did I count them right? yep I did). This is what I got on my Intel i7 laptop with a GeForce GTS 360M:

nghia@nghia-laptop:~/Projects/CUDA_NN$ bin/Release/CUDA_NN
Data dimension: 3
Total data points: 100000
Total query points: 100000

GPU: 2193.72 ms
CPU brute force: 69668.3 ms
libANN: 540.008 ms

Speed up over CPU brute force from using GPU: 31.76x
Speed up over libANN using GPU: 0.25x

ANN  produces the fastest results for 3D point searching. The GPU brute force version is slower but still practical for real applications, especially given its ease of implementation and reasonable running time.

I’d imagine even greater speed up (orders of magnitude?) is achievable with a KD-tree created on the GPU. A paper describing building a KD-tree on the GPU can be found at  However I could not find any source code on the website. I might give it a shot when I have time.

The demo code I wrote only searches for the nearest point given a query point. Extending the program to find the K nearest neighbour for a bunch of query points should not be too difficult. Each query point would have to maintain a sorted array of indexes and distances.

My code can be downloaded here ( If you’re on Linux just type make and run CUDA_NN. Just a warning, on Linux if a CUDA program runs for more than 5 second or so the operating system will kill it and display some warning.