Category Archives: Technical

A bunch of technical posts to help me remember useful stuff, should I become senile in the future.

Using Maxima to help write compact and fast matrix maths code and more

I’ve recently discovered this cool piece of open source software called Maxima. For those unfamiliar, it is a Computer Algebra System (CAS). You can express maths problem using symbols (as oppose to numerical values only) and apply common maths operations like integration, differentiation, matrix manipulation, finding roots, solving for x, simplification  etc. etc. to name a few. In fact, it could probably do most of your calculus homework for you, provided the teacher overlooks the fact that you didn’t show any working step.

Maxima operates on the command line but I tend to use the GUI frontend wxMaxima.

Here’s a programming scenario, you need to do a simple matrix operation, like an inversion of a 3×3 matrix to solve a linear problem. You want it fast but don’t want the hassle of linking to an external library, for whatever reason. Sure, you could do a quick Google and probably find an answer or use it as an excuse to play with Maxima. Here is how to do this in wxMaxima.

3x3 matrix inversion
3x3 matrix inversion

M is the 3×3 matrix with each element index by some letter. The output can now be translated directly into C, C++ code or whatever your favourite language is. Notice how the denominator is all the same, so you can improve efficiency by storing it into a variable. You can also check if the denominator is near zero, which would indicate a singular matrix that can’t be inverted.

Eigenvalues of a 3×3 matrix? Sure !

Eigenvalues of a 3x3 matrix
Eigenvalues of a 3x3 matrix ... bad idea

Err okay, maybe not such a great idea.

Need to solve Ax = b? Just re-arrange to x = invert(A)b and you got your answers. You probably only need one IF statement and the rest are maths operation. Not a single FOR loop. How’s that for lean and mean code.

Simple Pano Stitcher

In the past I’ve received emails from people on the OpenCV Yahoo group asking for help on panoramic stitching, to which I answered them to the best of my knowledge, though the answers were theoretical. Being the practical man I like to think of myself,  I decided to finally get down and dirty and write a simple panoramic stitcher that should help beginners venturing into panoramic stitching. In the process of doing so I learned some new things along the way. I believe OpenCV is scheduled to include some sort of panoramic stitching engine in future release. In the meantime you can check out my Simple Pano Stitcher program here:

Last update: 31/03/2014

Download SimplePanoStitcher-0.2.0.zip

It uses OpenCV to do fundamental image processing stuff and includes a public domain Levenberg-Marquardt non-linear solver.

To compile the code just type make and run bin/Release/SimplePanoStitcher. It will output a bunch of layer-xxx.png images as output. I’ve included sample images in the sample directory, which the program uses by default. Edit the source code to load your own images. You’ll need to know the focal length in pixels used to capture each image.

To keep things simple I only wrote the bare minimum to get it up and running, for more information check the comments in main.cpp at the top of the file.

Results with Simple Pano Stitcher

Below is a sample result I got after combining all the layers, by simply pasting them on top of each other. No fancy blending or anything. On close inspection there is some noticeable parallax. If I had taken it on a tripod instead of hand held it should be less noticeable.

Sample results from Simple Pano Stitcher
Sample results from Simple Pano Stitcher

Discussion

One of the crucial steps to getting good results is the non-linear optimisation step. I decided to optimise the focal length, yaw, pitch, and roll for each image. A total of 4 parameters. I thought it would be interesting to compare results with and without any optimisation. For comparison I’ll be using the full size images (3872 x 2592), which is not included in the sample directory to keep the package size reasonable. Below shows a close up section of the roof of the house without optimisation.

Panoramic stitching without non-linear optimisation

The root mean squared error reported is 22.5008 pixels. The results are without a doubt awful! Looking at it too long would cause me to require new glasses with a stronger optical prescription. I’ll now turn on the non-linear optimisation step for the 4 parameters mentioned previously. Below is what we get.

Panoramic stitching with optimisation
Panoramic stitching with non-linear optimisation

The reported root mean squared error is 3.84413 pixels. Much better! Not perfect, but much more pleasant to look at than the previous image.

There is no doubt a vast improvement by introducing non-linear optimisation to panoramic stitching. One can improve the results further by taking into consideration for radial distortion, but I’ll leave that as an exercise to the reader.

I originally got the focal length for my camera lens from calibration using the sample program (calibration.cpp) in the OpenCV 2.x directory. It performs a standard calibration using a checkerboard pattern. Yet even then its accuracy is still a bit off. This got me thinking a bit. Can we use panoramic stitching to refine the focal length? What are the pros and cons? The images used in the stitching process should ideally be scenery with features that are very very far, so that parallax is not an issue, such as outdoor landscape of mountains or something. Seems obvious enough, there are probably papers out there but I’m too lazy to check at this moment …

KD-tree on GPU for 3D point searching

This post is a follow up on my previous post about using the GPU for brute force nearest neighbour searching. In the previous post I found the GPU to be slower than using the ANN library, which utilises some sort of KD-tree to speed up searches. So I decided to get around to implementing a basic KD-tree on the GPU for a better comparison.

My approach was to build the tree on the host and transfer it to the GPU.  Sounds conceptually easy enough, as most things do. The tricky part was transferring the tree structure. I ended up representing the tree as an array node on the GPU, with each node referencing other nodes using an integer index, rather than using pointers, which I suspect would have got rather ugly. The KD-tree that I coded allows the user to set the maximum depth if they want to, which turned out to have some interesting effects as shown shortly.  But first lets see some numbers:

nghia@nghia-laptop:~/Projects/CUDA_KDtree$ bin/Release/CUDA_KDtree
CUDA blocks/threads: 196 512
Points in the tree: 100000
Query points: 100000

Results are the same!

GPU max tree depth: 99
GPU create + search: 321.849 + 177.968 = 499.817 ms
ANN create + search: 162.14 + 456.032 = 618.172 ms
Speed up of GPU over CPU for searches: 2.56x

Alright, now we’re seeing better results from the GPU. As you can see I broken the timing down for the create/search part of the KD-tree. My KD-tree creation code is slower than ANN, no real surprise there. However the searches using the GPU is faster, the results I was hoping for. But only 2.56x faster, nothing really to rave about. I could probably get the same speed up, or more, by multi-threading the searches on the CPU. I then started to play around with the maximum depth of my tree and got these results:

GPU max tree depth: 13
GPU create + search: 201.771 + 56.178 = 257.949 ms
ANN create + search: 163.414 + 403.978 = 567.392 ms
Speed up of GPU over CPU for searches: 7.19x

From 2.56x to 7.19x, a bit better! A max tree depth of 13 produced the fastest results. It would seem that there is a sweet spot between creating a tree with the maximum depth required, so that each leaf of the tree holds only a single point, versus a tree with a limited depth and doing a linear search on the leaves. I suspect most of the bottleneck is due to the heavy use of global memory, especially randomly accessing them from different threads, which is why we’re not seeing spectacular speed ups.

The code for the CUDA KD-tree can be downloaded here CUDA_KDtree.zip. Its written for Linux and requires the ANN library to be installed.

Brute Force Nearest Neighbour in CUDA for 3D points

It has been a while since I touched CUDA and seeing as I haven’t posted anything interesting on the front page yet I decided to write a simple brute force nearest neighbour to pass time. The idea has been implemented by the fast kNN library. The library is restricted to non-commerical purposes.

In the past I’ve needed to do nearest neighbour search mainly in 3D space, particular for the Iterative Closest Point (ICP) algorithm, or the likes of it. I relied on the ANN library to do fast nearest neighbour searches, which it does very well. A brute force implementation at the time would have been unimaginably slow and simply impractical. It would have been as slow as watching paint dry for the kind of 3D data I worked on. On second thoughts, watching paint dry would have been faster. Brute force has a typical running time of O(NM), where N and M are the size of the two data set to be compared, basically quadratic running time if both data sets are similar in size. This is generally very slow and impractical for very large data set using the CPU. You would almost be crazy to do so. But on the GPU it is a different story.

I ran a test case where I have 100,000 3D points and 100,000 query points. Each query point searches for its nearest point. This equates to 100,000,000,000 comparisons (that’s a lot of zeros, did I count them right? yep I did). This is what I got on my Intel i7 laptop with a GeForce GTS 360M:

nghia@nghia-laptop:~/Projects/CUDA_NN$ bin/Release/CUDA_NN
Data dimension: 3
Total data points: 100000
Total query points: 100000

GPU: 2193.72 ms
CPU brute force: 69668.3 ms
libANN: 540.008 ms

Speed up over CPU brute force from using GPU: 31.76x
Speed up over libANN using GPU: 0.25x

ANN  produces the fastest results for 3D point searching. The GPU brute force version is slower but still practical for real applications, especially given its ease of implementation and reasonable running time.

I’d imagine even greater speed up (orders of magnitude?) is achievable with a KD-tree created on the GPU. A paper describing building a KD-tree on the GPU can be found at http://www.kunzhou.net/.  However I could not find any source code on the website. I might give it a shot when I have time.

The demo code I wrote only searches for the nearest point given a query point. Extending the program to find the K nearest neighbour for a bunch of query points should not be too difficult. Each query point would have to maintain a sorted array of indexes and distances.

My code can be downloaded here (CUDA_LK.zip). If you’re on Linux just type make and run CUDA_NN. Just a warning, on Linux if a CUDA program runs for more than 5 second or so the operating system will kill it and display some warning.