Brute Force Nearest Neighbour in CUDA for 3D points

It has been a while since I touched CUDA and seeing as I haven’t posted anything interesting on the front page yet I decided to write a simple brute force nearest neighbour to pass time. The idea has been implemented by the fast kNN library. The library is restricted to non-commerical purposes.

In the past I’ve needed to do nearest neighbour search mainly in 3D space, particular for the Iterative Closest Point (ICP) algorithm, or the likes of it. I relied on the ANN library to do fast nearest neighbour searches, which it does very well. A brute force implementation at the time would have been unimaginably slow and simply impractical. It would have been as slow as watching paint dry for the kind of 3D data I worked on. On second thoughts, watching paint dry would have been faster. Brute force has a typical running time of O(NM), where N and M are the size of the two data set to be compared, basically quadratic running time if both data sets are similar in size. This is generally very slow and impractical for very large data set using the CPU. You would almost be crazy to do so. But on the GPU it is a different story.

I ran a test case where I have 100,000 3D points and 100,000 query points. Each query point searches for its nearest point. This equates to 100,000,000,000 comparisons (that’s a lot of zeros, did I count them right? yep I did). This is what I got on my Intel i7 laptop with a GeForce GTS 360M:

nghia@nghia-laptop:~/Projects/CUDA_NN$ bin/Release/CUDA_NN
Data dimension: 3
Total data points: 100000
Total query points: 100000

GPU: 2193.72 ms
CPU brute force: 69668.3 ms
libANN: 540.008 ms

Speed up over CPU brute force from using GPU: 31.76x
Speed up over libANN using GPU: 0.25x

ANN  produces the fastest results for 3D point searching. The GPU brute force version is slower but still practical for real applications, especially given its ease of implementation and reasonable running time.

I’d imagine even greater speed up (orders of magnitude?) is achievable with a KD-tree created on the GPU. A paper describing building a KD-tree on the GPU can be found at  However I could not find any source code on the website. I might give it a shot when I have time.

The demo code I wrote only searches for the nearest point given a query point. Extending the program to find the K nearest neighbour for a bunch of query points should not be too difficult. Each query point would have to maintain a sorted array of indexes and distances.

My code can be downloaded here ( If you’re on Linux just type make and run CUDA_NN. Just a warning, on Linux if a CUDA program runs for more than 5 second or so the operating system will kill it and display some warning.