Brute Force Nearest Neighbour in CUDA for 3D points

It has been a while since I touched CUDA and seeing as I haven’t posted anything interesting on the front page yet I decided to write a simple brute force nearest neighbour to pass time. The idea has been implemented by the fast kNN library. The library is restricted to non-commerical purposes.

In the past I’ve needed to do nearest neighbour search mainly in 3D space, particular for the Iterative Closest Point (ICP) algorithm, or the likes of it. I relied on the ANN library to do fast nearest neighbour searches, which it does very well. A brute force implementation at the time would have been unimaginably slow and simply impractical. It would have been as slow as watching paint dry for the kind of 3D data I worked on. On second thoughts, watching paint dry would have been faster. Brute force has a typical running time of O(NM), where N and M are the size of the two data set to be compared, basically quadratic running time if both data sets are similar in size. This is generally very slow and impractical for very large data set using the CPU. You would almost be crazy to do so. But on the GPU it is a different story.

I ran a test case where I have 100,000 3D points and 100,000 query points. Each query point searches for its nearest point. This equates to 100,000,000,000 comparisons (that’s a lot of zeros, did I count them right? yep I did). This is what I got on my Intel i7 laptop with a GeForce GTS 360M:

nghia@nghia-laptop:~/Projects/CUDA_NN$ bin/Release/CUDA_NN
Data dimension: 3
Total data points: 100000
Total query points: 100000

GPU: 2193.72 ms
CPU brute force: 69668.3 ms
libANN: 540.008 ms

Speed up over CPU brute force from using GPU: 31.76x
Speed up over libANN using GPU: 0.25x

ANN  produces the fastest results for 3D point searching. The GPU brute force version is slower but still practical for real applications, especially given its ease of implementation and reasonable running time.

I’d imagine even greater speed up (orders of magnitude?) is achievable with a KD-tree created on the GPU. A paper describing building a KD-tree on the GPU can be found at  However I could not find any source code on the website. I might give it a shot when I have time.

The demo code I wrote only searches for the nearest point given a query point. Extending the program to find the K nearest neighbour for a bunch of query points should not be too difficult. Each query point would have to maintain a sorted array of indexes and distances.

My code can be downloaded here ( If you’re on Linux just type make and run CUDA_NN. Just a warning, on Linux if a CUDA program runs for more than 5 second or so the operating system will kill it and display some warning.

4 thoughts on “Brute Force Nearest Neighbour in CUDA for 3D points”

  1. Nice and useful article.
    I made some test and found that there are limitations in the number of query points and the number of points in the reference set. Far below the global memory of the device the program fails. Reading the code I think than its limit is related to the number of available threads.
    On the other hand I have being playing with other implementation of a “brute force” on CUDA and it is faster. I hope that the limitation of my implementation only depends on the global memory. I need only the closest point (not k- ) for an ICP based registration, so are similar problems.
    Used the following card:
    Cuda device : 1
    Device name : GeForce 9800 GT
    Device global mem : 1073741824
    Device shared mem per block: 16384
    Registers per block : 8192
    Max Threads per block : 512
    Multi-processors count : 14
    Threads per multiprocessor : 768

    Results for 128k points and 6k query points (3D) :

    My Method Brute on i7 Your code :
    sum 237644759.0 237644759.0 237644759.0
    time[s] 0.886741 5.659189 2.467112

    I could “clean” my code and send you for your advice and considerations.


  2. News ! I changed the device and here your code is faster.

    Cuda device : 1
    Device name : GeForce GT 430
    Device global mem : 1073414144
    Device shared mem per block: 49152
    Registers per block : 32768
    Max Threads per block : 1024
    Multi-processors count : 2
    Threads per multiprocessor : 1536

    237644759.000000 237644759.000000 237644759.000000
    0.363197 3.941930 0.145398

    So your code takes 0.14 s and mine 0.36 s. !
    Brute Force on i7 2600k : 3.94 s (but without multithread, only one core, when i use OpenMP it is much better)

  3. Hi. I made minimum changes in your code thinking in a static reference set for points. So I took the transfer of the reference data set out. The reference data is loaded only at initialization. In a machine with low bandwidth, like my GeForce 9800 GT this was the bottleneck. Also the index output was removed, because it is not needed for registration using ICP. After doing this your code is better for my problem.
    Of course, this modification is not valid for a dynamic data set, like some particle (or N-body) problems.
    Any way, I can not beat the performance of kdtree or NearPT3 (from Randolph Franklin) in conjuntion with OpenMP, using pure CPU code running on i7.

Leave a Reply

Your email address will not be published. Required fields are marked *