KD-tree on GPU for 3D point searching

This post is a follow up on my previous post about using the GPU for brute force nearest neighbour searching. In the previous post I found the GPU to be slower than using the ANN library, which utilises some sort of KD-tree to speed up searches. So I decided to get around to implementing a basic KD-tree on the GPU for a better comparison.

My approach was to build the tree on the host and transfer it to the GPU.  Sounds conceptually easy enough, as most things do. The tricky part was transferring the tree structure. I ended up representing the tree as an array node on the GPU, with each node referencing other nodes using an integer index, rather than using pointers, which I suspect would have got rather ugly. The KD-tree that I coded allows the user to set the maximum depth if they want to, which turned out to have some interesting effects as shown shortly.  But first lets see some numbers:

nghia@nghia-laptop:~/Projects/CUDA_KDtree$ bin/Release/CUDA_KDtree
CUDA blocks/threads: 196 512
Points in the tree: 100000
Query points: 100000

Results are the same!

GPU max tree depth: 99
GPU create + search: 321.849 + 177.968 = 499.817 ms
ANN create + search: 162.14 + 456.032 = 618.172 ms
Speed up of GPU over CPU for searches: 2.56x

Alright, now we’re seeing better results from the GPU. As you can see I broken the timing down for the create/search part of the KD-tree. My KD-tree creation code is slower than ANN, no real surprise there. However the searches using the GPU is faster, the results I was hoping for. But only 2.56x faster, nothing really to rave about. I could probably get the same speed up, or more, by multi-threading the searches on the CPU. I then started to play around with the maximum depth of my tree and got these results:

GPU max tree depth: 13
GPU create + search: 201.771 + 56.178 = 257.949 ms
ANN create + search: 163.414 + 403.978 = 567.392 ms
Speed up of GPU over CPU for searches: 7.19x

From 2.56x to 7.19x, a bit better! A max tree depth of 13 produced the fastest results. It would seem that there is a sweet spot between creating a tree with the maximum depth required, so that each leaf of the tree holds only a single point, versus a tree with a limited depth and doing a linear search on the leaves. I suspect most of the bottleneck is due to the heavy use of global memory, especially randomly accessing them from different threads, which is why we’re not seeing spectacular speed ups.

The code for the CUDA KD-tree can be downloaded here CUDA_KDtree.zip. Its written for Linux and requires the ANN library to be installed.

2 thoughts on “KD-tree on GPU for 3D point searching”

  1. Hello Nghia:
    I am a phd student. And my major is computer graphis. I am working on KD-Tree those days. your codes is really helpful. And I think there are several things you can do to improve the performance. So, I want to discuss with you about the following three points.

    1. For building KD-Tree process, the most time consuming part is sorting to find median value. This part we can use GPU to do sorting. I try to use thrust library to do that. However, my performance didn’t improve too much. I think it’s because of the data size.
    2. For traversing KD-Tree process, I think we can updata radius of the circle while the best_dist is updata. In __device__ void Search(..) function:
    if(tmp_dist < best_dist) {
    best_dist = tmp_dist;
    best_idx = tmp_idx;
    radius = sqrt(best_dist);//updata the radius
    }
    3. I think this equation: CUDA_STACK = 2^(MAX_TREE_LEVEL) should be satisfied in the code. Because it's tatally possible that one thread would back traverse from the bottom to the top level of the tree.

    Thanks very much for sharing the codes.

    1. Hi,

      If you manage to improve the code let me know, because I’ve been busy with other projects lately so do not have time to go back to this one.

Leave a Reply

Your email address will not be published. Required fields are marked *