Last Updated on March 19, 2011 by nghiaho12
This post is a follow up on my previous post about using the GPU for brute force nearest neighbour searching. In the previous post I found the GPU to be slower than using the ANN library, which utilises some sort of KD-tree to speed up searches. So I decided to get around to implementing a basic KD-tree on the GPU for a better comparison.
My approach was to build the tree on the host and transfer it to the GPU. Sounds conceptually easy enough, as most things do. The tricky part was transferring the tree structure. I ended up representing the tree as an array node on the GPU, with each node referencing other nodes using an integer index, rather than using pointers, which I suspect would have got rather ugly. The KD-tree that I coded allows the user to set the maximum depth if they want to, which turned out to have some interesting effects as shown shortly. But first lets see some numbers:
nghia@nghia-laptop:~/Projects/CUDA_KDtree$ bin/Release/CUDA_KDtree CUDA blocks/threads: 196 512 Points in the tree: 100000 Query points: 100000 Results are the same! GPU max tree depth: 99 GPU create + search: 321.849 + 177.968 = 499.817 ms ANN create + search: 162.14 + 456.032 = 618.172 ms Speed up of GPU over CPU for searches: 2.56x
Alright, now we’re seeing better results from the GPU. As you can see I broken the timing down for the create/search part of the KD-tree. My KD-tree creation code is slower than ANN, no real surprise there. However the searches using the GPU is faster, the results I was hoping for. But only 2.56x faster, nothing really to rave about. I could probably get the same speed up, or more, by multi-threading the searches on the CPU. I then started to play around with the maximum depth of my tree and got these results:
GPU max tree depth: 13 GPU create + search: 201.771 + 56.178 = 257.949 ms ANN create + search: 163.414 + 403.978 = 567.392 ms Speed up of GPU over CPU for searches: 7.19x
From 2.56x to 7.19x, a bit better! A max tree depth of 13 produced the fastest results. It would seem that there is a sweet spot between creating a tree with the maximum depth required, so that each leaf of the tree holds only a single point, versus a tree with a limited depth and doing a linear search on the leaves. I suspect most of the bottleneck is due to the heavy use of global memory, especially randomly accessing them from different threads, which is why we’re not seeing spectacular speed ups.
The code for the CUDA KD-tree can be downloaded here CUDA_KDtree.zip. Its written for Linux and requires the ANN library to be installed.
Hello Nghia:
I am a phd student. And my major is computer graphis. I am working on KD-Tree those days. your codes is really helpful. And I think there are several things you can do to improve the performance. So, I want to discuss with you about the following three points.
1. For building KD-Tree process, the most time consuming part is sorting to find median value. This part we can use GPU to do sorting. I try to use thrust library to do that. However, my performance didn’t improve too much. I think it’s because of the data size.
2. For traversing KD-Tree process, I think we can updata radius of the circle while the best_dist is updata. In __device__ void Search(..) function:
if(tmp_dist < best_dist) {
best_dist = tmp_dist;
best_idx = tmp_idx;
radius = sqrt(best_dist);//updata the radius
}
3. I think this equation: CUDA_STACK = 2^(MAX_TREE_LEVEL) should be satisfied in the code. Because it's tatally possible that one thread would back traverse from the bottom to the top level of the tree.
Thanks very much for sharing the codes.
Hi,
If you manage to improve the code let me know, because I’ve been busy with other projects lately so do not have time to go back to this one.
Hello Nghia, Thanks a lot for this code! I’m trying to run this by changing the number of dimensions of the points, but it doesn’t seem to be working for any number of dimensions except 3. I get the following error:
terminate called after throwing an instance of std::length_error’
what(): vector::_M_fill_insert
Any help would be much appreciated! Thanks once again!
Your vector might be too large.
Hello Nghia. I am an undergrad student working with speeding up K-Nearest Neighbor (KNN) search using GPU. Thank you for sharing your codes.
Initially I was a bit confused as to how you were calculating the nearest neighbor. Upon looking at your code in detail, I understand that you are actually calculating the nearest neighbor. Can the same be modified to work for K nearest neighbors?
Thank you in advance.
It is possible. You would keep track of the top K min distances instead of just the single min distance.