This post is a follow up on my previous post about using the GPU for brute force nearest neighbour searching. In the previous post I found the GPU to be slower than using the ANN library, which utilises some sort of KD-tree to speed up searches. So I decided to get around to implementing a basic KD-tree on the GPU for a better comparison.
My approach was to build the tree on the host and transfer it to the GPU. Sounds conceptually easy enough, as most things do. The tricky part was transferring the tree structure. I ended up representing the tree as an array node on the GPU, with each node referencing other nodes using an integer index, rather than using pointers, which I suspect would have got rather ugly. The KD-tree that I coded allows the user to set the maximum depth if they want to, which turned out to have some interesting effects as shown shortly. But first lets see some numbers:
nghia@nghia-laptop:~/Projects/CUDA_KDtree$ bin/Release/CUDA_KDtree CUDA blocks/threads: 196 512 Points in the tree: 100000 Query points: 100000 Results are the same! GPU max tree depth: 99 GPU create + search: 321.849 + 177.968 = 499.817 ms ANN create + search: 162.14 + 456.032 = 618.172 ms Speed up of GPU over CPU for searches: 2.56x
Alright, now we’re seeing better results from the GPU. As you can see I broken the timing down for the create/search part of the KD-tree. My KD-tree creation code is slower than ANN, no real surprise there. However the searches using the GPU is faster, the results I was hoping for. But only 2.56x faster, nothing really to rave about. I could probably get the same speed up, or more, by multi-threading the searches on the CPU. I then started to play around with the maximum depth of my tree and got these results:
GPU max tree depth: 13 GPU create + search: 201.771 + 56.178 = 257.949 ms ANN create + search: 163.414 + 403.978 = 567.392 ms Speed up of GPU over CPU for searches: 7.19x
From 2.56x to 7.19x, a bit better! A max tree depth of 13 produced the fastest results. It would seem that there is a sweet spot between creating a tree with the maximum depth required, so that each leaf of the tree holds only a single point, versus a tree with a limited depth and doing a linear search on the leaves. I suspect most of the bottleneck is due to the heavy use of global memory, especially randomly accessing them from different threads, which is why we’re not seeing spectacular speed ups.
The code for the CUDA KD-tree can be downloaded here CUDA_KDtree.zip. Its written for Linux and requires the ANN library to be installed.