This is my own implementation of the Lucas Kanade optical flow algorithm using CUDA based on the paper:
I have always wanted to learn how to program using CUDA so I decided to start by implementing the Lucas Kanade optical flow algorithm. My version does optical flow calculations for every pixel (dense optical flow), as oppose to sparse. This makes coding much easier (no need to write a feature detector) and having a dense field is always nice.
I get about a 30x speed up over OpenCV’s version. However the results are slightly different. OpenCV tends to produce slightly more accurate results, which I have yet to replicate.
- Nvidia graphics card with CUDA installed
On Linux you can compile the code by running the compile script. Most likely you’ll have to edit the script to match your system. By default it’s linking to the 64bit CUDA libraries. The program is run from the command line as follows:
./cudaLK img1.png img2.png
This will produce a cudaLK.png and opencv.png file for comparison, with optical flow drawn every 16 pixels. The input images can be any one of the popular image format supported by the OpenCV library (eg. jpeg, bmp, png).
This code is by no means production quality, in fact it isn’t at all. It was written just so I could get familiar with programming in CUDA. This code won’t produce high quality result as OpenCV, but nonetheless should serve as a rough guide for comparison.
Results on my laptop
The results were obtained on my Asus laptop with the following specs:
- Intel i7 Q720 @ 1.60GHz
- Nvdidia Geforce GTS360M 1GB VRAM
- 6GB RAM
- Ubuntu 10.04 LTS 64bit version
- CUDA 3.0
The following parameters were used for the optical flow calculation:
|Termination condition||delta < 0.01 pixels|
|Dense optical flow (all pixels)||Yes|
I used the following two images extracted from a Ladybug camera sample video from Point Grey (hope they don’t mind). You can download the originals by clicking on the thumbnails.
Below shows a time break down of the stages involved. I chose to use the gettimeofday() function to time the different stages as seen from the CPU but included GPU time results from CUDA profiler for a more accurate breakdown. The CUDA profiler does take into account of CPU time but only for function calls, not section of code.
|Operation||CPU (ms)||GPU (ms)|
|Copying 2 images from CPU to GPU||1|
|Converting RGB to greyscale||1||0.947|
|Generating the pyramids||1||0.665|
|Copying results from GPU to CPU||7|
|Total time for cudaLK||918|
|OpenCV’s optical flow (8 threads)||28194|
That works out to be about 892,000 optical flow pixels per second using CUDA. Pretty good ! In comparison with OpenCV’s highly optimised CPU implementation utilising all 4 cores (8 threads), the GPU version is about 30x faster.
And of course the actual results. Obviously there is some room for improvement …
As shown, the performance of the GPU is much faster than the equivalent CPU implementation, even when all cores are utilised. I made use of CUDA’s texture memory, which is not only faster than global memory, because of caching, but has hardware bilinear interpolation support. One thing I did not implement is explicit boundary checking when the patch is partially off the image. I relied on the texture memory returning a clamped value for pixels off the texture and hoped they didn’t affect the overall tracking significantly.
With some extra work to get the optical flow quality up to OpenCV’s level, I would still expect the GPU version to run at least 15-20x faster. Maybe in the future when I get around to it.