Optical flow on CUDA

This is my own implementation of the Lucas Kanade optical flow algorithm using CUDA based on the paper:

Pyramidal Implementation of the Lucas Kanade Feature Tracker Description of the algorithm by Jean-Yves Bouguet.

I have always wanted to learn how to program using CUDA so I decided to start by implementing the Lucas Kanade optical flow algorithm. My version does optical flow calculations for every pixel (dense optical flow), as oppose to sparse. This makes coding much easier (no need to write a feature detector) and having a dense field is always nice.

I get about a 30x speed up over OpenCV’s version. However the results are slightly different. OpenCV tends to produce slightly more accurate results, which I have yet to replicate.



  • OpenCV
  • Nvidia graphics card with CUDA installed
  • GCC

Downoad cudaLK.zip

On Linux you can compile the code by running the compile script. Most likely you’ll have to edit the script to match your system. By default it’s linking to the 64bit CUDA libraries. The program is run from the command line as follows:

./cudaLK img1.png img2.png

This will produce a cudaLK.png and opencv.png file for comparison, with optical flow drawn every 16 pixels. The input images can be any one of the popular image format supported by the OpenCV library (eg. jpeg, bmp, png).


This code is by no means production quality, in fact it isn’t at all. It was written just so I could get familiar with programming in CUDA. This code won’t produce high quality result as OpenCV, but nonetheless should serve as a rough guide for comparison.

Results on my laptop

The results were obtained on my Asus laptop with the following specs:

  • Intel i7 Q720 @ 1.60GHz
  • Nvdidia Geforce GTS360M 1GB VRAM
  • 6GB RAM
  • Ubuntu 10.04 LTS 64bit version
  • CUDA 3.0

The following parameters were used for the optical flow calculation:

Image size 1280×640
Patch size 13×13
Pyramid level 3
Maximum iterations 10
Termination condition delta < 0.01 pixels
Dense optical flow (all pixels) Yes

I used the following two images extracted from a Ladybug camera sample video from Point Grey (hope they don’t mind). You can download the originals by clicking on the thumbnails.

Below shows a time break down of the stages involved. I chose to use the gettimeofday() function to time the different stages as seen from the CPU but included GPU time results from CUDA profiler for a more accurate breakdown. The CUDA profiler does take into account of CPU time but only for function calls, not section of code.

Operation CPU (ms) GPU (ms)
Copying 2 images from CPU to GPU 1
Converting RGB to greyscale 1 0.947
Generating the pyramids 1 0.665
Optical flow 907 904.682
Copying results from GPU to CPU 7
Total time for cudaLK 918
OpenCV’s optical flow (8 threads) 28194

That works out to be about 892,000 optical flow pixels per second using CUDA. Pretty good ! In comparison with OpenCV’s highly optimised CPU implementation utilising all 4 cores (8 threads), the GPU version is about 30x faster.

And of course the actual results. Obviously there is some room for improvement …

Results from CudaLK
Results from OpenCV

Some thoughts

As shown, the performance of the GPU is much faster than the equivalent CPU implementation, even when all cores are utilised. I made use of CUDA’s texture memory, which is not only faster than global memory, because of caching, but has hardware bilinear interpolation support. One thing I did not implement is explicit boundary checking when the patch is partially off the image. I relied on the texture memory returning a clamped value for pixels off the texture and hoped they didn’t affect the overall tracking significantly.

With some extra work to get the optical flow quality up to OpenCV’s level, I would still expect the GPU version to run at least 15-20x faster. Maybe in the future when I get around to it.

6 thoughts on “Optical flow on CUDA”

  1. Hm.. I used OpenCV’s Farneback dense optical flow recently (no GPU) and was getting over 7 fps at 640×480 inside a virtual machine on my laptop (i7 @ 2.8 GHz). That’s over 2 million optical flow pixels per second. I wonder what the discrepancy is — different settings or something? I did it using 320×240 and got about 30 fps, which is pretty consistent. I wonder what I’m missing. The patch size and pyramid level look pretty close to what I was using.


      1. I didn’t do an in-depth analysis of the accuracy but the flow maps looked good to my eye and the couple of points I did check seemed pretty good.

  2. A critical error in this performance analysis is that when you’re crunching numbers very hard, as Optical flow and Video Encoding certainly do, hyper-threading will kill your performance by at least $25.

    The tester needs to disable hyper-threading in the bios so that 4 real cores will have 4 threads total between them. The CPU will still win, but the difference won’t be x30.

    Hyper-threading allows the system to respond faster to user input when the system is burdened, and to share CPU instructions of that core that aren’t used at the same time.

    When you split Optical Flow between 8 threads, all 8 threads are going to be using similar special CPU instructions, and so 4 of those threads are going to be competing and blocking on the other 4, repeatedly until the task is done. 4 cores cannot parallel Optical Flow across 8 threads. It’ll try, but you’re just slowing things down.

    Max Cannaday
    {Threading Meister}

Leave a Reply

Your email address will not be published. Required fields are marked *