This is a quick revisit to my recent post comparing 3 different libraries with matrix support. As suggested by one of the comments to the last post, I’ve turned off any debugging option that each library may have. In practice you would have them on most of the time for safety reasons, but for this test I thought it would be interesting to see it turned off.

Armadillo and Eigen uses the define ARMA_NO_DEBUG and NDEBUG respectively to turn off error checking. I could not find an immediate way to do the same thing in OpenCV, unless I edit the source code, but chose not to. So keep that in that mind. I also modified the number of iterations for each of the 5 operation performed to be slightly more accurate. Fast operations like add, multiply, transpose and invert have more iterations performed to get a better average, compared to SVD, which is quite slow.

On with the results …

# Add

Performing C = A + B

## Raw data

Results in ms | OpenCV | Armadillo | Eigen |

4×4 | 0.00093 | 0.00008 | 0.00007 |

8×8 | 0.00039 | 0.00006 | 0.00015 |

16×16 | 0.00066 | 0.00030 | 0.00059 |

32×32 | 0.00139 | 0.00148 | 0.00194 |

64×64 | 0.00654 | 0.00619 | 0.00712 |

128×128 | 0.02454 | 0.02738 | 0.03225 |

256×256 | 0.09144 | 0.11315 | 0.10920 |

512×512 | 0.47997 | 0.57668 | 0.47382 |

## Normalised

Speed up over slowest | OpenCV | Armadillo | Eigen |

4×4 | 1.00x | 12.12x | 14.35x |

8×8 | 1.00x | 6.53x | 2.63x |

16×16 | 1.00x | 2.19x | 1.13x |

32×32 | 1.39x | 1.31x | 1.00x |

64×64 | 1.09x | 1.15x | 1.00x |

128×128 | 1.31x | 1.18x | 1.00x |

256×256 | 1.24x | 1.00x | 1.04x |

512×512 | 1.20x | 1.00x | 1.22x |

# Multiply

Performing C = A * B

## Raw data

Results in ms | OpenCV | Armadillo | Eigen |

4×4 | 0.00115 | 0.00017 | 0.00086 |

8×8 | 0.00195 | 0.00078 | 0.00261 |

16×16 | 0.00321 | 0.00261 | 0.00678 |

32×32 | 0.01865 | 0.01947 | 0.02130 |

64×64 | 0.15366 | 0.33080 | 0.07835 |

128×128 | 1.87008 | 1.72719 | 0.35859 |

256×256 | 15.76724 | 3.70212 | 2.70168 |

512×512 | 119.09382 | 24.08409 | 22.73524 |

## Normalised

Speed up over slowest | OpenCV | Armadillo | Eigen |

4×4 | 1.00x | 6.74x | 1.34x |

8×8 | 1.34x | 3.34x | 1.00x |

16×16 | 2.11x | 2.60x | 1.00x |

32×32 | 1.14x | 1.09x | 1.00x |

64×64 | 2.15x | 1.00x | 4.22x |

128×128 | 1.00x | 1.08x | 5.22x |

256×256 | 1.00x | 4.26x | 5.84x |

512×512 | 1.00x | 4.94x | 5.24x |

# Transpose

Performing C = A^T

## Raw data

Results in ms | OpenCV | Armadillo | Eigen |

4×4 | 0.00067 | 0.00004 | 0.00003 |

8×8 | 0.00029 | 0.00006 | 0.00008 |

16×16 | 0.00034 | 0.00028 | 0.00028 |

32×32 | 0.00071 | 0.00068 | 0.00110 |

64×64 | 0.00437 | 0.00592 | 0.00500 |

128×128 | 0.01552 | 0.06537 | 0.03486 |

256×256 | 0.08828 | 0.40813 | 0.20032 |

512×512 | 0.52455 | 1.51452 | 0.77584 |

## Normalised

Speed up over slowest | OpenCV | Armadillo | Eigen |

4×4 | 1.00x | 17.61x | 26.76x |

8×8 | 1.00x | 4.85x | 3.49x |

16×16 | 1.00x | 1.20x | 1.21x |

32×32 | 1.56x | 1.61x | 1.00x |

64×64 | 1.35x | 1.00x | 1.18x |

128×128 | 4.21x | 1.00x | 1.88x |

256×256 | 4.62x | 1.00x | 2.04x |

512×512 | 2.89x | 1.00x | 1.95x |

# Inversion

Performing C = A^-1

## Raw data

Results in ms | OpenCV | Armadillo | Eigen |

4×4 | 0.00205 | 0.00046 | 0.00271 |

8×8 | 0.00220 | 0.00417 | 0.00274 |

16×16 | 0.00989 | 0.01255 | 0.01094 |

32×32 | 0.06101 | 0.05146 | 0.05023 |

64×64 | 0.41286 | 0.25769 | 0.27921 |

128×128 | 3.60347 | 3.76052 | 1.88089 |

256×256 | 33.72502 | 23.10218 | 11.62692 |

512×512 | 285.03784 | 126.70175 | 162.74253 |

## Normalised

Speed up over slowest | OpenCV | Armadillo | Eigen |

4×4 | 1.32x | 5.85x | 1.00x |

8×8 | 1.90x | 1.00x | 1.52x |

16×16 | 1.27x | 1.00x | 1.15x |

32×32 | 1.00x | 1.19x | 1.21x |

64×64 | 1.00x | 1.60x | 1.48x |

128×128 | 1.04x | 1.00x | 2.00x |

256×256 | 1.00x | 1.46x | 2.90x |

512×512 | 1.00x | 2.25x | 1.75x |

# SVD

Performing full SVD, [U,S,V] = SVD(A)

## Raw data

Results in ms | OpenCV | Armadillo | Eigen |

4×4 | 0.01220 | 0.22080 | 0.01620 |

8×8 | 0.01760 | 0.05760 | 0.03340 |

16×16 | 0.10700 | 0.16560 | 0.25540 |

32×32 | 0.51480 | 0.70230 | 1.13900 |

64×64 | 3.63780 | 3.43520 | 6.63350 |

128×128 | 27.04300 | 23.01600 | 64.27500 |

256×256 | 240.11000 | 210.70600 | 675.84100 |

512×512 | 1727.44000 | 1586.66400 | 6934.32300 |

## Normalised

Discussion

Overall, the average running time has decreased for all the operations, which is a good start. Even OpenCV has lower running time, maybe the NDEBUG has an affect, since it’s a standardised define.

Speed up over slowest | OpenCV | Armadillo | Eigen |

4×4 | 18.10x | 1.00x | 13.63x |

8×8 | 3.27x | 1.00x | 1.72x |

16×16 | 2.39x | 1.54x | 1.00x |

32×32 | 2.21x | 1.62x | 1.00x |

64×64 | 1.82x | 1.93x | 1.00x |

128×128 | 2.38x | 2.79x | 1.00x |

256×256 | 2.81x | 3.21x | 1.00x |

512×512 | 4.01x | 4.37x | 1.00x |

# Discussion

Overall, average running time has decreased for all operations, which is a good sign. Even OpenCV, maybe the NDEBUG has an affect, since it’s a standardised define.

The results from the addition test show all 3 libraries giving more or less the same result. This is probably not a surprise since adding matrix is a very straight forward O(N) task.

The multiply test is a bit more interesting. For matrix 64×64 or larger, there is a noticeable gap between the libraries. Eigen is very fast, with Armadillo coming in second for matrix 256×256 or greater. I’m guessing for larger matrices Eigen and Armadillo leverages the extra CPU core, because I did see all the CPU cores utilised briefly during benchmarking.

The transpose test involve shuffling memory around. This test is affected by the CPU’s caching mechanism. OpenCV does a good job as the matrix size increases.

The inversion test is a bit of a mixed bag. OpenCV seems to be the slowest out of the two.

The SVD test is interesting. Seems like there is a clear range where OpenCV and Armadillo are faster. Eigen lags behind by quite a bit as the matrix size increases.

# Conclusion

In practice, if you just want a matrix library and nothing more then Armadillo or Eigen is probably the way to go. If you want something that is very portable with minimal effort then choose Eigen, because the entire library is header based, no library linking required. If you want the fastest matrix code possible then you can be adventurous and try combining the best of each library.

# Download

Code compiled with:

g++ test_matrix_lib.cpp -o test_matrix_lib -lopencv_core -larmadillo -lgomp -fopenmp \ -march=native -O3 -DARMA_NO_DEBUG -DNDEBUG

Good info. Thanks!

Very useful post. Besides the basic operations, I’d like to see some benchmark on the implementations of classic algorithms (for example kmeans). Since usually very long and complex expression might slow things down since many temporary objects might be created if the library is not optimized for such scenarios.

OpenCV isn’t a math library. They’re shipping a copy of Eigen underneath the hood. With your numbers so small, you’re most likely measuring the overhead of your system rather than the performance of these libraries. Put those math operator in a for loop for 10,000 times and you’ll get better numbers. However you’ll also need to add a random number generator so the compiler doesn’t convert your matrix operations to constant math.

Its true OpenCV isn’t a maths library, but since I do a lot of computer vision it was of interest to me. The basic functions I tested were the ones I most commonly use. I used 1000 iterations for each function and variables iteration for SVD, depending on the matrix size because it was so slow for larger matrices that I didn’t need that many loops. 1000 iterations felt adequate at the time based on the raw time of the loop, maybe I need to reconsider, especially for the super fast operations like add/subtraction/transpose.

I found that gccr didn’t optimise the results in the loop provided I made use of the variable later, which I do when printing out the results. Without the printf the loop isn’t even compiled in! Verified this using assembly output.

For small matrices, eigen supports fixed-size matrices; these would likely provide much better performance.

Also, eigen uses a slower algorithm when variables may be aliased e.g. A=A*B; this matters for multiplication. To indicate that this isn’t the case, you should use C.noalias() = A * B; instead.

Finally, the benchmark is a little tricky in that you’re trying to benchmark “basic” operations: but in actual fact more complex expressions aren’t necessarily executed by simply chaining the basic operations but instead (sometimes) using direct evaluation. All this means is that you should probably benchmark “realistic” expressions to get the most out of eigen and armadillo (both of which use a delayed evaluation approach).

It depends on your problem, but I’ve found that fixed-size matrices, noalias and smart expression evaluation can improve performance by easily an order of magnitude in many common cases, so I think that the simple approach is easy, but you might want to mention such a large caveat.

A few minor details: the version of G++ can make a significant difference, just as 32bit vs 64bit can: which versions are you using? Finally, -ffast-math can sometimes make an impact and it’s often safe; so that’s potentially interesting to look at.

Thanks for the info. I wasn’t aware of the aliasing issue. A proper test incorporating all the operations might be a good idea. Though I have yet to come up with a good one. Suggestions are welcome 🙂

I’m using 64bit gcc 4.6.1, but not with -ffast-math, fearing it might have some impact on numerical accuracy.

Hello man,

I have just done some evaluations about matrices multiplication with OpenCV, Eigen, Armadillo and OpenBLAS on Windows with some interesting results. You can see it here: http://4fire.wordpress.com/2012/04/29/matrices-multiplication-on-windows-matlab-is-the-champion-again

Hi,

That’s some nice results. I wonder if OpenCV is slower because there are checking overheads, it tends to do a lot of error checking.

It would be good to add an extra column for “cores”, you mentioned it in the paragraph below, but be nice to see it on the same table.

To preserve our sanity, basic cases are not proper to conclude real performance since the chaining and lazy evaluation approach on eigen and armadillo will change the result.

That is true. I have on my TODO list some benchmarking involving all operations mention. Whether I get around to it is a different story …

Good info.

Here’s Eigen’s take on comparing to Armadillo:

http://eigen.tuxfamily.org/index.php?title=Benchmark

I believe Armadillo advises users to plug LAPACK/BLAS into it for performance.

Hey mate!

Thanks for the info. Quite interesting.

Seems like anything matrix related I’ve come across comes crawling back to LAPACK/BLAS one way or another hehe.