In this post I’ll be comparing 3 popular C++ matrix libraries found on Linux.

OpenCV is a large computer vision library with matrix support. Armadillo wraps around LAPACK. Eigen is an interesting library, all the implementation is in the C++ header, much like boost. So it is simple to link into, but takes more time compile.

The 5 matrix operations I’ll be focusing on are:** add, multiply, transpose, inversion, SVD**. These are the most common functions I use. All the libraries are open source and run on a variety of platforms but I’ll just be comparing them on Ubuntu Linux.

Each of the 5 operations were tested on randomly generated matrices of different size NxN with the average running time recorded.

I was tossing up whether to use a bar chart to display the result but the results span over a very large interval. A log graph would show all the data easily but make numerical comparisons harder. So in the end I opted to show the raw data plus a normalised version to compare relative speed ups. Values highlight in red indicate the best results.

# Add

Performing C = A + B

## Raw data

Results in ms | OpenCV | Armadillo | Eigen |

4×4 | 0.00098 | 0.00003 | 0.00002 |

8×8 | 0.00034 | 0.00006 | 0.00017 |

16×16 | 0.00048 | 0.00029 | 0.00077 |

32×32 | 0.00142 | 0.00208 | 0.00185 |

64×64 | 0.00667 | 0.00647 | 0.00688 |

128×128 | 0.02190 | 0.02776 | 0.03318 |

256×256 | 0.23900 | 0.27900 | 0.30400 |

512×512 | 1.04700 | 1.17600 | 1.33900 |

## Normalised

Speed up over slowest | OpenCV | Armadillo | Eigen |

4×4 | 1.00x | 30.53x | 44.41x |

8×8 | 1.00x | 5.56x | 2.02x |

16×16 | 1.62x | 2.66x | 1.00x |

32×32 | 1.46x | 1.00x | 1.12x |

64×64 | 1.03x | 1.06x | 1.00x |

128×128 | 1.52x | 1.20x | 1.00x |

256×256 | 1.27x | 1.09x | 1.00x |

512×512 | 1.28x | 1.14x | 1.00x |

The average running time for all 3 libraries are very similar so I would say there is no clear winner here. In the 4×4 case where OpenCV is much slower it might be due to overhead in error checking.

# Multiply

Performing C = A * B

## Raw data

Results in ms | OpenCV | Armadillo | Eigen |

4×4 | 0.00104 | 0.00007 | 0.00030 |

8×8 | 0.00070 | 0.00080 | 0.00268 |

16×16 | 0.00402 | 0.00271 | 0.00772 |

32×32 | 0.02059 | 0.02104 | 0.02527 |

64×64 | 0.14835 | 0.18493 | 0.06987 |

128×128 | 1.83967 | 1.10590 | 0.60047 |

256×256 | 15.54500 | 9.18000 | 2.65200 |

512×512 | 133.32800 | 35.43100 | 21.53300 |

## Normalised

Speed up over slowest | OpenCV | Armadillo | Eigen |

4×4 | 1.00x | 16.03x | 3.52x |

8×8 | 3.84x | 3.35x | 1.00x |

16×16 | 1.92x | 2.84x | 1.00x |

32×32 | 1.23x | 1.20x | 1.00x |

64×64 | 1.25x | 1.00x | 2.65x |

128×128 | 1.00x | 1.66x | 3.06x |

256×256 | 1.00x | 1.69x | 5.86x |

512×512 | 1.00x | 3.76x | 6.19x |

Average running time for all 3 are similar up to 64×64, where Eigen comes out as the clear winner.

# Transpose

Performing C = A^T.

## Raw data

Results in ms | OpenCV | Armadillo | Eigen |

4×4 | 0.00029 | 0.00002 | 0.00002 |

8×8 | 0.00024 | 0.00007 | 0.00009 |

16×16 | 0.00034 | 0.00019 | 0.00028 |

32×32 | 0.00071 | 0.00088 | 0.00111 |

64×64 | 0.00458 | 0.00591 | 0.00573 |

128×128 | 0.01636 | 0.13390 | 0.04576 |

256×256 | 0.12200 | 0.77400 | 0.32400 |

512×512 | 0.68700 | 3.44700 | 1.17600 |

## Normalised

Speed up over slowest | OpenCV | Armadillo | Eigen |

4×4 | 1.00x | 17.00x | 12.57x |

8×8 | 1.00x | 3.45x | 2.82x |

16×16 | 1.00x | 1.81x | 1.20x |

32×32 | 1.56x | 1.26x | 1.00x |

64×64 | 1.29x | 1.00x | 1.03x |

128×128 | 8.18x | 1.00x | 2.93x |

256×256 | 6.34x | 1.00x | 2.39x |

512×512 | 5.02x | 1.00x | 2.93x |

Comparable running time up to 64×64, after which OpenCV is the winner by quite a bit. Some clever memory manipulation?

# Inversion

Performing C = A^-1

## Raw data

Results in ms | OpenCV | Armadillo | Eigen |

4×4 | 0.00189 | 0.00018 | 0.00090 |

8×8 | 0.00198 | 0.00414 | 0.00271 |

16×16 | 0.01118 | 0.01315 | 0.01149 |

32×32 | 0.06602 | 0.05445 | 0.05464 |

64×64 | 0.42008 | 0.32378 | 0.30324 |

128×128 | 3.67776 | 4.52664 | 2.35105 |

256×256 | 35.45200 | 16.41900 | 17.12700 |

512×512 | 302.33500 | 122.48600 | 97.62200 |

## Normalised

Speed up over slowest | OpenCV | Armadillo | Eigen |

4×4 | 1.00x | 10.22x | 2.09x |

8×8 | 2.09x | 1.00x | 1.53x |

16×16 | 1.18x | 1.00x | 1.15x |

32×32 | 1.00x | 1.21x | 1.21x |

64×64 | 1.00x | 1.30x | 1.39x |

128×128 | 1.23x | 1.00x | 1.93x |

256×256 | 1.00x | 2.16x | 2.07x |

512×512 | 1.00x | 2.47x | 3.10x |

Some mix results up until 128×128, where Eigen appears to be better choice.

# SVD

Performing [U,S,V] = SVD(A)

## Raw data

Results in ms | OpenCV | Armadillo | Eigen |

4×4 | 0.00815 | 0.01752 | 0.00544 |

8×8 | 0.01498 | 0.05514 | 0.03522 |

16×16 | 0.08335 | 0.17098 | 0.21254 |

32×32 | 0.53363 | 0.73960 | 1.21068 |

64×64 | 3.51651 | 3.37326 | 6.89069 |

128×128 | 25.86869 | 24.34282 | 71.48941 |

256×256 | 293.54300 | 226.95800 | 722.12400 |

512×512 | 1823.72100 | 1595.14500 | 7747.46800 |

## Normalised

Speed up over slowest | OpenCV | Armadillo | Eigen |

4×4 | 2.15x | 1.00x | 3.22x |

8×8 | 3.68x | 1.00x | 1.57x |

16×16 | 2.55x | 1.24x | 1.00x |

32×32 | 2.27x | 1.64x | 1.00x |

64×64 | 1.96x | 2.04x | 1.00x |

128×128 | 2.76x | 2.94x | 1.00x |

256×256 | 2.46x | 3.18x | 1.00x |

512×512 | 4.25x | 4.86x | 1.00x |

Looks like OpenCV and Armadillo are the winners, depending on the size of the matrix.

# Discussion

With mix results left, right and centre it is hard to come to any definite conclusion. The benchmark itself is very simple. I only focused on square matrices of power of two, comparing execution speed, not accuracy, which is important for SVD.

What’s interesting from the benchmark is the clear difference in speed for some of the operations depending on the matrix size. Since the margins can be large it can have a noticeable impact on your application’s running time. It would be pretty cool if there was a matrix library that could switch between different algorithms depending on the size/operation requested, fine tuned to the machine it is running on. Sort of like what Atlas/Blas does.

So which library is faster? I have no idea, try them all for your application and see 🙂

# Download

Here is the code used to generate the benchmark: test_matrix_lib.cpp

Compiled with:

g++ test_matrix_lib.cpp -o test_matrix_lib -lopencv_core -larmadillo -lgomp -fopenmp -march=native -O3

You’ll get more speed out of Armadillo if you define ARMA_NO_DEBUG before including the Armadillo header (or -DARMA_NO_DEBUG on the command line). By default Armadillo has debugging enabled to aid correct algorithm development.

Also, Armadillo uses Blas and Lapack for many routines, meaning you can use high speed Lapack replacement libraries like Intel MKL or AMD ACML. Instead of linking with the runtime component via -larmadillo (which is an alias to whatever Blas and Lapack libraries are installed), you can directly link with MKL and Armadillo will make use of it.

Thanks for the tip! I’ll have to do a second run with debugging turn off for all libraries, if possible. Something I overlooked. It’s useful to know Armadillo can support other linear algebra backend, but in the spirit of open source I’ll stick with BLAS/LAPACK for now 🙂

@nghiaho12

Did you installed Blas and Lapack for Armadillo?

I have some problems using Armadillo with Lapack and Blas.

Can you post an article on how to make make them work together?

Initially I installed BLAS, Lapack, and ATLAS but the matrix multiplication didn’t work, it gave me this errors:

/usr/local/include/armadillo_bits/blas_wrapper.hpp|79|undefined reference to `wrapper_dgemv_’|

/usr/local/include/armadillo_bits/blas_wrapper.hpp|114|undefined reference to `wrapper_dgemm_’|

And now after I installed ACML I have the same problem.

This is my C++ code:

#include

#include “armadillo”

using namespace arma;

using namespace std;

int main()

{

mat A = “1 2;3 4;5 6”;

mat B = “2 4 8;2 4 8”;

mat C= A*B;

cout<<C<<"\n";

return 0;

}

Did you faced the same problem? Can you give me some help with this?

I used the default blas/lapack/atlas package that came with Ubuntu 11.x. I think the trick is to make sure you got up to date packages, relative to the Armadillo version you’re installing.

I had the same problem with wrapper_dgemm_ not defined. It went away when I added -larmadillo to the command line.

No way!

Original BLAS is very slow, try GOTOBlas http://www.tacc.utexas.edu/tacc-projects/gotoblas2 or ACML at least

Eigen has its own vectorization system, so in the case that you are on a x86 system you would like to enable at least SSE2 by using -msse2

I compiled with gcc usnig -march=native, which according to GCC does:

-mtune=native and -march=native will produce code optimized for the host architecture as detected using the cpuid instruction.

So it should have all the SSE supported by my Intel i7 CPU.

Can you link against GPU matrix libraries such as BLAS when using armadillo?

I meant *MAGMA

I’m not sure on this one. If they implement the BLAS routines faithfully and completely then it might be possible.