OpenCV vs. Armadillo vs. Eigen vs. more! Round 3: pseudoinverse test

November 25, 2012 nghiaho12 23 Comments

Last Updated on December 7, 2012 by nghiaho12

Okay, the title of this post is getting longer and sillier, but this is the 3rd continuation of my last two post on comparing different libraries for everyday matrix operations. The last two posts compared basic operations such as multiplication, transposition, inversion etc. etc. in isolation, which is probably not a good reflection of real life usage. So I decided to come up with a new test that would combine different matrix operations together. I chose the pseudoinverse because it is something I use every now and then and it combines multiplication, transposition and inversion, which seems like a good test.

For benchmarking I’m going to be solving the following over determined linear system:

$AX = B$

and solve for X using

$X = \left(A^TA\right)^{-1}A^{T}B$

A is a NxM matrix, where N is much larger than M. I’ll be using N=1,000,000 data points and M (dimensions of the data) varying from 2 to 16.

B is a Nx1 matrix.

The matrix values will be randomly generated from 0 to 1 with uniform noise of [-1,1] added to B. They values are kept to a small range to avoid any significant numerical problems that can come about doing the pseudoinverse this way, not that I care too much for this benchmark. Each test is performed for 10 iterations, but not averaged out since I’m not interested in absolute time but relative to the other libraries.

Just to make the benchmark more interesting I’ve added GSL and OpenBLAS to the test, since they were just an apt-get away on Ubuntu.

Results

The following libraries were used

OpenCV 2.4.3 (compiled from source)
Eigen 3.1.2 (C++ headers from website)
Armadillo 3.4.4 (compiled from source)
GSL 1.15 (Ubuntu 12.10 package)
OpenBLAS 1.13 (Ubuntu 12.10 package)
Atlas 3.8.4 (Ubuntu 12.10 package)

My laptop has an Intel i7 1.60GHz with 6GB of RAM.

All values reported are in milliseconds. Each psuedoinverse test is performed 10 times but NOT averaged out. Lower is better. Just as a reminder each test is dealing with 1,000,000 data points of varying dimensions.

	2	3	4	5	6	7	8	9
OpenCV	169.619	321.204	376.3	610.043	873.379	1185.82	1194.12	1569.16
Eigen	152.159	258.069	253.844	371.627	423.474	577.065	555.305	744.016
Armadillo + Atlas	162.332	184.834	273.822	396.629	528.831	706.238	848.51	1088.47
Armadillo + OpenBLAS	79.803	118.718	147.714	298.839	372.235	484.864	411.337	507.84
GSL	507.052	787.429	1102.07	1476.67	1866.33	2321.66	2831.36	3237.67

	10	11	12	13	14	15	16
OpenCV	1965.95	2539.57	2495.63	2909.9	3518.22	4023.67	4064.92
Eigen	814.683	1035.96	993.226	1254.8	1362.02	1632.31	1615.69
Armadillo + Atlas	1297.01	1519.04	1792.74	2064.77	1438.16	1720.64	1906.79
Armadillo + OpenBLAS	534.947	581.294	639.175	772.382	824.971	825.79	893.771
GSL	3778.44	4427.47	4917.54	6037.29	6303.08	7187.5	7280.27

Ranking from best to worse

Armadillo + OpenBLAS
Eigen
Armadillo + Atlas (no multi-core support out of the box???)
OpenCV
GSL

All I can say is, holly smokes Batman! Armadillo + OpenBLAS wins out for every single dimension! Last is GSL, okay no surprise there for me. It never boasted being the fastest car on the track.

The cool thing about Armadillo is switching the BLAS engine only requires a different library to be linked, no recompilation of Armadillo. What is surprising is the Atlas library doesn’t seem to support multi-core by default. I’m probably not doing it right. Maybe I’m missing an environmental variable setting?

OpenBLAS is based on GotoBLAS and is actually a ‘made in China’ product, except this time I don’t get to make any jokes about the quality. It is fast because it takes advantage of multi-core CPU, while the others appear to only use 1 CPU core.

I’m rather sad OpenCV is not that fast since I use it heavily for computer vision tasks. My compiled version actually uses Eigen, but that doesn’t explain why it’s slower than Eigen! Back in the old days OpenCV used to use BLAS/LAPACK, something they might need to consider bringing back.

Code

test_matrix_pseudoinverse.cpp (right click save as)

Edit the code to #define in the libraries you want to test. Make sure you don’t turn on Armadillo + GSL, because they have conflicting enums. Instructions for compiling is at the top of the cpp file, but here it is again for reference.

To compile using ATLAS:

g++ test_matrix_pseudoinverse.cpp -o test_matrix_pseudoinverse -L/usr/lib/atlas-base -L/usr/lib/openblas-base -lopencv_core -larmadillo -lgomp -fopenmp -lcblas -llapack_atlas -lgsl -lgslcblas -march=native -O3 -DARMA_NO_DEBUG -DNDEBUG -DHAVE_INLINE -DGSL_RANGE_CHECK_OFF

To compile with OpenBLAS:

g++ test_matrix_pseudoinverse.cpp -o test_matrix_pseudoinverse -L/usr/lib/atlas-base -L/usr/lib/openblas-base -lopencv_core -larmadillo -lgomp -fopenmp -lopenblas -llapack_atlas -lgsl -lgslcblas -march=native -O3 -DARMA_NO_DEBUG -DNDEBUG -DHAVE_INLINE -DGSL_RANGE_CHECK_OFF

23 thoughts on “OpenCV vs. Armadillo vs. Eigen vs. more! Round 3: pseudoinverse test”

Gael Guennebaud says:

November 28, 2012 at 10:34 am

Your benchmark has many shortcomings:

First, with matrices of these sizes (1M * very_small), your benchmark is entirely dominated by the cost of computing A^T * A. So if the rest of steps are properly called (see below), then your benchmark reduce to benchmarking A^T * T, nothing else.

Second, you should use Eigen::VectorXd for vectors instead of regular MatrixXd. To reduce the runtime overhead for small objects, in Eigen we choose to assume that a MatrixX* object has no dimension equal to 1, and so the optimized paths for vector are not taken in your current version of the test.

Third, for matrix size larger than 4, it not recommended to compute the inverse of a matrice when what we want is solving Ax=b. Instead, we compute a matrix factorization, (e.g., Cholesky if you know the matrix is SPD), and Directly use it for solving. In Eigen:
X = (A.transpose()*A).llt().solve(A.transpose()*B);
(in your benchmark, this does not make much difference because the computation time is dominated by A^T * A)

Fourth, if you really want to explicitly compute the inverse matrix, then you should at least performs vector operations rather than matrix-matrix ops:
X = (_A.transpose()*_A).inverse()*(_A.transpose()*_B);
Same for GSL, opencv, …
In your current version it’s like you are solving for 1M right hand side, instead of a single.

Reply
1. nghiaho12 says:
  
  November 28, 2012 at 6:59 pm
  
  Thanks for the feedback. I did miss using better inversion for SPD matrix. The Armadillo doc explicitly states faster function for such matrix. The benchmark is more reflective of what I personally deal with than anything in general.
  
  Yep it’s true it is more or less an A^T*A test, I added various dimensions in case there was any surprise (none). However, it does highlight one teensy advantage of Armadillo over Eigen and that is I didn’t have to think about mixing matrix/vector types, it just worked fast. I changed the Eigen code to use VectorXd, it’s like 20% faster, but still overall slower than Armadillo.
  
  Reply
  1. Gael Guennebaud says:
    
    November 29, 2012 at 1:14 am
    
    Changing from MatrixXd to VectorXd is indeed not very important here since it benches A^T * A. However adding the parenthesis like this:
    X = (_A.transpose()*_A).inverse()*(_A.transpose()*_B);
    to avoid additional matrix-matrix operations is very important. Likewise, the GSL implementation should call gsl_blas_dgemv and not gsl_blas_dgemm like this:
    gsl_blas_dgemv(CblasTrans, CblasNoTrans, 1.0, A, _B, 0.0, At_B);
    gsl_blas_dgemv(CblasNoTrans, CblasNoTrans, 1.0, inv_AtA, At_B, 0.0, X);
    This reduces the number of FLOPS from 1e6*n*(n+1) to 1e6*n+n*n…
    
    Reply
    1. nghiaho12 says:
      
      November 29, 2012 at 3:35 am
      
      That’s a very good point about the bracketing. I made the changes and Eigen moved up to 2nd place 🙂
      
      OpenBLAS is faster because it utilises more than 1 CPU core, where as Eigen doesn’t. In fact, their website states:
      
      “Although we have very good performance for 1 CPU core, we don’t have any parallelization so we don’t take advantage of multicore CPUs; and we also don’t take advantage of GPUs”
      
      Reply
Mark Tall says:

December 18, 2012 at 11:43 pm

Minor comment: the Armadillo code can be made more compact by using .t() instead of trans(), eg. A.t()*B instead of trans(A)*B. There’s also .i() which is the corresponding shortcut for inv().

You may also want to look at using the solve() function instead of directly using inv(). Armadillo can automatically convert inv(A)*B into solve(A,B), but the mechanism that detects that such automatic conversions are possible is a work-in-progresss.

Reply
1. nghiaho12 says:
  
  December 19, 2012 at 12:03 am
  
  When I first played with Armadillo a while back I assumed they had those shortcuts already but they didn’t. It’s only recently that I found out they’ve implemented it. I’ve been using them for new code and it makes it more readable.
  
  I wonder if Armadillo has some support for delayed operations so it can figure out how to group the operations to do the least amount of work. Much like how one the comments suggest using brackets to reduce the number of operations.
  
  Reply
Michael Rans says:

March 8, 2013 at 5:19 am

Can you please try your benchmarking with Blaze-lib?
https://code.google.com/p/blaze-lib/wiki/Benchmarks

It claims to beat everything.

Reply
1. nghiaho12 says:
  
  March 8, 2013 at 5:38 am
  
  I would love to but I’m bogged down on another task. The graphs are certainly impressive. Maybe you would like to give it a go?
  
  Reply
  1. Michael Rans says:
    
    March 8, 2013 at 6:49 am
    
    Unfortunately I don’t have a Linux machine and Blaze is out of the box set up for Linux not Windows. Also I’d have to rerun all the other benchmarks since my configuration is different to yours.
    
    Reply
Binlong Li says:

April 3, 2013 at 8:16 pm

Thanks a lot for your comparison, it does helps a lot.
I’m not sure whether I’m the first on to tell you this little trick or not. I believe, for armadillo you could also initial the _A and _B matrix in the following way:

arma:mat _A(&A[0], N, M);
arma:mat _B(&B[0], N, 1);

instead of copying the data in a for loop.

Best

Reply
1. nghiaho12 says:
  
  April 4, 2013 at 2:47 am
  
  Thanks for the tip, didn’t know about that one!
  
  Reply
Ievgen Khvedchenia says:

July 16, 2013 at 6:25 am

Nice benchmark!
I haven’t heard about Armadillo, thanks for pointing me to something new)

In your test i noticed you use gettimeofday function to measure execution time. It’s not the best function to measure the execution time – it has too small precision. The best option is to use high-resolution performance counter (QueryPerformanceCounter http://msdn.microsoft.com/en-us/library/ms644904%28VS.85%29.aspx on Windows platform, or mach_absolute_time() on any POSIX-compatible OS) or use rdtsc instruction). I beleive using more precise timer will affect the results of your tests.

In defence of Eigen.
I’m quite surprised of Eigen’s results. Can i ask you do few modifications of your test?

1) Please, try defining NDEBUG for Eigen test as well. It also loves this define.
2) Also it’s recommended to enable SSE2 instruction set by adding the “-msse2” flag.
3) A matrix product assumes there is aliasing by default. Adding noalias explicitly helps compiler to generate more optimal code:
X.noalias() = (_A.transpose()*_A).inverse()*(_A.transpose()*_B);
4) There can be a reason to reorganize your code:

MatrixXd AtA;
MatrixXd AtB;
…
AtA.noalias() = _A.transpose()*_A;
AtB.noalias() =_A.transpose()*_B;
X.noalias() = AtA .inverse()*AtB;

The reason is to avoid memory allocations each time you compute the preudoinverse.

5) There is a Eigen::Map concept that can be used to map plan C-style arrays to eigen matrices. You can use it to initialze _A and _B matrices: http://eigen.tuxfamily.org/dox/classEigen_1_1Map.html

Reply
1. nghiaho12 says:
  
  July 16, 2013 at 6:57 am
  
  Hi,
  
  Thanks for your suggestions regarding Eigen, this is stuff I’m unaware of. Regarding your points
  
  1) I compiled with gcc -march=native, this enables every possible extension for my i7 CPU, including SSE2.
  2) If you check the cpp file you would see I also added NDEBUG 🙂
  3) Sounds interesting, never heard of the noalias() function. Might give it a shot when I have time.
  4) I avoided caching the transpose on purpose becaue I wanted to time the group of matrix operations as a whole, not in isolation. Else I would simply be timing a matrix multiplication only.
  5) Sounds handy
  
  Keep in mind that the only reason Eigen is in 2nd place is because Armadillo + OpenBLAS is multi-threaded. If Eigen could be threaded it might even beat Armadillo + OpenBLAS. Armadillo + OpenBLAS in my test is at most 2x faster than Eigen for the bigger calculations, even though it seems to be utilizing 4 CPU cores.
  
  On Linux, I find the gettimeofday function is generally “good enough”, it goes down to microsecond. It maybe not the most accurate but this is alleviated by running the timing over a for loop so tbat the total time taken is much larger than +- error of the timer. In my table the fastest timing is is 79.803 ms, which is okay if the gettimeofday() error is say +- 5 ms (~6% error), a wild guess there. Of course, if in doubt just increase the for loop so it runs in the seconds.
  
  Reply
Bo Li says:

September 28, 2013 at 5:33 pm

eigen supports intel MKL. Wondering if you could enable that and then compare eigen+MKL vs armadillo+openBLAS

Reply
1. nghiaho12 says:
  
  September 28, 2013 at 7:57 pm
  
  Didn’t know that. I don’t have MKL installed at the moment but I’ll see if I can find time to set it up.
  
  Reply
Anton Zuykov says:

July 10, 2014 at 4:16 am

It is a nice and useful comparison, but unfortunately the test combines SSE or other SIMD acceleration WITH multithread support which is a completely different thing.
Not in all systems many threads are supported, but this test sort of implies that armadillo+BLAS is the winner. In fact it is oranges vs apples here. And the you din’t even specify how many threads each library was using.

For example, had I decided to opt for embedded systems of low price level such as BBB or other BB-like stuff – it would be totally useless to assume that BLAS will help you out – it woun’t. Single thread and all multithreaded capability is going out of the window,say for ARMv6 or ARM7 to that matter.

Reply
1. nghiaho12 says:
  
  July 10, 2014 at 6:54 am
  
  True, forget to mention the number of threads. It was 4 (default). I did mention OpenBlas was multi-threaded and the others werent, at least in the previous blog post.
  
  Reply
pierre says:

July 18, 2014 at 7:19 pm

nice benchmarks but you should consider compiling all libs (especially atlas!) on the benched machine instead of using these nasty generic ubuntu packages. there are many preparations done, even during configuring, to fit most as possible into the target architecture.

for example there is definitely no avx256 support in the precompiled packages or users without a cpu lilke your corei7 would not be able to run them. i think openblas will speed up and atlas will get ways in front of eigen.

Reply
1. nghiaho12 says:
  
  July 20, 2014 at 6:21 pm
  
  Good point about the default Ubuntu packages, I did drop the ball a bit there. I have compiled Atlas in the past, not sure why I didn’t that time.
  
  Reply
JB says:

May 31, 2015 at 12:02 am

If you use Atlas and want it to be threaded you should link with the ptblas library versions, rather than the blas versions.

Reply
abd says:

April 13, 2016 at 1:24 pm

can i use armadillo in borland c compiler???

Reply
1. nghiaho12 says:
  
  April 18, 2016 at 12:26 am
  
  don’t know, try it and see
  
  Reply
Chip says:

April 27, 2016 at 5:04 pm

mingw-w64install.exe has HW32.Packed.10DE on it…considered malicious. that was considered to be required as per OpenBLAS instructions. What’s up with that?

Reply

Nghia Ho

OpenCV vs. Armadillo vs. Eigen vs. more! Round 3: pseudoinverse test

Results

Code

23 thoughts on “OpenCV vs. Armadillo vs. Eigen vs. more! Round 3: pseudoinverse test”

Leave a Reply Cancel reply

Where boredom, free time, and curiosity meet together