OpenCV is not designed for solving matrix related operation like: multiplication, eigenvalue decomposition or SVD, etc., so your result is not surprise to me. For your problem, there two kinds of choice: the first choice is using a dedicated library, lapack or Eigen is recommended, the second one is using a wrapped library, such as Armadillo or CVMlib. I have tested all of them, or you can see more at this, and my choice is lapack, which gives the fastest speed (on Windows, with VS IDE). Hope this help.
2.
This kind of question is recurring and should be answered more clearly than “Matlab uses highly optimized libraries” or “Matlab uses the MKL” for once on Stackoverflow.
这种类型的问题反复的出现,应该进行详细的回答,而不是回答”Matlab使用个高度优化的库”或者“Matlab使用了MKL”。
History:
Matrix multiplication (together with Matrix-vector, vector-vector multiplication and many of the matrix decompositions) is (are) the most important problems in linear algrebra. Engineers have been solving these problems with computers since the early days.
矩阵乘法(包括矩阵-矢量,矢量-矢量乘法以及许多矩阵分解)是线性代数中最重要的问题。早期,工程师们已经使用计算机解决了这些问题。
I’m not an expert on the history, but apparently back then, everybody just rewrote his Fortran version with simple loops. Some standardization then came along, with the identification of “kernels” (basic routines) that most linear algebra problems needed in order to be solved. These basic operations were then standardized in a specification called: Basic Linear Algebra Subprograms (BLAS). Engineers could then call these standard, well-tested BLAS routines in their code, making their work much easier.
我们不是一个历史专家,但是显然回到那时,每个人仅使用简单的循环重写自己的Fortran版本。然后一些标准化出现了。这些基本的操作然后被标准化,采用一特定的称谓:BLAS。工程师就可以在™的代码中调用这些标准(well-tested BLAS routines),使得它们的工作变得更加的简单。
BLAS:
BLAS evolved from level 1 (the first version which defined scalar-vector and vector-vector operations) to level 2 (vector-matrix operations) to level 3 (matrix-matrix operations), and provided more and more “kernels” so standardized more and more of the fundamental linear algebra operations. The original Fortran 77 implementations are still available on Netlib’s website.
Towards better performance:
So over the years (notably between the BLAS level 1 and level 2 releases: early 80s), hardware changed, with the advent of vector operations and cache hierarchies. These evolutions made it possible to increase the performance of the BLAS subroutines substantially. Different vendors then came along with their implementation of BLAS routines which were more and more efficient.
I don’t know all the historical implementations (I was not born or a kid back then), but two of the most notable ones came out in the early 2000s: the Intel MKL and GotoBLAS. Your Matlab uses the Intel MKL, which is a very good, optimized BLAS, and that explains the great performance you see.
Technical details on Matrix multiplication:
So why is Matlab (the MKL) so fast at dgemm (double-precision general matrix-matrix multiplication)? In simple terms: because it uses vectorization and good caching of data. In more complex terms: see the article provided by Jonathan Moore.
Basically, when you perform your multiplication in the C++ code you provided, you are not at all cache-friendly. Since I suspect you created an array of pointers to row arrays, your accesses in your inner loop to the k-th column of “matice2”: matice2[m][k] are very slow. Indeed, when you access matice2[0][k], you must get the k-th element of the array 0 of your matrix. Then in the next iteration, you must access matice2[1][k], which is the k-th element of another array (the array 1). Then in the next iteration you access yet another array, and so on… Since the entire matrix matice2 can’t fit in the highest caches (it’s 8*1024*1024 bytes large), the program must fetch the desired element from main memory, losing a lot of time.
基本上是这样的,当你使用你的C++代码执行乘法时,你并没有进行缓存。因为,我怀疑你对行数组创建一个指针数组,在你的内部循环中,访问”matice2”的第k列是非常慢的:matice2[m][k]非常慢。确实,当你访问matice2[0][k]时,你必须矩阵的第0个行数组的第k个元素。那么,下一次迭代,你必须访问matice2[1][k],其是第1个行数组(另外一个行数组)的第k个元素。那么在下一次迭代,你还要访问另外一个数组,等等。因为完整的矩阵matice2不能装入进高速缓存中(大小为8*1024*1024字节),程序必须从主内存中取出期望的元素,这样耗费了大量的时间。
If you just transposed the matrix, so that accesses would be in contiguous memory addresses, your code would already run much faster because now the compiler can load entire rows in the cache at the same time. Just try this modified version:
仅将矩阵进行转置,将在连续的内存地址访问,你的代码将运行的更快,因为现在编译器可以将完整的行同时加载到缓存中。试试下面修改的版本:
timer.start();
float temp = 0;
//transpose matice2
for (int p = 0; p < rozmer; p++)
{
for (int q = 0; q < rozmer; q++)
{
tempmat[p][q] = matice2[q][p];
}
}
for(int j = 0; j < rozmer; j++)
{
for (int k = 0; k < rozmer; k++)
{
temp = 0;
for (int m = 0; m < rozmer; m++)
{
temp = temp + matice1[j][m] * tempmat[k][m];
}
matice3[j][k] = temp;
}
}
timer.stop();
So you can see how just cache locality increased your code’s performance quite substantially. Now real dgemm implementations exploit that to a very extensive level: They perform the multiplication on blocks of the matrix defined by the size of the TLB (Translation lookaside buffer, long story short: what can effectively be cached), so that they stream to the processor exactly the amount of data it can process. The other aspect is vectorization, they use the processor’s vectorized instructions for optimal instruction throughput, which you can’t really do from your cross-platform C++ code.
因此,你可以看到仅仅缓存位置很明显地增加你代码的性能。现在,实数的double类型的矩阵相乘的实现在更广泛的水平上利用了这一点:它们在通过TLB大小定义的矩阵块上执行乘法,因此,它们可以精确地将处理器可以处理的数量传递给处理器。另一方面是矢量化,它们使用处理器的矢量化指令来获得最佳指令吞吐量,实际上,这是你无法从你的跨平台的C++代码中做到的。
Finally, people claiming that it’s because of Strassen’s or Coppersmith–Winograd algorithm are wrong, both these algorithms are not implementable in practice, because of hardware considerations mentioned above.
参考文献:
1. http://answers.opencv.org/question/34125/opencv-solve-linear-equation-system-too-slow/
2. http://stackoverflow.com/questions/6058139/why-is-matlab-so-fast-in-matrix-multiplication