Cache efficient matrix multiplication

Regardless of cache coherency problem in multithreaded environment one must think about cache efficiency when operating on various memory locations.

It is amazing how good cache efficient code performs.

As a classical example there is matrix multiplication problem.
When values are stored in memory in row major form you have good sequential access to the elements inside one row but when you need to access an element at another row it usually results in a cache miss so this works terribly slow.

Conventionally in order to multiply two matrices one must cycle through rows and columns.

There is another approach to transpose the second matrix so that rows in both matrices are visited one by one.
The second method with transposed matrix works faster than naive implementation, however there is a smarter way.
This is how matrix multiplication is implemented in performance critical libraries.

The trick here is to update each element of resulting matrix several times.
Instead of one cycle we are visiting elements of resulting matrix several times, adding new values, depending on how many rows the second matrix has.

I have written some example in C/C++ in order to illustrate this idea:

#include <iostream>
using namespace std;
int main(int argc, char * argv[])
{
 const int M=4,N=6;
 int a[N][M];
 int b[M][N];
 for(int i=0;i<M;++i)
 {
  for(int j=0;j<N;++j)
  {
   a[j][i] = j;
   b[i][j] = j;
  }
 }
 int result[N][N];
 memset(result, 0, sizeof(int) * N * N);
 int * pLeft = &a[0][0];
 int * pDstLim = &result[0][0] + N * N;
 int * pRightLim = &b[0][0] + N * M;
 for(int * pDst = &result[0][0]; pDst< pDstLim; pDst += N )
 {
  int * pRight = &b[0][0];
  while(pRight < pRightLim)
  {
   int * pTmpLim = pDst + N;
   while(pDst < pTmpLim)
   {
    *(pDst++) +=  *pLeft * *(pRight++);
   }
   ++pLeft;
   pDst -= N;
  }
 }
 for(int i=0;i<N;++i)
 {
  for(int j=0;j<N;++j)
  {
   std::cout << (int) result[i][j] << " ";
  }
  std::cout << std::endl;
 }
}

This code is incredibly faster (at least 40 times faster on my laptop, but I believe the speed up is better than linear) than other variants. For the general idea of how it works one can read Algorithms and Special Topics, which contains wonderful description of this and other great tricks.

Leave a Reply