Optimizing the Strassens Algorithm

The first time I really got a flavor of what it was like to sit and optimize code

and think about the underlying hardware, was last semester, when I had to write code for the Strassens matrix multiplication algorithm. The code wasn’t particularly good, and was my first taste of using multiple cores in my code.

Unfortunately, not knowing much about optimizing for the cache or about profiling properly, I didn’t really do justice to the code, despite it being one of the faster implementations of the submissions. One thing that occurred to me was to use the GPU, but at the time, I hadn’t the slightest knowledge of OpenCL and no way to program for my laptops Radeon GPU. Asking around in some forums and some public programming groups, I came upon this golden nugget in architecture that was SIMD, and found that to my luck, the C# guys had JUST implemented it into the C# JIT (called RyuJIT) with a Nuget package on the side, and this had JUST been pushed to the current release of the .NET X86-64 runtime as the default JIT.’

So I got to work, writing in the vectors, (in what is very ugly code in hindsight) and got a small performance improvement.

Then came the summer, and I really wanted to rewrite the code, in a language that would give me more performance like C. The one hurdle was that pthreads really didnt seem to be the right way to go. So through the forums, I discovered OpenMP and was fiddling around with it. I began to actually try the optimization during semester, after we were taught the OpenMP basics, and rewrote the code first in C#, with cache optimizations in mind, then in C, then C with vectors from the AVX 2 library, and then later with SSE4.1 because I wanted to compare performance with my laptop which did not support SSE2. I finally managed to complete it with an OpenMP version too and wrote a complete library with different compiler options for different hardware.

My test setup is a:

Custom Built Desktop with an Intel Core i5 6402p, 8GB of RAM and 256/1024/6144 kB of L1/L2/L3 cache
Running Windows 10 X64
Using the GCC 4.8 compiler on WSL for C and the Dotnet framework 4.6 version of Roslyn and RyuJIT for C#

The code generates random 4096x4096 arrays, multiplies them, and times the multiplication. Ive included two algorithms, Strassens and Naive.

And as for performance,

Program	Algorithm	Time	Notes
The first implementation (C#)	Naive	364.383 seconds
The first implementation (C#)	Strassens	81.383 seconds
The second implementation (C#)	Naive	23.028 seconds
The second implementation (C#)	Strassens	36.370 seconds
The third implementation (C)	Naive	26.328 seconds	Without any parallelization, just cache optimization
The third implementation (C)	Strassens	14.390 seconds	Without any parallelization, just cache optimization
The third implementation (C)	Naive	22.031 seconds	With SIMD (AVX 2)
The third implementation (C)	Strassens	9.390 seconds	With SIMD (AVX 2)
The third implementation (C)	Naive	16.767 seconds	With OpenMP (just parallel for)
The third implementation (C)	Strassens	5.99 seconds	With OpenMP (a la fork join)
The third implementation (C)	Naive	15.982 seconds	With OpenMP and SIMD
The third implementation (C)	Strassens	4.075 seconds	With OpenMP and SIMD

As the implementation becomes faster and more parallel, the constants for time complexity in the program become smaller and copying the arrays around in memory becomes the hotspot. SIMD becomes less relevant as the bottleneck is no longer the computation but the memory.

Here is a comparison of AVX2 and its 256 bit vectors vs SSE4.1 and its 128 bit vectors on performance (without OpenMP)

Program	Algorithm	Time	Notes
The third implementation (C)	Strassens	14.390 seconds	AVX 2
The third implementation (C)	Strassens	11.110 seconds	SSE4.1

Showing that data processing rate is hardly the bottleneck and only small improvements can be gained here.

Other observations were that the Strassens code runs slower than the Naive code on some systems (specifically a Laptop with a Core M processor) and uses a lot more memory.

You can find my code here.