Optimizing the Strassens Algorithm
The first time I really got a flavor of what it was like to sit and optimize code
and think about the underlying hardware, was last semester, when I had to write code for the Strassens matrix multiplication algorithm. The code wasn’t particularly good, and was my first taste of using multiple cores in my code.
Unfortunately, not knowing much about optimizing for the cache or about profiling properly, I didn’t really do justice to the code, despite it being one of the faster implementations of the submissions. One thing that occurred to me was to use the GPU, but at the time, I hadn’t the slightest knowledge of OpenCL and no way to program for my laptops Radeon GPU. Asking around in some forums and some public programming groups, I came upon this golden nugget in architecture that was SIMD, and found that to my luck, the C# guys had JUST implemented it into the C# JIT (called RyuJIT) with a Nuget package on the side, and this had JUST been pushed to the current release of the .NET X86-64 runtime as the default JIT.’
So I got to work, writing in the vectors, (in what is very ugly code in hindsight) and got a small performance improvement.
Then came the summer, and I really wanted to rewrite the code, in a language that would give me more performance like C. The one hurdle was that pthreads really didnt seem to be the right way to go. So through the forums, I discovered OpenMP and was fiddling around with it. I began to actually try the optimization during semester, after we were taught the OpenMP basics, and rewrote the code first in C#, with cache optimizations in mind, then in C, then C with vectors from the AVX 2 library, and then later with SSE4.1 because I wanted to compare performance with my laptop which did not support SSE2. I finally managed to complete it with an OpenMP version too and wrote a complete library with different compiler options for different hardware.
My test setup is a:
- Custom Built Desktop with an Intel Core i5 6402p, 8GB of RAM and 256/1024/6144 kB of L1/L2/L3 cache
- Running Windows 10 X64
- Using the GCC 4.8 compiler on WSL for C and the Dotnet framework 4.6 version of Roslyn and RyuJIT for C#
The code generates random 4096x4096 arrays, multiplies them, and times the multiplication. Ive included two algorithms, Strassens and Naive.
And as for performance,
Program | Algorithm | Time | Notes |
---|---|---|---|
The first implementation (C#) | Naive | 364.383 seconds | |
The first implementation (C#) | Strassens | 81.383 seconds | |
The second implementation (C#) | Naive | 23.028 seconds | |
The second implementation (C#) | Strassens | 36.370 seconds | |
The third implementation (C) | Naive | 26.328 seconds | Without any parallelization, just cache optimization |
The third implementation (C) | Strassens | 14.390 seconds | Without any parallelization, just cache optimization |
The third implementation (C) | Naive | 22.031 seconds | With SIMD (AVX 2) |
The third implementation (C) | Strassens | 9.390 seconds | With SIMD (AVX 2) |
The third implementation (C) | Naive | 16.767 seconds | With OpenMP (just parallel for) |
The third implementation (C) | Strassens | 5.99 seconds | With OpenMP (a la fork join) |
The third implementation (C) | Naive | 15.982 seconds | With OpenMP and SIMD |
The third implementation (C) | Strassens | 4.075 seconds | With OpenMP and SIMD |
As the implementation becomes faster and more parallel, the constants for time complexity in the program become smaller and copying the arrays around in memory becomes the hotspot. SIMD becomes less relevant as the bottleneck is no longer the computation but the memory.
Here is a comparison of AVX2 and its 256 bit vectors vs SSE4.1 and its 128 bit vectors on performance (without OpenMP)
Program | Algorithm | Time | Notes |
---|---|---|---|
The third implementation (C) | Strassens | 14.390 seconds | AVX 2 |
The third implementation (C) | Strassens | 11.110 seconds | SSE4.1 |
Showing that data processing rate is hardly the bottleneck and only small improvements can be gained here.
Other observations were that the Strassens code runs slower than the Naive code on some systems (specifically a Laptop with a Core M processor) and uses a lot more memory.
You can find my code here.