Friday, February 18, 2011

Using SIMD for Hardware Acceleration

Most modern processors come with a feature of operating the same instruction on multiple instances of data. The acronym SIMD stands for "Single Instruction Multiple Data" describes just that.

Intel first introduced the MMX instructions that could operate on multiple data in the Pentium CPU's way back in 1994. Intel supported this by coming out with an extended instruction set henceforth called Streaming SIMD Extensions or SSE. The SSE instruction set which first came with the Pentium III processor uses 128 bit registers that can be used to pack 4 integers or 4 floating point data types. The advantage of doing this becomes clear if one tries to consecutively add four different sets of operands. Using the basic instruction set one needs to:
  1. Move the two operands to registers.
  2. Add the operands.
  3. Move result to memory.
  4. Repeat the above steps for three more sets of operands.
But with just three MMX instructions all these can be done at once for the four sets of operands hence rapidly improving performance.

Before I show you how to use these SSE instructions it is important to know how to access data in a SIMD register. As you know a SIMD register is 128 bit which means it can hold 4 x 32 it data or in other 4 ints or 4 floats. This data in order is referred as [x, y, z, w]. The following SSE instruction: addps adds the four ints or floats in the 128 bit XMM0 register and the 128 bit XMM1 register and stores the results back in the XMM0 register.

addps xmm0 xmm1;

The four floats that are loaded into the SSE register can be moved from memory individually but such operations are slow. Moreover moving data between the FPU (Floating Point Processing Unit) registers and the CPU registers is particularly slow because the CPU has to wait for the FPU to complete the present operation at hand. Hence it is a good practice to leave the data in the SSE registers unless and until space has to be cleared.

Let us now see how we can leverage these SIMD instructions from C/C++. Many compilers provide different data types for SIMD operations. Here I will discuss only the Microsoft Visual Studio compiler. The MVCC provides a predefined datatype __m128, which can be used to declare a variable which holds data in a MMX register. A __m128 type variable is stored directly in a MMX register without ever being put in the memory or the CPU registers. It is the programmer's responsibility to align the data to 16 byte address once you load its contents directly into memory.

Here's a sample program which demonstrates the usage of SIMD to perform addition on four floating point data.

__m128 addMMX(__m128 a, __m128 b)
__m128 result;
/* inline assembly */
movaps xmm0, xmmword ptr [a]
movaps xmm1, xmmword ptr [b]
addps xmm0, xmm1
movaps xmmword ptr [result], xmm0
return result;

This is however a bad approach because the code is not portable and one has to embed inline assembly into high level code. A better way to do the same thing is to use intrinsics. Intrinsics are special commands that look and behave like C functions but are internally expanded to inline assembly code by the compiler. In order to use intrinsics be sure to include the xmmintrin.h file into your code.

#include <xmmintrin.h>

__m128 addSIMDwithIntrinsics(__m128 a, __m128 b)
/* use intrisics */
__m128 result = _mm_add_ps(a,b);
return result;

To load 4 floats into the MMX register simply use the load intrinsic.

/* be sure to 16 byte align your arrays to
reduce the number of fetch cycles required to load */

__declspec(align(16)) float A[] = {1.0f, 2.0f, 3.0f, 4.0f};
__declspec(align(16)) float B[] = {4.0f, 3.0f, 2.0f, 1.0f};
__declspec(align(16)) float C[] = {0.0f, 0.0f, 0.0f, 0.0f};

int main(int args, char* argv[])
/* load a and b from the arrays above */
__m128 a = _mm_load_ps(&A[0]);
__m128 b = _mm_load_ps(&B[0]);
__m128 c;

/* call addSIMDwithIntrinsics() function from above */
c = addSIMDwithIntrinsics(a, b);

/* write the result back to array */
_mm_store_ps(&C, c);

The next time you set out to write FFT functions or just about any repetitive math operations be sure to utilize this feature.