Implementation of a C++ packed vector library

Details: Written by Ronny Press; Category: C++; Published: January 30 2016

A packed vector library is a library which makes it easier to use the extended SIMD instructions sets (packed vectors) of modern CPUs.

There are quite a few packed vector libraries for C++. Most of them don't provide all the functionality I wanted to have so I wrote my own.

The functionality I wanted to have in such a library:

uses intrinsics, no inline assembly allowed

Visual C++ for 64Bit compilation doesn't support inline assembly
inline assembly may need manual register allocation
compiler-generated code is very good, so no need to get down to the lowest level

provides operator overloading at least for the arithmetic operators

you don't have to actually know all of the intrinsics

your code clearly states the intentions with normal infix operator syntax

__m128 a ..., b ...;
__m128 c = _mm_add_ps(a, b);

compared to:

vec4f_t a ... , b ...;
vec4f_t c = a + b;

it is possible to compile scalar code from the same source code

a scalar fallback must be present and you should not have to write it manually
support for as many instruction sets as possible
support at least these compilers: VC++, g++, clang++
support to compile a single executable with all the paths present without having to dispatch on each single operation: just write the algorithm in terms of packed vectors and the compiler generates fallback paths for platforms with restrictions

The last point is actually not completely done yet but I think I will get to it in the near future. The same is true for the second to last point for platforms other than x68 and amd64/x64.

Compiling a single executable with all the paths present

There are in general two ways to implement a single executable which runs on all machines, i.e. provides fallback implementations for machines without support of a specific instruction set:

implement all paths manually (including the scalar path), then dispatch to the correct version for the machine at runtime; the dispatch uses a function pointer or functor object pointer to call that version; the function pointer is initialized at program start to a dispatch initializer function which sets the function pointer to the correct value for the instruction set supported by the machine and calls that version;
the next indirect call jumps to the correct implementation directly, so the dispatch initializer function will never be called again;
all implementations can reside in the same module (executable or DLL/shared object);
the indirect call per function pointer is a relatively expensive operation because the call is less likely to be predictable and it can cause cache misses, so this is something you want to prevent in an inner loop; if you do this for every single operation, the price for the flexibility is high
build multiple DLLs/shared objects for each of the supported instruction sets and load the correct one on program startup; as DLLs/shared objects use essentially the same mechanism as function pointers, this has the same disadvantages as way 1 and in addition needs a more complex build process to be set up using conditional compilation

Using my packed vector library you just need to write the packed vector version of your algorithm and the scalar version used as fallback is automatically generated for you. Of course, the scalar version is probably slower than a manually written one, but your program will run on machines without support for the needed SIMD instruction sets.

To make this happen, you first need to extract your inner loops / algorithms into templates with a template parameter for the packed vector type to be used. My packed vector library provides implementations for some SIMD instruction sets and has all the scalar fallbacks implemented.

Next, you instantiate these templates with the concrete packed vector type ("concrete" meaning that the vector classes have a template parameter for the SIMD instruction set for which code is to be generated) either as an array of function pointers or in a switch. For the array of function pointers, the last step is to call the respective function indirectly (the switch calls them directly).

Example

Here's an example of an Invert operator in my engine; an image is stored as normalized floats (0-1) in RGBA order, so a vec4f is used:

before (SSE code generated only):

// loop ...
    math::vec4f_t col = ...load RGBA pixel value...;
    math::vec4f_t inv(math::vec4f_t::math_t::ones() - col);
    // ... extract inverted channel value(s) according to setting
    //     and store the result
// end of loop

after (scalar code can be generated, too):

template<typename vec_t>
static void do_render_op_invert(...) {
    // loop ...
        vec_t col = ...load RGBA pixel value...;
        vec_t inv(vec_t::math_t::ones() - col);
        // ... extract inverted channel value(s) according to setting
        //     and store the result
    // end of loop
}

typedef void (*func_t)(...);
static const func_t funcs[PMATH_TARGET_COUNT] = {
    do_render_op_invert<math::vec_t<float,4,__m128,PMATH_TARGET_SCALAR>>,
    do_render_op_invert<math::vec_t<float,4,__m128,PMATH_TARGET_SSE>>,
    do_render_op_invert<math::vec_t<float,4,__m128,PMATH_TARGET_SSE2>>,
    ...
};
funcs[
    // use runtime variable with instruction set from CPUID
]
(...);

Implementation of this feature in the packed vector library is not completely done yet as there are some problems with compatibility with VC++ and GCC/clang.