Here's a result of porting the packed vector library to ARM NEON compiler intrinsics.


Applying a 2D Gaussian filter to an image is a separable operation if the used 2D filter kernel
is separable (i.e. it can be factorized into an outer vector product of 2 vectors
or it is a result of such an operation).

In my engine, such a separable operation is computed in two steps:

  1. apply the horizontal vector on the input image
  2. apply the vertical vector on the resulting image

Here's a piece of the source for the calculation of the horizontal pass of the inner area,
i.e. all positions where the filter application is completely within the boundaries of the image:


The inner loop calculates the resulting pixel value by loading a pixel, weighing the component values by the kernel coefficient and accumulating the results. "v_kernel" is an array of packed vectors with 4 floats (denoted by the type name "vec4f_t"). Each element of the array contains the respective coefficient of filter vector already duplicated to all four elements. The loada() loads a packed vector from the given address.

The normalization is applied afterwards. The last step is a 2-step shuffling of the result with a fixed alpha value as the blurring operation should not affect the alpha channel of the image, before storing that result into the output buffer.


The point of this article is the demonstration of the generated code for ARM NEON after porting of my packed vector library to that instruction set.

The shuffle operations are relatively complex in comparison to the main operation, but are not part of the inner loop and don't contain memory accesses.


Some of the shuffle operations can be improved by implementing specialized versions but the result is OK, considering that the following set of macros is used to generate them with a little help from the preprocessor.

#define ZIP____(a,b,c,d,lane1,l1,lane2,l2,lane3,l3,lane4,l4)    \
template<typename T> T a##b##c##d(const T &);                   \
template<> inline float32x4_t a##b##c##d(const float32x4_t & v) \
{                                                               \
    return                                                      \
        vcombine_f32(                                           \
            vzip_f32(                                           \
                priv::shuffle_f32x2<l1,l1>(                     \
                    priv::get_lane_f32x4<lane1>(v)              \
                ),                                              \
                priv::shuffle_f32x2<l2,l2>(                     \
                    priv::get_lane_f32x4<lane2>(v)              \
                )                                               \
            ).val[0],                                           \
            vzip_f32(                                           \
                priv::shuffle_f32x2<l3,l3>(                     \
                    priv::get_lane_f32x4<lane3>(v)              \
                ),                                              \
                priv::shuffle_f32x2<l4,l4>(                     \
                    priv::get_lane_f32x4<lane4>(v)              \
                )                                               \
            ).val[0]                                            \
        );                                                      \

#define ZIP___(a,b,c,lane1,l1,lane2,l2,lane3,l3) \
    ZIP____(a,b,c,x,lane1,l1,lane2,l2,lane3,l3,0,0) \
    ZIP____(a,b,c,y,lane1,l1,lane2,l2,lane3,l3,0,1) \
    ZIP____(a,b,c,z,lane1,l1,lane2,l2,lane3,l3,1,0) \
#define ZIP__(a,b,lane1,l1,lane2,l2) \
    ZIP___(a,b,x,lane1,l1,lane2,l2,0,0) \
    ZIP___(a,b,y,lane1,l1,lane2,l2,0,1) \
    ZIP___(a,b,z,lane1,l1,lane2,l2,1,0) \
#define ZIP_(a,lane1,l1) \
    ZIP__(a,x,lane1,l1,0,0) \
    ZIP__(a,y,lane1,l1,0,1) \
    ZIP__(a,z,lane1,l1,1,0) \
#define ZIP \
    ZIP_(x,0,0) \
    ZIP_(y,0,1) \
    ZIP_(z,1,0) \


Note that this generates all 256 combinations for the ARM NEON implementation. (The x86 version is simpler because SSE/SSE2 provides machine instructions for these operations.)


The base operations used in the main macro are implemented as follows:

namespace priv {
    template<unsigned t_lane> float32x2_t get_lane_f32x4(float32x4_t v);
    template<> inline float32x2_t get_lane_f32x4<0>(float32x4_t v) { return vget_low_f32(v); }
    template<> inline float32x2_t get_lane_f32x4<1>(float32x4_t v) { return vget_high_f32(v); }

    template<unsigned t_lo, unsigned t_hi> float32x2_t shuffle_f32x2(float32x2_t v);
    template<> inline float32x2_t shuffle_f32x2<0,0>(float32x2_t v) { return vdup_lane_f32(v, 0); }
    template<> inline float32x2_t shuffle_f32x2<0,1>(float32x2_t v) { return v; }
    template<> inline float32x2_t shuffle_f32x2<1,0>(float32x2_t v) { return vrev64_f32(v); }
    template<> inline float32x2_t shuffle_f32x2<1,1>(float32x2_t v) { return vdup_lane_f32(v, 1); }