For approximation of some trigonometric functions you can evaluate an approximating polynomial which can be faster if you can live with restrictions in the parameter range. This article is not about acquiring the coefficients of these polynomials but rather the implementation of their evaluation.

For a fast sine approximation I use the following polynomial as an example (from 3D Game Engine Architecture, David H. Eberly, second printing, ISBN 978-0-12-229064-0):

sinx/x = 1 - 0.16605x^2 + 0.00761x^4 + e(x)

The input range is [0,pi/2] and the error: |e(x)| <= 1.7e-4

As there are a few of these approximations and typing out all of those multiplies manually is error-prone, I use recursive template instantiation to accomplish this task. This does not compromise runtime performance as you will see later.

The implementation of this particular polynomial looks like this:

static inline __m128 fast_sin_0(__m128 angles) { static const float coeffs[] = { -0.16605f, 0.00761f }; return angles * ( ones() + polynomialeval_even_packed_t< sizeof(coeffs)/sizeof(coeffs[0]), float,__m128 >()(angles, coeffs) ); }

The polynomialeval_even_packed_t is a templated struct implementing the call operator:

template<unsigned N, typename real_t, typename packed_t> struct polynomialeval_even_packed_t { inline packed_t operator()(packed_t x, const real_t * coeffs) { return coeffs[N-1] * expeval_packed_t<N*2,real_t,packed_t>()(x) + polynomialeval_even_packed_t<N-1,real_t,packed_t>()(x, coeffs); } }; template<typename real_t, typename packed_t> struct polynomialeval_even_packed_t<0,real_t,packed_t> { inline packed_t operator()(packed_t x, const real_t * coeffs) { return math_t<real_t,packed_t>::zeroes(); } };

This struct returns `sum[0<=i<N](coeffs[i]*x^(n*2+2))`, which means for the case of fast_sin_0() which uses a coefficient array of size 2:

sum[0<=i<2](coeffs[i]*x^(n*2+2)) = coeffs[0]*x^2 + coeffs[1]*x^4

The expeval_packed_t struct returns x^N.

Now lets look at the code generated by VC++12 for x64:

?fast_sin_0@?... PROC 00000 0f 28 21 movaps xmm4, XMMWORD PTR [rcx] 00003 0f 57 c9 xorps xmm1, xmm1 00006 0f 28 c4 movaps xmm0, xmm4 00009 0f 59 c4 mulps xmm0, xmm4 0000c 0f 28 d8 movaps xmm3, xmm0 0000f 0f 59 05 00 00 00 00 mulps xmm0, XMMWORD PTR __xmm@be2a0903be2a0903be2a0903be2a0903 00016 0f 59 dc mulps xmm3, xmm4 00019 0f 58 c1 addps xmm0, xmm1 0001c 0f 59 dc mulps xmm3, xmm4 0001f 0f 59 1d 00 00 00 00 mulps xmm3, XMMWORD PTR __xmm@3bf95d4f3bf95d4f3bf95d4f3bf95d4f 00026 0f 58 c3 addps xmm0, xmm3 00029 0f 58 05 00 00 00 00 addps xmm0, XMMWORD PTR ?ones_@?1??... 00030 0f 59 c4 mulps xmm0, xmm4 00033 c3 ret 0 ?fast_sin_0@?... ENDP

This is not bad and it was -- as requested -- inlined (not shown here). The conversion of the scalar coefficients to packed values was done at compile time. There's an unneeded addition of zero involved which is used to terminate the template instantiation recursion. I compiled with /fp:precise, so compiling with /fp:fast might lead to the removal of that addition.