For approximation of some trigonometric functions you can evaluate an approximating polynomial which can be faster if you can live with restrictions in the parameter range. This article is not about acquiring the coefficients of these polynomials but rather the implementation of their evaluation.


For a fast sine approximation I use the following polynomial as an example (from 3D Game Engine Architecture, David H. Eberly, second printing, ISBN 978-0-12-229064-0):

sinx/x = 1 - 0.16605x^2 + 0.00761x^4 + e(x)

The input range is [0,pi/2] and the error: |e(x)| <= 1.7e-4

As there are a few of these approximations and typing out all of those multiplies manually is error-prone, I use recursive template instantiation to accomplish this task. This does not compromise runtime performance as you will see later.

The implementation of this particular polynomial looks like this:

static inline __m128 fast_sin_0(__m128 angles) {
    static const float coeffs[] = { -0.16605f, 0.00761f };
    return angles * (
        ones() + 
        >()(angles, coeffs)

The polynomialeval_even_packed_t is a templated struct implementing the call operator:

template<unsigned N, typename real_t, typename packed_t>
    struct polynomialeval_even_packed_t {
        inline packed_t operator()(packed_t x, const real_t * coeffs) {
            return coeffs[N-1] * expeval_packed_t<N*2,real_t,packed_t>()(x) +
                polynomialeval_even_packed_t<N-1,real_t,packed_t>()(x, coeffs);
template<typename real_t, typename packed_t>
struct polynomialeval_even_packed_t<0,real_t,packed_t> {
    inline packed_t operator()(packed_t x, const real_t * coeffs) {
        return math_t<real_t,packed_t>::zeroes();

This struct returns sum[0<=i<N](coeffs[i]*x^(n*2+2)), which means for the case of fast_sin_0() which uses a coefficient array of size 2:

sum[0<=i<2](coeffs[i]*x^(n*2+2)) = coeffs[0]*x^2 + coeffs[1]*x^4

The expeval_packed_t struct returns x^N.


Now lets look at the code generated by VC++12 for x64:

?fast_sin_0@?... PROC

  00000	0f 28 21	 movaps	 xmm4, XMMWORD PTR [rcx]
  00003	0f 57 c9	 xorps	 xmm1, xmm1
  00006	0f 28 c4	 movaps	 xmm0, xmm4
  00009	0f 59 c4	 mulps	 xmm0, xmm4
  0000c	0f 28 d8	 movaps	 xmm3, xmm0
  0000f	0f 59 05 00 00
	00 00		 mulps	 xmm0, XMMWORD PTR __xmm@be2a0903be2a0903be2a0903be2a0903
  00016	0f 59 dc	 mulps	 xmm3, xmm4
  00019	0f 58 c1	 addps	 xmm0, xmm1
  0001c	0f 59 dc	 mulps	 xmm3, xmm4
  0001f	0f 59 1d 00 00
	00 00		 mulps	 xmm3, XMMWORD PTR __xmm@3bf95d4f3bf95d4f3bf95d4f3bf95d4f
  00026	0f 58 c3	 addps	 xmm0, xmm3
  00029	0f 58 05 00 00
	00 00		 addps	 xmm0, XMMWORD PTR ?ones_@?1??...
  00030	0f 59 c4	 mulps	 xmm0, xmm4
  00033	c3		 ret	 0

?fast_sin_0@?... ENDP

This is not bad and it was -- as requested -- inlined (not shown here). The conversion of the scalar coefficients to packed values was done at compile time. There's an unneeded addition of zero involved which is used to terminate the template instantiation recursion. I compiled with /fp:precise, so compiling with /fp:fast might lead to the removal of that addition.