Of all the articles at #AltDevBogADay I really enjoy the ones on optimization. I’ve always worked at smaller companies where programmers act in a more generalist capacity. While this has allowed me the opportunity to solve a wide range of problems, I have not had access to someone who could formally introduce me to the black art of optimization. Because of this I’ve had to scour books and blogs looking for ways to increase the performance of my own code and fill in the blanks from there.

So for my “show your ignorance” post I would like to display how I have reasoned around creating an SIMD enabled math library. I consider it a clever way to approach the problem, as I haven’t seen a similar implementation. Having said that there might be a good reason why it shouldn’t be done this way, so feedback is appreciated. If its a valid approach I plan on continuing development and releasing using a permissive license.

So lets begin.

The Problem

So we’re writing a math library. At the very least we’ll need abstractions for vectors, in 2D, 3D and 4D flavors, and matrices. We’re developing for the PC so SSE is available, and we write our library using SSE intrinsics.

Now we want to port our work over to iOS. The SIMD instruction set for ARM processors is NEON. Unfortunately this is not available on devices before the iPhone 3GS and iPad. So to support those devices we’ll need to create a version using regular old floats.

This is turning out to be quite a bit of work.

Rather than rewriting all our math code for each architecture we can define an SIMD interface and then build our math classes using that. Then when porting to a different architecture we just implement the interface and all the math classes and functions should just work.

The Interface

The SIMD interface should expose common operations available in the instruction set, such as basic math operations. It should also expose more complicated operations, such as the dot product, square root and shuffling. For the sake of this article we’ll just define enough to normalize a 3D vector.

In pseudocode this would look like.

interface simd_type
{
    simd_type();
    simd_type(value0, value1, value2, value3);
    simd_type operator* (rhs);
};

simd_type dot3(value1, value2);
simd_type inv_sqrt(value);

The C++ analog for an interface is a class with only pure virtual methods. This isn’t an acceptable way to approach the problem since we want this code to perform as fast as possible. So instead we’ll create a concrete SIMD type that follows a consistent naming scheme for its member functions. Then for our math classes we’ll take in an SIMD type as a template argument. Something like the ill fated concepts that was dropped from the latest iteration of C++ would be useful in this case just because we could verify the fact that the SIMD type implemented the interface properly.

So our 3D vector class would look something like this.

template <typename Real, typename Rep>
class vector3
{
public:
    vector3()
    { }

    vector3(Real x, Real y, Real z)
    : _rep(x, y, z, (Real)0)
    { }

    template <typename Real, typename Rep>
    friend vector3<Real, Rep> normalize(const vector3<Real, Rep>& value)
    {
        return vector3(value._rep * inv_sqrt(dot3(value._rep, value._rep)));
    }

private:
    vector3(const Rep& rep)
    : _rep(rep)
    { }

    Rep _rep;
} ;

Then we create a SSE2 implementation of the SIMD type.

class sse_float
{
public:
    inline sse_float()
    { }

    inline sse_float(float value0, float value1, float value2, float value3)
    : _values(_mm_set_ps(value3, value2, value1, value0))
    { }

    inline sse_float operator* (const sse_float& rhs) const
    {
        return sse_float(_mm_mul_ps(_values, rhs._values));
    }

    inline friend sse_float dot3(const sse_float& value1, const sse_float& value2)
    {
        const __m128 t0 = _mm_mul_ps(value1._values, value2._values);
        const __m128 t1 = _mm_shuffle_ps(t0, t0, _MM_SHUFFLE(0,0,0,0));
        const __m128 t2 = _mm_shuffle_ps(t0, t0, _MM_SHUFFLE(1,1,1,1));
        const __m128 t3 = _mm_shuffle_ps(t0, t0, _MM_SHUFFLE(2,2,2,2));

        return sse_float(_mm_add_ps(t1, _mm_add_ps(t2, t3)));
    }

    inline friend sse_float inv_sqrt(const sse_float& value)
    {
        // Perform a Newton-Raphson iteration on the reciprocal
        // yn+1 = (yn * (3 - xyn^2)) / 2
        const __m128 yn = _mm_rsqrt_ps(value._values);
        const __m128 xyn2 = _mm_mul_ps(_mm_mul_ps(value._values, yn), yn);

        return sse_float(_mm_mul_ps(_mm_mul_ps(_mm_set_ps1(0.5f), yn),
                            _mm_sub_ps(_mm_set_ps1(3.0f), xyn2)));
    }

private:
    inline sse_float(const __m128 values)
    : _values(values)
    { }

    __m128 _values;
} ;

The following code demonstrates the vector normalizing.

int main()
{
    typedef vector3<float, packed_type<float> > vector3f;

    vector3f test(1.0f, 2.0f, 3.0f);
    vector3f normal = normalize(test);
}

Additionally we can do an implementation for instruction sets without SIMD.

template <typename Real>
class packed_type
{
public:
    inline packed_type()
    { }

    inline packed_type(float value)
    {
        _values[0] = value;
        _values[1] = value;
        _values[2] = value;
        _values[3] = value;
    }

    inline packed_type(float value0, float value1, float value2, float value3)
    {
        _values[0] = value0;
        _values[1] = value1;
        _values[2] = value2;
        _values[3] = value3;
    }

    inline packed_type operator* (const packed_type& rhs) const
    {
        return packed_type(
            _values[0] * rhs._values[0],
            _values[1] * rhs._values[1],
            _values[2] * rhs._values[2],
            _values[3] * rhs._values[3]);
    }

    template <typename Real>
    inline friend packed_type<Real>
    dot3(const packed_type<Real>& value1, const packed_type<Real>& value2)
    {
        return packed_type(
            (value1._values[0] * value2._values[0]) +
            (value1._values[1] * value2._values[1]) +
            (value1._values[2] * value2._values[2]));
    }

    template <typename Real>
    inline friend packed_type<Real> inv_sqrt(const packed_type<Real>& value)
    {
        return packed_type(
            (Real)1 / std::sqrt(value._values[0]),
            (Real)1 / std::sqrt(value._values[1]),
            (Real)1 / std::sqrt(value._values[2]),
            (Real)1 / std::sqrt(value._values[3]));
    }

private:
    Real _values[4];
};

Results

By adding an additional layer below the math library the amount of code required to support different architectures is minimized. Assuming the compiler does a good job of inlining the member functions, performance should be as good as those written without an additional layer.

Resources

Designing Fast Cross-Platform SIMD Vector Libraries

Becoming a console programmer : Math Libraries Part 2

Square Roots in vivo: normalizing vectors

Software optimization resources