Performance metaprogramming

Programming comes down to making a solution with the appropriate set of trade-offs for a set of constraints. A typical choice is speed versus memory. One easy way to optimize a piece of code is to create specialized versions for different circumstances. Now, I’m not much of for fancy C++. Given the option I’d program in C99. I regard the upcoming C++0x standard with similar dread as I would a high school reunion. But the canonical example is probably using templatatized types to implicitly create versions of a routine for all widths of integer and float types. Generically I call this approach performance metaprogramming.

Although this seems like an unalloyed good, often this comes down to choosing performance over clarity. This is first of three posts about times I made code less readable to make it go faster.

Templates

This snippet of code abstracts some SPU code I changed. The specific problem was the test for m_do_expensive.

void DoCollision1(Manifold* manifold)
{
  // some setup

  for (int i = 0; i < manifold->m_num_cached_objects; ++i)
  {
    CachedObject* cached_object = manifold->m_cached_objects + i;
    for (int j = 0; j < cached_object->m_num_rigid_bodies; ++j)
    {
      RigidBody* rigid_body = cached_object->m_rigid_bodies + j;

      for (int k = 0; k < rigid_body->m_num_collision_prims; ++k)
      {
        COLL::Prim* prim = rigid_body->m_collision_prims + k;

        if (manifold->m_do_expensive)
        {
          DoExpensiveFunction(prim);
        }

        DoCheapFunction(prim);
      }
    }
  }
}

void ManifoldJob()
{
  Manifold* manifold;

  // ...

  DoCollision0(manifold);
  DoCollision1(manifold);
}

The SPU as you know only static branch prediction. Explicit hint instructions (hbr, hbra, hbrr) are added at compile time to indicate that a branch is coming up. The cost for a non-hinted branch is 18 cycles. Hint instructions can be explicitly added as well (most conveniently via the gcc __builtin_expect extension)

// we anticipate this conditional will be false
if (__builtin_expect((a>b),0))
  c += a;
else
  d += 1;

As it turned out, m_do_expensive is constant for any run of ManifoldJob. What we’d like to do is move the test for m_do_expensive out of the inner loop to as high a level as possible. Essentially we need two versions of DoCollision1, but without the associated maintenance. This, however, seemed like a good situation for the use of templates.

template <bool expensive_test>
void DoCollision1Internal(Manifold* manifold)
{
  // some setup

  for (int i = 0; i < manifold->m_num_cached_objects; ++i)
  {
    CachedObject* cached_object = manifold->m_cached_objects + i;
    for (int j = 0; j < cached_object->m_num_rigid_bodies; ++j)
    {
      RigidBody* rigid_body = cached_object->m_rigid_bodies + j;

      for (int k = 0; k < rigid_body->m_num_collision_prims; ++k)
      {
        COLL::Prim* prim = rigid_body->m_collision_prims + k;

        if (expensive_test)
        {
          DoExpensiveFunction(prim);
        }

        DoCheapFunction(prim);
      }
    }
  }
}

void DoCollision1(Manifold* manifold)
{
  if (manifold->m_do_expensive)
    DoCollision1Internal<true>(manifold);
  else
    DoCollision1Internal<false>(manifold);
}

Moving the test into a template parameter, the compiler has the information it needs to completely elide the test (and, does). This is all in the same translation unit as well, avoiding the usual template instantiation mess.

Next up: sorting by branchiness

Sometimes an instance has characteristics that affect many features of its behavior, and is expressed via the same boolean tests in many different functions. A typical example is whether a path follower is looping or not. Next time I’ll break down the process of identifying and taking advantage of these sort of characteristics to make a system run faster.