I was recently optimizing some OpenGL ES 2.0 shaders for iOS/Android, and it was funny to see how performance tricks that were cool in 2001 are having their revenge again. Here’s a small example of starting with a normalmapped Blinn-Phong shader and optimizing it to run several times faster. Most of the clever stuff below was actually done by ReJ, props to him!
Here’s a small test I’ll be working on: just a single plane with albedo and normal map textures:

I’ll be testing on iPhone 3Gs with iOS 4.2.1. Timer is started before glClear() and stopped after glFinish() that I added just after drawing the mesh.
Let’s start with an initial naïve shader version:
#ifdef VERTEX
attribute vec4 a_position;
attribute vec2 a_uv;
attribute vec3 a_normal;
attribute vec4 a_tangent;
uniform mat4 u_mvp;
uniform mat4 u_world2object;
uniform vec4 u_worldlightdir;
uniform vec4 u_worldcampos;
varying vec2 v_uv;
varying vec3 v_lightdir;
varying vec3 v_viewdir;
void main()
{
gl_Position = u_mvp * a_position;
v_uv = a_uv;
vec3 bitan = cross (a_normal.xyz, a_tangent.xyz) * a_tangent.w;
mat3 tsprotation = mat3 (
a_tangent.x, bitan.x, a_normal.x,
a_tangent.y, bitan.y, a_normal.y,
a_tangent.z, bitan.z, a_normal.z);
vec3 objLightDir = (u_world2object * u_worldlightdir).xyz;
vec3 objCamPos = (u_world2object * u_worldcampos).xyz;
vec3 objViewDir = objCamPos - a_position.xyz;
v_lightdir = tsprotation * objLightDir;
v_viewdir = tsprotation * objViewDir;
}
#endif
#ifdef FRAGMENT
precision highp float;
uniform vec4 u_lightcolor;
uniform vec4 u_matcolor;
uniform float u_spec;
varying vec2 v_uv;
varying vec3 v_lightdir;
varying vec3 v_viewdir;
uniform sampler2D u_texcolor;
uniform sampler2D u_texnormal;
void main()
{
vec4 albedo = texture2D (u_texcolor, v_uv) * u_matcolor;
vec3 normal = texture2D (u_texnormal, v_uv).rgb * 2.0 - 1.0;
vec3 halfdir = normalize (normalize(v_lightdir) + normalize(v_viewdir));
float diff = max (0.0, dot (normal, v_lightdir));
float nh = max (0.0, dot (normal, halfdir));
float spec = pow (nh, u_spec);
vec4 c = albedo * u_lightcolor * diff + u_lightcolor * spec;
gl_FragColor = c;
}
#endif
Should be pretty self-explanatory to anyone who’s familiar with tangent space normal mapping and Blinn-Phong BRDF. Running time: 24.5 milliseconds. On iPhone 4′s Retina resolution, this would be about 4x slower!
What can we do next? On mobile platforms using appropriate precision of variables is often very important, especially in a fragment shader. So let’s go and add highp/mediump/lowp qualifiers to the fragment shader: https://gist.github.com/783703/05e78340b12739e853ce031bd0388430ea95f2a6
Still the same running time! Alas, iOS does not have low level shader analysis tools, so we can’t really tell why that is happening. We could be limited by something else (e.g. normalizing vectors and computing pow() being the bottlenecks that run in parallel with all low precision stuff), or the driver might be promoting most of our computations to higher precision because it feels like it. It’s a magic box!
Let’s start approximating instead. How about computing normalized view direction per vertex, and interpolating that for the fragment shader? It won’t be entirely “correct”, but hey, it’s a phone we’re talking about. https://gist.github.com/783703/1e4fd0daa384d308d125a748985e8e203e49625a
15 milliseconds! But… the rendering is wrong; everything turned white near the bottom of the screen:

Turns out PowerVR SGX (the GPU in all current iOS devices) is really meaning “low precision” when we want to add two lowp vectors and normalize the result. Let’s try promoting one of them to medium precision with a “varying mediump vec3 v_viewdir”: https://gist.github.com/783703/591eb83dacaae3840cc4e4d3d8b95a4fc3abdd65
That fixed rendering, but we’re back to 24.5 milliseconds. Sad shader writers are sad… oh shader performance analysis tools, where art thou?
Let’s try approximating some more: compute half-vector in the vertex shader, and interpolate normalized value. This would get rid of all normalizations in the fragment shader. https://gist.github.com/783703/6360c2912b860aa30415e5120ef147169274cd71
16.3 milliseconds, not too bad! We still have pow() computed in the fragment shader, and that one is probably not the fastest operation there…
Almost a decade ago, a very common trick was to use a lookup texture to do the lighting. For example, a 2D texture indexed by (N.L, N.H). Since all lighting data would be “baked” into the texture, it does not necessarily have to be Blinn-Phong even; we can prepare faux-anisotropic, metallic, toon-shading or other fancy BRDFs there, as long as they can be expressed in terms of N.L and N.H. So let’s try creating 128×128 RGBA lookup texture and use that: https://gist.github.com/783703/87f1cf5529d644cab16123550e809e9f7598f4f3
A fast & not super efficient code to create the lighting lookup texture for Blinn-Phong:
// lr,lg,lb - light color
// spec = specular power
int idx = 0;
for (int y = 0; y < height; ++y)
{
for (int x = 0; x < width; ++x, idx+=4)
{
float vx = float(x) / width;
float vy = float(y) / height;
float nl = vx;
float nh = vy;
float s = powf (nh, spec);
data[idx+0] = nl * lr * 255.0f;
data[idx+1] = nl * lg * 255.0f;
data[idx+2] = nl * lb * 255.0f;
data[idx+3] = s * 255.0f;
}
}
9.1 milliseconds! We lost some precision in the specular though (it’s dimmer):

What else can be done? Notice that we clamp N.L and N.H values in the fragment shader, but this could be done just as well by the texture sampler, if we set texture’s addressing mode to CLAMP_TO_EDGE. Let’s get rid of the clamps: https://gist.github.com/783703/e24a2475fded83d2196372c8092a0d8de80a98eb
This is 8.3 milliseconds, or 7.6 milliseconds if we reduce our lighting texture resolution to 32×128.
Should we stop there? Not necessarily. For example, the shader is still multiplying albedo with a per-material color. Maybe that’s not very useful and can be let go. Maybe we can also make specular be always white?
// Final for now...
// iPhone 3Gs: 5.9ms
#ifdef VERTEX
attribute vec4 a_position;
attribute vec2 a_uv;
attribute vec3 a_normal;
attribute vec4 a_tangent;
uniform mat4 u_mvp;
uniform mat4 u_world2object;
uniform vec4 u_worldlightdir;
uniform vec4 u_worldcampos;
varying vec2 v_uv;
varying vec3 v_lightdir;
varying vec3 v_halfdir;
void main()
{
gl_Position = u_mvp * a_position;
v_uv = a_uv;
vec3 bitan = cross (a_normal.xyz, a_tangent.xyz) * a_tangent.w;
mat3 tsprotation = mat3 (
a_tangent.x, bitan.x, a_normal.x,
a_tangent.y, bitan.y, a_normal.y,
a_tangent.z, bitan.z, a_normal.z);
vec3 objLightDir = (u_world2object * u_worldlightdir).xyz;
vec3 objCamPos = (u_world2object * u_worldcampos).xyz;
vec3 objViewDir = objCamPos - a_position.xyz;
v_lightdir = tsprotation * objLightDir;
vec3 viewdir = normalize(tsprotation * objViewDir);
v_halfdir = normalize (v_lightdir + viewdir);
}
#endif
#ifdef FRAGMENT
uniform lowp vec4 u_lightcolor;
uniform lowp vec4 u_matcolor;
uniform mediump float u_spec;
varying mediump vec2 v_uv;
varying lowp vec3 v_lightdir;
varying lowp vec3 v_halfdir;
uniform sampler2D u_texcolor;
uniform sampler2D u_texnormal;
uniform sampler2D u_texLUT;
void main()
{
lowp vec4 albedo = texture2D (u_texcolor, v_uv);
lowp vec3 normal = texture2D (u_texnormal, v_uv).rgb * 2.0 - 1.0;
lowp float diff = dot (normal, v_lightdir);
lowp float nh = dot (normal, v_halfdir);
lowp vec2 luv = vec2(diff,nh);
lowp vec4 l = texture2D (u_texLUT, luv);
lowp vec4 c = albedo * l + l.a;
gl_FragColor = c;
}
#endif
How fast is this? 5.9 milliseconds, or over 4 times faster than our original shader.
Could it be made faster? Maybe; that’s an exercise for the reader :) I tried computing just the RGB color channels and setting alpha to zero, but that got slightly slower. Without real shader analysis tools it’s hard to see where or if additional cycles could be squeezed out.
I’m adding Xcode project with sources, textures and shaders of this experiment. Notes about it: only tested on iPhone 3Gs (probably will crash on iPhone 3G, and iPad will have wrong aspect ratio). Might not work at all! Shader is read from Resources/Shaders/shader.txt, next to it are shader versions of the steps of this experiment. Enjoy!
This is a cross-post from my blog: http://aras-p.info/blog/2011/02/01/ios-shader-tricks-or-its-2001-all-over-again/