For me, and for a lot of other programmers, floating point numbers have for the longest time been one of these blackboxes. There is a lot of (good and bad) information on the web about floating points, most of it describes the data format, how the bits are interpreted, what epsilon values you should use or how to deal with accuracy issues in floats. Hardly any article talks about where all of this actually comes from or how fundamental floating point operations are implemented.

So in this article I will talk about how some of these operations are implemented, specifically multiplication, addition and fused-multiply-add. I won’t talk about decimal-to-float conversions, float-to-double or float-to-int casts, division, comparisons or trigonometry functions. If you’re interested in these I suggest taking a look at John Hauser’s excellent SoftFloat library listed below. It’s the same library I’ve used to borrow the code samples in this article from.

For convenience sake I’ll also show an image of the floating point data layout taken from wikipedia because this might help explain some of the magic numbers and masks used in the code below. The hardware diagrams are taken from the “Floating-Point Fused Multiply-Add Architectures” paper linked below and are diagrams for **double precision** implementations (this due to me being unable to produce these pretty pictures myself). Keep that in mind when reading them.

The way IEEE 754 multiplication works is identical to how it works for regular scientific notation. Simply multiply the coefficients and add the exponents. However, because this is done in hardware we have some extra constraints, such as overflow and rounding, to take into account. These extra constraints are what make floats appear so ‘fuzzy’ to some.

- Check if either of the operands (A and B) are zero (early out)
- Check for potential exponent overflow and throw corresponding overflow errors
- Compute sign as C
_{sign}= A_{sign}XOR B_{sign} - Compute the exponent C
_{exponent}= A_{exponent}+ B_{exponent}– 127 - Compute mantissa C
_{mantissa }= A_{mantissa}* B_{mantissa}(23-bit integer multiply) and round the result according to the currently set rounding mode. - If C
_{mantissa}has overflown, normalize results (C_{mantissa}<<= 1, C_{exponent}-= 1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | f32 float32_mul(f32 a, f32 b) { // extract mantissa, exponent and sign u32 aFrac = a & 0x007FFFFF; u32 bFrac = b & 0x007FFFFF; u32 aExp = (a >> 23) & 0xFF; u32 bExp = (b >> 23) & 0xFF; u32 aSign = a >> 31; u32 bSign = b >> 31; // compute sign bit u32 zSign = aSign ^ bSign; // removed: handle edge conditions where the exponent is about to overflow // see the SoftFloat library for more information // compute exponent u32 zExp = aExp + bExp - 0x7F; // add implicit `1' bit aFrac = (aFrac | 0x00800000) << 7; bFrac = (bFrac | 0x00800000) << 8; u64 zFrac = (u64)aFrac * (u64)bFrac; u32 zFrac0 = zFrac >> 32; u32 zFrac1 = zFrac & 0xFFFFFFFF; // check if we overflowed into more than 23-bits and handle accordingly zFrac0 |= (zFrac1 != 0); if(0 <= (i32)(zFrac0 << 1)) { zFrac0 <<= 1; zExp--; } // reconstruct the float; I've removed the rounding code and just truncate return (zSign << 31) | ((zExp << 23) + (zFrac >> 7)); } |

Again, the steps for floating point addition are based on calculating with scientific notation. First you align the exponents, then you add the mantissas. The alignment step is the reason for the big inaccuracies with adding small and large numbers together.

- Align binary point
- If A
_{exponent }> B_{exponent}_{mantissa }>>= 1 until B_{mantissa}* 2^{Bexponent – Aexponent} - If B
_{exponent }> A_{exponent}_{mantissa }>>= 1 until A_{mantissa}* 2^{Aexponent – Bexponent} - Compute sum of aligned mantissas
- A
_{mantissa}* 2^{ Aexponent – Bexponent }+B_{mantissa} - Or B
_{mantissa}* 2^{ Bexponent – Aexponent }+A_{mantissa} - Normalized and round results
- Check for exponent overflow and throw corresponding overflow errors
- If C
_{mantissa}is zero set the entire float to zero to return a ‘correct’ 0 float.

- A

- If A

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | // implementation only works with a and b of equal sign // if a and b are of different sign, we call float32_sub instead // look at the SoftFloat source-code for specifics. static f32 float32_add(f32 a, f32 b) { int zExp; u32 zFrac; u32 aFrac = a & 0x007FFFFF; u32 bFrac = b & 0x007FFFFF; int aExp = (a >> 23) & 0xFF; int bExp = (b >> 23) & 0xFF; u32 aSign = a >> 31; u32 bSign = b >> 31; u32 zSign = aSign; int expDiff = aExp - bExp; aFrac <<= 6; bFrac <<= 6; // align exponents if needed if(expDiff > 0) { if(bExp == 0) --expDiff; else bFrac |= 0x20000000; bFrac = shift32RightJamming(bFrac, expDiff); zExp = aExp; } else if(expDiff < 0) { if(aExp == 0) ++expDiff; else aFrac |= 0x20000000; aFrac = shift32RightJamming(aFrac, -expDiff); zExp = bExp; } else if(expDiff == 0) { if(aExp == 0) return (zSign << 31) | ((aFrac + bFrac) >> 6); zFrac = 0x40000000 + aFrac + bFrac; zExp = aExp; return (zSign << 31) | ((zExp << 23) + (zFrac >> 7)); } aFrac |= 0x20000000; zFrac = (aFrac + bFrac) << 1; --zExp; if((i32)zFrac < 0) { zFrac = aFrac + bFrac; ++zExp; } // reconstruct the float; I've removed the rounding code and just truncate return (zSign << 31) | ((zExp << 23) + (zFrac >> 7)); } // for reference static u32 shift32RightJamming(int a, int count) { if(count == 0) return a; else if(count < 32) return (a >> count) | ((a << ((-count) & 31)) != 0); else return a != 0; } |

An overview of floating point addition hardware. The implementation will make a distinction between adding numbers where the exponent differs (the far path) and numbers where the exponent is the same (the close path), much like the implementation above.

The multiply-add operation is basically a combination of both of these operations that is as efficient or more efficient to implement in hardware as both operations separately. The primary difference in operation is (as long as it’s not a *pseudo-fma*) is the fact that there is only one rounding operation done at the end of the result, instead of one in the multiply *and* the add circuits (steps 3 and 4 respectively).

Some, if not most, SIMD architectures on current-gen platforms are actually built around just the fused-multiply-add and don’t have regular multiply or addition hardware (they’ll just insert identity constants into one of the three operands) a simple give-away for this is usually that the cycle count for these operations is identical in each case.

*Single precision floating-point format*. (2011, June 19).

Retrieved Juli 2011, from Wikipedia: http://en.wikipedia.org/wiki/Single_precision_floating-point_format

Quinnell, E. C. (2007, May). *Floating-Point Fused Multiply-Add Architectures.* Retrieved June 2011, from http://repositories.lib.utexas.edu/bitstream/handle/2152/3082/quinnelle60861.pdf

Shaadan, D. M. (2000, Januari). *Floating Point Arithmetic Using The IEEE 754 Standard Revisited.* Retrieved June 2011, from http://meseec.ce.rit.edu/eecc250-winter99/250-1-27-2000.pdf

Hauser, J. (2010, June). *SoftFloat* Retrieved June 2011, from http://www.jhauser.us/arithmetic/SoftFloat.html

Giesen, F. (2011, July). *Int-multiply-using-floats trickery* Retrieved July 2011, from http://pastebin.com/jyT0gTSS

The algorithm consists of 4 different steps implemented in 5 different shaders to accomplish what Crytek calls a *fancy edge blur*. I’ll outline the steps first and then go into detail about what each part does.

- Edge detection
- Count line lengths of log
_{4}(maxLineLength) passes - Determine blend weights
- Blur (I won’t go into this here because it’s a simple blur filter based on the previously calculated weights)

Although there are many ways to do edge detection in a pixel shader, the paper decided to implement this using a difference based on color, it makes sense to do this because blurring edges based on normal/depth. And as it turns the most recent version at the time of writing also supports color + depth edge detection. Currently the shader is implemented by converting colors to LAB color space and calculating a Euclidian distance between two colors and optionally do a depth compare; when this distance exceeds a certain threshold you have your edge.

The paper uses and outputs to two textures, a mask and the texture used to count line lengths and they are laid out like this:

- The mask consists of 2 channels:
- R gets a 1 if there is a
**horizontal**edge (zero otherwise) - G gets a 1 if there is a
**vertical**edge (zero otherwise) - The line length texture consists of 4 channels:
- B and A get a value of 1/255 if there is a
**horizontal**discrepancy (zero otherwise) - R and G get a value of 1/255 if there is a
**vertical**discrepancy (zero otherwise)- If there is a horizontal discrepancy, this means that there is a vertical line and vice versa! Keep this in mind for the Count line lengths shader.

- B and A get a value of 1/255 if there is a

- R gets a 1 if there is a

For my implementation I dropped the mask because the line length texture can serve the exact same purpose, except that it’s data is in different channels and checks of equals 1 should be converted to doesn’t equal 0. This made the flow of the algorithm a lot easier and saved some memory.

The process of determining the line lengths uses a technique called *recursive doubling* which is the reason this part of the process gets it’s logarithmic runtime performance of O(log_{4}(maxLineLength)) this basically comes down to doing 4 passes for a maximum length of 256 pixels. To see how this works we should first see exactly how the line-length buffer is structured.

Basically, the R channel stores how many pixels a certain line has to the left of it and the G channel stores how many there are to the right; this means that for each pixel you can look up the length of the edge by summing up either the R and G channels for horizontal length and the R and G channels for vertical length.

Orange represents the alpha channel. This is the content of the buffer when the algorithm is done processing.

The gist of the algorithm is pretty simple and I’ll go over it briefly. Just keep in mind that the following loop is done per channel (eg. 4 times).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | // PreviousLengths is initialized to 1.f/255.f for edges and 0 for non-edges float4 currentLengths = tex2D(PreviousLengths, Tex); float4 currentDelta = currentLengths * PixelSize.zzww; const float Threshold = Level / 255.f; if(currentLengths.r >= Threshold) { float2 newTex = Tex - float2(currentDelta.r, 0); for(int k = 0; k < 3; k++) { float oneDelta = tex2D(PreviousLengths, newTex).r; currentLengths.r += oneDelta; newTex.x -= oneDelta * PixelSize.z; } } |

For the first pass Level is initialized to 1 (then to 4, 16 and 64) so this check only does the length count only if it thinks it should still be counting the lengths of the edges. PixelSize is initialized to (1/width, 1/height, 255/width, 255/height) so when multiplying by zzww we convert between increments in 1/255th to increments in 1/width and 1/height.

The interesting part, however, is inside the loop as it moves more pixels to the side depending on the value in the R channel it retrieves. This has a effect that if the value at that pixel is 0, nothing changes and currentLengths doesn’t get incremented.

The different colors do not indicate different channels, they are merely different lines.

When doing normal point-sampling when you reach 0 you’ll know the line has ended and the loop makes sure you stop there. Hower; as shown by Nicolas Vizerie in MLAA (MorphoLogical AntiAliasing) on the GPU using Direct3D9.0 using bilinear filtering can reduce the amount of texture fetches by testing two lines at a time.

The blend weights are calculated from a pre-generated lookup table (look for tabAires in the source-code, I didn’t bother to re-implement it). However, the basic gist of the table is that on the vertical axis is the size (eg. the sum of two channels in the LineLength texture) of the line and on the horizontal is the size in one direction (eg. one of the two channels in the LineLength texture). The content of the table are the areas below the triangles that the edges form and they are almost all handled by the formula `0.5 * (1. - (2 * j + 1) / (float)S)`

the other formulas in the lookup table are there to help in edge cases where the equation tails off. Each pixel in the lookup table has a range of 0.25 ≤ pixelvalue ≤ 0.5 because the blending of a certain pixel has a maximum contribution of four other pixels.

The shader, although lengthy has quite a straightforward implementation that basically checks the endpoints of the lines to see if there is exists an edge orthogonal to it. If that’s the case, it uses the shortest of the two line-segments and the total size of the original line to determine the weights in the lookup table. This process is done four times:

- Horizontal for current pixel
- Vertical for current pixel
- Horizontal for one pixel to the left
- Vertical for one pixel above

- MLAA Demo 2 by @synulation Optimized DX10 version of the previous version of the shader.
- Practical Morphological Anti-Aliasing Other MLAA on the GPU technique.
- MLAA (MorphoLogical AntiAliasing) on the GPU using Direct3D9.0 Another MLAA implementation (also uses the old version of the paper).
- Morphological Antialiasing on GPU

- Interactive Summed-Area Table Generation for Glossy Environmental Reﬂections