[OpenWalnut-Dev] Shader Performance

Fri Oct 15 13:46:51 CEST 2010

Hi

For all you shader programmers out there, I'll give you some nice hints to improve shader performance. Ok you might as "why do we count cycles here?". Well each shader gets executed millions of times per frame (depending on size of geometry on screen and so on). If you reduce the runtime of a shader by 5 cycles, you save 5 million cycles per frame. Current GPU use 1.0Ghz to 1.5Ghz clock frequencies. In the 1Ghz example, your shader runs 5/1000 faster. Sounds not that cool but reformulate it: you have 10 frames per second. This means the GPU needs roughly 100 Million cycles per frame. If you reduce it by 5 Million cycles, you get 95 Million cycles per frame which means (at 1Ghz clock frequency) you get about 10.5 frames per second. Simply by reformulating two or three multiplications? Cool, isn't it? If you reduce your shader code by 10 cycles you get over 1 additional frame per second. Of course this only is a rough estimation leaving out several other criteria which influence the fps but hey who cares ;-).

Here are some ways to improve the performance of your shader:

Swizzle:
--------

What the f*** are swizzles? You know what it is. It is using vector-elements with masks:

 vec4 fun = ...
 fun.xy = hello.zw;  // these .xy and .zw are swizzles

Always use this kind of assignment instead of:

 fun.x = hello.z;
 fun.y = hello.w;

You can assume swizzle'd assignments to be nearly cost-free in hardware.

Madness:
--------

You most probably know this kind of scaling to bring a [-1,1] ranged vector to [0,1]:

vec4 result = (vector / 2.0) + 1.0;

That is a division and addition -> at least two cycles. But the GPU offers something called "MAD" - "Multiply, then Add". This operation gets executed in only one cycle. Of course a smart compiler can transform the upper statement to a MAD statement but you should not rely on it. Now lets reformulate the above scaling to a proper MAD operation:

vec4 result = vector * 0.5 + 1.0;

You should keep this in mind. Especially these kind of scalings are a very common operation in shaders. So the optimization potential is quite high. As mentioned earlier, you can't assume the GLSL compiler to be very smart. For you, the operation 

float result = 0.5 * ( 1.0 + value );

of course is a multiplication-add operation because of the distributivity. But you should reformulate this too.

float result = 0.5 + 0.5 * value;

Swizzle + MAD:
--------------

Using swizzle masks and MAD operations is cool and fast. So why not using them in conjunction with each other? Well, do! A very common operation in fragment shadern is setting the output color rgb but resetting the alpha value to 1.0:

outColor.rgb = color.rgb;
outColor.a = 1.0;
gl_FragColor = outColor;

These instructions cause at least two move instructions on GPU (at least, depending on the compiler and the GPU capabilities (setting different parts of gl_FragColor)). Using a MAD instruction in combination with clever swizzle masks, this can be done in only one cycle:

const vec2 constant = vec2( 1.0, 0.0 );
gl_FragColor = color.xyzw * constant.xxxy + constant.yyyx;

What does this mean? The multiplication causes the rgb components to be scaled with 1.0 and the alpha component with 0.0. The addition adds 0.0 to each rgb component and 1.0 to the alpha component. This is exactly what you intended.

Use Build-In operations:
------------------------

The GLSL language provides some fast build in functions. For calculating the dot product, use dot() instead of calculating it by hand. The same accounts for clamping of values. Instead of using if-then-else blocks causing execution time to increase tremendously, use clamp( value, min, max ):

if ( value > 0.5 )
  value = 0.5;
if ( value < 0.2 )
  value = 0.2;

Use the following code for the same effect:

value = clamp( value, 0.2, 0.5 );

Another even more common thing in shader code are linear interpolations between two colors or something:

vec4 color1 = ...
vec4 color2 = ...

vec4 mixedColor = color1 * ( 1.0 - value ) + color2 * value;

This can of course be reformulated into a MAD sequence but the best way is to use mix():

mixedColor = mix( color0, color1, value );

Branching:
----------

Please, please please pleeeaaase avoid branching wherever possible or, at least, reduce it to a minimum by merging as many conditions as possible to one branch. I will try to explain why branching is such a problem on GPU's. Assume the following shader:

void main()
{
   float value = calculateSomehow();

   if ( value < someThreshold )
      doThis();   
   else
      doThat();
}

It is a quite simple example but is enough to demonstrate the cause of branching-evilness. They key to understanding is the HOW does the GPU executes the shader? Please assume two fragments being processed by the above shader. For fragment 1, the condition evaluates to true. For the second it evaluates to false. The GPU organizes it threads into grids, blocks and warps (at least on NVidia cards). A grid contains several blocks, each block contains several warps. Each warp executes several threads synchronized on instruction level. That means all threads in a single warp execute the same operation at the same cycle (called SIMT, single instruction, multiple threads (no it is not SIMD)). What happens if the warp reaches the if statement? The threads evaluate the condition and begin to diverge in the path they follow. This means different instructions per thread in a warp per cycle. That simply is not possible (at least with the current NVidia GPUs). How does the GPU solve the problem? It first executes the instructions of all threads in a warp going the if-then path while suspending the other threads whose condition is false. AFTER that it suspends the threads where the condition was true and now executes the instructions of the threads where the condition was false. As an example, remember our fragments 1 and 2. Let them be neighbouring pixels and let them execute in the same warp. This simply means each thread needs the amount of cycles of the first branch AND the second branch. Especially if one branch is very time consuming, the other fragments in the same warp simply can't continue execution.

The big question now is: can we use evaluated conditions in formulas like "( a > b ) * value;" without any extra cost? Answering this question is not very easy as NVidia does not clearly say how comparisons and such stuff are executed on GPU. If the comparison gets evaluated as a bit inside a register, then you definitely need at least two cycles for the statement ( one executing the multiplication for threads where a > b is true and one cycle for those where it is false). If the GPU evaluates the condition to a value register, the operation can be done in 1 cycle. But I actually do not know this.

If you write your shaders you should keep this in mind.

Bye
Sebastian

-- 
Dipl.-Inf. Sebastian Eichelbaum
Universität Leipzig
Institut für Informatik
Abteilung Bild- und Signalverarbeitung
PF 100920
D-04009 Leipzig