[WIP]
Many things can impact performances:
Choice of algorithm and data structure
There often exists more than one way to do one thing, especially in a given situation.
→ Choose wisely. ( typically in 3D rendering, ray tracing to intersection is faster than sphere surfing that is faster than ray marching regular small steps – generally ).
For instance, some quadratic algorithms can be made linear if you split them in 2 passes, the first being stored in BuffA. ( Exemple: Gaussian filter, Fourier transform … )
Simpler sanity check: choose wisely the necessary precision and loops length
Also, a reminder that shader programming is about massive parallelism where the program is called for each pixel: the worst thing is to loop on drawing each element so as you would do with CPU programming. Especially for 2D shaders, check if you can find which element may cover the pixel, or which subset of iDs.
– See making a shader loopless.
– For splatting particles, see Voronoï particle tracking.
– Regular symmetries and repetitions (in 2D or 3D): use domain folding ( using abs or mod on pixel coordinates – in linear or polar ).
Another consequence is that it makes no sense to do a shader with a first loop computing data in an array then a second loop using the data linearly: just either do everything in the same loop, or precompute the data in BuffA.
Maths
Do math simplification of formulas ( e.g. using trigonometry, geometry, algebraic properties… ), optimizers don’t know about symbolic calculus
Convertly this is useless on consts, since the compiler will evaluate them at compile time.
About factorisation of expressions, and pre or post computing what does not depend on a loop:
Yes the optimizer can do a lot of smart work you wouldn’t imagine… but sometime it doesn’t do obvious things.
→ Don’t tempt the devil, do factor and avoid redundancy. Plus it helps readability and debugging.
Shadertoy tricks
Your costly precomputations in BuffA won’t change over time ? Compute it just once:
→ if (iFrame==0) {do the computation} else fragColor = texelFetch(iChannel0, ivec2(fragCoord), 0)
Your ultra-costly shader won’t change over time ? Compute it just once:
→ if(iFrame!=0){ discard; }
In both case, this won’t work if you access a texture, because it is loaded asynchroneously: you have to first wait for it’s loading.
→ see how to.
You still want to update one of those in case of resolution change (e.g. fullscreen) ?
→ see how to. ( even simpler for mouse move or click: just test it ).
Your costly precomputations in BuffA do change over time, but don’t require full resolution ?
→ compute only in a corner of BuffA, then access it via bilinear or MIPmapping (done for you by the hardware almost for free. Just activate MIPmap filtering in the texture binder).
Similarly, you can use MIPmap to approx any spatial integration: see General Purpose MIPmap page and examples. E.g. here we compute Gaussian blur with a mix of crude sampling (using proper filter weight) and MIPmap at small scale to get the sample value (from smaller scales) – in addition, the 2D filter is separate as two 1D passes. (Attention: beyond the MIPmap approximation itself, it can be pretty biased on non-power of 2 textures. If you need precision, better rely on CubeMaps, currently the only way to get square power-of-2 textures in shadertoy ).
GPU/SIMD issues and goodies
Parallelism is powerful, but it comes with constraints. Divergence in SIMD is a big one: A neighborhood of 32 pixels ( for nVidia warps ) is computed in total sync at assembly-language instruction level, so that branching ( conditions, variable-length loops, switch, etc ) force “visiting” each configuration one after the other if one pixel in the warp enters it.
A common irrationnal belief is then to fear and avoid any “if”, and replace it by turnarounds… that can be worse (and obfuscating).
→ The problem is not the “if” per se, but how you use it (in situations that are truely divergence). Typically, doing the same thing with different parameters in the 2 diverged branches is not parallelism friendly, while just setting different parameters then applying them after the branching is.
Note that bad accounting of parallelism can also cause cause ultra-long or crashing compilation.
Read more here.
Also, on GPU there is a very huge amount of computing units that must share caches and memory accesses: the ratio is even ridiculus compare to CPU programming, so coding the same way can be desastrous for perfs. In particular, arrays can be ultra-costly (they prevent parallelism coverage of wait states by eating all the registers), and re-computing an expression can be faster than accessing it in memory or texture, possibly even if already available in cache !
→ Be wise, but also test, if it’s something critical in your case (alas, the optimal is probably GPU-dependant).
Hardware derivatives:
if you need the screenwise derivative of anything, you can get it (almost) for free using dFdx, dFdy, fwidth ! Since the SIMD sync allows the GPU to compute finite differences for you.
Still, note that the precision is limited (not-centered finite differences evaluated every two pixels, but this is ok for slow varying quantities), they are killed by divergence (be smart and compute it upstream), and are not implemented on low devices.
For instance you can often get antialiasing (almost) for free by drawing v/fwidth(v) normalized distance to shape. ( But in case of discontinuity or divergence in v you have to be a bit smarter: see here ).
( To be continued )