Performance tricks

[WIP]
Many things can impact performances:

Choice of algorithm and data structure

There often exists more than one way to do one thing, especially in a given situation.
→ Choose wisely. ( typically in 3D rendering, ray tracing to intersection is faster than sphere surfing that is faster than ray marching regular small steps – generally ).

For instance, some quadratic algorithms can be made linear if you split them in 2 passes, the first being stored in BuffA. ( Exemple: Gaussian filter, Fourier transform … )

Simpler sanity check: choose wisely the necessary precision and loops length

Also, a reminder that shader programming is about massive parallelism where the program is called for each pixel: the worst thing is to loop on drawing each element so as you would do with CPU programming. Especially for 2D shaders, check if you can find which element may cover the pixel, or which subset of iDs.
– See making a shader loopless.
– For splatting particles, see Voronoï particle tracking.
– Regular symmetries and repetitions (in 2D or 3D): use domain folding ( using abs or mod on pixel coordinates – in linear or polar ).

Another consequence is that it makes no sense to do a shader with a first loop computing data in an array then a second loop using the data linearly: just either do everything in the same loop, or precompute the data in BuffA.

Maths

Do math simplification of formulas ( e.g. using trigonometry, geometry, algebraic properties… ), optimizers don’t know about symbolic calculus
Convertly this is useless on consts, since the compiler will evaluate them at compile time.

About factorisation of expressions, and pre or post computing what does not depend on a loop:
Yes the optimizer can do a lot of smart work you wouldn’t imagine… but sometime it doesn’t do obvious things.
→ Don’t tempt the devil, do factor and avoid redundancy. Plus it helps readability and debugging.

Shadertoy tricks

Your costly precomputations in BuffA won’t change over time ? Compute it just once:
→ if (iFrame==0) {do the computation} else fragColor = texelFetch(iChannel0, ivec2(fragCoord), 0)

Your ultra-costly shader won’t change over time ? Compute it just once:
→ if(iFrame!=0){ discard; }

In both case, this won’t work if you access a texture, because it is loaded asynchroneously: you have to first wait for it’s loading.
→ see how to.

You still want to update one of those in case of resolution change (e.g. fullscreen) ?
→ see how to. ( even simpler for mouse move or click: just test it ).

Your costly precomputations in BuffA do change over time, but don’t require full resolution ?
→ compute only in a corner of BuffA, then access it via bilinear or MIPmapping (done for you by the hardware almost for free. Just activate MIPmap filtering in the texture binder).
Similarly, you can use MIPmap to approx any spatial integration: see General Purpose MIPmap page and examples. E.g. here we compute Gaussian blur with a mix of crude sampling (using proper filter weight) and MIPmap at small scale to get the sample value (from smaller scales) – in addition, the 2D filter is separate as two 1D passes. (Attention: beyond the MIPmap approximation itself, it can be pretty biased on non-power of 2 textures. If you need precision, better rely on CubeMaps, currently the only way to get square power-of-2 textures in shadertoy :-/ ).

GPU/SIMD issues and goodies

Parallelism is powerful, but it comes with constraints. Divergence in SIMD is a big one: A neighborhood of 32 pixels ( for nVidia warps ) is computed in total sync at assembly-language instruction level, so that branching ( conditions, variable-length loops, switch, etc ) force “visiting” each configuration one after the other if one pixel in the warp enters it.
A common irrationnal belief is then to fear and avoid any “if”, and replace it by turnarounds… that can be worse (and obfuscating).
→ The problem is not the “if” per se, but how you use it (in situations that are truely divergence). Typically, doing the same thing with different parameters in the 2 diverged branches is not parallelism friendly, while just setting different parameters then applying them after the branching is.
Note that bad accounting of parallelism can also cause cause ultra-long or crashing compilation.
Read more here.

Also, on GPU there is a very huge amount of computing units that must share caches and memory accesses: the ratio is even ridiculus compare to CPU programming, so coding the same way can be desastrous for perfs. In particular, arrays can be ultra-costly (they prevent parallelism coverage of wait states by eating all the registers), and re-computing an expression can be faster than accessing it in memory or texture, possibly even if already available in cache !
→ Be wise, but also test, if it’s something critical in your case (alas, the optimal is probably GPU-dependant).

Hardware derivatives:
if you need the screenwise derivative of anything, you can get it (almost) for free using dFdx, dFdy, fwidth ! Since the SIMD sync allows the GPU to compute finite differences for you.
Still, note that the precision is limited (not-centered finite differences evaluated every two pixels, but this is ok for slow varying quantities), they are killed by divergence (be smart and compute it upstream), and are not implemented on low devices.
For instance you can often get antialiasing (almost) for free by drawing v/fwidth(v) normalized distance to shape. ( But in case of discontinuity or divergence in v you have to be a bit smarter: see here ).

( To be continued )

Performance tricks

Choice of algorithm and data structure

Maths

Shadertoy tricks

GPU/SIMD issues and goodies

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112