Division
SM translates this division in code:
- Code: Select all
out = total / 4410;
Into this ASM:
- Code: Select all
movaps xmm0,total;
divps xmm0,F4410;
movaps out,xmm0;
Only three statements. But one of them, divps, can be VERY expensive.
If you don't need the precission of divps, but could settle for 12 bit precission, you could multiply the reciprocal instead.
- Code: Select all
movaps xmm0,F4410;
rcpps xmm0,xmm0;
mulps xmm0,total;
movaps out,xmm0;
One instruction more, but both rcpps and mulps are cheap.
This little trick applied to just ONE case decreased CPU from ~5.2% to ~5.0%; 4% improvement!
FYI, In this context, the reciprocal of a number is the same as dividing 1 by that number; reciprocal of A is 1/A.
If you do more than one division by the same number, you could even take the time to calculate the reciprocal of the divisor in high precission and multiply by that number:
- Code: Select all
out = A / 1234;
out = B / 1234;
Translates to:
- Code: Select all
movaps xmm0,A;
divps xmm0,F1234;
movaps out,xmm0;
movaps xmm0,B;
divps xmm0,F1234;
movaps out,xmm0;
But, calculating the reciprocal of 1234 first:
- Code: Select all
R1234 = 1/1234;
out = A * R1234;
out = B * R1234;
or in ASM...
- Code: Select all
movaps xmm0,F1;
divps xmm0,F1234;
movaps R1234,xmm0;
movaps xmm0,A;
mulps xmm0,R1234;
movaps out,xmm0;
movaps xmm0,B;
mulps xmm0,R1234;
movaps out,xmm0;
Even though the latter uses more code, the advantage of having only one divps and two mulps versus two divps is quite high.
Ofcoure, if the divisor is static or coming from outside the code module, you should precalculate the reciprocal

Doing that decreased my ~5.2% 0.1% further to ~4.9% (whilst improving quality if rounded to sufficient digits). These sound like minor improvements, but if you add up many of these simple changes, the performance difference can be dramatic.