wurstcaptures - x86 assembly tricks

Optimized x86 assembly stuff and tricks.

Most people thinks that nowadays you can't optimize better than the compiler. This is true for most parts of your program. Some function can still be optimized with hand-crafted assembly when needed. Here you'll find some functions I've created approximations for or converted to assembly to speed them up.

Fast powf() for computation of specular intensity
Fast roundf(), floorf(), ceilf() and truncf() functions
A shufps immediate value calculator
References

Fast powf() for computation of specular intensity:
In the phong lighting model the intensity of specular highlights is calculated using the power function a^b. Usually numbers up to 128 are used for b - the higher the exponent, the sharper/smaller the highlight. The standard function is powf(). Here we use the famous approximation by C. Schlick [1] which is a^b ~= a / (b - a * b + a) and gets fairly close to the original. The MSVC compiler does very well when optimizing the function in release mode, but you can still gain a bit of performance with a bit of work.
In the assembler code we get from the compiler we remove the last 2 instructions, which are redundant anyway and reorder the fld instructions which saves us a fxch.

float powf_schlick(const float a, const float b)
{
    return (a / (b - a * b + a));
}

float powf_schlickASM(const float a, const float b)
{
    __asm {
                                  // floating point stack
        fld         dword ptr [b] // = b
        fmul        dword ptr [a] // = b*a
        //fxch
        fsubr       dword ptr [b] // = b-b*a
        fadd        dword ptr [a] // = b-b*a+a
        fdivr       dword ptr [a] // = a /(b-b*a+a)
        //fstp        dword ptr [esp+8] // empty
        //fld         dword ptr [esp+8] // pow(a, b)
    }
}

You can also convert this function to SSE to process floating point vectors really fast:

void powf_schlickSSE(const float * a, const float b, float * result)
{
    __asm {
        mov         eax, a              //load address of vector
        movss       xmm0, dword ptr [b] //load exponent into SSE register
        movups      xmm1, [eax]         //load vector into SSE register
        shufps      xmm0, xmm0, 0       //shuffle b into all floats
        movaps      xmm2, xmm1          //duplicate vector
        mov         eax, result         //load address of result
        mulps       xmm1, xmm0          //xmm1 = a*b
        subps       xmm0, xmm1          //xmm0 = b-a*b
        addps       xmm0, xmm2          //xmm2 = b-a*b+a
        rcpps       xmm0, xmm0          //xmm1 = 1 / (b-a*b+a)
        mulps       xmm2, xmm0          //xmm0 = a * (1 / (b-a*b+a))
        movups      [eax], xmm2         //store result
    }
}

The SSE function is as fast as the single float function! By using a reciprocal and then a multiply we loose a bit of precision, but in the desired range there is no problem with that.

Here are some values measured on a Intel Pentium D 3.4GHz made with 100*10ˆ6 random numbers in the range [0,128]:

Method	Values	Execution time	Median error	Maximum error
powf()	a ε [0,1] b ε [1,128]	382.8ns	0.0	0.0
powf_schlick()	a ε [0,1] b ε [1,128]	26.6ns	< 0.01	˜0.2
powf_schlickASM()	a ε [0,1] b ε [1,128]	24.1ns	< 0.01	˜0.2
powf_schlickSSE()	a ε [0,1] b ε [1,128]	28.8ns (7.2ns per float)	< 0.01	˜0.2

With my Pentium M745 (1.8GHz) the results look a bit different. The ASM method is a bit slower on that processor and the default powf() is much faster. But the SSE function is actually faster while operating on 4 floats simultaneously:

Method	Values	Execution time	Median error	Maximum error
powf()	a ε [0,1] b ε [1,128]	249.8ns	0.0	0.0
powf_schlick()	a ε [0,1] b ε [1,128]	29.1ns	< 0.01	˜0.2
powf_schlickASM()	a ε [0,1] b ε [1,128]	29.7ns	< 0.01	˜0.2
powf_schlickSSE()	a ε [0,1] b ε [1,128]	26.8ns (6.7ns per float)	< 0.01	˜0.2

Here is an animated image comparing the real powf() (grey) to the approximation (red). The error is shown in green:
(a ε [0,1] and b ε [1, 2, 4, 8, 16, 32, 64, 128])

↑ back to top

Fast roundf(), floorf(), ceilf() and truncf() functions:
Often you're in need of a fast floor() function for your algorithms. One way is to load a number, change the FPU control word, store the number to int and restore the original control word. There are faster methods though. This one is from Laurent de Soras [2] and uses pure FPU code. It rounds, floors, ceils or truncs independent of the current rounding mode. Read his PDF for the other functions and extensive information on rounding.

float floorf_ASM(const float a)
{
    const float roundTowardsNI = -0.5f; //use 0.5f to round to nearest integer
    float result;
    __asm {
        fld         dword ptr [a]
        fadd        st, st
        fadd        roundTowardsNI
        fistp       dword ptr [result]
        //return float value
        sar         dword ptr [result], 1
        fld         dword ptr [result]
        /*or: return int value
        mov         eax, dword ptr [result]
        sar         eax, 1
        */
    }
}

float ceilf_ASM(const float a)
{
    const float roundTowardsPI = -0.5f;
    float result;
    __asm {
        fld         dword ptr [a]
        fadd        st, st
        fsubr       roundTowardsPI
        fistp       dword ptr [result]
        //return float value
        sar         dword ptr [result], 1
        neg         dword ptr [result]
        fld         dword ptr [result]
        /*or: return int value
        mov         eax, dword ptr [result]
        sar         eax, 1
        neg         eax
        */
    }
}

Laurent also present methods for trunc and round to nearest, but those can be done with functions already existant in SSE which are probably faster:

float trunc_SSE(const float a)
{
    float result;
    __asm {
        cvttss2si   eax, dword ptr [a]
        //cvttsd2si   eax, qword ptr [a] //for double values
        //return float value. if you want to return an int, you're actually done already...
        mov         dword ptr [result], eax
        fld         dword ptr [result]
    }
}

float round_SSE(const float a)
{
    float result;
    __asm {
        cvtss2si    eax, dword ptr [a] //round according to rounding control bits in the MXCSR register
        //cvtsd2si    eax, qword ptr [a] //for double values
        //return float value. if you want to return an int, you're actually done already...
        mov         dword ptr [result], eax
        fld         dword ptr [result]
    }
}

Another trunc variant comes with SSE3, which has its own FPU instruction for this purpose and might be faster in some scenarios:

float trunc_SSE3(const float a)
{
    float result;
    __asm {
        fld         dword ptr [a]
        fisttp      dword ptr [result]
        //fisttp      qword ptr [result] //for double values
        //return float value
        fld         dword ptr [result]
        /*or: return int value
        mov         eax, dword ptr [result]
        */
    }
}

Sadly I have no timing values yet...

↑ back to top

A shufps immediate value calculator
The shupfs instruction always annoys the hell out of me. You need it often, but I usually get the immediate value wrong though I know how it works. Or when reading some SSE code I always have to figure out what goes where... Use the combo boxes or change the text field value to see the results.

source	3	2	1	0
destination	3	2	1	0	shufps destination, source, 0x

destination

The great line drawing JavaScript is from p01. Thanks a lot man!

↑ back to top

References:
- 1: C. Schlick, A fast alternative to Phong's specular model, In P. Heckbert, editor, Graphics gems IV, pages 385-387. Academic Press, 1994.
- 2: Laurent de Soras, Fast Rounding of Floating Point Numbers on Wintel Platform, http://ldesoras.free.fr/doc/articles/rounding_en.pdf (12.01.2007)
↑ back to top