64bit - How to force gcc to use all SSE (or AVX) registers? -


i'm trying write computationally intensive code windows x64 target, sse or new avx instructions, compiling in gcc 4.5.2 , 4.6.1, mingw64 (tdm gcc build, , custom build). compiler options -o3 -mavx. (-m64 implied)

in short, want perform lengthy computation on 4 3d vectors of packed floats. requires 4x3=12 xmm or ymm registers storage, , 2 or 3 registers temporary results. should imho fit snugly in 16 available sse (or avx) registers available 64bit targets. however, gcc produces suboptimal code register spilling, using registers xmm0-xmm10 , shuffling data , onto stack. question is:

is there way convince gcc use registers xmm0-xmm15?

to fix ideas, consider following sse code (for illustration only):

void example(vect<__m128> q1, vect<__m128> q2, vect<__m128>& a1, vect<__m128>& a2) {     (int i=0; < 10; i++) {         vect<__m128> v = q2 - q1;         a1 += v; //      a2 -= v;          q2 *= _mm_set1_ps(2.);     } } 

here vect<__m128> struct of 3 __m128, natural addition , multiplication scalar. when line a2 -= v commented out, i.e. need 3x3 registers storage since ignoring a2, produced code indeed straightforward no moves, performed in registers xmm0-xmm10. when remove comment a2 -= v, code pretty awful lot of shuffling between registers , stack. though compiler use registers xmm11-xmm13 or something.

i haven't seen gcc use of registers xmm11-xmm15 anywhere in code yet. doing wrong? understand callee-saved registers, overhead justified simplifying loop code.

two points:

  • first, you're making lot of assumptions. register spilling pretty cheap on x86 cpus (due fast l1 caches , register shadowing , other tricks), , 64-bit registers more costly access (in terms of larger instructions), may gcc's version fast, or faster, 1 want.
  • second, gcc, compiler, the best register allocation can. there's no "please better register allocation" option, because if there was, it'd enabled. compiler isn't trying spite you. (register allocation np-complete problem, recall, compiler never able generate perfect solution. best can approximate)

so, if want better register allocation, have 2 options:

  • write better register allocator, , patch gcc, or
  • bypass gcc , rewrite function in assembly, can control registers used when.

Comments

Popular posts from this blog

c# - how to write client side events functions for the combobox items -

exception - Python, pyPdf OCR error: pyPdf.utils.PdfReadError: EOF marker not found -