Since nothing changes in the GPU code as you switch your project from x86 to x64, it all has to do as how multiplication is performed on the CPU. There are some subtle differences between floating-point numbers handling in x86 and x64 modes and the biggest one is that since any x64 CPU also supports SSE and SSE2, it is used by default for math operations in 64-bit mode on Windows.
The HD4770 GPU does all computations using single-precision floating point units. Modern x64 CPUs on the other hand have two kinds of functional units that handle floating point numbers:
- x87 FPU which operates with the much higher extended precision of 80 bits
- SSE FPU which operates with 32-bit and 64-bit precision and is much compatible with how other CPUs handle floating point numbers
In 32-bit mode the compiler does not assume that SSE is available and generates usual x87 FPU code to do the math. In this case operations like data[i] * data[i]
are performed internally using the much higher 80-bit precision. Comparison of the kind if (results[i] == data[i] * data[i])
is performed as follows:
data[i]
is pushed onto the x87 FPU stack using the FLD DWORD PTR data[i]
data[i] * data[i]
is computed using FMUL DWORD PTR data[i]
result[i]
is pushed onto the x87 FPU stack using FLD DWORD PTR result[i]
- both values are compared using
FUCOMPP
Here comes the problem. data[i] * data[i]
resides in an x87 FPU stack element in 80-bit precision. result[i]
comes from the GPU in 32-bit precision. Both numbers will most likely differ since data[i] * data[i]
has much more significant digits whereas result[i]
has lots of zeros (in 80-bit precision)!
In 64-bit mode things happen in another way. The compiler knows that your CPU is SSE capable and it uses SSE instructions to do the math. The same comparison statement is performed in the following way on x64:
data[i]
is loaded into an SSE register using MOVSS XMM0, DWORD PTR data[i]
data[i] * data[i]
is computed using MULSS XMM0, DWORD PTR data[i]
result[i]
is loaded into another SSE register using MOVSS XMM1, DWORD PTR result[i]
- both values are compared using
UCOMISS XMM1, XMM0
In this case the square operation is performed with the same 32-bit single point precision as is used on the GPU. No intermediate results with 80-bit precision are generated. That's why results are the same.
It is very easy to actually test this even without GPU being involved. Just run the following simple program:
#include <stdlib.h>
#include <stdio.h>
float mysqr(float f)
{
f *= f;
return f;
}
int main (void)
{
int i, n;
float f, f2;
srand(1);
for (i = n = 0; n < 1000000; n++)
{
f = rand()/(float)RAND_MAX;
if (mysqr(f) != f*f) i++;
}
printf("%d of %d squares differ
", i);
return 0;
}
mysqr
is specifically written so that the intermediate 80-bit result will get converted in 32-bit precision float
. If you compile and run in 64-bit mode, output is:
0 of 1000000 squares differ
If you compile and run in 32-bit mode, output is:
999845 of 1000000 squares differ
In principle you should be able to change the floating point model in 32-bit mode (Project properties -> Configuration Properties -> C/C++ -> Code Generation -> Floating Point Model) but doing so changes nothing since at least on VS2010 intermediate results are still kept in the FPU. What you can do is to enforce store and reload of the computed square so that it will be rounded to 32-bit precision before it is compared with the result from the GPU. In the simple example above this is achieved by changing:
if (mysqr(f) != f*f) i++;
to
if (mysqr(f) != (float)(f*f)) i++;
After the change 32-bit code output becomes:
0 of 1000000 squares differ