The answer to your question depends on your target platform. Assuming you are using most common x86 cpus, I can give you this link http://instlatx64.atw.hu/ This is a collection of measured instruction latency (How long will it take to CPU to get result after it has argument) and how they are pipelined for many x86 and x86_64 processors. If your target is not x86, you can try to measure cost yourself or consult with your CPU documentation.
Firstly you should get a disassembler of your operations (from compiler e.g. gcc: gcc file.c -O3 -S -o file.asm
or via dissasembly of compiled binary, e.g. with help of debugger).
Remember, that In your operation there is loading and storing a value, which must be counted additionally.
Here are two examples from friweb.hu:
For Core 2 Duo E6700 latency (L) of SQRT (both x87, SSE and SSE2 versions)
- 29 ticks for 32-bit float; 58 ticks for 64-bit double; 69 ticks for 80-bit long double;
of DIVIDE (of floating point numbers):
- 18 ticks for 32-bit; 32 ticks for 64-bit; 38 ticks for 80-bit
For newer processors, the cost is less and is almost the same for DIV and for SQRT, e.g. for Sandy Bridge Intel CPU:
Floating-point SQRT is
- 14 ticks for 32 bit; 21 ticks for 64 bit; 24 ticks for 80 bit
Floating-point DIVIDE is
- 14 ticks for 32 bit; 22 ticks for 64 bit; 24 ticks for 80 bit
SQRT even a tick faster for 32bit.
So: For older CPUs, sqrt is itself 30-50 % slower than fdiv; For newer CPU the cost is the same.
For newer CPU, cost of both operations become lower that it was for older CPUs;
For longer floating format you needs more time; e.g. for 64-bit you need 2x time than for 32bit; but 80-bit is cheapy compared with 64-bit.
Also, newer CPUs have vector operations (SSE, SSE2, AVX) of the same speed as scalar (x87). Vectors are of 2-4 same-typed data. If you can align your loop to work on several FP values with same operation, you will get more performance from CPU.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…