Intro
Section Old Question contains the initial question (Further Investigation and Conclusion have been added since).
Skip to the section Further Investigation below for a detailed comparison of the different timing methods (
rdtsc
,clock_gettime
andQueryThreadCycleTime
).I believe the erratic behaviour of CGT can be attributed to either a buggy kernel or a buggy CPU (see section Conclusion).
The code used for testing is at the bottom of this question (see section Appendix).
Apologies for the length.
Old Question
In short: I am using clock_gettime
to measure the execution time of many code segments. I am experiencing very inconsistent measurements between separate runs. The method has an extremely high standard deviation when compared to other methods (see Explanation below).
Question: Is there a reason why clock_gettime
would give so inconsistent measurements when compared to other methods? Is there an alternative method with the same resolution that accounts for thread idle time?
Explanation: I am trying to profile a number of small parts of C code. The execution time of each of the code segments is not more than a couple of microseconds. In a single run, each of the code segments will execute some hundreds of times, which produces runs × hundreds
of measurements.
I also have to measure only the time the thread actually spends executing (which is why rdtsc
is not suitable). I also need a high resolution (which is why times
is not suitable).
I've tried the following methods:
rdtsc
(on Linux and Windows),clock_gettime
(with 'CLOCK_THREAD_CPUTIME_ID'; on Linux), andQueryThreadCycleTime
(on Windows).
Methodology: The analysis was performed on 25 runs. In each run, separate code segments repeat a 101 of times. Therefore I have 2525 measurements. Then I look at a histogram of the measurements, and also calculate some basic stuff (like the mean, std.dev., median, mode, min, and max).
I do not present how I measured the 'similarity' of the three methods, but this simply involved a basic comparison of proportion of times spent in each code segment ('proportion' means that the times are normalised). I then look at the pure differences in these proportions. This comparison showed that all 'rdtsc', 'QTCT', and 'CGT' measure the same proportions when averaged over the 25 runs. However, the results below show that 'CGT' has a very large standard deviation. This makes it unusable in my use case.
Results:
A comparison of clock_gettime
with rdtsc
for the same code segment (25 runs of 101 measurements = 2525 readings):
clock_gettime:
- 1881 measurements of 11 ns,
- 595 measurements were (distributed almost normally) between 3369 and 3414 ns,
- 2 measurements of 11680 ns,
- 1 measurement of 1506022 ns, and
the rest is between 900 and 5000 ns.
Min: 11 ns
- Max: 1506022 ns
- Mean: 1471.862 ns
- Median: 11 ns
- Mode: 11 ns
- Stddev: 29991.034
rdtsc (note: no context switches occurred during this run, but if it happens, it usually results in just a single measurement of 30000 ticks or so):
- 1178 measurements between 274 and 325 ticks,
- 306 measurements between 326 and 375 ticks,
- 910 measurements between 376 and 425 ticks,
- 129 measurements between 426 and 990 ticks,
- 1 measurement of 1240 ticks, and
1 measurement of 1256 ticks.
Min: 274 ticks
- Max: 1256 ticks
- Mean: 355.806 ticks
- Median: 333 ticks
- Mode: 376 ticks
- Stddev: 83.896
Discussion:
rdtsc
gives very similar results on both Linux and Windows. It has an acceptable standard deviation--it is actually quite consistent/stable. However, it does not account for thread idle time. Therefore, context switches make the measurements erratic (on Windows I have observed this quite often: a code segment with an average of 1000 ticks or so will take ~30000 ticks every now and then--definitely because of pre-emption).QueryThreadCycleTime
gives very consistent measurements--i.e. much lower standard deviation when compared tordtsc
. When no context switches happen, this method is almost identical tordtsc
.clock_gettime
, on the other hand, is producing extremely inconsistent results (not just between runs, but also between measurements). The standard deviations are extreme (when compared tordtsc
).
I hope the statistics are okay. But what could be the reason for such a discrepancy in the measurements between the two methods? Of course, there is caching, CPU/core migration, and other things. But none of this should be responsible for any such differences between 'rdtsc' and 'clock_gettime'. What is going on?
Further Investigation
I have investigated this a bit further. I have done two things:
Measured the overhead of just calling
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t)
(see code 1 in Appendix), andin a plain loop called
clock_gettime
and stored the readings into an array (see code 2 in Appendix). I measure the delta times (difference in successive measurement times, which should correspond a bit to the overhead of the call ofclock_gettime
).
I have measured it on two different computers with two different Linux Kernel versions:
CGT:
CPU: Core 2 Duo L9400 @ 1.86GHz
Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386
Results:
- Estimated
clock_gettime
overhead: between 690-710 ns Delta times:
- Average: 815.22 ns
- Median: 713 ns
- Mode: 709 ns
- Min: 698 ns
- Max: 23359 ns
Histogram (left-out ranges have frequencies of 0):
Range | Frequency ------------------+----------- 697 < x ≤ 800 -> 78111 <-- cached? 800 < x ≤ 1000 -> 16412 1000 < x ≤ 1500 -> 3 1500 < x ≤ 2000 -> 4836 <-- uncached? 2000 < x ≤ 3000 -> 305 3000 < x ≤ 5000 -> 161 5000 < x ≤ 10000 -> 105 10000 < x ≤ 15000 -> 53 15000 < x ≤ 20000 -> 8 20000 < x -> 5
- Estimated
CPU: 4 × Dual Core AMD Opteron Processor 275
Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux
Results:
- Estimated
clock_gettime
overhead: between 279-283 ns Delta times:
- Average: 320.00
- Median: 1
- Mode: 1
- Min: 1
- Max: 3495529
Histogram (left-out ranges have frequencies of 0):
Range | Frequency --------------------+----------- x ≤ 1 -> 86738 <-- cached? 282 < x ≤ 300 -> 13118 <-- uncached? 300 < x ≤ 440 -> 78 2000 < x ≤ 5000 -> 52 5000 < x ≤ 30000 -> 5 3000000 < x -> 8
- Estimated
RDTSC:
Related code rdtsc_delta.c
and rdtsc_overhead.c
.
CPU: Core 2 Duo L9400 @ 1.86GHz
Kernel: Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386
Results:
- Estimated overhead: between 39-42 ticks
Delta times:
- Average: 52.46 ticks
- Median: 42 ticks
- Mode: 42 ticks
- Min: 35 ticks
- Max: 28700 ticks
Histogram (left-out ranges have frequencies of 0):
Range | Frequency ------------------+----------- 34 < x ≤ 35 -> 16240 <-- cached? 41 < x ≤ 42 -> 63585 <-- uncached? (small difference) 48 < x ≤ 49 -> 19779 <-- uncached? 49 < x ≤ 120 -> 195 3125 < x ≤ 5000 -> 144 5000 < x ≤ 10000 -> 45 10000 < x ≤ 20000 -> 9 20000 < x -> 2
CPU: 4 × Dual Core AMD Opteron Processor 275
Kernel: Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux
Results:
- Estimated overhead: between 13.7-17.0 ticks
Delta times:
- Average: 35.44 ticks
- Median: 16 ticks
- Mode: 16 ticks
- Min: 14 ticks
- Max: 16372 ticks
Histogram (left-out ranges have frequencies of 0):
Range | Frequency ------------------+----------- 13 < x ≤ 14 -> 192 14 < x ≤ 21 -> 78172 <-- cached? 21 < x ≤ 50 -> 10818 50 < x ≤ 103 -> 10624 <-- uncached? 5825 < x ≤ 6500 -> 88 6500 < x ≤ 8000 -> 88 8000 < x ≤ 10000 -> 11 10000 < x ≤ 15000 -> 4 15000 < x ≤ 16372 -> 2
QTCT:
Related code qtct_delta.c
and qtct_overhead.c
.
CPU: Core 2 6700 @ 2.66GHz
Kernel: Windows 7 64-bit
Results:
- Estimated overhead: between 890-940 ticks
Delta times:
- Average: 1057.30 ticks
- Median: 890 ticks
- Mode: 890 ticks
- Min: 880 ticks
- Max: 29400 ticks
Histogram (left-out ranges have frequencies of 0):
Range | Frequency ------------------+----------- 879 < x ≤ 890 -> 71347 <-- cached? 895 < x ≤ 1469 -> 844 1469 < x ≤ 1600 -> 27613 <-- uncached? 1600 < x ≤ 2000 -> 55 2000 < x ≤ 4000 -> 86 4000 < x ≤ 8000 -> 43 8000 < x ≤ 16000 -> 10 16000 < x -> 1
Conclusion
I believe the answer to my question would be a buggy implementation on my machine (the one with AMD CPUs with an old Linux kernel).
The CGT results of the AMD machine with the old kernel show some extreme readings. If we look at the delta times, we'll see that the most frequent delta is 1 ns. This means t