Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Linux Kernel : 4.10.0-20-generic (also tried this on 4.11.3)

Ubuntu : 17.04

I have been trying to collect stats of memory-accesses using perf stat. I am able to collect stats for memory-stores but the count for memory-loads return me a 0 value.

The below is the details for memory-stores :-

perf stat -e cpu/mem-stores/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25

 Performance counter stats for './libquantum_base.arnab 100':

       158,115,510      cpu/mem-stores/u                                            

       0.559922797 seconds time elapsed

For memory-loads, I get a 0 count as can be seen below :-

perf stat -e cpu/mem-loads/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25

 Performance counter stats for './libquantum_base.arnab 100':

                 0      cpu/mem-loads/u                                             

       0.563806170 seconds time elapsed

I cannot understand why this does not count properly. Should I use a different event in any way to get proper data ?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
684 views
Welcome To Ask or Share your Answers For Others

1 Answer

The mem-loads event is mapped to the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3 performance monitoring unit event on Intel processors. The events MEM_TRANS_RETIRED.LOAD_LATENCY_* are special and can only be counted by using the p modifier. That is, you have to specify mem-loads:p to perf to use the event correctly.

MEM_TRANS_RETIRED.LOAD_LATENCY_* is a precise event and it only makes sense to be counted at the precise level. According to this Intel article (emphasis mine):

When a user elects to sample one of these events, special hardware is used that can keep track of a data load from issue to completion. This is more complicated than simply counting instances of an event (as with normal event-based sampling), and so only some loads are tracked. Loads are randomly chosen, the latency determined for each, and the correct event(s) incremented (latency >4, >8, >16, etc). Due to the nature of the sampling for this event, only a small percentage of an application's data loads can be tracked at any one time.

As you can see, MEM_TRANS_RETIRED.LOAD_LATENCY_* by no means count the total number of loads and it is not designed for that purpose at all.

If you want to to determine which instructions in your code are issuing load requests that take more than a specific number of cycles to complete, then MEM_TRANS_RETIRED.LOAD_LATENCY_* is the right performance event to use. In fact, that is exactly the purpose of perf-mem and it achieves its purpose by using this event.

If you want to count the total number of load uops retired, then you should use L1-dcache-loads, which is mapped to the MEM_UOPS_RETIRED.ALL_LOADS performance event on Intel processors.

On the other hand, mem-stores and L1-dcache-stores are mapped to the exact same performance event on all current Intel processors, namely, MEM_UOPS_RETIRED.ALL_STORES, which does count all retired store uops.

So in summary, if you are using perf-stat, you should (almost) always use L1-dcache-loads and L1-dcache-stores to count retired loads and stores, respectively. These are mapped to the raw events you have used in the answer you posted, only more portable because they also work on AMD processors.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...