Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Greetings noble community,

I want to have the following loop:

for(i = 0; i < MAX; i++)
    A[i] = B[i] + C[i];

This will run in parallel on a shared-memory quad-core computer using threads. The two alternatives below are being considered for the code to be executed by these threads, where tid is the id of the thread: 0, 1, 2 or 3.

(for simplicity, assume MAX is a multiple of 4)

Option 1:

for(i = tid; i < MAX; i += 4)
    A[i] = B[i] + C[i];

Option 2:

for(i = tid*(MAX/4); i < (tid+1)*(MAX/4); i++)
    A[i] = B[i] + C[i];

My question is if there's one that is more efficient then the other and why?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
190 views
Welcome To Ask or Share your Answers For Others

1 Answer

The second one is better than the first one. Simple answer: the second one minimize false sharing

Modern CPU doesn't not load byte one by one to the cache. It read once in a batch called cache line. When two threads trying to modify different variables on the same cache line, one must reload the cache after one modify it.

When would this happen?

Basically, elements nearby in memory will be in the same cache line. So, neighbor elements in array will be in the same cache line since array is just a chunk of memory. And foo1 and foo2 might be in the same cache line as well since they are defined close in the same class.

class Foo {

private int foo1;
private int foo2;

}

How bad is false sharing?

I refer Example 6 from the Gallery of Processor Cache Effects

private static int[] s_counter = new int[1024];
private void UpdateCounter(int position)
{
    for (int j = 0; j < 100000000; j++)
    {
        s_counter[position] = s_counter[position] + 3;
    }
}

On my quad-core machine, if I call UpdateCounter with parameters 0,1,2,3 from four different threads, it will take 4.3 seconds until all threads are done. On the other hand, if I call UpdateCounter with parameters 16,32,48,64 the operation will be done in 0.28 seconds!

How to detect false sharing?

Linux Perf could be used to detect cache misses and therefore help you analysis such problem.

refer to the analysis from CPU Cache Effects and Linux Perf, use perf to find out L1 cache miss from almost the same code example above:

Performance counter stats for './cache_line_test 0 1 2 3':
10,055,747 L1-dcache-load-misses     #    1.54% of all L1-dcache hits   [51.24%]
Performance counter stats for './cache_line_test 16 32 48 64':
  36,992 L1-dcache-load-misses     #    0.01% of all L1-dcache hits   [50.51%]

It shows here that the total L1 caches hits will drop from 10,055,747 to 36,992 without false sharing. And the performance overhead is not here, it's in the series of loading L2, L3 cache, loading memory after false sharing.

Is there some good practice in industry?

LMAX Disruptor is a High Performance Inter-Thread Messaging Library and it's the default messaging system for Intra-worker communication in Apache Storm The underlying data structure is a simple ring buffer. But to make it fast, it use a lot of tricks to reduce false sharing.

For example, it defines the super class RingBufferPad to create pad between elements in RingBuffer:

abstract class RingBufferPad
{
    protected long p1, p2, p3, p4, p5, p6, p7;
}

Also, when it allocate memory for the buffer it create pad both in front and in tail so that it won't be affected by data in adjacent memory space:

this.entries   = new Object[sequencer.getBufferSize() + 2 * BUFFER_PAD];

source

You probably want to learn more about all the magic tricks. Take a look at one of the author's post: Dissecting the Disruptor: Why it's so fast


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...