c++ - CUDA: In warp reduction and volatile keyword

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

c++ - CUDA: In warp reduction and volatile keyword

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

After reading the question and its answer from the following
LINK

I still have a question remaining in my mind. From my background in C/C++; I understand that using volatile has it's demerits. And also it is pointed in the answers that in case of CUDA, the optimizations can replace shared array with registers to keep data if volatile keyword is not used.

I want to know what would be the performance issues that can be encountered when calculating (sum) reduction. e.g.

__device__ void sum(volatile int *s_data, int tid)
{
    if (tid < 16)
    {
        s_data[tid] += s_data[tid + 16];
        s_data[tid] += s_data[tid +  8];
        s_data[tid] += s_data[tid +  4];
        s_data[tid] += s_data[tid +  2];
        s_data[tid] += s_data[tid +  1];
    }
}

I am using in warp reduction. Since all the threads with in warp are in sync, therefore I believe there is no need to use syncthreads() construct.

I want to know will removing the keyword volatile mess up my sum (due to cuda optimizations)? Can I use reduction such as this without volatile keyword.

Since I use this reduction function multiple time, will volatile keyword cause any performance degradation?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

559 views

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:42:11+0000

Removing the volatile keyword from that code could break that code on Fermi and Kepler GPUS. Those GPUs lack instructions to directly operate on shared memory. Instead, the compiler must emit a load/store pair to and from register.

What the volatile keyword does in this context is make the compiler honour that load-operate-store cycle and not perform an optimisation that would keep the value of s_data[tid] in register. To keep the sum accumulating in register would break the implicit memory syncronisation required to make that warp level shared memory summation work correctly.

Categories

c++ - CUDA: In warp reduction and volatile keyword

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags