Let's suppose I have a kernel call with a 2D grid, like so:
dim3 dimGrid(x, y); // not important what the actual values are
dim3 dimBlock(blockSize, blockSize);
myKernel <<< dimGrid, dimBlock >>>();
Now I've read that multidimensional grids are merely meant to ease programming - the underlying hardware will only ever use 1D linearly cached memory (unless you use texture memory, but that's not relevant here).
My question is: In what order will the threads be assigned to the grid indices during warp scheduling? Will they be assigned horizontally ("iterate" x, then y) or vertically ("iterate" y, then x)? This might be relevant to improve memory coalescing, depending on how I access my memory in the kernel.
To make it more clear, let's say the following represents the thread's IDs as applied to my (imaginary) grid with a "horizontal" distribution:
[ 0 1 2 3 ]
[ 4 5 6 7 ]
[ 8 9 10 11 ]
[ ... ]
And "vertical" distribution would be:
[ 0 4 8 .. ]
[ 1 5 9 .. ]
[ 2 6 10 .. ]
[ 3 7 11 .. ]
I hope you can see how this might affect coalescing: With each variant, there will be a specific optimal way to access my device memory buffer.
Unfortunately, I have not found any detailed information on this yet..
See Question&Answers more detail:os