Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

What parallel algorithms could I use to generate random permutations from a given set? Especially proposals or links to papers suitable for CUDA would be helpful.

A sequential version of this would be the Fisher-Yates shuffle.

Example:

Let S={1, 2, ..., 7} be the set of source indices. The goal is to generate n random permutations in parallel. Each of the n permutations contains each of the source indices exactly once, e.g. {7, 6, ..., 1}.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
323 views
Welcome To Ask or Share your Answers For Others

1 Answer

Fisher-Yates shuffle could be parallelized. For example, 4 concurrent workers need only 3 iterations to shuffle vector of 8 elements. On first iteration they swap 0<->1, 2<->3, 4<->5, 6<->7; on second iteration 0<->2, 1<->3, 4<->5, 6<->7; and on last iteration 0<->4, 1<->5, 2<->6, 3<->7.

ParallelFisherYates

This could be easily implemented as CUDA __device__ code (inspired by standard min/max reduction):

const int id  = threadIdx.x;
__shared__ int perm_shared[2 * BLOCK_SIZE];
perm_shared[2 * id]     = 2 * id;
perm_shared[2 * id + 1] = 2 * id + 1;
__syncthreads();

unsigned int shift = 1;
unsigned int pos = id * 2;  
while(shift <= BLOCK_SIZE)
{
    if (curand(&curand_state) & 1) swap(perm_shared, pos, pos + shift);
    shift = shift << 1;
    pos = (pos & ~shift) | ((pos & shift) >> 1);
    __syncthreads();
}

Here the curand initialization code is omitted, and method swap(int *p, int i, int j) exchanges values p[i] and p[j].

Note that the code above has the following assumptions:

  1. The length of permutation is 2 * BLOCK_SIZE, where BLOCK_SIZE is a power of 2.
  2. 2 * BLOCK_SIZE integers fit into __shared__ memory of CUDA device
  3. BLOCK_SIZE is a valid size of CUDA block (usually something between 32 and 512)

To generate more than one permutation I would suggest to utilize different CUDA blocks. If the goal is to make permutation of 7 elements (as it was mentioned in the original question) then I believe it will be faster to do it in single thread.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...