In a distributed job system written in C++11 I have implemented a fence (i.e. a thread outside the worker thread pool may ask to block until all currently scheduled jobs are done) using the following structure:
struct fence
{
std::atomic<size_t> counter;
std::mutex resume_mutex;
std::condition_variable resume;
fence(size_t num_threads)
: counter(num_threads)
{}
};
The code implementing the fence looks like this:
void task_pool::fence_impl(void *arg)
{
auto f = (fence *)arg;
if (--f->counter == 0) // (1)
// we have zeroed this fence's counter, wake up everyone that waits
f->resume.notify_all(); // (2)
else
{
unique_lock<mutex> lock(f->resume_mutex);
f->resume.wait(lock); // (3)
}
}
This works very well if threads enter the fence over a period of time. However, if they try to do it almost simultaneously, it seems to sometimes happen that between the atomic decrementation (1) and starting the wait on the conditional var (3), the thread yields CPU time and another thread decrements the counter to zero (1) and fires the cond. var (2). This results in the previous thread waiting forever in (3), because it starts waiting on it after it has already been notified.
A hack to make the thing workable is to put a 10 ms sleep just before (2), but that's unacceptable for obvious reasons.
Any suggestions on how to fix this in a performant way?
See Question&Answers more detail:os