It happened to me a few times to parallelize portion of programs with OpenMP just to notice that in the end, despite the good scalability, most of the foreseen speed-up was lost due to the poor performance of the single threaded case (if compared to the serial version).
The usual explanation that appears on the web for this behavior is that the code generated by compilers may be worse in the multi-threaded case. Anyhow I am not able to find anywhere a reference that explains why the assembly may be worse.
So, what I would like to ask to the compiler guys out there is:
May compiler optimizations be inhibited by multi-threading? In case, how could performance be affected?
If it could help narrowing down the question I am mainly interested in high-performance computing.
Disclaimer: As stated in the comments, part of the answers below may become obsolete in the future as they briefly discuss the way in which optimizations are handled by compilers at the time the question was posed.
See Question&Answers more detail:os