I do not have experience with GEMM, but your problem seems to be related to issues that appear in all kind of multi-threading scenarios.
When using multi-threading, you introduce a couple of potential overheads, the most common of which usually are
- creation/cleanup of starting/ending threads
- context switches when (number of threads) > (number of CPU cores)
- locking of resources, waiting to obtain a lock
- cache synchronization issues
The items 2. and 3. probably don't play a role in your example: you are using 12 threads on 12 (hyperthreading) cores, and your algorithm does not involve locks.
However, 1. might be relevant in your case: You are creating a total of 40000 threads, each of which multiplying and adding 200 values. I'd suggest to try a less fine-grained threading, maybe only splitting after the first loop. It's always a good idea not to split up the problem into pieces smaller than necessary.
Also 4. will very likely be important in your case. While you're not running into a race condition when writing the results to the array (because every thread is writing to its own index position), you are very likely to provoke a large overhead of cache syncs.
"Why?" you might think, because you're writing to different places in memory. That's because a typical CPU cache is organized in cache lines, which on the current Intel and AMD CPU models are 64 bytes long. This is the smallest size that can be used for transfers from and to the cache, when something is changed. Now that all CPU cores are reading and writing to adjacent memory words, this leads to synchronization of 64 bytes between all the cores whenever you are writing just 4 bytes (or 8, depending on the size of the data type you're using).
If memory is not an issue, you can simply "pad" every output array element with "dummy" data so that there is only one output element per cache line. If you're using 4byte data types, this would mean to skip 15 array elements for each 1 real data element. The cache issues will also improve when you make your threading less fine-grained, because every thread will access its own continuous region in memory practically without interfering with other threads' memory.
Edit: A more detailed description by Herb Sutter (one of the Gurus of C++) can be found here: http://www.drdobbs.com/parallel/maximize-locality-minimize-contention/208200273
Edit2: BTW, it's suggested to avoid std::move
in the return statement, as this might get in the way of return-value-optimization and copy-elision rules, which the standard now demands to happen automatically. See Is returning with `std::move` sensible in the case of multiple return statements?
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…