I've implemented a version of PageRank in a multithreaded version. I'm running it on a 4-core Q6600. When I run it set to create 4 threads, I get:
real 6.968s
user 26.020s
sys 0.050s
When I run with 128 threads I get:
real 0.545s
user 1.330s
sys 0.040s
This makes no sense to me. The basic algorithm is a sum-reduce:
- All threads sum a subset of the input;
- Synchronize;
- Each thread then accumulates part of the results from the other threads;
- The main thread sums an intermediate value from all the threads and then determines whether to continue.
Profiling hasn't helped. I'm not sure what data would be helpful to understand my code - please just ask.
It really has me puzzled.
See Question&Answers more detail:os