As ?? Tiib suggested, just break the program in a debugger. The way I do it is get the program running, switch to the output window, type Ctrl-C to interrupt the program, switch back to the GDB window, type "thread 1" so as to be in the context of the main program, and type "bt" to see the stack trace.
Now, look at the stack trace and understand it, because while the instruction at the program counter is responsible for that particular cycle being spent, so is every call on the stack.
If you do this a few times, you're going to see exactly what line is responsible for the bottleneck.
As soon as you see it on two (2) samples, you've nailed it.
Then fix it and do it all again, finding the next bottleneck, and so on.
You could easily find that you get enormous speedup this way.
< flame>
Some people say this is exactly what profilers do, only they do it better.
That's what you hear in lecture halls and on blogs, but here's the deal:
There are ways to speed up your code that do not reveal themselves as "slow functions" or "hot paths", for example - reorganizing the data structure.
Every function looks more-or-less innocent, even if it has high inclusive time percent.
They do reveal themselves if you actually look at stack samples.
So the problem with good profilers is not in the collection of samples, it is in the presentation of results. Statistics and measurements cannot tell you what a small selection of samples, examined carefully, do tell you.
What about the issue of small vs. large number of samples? Aren't more better?
OK, suppose you have an infinite loop, or if not infinite, it just runs far longer than you know it should? Would 1000 stack samples find it any better than a single sample? (No.) If you look at it under a debugger, you know you're in the loop because it takes basically 100% of the time. It's on the stack somewhere - just scan up the stack until you find it.
Even if the loop only takes 50% or 20% of the time, that's the probability each sample will see it.
So, if you see something you could get rid of on as few as two samples, it's worth doing it.
So, what do the 1000 samples buy you?
Maybe one thinks: "So what if we miss a problem or two? Maybe it's good enough." Well, is it?
Suppose the code has three problems P taking 50% of the time, Q taking 25%, and R taking 12.5%. The good stuff is called A.
This shows the speedup you get if you fix one of them, two of them, or all three of them.
PRPQPQPAPQPAPRPQ original time with avoidable code P, Q, and R all mixed together
RQQAQARQ fix P - 2 x speedup
PRPPPAPPAPRP fix Q - 1.3 x "
PPQPQPAPQPAPPQ fix R - 1.14 x "
RAAR fix P and Q - 4 x "
QQAQAQ fix P and R - 2.7 x "
PPPPAPPAPP fix Q and R - 1.6 x "
AA fix P, Q, and R - 8 x speedup
Does this make it clear why the ones that "get away" really hurt?
The best you can do if you miss any is twice as slow.
They are easy to find if you examine samples. P is on half the samples.
If you fix P and do it again, Q is on half the samples. Once you fix Q, R is on half the samples.
Fix R and you've got your 8x speedup.
You don't have to stop there. You can keep going until you truly can't find anything to fix.
The more problems there are, the higher the potential speedup,
but you can't afford to miss any.
The problem with profilers (even good ones) is that, by denying you the chance to see and study individual samples, they hide problems that you need to find.
More on all that.
For the statistically inclined, here's how it works.
There are good profilers.
The best are wall-time stack samplers that report inclusive percent at individual lines, letting you turn sampling on and off with a hot-key.
Zoom (wiki) is such a profiler.
But even those make the mistake of assuming you need lots of samples.
You don't, and the price you pay for them is you can't actually see any, so you can't see why the time is being spent, so you can't easily tell if it's necessary,
and you can't get rid of something unless you know you don't need it.
The result is you miss bottlenecks, and they end up stunting your speedup.
< /flame>