I'm new to Grand Central Dispatch and have been running some tests with it doing some processing on an image. Basically I'm running a grayscale algorithm both sequentially and using GCD and comparing the results.
here is the basic loop:
UInt8 r,g,b;
uint pixelIndex;
for (uint y = 0; y < height; y++) {
for (uint x = 0; x < width; x++) {
pixelIndex = (uint)(y * width + x);
if (pixelIndex+2 < width * height) {
sourceDataPtr = &sourceData[pixelIndex];
r = sourceDataPtr[0+0];
g = sourceDataPtr[0+1];
b = sourceDataPtr[0+2];
int value = (r+g+b) / 3;
if (value > MAX_COLOR_VALUE) {
value = MAX_COLOR_VALUE;
}
targetData[pixelIndex] = value;
self.imageData[pixelIndex] = value;
}
}
}
It simply runs through and takes the average value for Red, Green & Blue and uses that for the gray value. Very Simple. Now the parallel version basiclaly breaks the image into portions and then computes those portions seperately. Namely 2, 4, 8, 16 & 32 portions. I'm using the basic GCD so pass each portion in as it's own block to run concurrently. Here is the GCD wrapped code:
dispatch_group_t myTasks = dispatch_group_create();
for (int startX = 0; startX < width; startX += width/self.numHorizontalSegments) {
for (int startY = 0; startY < height; startY += height/self.numVerticalSegments) {
// For each segment, enqueue a block of code to compute it.
dispatch_group_async(myTasks, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^{
// grayscale code...
});
}
}
dispatch_group_wait(myTasks, DISPATCH_TIME_FOREVER);
Everything is working fine. But what I am not understanding is the speedup / CPU usage. Running tests in the simulator (which is using my dual core CPU) I am getting:
- ~0.0945s run time sequentially
- ~0.0675s run time using GCD
This is a speedup of around ~28% (a.k.a taking 72% the time of the sequential version). Theoretically, on a 2-core machine 100% speedup is the maximum. So this is falling well short of that and I can't figure out why.
I monitor the CPU usage and it maxes out around 118% - why is it not reaching closer to 200%? If anyone has an idea as to what I should change, or what is the culprit here I would greatly appreciate it.
My Theories:
- Not enough work on CPU (but image is ~3,150,000 pixels)
- Not enough time to fire up to near 200%? Maybe each thread requires a longer runtime before it starts chewing up that much of the CPU?
- I thought maybe the overhead was pretty high, but a test of launching 32 empty blocks to a queue (also in a group) took around ~0.0005s maximum.