Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I've noticed a big performance hit when I run my CUDA application in Windows 7 (versus Linux). I think I may know where the slowdown occurs: For whatever reason, the Windows Nvidia driver (version 331.65) does not immediately dispatch a CUDA kernel when invoked via the runtime API. To illustrate the problem I profiled the mergeSort application (from the examples that ship with CUDA 5.5).

Consider first the kernel launch time when running in Linux:

linux_launch

Next, consider the launch time when running in Windows:

windows_launch

This post suggests the problem might have something to do with the windows driver batching the kernel launches. Is there anyway I can disable this batching?

I am running with a GTX 690 GPU, Windows 7, and version 331.65 of the Nvidia driver.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
847 views
Welcome To Ask or Share your Answers For Others

1 Answer

There is a fair amount of overhead in sending GPU hardware commands through the WDDM stack.

As you've discovered, this means that under WDDM (only) GPU commands can get "batched" to amortize this overhead. The batching process may (probably will) introduce some latency, which can be variable, depending on what else is going on.

The best solution under windows is to switch the operating mode of the GPU from WDDM to TCC, which can be done via the nvidia-smi command, but it is only supported on Tesla GPUs and certain members of the Quadro family of GPUs -- i.e. not GeForce. (It also has the side effect of preventing the device from being used as a windows accelerated display adapter, which might be relevant for a Quadro device or a few specific older Fermi Tesla GPUs.)

AFAIK there is no officially documented method to circumvent or affect the WDDM batching process in the driver, but unofficially I've heard , according to Greg@NV in this link the command to issue after the cuda kernel call is cudaEventQuery(0); which may/should cause the WDDM batch queue to "flush" to the GPU.

As Greg points out, extensive use of this mechanism will wipe out the amortization benefit, and may do more harm than good.

EDIT: moving forward to 2016, a newer recommendation for a "low-impact" flush of the WDDM command queue would be cudaStreamQuery(stream);

EDIT2: Using recent drivers on windows, you should be able to place Titan family GPUs in TCC mode, assuming you have some other GPU set up for primary display. The nvidia-smi tool will allow you to switch modes (using nvidia-smi --help for more info).

Additional info about the TCC driver model can be found in the windows install guide, including that it may reduce the latency of kernel launches.

The statement about TCC support is a general one. Not all Quadro GPUs are supported. The final determinant of support for TCC (or not) on a particular GPU is the nvidia-smi tool. Nothing here should be construed as a guarantee of support for TCC on your particular GPU.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...