You can't wait for a thread to exit in DllMain(). Unless the thread had already exited by the time the DLL_PROCESS_DETACH was received, doing so will always deadlock. This is the expected behaviour.
The reason for this is that calls to DllMain() are serialized, via the loader lock. When ExitThread() is called, it claims the loader lock so that it can call DllMain() with DLL_THREAD_DETACH. Until that call has finished, the thread is still running.
So DllMain is waiting for the thread to exit, and the thread is waiting for DllMain to exit, a classic deadlock situation.
See also Dynamic-Link Library Best Practices on MSDN.
The solution is to add a new function to your DLL for the application to call before unloading the DLL. As you have noted, your code already works perfectly well when called explicitly.
In the case where backwards compatibility requirements make adding such a function impossible, and if you must have the worker threads, consider splitting your DLL into two parts, one of which is dynamically loaded by the other. The dynamically loaded part would contain (at a minimum) all of the code needed by the worker threads.
When the DLL that was loaded by the application itself receives DLL_PROCESS_DETACH, you just set the event to signal the threads to exit and then return immediately. One of the threads would have to be designated to wait for all the others and then free the second DLL, which you can do safely using FreeLibraryAndExitThread().
(Depending on the circumstances, and in particular if worker threads are exiting and/or new ones being created as part of regular operations, you may need to be very careful to avoid race conditions and/or deadlocks; this would likely be simpler if you used a thread pool and callbacks rather than creating worker threads manually.)
In the special case where the threads do not need to use any but the very simplest Windows APIs, it might be possible to use a thread pool and work callbacks to avoid the need for a second DLL. Once the callbacks have exited, which you can check using WaitForThreadpoolWorkCallbacks(), it is safe for the library to be unloaded - you do not need to wait for the threads themselves to exit.
The catch here is that the callbacks must avoid any Windows APIs that might take the loader lock. It is not documented which API calls are safe in this respect, and it varies between different versions of Windows. If you are calling anything more complicated than SetEvent or WriteFile, say, or if you are using a library rather than native Windows API functions, you must not use this approach.