Background:
I'm asking this because I currently have an application with many (hundreds to thousands) of threads. Most of those threads are idle a great portion of the time, waiting on work items to be placed in a queue. When a work item comes available, it is then processed by calling some arbitrarily-complex existing code. On some operating system configurations, the application bumps up against kernel parameters governing the maximum number of user processes, so I'd like to experiment with means to reduce the number of worker threads.
My proposed solution:
It seems like a coroutine-based approach, where I replace each worker thread with a coroutine, would help to accomplish this. I can then have a work queue backed by a pool of actual (kernel) worker threads. When an item is placed in a particular coroutine's queue for processing, an entry would be placed into the thread pool's queue. It would then resume the corresponding coroutine, process its queued data, and then suspend it again, freeing up the worker thread to do other work.
Implementation details:
In thinking about how I would do this, I'm having trouble understanding the functional differences between stackless and stackful coroutines. I have some experience using stackful coroutines using the Boost.Coroutine library. I find it's relatively easy to comprehend from a conceptual level: for each coroutine, it maintains a copy of the CPU context and stack, and when you switch to a coroutine, it switches to that saved context (just like a kernel-mode scheduler would).
What is less clear to me is how a stackless coroutine differs from this. In my application, the amount of overhead associated with the above-described queuing of work items is very important. Most implementations that I've seen, like the new CO2 library suggest that stackless coroutines provide much lower-overhead context switches.
Therefore, I'd like to understand the functional differences between stackless and stackful coroutines more clearly. Specifically, I think of these questions:
References like this one suggest that the distinction lies in where you can yield/resume in a stackful vs. stackless coroutine. Is this the case? Is there a simple example of something that I can do in a stackful coroutine but not in a stackless one?
Are there any limitations on the use of automatic storage variables (i.e. variables "on the stack")?
Are there any limitations on what functions I can call from a stackless coroutine?
If there is no saving of stack context for a stackless coroutine, where do automatic storage variables go when the coroutine is running?