Often I see people asking why they can't create more than
around 2000 threads in a process. The reason is not that there is
any particular limit inherent in Windows. Rather, the programmer
failed to take into account the amount of address space each thread
uses.
A thread consists of some memory in kernel mode (kernel stacks
and object management), some memory in user mode (the thread
environment block, thread-local storage, that sort of thing), plus
its stack. (Or stacks if you're on an Itanium system.)
Usually, the limiting factor is the stack size.
#include <stdio.h>
#include <windows.h>
DWORD CALLBACK ThreadProc(void*)
{
}
int __cdecl main(int argc, const char* argv[])
{
int i;
}
This program will typically print a value around 2000 for the
number of threads.
Why does it give up at around 2000?
Because the default stack size assigned by the linker is 1MB,
and 2000 stacks times 1MB per stack equals around 2GB, which is how
much address space is available to user-mode programs.
You can try to squeeze more threads into your process by
reducing your stack size, which can be done either by tweaking
linker options or manually overriding the stack size passed to the
CreateThread functions as described in MSDN.
With this change, I was able to squeak in around 13000
threads. While that's certainly better than 2000, it's short of the
naive expectation of 500,000 threads. (A thread is using 4KB of
stack in 2GB address space.) But you're forgetting the other
overhead. Address space allocation granularity is 64KB, so each
thread's stack occupies 64KB of address space even if only 4KB of
it is used. Plus of course you don't have free reign over all 2GB
of the address space; there are system DLLs and other things
occupying it.
But the real question that is raised whenever somebody asks,
"What's the maximum number of threads that a process can create?"
is "Why are you creating so many threads that this even becomes an
issue?"
The "one thread per client" model is well-known not to scale
beyond a dozen clients or so. If you're going to be handling more
than that many clients simultaneously, you should move to a model
where instead of dedicating a thread to a client, you instead
allocate an object. (Someday I'll muse on the duality between
threads and objects.) Windows provides I/O completion ports and a
thread pool to help you convert from a thread-based model to a
work-item-based model.
Note that fibers do not help much here, because a fiber has a
stack, and it is the address space required by the stack that is
the limiting factor nearly all of the time.