Ran a benchmark and got these results:
Method |
Threads |
SetupJobsFirst |
JobCount |
Mean |
Error |
StdDev |
Gen0 |
Gen1 |
Gen2 |
Allocated |
BenchmarkParallel |
0 |
False |
4000000 |
4.360 s |
0.0796 s |
0.0744 s |
7000.0000 |
6000.0000 |
1000.0000 |
625.18 MB |
BenchmarkParallel |
0 |
True |
4000000 |
4.109 s |
0.0803 s |
0.0955 s |
6000.0000 |
6000.0000 |
- |
401.18 MB |
This is for scheduling 4 million handles, flushing them, calling Complete()
on each one in a row, and then calling Return()
on each one in a row.
There are a couple culprits, one of which I'm aware of...
Queue<T>
vs. ConcurrentQueue<T>
ConcurrentQueue<T>
, as it turns out, doesn't pool its LinkedList nodes. Nodes are rarely actually allocated, because they contain a fairly large internal array, however with larger struct elements and many additions and removals, it starts to generate garbage. Here's a benchmark that proves this, by first adding a ton of items to both queues, running queue.Clear()
, and then running the benchmarks which repeat the process.
(TestEmptyBenchmark shows that 600B is the measurement lower bound for some reason; it literally doesn't do anything.)
Method |
QueueCapacity |
Mean |
Error |
StdDev |
Median |
Allocated |
BenchmarkEmpty |
100 |
138.00 ns |
17.89 ns |
52.76 ns |
100.0 ns |
600 B |
BenchmarkConcurrentQueue |
100 |
3,093.62 ns |
65.38 ns |
127.53 ns |
3,100.0 ns |
2648 B |
BenchmarkQueue |
100 |
776.27 ns |
19.36 ns |
42.91 ns |
800.0 ns |
600 B |
BenchmarkEmpty |
5000000 |
1,115,969.23 ns |
15,240.34 ns |
12,726.38 ns |
1,112,600.0 ns |
600 B |
BenchmarkConcurrentQueue |
5000000 |
25,210,842.86 ns |
372,509.73 ns |
330,220.18 ns |
25,179,650.0 ns |
41947736 B |
BenchmarkQueue |
5000000 |
7,210,566.67 ns |
49,435.43 ns |
46,241.94 ns |
7,193,800.0 ns |
600 B |
As you can see, Queue<T>
properly caches allocations in its internal array, while ConcurrentQueue<T>
generates garbage without bound, causing eventual GCs.
However, ConcurrentQueue is much much faster even without multithreaded code, so we need to decide what we're going to prioritize here.
Testing with Queue<T>
Out of curiosity, I ran a smaller benchmark both with and without a Queue
vs. ConcurrentQueue
implementation. Pay attention to the allocations...
ConcurrentQueue
Method |
Threads |
SetupJobsFirst |
JobCount |
Mean |
Error |
StdDev |
Median |
Gen0 |
Gen1 |
Allocated |
BenchmarkParallel |
0 |
True |
1000000 |
958,487,072.22 ns |
18,238,321.64 ns |
19,514,799.82 ns |
955,026,850.0 ns |
1000.0000 |
1000.0000 |
113553496 B |
Queue
Method |
Threads |
SetupJobsFirst |
JobCount |
Mean |
Error |
StdDev |
Median |
Gen0 |
Gen1 |
Allocated |
BenchmarkParallel |
0 |
True |
1000000 |
929,862,868.4 ns |
18,076,785.37 ns |
20,092,299.04 ns |
929,196,000.0 ns |
1000.0000 |
1000.0000 |
79998808 B |
As you can see, although Queue is 30% better than ConcurrentQueue in terms of allocations, there's still GC pressure.
All my benchmark code (minus swapping out the queue type) can be found in #8.
I'll run some more testing and see if I can figure out what's allocating.