genaray / zeroallocjobscheduler Goto Github PK

A high-performance alloc free c# Jobscheduler.

License: Apache License 2.0

C# 100.00%

csharp ecs engine game game-development gamedev godot jobs monogame multithreading net6 net7 netstandard21 unity unity3d

zeroallocjobscheduler's Introduction

Hi, I'm genaray 💻

I am a 23-year-old computer science student and technology enthusiast living in NRW, Germany.

I recently finished my bachelor thesis at a small local IT company, and I am looking forward to doing my master as well.

I mostly use C# and Java on a daily basis. However, I love learning new languages and technologies. In my spare time, I mostly play with Unity and develop small games.

Hobbies

Videogames 🎮
Gym 🏋️
Photography 📷
Learning new languages 🇹🇷
Game-Development 🤖

zeroallocjobscheduler's People

Contributors

Stargazers

Watchers

Forkers

boyquotes distantcam felix-codingclimber gaofancode lilithsilver unitycoder martindevans meigs2 epicguru matthewsre thygrrr andyhebear aboutunity u3d-resources

zeroallocjobscheduler's Issues

Pool the `CombineDependencies` list

Currently, users are forced to maintain a List<JobHandle> for each CombineDependencies call they make. And they can't even reuse this list between CombineDependencies calls -- only when they're sure the job has resolved. This isn't great from an API standpoint.

With #11, we have a value n we can look at to determine the maximum current simultaneous job count without allocation, and therefore n - 1 is the theoretical maximum amount of dependencies that can be used with CombineDependencies in one go. If we pre-allocate n lists of n - 1 length, we can easily clear and reuse a list if the list is returned, or alternatively begin allocating new lists if none are available (or adding to existing lists if they aren't big enough).

Users would still have to maintain a List<JobHandle> but it wouldn't be tied to the internal state anymore. So it's a much more reasonable ask.

As discussed in #7.

Nonatomic Queue Operations

The following operation in RangeWorkStealingDeque is not atomic.

var b = Volatile.Read(ref _bottom);
// we're popping, so decrement the bottom in advance.
// Doing this in advance ensures that we "reserve space" in the size, so that even if someone steals,
// they can't steal past this _bottom (their size would return less than 0, and they wouldn't steal). 
// At the end of this method, if we need to adjust this (i.e. there ended up being nothing to steal, or less than amount to steal)
// we resolve this.
b--;
Volatile.Write(ref _bottom, b);

suggesting to replace with:

var b = Interlocked.Decrement(ref _bottom);

Cannot `stackalloc JobHandle[]`

Even though JobHandle is a struct, I cannot seem to be allowed to stackalloc an array of them, which makes allocation free scheduling of jobs from within a loop of unknown runtime count very cumbersome (borderline impossible).

It would be extremely convenient to be able to do this to get the ability to directly and synchronously schedule and then complete workloads without allocating or messing around with a List or something like that.

It would also help to combine job dependencies as these jobs are being scheduled in parallel. As far as the package's api goes, that requires some sort of Span to be passed in. But assembling that span in memory is impossible without going through another type like System.Array or System.Collections.Generic.List.

... but it is a struct. And that totally works with other structs:

What am I getting wrong here? :)

I tried to work with a list with pre-reserved capacity, but I can't even submit a List to Scheduler.CombineDependencies. So there's no chance (without juggling array indices and counting jobs?) to track dependencies as they accrue?

Add a "max expected concurrent jobs" value to JobScheduler's constructor, to prevent spontaneous allocation

There is a lot of arbitrary allocation happening in the code at the moment, as discussed in #9:

On the first Schedule() for a new number of concurrent jobs, we always allocate to QueuedJobs
DefaultObjectPool<ManualResetEvent> will start thrashing and generating lots of garbage if we try and take more than 32 items at once
- The limit can be increased with a constructor parameter, but it significantly lowers pool efficiency.
ConcurrentQueue will generate garbage every time we move past 32 items and then dequeue everything
- There's a way to increase this segment size, but the solution's janky. See #10 for the detailed fix.

Adding a "max expected concurrent jobs" parameter to the JobScheduler constructor with a reasonable default at 32 or perhaps 64 would let us initialize everything right off the bat, and have a truly guaranteed ZeroAllocJobScheduler! (unless you go past your limit of concurrent scheduled or in-flight jobs).

We could even add a strict mode that throws an error if you go over the limit.

(I can work on this once #8 and #7 are complete, but I'm adding these issues to track them so we don't forget.)

ZeroAllocJobScheduler isn't actually ZeroAlloc

Ran a benchmark and got these results:

Method	Threads	SetupJobsFirst	JobCount	Mean	Error	StdDev	Gen0	Gen1	Gen2	Allocated
BenchmarkParallel	0	False	4000000	4.360 s	0.0796 s	0.0744 s	7000.0000	6000.0000	1000.0000	625.18 MB
BenchmarkParallel	0	True	4000000	4.109 s	0.0803 s	0.0955 s	6000.0000	6000.0000	-	401.18 MB

This is for scheduling 4 million handles, flushing them, calling Complete() on each one in a row, and then calling Return() on each one in a row.

There are a couple culprits, one of which I'm aware of...

`Queue<T>` vs. `ConcurrentQueue<T>`

ConcurrentQueue<T>, as it turns out, doesn't pool its LinkedList nodes. Nodes are rarely actually allocated, because they contain a fairly large internal array, however with larger struct elements and many additions and removals, it starts to generate garbage. Here's a benchmark that proves this, by first adding a ton of items to both queues, running queue.Clear(), and then running the benchmarks which repeat the process.

(TestEmptyBenchmark shows that 600B is the measurement lower bound for some reason; it literally doesn't do anything.)

Method	QueueCapacity	Mean	Error	StdDev	Median	Allocated
BenchmarkEmpty	100	138.00 ns	17.89 ns	52.76 ns	100.0 ns	600 B
BenchmarkConcurrentQueue	100	3,093.62 ns	65.38 ns	127.53 ns	3,100.0 ns	2648 B
BenchmarkQueue	100	776.27 ns	19.36 ns	42.91 ns	800.0 ns	600 B
BenchmarkEmpty	5000000	1,115,969.23 ns	15,240.34 ns	12,726.38 ns	1,112,600.0 ns	600 B
BenchmarkConcurrentQueue	5000000	25,210,842.86 ns	372,509.73 ns	330,220.18 ns	25,179,650.0 ns	41947736 B
BenchmarkQueue	5000000	7,210,566.67 ns	49,435.43 ns	46,241.94 ns	7,193,800.0 ns	600 B

As you can see, Queue<T> properly caches allocations in its internal array, while ConcurrentQueue<T> generates garbage without bound, causing eventual GCs.

However, ConcurrentQueue is much much faster even without multithreaded code, so we need to decide what we're going to prioritize here.

Testing with `Queue<T>`

Out of curiosity, I ran a smaller benchmark both with and without a Queue vs. ConcurrentQueue implementation. Pay attention to the allocations...

ConcurrentQueue

Method	Threads	SetupJobsFirst	JobCount	Mean	Error	StdDev	Median	Gen0	Gen1	Allocated
BenchmarkParallel	0	True	1000000	958,487,072.22 ns	18,238,321.64 ns	19,514,799.82 ns	955,026,850.0 ns	1000.0000	1000.0000	113553496 B

Queue

Method	Threads	SetupJobsFirst	JobCount	Mean	Error	StdDev	Median	Gen0	Gen1	Allocated
BenchmarkParallel	0	True	1000000	929,862,868.4 ns	18,076,785.37 ns	20,092,299.04 ns	929,196,000.0 ns	1000.0000	1000.0000	79998808 B

As you can see, although Queue is 30% better than ConcurrentQueue in terms of allocations, there's still GC pressure.

All my benchmark code (minus swapping out the queue type) can be found in #8.

I'll run some more testing and see if I can figure out what's allocating.

`JobScheduler` never really quits.

It has a problem where it doesn't clean up the worker threads and thus stopping the application from shutting down properly. Even if dispose is called on the scheduler, the application never quits. The work-around for now is to just call Environment.Exit(0) after everything has closed.

The dependencies thread-stall problem: how to fix?

Consider the following:

A scheduler has 2 threads (for simplicity), thread A and thread B
Four jobs X, Y, Z, W are scheduled and flushed:
- Job X has a runtime of 100ms
- Job Y has a dependency on Job X, and a runtime of 10ms
- Jobs Z and W have a runtime of 2ms each, and have no dependencies.

In the current system, here's what will happen:

Thread A starts working on job X
Thread B starts working on job Y
After 100ms, thread A completes job X and moves on to job Z
Job Y is signaled to continue now that its dependency is resolved
Thread A completed job Z after 2ms, picks up job W
Thread A completes job W after 2ms
After 6ms more, Thread B completes, and we're done.

Clearly, there's a problem here! Thread B was stuck on Job Y, even though it could have moved on to jobs Z and W first with no penalty.

How big of a problem is this? Do you have any ideas for fixing it?

A few of my thoughts:

I don't think a full Work Stealing algorithm is relevant here. From what I've read, the advantages of Work Stealing are mostly seen in a fork/join context, but we explicitly cannot fork from within jobs (you can only schedule on the main thread, with good reason).
We may be able to draw concepts from Work Stealing. For example, the idea of a Deque instead of a Queue, when done properly, can let heavier jobs go to the top and lighter ones go to the bottom. But I don't know at all how that would be applied.
It might help some to insertion-sort the unflushed job queue by dependency count. That would ensure all jobs in a given flush would have mostly-ideal sorting for concurrency. But if the user flushes a lot it would become moot.
A priority queue implementation probably wouldn't work, because there would be no way to update the dependency count once the job is added, not without removing and re-adding jobs. That sounds like a concurrency nightmare.
One option could be that, when a thread picks up a job, it checks if its dependencies are complete. If not, it puts it at the back of the queue and tries another. I don't know if this would be efficient, though, and at what point the thread should give up and just await a dependency handle (avoiding busy-waiting/spinning). And at that point the queue might be in an even worse state.

Any ideas, here? Nothing particularly good is coming to mind; just half-solutions.

Counters

Is it possible to know how many jobs are currently scheduled and how many jobs are currently running?

Real-World Benchmarks

I'd like to have some sort of real-world benchmarks. Currently, we just have some micro-benchmarks of varying quality.

Here's some research on the topic from more general schedulers...

In An Efficient Work-Stealing Scheduler for Task Dependency Graph (Lin et al, 2020) they use two approaches...
- Micro-benchmarks: a series of tests (similar to what we're doing already) given standard algorithms. Implementing these would be reasonably easy. Here are the chosen programs from the paper:
  - Linear chain: Each task has one successor and one predecessor except the first task and the last tasks. Each task increments a global counter by one. The graph has 8388608 tasks.
  - Binary tree: Each task has one predecessor and two successors except the root and leaf tasks. The task has no computation and the graph has 8388607 tasks.
  - Graph traversal: We generate a directed acyclic graph with random dependency. The degree of each node, i.e. number of successors and predecessors, is bounded by four. Each task marks a boolean variable to true indicating the node is visited. The graph has 4000000 tasks.
  - Matrix multiplication: We generate a task graph to multiply two square matrices of size 2048×2048. The task graph has two parts: (1) The first part initializes the values of matrices. (2) The second part performs the matrix multiplication. Computations are performed row by row to represent a generic parallel-for pattern
- Real-world tests using VLSI timing analysis from OpenTimer. See the paper for details. I think this probably isn't for us, but it's an interesting approach worth mentioning.
If someone wanted to take up the work, deciphering Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance (Slaughter et al, 2020) and creating an implementation in this scheduler could be a fun (or maybe not so fun) challenge. I don't have the knowledge or experience to do it properly, though.
- Quantifying Overheads in Charm++ and HPX Using Task Bench (Wu et al, 2022) discusses an implementation in Charm++ and HPX that may be of use
- The Task Bench repository has a ton of example implementations; mostly in C though.

There is an issue with these research approaches: none of them are focused around what this job scheduler is really for (for use in Arch and more generally in games and game-like applications). I'm not sure how to get some ECS-focused benchmarks up and running.

There's also the subject of comparisons across other solutions for C#. For example, it would be good to have some benchmarks that we could run against Unity's job system, Unreal's Tasks System, Quartz.NET, or maybe even C#'s default Task library to see how our overhead compares and where we might improve.

Possible ways to reduce locks?

I wonder if there are techniques that can be used to reduce locks e.g. for accessing the arrays and in some other places.

Undocumented value "amount" in Scheduler.Schedule

None of the APIs document what "amount" does in the Scheduler.Schedule method. The internal methods have it as a default parameter = 0, but that is not an allowed value for the user-facing API.

I would have interpreted it as the number of total work items that will be split into batches.

IJobParallelFor.Execute(int index) would then be the batch index.

However, this doesn't seem to be the case either. I'm getting astronomical (integer overflow) index*BatchSize values trying to chop a Memory<T> of ~0.25 million contiguous Ts into chunks of 16384 and execute them in parallel.

It also doesn't seem to be the case that Amount is the number of batches. (why do we need to specify batch size then, anyway? there's quite a bit of logic inside the source code that seems to find some middle ground between the thread limits and chopping up the parallelWork, but it doesn't seem to be working as intended.)

Here's my use case:

Memory is the appropriate size, spanning precisely the entire workload.
It does not (and might never) divide perfectly by batch size; it could even be a prime number depending on application state.

    private class Parallel<C1> : IJobParallelFor
    {
        public RefAction_C<C1> Action;
        public JobHandle Handle;
        public Memory<C1> Memory { get; set; }

        public void Execute(int index)
        {
            var start = index * BatchSize;
            var length = Math.Min(BatchSize, Memory.Length - start);
            var span = Memory.Span.Slice(start, length);

            foreach (ref var c in span) Action(ref c);
        }
        
        public void Finish()
        {
        }

        public int ThreadCount { get; set; }
        public int BatchSize { get; set; }
    }

What is the amount I need to specify when scheduling the job?
job.Handle = Scheduler.Schedule(job, table.Count);

Table.count is 0.25 Million (I want to schedule 4 jobs that each batch 0.25 millionSystem.Numerics.Vector3s). The array is jagged so it's not easily possible to batch them all together. job.BatchSize is 16384. (but any batch size causes issues, and the Scheduler seems to always wait on the 4th job to complete with ThreadCount = 4.

Work stealing job system vs Unity Job System vs Fiber Job System

Hello @genaray !

Thank you for making your job system open sourced.

I am very curious if you could elaborate on strength and weaknesses of your implementation versus traditional work stealing job system, unit's job system or Naughty Dog's fiber job system?

Thanks

CancellationToken

It would be useful to be able to stop the scheduler with a cancellation token

Scheduler.CompleteAll or Scheduler.CombineDependencies(JobHandle, JobHandle)

Because tracking dependencies is quite inconvenient (both with or without the ability to stackalloc, etc.), I would like to recommend providing an API that allows completing all outstanding dependencies on the Scheduler level.

Alternatively, an easier way to combine dependencies without heap-allocated objects would be great; e.g. combining two dependencies individually, iteratively.

I believe there are two styles of use for a scheduler:

the Unity style (which ZeroAlloc seems to mimick / which Unity seems to copy from ZeroAlloc)
the "I have a tightly defined workload, a delegate pointing to it, and want it done with zero allocations while I synchronously wait"-style, which neither ThreadPool.QueueUserWorkItem nor anything in the TPL offer.

The latter is tedious with the way Completing dependencies currently works.

The former is debatably great and flexible. (it's also a path straight into a self-made dependency hell)

Remove JobScheduler singleton

Is there a reason other than convenience that JobScheduler produces a singleton?

Instance isn't referenced anywhere in the code except for IJobExtensions.Schedule(), which is just a convenience method to avoid scheduler.Schedule(job). It doesn't even allow the client to avoid keeping a reference to the scheduler or referencing the instance themself, because they still need to call scheduler.Flush() and scheduler.Dispose() neither of which which have an extension method (and no extension method would make sense).

It seems to me it adds complexity, testing issues, and lack of thread safety.

Imagine the following...

var scheduler1 = new JobScheduler(...);

// somewhere completely else, or on a different thread:
var scheduler2 = new JobScheduler(...);
scheduler2.Dispose();

myJob.Schedule(...); // schedules on an invalid, disposed JobScheduler, even if you might expect the first!

I would argue we remove the Instance and instead let the user track their own singleton instead, if they wish. If that sounds OK, let me know, and I can make a PR!

JobScheduler.Test.csproj missing from Git

The .sln file contains the following:

Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "JobScheduler.Test", "JobScheduler.Test\JobScheduler.Test.csproj", "{22CA05F3-3921-4DEE-BC02-392BBC7F885C}"
EndProject

But this file is missing from Git, so when opened in Visual Studio, it errors out.

Batched parallel jobs (IJobParallelFor)

Right now, the job system can't really handle spawning a large amount of very small jobs. The overhead doesn't make it worth it.

We should provide a batched parallel job API, in a way similar to Unity's IJobParallelFor API, with four jobs scheduled to four threads, and a pseudo-work-stealing implementation between them. (Where each thread has a deque of work which other threads can grab from, to reduce concurrency issues.)

A couple issues to solve:

Figuring out the data structure; is there a concurrent deque we could use?
How would we solve the zero-alloc problem, if batches can be any size?
One simple solution to both issues: the queues aren't "real", just a number and an offset. The range shrinks interlocked as jobs pick up ranges to process.
The queues would need to be associated to a JobID to make sure we don't actually pull from some other job.
The API call would have to return a single job handle that only completes when all batches are empty.

ConcurrentQueue generates garbage

As I discussed in #9, ConcurrentQueue generates garbage.

I did a bit more testing, and produced the following table, from repeatedly clearing and re-adding a set amount of items to a ConcurrentQueue:

Method	QueueCapacity	Reps	Mean	Error	StdDev	Median	Allocated
BenchmarkConcurrentQueue	1	32	5,011.1 ns	103.44 ns	145.00 ns	5,000.0 ns	16984 B
BenchmarkConcurrentQueueWithDequeue	1	32	2,174.3 ns	47.47 ns	78.00 ns	2,200.0 ns	600 B
BenchmarkQueue	1	32	648.0 ns	17.21 ns	50.22 ns	600.0 ns	600 B
BenchmarkConcurrentQueue	32	32	19,262.5 ns	203.21 ns	199.58 ns	19,200.0 ns	16984 B
BenchmarkConcurrentQueueWithDequeue	32	32	28,842.9 ns	416.75 ns	369.44 ns	28,800.0 ns	600 B
BenchmarkQueue	32	32	8,573.5 ns	172.01 ns	277.77 ns	8,500.0 ns	600 B
BenchmarkConcurrentQueue	64	32	35,466.7 ns	504.52 ns	393.89 ns	35,350.0 ns	41560 B
BenchmarkConcurrentQueueWithDequeue	64	32	51,761.5 ns	398.80 ns	333.01 ns	51,700.0 ns	1368 B
BenchmarkQueue	64	32	9,938.5 ns	198.82 ns	166.02 ns	10,000.0 ns	600 B

This shows that even if we're just adding and removing a single item, it still generates garbage every single time the queue is cleared! However, if we dequeue the whole queue instead of clearing it, it reuses its segment.... unless we exceeded the segment length.

Looking at the ConcurrentQueue source, the initial segment size is 32 items. So, in the current system, if we exceed 32 JobMetas at once, we start allocating... and the moment we dequeue things, we generate garbage.

However, there's a hack to increase the initial segment size! If we pass a dummy IEnumerable<JobMeta> to the queue, we can increase that maximum. So during the initialization we should initialize the queue with a sequence as long as our limit should be, and then immediately dequeue (NOT clear!) the queue.