Coder Social home page Coder Social logo

genaray / zeroallocjobscheduler Goto Github PK

View Code? Open in Web Editor NEW
142.0 142.0 14.0 152 KB

A high-performance alloc free c# Jobscheduler.

License: Apache License 2.0

C# 100.00%
csharp ecs engine game game-development gamedev godot jobs monogame multithreading net6 net7 netstandard21 unity unity3d

zeroallocjobscheduler's Introduction

Hi, I'm genaray ๐Ÿ’ป

I am a 23-year-old computer science student and technology enthusiast living in NRW, Germany.

I recently finished my bachelor thesis at a small local IT company, and I am looking forward to doing my master as well.

I mostly use C# and Java on a daily basis. However, I love learning new languages and technologies. In my spare time, I mostly play with Unity and develop small games.

Hobbies

  • Videogames ๐ŸŽฎ
  • Gym ๐Ÿ‹๏ธ
  • Photography ๐Ÿ“ท
  • Learning new languages ๐Ÿ‡น๐Ÿ‡ท
  • Game-Development ๐Ÿค–

zeroallocjobscheduler's People

Contributors

genaray avatar lilithsilver avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

zeroallocjobscheduler's Issues

Pool the `CombineDependencies` list

Currently, users are forced to maintain a List<JobHandle> for each CombineDependencies call they make. And they can't even reuse this list between CombineDependencies calls -- only when they're sure the job has resolved. This isn't great from an API standpoint.

With #11, we have a value n we can look at to determine the maximum current simultaneous job count without allocation, and therefore n - 1 is the theoretical maximum amount of dependencies that can be used with CombineDependencies in one go. If we pre-allocate n lists of n - 1 length, we can easily clear and reuse a list if the list is returned, or alternatively begin allocating new lists if none are available (or adding to existing lists if they aren't big enough).

Users would still have to maintain a List<JobHandle> but it wouldn't be tied to the internal state anymore. So it's a much more reasonable ask.

As discussed in #7.

Nonatomic Queue Operations

The following operation in RangeWorkStealingDeque is not atomic.

var b = Volatile.Read(ref _bottom);
// we're popping, so decrement the bottom in advance.
// Doing this in advance ensures that we "reserve space" in the size, so that even if someone steals,
// they can't steal past this _bottom (their size would return less than 0, and they wouldn't steal). 
// At the end of this method, if we need to adjust this (i.e. there ended up being nothing to steal, or less than amount to steal)
// we resolve this.
b--;
Volatile.Write(ref _bottom, b);

suggesting to replace with:

var b = Interlocked.Decrement(ref _bottom);

Cannot `stackalloc JobHandle[]`

Even though JobHandle is a struct, I cannot seem to be allowed to stackalloc an array of them, which makes allocation free scheduling of jobs from within a loop of unknown runtime count very cumbersome (borderline impossible).

It would be extremely convenient to be able to do this to get the ability to directly and synchronously schedule and then complete workloads without allocating or messing around with a List or something like that.

It would also help to combine job dependencies as these jobs are being scheduled in parallel. As far as the package's api goes, that requires some sort of Span to be passed in. But assembling that span in memory is impossible without going through another type like System.Array or System.Collections.Generic.List.

image

... but it is a struct. And that totally works with other structs:

image

What am I getting wrong here? :)

I tried to work with a list with pre-reserved capacity, but I can't even submit a List to Scheduler.CombineDependencies. So there's no chance (without juggling array indices and counting jobs?) to track dependencies as they accrue?
image

Add a "max expected concurrent jobs" value to JobScheduler's constructor, to prevent spontaneous allocation

There is a lot of arbitrary allocation happening in the code at the moment, as discussed in #9:

  • On the first Schedule() for a new number of concurrent jobs, we always allocate to QueuedJobs
  • DefaultObjectPool<ManualResetEvent> will start thrashing and generating lots of garbage if we try and take more than 32 items at once
    • The limit can be increased with a constructor parameter, but it significantly lowers pool efficiency.
  • ConcurrentQueue will generate garbage every time we move past 32 items and then dequeue everything
    • There's a way to increase this segment size, but the solution's janky. See #10 for the detailed fix.

Adding a "max expected concurrent jobs" parameter to the JobScheduler constructor with a reasonable default at 32 or perhaps 64 would let us initialize everything right off the bat, and have a truly guaranteed ZeroAllocJobScheduler! (unless you go past your limit of concurrent scheduled or in-flight jobs).

We could even add a strict mode that throws an error if you go over the limit.

(I can work on this once #8 and #7 are complete, but I'm adding these issues to track them so we don't forget.)

ZeroAllocJobScheduler isn't actually ZeroAlloc

Ran a benchmark and got these results:

Method Threads SetupJobsFirst JobCount Mean Error StdDev Gen0 Gen1 Gen2 Allocated
BenchmarkParallel 0 False 4000000 4.360 s 0.0796 s 0.0744 s 7000.0000 6000.0000 1000.0000 625.18 MB
BenchmarkParallel 0 True 4000000 4.109 s 0.0803 s 0.0955 s 6000.0000 6000.0000 - 401.18 MB

This is for scheduling 4 million handles, flushing them, calling Complete() on each one in a row, and then calling Return() on each one in a row.

There are a couple culprits, one of which I'm aware of...

Queue<T> vs. ConcurrentQueue<T>

ConcurrentQueue<T>, as it turns out, doesn't pool its LinkedList nodes. Nodes are rarely actually allocated, because they contain a fairly large internal array, however with larger struct elements and many additions and removals, it starts to generate garbage. Here's a benchmark that proves this, by first adding a ton of items to both queues, running queue.Clear(), and then running the benchmarks which repeat the process.

(TestEmptyBenchmark shows that 600B is the measurement lower bound for some reason; it literally doesn't do anything.)

Method QueueCapacity Mean Error StdDev Median Allocated
BenchmarkEmpty 100 138.00 ns 17.89 ns 52.76 ns 100.0 ns 600 B
BenchmarkConcurrentQueue 100 3,093.62 ns 65.38 ns 127.53 ns 3,100.0 ns 2648 B
BenchmarkQueue 100 776.27 ns 19.36 ns 42.91 ns 800.0 ns 600 B
BenchmarkEmpty 5000000 1,115,969.23 ns 15,240.34 ns 12,726.38 ns 1,112,600.0 ns 600 B
BenchmarkConcurrentQueue 5000000 25,210,842.86 ns 372,509.73 ns 330,220.18 ns 25,179,650.0 ns 41947736 B
BenchmarkQueue 5000000 7,210,566.67 ns 49,435.43 ns 46,241.94 ns 7,193,800.0 ns 600 B

As you can see, Queue<T> properly caches allocations in its internal array, while ConcurrentQueue<T> generates garbage without bound, causing eventual GCs.

However, ConcurrentQueue is much much faster even without multithreaded code, so we need to decide what we're going to prioritize here.

Testing with Queue<T>

Out of curiosity, I ran a smaller benchmark both with and without a Queue vs. ConcurrentQueue implementation. Pay attention to the allocations...

ConcurrentQueue

Method Threads SetupJobsFirst JobCount Mean Error StdDev Median Gen0 Gen1 Allocated
BenchmarkParallel 0 True 1000000 958,487,072.22 ns 18,238,321.64 ns 19,514,799.82 ns 955,026,850.0 ns 1000.0000 1000.0000 113553496 B

Queue

Method Threads SetupJobsFirst JobCount Mean Error StdDev Median Gen0 Gen1 Allocated
BenchmarkParallel 0 True 1000000 929,862,868.4 ns 18,076,785.37 ns 20,092,299.04 ns 929,196,000.0 ns 1000.0000 1000.0000 79998808 B

As you can see, although Queue is 30% better than ConcurrentQueue in terms of allocations, there's still GC pressure.

All my benchmark code (minus swapping out the queue type) can be found in #8.

I'll run some more testing and see if I can figure out what's allocating.

`JobScheduler` never really quits.

It has a problem where it doesn't clean up the worker threads and thus stopping the application from shutting down properly. Even if dispose is called on the scheduler, the application never quits. The work-around for now is to just call Environment.Exit(0) after everything has closed.

The dependencies thread-stall problem: how to fix?

Consider the following:

  • A scheduler has 2 threads (for simplicity), thread A and thread B
  • Four jobs X, Y, Z, W are scheduled and flushed:
    • Job X has a runtime of 100ms
    • Job Y has a dependency on Job X, and a runtime of 10ms
    • Jobs Z and W have a runtime of 2ms each, and have no dependencies.

In the current system, here's what will happen:

  • Thread A starts working on job X
  • Thread B starts working on job Y
  • After 100ms, thread A completes job X and moves on to job Z
  • Job Y is signaled to continue now that its dependency is resolved
  • Thread A completed job Z after 2ms, picks up job W
  • Thread A completes job W after 2ms
  • After 6ms more, Thread B completes, and we're done.

Clearly, there's a problem here! Thread B was stuck on Job Y, even though it could have moved on to jobs Z and W first with no penalty.

How big of a problem is this? Do you have any ideas for fixing it?

A few of my thoughts:

  • I don't think a full Work Stealing algorithm is relevant here. From what I've read, the advantages of Work Stealing are mostly seen in a fork/join context, but we explicitly cannot fork from within jobs (you can only schedule on the main thread, with good reason).
  • We may be able to draw concepts from Work Stealing. For example, the idea of a Deque instead of a Queue, when done properly, can let heavier jobs go to the top and lighter ones go to the bottom. But I don't know at all how that would be applied.
  • It might help some to insertion-sort the unflushed job queue by dependency count. That would ensure all jobs in a given flush would have mostly-ideal sorting for concurrency. But if the user flushes a lot it would become moot.
  • A priority queue implementation probably wouldn't work, because there would be no way to update the dependency count once the job is added, not without removing and re-adding jobs. That sounds like a concurrency nightmare.
  • One option could be that, when a thread picks up a job, it checks if its dependencies are complete. If not, it puts it at the back of the queue and tries another. I don't know if this would be efficient, though, and at what point the thread should give up and just await a dependency handle (avoiding busy-waiting/spinning). And at that point the queue might be in an even worse state.

Any ideas, here? Nothing particularly good is coming to mind; just half-solutions.

Counters

Is it possible to know how many jobs are currently scheduled and how many jobs are currently running?

Real-World Benchmarks

I'd like to have some sort of real-world benchmarks. Currently, we just have some micro-benchmarks of varying quality.

Here's some research on the topic from more general schedulers...

  • In An Efficient Work-Stealing Scheduler for Task Dependency Graph (Lin et al, 2020) they use two approaches...

    • Micro-benchmarks: a series of tests (similar to what we're doing already) given standard algorithms. Implementing these would be reasonably easy. Here are the chosen programs from the paper:
      • Linear chain: Each task has one successor and one predecessor except the first task and the last tasks. Each task increments a global counter by one. The graph has 8388608 tasks.
      • Binary tree: Each task has one predecessor and two successors except the root and leaf tasks. The task has no computation and the graph has 8388607 tasks.
      • Graph traversal: We generate a directed acyclic graph with random dependency. The degree of each node, i.e. number of successors and predecessors, is bounded by four. Each task marks a boolean variable to true indicating the node is visited. The graph has 4000000 tasks.
      • Matrix multiplication: We generate a task graph to multiply two square matrices of size 2048ร—2048. The task graph has two parts: (1) The first part initializes the values of matrices. (2) The second part performs the matrix multiplication. Computations are performed row by row to represent a generic parallel-for pattern
    • Real-world tests using VLSI timing analysis from OpenTimer. See the paper for details. I think this probably isn't for us, but it's an interesting approach worth mentioning.
  • If someone wanted to take up the work, deciphering Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance (Slaughter et al, 2020) and creating an implementation in this scheduler could be a fun (or maybe not so fun) challenge. I don't have the knowledge or experience to do it properly, though.

There is an issue with these research approaches: none of them are focused around what this job scheduler is really for (for use in Arch and more generally in games and game-like applications). I'm not sure how to get some ECS-focused benchmarks up and running.

There's also the subject of comparisons across other solutions for C#. For example, it would be good to have some benchmarks that we could run against Unity's job system, Unreal's Tasks System, Quartz.NET, or maybe even C#'s default Task library to see how our overhead compares and where we might improve.

Possible ways to reduce locks?

I wonder if there are techniques that can be used to reduce locks e.g. for accessing the arrays and in some other places.

Undocumented value "amount" in Scheduler.Schedule

None of the APIs document what "amount" does in the Scheduler.Schedule method. The internal methods have it as a default parameter = 0, but that is not an allowed value for the user-facing API.
image

I would have interpreted it as the number of total work items that will be split into batches.

IJobParallelFor.Execute(int index) would then be the batch index.

However, this doesn't seem to be the case either. I'm getting astronomical (integer overflow) index*BatchSize values trying to chop a Memory<T> of ~0.25 million contiguous Ts into chunks of 16384 and execute them in parallel.

It also doesn't seem to be the case that Amount is the number of batches. (why do we need to specify batch size then, anyway? there's quite a bit of logic inside the source code that seems to find some middle ground between the thread limits and chopping up the parallelWork, but it doesn't seem to be working as intended.)

Here's my use case:

Memory is the appropriate size, spanning precisely the entire workload.
It does not (and might never) divide perfectly by batch size; it could even be a prime number depending on application state.

    private class Parallel<C1> : IJobParallelFor
    {
        public RefAction_C<C1> Action;
        public JobHandle Handle;
        public Memory<C1> Memory { get; set; }

        public void Execute(int index)
        {
            var start = index * BatchSize;
            var length = Math.Min(BatchSize, Memory.Length - start);
            var span = Memory.Span.Slice(start, length);

            foreach (ref var c in span) Action(ref c);
        }
        
        public void Finish()
        {
        }

        public int ThreadCount { get; set; }
        public int BatchSize { get; set; }
    }

What is the amount I need to specify when scheduling the job?
job.Handle = Scheduler.Schedule(job, table.Count);

Table.count is 0.25 Million (I want to schedule 4 jobs that each batch 0.25 millionSystem.Numerics.Vector3s). The array is jagged so it's not easily possible to batch them all together. job.BatchSize is 16384. (but any batch size causes issues, and the Scheduler seems to always wait on the 4th job to complete with ThreadCount = 4.

CancellationToken

It would be useful to be able to stop the scheduler with a cancellation token

Scheduler.CompleteAll or Scheduler.CombineDependencies(JobHandle, JobHandle)

Because tracking dependencies is quite inconvenient (both with or without the ability to stackalloc, etc.), I would like to recommend providing an API that allows completing all outstanding dependencies on the Scheduler level.

Alternatively, an easier way to combine dependencies without heap-allocated objects would be great; e.g. combining two dependencies individually, iteratively.

I believe there are two styles of use for a scheduler:

  • the Unity style (which ZeroAlloc seems to mimick / which Unity seems to copy from ZeroAlloc)
  • the "I have a tightly defined workload, a delegate pointing to it, and want it done with zero allocations while I synchronously wait"-style, which neither ThreadPool.QueueUserWorkItem nor anything in the TPL offer.

The latter is tedious with the way Completing dependencies currently works.

The former is debatably great and flexible. (it's also a path straight into a self-made dependency hell)

Remove JobScheduler singleton

Is there a reason other than convenience that JobScheduler produces a singleton?

Instance isn't referenced anywhere in the code except for IJobExtensions.Schedule(), which is just a convenience method to avoid scheduler.Schedule(job). It doesn't even allow the client to avoid keeping a reference to the scheduler or referencing the instance themself, because they still need to call scheduler.Flush() and scheduler.Dispose() neither of which which have an extension method (and no extension method would make sense).

It seems to me it adds complexity, testing issues, and lack of thread safety.

Imagine the following...

var scheduler1 = new JobScheduler(...);

// somewhere completely else, or on a different thread:
var scheduler2 = new JobScheduler(...);
scheduler2.Dispose();

myJob.Schedule(...); // schedules on an invalid, disposed JobScheduler, even if you might expect the first!

I would argue we remove the Instance and instead let the user track their own singleton instead, if they wish. If that sounds OK, let me know, and I can make a PR!

JobScheduler.Test.csproj missing from Git

The .sln file contains the following:

Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "JobScheduler.Test", "JobScheduler.Test\JobScheduler.Test.csproj", "{22CA05F3-3921-4DEE-BC02-392BBC7F885C}"
EndProject

But this file is missing from Git, so when opened in Visual Studio, it errors out.

Batched parallel jobs (IJobParallelFor)

Right now, the job system can't really handle spawning a large amount of very small jobs. The overhead doesn't make it worth it.

We should provide a batched parallel job API, in a way similar to Unity's IJobParallelFor API, with four jobs scheduled to four threads, and a pseudo-work-stealing implementation between them. (Where each thread has a deque of work which other threads can grab from, to reduce concurrency issues.)

A couple issues to solve:

  • Figuring out the data structure; is there a concurrent deque we could use?
  • How would we solve the zero-alloc problem, if batches can be any size?
  • One simple solution to both issues: the queues aren't "real", just a number and an offset. The range shrinks interlocked as jobs pick up ranges to process.
  • The queues would need to be associated to a JobID to make sure we don't actually pull from some other job.
  • The API call would have to return a single job handle that only completes when all batches are empty.

ConcurrentQueue generates garbage

As I discussed in #9, ConcurrentQueue generates garbage.

I did a bit more testing, and produced the following table, from repeatedly clearing and re-adding a set amount of items to a ConcurrentQueue:

Method QueueCapacity Reps Mean Error StdDev Median Allocated
BenchmarkConcurrentQueue 1 32 5,011.1 ns 103.44 ns 145.00 ns 5,000.0 ns 16984 B
BenchmarkConcurrentQueueWithDequeue 1 32 2,174.3 ns 47.47 ns 78.00 ns 2,200.0 ns 600 B
BenchmarkQueue 1 32 648.0 ns 17.21 ns 50.22 ns 600.0 ns 600 B
BenchmarkConcurrentQueue 32 32 19,262.5 ns 203.21 ns 199.58 ns 19,200.0 ns 16984 B
BenchmarkConcurrentQueueWithDequeue 32 32 28,842.9 ns 416.75 ns 369.44 ns 28,800.0 ns 600 B
BenchmarkQueue 32 32 8,573.5 ns 172.01 ns 277.77 ns 8,500.0 ns 600 B
BenchmarkConcurrentQueue 64 32 35,466.7 ns 504.52 ns 393.89 ns 35,350.0 ns 41560 B
BenchmarkConcurrentQueueWithDequeue 64 32 51,761.5 ns 398.80 ns 333.01 ns 51,700.0 ns 1368 B
BenchmarkQueue 64 32 9,938.5 ns 198.82 ns 166.02 ns 10,000.0 ns 600 B

This shows that even if we're just adding and removing a single item, it still generates garbage every single time the queue is cleared! However, if we dequeue the whole queue instead of clearing it, it reuses its segment.... unless we exceeded the segment length.

Looking at the ConcurrentQueue source, the initial segment size is 32 items. So, in the current system, if we exceed 32 JobMetas at once, we start allocating... and the moment we dequeue things, we generate garbage.

However, there's a hack to increase the initial segment size! If we pass a dummy IEnumerable<JobMeta> to the queue, we can increase that maximum. So during the initialization we should initialize the queue with a sequence as long as our limit should be, and then immediately dequeue (NOT clear!) the queue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.