As Go is such a nice language to do multithreading in, and most games lack multithread

I get your example. But: this is a problem even without multit

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Multithreading,about engoengine/engo

Comments (31)

matiwinnetou commented on August 19, 2024

This is quite controversial change. Keep in mind that there are issues to write from multiple threads to the same components so all components data access would have to be synchronized.

If we would be able to solve this problem we would rock but problem is rather tricky IMHO.

Putting read write mutex everywhere may not buy us much either. For 2D games which this engines aims for this is an extra complexity, which is not perhaps necessary albeit very awesome from technical point of view.

from engo.

EtienneBruines commented on August 19, 2024

That depends; how often do Systems write to something else than the object it's currently handling?
Are there any aspects in particular where you expect to see problems? (multithreading in Go is quite easy, so I think it could be done).

Keep in mind that we're not letting two Systems process in parallel. Only one System would be handling all Entitys at the same time.

This that would have to be multithreaded:

AddEntity (to the World);
AddSystem (to the World - shouldn't happen, but we could support it either way);

Things that don't matter:

AddComponent (to an Entity - but we could implement it either way);
All read-data

from engo.

paked commented on August 19, 2024

There are actually a boat load of problems with using concurrency with Go in my experience. If we cannot guarantee that a group of systems the game engine will actually behave differently.

Consider a collision system and movement system. Depending on the order these are ran, they will have _different_ game-breaking outcomes.

CollisionSystem to MovementSystem

Leaves the player falling into the ground when the state is rendered, this is because the collision is accounted for... And then it the player is moved to account for gravity.

MovementSystem to CollisionSystem

This is the correct order, gravity and movement is calculated and then if it collides the player is moved again.

Maybe I am misunderstanding the use of your concurrency. In that case, correct me.

from engo.

EtienneBruines commented on August 19, 2024

I get your example. But:

this is a problem even without multithreading;
- As they are now also handled System A, then System B -- see current implementation at first post
this could be fixed, even when using multithreading.
- All we need is some kind of Priority for the Systems.

(thinking about multithreading now, is prob. a lot easier than in later stages; and it would be awesome to have a multithreaded game engine)

The race detector in Go can also be of help, in case we have doubts.

from engo.

EtienneBruines commented on August 19, 2024

@paked @matiwinnetou Is it OK if I spend some time on this, to see if we can do this without changing the current behavior? Then when reviewing the PR, we could always say GO/NO-GO.

(if you think it's a waste of time: I'll probably be starting in 8 hours or so; and doing it for fun - so I won't be offended if it's denied afterwards)

from engo.

paked commented on August 19, 2024

@EtienneBruines If you can make it work, I am all ears. Go for it!

from engo.

EtienneBruines commented on August 19, 2024

@matiwinnetou I used your galaxian example, just to show it works for massive numbers of renderables. It won't be part of any PR or anything; at least not by me. That's your accomplishment/project.

Note that it becomes slower over time, because of #56. This is expected.

Also note that the lagging starts after it starts to render 60.000 objects - this is not shown, though.

@paked @matiwinnetou You can look at the code; a bit more complicated, but then again: once it works, it works. And I believe it works. Thoughts?

The -race detector detected no flaws.

from engo.

paked commented on August 19, 2024

This is pretty awesome. I'll review this.

from engo.

paked commented on August 19, 2024

@EtienneBruines now that we have benchmarks, would you mind them on this?

from engo.

EtienneBruines commented on August 19, 2024

What kind of benchmarks are you looking for? Anything in particular? (the estimate of 60k objects in the post above, was based on that benchmark implementation)

from engo.

paked commented on August 19, 2024

@EtienneBruines A comparison between non-concurrent vs concurrent would be good

from engo.

EtienneBruines commented on August 19, 2024

OK. Will do.

from engo.

EtienneBruines commented on August 19, 2024

The results are ... complicated.

When the task is "wait 1 nanosecond":

BenchmarkWorldWorkersOne-4          5000        988508 ns/op       72257 B/op       1006 allocs/op
BenchmarkWorldWorkersTwo-4          3000       1176729 ns/op       72259 B/op       1006 allocs/op
BenchmarkWorldWorkersFour-4         3000       1680458 ns/op       72266 B/op       1006 allocs/op
BenchmarkWorldWorkersEight-4        3000       1697502 ns/op       72261 B/op       1006 allocs/op
BenchmarkWorldLegacy-4              5000        821611 ns/op       72208 B/op       1003 allocs/op

When the task is "wait 50 nanoseconds":

BenchmarkWorldWorkersOne-4           500       7182040 ns/op      721995 B/op      10006 allocs/op
BenchmarkWorldWorkersTwo-4           500      12208238 ns/op      722007 B/op      10006 allocs/op
BenchmarkWorldWorkersFour-4          300      17106894 ns/op      721984 B/op      10006 allocs/op
BenchmarkWorldWorkersEight-4         300      17787672 ns/op      721996 B/op      10006 allocs/op
BenchmarkWorldLegacy-4               500       8147327 ns/op      721937 B/op      10003 allocs/op

When the task is: "wait 500 nanoseconds":

BenchmarkWorldWorkersOne-4            10     395278811 ns/op      722220 B/op      10006 allocs/op
BenchmarkWorldWorkersTwo-4            20     209402795 ns/op      722169 B/op      10006 allocs/op
BenchmarkWorldWorkersFour-4           30     133606247 ns/op      722244 B/op      10006 allocs/op
BenchmarkWorldWorkersEight-4          50      75034864 ns/op      722412 B/op      10007 allocs/op
BenchmarkWorldLegacy-4                10     449733866 ns/op      721936 B/op      10003 allocs/op

It does not matter whether you have 10k or 1k Entitys, the relation between the amount of workers and the runtime, stays pretty much the same.

So, depending on how fast the Update function on the System is, it makes perfect sense to multitask, or it will only make it slower.

from engo.

EtienneBruines commented on August 19, 2024

Seeing the results above, I think we could create a per-system setting of RunInParallel? Depending on the workload of the System, the user could say whether or not he wants to run it in parallel; because it may either greatly improve performance, or greatly reduce.

Thoughts @paked @matiwinnetou ?

from engo.

paked commented on August 19, 2024

I'd like to see this proposed... Not quite sure how reliable it would be in
practice.

On Mon, Nov 9, 2015 at 5:04 AM Etienne Bruines [email protected]
wrote:

Seeing the results above, I think we could create a per-system setting of
RunInParallel? Depending on the workload of the System, the user could
say whether or not he wants to run it in parallel; because it may either
greatly improve performance, or greatly reduce.

Thoughts @paked https://github.com/paked @matiwinnetou
https://github.com/matiwinnetou ?

—
Reply to this email directly or view it on GitHub
#53 (comment).

from engo.

EtienneBruines commented on August 19, 2024

Well, Systemer is an interface which defines everything a System needs to have. It would include an additional:

type Systemer interface {
    RunInParallel() bool
}

Indicating whether or not it should process all Entitys in parallel. The System struct would return false just to be safe, and people can set it to true whenever they feel like it would boost their performance.

This is desired, probably because it's the only way they'd be multithreading at all - no other way to tell engi: "give me all Entitys at multiple goroutines".

from engo.

paked commented on August 19, 2024

OK. I understand now. This seems like it would be totally kick ass.

from engo.

matiwinnetou commented on August 19, 2024

@EtienneBruines

"So, depending on how fast the Update function on the System is, it makes perfect sense to multitask, or it will only make it slower."

This does not surprise me at all, matches with knowledge I have based on how multithreaded systems work.

from engo.

EtienneBruines commented on August 19, 2024

The turning point is somewhere between 50-500 nanoseconds; not sure what's the "average time" we a System needs. So we could let the programmer decide.

from engo.

Kunde21 commented on August 19, 2024

If we can be sure that a call to System.Update(entity *Entity, dt float32) won't cause any internal race conditions, there's no reason to limit the number of go-routines in the concurrent implementation.

func (w *World) update(dt float32) {
        w.pre()

        var unp *UnpauseComponent

        for _, system := range w.Systems() {
                if headless && system.SkipOnHeadless() {
                        continue // so skip it
                }

                system.Pre()
                entities := system.Entities()
                complChan := make(chan struct{})
                count := len(entities)
                for _, entity := range entities {
                        if (w.paused && !entity.Component(&unp)) || !entity.Exists {
                                count--
                                continue // so skip it
                        }
                        go func(entity *Entity) {
                                system.Update(entity, dt)
                                complChan <- struct{}{}
                        }(entity)
                }
                for count > 0 {
                        <-complChan
                        count--
                }
                system.Post()
        }
        w.post()
}

The runtime scheduler is highly optimized and go-routine creation overhead is much less than a hardware thread, so there's no need to limit it to one per CPU core.

from engo.

EtienneBruines commented on August 19, 2024

Is there a reason not to limit it? (i.e.: with Intels HyperThreading, one could limit to 2*NumCPU).

Even if there's little overhead, what's the reason for having the overhead, if it can be avoided?

(I don't mean to "attack" you; just curious - I might even learn something)

from engo.

Kunde21 commented on August 19, 2024

Limiting the number of workers using channel constructs means you're paying the time and resource overhead to synchronize and pass data via the channel. There's no guarantee that the dispatcher is able to saturate the workers with tasks, resulting in under-utilized worker go-routines.

Spinning off a go-routine per job means that every go-routine has 100% utilization, internally, and the runtime scheduler will efficiently multiplex the work across hardware threads.

There was in interesting tweet about this a couple weeks ago.

from engo.

EtienneBruines commented on August 19, 2024

Interesting that you mention that tweet; I also noticed it, but also noticed that neither of those things he's comparing, are actually executed in parallel. If one uses multiple long-lived workers, the workers are actually faster.

But I only tested that hypothesis using the same example as bradfitz (multiplying the input by 2). Not sure if it holds in this scenario.

The up-side about this, is that we don't need an output-channel. All we need, is know that everything has been processed. (that would further simplify your snippet).

I managed to achieve 'maximum' performance not by "sending work" to those workers (using a channel), but letting those workers "get" the work, simply by incrementing an atomic integer value (indicating the index of the Entity in the array). When that number became higher than the length, workers knew they were done, and stopped.

from engo.

Kunde21 commented on August 19, 2024

The complChan channel in my snippet is used in place of the waitgroup to ensure all go-routines are executed before moving on to system.Post(). Sending a struct{}{} down a channel doesn't allocate any memory, so the cost of that operation is minimal.

from engo.

EtienneBruines commented on August 19, 2024

My bad; you're completely right. (Overlooked the struct{} part ;-). )

from engo.

Kunde21 commented on August 19, 2024

Do you have those benchmarks handy from these tests: #53 (comment)?

from engo.

Kunde21 commented on August 19, 2024

I put a few more steps into the current benchmarks. It seems that the tipping point is around 20 entities in a system. The efficiency of concurrency grows really quickly above there, to 2-4x speed-up at 1,000 entities.

Here's my results for GOMAXPROCS set at 8 and 4 cores:

Concurrent-8:
BenchmarkCollisionSystem10-8       30000             51,234 ns/op            1,905 B/op         86 allocs/op
BenchmarkCollisionSystem15-8       20000             85,077 ns/op            3,649 B/op        137 allocs/op
BenchmarkCollisionSystem20-8       10000            125,822 ns/op            5,986 B/op        216 allocs/op
BenchmarkCollisionSystem25-8       10000            182,450 ns/op            8,929 B/op        287 allocs/op
BenchmarkCollisionSystem50-8        3000            421,889 ns/op           33,452 B/op        906 allocs/op
BenchmarkCollisionSystem100-8       2000          1,059,544 ns/op          134,764 B/op      3,057 allocs/op
BenchmarkCollisionSystem1000-8        20         85,438,450 ns/op       12,417,864 B/op    265,484 allocs/op

Concurrent-4:
BenchmarkCollisionSystem10-4       30000             49,635 ns/op            1,904 B/op         86 allocs/op
BenchmarkCollisionSystem15-4       20000             86,418 ns/op            3,648 B/op        137 allocs/op
BenchmarkCollisionSystem20-4       10000            127,601 ns/op            5,985 B/op        216 allocs/op
BenchmarkCollisionSystem25-4       10000            172,231 ns/op            8,930 B/op        287 allocs/op
BenchmarkCollisionSystem50-4        3000            431,928 ns/op           33,453 B/op        906 allocs/op
BenchmarkCollisionSystem100-4       1000          1,253,998 ns/op          134,770 B/op      3,058 allocs/op
BenchmarkCollisionSystem1000-4        10        127,351,280 ns/op       12,537,835 B/op    273,501 allocs/op

Serial-8:
BenchmarkCollisionSystem10-8       50000             39,582 ns/op            1,696 B/op         83 allocs/op
BenchmarkCollisionSystem15-8       20000             79,311 ns/op            3,440 B/op        134 allocs/op
BenchmarkCollisionSystem20-8       10000            134,940 ns/op            5,776 B/op        213 allocs/op
BenchmarkCollisionSystem25-8        5000            210,266 ns/op            8,720 B/op        284 allocs/op
BenchmarkCollisionSystem50-8        2000            804,187 ns/op           33,237 B/op        903 allocs/op
BenchmarkCollisionSystem100-8        500          3,406,920 ns/op          134,604 B/op      3,058 allocs/op
BenchmarkCollisionSystem1000-8         3        360,511,833 ns/op       13,342,314 B/op    324,384 allocs/op

Serial-4
BenchmarkCollisionSystem10-4       50000             38,015 ns/op            1,696 B/op         83 allocs/op
BenchmarkCollisionSystem15-4       20000             75,529 ns/op            3,440 B/op        134 allocs/op
BenchmarkCollisionSystem20-4       10000            129,650 ns/op            5,776 B/op        213 allocs/op
BenchmarkCollisionSystem25-4        5000            204,636 ns/op            8,720 B/op        284 allocs/op
BenchmarkCollisionSystem50-4        2000            834,539 ns/op           33,237 B/op        903 allocs/op
BenchmarkCollisionSystem100-4        500          3,241,025 ns/op          134,592 B/op      3,057 allocs/op
BenchmarkCollisionSystem1000-4         3        353,174,266 ns/op       13,292,928 B/op    321,298 allocs/op

from engo.

Kunde21 commented on August 19, 2024

Added PR #88

from engo.

Kunde21 commented on August 19, 2024

I just realized the misunderstanding in your comment here: #53 (comment). In fact, that was the same thing I initially thought.

That benchmark code isn't comparing concurrent to serial. Instead, it's comparing the cost of passing a value between go-routines through a channel to passing a value between go-routines via function call to a closure.

It's a comparison, essentially, of the overhead costs that we were discussing in: #53 (comment) and #53 (comment). That's why I said spawning off a go-routine is cheaper than passing a value via channel to a worker.

It's probably the biggest break when moving from c-type threads to go-routines (my coding path). Since the runtime absorbs the cost of creating a hardware thread, we come out ahead when we feed enough go-routines down that hardware thread. Some webservers have logged millions of concurrent go-routines, which spreads the cost of the hardware thread across all of the go-routines that use it.

My benchmark results are indicative of using go-routines to spread re-entrant functions across threads. Native support for map-reduce isn't implemented in go. That, in my opinion, is mainly due to the go-routine pattern that I'm describing, which accomplishes the map/reduce pattern with a bit more code.

from engo.

EtienneBruines commented on August 19, 2024

I have updated my branch to the current origin/master, and am now producing this benchmark (GOMAXPROCS=8):

BenchmarkCollisionSystem10-8      200000         26652 ns/op        1764 B/op         87 allocs/op
BenchmarkCollisionSystem15-8      200000         41264 ns/op        3504 B/op        138 allocs/op
BenchmarkCollisionSystem20-8      100000         59754 ns/op        5840 B/op        217 allocs/op
BenchmarkCollisionSystem25-8      100000         70440 ns/op        8784 B/op        288 allocs/op
BenchmarkCollisionSystem50-8       30000        167668 ns/op       33296 B/op        907 allocs/op
BenchmarkCollisionSystem100-8      10000        648894 ns/op      134580 B/op       3057 allocs/op
BenchmarkCollisionSystem1000-8       100      66815395 ns/op    12284235 B/op     258119 allocs/op

Using your code at PR #88, to compare (because my CPU != your CPU, and also setting GOMAXPROCS=8):

BenchmarkCollisionSystem10-8      200000         24898 ns/op        1712 B/op         84 allocs/op
BenchmarkCollisionSystem15-8      100000         54139 ns/op        3456 B/op        135 allocs/op
BenchmarkCollisionSystem20-8       50000        113021 ns/op        5889 B/op        215 allocs/op
BenchmarkCollisionSystem25-8       30000        170921 ns/op        8833 B/op        286 allocs/op
BenchmarkCollisionSystem50-8       20000        254129 ns/op       33348 B/op        905 allocs/op
BenchmarkCollisionSystem100-8      10000        628400 ns/op      134638 B/op       3055 allocs/op
BenchmarkCollisionSystem1000-8       100      63574836 ns/op    12311827 B/op     258508 allocs/op

I guess for a high number of Entities, the overhead becomes less significant.

It's nice of you to clarify what the tweet is about. The overhead of incrementing a shared (atomic) integer value (index), is even less than that of a goroutine creation (and therefore, also less than passing through channel).

from engo.

EtienneBruines commented on August 19, 2024

Fixed as of #88 and #89.

from engo.

Multithreading about engo HOT 31 CLOSED

Comments (31)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent