I'd like to raise this as more of a discussion, though it seems like there may be a wa

perf: provide safepoints for pre-emptive scheduling about curve25519-voi HOT 7 CLOSED

oasisprotocol commented on June 15, 2024

perf: provide safepoints for pre-emptive scheduling

from curve25519-voi.

Comments (7)

Yawning commented on June 15, 2024 1

Oh boy.

Note: I just quickly skimmed the related issues, so my understanding of what the runtime does is surface-level at best). I assume that edwardsMultiscalarMulPippengerVartimeVector is the routine that's causing the most grief, since that is what does the heavy lifting for the batch sizes that you are doing.

The possibility of marking safepoints in assembly code is only documented in the Go proposal, though I haven't seen any examples of assembly code doing it in i.e. the Go standard library so far. Is the ability to mark safepoints actually possible right now in the latest version of Go?

As far as I can tell, this is not possible, might not be needed, and in may not help that much. From what I can tell by skimming the issues, the runtime's (new-ish) async preemption mechanism is based on periodically sending SIGURG to the process (once every 10? ms), and even prior to the introduction of the async preemption mechanism, function calls are preemption points (specifically when a stack frame is allocated).

For the sake of my implementation sanity, the underlying operations in what I assume is the problematic routine are already broken up into sub-routines, and while the multiscalar multiply does have a number of tight loops, each loop iteration already has numerous occasions for the scheduler to do the right thing (since each call to AddExtendedCached and SubExtendedCached, calls at least one assembly routine that uses a stack frame vecMul_AVX2).

One thing that you could try is to insert explicit call(s) to runtime.Gosched(). Intuitively I suspect that the following loops are good candidates to investigate, though the frequency of the explicit-yield probably needs benchmarking as to not overly-impact non-concurrent performance.

curve25519-voi/curve/scalar_mul_pippenger.go

Line 177 in fc681bf

for _, scalar := range scalars {

curve25519-voi/curve/scalar_mul_pippenger.go

Line 184 in fc681bf

for i, point := range points {

curve25519-voi/curve/scalar_mul_pippenger.go

Line 198 in fc681bf

for i := 0; i < size; i++ {

Are there any security implications to encouraging pre-emption for such crypto-related code?

Technically yes. The routines that would need additional pre-emption points are also used for things that work on secret material, and so additional copies of sensitive intermediaries will get spilled somewhere. That said, such things are outside of the threat model of this library, due it it being an extremely difficult problem to solve in general, especially with Go.

(Note: I am technically out of the office for the next week or so. Responses may be delayed, but I am interested in getting this library to work well for you.)

from curve25519-voi.

Yawning commented on June 15, 2024 1

Addendum (which naturally occurred to me after I turned off my laptop....)

and that there isn't an option to i.e. offload the execution of such code into a separate thread pool away from I/O-bound tasks

You could try doing just that as well like thus:

Bump up the number of OS threads the scheduler will create (runtime.GOMAXPROCS).
Spawn a pool of dedicated crypto workers, and give each of them a dedicated OS thread (runtime.LockOSThread/runtime.UnlockOSThread).

This should push the responsibility of "making the right thing happen" to being the kernel's problem rather than the Go scheduler's.

from curve25519-voi.

lithdew commented on June 15, 2024

Thanks for the quick response! I'll report how things go after inserting a few explicit yield points in the areas you suggested, and after offloading all crypto code to a dedicated goroutine pool for CPU-bound tasks.

The idea of creating a goroutine pool dedicated to executing code by making use of runtime.{Lock, Unlock}OSThread() never occurred to me 🤔.

from curve25519-voi.

lithdew commented on June 15, 2024

EDIT: Whether or not the solution below actually solves the problem of I/O-bound goroutines being starved by CPU-bound code, take it with a grain of salt. I'm going to try to run my benchmarks on a few different environments and take some time to look at the trace logs just to make sure. IIRC I heard long ago that there were certain cases/instances where runtime.LockOSThread() may not actually dedicate a single goroutine to a single thread the way we'd necessarily want it to.

I went for the latter approach of offloading the execution of all crypto code to a dedicated goroutine pool for CPU-bound tasks, and it looks all of the I/O-bound goroutines are no longer being starved 😌.

The pool is a bit makeshift (i.e. hardcoded constants, forcefully allocates GOMAXPROCS additional processes), though here's the code for it in case it may be helpful.

package rheia

import (
	"runtime"
	"sync"
	"sync/atomic"
)

const PoolWorkerPipelineSize = 1

type PoolTask struct {
	wg *sync.WaitGroup
	fn func()
}

type Pool struct {
	closed uint32
	queue  struct {
		sync.Mutex
		cond    sync.Cond
		entries []PoolTask
	}
	wg sync.WaitGroup
}

func NewPool() *Pool {
	pool := new(Pool)
	pool.queue.cond.L = &pool.queue.Mutex
	return pool
}

func (p *Pool) Close() {
	if !atomic.CompareAndSwapUint32(&p.closed, 0, 1) {
		return
	}
	p.queue.cond.Broadcast()
	p.wg.Wait()
}

func (p *Pool) Run() {
	count := runtime.GOMAXPROCS(0)
	runtime.GOMAXPROCS(count * 2)

	p.wg.Add(count)
	for i := 0; i < count; i++ {
		go func() {
			runtime.LockOSThread()
			defer runtime.UnlockOSThread()

			defer p.wg.Done()

			for {
				p.queue.Lock()
				for atomic.LoadUint32(&p.closed) != 1 && len(p.queue.entries) == 0 {
					p.queue.cond.Wait()
				}
				count, overflow := len(p.queue.entries), false
				if count > PoolWorkerPipelineSize {
					count, overflow = PoolWorkerPipelineSize, true
				}
				tasks := p.queue.entries[:count]
				p.queue.entries = p.queue.entries[count:]
				if overflow {
					p.queue.cond.Signal()
				}
				p.queue.Unlock()

				if atomic.LoadUint32(&p.closed) == 1 && len(tasks) == 0 {
					return
				}

				for _, task := range tasks {
					task.fn()
					task.wg.Done()
				}
			}
		}()
	}
}

var poolTaskPool = sync.Pool{New: func() interface{} { return PoolTask{wg: new(sync.WaitGroup)}}}

func (p *Pool) Submit(fn func()) {
	if atomic.LoadUint32(&p.closed) == 1 {
		fn()
		return
	}

	task := poolTaskPool.Get().(PoolTask)
	defer poolTaskPool.Put(task)

	task.wg.Add(1)
	task.fn = fn

	p.queue.Lock()
	p.queue.entries = append(p.queue.entries, task)
	p.queue.cond.Signal()
	p.queue.Unlock()

	task.wg.Wait()
}

from curve25519-voi.

lithdew commented on June 15, 2024

After looking at the trace logs closer and talking with some people about the issue on Discord, unfortunately, the approach of using runtime.{Lock, Unlock}OSThread() to create a goroutine pool for CPU-bound tasks does not prevent the I/O-bound goroutines from starvation.

With the main reason being this particular comment:

@ianlancetaylor (comment): Another, probably more important, issue is that the Go scheduler only permits up to GOMAXPROCS threads to be running Go code at one time. It enforces that restriction even on goroutines that are locked to threads using LockOSThread. So the scheduler will also preempt locked threads in order to let other goroutines run.

Yes. LockOSThread does not affect goroutine scheduling. I guess if people find this to be confusing we could add a sentence to the docs.

Because of this limitation in the scheduler, it is still possible that CPU-bound goroutines that are locked to OS threads can still starve I/O-bound goroutines.

Perhaps mending the crypto code to cooperatively yield to the scheduler in certain tight loops, or using cgo and spawning a dedicated thread pool, might be the only two options left on the table.

from curve25519-voi.

Yawning commented on June 15, 2024

Hmm. That's unfortunate, though I am somewhat confused since a combination of increasing GOMAXPROCS and locking the workers to a subset of them should guarantee that there always are OS level threads available to do I/O bound work (Ian Taylor's comment and the linked issue seemed to be concerned with the reverse situation, where they want CPU bound work to be prioritized, which isn't guaranteed).

The situation should be something like "there are count *2 OS threads, of which only count are allowed to execute the go routines dedicated to batch verification (but they can also execute other things)". This is all sorts of awful if the intention was to prevent I/O bound tasks from starving CPU bound ones, but I still don't see a reason why the reverse isn't possible.

Perhaps mending the crypto code to cooperatively yield to the scheduler in certain tight loops, or using cgo and spawning a dedicated thread pool, might be the only two options left on the table.

Another thing that may be worth investigation is to break up each large batch into sub-batches, with explicit yields between each sub-batch. It is marginally more book-keeping in the caller-side, but it is functionally similar to inserting yields into the batch verification routine itself, just with a coarser yield interval. The underlying algorithm(s) do favor batches being as large as possible though.

I might have more ideas if I could see how the rest of the application is put together as well.

from curve25519-voi.

Yawning commented on June 15, 2024

TLDR: Closing due to lack of required functionality in the toolchain, and the compiler should be doing the right thing anyway.

I'm going to close this as "not much I can do about it" for the following reasons:

There is no current way to annotate assembly routines for such a thing.
The assembly used in this context does not have execution time that should reach the point of having scheduling issues anyway.
The expensive vectorized assembly routines use a stack frame (so the compiler should be auto-inserting a safe point already).
The multiscalar multiply used during large batch verification has a large number of function calls (so the compiler should be auto-inserting a safe point already).

If someone comes up with a sensible location for explicit scheduler yields that does not overly hurt performance, I'm more than happy to take a PR. As a side note, I am still unconvinced that using runtime calls to limit which OS threads can do batch verifcation won't work, but at this point this is more "think about how the caller does things" than how this library behaves.

from curve25519-voi.

perf: provide safepoints for pre-emptive scheduling about curve25519-voi HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent