Comments (7)
Oh boy.
Note: I just quickly skimmed the related issues, so my understanding of what the runtime does is surface-level at best). I assume that edwardsMultiscalarMulPippengerVartimeVector
is the routine that's causing the most grief, since that is what does the heavy lifting for the batch sizes that you are doing.
The possibility of marking safepoints in assembly code is only documented in the Go proposal, though I haven't seen any examples of assembly code doing it in i.e. the Go standard library so far. Is the ability to mark safepoints actually possible right now in the latest version of Go?
As far as I can tell, this is not possible, might not be needed, and in may not help that much. From what I can tell by skimming the issues, the runtime's (new-ish) async preemption mechanism is based on periodically sending SIGURG to the process (once every 10? ms), and even prior to the introduction of the async preemption mechanism, function calls are preemption points (specifically when a stack frame is allocated).
For the sake of my implementation sanity, the underlying operations in what I assume is the problematic routine are already broken up into sub-routines, and while the multiscalar multiply does have a number of tight loops, each loop iteration already has numerous occasions for the scheduler to do the right thing (since each call to AddExtendedCached
and SubExtendedCached
, calls at least one assembly routine that uses a stack frame vecMul_AVX2
).
One thing that you could try is to insert explicit call(s) to runtime.Gosched()
. Intuitively I suspect that the following loops are good candidates to investigate, though the frequency of the explicit-yield probably needs benchmarking as to not overly-impact non-concurrent performance.
curve25519-voi/curve/scalar_mul_pippenger.go
Line 177 in fc681bf
curve25519-voi/curve/scalar_mul_pippenger.go
Line 184 in fc681bf
curve25519-voi/curve/scalar_mul_pippenger.go
Line 198 in fc681bf
Are there any security implications to encouraging pre-emption for such crypto-related code?
Technically yes. The routines that would need additional pre-emption points are also used for things that work on secret material, and so additional copies of sensitive intermediaries will get spilled somewhere. That said, such things are outside of the threat model of this library, due it it being an extremely difficult problem to solve in general, especially with Go.
(Note: I am technically out of the office for the next week or so. Responses may be delayed, but I am interested in getting this library to work well for you.)
from curve25519-voi.
Addendum (which naturally occurred to me after I turned off my laptop....)
and that there isn't an option to i.e. offload the execution of such code into a separate thread pool away from I/O-bound tasks
You could try doing just that as well like thus:
- Bump up the number of OS threads the scheduler will create (
runtime.GOMAXPROCS
). - Spawn a pool of dedicated crypto workers, and give each of them a dedicated OS thread (
runtime.LockOSThread
/runtime.UnlockOSThread
).
This should push the responsibility of "making the right thing happen" to being the kernel's problem rather than the Go scheduler's.
from curve25519-voi.
Thanks for the quick response! I'll report how things go after inserting a few explicit yield points in the areas you suggested, and after offloading all crypto code to a dedicated goroutine pool for CPU-bound tasks.
The idea of creating a goroutine pool dedicated to executing code by making use of runtime.{Lock, Unlock}OSThread()
never occurred to me 🤔.
from curve25519-voi.
EDIT: Whether or not the solution below actually solves the problem of I/O-bound goroutines being starved by CPU-bound code, take it with a grain of salt. I'm going to try to run my benchmarks on a few different environments and take some time to look at the trace logs just to make sure. IIRC I heard long ago that there were certain cases/instances where runtime.LockOSThread()
may not actually dedicate a single goroutine to a single thread the way we'd necessarily want it to.
I went for the latter approach of offloading the execution of all crypto code to a dedicated goroutine pool for CPU-bound tasks, and it looks all of the I/O-bound goroutines are no longer being starved 😌.
The pool is a bit makeshift (i.e. hardcoded constants, forcefully allocates GOMAXPROCS additional processes), though here's the code for it in case it may be helpful.
package rheia
import (
"runtime"
"sync"
"sync/atomic"
)
const PoolWorkerPipelineSize = 1
type PoolTask struct {
wg *sync.WaitGroup
fn func()
}
type Pool struct {
closed uint32
queue struct {
sync.Mutex
cond sync.Cond
entries []PoolTask
}
wg sync.WaitGroup
}
func NewPool() *Pool {
pool := new(Pool)
pool.queue.cond.L = &pool.queue.Mutex
return pool
}
func (p *Pool) Close() {
if !atomic.CompareAndSwapUint32(&p.closed, 0, 1) {
return
}
p.queue.cond.Broadcast()
p.wg.Wait()
}
func (p *Pool) Run() {
count := runtime.GOMAXPROCS(0)
runtime.GOMAXPROCS(count * 2)
p.wg.Add(count)
for i := 0; i < count; i++ {
go func() {
runtime.LockOSThread()
defer runtime.UnlockOSThread()
defer p.wg.Done()
for {
p.queue.Lock()
for atomic.LoadUint32(&p.closed) != 1 && len(p.queue.entries) == 0 {
p.queue.cond.Wait()
}
count, overflow := len(p.queue.entries), false
if count > PoolWorkerPipelineSize {
count, overflow = PoolWorkerPipelineSize, true
}
tasks := p.queue.entries[:count]
p.queue.entries = p.queue.entries[count:]
if overflow {
p.queue.cond.Signal()
}
p.queue.Unlock()
if atomic.LoadUint32(&p.closed) == 1 && len(tasks) == 0 {
return
}
for _, task := range tasks {
task.fn()
task.wg.Done()
}
}
}()
}
}
var poolTaskPool = sync.Pool{New: func() interface{} { return PoolTask{wg: new(sync.WaitGroup)}}}
func (p *Pool) Submit(fn func()) {
if atomic.LoadUint32(&p.closed) == 1 {
fn()
return
}
task := poolTaskPool.Get().(PoolTask)
defer poolTaskPool.Put(task)
task.wg.Add(1)
task.fn = fn
p.queue.Lock()
p.queue.entries = append(p.queue.entries, task)
p.queue.cond.Signal()
p.queue.Unlock()
task.wg.Wait()
}
from curve25519-voi.
After looking at the trace logs closer and talking with some people about the issue on Discord, unfortunately, the approach of using runtime.{Lock, Unlock}OSThread()
to create a goroutine pool for CPU-bound tasks does not prevent the I/O-bound goroutines from starvation.
With the main reason being this particular comment:
@ianlancetaylor (comment): Another, probably more important, issue is that the Go scheduler only permits up to GOMAXPROCS threads to be running Go code at one time. It enforces that restriction even on goroutines that are locked to threads using LockOSThread. So the scheduler will also preempt locked threads in order to let other goroutines run.
Yes. LockOSThread does not affect goroutine scheduling. I guess if people find this to be confusing we could add a sentence to the docs.
Because of this limitation in the scheduler, it is still possible that CPU-bound goroutines that are locked to OS threads can still starve I/O-bound goroutines.
Perhaps mending the crypto code to cooperatively yield to the scheduler in certain tight loops, or using cgo and spawning a dedicated thread pool, might be the only two options left on the table.
from curve25519-voi.
Hmm. That's unfortunate, though I am somewhat confused since a combination of increasing GOMAXPROCS and locking the workers to a subset of them should guarantee that there always are OS level threads available to do I/O bound work (Ian Taylor's comment and the linked issue seemed to be concerned with the reverse situation, where they want CPU bound work to be prioritized, which isn't guaranteed).
The situation should be something like "there are count *2
OS threads, of which only count
are allowed to execute the go routines dedicated to batch verification (but they can also execute other things)". This is all sorts of awful if the intention was to prevent I/O bound tasks from starving CPU bound ones, but I still don't see a reason why the reverse isn't possible.
Perhaps mending the crypto code to cooperatively yield to the scheduler in certain tight loops, or using cgo and spawning a dedicated thread pool, might be the only two options left on the table.
Another thing that may be worth investigation is to break up each large batch into sub-batches, with explicit yields between each sub-batch. It is marginally more book-keeping in the caller-side, but it is functionally similar to inserting yields into the batch verification routine itself, just with a coarser yield interval. The underlying algorithm(s) do favor batches being as large as possible though.
I might have more ideas if I could see how the rest of the application is put together as well.
from curve25519-voi.
TLDR: Closing due to lack of required functionality in the toolchain, and the compiler should be doing the right thing anyway.
I'm going to close this as "not much I can do about it" for the following reasons:
- There is no current way to annotate assembly routines for such a thing.
- The assembly used in this context does not have execution time that should reach the point of having scheduling issues anyway.
- The expensive vectorized assembly routines use a stack frame (so the compiler should be auto-inserting a safe point already).
- The multiscalar multiply used during large batch verification has a large number of function calls (so the compiler should be auto-inserting a safe point already).
If someone comes up with a sensible location for explicit scheduler yields that does not overly hurt performance, I'm more than happy to take a PR. As a side note, I am still unconvinced that using runtime calls to limit which OS threads can do batch verifcation won't work, but at this point this is more "think about how the caller does things" than how this library behaves.
from curve25519-voi.
Related Issues (20)
- Platform support
- Public key encryption, private key decryption HOT 1
- Go 1.19.x tracking
- ../github.com/oasisprotocol/curve25519-voi/internal/toolchain/constraints.go:48:6: undefined: __SOFTWARE_REQUIRES_GO_VERSION_1_18__ HOT 5
- perf: Replace the STROBE implementation
- perf: Add sr25519 precomputation support
- api: Consider exposing the merlin implementation
- ci: Figure out how to setup arm64 and arm32 builds/tests HOT 1
- ed25519: optimize reuse of instantiated BatchVerifier HOT 5
- perf: Think more about defaulting to ExpandedPublicKeys in the ed25519 batch verifier HOT 2
- perf: primitives/x25519: Rethink calling x/crypto/curve25519 HOT 2
- internal/field: Think about using fiat-crypto
- cleanup: Drop support for old versions of Go
- cleanup: Fix assembly `go vet` issues HOT 1
- cleanup: Use the 1.17 cast from slice to array syntax HOT 1
- perf: Consider faster batch forgery identification
- housekeeping: Go 1.17 related cleanup HOT 1
- Browser wasm compatibility request HOT 1
- enhancement: Add paranoid ed25519 signing
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from curve25519-voi.