Coder Social home page Coder Social logo

Comments (6)

abbycin avatar abbycin commented on July 4, 2024 1

emmm...

43029211-1cbccc86-8cb7-11e8-86df-7df2cd6e8226

from libaco.

hnes avatar hnes commented on July 4, 2024

你好,待我仔细地看一下再回复你哈 :D


Please give me some time to read it through :D

from libaco.

hnes avatar hnes commented on July 4, 2024

(English translation is at the bottom of this reply.)

看了你写的博客,很棒!

下面是它与libaco不同的地方:

  1. 不建议在C++中用户自己实现协程库(应该由C++标准库实现,这样才能保证正确),因为C++的ABI是编译器(甚至版本)相关且平台相关的,而在abbycin/tools/coroutine的实现中却是参考Sys V ABI标准实现的,这是不正确的;或者用C实现,然后让C++调用它,同时注意处理好两个语言之间的边界问题

  2. 即使是参考Sys V ABI标准来实现,在abbycin/tools/coroutine的switch_stack.asm中,也并没有完全遵守它(FPU与MXCSR的控制字的问题,腾讯的libco也犯了同样的错误,issue);

  3. abbycin/tools/coroutine的windows实现switch_stack_win.asm是错误的,Microsoft x64 ABI远远比Sys V ABI AMD64要复杂;

  4. libaco的协程不但支持独立执行栈,还支持与其它数量不限的协程一起共享某一个执行栈(另外还有执行栈上guard page的支持),而在abbycin/tools/coroutine中只支持独立执行栈,这在高并发场景下会消耗巨大(比如百万或者千万协程)。


I have read your blog, it's a very nice one :D

And here the following is the differences between your library with libaco:

  1. In C++, it is not recommended that the user implement the coroutine library (it should be implemented by the C++ standard and its library), because C++'s ABI is both compiler (or even by version) and platform dependent, but in the implementation of your abbycin/tools/coroutine, you only refer to the Sys V ABI standard, which is incorrect. You could implement a coroutine library by C, then call it in C++, and beware that the boundary problem between the two languages is handled correctly.

  2. Even if you choose to use the Sys V ABI standard, but in switch_stack.asm, you didn't fully comply with it (the problem of the FPU and MXCSR's control words, the Tencent's libco has made the same mistake too, here is the bug issue);

  3. Your switch_stack_win.asm is also wrong, because Microsoft x64 ABI is far more complex than the Sys V ABI AMD64;

  4. libaco not only supports the standalone execution stack of coroutine, but also supports the sharing of a single execution stack with other unlimited numbers of coroutines (and also supports the guard page on the execution stack), while the standalone execution stack is only supported in the abbycin/tools/coroutine (this will consume huge virtual memory in the high concurrency scenarios, a concurrency of 1 - 10 million for example).

from libaco.

yuanzhubi avatar yuanzhubi commented on July 4, 2024

楼主的库除了没有保存&恢复RDI 和RSI 其他没什么问题。
You forget to store/resume the callee saved registers RDI and RSI in windows x64.

from libaco.

hnes avatar hnes commented on July 4, 2024

Here is a table about the registers' usage in the Windows X64 ABI standard:

Nonvolatile registers: R12:R15 RDI RSI RBX RBP RSP XMM6:XMM15

And this code snippet below is a right implementation on windows should be like (there may be still some bugs in there because I'm not fully checked it yet though).

libcoro/coro.c#L137:

       #if __amd64

         #if _WIN32 || __CYGWIN__
           #define NUM_SAVED 29
           "\tsubq $168, %rsp\t" /* one dummy qword to improve alignment */
           "\tmovaps %xmm6, (%rsp)\n"
           "\tmovaps %xmm7, 16(%rsp)\n"
           "\tmovaps %xmm8, 32(%rsp)\n"
           "\tmovaps %xmm9, 48(%rsp)\n"
           "\tmovaps %xmm10, 64(%rsp)\n"
           "\tmovaps %xmm11, 80(%rsp)\n"
           "\tmovaps %xmm12, 96(%rsp)\n"
           "\tmovaps %xmm13, 112(%rsp)\n"
           "\tmovaps %xmm14, 128(%rsp)\n"
           "\tmovaps %xmm15, 144(%rsp)\n"
           "\tpushq %rsi\n"
           "\tpushq %rdi\n"
           "\tpushq %rbp\n"
           "\tpushq %rbx\n"
           "\tpushq %r12\n"
           "\tpushq %r13\n"
           "\tpushq %r14\n"
           "\tpushq %r15\n"
           #if CORO_WIN_TIB
             "\tpushq %fs:0x0\n"
             "\tpushq %fs:0x8\n"
             "\tpushq %fs:0xc\n"
           #endif
           "\tmovq %rsp, (%rcx)\n"
           "\tmovq (%rdx), %rsp\n"
           #if CORO_WIN_TIB
             "\tpopq %fs:0xc\n"
             "\tpopq %fs:0x8\n"
             "\tpopq %fs:0x0\n"
           #endif
           "\tpopq %r15\n"
           "\tpopq %r14\n"
           "\tpopq %r13\n"
           "\tpopq %r12\n"
           "\tpopq %rbx\n"
           "\tpopq %rbp\n"
           "\tpopq %rdi\n"
           "\tpopq %rsi\n"
           "\tmovaps (%rsp), %xmm6\n"
           "\tmovaps 16(%rsp), %xmm7\n"
           "\tmovaps 32(%rsp), %xmm8\n"
           "\tmovaps 48(%rsp), %xmm9\n"
           "\tmovaps 64(%rsp), %xmm10\n"
           "\tmovaps 80(%rsp), %xmm11\n"
           "\tmovaps 96(%rsp), %xmm12\n"
           "\tmovaps 112(%rsp), %xmm13\n"
           "\tmovaps 128(%rsp), %xmm14\n"
           "\tmovaps 144(%rsp), %xmm15\n"
           "\taddq $168, %rsp\n"
         #else
           #define NUM_SAVED 6
           "\tpushq %rbp\n"
           "\tpushq %rbx\n"
           "\tpushq %r12\n"
           "\tpushq %r13\n"
           "\tpushq %r14\n"
           "\tpushq %r15\n"
           "\tmovq %rsp, (%rdi)\n"
           "\tmovq (%rsi), %rsp\n"
           "\tpopq %r15\n"
           "\tpopq %r14\n"
           "\tpopq %r13\n"
           "\tpopq %r12\n"
           "\tpopq %rbx\n"
           "\tpopq %rbp\n"
         #endif
         "\tpopq %rcx\n"
         "\tjmpq *%rcx\n"

这是一个Windows X64 ABI标准中关于寄存器的使用的表

Nonvolatile registers:R12:R15 RDI RSI RBX RBP RSP XMM6:XMM15

下面的代码片段才是Windows上的正确实现应该像的样子(可能还是有一些bug,因为我还没有完全检查它)。

libcoro/coro.c#L137:

       #if __amd64

         #if _WIN32 || __CYGWIN__
           #define NUM_SAVED 29
           "\tsubq $168, %rsp\t" /* one dummy qword to improve alignment */
           "\tmovaps %xmm6, (%rsp)\n"
           "\tmovaps %xmm7, 16(%rsp)\n"
           "\tmovaps %xmm8, 32(%rsp)\n"
           "\tmovaps %xmm9, 48(%rsp)\n"
           "\tmovaps %xmm10, 64(%rsp)\n"
           "\tmovaps %xmm11, 80(%rsp)\n"
           "\tmovaps %xmm12, 96(%rsp)\n"
           "\tmovaps %xmm13, 112(%rsp)\n"
           "\tmovaps %xmm14, 128(%rsp)\n"
           "\tmovaps %xmm15, 144(%rsp)\n"
           "\tpushq %rsi\n"
           "\tpushq %rdi\n"
           "\tpushq %rbp\n"
           "\tpushq %rbx\n"
           "\tpushq %r12\n"
           "\tpushq %r13\n"
           "\tpushq %r14\n"
           "\tpushq %r15\n"
           #if CORO_WIN_TIB
             "\tpushq %fs:0x0\n"
             "\tpushq %fs:0x8\n"
             "\tpushq %fs:0xc\n"
           #endif
           "\tmovq %rsp, (%rcx)\n"
           "\tmovq (%rdx), %rsp\n"
           #if CORO_WIN_TIB
             "\tpopq %fs:0xc\n"
             "\tpopq %fs:0x8\n"
             "\tpopq %fs:0x0\n"
           #endif
           "\tpopq %r15\n"
           "\tpopq %r14\n"
           "\tpopq %r13\n"
           "\tpopq %r12\n"
           "\tpopq %rbx\n"
           "\tpopq %rbp\n"
           "\tpopq %rdi\n"
           "\tpopq %rsi\n"
           "\tmovaps (%rsp), %xmm6\n"
           "\tmovaps 16(%rsp), %xmm7\n"
           "\tmovaps 32(%rsp), %xmm8\n"
           "\tmovaps 48(%rsp), %xmm9\n"
           "\tmovaps 64(%rsp), %xmm10\n"
           "\tmovaps 80(%rsp), %xmm11\n"
           "\tmovaps 96(%rsp), %xmm12\n"
           "\tmovaps 112(%rsp), %xmm13\n"
           "\tmovaps 128(%rsp), %xmm14\n"
           "\tmovaps 144(%rsp), %xmm15\n"
           "\taddq $168, %rsp\n"
         #else
           #define NUM_SAVED 6
           "\tpushq %rbp\n"
           "\tpushq %rbx\n"
           "\tpushq %r12\n"
           "\tpushq %r13\n"
           "\tpushq %r14\n"
           "\tpushq %r15\n"
           "\tmovq %rsp, (%rdi)\n"
           "\tmovq (%rsi), %rsp\n"
           "\tpopq %r15\n"
           "\tpopq %r14\n"
           "\tpopq %r13\n"
           "\tpopq %r12\n"
           "\tpopq %rbx\n"
           "\tpopq %rbp\n"
         #endif
         "\tpopq %rcx\n"
         "\tjmpq *%rcx\n"

from libaco.

hnes avatar hnes commented on July 4, 2024

In the Best Practice part:

In summary, if you want to gain the ultra performance of libaco, just keep the stack usage of the non-standalone non-main co at the point of calling aco_yield as small as possible.

       co_fp 
       /  \
      /    \  
    f1     f2
   /  \    / \
  /    \  f4  \
yield  f3     f5

The stack usage of non-standalone (share stack with other coroutines) non-main co when it is been yielded (i.e. call aco_yield to yield back to main co) has big impact to the performance of context switching between coroutines. The benchmark result shows that clearly already. In the diagram above, the stack usage of function f2, f3, f4 and f5 has no direct influence to context switching performance since there are no aco_yield when they are executing. Whereas the stack usage of co_fp and f1 dominates the value of co->save_stack.max_cpsz and has a big influence to the context switching performance.

The key to keep a tiny stack usage of a function is to allocate the local variables (especially the big ones) on the heap and manage their lifecycle manually instead of allocating them on the stack by default. The -fstack-usage option of gcc is very helpful about this.

And from the Benchmark part:

aco_create/init_save_stk_sz=64B                              1     0.000 s      230.00 ns/op    4347824.79 op/s
aco_resume/co_amount=1/copy_stack_size=0B             20000000     0.412 s       20.59 ns/op   48576413.55 op/s
  -> acosw                                            40000000     0.412 s       10.29 ns/op   97152827.10 op/s
aco_destroy                                                  1     0.000 s      650.00 ns/op    1538461.66 op/s

aco_create/init_save_stk_sz=64B                       10000000     1.240 s      123.97 ns/op    8066542.54 op/s
aco_resume/co_amount=10000000/copy_stack_size=8B      40000000     1.327 s       33.17 ns/op   30143409.55 op/s
aco_destroy                                           10000000     0.328 s       32.82 ns/op   30467658.05 op/s

aco_create/init_save_stk_sz=64B                       10000000     0.659 s       65.94 ns/op   15165717.02 op/s
aco_resume/co_amount=10000000/copy_stack_size=24B     40000000     1.345 s       33.63 ns/op   29737708.53 op/s
aco_destroy                                           10000000     0.337 s       33.71 ns/op   29666697.09 op/s

aco_create/init_save_stk_sz=64B                       10000000     0.654 s       65.38 ns/op   15296191.35 op/s
aco_resume/co_amount=10000000/copy_stack_size=40B     40000000     1.348 s       33.71 ns/op   29663992.77 op/s
aco_destroy                                           10000000     0.336 s       33.56 ns/op   29794574.96 op/s

aco_create/init_save_stk_sz=64B                       10000000     0.653 s       65.29 ns/op   15316087.09 op/s
aco_resume/co_amount=10000000/copy_stack_size=56B     40000000     1.384 s       34.60 ns/op   28902221.24 op/s
aco_destroy                                           10000000     0.337 s       33.73 ns/op   29643682.93 op/s

aco_create/init_save_stk_sz=64B                       10000000     0.652 s       65.19 ns/op   15340872.40 op/s
aco_resume/co_amount=10000000/copy_stack_size=120B    40000000     1.565 s       39.11 ns/op   25566255.73 op/s
aco_destroy                                           10000000     0.443 s       44.30 ns/op   22574242.55 op/s

aco_create/init_save_stk_sz=64B                        2000000     0.131 s       65.61 ns/op   15241722.94 op/s
aco_resume/co_amount=2000000/copy_stack_size=136B     20000000     0.947 s       47.36 ns/op   21114212.05 op/s
aco_destroy                                            2000000     0.125 s       62.35 ns/op   16039466.45 op/s

aco_create/init_save_stk_sz=64B                        2000000     0.131 s       65.71 ns/op   15218784.72 op/s
aco_resume/co_amount=2000000/copy_stack_size=136B     20000000     0.948 s       47.39 ns/op   21101216.29 op/s
aco_destroy                                            2000000     0.125 s       62.73 ns/op   15941559.26 op/s

aco_create/init_save_stk_sz=64B                        2000000     0.131 s       65.49 ns/op   15270258.18 op/s
aco_resume/co_amount=2000000/copy_stack_size=152B     20000000     1.069 s       53.44 ns/op   18714275.17 op/s
aco_destroy                                            2000000     0.122 s       61.05 ns/op   16378678.85 op/s

aco_create/init_save_stk_sz=64B                        2000000     0.132 s       65.91 ns/op   15171336.62 op/s
aco_resume/co_amount=2000000/copy_stack_size=232B     20000000     1.190 s       59.48 ns/op   16813230.99 op/s
aco_destroy                                            2000000     0.123 s       61.26 ns/op   16324298.25 op/s

aco_create/init_save_stk_sz=64B                        2000000     0.131 s       65.68 ns/op   15224361.30 op/s
aco_resume/co_amount=2000000/copy_stack_size=488B     20000000     1.828 s       91.40 ns/op   10941133.56 op/s
aco_destroy                                            2000000     0.145 s       72.56 ns/op   13781182.82 op/s

aco_create/init_save_stk_sz=64B                        2000000     0.132 s       65.80 ns/op   15197461.34 op/s
aco_resume/co_amount=2000000/copy_stack_size=488B     20000000     1.829 s       91.47 ns/op   10932139.32 op/s
aco_destroy                                            2000000     0.149 s       74.70 ns/op   13387258.82 op/s

As the README already described, there are some limitations when you are using libaco with the shared execution stack mode, but if you could keep the stack usage of the shared execution stack when you are yielding as small as you could accept, then it is just fine.

But still, using DMA on the userspace is a very valuable method and worth further investigation in the future (that could be OS dependent or even OS version dependent though).

Thank you very much, @abbycin and your friend :D

from libaco.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.