Comments (6)
emmm...
from libaco.
你好,待我仔细地看一下再回复你哈 :D
Please give me some time to read it through :D
from libaco.
(English translation is at the bottom of this reply.)
看了你写的博客,很棒!
下面是它与libaco不同的地方:
-
不建议在C++中用户自己实现协程库(应该由C++标准库实现,这样才能保证正确),因为C++的ABI是编译器(甚至版本)相关且平台相关的,而在abbycin/tools/coroutine的实现中却是参考Sys V ABI标准实现的,这是不正确的;或者用C实现,然后让C++调用它,同时注意处理好两个语言之间的边界问题;
-
即使是参考Sys V ABI标准来实现,在abbycin/tools/coroutine的switch_stack.asm中,也并没有完全遵守它(FPU与MXCSR的控制字的问题,腾讯的libco也犯了同样的错误,issue);
-
abbycin/tools/coroutine的windows实现switch_stack_win.asm是错误的,Microsoft x64 ABI远远比Sys V ABI AMD64要复杂;
-
libaco的协程不但支持独立执行栈,还支持与其它数量不限的协程一起共享某一个执行栈(另外还有执行栈上guard page的支持),而在abbycin/tools/coroutine中只支持独立执行栈,这在高并发场景下会消耗巨大(比如百万或者千万协程)。
I have read your blog, it's a very nice one :D
And here the following is the differences between your library with libaco:
-
In C++, it is not recommended that the user implement the coroutine library (it should be implemented by the C++ standard and its library), because C++'s ABI is both compiler (or even by version) and platform dependent, but in the implementation of your abbycin/tools/coroutine, you only refer to the Sys V ABI standard, which is incorrect. You could implement a coroutine library by C, then call it in C++, and beware that the boundary problem between the two languages is handled correctly.
-
Even if you choose to use the Sys V ABI standard, but in switch_stack.asm, you didn't fully comply with it (the problem of the FPU and MXCSR's control words, the Tencent's libco has made the same mistake too, here is the bug issue);
-
Your switch_stack_win.asm is also wrong, because Microsoft x64 ABI is far more complex than the Sys V ABI AMD64;
-
libaco not only supports the standalone execution stack of coroutine, but also supports the sharing of a single execution stack with other unlimited numbers of coroutines (and also supports the guard page on the execution stack), while the standalone execution stack is only supported in the abbycin/tools/coroutine (this will consume huge virtual memory in the high concurrency scenarios, a concurrency of 1 - 10 million for example).
from libaco.
楼主的库除了没有保存&恢复RDI 和RSI 其他没什么问题。
You forget to store/resume the callee saved registers RDI and RSI in windows x64.
from libaco.
Here is a table about the registers' usage in the Windows X64 ABI standard:
Nonvolatile registers: R12:R15 RDI RSI RBX RBP RSP XMM6:XMM15
And this code snippet below is a right implementation on windows should be like (there may be still some bugs in there because I'm not fully checked it yet though).
#if __amd64
#if _WIN32 || __CYGWIN__
#define NUM_SAVED 29
"\tsubq $168, %rsp\t" /* one dummy qword to improve alignment */
"\tmovaps %xmm6, (%rsp)\n"
"\tmovaps %xmm7, 16(%rsp)\n"
"\tmovaps %xmm8, 32(%rsp)\n"
"\tmovaps %xmm9, 48(%rsp)\n"
"\tmovaps %xmm10, 64(%rsp)\n"
"\tmovaps %xmm11, 80(%rsp)\n"
"\tmovaps %xmm12, 96(%rsp)\n"
"\tmovaps %xmm13, 112(%rsp)\n"
"\tmovaps %xmm14, 128(%rsp)\n"
"\tmovaps %xmm15, 144(%rsp)\n"
"\tpushq %rsi\n"
"\tpushq %rdi\n"
"\tpushq %rbp\n"
"\tpushq %rbx\n"
"\tpushq %r12\n"
"\tpushq %r13\n"
"\tpushq %r14\n"
"\tpushq %r15\n"
#if CORO_WIN_TIB
"\tpushq %fs:0x0\n"
"\tpushq %fs:0x8\n"
"\tpushq %fs:0xc\n"
#endif
"\tmovq %rsp, (%rcx)\n"
"\tmovq (%rdx), %rsp\n"
#if CORO_WIN_TIB
"\tpopq %fs:0xc\n"
"\tpopq %fs:0x8\n"
"\tpopq %fs:0x0\n"
#endif
"\tpopq %r15\n"
"\tpopq %r14\n"
"\tpopq %r13\n"
"\tpopq %r12\n"
"\tpopq %rbx\n"
"\tpopq %rbp\n"
"\tpopq %rdi\n"
"\tpopq %rsi\n"
"\tmovaps (%rsp), %xmm6\n"
"\tmovaps 16(%rsp), %xmm7\n"
"\tmovaps 32(%rsp), %xmm8\n"
"\tmovaps 48(%rsp), %xmm9\n"
"\tmovaps 64(%rsp), %xmm10\n"
"\tmovaps 80(%rsp), %xmm11\n"
"\tmovaps 96(%rsp), %xmm12\n"
"\tmovaps 112(%rsp), %xmm13\n"
"\tmovaps 128(%rsp), %xmm14\n"
"\tmovaps 144(%rsp), %xmm15\n"
"\taddq $168, %rsp\n"
#else
#define NUM_SAVED 6
"\tpushq %rbp\n"
"\tpushq %rbx\n"
"\tpushq %r12\n"
"\tpushq %r13\n"
"\tpushq %r14\n"
"\tpushq %r15\n"
"\tmovq %rsp, (%rdi)\n"
"\tmovq (%rsi), %rsp\n"
"\tpopq %r15\n"
"\tpopq %r14\n"
"\tpopq %r13\n"
"\tpopq %r12\n"
"\tpopq %rbx\n"
"\tpopq %rbp\n"
#endif
"\tpopq %rcx\n"
"\tjmpq *%rcx\n"
这是一个Windows X64 ABI标准中关于寄存器的使用的表:
Nonvolatile registers:R12:R15 RDI RSI RBX RBP RSP XMM6:XMM15
下面的代码片段才是Windows上的正确实现应该像的样子(可能还是有一些bug,因为我还没有完全检查它)。
#if __amd64
#if _WIN32 || __CYGWIN__
#define NUM_SAVED 29
"\tsubq $168, %rsp\t" /* one dummy qword to improve alignment */
"\tmovaps %xmm6, (%rsp)\n"
"\tmovaps %xmm7, 16(%rsp)\n"
"\tmovaps %xmm8, 32(%rsp)\n"
"\tmovaps %xmm9, 48(%rsp)\n"
"\tmovaps %xmm10, 64(%rsp)\n"
"\tmovaps %xmm11, 80(%rsp)\n"
"\tmovaps %xmm12, 96(%rsp)\n"
"\tmovaps %xmm13, 112(%rsp)\n"
"\tmovaps %xmm14, 128(%rsp)\n"
"\tmovaps %xmm15, 144(%rsp)\n"
"\tpushq %rsi\n"
"\tpushq %rdi\n"
"\tpushq %rbp\n"
"\tpushq %rbx\n"
"\tpushq %r12\n"
"\tpushq %r13\n"
"\tpushq %r14\n"
"\tpushq %r15\n"
#if CORO_WIN_TIB
"\tpushq %fs:0x0\n"
"\tpushq %fs:0x8\n"
"\tpushq %fs:0xc\n"
#endif
"\tmovq %rsp, (%rcx)\n"
"\tmovq (%rdx), %rsp\n"
#if CORO_WIN_TIB
"\tpopq %fs:0xc\n"
"\tpopq %fs:0x8\n"
"\tpopq %fs:0x0\n"
#endif
"\tpopq %r15\n"
"\tpopq %r14\n"
"\tpopq %r13\n"
"\tpopq %r12\n"
"\tpopq %rbx\n"
"\tpopq %rbp\n"
"\tpopq %rdi\n"
"\tpopq %rsi\n"
"\tmovaps (%rsp), %xmm6\n"
"\tmovaps 16(%rsp), %xmm7\n"
"\tmovaps 32(%rsp), %xmm8\n"
"\tmovaps 48(%rsp), %xmm9\n"
"\tmovaps 64(%rsp), %xmm10\n"
"\tmovaps 80(%rsp), %xmm11\n"
"\tmovaps 96(%rsp), %xmm12\n"
"\tmovaps 112(%rsp), %xmm13\n"
"\tmovaps 128(%rsp), %xmm14\n"
"\tmovaps 144(%rsp), %xmm15\n"
"\taddq $168, %rsp\n"
#else
#define NUM_SAVED 6
"\tpushq %rbp\n"
"\tpushq %rbx\n"
"\tpushq %r12\n"
"\tpushq %r13\n"
"\tpushq %r14\n"
"\tpushq %r15\n"
"\tmovq %rsp, (%rdi)\n"
"\tmovq (%rsi), %rsp\n"
"\tpopq %r15\n"
"\tpopq %r14\n"
"\tpopq %r13\n"
"\tpopq %r12\n"
"\tpopq %rbx\n"
"\tpopq %rbp\n"
#endif
"\tpopq %rcx\n"
"\tjmpq *%rcx\n"
from libaco.
In the Best Practice part:
In summary, if you want to gain the ultra performance of libaco, just keep the stack usage of the non-standalone non-main co at the point of calling aco_yield as small as possible.
co_fp
/ \
/ \
f1 f2
/ \ / \
/ \ f4 \
yield f3 f5
The stack usage of non-standalone (share stack with other coroutines) non-main co when it is been yielded (i.e. call aco_yield to yield back to main co) has big impact to the performance of context switching between coroutines. The benchmark result shows that clearly already. In the diagram above, the stack usage of function f2, f3, f4 and f5 has no direct influence to context switching performance since there are no aco_yield when they are executing. Whereas the stack usage of co_fp and f1 dominates the value of co->save_stack.max_cpsz and has a big influence to the context switching performance.
The key to keep a tiny stack usage of a function is to allocate the local variables (especially the big ones) on the heap and manage their lifecycle manually instead of allocating them on the stack by default. The -fstack-usage option of gcc is very helpful about this.
And from the Benchmark part:
aco_create/init_save_stk_sz=64B 1 0.000 s 230.00 ns/op 4347824.79 op/s
aco_resume/co_amount=1/copy_stack_size=0B 20000000 0.412 s 20.59 ns/op 48576413.55 op/s
-> acosw 40000000 0.412 s 10.29 ns/op 97152827.10 op/s
aco_destroy 1 0.000 s 650.00 ns/op 1538461.66 op/s
aco_create/init_save_stk_sz=64B 10000000 1.240 s 123.97 ns/op 8066542.54 op/s
aco_resume/co_amount=10000000/copy_stack_size=8B 40000000 1.327 s 33.17 ns/op 30143409.55 op/s
aco_destroy 10000000 0.328 s 32.82 ns/op 30467658.05 op/s
aco_create/init_save_stk_sz=64B 10000000 0.659 s 65.94 ns/op 15165717.02 op/s
aco_resume/co_amount=10000000/copy_stack_size=24B 40000000 1.345 s 33.63 ns/op 29737708.53 op/s
aco_destroy 10000000 0.337 s 33.71 ns/op 29666697.09 op/s
aco_create/init_save_stk_sz=64B 10000000 0.654 s 65.38 ns/op 15296191.35 op/s
aco_resume/co_amount=10000000/copy_stack_size=40B 40000000 1.348 s 33.71 ns/op 29663992.77 op/s
aco_destroy 10000000 0.336 s 33.56 ns/op 29794574.96 op/s
aco_create/init_save_stk_sz=64B 10000000 0.653 s 65.29 ns/op 15316087.09 op/s
aco_resume/co_amount=10000000/copy_stack_size=56B 40000000 1.384 s 34.60 ns/op 28902221.24 op/s
aco_destroy 10000000 0.337 s 33.73 ns/op 29643682.93 op/s
aco_create/init_save_stk_sz=64B 10000000 0.652 s 65.19 ns/op 15340872.40 op/s
aco_resume/co_amount=10000000/copy_stack_size=120B 40000000 1.565 s 39.11 ns/op 25566255.73 op/s
aco_destroy 10000000 0.443 s 44.30 ns/op 22574242.55 op/s
aco_create/init_save_stk_sz=64B 2000000 0.131 s 65.61 ns/op 15241722.94 op/s
aco_resume/co_amount=2000000/copy_stack_size=136B 20000000 0.947 s 47.36 ns/op 21114212.05 op/s
aco_destroy 2000000 0.125 s 62.35 ns/op 16039466.45 op/s
aco_create/init_save_stk_sz=64B 2000000 0.131 s 65.71 ns/op 15218784.72 op/s
aco_resume/co_amount=2000000/copy_stack_size=136B 20000000 0.948 s 47.39 ns/op 21101216.29 op/s
aco_destroy 2000000 0.125 s 62.73 ns/op 15941559.26 op/s
aco_create/init_save_stk_sz=64B 2000000 0.131 s 65.49 ns/op 15270258.18 op/s
aco_resume/co_amount=2000000/copy_stack_size=152B 20000000 1.069 s 53.44 ns/op 18714275.17 op/s
aco_destroy 2000000 0.122 s 61.05 ns/op 16378678.85 op/s
aco_create/init_save_stk_sz=64B 2000000 0.132 s 65.91 ns/op 15171336.62 op/s
aco_resume/co_amount=2000000/copy_stack_size=232B 20000000 1.190 s 59.48 ns/op 16813230.99 op/s
aco_destroy 2000000 0.123 s 61.26 ns/op 16324298.25 op/s
aco_create/init_save_stk_sz=64B 2000000 0.131 s 65.68 ns/op 15224361.30 op/s
aco_resume/co_amount=2000000/copy_stack_size=488B 20000000 1.828 s 91.40 ns/op 10941133.56 op/s
aco_destroy 2000000 0.145 s 72.56 ns/op 13781182.82 op/s
aco_create/init_save_stk_sz=64B 2000000 0.132 s 65.80 ns/op 15197461.34 op/s
aco_resume/co_amount=2000000/copy_stack_size=488B 20000000 1.829 s 91.47 ns/op 10932139.32 op/s
aco_destroy 2000000 0.149 s 74.70 ns/op 13387258.82 op/s
As the README already described, there are some limitations when you are using libaco with the shared execution stack mode, but if you could keep the stack usage of the shared execution stack when you are yielding as small as you could accept, then it is just fine.
But still, using DMA on the userspace is a very valuable method and worth further investigation in the future (that could be OS dependent or even OS version dependent though).
Thank you very much, @abbycin and your friend :D
from libaco.
Related Issues (20)
- Document c standard library dependencies
- Status and Future Plans of libaco
- can libaco support yield/resume between threads? HOT 3
- Provide way to return error to caller instead of calling `abort()` on failed assertions HOT 1
- is it worth to create coroutine inside coroutine?
- which one better to pass argument to co function
- 关于libco bug的疑问
- Thread local storage support HOT 1
- Feature request: Ability to return a value from aco_yield() HOT 1
- Feature Request: aco_get_co() return NULL when the caller is main co HOT 4
- is 120byte the lowest we can have? possible to lower this? HOT 1
- Do libaco coroutines run on different CPUs at the same time. i.e. does the program performance benefit from having multiple CPUs HOT 2
- Question: which is faster, creating new coroutines or reusing old coroutines HOT 3
- resume multiple coroutines in parallel HOT 1
- Integrate into our io engine? HOT 1
- add arm32/aarch64 support for android and ios HOT 1
- `aco_amd64_inline_short_aligned_memcpy_test_ok` size alignment
- whether this description is incorrect
- Few questions about proper usage
- is aco_exit necessary?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from libaco.