hnes / libaco Goto Github PK
View Code? Open in Web Editor NEWA blazing fast and lightweight C asymmetric coroutine library 💎 ⛅🚀⛅🌞
Home Page: https://libaco.org
License: Apache License 2.0
A blazing fast and lightweight C asymmetric coroutine library 💎 ⛅🚀⛅🌞
Home Page: https://libaco.org
License: Apache License 2.0
In the case I'm considering using libaco, I'll have less than 20 standalone non-main coroutines and the cost to switch between the coroutines is critical. The current cost of 10ns for acosw() fits my application, but having to go through the main coroutine first to switch to another non-main coroutine puts the cost of switching at 20ns, what becomes too much for what I'm trying to accomplish.
Is it possible to have a version of aco_yield() that receives the coroutine to where to yield?
Thank goodness that I got the chance to come back!
Sorry for this not-so-temporary "temporary" leaving...
From now on, I will try to fix any bug whenever there is one been reported. And I am very happy to see that there is no bug report yet since libaco had been released, which is going quite as I expected.
All the features, pull requests and the development of the next release would be postponed until I got enough spare time. Sorry for that. It may take one or two months to come. But still, I could not make any promise for this. Life is full of accidents. Sometimes even if we do not want them at all :-(
You have my sincere apologies and best regards. And, again, have fun :-)
Hello,
I currently have asynchronous code with callback that I would potentially like to replace with fibers.
The operation is simple, I make a parallel for loop on an event list and I call the assigned callbacks functions.
With fibers, I would initially be on the thread with main_co and resume the coroutines with a parallel for loop, which can yield again (they cannot be resumed again on the same loop).
Is it possible ?
I suppose there must be synchronizations to do on the main_co when we resume, yield or exit.
I would normally have few fibers (about fifty).
Thanks
I am using the actor model, actor get scheduled and ran in thread pool.
Means to use a coroutine, it can be yield by any thread of the pool, and then resumed by any thread of the pool.
But currently the libaco bind the coroutine to a ``main co" and have to resume by same thread.
Is there any possibility to support it?
void common_entry() {
call_fp() // call current coroutine's aco_cofuncp
aco_exit()
}
Then the cofuncp
can return like a normal function without calling aco_exit
Hello, I'm so sorry, but I'm really interested in this library, but I don't quite understand from the diagrams what is going on.
Do you think we can work together to improve the documentation? My naivety might help to improve the documentation as I can point out what I don't understand and perhaps make a PR with improved documentation.
In the first instance, I don't understand how this differs from normal stackful coroutines (fibres). To me, it looks like you are storing the registers in a private area, but then sharing a single stack between multiple coroutines. Does this mean stack is trashed when switching between coroutines? i.e. local variables can't be used. Is this an implementation of stackless coroutine?
Hi, our open source lib (https://github.com/alibaba/PhotonLibOS) is based on stackful coroutine, and a high efficiency IO scheduler. Looks like your implementation is based on stackless, would you like to integrate this project into our IO engine?
Our basic coroutine APIs are like interrupt
, context_switch
, create
, etc...
As per title.
#define aco_amd64_inline_short_aligned_memcpy_test_ok(dst, src, sz) \
( \
(((uintptr_t)(src) & 0x0f) == 0) && (((uintptr_t)(dst) & 0x0f) == 0) \
&& \
(((sz) & 0x0f) == 0x08) && (((sz) >> 4) >= 0) && (((sz) >> 4) <= 8) \
)
I think it should be (((sz) & 0x0f) == 0)
instead of (((sz) & 0x0f) == 0x08)
.
Line 65 in c941da6
Do libaco coroutines run on different CPUs at the same time. i.e. does the program performance benefit from having multiple CPUs? Or do coroutines run on only one CPU at any given point of time. In that case, libaco could used in conjunction to a standard threading library I guess...
The following source code structure is repeated a few times.
…
…* p = malloc(…);
assertalloc_ptr(p);
…
How do you think about to combine it into a function like “aco_malloc
” (or “aco_calloc
”)?
120byte sounds like a lot.
possible to use ideas from here to lower it?
https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html
40bytes possible?
Any possibility to add support for thread (coroutine) local variables?
That would be very useful, thanks!
hi, I am reading the code and have done some benchmark:
https://github.com/guangqianpeng/libaco/blob/master/bench_result
I have two questions:
tcmalloc
improves the benchmark results. With aco_amount=1000000
and copy_stack_size=56B
, the tcmalloc version achieves 37ns per aco_resume()
operation but the default takes 66ns. Why? In this case, aco_resume()
does not allocate memory, which is really confusing...%xmm
registers to optimize small memory copying. But according to my benchmark, this does not make many differences. I guess memcpy()
already takes advantage of these registers. Do you have more benchmark results?I will be very grateful to you for answering my questions :-)
亲爱的作者,您好:)
希望借issue区向您请教一个简单的问题。
为什么您选择自己编写汇编来切换上下文,而不是使用现成的setjmp?是因为setjmp功能不足,还是因为其性能不如手写汇编?
期待您的回复!
Thanks for this great library!
I think that having something similar to the following functions:
void* aco_yield_value();
void aco_resume_value(aco_t* aco, void* val);
would be very convenient. I see two practical use cases for this feature: It allows us to create constructs similar to Python's generators, where the coroutine iterates over some data structure or computed sequence, and incrementally returns values from it. It also makes it easier to use libaco
with event-based frameworks like io_uring
, where events have a return value that need to be passed back to the coroutine:
// Inside coroutine...
struct io_uring_sqe* sqe;
int amt_read;
char buf[1024];
int fd;
// ...
sqe = io_uring_get_sqe(ring);
io_uring_prep_read(sqe, fd, buf, sizeof(buf), 0);
io_uring_sqe_set_data(sqe, aco_get_co());
io_uring_submit(ring);
amt_read = (int) (intptr_t) aco_yield_value();
// ...
// Inside event loop...
struct io_uring_cqe* cqe;
aco_t* coro;
int result_val;
// ...
io_uring_wait_cqe(ring, &cqe);
coro = io_uring_cqe_get_data(cqe);
result_val = cqe->res;
io_uring_cqe_seen(ring, cqe);
aco_resume_value(coro, (void*) (intptr_t) result_val);
As I understand it, adding a return value would require saving/restoring an additional register. Perhaps this feature could be put behind a #define
macro, so that users can choose if they want to take a slight performance hit for this feature.
Right now, this feature can be emulated by using an externally maintained pointer to marshal the data. But this is a very janky design pattern, and I believe that it is best to avoid polluting a codebase with pointers if the necessary data can simply be passed by value.
I find that an include guard like “ACO_H
” can be too short for the safe reuse of your header file (when it belongs to an application programming interface).
Some code from C source files should be wrapped by the setting “extern "C"” for C++ tools.
arm & arm64 support are mentioned in the README's TODO section. I'm can't use this library until that is added :(
I didn't see an active issue tracker for it, so I figured I'd make one.
I would like to point out that an identifier like “_ACO_H
” does eventually not fit to the expected naming convention of the C language standard.
Would you like to adjust your selection for unique names?
I'm planning to use libaco in C++, but I need some clarifications about those 3 scenarios, whether they are not UB and will work correctly.
只有共享栈的实现么?
Does libaco only have the implementation of shared stack mode?
I'm wonder how to determine main co and non-main co while executing the same function.
As the document said, aco_get_co() should be called by non-main co.
How about returning NULL pointer to reveal the caller is main co ??
add arm32/aarch64 support for android and ios
below comes the output from my shell
jiangyunfan@wejoy:~/Downloads/libaco-1.2.4$ bash make.sh -o no-m32
OUTPUT_DIR: ./output
CFLAGS: -g -O2 -Wall -Werror
EXTRA_CFLAGS:
ACO_EXTRA_CFLAGS:
OUTPUT_SUFFIX: ..no_valgrind.standaloneFPUenv
cc -g -O2 -Wall -Werror acosw.S aco.c test_aco_tutorial_0.c -o ./output/test_aco_tutorial_0..no_valgrind.standaloneFPUenv
/usr/bin/ld: cannot open output file ./output/test_aco_tutorial_0..no_valgrind.standaloneFPUenv: 没有那个文件或目录
collect2: error: ld returned 1 exit status
error: make.sh:build fail
error: make.sh:exit
jiangyunfan@wejoy:~/Downloads/libaco-1.2.4$ uname -a
Linux wejoy 4.18.0-13-generic #14-Ubuntu SMP Wed Dec 5 09:04:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
jiangyunfan@wejoy:~/Downloads/libaco-1.2.4$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.10
Release: 18.10
Codename: cosmic
int* gl_ptr;
void inc_p(int* p){ (*p)++; }
void co_fp0() {
int ct = 0;
gl_ptr = &ct; // line 7
aco_yield();
check(ct);
int* ptr = &ct;
inc_p(ptr); // line 11
aco_exit();
}
void co_fp1() {
do_sth(gl_ptr); // line 16
aco_exit();
}
gl_ptr
in co_fp1 (line 16) has totally different semantics with the gl_ptr
in line 7 of co_fp0, and that kind of code would probably corrupt the execution stack of co_fp1. But the line 11 is fine because variable ct
and function inc_p
are in the same coroutine context. Allocating that kind of variables (need to share with other coroutines) on the heap would simply solve such problems: ```Is co_fp0 affected? ct was modified during yield
和这个 https://github.com/abbycin/tools/tree/master/coroutine
Could you please tell me the differences between my coroutine library and libaco?
Aborting is fine for debug but in production it is usually undesirable to crash. For example, on failed malloc, an error code or NULL can be returned to the caller.
i just wondering what's technique to create co inside co, does its affect performance.
by example i want to create few co inside non-main co, so what should i do,
like:
also one more question, documentation said to allocate local variable (especially the big one) on the heap, then what usage stack in a function good for?
there is some technique to calculate how much w should allocate the share stack size?
i need advise for this, i hope someone give me some tips or anything.
thanks, sorry for my bad english.
According to MacOS's IA-32 Function Calling Conventions:
The function calling conventions used in the IA-32 environment are the same as those used in the System V IA-32 ABI, with the following exceptions:
Different rules for returning structures
The stack is 16-byte aligned at the point of function calls
Large data types (larger than 4 bytes) are kept at their natural alignment
Most floating-point operations are carried out using the SSE unit instead of the x87 FPU, except when operating on long double values. (The IA-32 environment defaults to 64-bit internal precision for the x87 FPU.)
The content of this article is largely based in System V Application Binary Interface: Intel386 Architecture Processor Supplement, available at http://www.sco.com/developers/devspecs/abi386-4.pdf.
and MacOS's x86-64 Function Calling Conventions:
The OS X x86-64 function calling conventions are the same as the function calling conventions described in System V Application Binary Interface AMD64 Architecture Processor Supplement.
That means libaco already supports MacOS, but there still exist some build problems due to the differences of compiler toolchains between linux and mac.
The support for MacOS would be finally released in the v1.2.3 (would come in several days).
If a coroutine is finished without calling aco_exit. We can make it return to a function, and that function has the same functionality as the aco_exit. So why do we need aco_exit?
When I build the project libaco by executing command bash make.sh -o no-m32 -o no-valgrind
I got the error
test_aco_benchmark.c:106:17: error: implicit declaration of function 'clock_gettime' is invalid in C99
[-Werror,-Wimplicit-function-declaration]
assert(0 == clock_gettime(CLOCK_MONOTONIC, &tstart));
test_aco_benchmark.c:168:31: error: use of undeclared identifier 'CLOCK_MONOTONIC'
assert(0 == clock_gettime(CLOCK_MONOTONIC, &tstart));
And here is my envonment:
Did I missing any key point?
How can i launch a coroutine which waits for 5 secs and do some operation in a data structure.
I don't want main thread to wait. sleep(5) holds the main thread itself.
To avoid being optimized by the compiler, the benchmark coroutines co_fp_stksz_xx()
and co_fp_alloca()
use memset()
. However, it failed on my machine (Ubuntu 16.04, Linux 4.15, gcc 5.4). We can use the following method to correct the stack size:
void do_not_optimize(void* value) {
asm volatile("" : : "r,m"(value) : "memory");
}
void co_fp_stksz_128() {
int ip[28];
do_not_optimize(ip);
// memset(ip, 1, sizeof(ip));
while(1){
aco_yield();
}
aco_exit();
}
Hi!
I want to build a system with coroutines, but performance is my concern. My question is how to design the system: should I create a fixed number of coroutines at the beginning and keep reusing them, or can I simplify my design: create a coroutines, once it is done destroy it and create another one. I would prefer the second solution unless it is slower.
The asm("acosw");
in aco.h needs to be __asm("acosw");
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.