hnes / libaco Goto Github PK

View Code? Open in Web Editor NEW

3.5K 137.0 391.0 1.37 MB

A blazing fast and lightweight C asymmetric coroutine library 💎 ⛅🚀⛅🌞

Home Page: https://libaco.org

License: Apache License 2.0

C 78.78% Assembly 6.80% Shell 14.42%

c coroutine coroutine-library high-performance lightweight

libaco's People

Contributors

Stargazers

Watchers

Forkers

nathanm2 akoserwal skyformat99 linecode ezhangle kevreco mpw postwait philippeback rxhealth soelgary neuroradiology quinndiggity blakejakopovic 4144 puyol shaunstanislauslau templeblock merz9b crazyguitar najibalghaeth hermixy gclove qhxin lecai caidaoli wangxx2026 rim99 wadeliuyi hurricane1260 bakalab kingsrd uael feilaoda yonggeshidai xrayw ksharkstone ugenehan duzhanyuan gubaojian lisprez johnlunney davidmr001 moriartyz cyysu lilinj2000 liexusong icewiki marcioalmada toilaj 5up3rc liyic u20024804 yanxueweicn tempbottle gokulvasan lihs0203 xiaozongyang hj3938 zhangf911 ceekay-shen forrestsu ipandayu stangelandcl dreamtraveler feiser2016 ddling1216 lancewz yetiddbb lmsreborn askyangc fuwensun fushijian praveenmunagapati socoding carterli xuefengchang geekhuyang damoxinjiang jasonchen86899 applesky6 yixingzhong ericyao2013 ningdegang ilovenaibao hgy1125 magicod yokoa clhongooo xvhfeng joe2hpimn leongle leepand tomzhang x0ken feng-y liruhan ziyht yenmuse zhuomingliang

libaco's Issues

Request: having a version of aco_yield() that receives the coroutine to where to yield

In the case I'm considering using libaco, I'll have less than 20 standalone non-main coroutines and the cost to switch between the coroutines is critical. The current cost of 10ns for acosw() fits my application, but having to go through the main coroutine first to switch to another non-main coroutine puts the cost of switching at 20ns, what becomes too much for what I'm trying to accomplish.

Is it possible to have a version of aco_yield() that receives the coroutine to where to yield?

Status and Future Plans of libaco

Thank goodness that I got the chance to come back!

Sorry for this not-so-temporary "temporary" leaving...

From now on, I will try to fix any bug whenever there is one been reported. And I am very happy to see that there is no bug report yet since libaco had been released, which is going quite as I expected.
All the features, pull requests and the development of the next release would be postponed until I got enough spare time. Sorry for that. It may take one or two months to come. But still, I could not make any promise for this. Life is full of accidents. Sometimes even if we do not want them at all :-(

You have my sincere apologies and best regards. And, again, have fun :-)

resume multiple coroutines in parallel

Hello,

I currently have asynchronous code with callback that I would potentially like to replace with fibers.
The operation is simple, I make a parallel for loop on an event list and I call the assigned callbacks functions.

With fibers, I would initially be on the thread with main_co and resume the coroutines with a parallel for loop, which can yield again (they cannot be resumed again on the same loop).

Is it possible ?
I suppose there must be synchronizations to do on the main_co when we resume, yield or exit.
I would normally have few fibers (about fifty).

Thanks

can libaco support yield/resume between threads?

I am using the actor model, actor get scheduled and ran in thread pool.

Means to use a coroutine, it can be yield by any thread of the pool, and then resumed by any thread of the pool.
But currently the libaco bind the coroutine to a ``main co" and have to resume by same thread.
Is there any possibility to support it?

Improvement: support a common entry for every non-main coroutine

void common_entry() {
  call_fp() // call current coroutine's aco_cofuncp
  aco_exit()
}

Then the cofuncp can return like a normal function without calling aco_exit

improvements to documentation

Hello, I'm so sorry, but I'm really interested in this library, but I don't quite understand from the diagrams what is going on.

Do you think we can work together to improve the documentation? My naivety might help to improve the documentation as I can point out what I don't understand and perhaps make a PR with improved documentation.

In the first instance, I don't understand how this differs from normal stackful coroutines (fibres). To me, it looks like you are storing the registers in a private area, but then sharing a single stack between multiple coroutines. Does this mean stack is trashed when switching between coroutines? i.e. local variables can't be used. Is this an implementation of stackless coroutine?

Integrate into our io engine?

Hi, our open source lib (https://github.com/alibaba/PhotonLibOS) is based on stackful coroutine, and a high efficiency IO scheduler. Looks like your implementation is based on stackless, would you like to integrate this project into our IO engine?

Our basic coroutine APIs are like interrupt, context_switch, create, etc...

Windows/MSVC support

As per title.

`aco_amd64_inline_short_aligned_memcpy_test_ok` size alignment

#define aco_amd64_inline_short_aligned_memcpy_test_ok(dst, src, sz) \
    (   \
        (((uintptr_t)(src) & 0x0f) == 0) && (((uintptr_t)(dst) & 0x0f) == 0) \
        &&  \
        (((sz) & 0x0f) == 0x08) && (((sz) >> 4) >= 0) && (((sz) >> 4) <= 8) \
    )

I think it should be (((sz) & 0x0f) == 0) instead of (((sz) & 0x0f) == 0x08).

which one better to pass argument to co function

pass through "void* arg" to aco_create function or
create global variable that store every void* arg to specified function so the co function could call the argument from global variable (or global heap maybe?)

Question: Would there be some problem if one signal handler is triggered during the call of `acosw`?

libaco/acosw.S

Line 65 in c941da6

mov ecx,DWORD PTR [esp+0x8] // to_co

按照我的理解，这一行之前的: lea ecx,[esp+0x4] // esp，通过将栈顶指针暂时保存在ecx中是作为一个定点的作用吧，因为前面注释里已经说了，signal可能改变栈顶指针esp的值，那这一行又用esp本身地址+8的方式取得to_co的地址，那万一在这之前esp的值由于signal改变了，那此时ecx里保存的值就不是to_co的地址了吧，不知道我这样配合你上面的注释理解得对不对。探讨一下...

Do libaco coroutines run on different CPUs at the same time. i.e. does the program performance benefit from having multiple CPUs

Do libaco coroutines run on different CPUs at the same time. i.e. does the program performance benefit from having multiple CPUs? Or do coroutines run on only one CPU at any given point of time. In that case, libaco could used in conjunction to a standard threading library I guess...

Adjust code around memory allocations

The following source code structure is repeated a few times.

…
    …* p = malloc(…);
    assertalloc_ptr(p);
…

How do you think about to combine it into a function like “aco_malloc” (or “aco_calloc”)?

关于libco bug的疑问

Tencent/libco#90 (comment)

is 120byte the lowest we can have? possible to lower this?

120byte sounds like a lot.
possible to use ideas from here to lower it?
https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html

40bytes possible?

Document c standard library dependencies

Thread local storage support

Any possibility to add support for thread (coroutine) local variables?

That would be very useful, thanks!

Question: benchamrk about tcmalloc and memcpy

hi, I am reading the code and have done some benchmark:
https://github.com/guangqianpeng/libaco/blob/master/bench_result
I have two questions:

tcmalloc improves the benchmark results. With aco_amount=1000000 and copy_stack_size=56B, the tcmalloc version achieves 37ns per aco_resume() operation but the default takes 66ns. Why? In this case, aco_resume() does not allocate memory, which is really confusing...
When copying stack, you use %xmm registers to optimize small memory copying. But according to my benchmark, this does not make many differences. I guess memcpy() already takes advantage of these registers. Do you have more benchmark results?

I will be very grateful to you for answering my questions :-)

为什么使用自行编写的acosw.S汇编来切换上下文，而不是使用现成的setjmp？

亲爱的作者，您好:)
希望借issue区向您请教一个简单的问题。

为什么您选择自己编写汇编来切换上下文，而不是使用现成的setjmp？是因为setjmp功能不足，还是因为其性能不如手写汇编？

期待您的回复！

Feature request: Ability to return a value from aco_yield()

Thanks for this great library!

I think that having something similar to the following functions:

void* aco_yield_value();
void aco_resume_value(aco_t* aco, void* val);

would be very convenient. I see two practical use cases for this feature: It allows us to create constructs similar to Python's generators, where the coroutine iterates over some data structure or computed sequence, and incrementally returns values from it. It also makes it easier to use libaco with event-based frameworks like io_uring, where events have a return value that need to be passed back to the coroutine:

// Inside coroutine...
struct io_uring_sqe* sqe;
int amt_read;
char buf[1024];
int fd;
// ...
sqe = io_uring_get_sqe(ring);
io_uring_prep_read(sqe, fd, buf, sizeof(buf), 0);
io_uring_sqe_set_data(sqe, aco_get_co());
io_uring_submit(ring);
amt_read = (int) (intptr_t) aco_yield_value();
// ...

// Inside event loop...
struct io_uring_cqe* cqe;
aco_t* coro;
int result_val;
// ...
io_uring_wait_cqe(ring, &cqe);
coro = io_uring_cqe_get_data(cqe);
result_val = cqe->res;
io_uring_cqe_seen(ring, cqe);
aco_resume_value(coro, (void*) (intptr_t) result_val);

As I understand it, adding a return value would require saving/restoring an additional register. Perhaps this feature could be put behind a #define macro, so that users can choose if they want to take a slight performance hit for this feature.

Right now, this feature can be emulated by using an externally maintained pointer to marshal the data. But this is a very janky design pattern, and I believe that it is best to avoid polluting a codebase with pointers if the necessary data can simply be passed by value.

Make an include guard unique

I find that an include guard like “ACO_H” can be too short for the safe reuse of your header file (when it belongs to an application programming interface).

Improve cooperation with C++ environments

Some code from C source files should be wrapped by the setting “extern "C"” for C++ tools.

Addition of a build system generator

I suggest to reuse a higher level build system so that powerful checks for software features will become easier.

arm & arm64 support

arm & arm64 support are mentioned in the README's TODO section. I'm can't use this library until that is added :(
I didn't see an active issue tracker for it, so I figured I'd make one.

reserved identifier violation

I would like to point out that an identifier like “_ACO_H” does eventually not fit to the expected naming convention of the C language standard.
Would you like to adjust your selection for unique names?

Few questions about proper usage

I'm planning to use libaco in C++, but I need some clarifications about those 3 scenarios, whether they are not UB and will work correctly.

Creating a coroutine with a non main coroutine as the "main"/parent parameter. So when I use aco_create to create non-main co A, and then pass the resulting coroutine into another aco_create call to create B, as the main co parameter, it seems to work fine. i.e. B yields to A, then A can yield to main. Is this valid though?
Can there be 2 main coroutines in a single thread?
Can aco_exit() be called from a C++ destructor? So I've noticed that aco_exit() causes C++ destructors in a coroutine NOT to fire, which is a problem. As a solution I found out that creating a dummy object at the beginning of the function, which calls aco_exit() as its dtor, actually lets other destructors to fire before aco_exit, and it works even with a return statement, without invoking the "last word" crash.

Question: Does libaco only have the implementation of shared stack mode?

只有共享栈的实现么？

Does libaco only have the implementation of shared stack mode?

Feature Request: aco_get_co() return NULL when the caller is main co

I'm wonder how to determine main co and non-main co while executing the same function.
As the document said, aco_get_co() should be called by non-main co.
How about returning NULL pointer to reveal the caller is main co ??

add arm32/aarch64 support for android and ios

build failed from the latest release

below comes the output from my shell

jiangyunfan@wejoy:~/Downloads/libaco-1.2.4$ bash make.sh -o no-m32
OUTPUT_DIR:       ./output
CFLAGS:           -g -O2 -Wall -Werror
EXTRA_CFLAGS:     
ACO_EXTRA_CFLAGS: 
OUTPUT_SUFFIX:    ..no_valgrind.standaloneFPUenv
        cc -g -O2 -Wall -Werror   acosw.S aco.c test_aco_tutorial_0.c  -o ./output/test_aco_tutorial_0..no_valgrind.standaloneFPUenv
/usr/bin/ld: cannot open output file ./output/test_aco_tutorial_0..no_valgrind.standaloneFPUenv: 没有那个文件或目录
collect2: error: ld returned 1 exit status
error: make.sh:build fail
error: make.sh:exit

jiangyunfan@wejoy:~/Downloads/libaco-1.2.4$ uname -a
Linux wejoy 4.18.0-13-generic #14-Ubuntu SMP Wed Dec 5 09:04:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
jiangyunfan@wejoy:~/Downloads/libaco-1.2.4$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.10
Release:        18.10
Codename:       cosmic

whether this description is incorrect

int* gl_ptr;
void inc_p(int* p){ (*p)++; }
void co_fp0() {
    int ct = 0;
    gl_ptr = &ct; // line 7
    aco_yield();
    check(ct);
    int* ptr = &ct;
    inc_p(ptr);   // line 11
    aco_exit();
}

void co_fp1() {
    do_sth(gl_ptr); // line 16
    aco_exit();
}

In the above code snippet, we assume that co_fp0 & co_fp1 shares the same share stack (they are both non-main co) and the running sequence of them is "co_fp0 -> co_fp1 -> co_fp0". Since they are sharing the same stack, the address holding in gl_ptr in co_fp1 (line 16) has totally different semantics with the gl_ptr in line 7 of co_fp0, and that kind of code would probably corrupt the execution stack of co_fp1. But the line 11 is fine because variable ct and function inc_p are in the same coroutine context. Allocating that kind of variables (need to share with other coroutines) on the heap would simply solve such problems: ```

Is co_fp0 affected? ct was modified during yield

Question: Differences with my coroutine library?

和这个 https://github.com/abbycin/tools/tree/master/coroutine

Could you please tell me the differences between my coroutine library and libaco?

Provide way to return error to caller instead of calling `abort()` on failed assertions

Aborting is fine for debug but in production it is usually undesirable to crash. For example, on failed malloc, an error code or NULL can be returned to the caller.

is it worth to create coroutine inside coroutine?

i just wondering what's technique to create co inside co, does its affect performance.

by example i want to create few co inside non-main co, so what should i do,
like:

creating share stack variable in global heap instead in co local variable?
how much stack size for main share stack and share stack inside co-main function?

also one more question, documentation said to allocate local variable (especially the big one) on the heap, then what usage stack in a function good for?
there is some technique to calculate how much w should allocate the share stack size?

i need advise for this, i hope someone give me some tips or anything.

thanks, sorry for my bad english.

Proposal: Add the MacOS support

According to MacOS's IA-32 Function Calling Conventions:

The function calling conventions used in the IA-32 environment are the same as those used in the System V IA-32 ABI, with the following exceptions:

Different rules for returning structures

The stack is 16-byte aligned at the point of function calls

Large data types (larger than 4 bytes) are kept at their natural alignment

Most floating-point operations are carried out using the SSE unit instead of the x87 FPU, except when operating on long double values. (The IA-32 environment defaults to 64-bit internal precision for the x87 FPU.)

The content of this article is largely based in System V Application Binary Interface: Intel386 Architecture Processor Supplement, available at http://www.sco.com/developers/devspecs/abi386-4.pdf.

and MacOS's x86-64 Function Calling Conventions:

The OS X x86-64 function calling conventions are the same as the function calling conventions described in System V Application Binary Interface AMD64 Architecture Processor Supplement.

That means libaco already supports MacOS, but there still exist some build problems due to the differences of compiler toolchains between linux and mac.

The support for MacOS would be finally released in the v1.2.3 (would come in several days).

is aco_exit necessary?

If a coroutine is finished without calling aco_exit. We can make it return to a function, and that function has the same functionality as the aco_exit. So why do we need aco_exit?

不再活跃维护/更新/开发了吗?

Fail to build libaco

When I build the project libaco by executing command bash make.sh -o no-m32 -o no-valgrind I got the error

test_aco_benchmark.c:106:17: error: implicit declaration of function 'clock_gettime' is invalid in C99
      [-Werror,-Wimplicit-function-declaration]
    assert(0 == clock_gettime(CLOCK_MONOTONIC, &tstart));
test_aco_benchmark.c:168:31: error: use of undeclared identifier 'CLOCK_MONOTONIC'
    assert(0 == clock_gettime(CLOCK_MONOTONIC, &tstart));

And here is my envonment:

OS：Mac OS Elcaption
make: GNU Make 4.2.1
compiler: Apple LLVM version 8.0.0 (clang-800.0.42.1)

Did I missing any key point?

A full-featured, high-performance and lightweight scheduler for libaco is needed

How can i launch a coroutine which waits for 5 secs and do some operation in a data structure.
I don't want main thread to wait. sleep(5) holds the main thread itself.

Proposal: stack size of benchmark coroutines co_fp_stksz_xx() is optmized by compiler

To avoid being optimized by the compiler, the benchmark coroutines co_fp_stksz_xx() and co_fp_alloca() use memset() . However, it failed on my machine (Ubuntu 16.04, Linux 4.15, gcc 5.4). We can use the following method to correct the stack size:

void do_not_optimize(void* value) {
  asm volatile("" : : "r,m"(value) : "memory");
}

void co_fp_stksz_128() {
    int ip[28];
    do_not_optimize(ip);
    // memset(ip, 1, sizeof(ip));
    while(1){
        aco_yield();
    }
    aco_exit();
}

Question: which is faster, creating new coroutines or reusing old coroutines

Hi!

I want to build a system with coroutines, but performance is my concern. My question is how to design the system: should I create a fixed number of coroutines at the beginning and keep reusing them, or can I simplify my design: create a coroutines, once it is done destroy it and create another one. I would prefer the second solution unless it is slower.

asm hints in aco.h aren't c99?

The asm("acosw"); in aco.h needs to be __asm("acosw");