Coder Social home page Coder Social logo

hnes / libaco Goto Github PK

View Code? Open in Web Editor NEW
3.5K 137.0 391.0 1.37 MB

A blazing fast and lightweight C asymmetric coroutine library 💎 ⛅🚀⛅🌞

Home Page: https://libaco.org

License: Apache License 2.0

C 78.78% Assembly 6.80% Shell 14.42%
c coroutine coroutine-library high-performance lightweight

libaco's People

Contributors

bkrmendy avatar blakejakopovic avatar carterli avatar hnes avatar johnlunney avatar postwait avatar shujianqian avatar techbech avatar uael avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libaco's Issues

Request: having a version of aco_yield() that receives the coroutine to where to yield

In the case I'm considering using libaco, I'll have less than 20 standalone non-main coroutines and the cost to switch between the coroutines is critical. The current cost of 10ns for acosw() fits my application, but having to go through the main coroutine first to switch to another non-main coroutine puts the cost of switching at 20ns, what becomes too much for what I'm trying to accomplish.

Is it possible to have a version of aco_yield() that receives the coroutine to where to yield?

Status and Future Plans of libaco

Thank goodness that I got the chance to come back!

Sorry for this not-so-temporary "temporary" leaving...

  1. From now on, I will try to fix any bug whenever there is one been reported. And I am very happy to see that there is no bug report yet since libaco had been released, which is going quite as I expected.

  2. All the features, pull requests and the development of the next release would be postponed until I got enough spare time. Sorry for that. It may take one or two months to come. But still, I could not make any promise for this. Life is full of accidents. Sometimes even if we do not want them at all :-(

You have my sincere apologies and best regards. And, again, have fun :-)

resume multiple coroutines in parallel

Hello,

I currently have asynchronous code with callback that I would potentially like to replace with fibers.
The operation is simple, I make a parallel for loop on an event list and I call the assigned callbacks functions.

With fibers, I would initially be on the thread with main_co and resume the coroutines with a parallel for loop, which can yield again (they cannot be resumed again on the same loop).

Is it possible ?
I suppose there must be synchronizations to do on the main_co when we resume, yield or exit.
I would normally have few fibers (about fifty).

Thanks

can libaco support yield/resume between threads?

I am using the actor model, actor get scheduled and ran in thread pool.

Means to use a coroutine, it can be yield by any thread of the pool, and then resumed by any thread of the pool.
But currently the libaco bind the coroutine to a ``main co" and have to resume by same thread.
Is there any possibility to support it?

improvements to documentation

Hello, I'm so sorry, but I'm really interested in this library, but I don't quite understand from the diagrams what is going on.

Do you think we can work together to improve the documentation? My naivety might help to improve the documentation as I can point out what I don't understand and perhaps make a PR with improved documentation.

In the first instance, I don't understand how this differs from normal stackful coroutines (fibres). To me, it looks like you are storing the registers in a private area, but then sharing a single stack between multiple coroutines. Does this mean stack is trashed when switching between coroutines? i.e. local variables can't be used. Is this an implementation of stackless coroutine?

Integrate into our io engine?

Hi, our open source lib (https://github.com/alibaba/PhotonLibOS) is based on stackful coroutine, and a high efficiency IO scheduler. Looks like your implementation is based on stackless, would you like to integrate this project into our IO engine?

Our basic coroutine APIs are like interrupt, context_switch, create, etc...

`aco_amd64_inline_short_aligned_memcpy_test_ok` size alignment

#define aco_amd64_inline_short_aligned_memcpy_test_ok(dst, src, sz) \
    (   \
        (((uintptr_t)(src) & 0x0f) == 0) && (((uintptr_t)(dst) & 0x0f) == 0) \
        &&  \
        (((sz) & 0x0f) == 0x08) && (((sz) >> 4) >= 0) && (((sz) >> 4) <= 8) \
    )

I think it should be (((sz) & 0x0f) == 0) instead of (((sz) & 0x0f) == 0x08).

which one better to pass argument to co function

  1. pass through "void* arg" to aco_create function or
  2. create global variable that store every void* arg to specified function so the co function could call the argument from global variable (or global heap maybe?)

Question: Would there be some problem if one signal handler is triggered during the call of `acosw`?

libaco/acosw.S

Line 65 in c941da6

mov ecx,DWORD PTR [esp+0x8] // to_co

按照我的理解,这一行之前的: lea ecx,[esp+0x4] // esp, 通过将栈顶指针暂时保存在ecx中是作为一个定点的作用吧,因为前面注释里已经说了,signal可能改变栈顶指针esp的值,那这一行又用esp本身地址+8的方式取得to_co的地址,那万一在这之前esp的值由于signal改变了,那此时ecx里保存的值就不是to_co的地址了吧,不知道我这样配合你上面的注释理解得对不对。探讨一下...

Question: benchamrk about tcmalloc and memcpy

hi, I am reading the code and have done some benchmark:
https://github.com/guangqianpeng/libaco/blob/master/bench_result
I have two questions:

  1. tcmalloc improves the benchmark results. With aco_amount=1000000 and copy_stack_size=56B, the tcmalloc version achieves 37ns per aco_resume() operation but the default takes 66ns. Why? In this case, aco_resume() does not allocate memory, which is really confusing...
  2. When copying stack, you use %xmm registers to optimize small memory copying. But according to my benchmark, this does not make many differences. I guess memcpy() already takes advantage of these registers. Do you have more benchmark results?

I will be very grateful to you for answering my questions :-)

Feature request: Ability to return a value from aco_yield()

Thanks for this great library!

I think that having something similar to the following functions:

void* aco_yield_value();
void aco_resume_value(aco_t* aco, void* val);

would be very convenient. I see two practical use cases for this feature: It allows us to create constructs similar to Python's generators, where the coroutine iterates over some data structure or computed sequence, and incrementally returns values from it. It also makes it easier to use libaco with event-based frameworks like io_uring, where events have a return value that need to be passed back to the coroutine:

// Inside coroutine...
struct io_uring_sqe* sqe;
int amt_read;
char buf[1024];
int fd;
// ...
sqe = io_uring_get_sqe(ring);
io_uring_prep_read(sqe, fd, buf, sizeof(buf), 0);
io_uring_sqe_set_data(sqe, aco_get_co());
io_uring_submit(ring);
amt_read = (int) (intptr_t) aco_yield_value();
// ...

// Inside event loop...
struct io_uring_cqe* cqe;
aco_t* coro;
int result_val;
// ...
io_uring_wait_cqe(ring, &cqe);
coro = io_uring_cqe_get_data(cqe);
result_val = cqe->res;
io_uring_cqe_seen(ring, cqe);
aco_resume_value(coro, (void*) (intptr_t) result_val);

As I understand it, adding a return value would require saving/restoring an additional register. Perhaps this feature could be put behind a #define macro, so that users can choose if they want to take a slight performance hit for this feature.

Right now, this feature can be emulated by using an externally maintained pointer to marshal the data. But this is a very janky design pattern, and I believe that it is best to avoid polluting a codebase with pointers if the necessary data can simply be passed by value.

Make an include guard unique

I find that an include guard like “ACO_H” can be too short for the safe reuse of your header file (when it belongs to an application programming interface).

arm & arm64 support

arm & arm64 support are mentioned in the README's TODO section. I'm can't use this library until that is added :(
I didn't see an active issue tracker for it, so I figured I'd make one.

Few questions about proper usage

I'm planning to use libaco in C++, but I need some clarifications about those 3 scenarios, whether they are not UB and will work correctly.

  1. Creating a coroutine with a non main coroutine as the "main"/parent parameter. So when I use aco_create to create non-main co A, and then pass the resulting coroutine into another aco_create call to create B, as the main co parameter, it seems to work fine. i.e. B yields to A, then A can yield to main. Is this valid though?
  2. Can there be 2 main coroutines in a single thread?
  3. Can aco_exit() be called from a C++ destructor? So I've noticed that aco_exit() causes C++ destructors in a coroutine NOT to fire, which is a problem. As a solution I found out that creating a dummy object at the beginning of the function, which calls aco_exit() as its dtor, actually lets other destructors to fire before aco_exit, and it works even with a return statement, without invoking the "last word" crash.

build failed from the latest release

below comes the output from my shell

jiangyunfan@wejoy:~/Downloads/libaco-1.2.4$ bash make.sh -o no-m32
OUTPUT_DIR:       ./output
CFLAGS:           -g -O2 -Wall -Werror
EXTRA_CFLAGS:     
ACO_EXTRA_CFLAGS: 
OUTPUT_SUFFIX:    ..no_valgrind.standaloneFPUenv
        cc -g -O2 -Wall -Werror   acosw.S aco.c test_aco_tutorial_0.c  -o ./output/test_aco_tutorial_0..no_valgrind.standaloneFPUenv
/usr/bin/ld: cannot open output file ./output/test_aco_tutorial_0..no_valgrind.standaloneFPUenv: 没有那个文件或目录
collect2: error: ld returned 1 exit status
error: make.sh:build fail
error: make.sh:exit

jiangyunfan@wejoy:~/Downloads/libaco-1.2.4$ uname -a
Linux wejoy 4.18.0-13-generic #14-Ubuntu SMP Wed Dec 5 09:04:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
jiangyunfan@wejoy:~/Downloads/libaco-1.2.4$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.10
Release:        18.10
Codename:       cosmic

whether this description is incorrect

int* gl_ptr;
void inc_p(int* p){ (*p)++; }
void co_fp0() {
    int ct = 0;
    gl_ptr = &ct; // line 7
    aco_yield();
    check(ct);
    int* ptr = &ct;
    inc_p(ptr);   // line 11
    aco_exit();
}

void co_fp1() {
    do_sth(gl_ptr); // line 16
    aco_exit();
}
  1. In the above code snippet, we assume that co_fp0 & co_fp1 shares the same share stack (they are both non-main co) and the running sequence of them is "co_fp0 -> co_fp1 -> co_fp0". Since they are sharing the same stack, the address holding in gl_ptr in co_fp1 (line 16) has totally different semantics with the gl_ptr in line 7 of co_fp0, and that kind of code would probably corrupt the execution stack of co_fp1. But the line 11 is fine because variable ct and function inc_p are in the same coroutine context. Allocating that kind of variables (need to share with other coroutines) on the heap would simply solve such problems: ```

Is co_fp0 affected? ct was modified during yield

is it worth to create coroutine inside coroutine?

i just wondering what's technique to create co inside co, does its affect performance.

by example i want to create few co inside non-main co, so what should i do,
like:

  1. creating share stack variable in global heap instead in co local variable?
  2. how much stack size for main share stack and share stack inside co-main function?

also one more question, documentation said to allocate local variable (especially the big one) on the heap, then what usage stack in a function good for?
there is some technique to calculate how much w should allocate the share stack size?

i need advise for this, i hope someone give me some tips or anything.

thanks, sorry for my bad english.

Proposal: Add the MacOS support

According to MacOS's IA-32 Function Calling Conventions:

The function calling conventions used in the IA-32 environment are the same as those used in the System V IA-32 ABI, with the following exceptions:

  • Different rules for returning structures

  • The stack is 16-byte aligned at the point of function calls

  • Large data types (larger than 4 bytes) are kept at their natural alignment

  • Most floating-point operations are carried out using the SSE unit instead of the x87 FPU, except when operating on long double values. (The IA-32 environment defaults to 64-bit internal precision for the x87 FPU.)

The content of this article is largely based in System V Application Binary Interface: Intel386 Architecture Processor Supplement, available at http://www.sco.com/developers/devspecs/abi386-4.pdf.

and MacOS's x86-64 Function Calling Conventions:

The OS X x86-64 function calling conventions are the same as the function calling conventions described in System V Application Binary Interface AMD64 Architecture Processor Supplement.

That means libaco already supports MacOS, but there still exist some build problems due to the differences of compiler toolchains between linux and mac.

The support for MacOS would be finally released in the v1.2.3 (would come in several days).

is aco_exit necessary?

If a coroutine is finished without calling aco_exit. We can make it return to a function, and that function has the same functionality as the aco_exit. So why do we need aco_exit?

Fail to build libaco

When I build the project libaco by executing command bash make.sh -o no-m32 -o no-valgrind I got the error

test_aco_benchmark.c:106:17: error: implicit declaration of function 'clock_gettime' is invalid in C99
      [-Werror,-Wimplicit-function-declaration]
    assert(0 == clock_gettime(CLOCK_MONOTONIC, &tstart));
test_aco_benchmark.c:168:31: error: use of undeclared identifier 'CLOCK_MONOTONIC'
    assert(0 == clock_gettime(CLOCK_MONOTONIC, &tstart));

And here is my envonment:

  • OS:Mac OS Elcaption
  • make: GNU Make 4.2.1
  • compiler: Apple LLVM version 8.0.0 (clang-800.0.42.1)

Did I missing any key point?

Proposal: stack size of benchmark coroutines co_fp_stksz_xx() is optmized by compiler

To avoid being optimized by the compiler, the benchmark coroutines co_fp_stksz_xx() and co_fp_alloca() use memset() . However, it failed on my machine (Ubuntu 16.04, Linux 4.15, gcc 5.4). We can use the following method to correct the stack size:

void do_not_optimize(void* value) {
  asm volatile("" : : "r,m"(value) : "memory");
}
void co_fp_stksz_128() {
    int ip[28];
    do_not_optimize(ip);
    // memset(ip, 1, sizeof(ip));
    while(1){
        aco_yield();
    }
    aco_exit();
}

Question: which is faster, creating new coroutines or reusing old coroutines

Hi!

I want to build a system with coroutines, but performance is my concern. My question is how to design the system: should I create a fixed number of coroutines at the beginning and keep reusing them, or can I simplify my design: create a coroutines, once it is done destroy it and create another one. I would prefer the second solution unless it is slower.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.