Coder Social home page Coder Social logo

Comments (8)

zainryan avatar zainryan commented on July 18, 2024

Hi!
AIFM uses SIGUSR2 and SIGUSR1 to force task preemption and GC triggering, so it's expected to see that in GDB. You can suppress them using handle SIGUSR2 nostop noprint and handle SIGUSR1 nostop noprint.

Could you please rerun the experiment using GDB and show me the call stack when segfault is triggered?

from aifm.

BinZlP avatar BinZlP commented on July 18, 2024

Thanks for your explanation!

This is the call stack when segfault has occurred:

Thread 2 "main" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff954fe700 (LWP 7531)]
far_memory::GCParallelMarker::slave_fn (this=, tid=)
at ../../..//src/manager.cpp:434
434 if (!ptr->meta().is_shared()) {
(gdb) bt
#0 far_memory::GCParallelMarker::slave_fn (this=, tid=)
at ../../..//src/manager.cpp:434
#1 0x00005555555a1786 in std::function<void ()>::operator()() const (this=0x7ffc26cf6fe0)
at /usr/include/c++/9/bits/std_function.h:683
#2 rt::thread_internal::ThreadTrampolineWithJoin (arg=0x7ffc26cf6fd0) at thread.cc:15
#3 0x00005555555a3dd0 in ?? () at runtime/sched.c:128
#4 0x0000000000000000 in ?? ()

... and also, another problem raised that the program sometimes stops during reading the file. Like below:

Have read 72351744 bytes.
Have read 73400320 bytes.
Have read 74448896 bytes.
Have read 75497472 bytes.
Have read 76546048 bytes.
Have read 77594624 bytes.
Have read 78643200 bytes.
Have read 79691776 bytes.
Have read 80740352 bytes.
Have read 81788928 bytes.
Have read 82837504 bytes.
( ... stop and not proceed )

It looked like deadlock or something, so I stopped the program with Ctrl+C. This is the call stack when killed by SIGINT after the program stopped:

Thread 1 "main" received signal SIGINT, Interrupt.
0x00007ffff6e58317 in ioctl () at ../sysdeps/unix/syscall-template.S:78
78 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0 0x00007ffff6e58317 in ioctl () at ../sysdeps/unix/syscall-template.S:78
#1 0x00005555555a396e in kthread_yield_to_iokernel () at runtime/kthread.c:118
#2 kthread_park (voluntary=) at runtime/kthread.c:244
#3 0x00005555555a4dff in schedule () at ./inc/runtime/preempt.h:53
#4 0x00005555555a3e80 in ?? () at runtime/sched.c:175
#5 0x0000000000000000 in ?? ()

Thanks!

from aifm.

BinZlP avatar BinZlP commented on July 18, 2024

Also, I tried running test code while reducing the local cache size from 16GB to 1GB, but the problems occurred from 14GB. From 14GB, sometimes it shortly stops while reading and stops at the end (or more early) of the reading stage without moving to the compression stage. This case, the call stack is same with prev. comment's last one.

Lastly, sometimes segfault triggered after the reading is completed. Here's the log of the case:

Have read 998244352 bytes.
Have read 999292928 bytes.
[ 7.498414] CPU 07| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 1919 times
[ 7.498425] CPU 07| <3> txq full
[ 9.267466] CPU 02| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 4930 times
[ 9.267480] CPU 02| <3> txq full
[ 11.240640] CPU 02| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 9693 times
[ 11.240655] CPU 02| <3> txq full
[ 13.067836] CPU 02| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 8229 times
[ 13.067850] CPU 02| <3> txq full
Thread 2 "main" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff954fe700 (LWP 10236)]
far_memory::GCParallelMarker::slave_fn (this=, tid=) at ../../..//src/manager.cpp:434

And this is the call stack of the case:
(gdb) bt

#0 far_memory::GCParallelMarker::slave_fn (this=, tid=)
at ../../..//src/manager.cpp:434
#1 0x00005555555a1786 in std::function<void ()>::operator()() const (this=0x7ffbec176fe0)
at /usr/include/c++/9/bits/std_function.h:683
#2 rt::thread_internal::ThreadTrampolineWithJoin (arg=0x7ffbec176fd0) at thread.cc:15
#3 0x00005555555a3dd0 in ?? () at runtime/sched.c:128
#4 0x0000000100000000 in ?? ()
#5 0x0000100000082f00 in ?? ()
#6 0x00007ffbf19e6f01 in ?? ()
#7 0x00007fff88011b18 in ?? ()
#8 0x000055555556ffc0 in ?? () at ../../..//inc/internal/parallel.ipp:77
#9 0x000055555556ffa0 in ?? () at /usr/include/c++/9/bits/stl_deque.h:273
#10 0x0000000000000000 in ?? ()

Thanks for your help :)

from aifm.

zainryan avatar zainryan commented on July 18, 2024

Hi, thanks for your information! This looks like a bug, which is intolerable. I will try to reproduce and fix it once I get a chance. Should be soon.

from aifm.

zainryan avatar zainryan commented on July 18, 2024

Hi, I just ran fig.11a code with local_ram=14G for 100 times on my cloudlab instance (the one mentioned in README), and doesn't observe sigfault or deadlock. Maybe what you are facing now is caused by some misconfiguration or by the actual bugs that are hard to trigger in my instance. In either case, I'd be happy to help you if I'm able to ssh into your instance. You can send me an email ([email protected]).

from aifm.

BinZlP avatar BinZlP commented on July 18, 2024

Thanks for your kindness! Then I'll send you an email :D

from aifm.

zainryan avatar zainryan commented on July 18, 2024

Hi Han, the commit above should fix everything. Free feel to reopen this issue if you find anything wrong.

from aifm.

BinZlP avatar BinZlP commented on July 18, 2024

from aifm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.