Comments (8)
Hi!
AIFM uses SIGUSR2 and SIGUSR1 to force task preemption and GC triggering, so it's expected to see that in GDB. You can suppress them using handle SIGUSR2 nostop noprint
and handle SIGUSR1 nostop noprint
.
Could you please rerun the experiment using GDB and show me the call stack when segfault is triggered?
from aifm.
Thanks for your explanation!
This is the call stack when segfault has occurred:
Thread 2 "main" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff954fe700 (LWP 7531)]
far_memory::GCParallelMarker::slave_fn (this=, tid=)
at ../../..//src/manager.cpp:434
434 if (!ptr->meta().is_shared()) {
(gdb) bt
#0 far_memory::GCParallelMarker::slave_fn (this=, tid=)
at ../../..//src/manager.cpp:434
#1 0x00005555555a1786 in std::function<void ()>::operator()() const (this=0x7ffc26cf6fe0)
at /usr/include/c++/9/bits/std_function.h:683
#2 rt::thread_internal::ThreadTrampolineWithJoin (arg=0x7ffc26cf6fd0) at thread.cc:15
#3 0x00005555555a3dd0 in ?? () at runtime/sched.c:128
#4 0x0000000000000000 in ?? ()
... and also, another problem raised that the program sometimes stops during reading the file. Like below:
Have read 72351744 bytes.
Have read 73400320 bytes.
Have read 74448896 bytes.
Have read 75497472 bytes.
Have read 76546048 bytes.
Have read 77594624 bytes.
Have read 78643200 bytes.
Have read 79691776 bytes.
Have read 80740352 bytes.
Have read 81788928 bytes.
Have read 82837504 bytes.
( ... stop and not proceed )
It looked like deadlock or something, so I stopped the program with Ctrl+C. This is the call stack when killed by SIGINT after the program stopped:
Thread 1 "main" received signal SIGINT, Interrupt.
0x00007ffff6e58317 in ioctl () at ../sysdeps/unix/syscall-template.S:78
78 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0 0x00007ffff6e58317 in ioctl () at ../sysdeps/unix/syscall-template.S:78
#1 0x00005555555a396e in kthread_yield_to_iokernel () at runtime/kthread.c:118
#2 kthread_park (voluntary=) at runtime/kthread.c:244
#3 0x00005555555a4dff in schedule () at ./inc/runtime/preempt.h:53
#4 0x00005555555a3e80 in ?? () at runtime/sched.c:175
#5 0x0000000000000000 in ?? ()
Thanks!
from aifm.
Also, I tried running test code while reducing the local cache size from 16GB to 1GB, but the problems occurred from 14GB. From 14GB, sometimes it shortly stops while reading and stops at the end (or more early) of the reading stage without moving to the compression stage. This case, the call stack is same with prev. comment's last one.
Lastly, sometimes segfault triggered after the reading is completed. Here's the log of the case:
Have read 998244352 bytes.
Have read 999292928 bytes.
[ 7.498414] CPU 07| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 1919 times
[ 7.498425] CPU 07| <3> txq full
[ 9.267466] CPU 02| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 4930 times
[ 9.267480] CPU 02| <3> txq full
[ 11.240640] CPU 02| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 9693 times
[ 11.240655] CPU 02| <3> txq full
[ 13.067836] CPU 02| <3> runtime/net/directpath/mlx5/mlx5_rxtx.c:105 mlx5_transmit_one() suppressed 8229 times
[ 13.067850] CPU 02| <3> txq full
Thread 2 "main" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff954fe700 (LWP 10236)]
far_memory::GCParallelMarker::slave_fn (this=, tid=) at ../../..//src/manager.cpp:434
And this is the call stack of the case:
(gdb) bt
#0 far_memory::GCParallelMarker::slave_fn (this=, tid=)
at ../../..//src/manager.cpp:434
#1 0x00005555555a1786 in std::function<void ()>::operator()() const (this=0x7ffbec176fe0)
at /usr/include/c++/9/bits/std_function.h:683
#2 rt::thread_internal::ThreadTrampolineWithJoin (arg=0x7ffbec176fd0) at thread.cc:15
#3 0x00005555555a3dd0 in ?? () at runtime/sched.c:128
#4 0x0000000100000000 in ?? ()
#5 0x0000100000082f00 in ?? ()
#6 0x00007ffbf19e6f01 in ?? ()
#7 0x00007fff88011b18 in ?? ()
#8 0x000055555556ffc0 in ?? () at ../../..//inc/internal/parallel.ipp:77
#9 0x000055555556ffa0 in ?? () at /usr/include/c++/9/bits/stl_deque.h:273
#10 0x0000000000000000 in ?? ()
Thanks for your help :)
from aifm.
Hi, thanks for your information! This looks like a bug, which is intolerable. I will try to reproduce and fix it once I get a chance. Should be soon.
from aifm.
Hi, I just ran fig.11a code with local_ram=14G for 100 times on my cloudlab instance (the one mentioned in README), and doesn't observe sigfault or deadlock. Maybe what you are facing now is caused by some misconfiguration or by the actual bugs that are hard to trigger in my instance. In either case, I'd be happy to help you if I'm able to ssh into your instance. You can send me an email ([email protected]).
from aifm.
Thanks for your kindness! Then I'll send you an email :D
from aifm.
Hi Han, the commit above should fix everything. Free feel to reopen this issue if you find anything wrong.
from aifm.
from aifm.
Related Issues (20)
- spend too much time to run an experiment HOT 5
- Add section "Known Limitations" to README.md HOT 1
- ASSERTION 'tcp_dial(laddr, raddr, &remote_master_) != 0' FAILED IN 'TCPDevice' HOT 12
- SPDK backend HOT 3
- installing error HOT 2
- fig6a segfault HOT 8
- How to measure the throughput of the application HOT 1
- fig7 S3 links access denied HOT 2
- Log allocator mark-compact HOT 4
- A bug in gc_cache() and CircularBuffer
- ksched failed to insert HOT 1
- failed to map ingress region HOT 1
- control_setup: failed to map ingress region
- Zero window deadlock HOT 3
- Sche.c preempt assertion error HOT 8
- Link layer is not Ethernet. HOT 2
- mlx5_init: IB device not found HOT 4
- Memory leak? HOT 15
- failed to load ksched.ko HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aifm.