Comments (9)
Oh, some additional context that may help: You can also decrease the thread count of one process and start many copies of the process at the same time, and that will also trigger it. More interestingly, it's possible this is just a placebo ritual, but when I encountered it in real experiments, it seemed more likely to occur when starting multiple experiments at the same time, so I stopped doing that, and let each experiment finish bootstrapping before starting the next, and it seemed to go away. If that effect is real and not just superstition, it smells like /dev/shm
or some other shared resource to me.
from shadow.
Thanks for the report! This is likely the same as #3266 (comment).
Shadow uses the RootedRc
to add Send
/Sync
non-atomic reference counting to objects. The implementation details mean that you cannot drop these RootedRc
objects, and instead you need to call a method on them (fn explicit_drop(self, root: &Self::ExplicitDropParam)
) to free them. If you don't call this, then it's a resource leak (The T
in RootedRc<T>
is never dropped). Since rust doesn't have any way to prevent dropping objects at compile time, RootedRc
uses a run time check which panics in debug builds if the RootedRc
is dropped. So the panics should only occur in debug builds. In release builds it should log an error and continue (and leaking the resource).
This specific error is probably because clone_internal
is returning early without calling explicit_drop
on the new RootedRc
that it creates. There are a few reasons that it could return early such as providing invalid flags to clone(). It would be interesting to know if clone
is called differently from the application when this occurs. But the real bug here is not calling explicit_drop
before returning.
from shadow.
You're right that it continues when Shadow is compiled without debug mode, but I don't see any warnings that would explain an early return. According to shadow's strace configuration, the edit: Ah, the second field of clone3
flags are the same in the error and non-error case (88, no idea which flags those are).clone3
is a size, not the flags, so disregard that.
from shadow.
Taking a look at this...
from shadow.
This should be fixed now, though chances are the clone call that was causing a crash before will return an error, so the sim might still not work as desired. I did also update the strace logging to log the contents of the clone_args struct including the flags, so that might make it a bit easier to dig in further.
from shadow.
Tried to reproduce to verify the fix. When run natively I run out of memory:
$ cargo run --release
Compiling proc-macro2 v1.0.78
Compiling unicode-ident v1.0.12
Compiling libc v0.2.153
Compiling pin-project-lite v0.2.13
Compiling quote v1.0.35
Compiling syn v2.0.48
Compiling num_cpus v1.16.0
Compiling tokio-macros v2.2.0
Compiling tokio v1.35.1
Compiling shadow-bug v0.1.0 (/home/jnewsome/tmp/actions-sandbox)
Finished release [optimized] target(s) in 4.04s
Running `target/release/shadow-bug`
thread '<unnamed>' panicked at library/std/src/sys/unix/stack_overflow.rs:147:13:
failed to set up alternative stack guard page: Cannot allocate memory (os error 12)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at thread 'library/std/src/sys/unix/stack_overflow.rs<unnamed>:' panicked at 143library/std/src/sys/unix/stack_overflow.rs::13143:
thread ':failed to allocate an alternative stack: Cannot allocate memory (os error 12)<unnamed>13
' panicked at :
library/core/src/panicking.rs:failed to allocate an alternative stack: Cannot allocate memory (os error 12)126
:5:
panic in a function that cannot unwind
thread '<unnamed>thread 'stack backtrace:
' panicked at <unnamed>library/core/src/panicking.rs' panicked at :library/core/src/panicking.rs126::1265::
5panic in a function that cannot unwind:
panic in a function that cannot unwind
0: 0x55f615e51e5c - <unknown>
1: 0x55f615e72a8c - <unknown>
2: 0x55f615e4f74e - <unknown>
3: 0x55f615e51c44 - <unknown>
4: 0x55f615e532f3 - <unknown>
5: 0x55f615e5300c - <unknown>
6: 0x55f615e53879 - <unknown>
7: 0x55f615e53731 - <unknown>
8: 0x55f615e52386 - <unknown>
9: 0x55f615e534c2 - <unknown>
10: 0x55f615e190c3 - <unknown>
11: 0x55f615e19167 - <unknown>
12: 0x55f615e191f3 - <unknown>
13: 0x55f615e55f42 - <unknown>
14: 0x7f0357e94ac3 - <unknown>
15: 0x7f0357f26850 - <unknown>
16: 0x0 - <unknown>
thread caused non-unwinding panic. aborting.
stack backtrace:
0: 0x55f615e51e5c - <unknown>
1: 0x55f615e72a8c - <unknown>
2: 0x55f615e4f74e - <unknown>
3: 0x55f615e51c44Aborted (core dumped)
If I change the thread count from 20000 to 2000 the simulation works before the fix, on my host machine (Ubuntu 22.04).
I'll try again in a debian 12 container; this is probably a glibc difference resulting in different clone flags.
Reopening while I verify the fix...
from shadow.
I'm getting the same behavior in my debian 12 container.
Oh, some additional context that may help: You can also decrease the thread count of one process and start many copies of the process at the same time, and that will also trigger it. More interestingly, it's possible this is just a placebo ritual, but when I encountered it in real experiments, it seemed more likely to occur when starting multiple experiments at the same time, so I stopped doing that, and let each experiment finish bootstrapping before starting the next, and it seemed to go away. If that effect is real and not just superstition, it smells like /dev/shm or some other shared resource to me.
Ah, yeah maybe we need to be somewhat resource constrained to trigger the bug, but not so much that we can't allocate all the thread stacks heh.
I'll close again for now but feel free to reopen if it seems not fixed.
from shadow.
Thanks for looking into this. The results are now the same for me irrespective of thread count (i.e., successfully completes with a few seemingly harmless warnings). Is this an indication that everything is fine, or is it possible some error is now getting buried?
from shadow.
Strange. It might be worth checking the strace log - based on the past behavior I'd expect to see some clone
or clone3
syscalls that are returning an error where shadow was previously crashing. I suppose the question is whether tokio deals with that gracefully, and why there are these clone
/ clone3
syscalls that are failing only when there are very many threads.
from shadow.
Related Issues (20)
- UDP max payload size should depend on the MTU
- Loopback interface should use a larger MTU HOT 1
- Add docs about language and testing expectations for new PRs HOT 1
- Consider removing the `--summarize` rust test option
- Log file size problem HOT 2
- Plugin makes 6 extra syscalls for every "gettimeofday" call
- Cannot disable perf timers after enabling
- Fork test fails when perf timers are enabled
- Document that Shadow usually doesn't perform shell expansion in the config file
- Upgrade nix version in tests
- `posix_spawn` fails starting in glibc-2.38 HOT 2
- Go tests segfault with Go 1.21 (Fedora 38 and 39) HOT 4
- Cannot include structs with bitfields in linux-api
- Partial read triggers an event in Shadow, but not Linux HOT 2
- The 'fork-linux' test is flaky
- Snowflake simulation fails to dial OR port HOT 8
- Decide policy on stub implementations (syscalls, sockopts, flags, etc) HOT 2
- "thread has no syscall_condition" warning during fork-shadow test
- UDP has a weird edge-triggered EPOLLOUT behavior in Linux
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from shadow.