Describe the issue I can consistently crash shadow by running any

Thanks for the report! This is likely the same as <a class="issue-link js-issue-link"

I'm getting the same behavior in my debian 12 container. <p dir="auto

crash: rootedcell "Dropped without calling `safely_drop`" about shadow HOT 9 CLOSED

jtracey commented on June 6, 2024

crash: rootedcell "Dropped without calling `safely_drop`"

from shadow.

Comments (9)

jtracey commented on June 6, 2024

Oh, some additional context that may help: You can also decrease the thread count of one process and start many copies of the process at the same time, and that will also trigger it. More interestingly, it's possible this is just a placebo ritual, but when I encountered it in real experiments, it seemed more likely to occur when starting multiple experiments at the same time, so I stopped doing that, and let each experiment finish bootstrapping before starting the next, and it seemed to go away. If that effect is real and not just superstition, it smells like /dev/shm or some other shared resource to me.

from shadow.

stevenengler commented on June 6, 2024

Thanks for the report! This is likely the same as #3266 (comment).

Shadow uses the RootedRc to add Send/Sync non-atomic reference counting to objects. The implementation details mean that you cannot drop these RootedRc objects, and instead you need to call a method on them (fn explicit_drop(self, root: &Self::ExplicitDropParam)) to free them. If you don't call this, then it's a resource leak (The T in RootedRc<T> is never dropped). Since rust doesn't have any way to prevent dropping objects at compile time, RootedRc uses a run time check which panics in debug builds if the RootedRc is dropped. So the panics should only occur in debug builds. In release builds it should log an error and continue (and leaking the resource).

This specific error is probably because clone_internal is returning early without calling explicit_drop on the new RootedRc that it creates. There are a few reasons that it could return early such as providing invalid flags to clone(). It would be interesting to know if clone is called differently from the application when this occurs. But the real bug here is not calling explicit_drop before returning.

from shadow.

jtracey commented on June 6, 2024

You're right that it continues when Shadow is compiled without debug mode, but I don't see any warnings that would explain an early return. ~~According to shadow's strace configuration, the clone3 flags are the same in the error and non-error case (88, no idea which flags those are).~~ edit: Ah, the second field of clone3 is a size, not the flags, so disregard that.

from shadow.

sporksmith commented on June 6, 2024

Taking a look at this...

from shadow.

sporksmith commented on June 6, 2024

This should be fixed now, though chances are the clone call that was causing a crash before will return an error, so the sim might still not work as desired. I did also update the strace logging to log the contents of the clone_args struct including the flags, so that might make it a bit easier to dig in further.

from shadow.

sporksmith commented on June 6, 2024

Tried to reproduce to verify the fix. When run natively I run out of memory:

$ cargo run --release
   Compiling proc-macro2 v1.0.78
   Compiling unicode-ident v1.0.12
   Compiling libc v0.2.153
   Compiling pin-project-lite v0.2.13
   Compiling quote v1.0.35
   Compiling syn v2.0.48
   Compiling num_cpus v1.16.0
   Compiling tokio-macros v2.2.0
   Compiling tokio v1.35.1
   Compiling shadow-bug v0.1.0 (/home/jnewsome/tmp/actions-sandbox)
    Finished release [optimized] target(s) in 4.04s
     Running `target/release/shadow-bug`
thread '<unnamed>' panicked at library/std/src/sys/unix/stack_overflow.rs:147:13:
failed to set up alternative stack guard page: Cannot allocate memory (os error 12)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at thread 'library/std/src/sys/unix/stack_overflow.rs<unnamed>:' panicked at 143library/std/src/sys/unix/stack_overflow.rs::13143:
thread ':failed to allocate an alternative stack: Cannot allocate memory (os error 12)<unnamed>13
' panicked at :
library/core/src/panicking.rs:failed to allocate an alternative stack: Cannot allocate memory (os error 12)126
:5:
panic in a function that cannot unwind
thread '<unnamed>thread 'stack backtrace:
' panicked at <unnamed>library/core/src/panicking.rs' panicked at :library/core/src/panicking.rs126::1265::
5panic in a function that cannot unwind:

panic in a function that cannot unwind
   0:     0x55f615e51e5c - <unknown>
   1:     0x55f615e72a8c - <unknown>
   2:     0x55f615e4f74e - <unknown>
   3:     0x55f615e51c44 - <unknown>
   4:     0x55f615e532f3 - <unknown>
   5:     0x55f615e5300c - <unknown>
   6:     0x55f615e53879 - <unknown>
   7:     0x55f615e53731 - <unknown>
   8:     0x55f615e52386 - <unknown>
   9:     0x55f615e534c2 - <unknown>
  10:     0x55f615e190c3 - <unknown>
  11:     0x55f615e19167 - <unknown>
  12:     0x55f615e191f3 - <unknown>
  13:     0x55f615e55f42 - <unknown>
  14:     0x7f0357e94ac3 - <unknown>
  15:     0x7f0357f26850 - <unknown>
  16:                0x0 - <unknown>
thread caused non-unwinding panic. aborting.
stack backtrace:
   0:     0x55f615e51e5c - <unknown>
   1:     0x55f615e72a8c - <unknown>
   2:     0x55f615e4f74e - <unknown>
   3:     0x55f615e51c44Aborted (core dumped)

If I change the thread count from 20000 to 2000 the simulation works before the fix, on my host machine (Ubuntu 22.04).

I'll try again in a debian 12 container; this is probably a glibc difference resulting in different clone flags.

Reopening while I verify the fix...

from shadow.

sporksmith commented on June 6, 2024

I'm getting the same behavior in my debian 12 container.

Oh, some additional context that may help: You can also decrease the thread count of one process and start many copies of the process at the same time, and that will also trigger it. More interestingly, it's possible this is just a placebo ritual, but when I encountered it in real experiments, it seemed more likely to occur when starting multiple experiments at the same time, so I stopped doing that, and let each experiment finish bootstrapping before starting the next, and it seemed to go away. If that effect is real and not just superstition, it smells like /dev/shm or some other shared resource to me.

Ah, yeah maybe we need to be somewhat resource constrained to trigger the bug, but not so much that we can't allocate all the thread stacks heh.

I'll close again for now but feel free to reopen if it seems not fixed.

from shadow.

jtracey commented on June 6, 2024

Thanks for looking into this. The results are now the same for me irrespective of thread count (i.e., successfully completes with a few seemingly harmless warnings). Is this an indication that everything is fine, or is it possible some error is now getting buried?

from shadow.

sporksmith commented on June 6, 2024

Strange. It might be worth checking the strace log - based on the past behavior I'd expect to see some clone or clone3 syscalls that are returning an error where shadow was previously crashing. I suppose the question is whether tokio deals with that gracefully, and why there are these clone / clone3 syscalls that are failing only when there are very many threads.

from shadow.

crash: rootedcell "Dropped without calling `safely_drop`" about shadow HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent