Coder Social home page Coder Social logo

Comments (9)

jtracey avatar jtracey commented on June 6, 2024

Oh, some additional context that may help: You can also decrease the thread count of one process and start many copies of the process at the same time, and that will also trigger it. More interestingly, it's possible this is just a placebo ritual, but when I encountered it in real experiments, it seemed more likely to occur when starting multiple experiments at the same time, so I stopped doing that, and let each experiment finish bootstrapping before starting the next, and it seemed to go away. If that effect is real and not just superstition, it smells like /dev/shm or some other shared resource to me.

from shadow.

stevenengler avatar stevenengler commented on June 6, 2024

Thanks for the report! This is likely the same as #3266 (comment).

Shadow uses the RootedRc to add Send/Sync non-atomic reference counting to objects. The implementation details mean that you cannot drop these RootedRc objects, and instead you need to call a method on them (fn explicit_drop(self, root: &Self::ExplicitDropParam)) to free them. If you don't call this, then it's a resource leak (The T in RootedRc<T> is never dropped). Since rust doesn't have any way to prevent dropping objects at compile time, RootedRc uses a run time check which panics in debug builds if the RootedRc is dropped. So the panics should only occur in debug builds. In release builds it should log an error and continue (and leaking the resource).

This specific error is probably because clone_internal is returning early without calling explicit_drop on the new RootedRc that it creates. There are a few reasons that it could return early such as providing invalid flags to clone(). It would be interesting to know if clone is called differently from the application when this occurs. But the real bug here is not calling explicit_drop before returning.

from shadow.

jtracey avatar jtracey commented on June 6, 2024

You're right that it continues when Shadow is compiled without debug mode, but I don't see any warnings that would explain an early return. According to shadow's strace configuration, the clone3 flags are the same in the error and non-error case (88, no idea which flags those are). edit: Ah, the second field of clone3 is a size, not the flags, so disregard that.

from shadow.

sporksmith avatar sporksmith commented on June 6, 2024

Taking a look at this...

from shadow.

sporksmith avatar sporksmith commented on June 6, 2024

This should be fixed now, though chances are the clone call that was causing a crash before will return an error, so the sim might still not work as desired. I did also update the strace logging to log the contents of the clone_args struct including the flags, so that might make it a bit easier to dig in further.

from shadow.

sporksmith avatar sporksmith commented on June 6, 2024

Tried to reproduce to verify the fix. When run natively I run out of memory:

$ cargo run --release
   Compiling proc-macro2 v1.0.78
   Compiling unicode-ident v1.0.12
   Compiling libc v0.2.153
   Compiling pin-project-lite v0.2.13
   Compiling quote v1.0.35
   Compiling syn v2.0.48
   Compiling num_cpus v1.16.0
   Compiling tokio-macros v2.2.0
   Compiling tokio v1.35.1
   Compiling shadow-bug v0.1.0 (/home/jnewsome/tmp/actions-sandbox)
    Finished release [optimized] target(s) in 4.04s
     Running `target/release/shadow-bug`
thread '<unnamed>' panicked at library/std/src/sys/unix/stack_overflow.rs:147:13:
failed to set up alternative stack guard page: Cannot allocate memory (os error 12)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at thread 'library/std/src/sys/unix/stack_overflow.rs<unnamed>:' panicked at 143library/std/src/sys/unix/stack_overflow.rs::13143:
thread ':failed to allocate an alternative stack: Cannot allocate memory (os error 12)<unnamed>13
' panicked at :
library/core/src/panicking.rs:failed to allocate an alternative stack: Cannot allocate memory (os error 12)126
:5:
panic in a function that cannot unwind
thread '<unnamed>thread 'stack backtrace:
' panicked at <unnamed>library/core/src/panicking.rs' panicked at :library/core/src/panicking.rs126::1265::
5panic in a function that cannot unwind:

panic in a function that cannot unwind
   0:     0x55f615e51e5c - <unknown>
   1:     0x55f615e72a8c - <unknown>
   2:     0x55f615e4f74e - <unknown>
   3:     0x55f615e51c44 - <unknown>
   4:     0x55f615e532f3 - <unknown>
   5:     0x55f615e5300c - <unknown>
   6:     0x55f615e53879 - <unknown>
   7:     0x55f615e53731 - <unknown>
   8:     0x55f615e52386 - <unknown>
   9:     0x55f615e534c2 - <unknown>
  10:     0x55f615e190c3 - <unknown>
  11:     0x55f615e19167 - <unknown>
  12:     0x55f615e191f3 - <unknown>
  13:     0x55f615e55f42 - <unknown>
  14:     0x7f0357e94ac3 - <unknown>
  15:     0x7f0357f26850 - <unknown>
  16:                0x0 - <unknown>
thread caused non-unwinding panic. aborting.
stack backtrace:
   0:     0x55f615e51e5c - <unknown>
   1:     0x55f615e72a8c - <unknown>
   2:     0x55f615e4f74e - <unknown>
   3:     0x55f615e51c44Aborted (core dumped)

If I change the thread count from 20000 to 2000 the simulation works before the fix, on my host machine (Ubuntu 22.04).

I'll try again in a debian 12 container; this is probably a glibc difference resulting in different clone flags.

Reopening while I verify the fix...

from shadow.

sporksmith avatar sporksmith commented on June 6, 2024

I'm getting the same behavior in my debian 12 container.

Oh, some additional context that may help: You can also decrease the thread count of one process and start many copies of the process at the same time, and that will also trigger it. More interestingly, it's possible this is just a placebo ritual, but when I encountered it in real experiments, it seemed more likely to occur when starting multiple experiments at the same time, so I stopped doing that, and let each experiment finish bootstrapping before starting the next, and it seemed to go away. If that effect is real and not just superstition, it smells like /dev/shm or some other shared resource to me.

Ah, yeah maybe we need to be somewhat resource constrained to trigger the bug, but not so much that we can't allocate all the thread stacks heh.

I'll close again for now but feel free to reopen if it seems not fixed.

from shadow.

jtracey avatar jtracey commented on June 6, 2024

Thanks for looking into this. The results are now the same for me irrespective of thread count (i.e., successfully completes with a few seemingly harmless warnings). Is this an indication that everything is fine, or is it possible some error is now getting buried?

from shadow.

sporksmith avatar sporksmith commented on June 6, 2024

Strange. It might be worth checking the strace log - based on the past behavior I'd expect to see some clone or clone3 syscalls that are returning an error where shadow was previously crashing. I suppose the question is whether tokio deals with that gracefully, and why there are these clone / clone3 syscalls that are failing only when there are very many threads.

from shadow.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.