Coder Social home page Coder Social logo

Comments (10)

sethp avatar sethp commented on June 17, 2024

Oh, I also meant to mention: I found someone on the mailing list with a similar-looking symptom ( https://groups.google.com/g/chipyard/c/i0pNR4t8HFA/m/NBMP4fcsAQAJ ), but given that they reported 1) the crash occurred in what appears to be an initial block rather than the nba driving the network_tick, and 2) solving the problem by changing the name of the connected bus I suspect that is a different issue, though perhaps still somewhere in SimNetwork.cc & friends.

from chipyard.

jerryz123 avatar jerryz123 commented on June 17, 2024

I wonder if adding a lock to the network__tick and network_init functions would be sufficient.

from chipyard.

sethp avatar sethp commented on June 17, 2024

I think it depends on how you mean: I noticed the cur that's 0x0 in gdb there is a static __thread context_t* cur;; since it's a thread-local, I read that as saying the faulting thread's copy of that storage is uninitialized. No other thread can (well, should) access the faulting thread's local storage, so locking to wait for it to be constructed would probably just hang.

If you mean "instead of using a ucontext/coroutine thing, set up a cond-with-lock in network_init that parks a pthread until signaled by network_tick", then I think there's a potentially fruitful path there: there's even a bit of an example implementation in the fesvr's context_t as an alternative to using a ucontext, albeit one that doesn't look directly usable.

Between repair and replace, I'm personally leaning towards trying to figure out what the fesvr's context_t thing wants, since there's a handful of other uses of in chipyard already (e.g. the spike tile) that would either suffer from a similar issue or possibly provide a solution. I'm hoping to continue posting my notes here as I learn my way through the fesvr and verilator threading models—unless you'd rather I didn't, of course!

from chipyard.

jerryz123 avatar jerryz123 commented on June 17, 2024

unless you'd rather I didn't, of course!

I suspect the solution to the problem does not require messing around in context_t. I believe several other devices uses FESVR's context_t and behave correctly in multithreaded sims.

from chipyard.

sethp avatar sethp commented on June 17, 2024

Perhaps! I did see that other devices made use of context_t, which is what leads me to want to understand the problem a little better. I found certain --threads counts do in fact produce a simulation that works for a given model using SimNetwork; not just 1 (which always works), but sometimes the model will work indefinitely with --threads 2 and crash immediately with --threads 3.

Two especially relevant details I've noticed so far:

  1. The verilator multithreading model appears to schedule micro-tasks statically; i.e. the same thread always resolves the same DPI-C call for a given model
  2. The crash occurs inside context_t when the thread-local cur variable is NULL (0x0). cur usually looks like it gets initialized as a side-effect of calling init in that thread (for any context_t instance, I think?)

I think it's the combination of these two that's causing the behavior I'm seeing: when the scheduler happens to place the network_init and network_tick DPI calls into the same thread's work queue (P=100% with one thread, ~50% with two, ~33 % with three, etc...), then there's no crash—network_init populates the thread-local, and network_tick uses it.

If I'm further right in saying that any call to static context_t::current() in a thread "pre-warms" it, then adding more independently-scheduled instances (like, say, 2x ice nics + a block device + 8 spike tiles), we might end up rapidly (but asymptotically) approaching a 100% chance that some initial block populates each thread's cur storage for a given number of threads. Which could account for (directly, or indirectly) your experience that multi-threaded sims with context_t work fine?

from chipyard.

sethp avatar sethp commented on June 17, 2024

Hmm, well, yes and no to that last question. Adding a spike tile1 did perturb the scheduler enough that the simulation worked at least once2 with VERILATOR_THREADS=3:

[UART] UART0 is here (stdin/stdout).
network init (tid=198488)
No tap interface provided
Constructing spike processor_t (tid=198490)
Done constructing spike processor
network tick (tid=198488)
- /home/seth/Code/src/github.com/ucb-bar/chipyard/sims/verilator/generated-src/chipyard.harness.TestHarness.TapNICRocketConfig/gen-collateral/TestDriver.v:158: Verilog $finish

and failed with VERILATOR_THREADS=4:

network init (tid=205548)
No tap interface provided
Constructing spike processor_t (tid=205550)
Done constructing spike processor
network tick (tid=205551)
zsh: segmentation fault (core dumped)  ./sims/verilator/simulator-chipyard.harness-TapNICRocketConfig 

But, adding/removing cores doesn't change which threads do the initialization:

Constructing spike processor_t (tid=219082)
Done constructing spike processor
Constructing spike processor_t (tid=219082)
Done constructing spike processor
Constructing spike processor_t (tid=219082)
Done constructing spike processor
Constructing spike processor_t (tid=219082)
Done constructing spike processor

And, experimenting also provided a counterexample to my speculation that any context_t::init caller in a thread would suffice:

[UART] UART0 is here (stdin/stdout).
network init (tid=192456)
No tap interface provided
Constructing spike processor_t (tid=192457)
Done constructing spike processor
network tick (tid=192457)
zsh: segmentation fault (core dumped)  ./sims/verilator/simulator-chipyard.harness-TapNICRocketConfig 

Also, it seems that bdev is vulnerable to the same crash (as long as the sim is run with +blkdev=somefile, otherwise the blkdev never inits or ticks):

bdev init (tid=312407)
[UART] UART0 is here (stdin/stdout).
...
bdev tick (tid=312410)
zsh: segmentation fault (core dumped)  ./sims/verilator/simulator-chipyard.harness-TapNICRocketConfig +permissive   
$ coredumpctl debug
...
Core was generated by `./sims/verilator/simulator-chipyard.harness-TapNICRocketConfig +permissive +blk'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000556201ef4530 in context_t::switch_to (this=0x55620306f9e0) at ../fesvr/context.cc:86
86        cur = this;
[Current thread is 1 (Thread 0x7fcebc80c6c0 (LWP 312410))]
(gdb) bt
#0  0x0000556201ef4530 in context_t::switch_to (this=0x55620306f9e0) at ../fesvr/context.cc:86
#1  0x00005562015cecbe in block_device_tick ()
#2  0x00005562017a7206 in VTestDriver___024unit____Vdpiimwrap_block_device_tick_TOP____024unit(unsigned char, unsigned char&, unsigned char, unsigned int, unsigned int, unsigned int, unsigned char, unsigned char&, unsigned long, unsigned int, unsigned char&, unsigned char, unsigned long&, unsigned int&) ()
#3  0x00005562019a38b0 in VTestDriver___024root___nba_sequent__TOP__1888(VTestDriver___024root*) ()
#4  0x0000556201646d0f in VTestDriver___024root____Vthread__nba__2(void*, bool) ()
#5  0x000055620160ce51 in VlWorkerThread::workerLoop() ()
#6  0x00007fcebe86fe95 in std::execute_native_thread_routine (__p=<optimized out>)
    at ../../../../../libstdc++-v3/src/c++11/thread.cc:104
#7  0x00007fcebe53e55a in ?? () from /usr/lib/libc.so.6
#8  0x00007fcebe5bba5c in ?? () from /usr/lib/libc.so.6
(gdb) p cur
$1 = (context_t *) 0x0

It seems like the spike tile is in fact the odd one out; since it's driven from a single DPI-C entrypoint that inits itself on demand, it's not possible(?) for it to be scheduled on to two different threads by the Verilator microtask scheduler.

Footnotes

  1. so the new config is

    class TapNICRocketConfig extends Config(
    new chipyard.WithNSpikeCores(4) ++
    
    new chipyard.harness.WithSimNetwork ++
    new icenet.WithIceNIC ++
    new freechips.rocketchip.subsystem.WithNBigCores(1) ++
    new chipyard.config.AbstractConfig)
    
  2. I'm testing with:

    touch sims/verilator/generated-src/chipyard.harness.TestHarness.TapNICRocketConfig/gen-collateral/filelist.f && MAKEFLAGS=-j`nproc` make -C sims/verilator VERILATOR_THREADS=N CONFIG=TapNICRocketConfig && ./sims/verilator/simulator-chipyard.harness-TapNICRocketConfig toolchains/riscv-tools/riscv-tests/build/isa/rv64ui-p-simple
    

    for various values of N. I'm also using rv64ui-p-simple because it doesn't seem important which test program to use: whether it segfaults or not happens on the first network_tick call and is apparently independent of whatever the simulated binary does.

from chipyard.

sethp avatar sethp commented on June 17, 2024

Ah, one speculative code change later and, oh, hello:

[UART] UART0 is here (stdin/stdout).
network init (tid=536361)
No tap interface provided
network tick (tid=536364)
- /home/seth/Code/src/github.com/ucb-bar/chipyard/sims/verilator/generated-src/chipyard.harness.TestHarness.TapNICRocketConfig/gen-collateral/TestDriver.v:158: Verilog $finish

init and tick ran in different OS threads, and yet no crash! That result via noticing that it was a very different crash in the case that someone had previously populated the thread-local.

In reading the code to try and identify how the control flow ought to return, I noticed that both the device and the switch had a pointer to another thread's local storage (they were both storing cur in a field). So, overriding some access modifiers so I could update them from network_tick:

  if (!netdev || !netsw) {
    fprintf(stderr, "You forgot to call network_init!");
    exit(1);
  }

+  netdev->target = context_t::current();
+  netsw->main = context_t::current();

  netdev->tick(out_valid, out_data, out_last);
  netdev->switch_to_host();

  netsw->distribute();
  netsw->switch_to_worker();

and, since context_t::current exists to populate the thread-local, it seems neither crash occurred.

I'm still not quite sure what all this means yet, nor what course of action it suggests, but I did find that result very interesting.

from chipyard.

sethp avatar sethp commented on June 17, 2024

Ok, so I've experimented a bit more, and I've come up with three potentially useful perspectives here:

  • SimNetwork (& SimBlockDevice) are mis-using context_t by stashing the pointer from context_t::current in a field during construction. That's always a pointer to a thread-local, and if the constructor is called in a different thread (as here), that gets very difficult to reason about. A change suggested by this perspective is an iteration on (but functionally the same as) "update in network tick" from above, maybe something like this in device.h:
    -    void switch_to_host(void) { host.switch_to(); }
    +    void switch_to_host(void) { target = context::current(); host.switch_to(); target = NULL; }
  • context_t is challenging to use correctly, especially in a threaded program. An idea for how to improve the surface a bit would be to promote prev to a thread-local itself and add something like:
    void context_t::yield() {
    #ifdef USE_UCONTEXT 
      if (swapcontext(cur->context.get(), prev->context.get()) != 0)
        abort();
    #else
      assert(false && "todo!");
      abort();
    #endif
    }
    which would allow NetworkDevice & NetworkSwitch a way to pass control flow back to the simulation without needing to stash their own context pointers. The full fix would also require changing context_t::switch_to so that it always initializes the thread-local if it's NULL (or, possibly better yet, just make cur ~ static __thread context_t cur so it's always valid memory)
  • verilator's threading model is already providing (effectively) stackless (micro-)tasks that can be efficiently distributed to threads. context_t does provide stackful coroutines, but the gain from using context_t might be relatively small for its cost in complexity (and the implicit syscall to set the signal mask every swap). The direction suggested here would be to refactor the tick functions to return after a single step (i.e. on the current loop back-edge); to my eye that looks fairly achievable for the NetworkDevice & NetworkSwitch after shuffling a few stack-locals to become class members or shim'd-in arguments. BlockDevice would be even simpler, and I don't see any other uses of fesvr/context.h in my clone of the chipyard repo at this time (although I only have the default list of submodules from ./build-setup.sh).

I'd be lying if I said I didn't see the last as the most pragmatic option. I also feel like it's a bit of of a loss: I've really enjoyed learning about the ucontext_t stuff, and I appreciate the utility of being able to multiplex lightweight user tasks over the same thread. That said, I spend a lot of time learning about weird corners of computing, and it was still very strange to me on first encounter. That to me is an important signal about code accessibility, and for the same reason non-local control flow is... well, best used sparingly.

I believe I can fill out any of those three directions into a full-fledged PR to the upstream(s) in question here (IceNet & testchipip, or riscv-isa-sim). Do any of them especially call to you, @jerryz123 ? I see that you've done a lot of this work, so I suspect you'd have a better sense for the overall context (pun intended).

from chipyard.

jerryz123 avatar jerryz123 commented on June 17, 2024

SimNetwork (& SimBlockDevice) are mis-using context_t by stashing the pointer from context_t::current in a field during construction. That's always a pointer to a thread-local, and if the constructor is called in a different thread (as here), that gets very difficult to reason about.

While I agree with your reasoning here, I don't think its reasonable to expect/require SimDevice implementations to be thread-safe, where the constructor/tick functions can be called from distinct threads. As far as I can tell, this quirk only appears with Verilator multi-threading. The other simulators don't do this, even with multithreading enabled.

context_t is challenging to use correctly, especially in a threaded program. An idea for how to improve the surface a bit would be to promote prev to a thread-local itself and add something like:

Does this generalize to systems with multiple contexts? IMO its better to require the programmer to explicitly specify the next context to execute. There are use-cases of context_t which have multiple contexts (not just target/host).

The direction suggested here would be to refactor the tick functions to return after a single step

The htif/tsi mechanism uses context_t, but I believe the implementation is buried with the static FESVR library, which is compiled as part of spike (Spike uses htif and context_t as well in its own simulation loop).
The FireSim FPGA emulation driver also heavily uses context_t.

Another example is the tick function for SpikeTile, which allows the Spike C++ core model to interact with the Chipyard RTL simulation.


I couldn't think of a way to make that system work without context_t.

My belief here is that the init/tick functions being called from separate threads is a Verilator-specific quirk that we should work around with minimal disruption to existing other code/interfaces. Perhaps the simplest thing is to merge network_tick and network_init, and make the tick function initialize the devices on-demand?

Thank you for digging into this, I've learned quite a bit about the subtleties of the context_t behavior in a multi-threaded system from your analysis.

from chipyard.

sethp avatar sethp commented on June 17, 2024

While I agree with your reasoning here, I don't think its reasonable to expect/require SimDevice implementations to be thread-safe, where the constructor/tick functions can be called from distinct threads.

Yeah, I hear you about not wanting the {runtime,complexity} overhead of generalized thread-safety in the simulated devices. I want to note that the situation here calls for a much narrower "kind" of thread safety—it's the difference between what Rust calls Sync (you might be called from multiple threads at the same time) and the much simpler Send (it's safe to move the resource between threads)12. There is a strict happens-before relationship between the verilog inital blocks that call init and the always blocks that call tick, which it appears that verilator correctly implements. So to achieve correctness here it's not that the sim needs to handle multiple concurrent callers, but more or less just needs to avoid stashing a reference to another thread's local storage.

As far as I can tell, this quirk only appears with Verilator multi-threading. The other simulators don't do this, even with multithreading enabled.

To be fully transparent, I have only a few dozen hours' worth of experience with any of the commercial verilog implementations, and none to the depth I've gotten here with verilator. I do wonder if it's on accident or by design that the other simulators don't encounter this behavior: do you know if there's some verilog standard (implicit or explicit) that verilator is violating here by evaluating the initial block in a different thread than the always block? If it should be treating the module, say, as the "unit" of work (perhaps iff the module contains DPI?), that's something it might be worth raising with them upstream, too.

Does this generalize to systems with multiple contexts? IMO its better to require the programmer to explicitly specify the next context to execute. There are use-cases of context_t which have multiple contexts (not just target/host).

The yeild semantics I implemented do generalize in the sense that every context_t has a most-recently-swapped antecedent, but as written would probably produce surprising behavior when trying to "nest" contexts (M switch_to A switch_to B followed by yield would "return" to B, but a second yield from A would pass control flow back to B). We could imagine a context "stack", with switch_to pushing a task, and a definition of yield that acts as a "pop." I believe that would work to implement arbitrarily nested contexts just fine, although there's other simple solutions too when the number of tasks is small and statically fixed, as I think is the case here3.

My belief here is that the init/tick functions being called from separate threads is a Verilator-specific quirk that we should work around with minimal disruption to existing other code/interfaces.

I appreciate the examples! I'm glad to have the benefit of your experience here. I'd agree at this point that "avoid use of context_t entirely" is a path not worth further exploration.

Unfortunately, I'm not sure there's a general resolution that doesn't at least involve at least looking at the other use sites: neither threading nor non-local control flow are famous for composing well. I haven't identified any answers that reside entirely within context_t or the verilated main or some other high-leverage point that would span all devices (at least, not yet).

Perhaps the simplest thing is to merge network_tick and network_init, and make the tick function initialize the devices on-demand?

It's a good idea, that's how it seems the spike tile (and, perhaps, htif?) gets away with using context_t when verilated as a multi-threaded model. I considered it, but I didn't bring it up, because it has some immediate consequences that I presumed would be disqualifying (all the _tick interfaces would have to take all the _init parameters, for example).

I'm also not entirely sure how durable it would be, as a solution: the verilator documentation on task scheduling suggests that they tried both static and dynamic scheduling and went for the static for performance (rather than correctness) reasons. I suspect a dynamically scheduled runtime (e.g. one based on work-stealing) would probably cause even spiketile.cc to spontaneously fail, as it got moved around between threads.

My guess is that's a decision that's unlikely to be reversed any time soon ("efficient dynamic scheduling" notwithstanding), and you're in the much better position than I to know if "all DPI-C that uses context_t has a single verilog-facing entrypoint" is an invariant you feel is more maintainable.

I think my plan at this point is to continue experimenting with ucontexts to get a better understanding of what it means to nest them (& how else they're used by htif & friends), and whether there's somewhere besides a thread-local to pass task-local sideband data to try and break the coupling there.

Pursuing a definition of a task that didn't care what thread it was scheduled on (as long as it wasn't scheduled more than once) seems like it offers a resolution that's relatively low-impact and high-durability to me.

Thank you for digging into this, I've learned quite a bit about the subtleties of the context_t behavior in a multi-threaded system from your analysis.

Thank you for reading, and for the feedback! I'm glad you've found it helpful, I've very much enjoyed learning about all these fine details as well 😄

Footnotes

  1. It's fairly Rust-jaron-rich, but the rust user forums have a good discussion about what being !Send + Sync means.

  2. The only other non-experiential citation I have for this is the Rustonomicon, which unhelpfully states "A type is Send if it is safe to send it to another thread," and seems to mistake its own premise further down (filed as https://github.com/rust-lang/nomicon/issues/453 , for anyone reading that desires homework from a footnote).

  3. If you're curious, I have an example that I'm playing with: https://github.com/sethp/ucontext-coroutine . I haven't gotten to threading just yet, and I suspect my implementation is broken even for tail-recursive / single-recursion cases, but it's been enlightening.

from chipyard.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.