Coder Social home page Coder Social logo

Comments (9)

dallison avatar dallison commented on June 23, 2024 1

I've found the issue. Accessing an argument after it has been std::moved. I have a fix.

from subspace.

mikael-s-persson avatar mikael-s-persson commented on June 23, 2024 1

I found the bug.

It was a race condition between the publisher's increment of the ref count on the new (resized) slot (at https://github.com/dallison/subspace/blob/main/client/client.cc#L264) via the SetSlotToBiggestBuffer call after the reply from the server to resize the channel, and the subscriber unmapping unused buffers (see https://github.com/dallison/subspace/blob/main/common/channel.cc#L432C15-L432C33) via the UnmapUnusedBuffers function. Basically, the subscriber might come around to clean up unused buffers right in between the server having resized channel but the publisher having not yet claimed the new slot. Kaboom.

I can fix it pretty easily by never cleaning the last buffer. (i.e., always assume that the largest buffer is in use, even if it's ref count is zero). I'll make a PR soon.

from subspace.

mikael-s-persson avatar mikael-s-persson commented on June 23, 2024

Friendly ping @dallison, any opinion on what might be causing this?

from subspace.

mikael-s-persson avatar mikael-s-persson commented on June 23, 2024

@a7g4 I dug into this a bit, but I'm not able to reproduce the crashes in any config I try (gcc / clang, libstdc++ / libc++, opt / dbg).

But looking at the code and at Dave's fix to the 'reload' function on monday, see 8a4bdeb#diff-0456d178d4819d469423f58832be3dd34231c1325bb7ea678813e48cca5f1b11L99

I suspect this bug is fixed, because that reload bug would definitely cause the sort of crash you reported, i.e., the buffer_index being out-of-sync with the buffers_ vector, since the reload is the way a subscriber checks if updating the buffers_ is necessary before proceeding with looking at the message slots.

Like I said, I cannot check that, I cannot reproduce your error.

from subspace.

dallison avatar dallison commented on June 23, 2024

Sorry for the delay, for some reason I am not getting notified of some issues in github.

from subspace.

a7g4 avatar a7g4 commented on June 23, 2024

No worries! Let us know how we can help 🙂

from subspace.

a7g4 avatar a7g4 commented on June 23, 2024

I've found the issue. Accessing an argument after it has been std::moved. I have a fix.

Is that this commit? 8a4bdeb

If so, it doesn't seem to resolve the issue 😢

We did notice that this issue seems pretty common on aarch64 but not on x86_64. Does that help identify/diagnose or is there any extra data that we can gather?

from subspace.

mikael-s-persson avatar mikael-s-persson commented on June 23, 2024

I did a bit of debugging on our machine. Here is what I've been able to gather so far from running that test in gdb.

Error:

Thread 4 "client_test" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fb4ae0110 (LWP 13282)]
subspace::Channel::NextSlot(subspace::MessageSlot*, bool, int, std::function<bool (subspace::ChannelLock*)>) (this=0x7fac001058, slot=0x7fb42d32a8, reliable=false, owner=1, reload=...)
    at common/channel.cc:683
683   Prefix(slot)->flags |= kMessageSeen;

Stack trace on all relevant threads:

Thread 4 (Thread 0x7fb4ae0110 (LWP 13282)):
#0  subspace::Channel::NextSlot(subspace::MessageSlot*, bool, int, std::function<bool (subspace::ChannelLock*)>) (this=0x7fac001058, slot=0x7fb42d32a8, reliable=false, owner=1, reload=...)
    at common/channel.cc:683
#1  0x0000007fb7ec35fc in subspace::details::SubscriberImpl::NextSlot(std::function<bool (subspace::ChannelLock*)>) (this=0x7fac001050, reload=...) at ./client/client_channel.h:246
#2  0x0000007fb7ebb550 in subspace::Client::ReadMessageInternal (this=0x7fffffd138, subscriber=0x7fac001050, mode=subspace::ReadMode::kReadNext, pass_activation=false, clear_trigger=true)
    at client/client.cc:454
#3  0x0000007fb7ebb93c in subspace::Client::ReadMessage (this=0x7fffffd138, subscriber=0x7fac001050, mode=subspace::ReadMode::kReadNext) at client/client.cc:535
#4  0x00000055555a0720 in subspace::Subscriber::ReadMessage (this=0x7fb4adf7b0, mode=subspace::ReadMode::kReadNext) at ./client/client.h:707
#5  0x0000005555599408 in ClientTest_TestCrash_Test::<lambda()>::operator()(void) const (__closure=0x55556112f8) at client/client_test.cc:1268

Thread 3 (Thread 0x7fb572d110 (LWP 13281)):
#0  0x0000007fb632f0b4 in __libc_recv (fd=10, buf=0x7fb572c460, len=4, flags=0) at ../sysdeps/unix/sysv/linux/recv.c:28
#1  0x0000007fb6d9261c in toolbelt::ReceiveFully (c=0x0, fd=10, length=4, buffer=0x7fb572c460 "\220\304r\265\177", buflen=4) at external/toolbelt/toolbelt/sockets.cc:90
#2  0x0000007fb6d92c88 in toolbelt::Socket::ReceiveMessage (this=0x7fffffc098, buffer=0x7fffffc0b0 "", buflen=4096, c=0x0) at external/toolbelt/toolbelt/sockets.cc:200
#3  0x0000007fb7ebdb1c in subspace::Client::SendRequestReceiveResponse (this=0x7fffffc078, req=..., response=..., fds=std::vector of length 0, capacity 100) at client/client.cc:953
#4  0x0000007fb7ebd7e4 in subspace::Client::ResizeChannel (this=0x7fffffc078, publisher=0x7fa8001050, new_slot_size=32) at client/client.cc:914
#5  0x0000007fb7eba774 in subspace::Client::GetMessageBuffer (this=0x7fffffc078, publisher=0x7fa8001050, max_size=32) at client/client.cc:261
#6  0x00000055555a0234 in subspace::Publisher::GetMessageBuffer (this=0x7fb572c7c8, max_size=32) at ./client/client.h:589
#7  0x000000555559926c in ClientTest_TestCrash_Test::<lambda()>::operator()(void) const (__closure=0x55556112c8) at client/client_test.cc:1259

Thread 2 (Thread 0x7fb5f36110 (LWP 13280)):
#0  0x0000007fb6327550 in __pthread_mutex_lock_full (mutex=0x7fb42dc080) at pthread_mutex_lock.c:313
#1  0x0000007fb7ec0b0c in subspace::ChannelLock::Lock (this=0x7fb002b3f0) at ./common/channel.h:111
#2  0x0000007fb7e45ad4 in subspace::ChannelLock::ChannelLock(pthread_mutex_t*, std::function<bool (subspace::ChannelLock*)>) (this=0x7fb002b3f0, lock=0x7fb42dc080, reload=...)
    at ./common/channel.h:98
#3  0x0000007fb7e434d4 in subspace::Channel::ExtendBuffers (this=0x7fb004e8d0, new_slot_size=32) at common/channel.cc:382
#4  0x0000007fb7f931c4 in subspace::ClientHandler::HandleResize (this=0x7fb0020240, req=..., response=0x7fb004d920, fds=std::vector of length 0, capacity 0) at server/client_handler.cc:395
#5  0x0000007fb7f914f8 in subspace::ClientHandler::HandleMessage (this=0x7fb0020240, req=..., resp=..., fds=std::vector of length 0, capacity 0) at server/client_handler.cc:87
#6  0x0000007fb7f91000 in subspace::ClientHandler::Run (this=0x7fb0020240, c=0x7fb00212a0) at server/client_handler.cc:29
#7  0x0000007fb7f72d04 in subspace::Server::<lambda(co::Coroutine*)>::operator()(co::Coroutine *) const (__closure=0x7fb00212b0, c=0x7fb00212a0) at server/server.cc:278

Some stack variables printed out from gdb:

(gdb) print slot->id
$2 = 3
(gdb) print slot->message_size
$3 = 16
(gdb) print slot->buffer_index
$4 = 4
(gdb) print ccb_->num_buffers
$11 = 5
(gdb) print buffers_.size()
$7 = 5
(gdb) print buffers_[0].buffer
$12 = 0x0
(gdb) print buffers_[1].buffer
$13 = 0x7fb42d7000 ""
(gdb) print buffers_[2].buffer
$14 = 0x7fb42cf000 "\001"
(gdb) print buffers_[3].buffer
$15 = 0x7fb42cd000 "\001"
(gdb) print buffers_[4].buffer
$16 = 0x0
(gdb) print buffers_[0].slot_size
$17 = 0
(gdb) print buffers_[1].slot_size
$18 = 2
(gdb) print buffers_[2].slot_size
$19 = 4
(gdb) print buffers_[3].slot_size
$20 = 8
(gdb) print buffers_[4].slot_size
$21 = 0

Observations:

  • The server is waiting on the ChannelLock to resize the channel to new_slot_size=32, meaning that the resize to 16 should have already happened at this point.
  • The subscriber client is getting the next slot which it sees as having a size of 16, at buffer index 4.
  • The subscriber does have 5 buffers, but the last buffer (index 4) has a size and a pointer set to null.

Currently, I'm suspicious of client_handler.cc:404 which calls channel->AddBuffer without a ChannelLock, and without the memory fences inherent to x86, this might lead to inconsistent writes.

from subspace.

mikael-s-persson avatar mikael-s-persson commented on June 23, 2024

I guess in theory, there is still a race condition because if multiple publishers are resizing at the same time, there could be multiple buffers on the tail end of the channel that don't yet have a refs count. Maybe the proper solution is for the server to set to ref count to 1 when extending the buffers, instead of having the publishers increment it after the reply.

from subspace.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.