Coder Social home page Coder Social logo

Worker Crash SIGABRT about mediasoup HOT 27 CLOSED

GEverding avatar GEverding commented on June 4, 2024
Worker Crash SIGABRT

from mediasoup.

Comments (27)

ibc avatar ibc commented on June 4, 2024 1

Reading over my code there could be a race condition and 2 webrtcservers get created at startup. In this situation only 1 would be used for creating the all the transports.

Do you literally mean that you are creating WebRtcServers in different Workers and assigning them the same listening ports?

This is not possible. worker.createWebRtcServer() will throw if given UDP or TCP port is already in use by other process:

mediasoup:WARN:Channel request failed [method:WORKER_CREATE_WEBRTCSERVER, id:1]: uv_listen() failed [transport:tcp, ip:'0.0.0.0', port:44444]: address already in use [method:worker.createWebRtcServer]
mediasoup:WARN:Channel request failed [method:WORKER_CREATE_WEBRTCSERVER, id:1]: uv_udp_bind() failed [transport:udp, ip:'0.0.0.0', port:44444]: address already in use [method:worker.createWebRtcServer]

from mediasoup.

GEverding avatar GEverding commented on June 4, 2024 1

We have 1 worker with 1 webrtcserver with 4 ports 40001+40002 for internal udp+tcp, 40003+40004 for public announced ips. Nothing is shared.

I found this out yesterday. We were running a lot (95%) of traffic through TCP b/c of a config error. Since fixing the config error we see fewer errors.

from mediasoup.

ibc avatar ibc commented on June 4, 2024 1

@GEverding do you know which devices connect to your server? They obviously implement ICE renomination which is the reason of the crash (see PR) and AFAIK Chrome and libwebrtc doesn't support it. Well it does but I'm not aware that it's enabled by default.

from mediasoup.

ibc avatar ibc commented on June 4, 2024 1

@GEverding I'm merging PR #1182 since there was an obvious bug in the code and that PR fixes the vulnerability in the code. 3.12.16 released. But still I'd appreciate if you could validate whether it fixes your scenario. Thanks a lot.

from mediasoup.

GEverding avatar GEverding commented on June 4, 2024
2023-10-15T20:43:50.557Z mediasoup:ERROR:PayloadChannel Producer PayloadChannel error: Error: read ECONNRESET
2023-10-15T20:43:50.558Z mediasoup:ERROR:Channel Producer Channel error: Error: read ECONNRESET
2023-10-15T20:43:50.559Z mediasoup:ERROR:Worker worker process died unexpectedly [pid:20, code:null, signal:SIGABRT]

Also seen it with this error as well.

from mediasoup.

jmillan avatar jmillan commented on June 4, 2024

@GEverding, there is basically no description about the issue. Please elaborate the scenario in which the problem happens, and wether it reproduces always, etc.

from mediasoup.

GEverding avatar GEverding commented on June 4, 2024

My apologies I was dealing with another issue. I'm seeing this in normal use of mediasoup in production. It's a fairly rare issues. My guess is it happens when a client disconnects half way through setup because I see this more on startup of a new instance. I know you really need core dumps for this and I'm working on that.

from mediasoup.

jmillan avatar jmillan commented on June 4, 2024

The error refers to the UnixSocket between Node and C++ worker. It's not related to client connections.

from mediasoup.

GEverding avatar GEverding commented on June 4, 2024

Do you want a core dump to debug this or do you need something else. I did see this on v20 but I'm seeing it more on v21-nightly (there's a node ssl crash in v20-v18)

from mediasoup.

ibc avatar ibc commented on June 4, 2024

Coredump please. If you figure out reliable steps to reproduce it would be great but I think that will be hard.

from mediasoup.

GEverding avatar GEverding commented on June 4, 2024

Still working on the core dump. but saw this in the log and might be the issue:

2023-10-16T19:42:22.374Z mediasoup:ERROR:Worker (stderr) (ABORT) RTC::IceServer::HandleTuple() | failed assertion `this->tuples.empty()': state is 'disconnected' but there are 1 tuples

from mediasoup.

ibc avatar ibc commented on June 4, 2024

Interesting. I will check tomorrow. If you can get the core dump it would be better but maybe I'm able to figure out the issue without it. Thanks.

from mediasoup.

GEverding avatar GEverding commented on June 4, 2024
(gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {0, 167607312, 0, 5, 0, 281470681743361, 281473119789568, 281473931202416, 281473119637420, 187650778675576, 281473114525712, 281473114527504, 187651764764656,
            281473931202384, 281473115126340, 1}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x0000ffff910a7aa0 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {281473116505920, 2676586395008836901, 2676586395008836901, 2334397743342383717,
              7813874495109036581, 684901, 2317476195752624424, 0, 0, 18446744073692774400, 0, 0, 1, 4616194021469978624, 0, 4616194021471028225}}, sa_flags = 1074791425,
          sa_restorer = 0xaaaada013978 <[email protected]>}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x0000aaaad9a4b00c in RTC::IceServer::HandleTuple(RTC::TransportTuple*, bool, bool, unsigned int) ()
No symbol table info available.
#3  0x0000aaaad9af7b5c in non-virtual thunk to RTC::WebRtcServer::OnTcpConnectionPacketReceived(RTC::TcpConnection*, unsigned char const*, unsigned long) ()
No symbol table info available.
#4  0x0000aaaad9ad3948 in RTC::TcpConnection::UserOnTcpConnectionRead() ()
No symbol table info available.
#5  0x0000aaaad9d91794 in uv.read ()
No symbol table info available.
#6  0x0000aaaad9d920f4 in uv.stream_io ()
No symbol table info available.
#7  0x0000aaaad9d98d54 in uv.io_poll ()
No symbol table info available.
#8  0x0000aaaad9d8c734 in uv_run ()
No symbol table info available.
#9  0x0000aaaad99d8784 in DepLibUV::RunLoop() ()
No symbol table info available.
#10 0x0000aaaad99e2060 in Worker::Worker(Channel::ChannelSocket*, PayloadChannel::PayloadChannelSocket*) ()
No symbol table info available.
#11 0x0000aaaad99d72a0 in mediasoup_worker_run ()
No symbol table info available.
#12 0x0000aaaad99d5ea4 in main ()
No symbol table info available.

Is this enough of a coredump? I have not been able to repo in testing.

from mediasoup.

ibc avatar ibc commented on June 4, 2024

Yes, it's enough. Thanks. Will work on it finally tomorrow. BTW do you have some special scenario such as clients connecting to mediasoup through TCP instead of UDP? Are you using separate WebRtcServer to make all WebRtcTransports listen in same port(s) or not?

from mediasoup.

GEverding avatar GEverding commented on June 4, 2024

Yes we have some clients using TCP b/c of network restrictions. We run UDP and TCP on different ports.

We have 1 worker and it gets a WebRTCServer. Reading over my code there could be a race condition and 2 webrtcservers get created at startup. In this situation only 1 would be used for creating the all the transports.

from mediasoup.

ibc avatar ibc commented on June 4, 2024

Reading over my code there could be a race condition and 2 webrtcservers get created at startup. In this situation only 1 would be used for creating the all the transports.

Do you literally mean that you are creating WebRtcServers in different Workers and assigning them the same listening ports?

from mediasoup.

ibc avatar ibc commented on June 4, 2024

So the thing is that iceServer->HandleTuple() private method was called while in 'disconnected' mode.

  1. ' IceServer' entered in 'disconnected' mode before because iceServer->RemoveTuple() was called and it removed the last/only tuple, so IceServer changed its state to "disconnected". That's the only way to switch to 'disconnected' state.
  2. But somehow iceServer->HandleTuple() was called later and it added a tuple, but it didn't leave the 'disconnected' state. It doesn't make sense.
  3. So later, while in 'disconnected' state and with a tuple in the list, somehow iceServer->HandleTuple() was called again and and here the assertion happens ("state is 'disconnected' but there are 1 tuples").

from mediasoup.

ibc avatar ibc commented on June 4, 2024

So the bug is here, in HandleTuple() method (for both 'NEW' and 'DISCONNECTED' states (maybe more):

	void IceServer::HandleTuple(
	  RTC::TransportTuple* tuple, bool hasUseCandidate, bool hasNomination, uint32_t nomination)
	{
		MS_TRACE();

		switch (this->state)
		{
			case IceState::NEW:
			{
				// There should be no tuples.
				MS_ASSERT(
				  this->tuples.empty(), "state is 'new' but there are %zu tuples", this->tuples.size());

				// There shouldn't be a selected tuple.
				MS_ASSERT(!this->selectedTuple, "state is 'new' but there is selected tuple");

				if (!hasUseCandidate && !hasNomination)
				{
					MS_DEBUG_TAG(
					  ice,
					  "transition from state 'new' to 'connected' [hasUseCandidate:%s, hasNomination:%s, nomination:%" PRIu32
					  "]",
					  hasUseCandidate ? "true" : "false",
					  hasNomination ? "true" : "false",
					  nomination);

					// Store the tuple.
					auto* storedTuple = AddTuple(tuple);

					// Mark it as selected tuple.
					SetSelectedTuple(storedTuple);
					// Update state.
					this->state = IceState::CONNECTED;
					// Notify the listener.
					this->listener->OnIceServerConnected(this);
				}
				else
				{
					// Store the tuple.
					auto* storedTuple = AddTuple(tuple);

					if ((hasNomination && nomination > this->remoteNomination) || !hasNomination)
					{
						MS_DEBUG_TAG(
						  ice,
						  "transition from state 'new' to 'completed' [hasUseCandidate:%s, hasNomination:%s, nomination:%" PRIu32
						  "]",
						  hasUseCandidate ? "true" : "false",
						  hasNomination ? "true" : "false",
						  nomination);

						// Mark it as selected tuple.
						SetSelectedTuple(storedTuple);
						// Update state.
						this->state = IceState::COMPLETED;
						// Update nomination.
						if (hasNomination && nomination > this->remoteNomination)
							this->remoteNomination = nomination;
						// Notify the listener.
						this->listener->OnIceServerCompleted(this);
					}
				}

				break;
			}

Notice that there are cases in which we store the tuple but we do NOT change the state, so there we end with tuples in the list and still in 'DISCONNECTED' state:

CleanShot-2023-10-18-at-19 02 36@2x

from mediasoup.

ibc avatar ibc commented on June 4, 2024

So basically after this PR #756 we made it possible for IceServer to have tuples in the list while none of them is selected and state remains 'new' or 'disconnected'.

Hi @penguinol, problems here.

from mediasoup.

ibc avatar ibc commented on June 4, 2024

PR done here: #1182
Will merge and release new version tomorrow after getting some eyes on it.

from mediasoup.

GEverding avatar GEverding commented on June 4, 2024

We have Chrome (electron), iOS and Android all connecting to our servers

from mediasoup.

penguinol avatar penguinol commented on June 4, 2024

Well, we only use udp, so we have never met this crash. I'll check the PR.

from mediasoup.

ibc avatar ibc commented on June 4, 2024

@GEverding it should be fixed in #1182, but I'm extremely busy working on N things at the same time. Is it possible for you to try the branch in that PR in your setup with TCP enabled and see if it works as expected? I've done all tests I could but I've not been able to reproduce the bug using v3 branch.

from mediasoup.

GEverding avatar GEverding commented on June 4, 2024

Yeah I can build it and put it out in production. I'll know by tmrw likely if I don't see any crashes.

from mediasoup.

ibc avatar ibc commented on June 4, 2024

Thanks a lot. I'm 95% sure it will work fine.

from mediasoup.

GEverding avatar GEverding commented on June 4, 2024

Hey its been out for 48hrs now and its looking good 👍

from mediasoup.

ibc avatar ibc commented on June 4, 2024

Great!

from mediasoup.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.